date:20121128

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Glauber Costa

On 11/27/2012 07:10 PM, Michael Wolf wrote:
 On 11/27/2012 02:48 AM, Glauber Costa wrote:
 Hi,

 On 11/27/2012 12:36 AM, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.  This can
 cause confusion for the end user.  To ease the confusion this patch set
 adds the idea of consigned (expected steal) time.  The host will
 separate
 the consigned time from the steal time.  The consignment limit passed
 to the
 host will be the amount of steal time expected within a fixed period of
 time.  Any other steal time accruing during that period will show as the
 traditional steal time.
 If you submit this again, please include a version number in your series.
 Will do.  The patchset was sent twice yesterday by mistake.  Got an
 error the first time and didn't
 think the patches went out.  This has been corrected.

 It would also be helpful to include a small changelog about what changed
 between last version and this version, so we could focus on that.
 yes, will do that.  When I took the RFC off the patches I was looking at
 it as a new patchset which was
 a mistake.  I will make sure to add a changelog when I submit again.

 As for the rest, I answered your previous two submissions saying I don't
 agree with the concept. If you hadn't changed anything, resending it
 won't change my mind.

 I could of course, be mistaken or misguided. But I had also not seen any
 wave of support in favor of this previously, so basically I have no new
 data to make me believe I should see it any differently.

 Let's try this again:

 * Rik asked you in your last submission how does ppc handle this. You
 said, and I quote: In the case of lpar on POWER systems they simply
 report steal time and do not alter it in any way.
 They do however report how much processor is assigned to the partition
 and that information is in /proc/ppc64/lparcfg.
 Yes, but we still get questions from users asking what is steal time?
 why am I seeing this?

 Now, that is a *way* more sensible thing to do. Much more. Confusing
 users is something extremely subjective. This is specially true about
 concepts that are know for quite some time, like steal time. If you out
 of a sudden change the meaning of this, it is sure to confuse a lot more
 users than it would clarify.
 Something like this could certainly be done.  But when I was submitting
 the patch set as
 an RFC then qemu was passing a cpu percentage that would be used by the
 guest kernel
 to adjust the steal time. This percentage was being stored on the guest
 as a sysctl value.
 Avi stated he didn't like that kind of coupling, and that the value
 could get out of sync.  Anthony stated The guest shouldn't need to know
 it's entitlement. Or at least, it's up to a management tool to report
 that in a way that's meaningful for the guest.
 
 So perhaps I misunderstood what they were suggesting, but I took it to
 mean that they did not
 want the guest to know what the entitlement was.  That the host should
 take care of it and just
 report the already adjusted data to the guest.  So in this version of
 the code the host would use a set
 period for a timer and be passed essentially a number of ticks of
 expected steal time.  The host
 would then use the timer to break out the steal time into consigned and
 steal buckets which would be
 reported to the guest.
 
 Both the consigned and the steal would be reported via /proc/stat. So
 anyone needing to see total
 time away could add the two fields together.  The user, however, when
 using tools like top or vmstat
 would see the usage based on what the guest is entitled to.
 
 Do you have suggestions for how I can build consensus around one of the
 two approaches?
 

Before I answer this, can you please detail which mechanism are you
using to enforce the entitlement? Is it the cgroup cpu controller, or
something else?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm:queue 23/25] arch/s390/kvm/kvm-s390.c:358:6: error: conflicting types for 'kvm_arch_vcpu_postcreate'

2012-11-28 Thread Fengguang Wu

On Tue, Nov 27, 2012 at 11:51:19PM -0200, Marcelo Tosatti wrote:
 On Tue, Nov 27, 2012 at 10:56:50AM +0800, Fengguang Wu wrote:
  
  Hi Marcelo,
  
  FYI, kernel build failed on
  
  tree:   git://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
  head:   fc1ddea318fa2c1ac3d496d8653ca4bc9b66e679
  commit: 438d76a60e7be59a558f8712a47565fa8258d17d [23/25] KVM: x86: add 
  kvm_arch_vcpu_postcreate callback, move TSC initialization
  config: make ARCH=s390 allmodconfig
 
 Fengguang Wu,
 
 Thanks. Repository has been updated, it would be good
 if you can rerun the tests.

Marcelo, the build log shows that the new kvm/queue head
d98d07ca7e0347d712d54a865af323c4aee04bc2 builds fine. :)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4 0/2] kvm: direct msix injection

2012-11-28 Thread Michael S. Tsirkin

On Tue, Nov 27, 2012 at 09:34:57PM -0700, Alex Williamson wrote:
 On Wed, 2012-11-21 at 21:26 +0200, Michael S. Tsirkin wrote:
  On Wed, Oct 17, 2012 at 06:05:53PM +0200, Michael S. Tsirkin wrote:
   We can deliver certain interrupts, notably MSIX,
   from atomic context.
   Here's an untested patch to do this (compiled only).
   
   Changes from v2:
   Don't inject broadcast interrupts directly
   Changes from v1:
   Tried to address comments from v1, except unifying
   with kvm_set_irq: passing flags to it looks too ugly.
   Added a comment.
   
   Jan, you said you can test this?
  
  I have tested this with some networking workloads
  and this patchset seems to work fine.
  My setup isn't a good fit for benchmarking device
  assignment though.
  Alex, could you pls verifyu that this solves the
  latency issue that you sometimes observe?
  With this patchset device assignment latency should be
  as fast as vfio.
 
 Yep, that seems to cover the gap.  My environment is too noisy to
 declare an absolute winner, but pci-assign and vfio-pci are now very,
 very similar under netperf TCP_RR with these patches.  Thanks,
 
 Alex

Thanks very much for the report.
Gleb, Marcelo, ACK?
Also please add
Tested-by: Alex Williamson alex.william...@redhat.com

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/4] nested vmx: use a list to store the launched vmcs12 for L1 VMM

2012-11-28 Thread Gleb Natapov

On Tue, Nov 27, 2012 at 10:29:08PM -0200, Marcelo Tosatti wrote:
 On Thu, Nov 22, 2012 at 12:51:59PM +0800, Dongxiao Xu wrote:
  The launch state is not a member in the VMCS area, use a separate
  variable (list) to store it instead.
  
  Signed-off-by: Dongxiao Xu dongxiao...@intel.com
 
 1. What is the problem with keeping launched state in the VMCS?
 Assuming there is a positive answer to the above:
 
 2. Don't you have to change VMCS ID?
 
 3. Can't it be kept somewhere else other than a list? Current scheme 
 allows guest to allocate unlimited amounts of host memory.
 
 4. What is the state of migration / nested vmx again? If vmcs12 is
 migrated, this means launched state is not migrated anymore.
 
 Patches 1-3 seem fine.
According to Dongxiao they are slowing down nested guest by 4%.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/6] x86: PIT connects to pin 2 of IOAPIC

2012-11-28 Thread Gleb Natapov

On Wed, Nov 21, 2012 at 04:09:34PM +0800, Yang Zhang wrote:
 When PIT connects to IOAPIC, it route to pin 2 not pin 0.
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---
  virt/kvm/ioapic.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index cfb7e4d..166c450 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -181,7 +181,7 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int 
 irq)
  
  #ifdef CONFIG_X86
   /* Always delivery PIT interrupt to vcpu 0 */
 - if (irq == 0) {
 + if (irq == 2) {
Hmm, this means all this time the code didn't work correctly which makes
me wonder if we need this hack at all.

   irqe.dest_mode = 0; /* Physical mode. */
   /* need to read apic_id from apic regiest since
* it can be rewritten */
 -- 
 1.7.1

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance issue

2012-11-28 Thread Vadim Rozenfeld

On Tuesday, November 27, 2012 11:13:12 PM George-Cristian Bîrzan wrote:
 On Tue, Nov 27, 2012 at 10:38 PM, Vadim Rozenfeld vroze...@redhat.com 
wrote:
  I have some code which do both reference time and invariant TSC but it
  will not work after migration. I will send it later today.
 
 Do you mean migrating guests? This is not an issue for us.
OK, but don't say I didn't warn you :)

There are two patches, one for kvm and another one for qemu.
you will probably need to rebase them.
Add hv_tsc cpu parameter to activate this feature.
you will probably need to deactivate hpet by adding -no-hpet
parameter as well.

best regards,
Vadim.

 
 Also, it would be much appreciated!
 
 --
 George-Cristian Bîrzan
diff --git a/arch/x86/include/asm/hyperv.h b/arch/x86/include/asm/hyperv.h
index b80420b..9c5ffef 100644
--- a/arch/x86/include/asm/hyperv.h
+++ b/arch/x86/include/asm/hyperv.h
@@ -136,6 +136,9 @@
 /* MSR used to read the per-partition time reference counter */
 #define HV_X64_MSR_TIME_REF_COUNT		0x4020
 
+/* A partition's reference time stamp counter (TSC) page */
+#define HV_X64_MSR_REFERENCE_TSC		0x4021
+
 /* Define the virtual APIC registers */
 #define HV_X64_MSR_EOI0x4070
 #define HV_X64_MSR_ICR0x4071
@@ -179,6 +182,10 @@
 #define HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_MASK	\
 		(~((1ull  HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT) - 1))
 
+#define HV_X64_MSR_TSC_REFERENCE_ENABLE			0x0001
+#define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT		12
+
+
 #define HV_PROCESSOR_POWER_STATE_C0		0
 #define HV_PROCESSOR_POWER_STATE_C1		1
 #define HV_PROCESSOR_POWER_STATE_C2		2
@@ -191,4 +198,11 @@
 #define HV_STATUS_INVALID_ALIGNMENT		4
 #define HV_STATUS_INSUFFICIENT_BUFFERS		19
 
+typedef struct _HV_REFERENCE_TSC_PAGE {
+uint32_t TscSequence;
+uint32_t Rserved1;
+uint64_t TscScale;
+int64_t  TscOffset;
+} HV_REFERENCE_TSC_PAGE, * PHV_REFERENCE_TSC_PAGE;
+
 #endif
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b2e11f4..63ee09e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -565,6 +565,8 @@ struct kvm_arch {
 	/* fields used by HYPER-V emulation */
 	u64 hv_guest_os_id;
 	u64 hv_hypercall;
+	u64 hv_ref_count;
+	u64 hv_tsc_page;
 
 	#ifdef CONFIG_KVM_MMU_AUDIT
 	int audit_point;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4f76417..4538295 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -813,7 +813,7 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
 static u32 msrs_to_save[] = {
 	MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
 	MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
-	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
+	HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL, HV_X64_MSR_REFERENCE_TSC,
 	HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
 	MSR_KVM_PV_EOI_EN,
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
@@ -1428,6 +1428,8 @@ static bool kvm_hv_msr_partition_wide(u32 msr)
 	switch (msr) {
 	case HV_X64_MSR_GUEST_OS_ID:
 	case HV_X64_MSR_HYPERCALL:
+	case HV_X64_MSR_TIME_REF_COUNT:
+	case HV_X64_MSR_REFERENCE_TSC:
 		r = true;
 		break;
 	}
@@ -1438,6 +1440,7 @@ static bool kvm_hv_msr_partition_wide(u32 msr)
 static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	struct kvm *kvm = vcpu-kvm;
+	unsigned long addr;
 
 	switch (msr) {
 	case HV_X64_MSR_GUEST_OS_ID:
@@ -1467,6 +1470,27 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		if (__copy_to_user((void __user *)addr, instructions, 4))
 			return 1;
 		kvm-arch.hv_hypercall = data;
+		kvm-arch.hv_ref_count = get_kernel_ns();
+		break;
+	}
+	case HV_X64_MSR_REFERENCE_TSC: {
+		HV_REFERENCE_TSC_PAGE tsc_ref;
+		tsc_ref.TscSequence =
+			boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ? 1 : 0;
+		tsc_ref.TscScale =
+			((1LL  32) /vcpu-arch.virtual_tsc_khz)  32;
+		tsc_ref.TscOffset = 0;
+		if (!(data  HV_X64_MSR_TSC_REFERENCE_ENABLE)) {
+			kvm-arch.hv_tsc_page = data;
+			break;
+		}
+		addr = gfn_to_hva(vcpu-kvm, data 
+			HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT);
+		if (kvm_is_error_hva(addr))
+			return 1;
+		if(__copy_to_user((void __user *)addr, tsc_ref, sizeof(tsc_ref)))
+			return 1;
+		kvm-arch.hv_tsc_page = data;
 		break;
 	}
 	default:
@@ -1881,6 +1905,13 @@ static int get_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	case HV_X64_MSR_HYPERCALL:
 		data = kvm-arch.hv_hypercall;
 		break;
+	case HV_X64_MSR_TIME_REF_COUNT:
+		data = get_kernel_ns() - kvm-arch.hv_ref_count;
+		do_div(data, 100);
+		break;
+	case HV_X64_MSR_REFERENCE_TSC:
+		data = kvm-arch.hv_tsc_page;
+		break;
 	default:
 		vcpu_unimpl(vcpu, Hyper-V unhandled rdmsr: 0x%x\n, msr);
 		return 1;
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index f3708e6..ad77b72 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -1250,6 +1250,8 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, const char *cpu_model)
 hyperv_enable_relaxed_timing(true);

Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler

2012-11-28 Thread Gleb Natapov

On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote:
 We can deliver certain interrupts, notably MSI,
 from atomic context.  Use kvm_set_irq_inatomic,
 to implement an irq handler for msi.
 
 This reduces the pressure on scheduler in case
 where host and guest irq share a host cpu.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  virt/kvm/assigned-dev.c | 36 ++--
  1 file changed, 26 insertions(+), 10 deletions(-)
 
 diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
 index 23a41a9..3642239 100644
 --- a/virt/kvm/assigned-dev.c
 +++ b/virt/kvm/assigned-dev.c
 @@ -105,6 +105,15 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int irq, 
 void *dev_id)
  }
  
  #ifdef __KVM_HAVE_MSI
 +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
 +{
 + struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 + int ret = kvm_set_irq_inatomic(assigned_dev-kvm,
 +assigned_dev-irq_source_id,
 +assigned_dev-guest_irq, 1);
Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from
previous patch? 

 + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED;
 +}
 +
  static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id)
  {
   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 @@ -117,6 +126,23 @@ static irqreturn_t kvm_assigned_dev_thread_msi(int irq, 
 void *dev_id)
  #endif
  
  #ifdef __KVM_HAVE_MSIX
 +static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)
 +{
 + struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 + int index = find_index_from_host_irq(assigned_dev, irq);
 + u32 vector;
 + int ret = 0;
 +
 + if (index = 0) {
 + vector = assigned_dev-guest_msix_entries[index].vector;
 + ret = kvm_set_irq_inatomic(assigned_dev-kvm,
 +assigned_dev-irq_source_id,
 +vector, 1);
 + }
 +
 + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED;
 +}
 +
  static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
  {
   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 @@ -334,11 +360,6 @@ static int assigned_device_enable_host_intx(struct kvm 
 *kvm,
  }
  
  #ifdef __KVM_HAVE_MSI
 -static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
 -{
 - return IRQ_WAKE_THREAD;
 -}
 -
  static int assigned_device_enable_host_msi(struct kvm *kvm,
  struct kvm_assigned_dev_kernel *dev)
  {
 @@ -363,11 +384,6 @@ static int assigned_device_enable_host_msi(struct kvm 
 *kvm,
  #endif
  
  #ifdef __KVM_HAVE_MSIX
 -static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)
 -{
 - return IRQ_WAKE_THREAD;
 -}
 -
  static int assigned_device_enable_host_msix(struct kvm *kvm,
   struct kvm_assigned_dev_kernel *dev)
  {
 -- 
 MST
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: MMU: lazily drop large spte

2012-11-28 Thread Marcelo Tosatti

On Wed, Nov 28, 2012 at 01:27:38PM +0800, Xiao Guangrong wrote:
 On 11/18/2012 11:00 AM, Marcelo Tosatti wrote:
 map gfn 4?  See corrected step 7 above.
 
  Ah, this is a real bug, and unfortunately, it exists in current
  code. I will make a separate patchset to fix it. Thank you, Marcelo!
  
  Is it? Hum..
  
  Anyway, it would be great if you can write a testcase (should be similar
  in size to rmap_chain).
 
 Marcelo, is this patch acceptable?

Yes, can we get reexecute_instruction fix first? (which should then be
able to handle the case where a large read-only spte is created).

I'll merge the testcase later today.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler

2012-11-28 Thread Michael S. Tsirkin

On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote:
 On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote:
  We can deliver certain interrupts, notably MSI,
  from atomic context.  Use kvm_set_irq_inatomic,
  to implement an irq handler for msi.
  
  This reduces the pressure on scheduler in case
  where host and guest irq share a host cpu.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
   virt/kvm/assigned-dev.c | 36 ++--
   1 file changed, 26 insertions(+), 10 deletions(-)
  
  diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
  index 23a41a9..3642239 100644
  --- a/virt/kvm/assigned-dev.c
  +++ b/virt/kvm/assigned-dev.c
  @@ -105,6 +105,15 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int 
  irq, void *dev_id)
   }
   
   #ifdef __KVM_HAVE_MSI
  +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
  +{
  +   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
  +   int ret = kvm_set_irq_inatomic(assigned_dev-kvm,
  +  assigned_dev-irq_source_id,
  +  assigned_dev-guest_irq, 1);
 Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from
 previous patch? 

kvm_set_msi_inatomic needs a routing entry, and
we don't have the routing entry at this level.

Further, guest irq might not be an MSI: host MSI
can cause guest intx injection I think, we need to
bounce it to thread as we did earlier.

  +   return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED;
  +}
  +
   static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id)
   {
  struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
  @@ -117,6 +126,23 @@ static irqreturn_t kvm_assigned_dev_thread_msi(int 
  irq, void *dev_id)
   #endif
   
   #ifdef __KVM_HAVE_MSIX
  +static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)
  +{
  +   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
  +   int index = find_index_from_host_irq(assigned_dev, irq);
  +   u32 vector;
  +   int ret = 0;
  +
  +   if (index = 0) {
  +   vector = assigned_dev-guest_msix_entries[index].vector;
  +   ret = kvm_set_irq_inatomic(assigned_dev-kvm,
  +  assigned_dev-irq_source_id,
  +  vector, 1);
  +   }
  +
  +   return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED;
  +}
  +
   static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
   {
  struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
  @@ -334,11 +360,6 @@ static int assigned_device_enable_host_intx(struct kvm 
  *kvm,
   }
   
   #ifdef __KVM_HAVE_MSI
  -static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
  -{
  -   return IRQ_WAKE_THREAD;
  -}
  -
   static int assigned_device_enable_host_msi(struct kvm *kvm,
 struct kvm_assigned_dev_kernel *dev)
   {
  @@ -363,11 +384,6 @@ static int assigned_device_enable_host_msi(struct kvm 
  *kvm,
   #endif
   
   #ifdef __KVM_HAVE_MSIX
  -static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)
  -{
  -   return IRQ_WAKE_THREAD;
  -}
  -
   static int assigned_device_enable_host_msix(struct kvm *kvm,
  struct kvm_assigned_dev_kernel *dev)
   {
  -- 
  MST
  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler

2012-11-28 Thread Gleb Natapov

On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote:
  On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote:
   We can deliver certain interrupts, notably MSI,
   from atomic context.  Use kvm_set_irq_inatomic,
   to implement an irq handler for msi.
   
   This reduces the pressure on scheduler in case
   where host and guest irq share a host cpu.
   
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
   ---
virt/kvm/assigned-dev.c | 36 ++--
1 file changed, 26 insertions(+), 10 deletions(-)
   
   diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
   index 23a41a9..3642239 100644
   --- a/virt/kvm/assigned-dev.c
   +++ b/virt/kvm/assigned-dev.c
   @@ -105,6 +105,15 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int 
   irq, void *dev_id)
}

#ifdef __KVM_HAVE_MSI
   +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
   +{
   + struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
   + int ret = kvm_set_irq_inatomic(assigned_dev-kvm,
   +assigned_dev-irq_source_id,
   +assigned_dev-guest_irq, 1);
  Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from
  previous patch? 
 
 kvm_set_msi_inatomic needs a routing entry, and
 we don't have the routing entry at this level.
 
Yes, right. BTW is this interface will be used only for legacy assigned
device or there will be other users too?

 Further, guest irq might not be an MSI: host MSI
 can cause guest intx injection I think, we need to
 bounce it to thread as we did earlier.
Ah, so msi in kvm_assigned_dev_msi() is about host msi? Can host be intx
but guest msi? You seems to not handle this case. Also injection of intx
via ioapic is the same as injecting MSI. The format and the capability
of irq message are essentially the same.


 
   + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED;
   +}
   +
static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id)
{
 struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
   @@ -117,6 +126,23 @@ static irqreturn_t kvm_assigned_dev_thread_msi(int 
   irq, void *dev_id)
#endif

#ifdef __KVM_HAVE_MSIX
   +static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)
   +{
   + struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
   + int index = find_index_from_host_irq(assigned_dev, irq);
   + u32 vector;
   + int ret = 0;
   +
   + if (index = 0) {
   + vector = assigned_dev-guest_msix_entries[index].vector;
   + ret = kvm_set_irq_inatomic(assigned_dev-kvm,
   +assigned_dev-irq_source_id,
   +vector, 1);
   + }
   +
   + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : IRQ_HANDLED;
   +}
   +
static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
{
 struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
   @@ -334,11 +360,6 @@ static int assigned_device_enable_host_intx(struct 
   kvm *kvm,
}

#ifdef __KVM_HAVE_MSI
   -static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
   -{
   - return IRQ_WAKE_THREAD;
   -}
   -
static int assigned_device_enable_host_msi(struct kvm *kvm,
struct kvm_assigned_dev_kernel *dev)
{
   @@ -363,11 +384,6 @@ static int assigned_device_enable_host_msi(struct 
   kvm *kvm,
#endif

#ifdef __KVM_HAVE_MSIX
   -static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)
   -{
   - return IRQ_WAKE_THREAD;
   -}
   -
static int assigned_device_enable_host_msix(struct kvm *kvm,
 struct kvm_assigned_dev_kernel *dev)
{
   -- 
   MST
   --
   To unsubscribe from this list: send the line unsubscribe kvm in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
  --
  Gleb.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler

2012-11-28 Thread Michael S. Tsirkin

On Wed, Nov 28, 2012 at 02:13:01PM +0200, Gleb Natapov wrote:
 On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote:
  On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote:
   On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote:
We can deliver certain interrupts, notably MSI,
from atomic context.  Use kvm_set_irq_inatomic,
to implement an irq handler for msi.

This reduces the pressure on scheduler in case
where host and guest irq share a host cpu.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 virt/kvm/assigned-dev.c | 36 ++--
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 23a41a9..3642239 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -105,6 +105,15 @@ static irqreturn_t 
kvm_assigned_dev_thread_intx(int irq, void *dev_id)
 }
 
 #ifdef __KVM_HAVE_MSI
+static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
+{
+   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
+   int ret = kvm_set_irq_inatomic(assigned_dev-kvm,
+  assigned_dev-irq_source_id,
+  assigned_dev-guest_irq, 1);
   Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from
   previous patch? 
  
  kvm_set_msi_inatomic needs a routing entry, and
  we don't have the routing entry at this level.
  
 Yes, right. BTW is this interface will be used only for legacy assigned
 device or there will be other users too?

I think long term we should convert irqfd to this too.

  Further, guest irq might not be an MSI: host MSI
  can cause guest intx injection I think, we need to
  bounce it to thread as we did earlier.
 Ah, so msi in kvm_assigned_dev_msi() is about host msi?

Yes.

 Can host be intx
 but guest msi?

No.

 You seems to not handle this case. Also injection of intx
 via ioapic is the same as injecting MSI. The format and the capability
 of irq message are essentially the same.

Absolutely. So we will be able to extend this to intx long term.
The difference is in the fact that unlike msi, intx can
(and does) have multiple entries per GSI.
I have not yet figured out how to report and handle failure
in case one of these can be injected in atomic context,
another can't. There's likely an easy way but can
be a follow up patch I think.

 
  
+   return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : 
IRQ_HANDLED;
+}
+
 static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id)
 {
struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
@@ -117,6 +126,23 @@ static irqreturn_t kvm_assigned_dev_thread_msi(int 
irq, void *dev_id)
 #endif
 
 #ifdef __KVM_HAVE_MSIX
+static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)
+{
+   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
+   int index = find_index_from_host_irq(assigned_dev, irq);
+   u32 vector;
+   int ret = 0;
+
+   if (index = 0) {
+   vector = assigned_dev-guest_msix_entries[index].vector;
+   ret = kvm_set_irq_inatomic(assigned_dev-kvm,
+  assigned_dev-irq_source_id,
+  vector, 1);
+   }
+
+   return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : 
IRQ_HANDLED;
+}
+
 static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
 {
struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
@@ -334,11 +360,6 @@ static int assigned_device_enable_host_intx(struct 
kvm *kvm,
 }
 
 #ifdef __KVM_HAVE_MSI
-static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
-{
-   return IRQ_WAKE_THREAD;
-}
-
 static int assigned_device_enable_host_msi(struct kvm *kvm,
   struct 
kvm_assigned_dev_kernel *dev)
 {
@@ -363,11 +384,6 @@ static int assigned_device_enable_host_msi(struct 
kvm *kvm,
 #endif
 
 #ifdef __KVM_HAVE_MSIX
-static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)
-{
-   return IRQ_WAKE_THREAD;
-}
-
 static int assigned_device_enable_host_msix(struct kvm *kvm,
struct 
kvm_assigned_dev_kernel *dev)
 {
-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   
   --
 Gleb.
 
 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to

Re: [PATCH v2 4/4] nested vmx: use a list to store the launched vmcs12 for L1 VMM

2012-11-28 Thread Orit Wasserman

On 11/28/2012 02:29 AM, Marcelo Tosatti wrote:
 On Thu, Nov 22, 2012 at 12:51:59PM +0800, Dongxiao Xu wrote:
 The launch state is not a member in the VMCS area, use a separate
 variable (list) to store it instead.

 Signed-off-by: Dongxiao Xu dongxiao...@intel.com
 
 1. What is the problem with keeping launched state in the VMCS?
 Assuming there is a positive answer to the above:
 
 2. Don't you have to change VMCS ID?
 
 3. Can't it be kept somewhere else other than a list? Current scheme 
 allows guest to allocate unlimited amounts of host memory.
I agree with Marcelo you have to limit the number of VMCS in the list otherwise
it will be easy to attack a host with nested :)
 
 4. What is the state of migration / nested vmx again? If vmcs12 is
 migrated, this means launched state is not migrated anymore.
 
 Patches 1-3 seem fine.
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler

2012-11-28 Thread Gleb Natapov

On Wed, Nov 28, 2012 at 02:22:45PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 28, 2012 at 02:13:01PM +0200, Gleb Natapov wrote:
  On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote:
   On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote:
On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote:
 We can deliver certain interrupts, notably MSI,
 from atomic context.  Use kvm_set_irq_inatomic,
 to implement an irq handler for msi.
 
 This reduces the pressure on scheduler in case
 where host and guest irq share a host cpu.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  virt/kvm/assigned-dev.c | 36 ++--
  1 file changed, 26 insertions(+), 10 deletions(-)
 
 diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
 index 23a41a9..3642239 100644
 --- a/virt/kvm/assigned-dev.c
 +++ b/virt/kvm/assigned-dev.c
 @@ -105,6 +105,15 @@ static irqreturn_t 
 kvm_assigned_dev_thread_intx(int irq, void *dev_id)
  }
  
  #ifdef __KVM_HAVE_MSI
 +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
 +{
 + struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 + int ret = kvm_set_irq_inatomic(assigned_dev-kvm,
 +assigned_dev-irq_source_id,
 +assigned_dev-guest_irq, 1);
Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() from
previous patch? 
   
   kvm_set_msi_inatomic needs a routing entry, and
   we don't have the routing entry at this level.
   
  Yes, right. BTW is this interface will be used only for legacy assigned
  device or there will be other users too?
 
 I think long term we should convert irqfd to this too.
 
VIFO uses irqfd, no? So, why legacy device assignment needs that code
to achieve parity with VFIO? Also why long term? What are the
complications?

   Further, guest irq might not be an MSI: host MSI
   can cause guest intx injection I think, we need to
   bounce it to thread as we did earlier.
  Ah, so msi in kvm_assigned_dev_msi() is about host msi?
 
 Yes.
 
  Can host be intx
  but guest msi?
 
 No.
 
  You seems to not handle this case. Also injection of intx
  via ioapic is the same as injecting MSI. The format and the capability
  of irq message are essentially the same.
 
 Absolutely. So we will be able to extend this to intx long term.
 The difference is in the fact that unlike msi, intx can
 (and does) have multiple entries per GSI.
 I have not yet figured out how to report and handle failure
 in case one of these can be injected in atomic context,
 another can't. There's likely an easy way but can
 be a follow up patch I think.
I prefer to figure that out before introducing the interface. Hmm, we
can get rid of vcpu loop in pic (should be very easily done by checking
for kvm_apic_accept_pic_intr() during apic configuration and keeping
global extint vcpu) and then sorting irq routing entries so that ioapic
entry is first since only ioapic injection can fail.

 
  
   
 + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : 
 IRQ_HANDLED;
 +}
 +
  static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id)
  {
   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 @@ -117,6 +126,23 @@ static irqreturn_t 
 kvm_assigned_dev_thread_msi(int irq, void *dev_id)
  #endif
  
  #ifdef __KVM_HAVE_MSIX
 +static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)
 +{
 + struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 + int index = find_index_from_host_irq(assigned_dev, irq);
 + u32 vector;
 + int ret = 0;
 +
 + if (index = 0) {
 + vector = assigned_dev-guest_msix_entries[index].vector;
 + ret = kvm_set_irq_inatomic(assigned_dev-kvm,
 +assigned_dev-irq_source_id,
 +vector, 1);
 + }
 +
 + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : 
 IRQ_HANDLED;
 +}
 +
  static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void 
 *dev_id)
  {
   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 @@ -334,11 +360,6 @@ static int 
 assigned_device_enable_host_intx(struct kvm *kvm,
  }
  
  #ifdef __KVM_HAVE_MSI
 -static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
 -{
 - return IRQ_WAKE_THREAD;
 -}
 -
  static int assigned_device_enable_host_msi(struct kvm *kvm,
  struct 
 kvm_assigned_dev_kernel *dev)
  {
 @@ -363,11 +384,6 @@ static int 
 assigned_device_enable_host_msi(struct kvm *kvm,
  #endif
  
  #ifdef __KVM_HAVE_MSIX
 -static irqreturn_t

Re: [PATCH v4 02/13] ARM: KVM: Keep track of currently running vcpus

2012-11-28 Thread Will Deacon

Just a bunch of typos in this one :)

On Sat, Nov 10, 2012 at 03:44:30PM +, Christoffer Dall wrote:
 From: Marc Zyngier marc.zyng...@arm.com
 
 When an interrupt occurs for the guest, it is sometimes necessary
 to find out which vcpu was running at that point.
 
 Keep track of which vcpu is being tun in kvm_arch_vcpu_ioctl_run(),

run

 and allow the data to be retrived using either:

retrieved

 - kvm_arm_get_running_vcpu(): returns the vcpu running at this point
   on the current CPU. Can only be used in a non-preemptable context.

preemptible

 - kvm_arm_get_running_vcpus(): returns the per-CPU variable holding
   the the running vcpus, useable for per-CPU interrupts.

-the
usable

 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
 ---
  arch/arm/include/asm/kvm_host.h |   10 ++
  arch/arm/kvm/arm.c  |   30 ++
  2 files changed, 40 insertions(+)
 
 diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
 index e7fc249..e66cd56 100644
 --- a/arch/arm/include/asm/kvm_host.h
 +++ b/arch/arm/include/asm/kvm_host.h
 @@ -154,4 +154,14 @@ static inline int kvm_test_age_hva(struct kvm *kvm, 
 unsigned long hva)
  {
   return 0;
  }
 +
 +struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
 +struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);

DECLARE_PER_CPU?

 +int kvm_arm_copy_coproc_indices(struct kvm_vcpu *vcpu, u64 __user *uindices);
 +unsigned long kvm_arm_num_coproc_regs(struct kvm_vcpu *vcpu);
 +struct kvm_one_reg;
 +int kvm_arm_coproc_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg 
 *);
 +int kvm_arm_coproc_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg 
 *);
 +
  #endif /* __ARM_KVM_HOST_H__ */
 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 2cdc07b..60b119a 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -53,11 +53,38 @@ static DEFINE_PER_CPU(unsigned long, 
 kvm_arm_hyp_stack_page);
  static struct vfp_hard_struct __percpu *kvm_host_vfp_state;
  static unsigned long hyp_default_vectors;
  
 +/* Per-CPU variable containing the currently running vcpu. */
 +static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_arm_running_vcpu);
 +
  /* The VMID used in the VTTBR */
  static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
  static u8 kvm_next_vmid;
  static DEFINE_SPINLOCK(kvm_vmid_lock);
  
 +static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu)
 +{
 + BUG_ON(preemptible());
 + __get_cpu_var(kvm_arm_running_vcpu) = vcpu;
 +}
 +
 +/**
 + * kvm_arm_get_running_vcpu - get the vcpu running on the current CPU.
 + * Must be called from non-preemptible context
 + */
 +struct kvm_vcpu *kvm_arm_get_running_vcpu(void)
 +{
 + BUG_ON(preemptible());
 + return __get_cpu_var(kvm_arm_running_vcpu);
 +}
 +
 +/**
 + * kvm_arm_get_running_vcpus - get the per-CPU array on currently running 
 vcpus.
 + */

s/on/of/ ?

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 03/13] ARM: KVM: Initial VGIC infrastructure support

2012-11-28 Thread Will Deacon

On Sat, Nov 10, 2012 at 03:44:37PM +, Christoffer Dall wrote:
 From: Marc Zyngier marc.zyng...@arm.com
 
 Wire the basic framework code for VGIC support. Nothing to enable
 yet.

Again, not sure how useful this patch is. Might as well merge it with code
that actually does something. Couple of comments inline anyway...
 
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
 ---
  arch/arm/include/asm/kvm_host.h |7 
  arch/arm/include/asm/kvm_vgic.h |   70 
 +++
  arch/arm/kvm/arm.c  |   21 +++-
  arch/arm/kvm/interrupts.S   |4 ++
  arch/arm/kvm/mmio.c |3 ++
  virt/kvm/kvm_main.c |5 ++-
  6 files changed, 107 insertions(+), 3 deletions(-)
  create mode 100644 arch/arm/include/asm/kvm_vgic.h

[...]

 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 60b119a..426828a 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -183,6 +183,9 @@ int kvm_dev_ioctl_check_extension(long ext)
  {
   int r;
   switch (ext) {
 +#ifdef CONFIG_KVM_ARM_VGIC
 + case KVM_CAP_IRQCHIP:
 +#endif
   case KVM_CAP_USER_MEMORY:
   case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
   case KVM_CAP_ONE_REG:
 @@ -304,6 +307,10 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
  {
   /* Force users to call KVM_ARM_VCPU_INIT */
   vcpu-arch.target = -1;
 +
 + /* Set up VGIC */
 + kvm_vgic_vcpu_init(vcpu);
 +
   return 0;
  }
  
 @@ -363,7 +370,7 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
   */
  int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
  {
 - return !!v-arch.irq_lines;
 + return !!v-arch.irq_lines || kvm_vgic_vcpu_pending_irq(v);
  }

So interrupt injection without the in-kernel GIC updates irq_lines, but the
in-kernel GIC has its own separate data structures? Why can't the in-kernel GIC
just use irq_lines instead of irq_pending_on_cpu?

  
  int kvm_arch_vcpu_in_guest_mode(struct kvm_vcpu *v)
 @@ -633,6 +640,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
 kvm_run *run)
  
   update_vttbr(vcpu-kvm);
  
 + kvm_vgic_sync_to_cpu(vcpu);
 +
   local_irq_disable();
  
   /*
 @@ -645,6 +654,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
 kvm_run *run)
  
   if (ret = 0 || need_new_vmid_gen(vcpu-kvm)) {
   local_irq_enable();
 + kvm_vgic_sync_from_cpu(vcpu);
   continue;
   }

For VFP, we use different terminology (sync and flush). I don't think they're
any clearer than what you have, but the consistency would be nice.

Given that both these functions are run with interrupts enabled, why doesn't
the second require a lock for updating dist-irq_pending_on_cpu? I notice
there's a random smp_mb() over there...

  
 @@ -683,6 +693,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
 kvm_run *run)
* Back from guest
*/
  
 + kvm_vgic_sync_from_cpu(vcpu);

Likewise.

   ret = handle_exit(vcpu, run, ret);
   }
  
 @@ -965,6 +977,13 @@ static int init_hyp_mode(void)
   }
   }
  
 + /*
 +  * Init HYP view of VGIC
 +  */
 + err = kvm_vgic_hyp_init();
 + if (err)
 + goto out_free_mappings;
 +
   return 0;
  out_free_vfp:
   free_percpu(kvm_host_vfp_state);

[...]

 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 2fb7319..665af96 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -1880,12 +1880,13 @@ static long kvm_vcpu_ioctl(struct file *filp,
   if (vcpu-kvm-mm != current-mm)
   return -EIO;
  
 -#if defined(CONFIG_S390) || defined(CONFIG_PPC)
 +#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_ARM)
   /*
* Special cases: vcpu ioctls that are asynchronous to vcpu execution,
* so vcpu_load() would break it.
*/
 - if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_INTERRUPT)
 + if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_INTERRUPT ||
 + ioctl == KVM_IRQ_LINE)
   return kvm_arch_vcpu_ioctl(filp, ioctl, arg);
  #endif

Separate patch?

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] KVM: VMX: fix invalid cpu passed to smp_call_function_single

2012-11-28 Thread Xiao Guangrong

In loaded_vmcs_clear, loaded_vmcs-cpu is the fist parameter passed to
smp_call_function_single, if the target cpu is downing (doing cpu hot remove),
loaded_vmcs-cpu can become -1 then -1 is passed to smp_call_function_single

It can be triggered when vcpu is being destroyed, loaded_vmcs_clear is called
in the preemptionable context

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/vmx.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6599e45..29e8f42 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1007,9 +1007,11 @@ static void __loaded_vmcs_clear(void *arg)

 static void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs)
 {
-   if (loaded_vmcs-cpu != -1)
-   smp_call_function_single(
-   loaded_vmcs-cpu, __loaded_vmcs_clear, loaded_vmcs, 1);
+   int cpu = loaded_vmcs-cpu;
+
+   if (cpu != -1)
+   smp_call_function_single(cpu,
+__loaded_vmcs_clear, loaded_vmcs, 1);
 }

 static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx)
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] KVM: VMX: fix memory order between loading vmcs and clearing vmcs

2012-11-28 Thread Xiao Guangrong

vmcs-cpu indicates whether it exists on the target cpu, -1 means the vmcs
does not exist on any vcpu

If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's percpu
list. The list can be corrupted if the cpu prefetch the vmcs's list before
reading vmcs-cpu. Meanwhile, we should remove vmcs from the list before
making vmcs-vcpu == -1 be visible

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/vmx.c |   17 +
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 29e8f42..6056d88 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1002,6 +1002,15 @@ static void __loaded_vmcs_clear(void *arg)
if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs)
per_cpu(current_vmcs, cpu) = NULL;
list_del(loaded_vmcs-loaded_vmcss_on_cpu_link);
+
+   /*
+* we should ensure updating loaded_vmcs-loaded_vmcss_on_cpu_link
+* is before setting loaded_vmcs-vcpu to -1 which is done in
+* loaded_vmcs_init. Otherwise, other cpu can see vcpu = -1 fist
+* then adds the vmcs into percpu list before it is deleted.
+*/
+   smp_wmb();
+
loaded_vmcs_init(loaded_vmcs);
 }

@@ -1537,6 +1546,14 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
local_irq_disable();
+
+   /*
+* Read loaded_vmcs-cpu should be before fetching
+* loaded_vmcs-loaded_vmcss_on_cpu_link.
+* See the comments in __loaded_vmcs_clear().
+*/
+   smp_rmb();
+
list_add(vmx-loaded_vmcs-loaded_vmcss_on_cpu_link,
 per_cpu(loaded_vmcss_on_cpu, cpu));
local_irq_enable();
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 03/13] ARM: KVM: Initial VGIC infrastructure support

2012-11-28 Thread Marc Zyngier

On 28/11/12 12:49, Will Deacon wrote:
 On Sat, Nov 10, 2012 at 03:44:37PM +, Christoffer Dall wrote:
 From: Marc Zyngier marc.zyng...@arm.com

 Wire the basic framework code for VGIC support. Nothing to enable
 yet.
 
 Again, not sure how useful this patch is. Might as well merge it with code
 that actually does something. Couple of comments inline anyway...
  
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
 ---
  arch/arm/include/asm/kvm_host.h |7 
  arch/arm/include/asm/kvm_vgic.h |   70 
 +++
  arch/arm/kvm/arm.c  |   21 +++-
  arch/arm/kvm/interrupts.S   |4 ++
  arch/arm/kvm/mmio.c |3 ++
  virt/kvm/kvm_main.c |5 ++-
  6 files changed, 107 insertions(+), 3 deletions(-)
  create mode 100644 arch/arm/include/asm/kvm_vgic.h
 
 [...]
 
 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 60b119a..426828a 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -183,6 +183,9 @@ int kvm_dev_ioctl_check_extension(long ext)
  {
  int r;
  switch (ext) {
 +#ifdef CONFIG_KVM_ARM_VGIC
 +case KVM_CAP_IRQCHIP:
 +#endif
  case KVM_CAP_USER_MEMORY:
  case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
  case KVM_CAP_ONE_REG:
 @@ -304,6 +307,10 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
  {
  /* Force users to call KVM_ARM_VCPU_INIT */
  vcpu-arch.target = -1;
 +
 +/* Set up VGIC */
 +kvm_vgic_vcpu_init(vcpu);
 +
  return 0;
  }
  
 @@ -363,7 +370,7 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu 
 *vcpu,
   */
  int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
  {
 -return !!v-arch.irq_lines;
 +return !!v-arch.irq_lines || kvm_vgic_vcpu_pending_irq(v);
  }
 
 So interrupt injection without the in-kernel GIC updates irq_lines, but the
 in-kernel GIC has its own separate data structures? Why can't the in-kernel 
 GIC
 just use irq_lines instead of irq_pending_on_cpu?

They serve very different purposes:
- irq_lines directly controls the IRQ and FIQ lines (it is or-ed into
the HCR register before entering the guest)
- irq_pending_on_cpu deals with the CPU interface, and only that. Plus,
it is a kernel only thing. What triggers the interrupt on the guest is
the presence of list registers with a pending state.

You signal interrupts one way or the other.

 
  
  int kvm_arch_vcpu_in_guest_mode(struct kvm_vcpu *v)
 @@ -633,6 +640,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
 struct kvm_run *run)
  
  update_vttbr(vcpu-kvm);
  
 +kvm_vgic_sync_to_cpu(vcpu);
 +
  local_irq_disable();
  
  /*
 @@ -645,6 +654,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
 struct kvm_run *run)
  
  if (ret = 0 || need_new_vmid_gen(vcpu-kvm)) {
  local_irq_enable();
 +kvm_vgic_sync_from_cpu(vcpu);
  continue;
  }
 
 For VFP, we use different terminology (sync and flush). I don't think they're
 any clearer than what you have, but the consistency would be nice.

Which one maps to which?

 Given that both these functions are run with interrupts enabled, why doesn't
 the second require a lock for updating dist-irq_pending_on_cpu? I notice
 there's a random smp_mb() over there...

Updating *only* irq_pending_on_cpu doesn't require the lock (set_bit()
should be safe, and I think the smp_mb() is a leftover of some debugging
hack). kvm_vgic_to_cpu() does a lot more (it picks interrupt from the
distributor, hence requires the lock to be taken).

  
 @@ -683,6 +693,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
 struct kvm_run *run)
   * Back from guest
   */
  
 +kvm_vgic_sync_from_cpu(vcpu);
 
 Likewise.
 
  ret = handle_exit(vcpu, run, ret);
  }
  
 @@ -965,6 +977,13 @@ static int init_hyp_mode(void)
  }
  }
  
 +/*
 + * Init HYP view of VGIC
 + */
 +err = kvm_vgic_hyp_init();
 +if (err)
 +goto out_free_mappings;
 +
  return 0;
  out_free_vfp:
  free_percpu(kvm_host_vfp_state);
 
 [...]
 
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 2fb7319..665af96 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -1880,12 +1880,13 @@ static long kvm_vcpu_ioctl(struct file *filp,
  if (vcpu-kvm-mm != current-mm)
  return -EIO;
  
 -#if defined(CONFIG_S390) || defined(CONFIG_PPC)
 +#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_ARM)
  /*
   * Special cases: vcpu ioctls that are asynchronous to vcpu execution,
   * so vcpu_load() would break it.
   */
 -if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_INTERRUPT)
 +if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_INTERRUPT ||
 +ioctl == KVM_IRQ_LINE)

Re: [PATCH v4 04/13] ARM: KVM: Initial VGIC MMIO support code

2012-11-28 Thread Will Deacon

On Sat, Nov 10, 2012 at 03:44:44PM +, Christoffer Dall wrote:
 From: Marc Zyngier marc.zyng...@arm.com
 
 Wire the initial in-kernel MMIO support code for the VGIC, used
 for the distributor emulation.

[...]

 diff --git a/arch/arm/kvm/vgic.c b/arch/arm/kvm/vgic.c
 new file mode 100644
 index 000..26ada3b
 --- /dev/null
 +++ b/arch/arm/kvm/vgic.c
 @@ -0,0 +1,138 @@
 +/*
 + * Copyright (C) 2012 ARM Ltd.
 + * Author: Marc Zyngier marc.zyng...@arm.com
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License version 2 as
 + * published by the Free Software Foundation.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + * GNU General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public License
 + * along with this program; if not, write to the Free Software
 + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
 + */
 +
 +#include linux/kvm.h
 +#include linux/kvm_host.h
 +#include linux/interrupt.h
 +#include linux/io.h
 +#include asm/kvm_emulate.h
 +
 +#define ACCESS_READ_VALUE(1  0)
 +#define ACCESS_READ_RAZ  (0  0)
 +#define ACCESS_READ_MASK(x)  ((x)  (1  0))
 +#define ACCESS_WRITE_IGNORED (0  1)
 +#define ACCESS_WRITE_SETBIT  (1  1)
 +#define ACCESS_WRITE_CLEARBIT(2  1)
 +#define ACCESS_WRITE_VALUE   (3  1)
 +#define ACCESS_WRITE_MASK(x) ((x)  (3  1))
 +
 +/**
 + * vgic_reg_access - access vgic register
 + * @mmio:   pointer to the data describing the mmio access
 + * @reg:pointer to the virtual backing of the vgic distributor struct
 + * @offset: least significant 2 bits used for word offset
 + * @mode:   ACCESS_ mode (see defines above)
 + *
 + * Helper to make vgic register access easier using one of the access
 + * modes defined for vgic register access
 + * (read,raz,write-ignored,setbit,clearbit,write)
 + */
 +static void vgic_reg_access(struct kvm_exit_mmio *mmio, u32 *reg,
 + u32 offset, int mode)
 +{
 + int word_offset = offset  3;

You can get rid of this variable.

 + int shift = word_offset * 8;

shift = (offset  3)  3;

 + u32 mask;
 + u32 regval;
 +
 + /*
 +  * Any alignment fault should have been delivered to the guest
 +  * directly (ARM ARM B3.12.7 Prioritization of aborts).
 +  */
 +
 + mask = (~0U)  (word_offset * 8);

then use shift here.

 + if (reg)
 + regval = *reg;
 + else {

Use braces for the if clause.

 + BUG_ON(mode != (ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED));
 + regval = 0;
 + }
 +
 + if (mmio-is_write) {
 + u32 data = (*((u32 *)mmio-data)  mask)  shift;
 + switch (ACCESS_WRITE_MASK(mode)) {
 + case ACCESS_WRITE_IGNORED:
 + return;
 +
 + case ACCESS_WRITE_SETBIT:
 + regval |= data;
 + break;
 +
 + case ACCESS_WRITE_CLEARBIT:
 + regval = ~data;
 + break;
 +
 + case ACCESS_WRITE_VALUE:
 + regval = (regval  ~(mask  shift)) | data;
 + break;
 + }
 + *reg = regval;
 + } else {
 + switch (ACCESS_READ_MASK(mode)) {
 + case ACCESS_READ_RAZ:
 + regval = 0;
 + /* fall through */
 +
 + case ACCESS_READ_VALUE:
 + *((u32 *)mmio-data) = (regval  shift)  mask;
 + }
 + }
 +}

It might be a good idea to have some port accessors for mmio-data otherwise
you'll likely get endianness issues creeping in.

 +/* All this should be handled by kvm_bus_io_*()... FIXME!!! */

I don't follow this comment :) Can you either make it clearer (and less
alarming!) or just drop it please?

 +struct mmio_range {
 + unsigned long base;
 + unsigned long len;
 + bool (*handle_mmio)(struct kvm_vcpu *vcpu, struct kvm_exit_mmio *mmio,
 + u32 offset);
 +};

Why not make offset a phys_addr_t?

 +static const struct mmio_range vgic_ranges[] = {
 + {}
 +};
 +
 +static const
 +struct mmio_range *find_matching_range(const struct mmio_range *ranges,
 +struct kvm_exit_mmio *mmio,
 +unsigned long base)
 +{
 + const struct mmio_range *r = ranges;
 + unsigned long addr = mmio-phys_addr - base;

Same here, I don't think we want to truncate everything to unsigned long.

 + while (r-len) {
 + if (addr = r-base 
 + (addr + mmio-len) = (r-base + r-len))
 + return r;
 + r++;
 + }

Hmm, does this work correctly for adjacent mmio devices where addr sits
on

Re: [PATCH v4 05/13] ARM: KVM: VGIC accept vcpu and dist base addresses from user space

2012-11-28 Thread Will Deacon

On Sat, Nov 10, 2012 at 03:44:51PM +, Christoffer Dall wrote:
 User space defines the model to emulate to a guest and should therefore
 decide which addresses are used for both the virtual CPU interface
 directly mapped in the guest physical address space and for the emulated
 distributor interface, which is mapped in software by the in-kernel VGIC
 support.
 
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
 ---
  arch/arm/include/asm/kvm_mmu.h  |2 +
  arch/arm/include/asm/kvm_vgic.h |9 ++
  arch/arm/kvm/arm.c  |   16 ++
  arch/arm/kvm/vgic.c |   61 
 +++
  4 files changed, 87 insertions(+), 1 deletion(-)
 
 diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
 index 9bd0508..0800531 100644
 --- a/arch/arm/include/asm/kvm_mmu.h
 +++ b/arch/arm/include/asm/kvm_mmu.h
 @@ -26,6 +26,8 @@
   * To save a bit of memory and to avoid alignment issues we assume 39-bit IPA
   * for now, but remember that the level-1 table must be aligned to its size.
   */
 +#define KVM_PHYS_SHIFT   (38)

Seems a bit small...

 +#define KVM_PHYS_MASK((1ULL  KVM_PHYS_SHIFT) - 1)
  #define PTRS_PER_PGD2512
  #define PGD2_ORDER   get_order(PTRS_PER_PGD2 * sizeof(pgd_t))
  
 diff --git a/arch/arm/include/asm/kvm_vgic.h b/arch/arm/include/asm/kvm_vgic.h
 index b444ecf..9ca8d21 100644
 --- a/arch/arm/include/asm/kvm_vgic.h
 +++ b/arch/arm/include/asm/kvm_vgic.h
 @@ -20,6 +20,9 @@
  #define __ASM_ARM_KVM_VGIC_H
  
  struct vgic_dist {
 + /* Distributor and vcpu interface mapping in the guest */
 + phys_addr_t vgic_dist_base;
 + phys_addr_t vgic_cpu_base;
  };
  
  struct vgic_cpu {
 @@ -31,6 +34,7 @@ struct kvm_run;
  struct kvm_exit_mmio;
  
  #ifdef CONFIG_KVM_ARM_VGIC
 +int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr);
  bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run,
 struct kvm_exit_mmio *mmio);
  
 @@ -40,6 +44,11 @@ static inline int kvm_vgic_hyp_init(void)
   return 0;
  }
  
 +static inline int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 
 addr)
 +{
 + return 0;
 +}
 +
  static inline int kvm_vgic_init(struct kvm *kvm)
  {
   return 0;
 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 426828a..3ac1aab 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -61,6 +61,8 @@ static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
  static u8 kvm_next_vmid;
  static DEFINE_SPINLOCK(kvm_vmid_lock);
  
 +static bool vgic_present;
 +
  static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu)
  {
   BUG_ON(preemptible());
 @@ -825,7 +827,19 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
 kvm_dirty_log *log)
  static int kvm_vm_ioctl_set_device_address(struct kvm *kvm,
  struct kvm_device_address *dev_addr)
  {
 - return -ENODEV;
 + unsigned long dev_id, type;
 +
 + dev_id = (dev_addr-id  KVM_DEVICE_ID_MASK)  KVM_DEVICE_ID_SHIFT;
 + type = (dev_addr-id  KVM_DEVICE_TYPE_MASK)  KVM_DEVICE_TYPE_SHIFT;
 +
 + switch (dev_id) {
 + case KVM_ARM_DEVICE_VGIC_V2:
 + if (!vgic_present)
 + return -ENXIO;
 + return kvm_vgic_set_addr(kvm, type, dev_addr-addr);
 + default:
 + return -ENODEV;
 + }
  }
  
  long kvm_arch_vm_ioctl(struct file *filp,
 diff --git a/arch/arm/kvm/vgic.c b/arch/arm/kvm/vgic.c
 index 26ada3b..f85b275 100644
 --- a/arch/arm/kvm/vgic.c
 +++ b/arch/arm/kvm/vgic.c
 @@ -22,6 +22,13 @@
  #include linux/io.h
  #include asm/kvm_emulate.h
  
 +#define VGIC_ADDR_UNDEF  (-1)
 +#define IS_VGIC_ADDR_UNDEF(_x)  ((_x) == (typeof(_x))VGIC_ADDR_UNDEF)
 +
 +#define VGIC_DIST_SIZE   0x1000
 +#define VGIC_CPU_SIZE0x2000

These defines might be useful to userspace so that they don't request the
distributor and the cpu interface to be place too close together (been there,
done that :).

 +
 +
  #define ACCESS_READ_VALUE(1  0)
  #define ACCESS_READ_RAZ  (0  0)
  #define ACCESS_READ_MASK(x)  ((x)  (1  0))
 @@ -136,3 +143,57 @@ bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct 
 kvm_run *run, struct kvm_exi
  {
   return KVM_EXIT_MMIO;
  }
 +
 +static bool vgic_ioaddr_overlap(struct kvm *kvm)
 +{
 + phys_addr_t dist = kvm-arch.vgic.vgic_dist_base;
 + phys_addr_t cpu = kvm-arch.vgic.vgic_cpu_base;
 +
 + if (IS_VGIC_ADDR_UNDEF(dist) || IS_VGIC_ADDR_UNDEF(cpu))
 + return false;
 + if ((dist = cpu  dist + VGIC_DIST_SIZE  cpu) ||
 + (cpu = dist  cpu + VGIC_CPU_SIZE  dist))
 + return true;
 + return false;

Just return the predicate that you're testing.

 +}
 +
 +int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr)
 +{
 + int r = 0;
 + struct vgic_dist *vgic = kvm-arch.vgic;
 +
 + if (addr  ~KVM_PHYS_MASK)

Re: [PATCH v4 02/13] ARM: KVM: Keep track of currently running vcpus

2012-11-28 Thread Marc Zyngier

On 28/11/12 12:47, Will Deacon wrote:
 Just a bunch of typos in this one :)

Typos? me? ;-)

 
 On Sat, Nov 10, 2012 at 03:44:30PM +, Christoffer Dall wrote:
 From: Marc Zyngier marc.zyng...@arm.com

 When an interrupt occurs for the guest, it is sometimes necessary
 to find out which vcpu was running at that point.

 Keep track of which vcpu is being tun in kvm_arch_vcpu_ioctl_run(),
 
 run
 
 and allow the data to be retrived using either:
 
 retrieved
 
 - kvm_arm_get_running_vcpu(): returns the vcpu running at this point
   on the current CPU. Can only be used in a non-preemptable context.
 
 preemptible
 
 - kvm_arm_get_running_vcpus(): returns the per-CPU variable holding
   the the running vcpus, useable for per-CPU interrupts.
 
 -the
 usable
 
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
 ---
  arch/arm/include/asm/kvm_host.h |   10 ++
  arch/arm/kvm/arm.c  |   30 ++
  2 files changed, 40 insertions(+)

 diff --git a/arch/arm/include/asm/kvm_host.h 
 b/arch/arm/include/asm/kvm_host.h
 index e7fc249..e66cd56 100644
 --- a/arch/arm/include/asm/kvm_host.h
 +++ b/arch/arm/include/asm/kvm_host.h
 @@ -154,4 +154,14 @@ static inline int kvm_test_age_hva(struct kvm *kvm, 
 unsigned long hva)
  {
  return 0;
  }
 +
 +struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
 +struct kvm_vcpu __percpu **kvm_get_running_vcpus(void);
 
 DECLARE_PER_CPU?

Ah, nice one. I didn't even now it existed!

 +int kvm_arm_copy_coproc_indices(struct kvm_vcpu *vcpu, u64 __user 
 *uindices);
 +unsigned long kvm_arm_num_coproc_regs(struct kvm_vcpu *vcpu);
 +struct kvm_one_reg;
 +int kvm_arm_coproc_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg 
 *);
 +int kvm_arm_coproc_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg 
 *);
 +
  #endif /* __ARM_KVM_HOST_H__ */
 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 2cdc07b..60b119a 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -53,11 +53,38 @@ static DEFINE_PER_CPU(unsigned long, 
 kvm_arm_hyp_stack_page);
  static struct vfp_hard_struct __percpu *kvm_host_vfp_state;
  static unsigned long hyp_default_vectors;
  
 +/* Per-CPU variable containing the currently running vcpu. */
 +static DEFINE_PER_CPU(struct kvm_vcpu *, kvm_arm_running_vcpu);
 +
  /* The VMID used in the VTTBR */
  static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
  static u8 kvm_next_vmid;
  static DEFINE_SPINLOCK(kvm_vmid_lock);
  
 +static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu)
 +{
 +BUG_ON(preemptible());
 +__get_cpu_var(kvm_arm_running_vcpu) = vcpu;
 +}
 +
 +/**
 + * kvm_arm_get_running_vcpu - get the vcpu running on the current CPU.
 + * Must be called from non-preemptible context
 + */
 +struct kvm_vcpu *kvm_arm_get_running_vcpu(void)
 +{
 +BUG_ON(preemptible());
 +return __get_cpu_var(kvm_arm_running_vcpu);
 +}
 +
 +/**
 + * kvm_arm_get_running_vcpus - get the per-CPU array on currently running 
 vcpus.
 + */
 
 s/on/of/ ?

Indeed.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: use is_idle_task() instead of idle_cpu() to decide when to halt in async_pf

2012-11-28 Thread Gleb Natapov

As Frederic pointed idle_cpu() may return false even if async fault
happened in the idle task if wake up is pending. In this case the code
will try to put idle task to sleep. Fix this by using is_idle_task() to
check for idle task.

Reported-by: Frederic Weisbecker fweis...@gmail.com
Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index a91c6b4..08b973f 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -120,11 +120,6 @@ void kvm_async_pf_task_wait(u32 token)
struct kvm_task_sleep_head *b = async_pf_sleepers[key];
struct kvm_task_sleep_node n, *e;
DEFINE_WAIT(wait);
-   int cpu, idle;
-
-   cpu = get_cpu();
-   idle = idle_cpu(cpu);
-   put_cpu();
 
spin_lock(b-lock);
e = _find_apf_task(b, token);
@@ -138,7 +133,7 @@ void kvm_async_pf_task_wait(u32 token)
 
n.token = token;
n.cpu = smp_processor_id();
-   n.halted = idle || preempt_count()  1;
+   n.halted = is_idle_task(current) || preempt_count()  1;
init_waitqueue_head(n.wq);
hlist_add_head(n.link, b-list);
spin_unlock(b-lock);
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 06/13] ARM: KVM: VGIC distributor handling

2012-11-28 Thread Will Deacon

On Sat, Nov 10, 2012 at 03:44:58PM +, Christoffer Dall wrote:
 From: Marc Zyngier marc.zyng...@arm.com
 
 Add the GIC distributor emulation code. A number of the GIC features
 are simply ignored as they are not required to boot a Linux guest.
 
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
 ---
  arch/arm/include/asm/kvm_vgic.h |  167 ++
  arch/arm/kvm/vgic.c |  471 
 +++
  2 files changed, 637 insertions(+), 1 deletion(-)
 
 diff --git a/arch/arm/include/asm/kvm_vgic.h b/arch/arm/include/asm/kvm_vgic.h
 index 9ca8d21..9e60b1d 100644
 --- a/arch/arm/include/asm/kvm_vgic.h
 +++ b/arch/arm/include/asm/kvm_vgic.h
 @@ -19,10 +19,177 @@
  #ifndef __ASM_ARM_KVM_VGIC_H
  #define __ASM_ARM_KVM_VGIC_H
 
 +#include linux/kernel.h
 +#include linux/kvm.h
 +#include linux/kvm_host.h
 +#include linux/irqreturn.h
 +#include linux/spinlock.h
 +#include linux/types.h
 +
 +#define VGIC_NR_IRQS   128

#define VGIC_NR_PRIVATE_IRQS32?

 +#define VGIC_NR_SHARED_IRQS(VGIC_NR_IRQS - 32)

then subtract it here

 +#define VGIC_MAX_CPUS  NR_CPUS

We already have KVM_MAX_VCPUS, why do we need another?

 +
 +/* Sanity checks... */
 +#if (VGIC_MAX_CPUS  8)
 +#error Invalid number of CPU interfaces
 +#endif
 +
 +#if (VGIC_NR_IRQS  31)
 +#error VGIC_NR_IRQS must be a multiple of 32
 +#endif
 +
 +#if (VGIC_NR_IRQS  1024)
 +#error VGIC_NR_IRQS must be = 1024
 +#endif

Maybe put each check directly below the #define being tested, to make it
super-obvious to people thinking of changing the constants?

 +/*
 + * The GIC distributor registers describing interrupts have two parts:
 + * - 32 per-CPU interrupts (SGI + PPI)
 + * - a bunch of shared interrups (SPI)

interrupts

 + */
 +struct vgic_bitmap {
 +   union {
 +   u32 reg[1];
 +   unsigned long reg_ul[0];
 +   } percpu[VGIC_MAX_CPUS];
 +   union {
 +   u32 reg[VGIC_NR_SHARED_IRQS / 32];
 +   unsigned long reg_ul[0];
 +   } shared;
 +};

Whoa, this is nasty!

Firstly, let's replace the `32' with sizeof(u32) for fun. Secondly, can
we make the reg_ul arrays sized using the BITS_TO_LONGS macro?

 +
 +static inline u32 *vgic_bitmap_get_reg(struct vgic_bitmap *x,
 +  int cpuid, u32 offset)
 +{
 +   offset = 2;
 +   BUG_ON(offset  (VGIC_NR_IRQS / 32));

Hmm, where is offset sanity-checked before here? Do you just rely on all
trapped accesses being valid?

 +   if (!offset)
 +   return x-percpu[cpuid].reg;
 +   else
 +   return x-shared.reg + offset - 1;

An alternative to this would be to have a single array, with the per-cpu
interrupts all laid out at the start and a macro to convert an offset to
an index. Might make the code more readable and the struct definition more
concise.

 +}
 +
 +static inline int vgic_bitmap_get_irq_val(struct vgic_bitmap *x,
 +int cpuid, int irq)
 +{
 +   if (irq  32)

VGIC_NR_PRIVATE_IRQS (inless you go with the suggestion above)

 +   return test_bit(irq, x-percpu[cpuid].reg_ul);
 +
 +   return test_bit(irq - 32, x-shared.reg_ul);
 +}
 +
 +static inline void vgic_bitmap_set_irq_val(struct vgic_bitmap *x,
 +  int cpuid, int irq, int val)
 +{
 +   unsigned long *reg;
 +
 +   if (irq  32)
 +   reg = x-percpu[cpuid].reg_ul;
 +   else {
 +   reg =  x-shared.reg_ul;
 +   irq -= 32;
 +   }

Likewise.

 +
 +   if (val)
 +   set_bit(irq, reg);
 +   else
 +   clear_bit(irq, reg);
 +}
 +
 +static inline unsigned long *vgic_bitmap_get_cpu_map(struct vgic_bitmap *x,
 +int cpuid)
 +{
 +   if (unlikely(cpuid = VGIC_MAX_CPUS))
 +   return NULL;
 +   return x-percpu[cpuid].reg_ul;
 +}
 +
 +static inline unsigned long *vgic_bitmap_get_shared_map(struct vgic_bitmap 
 *x)
 +{
 +   return x-shared.reg_ul;
 +}
 +
 +struct vgic_bytemap {
 +   union {
 +   u32 reg[8];
 +   unsigned long reg_ul[0];
 +   } percpu[VGIC_MAX_CPUS];
 +   union {
 +   u32 reg[VGIC_NR_SHARED_IRQS  / 4];
 +   unsigned long reg_ul[0];
 +   } shared;
 +};

Argh, it's another one! :)

 +
 +static inline u32 *vgic_bytemap_get_reg(struct vgic_bytemap *x,
 +   int cpuid, u32 offset)
 +{
 +   offset = 2;
 +   BUG_ON(offset  (VGIC_NR_IRQS / 4));
 +   if (offset  4)
 +   return x-percpu[cpuid].reg + offset;
 +   else
 +   return x-shared.reg + offset - 8;
 +}
 +
 +static inline int vgic_bytemap_get_irq_val(struct vgic_bytemap *x,
 +  int cpuid, int irq)
 +{
 +   u32 *reg, shift;
 +   shift =

Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler

2012-11-28 Thread Michael S. Tsirkin

On Wed, Nov 28, 2012 at 02:45:09PM +0200, Gleb Natapov wrote:
 On Wed, Nov 28, 2012 at 02:22:45PM +0200, Michael S. Tsirkin wrote:
  On Wed, Nov 28, 2012 at 02:13:01PM +0200, Gleb Natapov wrote:
   On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote:
On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote:
 On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote:
  We can deliver certain interrupts, notably MSI,
  from atomic context.  Use kvm_set_irq_inatomic,
  to implement an irq handler for msi.
  
  This reduces the pressure on scheduler in case
  where host and guest irq share a host cpu.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
   virt/kvm/assigned-dev.c | 36 ++--
   1 file changed, 26 insertions(+), 10 deletions(-)
  
  diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
  index 23a41a9..3642239 100644
  --- a/virt/kvm/assigned-dev.c
  +++ b/virt/kvm/assigned-dev.c
  @@ -105,6 +105,15 @@ static irqreturn_t 
  kvm_assigned_dev_thread_intx(int irq, void *dev_id)
   }
   
   #ifdef __KVM_HAVE_MSI
  +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
  +{
  +   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
  +   int ret = kvm_set_irq_inatomic(assigned_dev-kvm,
  +  assigned_dev-irq_source_id,
  +  assigned_dev-guest_irq, 1);
 Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() 
 from
 previous patch? 

kvm_set_msi_inatomic needs a routing entry, and
we don't have the routing entry at this level.

   Yes, right. BTW is this interface will be used only for legacy assigned
   device or there will be other users too?
  
  I think long term we should convert irqfd to this too.
  
 VIFO uses irqfd, no? So, why legacy device assignment needs that code
 to achieve parity with VFIO?

Clarification: there are two issues:

1. legacy assignment has bad worst case latency
this is because we bounce all ainterrupts through threads
this patch fixes this
2. irqfd injects all MSIs from an atomic context
this patch does not fix this, but it can
be fixed on top of this patch

 Also why long term? What are the complications?

Nothing special. Just need to be careful with all the rcu trickery that
irqfd uses.

Further, guest irq might not be an MSI: host MSI
can cause guest intx injection I think, we need to
bounce it to thread as we did earlier.
   Ah, so msi in kvm_assigned_dev_msi() is about host msi?
  
  Yes.
  
   Can host be intx
   but guest msi?
  
  No.
  
   You seems to not handle this case. Also injection of intx
   via ioapic is the same as injecting MSI. The format and the capability
   of irq message are essentially the same.
  
  Absolutely. So we will be able to extend this to intx long term.
  The difference is in the fact that unlike msi, intx can
  (and does) have multiple entries per GSI.
  I have not yet figured out how to report and handle failure
  in case one of these can be injected in atomic context,
  another can't. There's likely an easy way but can
  be a follow up patch I think.

 I prefer to figure that out before introducing the interface.

Ow come on, it's just an internal interface, not even exported
to modules. Changing it would be trivial and the
implementation is very small too.

 Hmm, we
 can get rid of vcpu loop in pic (should be very easily done by checking
 for kvm_apic_accept_pic_intr() during apic configuration and keeping
 global extint vcpu) and then sorting irq routing entries so that ioapic
 entry is first since only ioapic injection can fail.

Yes, I think it's a good idea to remove as many vcpu loops as possible:
for example, this vcpu loop is currently hit from atomic
context anyway, isn't it?

  
   

  +   return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : 
  IRQ_HANDLED;
  +}
  +
   static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void 
  *dev_id)
   {
  struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
  @@ -117,6 +126,23 @@ static irqreturn_t 
  kvm_assigned_dev_thread_msi(int irq, void *dev_id)
   #endif
   
   #ifdef __KVM_HAVE_MSIX
  +static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)
  +{
  +   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
  +   int index = find_index_from_host_irq(assigned_dev, irq);
  +   u32 vector;
  +   int ret = 0;
  +
  +   if (index = 0) {
  +   vector = assigned_dev-guest_msix_entries[index].vector;
  +   ret = kvm_set_irq_inatomic(assigned_dev-kvm,
  +  assigned_dev-irq_source_id,
  +  vector, 1);
  +   }
  +
  +

Re: [kvmarm] [PATCH v4 05/13] ARM: KVM: VGIC accept vcpu and dist base addresses from user space

2012-11-28 Thread Marc Zyngier

On 28/11/12 13:11, Will Deacon wrote:
 On Sat, Nov 10, 2012 at 03:44:51PM +, Christoffer Dall wrote:
 User space defines the model to emulate to a guest and should therefore
 decide which addresses are used for both the virtual CPU interface
 directly mapped in the guest physical address space and for the emulated
 distributor interface, which is mapped in software by the in-kernel VGIC
 support.

 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
 ---
  arch/arm/include/asm/kvm_mmu.h  |2 +
  arch/arm/include/asm/kvm_vgic.h |9 ++
  arch/arm/kvm/arm.c  |   16 ++
  arch/arm/kvm/vgic.c |   61 
 +++
  4 files changed, 87 insertions(+), 1 deletion(-)

 diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
 index 9bd0508..0800531 100644
 --- a/arch/arm/include/asm/kvm_mmu.h
 +++ b/arch/arm/include/asm/kvm_mmu.h
 @@ -26,6 +26,8 @@
   * To save a bit of memory and to avoid alignment issues we assume 39-bit 
 IPA
   * for now, but remember that the level-1 table must be aligned to its size.
   */
 +#define KVM_PHYS_SHIFT  (38)
 
 Seems a bit small...

It's now been fixed to be 40 bits.

 +#define KVM_PHYS_MASK((1ULL  KVM_PHYS_SHIFT) - 1)
  #define PTRS_PER_PGD2   512
  #define PGD2_ORDER  get_order(PTRS_PER_PGD2 * sizeof(pgd_t))
  
 diff --git a/arch/arm/include/asm/kvm_vgic.h 
 b/arch/arm/include/asm/kvm_vgic.h
 index b444ecf..9ca8d21 100644
 --- a/arch/arm/include/asm/kvm_vgic.h
 +++ b/arch/arm/include/asm/kvm_vgic.h
 @@ -20,6 +20,9 @@
  #define __ASM_ARM_KVM_VGIC_H
  
  struct vgic_dist {
 +/* Distributor and vcpu interface mapping in the guest */
 +phys_addr_t vgic_dist_base;
 +phys_addr_t vgic_cpu_base;
  };
  
  struct vgic_cpu {
 @@ -31,6 +34,7 @@ struct kvm_run;
  struct kvm_exit_mmio;
  
  #ifdef CONFIG_KVM_ARM_VGIC
 +int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr);
  bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run,
struct kvm_exit_mmio *mmio);
  
 @@ -40,6 +44,11 @@ static inline int kvm_vgic_hyp_init(void)
  return 0;
  }
  
 +static inline int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, 
 u64 addr)
 +{
 +return 0;
 +}
 +
  static inline int kvm_vgic_init(struct kvm *kvm)
  {
  return 0;
 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 426828a..3ac1aab 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -61,6 +61,8 @@ static atomic64_t kvm_vmid_gen = ATOMIC64_INIT(1);
  static u8 kvm_next_vmid;
  static DEFINE_SPINLOCK(kvm_vmid_lock);
  
 +static bool vgic_present;
 +
  static void kvm_arm_set_running_vcpu(struct kvm_vcpu *vcpu)
  {
  BUG_ON(preemptible());
 @@ -825,7 +827,19 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
 kvm_dirty_log *log)
  static int kvm_vm_ioctl_set_device_address(struct kvm *kvm,
 struct kvm_device_address *dev_addr)
  {
 -return -ENODEV;
 +unsigned long dev_id, type;
 +
 +dev_id = (dev_addr-id  KVM_DEVICE_ID_MASK)  KVM_DEVICE_ID_SHIFT;
 +type = (dev_addr-id  KVM_DEVICE_TYPE_MASK)  KVM_DEVICE_TYPE_SHIFT;
 +
 +switch (dev_id) {
 +case KVM_ARM_DEVICE_VGIC_V2:
 +if (!vgic_present)
 +return -ENXIO;
 +return kvm_vgic_set_addr(kvm, type, dev_addr-addr);
 +default:
 +return -ENODEV;
 +}
  }
  
  long kvm_arch_vm_ioctl(struct file *filp,
 diff --git a/arch/arm/kvm/vgic.c b/arch/arm/kvm/vgic.c
 index 26ada3b..f85b275 100644
 --- a/arch/arm/kvm/vgic.c
 +++ b/arch/arm/kvm/vgic.c
 @@ -22,6 +22,13 @@
  #include linux/io.h
  #include asm/kvm_emulate.h
  
 +#define VGIC_ADDR_UNDEF (-1)
 +#define IS_VGIC_ADDR_UNDEF(_x)  ((_x) == (typeof(_x))VGIC_ADDR_UNDEF)
 +
 +#define VGIC_DIST_SIZE  0x1000
 +#define VGIC_CPU_SIZE   0x2000
 
 These defines might be useful to userspace so that they don't request the
 distributor and the cpu interface to be place too close together (been there,
 done that :).

Fair enough.

 +
 +
  #define ACCESS_READ_VALUE   (1  0)
  #define ACCESS_READ_RAZ (0  0)
  #define ACCESS_READ_MASK(x) ((x)  (1  0))
 @@ -136,3 +143,57 @@ bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct 
 kvm_run *run, struct kvm_exi
  {
  return KVM_EXIT_MMIO;
  }
 +
 +static bool vgic_ioaddr_overlap(struct kvm *kvm)
 +{
 +phys_addr_t dist = kvm-arch.vgic.vgic_dist_base;
 +phys_addr_t cpu = kvm-arch.vgic.vgic_cpu_base;
 +
 +if (IS_VGIC_ADDR_UNDEF(dist) || IS_VGIC_ADDR_UNDEF(cpu))
 +return false;
 +if ((dist = cpu  dist + VGIC_DIST_SIZE  cpu) ||
 +(cpu = dist  cpu + VGIC_CPU_SIZE  dist))
 +return true;
 +return false;
 
 Just return the predicate that you're testing.
 
 +}
 +
 +int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr)
 +{
 +int r = 0;
 +struct vgic_dist

Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler

2012-11-28 Thread Gleb Natapov

On Wed, Nov 28, 2012 at 03:25:44PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 28, 2012 at 02:45:09PM +0200, Gleb Natapov wrote:
  On Wed, Nov 28, 2012 at 02:22:45PM +0200, Michael S. Tsirkin wrote:
   On Wed, Nov 28, 2012 at 02:13:01PM +0200, Gleb Natapov wrote:
On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote:
  On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin wrote:
   We can deliver certain interrupts, notably MSI,
   from atomic context.  Use kvm_set_irq_inatomic,
   to implement an irq handler for msi.
   
   This reduces the pressure on scheduler in case
   where host and guest irq share a host cpu.
   
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
   ---
virt/kvm/assigned-dev.c | 36 ++--
1 file changed, 26 insertions(+), 10 deletions(-)
   
   diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
   index 23a41a9..3642239 100644
   --- a/virt/kvm/assigned-dev.c
   +++ b/virt/kvm/assigned-dev.c
   @@ -105,6 +105,15 @@ static irqreturn_t 
   kvm_assigned_dev_thread_intx(int irq, void *dev_id)
}

#ifdef __KVM_HAVE_MSI
   +static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
   +{
   + struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
   + int ret = kvm_set_irq_inatomic(assigned_dev-kvm,
   +assigned_dev-irq_source_id,
   +assigned_dev-guest_irq, 1);
  Why not use kvm_set_msi_inatomic() and drop kvm_set_irq_inatomic() 
  from
  previous patch? 
 
 kvm_set_msi_inatomic needs a routing entry, and
 we don't have the routing entry at this level.
 
Yes, right. BTW is this interface will be used only for legacy assigned
device or there will be other users too?
   
   I think long term we should convert irqfd to this too.
   
  VIFO uses irqfd, no? So, why legacy device assignment needs that code
  to achieve parity with VFIO?
 
 Clarification: there are two issues:
 
 1. legacy assignment has bad worst case latency
   this is because we bounce all ainterrupts through threads
   this patch fixes this
 2. irqfd injects all MSIs from an atomic context
   this patch does not fix this, but it can
   be fixed on top of this patch
 
Thanks for clarification.

  Also why long term? What are the complications?
 
 Nothing special. Just need to be careful with all the rcu trickery that
 irqfd uses.
 
 Further, guest irq might not be an MSI: host MSI
 can cause guest intx injection I think, we need to
 bounce it to thread as we did earlier.
Ah, so msi in kvm_assigned_dev_msi() is about host msi?
   
   Yes.
   
Can host be intx
but guest msi?
   
   No.
   
You seems to not handle this case. Also injection of intx
via ioapic is the same as injecting MSI. The format and the capability
of irq message are essentially the same.
   
   Absolutely. So we will be able to extend this to intx long term.
   The difference is in the fact that unlike msi, intx can
   (and does) have multiple entries per GSI.
   I have not yet figured out how to report and handle failure
   in case one of these can be injected in atomic context,
   another can't. There's likely an easy way but can
   be a follow up patch I think.
 
  I prefer to figure that out before introducing the interface.
 
 Ow come on, it's just an internal interface, not even exported
 to modules. Changing it would be trivial and the
 implementation is very small too.
 
The question is if it can be done at all or not. If it cannot then it
does not matter that interface is internal, but fortunately looks like
it is possible, so I am fine with proposed implementation for now.

  Hmm, we
  can get rid of vcpu loop in pic (should be very easily done by checking
  for kvm_apic_accept_pic_intr() during apic configuration and keeping
  global extint vcpu) and then sorting irq routing entries so that ioapic
  entry is first since only ioapic injection can fail.
 
 Yes, I think it's a good idea to remove as many vcpu loops as possible:
 for example, this vcpu loop is currently hit from atomic
 context anyway, isn't it?
Actually it is not. The lock is dropped just before the loop, so this
loop shouldn't be the roadblock at all.

 
   

 
   + return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD : 
   IRQ_HANDLED;
   +}
   +
static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void 
   *dev_id)
{
 struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
   @@ -117,6 +126,23 @@ static irqreturn_t 
   kvm_assigned_dev_thread_msi(int irq, void *dev_id)
#endif

#ifdef __KVM_HAVE_MSIX
   +static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)

Re: [PATCH v4 04/13] ARM: KVM: Initial VGIC MMIO support code

2012-11-28 Thread Marc Zyngier

On 28/11/12 13:09, Will Deacon wrote:
 On Sat, Nov 10, 2012 at 03:44:44PM +, Christoffer Dall wrote:
 From: Marc Zyngier marc.zyng...@arm.com

 Wire the initial in-kernel MMIO support code for the VGIC, used
 for the distributor emulation.
 
 [...]
 
 diff --git a/arch/arm/kvm/vgic.c b/arch/arm/kvm/vgic.c
 new file mode 100644
 index 000..26ada3b
 --- /dev/null
 +++ b/arch/arm/kvm/vgic.c
 @@ -0,0 +1,138 @@
 +/*
 + * Copyright (C) 2012 ARM Ltd.
 + * Author: Marc Zyngier marc.zyng...@arm.com
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License version 2 as
 + * published by the Free Software Foundation.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + * GNU General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public License
 + * along with this program; if not, write to the Free Software
 + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
 + */
 +
 +#include linux/kvm.h
 +#include linux/kvm_host.h
 +#include linux/interrupt.h
 +#include linux/io.h
 +#include asm/kvm_emulate.h
 +
 +#define ACCESS_READ_VALUE   (1  0)
 +#define ACCESS_READ_RAZ (0  0)
 +#define ACCESS_READ_MASK(x) ((x)  (1  0))
 +#define ACCESS_WRITE_IGNORED(0  1)
 +#define ACCESS_WRITE_SETBIT (1  1)
 +#define ACCESS_WRITE_CLEARBIT   (2  1)
 +#define ACCESS_WRITE_VALUE  (3  1)
 +#define ACCESS_WRITE_MASK(x)((x)  (3  1))
 +
 +/**
 + * vgic_reg_access - access vgic register
 + * @mmio:   pointer to the data describing the mmio access
 + * @reg:pointer to the virtual backing of the vgic distributor struct
 + * @offset: least significant 2 bits used for word offset
 + * @mode:   ACCESS_ mode (see defines above)
 + *
 + * Helper to make vgic register access easier using one of the access
 + * modes defined for vgic register access
 + * (read,raz,write-ignored,setbit,clearbit,write)
 + */
 +static void vgic_reg_access(struct kvm_exit_mmio *mmio, u32 *reg,
 +u32 offset, int mode)
 +{
 +int word_offset = offset  3;
 
 You can get rid of this variable.
 
 +int shift = word_offset * 8;
 
 shift = (offset  3)  3;
 
 +u32 mask;
 +u32 regval;
 +
 +/*
 + * Any alignment fault should have been delivered to the guest
 + * directly (ARM ARM B3.12.7 Prioritization of aborts).
 + */
 +
 +mask = (~0U)  (word_offset * 8);
 
 then use shift here.

Sure.

 +if (reg)
 +regval = *reg;
 +else {
 
 Use braces for the if clause.

Indeed.

 
 +BUG_ON(mode != (ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED));
 +regval = 0;
 +}
 +
 +if (mmio-is_write) {
 +u32 data = (*((u32 *)mmio-data)  mask)  shift;
 +switch (ACCESS_WRITE_MASK(mode)) {
 +case ACCESS_WRITE_IGNORED:
 +return;
 +
 +case ACCESS_WRITE_SETBIT:
 +regval |= data;
 +break;
 +
 +case ACCESS_WRITE_CLEARBIT:
 +regval = ~data;
 +break;
 +
 +case ACCESS_WRITE_VALUE:
 +regval = (regval  ~(mask  shift)) | data;
 +break;
 +}
 +*reg = regval;
 +} else {
 +switch (ACCESS_READ_MASK(mode)) {
 +case ACCESS_READ_RAZ:
 +regval = 0;
 +/* fall through */
 +
 +case ACCESS_READ_VALUE:
 +*((u32 *)mmio-data) = (regval  shift)  mask;
 +}
 +}
 +}
 
 It might be a good idea to have some port accessors for mmio-data otherwise
 you'll likely get endianness issues creeping in.

Aarrgh! This depends on the relative endianess of host and guest. 'm
feeling slightly sick...

 +/* All this should be handled by kvm_bus_io_*()... FIXME!!! */
 
 I don't follow this comment :) Can you either make it clearer (and less
 alarming!) or just drop it please?

The non-alarming version should read:

/*
 * I would have liked to use the kvm_bus_io_*() API instead, but
 * it cannot cope with banked registers (only the VM pointer is
 * passed around, and we need the vcpu). One of these days, someone
 * please fix it!
 */

 
 +struct mmio_range {
 +unsigned long base;
 +unsigned long len;
 +bool (*handle_mmio)(struct kvm_vcpu *vcpu, struct kvm_exit_mmio *mmio,
 +u32 offset);
 +};
 
 Why not make offset a phys_addr_t?

Very good point.

 +static const struct mmio_range vgic_ranges[] = {
 +{}
 +};
 +
 +static const
 +struct mmio_range *find_matching_range(const struct mmio_range *ranges,
 +   struct kvm_exit_mmio *mmio,
 +   unsigned long base)
 +{
 +

Re: [PATCH 2/3] KVM: x86: let reexecute_instruction work for tdp

2012-11-28 Thread Gleb Natapov

On Wed, Nov 28, 2012 at 11:15:13AM +0800, Xiao Guangrong wrote:
 On 11/28/2012 07:32 AM, Marcelo Tosatti wrote:
  On Tue, Nov 27, 2012 at 11:13:11AM +0800, Xiao Guangrong wrote:
  +static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long 
  cr2)
   {
  -gpa_t gpa;
  +gpa_t gpa = cr2;
   pfn_t pfn;
 
  -if (tdp_enabled)
  +if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages))
   return false;
 
  How is indirect_shadow_pages protected? Why is ACCESS_ONCE() being used
  to read it?
 
  Hi Marcelo,
 
  It is protected by mmu-lock for it only be changed when mmu-lock is hold. 
  And
  ACCESS_ONCE is used on read path avoiding magic optimization from compiler.
  
  Please switch to mmu_lock protection, there is no reason to have access
  to this variable locklessly - not performance critical.
  
  For example, there is no use of barriers when modifying the variable.
 
 This is not bad, the worst case is, the direct mmu failed to unprotect the 
 shadow
 pages, (meet indirect_shadow_pages = 0, but there has shadow pages being 
 shadowed.),
 after enter to guest, we will go into reexecute_instruction again, then it 
 will
 remove shadow pages.
 
Isn't the same scenario can happen even with mmu lock around
indirect_shadow_pages access?

 But, i do not have strong opinion on it, i respect your idea! :)
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction

2012-11-28 Thread Gleb Natapov

On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote:
 On 11/27/2012 06:41 AM, Marcelo Tosatti wrote:
 
 
  -  return false;
  +again:
  +  page_fault_count = ACCESS_ONCE(vcpu-kvm-arch.page_fault_count);
  +
  +  /*
  +   * if emulation was due to access to shadowed page table
  +   * and it failed try to unshadow page and re-enter the
  +   * guest to let CPU execute the instruction.
  +   */
  +  kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
  +  emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false);
  
  Can you explain what is the objective here?
  
 
 Sure. :)
 
 The instruction emulation is caused by fault access on cr3. After unprotect
 the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of cr3.
 if it return 1, mmu can not fix the mapping, we should report the error,
 otherwise it is good to return to guest and let it re-execute the instruction
 again.
 
 page_fault_count is used to avoid the race on other vcpus, since after we
 unprotect the target page, other cpu can enter page fault path and let the
 page be write-protected again.
 
 This way can help us to detect all the case that mmu can not be fixed.
 
Can you write this in a comment above vcpu-arch.mmu.page_fault()?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 03/13] ARM: KVM: Initial VGIC infrastructure support

2012-11-28 Thread Will Deacon

On Wed, Nov 28, 2012 at 01:09:37PM +, Marc Zyngier wrote:
 On 28/11/12 12:49, Will Deacon wrote:
  On Sat, Nov 10, 2012 at 03:44:37PM +, Christoffer Dall wrote:
  @@ -363,7 +370,7 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu 
  *vcpu,
*/
   int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
   {
  -  return !!v-arch.irq_lines;
  +  return !!v-arch.irq_lines || kvm_vgic_vcpu_pending_irq(v);
   }
  
  So interrupt injection without the in-kernel GIC updates irq_lines, but the
  in-kernel GIC has its own separate data structures? Why can't the in-kernel 
  GIC
  just use irq_lines instead of irq_pending_on_cpu?
 
 They serve very different purposes:
 - irq_lines directly controls the IRQ and FIQ lines (it is or-ed into
 the HCR register before entering the guest)
 - irq_pending_on_cpu deals with the CPU interface, and only that. Plus,
 it is a kernel only thing. What triggers the interrupt on the guest is
 the presence of list registers with a pending state.
 
 You signal interrupts one way or the other.

Ok, thanks for the explanation. I suspect that we could use (another)
cosmetic change then. How about cpui_irq_pending and hcr_irq_pending?

   int kvm_arch_vcpu_in_guest_mode(struct kvm_vcpu *v)
  @@ -633,6 +640,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
  struct kvm_run *run)
   
 update_vttbr(vcpu-kvm);
   
  +  kvm_vgic_sync_to_cpu(vcpu);
  +
 local_irq_disable();
   
 /*
  @@ -645,6 +654,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
  struct kvm_run *run)
   
 if (ret = 0 || need_new_vmid_gen(vcpu-kvm)) {
 local_irq_enable();
  +  kvm_vgic_sync_from_cpu(vcpu);
 continue;
 }
  
  For VFP, we use different terminology (sync and flush). I don't think 
  they're
  any clearer than what you have, but the consistency would be nice.
 
 Which one maps to which?

sync:   hardware - data structure
flush:  data structure - hardware

  Given that both these functions are run with interrupts enabled, why doesn't
  the second require a lock for updating dist-irq_pending_on_cpu? I notice
  there's a random smp_mb() over there...
 
 Updating *only* irq_pending_on_cpu doesn't require the lock (set_bit()
 should be safe, and I think the smp_mb() is a leftover of some debugging
 hack). kvm_vgic_to_cpu() does a lot more (it picks interrupt from the
 distributor, hence requires the lock to be taken).

Ok, if the barrier is just a hangover from something else and you don't have
any races with test/clear operations then you should be alright.

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 06/13] ARM: KVM: VGIC distributor handling

2012-11-28 Thread Marc Zyngier

On 28/11/12 13:21, Will Deacon wrote:
 On Sat, Nov 10, 2012 at 03:44:58PM +, Christoffer Dall wrote:
 From: Marc Zyngier marc.zyng...@arm.com

 Add the GIC distributor emulation code. A number of the GIC features
 are simply ignored as they are not required to boot a Linux guest.

 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com
 ---
  arch/arm/include/asm/kvm_vgic.h |  167 ++
  arch/arm/kvm/vgic.c |  471 
 +++
  2 files changed, 637 insertions(+), 1 deletion(-)

 diff --git a/arch/arm/include/asm/kvm_vgic.h 
 b/arch/arm/include/asm/kvm_vgic.h
 index 9ca8d21..9e60b1d 100644
 --- a/arch/arm/include/asm/kvm_vgic.h
 +++ b/arch/arm/include/asm/kvm_vgic.h
 @@ -19,10 +19,177 @@
  #ifndef __ASM_ARM_KVM_VGIC_H
  #define __ASM_ARM_KVM_VGIC_H

 +#include linux/kernel.h
 +#include linux/kvm.h
 +#include linux/kvm_host.h
 +#include linux/irqreturn.h
 +#include linux/spinlock.h
 +#include linux/types.h
 +
 +#define VGIC_NR_IRQS   128
 
 #define VGIC_NR_PRIVATE_IRQS32?
 
 +#define VGIC_NR_SHARED_IRQS(VGIC_NR_IRQS - 32)
 
 then subtract it here

Sure.

 +#define VGIC_MAX_CPUS  NR_CPUS
 
 We already have KVM_MAX_VCPUS, why do we need another?

They really should be the same, and this NR_CPUS is a bug that has
already been fixed.

 +
 +/* Sanity checks... */
 +#if (VGIC_MAX_CPUS  8)
 +#error Invalid number of CPU interfaces
 +#endif
 +
 +#if (VGIC_NR_IRQS  31)
 +#error VGIC_NR_IRQS must be a multiple of 32
 +#endif
 +
 +#if (VGIC_NR_IRQS  1024)
 +#error VGIC_NR_IRQS must be = 1024
 +#endif
 
 Maybe put each check directly below the #define being tested, to make it
 super-obvious to people thinking of changing the constants?

OK.

 +/*
 + * The GIC distributor registers describing interrupts have two parts:
 + * - 32 per-CPU interrupts (SGI + PPI)
 + * - a bunch of shared interrups (SPI)
 
 interrupts
 
 + */
 +struct vgic_bitmap {
 +   union {
 +   u32 reg[1];
 +   unsigned long reg_ul[0];
 +   } percpu[VGIC_MAX_CPUS];
 +   union {
 +   u32 reg[VGIC_NR_SHARED_IRQS / 32];
 +   unsigned long reg_ul[0];
 +   } shared;
 +};
 
 Whoa, this is nasty!
 
 Firstly, let's replace the `32' with sizeof(u32) for fun. Secondly, can
 we make the reg_ul arrays sized using the BITS_TO_LONGS macro?

This has already been replaced with:

struct vgic_bitmap {
union {
u32 reg[1];
DECLARE_BITMAP(reg_ul, 32);
} percpu[VGIC_MAX_CPUS];
union {
u32 reg[VGIC_NR_SHARED_IRQS / 32];
DECLARE_BITMAP(reg_ul, VGIC_NR_SHARED_IRQS);
} shared;
};

which should address most of your concerns. As for the sizeof(u32), I
think assuming that u32 has a grand total of 32bits is safe enough ;-)

 +
 +static inline u32 *vgic_bitmap_get_reg(struct vgic_bitmap *x,
 +  int cpuid, u32 offset)
 +{
 +   offset = 2;
 +   BUG_ON(offset  (VGIC_NR_IRQS / 32));
 
 Hmm, where is offset sanity-checked before here? Do you just rely on all
 trapped accesses being valid?

You've already validated the access being in a valid range. Offset is
just derived the access address. the BUG_ON() is a leftover from early
debugging and should go away now.

 +   if (!offset)
 +   return x-percpu[cpuid].reg;
 +   else
 +   return x-shared.reg + offset - 1;
 
 An alternative to this would be to have a single array, with the per-cpu
 interrupts all laid out at the start and a macro to convert an offset to
 an index. Might make the code more readable and the struct definition more
 concise.

I'll try and see if this makes the code more palatable.

 +}
 +
 +static inline int vgic_bitmap_get_irq_val(struct vgic_bitmap *x,
 +int cpuid, int irq)
 +{
 +   if (irq  32)
 
 VGIC_NR_PRIVATE_IRQS (inless you go with the suggestion above)

yep.

 +   return test_bit(irq, x-percpu[cpuid].reg_ul);
 +
 +   return test_bit(irq - 32, x-shared.reg_ul);
 +}
 +
 +static inline void vgic_bitmap_set_irq_val(struct vgic_bitmap *x,
 +  int cpuid, int irq, int val)
 +{
 +   unsigned long *reg;
 +
 +   if (irq  32)
 +   reg = x-percpu[cpuid].reg_ul;
 +   else {
 +   reg =  x-shared.reg_ul;
 +   irq -= 32;
 +   }
 
 Likewise.
 
 +
 +   if (val)
 +   set_bit(irq, reg);
 +   else
 +   clear_bit(irq, reg);
 +}
 +
 +static inline unsigned long *vgic_bitmap_get_cpu_map(struct vgic_bitmap *x,
 +int cpuid)
 +{
 +   if (unlikely(cpuid = VGIC_MAX_CPUS))
 +   return NULL;
 +   return x-percpu[cpuid].reg_ul;
 +}
 +
 +static inline unsigned long *vgic_bitmap_get_shared_map(struct vgic_bitmap 
 *x)
 +{
 +

kvm: let's add HYPERV capabilities (was Re: [PATCH 01/20] hyper-v: introduce Hyper-V support infrastructure.)

2012-11-28 Thread Michael S. Tsirkin

On Fri, Jan 20, 2012 at 03:26:27PM -0200, Marcelo Tosatti wrote:
 From: Vadim Rozenfeld vroze...@redhat.com
 
 [Jan: fix build with CONFIG_USER_ONLY]
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 ---
  Makefile.target  |2 +
  target-i386/cpuid.c  |   14 +++
  target-i386/hyperv.c |   64 
 ++
  target-i386/hyperv.h |   43 +
  4 files changed, 123 insertions(+), 0 deletions(-)
  create mode 100644 target-i386/hyperv.c
  create mode 100644 target-i386/hyperv.h
 
 diff --git a/Makefile.target b/Makefile.target
 index 06d79b8..798dd30 100644
 --- a/Makefile.target
 +++ b/Makefile.target
 @@ -199,6 +199,8 @@ obj-$(CONFIG_NO_KVM) += kvm-stub.o
  obj-y += memory.o savevm.o
  LIBS+=-lz
  
 +obj-i386-y +=hyperv.o
 +
  QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
  QEMU_CFLAGS += $(VNC_SASL_CFLAGS)
  QEMU_CFLAGS += $(VNC_JPEG_CFLAGS)
 diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
 index 91a104b..b9bfeaf 100644
 --- a/target-i386/cpuid.c
 +++ b/target-i386/cpuid.c
 @@ -27,6 +27,8 @@
  #include qemu-option.h
  #include qemu-config.h
  
 +#include hyperv.h
 +
  /* feature flags taken from Intel Processor Identification and the CPUID
   * Instruction and AMD's CPUID Specification.  In cases of disagreement
   * between feature naming conventions, aliases may be added.
 @@ -716,6 +718,14 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, 
 const char *cpu_model)
  goto error;
  }
  x86_cpu_def-tsc_khz = tsc_freq / 1000;
 +} else if (!strcmp(featurestr, hv_spinlocks)) {
 +char *err;
 +numvalue = strtoul(val, err, 0);
 +if (!*val || *err) {
 +fprintf(stderr, bad numerical value %s\n, val);
 +goto error;
 +}
 +hyperv_set_spinlock_retries(numvalue);
  } else {
  fprintf(stderr, unrecognized feature %s\n, featurestr);
  goto error;
 @@ -724,6 +734,10 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, 
 const char *cpu_model)
  check_cpuid = 1;
  } else if (!strcmp(featurestr, enforce)) {
  check_cpuid = enforce_cpuid = 1;
 +} else if (!strcmp(featurestr, hv_relaxed)) {
 +hyperv_enable_relaxed_timing(true);
 +} else if (!strcmp(featurestr, hv_vapic)) {
 +hyperv_enable_vapic_recommended(true);
  } else {
  fprintf(stderr, feature string `%s' not in format 
 (+feature|-feature|feature=xyz)\n, featurestr);
  goto error;
 diff --git a/target-i386/hyperv.c b/target-i386/hyperv.c
 new file mode 100644
 index 000..f284e99
 --- /dev/null
 +++ b/target-i386/hyperv.c
 @@ -0,0 +1,64 @@
 +/*
 + * QEMU Hyper-V support
 + *
 + * Copyright Red Hat, Inc. 2011
 + *
 + * Author: Vadim Rozenfeld vroze...@redhat.com
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + *
 + */
 +
 +#include hyperv.h
 +
 +static bool hyperv_vapic;
 +static bool hyperv_relaxed_timing;
 +static int hyperv_spinlock_attempts = HYPERV_SPINLOCK_NEVER_RETRY;
 +
 +void hyperv_enable_vapic_recommended(bool val)
 +{
 +hyperv_vapic = val;
 +}
 +
 +void hyperv_enable_relaxed_timing(bool val)
 +{
 +hyperv_relaxed_timing = val;
 +}
 +
 +void hyperv_set_spinlock_retries(int val)
 +{
 +hyperv_spinlock_attempts = val;
 +if (hyperv_spinlock_attempts  0xFFF) {
 +hyperv_spinlock_attempts = 0xFFF;
 +}
 +}
 +
 +bool hyperv_enabled(void)
 +{
 +return hyperv_hypercall_available() || hyperv_relaxed_timing_enabled();
 +}
 +
 +bool hyperv_hypercall_available(void)
 +{
 +if (hyperv_vapic ||
 +(hyperv_spinlock_attempts != HYPERV_SPINLOCK_NEVER_RETRY)) {
 +  return true;
 +}
 +return false;
 +}
 +
 +bool hyperv_vapic_recommended(void)
 +{
 +return hyperv_vapic;
 +}
 +
 +bool hyperv_relaxed_timing_enabled(void)
 +{
 +return hyperv_relaxed_timing;
 +}
 +
 +int hyperv_get_spinlock_retries(void)
 +{
 +return hyperv_spinlock_attempts;
 +}
 diff --git a/target-i386/hyperv.h b/target-i386/hyperv.h
 new file mode 100644
 index 000..15467bf
 --- /dev/null
 +++ b/target-i386/hyperv.h
 @@ -0,0 +1,43 @@
 +/*
 + * QEMU Hyper-V support
 + *
 + * Copyright Red Hat, Inc. 2011
 + *
 + * Author: Vadim Rozenfeld vroze...@redhat.com
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + *
 + */
 +
 +#ifndef QEMU_HW_HYPERV_H
 +#define QEMU_HW_HYPERV_H 1
 +
 +#include qemu-common.h
 +#include asm/hyperv.h
 +
 +#ifndef HYPERV_SPINLOCK_NEVER_RETRY
 +#define HYPERV_SPINLOCK_NEVER_RETRY 0x
 +#endif
 +
 +#ifndef KVM_CPUID_SIGNATURE_NEXT
 +#define KVM_CPUID_SIGNATURE_NEXT0x4100
 +#endif

Re: [PATCH 2/3] KVM: x86: let reexecute_instruction work for tdp

2012-11-28 Thread Xiao Guangrong

On 11/28/2012 10:01 PM, Gleb Natapov wrote:
 On Wed, Nov 28, 2012 at 11:15:13AM +0800, Xiao Guangrong wrote:
 On 11/28/2012 07:32 AM, Marcelo Tosatti wrote:
 On Tue, Nov 27, 2012 at 11:13:11AM +0800, Xiao Guangrong wrote:
 +static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long 
 cr2)
  {
 -gpa_t gpa;
 +gpa_t gpa = cr2;
  pfn_t pfn;

 -if (tdp_enabled)
 +if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages))
  return false;

 How is indirect_shadow_pages protected? Why is ACCESS_ONCE() being used
 to read it?

 Hi Marcelo,

 It is protected by mmu-lock for it only be changed when mmu-lock is hold. 
 And
 ACCESS_ONCE is used on read path avoiding magic optimization from compiler.

 Please switch to mmu_lock protection, there is no reason to have access
 to this variable locklessly - not performance critical.

 For example, there is no use of barriers when modifying the variable.

 This is not bad, the worst case is, the direct mmu failed to unprotect the 
 shadow
 pages, (meet indirect_shadow_pages = 0, but there has shadow pages being 
 shadowed.),
 after enter to guest, we will go into reexecute_instruction again, then it 
 will
 remove shadow pages.

 Isn't the same scenario can happen even with mmu lock around
 indirect_shadow_pages access?

Hmm..., i also think it is no different. Even using mmu-lock, we can not
prevent the target pfn can not be write-protected later. Marcelo?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction

2012-11-28 Thread Xiao Guangrong

On 11/28/2012 10:12 PM, Gleb Natapov wrote:
 On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote:
 On 11/27/2012 06:41 AM, Marcelo Tosatti wrote:


 -  return false;
 +again:
 +  page_fault_count = ACCESS_ONCE(vcpu-kvm-arch.page_fault_count);
 +
 +  /*
 +   * if emulation was due to access to shadowed page table
 +   * and it failed try to unshadow page and re-enter the
 +   * guest to let CPU execute the instruction.
 +   */
 +  kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
 +  emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false);

 Can you explain what is the objective here?


 Sure. :)

 The instruction emulation is caused by fault access on cr3. After unprotect
 the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of cr3.
 if it return 1, mmu can not fix the mapping, we should report the error,
 otherwise it is good to return to guest and let it re-execute the instruction
 again.

 page_fault_count is used to avoid the race on other vcpus, since after we
 unprotect the target page, other cpu can enter page fault path and let the
 page be write-protected again.

 This way can help us to detect all the case that mmu can not be fixed.

 Can you write this in a comment above vcpu-arch.mmu.page_fault()?

Okay, if Marcelo does not object this way. :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler

2012-11-28 Thread Michael S. Tsirkin

On Wed, Nov 28, 2012 at 03:38:40PM +0200, Gleb Natapov wrote:
 On Wed, Nov 28, 2012 at 03:25:44PM +0200, Michael S. Tsirkin wrote:
  On Wed, Nov 28, 2012 at 02:45:09PM +0200, Gleb Natapov wrote:
   On Wed, Nov 28, 2012 at 02:22:45PM +0200, Michael S. Tsirkin wrote:
On Wed, Nov 28, 2012 at 02:13:01PM +0200, Gleb Natapov wrote:
 On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote:
  On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote:
   On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin 
   wrote:
We can deliver certain interrupts, notably MSI,
from atomic context.  Use kvm_set_irq_inatomic,
to implement an irq handler for msi.

This reduces the pressure on scheduler in case
where host and guest irq share a host cpu.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 virt/kvm/assigned-dev.c | 36 
++--
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 23a41a9..3642239 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -105,6 +105,15 @@ static irqreturn_t 
kvm_assigned_dev_thread_intx(int irq, void *dev_id)
 }
 
 #ifdef __KVM_HAVE_MSI
+static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
+{
+   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
+   int ret = kvm_set_irq_inatomic(assigned_dev-kvm,
+  
assigned_dev-irq_source_id,
+  assigned_dev-guest_irq, 
1);
   Why not use kvm_set_msi_inatomic() and drop 
   kvm_set_irq_inatomic() from
   previous patch? 
  
  kvm_set_msi_inatomic needs a routing entry, and
  we don't have the routing entry at this level.
  
 Yes, right. BTW is this interface will be used only for legacy 
 assigned
 device or there will be other users too?

I think long term we should convert irqfd to this too.

   VIFO uses irqfd, no? So, why legacy device assignment needs that code
   to achieve parity with VFIO?
  
  Clarification: there are two issues:
  
  1. legacy assignment has bad worst case latency
  this is because we bounce all ainterrupts through threads
  this patch fixes this
  2. irqfd injects all MSIs from an atomic context
  this patch does not fix this, but it can
  be fixed on top of this patch
  
 Thanks for clarification.
 
   Also why long term? What are the complications?
  
  Nothing special. Just need to be careful with all the rcu trickery that
  irqfd uses.
  
  Further, guest irq might not be an MSI: host MSI
  can cause guest intx injection I think, we need to
  bounce it to thread as we did earlier.
 Ah, so msi in kvm_assigned_dev_msi() is about host msi?

Yes.

 Can host be intx
 but guest msi?

No.

 You seems to not handle this case. Also injection of intx
 via ioapic is the same as injecting MSI. The format and the capability
 of irq message are essentially the same.

Absolutely. So we will be able to extend this to intx long term.
The difference is in the fact that unlike msi, intx can
(and does) have multiple entries per GSI.
I have not yet figured out how to report and handle failure
in case one of these can be injected in atomic context,
another can't. There's likely an easy way but can
be a follow up patch I think.
  
   I prefer to figure that out before introducing the interface.
  
  Ow come on, it's just an internal interface, not even exported
  to modules. Changing it would be trivial and the
  implementation is very small too.
  
 The question is if it can be done at all or not. If it cannot then it
 does not matter that interface is internal, but fortunately looks like
 it is possible, so I am fine with proposed implementation for now.
 
   Hmm, we
   can get rid of vcpu loop in pic (should be very easily done by checking
   for kvm_apic_accept_pic_intr() during apic configuration and keeping
   global extint vcpu) and then sorting irq routing entries so that ioapic
   entry is first since only ioapic injection can fail.
  
  Yes, I think it's a good idea to remove as many vcpu loops as possible:
  for example, this vcpu loop is currently hit from atomic
  context anyway, isn't it?
 Actually it is not. The lock is dropped just before the loop, so this
 loop shouldn't be the roadblock at all.

Hmm you are saying PIC injections in atomic context always succeeds?

  

 
  
+   return unlikely(ret == -EWOULDBLOCK) ? IRQ_WAKE_THREAD 
: IRQ_HANDLED;
+}
+
 static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void 
*dev_id)

Re: [PATCH 2/5] Expand the steal time msr to also contain the consigned time.

2012-11-28 Thread Michael Wolf


On 11/27/2012 03:03 PM, Konrad Rzeszutek Wilk wrote:

On Mon, Nov 26, 2012 at 02:36:45PM -0600, Michael Wolf wrote:

Add a consigned field.  This field will hold the time lost due to capping or 
overcommit.
The rest of the time will still show up in the steal-time field.

Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com
---
  arch/x86/include/asm/paravirt.h   |4 ++--
  arch/x86/include/asm/paravirt_types.h |2 +-
  arch/x86/kernel/kvm.c |7 ++-
  kernel/sched/core.c   |   10 +-
  kernel/sched/cputime.c|2 +-
  5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a0facf3..a5f9f30 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
  extern struct static_key paravirt_steal_enabled;
  extern struct static_key paravirt_steal_rq_enabled;
  
-static inline u64 paravirt_steal_clock(int cpu)

+static inline u64 paravirt_steal_clock(int cpu, u64 *steal)

So its u64 here.

  {
-   return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+   PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
  }
  
  static inline unsigned long long paravirt_read_pmc(int counter)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 142236e..5d4fc8b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -95,7 +95,7 @@ struct pv_lazy_ops {
  
  struct pv_time_ops {

unsigned long long (*sched_clock)(void);
-   unsigned long long (*steal_clock)(int cpu);
+   void (*steal_clock)(int cpu, unsigned long long *steal);

But not u64 here? Any particular reason?

It should be void everywhere, thanks for catching that I will change the 
code.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4 2/2] kvm: deliver msi interrupts from irq handler

2012-11-28 Thread Gleb Natapov

On Wed, Nov 28, 2012 at 05:25:17PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 28, 2012 at 03:38:40PM +0200, Gleb Natapov wrote:
  On Wed, Nov 28, 2012 at 03:25:44PM +0200, Michael S. Tsirkin wrote:
   On Wed, Nov 28, 2012 at 02:45:09PM +0200, Gleb Natapov wrote:
On Wed, Nov 28, 2012 at 02:22:45PM +0200, Michael S. Tsirkin wrote:
 On Wed, Nov 28, 2012 at 02:13:01PM +0200, Gleb Natapov wrote:
  On Wed, Nov 28, 2012 at 01:56:16PM +0200, Michael S. Tsirkin wrote:
   On Wed, Nov 28, 2012 at 01:43:34PM +0200, Gleb Natapov wrote:
On Wed, Oct 17, 2012 at 06:06:06PM +0200, Michael S. Tsirkin 
wrote:
 We can deliver certain interrupts, notably MSI,
 from atomic context.  Use kvm_set_irq_inatomic,
 to implement an irq handler for msi.
 
 This reduces the pressure on scheduler in case
 where host and guest irq share a host cpu.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  virt/kvm/assigned-dev.c | 36 
 ++--
  1 file changed, 26 insertions(+), 10 deletions(-)
 
 diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
 index 23a41a9..3642239 100644
 --- a/virt/kvm/assigned-dev.c
 +++ b/virt/kvm/assigned-dev.c
 @@ -105,6 +105,15 @@ static irqreturn_t 
 kvm_assigned_dev_thread_intx(int irq, void *dev_id)
  }
  
  #ifdef __KVM_HAVE_MSI
 +static irqreturn_t kvm_assigned_dev_msi(int irq, void 
 *dev_id)
 +{
 + struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 + int ret = kvm_set_irq_inatomic(assigned_dev-kvm,
 +
 assigned_dev-irq_source_id,
 +assigned_dev-guest_irq, 
 1);
Why not use kvm_set_msi_inatomic() and drop 
kvm_set_irq_inatomic() from
previous patch? 
   
   kvm_set_msi_inatomic needs a routing entry, and
   we don't have the routing entry at this level.
   
  Yes, right. BTW is this interface will be used only for legacy 
  assigned
  device or there will be other users too?
 
 I think long term we should convert irqfd to this too.
 
VIFO uses irqfd, no? So, why legacy device assignment needs that code
to achieve parity with VFIO?
   
   Clarification: there are two issues:
   
   1. legacy assignment has bad worst case latency
 this is because we bounce all ainterrupts through threads
 this patch fixes this
   2. irqfd injects all MSIs from an atomic context
 this patch does not fix this, but it can
 be fixed on top of this patch
   
  Thanks for clarification.
  
Also why long term? What are the complications?
   
   Nothing special. Just need to be careful with all the rcu trickery that
   irqfd uses.
   
   Further, guest irq might not be an MSI: host MSI
   can cause guest intx injection I think, we need to
   bounce it to thread as we did earlier.
  Ah, so msi in kvm_assigned_dev_msi() is about host msi?
 
 Yes.
 
  Can host be intx
  but guest msi?
 
 No.
 
  You seems to not handle this case. Also injection of intx
  via ioapic is the same as injecting MSI. The format and the 
  capability
  of irq message are essentially the same.
 
 Absolutely. So we will be able to extend this to intx long term.
 The difference is in the fact that unlike msi, intx can
 (and does) have multiple entries per GSI.
 I have not yet figured out how to report and handle failure
 in case one of these can be injected in atomic context,
 another can't. There's likely an easy way but can
 be a follow up patch I think.
   
I prefer to figure that out before introducing the interface.
   
   Ow come on, it's just an internal interface, not even exported
   to modules. Changing it would be trivial and the
   implementation is very small too.
   
  The question is if it can be done at all or not. If it cannot then it
  does not matter that interface is internal, but fortunately looks like
  it is possible, so I am fine with proposed implementation for now.
  
Hmm, we
can get rid of vcpu loop in pic (should be very easily done by checking
for kvm_apic_accept_pic_intr() during apic configuration and keeping
global extint vcpu) and then sorting irq routing entries so that ioapic
entry is first since only ioapic injection can fail.
   
   Yes, I think it's a good idea to remove as many vcpu loops as possible:
   for example, this vcpu loop is currently hit from atomic
   context anyway, isn't it?
  Actually it is not. The lock is dropped just before the loop, so this
  loop shouldn't be the roadblock at all.
 
 Hmm you are saying PIC injections in atomic context always succeeds?
 
No, I am saying vcpu loop is not hit

Re: [PATCH] vhost-blk: Add vhost-blk support v5

2012-11-28 Thread David Miller

From: Michael S. Tsirkin m...@redhat.com
Date: Mon, 26 Nov 2012 17:14:16 +0200

 On Mon, Nov 19, 2012 at 10:26:41PM +0200, Michael S. Tsirkin wrote:
  
  Userspace bits:
  -
  1) LKVM
  The latest vhost-blk userspace bits for kvm tool can be found here:
  g...@github.com:asias/linux-kvm.git blk.vhost-blk
  
  2) QEMU
  The latest vhost-blk userspace prototype for QEMU can be found here:
  g...@github.com:asias/qemu.git blk.vhost-blk
  
  Changes in v5:
  - Do not assume the buffer layout
  - Fix wakeup race
  
  Changes in v4:
  - Mark req-status as userspace pointer
  - Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status()
  - Add if (need_resched()) schedule() in blk thread
  - Kill vhost_blk_stop_vq() and move it into vhost_blk_stop()
  - Use vq_err() instead of pr_warn()
  - Fail un Unsupported request
  - Add flush in vhost_blk_set_features()
  
  Changes in v3:
  - Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph!
  - Check file passed by user is a raw block device file
  
  Signed-off-by: Asias He as...@redhat.com
 
 Since there are files shared by this and vhost net
 it's easiest for me to merge this all through the
 vhost tree.
 
 Hi Dave, are you ok with this proposal?

I have no problems with this, for networking parts:

Acked-by: David S. Miller da...@davemloft.net
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

PCI device pass through support

2012-11-28 Thread Krishna J

Hi Alex,
I am trying to pass through a PCI device to the guest to compare the 
MSI interrupt latency with normal device pass through and pass through using 
VFIO framework. I used the following script

for dev in $(ls /sys/bus/pci/devices/:06:00.0/iommu_group/devices); do
    vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
    device=$(cat /sys/bus/pci/devices/$dev/device)
    if [ -e /sys/bus/pci/devices/$dev/driver ]; then
        echo $dev  /sys/bus/pci/devices/$dev/driver/unbind
    fi
    echo $vendor $device  /sys/bus/pci/drivers/vfio-pci/new_id
done

and added -device vfio-pci,host,host=:06:00.0 to my qemu command line. 

I am getting Error - qemu-system-x86_64: -device 
vfio-pci,host,host=:06:00.0: Property 'vfio-pci.host' doesn't take value 
'on'-- when i use the qemu for your git repo and and also the latest 
v1.3.0-rc1 qemu. I am using Kernel 3.6.7-rt18. Am i missing step ?


Krishna   --
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PCI device pass through support

2012-11-28 Thread Alex Williamson

On Wed, 2012-11-28 at 16:32 +, Krishna J wrote:
 Hi Alex,
 I am trying to pass through a PCI device to the guest to compare the
 MSI interrupt latency with normal device pass through and pass through
 using VFIO framework. I used the following script
 
 for dev in $(ls /sys/bus/pci/devices/:06:00.0/iommu_group/devices); do
 vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
 device=$(cat /sys/bus/pci/devices/$dev/device)
 if [ -e /sys/bus/pci/devices/$dev/driver ]; then
 echo $dev  /sys/bus/pci/devices/$dev/driver/unbind
 fi
 echo $vendor $device  /sys/bus/pci/drivers/vfio-pci/new_id
 done
 
 and added -device vfio-pci,host,host=:06:00.0 to my qemu command line. 
 ^

 I am getting Error - qemu-system-x86_64: -device
 vfio-pci,host,host=:06:00.0: Property 'vfio-pci.host' doesn't take
 value 'on'-- when i use the qemu for your git repo and and also the
 latest v1.3.0-rc1 qemu. I am using Kernel 3.6.7-rt18. Am i missing
 step ?

Hi Krishna,

There's an extra host, in the option above, is that the problem?
Please let me know if I've mis-documented it anywhere.  I hope you can
share your results when you get it working.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm tools: powerpc: convert spapr pci to new device iterators

2012-11-28 Thread Matt Evans


Hi Will,

On 2012-11-27 13:00, Will Deacon wrote:

Commit 8d35d32d0148 (kvm tools: add generic device registration
mechanism) introduced a tree-based device lookup-by-bus mechanism as
well as iterators to enumerate the devices on a particular bus.

Whilst both x86 and ppc were converted by the original patch, the 
spapr

pci changes were incomplete, so include the required changes here.

Compile-tested only on ppc64 970mp. Note that I had to hack the 
Makefile

in order to build guest_init.o with a toolchain defaulting to ppc:

 $(GUEST_INIT): guest/init.c
$(E)   LINK $@
-   $(Q) $(CC) -static guest/init.c -o $@
-   $(Q) $(LD) -r -b binary -o guest/guest_init.o $(GUEST_INIT)
+   $(Q) $(CC) -m64 -static guest/init.c -o $@
+   $(Q) $(LD) -m elf64ppc -r -b binary -o guest/guest_init.o
$(GUEST_INIT)

 $(DEPS):

Cc: Matt Evans m...@ozlabs.org
Signed-off-by: Will Deacon will.dea...@arm.com


I lack PPC
But this patch appears too good
To be left aside

Acked-by: Matt Evans m...@ozlabs.org


Cheers,

Matt


---
 tools/kvm/powerpc/spapr_pci.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/kvm/powerpc/spapr_pci.c 
b/tools/kvm/powerpc/spapr_pci.c

index 5f4016c..ed4b9ab 100644
--- a/tools/kvm/powerpc/spapr_pci.c
+++ b/tools/kvm/powerpc/spapr_pci.c
@@ -15,6 +15,7 @@

 #include spapr.h
 #include spapr_pci.h
+#include kvm/devices.h
 #include kvm/fdt.h
 #include kvm/util.h
 #include kvm/pci.h
@@ -248,6 +249,7 @@ int spapr_populate_pci_devices(struct kvm *kvm,
   void *fdt)
 {
int bus_off, node_off = 0, devid, fn, i, n, devices;
+   struct device_header *dev_hdr;
char nodename[256];
struct {
uint32_t hi;
@@ -301,14 +303,15 @@ int spapr_populate_pci_devices(struct kvm *kvm,

/* Populate PCI devices and allocate IRQs */
devices = 0;
-
-   for (devid = 0; devid  KVM_MAX_DEVICES; devid++) {
+   dev_hdr = device__first_dev(DEVICE_BUS_PCI);
+   while (dev_hdr) {
uint32_t *irqmap = interrupt_map[devices];
-   struct pci_device_header *hdr = pci__find_dev(devid);
+   struct pci_device_header *hdr = dev_hdr-data;

if (!hdr)
continue;

+   devid = dev_hdr-dev_num;
fn = 0; /* kvmtool doesn't yet do multifunction devices */

sprintf(nodename, pci@%u,%u, devid, fn);
@@ -413,6 +416,7 @@ int spapr_populate_pci_devices(struct kvm *kvm,
 		/* We don't set ibm,dma-window property as we don't have an IOMMU. 
*/


++devices;
+   dev_hdr = device__next_dev(dev_hdr);
}

/* Write interrupt map */


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Michael Wolf


On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:

On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.

The definition of stolen time is 'time during which the virtual CPU is
runnable to not running'. Overcommit is the main scenario which steal
time helps to detect.

Can you describe the 'capped' case?
In the capped case, the time that the guest spends waiting due to it 
having used its full allottment of time shows up as steal time.  The way 
my patchset currently stands is that you would set up the
bandwidth control and you would have to pass it a  matching value from 
qemu.  In the future, it would
be possible to have something parse the bandwidth setting and 
automatically adjust the setting in the

host used for steal time reporting.



  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will separate
the consigned time from the steal time.  The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

---

Michael Wolf (5):
   Alter the amount of steal time reported by the guest.
   Expand the steal time msr to also contain the consigned time.
   Add the code to send the consigned time from the host to the guest
   Add a timer to allow the separation of consigned from steal time.
   Add an ioctl to communicate the consign limit to the host.


  arch/x86/include/asm/kvm_host.h   |   11 +++
  arch/x86/include/asm/kvm_para.h   |3 +-
  arch/x86/include/asm/paravirt.h   |4 +--
  arch/x86/include/asm/paravirt_types.h |2 +
  arch/x86/kernel/kvm.c |8 ++---
  arch/x86/kernel/paravirt.c|4 +--
  arch/x86/kvm/x86.c|   50 -
  fs/proc/stat.c|9 +-
  include/linux/kernel_stat.h   |2 +
  include/linux/kvm_host.h  |2 +
  include/uapi/linux/kvm.h  |2 +
  kernel/sched/core.c   |   10 ++-
  kernel/sched/cputime.c|   21 +-
  kernel/sched/sched.h  |2 +
  virt/kvm/kvm_main.c   |7 +
  15 files changed, 120 insertions(+), 17 deletions(-)

--
Signature

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Michael Wolf


On 11/28/2012 02:45 AM, Glauber Costa wrote:

On 11/27/2012 07:10 PM, Michael Wolf wrote:

On 11/27/2012 02:48 AM, Glauber Costa wrote:

Hi,

On 11/27/2012 12:36 AM, Michael Wolf wrote:

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat.  This can
cause confusion for the end user.  To ease the confusion this patch set
adds the idea of consigned (expected steal) time.  The host will
separate
the consigned time from the steal time.  The consignment limit passed
to the
host will be the amount of steal time expected within a fixed period of
time.  Any other steal time accruing during that period will show as the
traditional steal time.

If you submit this again, please include a version number in your series.

Will do.  The patchset was sent twice yesterday by mistake.  Got an
error the first time and didn't
think the patches went out.  This has been corrected.

It would also be helpful to include a small changelog about what changed
between last version and this version, so we could focus on that.

yes, will do that.  When I took the RFC off the patches I was looking at
it as a new patchset which was
a mistake.  I will make sure to add a changelog when I submit again.

As for the rest, I answered your previous two submissions saying I don't
agree with the concept. If you hadn't changed anything, resending it
won't change my mind.

I could of course, be mistaken or misguided. But I had also not seen any
wave of support in favor of this previously, so basically I have no new
data to make me believe I should see it any differently.

Let's try this again:

* Rik asked you in your last submission how does ppc handle this. You
said, and I quote: In the case of lpar on POWER systems they simply
report steal time and do not alter it in any way.
They do however report how much processor is assigned to the partition
and that information is in /proc/ppc64/lparcfg.

Yes, but we still get questions from users asking what is steal time?
why am I seeing this?

Now, that is a *way* more sensible thing to do. Much more. Confusing
users is something extremely subjective. This is specially true about
concepts that are know for quite some time, like steal time. If you out
of a sudden change the meaning of this, it is sure to confuse a lot more
users than it would clarify.

Something like this could certainly be done.  But when I was submitting
the patch set as
an RFC then qemu was passing a cpu percentage that would be used by the
guest kernel
to adjust the steal time. This percentage was being stored on the guest
as a sysctl value.
Avi stated he didn't like that kind of coupling, and that the value
could get out of sync.  Anthony stated The guest shouldn't need to know
it's entitlement. Or at least, it's up to a management tool to report
that in a way that's meaningful for the guest.

So perhaps I misunderstood what they were suggesting, but I took it to
mean that they did not
want the guest to know what the entitlement was.  That the host should
take care of it and just
report the already adjusted data to the guest.  So in this version of
the code the host would use a set
period for a timer and be passed essentially a number of ticks of
expected steal time.  The host
would then use the timer to break out the steal time into consigned and
steal buckets which would be
reported to the guest.

Both the consigned and the steal would be reported via /proc/stat. So
anyone needing to see total
time away could add the two fields together.  The user, however, when
using tools like top or vmstat
would see the usage based on what the guest is entitled to.

Do you have suggestions for how I can build consensus around one of the
two approaches?


Before I answer this, can you please detail which mechanism are you
using to enforce the entitlement? Is it the cgroup cpu controller, or
something else?
It is setup using cpu overcommit.  But the request was for something 
that would work in both

the overcommit environment as well as when hard capping is being used.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance issue

2012-11-28 Thread George-Cristian Bîrzan

On Wed, Nov 28, 2012 at 1:39 PM, Vadim Rozenfeld vroze...@redhat.com wrote:
 On Tuesday, November 27, 2012 11:13:12 PM George-Cristian Bîrzan wrote:
 On Tue, Nov 27, 2012 at 10:38 PM, Vadim Rozenfeld vroze...@redhat.com
 wrote:
  I have some code which do both reference time and invariant TSC but it
  will not work after migration. I will send it later today.

 Do you mean migrating guests? This is not an issue for us.
 OK, but don't say I didn't warn you :)

 There are two patches, one for kvm and another one for qemu.
 you will probably need to rebase them.
 Add hv_tsc cpu parameter to activate this feature.
 you will probably need to deactivate hpet by adding -no-hpet
 parameter as well.

I've also added +hv_relaxed since then, but this is the command I'm
using now and there's no change:

/usr/bin/qemu-kvm -name b691546e-79f8-49c6-a293-81067503a6ad -S -M
pc-1.2 -enable-kvm -m 16384 -smp 9,sockets=1,cores=9,threads=1 -uuid
b691546e-79f8-49c6-a293-81067503a6ad -no-user-config -nodefaults
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/b691546e-79f8-49c6-a293-81067503a6ad.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-hpet -no-shutdown -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=/var/lib/libvirt/images/dis-magnetics-2-223101/d8b233c6-8424-4de9-ae3c-7c9a60288514,if=none,id=drive-virtio-disk0,format=qcow2,cache=writeback,aio=native
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=35,id=hostnet0,vhost=on,vhostfd=36 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=22:2e:fb:a2:36:be,bus=pci.0,addr=0x3
-netdev tap,fd=40,id=hostnet1,vhost=on,vhostfd=41 -device
virtio-net-pci,netdev=hostnet1,id=net1,mac=22:94:44:5a:cb:24,bus=pci.0,addr=0x4
-vnc 127.0.0.1:0,password -vga cirrus -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -cpu host,hv_tsc

I compiled qemu-1.2.0-24 after applying your patch, used the head for
KVM, and I see no difference. I've tried setting windows'
useplatformclock on and off, no change either.


Other than that, was looking into a profiling trace of the software
running and a lot of time (60%?) is spent calling two functions from
hal.dll, HalpGetPmTimerSleepModePerfCounter when I disable HPET, and
HalpHPETProgramRolloverTimer which do point at something related to
the timers.

Any other thing I can try?


--
George-Cristian Bîrzan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Anthony Liguori

Glauber Costa glom...@parallels.com writes:

 Hi,

 On 11/27/2012 12:36 AM, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.  This can
 cause confusion for the end user.  To ease the confusion this patch set
 adds the idea of consigned (expected steal) time.  The host will separate
 the consigned time from the steal time.  The consignment limit passed to the
 host will be the amount of steal time expected within a fixed period of
 time.  Any other steal time accruing during that period will show as the
 traditional steal time.

 If you submit this again, please include a version number in your series.

 It would also be helpful to include a small changelog about what changed
 between last version and this version, so we could focus on that.

 As for the rest, I answered your previous two submissions saying I don't
 agree with the concept. If you hadn't changed anything, resending it
 won't change my mind.

 I could of course, be mistaken or misguided. But I had also not seen any
 wave of support in favor of this previously, so basically I have no new
 data to make me believe I should see it any differently.

 Let's try this again:

 * Rik asked you in your last submission how does ppc handle this. You
 said, and I quote: In the case of lpar on POWER systems they simply
 report steal time and do not alter it in any way.
 They do however report how much processor is assigned to the partition
 and that information is in /proc/ppc64/lparcfg.

This only is helpful for static entitlements.

But if we allow dynamic entitlements--which is a very useful feature,
think buying an online upgrade in a cloud environment--then you need
to account for entitlement loss at the same place where you do the rest
of the accounting: in /proc/stat.

 Now, that is a *way* more sensible thing to do. Much more. Confusing
 users is something extremely subjective. This is specially true about
 concepts that are know for quite some time, like steal time. If you out
 of a sudden change the meaning of this, it is sure to confuse a lot more
 users than it would clarify.

I'll bring you a nice bottle of scotch at the next KVM Forum if you can
find me one user that can accurately describe what steal time is.

The semantics are so incredibly subtle that I have a hard time believing
anyone actually understands what it means today.

Regards,

Anthony Liguori





 
 ---
 
 Michael Wolf (5):
   Alter the amount of steal time reported by the guest.
   Expand the steal time msr to also contain the consigned time.
   Add the code to send the consigned time from the host to the guest
   Add a timer to allow the separation of consigned from steal time.
   Add an ioctl to communicate the consign limit to the host.
 
 
  arch/x86/include/asm/kvm_host.h   |   11 +++
  arch/x86/include/asm/kvm_para.h   |3 +-
  arch/x86/include/asm/paravirt.h   |4 +--
  arch/x86/include/asm/paravirt_types.h |2 +
  arch/x86/kernel/kvm.c |8 ++---
  arch/x86/kernel/paravirt.c|4 +--
  arch/x86/kvm/x86.c|   50 
 -
  fs/proc/stat.c|9 +-
  include/linux/kernel_stat.h   |2 +
  include/linux/kvm_host.h  |2 +
  include/uapi/linux/kvm.h  |2 +
  kernel/sched/core.c   |   10 ++-
  kernel/sched/cputime.c|   21 +-
  kernel/sched/sched.h  |2 +
  virt/kvm/kvm_main.c   |7 +
  15 files changed, 120 insertions(+), 17 deletions(-)
 

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance issue

2012-11-28 Thread George-Cristian Bîrzan

On Wed, Nov 28, 2012 at 1:39 PM, Vadim Rozenfeld vroze...@redhat.com wrote:
 There are two patches, one for kvm and another one for qemu.

I just realised this. I was supposed to use qemu, or qemu-kvm? I used qemu

--
George-Cristian Bîrzan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance issue

2012-11-28 Thread Gleb Natapov

On Wed, Nov 28, 2012 at 09:18:38PM +0200, George-Cristian Bîrzan wrote:
 On Wed, Nov 28, 2012 at 1:39 PM, Vadim Rozenfeld vroze...@redhat.com wrote:
  There are two patches, one for kvm and another one for qemu.
 
 I just realised this. I was supposed to use qemu, or qemu-kvm? I used qemu
 
Does not matter, but you need to also recompile kernel with the first patch.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance issue

2012-11-28 Thread George-Cristian Bîrzan

On Wed, Nov 28, 2012 at 9:56 PM, Gleb Natapov g...@redhat.com wrote:
 On Wed, Nov 28, 2012 at 09:18:38PM +0200, George-Cristian Bîrzan wrote:
 On Wed, Nov 28, 2012 at 1:39 PM, Vadim Rozenfeld vroze...@redhat.com wrote:
  There are two patches, one for kvm and another one for qemu.

 I just realised this. I was supposed to use qemu, or qemu-kvm? I used qemu

 Does not matter, but you need to also recompile kernel with the first patch.

Do I have to recompile the kernel, or just the module? I followed the
instructions at
http://www.linux-kvm.org/page/Code#building_an_external_module_with_older_kernels
but I guess I can do the whole kernel, if it might help.

--
George-Cristian Bîrzan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance issue

2012-11-28 Thread Gleb Natapov

On Wed, Nov 28, 2012 at 10:01:04PM +0200, George-Cristian Bîrzan wrote:
 On Wed, Nov 28, 2012 at 9:56 PM, Gleb Natapov g...@redhat.com wrote:
  On Wed, Nov 28, 2012 at 09:18:38PM +0200, George-Cristian Bîrzan wrote:
  On Wed, Nov 28, 2012 at 1:39 PM, Vadim Rozenfeld vroze...@redhat.com 
  wrote:
   There are two patches, one for kvm and another one for qemu.
 
  I just realised this. I was supposed to use qemu, or qemu-kvm? I used qemu
 
  Does not matter, but you need to also recompile kernel with the first patch.
 
 Do I have to recompile the kernel, or just the module? I followed the
 instructions at
 http://www.linux-kvm.org/page/Code#building_an_external_module_with_older_kernels
 but I guess I can do the whole kernel, if it might help.
 
Module is enough, but kvm-kmod is not what you want. Just rebuild the
whole kernel if you do not know how to rebuild only the module for your
distribution's kernel.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] Alter steal time reporting in KVM

2012-11-28 Thread Glauber Costa

On 11/28/2012 10:43 PM, Michael Wolf wrote:
 On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:
 On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.
 The definition of stolen time is 'time during which the virtual CPU is
 runnable to not running'. Overcommit is the main scenario which steal
 time helps to detect.

 Can you describe the 'capped' case?
 In the capped case, the time that the guest spends waiting due to it
 having used its full allottment of time shows up as steal time.  The way
 my patchset currently stands is that you would set up the
 bandwidth control and you would have to pass it a  matching value from
 qemu.  In the future, it would
 be possible to have something parse the bandwidth setting and
 automatically adjust the setting in the
 host used for steal time reporting.

Ok, so correct me if I am wrong, but I believe you would be using
something like the bandwidth capper in the cpu cgroup to set those
entitlements, right?

Some time has passed since I last looked into it, but IIRC, after you
get are out of your quota, you should be out of the runqueue. In the
lovely world of KVM, we approximate steal time as runqueue time:

arch/x86/kvm/x86.c:
delta = current-sched_info.run_delay - vcpu-arch.st.last_steal;
vcpu-arch.st.last_steal = current-sched_info.run_delay;
vcpu-arch.st.accum_steal = delta;

include/linux/sched.h:
unsigned long long run_delay; /* time spent waiting on a runqueue */

So if you are out of the runqueue, you won't get steal time accounted,
and then I truly fail to understand what you are doing.

In case I am wrong, and run_delay also includes the time you can't run
because you are out of capacity, then maybe what we should do, is to
just subtract it from run_delay in kvm/x86.c before we pass it on. In
summary:


Alter the amount of steal time reported by the guest.
Maybe this should go away.

Expand the steal time msr to also contain the consigned time.
Maybe this should go away

Add the code to send the consigned time from the host to the
 guest
This definitely should be heavily modified

Add a timer to allow the separation of consigned from steal time.
Maybe this should go away

Add an ioctl to communicate the consign limit to the host.
This definitely should go away.

More specifically, *whatever* way we use to cap the processor, the host
system will have all the information at all times.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vfio powerpc: implemented IOMMU driver for VFIO

2012-11-28 Thread Alex Williamson

On Wed, 2012-11-28 at 18:21 +1100, Alexey Kardashevskiy wrote:
 VFIO implements platform independent stuff such as
 a PCI driver, BAR access (via read/write on a file descriptor
 or direct mapping when possible) and IRQ signaling.
 
 The platform dependent part includes IOMMU initialization
 and handling. This patch implements an IOMMU driver for VFIO
 which does mapping/unmapping pages for the guest IO and
 provides information about DMA window (required by a POWERPC
 guest).
 
 The counterpart in QEMU is required to support this functionality.
 
 Cc: David Gibson da...@gibson.dropbear.id.au
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 ---
  drivers/vfio/Kconfig|6 +
  drivers/vfio/Makefile   |1 +
  drivers/vfio/vfio_iommu_spapr_tce.c |  332 
 +++
  include/linux/vfio.h|   33 
  4 files changed, 372 insertions(+)
  create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
 
 diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
 index 7cd5dec..b464687 100644
 --- a/drivers/vfio/Kconfig
 +++ b/drivers/vfio/Kconfig
 @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
   depends on VFIO
   default n
  
 +config VFIO_IOMMU_SPAPR_TCE
 + tristate
 + depends on VFIO  SPAPR_TCE_IOMMU
 + default n
 +
  menuconfig VFIO
   tristate VFIO Non-Privileged userspace driver framework
   depends on IOMMU_API
   select VFIO_IOMMU_TYPE1 if X86
 + select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
   help
 VFIO provides a framework for secure userspace device drivers.
 See Documentation/vfio.txt for more details.
 diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
 index 2398d4a..72bfabc 100644
 --- a/drivers/vfio/Makefile
 +++ b/drivers/vfio/Makefile
 @@ -1,3 +1,4 @@
  obj-$(CONFIG_VFIO) += vfio.o
  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
 +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
  obj-$(CONFIG_VFIO_PCI) += pci/
 diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
 b/drivers/vfio/vfio_iommu_spapr_tce.c
 new file mode 100644
 index 000..b98770e
 --- /dev/null
 +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
 @@ -0,0 +1,332 @@
 +/*
 + * VFIO: IOMMU DMA mapping support for TCE on POWER
 + *
 + * Copyright (C) 2012 IBM Corp.  All rights reserved.
 + * Author: Alexey Kardashevskiy a...@ozlabs.ru
 + *
 + * This program is free software; you can redistribute it and/or modify
 + * it under the terms of the GNU General Public License version 2 as
 + * published by the Free Software Foundation.
 + *
 + * Derived from original vfio_iommu_type1.c:
 + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
 + * Author: Alex Williamson alex.william...@redhat.com
 + */
 +
 +#include linux/module.h
 +#include linux/pci.h
 +#include linux/slab.h
 +#include linux/uaccess.h
 +#include linux/err.h
 +#include linux/vfio.h
 +#include asm/iommu.h
 +
 +#define DRIVER_VERSION  0.1
 +#define DRIVER_AUTHOR   a...@ozlabs.ru
 +#define DRIVER_DESC VFIO IOMMU SPAPR TCE
 +
 +static void tce_iommu_detach_group(void *iommu_data,
 + struct iommu_group *iommu_group);
 +
 +/*
 + * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
 + */
 +
 +/*
 + * This code handles mapping and unmapping of user data buffers
 + * into DMA'ble space using the IOMMU
 + */
 +
 +#define NPAGE_TO_SIZE(npage) ((size_t)(npage)  PAGE_SHIFT)
 +
 +struct vwork {
 + struct mm_struct*mm;
 + longnpage;
 + struct work_struct  work;
 +};
 +
 +/* delayed decrement/increment for locked_vm */
 +static void lock_acct_bg(struct work_struct *work)
 +{
 + struct vwork *vwork = container_of(work, struct vwork, work);
 + struct mm_struct *mm;
 +
 + mm = vwork-mm;
 + down_write(mm-mmap_sem);
 + mm-locked_vm += vwork-npage;
 + up_write(mm-mmap_sem);
 + mmput(mm);
 + kfree(vwork);
 +}
 +
 +static void lock_acct(long npage)
 +{
 + struct vwork *vwork;
 + struct mm_struct *mm;
 +
 + if (!current-mm)
 + return; /* process exited */
 +
 + if (down_write_trylock(current-mm-mmap_sem)) {
 + current-mm-locked_vm += npage;
 + up_write(current-mm-mmap_sem);
 + return;
 + }
 +
 + /*
 +  * Couldn't get mmap_sem lock, so must setup to update
 +  * mm-locked_vm later. If locked_vm were atomic, we
 +  * wouldn't need this silliness
 +  */
 + vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
 + if (!vwork)
 + return;
 + mm = get_task_mm(current);
 + if (!mm) {
 + kfree(vwork);
 + return;
 + }
 + INIT_WORK(vwork-work, lock_acct_bg);
 + vwork-mm = mm;
 + vwork-npage = npage;
 + schedule_work(vwork-work);
 +}

This looks familiar, should we split it out to a common file instead of
duplicating it?

 +
 +/*
 + * The container descriptor supports only a single group per container.
 + * Required by the

[PATCH V2] Added tests for ia32_tsc_adjust funtionality.

2012-11-28 Thread Will Auld

Added x86/tsc_adjust.c and updated x86/vmexit.c to include timing tests
for reading and writing the emulated IA32_TSC_ADJUST msr.

Signed-off-by: Will Auld will.a...@intel.com
---
 config-x86-common.mak |  5 -
 x86/tsc_adjust.c  | 60 +++
 x86/vmexit.c  | 13 +++
 3 files changed, 77 insertions(+), 1 deletion(-)
 create mode 100644 x86/tsc_adjust.c

diff --git a/config-x86-common.mak b/config-x86-common.mak
index c76cd11..8f909f7 100644
--- a/config-x86-common.mak
+++ b/config-x86-common.mak
@@ -34,7 +34,8 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \
$(TEST_DIR)/realmode.flat $(TEST_DIR)/msr.flat \
$(TEST_DIR)/hypercall.flat $(TEST_DIR)/sieve.flat \
$(TEST_DIR)/kvmclock_test.flat  $(TEST_DIR)/eventinj.flat \
-   $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat 
$(TEST_DIR)/asyncpf.flat
+   $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat \
+   $(TEST_DIR)/tsc_adjust.flat $(TEST_DIR)/asyncpf.flat
 
 ifdef API
 tests-common += api/api-sample
@@ -64,6 +65,8 @@ $(TEST_DIR)/port80.elf: $(cstart.o) $(TEST_DIR)/port80.o
 
 $(TEST_DIR)/tsc.elf: $(cstart.o) $(TEST_DIR)/tsc.o
 
+$(TEST_DIR)/tsc_adjust.elf: $(cstart.o) $(TEST_DIR)/tsc_adjust.o
+
 $(TEST_DIR)/apic.elf: $(cstart.o) $(TEST_DIR)/apic.o
 
 $(TEST_DIR)/realmode.elf: $(TEST_DIR)/realmode.o
diff --git a/x86/tsc_adjust.c b/x86/tsc_adjust.c
new file mode 100644
index 000..05cc5d9
--- /dev/null
+++ b/x86/tsc_adjust.c
@@ -0,0 +1,60 @@
+#include libcflat.h
+#include processor.h
+
+#define IA32_TSC_ADJUST 0x3b
+
+int main()
+{
+   u64 t1, t2, t3, t4, t5;
+   u64 est_delta_time;
+   bool pass = true;
+
+   if (cpuid(7).b  (1  1)) { // IA32_TSC_ADJUST Feature is enabled?
+   if ( rdmsr(IA32_TSC_ADJUST) != 0x0) {
+   printf(failure: IA32_TSC_ADJUST msr was incorrectly
+initialized\n);
+   pass = false;
+   }
+   t3 = 1000ull;
+   t1 = rdtsc();
+   wrmsr(IA32_TSC_ADJUST, t3);
+   t2 = rdtsc();
+   if (rdmsr(IA32_TSC_ADJUST) != t3) {
+   printf(failure: IA32_TSC_ADJUST msr read / write
+incorrect\n);
+   pass = false;
+   }
+   if (t2 - t1  t3) {
+   printf(failure: TSC did not adjust for IA32_TSC_ADJUST
+value\n);
+   pass = false;
+   }
+   t3 = 0x0;
+   wrmsr(IA32_TSC_ADJUST, t3);
+   if (rdmsr(IA32_TSC_ADJUST) != t3) {
+   printf(failure: IA32_TSC_ADJUST msr read / write
+incorrect\n);
+   pass = false;
+   }
+   t4 = 1000ull;
+   t1 = rdtsc();
+   wrtsc(t4);
+   t2 = rdtsc();
+   t5 = rdmsr(IA32_TSC_ADJUST);
+   // est of time between reading tsc and writing tsc, 
+   // (based on IA32_TSC_ADJUST msr value) should be small
+   est_delta_time = t4 - t5 - t1;
+   if (est_delta_time  2 * (t2 - t4)) {
+   // arbitray 2x latency (wrtsc-rdtsc) threshold
+   printf(failure: IA32_TSC_ADJUST msr incorrectly
+adjusted on tsc write\n);
+   pass = false;
+   }
+   if (pass) printf(success: IA32_TSC_ADJUST enabled and
+working correctly\n);
+   }
+   else {
+   printf(success: IA32_TSC_ADJUST feature not enabled\n);
+   }
+   return pass?0:1;
+}
diff --git a/x86/vmexit.c b/x86/vmexit.c
index ad8ab55..99ff964 100644
--- a/x86/vmexit.c
+++ b/x86/vmexit.c
@@ -34,6 +34,7 @@ static void vmcall(void)
asm volatile (vmcall : +a(a), =b(b), =c(c), =d(d));
 }
 
+#define MSR_TSC_ADJUST 0x3b
 #define MSR_EFER 0xc080
 #define EFER_NX_MASK(1ull  11)
 
@@ -103,6 +104,16 @@ static void ple_round_robin(void)
++counters[you].n1;
 }
 
+static void rd_tsc_adjust_msr(void)
+{
+   rdmsr(MSR_TSC_ADJUST);
+}
+
+static void wr_tsc_adjust_msr(void)
+{
+   wrmsr(MSR_TSC_ADJUST, 0x0);
+}
+
 static struct test {
void (*func)(void);
const char *name;
@@ -119,6 +130,8 @@ static struct test {
{ ipi, ipi, is_smp, .parallel = 0, },
{ ipi_halt, ipi+halt, is_smp, .parallel = 0, },
{ ple_round_robin, ple-round-robin, .parallel = 1 },
+   { wr_tsc_adjust_msr, wr_tsc_adjust_msr, .parallel = 1 },
+   { rd_tsc_adjust_msr, rd_tsc_adjust_msr, .parallel = 1 },
 };
 
 unsigned iterations;
-- 
1.8.0.rc0



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to

Re: [PATCH 2/3] KVM: x86: let reexecute_instruction work for tdp

2012-11-28 Thread Marcelo Tosatti

On Wed, Nov 28, 2012 at 10:55:26PM +0800, Xiao Guangrong wrote:
 On 11/28/2012 10:01 PM, Gleb Natapov wrote:
  On Wed, Nov 28, 2012 at 11:15:13AM +0800, Xiao Guangrong wrote:
  On 11/28/2012 07:32 AM, Marcelo Tosatti wrote:
  On Tue, Nov 27, 2012 at 11:13:11AM +0800, Xiao Guangrong wrote:
  +static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned 
  long cr2)
   {
  -  gpa_t gpa;
  +  gpa_t gpa = cr2;
 pfn_t pfn;
 
  -  if (tdp_enabled)
  +  if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages))
 return false;
 
  How is indirect_shadow_pages protected? Why is ACCESS_ONCE() being used
  to read it?
 
  Hi Marcelo,
 
  It is protected by mmu-lock for it only be changed when mmu-lock is 
  hold. And
  ACCESS_ONCE is used on read path avoiding magic optimization from 
  compiler.
 
  Please switch to mmu_lock protection, there is no reason to have access
  to this variable locklessly - not performance critical.
 
  For example, there is no use of barriers when modifying the variable.
 
  This is not bad, the worst case is, the direct mmu failed to unprotect the 
  shadow
  pages, (meet indirect_shadow_pages = 0, but there has shadow pages being 
  shadowed.),
  after enter to guest, we will go into reexecute_instruction again, then it 
  will
  remove shadow pages.
 
  Isn't the same scenario can happen even with mmu lock around
  indirect_shadow_pages access?
 
 Hmm..., i also think it is no different. Even using mmu-lock, we can not
 prevent the target pfn can not be write-protected later. Marcelo?

In this particular case, it appears to be harmless (unsure if
kvm_mmu_pte_write one is safe). The point is, lockless access should not
be special.

Lockless access must be carefully documented (access protocol to
variables documented, all possible cases listed), and done when
necessary due to performance. Otherwise, don't do it.

On this case, its not necessary.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction

2012-11-28 Thread Marcelo Tosatti

On Wed, Nov 28, 2012 at 10:59:35PM +0800, Xiao Guangrong wrote:
 On 11/28/2012 10:12 PM, Gleb Natapov wrote:
  On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote:
  On 11/27/2012 06:41 AM, Marcelo Tosatti wrote:
 
 
  -return false;
  +again:
  +page_fault_count = 
  ACCESS_ONCE(vcpu-kvm-arch.page_fault_count);
  +
  +/*
  + * if emulation was due to access to shadowed page table
  + * and it failed try to unshadow page and re-enter the
  + * guest to let CPU execute the instruction.
  + */
  +kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
  +emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, 
  PFERR_WRITE_MASK, false);
 
  Can you explain what is the objective here?
 
 
  Sure. :)
 
  The instruction emulation is caused by fault access on cr3. After unprotect
  the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of 
  cr3.
  if it return 1, mmu can not fix the mapping, we should report the error,
  otherwise it is good to return to guest and let it re-execute the 
  instruction
  again.
 
  page_fault_count is used to avoid the race on other vcpus, since after we
  unprotect the target page, other cpu can enter page fault path and let the
  page be write-protected again.
 
  This way can help us to detect all the case that mmu can not be fixed.
 
  Can you write this in a comment above vcpu-arch.mmu.page_fault()?
 
 Okay, if Marcelo does not object this way. :)

I do object, since it is possible to detect precisely the condition by 
storing which gfns have been cached.

Then, Xiao, you need a way to handle large read-only sptes.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V2] Added tests for ia32_tsc_adjust funtionality.

2012-11-28 Thread Marcelo Tosatti

On Wed, Nov 28, 2012 at 01:32:09PM -0800, Will Auld wrote:
 Added x86/tsc_adjust.c and updated x86/vmexit.c to include timing tests
 for reading and writing the emulated IA32_TSC_ADJUST msr.
 
 Signed-off-by: Will Auld will.a...@intel.com
 ---
  config-x86-common.mak |  5 -
  x86/tsc_adjust.c  | 60 
 +++
  x86/vmexit.c  | 13 +++
  3 files changed, 77 insertions(+), 1 deletion(-)
  create mode 100644 x86/tsc_adjust.c

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction

2012-11-28 Thread Xiao Guangrong

On 11/29/2012 05:57 AM, Marcelo Tosatti wrote:
 On Wed, Nov 28, 2012 at 10:59:35PM +0800, Xiao Guangrong wrote:
 On 11/28/2012 10:12 PM, Gleb Natapov wrote:
 On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote:
 On 11/27/2012 06:41 AM, Marcelo Tosatti wrote:


 -return false;
 +again:
 +page_fault_count = 
 ACCESS_ONCE(vcpu-kvm-arch.page_fault_count);
 +
 +/*
 + * if emulation was due to access to shadowed page table
 + * and it failed try to unshadow page and re-enter the
 + * guest to let CPU execute the instruction.
 + */
 +kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
 +emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, 
 PFERR_WRITE_MASK, false);

 Can you explain what is the objective here?


 Sure. :)

 The instruction emulation is caused by fault access on cr3. After unprotect
 the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of 
 cr3.
 if it return 1, mmu can not fix the mapping, we should report the error,
 otherwise it is good to return to guest and let it re-execute the 
 instruction
 again.

 page_fault_count is used to avoid the race on other vcpus, since after we
 unprotect the target page, other cpu can enter page fault path and let the
 page be write-protected again.

 This way can help us to detect all the case that mmu can not be fixed.

 Can you write this in a comment above vcpu-arch.mmu.page_fault()?

 Okay, if Marcelo does not object this way. :)
 
 I do object, since it is possible to detect precisely the condition by 
 storing which gfns have been cached.
 
 Then, Xiao, you need a way to handle large read-only sptes.

Sorry, Marcelo, i am still confused why read-only sptes can not work
under this patch?

The code after read-only large spte is is:

+   if ((level  PT_PAGE_TABLE_LEVEL 
+  has_wrprotected_page(vcpu-kvm, gfn, level)) ||
+ mmu_need_write_protect(vcpu, gfn, can_unsync)) {
pgprintk(%s: found shadow page for %llx, marking ro\n,
 __func__, gfn);
ret = 1;

It return 1, then reexecute_instruction return 0. It is the same as without
readonly large-spte.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction

2012-11-28 Thread Xiao Guangrong

On 11/29/2012 06:40 AM, Xiao Guangrong wrote:
 On 11/29/2012 05:57 AM, Marcelo Tosatti wrote:
 On Wed, Nov 28, 2012 at 10:59:35PM +0800, Xiao Guangrong wrote:
 On 11/28/2012 10:12 PM, Gleb Natapov wrote:
 On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote:
 On 11/27/2012 06:41 AM, Marcelo Tosatti wrote:


 -   return false;
 +again:
 +   page_fault_count = 
 ACCESS_ONCE(vcpu-kvm-arch.page_fault_count);
 +
 +   /*
 +* if emulation was due to access to shadowed page table
 +* and it failed try to unshadow page and re-enter the
 +* guest to let CPU execute the instruction.
 +*/
 +   kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
 +   emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, 
 PFERR_WRITE_MASK, false);

 Can you explain what is the objective here?


 Sure. :)

 The instruction emulation is caused by fault access on cr3. After 
 unprotect
 the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of 
 cr3.
 if it return 1, mmu can not fix the mapping, we should report the error,
 otherwise it is good to return to guest and let it re-execute the 
 instruction
 again.

 page_fault_count is used to avoid the race on other vcpus, since after we
 unprotect the target page, other cpu can enter page fault path and let the
 page be write-protected again.

 This way can help us to detect all the case that mmu can not be fixed.

 Can you write this in a comment above vcpu-arch.mmu.page_fault()?

 Okay, if Marcelo does not object this way. :)

 I do object, since it is possible to detect precisely the condition by 
 storing which gfns have been cached.

 Then, Xiao, you need a way to handle large read-only sptes.
 
 Sorry, Marcelo, i am still confused why read-only sptes can not work
 under this patch?
 
 The code after read-only large spte is is:
 
 + if ((level  PT_PAGE_TABLE_LEVEL 
 +has_wrprotected_page(vcpu-kvm, gfn, level)) ||
 +   mmu_need_write_protect(vcpu, gfn, can_unsync)) {
   pgprintk(%s: found shadow page for %llx, marking ro\n,
__func__, gfn);
   ret = 1;
 
 It return 1, then reexecute_instruction return 0. It is the same as without
 readonly large-spte.

Ah, wait, There is a case, the large page located at 0-2M, the 0-4K is used as a
page-table (e.g. PDE), and the guest want to write the memory located at 5K 
which
should be freely written. This patch can return 0 for both current code and 
readonly
large spte.

I need to think it more.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH V2] Added tests for ia32_tsc_adjust functionality.

2012-11-28 Thread Auld, Will

Thanks Marcelo!

 -Original Message-
 From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
 Sent: Wednesday, November 28, 2012 2:14 PM
 To: Auld, Will
 Cc: kvm@vger.kernel.org; Dugger, Donald D; Liu, Jinsong; Zhang,
 Xiantao; a...@redhat.com; Gleb
 Subject: Re: [PATCH V2] Added tests for ia32_tsc_adjust funtionality.
 
 On Wed, Nov 28, 2012 at 01:32:09PM -0800, Will Auld wrote:
  Added x86/tsc_adjust.c and updated x86/vmexit.c to include timing
  tests for reading and writing the emulated IA32_TSC_ADJUST msr.
 
  Signed-off-by: Will Auld will.a...@intel.com
  ---
   config-x86-common.mak |  5 -
   x86/tsc_adjust.c  | 60
 +++
   x86/vmexit.c  | 13 +++
   3 files changed, 77 insertions(+), 1 deletion(-)  create mode 100644
  x86/tsc_adjust.c
 
 Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction

2012-11-28 Thread Marcelo Tosatti

On Thu, Nov 29, 2012 at 06:40:51AM +0800, Xiao Guangrong wrote:
 On 11/29/2012 05:57 AM, Marcelo Tosatti wrote:
  On Wed, Nov 28, 2012 at 10:59:35PM +0800, Xiao Guangrong wrote:
  On 11/28/2012 10:12 PM, Gleb Natapov wrote:
  On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote:
  On 11/27/2012 06:41 AM, Marcelo Tosatti wrote:
 
 
  -  return false;
  +again:
  +  page_fault_count = 
  ACCESS_ONCE(vcpu-kvm-arch.page_fault_count);
  +
  +  /*
  +   * if emulation was due to access to shadowed page table
  +   * and it failed try to unshadow page and re-enter the
  +   * guest to let CPU execute the instruction.
  +   */
  +  kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
  +  emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, 
  PFERR_WRITE_MASK, false);
 
  Can you explain what is the objective here?
 
 
  Sure. :)
 
  The instruction emulation is caused by fault access on cr3. After 
  unprotect
  the target page, we call vcpu-arch.mmu.page_fault to fix the mapping of 
  cr3.
  if it return 1, mmu can not fix the mapping, we should report the error,
  otherwise it is good to return to guest and let it re-execute the 
  instruction
  again.
 
  page_fault_count is used to avoid the race on other vcpus, since after we
  unprotect the target page, other cpu can enter page fault path and let 
  the
  page be write-protected again.
 
  This way can help us to detect all the case that mmu can not be fixed.
 
  Can you write this in a comment above vcpu-arch.mmu.page_fault()?
 
  Okay, if Marcelo does not object this way. :)
  
  I do object, since it is possible to detect precisely the condition by 
  storing which gfns have been cached.
  
  Then, Xiao, you need a way to handle large read-only sptes.
 
 Sorry, Marcelo, i am still confused why read-only sptes can not work
 under this patch?
 
 The code after read-only large spte is is:
 
 + if ((level  PT_PAGE_TABLE_LEVEL 
 +has_wrprotected_page(vcpu-kvm, gfn, level)) ||
 +   mmu_need_write_protect(vcpu, gfn, can_unsync)) {
   pgprintk(%s: found shadow page for %llx, marking ro\n,
__func__, gfn);
   ret = 1;
 
 It return 1, then reexecute_instruction return 0. It is the same as without
 readonly large-spte.

https://lkml.org/lkml/2012/11/17/75

Does unshadowing work with large sptes at reexecute_instruction? That
is, do we nuke any large read-only sptes that might be causing a certain
gfn to be read-only?

That is, following the sequence there, is the large read-only spte
properly destroyed?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: VMX: fix invalid cpu passed to smp_call_function_single

2012-11-28 Thread Marcelo Tosatti

On Wed, Nov 28, 2012 at 08:53:15PM +0800, Xiao Guangrong wrote:
 In loaded_vmcs_clear, loaded_vmcs-cpu is the fist parameter passed to
 smp_call_function_single, if the target cpu is downing (doing cpu hot remove),
 loaded_vmcs-cpu can become -1 then -1 is passed to smp_call_function_single
 
 It can be triggered when vcpu is being destroyed, loaded_vmcs_clear is called
 in the preemptionable context
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/vmx.c |8 +---
  1 files changed, 5 insertions(+), 3 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: use is_idle_task() instead of idle_cpu() to decide when to halt in async_pf

2012-11-28 Thread Marcelo Tosatti

On Wed, Nov 28, 2012 at 03:19:08PM +0200, Gleb Natapov wrote:
 As Frederic pointed idle_cpu() may return false even if async fault
 happened in the idle task if wake up is pending. In this case the code
 will try to put idle task to sleep. Fix this by using is_idle_task() to
 check for idle task.
 
 Reported-by: Frederic Weisbecker fweis...@gmail.com
 Signed-off-by: Gleb Natapov g...@redhat.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] KVM: VMX: fix memory order between loading vmcs and clearing vmcs

2012-11-28 Thread Marcelo Tosatti

On Wed, Nov 28, 2012 at 08:54:14PM +0800, Xiao Guangrong wrote:
 vmcs-cpu indicates whether it exists on the target cpu, -1 means the vmcs
 does not exist on any vcpu
 
 If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's percpu
 list. The list can be corrupted if the cpu prefetch the vmcs's list before
 reading vmcs-cpu. Meanwhile, we should remove vmcs from the list before
 making vmcs-vcpu == -1 be visible
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/vmx.c |   17 +
  1 files changed, 17 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 29e8f42..6056d88 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -1002,6 +1002,15 @@ static void __loaded_vmcs_clear(void *arg)
   if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs)
   per_cpu(current_vmcs, cpu) = NULL;
   list_del(loaded_vmcs-loaded_vmcss_on_cpu_link);
 +
 + /*
 +  * we should ensure updating loaded_vmcs-loaded_vmcss_on_cpu_link
 +  * is before setting loaded_vmcs-vcpu to -1 which is done in
 +  * loaded_vmcs_init. Otherwise, other cpu can see vcpu = -1 fist
 +  * then adds the vmcs into percpu list before it is deleted.
 +  */
 + smp_wmb();
 +

Neither loads nor stores are reordered with like operations (see
section 8.2.3.2 of intel's volume 3). This behaviour makes the barrier
not necessary.

However, i agree access to loaded_vmcs is not obviously safe. I can't
tell its safe with vmm_exclusive = 0 (where vcpu-cpu can change at any
time).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction

2012-11-28 Thread Marcelo Tosatti

On Thu, Nov 29, 2012 at 07:16:50AM +0800, Xiao Guangrong wrote:
 On 11/29/2012 06:40 AM, Xiao Guangrong wrote:
  On 11/29/2012 05:57 AM, Marcelo Tosatti wrote:
  On Wed, Nov 28, 2012 at 10:59:35PM +0800, Xiao Guangrong wrote:
  On 11/28/2012 10:12 PM, Gleb Natapov wrote:
  On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote:
  On 11/27/2012 06:41 AM, Marcelo Tosatti wrote:
 
 
  - return false;
  +again:
  + page_fault_count = 
  ACCESS_ONCE(vcpu-kvm-arch.page_fault_count);
  +
  + /*
  +  * if emulation was due to access to shadowed page table
  +  * and it failed try to unshadow page and re-enter the
  +  * guest to let CPU execute the instruction.
  +  */
  + kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
  + emulate = vcpu-arch.mmu.page_fault(vcpu, cr3, 
  PFERR_WRITE_MASK, false);
 
  Can you explain what is the objective here?
 
 
  Sure. :)
 
  The instruction emulation is caused by fault access on cr3. After 
  unprotect
  the target page, we call vcpu-arch.mmu.page_fault to fix the mapping 
  of cr3.
  if it return 1, mmu can not fix the mapping, we should report the error,
  otherwise it is good to return to guest and let it re-execute the 
  instruction
  again.
 
  page_fault_count is used to avoid the race on other vcpus, since after 
  we
  unprotect the target page, other cpu can enter page fault path and let 
  the
  page be write-protected again.
 
  This way can help us to detect all the case that mmu can not be fixed.
 
  Can you write this in a comment above vcpu-arch.mmu.page_fault()?
 
  Okay, if Marcelo does not object this way. :)
 
  I do object, since it is possible to detect precisely the condition by 
  storing which gfns have been cached.
 
  Then, Xiao, you need a way to handle large read-only sptes.
  
  Sorry, Marcelo, i am still confused why read-only sptes can not work
  under this patch?
  
  The code after read-only large spte is is:
  
  +   if ((level  PT_PAGE_TABLE_LEVEL 
  +  has_wrprotected_page(vcpu-kvm, gfn, level)) ||
  + mmu_need_write_protect(vcpu, gfn, can_unsync)) {
  pgprintk(%s: found shadow page for %llx, marking ro\n,
   __func__, gfn);
  ret = 1;
  
  It return 1, then reexecute_instruction return 0. It is the same as without
  readonly large-spte.
 
 Ah, wait, There is a case, the large page located at 0-2M, the 0-4K is used 
 as a
 page-table (e.g. PDE), and the guest want to write the memory located at 5K 
 which
 should be freely written. This patch can return 0 for both current code and 
 readonly
 large spte.

Yes, should remove the read-only large spte if any write to 0-2M fails
(said 'unshadow' in the previous email but the correct is 'remove large
spte in range').

 I need to think it more.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task

2012-11-28 Thread Andrew Theurer

On Wed, 2012-11-28 at 14:26 -0800, Chegu Vinod wrote:
 On 11/27/2012 6:23 AM, Chegu Vinod wrote:
 
  On 11/27/2012 2:30 AM, Raghavendra K T wrote: 
   On 11/26/2012 07:05 PM, Andrew Jones wrote: 
On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T
wrote: 
 From: Peter Zijlstra pet...@infradead.org 
 
 In case of undercomitted scenarios, especially in large
 guests 
 yield_to overhead is significantly high. when run queue length
 of 
 source and target is one, take an opportunity to bail out and
 return 
 -ESRCH. This return condition can be further exploited to
 quickly come 
 out of PLE handler. 
 
 (History: Raghavendra initially worked on break out of kvm ple
 handler upon 
   seeing source runqueue length = 1, but it had to export rq
 length). 
   Peter came up with the elegant idea of return -ESRCH in
 scheduler core. 
 
 Signed-off-by: Peter Zijlstra pet...@infradead.org 
 Raghavendra, Checking the rq length of target vcpu condition
 added.(thanks Avi) 
 Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com 
 Signed-off-by: Raghavendra K T
 raghavendra...@linux.vnet.ibm.com 
 --- 
 
   kernel/sched/core.c |   25 +++-- 
   1 file changed, 19 insertions(+), 6 deletions(-) 
 
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c 
 index 2d8927f..fc219a5 100644 
 --- a/kernel/sched/core.c 
 +++ b/kernel/sched/core.c 
 @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield); 
* It's the caller's job to ensure that the target task
 struct 
* can't go away on us before we can do any checks. 
* 
 - * Returns true if we indeed boosted the target task. 
 + * Returns: 
 + *true (0) if we indeed boosted the target task. 
 + *false (0) if we failed to boost the target. 
 + *-ESRCH if there's no task to yield to. 
*/ 
   bool __sched yield_to(struct task_struct *p, bool preempt) 
   { 
 @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct
 task_struct *p, bool preempt) 
 
   again: 
   p_rq = task_rq(p); 
 +/* 
 + * If we're the only runnable task on the rq and target
 rq also 
 + * has only one task, there's absolutely no point in
 yielding. 
 + */ 
 +if (rq-nr_running == 1  p_rq-nr_running == 1) { 
 +yielded = -ESRCH; 
 +goto out_irq; 
 +} 
 + 
   double_rq_lock(rq, p_rq); 
   while (task_rq(p) != p_rq) { 
   double_rq_unlock(rq, p_rq); 
 @@ -4310,13 +4322,13 @@ again: 
   } 
 
   if (!curr-sched_class-yield_to_task) 
 -goto out; 
 +goto out_unlock; 
 
   if (curr-sched_class != p-sched_class) 
 -goto out; 
 +goto out_unlock; 
 
   if (task_running(p_rq, p) || p-state) 
 -goto out; 
 +goto out_unlock; 
 
   yielded = curr-sched_class-yield_to_task(rq, p,
 preempt); 
   if (yielded) { 
 @@ -4329,11 +4341,12 @@ again: 
   resched_task(p_rq-curr); 
   } 
 
 -out: 
 +out_unlock: 
   double_rq_unlock(rq, p_rq); 
 +out_irq: 
   local_irq_restore(flags); 
 
 -if (yielded) 
 +if (yielded  0) 
   schedule(); 
 
   return yielded; 
 

Acked-by: Andrew Jones drjo...@redhat.com 

   
   Thank you Drew. 
   
   Marcelo Gleb.. Please let me know if you have comments / concerns
   on the patches.. 
   
   Andrew, Vinod, IMO, the patch set looks good for undercommit
   scenarios 
   especially for large guests where we do have overhead of vcpu
   iteration 
   of ple handler.. 
   
   . 
   
  Thanks Raghu. Will try to get this latest patch set evaluated and
  get back to you. 
  
  
 Hi Raghu,
 
 Here is some preliminary data with your latest set of  PLE patches (
 also with Andrew's throttled yield_to() change).
 
 Ran a single guest on a 80 core Westmere platform. [Note: Host and
 Guest had the latest kernel from kvm.git and also using the latest
  qemu from qemu.git as of yesterday morning]. 
 
 The guest was running a AIM7 high_systime workload. (Note:
 high_systime is a kernel intensive micro-benchmark but in this case it
 was run just as a workload in the guest to trigger spinlock etc.
 contention in the guest OS and hence PLE (i.e. this is not a real
 benchmark run). 'have run this workload with a constant # (i.e. 2000)
 users with 100 jobs per user. The numbers below represent the # of
 jobs per minute (JPM) -  higher the better) .
 
  40VCPU  60VCPU  80VCPU 
 
 a) 3.7.0-rc6+ w/ ple_gap=0   ~102K   ~88K~81K
 
 b) 3.7.0-rc6+ ~53K   ~25K~18-20K

 c) 3.7.0-rc6+ w/ PLE patches ~100K   ~81K~48K-69K  - lot of variation 
 from run to run.

Re: [PATCH 2/2] KVM: VMX: fix memory order between loading vmcs and clearing vmcs

2012-11-28 Thread Xiao Guangrong

On 11/29/2012 08:04 AM, Marcelo Tosatti wrote:
 On Wed, Nov 28, 2012 at 08:54:14PM +0800, Xiao Guangrong wrote:
 vmcs-cpu indicates whether it exists on the target cpu, -1 means the vmcs
 does not exist on any vcpu

 If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's 
 percpu
 list. The list can be corrupted if the cpu prefetch the vmcs's list before
 reading vmcs-cpu. Meanwhile, we should remove vmcs from the list before
 making vmcs-vcpu == -1 be visible

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/vmx.c |   17 +
  1 files changed, 17 insertions(+), 0 deletions(-)

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 29e8f42..6056d88 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -1002,6 +1002,15 @@ static void __loaded_vmcs_clear(void *arg)
  if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs)
  per_cpu(current_vmcs, cpu) = NULL;
  list_del(loaded_vmcs-loaded_vmcss_on_cpu_link);
 +
 +/*
 + * we should ensure updating loaded_vmcs-loaded_vmcss_on_cpu_link
 + * is before setting loaded_vmcs-vcpu to -1 which is done in
 + * loaded_vmcs_init. Otherwise, other cpu can see vcpu = -1 fist
 + * then adds the vmcs into percpu list before it is deleted.
 + */
 +smp_wmb();
 +
 
 Neither loads nor stores are reordered with like operations (see
 section 8.2.3.2 of intel's volume 3). This behaviour makes the barrier
 not necessary.

Ouch, yes, you are right. My memory is wrong. It seems only later-read
can be reordered with early-write.

But if 'read vs read' and 'write vs write' can be guaranteed by CPU, smp_wmb()
and smp_rmb() should only be a complier barrier, so i think we can add the 
barriers
to improve the readable and the portable.

And anyway, the current code missed complier-barrier.

 
 However, i agree access to loaded_vmcs is not obviously safe. I can't
 tell its safe with vmm_exclusive = 0 (where vcpu-cpu can change at any
 time).

If vmm_exclusive = 0, the vmcs can removed from percpu list when vcpu is 
scheduled
out. The list is not broken.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v2 4/4] nested vmx: use a list to store the launched vmcs12 for L1 VMM

2012-11-28 Thread Xu, Dongxiao

 -Original Message-
 From: Gleb Natapov [mailto:g...@redhat.com]
 Sent: Wednesday, November 28, 2012 7:28 PM
 To: Marcelo Tosatti
 Cc: Xu, Dongxiao; kvm@vger.kernel.org
 Subject: Re: [PATCH v2 4/4] nested vmx: use a list to store the launched 
 vmcs12
 for L1 VMM

 On Tue, Nov 27, 2012 at 10:29:08PM -0200, Marcelo Tosatti wrote:
  On Thu, Nov 22, 2012 at 12:51:59PM +0800, Dongxiao Xu wrote:
   The launch state is not a member in the VMCS area, use a separate
   variable (list) to store it instead.

   Signed-off-by: Dongxiao Xu dongxiao...@intel.com

  1. What is the problem with keeping launched state in the VMCS?
  Assuming there is a positive answer to the above:

  2. Don't you have to change VMCS ID?

  3. Can't it be kept somewhere else other than a list? Current scheme
  allows guest to allocate unlimited amounts of host memory.

  4. What is the state of migration / nested vmx again? If vmcs12 is
  migrated, this means launched state is not migrated anymore.

  Patches 1-3 seem fine.
 According to Dongxiao they are slowing down nested guest by 4%.

For this version, it will introduce certain performance downgrade. 

Actually in my new patch, I simplified the vmcs12_read() and vmcs12_write() 
functions and there is no obvious performance downgrade.

Thanks,
Dongxiao

 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V3 RFC 0/2] kvm: Improving undercommit scenarios

2012-11-28 Thread Chegu Vinod


On 11/26/2012 4:07 AM, Raghavendra K T wrote:

  In some special scenarios like #vcpu = #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

  The first patch optimizes all the yield_to by bailing out when there
  is no need to continue in yield_to (i.e., when there is only one task
  in source and target rq).

  Second patch uses that in PLE handler. Further when a yield_to fails
  we do not immediately go out of PLE handler instead we try thrice
  to have better statistical possibility of false return. Otherwise that
  would affect moderate overcommit cases.
  
  Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and

  around 51% for dbench 1x  with 32 core PLE machine with 32 vcpu guest.


base = 3.7.0-rc6
machine: 32 core mx3850 x5 PLE mc

--+---+---+---++---+
ebizzy (rec/sec higher is beter)
--+---+---+---++---+
 basestdev   patched stdev   %improve
--+---+---+---++---+
1x   2511.300021.54096051.8000   170.2592   140.98276
2x   2679.4000   332.44822692.3000   251.4005 0.48145
3x   2253.5000   266.42432192.1667   178.9753-2.72169
4x   1784.3750   102.26992018.7500   187.572313.13485
--+---+---+---++---+

--+---+---+---++---+
 dbench (throughput in MB/sec. higher is better)
--+---+---+---++---+
 basestdev   patched stdev   %improve
--+---+---+---++---+
1x  6677.4080   638.504810098.0060   3449.7026 51.22643
2x  2012.676064.76422019.0440 62.6702   0.31639
3x  1302.078340.83361292.7517 27.0515  -0.71629
4x  3043.1725  3243.72814664.4662   5946.5741  53.27643
--+---+---+---++---+

Here is the refernce of no ple result.
  ebizzy-1x_nople 7592.6000 rec/sec
  dbench_1x_nople 7853.6960 MB/sec

The result says we can still improve by 60% for ebizzy, but overall we are
getting impressive performance with the patches.

  Changes Since V2:
  - Dropped global measures usage patch (Peter Zilstra)
  - Do not bail out on first failure (Avi Kivity)
  - Try thrice for the failure of yield_to to get statistically more correct
behaviour.

  Changes since V1:
  - Discard the idea of exporting nrrunning and optimize in core scheduler 
(Peter)
  - Use yield() instead of schedule in overcommit scenarios (Rik)
  - Use loadavg knowledge to detect undercommit/overcommit

  Peter Zijlstra (1):
   Bail out of yield_to when source and target runqueue has one task

  Raghavendra K T (1):
   Handle yield_to failure return for potential undercommit case

  Please let me know your comments and suggestions.

  Link for V2:
  https://lkml.org/lkml/2012/10/29/287

  Link for V1:
  https://lkml.org/lkml/2012/9/21/168

  kernel/sched/core.c | 25 +++--
  virt/kvm/kvm_main.c | 26 --
  2 files changed, 35 insertions(+), 16 deletions(-)

.


Tested-by: Chegu Vinod chegu_vi...@hp.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v2 4/4] nested vmx: use a list to store the launched vmcs12 for L1 VMM

2012-11-28 Thread Xu, Dongxiao

 -Original Message-
 From: Orit Wasserman [mailto:owass...@redhat.com]
 Sent: Wednesday, November 28, 2012 8:30 PM
 To: Marcelo Tosatti
 Cc: Xu, Dongxiao; kvm@vger.kernel.org; g...@redhat.com
 Subject: Re: [PATCH v2 4/4] nested vmx: use a list to store the launched 
 vmcs12
 for L1 VMM

 On 11/28/2012 02:29 AM, Marcelo Tosatti wrote:
  On Thu, Nov 22, 2012 at 12:51:59PM +0800, Dongxiao Xu wrote:
  The launch state is not a member in the VMCS area, use a separate
  variable (list) to store it instead.

  Signed-off-by: Dongxiao Xu dongxiao...@intel.com

  1. What is the problem with keeping launched state in the VMCS?
  Assuming there is a positive answer to the above:

  2. Don't you have to change VMCS ID?

  3. Can't it be kept somewhere else other than a list? Current scheme
  allows guest to allocate unlimited amounts of host memory.
 I agree with Marcelo you have to limit the number of VMCS in the list 
 otherwise
 it will be easy to attack a host with nested :)

Yes it is a point. I will add a limitation of the VMCS number for the guest VMM.

Thanks,
Dongxiao

  4. What is the state of migration / nested vmx again? If vmcs12 is
  migrated, this means launched state is not migrated anymore.

  Patches 1-3 seem fine.

  --
  To unsubscribe from this list: send the line unsubscribe kvm in the
  body of a message to majord...@vger.kernel.org More majordomo info at
  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task

2012-11-28 Thread Chegu Vinod


On 11/28/2012 5:09 PM, Chegu Vinod wrote:

On 11/27/2012 6:23 AM, Chegu Vinod wrote:

On 11/27/2012 2:30 AM, Raghavendra K T wrote:

On 11/26/2012 07:05 PM, Andrew Jones wrote:

On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:

From: Peter Zijlstra pet...@infradead.org

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly 
come

out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple 
handler upon

  seeing source runqueue length = 1, but it had to export rq length).
  Peter came up with the elegant idea of return -ESRCH in 
scheduler core.


Signed-off-by: Peter Zijlstra pet...@infradead.org
Raghavendra, Checking the rq length of target vcpu condition 
added.(thanks Avi)

Reviewed-by: Srikar Dronamraju sri...@linux.vnet.ibm.com
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com
---

  kernel/sched/core.c |   25 +++--
  1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..fc219a5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
   * It's the caller's job to ensure that the target task struct
   * can't go away on us before we can do any checks.
   *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ *true (0) if we indeed boosted the target task.
+ *false (0) if we failed to boost the target.
+ *-ESRCH if there's no task to yield to.
   */
  bool __sched yield_to(struct task_struct *p, bool preempt)
  {
@@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct 
*p, bool preempt)


  again:
  p_rq = task_rq(p);
+/*
+ * If we're the only runnable task on the rq and target rq also
+ * has only one task, there's absolutely no point in yielding.
+ */
+if (rq-nr_running == 1  p_rq-nr_running == 1) {
+yielded = -ESRCH;
+goto out_irq;
+}
+
  double_rq_lock(rq, p_rq);
  while (task_rq(p) != p_rq) {
  double_rq_unlock(rq, p_rq);
@@ -4310,13 +4322,13 @@ again:
  }

  if (!curr-sched_class-yield_to_task)
-goto out;
+goto out_unlock;

  if (curr-sched_class != p-sched_class)
-goto out;
+goto out_unlock;

  if (task_running(p_rq, p) || p-state)
-goto out;
+goto out_unlock;

  yielded = curr-sched_class-yield_to_task(rq, p, preempt);
  if (yielded) {
@@ -4329,11 +4341,12 @@ again:
  resched_task(p_rq-curr);
  }

-out:
+out_unlock:
  double_rq_unlock(rq, p_rq);
+out_irq:
  local_irq_restore(flags);

-if (yielded)
+if (yielded  0)
  schedule();

  return yielded;



Acked-by: Andrew Jones drjo...@redhat.com



Thank you Drew.

Marcelo Gleb.. Please let me know if you have comments / concerns on 
the patches..


Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios
especially for large guests where we do have overhead of vcpu iteration
of ple handler..

.

Thanks Raghu. Will try to get this latest patch set evaluated and get 
back to you.


Vinod





 Resending as prev. email to the kvm and lkml email aliases bounced 
twice... Apologies for any repeats! 



Hi Raghu,

Here is some preliminary data with your latest set ofPLE patches ( 
also with Andrew's throttled yield_to() change).


Ran a single guest on a 80 core Westmere platform. [Note: Host and 
Guest had the latest kernel from kvm.git and also using the latestqemu 
from qemu.git as of yesterday morning].


The guest was running a AIM7 high_systime workload. (Note: 
high_systime is a kernel intensive micro-benchmark but in this case it 
was run just as a workload in the guest to trigger spinlock etc. 
contention in the guest OS and hence PLE (i.e. this is not a real 
benchmark run). 'have run this workload with a constant # (i.e. 2000) 
users with 100 jobs per user. The numbers below represent the # of 
jobs per minute (JPM) -higher the better) .


40VCPU60VCPU80VCPU

a) 3.7.0-rc6+ w/ ple_gap=0~102K~88K~81K


b) 3.7.0-rc6+~53K~25K~18-20K

c) 3.7.0-rc6+ w/ PLE patches~100K~81K~48K-69K- lot of variation from 
run to run.


d) 3.7.0-rc6+ w/throttled

yield_to() change~101K~87K~78K

---

The PLE patches case (c) does show improvements in this non-overcommit 
large guest case when compared to the case (b). However at 80way 
started to observe quite a bit of variation from run to run and the 
JPM was lower when compared with the throttled yield_to() change case (d).


For this 80way in case (c) also noticed that average time spent in the 
PLE exit (as reported by a small samplings from perf kvm stat) was 
varying quite a bit and was at times much greater when compared with 
the case of throttled yield_to() change case (d). More details are

Re: [PATCH 2/2] KVM: VMX: fix memory order between loading vmcs and clearing vmcs

2012-11-28 Thread Xiao Guangrong

On 11/29/2012 08:04 AM, Marcelo Tosatti wrote:
 On Wed, Nov 28, 2012 at 08:54:14PM +0800, Xiao Guangrong wrote:
 vmcs-cpu indicates whether it exists on the target cpu, -1 means the vmcs
 does not exist on any vcpu

 If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's 
 percpu
 list. The list can be corrupted if the cpu prefetch the vmcs's list before
 reading vmcs-cpu. Meanwhile, we should remove vmcs from the list before
 making vmcs-vcpu == -1 be visible

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/vmx.c |   17 +
  1 files changed, 17 insertions(+), 0 deletions(-)

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 29e8f42..6056d88 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -1002,6 +1002,15 @@ static void __loaded_vmcs_clear(void *arg)
  if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs)
  per_cpu(current_vmcs, cpu) = NULL;
  list_del(loaded_vmcs-loaded_vmcss_on_cpu_link);
 +
 +/*
 + * we should ensure updating loaded_vmcs-loaded_vmcss_on_cpu_link
 + * is before setting loaded_vmcs-vcpu to -1 which is done in
 + * loaded_vmcs_init. Otherwise, other cpu can see vcpu = -1 fist
 + * then adds the vmcs into percpu list before it is deleted.
 + */
 +smp_wmb();
 +
 
 Neither loads nor stores are reordered with like operations (see
 section 8.2.3.2 of intel's volume 3). This behaviour makes the barrier
 not necessary.

Ouch, yes, you are right. My memory is wrong. It seems only later-read
can be reordered with early-write.

But if 'read vs read' and 'write vs write' can be guaranteed by CPU, smp_wmb()
and smp_rmb() should act as a complier barrier, so i think we can add the 
barriers
to improve the readable and the portable.

And anyway, the current code missed complier-barrier.

 
 However, i agree access to loaded_vmcs is not obviously safe. I can't
 tell its safe with vmm_exclusive = 0 (where vcpu-cpu can change at any
 time).

If vmm_exclusive = 0, the vmcs can removed from percpu list when vcpu is 
scheduled
out. The list is not broken.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vfio powerpc: implemented IOMMU driver for VFIO

2012-11-28 Thread Alexey Kardashevskiy


On 29/11/12 08:01, Alex Williamson wrote:

On Wed, 2012-11-28 at 18:21 +1100, Alexey Kardashevskiy wrote:

VFIO implements platform independent stuff such as
a PCI driver, BAR access (via read/write on a file descriptor
or direct mapping when possible) and IRQ signaling.

The platform dependent part includes IOMMU initialization
and handling. This patch implements an IOMMU driver for VFIO
which does mapping/unmapping pages for the guest IO and
provides information about DMA window (required by a POWERPC
guest).

The counterpart in QEMU is required to support this functionality.

Cc: David Gibson da...@gibson.dropbear.id.au
Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
---
  drivers/vfio/Kconfig|6 +
  drivers/vfio/Makefile   |1 +
  drivers/vfio/vfio_iommu_spapr_tce.c |  332 +++
  include/linux/vfio.h|   33 
  4 files changed, 372 insertions(+)
  create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 7cd5dec..b464687 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
depends on VFIO
default n

+config VFIO_IOMMU_SPAPR_TCE
+   tristate
+   depends on VFIO  SPAPR_TCE_IOMMU
+   default n
+
  menuconfig VFIO
tristate VFIO Non-Privileged userspace driver framework
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if X86
+   select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
help
  VFIO provides a framework for secure userspace device drivers.
  See Documentation/vfio.txt for more details.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 2398d4a..72bfabc 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,3 +1,4 @@
  obj-$(CONFIG_VFIO) += vfio.o
  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
+obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
  obj-$(CONFIG_VFIO_PCI) += pci/
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
new file mode 100644
index 000..b98770e
--- /dev/null
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -0,0 +1,332 @@
+/*
+ * VFIO: IOMMU DMA mapping support for TCE on POWER
+ *
+ * Copyright (C) 2012 IBM Corp.  All rights reserved.
+ * Author: Alexey Kardashevskiy a...@ozlabs.ru
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio_iommu_type1.c:
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ * Author: Alex Williamson alex.william...@redhat.com
+ */
+
+#include linux/module.h
+#include linux/pci.h
+#include linux/slab.h
+#include linux/uaccess.h
+#include linux/err.h
+#include linux/vfio.h
+#include asm/iommu.h
+
+#define DRIVER_VERSION  0.1
+#define DRIVER_AUTHOR   a...@ozlabs.ru
+#define DRIVER_DESC VFIO IOMMU SPAPR TCE
+
+static void tce_iommu_detach_group(void *iommu_data,
+   struct iommu_group *iommu_group);
+
+/*
+ * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
+ */
+
+/*
+ * This code handles mapping and unmapping of user data buffers
+ * into DMA'ble space using the IOMMU
+ */
+
+#define NPAGE_TO_SIZE(npage)   ((size_t)(npage)  PAGE_SHIFT)
+
+struct vwork {
+   struct mm_struct*mm;
+   longnpage;
+   struct work_struct  work;
+};
+
+/* delayed decrement/increment for locked_vm */
+static void lock_acct_bg(struct work_struct *work)
+{
+   struct vwork *vwork = container_of(work, struct vwork, work);
+   struct mm_struct *mm;
+
+   mm = vwork-mm;
+   down_write(mm-mmap_sem);
+   mm-locked_vm += vwork-npage;
+   up_write(mm-mmap_sem);
+   mmput(mm);
+   kfree(vwork);
+}
+
+static void lock_acct(long npage)
+{
+   struct vwork *vwork;
+   struct mm_struct *mm;
+
+   if (!current-mm)
+   return; /* process exited */
+
+   if (down_write_trylock(current-mm-mmap_sem)) {
+   current-mm-locked_vm += npage;
+   up_write(current-mm-mmap_sem);
+   return;
+   }
+
+   /*
+* Couldn't get mmap_sem lock, so must setup to update
+* mm-locked_vm later. If locked_vm were atomic, we
+* wouldn't need this silliness
+*/
+   vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
+   if (!vwork)
+   return;
+   mm = get_task_mm(current);
+   if (!mm) {
+   kfree(vwork);
+   return;
+   }
+   INIT_WORK(vwork-work, lock_acct_bg);
+   vwork-mm = mm;
+   vwork-npage = npage;
+   schedule_work(vwork-work);
+}


This looks familiar, should we split it out to a common file instead of
duplicating it?


It is simple cut-n-paste from type1 driver :)
Moving it to a separate file is up to you but it is

Re: [RFC PATCH 0/4] MSI affinity for assigned devices

2012-11-28 Thread Jan Kiszka

On 2012-11-28 00:21, Alex Williamson wrote:
 On Wed, 2012-11-28 at 00:08 +0100, Jan Kiszka wrote:
 On 2012-11-27 23:00, Alex Williamson wrote:
 This is post-1.3 material, so I'll just post it as an RFC for now.

 MSI routing updates aren't currently handled by pci-assign or
 vfio-pci (when using KVM acceleration), which means that trying to
 set interrupt SMP affinity in the guest has no effect unless MSI is
 completely disabled and re-enabled.  This series fixes this for both
 device assignment backends using similar schemes.  We store the last
 MSIMessage programmed to KVM and do updates to the MSI route when it
 changes.  pci-assign takes a little bit of refactoring to make this
 happen cleanly.  Thanks,

 This should rather be done by implementing vector notifiers for MSI as
 well. That way the device model no longer has to track reasons for
 vector changes in an open-coded fashion, just like we already do for MSI-X.

 Was on my todo list for a long time, but I never reached this item.
 
 MSI masking is optional and not many devices seem to support it.  What I
 see with a linux guest is that it just overwrites the address/data while
 MSI is enabled.  What were you thinking for notifiers? mask, unmask,
 update?  I'm not sure I'm interested enough in this to add MSI vector
 notifiers.  Thanks,

We didn't merge mask notifiers but a more generic concept. Our
notifiers a supposed to fire when the configuration changes effectively.
So, for MSI, a use event has to trigger also on change, i.e. when
the configuration is overwritten while the vector is already active.
Alternatively, you could also add such an update notifier, using the
MSIVectorUseNotifier signature if it makes the user side simpler.

Jan




signature.asc
Description: OpenPGP digital signature

Re: SOLVED: RE: Garbled audio - Windows 7 64 bit guest on Debian

2012-11-28 Thread razamatan

Jimmy Crossley jcrossley at CoNetrix.com writes:

  -Original Message-
  From: kvm-owner at vger.kernel.org [mailto:kvm-owner at 
  vger.kernel.org] 
On Behalf Of Jimmy Crossley
  Sent: Friday, June 08, 2012 17:29
  Subject: Garbled audio - Windows 7 64 bit guest on Debian

  I am experiencing garbled sound on a Windows 7 64 bit guest running under 
  64 
bit Debian.  I have
  searched many discussion groups, etc. on the net and could find nothing 
useful, so I thought I would
  post this here, hoping someone with a deeper understanding could help out.

** snip

  If I connect to the machine using remote desktop (mstsc.exe, rdesktop, 
xfreerdp), the sound gets
  redirected to the local machine and sounds perfect.  The sound is only 
garbled when using SDL.  The
  same audio problems exist if I start up the machine and connect to it with 
vnc.

 I have mostly solved this issue.  The sound works much, much, better, but is 
still not as good as on my host machine.

 I installed PulseAudio and used it instead of ALSA.  In order to get kvm to 
use it, I set the environment
 variable QEMU_AUDIO_DRV=pa.  I had been using sudo to start the VM, and that 
kept this environment
 variable from being used.  I did a sudo setcap cap_net_admin+ep 
 /usr/bin/kvm 
in order to be able to run
 kvm under a normal user account.  Now the sound works quite well.

i've pretty much nearly duplicated this setup (gentoo host, qemu-kvm 1.2.1, 
pulseaudio, sdl, emulated hda audio, windows 7 64bit sp1 guest) due to the same 
reasons and the same result.  however, as jcrossley mentioned, it's not as 
crisp 
as the host; there's a lot of crackling and popping still.

based on jcrossley's investigation using rdp instead (and seeing no ill effects 
wrt sound), wouldn't the issue be pointed at the hda emulation?

how would one go about getting involved in terms of helping out w/ the hda 
emulator?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

73 matches

Mail list logo