[PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Tian, Kevin
Hi, Avi,

Here comes the patch:

KVM: APIC: avoid instruction emulation for EOI writes

Instruction emulation for EOI writes can be skipped, since sane
guest simply uses MOV instead of string operations. This is a nice
improvement when guest doesn't support x2apic or hyper-V EOI
support.

a single VM bandwidth is observed with ~8% bandwidth improvement
(7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation.

Signed-off-by: Kevin Tian kevin.t...@intel.com
Based on earlier work from:
Signed-off-by: Eddie Dong eddie.d...@intel.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 57dcbd4..933187e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this,
return 0;
 }
 
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
+{
+   struct kvm_lapic *apic = vcpu-arch.apic;
+
+   if (apic)
+   apic_set_eoi(vcpu-arch.apic);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
if (!vcpu-arch.apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 52c9e6b..8287243 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5e8d411..35e4af7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4540,6 +4540,22 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu)
 
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   int access_type, offset;
+
+   access_type = (exit_qualification  12)  0xf;
+   offset = exit_qualification  0xfff;
+   /*
+* Sane guest uses MOV instead of string operations to
+* write EOI, with written value not cared. So make a
+* short-circuit here by avoiding heavy instruction 
+* emulation.
+*/
+   if ((access_type == 1)  (offset == APIC_EOI)) {
+   kvm_lapic_set_eoi(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }

Thanks
Kevin

 -Original Message-
 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Thursday, August 25, 2011 12:39 PM
 To: Tian, Kevin
 Cc: kvm@vger.kernel.org; Nakajima, Jun
 Subject: Re: about vEOI optimization
 
 On 08/25/2011 05:24 AM, Tian, Kevin wrote:
  
Another option is the hyper-V EOI support, which can also eliminate the
EOI exit when no additional interrupt is pending.  This can improve EOI
performance even more.
  
 
  yes, and this is an orthogonal option.
 
  So if you agree, I'll send out an updated patch atop their work.
 
 
 
 Thanks.
 
 --
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.



20110829_kvm_eoi_opt.patch
Description: 20110829_kvm_eoi_opt.patch


RE: [PATCH] KVM: emulate lapic tsc deadline timer for hvm

2011-08-29 Thread Tian, Kevin
 From: Marcelo Tosatti
 Sent: Tuesday, August 23, 2011 6:47 PM
 
  +  if (!apic-lapic_timer.tscdeadline)
  +  return;
  +
  +  tsc_target = kvm_x86_ops-
  +  guest_to_host_tsc(apic-lapic_timer.tscdeadline);
  +  rdtscll(tsc_now);
  +  tsc_delta = tsc_target - tsc_now;
 
  This only works if we have a constant tsc, that's not true for large

that type of machine exposes tricky issues to time virtualization in 
general.

  multiboard machines.  Need to do this with irqs disabled as well
  (reading both 'now' and 'tsc_now' in the same critical section).
 
 Should look like this:
 
 local_irq_disable();
 u64 guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu);
 if (guest_tsc = tscdeadline)
 hrtimer_start(now);
 else {
   ns = convert_to_ns(guest_tsc - tscdeadline);
   hrtimer_start(now + ns);
 }
 local_irq_enable();

Above is an overkill. only calculating guest tsc delta needs be protected.

On the other hand, I don't think masking irq here adds obvious benefit.
Virtualization doesn't ensure micro-level time accuracy, since there're
many events delaying virtual time interrupt injection even when timer
emulation logic triggers it timely. That's even the case on bare metal
though with smaller scope, and thus most guests are able to handle 
such situation. masking irq in non-critical regions simply slow down the
system response on other events. 

 
 Note the vcpus tsc can have different frequency than the hosts, so
 vcpu_tsc_khz() should be used to convert to nanoseconds, not tsc_khz.
 

yes, that's a good suggestion.

btw, Jinsong is on vacation this week. He will update the patch according
to you comment when back.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Avi Kivity

On 08/29/2011 09:09 AM, Tian, Kevin wrote:


  static int handle_apic_access(struct kvm_vcpu *vcpu)
  {
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   int access_type, offset;
+
+   access_type = (exit_qualification  12)  0xf;
+   offset = exit_qualification  0xfff;
+   /*
+* Sane guest uses MOV instead of string operations to
+* write EOI, with written value not cared. So make a
+* short-circuit here by avoiding heavy instruction
+* emulation.
+*/
+   if ((access_type == 1)  (offset == APIC_EOI)) {


Please replace this 1 with a #define.


+   kvm_lapic_set_eoi(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
return emulate_instruction(vcpu, 0) == EMULATE_DONE;
  }




Looks good otherwise.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Tian, Kevin
 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Monday, August 29, 2011 3:24 PM
 
 On 08/29/2011 09:09 AM, Tian, Kevin wrote:
 
static int handle_apic_access(struct kvm_vcpu *vcpu)
{
  +   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
  +   int access_type, offset;
  +
  +   access_type = (exit_qualification  12)  0xf;
  +   offset = exit_qualification  0xfff;
  +   /*
  +* Sane guest uses MOV instead of string operations to
  +* write EOI, with written value not cared. So make a
  +* short-circuit here by avoiding heavy instruction
  +* emulation.
  +*/
  +   if ((access_type == 1)  (offset == APIC_EOI)) {
 
 Please replace this 1 with a #define.
 

updated.

--

KVM: APIC: avoid instruction emulation for EOI writes

Instruction emulation for EOI writes can be skipped, since sane
guest simply uses MOV instead of string operations. This is a nice
improvement when guest doesn't support x2apic or hyper-V EOI
support.

a single VM bandwidth is observed with ~8% bandwidth improvement
(7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation.

Signed-off-by: Kevin Tian kevin.t...@intel.com
Based on earlier work from:
Signed-off-by: Eddie Dong eddie.d...@intel.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 57dcbd4..933187e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this,
return 0;
 }
 
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
+{
+   struct kvm_lapic *apic = vcpu-arch.apic;
+
+   if (apic)
+   apic_set_eoi(vcpu-arch.apic);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
if (!vcpu-arch.apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 52c9e6b..8287243 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5e8d411..34c094f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4538,8 +4538,25 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu)
return 1;
 }
 
+#define APIC_ACCESS_TYPE_W 1 /* Linear write access during inst execution */
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   int access_type, offset;
+
+   access_type = (exit_qualification  12)  0xf;
+   offset = exit_qualification  0xfff;
+   /*
+* Sane guest uses MOV instead of string operations to
+* write EOI, with written value not cared. So make a
+* short-circuit here by avoiding heavy instruction 
+* emulation.
+*/
+   if ((access_type == APIC_ACCESS_TYPE_W)  (offset == APIC_EOI)) {
+   kvm_lapic_set_eoi(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }

Thanks
Kevin


20110829_kvm_eoi_opt.patch
Description: 20110829_kvm_eoi_opt.patch


RE: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Sasha Levin
On Mon, 2011-08-29 at 15:35 +0800, Tian, Kevin wrote:
  From: Avi Kivity [mailto:a...@redhat.com]
  Sent: Monday, August 29, 2011 3:24 PM
  
  On 08/29/2011 09:09 AM, Tian, Kevin wrote:
  
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
   + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
   + int access_type, offset;
   +
   + access_type = (exit_qualification  12)  0xf;
   + offset = exit_qualification  0xfff;
   + /*
   +  * Sane guest uses MOV instead of string operations to
   +  * write EOI, with written value not cared. So make a
   +  * short-circuit here by avoiding heavy instruction
   +  * emulation.
   +  */
   + if ((access_type == 1)  (offset == APIC_EOI)) {
  
  Please replace this 1 with a #define.
  
 
 updated.
 
 --
 
 KVM: APIC: avoid instruction emulation for EOI writes
 
 Instruction emulation for EOI writes can be skipped, since sane
 guest simply uses MOV instead of string operations. This is a nice
 improvement when guest doesn't support x2apic or hyper-V EOI
 support.
 
 a single VM bandwidth is observed with ~8% bandwidth improvement
 (7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation.
 
 Signed-off-by: Kevin Tian kevin.t...@intel.com
 Based on earlier work from:
 Signed-off-by: Eddie Dong eddie.d...@intel.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 57dcbd4..933187e 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this,
   return 0;
  }
  
 +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
 +{
 + struct kvm_lapic *apic = vcpu-arch.apic;
 +
 + if (apic)
 + apic_set_eoi(vcpu-arch.apic);
 +}
 +EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
 +
  void kvm_free_lapic(struct kvm_vcpu *vcpu)
  {
   if (!vcpu-arch.apic)
 diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
 index 52c9e6b..8287243 100644
 --- a/arch/x86/kvm/lapic.h
 +++ b/arch/x86/kvm/lapic.h
 @@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
  void kvm_lapic_reset(struct kvm_vcpu *vcpu);
  u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
  void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
 +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
  void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
  void kvm_apic_set_version(struct kvm_vcpu *vcpu);
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 5e8d411..34c094f 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -4538,8 +4538,25 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu)
   return 1;
  }
  
 +#define APIC_ACCESS_TYPE_W 1 /* Linear write access during inst execution */
  static int handle_apic_access(struct kvm_vcpu *vcpu)
  {
 + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
 + int access_type, offset;
 +
 + access_type = (exit_qualification  12)  0xf;
 + offset = exit_qualification  0xfff;

Maybe it's worth moving that #define above, and '#define'ing all the
magic that happens in these 2 lines in vmx.h along with all the other
exit qualification parameters?

 + /*
 +  * Sane guest uses MOV instead of string operations to
 +  * write EOI, with written value not cared. So make a
 +  * short-circuit here by avoiding heavy instruction 
 +  * emulation.
 +  */
 + if ((access_type == APIC_ACCESS_TYPE_W)  (offset == APIC_EOI)) {
 + kvm_lapic_set_eoi(vcpu);
 + skip_emulated_instruction(vcpu);
 + return 1;
 + }
   return emulate_instruction(vcpu, 0) == EMULATE_DONE;
  }
 
 Thanks
 Kevin


-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 39412] Win Vista and Win2K8 guests' network breaks down

2011-08-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=39412


Jay Ren yongjie@intel.com changed:

   What|Removed |Added

 Status|NEEDINFO|CLOSED
 Resolution||CODE_FIX




--- Comment #8 from Jay Ren yongjie@intel.com  2011-08-29 08:48:53 ---
I tried this on Linus' linux 3.1.0-rc4 tree and found this bug got fixed.
The commit I tried is below.
commit c6a389f123b9f68d605bb7e0f9b32ec1e3e14132
Author: Linus Torvalds torva...@linux-foundation.org
Date:   Sun Aug 28 21:16:01 2011 -0700

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Avi Kivity

On 08/29/2011 11:15 AM, Sasha Levin wrote:


  +#define APIC_ACCESS_TYPE_W 1 /* Linear write access during inst execution */
   static int handle_apic_access(struct kvm_vcpu *vcpu)
   {
  + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
  + int access_type, offset;
  +
  + access_type = (exit_qualification  12)  0xf;
  + offset = exit_qualification  0xfff;

Maybe it's worth moving that #define above, and '#define'ing all the
magic that happens in these 2 lines in vmx.h along with all the other
exit qualification parameters?


Sure, that's what vmx.h is for.  And 0xfff is ~PAGE_MASK.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 39412] Win Vista and Win2K8 guests' network breaks down

2011-08-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=39412


Florian Mickler flor...@mickler.org changed:

   What|Removed |Added

 Resolution|CODE_FIX|UNREPRODUCIBLE




--- Comment #9 from Florian Mickler flor...@mickler.org  2011-08-29 08:53:53 
---
Thank you. 
[Because we don't exactly know what fixed it, our rule is to close this as
unreproducible.]

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] Migration thread mutex

2011-08-29 Thread Stefan Hajnoczi
On Sat, Aug 27, 2011 at 7:09 PM, Umesh Deshpande udesh...@redhat.com wrote:
 This patch implements migrate_ram mutex, which protects the RAMBlock list
 traversal in the migration thread during the transfer of a ram from their
 addition/removal from the iothread.

 Note: Combination of iothread mutex and migration thread mutex works as a
 rw-lock. Both mutexes are acquired while modifying the ram_list members or RAM
 block list.

 Signed-off-by: Umesh Deshpande udesh...@redhat.com
 ---
  arch_init.c   |   21 +
  cpu-all.h     |    3 +++
  exec.c        |   23 +++
  qemu-common.h |    2 ++
  4 files changed, 49 insertions(+), 0 deletions(-)

 diff --git a/arch_init.c b/arch_init.c
 index 484b39d..9d02270 100644
 --- a/arch_init.c
 +++ b/arch_init.c
 @@ -109,6 +109,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch)

  static RAMBlock *last_block;
  static ram_addr_t last_offset;
 +static uint64_t last_version;
[...]
  typedef struct RAMList {
 +    QemuMutex mutex;    /* Protects RAM block list */
     uint8_t *phys_dirty;
 +    uint32_t version;   /* To detect ram block addition/removal */

Is there a reason why RAMList.version is uint32_t but last_version is uint64_t?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] Separate migration thread

2011-08-29 Thread Stefan Hajnoczi
On Sat, Aug 27, 2011 at 7:09 PM, Umesh Deshpande udesh...@redhat.com wrote:
 This patch creates a separate thread for the guest migration on the source 
 side.
 All exits (on completion/error) from the migration thread are handled by a
 bottom handler, which is called from the iothread.

 Signed-off-by: Umesh Deshpande udesh...@redhat.com
 ---
  buffered_file.c     |   76 
  migration.c         |  105 ++
  migration.h         |    8 
  qemu-thread-posix.c |   10 +
  qemu-thread.h       |    1 +

Will this patch break Windows builds by adding a function to
qemu-thread-posix.c which is not implemented in qemu-thread-win32.c?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Separate thread for VM migration

2011-08-29 Thread Paolo Bonzini

On 08/27/2011 08:09 PM, Umesh Deshpande wrote:

Following patch series deals with VCPU and iothread starvation during the
migration of a guest. Currently the iothread is responsible for performing the
guest migration. It holds qemu_mutex during the migration and doesn't allow VCPU
to enter the qemu mode and delays its return to the guest. The guest migration,
executed as an iohandler also delays the execution of other iohandlers.
In the following patch series,

The migration has been moved to a separate thread to
reduce the qemu_mutex contention and iohandler starvation.

Umesh Deshpande (5):
   vm_stop from non-io threads
   MRU ram block list
   migration thread mutex
   separate migration bitmap
   separate migration thread

  arch_init.c |   38 +
  buffered_file.c |   76 ++---
  cpu-all.h   |   42 ++
  cpus.c  |4 +-
  exec.c  |   97 --
  migration.c |  117 ---
  migration.h |9 
  qemu-common.h   |2 +
  qemu-thread-posix.c |   10 
  qemu-thread.h   |1 +
  10 files changed, 302 insertions(+), 94 deletions(-)



I think this patchset is quite good.  These are the problems I found:

1) the locking rules in patch 3 are a bit too clever, and the cleverness 
will become obsolete once RCU is in place.  The advantage of the clever 
stuff is that rwlock code looks more like RCU code; the disadvantage is 
that it is harder to read and easier to bikeshed about.


2) it breaks Windows build, but that's easy to fix.

3) there are a _lot_ of cleanups possible on top of patch 5 (I would not 
be too surprised if the final version has an almost-neutral diffstat), 
and whether to prefer good or perfect is another very popular topic.


4) I'm not sure block migration has been tested in all scenarios (I'm 
curious about coroutine-heavy ones).  This may mean that the migration 
thread is blocked onto Marcelo's live streaming work, which would 
provide the ingredients to remove block migration altogether.  A round 
of Autotest testing is the minimum required to include this, and I'm not 
sure if this was done either.



That said, I find the code to be quite good overall, and I wouldn't 
oppose inclusion with only (2) fixed---may even take care of it 
myself---and more testing results apart from the impressive performance 
numbers.


About performance, I'm curious how you measured it.  Was the buffer 
cache empty?  That is, how many compressible pages were found?  I toyed 
with vectorizing is_dup_page, but I need to get some numbers before posting.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qemu-kvm 0.15.0 boot order not working

2011-08-29 Thread Peter Lieven

Hi,

when I specify something like

qemu -boot order=dc -cdrom image.iso -drive file=img.raw,if=virtio,boot=yes

or

qemu -boot order=n -cdrom image.iso -drive file=img.raw,if=virtio,boot=yes

with qemu-kvm 0.15.0

it will always directly boot from the hardrive and not from cdrom or 
network.

is this on purpose? the behaviour was different in earlier versions.
If i omit the boot=yes in the drive specification the -boot order 
parameter is working

as expected.

Peter



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Jan Kiszka
On 2011-08-29 08:09, Tian, Kevin wrote:
 Hi, Avi,
 
 Here comes the patch:
 
 KVM: APIC: avoid instruction emulation for EOI writes
 
 Instruction emulation for EOI writes can be skipped, since sane
 guest simply uses MOV instead of string operations. This is a nice
 improvement when guest doesn't support x2apic or hyper-V EOI
 support.
 
 a single VM bandwidth is observed with ~8% bandwidth improvement
 (7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation.
 
 Signed-off-by: Kevin Tian kevin.t...@intel.com
 Based on earlier work from:
 Signed-off-by: Eddie Dong eddie.d...@intel.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 57dcbd4..933187e 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this,
   return 0;
  }
  
 +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
 +{
 + struct kvm_lapic *apic = vcpu-arch.apic;
 +
 + if (apic)
 + apic_set_eoi(vcpu-arch.apic);
 +}
 +EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
 +
  void kvm_free_lapic(struct kvm_vcpu *vcpu)
  {
   if (!vcpu-arch.apic)
 diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
 index 52c9e6b..8287243 100644
 --- a/arch/x86/kvm/lapic.h
 +++ b/arch/x86/kvm/lapic.h
 @@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
  void kvm_lapic_reset(struct kvm_vcpu *vcpu);
  u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
  void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
 +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
  void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
  void kvm_apic_set_version(struct kvm_vcpu *vcpu);
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 5e8d411..35e4af7 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -4540,6 +4540,22 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu)
  
  static int handle_apic_access(struct kvm_vcpu *vcpu)
  {
 + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
 + int access_type, offset;
 +
 + access_type = (exit_qualification  12)  0xf;
 + offset = exit_qualification  0xfff;
 + /*
 +  * Sane guest uses MOV instead of string operations to
 +  * write EOI, with written value not cared. So make a
 +  * short-circuit here by avoiding heavy instruction 
 +  * emulation.
 +  */

Is there no cheap way to validate this assumption and fall back to the
slow path in case it doesn't apply? E.g. reading the first instruction
byte and matching it against a whitelist? Even if the ignored scenarios
are highly unlikely, I think we so far tried hard to provide both fast
and accurate results to the guest in all cases.

 + if ((access_type == 1)  (offset == APIC_EOI)) {
 + kvm_lapic_set_eoi(vcpu);
 + skip_emulated_instruction(vcpu);
 + return 1;
 + }
   return emulate_instruction(vcpu, 0) == EMULATE_DONE;
  }
 
 Thanks
 Kevin

Nice optimization otherwise!

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HPET configuration in Seabios

2011-08-29 Thread Jan Kiszka
On 2011-08-29 07:32, Avi Kivity wrote:
 On 08/29/2011 01:14 AM, Kevin O'Connor wrote:
 On Sun, Aug 28, 2011 at 10:42:49PM +0200, Jan Kiszka wrote:
   On 2011-08-28 20:54, Alexander Graf wrote:
   
 On 28.08.2011, at 02:42, Avi Kivity wrote:
   
 On 08/26/2011 08:32 AM, ya su wrote:
 hi,Avi:
   
 I met the same problem, tons of hpet vm_exits(vector 209,
 fault
 address is in the guest vm's hpet mmio range), even I disable
 hpet
 device in win7 guest vm, it still produce a larget amount of
 vm_exits
 when trace-cmd ;  I add -no-hpet to start the vm, it still has
 HPET
 device inside VM.
   
 Does that means the HPET device in VM does not depend on the
 emulated hpet device in qemu-kvm? Is there any way to disable
 the VM
 HPET device to prevent so many vm_exits?  Thansk.
   
   
 Looks like a bug to me.
   
 IIRC disabling the HPET device doesn't remove the entry from the
 DSDT, no? So the guest OS might still think it's there while nothing
 responds (read returns -1).
 
   Exactly. We have a fw_cfg interface in place for quite a while now
   (though I wonder how the firmware is supposed to tell -no-hpet apart
   from QEMU versions that don't provide this data - both return count =
   255), but SeaBios still exposes one HPET block at a hard-coded address
   unconditionally.
 
   There was quite some discussion about the corresponding Seabios
 patches
   back then but apparently no consensus was found. Re-reading it, I
 think
   Kevin asked for passing the necessary DSDT fragments from QEMU to the
   firmware instead of using a new, proprietary fw_cfg format. Is that
   still the key requirement for any patch finally fixing this bug?

 My preference would be to use the existing ACPI table passing
 interface (fw_cfg slot 0x8000) to pass different ACPI tables to
 SeaBIOS.

 SeaBIOS doesn't currently allow that interface to override tables
 SeaBIOS builds itself, but it's a simple change to rectify that.

 When this was last proposed, it was raised that the header information
 in the ACPI table may then not match the tables that SeaBIOS builds.
 I think I proposed at that time that SeaBIOS could use the header of
 the first fw_cfg table (or some other fw_cfg interface) to populate
 the headers of its table headers.  However, there was no consensus.

 Note - the above is in regard to the HPET table.  If the HPET entry in
 the DSDT needs to be removed then that's a bigger change.

 
 Can't seabios just poke at the hpet itself and see if it exists or not?
 

Would be hard for the BIOS to guess the locations of the blocks unless
we define the addresses used by QEMU as something like base + hpet_no *
block_size in all cases.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AUTOTEST][PATCH][KVM] Add test for problem with killing guest when network is under load.

2011-08-29 Thread Lukáš Doktor

Thanks, this patch works well.

Acked-by: Lukas Doktor ldok...@redhat.com

Dne 26.8.2011 10:31, Jiří Župka napsal(a):

This patch contain two tests.
1) Try kill guest when guest netwok is under loading.
2) Try kill guest after multiple adding and removing network drivers.

Signed-off-by: Jiří Župkajzu...@redhat.com
---
  client/tests/kvm/tests_base.cfg.sample|   18 
  client/virt/tests/netstress_kill_guest.py |  147 +
  2 files changed, 165 insertions(+), 0 deletions(-)
  create mode 100644 client/virt/tests/netstress_kill_guest.py

diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index ec1b48d..a7ff29f 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -845,6 +845,24 @@ variants:
  restart_vm = yes
  kill_vm_on_error = yes

+- netstress_kill_guest: install setup unattended_install.cdrom
+only Linux
+type = netstress_kill_guest
+image_snapshot = yes
+nic_mode = tap
+# There should be enough vms for build topology.
+variants:
+-driver:
+mode = driver
+-load:
+mode = load
+netperf_files = netperf-2.4.5.tar.bz2 wait_before_data.patch
+packet_size = 1500
+setup_cmd = cd %s  tar xvfj netperf-2.4.5.tar.bz2  cd netperf-2.4.5  patch 
-p0  ../wait_before_data.patch  ./configure  make
+clean_cmd =  while killall -9 netserver; do True test; done;
+netserver_cmd =  %s/netperf-2.4.5/src/netserver
+netperf_cmd = %s/netperf-2.4.5/src/netperf -t %s -H %s -l 60 
-- -m %s
+
  - set_link: install setup image_copy unattended_install.cdrom
  type = set_link
  test_timeout = 1000
diff --git a/client/virt/tests/netstress_kill_guest.py 
b/client/virt/tests/netstress_kill_guest.py
new file mode 100644
index 000..7452e09
--- /dev/null
+++ b/client/virt/tests/netstress_kill_guest.py
@@ -0,0 +1,147 @@
+import logging, os, signal, re, time
+from autotest_lib.client.common_lib import error
+from autotest_lib.client.bin import utils
+from autotest_lib.client.virt import aexpect, virt_utils
+
+
+def run_netstress_kill_guest(test, params, env):
+
+Try stop network interface in VM when other VM try to communicate.
+
+@param test: kvm test object
+@param params: Dictionary with the test parameters
+@param env: Dictionary with test environment.
+
+def get_corespond_ip(ip):
+
+Get local ip address which is used for contact ip.
+
+@param ip: Remote ip
+@return: Local corespond IP.
+
+result = utils.run(ip route get %s % (ip)).stdout
+ip = re.search(src (.+), result)
+if ip is not None:
+ip = ip.groups()[0]
+return ip
+
+
+def get_ethernet_driver(session):
+
+Get driver of network cards.
+
+@param session: session to machine
+
+modules = []
+out = session.cmd(ls -l /sys/class/net/*/device/driver/module)
+for module in out.split(\n):
+modules.append(module.split(/)[-1])
+modules.remove()
+return set(modules)
+
+
+def kill_and_check(vm):
+vm_pid = vm.get_pid()
+vm.destroy(gracefully=False)
+time.sleep(2)
+try:
+os.kill(vm_pid, 0)
+logging.error(VM is not dead.)
+raise error.TestFail(Problem with killing guest.)
+except OSError:
+logging.info(VM is dead.)
+
+
+def netload_kill_problem(session_serial):
+netperf_dir = os.path.join(os.environ['AUTODIR'], tests/netperf2)
+setup_cmd = params.get(setup_cmd)
+clean_cmd = params.get(clean_cmd)
+firewall_flush = iptables -F
+
+try:
+utils.run(firewall_flush)
+except:
+logging.warning(Could not flush firewall rules on guest)
+
+try:
+session_serial.cmd(firewall_flush)
+except aexpect.ShellError:
+logging.warning(Could not flush firewall rules on guest)
+
+for i in params.get(netperf_files).split():
+vm.copy_files_to(os.path.join(netperf_dir, i), /tmp)
+
+guest_ip = vm.get_address(0)
+server_ip = get_corespond_ip(guest_ip)
+
+logging.info(Setup and run netperf on host and guest)
+session_serial.cmd(setup_cmd % /tmp, timeout=200)
+utils.run(setup_cmd % netperf_dir)
+
+try:
+session_serial.cmd(clean_cmd)
+except:
+pass
+session_serial.cmd(params.get(netserver_cmd) % /tmp)
+
+utils.run(clean_cmd, ignore_status=True)
+utils.run(params.get(netserver_cmd) % netperf_dir)
+
+server_netperf_cmd = params.get(netperf_cmd) % (netperf_dir, 
TCP_STREAM,
+guest_ip, 

Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Avi Kivity

On 08/29/2011 01:24 PM, Jan Kiszka wrote:


   static int handle_apic_access(struct kvm_vcpu *vcpu)
   {
  + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
  + int access_type, offset;
  +
  + access_type = (exit_qualification  12)  0xf;
  + offset = exit_qualification  0xfff;
  + /*
  +  * Sane guest uses MOV instead of string operations to
  +  * write EOI, with written value not cared. So make a
  +  * short-circuit here by avoiding heavy instruction
  +  * emulation.
  +  */

Is there no cheap way to validate this assumption and fall back to the
slow path in case it doesn't apply? E.g. reading the first instruction
byte and matching it against a whitelist? Even if the ignored scenarios
are highly unlikely, I think we so far tried hard to provide both fast
and accurate results to the guest in all cases.



Just reading the first byte requires a guest page table walk.  This is 
probably the highest cost in emulation (which also requires a walk for 
the data access).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HPET configuration in Seabios

2011-08-29 Thread Avi Kivity

On 08/29/2011 01:25 PM, Jan Kiszka wrote:


  Can't seabios just poke at the hpet itself and see if it exists or not?


Would be hard for the BIOS to guess the locations of the blocks unless
we define the addresses used by QEMU as something like base + hpet_no *
block_size in all cases.



Currently we have a fixed address.  We could do:

 if available in fw_cfg:
 use that (may indicate no hpet)
 elif fixed address works:
 use that
 else
 no hpet


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Jan Kiszka
On 2011-08-29 12:59, Avi Kivity wrote:
 On 08/29/2011 01:24 PM, Jan Kiszka wrote:

   static int handle_apic_access(struct kvm_vcpu *vcpu)
   {
  +  unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
  +  int access_type, offset;
  +
  +  access_type = (exit_qualification  12)  0xf;
  +  offset = exit_qualification  0xfff;
  +  /*
  +   * Sane guest uses MOV instead of string operations to
  +   * write EOI, with written value not cared. So make a
  +   * short-circuit here by avoiding heavy instruction
  +   * emulation.
  +   */

 Is there no cheap way to validate this assumption and fall back to the
 slow path in case it doesn't apply? E.g. reading the first instruction
 byte and matching it against a whitelist? Even if the ignored scenarios
 are highly unlikely, I think we so far tried hard to provide both fast
 and accurate results to the guest in all cases.

 
 Just reading the first byte requires a guest page table walk.  This is 
 probably the highest cost in emulation (which also requires a walk for 
 the data access).

And what about caching the result of the first walk? Usually, a sane
guest won't have many code pages that issue the EIO.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HPET configuration in Seabios

2011-08-29 Thread Jan Kiszka
On 2011-08-29 13:00, Avi Kivity wrote:
 On 08/29/2011 01:25 PM, Jan Kiszka wrote:

  Can't seabios just poke at the hpet itself and see if it exists or not?


 Would be hard for the BIOS to guess the locations of the blocks unless
 we define the addresses used by QEMU as something like base + hpet_no *
 block_size in all cases.

 
 Currently we have a fixed address.  We could do:
 
   if available in fw_cfg:
   use that (may indicate no hpet)
   elif fixed address works:
   use that
   else
   no hpet

Currently, we also only have a single HPET block, but that's just
because of some QEMU limitations that will vanish sooner or later. Then
nothing will prevent multiple -device hpet,base=XXX.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Avi Kivity

On 08/29/2011 02:03 PM, Jan Kiszka wrote:


  Just reading the first byte requires a guest page table walk.  This is
  probably the highest cost in emulation (which also requires a walk for
  the data access).

And what about caching the result of the first walk? Usually, a sane
guest won't have many code pages that issue the EIO.



There's no way to know when to invalidate the cache.

We could go a bit further, and cache the the whole thing.  On the first 
exit, do the entire emulation, and remember %rip.  On the second exit, 
if %rip matches, skip directly to kvm_lapic_eoi().


But I don't think it's worth it.  This also has failure modes, and 
really, no guest will ever write to EOI with stosl.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HPET configuration in Seabios

2011-08-29 Thread Avi Kivity

On 08/29/2011 02:05 PM, Jan Kiszka wrote:

On 2011-08-29 13:00, Avi Kivity wrote:
  On 08/29/2011 01:25 PM, Jan Kiszka wrote:

   Can't seabios just poke at the hpet itself and see if it exists or not?


  Would be hard for the BIOS to guess the locations of the blocks unless
  we define the addresses used by QEMU as something like base + hpet_no *
  block_size in all cases.


  Currently we have a fixed address.  We could do:

if available in fw_cfg:
use that (may indicate no hpet)
elif fixed address works:
use that
else
no hpet

Currently, we also only have a single HPET block, but that's just
because of some QEMU limitations that will vanish sooner or later. Then
nothing will prevent multiple -device hpet,base=XXX.



Yes, so we should enable the fw_cfg interface before that happens.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HPET configuration in Seabios

2011-08-29 Thread Jan Kiszka
On 2011-08-29 13:05, Jan Kiszka wrote:
 On 2011-08-29 13:00, Avi Kivity wrote:
 On 08/29/2011 01:25 PM, Jan Kiszka wrote:

  Can't seabios just poke at the hpet itself and see if it exists or not?


 Would be hard for the BIOS to guess the locations of the blocks unless
 we define the addresses used by QEMU as something like base + hpet_no *
 block_size in all cases.


 Currently we have a fixed address.  We could do:

   if available in fw_cfg:
   use that (may indicate no hpet)
   elif fixed address works:
   use that
   else
   no hpet
 
 Currently, we also only have a single HPET block, but that's just
 because of some QEMU limitations that will vanish sooner or later. Then
 nothing will prevent multiple -device hpet,base=XXX.

That said, some HPET probing (without any fw_cfg) may be a short-term
workaround to fix Seabios until we defined The solution for
communicating HPET block configurations.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: x86 emulator: fuzz tester

2011-08-29 Thread Avi Kivity

On 08/26/2011 01:17 AM, Lucas Meneghel Rodrigues wrote:


I still haven't gone through all the code, but it's a good idea to put 
a MODULE_LICENSE(GPL) macro around here, so the build system doesn't 
complain about it:


WARNING: modpost: missing MODULE_LICENSE() in 
arch/x86/kvm/test-emulator.o

see include/linux/module.h for more information



Thanks; fixed.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Emulator fuzz tester

2011-08-29 Thread Avi Kivity

On 08/26/2011 03:11 AM, Lucas Meneghel Rodrigues wrote:

Lucas, how would we go about integrating this into kvm-autotest?



I have applied the 3 patches on your latest tree, compiled the kernel 
but I'm having trouble in running the test the way you described.


One thing I've noticed here: I can only compile the test as a kernel 
module, not in the kernel image (menuconfig only gives me
(N/m/?). So I believe there's no way to test it the way you have 
described... In any case I did try what you have suggested, then the 
kernel panics due to the lack of a filesystem/init. After some 
reading, I learned to create a bogus fs with a bogus init in it, but 
still, the test does not run (I guess it's because the test module is 
not compiled into the bzImage).


I assume there are some details you forgot to mention to get this 
done... Would you mind posting a more detailed procedure?


The module depends on KVM, so if that is a module, then you can only 
build the test as a module.  If you set CONFIG_KVM=y then you'll be able 
to build the fuzzer in.


The simplest kernel you can build is probably

  make defconfig
  set CONFIG_KVM=y
  set CONFIG_KVM_EMULATOR_TEST=y

Note there is no need to build kvm-intel or kvm-amd for that.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: Add ivshmem device

2011-08-29 Thread Sasha Levin
The patch adds an ivshmem device which can be used to share memory between
guests on the same host.

This implementation is lacking inter-guest communication which should be
implemented later once information regarding the client-server protocol is
gathered, though infrastructure used to add and remove clients already exists
in the patch (but isn't used anywhere).

Patch is based on David Evansky's shmem device.

Signed-off-by: David Evensky even...@sandia.gov
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/Makefile |1 +
 tools/kvm/builtin-run.c|  134 
 tools/kvm/hw/pci-shmem.c   |  266 
 tools/kvm/include/kvm/pci-shmem.h  |   28 
 tools/kvm/include/kvm/pci.h|3 +-
 tools/kvm/include/kvm/virtio-pci-dev.h |3 +
 tools/kvm/pci.c|5 +-
 tools/kvm/virtio/pci.c |4 +-
 8 files changed, 438 insertions(+), 6 deletions(-)
 create mode 100644 tools/kvm/hw/pci-shmem.c
 create mode 100644 tools/kvm/include/kvm/pci-shmem.h

diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile
index 25cbd7e..efa032d 100644
--- a/tools/kvm/Makefile
+++ b/tools/kvm/Makefile
@@ -81,6 +81,7 @@ OBJS  += virtio/9p.o
 OBJS   += virtio/9p-pdu.o
 OBJS   += hw/vesa.o
 OBJS   += hw/i8042.o
+OBJS   += hw/pci-shmem.o
 
 FLAGS_BFD := $(CFLAGS) -lbfd
 has_bfd := $(call try-cc,$(SOURCE_BFD),$(FLAGS_BFD))
diff --git a/tools/kvm/builtin-run.c b/tools/kvm/builtin-run.c
index 38612b6..b9efde2 100644
--- a/tools/kvm/builtin-run.c
+++ b/tools/kvm/builtin-run.c
@@ -28,6 +28,7 @@
 #include kvm/sdl.h
 #include kvm/vnc.h
 #include kvm/guest_compat.h
+#include kvm/pci-shmem.h
 
 #include linux/types.h
 
@@ -52,6 +53,8 @@
 #define DEFAULT_SCRIPT none
 
 #define MB_SHIFT   (20)
+#define KB_SHIFT   (10)
+#define GB_SHIFT   (30)
 #define MIN_RAM_SIZE_MB(64ULL)
 #define MIN_RAM_SIZE_BYTE  (MIN_RAM_SIZE_MB  MB_SHIFT)
 
@@ -151,6 +154,131 @@ static int virtio_9p_rootdir_parser(const struct option 
*opt, const char *arg, i
return 0;
 }
 
+static int shmem_parser(const struct option *opt, const char *arg, int unset)
+{
+   const uint64_t default_size = SHMEM_DEFAULT_SIZE;
+   const uint64_t default_phys_addr = SHMEM_DEFAULT_ADDR;
+   const char *default_handle = SHMEM_DEFAULT_HANDLE;
+   struct shmem_info *si = malloc(sizeof(struct shmem_info));
+   enum { PCI, UNK } addr_type = PCI;
+   uint64_t phys_addr;
+   uint64_t size;
+   char *handle = NULL;
+   int create = 0;
+   const char *p = arg;
+   char *next;
+   int base = 10;
+   int verbose = 0;
+
+   const int skip_pci = strlen(pci:);
+   if (verbose)
+   pr_info(shmem_parser(%p,%s,%d), opt, arg, unset);
+   /* parse out optional addr family */
+   if (strcasestr(p, pci:)) {
+   p += skip_pci;
+   addr_type = PCI;
+   } else if (strcasestr(p, mem:)) {
+   die(I can't add to E820 map yet.\n);
+   }
+   /* parse out physical addr */
+   base = 10;
+   if (strcasestr(p, 0x))
+   base = 16;
+   phys_addr = strtoll(p, next, base);
+   if (next == p  phys_addr == 0) {
+   pr_info(shmem: no physical addr specified, using default.);
+   phys_addr = default_phys_addr;
+   }
+   if (*next != ':'  *next != '\0')
+   die(shmem: unexpected chars after phys addr.\n);
+   if (*next == '\0')
+   p = next;
+   else
+   p = next + 1;
+   /* parse out size */
+   base = 10;
+   if (strcasestr(p, 0x))
+   base = 16;
+   size = strtoll(p, next, base);
+   if (next == p  size == 0) {
+   pr_info(shmem: no size specified, using default.);
+   size = default_size;
+   }
+   /* look for [KMGkmg][Bb]*  uses base 2. */
+   int skip_B = 0;
+   if (strspn(next, KMGkmg)) {   /* might have a prefix */
+   if (*(next + 1) == 'B' || *(next + 1) == 'b')
+   skip_B = 1;
+   switch (*next) {
+   case 'K':
+   case 'k':
+   size = size  KB_SHIFT;
+   break;
+   case 'M':
+   case 'm':
+   size = size  MB_SHIFT;
+   break;
+   case 'G':
+   case 'g':
+   size = size  GB_SHIFT;
+   break;
+   default:
+   die(shmem: bug in detecting size prefix.);
+   break;
+   }
+   next += 1 + skip_B;
+   }
+   if (*next != ':'  *next != '\0') {
+   die(shmem: unexpected chars after phys size. %c%c\n,
+   *next, *p);
+   }
+   if (*next == '\0')
+   p 

[PATCH] KVM: Document KVM_IRQFD

2011-08-29 Thread Sasha Levin
Avi Kivity a...@redhat.com
Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 Documentation/virtual/kvm/api.txt |   27 +++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 2d510b6..d1150b6 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1450,6 +1450,33 @@ is supported; 2 if the processor requires all virtual 
machines to have
 an RMA, or 1 if the processor can use an RMA but doesn't require it,
 because it supports the Virtual RMA (VRMA) facility.
 
+4.64 KVM_IRQFD
+
+Capability: KVM_CAP_IRQFD
+Architectures: all
+Type: vm ioctl
+Parameters: struct kvm_irqfd (in)
+Returns: 0 on success, !0 on error
+
+This ioctl attaches or detaches an eventfd to a GSI within the guest.
+While the eventfd is assigned to the guest, any write to the eventfd
+would trigger the GSI within the guest.
+
+struct kvm_irqfd {
+   __u32 fd;
+   __u32 gsi;
+   __u32 flags;
+   __u8  pad[20];
+};
+
+The following flags are defined:
+
+#define KVM_IRQFD_FLAG_DEASSIGN (1  0)
+
+If deassign flag is set, the eventfd will be deassigned from the GSI and
+further writes to the eventfd won't trigger the GSI.
+
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
-- 
1.7.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend] KVM: Document KVM_IRQFD

2011-08-29 Thread Sasha Levin
Cc: Avi Kivity a...@redhat.com
Cc: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 Documentation/virtual/kvm/api.txt |   27 +++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 2d510b6..d1150b6 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1450,6 +1450,33 @@ is supported; 2 if the processor requires all virtual 
machines to have
 an RMA, or 1 if the processor can use an RMA but doesn't require it,
 because it supports the Virtual RMA (VRMA) facility.
 
+4.64 KVM_IRQFD
+
+Capability: KVM_CAP_IRQFD
+Architectures: all
+Type: vm ioctl
+Parameters: struct kvm_irqfd (in)
+Returns: 0 on success, !0 on error
+
+This ioctl attaches or detaches an eventfd to a GSI within the guest.
+While the eventfd is assigned to the guest, any write to the eventfd
+would trigger the GSI within the guest.
+
+struct kvm_irqfd {
+   __u32 fd;
+   __u32 gsi;
+   __u32 flags;
+   __u8  pad[20];
+};
+
+The following flags are defined:
+
+#define KVM_IRQFD_FLAG_DEASSIGN (1  0)
+
+If deassign flag is set, the eventfd will be deassigned from the GSI and
+further writes to the eventfd won't trigger the GSI.
+
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
-- 
1.7.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] KVM: Document KVM_IRQFD

2011-08-29 Thread Avi Kivity

On 08/29/2011 03:34 PM, Sasha Levin wrote:

Cc: Avi Kivitya...@redhat.com
Cc: Marcelo Tosattimtosa...@redhat.com
Signed-off-by: Sasha Levinlevinsasha...@gmail.com


Thanks, applied.


+Returns: 0 on success, !0 on error
+


-1 (I fixed this).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: x86 emulator: fuzz tester

2011-08-29 Thread Avi Kivity

On 08/25/2011 10:04 PM, Avi Kivity wrote:



Also fuzzing from an actual guest is useful to test the real backend
functions. What problem did you encounter? The new testsuite scheme
seems a good fit for that (with the exception of being locked to 32-bit
mode).


Mostly that I forgot it exists.  Other issues are that it's harder to 
force random values through it - though I could allocate a couple GB 
and fill it with random values.  We also lose the ability to test 
inputs to callbacks (not that I do much of that here).


Further issues would be:

- much slower - heavyweight exit on every insn, KVM_SET_SREGS, etc.
- need to set up GDT/LDT, I guess we can do this once and fill it with 
random entries

- much more care in setting up registers so we can get a context that runs
- need to figure out where %rip EA is so we can put insn there, hope it 
doesn't conflict with other code


So it will probably work, but the result will be of lower quality.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] KVM: Document KVM_IRQFD

2011-08-29 Thread Sasha Levin
On Mon, 2011-08-29 at 15:37 +0300, Avi Kivity wrote:
 On 08/29/2011 03:34 PM, Sasha Levin wrote:
  Cc: Avi Kivitya...@redhat.com
  Cc: Marcelo Tosattimtosa...@redhat.com
  Signed-off-by: Sasha Levinlevinsasha...@gmail.com
 
 Thanks, applied.
 
  +Returns: 0 on success, !0 on error
  +
 
 -1 (I fixed this).
 

Why -1? It could be anything  0, no?

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] KVM: Document KVM_IRQFD

2011-08-29 Thread Avi Kivity

On 08/29/2011 03:43 PM, Sasha Levin wrote:

On Mon, 2011-08-29 at 15:37 +0300, Avi Kivity wrote:
  On 08/29/2011 03:34 PM, Sasha Levin wrote:
Cc: Avi Kivitya...@redhat.com
Cc: Marcelo Tosattimtosa...@redhat.com
Signed-off-by: Sasha Levinlevinsasha...@gmail.com

  Thanks, applied.

+Returns: 0 on success, !0 on error
+

  -1 (I fixed this).


Why -1? It could be anything  0, no?



No.  ioctl()s (and most system calls) return -1 on error and set errno.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] Migration thread mutex

2011-08-29 Thread Umesh Deshpande

On 08/29/2011 05:04 AM, Stefan Hajnoczi wrote:

On Sat, Aug 27, 2011 at 7:09 PM, Umesh Deshpandeudesh...@redhat.com  wrote:

This patch implements migrate_ram mutex, which protects the RAMBlock list
traversal in the migration thread during the transfer of a ram from their
addition/removal from the iothread.

Note: Combination of iothread mutex and migration thread mutex works as a
rw-lock. Both mutexes are acquired while modifying the ram_list members or RAM
block list.

Signed-off-by: Umesh Deshpandeudesh...@redhat.com
---
  arch_init.c   |   21 +
  cpu-all.h |3 +++
  exec.c|   23 +++
  qemu-common.h |2 ++
  4 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 484b39d..9d02270 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -109,6 +109,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch)

  static RAMBlock *last_block;
  static ram_addr_t last_offset;
+static uint64_t last_version;

[...]

  typedef struct RAMList {
+QemuMutex mutex;/* Protects RAM block list */
 uint8_t *phys_dirty;
+uint32_t version;   /* To detect ram block addition/removal */

Is there a reason why RAMList.version is uint32_t but last_version is uint64_t?

No, my bad, they both should be consistent.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] Separate migration thread

2011-08-29 Thread Umesh Deshpande

On 08/29/2011 05:09 AM, Stefan Hajnoczi wrote:

On Sat, Aug 27, 2011 at 7:09 PM, Umesh Deshpandeudesh...@redhat.com  wrote:

This patch creates a separate thread for the guest migration on the source side.
All exits (on completion/error) from the migration thread are handled by a
bottom handler, which is called from the iothread.

Signed-off-by: Umesh Deshpandeudesh...@redhat.com
---
  buffered_file.c |   76 
  migration.c |  105 ++
  migration.h |8 
  qemu-thread-posix.c |   10 +
  qemu-thread.h   |1 +

Will this patch break Windows builds by adding a function to
qemu-thread-posix.c which is not implemented in qemu-thread-win32.c?

Yes, equivalent function needs to be added in qemu-thread.win32.c
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Jan Kiszka
On 2011-08-29 13:11, Avi Kivity wrote:
 On 08/29/2011 02:03 PM, Jan Kiszka wrote:

  Just reading the first byte requires a guest page table walk.  This is
  probably the highest cost in emulation (which also requires a walk for
  the data access).

 And what about caching the result of the first walk? Usually, a sane
 guest won't have many code pages that issue the EIO.

 
 There's no way to know when to invalidate the cache.

Set the affected code page read-only?

 
 We could go a bit further, and cache the the whole thing.  On the first 
 exit, do the entire emulation, and remember %rip.  On the second exit, 
 if %rip matches, skip directly to kvm_lapic_eoi().
 
 But I don't think it's worth it.  This also has failure modes, and 
 really, no guest will ever write to EOI with stosl.

...or add/sub/and/or etc. Well, we've done other crazy things in the
past just to keep even the unlikely case correct. I was just wondering
if that policy changed.

However, I just realized that user space is able to avoid this
inaccuracy for potentially insane guests by not using in-kernel
irqchips. So we have at least a knob.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2011-08-29 Thread David Ahern
subscribe
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Avi Kivity

On 08/29/2011 04:55 PM, Jan Kiszka wrote:

On 2011-08-29 13:11, Avi Kivity wrote:
  On 08/29/2011 02:03 PM, Jan Kiszka wrote:

   Just reading the first byte requires a guest page table walk.  This is
   probably the highest cost in emulation (which also requires a walk for
   the data access).

  And what about caching the result of the first walk? Usually, a sane
  guest won't have many code pages that issue the EIO.


  There's no way to know when to invalidate the cache.

Set the affected code page read-only?


The virt-phys mapping could change too.  And please, don't think of new 
reasons to write protect pages, they break up my lovely 2M maps.




  We could go a bit further, and cache the the whole thing.  On the first
  exit, do the entire emulation, and remember %rip.  On the second exit,
  if %rip matches, skip directly to kvm_lapic_eoi().

  But I don't think it's worth it.  This also has failure modes, and
  really, no guest will ever write to EOI with stosl.

...or add/sub/and/or etc.


Argh, yes, flags can be updated.

Actually, this might work - if we get a read access first as part of the 
RMW, we'll emulate the instruction.  No idea what the hardware does in 
this case.



  Well, we've done other crazy things in the
past just to keep even the unlikely case correct. I was just wondering
if that policy changed.


I can't answer yes to that question.  But I see no way to make it work 
both fast and correct.




However, I just realized that user space is able to avoid this
inaccuracy for potentially insane guests by not using in-kernel
irqchips. So we have at least a knob.


Could/should have a flag to disable this in the kernel as well.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Broken pci_block_user_cfg_access interface

2011-08-29 Thread Michael S. Tsirkin
So how about something like the following?
Compile tested only, and I'm not sure I got the
ipr and iov error handling right.
Comments?


Subject: [PATCH] pci: fail block usercfg access on reset

Anyone who wants to block usercfg access will
also want to block reset from userspace.
On the other hand, reset from userspace
should block any other access from userspace.

Finally, make it possible to detect reset in
pregress by returning an error status from
pci_block_user_cfg_access.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/pci/access.c  |   45 
 drivers/pci/iov.c |   19 
 drivers/pci/pci.c |4 +-
 drivers/scsi/ipr.c|   22 ++-
 drivers/uio/uio_pci_generic.c |7 +-
 include/linux/pci.h   |6 -
 6 files changed, 87 insertions(+), 16 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index fdaa42a..2492365 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -139,7 +139,7 @@ static noinline void pci_wait_ucfg(struct pci_dev *dev)
raw_spin_unlock_irq(pci_lock);
schedule();
raw_spin_lock_irq(pci_lock);
-   } while (dev-block_ucfg_access);
+   } while (dev-block_ucfg_access || dev-reset_in_progress);
__remove_wait_queue(pci_ucfg_wait, wait);
 }
 
@@ -153,7 +153,8 @@ int pci_user_read_config_##size 
\
if (PCI_##size##_BAD)   \
return -EINVAL; \
raw_spin_lock_irq(pci_lock);   \
-   if (unlikely(dev-block_ucfg_access)) pci_wait_ucfg(dev);   \
+   if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) \
+   pci_wait_ucfg(dev); \
ret = dev-bus-ops-read(dev-bus, dev-devfn, \
pos, sizeof(type), data);  \
raw_spin_unlock_irq(pci_lock); \
@@ -171,8 +172,9 @@ int pci_user_write_config_##size
\
int ret = -EIO; \
if (PCI_##size##_BAD)   \
return -EINVAL; \
-   raw_spin_lock_irq(pci_lock);   \
-   if (unlikely(dev-block_ucfg_access)) pci_wait_ucfg(dev);   \
+   raw_spin_lock_irq(pci_lock);   \
+   if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) \
+   pci_wait_ucfg(dev); \
ret = dev-bus-ops-write(dev-bus, dev-devfn,\
pos, sizeof(type), val);\
raw_spin_unlock_irq(pci_lock); \
@@ -408,19 +410,24 @@ EXPORT_SYMBOL(pci_vpd_truncate);
  * sleep until access is unblocked again.  We don't allow nesting of
  * block/unblock calls.
  */
-void pci_block_user_cfg_access(struct pci_dev *dev)
+int pci_block_user_cfg_access(struct pci_dev *dev)
 {
unsigned long flags;
int was_blocked;
+   int in_progress;
 
raw_spin_lock_irqsave(pci_lock, flags);
was_blocked = dev-block_ucfg_access;
dev-block_ucfg_access = 1;
+   in_progress = dev-reset_in_progress;
raw_spin_unlock_irqrestore(pci_lock, flags);
 
+   if (in_progress)
+   return -EIO;
/* If we BUG() inside the pci_lock, we're guaranteed to hose
 * the machine */
BUG_ON(was_blocked);
+   return 0;
 }
 EXPORT_SYMBOL_GPL(pci_block_user_cfg_access);
 
@@ -445,3 +452,31 @@ void pci_unblock_user_cfg_access(struct pci_dev *dev)
raw_spin_unlock_irqrestore(pci_lock, flags);
 }
 EXPORT_SYMBOL_GPL(pci_unblock_user_cfg_access);
+
+void pci_reset_start(struct pci_dev *dev)
+{
+   int was_started;
+
+   raw_spin_lock_irq(pci_lock);
+   if (unlikely(dev-block_ucfg_access || dev-reset_in_progress))
+   pci_wait_ucfg(dev);
+   was_started = dev-reset_in_progress;
+   dev-reset_in_progress = 1;
+   raw_spin_unlock_irq(pci_lock);
+
+   /* If we BUG() inside the pci_lock, we're guaranteed to hose
+* the machine */
+   BUG_ON(was_started);
+}
+void pci_reset_end(struct pci_dev *dev)
+{
+   raw_spin_lock_irq(pci_lock);
+
+   /* This indicates a problem in the caller, but we don't need
+* to kill them, unlike a double-reset above. */
+   WARN_ON(!dev-reset_in_progress);
+
+   dev-reset_in_progress = 0;
+   wake_up_all(pci_ucfg_wait);
+   raw_spin_unlock_irq(pci_lock);
+}
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 42fae47..464d9b5 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ 

Re: Questions regarding ivshmem spec

2011-08-29 Thread Sasha Levin
On Thu, 2011-08-25 at 16:29 +0300, Sasha Levin wrote:
 Hello,
 
 I am looking to implement an ivshmem device for KVM tools, the purpose
 is to provide same functionality as QEMU and interoperability with QEMU.

[snip]

 1. File handles and guest IDs are passed between the server and the
 peers using sockets, is the protocol itself documented anywhere? I would
 like to be able to work alongside QEMU servers/peers. 

I'm still wondering if someone could do a quick sketch of the
client-server protocol or possibly point me to something that documents
it?

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Broken pci_block_user_cfg_access interface

2011-08-29 Thread Jan Kiszka
On 2011-08-29 17:05, Michael S. Tsirkin wrote:
 So how about something like the following?
 Compile tested only, and I'm not sure I got the
 ipr and iov error handling right.
 Comments?

This still doesn't synchronize two callers of pci_block_user_cfg_access
unless one was reset. We may not have such a scenario yet, but that's
how the old code started as well...

And it makes the interface more convoluted (from 1 meter, why should
pci_block_user_cfg_access return an error if reset is running?).

 
 
 Subject: [PATCH] pci: fail block usercfg access on reset
 
 Anyone who wants to block usercfg access will
 also want to block reset from userspace.
 On the other hand, reset from userspace
 should block any other access from userspace.
 
 Finally, make it possible to detect reset in
 pregress by returning an error status from
 pci_block_user_cfg_access.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  drivers/pci/access.c  |   45 
  drivers/pci/iov.c |   19 
  drivers/pci/pci.c |4 +-
  drivers/scsi/ipr.c|   22 ++-
  drivers/uio/uio_pci_generic.c |7 +-
  include/linux/pci.h   |6 -
  6 files changed, 87 insertions(+), 16 deletions(-)
 
 diff --git a/drivers/pci/access.c b/drivers/pci/access.c
 index fdaa42a..2492365 100644
 --- a/drivers/pci/access.c
 +++ b/drivers/pci/access.c
 @@ -139,7 +139,7 @@ static noinline void pci_wait_ucfg(struct pci_dev *dev)
   raw_spin_unlock_irq(pci_lock);
   schedule();
   raw_spin_lock_irq(pci_lock);
 - } while (dev-block_ucfg_access);
 + } while (dev-block_ucfg_access || dev-reset_in_progress);
   __remove_wait_queue(pci_ucfg_wait, wait);
  }
  
 @@ -153,7 +153,8 @@ int pci_user_read_config_##size   
 \
   if (PCI_##size##_BAD)   \
   return -EINVAL; \
   raw_spin_lock_irq(pci_lock);   \
 - if (unlikely(dev-block_ucfg_access)) pci_wait_ucfg(dev);   \
 + if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) \
 + pci_wait_ucfg(dev); \
   ret = dev-bus-ops-read(dev-bus, dev-devfn, \
   pos, sizeof(type), data);  \
   raw_spin_unlock_irq(pci_lock); \
 @@ -171,8 +172,9 @@ int pci_user_write_config_##size  
 \
   int ret = -EIO; \
   if (PCI_##size##_BAD)   \
   return -EINVAL; \
 - raw_spin_lock_irq(pci_lock);   \
 - if (unlikely(dev-block_ucfg_access)) pci_wait_ucfg(dev);   \
 + raw_spin_lock_irq(pci_lock);   \
 + if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) \
 + pci_wait_ucfg(dev); \
   ret = dev-bus-ops-write(dev-bus, dev-devfn,\
   pos, sizeof(type), val);\
   raw_spin_unlock_irq(pci_lock); \
 @@ -408,19 +410,24 @@ EXPORT_SYMBOL(pci_vpd_truncate);
   * sleep until access is unblocked again.  We don't allow nesting of
   * block/unblock calls.
   */
 -void pci_block_user_cfg_access(struct pci_dev *dev)
 +int pci_block_user_cfg_access(struct pci_dev *dev)
  {
   unsigned long flags;
   int was_blocked;
 + int in_progress;
  
   raw_spin_lock_irqsave(pci_lock, flags);
   was_blocked = dev-block_ucfg_access;
   dev-block_ucfg_access = 1;
 + in_progress = dev-reset_in_progress;
   raw_spin_unlock_irqrestore(pci_lock, flags);
  
 + if (in_progress)
 + return -EIO;
   /* If we BUG() inside the pci_lock, we're guaranteed to hose
* the machine */
   BUG_ON(was_blocked);
 + return 0;
  }
  EXPORT_SYMBOL_GPL(pci_block_user_cfg_access);
  
 @@ -445,3 +452,31 @@ void pci_unblock_user_cfg_access(struct pci_dev *dev)
   raw_spin_unlock_irqrestore(pci_lock, flags);
  }
  EXPORT_SYMBOL_GPL(pci_unblock_user_cfg_access);
 +
 +void pci_reset_start(struct pci_dev *dev)
 +{
 + int was_started;
 +
 + raw_spin_lock_irq(pci_lock);
 + if (unlikely(dev-block_ucfg_access || dev-reset_in_progress))
 + pci_wait_ucfg(dev);
 + was_started = dev-reset_in_progress;
 + dev-reset_in_progress = 1;
 + raw_spin_unlock_irq(pci_lock);
 +
 + /* If we BUG() inside the pci_lock, we're guaranteed to hose
 +  * the machine */
 + BUG_ON(was_started);
 +}
 +void pci_reset_end(struct pci_dev *dev)
 +{
 + raw_spin_lock_irq(pci_lock);
 +
 + /* This indicates a problem in the 

Re: Broken pci_block_user_cfg_access interface

2011-08-29 Thread Michael S. Tsirkin
On Mon, Aug 29, 2011 at 05:42:16PM +0200, Jan Kiszka wrote:
 On 2011-08-29 17:05, Michael S. Tsirkin wrote:
  So how about something like the following?
  Compile tested only, and I'm not sure I got the
  ipr and iov error handling right.
  Comments?
 
 This still doesn't synchronize two callers of pci_block_user_cfg_access
 unless one was reset. We may not have such a scenario yet, but that's
 how the old code started as well...
 
 And it makes the interface more convoluted (from 1 meter, why should
 pci_block_user_cfg_access return an error if reset is running?).

Well I made the error EIO but it really is something like
EAGAIN which means 'I would block if I could, but I was
asked not to'.

  
  
  Subject: [PATCH] pci: fail block usercfg access on reset
  
  Anyone who wants to block usercfg access will
  also want to block reset from userspace.
  On the other hand, reset from userspace
  should block any other access from userspace.
  
  Finally, make it possible to detect reset in
  pregress by returning an error status from
  pci_block_user_cfg_access.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
   drivers/pci/access.c  |   45 
  
   drivers/pci/iov.c |   19 
   drivers/pci/pci.c |4 +-
   drivers/scsi/ipr.c|   22 ++-
   drivers/uio/uio_pci_generic.c |7 +-
   include/linux/pci.h   |6 -
   6 files changed, 87 insertions(+), 16 deletions(-)
  
  diff --git a/drivers/pci/access.c b/drivers/pci/access.c
  index fdaa42a..2492365 100644
  --- a/drivers/pci/access.c
  +++ b/drivers/pci/access.c
  @@ -139,7 +139,7 @@ static noinline void pci_wait_ucfg(struct pci_dev *dev)
  raw_spin_unlock_irq(pci_lock);
  schedule();
  raw_spin_lock_irq(pci_lock);
  -   } while (dev-block_ucfg_access);
  +   } while (dev-block_ucfg_access || dev-reset_in_progress);
  __remove_wait_queue(pci_ucfg_wait, wait);
   }
   
  @@ -153,7 +153,8 @@ int pci_user_read_config_##size 
  \
  if (PCI_##size##_BAD)   \
  return -EINVAL; \
  raw_spin_lock_irq(pci_lock);   \
  -   if (unlikely(dev-block_ucfg_access)) pci_wait_ucfg(dev);   \
  +   if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) \
  +   pci_wait_ucfg(dev); \
  ret = dev-bus-ops-read(dev-bus, dev-devfn, \
  pos, sizeof(type), data);  \
  raw_spin_unlock_irq(pci_lock); \
  @@ -171,8 +172,9 @@ int pci_user_write_config_##size
  \
  int ret = -EIO; \
  if (PCI_##size##_BAD)   \
  return -EINVAL; \
  -   raw_spin_lock_irq(pci_lock);   \
  -   if (unlikely(dev-block_ucfg_access)) pci_wait_ucfg(dev);   \
  +   raw_spin_lock_irq(pci_lock);   \
  +   if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) \
  +   pci_wait_ucfg(dev); \
  ret = dev-bus-ops-write(dev-bus, dev-devfn,\
  pos, sizeof(type), val);\
  raw_spin_unlock_irq(pci_lock); \
  @@ -408,19 +410,24 @@ EXPORT_SYMBOL(pci_vpd_truncate);
* sleep until access is unblocked again.  We don't allow nesting of
* block/unblock calls.
*/
  -void pci_block_user_cfg_access(struct pci_dev *dev)
  +int pci_block_user_cfg_access(struct pci_dev *dev)
   {
  unsigned long flags;
  int was_blocked;
  +   int in_progress;
   
  raw_spin_lock_irqsave(pci_lock, flags);
  was_blocked = dev-block_ucfg_access;
  dev-block_ucfg_access = 1;
  +   in_progress = dev-reset_in_progress;
  raw_spin_unlock_irqrestore(pci_lock, flags);
   
  +   if (in_progress)
  +   return -EIO;
  /* If we BUG() inside the pci_lock, we're guaranteed to hose
   * the machine */
  BUG_ON(was_blocked);
  +   return 0;
   }
   EXPORT_SYMBOL_GPL(pci_block_user_cfg_access);
   
  @@ -445,3 +452,31 @@ void pci_unblock_user_cfg_access(struct pci_dev *dev)
  raw_spin_unlock_irqrestore(pci_lock, flags);
   }
   EXPORT_SYMBOL_GPL(pci_unblock_user_cfg_access);
  +
  +void pci_reset_start(struct pci_dev *dev)
  +{
  +   int was_started;
  +
  +   raw_spin_lock_irq(pci_lock);
  +   if (unlikely(dev-block_ucfg_access || dev-reset_in_progress))
  +   pci_wait_ucfg(dev);
  +   was_started = dev-reset_in_progress;
  +   dev-reset_in_progress = 1;
  +   raw_spin_unlock_irq(pci_lock);
  +
  +   /* If we BUG() inside the 

Re: Broken pci_block_user_cfg_access interface

2011-08-29 Thread Jan Kiszka
On 2011-08-29 17:58, Michael S. Tsirkin wrote:
 On Mon, Aug 29, 2011 at 05:42:16PM +0200, Jan Kiszka wrote:
 I still don't get what prevents converting ipr to allow plain mutex
 synchronization. My vision is:
  - push reset-on-error of ipr into workqueue (or threaded IRQ?)
  - require mutex synchronization for common config space access
 
 Meaning pci_user_ read/write config?

And pci_dev_reset, yes.

 
 and the
full reset cycle
  - only exception: INTx status/masking access
 = use pci_lock + test for reset_in_progress, skip operation if
that is the case

 That would allow to drop the whole block_user_cfg infrastructure.

 Jan
 
 We still need to block userspace access while INTx does
 the status/masking access, right?

Yes, pci_lock would do that for us.

We should consider making the related bits for INTx test  mask/unmask
generic PCI services so that no user (uio_pci_generic, kvm, vfio) needs
to worry about the locking details.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Broken pci_block_user_cfg_access interface

2011-08-29 Thread Michael S. Tsirkin
On Mon, Aug 29, 2011 at 06:14:39PM +0200, Jan Kiszka wrote:
 On 2011-08-29 17:58, Michael S. Tsirkin wrote:
  On Mon, Aug 29, 2011 at 05:42:16PM +0200, Jan Kiszka wrote:
  I still don't get what prevents converting ipr to allow plain mutex
  synchronization. My vision is:
   - push reset-on-error of ipr into workqueue (or threaded IRQ?)
   - require mutex synchronization for common config space access
  
  Meaning pci_user_ read/write config?
 
 And pci_dev_reset, yes.
 
  
  and the
 full reset cycle
   - only exception: INTx status/masking access
  = use pci_lock + test for reset_in_progress, skip operation if
 that is the case
 
  That would allow to drop the whole block_user_cfg infrastructure.
 
  Jan
  
  We still need to block userspace access while INTx does
  the status/masking access, right?
 
 Yes, pci_lock would do that for us.

Well this means block_user_cfg is not going away,
this is what it really is: pci_lock + a bit to lock out userspace.

 We should consider making the related bits for INTx test  mask/unmask
 generic PCI services so that no user (uio_pci_generic, kvm, vfio) needs
 to worry about the locking details.
 
 Jan
 
 -- 
 Siemens AG, Corporate Technology, CT T DE IT 1
 Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 39412] Win Vista and Win2K8 guests' network breaks down

2011-08-29 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=39412





--- Comment #10 from Rafael J. Wysocki r...@sisk.pl  2011-08-29 16:23:57 ---
On Monday, August 29, 2011, Ren, Yongjie wrote:
 I've verified this bug in Linus' tree 3.1.0-rc4. It got fixed, so I closed 
 the bug.
 The commit I tried is below.
 commit c6a389f123b9f68d605bb7e0f9b32ec1e3e14132
 Author: Linus Torvalds torva...@linux-foundation.org
 Date:   Sun Aug 28 21:16:01 2011 -0700
 
 Best Regards,
  Yongjie Ren  (Jay)
 
  -Original Message-
  From: Rafael J. Wysocki [mailto:r...@sisk.pl]
  Sent: Monday, August 29, 2011 3:01 AM
  To: Linux Kernel Mailing List
  Cc: Kernel Testers List; Maciej Rutecki; Florian Mickler; Avi Kivity; Ren, 
  Yongjie;
  Marcelo Tosatti; Stefan Hajnoczi
  Subject: [Bug #39412] Win Vista and Win2K8 guests' network breaks down
  
  This message has been generated automatically as a part of a report
  of regressions introduced between 2.6.39 and 3.0.
  
  The following bug entry is on the current list of known regressions
  introduced between 2.6.39 and 3.0.  Please verify if it still should
  be listed and let the tracking team know (either way).
  
  
  Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=39412
  Subject : Win Vista and Win2K8 guests' network breaks down
  Submitter   : Jay Ren yongjie@intel.com
  Date: 2011-07-15 15:31 (45 days old)
  
 


-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
You are watching the assignee of the bug.--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Broken pci_block_user_cfg_access interface

2011-08-29 Thread Jan Kiszka
On 2011-08-29 18:23, Michael S. Tsirkin wrote:
 On Mon, Aug 29, 2011 at 06:14:39PM +0200, Jan Kiszka wrote:
 On 2011-08-29 17:58, Michael S. Tsirkin wrote:
 On Mon, Aug 29, 2011 at 05:42:16PM +0200, Jan Kiszka wrote:
 I still don't get what prevents converting ipr to allow plain mutex
 synchronization. My vision is:
  - push reset-on-error of ipr into workqueue (or threaded IRQ?)
  - require mutex synchronization for common config space access

 Meaning pci_user_ read/write config?

 And pci_dev_reset, yes.


 and the
full reset cycle
  - only exception: INTx status/masking access
 = use pci_lock + test for reset_in_progress, skip operation if
that is the case

 That would allow to drop the whole block_user_cfg infrastructure.

 Jan

 We still need to block userspace access while INTx does
 the status/masking access, right?

 Yes, pci_lock would do that for us.
 
 Well this means block_user_cfg is not going away,
 this is what it really is: pci_lock + a bit to lock out userspace.

I does as we only end up with a mutex and pci_lock. No more hand-crafted
queuing/blocking/waking.

INTx masking is a bit special as it's the only thing that truly requires
atomic context. But that's something we should address generically anyway.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: vfio / device assignment -- layout of device fd files

2011-08-29 Thread Yoder Stuart-B08248
Alex Graf, Scott Wood, and I met last week to try to flesh out
some details as to how vfio could work for non-PCI devices,
like we have in embedded systems.   This most likely will
require a different kernel driver than vfio-- for now we are
calling it dtio (for device tree I/O) as there is no way
to discover these devices except from the device tree.   But
the dtio driver would use the same architecture and interfaces
as vfio.

For devices on a system bus and represented in a device
tree we have some different requirements than PCI for what
is exposed in the device fd file.  A device may have multiple
address regions, multiple interrupts, a variable length device
tree path, whether a region is mmapable, etc.

With existing vfio, the device fd file layout is something
like:
  0xF Config space offset
  ...
  0x6 ROM offset
  0x5 BAR 5 offset
  0x4 BAR 4 offset
  0x3 BAR 3 offset
  0x2 BAR 2 offset
  0x1 BAR 1 offset
  0x0 BAR 0 offset

We have an alternate proposal that we think is more flexible,
extensible, and will accommodate both PCI and system bus
type devices (and others).

Instead of config space fixed at 0xf, we would propose
a header and multiple 'device info' records at offset 0x0 that would
encode everything that user space needs to know about
the device:

  0x0  +-+-+
   | magic   |   version   | u64   // magic u64 identifies the type of
   |   vfio| |   // passthru I/O, plus version #
   |   dtio| |   //   vfio - PCI devices
   +-+-+   //   dtio - device tree devices
   | flags | u32   // encodes any flags (TBD)
   +---+
   |  dev info record N|
   |type   | u32   // type of record
   |rec_len| u32   // length in bytes of record
   |   |  (including record header)
   |flags  | u32   // type specific flags
   |...content...  |   // record content, which could
   +---+   // include sub-records
   |  dev info record N+1  |
   +---+
   |  dev info record N+2  |
   +---+
   ...

The device info records following the file header have the following
record types each with content encoded in a record specific way:

 REGION  - describes an addressable address range for the device
 DTPATH - describes the device tree path for the device
 DTINDEX - describes the index into the related device tree
   property (reg,ranges,interrupts,interrupt-map)
 INTERRUPT - describes an interrupt for the device
 PCI_CONFIG_SPACE - describes config space for the device
 PCI_INFO - domain:bus:device:func
 PCI_BAR_INFO - information about the BAR for a device

For a device tree type device the file may look like:

 0x0+---+
|header |  
+---+
|   type = REGION   |  
|   rec_len |  
|   flags = | u32 // region specific flags
|   is_mmapable | 
|   offset  | u64 // seek offset to region from
|   |from beginning
|   len | u64 // length of region
|   addr| u64 // phys addr of region
|   |  
+---+
 \   type = DTPATH  \  // a sub-region
  |   rec_len|  
  |   flags  |  
  |   dev tree path  | char[] // device tree path
+---+
 \   type = DTINDEX \  // a sub-region
  |   rec_len|  
  |   flags  |  
  |   prop_type  | u32  // REG, RANGES
  |   prop_index | u32  // index  into resource list
+---+
|  type = INTERRUPT |  
|  rec_len  |  
|  flags| u32 
|  ioctl_handle | u32 // argument to ioctl to get interrupts
|   |  
+---+
 \   type = DTPATH \  // a sub-region
  |   rec_len   |  
  |   flags |  
  |   dev tree path |  char[] // device tree path
+---+
  \   type = DTINDEX   \  // a sub-region 
  |   rec_len   |  
  |   flags |  
  |   prop_type | u32 // INTERRUPT,INTERRUPT_MAP
  |   prop_index| u32 // index


PCI devices would have a PCI specific encoding.  Instead of
config space and the mappable BAR regions being at specific
predetermined offsets, the device info records 

Re: Questions regarding ivshmem spec

2011-08-29 Thread Cam Macdonell
On Mon, Aug 29, 2011 at 9:25 AM, Sasha Levin levinsasha...@gmail.com wrote:
 On Thu, 2011-08-25 at 16:29 +0300, Sasha Levin wrote:
 Hello,

 I am looking to implement an ivshmem device for KVM tools, the purpose
 is to provide same functionality as QEMU and interoperability with QEMU.

 [snip]

 1. File handles and guest IDs are passed between the server and the
 peers using sockets, is the protocol itself documented anywhere? I would
 like to be able to work alongside QEMU servers/peers.

 I'm still wondering if someone could do a quick sketch of the
 client-server protocol or possibly point me to something that documents
 it?

Hi Sasha,

I have something like that.  I'll be in touch when I find it.

Cam


 --

 Sasha.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Broken pci_block_user_cfg_access interface

2011-08-29 Thread Jan Kiszka
On 2011-08-29 17:42, Jan Kiszka wrote:
 I still don't get what prevents converting ipr to allow plain mutex
 synchronization. My vision is:
  - push reset-on-error of ipr into workqueue (or threaded IRQ?)

I'm starting to like your proposal: I had a look at ipr, but it turned
out to be anything but trivial to convert that driver. It runs its
complete state machine under spin_lock_irq, and the functions calling
pci_block/unblock_user_cfg_access are deep inside this thing. I have no
hardware to test whatever change, and I feel a bit uncomfortable asking
Brian to redesign his driver that massively.

So back to your idea: I would generalize pci_block_user_cfg_access to
pci_block_cfg_access. It should fail when some other site already holds
the access lock, but it should remain non-blocking - for the sake of ipr.

We should still provide generic pci-2.3 IRQ masking services, but that
could be done in a second step. I could have a look at this.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Support for vm_stop from the migration thread

2011-08-29 Thread Marcelo Tosatti
On Sat, Aug 27, 2011 at 02:09:44PM -0400, Umesh Deshpande wrote:
 Currently, when any thread other than iothread calls vm_stop, it is scheduled 
 to
 be executed later by the iothread. This patch allows the execution of vm_stop
 from threads other than iothread. This is especially helpful when the 
 migration is
 moved into a separate thread.
 
 Signed-off-by: Umesh Deshpande udesh...@redhat.com
 ---
  cpus.c |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/cpus.c b/cpus.c
 index de70e02..f35f683 100644
 --- a/cpus.c
 +++ b/cpus.c
 @@ -122,8 +122,8 @@ static void do_vm_stop(int reason)
  {
  if (vm_running) {
  cpu_disable_ticks();
 -vm_running = 0;
  pause_all_vcpus();
 +vm_running = 0;
  vm_state_notify(0, reason);
  qemu_aio_flush();
  bdrv_flush_all();

Why this change?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] Migration thread mutex

2011-08-29 Thread Marcelo Tosatti
On Sat, Aug 27, 2011 at 02:09:46PM -0400, Umesh Deshpande wrote:
 This patch implements migrate_ram mutex, which protects the RAMBlock list
 traversal in the migration thread during the transfer of a ram from their
 addition/removal from the iothread.
 
 Note: Combination of iothread mutex and migration thread mutex works as a
 rw-lock. Both mutexes are acquired while modifying the ram_list members or RAM
 block list.
 
 Signed-off-by: Umesh Deshpande udesh...@redhat.com
 ---
  arch_init.c   |   21 +
  cpu-all.h |3 +++
  exec.c|   23 +++
  qemu-common.h |2 ++
  4 files changed, 49 insertions(+), 0 deletions(-)
 
 diff --git a/arch_init.c b/arch_init.c
 index 484b39d..9d02270 100644
 --- a/arch_init.c
 +++ b/arch_init.c
 @@ -109,6 +109,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
  
  static RAMBlock *last_block;
  static ram_addr_t last_offset;
 +static uint64_t last_version;
  
  static int ram_save_block(QEMUFile *f)
  {
 @@ -170,6 +171,7 @@ static int ram_save_block(QEMUFile *f)
  
  last_block = block;
  last_offset = offset;
 +last_version = ram_list.version;
  
  return bytes_sent;
  }
 @@ -270,6 +272,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, 
 void *opaque)
  bytes_transferred = 0;
  last_block = NULL;
  last_offset = 0;
 +last_version = ram_list.version = 0;
  sort_ram_list();
  
  /* Make sure all dirty bits are set */
 @@ -298,6 +301,17 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, 
 void *opaque)
  bytes_transferred_last = bytes_transferred;
  bwidth = qemu_get_clock_ns(rt_clock);
  
 +if (stage != 3) {
 +qemu_mutex_lock_migrate_ram();
 +qemu_mutex_unlock_iothread();
 +}

Dropping the iothread lock from within a timer handler is not safe.
This change to ram_save_live should be moved to the patch where 
migration thread is introduced.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] Separate migration thread

2011-08-29 Thread Marcelo Tosatti
On Sat, Aug 27, 2011 at 02:09:48PM -0400, Umesh Deshpande wrote:
 This patch creates a separate thread for the guest migration on the source 
 side.
 All exits (on completion/error) from the migration thread are handled by a
 bottom handler, which is called from the iothread.
 
 Signed-off-by: Umesh Deshpande udesh...@redhat.com
 ---
  buffered_file.c |   76 
  migration.c |  105 ++
  migration.h |8 
  qemu-thread-posix.c |   10 +
  qemu-thread.h   |1 +
  5 files changed, 124 insertions(+), 76 deletions(-)
 
 diff --git a/buffered_file.c b/buffered_file.c
 index 41b42c3..c31852e 100644
 --- a/buffered_file.c
 +++ b/buffered_file.c
 @@ -16,6 +16,8 @@
  #include qemu-timer.h
  #include qemu-char.h
  #include buffered_file.h
 +#include migration.h
 +#include qemu-thread.h
  
  //#define DEBUG_BUFFERED_FILE
  
 @@ -28,13 +30,14 @@ typedef struct QEMUFileBuffered
  void *opaque;
  QEMUFile *file;
  int has_error;
 +int closed;
  int freeze_output;
  size_t bytes_xfer;
  size_t xfer_limit;
  uint8_t *buffer;
  size_t buffer_size;
  size_t buffer_capacity;
 -QEMUTimer *timer;
 +QemuThread thread;
  } QEMUFileBuffered;
  
  #ifdef DEBUG_BUFFERED_FILE
 @@ -155,14 +158,6 @@ static int buffered_put_buffer(void *opaque, const 
 uint8_t *buf, int64_t pos, in
  offset = size;
  }
  
 -if (pos == 0  size == 0) {
 -DPRINTF(file is ready\n);
 -if (s-bytes_xfer = s-xfer_limit) {
 -DPRINTF(notifying client\n);
 -s-put_ready(s-opaque);
 -}
 -}
 -
  return offset;
  }
  
 @@ -173,22 +168,25 @@ static int buffered_close(void *opaque)
  
  DPRINTF(closing\n);
  
 -while (!s-has_error  s-buffer_size) {
 -buffered_flush(s);
 -if (s-freeze_output)
 -s-wait_for_unfreeze(s);
 -}
 +s-closed = 1;
  
 -ret = s-close(s-opaque);
 +qemu_mutex_unlock_migrate_ram();
 +qemu_mutex_unlock_iothread();

This is using the ram mutex to protect migration thread specific data.
A new lock should be introduced for that purpose.

 -qemu_del_timer(s-timer);
 -qemu_free_timer(s-timer);
 +qemu_thread_join(s-thread);
 +/* Waits for the completion of the migration thread */
 +
 +qemu_mutex_lock_iothread();
 +qemu_mutex_lock_migrate_ram();
 +
 +ret = s-close(s-opaque);
  qemu_free(s-buffer);
  qemu_free(s);
  
  return ret;
  }
  
 +
  static int buffered_rate_limit(void *opaque)
  {
  QEMUFileBuffered *s = opaque;
 @@ -228,26 +226,37 @@ static int64_t buffered_get_rate_limit(void *opaque)
  return s-xfer_limit;
  }
  
 -static void buffered_rate_tick(void *opaque)
 +static void *migrate_vm(void *opaque)
  {

buffered_file.c was generic code that has now become migration specific
(although migration was the only user). So it should either stop
pretending to be generic code, by rename to migration_thread.c along
with un-exporting interfaces, or it should remain generic and therefore
all migration specific knowledge moved somewhere else.

Anthony?

 +int64_t current_time, expire_time = qemu_get_clock_ms(rt_clock) + 100;
 +struct timeval tv = { .tv_sec = 0, .tv_usec = 10};

qemu_get_clock_ms should happen under iothread lock.

 -if (s-freeze_output)
 -return;
 +current_time = qemu_get_clock_ms(rt_clock);
 +if (!s-closed  (expire_time  current_time)) {
 +tv.tv_usec = 1000 * (expire_time - current_time);
 +select(0, NULL, NULL, NULL, tv);
 +continue;
 +}
  
 -s-bytes_xfer = 0;
 +s-bytes_xfer = 0;
  
 -buffered_flush(s);
 +expire_time = qemu_get_clock_ms(rt_clock) + 100;
 +if (!s-closed) {
 +s-put_ready(s-opaque);
 +} else {
 +buffered_flush(s);
 +}
 +}
  
 -/* Add some checks around this */
 -s-put_ready(s-opaque);
 +return NULL;
  }
  
  QEMUFile *qemu_fopen_ops_buffered(void *opaque,
 @@ -267,15 +276,14 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
  s-put_ready = put_ready;
  s-wait_for_unfreeze = wait_for_unfreeze;
  s-close = close;
 +s-closed = 0;
  
  s-file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
   buffered_close, buffered_rate_limit,
   buffered_set_rate_limit,
 -  buffered_get_rate_limit);
 -
 -s-timer = qemu_new_timer_ms(rt_clock, buffered_rate_tick, s);
 + buffered_get_rate_limit);
  
 -qemu_mod_timer(s-timer, qemu_get_clock_ms(rt_clock) + 100);
 +qemu_thread_create(s-thread, migrate_vm, s);
  
  return s-file;
  }
 diff --git a/migration.c b/migration.c
 index af3a1f2..5df186d 100644
 --- a/migration.c
 +++ b/migration.c
 @@ -149,10 +149,12 @@ int do_migrate_set_speed(Monitor *mon, 

Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files

2011-08-29 Thread Anthony Liguori

On 08/29/2011 11:51 AM, Yoder Stuart-B08248 wrote:

Alex Graf, Scott Wood, and I met last week to try to flesh out
some details as to how vfio could work for non-PCI devices,
like we have in embedded systems.   This most likely will
require a different kernel driver than vfio-- for now we are
calling it dtio (for device tree I/O) as there is no way
to discover these devices except from the device tree.   But
the dtio driver would use the same architecture and interfaces
as vfio.

For devices on a system bus and represented in a device
tree we have some different requirements than PCI for what
is exposed in the device fd file.  A device may have multiple
address regions, multiple interrupts, a variable length device
tree path, whether a region is mmapable, etc.

With existing vfio, the device fd file layout is something
like:
   0xF Config space offset
   ...
   0x6 ROM offset
   0x5 BAR 5 offset
   0x4 BAR 4 offset
   0x3 BAR 3 offset
   0x2 BAR 2 offset
   0x1 BAR 1 offset
   0x0 BAR 0 offset

We have an alternate proposal that we think is more flexible,
extensible, and will accommodate both PCI and system bus
type devices (and others).

Instead of config space fixed at 0xf, we would propose
a header and multiple 'device info' records at offset 0x0 that would
encode everything that user space needs to know about
the device:


Why not just use an ioctl with a proper struct?

The config space is weird for PCI access because it's mirroring a well 
known binary blob.  It's not something to replicate if you're inventing 
something new.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Broken pci_block_user_cfg_access interface

2011-08-29 Thread Michael S. Tsirkin
On Mon, Aug 29, 2011 at 08:47:07PM +0200, Jan Kiszka wrote:
 On 2011-08-29 17:42, Jan Kiszka wrote:
  I still don't get what prevents converting ipr to allow plain mutex
  synchronization. My vision is:
   - push reset-on-error of ipr into workqueue (or threaded IRQ?)
 
 I'm starting to like your proposal: I had a look at ipr, but it turned
 out to be anything but trivial to convert that driver. It runs its
 complete state machine under spin_lock_irq, and the functions calling
 pci_block/unblock_user_cfg_access are deep inside this thing. I have no
 hardware to test whatever change, and I feel a bit uncomfortable asking
 Brian to redesign his driver that massively.
 
 So back to your idea: I would generalize pci_block_user_cfg_access to
 pci_block_cfg_access. It should fail when some other site already holds
 the access lock, but it should remain non-blocking - for the sake of ipr.

It would be easy to have blocking and non-blocking variants.

But
- I have no idea whether supporting sysfs config/reset access
  while ipr is active makes any sense - I know we need it for uio.
- reset while uio handles interrupt needs to block, not fail I think


 We should still provide generic pci-2.3 IRQ masking services, but that
 could be done in a second step. I could have a look at this.
 
 Jan
 
 -- 
 Siemens AG, Corporate Technology, CT T DE IT 1
 Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files

2011-08-29 Thread Scott Wood
On 08/29/2011 02:04 PM, Anthony Liguori wrote:
 On 08/29/2011 11:51 AM, Yoder Stuart-B08248 wrote:
 Instead of config space fixed at 0xf, we would propose
 a header and multiple 'device info' records at offset 0x0 that would
 encode everything that user space needs to know about
 the device:
 
 Why not just use an ioctl with a proper struct?

This is more extensible than a struct -- both in features, and in the
number of each type of resource that you can have, length of strings you
can have, etc.

 The config space is weird for PCI access because it's mirroring a well
 known binary blob.  It's not something to replicate if you're inventing
 something new.

There's no intent to replicate config space in general -- config space
is provided as-is.  There's little overlap between config space and the
extra information provided.  Length can be had from config space, but
only by modifying it.  Physical address sort-of overlaps, though bus
addresess could be different from CPU physical addresses[1].  In both
cases, it'd be nice to stay consistent with device-tree regions.

BAR type is overlap, but doesn't seem too unreasonable to me.

-Scott

[1] The user is probably less likely to care about the physical address
at all in the PCI case, but consistency is nice.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio / device assignment -- layout of device fd files

2011-08-29 Thread Alex Williamson
On Mon, 2011-08-29 at 16:51 +, Yoder Stuart-B08248 wrote:
 Alex Graf, Scott Wood, and I met last week to try to flesh out
 some details as to how vfio could work for non-PCI devices,
 like we have in embedded systems.   This most likely will
 require a different kernel driver than vfio-- for now we are
 calling it dtio (for device tree I/O) as there is no way
 to discover these devices except from the device tree.   But
 the dtio driver would use the same architecture and interfaces
 as vfio.

Why is this a different kernel driver?  The difference will primarily be
in what bus types vfio registers drivers and the set of device types the
device fds support.  The group and iommu interfaces will be shared.
This sounds more like vfio .config options (CONFIG_VFIO_PCI,
CONFIG_VFIO_DT).

 For devices on a system bus and represented in a device
 tree we have some different requirements than PCI for what
 is exposed in the device fd file.  A device may have multiple
 address regions, multiple interrupts, a variable length device
 tree path, whether a region is mmapable, etc.
 
 With existing vfio, the device fd file layout is something
 like:
   0xF Config space offset
   ...
   0x6 ROM offset
   0x5 BAR 5 offset
   0x4 BAR 4 offset
   0x3 BAR 3 offset
   0x2 BAR 2 offset
   0x1 BAR 1 offset
   0x0 BAR 0 offset
 
 We have an alternate proposal that we think is more flexible,
 extensible, and will accommodate both PCI and system bus
 type devices (and others).
 
 Instead of config space fixed at 0xf, we would propose
 a header and multiple 'device info' records at offset 0x0 that would
 encode everything that user space needs to know about
 the device:
 
   0x0  +-+-+
| magic   |   version   | u64   // magic u64 identifies the type of
|   vfio| |   // passthru I/O, plus version #
|   dtio| |   //   vfio - PCI devices
+-+-+   //   dtio - device tree devices

Maybe magic = pci, dt, ...

| flags | u32   // encodes any flags (TBD)
+---+
|  dev info record N|
|type   | u32   // type of record
|rec_len| u32   // length in bytes of record
|   |  (including record header)
|flags  | u32   // type specific flags
|...content...  |   // record content, which could
+---+   // include sub-records
|  dev info record N+1  |
+---+
|  dev info record N+2  |
+---+
...
 
 The device info records following the file header have the following
 record types each with content encoded in a record specific way:
 
  REGION  - describes an addressable address range for the device
  DTPATH - describes the device tree path for the device
  DTINDEX - describes the index into the related device tree
property (reg,ranges,interrupts,interrupt-map)

I don't quite understand if these are physical or virtual.

  INTERRUPT - describes an interrupt for the device
  PCI_CONFIG_SPACE - describes config space for the device

I would have expected this to be a REGION with a property of
PCI_CONFIG_SPACE.

  PCI_INFO - domain:bus:device:func

Not entirely sure we need this.  How are you imagining we get from a
group fd to a device fd (wondering if you're only including this for
enumeration)?  I'm currently coding it as a VFIO_GROUP_GET_DEVICE_FD
ioctl that takes a char* parameter that contains the dev_name() for the
device requested.  The list of devices under each group can be found by
read()ing the group fd.  If we keep this, we should make the interfaces
similar, in fact, maybe this is how we describe the capabilities of the
iommu too, reading a table from the iommu fd.

  PCI_BAR_INFO - information about the BAR for a device
 
 For a device tree type device the file may look like:
 
  0x0+---+
 |header |  
 +---+
 |   type = REGION   |  
 |   rec_len |  
 |   flags = | u32 // region specific flags
 |   is_mmapable | 
 |   offset  | u64 // seek offset to region from
 |   |from beginning
 |   len | u64 // length of region
 |   addr| u64 // phys addr of region

Would we only need to expose phys addr for 1:1 mapping requirements?
I'm not sure why we'd care to expose this otherwise.

 |   |  
 +---+
  \   type = DTPATH  \  // a sub-region
   |   rec_len|  
   |   flags  |  
   |   dev tree path 

Re: [PATCH resend] KVM: Document KVM_IRQFD

2011-08-29 Thread Scott Wood
On 08/29/2011 07:34 AM, Sasha Levin wrote:
 Cc: Avi Kivity a...@redhat.com
 Cc: Marcelo Tosatti mtosa...@redhat.com
 Signed-off-by: Sasha Levin levinsasha...@gmail.com
 ---
  Documentation/virtual/kvm/api.txt |   27 +++
  1 files changed, 27 insertions(+), 0 deletions(-)
 
 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 2d510b6..d1150b6 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -1450,6 +1450,33 @@ is supported; 2 if the processor requires all virtual 
 machines to have
  an RMA, or 1 if the processor can use an RMA but doesn't require it,
  because it supports the Virtual RMA (VRMA) facility.
  
 +4.64 KVM_IRQFD
 +
 +Capability: KVM_CAP_IRQFD
 +Architectures: all
 +Type: vm ioctl
 +Parameters: struct kvm_irqfd (in)
 +Returns: 0 on success, !0 on error
 +
 +This ioctl attaches or detaches an eventfd to a GSI within the guest.
 +While the eventfd is assigned to the guest, any write to the eventfd
 +would trigger the GSI within the guest.
 +
 +struct kvm_irqfd {
 + __u32 fd;
 + __u32 gsi;
 + __u32 flags;
 + __u8  pad[20];
 +};

Should define gsi (and how it's used by KVM) somewhere.  AFAICT it's
an ACPI-ism.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio / device assignment -- layout of device fd files

2011-08-29 Thread Scott Wood
On 08/29/2011 02:51 PM, Alex Williamson wrote:
 On Mon, 2011-08-29 at 16:51 +, Yoder Stuart-B08248 wrote:
 The device info records following the file header have the following
 record types each with content encoded in a record specific way:

  REGION  - describes an addressable address range for the device
  DTPATH - describes the device tree path for the device
  DTINDEX - describes the index into the related device tree
property (reg,ranges,interrupts,interrupt-map)
 
 I don't quite understand if these are physical or virtual.

If what are physical or virtual?

  INTERRUPT - describes an interrupt for the device
  PCI_CONFIG_SPACE - describes config space for the device
 
 I would have expected this to be a REGION with a property of
 PCI_CONFIG_SPACE.

Could be, if physical address is made optional.

  PCI_INFO - domain:bus:device:func
 
 Not entirely sure we need this.  How are you imagining we get from a
 group fd to a device fd (wondering if you're only including this for
 enumeration)?  I'm currently coding it as a VFIO_GROUP_GET_DEVICE_FD
 ioctl that takes a char* parameter that contains the dev_name() for the
 device requested.  The list of devices under each group can be found by
 read()ing the group fd.  

Seems like it would be nice to keep this information around, rather than
require the user to pass it around separately.  Shouldn't cost much.

 If we keep this, we should make the interfaces similar, 

Yes.

  PCI_BAR_INFO - information about the BAR for a device

 For a device tree type device the file may look like:

  0x0+---+
 |header |  
 +---+
 |   type = REGION   |  
 |   rec_len |  
 |   flags = | u32 // region specific flags
 |   is_mmapable | 
 |   offset  | u64 // seek offset to region from
 |   |from beginning
 |   len | u64 // length of region
 |   addr| u64 // phys addr of region
 
 Would we only need to expose phys addr for 1:1 mapping requirements?
 I'm not sure why we'd care to expose this otherwise.

It's more important for non-PCI, where it avoids the need for userspace
to parse the device tree to find the guest address (we'll usually want
1:1), or to consolidate pages shared by multiple regions.  It could be
nice for debugging, as well.

Seems like something that's better to have and not need, than the other
way around.  It would need to be optional, though, if we want to be able
to describe things without a physical address like config space.

 |   |  
 +---+
  \   type = DTPATH  \  // a sub-region
   |   rec_len|  
   |   flags  |  
   |   dev tree path  | char[] // device tree path
 +---+
  \   type = DTINDEX \  // a sub-region
   |   rec_len|  
   |   flags  |  
   |   prop_type  | u32  // REG, RANGES
   |   prop_index | u32  // index  into resource list
 +---+
 |  type = INTERRUPT |  
 |  rec_len  |  
 |  flags| u32 
 |  ioctl_handle | u32 // argument to ioctl to get interrupts
 
 Is this a dynamic ioctl number or just a u32 parameter to an ioctl like
 VFIO_DEVICE_GET_IRQ_FD (ie. an instance number)?

It's a parameter to VFIO_DEVICE_GET_IRQ_FD.

 +---+
 |   type = PCI_INFO |  
 |   rec_len |  
 |   flags = 0x0 |  
 |   dom:bus:dev:func| u32 // pci device info
 +---+
 |   type = REGION   |  
 |   rec_len |  
 |   flags = |  
 |   is_mmapable |  
 |   offset  | u64 // seek offset to region from
 |   |from beginning
 |   len | u64 // length of region
 |   addr| u64 // physical addr of region
 +---+
  \   type = PCI_BAR_INFO\  
   |   rec_len|  
   |   flags  |  
   |   bar_type   |  // pio
   |  |  // prefetable mmio
   |  |  // non-prefetchable mmmio
   |   bar_index  |  // index of bar in device
 
 Aren't a lot of these typical region attributes?  Wondering if we should
 just make them part of the REGION flags or we'll have a growing number
 of sub-regions describing common things.

It'd be nice to keep the base region record common among
PCI/DT/whatever.


KVM call agenda for August 30

2011-08-29 Thread Juan Quintela

Hi

Please send in any agenda items you are interested in covering.

Later, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio / device assignment -- layout of device fd files

2011-08-29 Thread Alex Williamson
On Mon, 2011-08-29 at 16:58 -0500, Scott Wood wrote:
 On 08/29/2011 02:51 PM, Alex Williamson wrote:
  On Mon, 2011-08-29 at 16:51 +, Yoder Stuart-B08248 wrote:
  The device info records following the file header have the following
  record types each with content encoded in a record specific way:
 
   REGION  - describes an addressable address range for the device
   DTPATH - describes the device tree path for the device
   DTINDEX - describes the index into the related device tree
 property (reg,ranges,interrupts,interrupt-map)
  
  I don't quite understand if these are physical or virtual.
 
 If what are physical or virtual?

Can you give an example of a path vs an index?  I don't understand
enough about these to ask a useful question about what they're
describing.

   INTERRUPT - describes an interrupt for the device
   PCI_CONFIG_SPACE - describes config space for the device
  
  I would have expected this to be a REGION with a property of
  PCI_CONFIG_SPACE.
 
 Could be, if physical address is made optional.

Or physical address is also a property, aka sub-region.

   PCI_INFO - domain:bus:device:func
  
  Not entirely sure we need this.  How are you imagining we get from a
  group fd to a device fd (wondering if you're only including this for
  enumeration)?  I'm currently coding it as a VFIO_GROUP_GET_DEVICE_FD
  ioctl that takes a char* parameter that contains the dev_name() for the
  device requested.  The list of devices under each group can be found by
  read()ing the group fd.  
 
 Seems like it would be nice to keep this information around, rather than
 require the user to pass it around separately.  Shouldn't cost much.

Seems redundant.  The user had to know the d:b:d.f to get the device fd
in the first place.  At best it's a sanity check.

  If we keep this, we should make the interfaces similar, 
 
 Yes.
 
   PCI_BAR_INFO - information about the BAR for a device
 
  For a device tree type device the file may look like:
 
   0x0+---+
  |header |  
  +---+
  |   type = REGION   |  
  |   rec_len |  
  |   flags = | u32 // region specific flags
  |   is_mmapable | 
  |   offset  | u64 // seek offset to region from
  |   |from beginning
  |   len | u64 // length of region
  |   addr| u64 // phys addr of region
  
  Would we only need to expose phys addr for 1:1 mapping requirements?
  I'm not sure why we'd care to expose this otherwise.
 
 It's more important for non-PCI, where it avoids the need for userspace
 to parse the device tree to find the guest address (we'll usually want
 1:1), or to consolidate pages shared by multiple regions.  It could be
 nice for debugging, as well.

So the device tree path is ripped straight from the system, so it's the
actual 1:1, matching physical hardware, path.

 Seems like something that's better to have and not need, than the other
 way around.  It would need to be optional, though, if we want to be able
 to describe things without a physical address like config space.

sub-region?

  |   |  
  +---+
   \   type = DTPATH  \  // a sub-region
|   rec_len|  
|   flags  |  
|   dev tree path  | char[] // device tree path
  +---+
   \   type = DTINDEX \  // a sub-region
|   rec_len|  
|   flags  |  
|   prop_type  | u32  // REG, RANGES
|   prop_index | u32  // index  into resource list
  +---+
  |  type = INTERRUPT |  
  |  rec_len  |  
  |  flags| u32 
  |  ioctl_handle | u32 // argument to ioctl to get 
  interrupts
  
  Is this a dynamic ioctl number or just a u32 parameter to an ioctl like
  VFIO_DEVICE_GET_IRQ_FD (ie. an instance number)?
 
 It's a parameter to VFIO_DEVICE_GET_IRQ_FD.
 
  +---+
  |   type = PCI_INFO |  
  |   rec_len |  
  |   flags = 0x0 |  
  |   dom:bus:dev:func| u32 // pci device info
  +---+
  |   type = REGION   |  
  |   rec_len |  
  |   flags = |  
  |   is_mmapable |  
  |   offset  | u64 // seek offset to region from
  |   |from beginning
  |   len | u64 // length of region
  |   addr| u64 // physical addr of region
  +---+
   \   

Re: RFC: vfio / device assignment -- layout of device fd files

2011-08-29 Thread Scott Wood
On 08/29/2011 05:46 PM, Alex Williamson wrote:
 On Mon, 2011-08-29 at 16:58 -0500, Scott Wood wrote:
 On 08/29/2011 02:51 PM, Alex Williamson wrote:
 On Mon, 2011-08-29 at 16:51 +, Yoder Stuart-B08248 wrote:
 The device info records following the file header have the following
 record types each with content encoded in a record specific way:

  REGION  - describes an addressable address range for the device
  DTPATH - describes the device tree path for the device
  DTINDEX - describes the index into the related device tree
property (reg,ranges,interrupts,interrupt-map)

 I don't quite understand if these are physical or virtual.

 If what are physical or virtual?
 
 Can you give an example of a path vs an index?  I don't understand
 enough about these to ask a useful question about what they're
 describing.

You'd have both path and index.

Example, for this tree:

/ {
...
foo {
...
bar {
reg = 0x1000 64 0x1800 64;
ranges = 0 0x2 0x1;
...

child {
reg = 0x100 0x100;
...
};
};
};
};

There would be 4 regions if you bind to /foo/bar:

// this is 64 bytes at 0x1000
DTPATH /foo/bar
DTINDEX prop_type=REG prop_index=0

// this is 64 bytes at 0x1800
DTPATH /foo/bar
DTINDEX prop_type=REG prop_index=1

// this is 16K at 0x2
DTPATH /foo/bar
DTINDEX prop_type=RANGES prop_index=0

// this is 256 bytes at 0x20100
DTPATH /foo/bar/child
DTINDEX prop_type=REG prop_index=0

Both ranges and the child reg are needed, since ranges could be a simple
ranges; that passes everything with no translation, and child nodes
could be absent-but-implied in some other cases (such as when they
represent PCI devices which can be probed -- we still need to map the
ranges that correspond to PCI controller windows).

  INTERRUPT - describes an interrupt for the device
  PCI_CONFIG_SPACE - describes config space for the device

 I would have expected this to be a REGION with a property of
 PCI_CONFIG_SPACE.

 Could be, if physical address is made optional.
 
 Or physical address is also a property, aka sub-region.

A subrecord of REGION is fine with me.

 Would we only need to expose phys addr for 1:1 mapping requirements?
 I'm not sure why we'd care to expose this otherwise.

 It's more important for non-PCI, where it avoids the need for userspace
 to parse the device tree to find the guest address (we'll usually want
 1:1), or to consolidate pages shared by multiple regions.  It could be
 nice for debugging, as well.
 
 So the device tree path is ripped straight from the system, so it's the
 actual 1:1, matching physical hardware, path.

Yes.

 Even for non-PCI we need to
 know if the region is pio/mmio32/mmio64/prefetchable/etc.

 Outside of PCI, what standardized form would you put such information
 in?  Where would the kernel get this information?  What does
 mmio32/mmio64 mean in this context?
 
 I could imagine a platform device described by ACPI that might want to
 differentiate.  The physical device doesn't get moved of course, but
 guest drivers might care how the device is described if we need to
 rebuild those ACPI tables.  ACPI might even be a good place to leverage
 these data structures... /me ducks.

ACPI info could be another subrecord type, but in the device tree
system-bus case we generally don't have this information at the generic
infrastructure level.  Drivers are expected to know how their devices'
regions should be mapped.

 BAR index could really just translate to a REGION instance number.

 How would that work if you make non-BAR things (such as config space)
 into regions?
 
 Put their instance numbers outside of the BAR region?  We have a fixed
 REGION space on PCI, so we could just define BAR0 == instance 0, BAR1 ==
 instance 1... ROM == instance 6, CONFIG == instance 0xF (or 7).

Seems more awkward than just having each region say what it is.  What do
you do to fill in the gaps?

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] Add generic stubs for kvm stop check functions

2011-08-29 Thread Eric B Munson
This function is called from the watchdog code when a soft lockup is detected.
If this is an arch that does not support pvclock, this function is used.

Signed-off-by: Eric B Munson emun...@mgebm.net
---
 include/asm-generic/pvclock.h |   14 ++
 1 files changed, 14 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/pvclock.h

diff --git a/include/asm-generic/pvclock.h b/include/asm-generic/pvclock.h
new file mode 100644
index 000..ff046b6
--- /dev/null
+++ b/include/asm-generic/pvclock.h
@@ -0,0 +1,14 @@
+#ifndef _ASM_GENERIC_PVCLOCK_H
+#define _ASM_GENERIC_PVCLOCK_H
+
+
+/*
+ * These functions are used by architectures that support kvm to avoid issuing
+ * false soft lockup messages.
+ */
+static inline bool kvm_check_and_clear_host_stopped(int cpu)
+{
+   return false;
+}
+
+#endif
-- 
1.7.4.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] Add functions to check if the host has stopped the vm

2011-08-29 Thread Eric B Munson
When a host stops or suspends a VM it will set a flag to show this.  The
watchdog will use these functions to determine if a softlockup is real, or the
result of a suspended VM.

Signed-off-by: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/pvclock.h |2 ++
 arch/x86/kernel/kvmclock.c |   14 ++
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index a518c0a..dd59ad0 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -13,6 +13,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall,
struct timespec *ts);
 void pvclock_resume(void);
 
+bool kvm_check_and_clear_host_stopped(int cpu);
+
 /*
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index c1a0188..5f60d2b 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -113,6 +113,20 @@ static void kvm_get_preset_lpj(void)
preset_lpj = lpj;
 }
 
+bool kvm_check_and_clear_host_stopped(int cpu)
+{
+   bool ret = false;
+   struct pvclock_vcpu_time_info *src;
+
+   src = per_cpu(hv_clock, cpu);
+   if ((src-flags  PVCLOCK_GUEST_STOPPED) != 0) {
+   src-flags = src-flags  (~PVCLOCK_GUEST_STOPPED);
+   ret = true;
+   }
+
+   return ret;
+}
+
 static struct clocksource kvm_clock = {
.name = kvm-clock,
.read = kvm_clock_get_cycles,
-- 
1.7.4.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] Avoid soft lockup message when KVM is stopped by host

2011-08-29 Thread Eric B Munson
Currently, when qemu stops a guest kernel that guest will issue a soft lockup
message when it resumes.  This set provides the ability for qemu to comminucate
to the guest that it has been stopped.  When the guest hits the watchdog on
resume it will check if it was suspended before issuing the warning.

Eric B Munson (4):
  Add flag to indicate that a vm was stopped by the host
  Add functions to check if the host has stopped the vm
  Add generic stubs for kvm stop check functions
  Add check for suspended vm in softlockup detector

 arch/x86/include/asm/pvclock-abi.h |1 +
 arch/x86/include/asm/pvclock.h |2 ++
 arch/x86/kernel/kvmclock.c |   14 ++
 include/asm-generic/pvclock.h  |   14 ++
 kernel/watchdog.c  |   12 
 5 files changed, 43 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/pvclock.h

-- 
1.7.4.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] Add check for suspended vm in softlockup detector

2011-08-29 Thread Eric B Munson
A suspended VM can cause spurious soft lockup warnings.  To avoid these, the
watchdog now checks if the kernel knows it was stopped by the host and skips
the warning if so.

Signed-off-by: Eric B Munson emun...@mgebm.net
---
 kernel/watchdog.c |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 36491cd..4cbb69f 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -25,6 +25,7 @@
 #include linux/sysctl.h
 
 #include asm/irq_regs.h
+#include asm/pvclock.h
 #include linux/perf_event.h
 
 int watchdog_enabled = 1;
@@ -292,6 +293,17 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)
 */
duration = is_softlockup(touch_ts);
if (unlikely(duration)) {
+   /*
+* If a virtual machine is stopped by the host it can look to
+* the watchdog like a soft lockup, check to see if the host
+* stopped the vm before we issue the warning
+*/
+   if (kvm_check_and_clear_host_stopped(get_cpu())) {
+   put_cpu();
+   return HRTIMER_RESTART;
+   }
+   put_cpu();
+
/* only warn once */
if (__this_cpu_read(soft_watchdog_warn) == true)
return HRTIMER_RESTART;
-- 
1.7.4.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] Add flag to indicate that a vm was stopped by the host

2011-08-29 Thread Eric B Munson
This flag will be used to check if the vm was stopped by the host when a soft
lockup was detected.

Signed-off-by: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/pvclock-abi.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pvclock-abi.h 
b/arch/x86/include/asm/pvclock-abi.h
index 35f2d19..6167fd7 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/asm/pvclock-abi.h
@@ -40,5 +40,6 @@ struct pvclock_wall_clock {
 } __attribute__((__packed__));
 
 #define PVCLOCK_TSC_STABLE_BIT (1  0)
+#define PVCLOCK_GUEST_STOPPED  (1  1)
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_PVCLOCK_ABI_H */
-- 
1.7.4.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Tian, Kevin
v2 changes:
define exit qualification fields for APIC-Access in vmx.h
use apic_reg_write instead of apic_set_eoi, to avoid breaking tracing
add fasteoi option to allow disabling this acceleration



commit 2a66a12cb6928c962d24907e6d39b6eb9ac94b4b
Author: Kevin Tian kevin.t...@intel.com
Date:   Mon Aug 29 13:08:28 2011 +0800

KVM: APIC: avoid instruction emulation for EOI writes

Instruction emulation for EOI writes can be skipped, since sane
guest simply uses MOV instead of string operations. This is a nice
improvement when guest doesn't support x2apic or hyper-V EOI
support.

a single VM bandwidth is observed with ~8% bandwidth improvement
(7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation.

Signed-off-by: Kevin Tian kevin.t...@intel.com
Based on earlier work from:
Signed-off-by: Eddie Dong eddie.d...@intel.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 2caf290..31f180c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -350,6 +350,18 @@ enum vmcs_field {
 #define DEBUG_REG_ACCESS_REG(eq)(((eq)  8)  0xf) /* 11:8, general 
purpose reg. */
 
 
+/*
+ * Exit Qualifications for APIC-Access
+ */
+#define APIC_ACCESS_OFFSET  0xfff   /* 11:0, offset within the 
APIC page */
+#define APIC_ACCESS_TYPE0xf000  /* 15:12, access type */
+#define TYPE_LINEAR_APIC_INST_READ  (0  12)
+#define TYPE_LINEAR_APIC_INST_WRITE (1  12)
+#define TYPE_LINEAR_APIC_INST_FETCH (2  12)
+#define TYPE_LINEAR_APIC_EVENT  (3  12)
+#define TYPE_PHYSICAL_APIC_EVENT(10  12)
+#define TYPE_PHYSICAL_APIC_INST (15  12)
+
 /* segment AR */
 #define SEGMENT_AR_L_MASK (1  13)
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 57dcbd4..52645f2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this,
return 0;
 }
 
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
+{
+   struct kvm_lapic *apic = vcpu-arch.apic;
+
+   if (apic)
+   apic_reg_write(vcpu-arch.apic, APIC_EOI, 0);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
if (!vcpu-arch.apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 52c9e6b..8287243 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5e8d411..9d0364b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -71,6 +71,9 @@ module_param(vmm_exclusive, bool, S_IRUGO);
 static int __read_mostly yield_on_hlt = 1;
 module_param(yield_on_hlt, bool, S_IRUGO);
 
+static int __read_mostly fasteoi = 1;
+module_param(fasteoi, bool, S_IRUGO);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -4540,6 +4543,24 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu)
 
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
+   if (likely(fasteoi)) {
+   unsigned long exit_qualification = 
vmcs_readl(EXIT_QUALIFICATION);
+   int access_type, offset;
+
+   access_type = exit_qualification  APIC_ACCESS_TYPE;
+   offset = exit_qualification  APIC_ACCESS_OFFSET;
+   /*
+* Sane guest uses MOV to write EOI, with written value 
+* not cared. So make a short-circuit here by avoiding 
+* heavy instruction emulation.
+*/
+   if ((access_type == TYPE_LINEAR_APIC_INST_WRITE) 
+   (offset == APIC_EOI)) {
+   kvm_lapic_set_eoi(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+   }
return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }


Thanks
Kevin


20110829_kvm_eoi_opt.patch
Description: 20110829_kvm_eoi_opt.patch


Re: kvm PCI assignment VFIO ramblings

2011-08-29 Thread David Gibson
eOn Fri, Aug 26, 2011 at 01:17:05PM -0700, Aaron Fabbri wrote:
[snip]
 Yes.  In essence, I'd rather not have to run any other admin processes.
 Doing things programmatically, on the fly, from each process, is the
 cleanest model right now.

The persistent group model doesn't necessarily prevent that.
There's no reason your program can't use the administrative interface
as well as the use interface, and I don't see that making the admin
interface separate and persistent makes this any harder.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 01/11] KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write

2011-08-29 Thread Xiao Guangrong
kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
function when spte is prefetched, unfortunately, we can not know how many
spte need to be prefetched on this path, that means we can use out of the
free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
path does not fill the cache, such as INS instruction emulated that does not
trigger page fault

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   25 -
 1 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5d7fbf0..b01afee 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -592,6 +592,11 @@ static int mmu_topup_memory_cache(struct 
kvm_mmu_memory_cache *cache,
return 0;
 }
 
+static int mmu_memory_cache_free_objects(struct kvm_mmu_memory_cache *cache)
+{
+   return cache-nobjs;
+}
+
 static void mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc,
  struct kmem_cache *cache)
 {
@@ -969,6 +974,14 @@ static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t 
gfn, int level)
return linfo-rmap_pde;
 }
 
+static bool rmap_can_add(struct kvm_vcpu *vcpu)
+{
+   struct kvm_mmu_memory_cache *cache;
+
+   cache = vcpu-arch.mmu_pte_list_desc_cache;
+   return mmu_memory_cache_free_objects(cache);
+}
+
 static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
 {
struct kvm_mmu_page *sp;
@@ -3585,6 +3598,12 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
break;
}
 
+   /*
+* No need to care whether allocation memory is successful
+* or not since pte prefetch is skiped if it does not have
+* enough objects in the cache.
+*/
+   mmu_topup_memory_caches(vcpu);
spin_lock(vcpu-kvm-mmu_lock);
if (atomic_read(vcpu-kvm-arch.invlpg_counter) != invlpg_counter)
gentry = 0;
@@ -3655,7 +3674,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
mmu_page_zap_pte(vcpu-kvm, sp, spte);
if (gentry 
  !((sp-role.word ^ vcpu-arch.mmu.base_role.word)
-  mask.word))
+  mask.word)  rmap_can_add(vcpu))
mmu_pte_write_new_pte(vcpu, sp, spte, gentry);
if (!remote_flush  need_remote_flush(entry, *spte))
remote_flush = true;
@@ -3716,10 +3735,6 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, 
u32 error_code,
goto out;
}
 
-   r = mmu_topup_memory_caches(vcpu);
-   if (r)
-   goto out;
-
er = x86_emulate_instruction(vcpu, cr2, 0, insn, insn_len);
 
switch (er) {
-- 
1.7.5.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 02/11] KVM: x86: tag the instructions which are used to write page table

2011-08-29 Thread Xiao Guangrong
The idea is from Avi:
| tag instructions that are typically used to modify the page tables, and
| drop shadow if any other instruction is used.
| The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg,
| and cmpxchg8b.

This patch is used to tag the instructions and in the later path, shadow page
is dropped if it is written by other instructions

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/emulate.c |   35 ---
 1 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 0453c07..e24c269 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -82,6 +82,7 @@
 #define RMExt   (415) /* Opcode extension in ModRM r/m if mod == 3 */
 #define Sse (118) /* SSE Vector instruction */
 /* Misc flags */
+#define PageTable   (1  19)   /* instruction used to write page table */
 #define Prot(121) /* instruction generates #UD if not in prot-mode */
 #define VendorSpecific (122) /* Vendor specific instruction */
 #define NoAccess(123) /* Don't access memory (lea/invlpg/verr etc) */
@@ -3018,10 +3019,10 @@ static struct opcode group7_rm7[] = {
 
 static struct opcode group1[] = {
I(Lock, em_add),
-   I(Lock, em_or),
+   I(Lock | PageTable, em_or),
I(Lock, em_adc),
I(Lock, em_sbb),
-   I(Lock, em_and),
+   I(Lock | PageTable, em_and),
I(Lock, em_sub),
I(Lock, em_xor),
I(0, em_cmp),
@@ -3076,18 +3077,21 @@ static struct group_dual group7 = { {
 
 static struct opcode group8[] = {
N, N, N, N,
-   D(DstMem | SrcImmByte | ModRM), D(DstMem | SrcImmByte | ModRM | Lock),
-   D(DstMem | SrcImmByte | ModRM | Lock), D(DstMem | SrcImmByte | ModRM | 
Lock),
+   D(DstMem | SrcImmByte | ModRM),
+   D(DstMem | SrcImmByte | ModRM | Lock | PageTable),
+   D(DstMem | SrcImmByte | ModRM | Lock),
+   D(DstMem | SrcImmByte | ModRM | Lock | PageTable),
 };
 
 static struct group_dual group9 = { {
-   N, D(DstMem64 | ModRM | Lock), N, N, N, N, N, N,
+   N, D(DstMem64 | ModRM | Lock | PageTable), N, N, N, N, N, N,
 }, {
N, N, N, N, N, N, N, N,
 } };
 
 static struct opcode group11[] = {
-   I(DstMem | SrcImm | ModRM | Mov, em_mov), X7(D(Undefined)),
+   I(DstMem | SrcImm | ModRM | Mov | PageTable, em_mov),
+   X7(D(Undefined)),
 };
 
 static struct gprefix pfx_0f_6f_0f_7f = {
@@ -3099,7 +3103,7 @@ static struct opcode opcode_table[256] = {
I6ALU(Lock, em_add),
D(ImplicitOps | Stack | No64), D(ImplicitOps | Stack | No64),
/* 0x08 - 0x0F */
-   I6ALU(Lock, em_or),
+   I6ALU(Lock | PageTable, em_or),
D(ImplicitOps | Stack | No64), N,
/* 0x10 - 0x17 */
I6ALU(Lock, em_adc),
@@ -3108,7 +3112,7 @@ static struct opcode opcode_table[256] = {
I6ALU(Lock, em_sbb),
D(ImplicitOps | Stack | No64), D(ImplicitOps | Stack | No64),
/* 0x20 - 0x27 */
-   I6ALU(Lock, em_and), N, N,
+   I6ALU(Lock | PageTable, em_and), N, N,
/* 0x28 - 0x2F */
I6ALU(Lock, em_sub), N, I(ByteOp | DstAcc | No64, em_das),
/* 0x30 - 0x37 */
@@ -3141,11 +3145,11 @@ static struct opcode opcode_table[256] = {
G(ByteOp | DstMem | SrcImm | ModRM | No64 | Group, group1),
G(DstMem | SrcImmByte | ModRM | Group, group1),
I2bv(DstMem | SrcReg | ModRM, em_test),
-   I2bv(DstMem | SrcReg | ModRM | Lock, em_xchg),
+   I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_xchg),
/* 0x88 - 0x8F */
-   I2bv(DstMem | SrcReg | ModRM | Mov, em_mov),
+   I2bv(DstMem | SrcReg | ModRM | Mov | PageTable, em_mov),
I2bv(DstReg | SrcMem | ModRM | Mov, em_mov),
-   I(DstMem | SrcNone | ModRM | Mov, em_mov_rm_sreg),
+   I(DstMem | SrcNone | ModRM | Mov | PageTable, em_mov_rm_sreg),
D(ModRM | SrcMem | NoAccess | DstReg),
I(ImplicitOps | SrcMem16 | ModRM, em_mov_sreg_rm),
G(0, group1A),
@@ -3158,7 +3162,7 @@ static struct opcode opcode_table[256] = {
II(ImplicitOps | Stack, em_popf, popf), N, N,
/* 0xA0 - 0xA7 */
I2bv(DstAcc | SrcMem | Mov | MemAbs, em_mov),
-   I2bv(DstMem | SrcAcc | Mov | MemAbs, em_mov),
+   I2bv(DstMem | SrcAcc | Mov | MemAbs | PageTable, em_mov),
I2bv(SrcSI | DstDI | Mov | String, em_mov),
I2bv(SrcSI | DstDI | String, em_cmp),
/* 0xA8 - 0xAF */
@@ -3255,18 +3259,19 @@ static struct opcode twobyte_table[256] = {
D(DstMem | SrcReg | Src2CL | ModRM), N, N,
/* 0xA8 - 0xAF */
D(ImplicitOps | Stack), D(ImplicitOps | Stack),
-   DI(ImplicitOps, rsm), D(DstMem | SrcReg | ModRM | BitOp | Lock),
+   DI(ImplicitOps, rsm),
+   D(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable),
D(DstMem | SrcReg | Src2ImmByte | ModRM),
D(DstMem | SrcReg | Src2CL | ModRM),
D(ModRM), I(DstReg | SrcMem | ModRM, 

[PATCH v3 03/11] KVM: x86: retry non-page-table writing instruction

2011-08-29 Thread Xiao Guangrong
If the emulation is caused by #PF and it is non-page_table writing instruction,
it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
page and retry this instruction directly

The idea is from Avi

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/include/asm/kvm_emulate.h |1 +
 arch/x86/include/asm/kvm_host.h|5 
 arch/x86/kvm/emulate.c |5 
 arch/x86/kvm/mmu.c |   22 +
 arch/x86/kvm/x86.c |   47 
 5 files changed, 75 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 6040d11..fa87b63 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -362,6 +362,7 @@ enum x86_intercept {
 #endif
 
 int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len);
+bool page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
 #define EMULATION_FAILED -1
 #define EMULATION_OK 0
 #define EMULATION_RESTART 1
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6ab4241..27a25df 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -443,6 +443,9 @@ struct kvm_vcpu_arch {
 
cpumask_var_t wbinvd_dirty_mask;
 
+   unsigned long last_retry_eip;
+   unsigned long last_retry_addr;
+
struct {
bool halted;
gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
@@ -689,6 +692,7 @@ enum emulation_result {
 #define EMULTYPE_NO_DECODE (1  0)
 #define EMULTYPE_TRAP_UD   (1  1)
 #define EMULTYPE_SKIP  (1  2)
+#define EMULTYPE_RETRY (1  3)
 int x86_emulate_instruction(struct kvm_vcpu *vcpu, unsigned long cr2,
int emulation_type, void *insn, int insn_len);
 
@@ -753,6 +757,7 @@ void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu);
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
   const u8 *new, int bytes,
   bool guest_initiated);
+int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn);
 int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
 void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index e24c269..c62424e 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3691,6 +3691,11 @@ done:
return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK;
 }
 
+bool page_table_writing_insn(struct x86_emulate_ctxt *ctxt)
+{
+   return ctxt-d  PageTable;
+}
+
 static bool string_insn_completed(struct x86_emulate_ctxt *ctxt)
 {
/* The second termination condition only applies for REPE
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b01afee..26aae11 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1997,7 +1997,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned 
int goal_nr_mmu_pages)
kvm-arch.n_max_mmu_pages = goal_nr_mmu_pages;
 }
 
-static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
+int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
 {
struct kvm_mmu_page *sp;
struct hlist_node *node;
@@ -2007,6 +2007,7 @@ static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t 
gfn)
pgprintk(%s: looking for gfn %llx\n, __func__, gfn);
r = 0;
 
+   spin_lock(kvm-mmu_lock);
for_each_gfn_indirect_valid_sp(kvm, sp, gfn, node) {
pgprintk(%s: gfn %llx role %x\n, __func__, gfn,
 sp-role.word);
@@ -2014,8 +2015,10 @@ static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t 
gfn)
kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
}
kvm_mmu_commit_zap_page(kvm, invalid_list);
+   spin_unlock(kvm-mmu_lock);
return r;
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page);
 
 static void mmu_unshadow(struct kvm *kvm, gfn_t gfn)
 {
@@ -3697,9 +3700,7 @@ int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, 
gva_t gva)
 
gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
 
-   spin_lock(vcpu-kvm-mmu_lock);
r = kvm_mmu_unprotect_page(vcpu-kvm, gpa  PAGE_SHIFT);
-   spin_unlock(vcpu-kvm-mmu_lock);
return r;
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page_virt);
@@ -3720,10 +3721,18 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu)
kvm_mmu_commit_zap_page(vcpu-kvm, invalid_list);
 }
 
+static bool is_mmio_page_fault(struct kvm_vcpu *vcpu, gva_t addr)
+{
+   if (vcpu-arch.mmu.direct_map || mmu_is_nested(vcpu))
+   return vcpu_match_mmio_gpa(vcpu, addr);
+
+   return vcpu_match_mmio_gva(vcpu, addr);
+}
+
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
   void *insn, int insn_len)
 {
-   int r;
+   int r, emulation_type = 

[PATCH v3 04/11] KVM: x86: cleanup port-in/port-out emulated

2011-08-29 Thread Xiao Guangrong
Remove the same code between emulator_pio_in_emulated and
emulator_pio_out_emulated

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/x86.c |   59 ++-
 1 files changed, 26 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1afe59e..85e6b4e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4327,32 +4327,24 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void *pd)
return r;
 }
 
-
-static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt,
-   int size, unsigned short port, void *val,
-   unsigned int count)
+static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size,
+  unsigned short port, void *val,
+  unsigned int count, bool in)
 {
-   struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-
-   if (vcpu-arch.pio.count)
-   goto data_avail;
-
-   trace_kvm_pio(0, port, size, count);
+   trace_kvm_pio(!in, port, size, count);
 
vcpu-arch.pio.port = port;
-   vcpu-arch.pio.in = 1;
+   vcpu-arch.pio.in = in;
vcpu-arch.pio.count  = count;
vcpu-arch.pio.size = size;
 
if (!kernel_pio(vcpu, vcpu-arch.pio_data)) {
-   data_avail:
-   memcpy(val, vcpu-arch.pio_data, size * count);
vcpu-arch.pio.count = 0;
return 1;
}
 
vcpu-run-exit_reason = KVM_EXIT_IO;
-   vcpu-run-io.direction = KVM_EXIT_IO_IN;
+   vcpu-run-io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT;
vcpu-run-io.size = size;
vcpu-run-io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
vcpu-run-io.count = count;
@@ -4361,36 +4353,37 @@ static int emulator_pio_in_emulated(struct 
x86_emulate_ctxt *ctxt,
return 0;
 }
 
-static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt,
-int size, unsigned short port,
-const void *val, unsigned int count)
+static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt,
+   int size, unsigned short port, void *val,
+   unsigned int count)
 {
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+   int ret;
 
-   trace_kvm_pio(1, port, size, count);
-
-   vcpu-arch.pio.port = port;
-   vcpu-arch.pio.in = 0;
-   vcpu-arch.pio.count = count;
-   vcpu-arch.pio.size = size;
-
-   memcpy(vcpu-arch.pio_data, val, size * count);
+   if (vcpu-arch.pio.count)
+   goto data_avail;
 
-   if (!kernel_pio(vcpu, vcpu-arch.pio_data)) {
+   ret = emulator_pio_in_out(vcpu, size, port, val, count, true);
+   if (ret) {
+data_avail:
+   memcpy(val, vcpu-arch.pio_data, size * count);
vcpu-arch.pio.count = 0;
return 1;
}
 
-   vcpu-run-exit_reason = KVM_EXIT_IO;
-   vcpu-run-io.direction = KVM_EXIT_IO_OUT;
-   vcpu-run-io.size = size;
-   vcpu-run-io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
-   vcpu-run-io.count = count;
-   vcpu-run-io.port = port;
-
return 0;
 }
 
+static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt,
+int size, unsigned short port,
+const void *val, unsigned int count)
+{
+   struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+
+   memcpy(vcpu-arch.pio_data, val, size * count);
+   return emulator_pio_in_out(vcpu, size, port, (void *)val, count, false);
+}
+
 static unsigned long get_segment_base(struct kvm_vcpu *vcpu, int seg)
 {
return kvm_x86_ops-get_segment_base(vcpu, seg);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 05/11] KVM: MMU: do not mark accessed bit on pte write path

2011-08-29 Thread Xiao Guangrong
In current code, the accessed bit is always set when page fault occurred,
do not need to set it on pte write path

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/include/asm/kvm_host.h |1 -
 arch/x86/kvm/mmu.c  |   22 +-
 2 files changed, 1 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 27a25df..58ea3a7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -356,7 +356,6 @@ struct kvm_vcpu_arch {
gfn_t last_pt_write_gfn;
int   last_pt_write_count;
u64  *last_pte_updated;
-   gfn_t last_pte_gfn;
 
struct fpu guest_fpu;
u64 xcr0;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 26aae11..7ec2a6a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2206,11 +2206,6 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
if (set_mmio_spte(sptep, gfn, pfn, pte_access))
return 0;
 
-   /*
-* We don't set the accessed bit, since we sometimes want to see
-* whether the guest actually used the pte (in order to detect
-* demand paging).
-*/
spte = PT_PRESENT_MASK;
if (!speculative)
spte |= shadow_accessed_mask;
@@ -2361,10 +2356,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
}
}
kvm_release_pfn_clean(pfn);
-   if (speculative) {
+   if (speculative)
vcpu-arch.last_pte_updated = sptep;
-   vcpu-arch.last_pte_gfn = gfn;
-   }
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -3532,18 +3525,6 @@ static bool last_updated_pte_accessed(struct kvm_vcpu 
*vcpu)
return !!(spte  (*spte  shadow_accessed_mask));
 }
 
-static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
-{
-   u64 *spte = vcpu-arch.last_pte_updated;
-
-   if (spte
-vcpu-arch.last_pte_gfn == gfn
-shadow_accessed_mask
-!(*spte  shadow_accessed_mask)
-is_shadow_present_pte(*spte))
-   set_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
-}
-
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
   const u8 *new, int bytes,
   bool guest_initiated)
@@ -3614,7 +3595,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
++vcpu-kvm-stat.mmu_pte_write;
trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
if (guest_initiated) {
-   kvm_mmu_access_page(vcpu, gfn);
if (gfn == vcpu-arch.last_pt_write_gfn
 !last_updated_pte_accessed(vcpu)) {
++vcpu-arch.last_pt_write_count;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 06/11] KVM: MMU: cleanup FNAME(invlpg)

2011-08-29 Thread Xiao Guangrong
Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
same code between FNAME(invlpg) and FNAME(sync_page)

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   16 ++--
 arch/x86/kvm/paging_tmpl.h |   42 +++---
 2 files changed, 25 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 7ec2a6a..ed3e778 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1808,7 +1808,7 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, 
u64 *sptep,
}
 }
 
-static void mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
+static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 u64 *spte)
 {
u64 pte;
@@ -1816,17 +1816,21 @@ static void mmu_page_zap_pte(struct kvm *kvm, struct 
kvm_mmu_page *sp,
 
pte = *spte;
if (is_shadow_present_pte(pte)) {
-   if (is_last_spte(pte, sp-role.level))
+   if (is_last_spte(pte, sp-role.level)) {
drop_spte(kvm, spte);
-   else {
+   if (is_large_pte(pte))
+   --kvm-stat.lpages;
+   } else {
child = page_header(pte  PT64_BASE_ADDR_MASK);
drop_parent_pte(child, spte);
}
-   } else if (is_mmio_spte(pte))
+   return true;
+   }
+
+   if (is_mmio_spte(pte))
mmu_spte_clear_no_track(spte);
 
-   if (is_large_pte(pte))
-   --kvm-stat.lpages;
+   return false;
 }
 
 static void kvm_mmu_page_unlink_children(struct kvm *kvm,
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 9299410..7862c05 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -656,6 +656,16 @@ out_unlock:
return 0;
 }
 
+static gpa_t FNAME(get_first_pte_gpa)(struct kvm_mmu_page *sp)
+{
+   int offset = 0;
+
+   if (PTTYPE == 32)
+   offset = sp-role.quadrant  PT64_LEVEL_BITS;
+
+   return gfn_to_gpa(sp-gfn) + offset * sizeof(pt_element_t);
+}
+
 static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 {
struct kvm_shadow_walk_iterator iterator;
@@ -663,7 +673,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
gpa_t pte_gpa = -1;
int level;
u64 *sptep;
-   int need_flush = 0;
 
vcpu_clear_mmio_info(vcpu, gva);
 
@@ -675,36 +684,20 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t 
gva)
 
sp = page_header(__pa(sptep));
if (is_last_spte(*sptep, level)) {
-   int offset, shift;
-
if (!sp-unsync)
break;
 
-   shift = PAGE_SHIFT -
- (PT_LEVEL_BITS - PT64_LEVEL_BITS) * level;
-   offset = sp-role.quadrant  shift;
-
-   pte_gpa = (sp-gfn  PAGE_SHIFT) + offset;
+   pte_gpa = FNAME(get_first_pte_gpa)(sp);
pte_gpa += (sptep - sp-spt) * sizeof(pt_element_t);
 
-   if (is_shadow_present_pte(*sptep)) {
-   if (is_large_pte(*sptep))
-   --vcpu-kvm-stat.lpages;
-   drop_spte(vcpu-kvm, sptep);
-   need_flush = 1;
-   } else if (is_mmio_spte(*sptep))
-   mmu_spte_clear_no_track(sptep);
-
-   break;
+   if (mmu_page_zap_pte(vcpu-kvm, sp, sptep))
+   kvm_flush_remote_tlbs(vcpu-kvm);
}
 
if (!is_shadow_present_pte(*sptep) || !sp-unsync_children)
break;
}
 
-   if (need_flush)
-   kvm_flush_remote_tlbs(vcpu-kvm);
-
atomic_inc(vcpu-kvm-arch.invlpg_counter);
 
spin_unlock(vcpu-kvm-mmu_lock);
@@ -769,19 +762,14 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu 
*vcpu, gva_t vaddr,
  */
 static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
-   int i, offset, nr_present;
+   int i, nr_present = 0;
bool host_writable;
gpa_t first_pte_gpa;
 
-   offset = nr_present = 0;
-
/* direct kvm_mmu_page can not be unsync. */
BUG_ON(sp-role.direct);
 
-   if (PTTYPE == 32)
-   offset = sp-role.quadrant  PT64_LEVEL_BITS;
-
-   first_pte_gpa = gfn_to_gpa(sp-gfn) + offset * sizeof(pt_element_t);
+   first_pte_gpa = FNAME(get_first_pte_gpa)(sp);
 
for (i = 0; i  PT64_ENT_PER_PAGE; i++) {
unsigned pte_access;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More 

[PATCH v3 07/11] KVM: MMU: fast prefetch spte on invlpg path

2011-08-29 Thread Xiao Guangrong
Fast prefetch spte for the unsync shadow page on invlpg path

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/include/asm/kvm_host.h |4 +---
 arch/x86/kvm/mmu.c  |   38 +++---
 arch/x86/kvm/paging_tmpl.h  |   30 ++
 arch/x86/kvm/x86.c  |4 ++--
 4 files changed, 36 insertions(+), 40 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 58ea3a7..927ba73 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -460,7 +460,6 @@ struct kvm_arch {
unsigned int n_requested_mmu_pages;
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
-   atomic_t invlpg_counter;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
 * Hash table of struct kvm_mmu_page.
@@ -754,8 +753,7 @@ int fx_init(struct kvm_vcpu *vcpu);
 
 void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu);
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
-  const u8 *new, int bytes,
-  bool guest_initiated);
+  const u8 *new, int bytes);
 int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn);
 int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva);
 void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ed3e778..f6de2fc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3530,8 +3530,7 @@ static bool last_updated_pte_accessed(struct kvm_vcpu 
*vcpu)
 }
 
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
-  const u8 *new, int bytes,
-  bool guest_initiated)
+  const u8 *new, int bytes)
 {
gfn_t gfn = gpa  PAGE_SHIFT;
union kvm_mmu_page_role mask = { .word = 0 };
@@ -3540,7 +3539,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
LIST_HEAD(invalid_list);
u64 entry, gentry, *spte;
unsigned pte_size, page_offset, misaligned, quadrant, offset;
-   int level, npte, invlpg_counter, r, flooded = 0;
+   int level, npte, r, flooded = 0;
bool remote_flush, local_flush, zap_page;
 
/*
@@ -3555,19 +3554,16 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 
pgprintk(%s: gpa %llx bytes %d\n, __func__, gpa, bytes);
 
-   invlpg_counter = atomic_read(vcpu-kvm-arch.invlpg_counter);
-
/*
 * Assume that the pte write on a page table of the same type
 * as the current vcpu paging mode since we update the sptes only
 * when they have the same mode.
 */
-   if ((is_pae(vcpu)  bytes == 4) || !new) {
+   if (is_pae(vcpu)  bytes == 4) {
/* Handle a 32-bit guest writing two halves of a 64-bit gpte */
-   if (is_pae(vcpu)) {
-   gpa = ~(gpa_t)7;
-   bytes = 8;
-   }
+   gpa = ~(gpa_t)7;
+   bytes = 8;
+
r = kvm_read_guest(vcpu-kvm, gpa, gentry, min(bytes, 8));
if (r)
gentry = 0;
@@ -3593,22 +3589,18 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 */
mmu_topup_memory_caches(vcpu);
spin_lock(vcpu-kvm-mmu_lock);
-   if (atomic_read(vcpu-kvm-arch.invlpg_counter) != invlpg_counter)
-   gentry = 0;
kvm_mmu_free_some_pages(vcpu);
++vcpu-kvm-stat.mmu_pte_write;
trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
-   if (guest_initiated) {
-   if (gfn == vcpu-arch.last_pt_write_gfn
-!last_updated_pte_accessed(vcpu)) {
-   ++vcpu-arch.last_pt_write_count;
-   if (vcpu-arch.last_pt_write_count = 3)
-   flooded = 1;
-   } else {
-   vcpu-arch.last_pt_write_gfn = gfn;
-   vcpu-arch.last_pt_write_count = 1;
-   vcpu-arch.last_pte_updated = NULL;
-   }
+   if (gfn == vcpu-arch.last_pt_write_gfn
+!last_updated_pte_accessed(vcpu)) {
+   ++vcpu-arch.last_pt_write_count;
+   if (vcpu-arch.last_pt_write_count = 3)
+   flooded = 1;
+   } else {
+   vcpu-arch.last_pt_write_gfn = gfn;
+   vcpu-arch.last_pt_write_count = 1;
+   vcpu-arch.last_pte_updated = NULL;
}
 
mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 7862c05..3395ab2 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -670,20 +670,27 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t 
gva)
 {
struct kvm_shadow_walk_iterator iterator;
struct kvm_mmu_page *sp;
-   gpa_t pte_gpa = -1;

[PATCH v3 08/11] KVM: MMU: remove unnecessary kvm_mmu_free_some_pages

2011-08-29 Thread Xiao Guangrong
In kvm_mmu_pte_write, we do not need to alloc shadow page, so calling
kvm_mmu_free_some_pages is really unnecessary

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f6de2fc..9ac0dc8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3589,7 +3589,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 */
mmu_topup_memory_caches(vcpu);
spin_lock(vcpu-kvm-mmu_lock);
-   kvm_mmu_free_some_pages(vcpu);
++vcpu-kvm-stat.mmu_pte_write;
trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
if (gfn == vcpu-arch.last_pt_write_gfn
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 09/11] KVM: MMU: split kvm_mmu_pte_write function

2011-08-29 Thread Xiao Guangrong
kvm_mmu_pte_write is too long, we split it for better readable

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |  187 +++-
 1 files changed, 112 insertions(+), 75 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9ac0dc8..cfe24fe 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3529,48 +3529,28 @@ static bool last_updated_pte_accessed(struct kvm_vcpu 
*vcpu)
return !!(spte  (*spte  shadow_accessed_mask));
 }
 
-void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
-  const u8 *new, int bytes)
+static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
+   const u8 *new, int *bytes)
 {
-   gfn_t gfn = gpa  PAGE_SHIFT;
-   union kvm_mmu_page_role mask = { .word = 0 };
-   struct kvm_mmu_page *sp;
-   struct hlist_node *node;
-   LIST_HEAD(invalid_list);
-   u64 entry, gentry, *spte;
-   unsigned pte_size, page_offset, misaligned, quadrant, offset;
-   int level, npte, r, flooded = 0;
-   bool remote_flush, local_flush, zap_page;
-
-   /*
-* If we don't have indirect shadow pages, it means no page is
-* write-protected, so we can exit simply.
-*/
-   if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages))
-   return;
-
-   zap_page = remote_flush = local_flush = false;
-   offset = offset_in_page(gpa);
-
-   pgprintk(%s: gpa %llx bytes %d\n, __func__, gpa, bytes);
+   u64 gentry;
+   int r;
 
/*
 * Assume that the pte write on a page table of the same type
 * as the current vcpu paging mode since we update the sptes only
 * when they have the same mode.
 */
-   if (is_pae(vcpu)  bytes == 4) {
+   if (is_pae(vcpu)  *bytes == 4) {
/* Handle a 32-bit guest writing two halves of a 64-bit gpte */
-   gpa = ~(gpa_t)7;
-   bytes = 8;
-
-   r = kvm_read_guest(vcpu-kvm, gpa, gentry, min(bytes, 8));
+   *gpa = ~(gpa_t)7;
+   *bytes = 8;
+   r = kvm_read_guest(vcpu-kvm, *gpa, gentry, min(*bytes, 8));
if (r)
gentry = 0;
new = (const u8 *)gentry;
}
 
-   switch (bytes) {
+   switch (*bytes) {
case 4:
gentry = *(const u32 *)new;
break;
@@ -3582,71 +3562,128 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t 
gpa,
break;
}
 
-   /*
-* No need to care whether allocation memory is successful
-* or not since pte prefetch is skiped if it does not have
-* enough objects in the cache.
-*/
-   mmu_topup_memory_caches(vcpu);
-   spin_lock(vcpu-kvm-mmu_lock);
-   ++vcpu-kvm-stat.mmu_pte_write;
-   trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE);
+   return gentry;
+}
+
+/*
+ * If we're seeing too many writes to a page, it may no longer be a page table,
+ * or we may be forking, in which case it is better to unmap the page.
+ */
+static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+   bool flooded = false;
+
if (gfn == vcpu-arch.last_pt_write_gfn
 !last_updated_pte_accessed(vcpu)) {
++vcpu-arch.last_pt_write_count;
if (vcpu-arch.last_pt_write_count = 3)
-   flooded = 1;
+   flooded = true;
} else {
vcpu-arch.last_pt_write_gfn = gfn;
vcpu-arch.last_pt_write_count = 1;
vcpu-arch.last_pte_updated = NULL;
}
 
+   return flooded;
+}
+
+/*
+ * Misaligned accesses are too much trouble to fix up; also, they usually
+ * indicate a page is not used as a page table.
+ */
+static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
+   int bytes)
+{
+   unsigned offset, pte_size, misaligned;
+
+   pgprintk(misaligned: gpa %llx bytes %d role %x\n,
+gpa, bytes, sp-role.word);
+
+   offset = offset_in_page(gpa);
+   pte_size = sp-role.cr4_pae ? 8 : 4;
+   misaligned = (offset ^ (offset + bytes - 1))  ~(pte_size - 1);
+   misaligned |= bytes  4;
+
+   return misaligned;
+}
+
+static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
+{
+   unsigned page_offset, quadrant;
+   u64 *spte;
+   int level;
+
+   page_offset = offset_in_page(gpa);
+   level = sp-role.level;
+   *nspte = 1;
+   if (!sp-role.cr4_pae) {
+   page_offset = 1;  /* 32-64 */
+   /*
+* A 32-bit pde maps 4MB while the shadow pdes map
+* only 2MB.  So we need to double the offset again
+* and zap two pdes instead of one.
+*/
+   if (level == 

[PATCH v3 10/11] KVM: MMU: fix detecting misaligned accessed

2011-08-29 Thread Xiao Guangrong
Sometimes, we only modify the last one byte of a pte to update status bit,
for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
instruction is used in this function, in this case, kvm_mmu_pte_write will
treat it as misaligned access, and the shadow page table is zapped

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cfe24fe..adaa160 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3601,6 +3601,14 @@ static bool detect_write_misaligned(struct kvm_mmu_page 
*sp, gpa_t gpa,
 
offset = offset_in_page(gpa);
pte_size = sp-role.cr4_pae ? 8 : 4;
+
+   /*
+* Sometimes, the OS only writes the last one bytes to update status
+* bits, for example, in linux, andb instruction is used in clear_bit().
+*/
+   if (!(offset  (pte_size - 1))  bytes == 1)
+   return false;
+
misaligned = (offset ^ (offset + bytes - 1))  ~(pte_size - 1);
misaligned |= bytes  4;
 
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 11/11] KVM: MMU: improve write flooding detected

2011-08-29 Thread Xiao Guangrong
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/include/asm/kvm_host.h |6 +---
 arch/x86/kvm/mmu.c  |   57 +--
 arch/x86/kvm/paging_tmpl.h  |   12 +++-
 3 files changed, 26 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 927ba73..9d17238 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -239,6 +239,8 @@ struct kvm_mmu_page {
int clear_spte_count;
 #endif
 
+   int write_flooding_count;
+
struct rcu_head rcu;
 };
 
@@ -353,10 +355,6 @@ struct kvm_vcpu_arch {
struct kvm_mmu_memory_cache mmu_page_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
 
-   gfn_t last_pt_write_gfn;
-   int   last_pt_write_count;
-   u64  *last_pte_updated;
-
struct fpu guest_fpu;
u64 xcr0;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index adaa160..fd5b389 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1652,6 +1652,18 @@ static void init_shadow_page_table(struct kvm_mmu_page 
*sp)
sp-spt[i] = 0ull;
 }
 
+static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp)
+{
+   sp-write_flooding_count = 0;
+}
+
+static void clear_sp_write_flooding_count(u64 *spte)
+{
+   struct kvm_mmu_page *sp =  page_header(__pa(spte));
+
+   __clear_sp_write_flooding_count(sp);
+}
+
 static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 gfn_t gfn,
 gva_t gaddr,
@@ -1695,6 +1707,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
} else if (sp-unsync)
kvm_mmu_mark_parents_unsync(sp);
 
+   __clear_sp_write_flooding_count(sp);
trace_kvm_mmu_get_page(sp, false);
return sp;
}
@@ -1847,15 +1860,6 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, 
u64 *parent_pte)
mmu_page_remove_parent_pte(sp, parent_pte);
 }
 
-static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
-{
-   int i;
-   struct kvm_vcpu *vcpu;
-
-   kvm_for_each_vcpu(i, vcpu, kvm)
-   vcpu-arch.last_pte_updated = NULL;
-}
-
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
u64 *parent_pte;
@@ -1915,7 +1919,6 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
struct kvm_mmu_page *sp,
}
 
sp-role.invalid = 1;
-   kvm_mmu_reset_last_pte_updated(kvm);
return ret;
 }
 
@@ -2360,8 +2363,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
}
}
kvm_release_pfn_clean(pfn);
-   if (speculative)
-   vcpu-arch.last_pte_updated = sptep;
 }
 
 static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
@@ -3522,13 +3523,6 @@ static void mmu_pte_write_flush_tlb(struct kvm_vcpu 
*vcpu, bool zap_page,
kvm_mmu_flush_tlb(vcpu);
 }
 
-static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu)
-{
-   u64 *spte = vcpu-arch.last_pte_updated;
-
-   return !!(spte  (*spte  shadow_accessed_mask));
-}
-
 static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa,
const u8 *new, int *bytes)
 {
@@ -3569,22 +3563,9 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu 
*vcpu, gpa_t *gpa,
  * If we're seeing too many writes to a page, it may no longer be a page table,
  * or we may be forking, in which case it is better to unmap the page.
  */
-static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn)
+static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte)
 {
-   bool flooded = false;
-
-   if (gfn == vcpu-arch.last_pt_write_gfn
-!last_updated_pte_accessed(vcpu)) {
-   ++vcpu-arch.last_pt_write_count;
-   if (vcpu-arch.last_pt_write_count = 3)
-   flooded = true;
-   } else {
-   vcpu-arch.last_pt_write_gfn = gfn;
-   vcpu-arch.last_pt_write_count = 1;
-   vcpu-arch.last_pte_updated = NULL;
-   }
-
-   return flooded;
+   return ++sp-write_flooding_count = 3;
 }
 
 /*

Re: VFIO v2 design plan

2011-08-29 Thread David Gibson
On Fri, Aug 26, 2011 at 11:05:23AM -0600, Alex Williamson wrote:
 
 I don't think too much has changed since the previous email went out,
 but it seems like a good idea to post a summary in case there were
 suggestions or objections that I missed.
 
 VFIO v2 will rely on the platform iommu driver reporting grouping
 information.  Again, a group is a set of devices for which the iommu
 cannot differentiate transactions.  An example would be a set of devices
 behind a PCI-to-PCI bridge.  All transactions appear to be from the
 bridge itself rather than devices behind the bridge.  Platforms are free
 to have whatever constraints they need to for what constitutes a group.
 
 I posted a rough draft of patch to implement that for the base iommu
 driver and VT-d, adding an iommu_device_group callback on iommu ops.
 The iommu base driver also populates an iommu_group sysfs file for each
 device that's part of a group.  Members of the same group return the
 same value via either the sysfs or iommu_device_group.  The value
 returned is arbitrary, should not be assumed to be persistent across
 boots, and is left to the iommu driver to generate.  There are some
 implementation details around how to do this without favoring one bus
 over another, but the interface should be bus/device type agnostic in
 the end.
 
 When the vfio module is loaded, character devices will be created for
 each group in /dev/vfio/$GROUP.  Setting file permissions on these files
 should be sufficient for providing a user with complete access to the
 group.  Opening this device file provides what we'll call the group
 fd.  The group fd is restricted to only work with a single mm context.
 Concurrent opens will be denied if the opening process mm does not
 match.  The group fd will provide interfaces for enumerating the devices
 in the group, returning a file descriptor for each device in the group
 (the device fd), binding groups together, and returning a file
 descriptor for iommu operations (the iommu fd).
 
 A group is viable when all member devices of the group are bound to
 the vfio driver.  Until that point, the group fd only allows enumeration
 interfaces (ie. listing of group devices).  I'm currently thinking
 enumeration will be done by a simple read() on the device file returning
 a list of dev_name()s.

Ok.  Are you envisaging this interface as a virtual file, or as a
stream?  That is, can you seek around the list of devices like a
regular file - in which case, what are the precise semantics when the
list is changed by a bind - or is there no meaningful notion of file
pointer and read() just gives you the next device - in which case how
to you rewind to enumerate the group again.

  Once the group is viable, the user may bind the
 group to another group, retrieve the iommu fd, or retrieve device fds.
 Internally, each of these operations will result in an iommu domain
 being allocated and all of the devices attached to the domain.
 
 The purpose of binding groups is to share the iommu domain.  Groups
 making use of incompatible iommu domains will fail to bind.  Groups
 making use of different mm's will fail to bind.  The vfio driver may
 reject some binding based on domain capabilities, but final veto power
 is left to the iommu driver[1].  If a user makes use of a group
 independently and later wishes to bind it to another group, all the
 device fds and the iommu fd must first be closed.  This prevents using a
 stale iommu fd or accessing devices while the iommu is being switched.
 Operations on any group fds of a merged group are performed globally on
 the group (ie. enumerating the devices lists all devices in the merged
 group, retrieving the iommu fd from any group fd results in the same fd,
 device fds from any group can be retrieved from any group fd[2]).
 Groups can be merged and unmerged dynamically.  Unmerging a group
 requires the device fds for the outgoing group are closed.  The iommu fd
 will remain persistent for the remaining merged group.

As I've said I prefer a persistent group model, rather than this
transient group model, but it's not a dealbreaker by itself.  How are
unmerges specified?  I'm also assuming that in this model closing a
(bound) group fd will unmerge everything down to atomic groups again.

 If a device within a group is unbound from the vfio driver while it's in
 use (iommu fd refcnt  0 || device fd recnt  0), vfio will block the
 release and send netlink remove requests for every opened device in the
 group (or merged group).

Hrm, I do dislike netlink being yet another aspect of an already
complex interface.  Would it be possible to do kernel-user
notifications with a poll()/read() interface on one of the existing
fds instead?

  If the device fds are not released and
 subsequently the iommu fd released as well, vfio will kill the user
 process after some delay.

Ouch, this seems to me a problematic semantic.  Whether the user
process survives depends on whether it processes the remove requests

Re: A non-responsive guest problem

2011-08-29 Thread Paul
Hi,
After changing the clock source from kvm-clock to tsc, everything is
OK. Probably it's the bug of kvm-clock. Maybe these bugs have been
fixed in latest qemu.
Thanks,
Paul

On Wed, Aug 24, 2011 at 8:47 PM, Paul fly...@gmail.com wrote:

 Hi,

 Sometimes this problem happened in one day, but sometimes it was very
 difficult to reproduce it.
 Previously the clock source of the guest was kvm-clock. Now I changed
 it to tsc. The problem didn't occur until now. Is it related to the
 clock source? I  find that there are some bug fixes for kvm-clock
 recently. (e.g.,
 http://www.spinics.net/lists/stable-commits/msg11942.html) Anyway, I
 will update KVM later.

 Thanks,
 Paul

 On Wed, Aug 24, 2011 at 6:24 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 
  On Wed, Aug 24, 2011 at 10:02 AM, Xiao Guangrong
  xiaoguangr...@cn.fujitsu.com wrote:
   On 08/24/2011 04:40 PM, Paul wrote:
   Hi,
  
   I captured the output of pidstat when the problem was reproduced:
  
   bash-4.1# pidstat -p $PID 8966
   Linux 2.6.32-71.el6.x86_64 (test)     07/24/11        _x86_64_        (4 
   CPU)
  
   16:25:15          PID    %usr %system  %guest    %CPU   CPU  Command
   16:25:15         8966    0.14   55.04  115.41  170.59     1  qemu-kvm
  
  
   I have tried to reproduce it, but it was failed. I am using the
   current KVM code. I suggest you to test it by the new code if possible.
 
  Yes, that's a good idea.  The issue might already be fixed.  But if
  this is hard to reproduce then perhaps keep the spinning guest around
  a bit longer so we can poke at it and figure out what is happening.
 
  The pidstat output shows us that it's the guest that is spinning, not
  qemu-kvm userspace.
 
  The system time (time spent in host kernel) is also quite high so
  running kvm_stat might show some interesting KVM events happening.
 
  Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VFIO v2 design plan

2011-08-29 Thread Alex Williamson
On Tue, 2011-08-30 at 13:04 +1000, David Gibson wrote:
 On Fri, Aug 26, 2011 at 11:05:23AM -0600, Alex Williamson wrote:
  
  I don't think too much has changed since the previous email went out,
  but it seems like a good idea to post a summary in case there were
  suggestions or objections that I missed.
  
  VFIO v2 will rely on the platform iommu driver reporting grouping
  information.  Again, a group is a set of devices for which the iommu
  cannot differentiate transactions.  An example would be a set of devices
  behind a PCI-to-PCI bridge.  All transactions appear to be from the
  bridge itself rather than devices behind the bridge.  Platforms are free
  to have whatever constraints they need to for what constitutes a group.
  
  I posted a rough draft of patch to implement that for the base iommu
  driver and VT-d, adding an iommu_device_group callback on iommu ops.
  The iommu base driver also populates an iommu_group sysfs file for each
  device that's part of a group.  Members of the same group return the
  same value via either the sysfs or iommu_device_group.  The value
  returned is arbitrary, should not be assumed to be persistent across
  boots, and is left to the iommu driver to generate.  There are some
  implementation details around how to do this without favoring one bus
  over another, but the interface should be bus/device type agnostic in
  the end.
  
  When the vfio module is loaded, character devices will be created for
  each group in /dev/vfio/$GROUP.  Setting file permissions on these files
  should be sufficient for providing a user with complete access to the
  group.  Opening this device file provides what we'll call the group
  fd.  The group fd is restricted to only work with a single mm context.
  Concurrent opens will be denied if the opening process mm does not
  match.  The group fd will provide interfaces for enumerating the devices
  in the group, returning a file descriptor for each device in the group
  (the device fd), binding groups together, and returning a file
  descriptor for iommu operations (the iommu fd).
  
  A group is viable when all member devices of the group are bound to
  the vfio driver.  Until that point, the group fd only allows enumeration
  interfaces (ie. listing of group devices).  I'm currently thinking
  enumeration will be done by a simple read() on the device file returning
  a list of dev_name()s.
 
 Ok.  Are you envisaging this interface as a virtual file, or as a
 stream?  That is, can you seek around the list of devices like a
 regular file - in which case, what are the precise semantics when the
 list is changed by a bind - or is there no meaningful notion of file
 pointer and read() just gives you the next device - in which case how
 to you rewind to enumerate the group again.

I was implementing it as a virtual file that gets generated on read()
(see example in note[2] below).  It is a bit clunky as reading it a byte
at a time could experience races w/ device add/remove.  If it's read all
at once, it's an accurate snapshot.  Suggestions welcome, this just
seemed easier than trying to stuff it into a struct for an ioctl.  For a
while I thought I could do a VFIO_GROUP_GET_NUM_DEVICES +
VFIO_GROUP_GET_DEVICE_INDEX, but that assumes device stability, which I
don't think we can guarantee.

   Once the group is viable, the user may bind the
  group to another group, retrieve the iommu fd, or retrieve device fds.
  Internally, each of these operations will result in an iommu domain
  being allocated and all of the devices attached to the domain.
  
  The purpose of binding groups is to share the iommu domain.  Groups
  making use of incompatible iommu domains will fail to bind.  Groups
  making use of different mm's will fail to bind.  The vfio driver may
  reject some binding based on domain capabilities, but final veto power
  is left to the iommu driver[1].  If a user makes use of a group
  independently and later wishes to bind it to another group, all the
  device fds and the iommu fd must first be closed.  This prevents using a
  stale iommu fd or accessing devices while the iommu is being switched.
  Operations on any group fds of a merged group are performed globally on
  the group (ie. enumerating the devices lists all devices in the merged
  group, retrieving the iommu fd from any group fd results in the same fd,
  device fds from any group can be retrieved from any group fd[2]).
  Groups can be merged and unmerged dynamically.  Unmerging a group
  requires the device fds for the outgoing group are closed.  The iommu fd
  will remain persistent for the remaining merged group.
 
 As I've said I prefer a persistent group model, rather than this
 transient group model, but it's not a dealbreaker by itself.  How are
 unmerges specified?

VFIO_GROUP_UNMERGE ioctl taking a group fd parameter.

  I'm also assuming that in this model closing a
 (bound) group fd will unmerge everything down to atomic groups again.

Yes, it will 

Re: RFC: vfio / device assignment -- layout of device fd files

2011-08-29 Thread Alex Williamson
On Mon, 2011-08-29 at 18:14 -0500, Scott Wood wrote:
 On 08/29/2011 05:46 PM, Alex Williamson wrote:
  On Mon, 2011-08-29 at 16:58 -0500, Scott Wood wrote:
  On 08/29/2011 02:51 PM, Alex Williamson wrote:
  On Mon, 2011-08-29 at 16:51 +, Yoder Stuart-B08248 wrote:
  The device info records following the file header have the following
  record types each with content encoded in a record specific way:
 
   REGION  - describes an addressable address range for the device
   DTPATH - describes the device tree path for the device
   DTINDEX - describes the index into the related device tree
 property (reg,ranges,interrupts,interrupt-map)
 
  I don't quite understand if these are physical or virtual.
 
  If what are physical or virtual?
  
  Can you give an example of a path vs an index?  I don't understand
  enough about these to ask a useful question about what they're
  describing.
 
 You'd have both path and index.
 
 Example, for this tree:
 
 / {
   ...
   foo {
   ...
   bar {
   reg = 0x1000 64 0x1800 64;
   ranges = 0 0x2 0x1;
   ...
 
   child {
   reg = 0x100 0x100;
   ...
   };
   };
   };
 };
 
 There would be 4 regions if you bind to /foo/bar:
 
 // this is 64 bytes at 0x1000
 DTPATH /foo/bar
 DTINDEX prop_type=REG prop_index=0
 
 // this is 64 bytes at 0x1800
 DTPATH /foo/bar
 DTINDEX prop_type=REG prop_index=1
 
 // this is 16K at 0x2
 DTPATH /foo/bar
 DTINDEX prop_type=RANGES prop_index=0
 
 // this is 256 bytes at 0x20100
 DTPATH /foo/bar/child
 DTINDEX prop_type=REG prop_index=0
 
 Both ranges and the child reg are needed, since ranges could be a simple
 ranges; that passes everything with no translation, and child nodes
 could be absent-but-implied in some other cases (such as when they
 represent PCI devices which can be probed -- we still need to map the
 ranges that correspond to PCI controller windows).

Thanks for the example.  Is it always the case that you need a path and
an index?  If so, why are they separate sub-regions instead of combined
into a DT_INFO sub-region?

   INTERRUPT - describes an interrupt for the device
   PCI_CONFIG_SPACE - describes config space for the device
 
  I would have expected this to be a REGION with a property of
  PCI_CONFIG_SPACE.
 
  Could be, if physical address is made optional.
  
  Or physical address is also a property, aka sub-region.
 
 A subrecord of REGION is fine with me.
 
  Would we only need to expose phys addr for 1:1 mapping requirements?
  I'm not sure why we'd care to expose this otherwise.
 
  It's more important for non-PCI, where it avoids the need for userspace
  to parse the device tree to find the guest address (we'll usually want
  1:1), or to consolidate pages shared by multiple regions.  It could be
  nice for debugging, as well.
  
  So the device tree path is ripped straight from the system, so it's the
  actual 1:1, matching physical hardware, path.
 
 Yes.
 
  Even for non-PCI we need to
  know if the region is pio/mmio32/mmio64/prefetchable/etc.
 
  Outside of PCI, what standardized form would you put such information
  in?  Where would the kernel get this information?  What does
  mmio32/mmio64 mean in this context?
  
  I could imagine a platform device described by ACPI that might want to
  differentiate.  The physical device doesn't get moved of course, but
  guest drivers might care how the device is described if we need to
  rebuild those ACPI tables.  ACPI might even be a good place to leverage
  these data structures... /me ducks.
 
 ACPI info could be another subrecord type, but in the device tree
 system-bus case we generally don't have this information at the generic
 infrastructure level.  Drivers are expected to know how their devices'
 regions should be mapped.

The device tree tells them how they're mapped, right?  Or maybe more
precisely, the device tree tells them where they're mapped and it
doesn't really matter whether they're 32bit or 64bit because they can't
be moved.

Maybe this is sub-region material.  It just feels wrong to enumerate a
region and not be able to include any basic properties beyond offset and
size in a common field.  For PCI, we can also describe the properties
via config space, so sub-regions could still be optional.

  BAR index could really just translate to a REGION instance number.
 
  How would that work if you make non-BAR things (such as config space)
  into regions?
  
  Put their instance numbers outside of the BAR region?  We have a fixed
  REGION space on PCI, so we could just define BAR0 == instance 0, BAR1 ==
  instance 1... ROM == instance 6, CONFIG == instance 0xF (or 7).
 
 Seems more awkward than just having each region say what it is.  What do
 you do to fill in the gaps?

You don't, instance numbers would just be non-contiguous.  The 

KVM on IBM PowerEN processor

2011-08-29 Thread Kun Wang
Hi, everyone,

This is Kun Wang from IBM Research China. I and my team have been working 
on IBM PowerEN processor in recent years, including its simulation, 
lib/runtime optimization and etc. Now we start the work to enable KVM on 
PowerEN processor. Since the A2 core of PowerEN follows Power ISA v2.06 
(more specifically, book3e and 64-bit), I believe 99% of our work will 
stick to the ISA, and hence can be leveraged by others.

As the new one to this KVM world, I and my team definitely need your help. 
Looking forward to talking and working with you guys in the future.

Best regards,
Kun Wang

Research Staff Member, Manager of Next-Generation Systems
IBM Research China
Tel: (86)10-58748491
FAX: (86)10-58748330
E-mail: wang...@cn.ibm.com
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] Add check for suspended vm in softlockup detector

2011-08-29 Thread Eric B Munson
A suspended VM can cause spurious soft lockup warnings.  To avoid these, the
watchdog now checks if the kernel knows it was stopped by the host and skips
the warning if so.

Signed-off-by: Eric B Munson emun...@mgebm.net
---
 kernel/watchdog.c |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 36491cd..4cbb69f 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -25,6 +25,7 @@
 #include linux/sysctl.h
 
 #include asm/irq_regs.h
+#include asm/pvclock.h
 #include linux/perf_event.h
 
 int watchdog_enabled = 1;
@@ -292,6 +293,17 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)
 */
duration = is_softlockup(touch_ts);
if (unlikely(duration)) {
+   /*
+* If a virtual machine is stopped by the host it can look to
+* the watchdog like a soft lockup, check to see if the host
+* stopped the vm before we issue the warning
+*/
+   if (kvm_check_and_clear_host_stopped(get_cpu())) {
+   put_cpu();
+   return HRTIMER_RESTART;
+   }
+   put_cpu();
+
/* only warn once */
if (__this_cpu_read(soft_watchdog_warn) == true)
return HRTIMER_RESTART;
-- 
1.7.4.1

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] Add flag to indicate that a vm was stopped by the host

2011-08-29 Thread Eric B Munson
This flag will be used to check if the vm was stopped by the host when a soft
lockup was detected.

Signed-off-by: Eric B Munson emun...@mgebm.net
---
 arch/x86/include/asm/pvclock-abi.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pvclock-abi.h 
b/arch/x86/include/asm/pvclock-abi.h
index 35f2d19..6167fd7 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/asm/pvclock-abi.h
@@ -40,5 +40,6 @@ struct pvclock_wall_clock {
 } __attribute__((__packed__));
 
 #define PVCLOCK_TSC_STABLE_BIT (1  0)
+#define PVCLOCK_GUEST_STOPPED  (1  1)
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_PVCLOCK_ABI_H */
-- 
1.7.4.1

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html