Re: source for virt io backend driver

2012-04-09 Thread Nadav Har'El
On Sun, Apr 08, 2012, Steven wrote about Re: source for virt io backend 
driver:
  vhost-net. This you can find in drivers/vhost/net.c (and
  drivers/vhost/vhost.c for the generic vhost infrastructure for virtio
  drivers in the host kernel).
 
 Yes, I am looking for the backend code in the kernel. I found the file
 for net backend drivers/vhost/net.c. However, the backend code for blk
 is not there.
 Could you point out this? Thanks.

Indeed, vhost-block is not (yet) part of the mainline kernel.

Maybe someone else more experienced in vhost-block can recommend where
to get the latest version.

Nadav.


-- 
Nadav Har'El| Monday, Apr 9 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Diplomat: A man who always remembers a
http://nadav.harel.org.il   |woman's birthday but never her age.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming

2012-04-09 Thread Michael S. Tsirkin
On Mon, Apr 09, 2012 at 11:47:51AM +0800, Ren Mingxin wrote:
  On 04/04/2012 04:01 PM, Michael S. Tsirkin wrote:
 On Mon, Apr 02, 2012 at 12:00:45PM -0700, Tejun Heo wrote:
 On Mon, Apr 02, 2012 at 11:56:18AM -0700, James Bottomley wrote:
 So if we're agreed no other devices going forwards should ever use this
 interface, is there any point unifying the interface?  No matter how
 many caveats you hedge it round with, putting the API in a central place
 will be a bit like a honey trap for careless bears.   It might be safer
 just to leave it buried in the three current drivers.
 Yeah, that was my hope but I think it would be easier to enforce to
 have a common function which is clearly marked legacy so that new
 driver writers can go look for the naming code in the existing ones,
 find out they're all using the same function which is marked legacy
 and explains what to do for newer drivers.
 I think I'm not the only one to be confused about the
 preferred direction here.
 James, do you agree to the approach above?
 
 It would be nice to fix virtio block for 3.4, so
 how about this:
 - I'll just apply the original bugfix patch for 3.4 -
it only affects virtio
 
 Sorry, about only affects virtio, I'm not very clear here:
 1) Just duplicate the disk name format function in virtio_blk
 like the original patch: https://lkml.org/lkml/2012/3/28/45
 2) Move the disk name format function into block core like
 this patch series but only affects virtio(not affect mtip32xx).
 Do you mean the 2) one or something else?

I mean 1) - I'll apply the original patch.

 - Ren will repost the refactoring patch on top, and we can
keep up the discussion
 
 Ren if you agree, can you make this a two patch series please?
 
 
 Sure.
 
 -- 
 Thanks,
 Ren
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming

2012-04-09 Thread Michael S. Tsirkin
On Mon, Apr 09, 2012 at 11:47:51AM +0800, Ren Mingxin wrote:
  On 04/04/2012 04:01 PM, Michael S. Tsirkin wrote:
 On Mon, Apr 02, 2012 at 12:00:45PM -0700, Tejun Heo wrote:
 On Mon, Apr 02, 2012 at 11:56:18AM -0700, James Bottomley wrote:
 So if we're agreed no other devices going forwards should ever use this
 interface, is there any point unifying the interface?  No matter how
 many caveats you hedge it round with, putting the API in a central place
 will be a bit like a honey trap for careless bears.   It might be safer
 just to leave it buried in the three current drivers.
 Yeah, that was my hope but I think it would be easier to enforce to
 have a common function which is clearly marked legacy so that new
 driver writers can go look for the naming code in the existing ones,
 find out they're all using the same function which is marked legacy
 and explains what to do for newer drivers.
 I think I'm not the only one to be confused about the
 preferred direction here.
 James, do you agree to the approach above?
 
 It would be nice to fix virtio block for 3.4, so
 how about this:
 - I'll just apply the original bugfix patch for 3.4 -
it only affects virtio
 
 Sorry, about only affects virtio, I'm not very clear here:
 1) Just duplicate the disk name format function in virtio_blk
 like the original patch: https://lkml.org/lkml/2012/3/28/45

So I'd like to apply this, and we can discuss the
deduplication for 3.5. Please post a version of this that
1. isn't line-wrapped and doesn't have damaged whitespace
   so I can run git am on it
2. lists the # of duspported disks correctly as 26^3+(26^2+26)
   in the description

Thanks!

 2) Move the disk name format function into block core like
 this patch series but only affects virtio(not affect mtip32xx).
 Do you mean the 2) one or something else?
 
 - Ren will repost the refactoring patch on top, and we can
keep up the discussion
 
 Ren if you agree, can you make this a two patch series please?
 
 
 Sure.
 
 -- 
 Thanks,
 Ren
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-09 Thread Avi Kivity
On 04/08/2012 08:37 PM, Jan Kiszka wrote:
 The core problem is not the ordering. The problem is that the kernel is
 susceptible to ordering mistakes of userspace. And that is because the
 kernel panics on PCI errors of devices that are in user hands - a
 critical kernel bug IMHO. 

Certainly.  But this userspace patch won't fix it.

 Proper reset of MSI or even the whole PCI
 config space is another issue, but one the kernel should not worry about
 - still, it should be fixed (therefore this patch).

And I was asking what is the right way to do it.  Reset the device and
read back the register values, or do an emulated reset and push down the
register values.

 But even if we disallowed userland to disable MMIO and PIO access to the
 device, we would be be able to exclude that there are secrete channels
 in the device's interface having the same effect. So we likely need to
 enhance PCI error handling to catch and handle faults for certain
 devices differently - those we cannot trust to behave properly while
 they are under userland/guest control.

Why not all of them?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-09 Thread Gleb Natapov
On Sun, Apr 08, 2012 at 06:27:03PM +0300, Gleb Natapov wrote:
 On Fri, Apr 06, 2012 at 09:50:50AM +0300, Gleb Natapov wrote:
  On Fri, Apr 06, 2012 at 10:43:17AM +0530, shashank rachamalla wrote:
   On Thu, Apr 5, 2012 at 8:11 PM, Gleb Natapov g...@redhat.com wrote:
On Thu, Apr 05, 2012 at 05:38:40PM +0300, Avi Kivity wrote:
On 04/05/2012 04:57 PM, Gleb Natapov wrote:
  
  May be it used NMI based profiling. We should ask oprofile 
  developers.
  As I said I am almost sure my inability to run it on a host is 
  probably
  PEBKAC, although I ran the same script exactly on the host and the
  guest (the script is from the first email of this thread)
 
 After upgrading the kernel to latest git from whatever it was there 
 the
 same script works and counts CPU_CLK_UNHALT events.

   
This is even while it violates the Intel guidelines?
   
Yes, but who says the result is correct :) It seems that we handle
global ctrl msr wrong. That is counter can be enabled either in global
ctrl or in eventsel. Trying to confirm that.
   
   if that becomes true then will global ctrl msr have any significance ?
  When it is in use yes.
  
 I was wrong. We do handle global ctrl msr correctly, I just ran my test
 incorrectly. If I disable global ctrl on all cpus (for i in `seq 0 15`;
 do wrmsr -p $i 0x38f 0; done) oprofile stops working.
 
After searching high and low I finally found the following in
Performance Monitoring Unit Sharing Guide white paper:

  Known starting state: Software requires a known starting
  state. After CPU reset, all counters and control registers are
  disabled and clear/reset to 0.  The only exception to this is the
  IA32_PERF_GLOBAL_CTRL control MSR, all programmable counter global
  enable bits are reset to 1.

Patch will follow.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] skbuff: struct ubuf_info callback type safety

2012-04-09 Thread Michael S. Tsirkin
The skb struct ubuf_info callback gets passed struct ubuf_info
itself, not the arg value as the field name and the function signature
seem to imply. Rename the arg field to ctx to match usage,
add documentation and change the callback argument type
to make usage clear and to have compiler check correctness.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 drivers/vhost/net.c|2 +-
 drivers/vhost/vhost.c  |5 ++---
 drivers/vhost/vhost.h  |2 +-
 include/linux/skbuff.h |7 ---
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f0da2c3..1f21d2a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -238,7 +238,7 @@ static void handle_tx(struct vhost_net *net)
 
vq-heads[vq-upend_idx].len = len;
ubuf-callback = vhost_zerocopy_callback;
-   ubuf-arg = vq-ubufs;
+   ubuf-ctx = vq-ubufs;
ubuf-desc = vq-upend_idx;
msg.msg_control = ubuf;
msg.msg_controllen = sizeof(ubuf);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 947f00d..51e4c1e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1598,10 +1598,9 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref 
*ubufs)
kfree(ubufs);
 }
 
-void vhost_zerocopy_callback(void *arg)
+void vhost_zerocopy_callback(struct ubuf_info *ubuf)
 {
-   struct ubuf_info *ubuf = arg;
-   struct vhost_ubuf_ref *ubufs = ubuf-arg;
+   struct vhost_ubuf_ref *ubufs = ubuf-ctx;
struct vhost_virtqueue *vq = ubufs-vq;
 
/* set len = 1 to mark this desc buffers done DMA */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8dcf4cc..8de1fd5 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -188,7 +188,7 @@ bool vhost_enable_notify(struct vhost_dev *, struct 
vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
unsigned int log_num, u64 len);
-void vhost_zerocopy_callback(void *arg);
+void vhost_zerocopy_callback(struct ubuf_info *);
 int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq);
 
 #define vq_err(vq, fmt, ...) do {  \
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3337027..4c3f138 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -238,11 +238,12 @@ enum {
 /*
  * The callback notifies userspace to release buffers when skb DMA is done in
  * lower device, the skb last reference should be 0 when calling this.
- * The desc is used to track userspace buffer index.
+ * The ctx field is used to track device context.
+ * The desc field is used to track userspace buffer index.
  */
 struct ubuf_info {
-   void (*callback)(void *);
-   void *arg;
+   void (*callback)(struct ubuf_info *);
+   void *ctx;
unsigned long desc;
 };
 
-- 
1.7.9.111.gf3fb0
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for April, Tuesday 10

2012-04-09 Thread Juan Quintela

Hi

Please send in any agenda items you are interested in covering.

Cheers,

Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock

2012-04-09 Thread Avi Kivity
On 04/05/2012 09:25 PM, Xiao Guangrong wrote:
 On 04/01/2012 11:53 PM, Avi Kivity wrote:

  On 03/29/2012 11:25 AM, Xiao Guangrong wrote:
  It depends on PTE_LIST_WRITE_PROTECT bit in rmap which let us quickly know
  whether the page is writable out of mmu-lock
 
  Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
  ---
   arch/x86/kvm/mmu.c |   17 +
   arch/x86/kvm/paging_tmpl.h |2 +-
   2 files changed, 14 insertions(+), 5 deletions(-)
 
  diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
  index 3887a07..c029185 100644
  --- a/arch/x86/kvm/mmu.c
  +++ b/arch/x86/kvm/mmu.c
  @@ -1148,6 +1148,12 @@ static int rmap_write_protect(struct kvm *kvm, u64 
  gfn)
 
 *rmapp |= PTE_LIST_WRITE_PROTECT;
 
  +  /*
  +   * Setting PTE_LIST_WRITE_PROTECT bit before doing page
  +   * write-protect.
  +   */
  +  smp_mb();
  +
  
  wmb only needed.
  


 We should ensure setting this bit before reading spte, it cooperates with
 fast page fault path to avoid this case:

 On fast page fault path:On rmap_write_protect path:
 read spte: old_spte = *spte
(reading spte is reordered to the 
 front of
 setting PTE_LIST_WRITE_PROTECT bit)
 set spte.identification
smp_mb
 if (!rmap.PTE_LIST_WRITE_PROTECT)
 set rmap.PTE_LIST_WRITE_PROTECT
 cmpxchg(sptep, spte, spte | WRITABLE)
 see old_spte.identification is 
 not set,
 so it does not write-protect this 
 page
   OOPS!!!

Ah, so it's protecting two paths at the same time: rmap.write_protect -
fast page fault, and lock(sptep) - write protect.

The whole thing needs to be documented very carefully in locking.txt,
otherwise mmu.c will be write-protected to everyone except you.

  Would it be better to store this bit in all the sptes instead?  We're
  touching them in any case.  More work to clear them, but
  un-write-protecting a page is beneficial anyway as it can save a fault.
  

 There are two reasons:
 - if we set this bit in rmap, we can do the quickly check to see the page is
   writble before doing shadow page walking.

 - since a full barrier is needed, we should use smp_mb for every spte like 
 this:

   while ((spte = rmap_next(rmapp, spte))) {
   read spte
 smp_mb
 write-protect spte
   }

   smp_mb is called in the loop, i think it is not good, yes?

Yes, agree.

 If you just want to save the fault, we can let all spte to be writeable in
 mmu_need_write_protect, but we should cache gpte access bits into spte 
 firstly.
 It should be another patchset i think. :)

Yes.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Avi Kivity
On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
 * Idea
 The present bit of page fault error code (EFEC.P) indicates whether the
 page table is populated on all levels, if this bit is set, we can know
 the page fault is caused by the page-protection bits (e.g. W/R bit) or
 the reserved bits.

 In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
 simply fixed: the page fault caused by reserved bit
 (EFFC.P = 1  EFEC.RSV = 1) has already been filtered out in fast mmio
 path. What we need do to fix the rest page fault (EFEC.P = 1  RSV != 1)
 is just increasing the corresponding access on the spte.

 This pachset introduces a fast path to fix this kind of page fault: it
 is out of mmu-lock and need not walk host page table to get the mapping
 from gfn to pfn.



This patchset is really worrying to me.

It introduces a lot of concurrency into data structures that were not
designed for it.  Even if it is correct, it will be very hard to
convince ourselves that it is correct, and if it isn't, to debug those
subtle bugs.  It will also be much harder to maintain the mmu code than
it is now.

There are a lot of things to check.  Just as an example, we need to be
sure that if we use rcu_dereference() twice in the same code path, that
any inconsistencies due to a write in between are benign.  Doing that is
a huge task.

But I appreciate the performance improvement and would like to see a
simpler version make it in.  This needs to reduce the amount of data
touched in the fast path so it is easier to validate, and perhaps reduce
the number of cases that the fast path works on.

I would like to see the fast path as simple as

  rcu_read_lock();

  (lockless shadow walk)
  spte = ACCESS_ONCE(*sptep);

  if (!(spte  PT_MAY_ALLOW_WRITES))
goto slow;

  gfn = kvm_mmu_page_get_gfn(sp, sptep - sp-sptes)
  mark_page_dirty(kvm, gfn);

  new_spte = spte  ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
  if (cmpxchg(sptep, spte, new_spte) != spte)
   goto slow;

  rcu_read_unlock();
  return;

slow:
  rcu_read_unlock();
  slow_path();

It now becomes the responsibility of the slow path to maintain *sptep 
PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
can be as simple as a clear_bit() before we update sp-gfns[] or if we
add host write protection.

Sorry, it's too complicated for me.  Marcelo, what's your take?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock

2012-04-09 Thread Takuya Yoshikawa
On Mon, 09 Apr 2012 15:28:47 +0300
Avi Kivity a...@redhat.com wrote:

 Ah, so it's protecting two paths at the same time: rmap.write_protect -
 fast page fault, and lock(sptep) - write protect.
 
 The whole thing needs to be documented very carefully in locking.txt,
 otherwise mmu.c will be write-protected to everyone except you.

Before changing mmu_lock usage, better describe it in locking.txt.

I guess from mmu.c git history, it is already restricted to a few
people who can describe that.

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Avi Kivity
On 04/06/2012 08:24 AM, Xiao Guangrong wrote:

 Foolish me, i should be crazy. Sorry for my mistake. :(

 Unfortunately, it can not work, we can not get a stable gfn from gpte or
 sp-gfns[]. For example:

 beginning:
 Gpte = Gfn1
 gfn_to_pfn(Gfn1) = Pfn
 Spte = Pfn
 Gfn1 is write-free
 Gfn2 is write-protected


 VCPU 0  VCPU 1 VCPU 2

 fault on gpte
 fast page fault path:
   set Spte.fast_pf
   get Gfn1 from Gpte/sp-gfns[]
   if (Gfn1 is writable)
 Pfn is swapped out:
   Spte = 0
   Gpte is modified to Gfn2,
 and Pfn is realloced and remapped
 to Gfn2, so:
 Spte = Pfn

   fast page fault 
 path:
  set Spte.fast_pf

  cmpxchg  Spte+w
 OOPS!!!
   we see Spte is not changed and
happily make it writable, so gfn2 can be writable

 It seems only a unique identification can prevent this. :(


Ouch.

What about restricting this to role.direct=1?  Then gfn is stable?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock

2012-04-09 Thread Avi Kivity
On 04/09/2012 04:16 PM, Takuya Yoshikawa wrote:
 On Mon, 09 Apr 2012 15:28:47 +0300
 Avi Kivity a...@redhat.com wrote:

  Ah, so it's protecting two paths at the same time: rmap.write_protect -
  fast page fault, and lock(sptep) - write protect.
  
  The whole thing needs to be documented very carefully in locking.txt,
  otherwise mmu.c will be write-protected to everyone except you.

 Before changing mmu_lock usage, better describe it in locking.txt.

 I guess from mmu.c git history, it is already restricted to a few
 people who can describe that.

It's getting harder and harder, yes.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Xiao Guangrong
On 04/09/2012 09:12 PM, Avi Kivity wrote:

 On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
 * Idea
 The present bit of page fault error code (EFEC.P) indicates whether the
 page table is populated on all levels, if this bit is set, we can know
 the page fault is caused by the page-protection bits (e.g. W/R bit) or
 the reserved bits.

 In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
 simply fixed: the page fault caused by reserved bit
 (EFFC.P = 1  EFEC.RSV = 1) has already been filtered out in fast mmio
 path. What we need do to fix the rest page fault (EFEC.P = 1  RSV != 1)
 is just increasing the corresponding access on the spte.

 This pachset introduces a fast path to fix this kind of page fault: it
 is out of mmu-lock and need not walk host page table to get the mapping
 from gfn to pfn.


 
 This patchset is really worrying to me.
 
 It introduces a lot of concurrency into data structures that were not
 designed for it.  Even if it is correct, it will be very hard to
 convince ourselves that it is correct, and if it isn't, to debug those
 subtle bugs.  It will also be much harder to maintain the mmu code than
 it is now.
 
 There are a lot of things to check.  Just as an example, we need to be
 sure that if we use rcu_dereference() twice in the same code path, that
 any inconsistencies due to a write in between are benign.  Doing that is
 a huge task.
 
 But I appreciate the performance improvement and would like to see a
 simpler version make it in.  This needs to reduce the amount of data
 touched in the fast path so it is easier to validate, and perhaps reduce
 the number of cases that the fast path works on.
 
 I would like to see the fast path as simple as
 
   rcu_read_lock();
 
   (lockless shadow walk)
   spte = ACCESS_ONCE(*sptep);
 
   if (!(spte  PT_MAY_ALLOW_WRITES))
 goto slow;
 
   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp-sptes)
   mark_page_dirty(kvm, gfn);
 
   new_spte = spte  ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
   if (cmpxchg(sptep, spte, new_spte) != spte)
goto slow;
 
   rcu_read_unlock();
   return;
 
 slow:
   rcu_read_unlock();
   slow_path();
 
 It now becomes the responsibility of the slow path to maintain *sptep 
 PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
 can be as simple as a clear_bit() before we update sp-gfns[] or if we
 add host write protection.
 


Okay, let's simplify it as possible:

- let it only fix the page fault with PFEC.P == 1  PFEC.W = 0, that means
  unlock set_spte path can be dropped

- let it just fixes the page fault caused by dirty-log that means we always
  skip the spte which write-protected by shadow page protection.

Then, things should be fair simper:

In set_spte path, if the spte can be writable, we set ALLOW_WRITE bit
In rmap_write_protect:
  if (spte.PT_WRITABLE_MASK) {
WARN_ON(!(spte  ALLOW_WRITE));
spte = ~PT_WRITABLE_MASK;
spte |= WRITE_PROTECT;
  }

in fast page fault:

if (spte  PT_WRITABLE_MASK)
return_go_guest;

if ((spte  ALLOW_WRITE)  !(spte  WRITE_PROTECT))
cmpxchg spte + PT_WRITABLE_MASK

The information all we needed comes from spte it is independence from other 
path,
and no barriers.


Hmm, how about this one?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Xiao Guangrong
On 04/09/2012 09:20 PM, Avi Kivity wrote:

 On 04/06/2012 08:24 AM, Xiao Guangrong wrote:

 Foolish me, i should be crazy. Sorry for my mistake. :(

 Unfortunately, it can not work, we can not get a stable gfn from gpte or
 sp-gfns[]. For example:

 beginning:
 Gpte = Gfn1
 gfn_to_pfn(Gfn1) = Pfn
 Spte = Pfn
 Gfn1 is write-free
 Gfn2 is write-protected


 VCPU 0  VCPU 1 VCPU 2

 fault on gpte
 fast page fault path:
   set Spte.fast_pf
   get Gfn1 from Gpte/sp-gfns[]
   if (Gfn1 is writable)
 Pfn is swapped out:
  Spte = 0
  Gpte is modified to Gfn2,
 and Pfn is realloced and remapped
 to Gfn2, so:
 Spte = Pfn

   fast page fault 
 path:
  set Spte.fast_pf

  cmpxchg  Spte+w
 OOPS!!!
   we see Spte is not changed and
happily make it writable, so gfn2 can be writable

 It seems only a unique identification can prevent this. :(

 
 Ouch.
 
 What about restricting this to role.direct=1?  Then gfn is stable?
 


Yes.

The gfn of direct sp is stable since it is calculated by sp-gfn which is
independent with gpte.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Xiao Guangrong
On 04/09/2012 09:55 PM, Xiao Guangrong wrote:


 
 
 Okay, let's simplify it as possible:
 
 - let it only fix the page fault with PFEC.P == 1  PFEC.W = 0, that means


PFEC.P == 1  PFEC.W = 1, my typo.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Avi Kivity
On 04/09/2012 04:55 PM, Xiao Guangrong wrote:

 Okay, let's simplify it as possible:

 - let it only fix the page fault with PFEC.P == 1  PFEC.W = 0, that means
   unlock set_spte path can be dropped

 - let it just fixes the page fault caused by dirty-log that means we always
   skip the spte which write-protected by shadow page protection.

 Then, things should be fair simper:

 In set_spte path, if the spte can be writable, we set ALLOW_WRITE bit
 In rmap_write_protect:
   if (spte.PT_WRITABLE_MASK) {
   WARN_ON(!(spte  ALLOW_WRITE));
   spte = ~PT_WRITABLE_MASK;
   spte |= WRITE_PROTECT;
   }

 in fast page fault:

 if (spte  PT_WRITABLE_MASK)
   return_go_guest;

 if ((spte  ALLOW_WRITE)  !(spte  WRITE_PROTECT))
   cmpxchg spte + PT_WRITABLE_MASK

 The information all we needed comes from spte it is independence from other 
 path,
 and no barriers.


 Hmm, how about this one?


I like it. WRITE_PROTECT is better than my ALLOW_WRITES, the meaning is
clearer.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PMU emulation: GLOBAL_CTRL MSR should be enabled on reset.

2012-04-09 Thread Gleb Natapov
On reset all MPU counters should be enabled in GLOBAL_CTRL MSR.

Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 173df38..2e88438 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -459,17 +459,17 @@ void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu)
pmu-available_event_types = ~entry-ebx  ((1ull  bitmap_len) - 1);
 
if (pmu-version == 1) {
-   pmu-global_ctrl = (1  pmu-nr_arch_gp_counters) - 1;
-   return;
+   pmu-nr_arch_fixed_counters = 0;
+   } else {
+   pmu-nr_arch_fixed_counters = min((int)(entry-edx  0x1f),
+   X86_PMC_MAX_FIXED);
+   pmu-counter_bitmask[KVM_PMC_FIXED] =
+   ((u64)1  ((entry-edx  5)  0xff)) - 1;
}
 
-   pmu-nr_arch_fixed_counters = min((int)(entry-edx  0x1f),
-   X86_PMC_MAX_FIXED);
-   pmu-counter_bitmask[KVM_PMC_FIXED] =
-   ((u64)1  ((entry-edx  5)  0xff)) - 1;
-   pmu-global_ctrl_mask = ~(((1  pmu-nr_arch_gp_counters) - 1)
-   | (((1ull  pmu-nr_arch_fixed_counters) - 1)
-X86_PMC_IDX_FIXED));
+   pmu-global_ctrl = ((1  pmu-nr_arch_gp_counters) - 1) |
+   (((1ull  pmu-nr_arch_fixed_counters) - 1)  
X86_PMC_IDX_FIXED);
+   pmu-global_ctrl_mask = ~pmu-global_ctrl;
 }
 
 void kvm_pmu_init(struct kvm_vcpu *vcpu)
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-test] make PMU test to pass on more cpu types

2012-04-09 Thread Gleb Natapov
ping.

On Thu, Mar 29, 2012 at 10:53:36AM +0200, Gleb Natapov wrote:
 Expand boundaries of some tests since it was observed that on some cpus
 results do not fit.
 
 Signed-off-by: Gleb Natapov g...@redhat.com
 diff --git a/x86/pmu.c b/x86/pmu.c
 index e1b8279..2c46f31 100644
 --- a/x86/pmu.c
 +++ b/x86/pmu.c
 @@ -80,8 +80,8 @@ struct pmu_event {
  } gp_events[] = {
   {core cycles, 0x003c, 1*N, 50*N},
   {instructions, 0x00c0, 10*N, 10.2*N},
 - {ref cycles, 0x013c, 1*N, 30*N},
 - {llc refference, 0x4f2e, 1, 1*N},
 + {ref cycles, 0x013c, 0.1*N, 30*N},
 + {llc refference, 0x4f2e, 1, 2*N},
   {llc misses, 0x412e, 1, 1*N},
   {branches, 0x00c4, 1*N, 1.1*N},
   {branch misses, 0x00c5, 0, 0.1*N},
 --
   Gleb.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] KVM: x86 emulator: add support for vector alignment

2012-04-09 Thread Avi Kivity
x86 defines three classes of vector instructions: explicitly
aligned (#GP(0) if unaligned, explicitly unaligned, and default
(which depends on the encoding: AVX is unaligned, SSE is aligned).

Add support for marking an instruction as explicitly aligned or
unaligned, and mark MOVDQU as unaligned.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/emulate.c |   30 +-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8375622..6302e5c 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -142,6 +142,9 @@
 #define Src2FS  (OpFS  Src2Shift)
 #define Src2GS  (OpGS  Src2Shift)
 #define Src2Mask(OpMask  Src2Shift)
+#define Aligned ((u64)1  41)  /* Explicitly aligned (e.g. MOVDQA) */
+#define Unaligned   ((u64)1  42)  /* Explicitly unaligned (e.g. MOVDQU) */
+#define Avx ((u64)1  43)  /* Advanced Vector Extensions */
 
 #define X2(x...) x, x
 #define X3(x...) X2(x), x
@@ -557,6 +560,29 @@ static void set_segment_selector(struct x86_emulate_ctxt 
*ctxt, u16 selector,
ctxt-ops-set_segment(ctxt, selector, desc, base3, seg);
 }
 
+/*
+ * x86 defines three classes of vector instructions: explicitly
+ * aligned, explicitly unaligned, and the rest, which change behaviour
+ * depending on whether they're AVX encoded or not.
+ *
+ * Also included is CMPXCHG16B which is not a vector instruction, yet it is
+ * subject to the same check.
+ */
+static bool insn_aligned(struct x86_emulate_ctxt *ctxt, unsigned size)
+{
+   if (likely(size  16))
+   return false;
+
+   if (ctxt-d  Aligned)
+   return true;
+   else if (ctxt-d  Unaligned)
+   return false;
+   else if (ctxt-d  Avx)
+   return false;
+   else
+   return true;
+}
+
 static int __linearize(struct x86_emulate_ctxt *ctxt,
 struct segmented_address addr,
 unsigned size, bool write, bool fetch,
@@ -621,6 +647,8 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
}
if (fetch ? ctxt-mode != X86EMUL_MODE_PROT64 : ctxt-ad_bytes != 8)
la = (u32)-1;
+   if (insn_aligned(ctxt, size)  ((la  (size - 1)) != 0))
+   return emulate_gp(ctxt, 0);
*linear = la;
return X86EMUL_CONTINUE;
 bad:
@@ -3415,7 +3443,7 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 };
 
 static struct gprefix pfx_0f_6f_0f_7f = {
-   N, N, N, I(Sse, em_movdqu),
+   N, N, N, I(Sse | Unaligned, em_movdqu),
 };
 
 static struct opcode opcode_table[256] = {
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] MMX/SSE data movement

2012-04-09 Thread Avi Kivity
This patchset implements MMX registers, SSE data alignment, and three
instructions:

  MOVQ (MMX)
  MOVNTPS (SSE)
  MOVDQA (SSE)

all used in accessing framebuffers from guests.

Avi Kivity (4):
  KVM: x86 emulator: add support for vector alignment
  KVM: x86 emulator: implement movntps
  KVM: x86 emulator: MMX support
  KVM: x86 emulator: implement MMX MOVQ (opcodes 0f 6f, 0f 7f)

Stefan Hajnoczi (1):
  KVM: x86: emulate movdqa

 arch/x86/include/asm/kvm_emulate.h |4 +-
 arch/x86/kvm/emulate.c |  148 
 2 files changed, 138 insertions(+), 14 deletions(-)

-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] KVM: x86 emulator: implement MMX MOVQ (opcodes 0f 6f, 0f 7f)

2012-04-09 Thread Avi Kivity
Needed by some framebuffer drivers.  See

https://bugzilla.kernel.org/show_bug.cgi?id=42779

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/emulate.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 0011b4a..d5729a9 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3488,7 +3488,7 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 };
 
 static struct gprefix pfx_0f_6f_0f_7f = {
-   N, I(Sse | Aligned, em_mov), N, I(Sse | Unaligned, em_mov),
+   I(Mmx, em_mov), I(Sse | Aligned, em_mov), N, I(Sse | Unaligned, em_mov),
 };
 
 static struct gprefix pfx_vmovntpx = {
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] KVM: x86: emulate movdqa

2012-04-09 Thread Avi Kivity
From: Stefan Hajnoczi stefa...@linux.vnet.ibm.com

An Ubuntu 9.10 Karmic Koala guest is unable to boot or install due to
missing movdqa emulation:

kvm_exit: reason EXCEPTION_NMI rip 0x7fef3e025a7b info 7fef3e799000 8b0e
kvm_page_fault: address 7fef3e799000 error_code f
kvm_emulate_insn: 0:7fef3e025a7b: 66 0f 7f 07 (prot64)

movdqa %xmm0,(%rdi)

[avi: mark it explicitly aligned]

Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/emulate.c |   10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 6302e5c..b160fb1f 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2818,7 +2818,7 @@ static int em_rdpmc(struct x86_emulate_ctxt *ctxt)
 
 static int em_mov(struct x86_emulate_ctxt *ctxt)
 {
-   ctxt-dst.val = ctxt-src.val;
+   memcpy(ctxt-dst.valptr, ctxt-src.valptr, ctxt-op_bytes);
return X86EMUL_CONTINUE;
 }
 
@@ -2898,12 +2898,6 @@ static int em_mov_sreg_rm(struct x86_emulate_ctxt *ctxt)
return load_segment_descriptor(ctxt, sel, ctxt-modrm_reg);
 }
 
-static int em_movdqu(struct x86_emulate_ctxt *ctxt)
-{
-   memcpy(ctxt-dst.vec_val, ctxt-src.vec_val, ctxt-op_bytes);
-   return X86EMUL_CONTINUE;
-}
-
 static int em_invlpg(struct x86_emulate_ctxt *ctxt)
 {
int rc;
@@ -3443,7 +3437,7 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 };
 
 static struct gprefix pfx_0f_6f_0f_7f = {
-   N, N, N, I(Sse | Unaligned, em_movdqu),
+   N, I(Sse | Aligned, em_mov), N, I(Sse | Unaligned, em_mov),
 };
 
 static struct opcode opcode_table[256] = {
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] KVM: x86 emulator: implement movntps

2012-04-09 Thread Avi Kivity
Used to write to framebuffers (by at least Icaros).

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/emulate.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index b160fb1f..fb39e0b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3440,6 +3440,10 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
N, I(Sse | Aligned, em_mov), N, I(Sse | Unaligned, em_mov),
 };
 
+static struct gprefix pfx_vmovntpx = {
+   I(0, em_mov), N, N, N,
+};
+
 static struct opcode opcode_table[256] = {
/* 0x00 - 0x07 */
I6ALU(Lock, em_add),
@@ -3571,7 +3575,8 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
IIP(ModRM | SrcMem | Priv | Op3264, em_cr_write, cr_write, 
check_cr_write),
IIP(ModRM | SrcMem | Priv | Op3264, em_dr_write, dr_write, 
check_dr_write),
N, N, N, N,
-   N, N, N, N, N, N, N, N,
+   N, N, N, GP(ModRM | DstMem | SrcReg | Sse | Mov | Aligned, 
pfx_vmovntpx),
+   N, N, N, N,
/* 0x30 - 0x3F */
II(ImplicitOps | Priv, em_wrmsr, wrmsr),
IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
-- 
1.7.10

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] KVM: x86 emulator: MMX support

2012-04-09 Thread Avi Kivity
General support for the MMX instruction set.  Special care is taken
to trap pending x87 exceptions so that they are properly reflected
to the guest.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/include/asm/kvm_emulate.h |4 +-
 arch/x86/kvm/emulate.c |  103 ++--
 2 files changed, 102 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index c222e1a..1ac46c22 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -200,7 +200,7 @@ typedef u32 __attribute__((vector_size(16))) sse128_t;
 
 /* Type, address-of, and value of an instruction's operand. */
 struct operand {
-   enum { OP_REG, OP_MEM, OP_IMM, OP_XMM, OP_NONE } type;
+   enum { OP_REG, OP_MEM, OP_IMM, OP_XMM, OP_MM, OP_NONE } type;
unsigned int bytes;
union {
unsigned long orig_val;
@@ -213,12 +213,14 @@ struct operand {
unsigned seg;
} mem;
unsigned xmm;
+   unsigned mm;
} addr;
union {
unsigned long val;
u64 val64;
char valptr[sizeof(unsigned long) + 2];
sse128_t vec_val;
+   u64 mm_val;
};
 };
 
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index fb39e0b..0011b4a 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -142,6 +142,7 @@
 #define Src2FS  (OpFS  Src2Shift)
 #define Src2GS  (OpGS  Src2Shift)
 #define Src2Mask(OpMask  Src2Shift)
+#define Mmx ((u64)1  40)  /* MMX Vector instruction */
 #define Aligned ((u64)1  41)  /* Explicitly aligned (e.g. MOVDQA) */
 #define Unaligned   ((u64)1  42)  /* Explicitly unaligned (e.g. MOVDQU) */
 #define Avx ((u64)1  43)  /* Advanced Vector Extensions */
@@ -887,6 +888,40 @@ static void write_sse_reg(struct x86_emulate_ctxt *ctxt, 
sse128_t *data,
ctxt-ops-put_fpu(ctxt);
 }
 
+static void read_mmx_reg(struct x86_emulate_ctxt *ctxt, u64 *data, int reg)
+{
+   ctxt-ops-get_fpu(ctxt);
+   switch (reg) {
+   case 0: asm(movq %%mm0, %0 : =m(*data)); break;
+   case 1: asm(movq %%mm1, %0 : =m(*data)); break;
+   case 2: asm(movq %%mm2, %0 : =m(*data)); break;
+   case 3: asm(movq %%mm3, %0 : =m(*data)); break;
+   case 4: asm(movq %%mm4, %0 : =m(*data)); break;
+   case 5: asm(movq %%mm5, %0 : =m(*data)); break;
+   case 6: asm(movq %%mm6, %0 : =m(*data)); break;
+   case 7: asm(movq %%mm7, %0 : =m(*data)); break;
+   default: BUG();
+   }
+   ctxt-ops-put_fpu(ctxt);
+}
+
+static void write_mmx_reg(struct x86_emulate_ctxt *ctxt, u64 *data, int reg)
+{
+   ctxt-ops-get_fpu(ctxt);
+   switch (reg) {
+   case 0: asm(movq %0, %%mm0 : : m(*data)); break;
+   case 1: asm(movq %0, %%mm1 : : m(*data)); break;
+   case 2: asm(movq %0, %%mm2 : : m(*data)); break;
+   case 3: asm(movq %0, %%mm3 : : m(*data)); break;
+   case 4: asm(movq %0, %%mm4 : : m(*data)); break;
+   case 5: asm(movq %0, %%mm5 : : m(*data)); break;
+   case 6: asm(movq %0, %%mm6 : : m(*data)); break;
+   case 7: asm(movq %0, %%mm7 : : m(*data)); break;
+   default: BUG();
+   }
+   ctxt-ops-put_fpu(ctxt);
+}
+
 static void decode_register_operand(struct x86_emulate_ctxt *ctxt,
struct operand *op)
 {
@@ -903,6 +938,13 @@ static void decode_register_operand(struct 
x86_emulate_ctxt *ctxt,
read_sse_reg(ctxt, op-vec_val, reg);
return;
}
+   if (ctxt-d  Mmx) {
+   reg = 7;
+   op-type = OP_MM;
+   op-bytes = 8;
+   op-addr.mm = reg;
+   return;
+   }
 
op-type = OP_REG;
if (ctxt-d  ByteOp) {
@@ -948,6 +990,12 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
read_sse_reg(ctxt, op-vec_val, ctxt-modrm_rm);
return rc;
}
+   if (ctxt-d  Mmx) {
+   op-type = OP_MM;
+   op-bytes = 8;
+   op-addr.xmm = ctxt-modrm_rm  7;
+   return rc;
+   }
fetch_register_operand(op);
return rc;
}
@@ -1415,6 +1463,9 @@ static int writeback(struct x86_emulate_ctxt *ctxt)
case OP_XMM:
write_sse_reg(ctxt, ctxt-dst.vec_val, ctxt-dst.addr.xmm);
break;
+   case OP_MM:
+   write_mmx_reg(ctxt, ctxt-dst.mm_val, ctxt-dst.addr.mm);
+   break;
case OP_NONE:
/* no writeback */
break;
@@ -3987,6 +4038,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
 
if (ctxt-d  Sse)
ctxt-op_bytes = 16;
+   else if (ctxt-d  Mmx)
+ 

Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Marcelo Tosatti
On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
 On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
  * Idea
  The present bit of page fault error code (EFEC.P) indicates whether the
  page table is populated on all levels, if this bit is set, we can know
  the page fault is caused by the page-protection bits (e.g. W/R bit) or
  the reserved bits.
 
  In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
  simply fixed: the page fault caused by reserved bit
  (EFFC.P = 1  EFEC.RSV = 1) has already been filtered out in fast mmio
  path. What we need do to fix the rest page fault (EFEC.P = 1  RSV != 1)
  is just increasing the corresponding access on the spte.
 
  This pachset introduces a fast path to fix this kind of page fault: it
  is out of mmu-lock and need not walk host page table to get the mapping
  from gfn to pfn.
 
 
 
 This patchset is really worrying to me.
 
 It introduces a lot of concurrency into data structures that were not
 designed for it.  Even if it is correct, it will be very hard to
 convince ourselves that it is correct, and if it isn't, to debug those
 subtle bugs.  It will also be much harder to maintain the mmu code than
 it is now.
 
 There are a lot of things to check.  Just as an example, we need to be
 sure that if we use rcu_dereference() twice in the same code path, that
 any inconsistencies due to a write in between are benign.  Doing that is
 a huge task.
 
 But I appreciate the performance improvement and would like to see a
 simpler version make it in.  This needs to reduce the amount of data
 touched in the fast path so it is easier to validate, and perhaps reduce
 the number of cases that the fast path works on.
 
 I would like to see the fast path as simple as
 
   rcu_read_lock();
 
   (lockless shadow walk)
   spte = ACCESS_ONCE(*sptep);
 
   if (!(spte  PT_MAY_ALLOW_WRITES))
 goto slow;
 
   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp-sptes)
   mark_page_dirty(kvm, gfn);
 
   new_spte = spte  ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
   if (cmpxchg(sptep, spte, new_spte) != spte)
goto slow;
 
   rcu_read_unlock();
   return;
 
 slow:
   rcu_read_unlock();
   slow_path();
 
 It now becomes the responsibility of the slow path to maintain *sptep 
 PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
 can be as simple as a clear_bit() before we update sp-gfns[] or if we
 add host write protection.
 
 Sorry, it's too complicated for me.  Marcelo, what's your take?

The improvement is small and limited to special cases (migration should
be rare and framebuffer memory accounts for a small percentage of total
memory).

For one, how can this be safe against mmu notifier methods?

KSM   |VCPU0| VCPU1
  | fault   | fault
  | cow-page|
  | set spte RW |
  | |
write protect host pte| |
grab mmu_lock | |
remove writeable bit in spte  | |
increase mmu_notifier_seq | |  spte = read-only spte
release mmu_lock  | |  cmpxchg succeeds, RO-RW!

MMU notifiers rely on the fault path sequence being

read host pte
read mmu_notifier_seq
spin_lock(mmu_lock)
if (mmu_notifier_seq changed)
goodbye, host pte value is stale
spin_unlock(mmu_lock)

By the example above, you cannot rely on the spte value alone,
mmu_notifier_seq must be taken into account.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Xiao Guangrong
On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:

 On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
 On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
 * Idea
 The present bit of page fault error code (EFEC.P) indicates whether the
 page table is populated on all levels, if this bit is set, we can know
 the page fault is caused by the page-protection bits (e.g. W/R bit) or
 the reserved bits.

 In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
 simply fixed: the page fault caused by reserved bit
 (EFFC.P = 1  EFEC.RSV = 1) has already been filtered out in fast mmio
 path. What we need do to fix the rest page fault (EFEC.P = 1  RSV != 1)
 is just increasing the corresponding access on the spte.

 This pachset introduces a fast path to fix this kind of page fault: it
 is out of mmu-lock and need not walk host page table to get the mapping
 from gfn to pfn.



 This patchset is really worrying to me.

 It introduces a lot of concurrency into data structures that were not
 designed for it.  Even if it is correct, it will be very hard to
 convince ourselves that it is correct, and if it isn't, to debug those
 subtle bugs.  It will also be much harder to maintain the mmu code than
 it is now.

 There are a lot of things to check.  Just as an example, we need to be
 sure that if we use rcu_dereference() twice in the same code path, that
 any inconsistencies due to a write in between are benign.  Doing that is
 a huge task.

 But I appreciate the performance improvement and would like to see a
 simpler version make it in.  This needs to reduce the amount of data
 touched in the fast path so it is easier to validate, and perhaps reduce
 the number of cases that the fast path works on.

 I would like to see the fast path as simple as

   rcu_read_lock();

   (lockless shadow walk)
   spte = ACCESS_ONCE(*sptep);

   if (!(spte  PT_MAY_ALLOW_WRITES))
 goto slow;

   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp-sptes)
   mark_page_dirty(kvm, gfn);

   new_spte = spte  ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
   if (cmpxchg(sptep, spte, new_spte) != spte)
goto slow;

   rcu_read_unlock();
   return;

 slow:
   rcu_read_unlock();
   slow_path();

 It now becomes the responsibility of the slow path to maintain *sptep 
 PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
 can be as simple as a clear_bit() before we update sp-gfns[] or if we
 add host write protection.

 Sorry, it's too complicated for me.  Marcelo, what's your take?
 
 The improvement is small and limited to special cases (migration should
 be rare and framebuffer memory accounts for a small percentage of total
 memory).
 
 For one, how can this be safe against mmu notifier methods?
 
 KSM |VCPU0| VCPU1
 | fault   | fault
 | cow-page|
 | set spte RW |
 | |
 write protect host pte  | |
 grab mmu_lock   | |
 remove writeable bit in spte  |   |
 increase mmu_notifier_seq |   |  spte = read-only spte
 release mmu_lock| |  cmpxchg succeeds, RO-RW!
 
 MMU notifiers rely on the fault path sequence being
 
 read host pte
 read mmu_notifier_seq
 spin_lock(mmu_lock)
 if (mmu_notifier_seq changed)
   goodbye, host pte value is stale
 spin_unlock(mmu_lock)
 
 By the example above, you cannot rely on the spte value alone,
 mmu_notifier_seq must be taken into account.


No.

When KSM change the host page to read-only, the HOST_WRITABLE bit
of spte should be removed, that means, the spte should be changed
that can be watched by cmpxchg.

Note: we mark spte to be writeable only if spte.HOST_WRITABLE is
set.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Xiao Guangrong
On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:

 On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
 On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
 * Idea
 The present bit of page fault error code (EFEC.P) indicates whether the
 page table is populated on all levels, if this bit is set, we can know
 the page fault is caused by the page-protection bits (e.g. W/R bit) or
 the reserved bits.

 In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
 simply fixed: the page fault caused by reserved bit
 (EFFC.P = 1  EFEC.RSV = 1) has already been filtered out in fast mmio
 path. What we need do to fix the rest page fault (EFEC.P = 1  RSV != 1)
 is just increasing the corresponding access on the spte.

 This pachset introduces a fast path to fix this kind of page fault: it
 is out of mmu-lock and need not walk host page table to get the mapping
 from gfn to pfn.



 This patchset is really worrying to me.

 It introduces a lot of concurrency into data structures that were not
 designed for it.  Even if it is correct, it will be very hard to
 convince ourselves that it is correct, and if it isn't, to debug those
 subtle bugs.  It will also be much harder to maintain the mmu code than
 it is now.

 There are a lot of things to check.  Just as an example, we need to be
 sure that if we use rcu_dereference() twice in the same code path, that
 any inconsistencies due to a write in between are benign.  Doing that is
 a huge task.

 But I appreciate the performance improvement and would like to see a
 simpler version make it in.  This needs to reduce the amount of data
 touched in the fast path so it is easier to validate, and perhaps reduce
 the number of cases that the fast path works on.

 I would like to see the fast path as simple as

   rcu_read_lock();

   (lockless shadow walk)
   spte = ACCESS_ONCE(*sptep);

   if (!(spte  PT_MAY_ALLOW_WRITES))
 goto slow;

   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp-sptes)
   mark_page_dirty(kvm, gfn);

   new_spte = spte  ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
   if (cmpxchg(sptep, spte, new_spte) != spte)
goto slow;

   rcu_read_unlock();
   return;

 slow:
   rcu_read_unlock();
   slow_path();

 It now becomes the responsibility of the slow path to maintain *sptep 
 PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
 can be as simple as a clear_bit() before we update sp-gfns[] or if we
 add host write protection.

 Sorry, it's too complicated for me.  Marcelo, what's your take?
 
 The improvement is small and limited to special cases (migration should
 be rare and framebuffer memory accounts for a small percentage of total
 memory).


Actually, although the framebuffer is small but it is modified really
frequently, and another unlucky things is that dirty-log is also
very frequently and need hold mmu-lock to do write-protect.

Yes, if Xwindow is not enabled, the benefit is limited. :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Marcelo Tosatti
On Tue, Apr 10, 2012 at 02:13:41AM +0800, Xiao Guangrong wrote:
 On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:
 
  On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
  On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
  * Idea
  The present bit of page fault error code (EFEC.P) indicates whether the
  page table is populated on all levels, if this bit is set, we can know
  the page fault is caused by the page-protection bits (e.g. W/R bit) or
  the reserved bits.
 
  In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
  simply fixed: the page fault caused by reserved bit
  (EFFC.P = 1  EFEC.RSV = 1) has already been filtered out in fast mmio
  path. What we need do to fix the rest page fault (EFEC.P = 1  RSV != 1)
  is just increasing the corresponding access on the spte.
 
  This pachset introduces a fast path to fix this kind of page fault: it
  is out of mmu-lock and need not walk host page table to get the mapping
  from gfn to pfn.
 
 
 
  This patchset is really worrying to me.
 
  It introduces a lot of concurrency into data structures that were not
  designed for it.  Even if it is correct, it will be very hard to
  convince ourselves that it is correct, and if it isn't, to debug those
  subtle bugs.  It will also be much harder to maintain the mmu code than
  it is now.
 
  There are a lot of things to check.  Just as an example, we need to be
  sure that if we use rcu_dereference() twice in the same code path, that
  any inconsistencies due to a write in between are benign.  Doing that is
  a huge task.
 
  But I appreciate the performance improvement and would like to see a
  simpler version make it in.  This needs to reduce the amount of data
  touched in the fast path so it is easier to validate, and perhaps reduce
  the number of cases that the fast path works on.
 
  I would like to see the fast path as simple as
 
rcu_read_lock();
 
(lockless shadow walk)
spte = ACCESS_ONCE(*sptep);
 
if (!(spte  PT_MAY_ALLOW_WRITES))
  goto slow;
 
gfn = kvm_mmu_page_get_gfn(sp, sptep - sp-sptes)
mark_page_dirty(kvm, gfn);
 
new_spte = spte  ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
if (cmpxchg(sptep, spte, new_spte) != spte)
 goto slow;
 
rcu_read_unlock();
return;
 
  slow:
rcu_read_unlock();
slow_path();
 
  It now becomes the responsibility of the slow path to maintain *sptep 
  PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
  can be as simple as a clear_bit() before we update sp-gfns[] or if we
  add host write protection.
 
  Sorry, it's too complicated for me.  Marcelo, what's your take?
  
  The improvement is small and limited to special cases (migration should
  be rare and framebuffer memory accounts for a small percentage of total
  memory).
  
  For one, how can this be safe against mmu notifier methods?
  
  KSM   |VCPU0| VCPU1
| fault   | fault
| cow-page|
| set spte RW |
| |
  write protect host pte| |
  grab mmu_lock | |
  remove writeable bit in spte  | |
  increase mmu_notifier_seq | |  spte = read-only spte
  release mmu_lock  | |  cmpxchg succeeds, RO-RW!
  
  MMU notifiers rely on the fault path sequence being
  
  read host pte
  read mmu_notifier_seq
  spin_lock(mmu_lock)
  if (mmu_notifier_seq changed)
  goodbye, host pte value is stale
  spin_unlock(mmu_lock)
  
  By the example above, you cannot rely on the spte value alone,
  mmu_notifier_seq must be taken into account.
 
 
 No.
 
 When KSM change the host page to read-only, the HOST_WRITABLE bit
 of spte should be removed, that means, the spte should be changed
 that can be watched by cmpxchg.
 
 Note: we mark spte to be writeable only if spte.HOST_WRITABLE is
 set.

Right. 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Marcelo Tosatti
On Tue, Apr 10, 2012 at 02:26:27AM +0800, Xiao Guangrong wrote:
 On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:
 
  On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
  On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
  * Idea
  The present bit of page fault error code (EFEC.P) indicates whether the
  page table is populated on all levels, if this bit is set, we can know
  the page fault is caused by the page-protection bits (e.g. W/R bit) or
  the reserved bits.
 
  In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
  simply fixed: the page fault caused by reserved bit
  (EFFC.P = 1  EFEC.RSV = 1) has already been filtered out in fast mmio
  path. What we need do to fix the rest page fault (EFEC.P = 1  RSV != 1)
  is just increasing the corresponding access on the spte.
 
  This pachset introduces a fast path to fix this kind of page fault: it
  is out of mmu-lock and need not walk host page table to get the mapping
  from gfn to pfn.
 
 
 
  This patchset is really worrying to me.
 
  It introduces a lot of concurrency into data structures that were not
  designed for it.  Even if it is correct, it will be very hard to
  convince ourselves that it is correct, and if it isn't, to debug those
  subtle bugs.  It will also be much harder to maintain the mmu code than
  it is now.
 
  There are a lot of things to check.  Just as an example, we need to be
  sure that if we use rcu_dereference() twice in the same code path, that
  any inconsistencies due to a write in between are benign.  Doing that is
  a huge task.
 
  But I appreciate the performance improvement and would like to see a
  simpler version make it in.  This needs to reduce the amount of data
  touched in the fast path so it is easier to validate, and perhaps reduce
  the number of cases that the fast path works on.
 
  I would like to see the fast path as simple as
 
rcu_read_lock();
 
(lockless shadow walk)
spte = ACCESS_ONCE(*sptep);
 
if (!(spte  PT_MAY_ALLOW_WRITES))
  goto slow;
 
gfn = kvm_mmu_page_get_gfn(sp, sptep - sp-sptes)
mark_page_dirty(kvm, gfn);
 
new_spte = spte  ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
if (cmpxchg(sptep, spte, new_spte) != spte)
 goto slow;
 
rcu_read_unlock();
return;
 
  slow:
rcu_read_unlock();
slow_path();
 
  It now becomes the responsibility of the slow path to maintain *sptep 
  PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
  can be as simple as a clear_bit() before we update sp-gfns[] or if we
  add host write protection.
 
  Sorry, it's too complicated for me.  Marcelo, what's your take?
  
  The improvement is small and limited to special cases (migration should
  be rare and framebuffer memory accounts for a small percentage of total
  memory).
 
 
 Actually, although the framebuffer is small but it is modified really
 frequently, and another unlucky things is that dirty-log is also
 very frequently and need hold mmu-lock to do write-protect.
 
 Yes, if Xwindow is not enabled, the benefit is limited. :)

Ignoring that fact, the safety of lockless set_spte and friends is not
clear.

Perhaps the mmu_lock hold times by get_dirty are a large component here?
If that can be alleviated, not only RO-RW faults benefit.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup - CPU#1 stuck for 61s ! [qemu-kvm... (and server hang)

2012-04-09 Thread Rahul


Tomasz,

Did you get the cause of this issue? plz share, I am also facing the same issue.

-Rahul N.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup - CPU#1 stuck for 61s ! [qemu-kvm... (and server hang)

2012-04-09 Thread Marcelo Tosatti
On Mon, Apr 09, 2012 at 07:56:28PM +, Rahul wrote:
 
 
 Tomasz,
 
 Did you get the cause of this issue? plz share, I am also facing the same 
 issue.
 
 -Rahul N.

Rahul,

Can you please post traces?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


vhost-blk development

2012-04-09 Thread Michael Baysek
Hi all.  I'm interested in any developments on the vhost-blk in kernel 
accelerator for disk i/o.  

I had seen a patchset on LKML https://lkml.org/lkml/2011/7/28/175 but that is 
rather old.  Are there any newer developments going on with the vhost-blk stuff?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] adding tracepoints to vhost

2012-04-09 Thread Jason Wang
To help in vhost analyzing, the following series adding basic tracepoints to
vhost. Operations of both virtqueues and vhost works were traced in current
implementation, net code were untouched. A top-like satistics displaying script
were introduced to help the troubleshooting.

TODO:
- net specific tracepoints?

---

Jason Wang (2):
  vhost: basic tracepoints
  tools: virtio: add a top-like utility for displaying vhost satistics


 drivers/vhost/trace.h   |  153 
 drivers/vhost/vhost.c   |   17 ++
 tools/virtio/vhost_stat |  360 +++
 3 files changed, 528 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vhost/trace.h
 create mode 100755 tools/virtio/vhost_stat

-- 
Jason Wang
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] vhost: basic tracepoints

2012-04-09 Thread Jason Wang
To help for the performance optimizations and debugging, this patch tracepoints
for vhost. Pay attention that the tracepoints are only for vhost, net code are
not touched.

Two kinds of activities were traced: virtio and vhost work.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 drivers/vhost/trace.h |  153 +
 drivers/vhost/vhost.c |   17 +
 2 files changed, 168 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vhost/trace.h

diff --git a/drivers/vhost/trace.h b/drivers/vhost/trace.h
new file mode 100644
index 000..0423899
--- /dev/null
+++ b/drivers/vhost/trace.h
@@ -0,0 +1,153 @@
+#if !defined(_TRACE_VHOST_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VHOST_H
+
+#include linux/tracepoint.h
+#include vhost.h
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vhost
+
+/*
+ * Tracepoint for updating used flag.
+ */
+TRACE_EVENT(vhost_virtio_update_used_flags,
+   TP_PROTO(struct vhost_virtqueue *vq),
+   TP_ARGS(vq),
+
+   TP_STRUCT__entry(
+   __field(struct vhost_virtqueue *, vq)
+   __field(u16, used_flags)
+   ),
+
+   TP_fast_assign(
+   __entry-vq = vq;
+   __entry-used_flags = vq-used_flags;
+   ),
+
+   TP_printk(vhost update used flag %x to vq %p notify %s,
+   __entry-used_flags, __entry-vq,
+   (__entry-used_flags  VRING_USED_F_NO_NOTIFY) ?
+   disabled : enabled)
+);
+
+/*
+ * Tracepoint for updating avail event.
+ */
+TRACE_EVENT(vhost_virtio_update_avail_event,
+   TP_PROTO(struct vhost_virtqueue *vq),
+   TP_ARGS(vq),
+
+   TP_STRUCT__entry(
+   __field(struct vhost_virtqueue *, vq)
+   __field(u16, avail_idx)
+   ),
+
+   TP_fast_assign(
+   __entry-vq = vq;
+   __entry-avail_idx = vq-avail_idx;
+   ),
+
+   TP_printk(vhost update avail idx %u(%u) for vq %p,
+ __entry-avail_idx, __entry-avail_idx %
+ __entry-vq-num, __entry-vq)
+);
+
+/*
+ * Tracepoint for processing descriptor.
+ */
+TRACE_EVENT(vhost_virtio_get_vq_desc,
+   TP_PROTO(struct vhost_virtqueue *vq, unsigned int index,
+unsigned out, unsigned int in),
+   TP_ARGS(vq, index, out, in),
+
+   TP_STRUCT__entry(
+   __field(struct vhost_virtqueue *, vq)
+   __field(unsigned int, head)
+   __field(unsigned int, out)
+   __field(unsigned int, in)
+   ),
+
+   TP_fast_assign(
+   __entry-vq = vq;
+   __entry-head = index;
+   __entry-out = out;
+   __entry-in = in;
+   ),
+
+   TP_printk(vhost get vq %p head %u out %u in %u,
+ __entry-vq, __entry-head, __entry-out, __entry-in)
+
+);
+
+/*
+ * Tracepoint for signal guest.
+ */
+TRACE_EVENT(vhost_virtio_signal,
+   TP_PROTO(struct vhost_virtqueue *vq),
+   TP_ARGS(vq),
+
+   TP_STRUCT__entry(
+   __field(struct vhost_virtqueue *, vq)
+   ),
+
+   TP_fast_assign(
+   __entry-vq = vq;
+   ),
+
+   TP_printk(vhost signal vq %p, __entry-vq)
+);
+
+DECLARE_EVENT_CLASS(vhost_work_template,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work),
+
+   TP_STRUCT__entry(
+   __field(struct vhost_dev *, dev)
+   __field(struct vhost_work *, work)
+   __field(void *, function)
+   ),
+
+   TP_fast_assign(
+   __entry-dev = dev;
+   __entry-work = work;
+   __entry-function = work-fn;
+   ),
+
+   TP_printk(%pf for work %p dev %p,
+   __entry-function, __entry-work, __entry-dev)
+);
+
+DEFINE_EVENT(vhost_work_template, vhost_work_queue_wakeup,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work));
+
+DEFINE_EVENT(vhost_work_template, vhost_work_queue_coalesce,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work));
+
+DEFINE_EVENT(vhost_work_template, vhost_poll_start,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work));
+
+DEFINE_EVENT(vhost_work_template, vhost_poll_stop,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work));
+
+DEFINE_EVENT(vhost_work_template, vhost_work_execute_start,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work));
+
+DEFINE_EVENT(vhost_work_template, vhost_work_execute_end,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work));
+
+#endif /* _TRACE_VHOST_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH ../../drivers/vhost
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE trace
+
+/* This part must be outside protection */
+#include trace/define_trace.h
+
diff --git 

[PATCH 2/2] tools: virtio: add a top-like utility for displaying vhost satistics

2012-04-09 Thread Jason Wang
This patch adds simple python to display vhost satistics of vhost, the codes
were based on kvm_stat script from qemu. As work function has been recored,
filters could be used to distinguish which kinds of work are being executed or
queued:

vhost statistics

 vhost_virtio_get_vq_desc   1460997  219682
 vhost_work_execute_start101248   12842
 vhost_work_execute_end  101247   12842
 vhost_work_queue_wakeup 101263   12841
 vhost_virtio_signal  684528659
 vhost_work_queue_wakeup(rx_net)  517976584
 vhost_work_execute_start(rx_net) 517956584
 vhost_work_queue_coalesce357376571
 vhost_work_queue_coalesce(rx_net)357096566
 vhost_virtio_update_avail_event  495126271
 vhost_work_execute_start(tx_kick)494296254
 vhost_work_queue_wakeup(tx_kick) 494426252
 vhost_work_queue_coalesce(tx_kick)  28   5
 vhost_work_execute_start(rx_kick)   22   3
 vhost_work_queue_wakeup(rx_kick)22   3
 vhost_poll_start 4   0

Signed-off-by: Jason Wang jasow...@redhat.com
---
 tools/virtio/vhost_stat |  360 +++
 1 files changed, 360 insertions(+), 0 deletions(-)
 create mode 100755 tools/virtio/vhost_stat

diff --git a/tools/virtio/vhost_stat b/tools/virtio/vhost_stat
new file mode 100755
index 000..b730f3b
--- /dev/null
+++ b/tools/virtio/vhost_stat
@@ -0,0 +1,360 @@
+#!/usr/bin/python
+#
+# top-like utility for displaying vhost statistics
+#
+# Copyright 2012 Red Hat, Inc.
+#
+# Modified from kvm_stat from qemu
+#
+# This work is licensed under the terms of the GNU GPL, version 2.  See
+# the COPYING file in the top-level directory.
+
+import curses
+import sys, os, time, optparse
+
+work_types = {
+handle_rx_kick : rx_kick,
+handle_tx_kick : tx_kick,
+handle_rx_net : rx_net,
+handle_tx_net : tx_net,
+vhost_attach_cgroups_work: cg_attach
+}
+
+addr = {}
+
+kallsyms = file(/proc/kallsyms).readlines()
+for kallsym in kallsyms:
+entry = kallsym.split()
+if entry[2] in work_types.keys():
+addr[0x%s % entry[0]] = work_types[entry[2]]
+
+filters = {
+'vhost_work_queue_wakeup': ('function', addr),
+'vhost_work_queue_coalesce' : ('function', addr),
+'vhost_work_execute_start' : ('function', addr),
+'vhost_poll_start' : ('function', addr),
+'vhost_poll_stop' : ('function', addr),
+}
+
+def invert(d):
+return dict((x[1], x[0]) for x in d.iteritems())
+
+for f in filters:
+filters[f] = (filters[f][0], invert(filters[f][1]))
+
+import ctypes, struct, array
+
+libc = ctypes.CDLL('libc.so.6')
+syscall = libc.syscall
+class perf_event_attr(ctypes.Structure):
+_fields_ = [('type', ctypes.c_uint32),
+('size', ctypes.c_uint32),
+('config', ctypes.c_uint64),
+('sample_freq', ctypes.c_uint64),
+('sample_type', ctypes.c_uint64),
+('read_format', ctypes.c_uint64),
+('flags', ctypes.c_uint64),
+('wakeup_events', ctypes.c_uint32),
+('bp_type', ctypes.c_uint32),
+('bp_addr', ctypes.c_uint64),
+('bp_len', ctypes.c_uint64),
+]
+def _perf_event_open(attr, pid, cpu, group_fd, flags):
+return syscall(298, ctypes.pointer(attr), ctypes.c_int(pid),
+   ctypes.c_int(cpu), ctypes.c_int(group_fd),
+   ctypes.c_long(flags))
+
+PERF_TYPE_HARDWARE  = 0
+PERF_TYPE_SOFTWARE  = 1
+PERF_TYPE_TRACEPOINT= 2
+PERF_TYPE_HW_CACHE  = 3
+PERF_TYPE_RAW   = 4
+PERF_TYPE_BREAKPOINT= 5
+
+PERF_SAMPLE_IP  = 1  0
+PERF_SAMPLE_TID = 1  1
+PERF_SAMPLE_TIME= 1  2
+PERF_SAMPLE_ADDR= 1  3
+PERF_SAMPLE_READ= 1  4
+PERF_SAMPLE_CALLCHAIN   = 1  5
+PERF_SAMPLE_ID  = 1  6
+PERF_SAMPLE_CPU = 1  7
+PERF_SAMPLE_PERIOD  = 1  8
+PERF_SAMPLE_STREAM_ID   = 1  9
+PERF_SAMPLE_RAW = 1  10
+
+PERF_FORMAT_TOTAL_TIME_ENABLED  = 1  0
+PERF_FORMAT_TOTAL_TIME_RUNNING  = 1  1
+PERF_FORMAT_ID  = 1  2
+PERF_FORMAT_GROUP   = 1  3
+
+import re
+
+sys_tracing = '/sys/kernel/debug/tracing'
+
+class Group(object):
+def __init__(self, cpu):
+self.events = []
+self.group_leader = None
+self.cpu = cpu
+def add_event(self, name, event_set, tracepoint, filter = None):
+self.events.append(Event(group = self,
+ name = name, event_set = event_set,
+ tracepoint = tracepoint, filter = filter))
+if len(self.events) == 1:
+

Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Xiao Guangrong
On 04/10/2012 03:46 AM, Marcelo Tosatti wrote:

 On Tue, Apr 10, 2012 at 02:26:27AM +0800, Xiao Guangrong wrote:
 On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:

 On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
 On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
 * Idea
 The present bit of page fault error code (EFEC.P) indicates whether the
 page table is populated on all levels, if this bit is set, we can know
 the page fault is caused by the page-protection bits (e.g. W/R bit) or
 the reserved bits.

 In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
 simply fixed: the page fault caused by reserved bit
 (EFFC.P = 1  EFEC.RSV = 1) has already been filtered out in fast mmio
 path. What we need do to fix the rest page fault (EFEC.P = 1  RSV != 1)
 is just increasing the corresponding access on the spte.

 This pachset introduces a fast path to fix this kind of page fault: it
 is out of mmu-lock and need not walk host page table to get the mapping
 from gfn to pfn.



 This patchset is really worrying to me.

 It introduces a lot of concurrency into data structures that were not
 designed for it.  Even if it is correct, it will be very hard to
 convince ourselves that it is correct, and if it isn't, to debug those
 subtle bugs.  It will also be much harder to maintain the mmu code than
 it is now.

 There are a lot of things to check.  Just as an example, we need to be
 sure that if we use rcu_dereference() twice in the same code path, that
 any inconsistencies due to a write in between are benign.  Doing that is
 a huge task.

 But I appreciate the performance improvement and would like to see a
 simpler version make it in.  This needs to reduce the amount of data
 touched in the fast path so it is easier to validate, and perhaps reduce
 the number of cases that the fast path works on.

 I would like to see the fast path as simple as

   rcu_read_lock();

   (lockless shadow walk)
   spte = ACCESS_ONCE(*sptep);

   if (!(spte  PT_MAY_ALLOW_WRITES))
 goto slow;

   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp-sptes)
   mark_page_dirty(kvm, gfn);

   new_spte = spte  ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
   if (cmpxchg(sptep, spte, new_spte) != spte)
goto slow;

   rcu_read_unlock();
   return;

 slow:
   rcu_read_unlock();
   slow_path();

 It now becomes the responsibility of the slow path to maintain *sptep 
 PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
 can be as simple as a clear_bit() before we update sp-gfns[] or if we
 add host write protection.

 Sorry, it's too complicated for me.  Marcelo, what's your take?

 The improvement is small and limited to special cases (migration should
 be rare and framebuffer memory accounts for a small percentage of total
 memory).


 Actually, although the framebuffer is small but it is modified really
 frequently, and another unlucky things is that dirty-log is also
 very frequently and need hold mmu-lock to do write-protect.

 Yes, if Xwindow is not enabled, the benefit is limited. :)
 
 Ignoring that fact, the safety of lockless set_spte and friends is not
 clear.
 


That is why AVI suggested me to simplify the whole things. :)


 Perhaps the mmu_lock hold times by get_dirty are a large component here?
 If that can be alleviated, not only RO-RW faults benefit.
 

Yes.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-09 Thread shashank rachamalla
On Mon, Apr 9, 2012 at 2:56 PM, Gleb Natapov g...@redhat.com wrote:
 On Sun, Apr 08, 2012 at 06:27:03PM +0300, Gleb Natapov wrote:
 On Fri, Apr 06, 2012 at 09:50:50AM +0300, Gleb Natapov wrote:
  On Fri, Apr 06, 2012 at 10:43:17AM +0530, shashank rachamalla wrote:
   On Thu, Apr 5, 2012 at 8:11 PM, Gleb Natapov g...@redhat.com wrote:
On Thu, Apr 05, 2012 at 05:38:40PM +0300, Avi Kivity wrote:
On 04/05/2012 04:57 PM, Gleb Natapov wrote:
  
  May be it used NMI based profiling. We should ask oprofile 
  developers.
  As I said I am almost sure my inability to run it on a host is 
  probably
  PEBKAC, although I ran the same script exactly on the host and the
  guest (the script is from the first email of this thread)
 
 After upgrading the kernel to latest git from whatever it was there 
 the
 same script works and counts CPU_CLK_UNHALT events.

   
This is even while it violates the Intel guidelines?
   
Yes, but who says the result is correct :) It seems that we handle
global ctrl msr wrong. That is counter can be enabled either in global
ctrl or in eventsel. Trying to confirm that.
   
   if that becomes true then will global ctrl msr have any significance ?
  When it is in use yes.
 
 I was wrong. We do handle global ctrl msr correctly, I just ran my test
 incorrectly. If I disable global ctrl on all cpus (for i in `seq 0 15`;
 do wrmsr -p $i 0x38f 0; done) oprofile stops working.

 After searching high and low I finally found the following in
 Performance Monitoring Unit Sharing Guide white paper:

  Known starting state: Software requires a known starting
  state. After CPU reset, all counters and control registers are
  disabled and clear/reset to 0.  The only exception to this is the
  IA32_PERF_GLOBAL_CTRL control MSR, all programmable counter global
  enable bits are reset to 1.

 Patch will follow.


Thanks for the patch. global ctrl msr is properly set inside guest and
i can run oprofile as well.

 --
                        Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: source for virt io backend driver

2012-04-09 Thread Steven
Hi,
I found this post
http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/89334
So the current block driver seems completely emulated by the qemu driver.

- ha

On Mon, Apr 9, 2012 at 2:15 AM, Nadav Har'El n...@math.technion.ac.il wrote:
 On Sun, Apr 08, 2012, Steven wrote about Re: source for virt io backend 
 driver:
  vhost-net. This you can find in drivers/vhost/net.c (and
  drivers/vhost/vhost.c for the generic vhost infrastructure for virtio
  drivers in the host kernel).

 Yes, I am looking for the backend code in the kernel. I found the file
 for net backend drivers/vhost/net.c. However, the backend code for blk
 is not there.
 Could you point out this? Thanks.

 Indeed, vhost-block is not (yet) part of the mainline kernel.

 Maybe someone else more experienced in vhost-block can recommend where
 to get the latest version.

 Nadav.


 --
 Nadav Har'El                        |                     Monday, Apr 9 2012,
 n...@math.technion.ac.il             
 |-
 Phone +972-523-790466, ICQ 13349191 |Diplomat: A man who always remembers a
 http://nadav.harel.org.il           |woman's birthday but never her age.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html