Re: [PATCH] kvm: rework remove-write-access for a slot

2010-06-04 Thread Lai Jiangshan
Avi Kivity wrote:
 On 06/02/2010 11:53 AM, Lai Jiangshan wrote:
 Current code uses slot_bitmap to find ptes who map a page
 from the memory slot, it is not precise: some ptes in the shadow page
 are not map any page from the memory slot.

 This patch uses rmap to find the ptes precisely, and remove
 the unused slot_bitmap.


 
 Patch looks good; a couple of comments:
 
 - We might see a slowdown with !tdp, since we no longer have locality. 
 Each page will map to an spte in a different page.  However, it's still
 worth it in my opinion.

Yes, this patch hurts the cache since we no longer have locality.
And if most pages of the slot are not mapped(rmap_next(kvm, rmapp, NULL)==NULL),
this patch will worse than old method I think.

This patch do things straightly, precisely.

 - I thought of a different approach to write protection: write protect
 the L4 sptes, on write fault add write permission to the L4 spte and
 write protect the L3 sptes that it points to, etc.  This method can use
 the slot bitmap to reduce the number of write faults.  However we can
 reintroduce the slot bitmap if/when we use the method, this shouldn't
 block the patch.

It is very a good approach and it is blazing fast.

I have no time to implement it currently,
could you update it into the TODO list?

 

 +static void rmapp_remove_write_access(struct kvm *kvm, unsigned long
 *rmapp)
 +{
 +u64 *spte = rmap_next(kvm, rmapp, NULL);
 +
 +while (spte) {
 +/* avoid RMW */
 +if (is_writable_pte(*spte))
 +*spte = ~PT_WRITABLE_MASK;
 
 Must use an atomic operation here to avoid losing dirty or accessed bit.
 

Atomic operation is too expensive, I retained the comment /* avoid RMW */
and wait someone take a good approach for it.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 1/2] virtio: support layout with avail ring before idx

2010-06-04 Thread Michael S. Tsirkin
On Fri, Jun 04, 2010 at 12:04:57PM +0930, Rusty Russell wrote:
 On Wed, 2 Jun 2010 12:17:12 am Michael S. Tsirkin wrote:
  This adds an (unused) option to put available ring before control (avail
  index, flags), and adds padding between index and flags. This avoids
  cache line sharing between control and ring, and also makes it possible
  to extend avail control without incurring extra cache misses.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 No no no no.  254?  You're trying to Morton me![1]

Hmm, I wonder what will we do if we want a 3rd field on
a separate chacheline. But ok.

 How's this (untested):

I think we also want to put flags there as well,
they are used on interrupt path, together with last used index.

 diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
 --- a/include/linux/virtio_ring.h
 +++ b/include/linux/virtio_ring.h
 @@ -74,8 +74,8 @@ struct vring {
  /* The standard layout for the ring is a continuous chunk of memory which 
 looks
   * like this.  We assume num is a power of 2.
   *
 - * struct vring
 - * {
 + * struct vring {
 + *   *** The driver writes to this part.
   *   // The actual descriptors (16 bytes each)
   *   struct vring_desc desc[num];
   *
 @@ -84,9 +84,11 @@ struct vring {
   *   __u16 avail_idx;
   *   __u16 available[num];
   *
 - *   // Padding to the next align boundary.
 + *   // Padding so used_flags is on the next align boundary.
   *   char pad[];
 + *   __u16 last_used; // On a cacheline of its own.
   *
 + *   *** The device writes to this part.
   *   // A ring of used descriptor heads with free-running index.
   *   __u16 used_flags;
   *   __u16 used_idx;
 @@ -110,6 +112,12 @@ static inline unsigned vring_size(unsign
   + sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num;
  }
  
 +/* Last used index sits at the very end of the driver part of the struct */
 +static inline __u16 *vring_last_used_idx(const struct vring *vr)
 +{
 + return (__u16 *)vr-used - 1;
 +}
 +
  #ifdef __KERNEL__
  #include linux/irqreturn.h
  struct virtio_device;
 
 Cheers,
 Rusty.
 [1] Andrew Morton has this technique where he posts a solution so ugly it
 forces others to fix it properly.  Ego-roping, basically.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support

2010-06-04 Thread Kevin Wolf
Am 03.06.2010 18:23, schrieb MORITA Kazutaka:
 +static void sd_aio_cancel(BlockDriverAIOCB *blockacb)
 +{
 +   SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb;
 +
 +   acb-canceled = 1;
 +}

 Does this provide the right semantics? You haven't really cancelled the
 request, but you pretend to. So you actually complete the request in the
 background and then throw the return code away.

 I seem to remember that posix-aio-compat.c waits at this point for
 completion of the requests, calls the callbacks and only afterwards
 returns from aio_cancel when no more requests are in flight.

 Or if you can really cancel requests, it would be the best option, of
 course.

 
 Sheepdog cannot cancel the requests which are already sent to the
 servers.  So, as you say, we pretend to cancel the requests without
 waiting for completion of them.  However, are there any situation
 where pretending to cancel causes problems in practice?

I'm not sure how often it would happen in practice, but if the guest OS
thinks the old value is on disk when in fact the new one is, this could
lead to corruption. I think if it can happen, even without evidence that
it actually does, it's already relevant enough.

 To wait for completion of the requests here, we may need to create
 another thread for processing I/O like posix-aio-compat.c.

I don't think you need a thread to get the same behaviour, you just need
to call the fd handlers like in the main loop. It would probably be the
first driver doing this, though, and it's not an often used code path,
so it might be a bad idea.

Maybe it's reasonable to just complete the request with -EIO? This way
the guest couldn't make any assumption about the data written. On the
other hand, it could be unhappy about failed requests, but that's
probably better than corruption.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 1/2] virtio: support layout with avail ring before idx

2010-06-04 Thread Rusty Russell
On Fri, 4 Jun 2010 08:05:43 pm Michael S. Tsirkin wrote:
 On Fri, Jun 04, 2010 at 12:04:57PM +0930, Rusty Russell wrote:
  On Wed, 2 Jun 2010 12:17:12 am Michael S. Tsirkin wrote:
   This adds an (unused) option to put available ring before control (avail
   index, flags), and adds padding between index and flags. This avoids
   cache line sharing between control and ring, and also makes it possible
   to extend avail control without incurring extra cache misses.
   
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  No no no no.  254?  You're trying to Morton me![1]
 
 Hmm, I wonder what will we do if we want a 3rd field on
 a separate chacheline. But ok.
 
  How's this (untested):
 
 I think we also want to put flags there as well,
 they are used on interrupt path, together with last used index.

I'm uncomfortable with moving a field.

We haven't done that before and I wonder what will break with old code.

Should we instead just abandon the flags field and use last_used only?
Or, more radically, put flags == last_used when the feature is on?

Thoughts?
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 1/2] virtio: support layout with avail ring before idx

2010-06-04 Thread Michael S. Tsirkin
On Fri, Jun 04, 2010 at 08:46:49PM +0930, Rusty Russell wrote:
 On Fri, 4 Jun 2010 08:05:43 pm Michael S. Tsirkin wrote:
  On Fri, Jun 04, 2010 at 12:04:57PM +0930, Rusty Russell wrote:
   On Wed, 2 Jun 2010 12:17:12 am Michael S. Tsirkin wrote:
This adds an (unused) option to put available ring before control (avail
index, flags), and adds padding between index and flags. This avoids
cache line sharing between control and ring, and also makes it possible
to extend avail control without incurring extra cache misses.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
   
   No no no no.  254?  You're trying to Morton me![1]
  
  Hmm, I wonder what will we do if we want a 3rd field on
  a separate chacheline. But ok.
  
   How's this (untested):
  
  I think we also want to put flags there as well,
  they are used on interrupt path, together with last used index.
 
 I'm uncomfortable with moving a field.
 
 We haven't done that before and I wonder what will break with old code.

With e.g. my patch, We only do this conditionally when bit is negotitated.

 Should we instead just abandon the flags field and use last_used only?
 Or, more radically, put flags == last_used when the feature is on?
 
 Thoughts?
 Rusty.

Hmm, e.g. with TX and virtio net, we almost never want interrupts,
whatever the index value.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Opteron AMD-V support

2010-06-04 Thread Andre Przywara

Brian Jackson wrote:

On Thursday 03 June 2010 21:33:24 Govender, Sashan wrote:

Hi

We bumped into this issue with VMWare ESX 4 where it doesn't support
hardware virtualization if the processor is an AMD Athlon/Opteron
(http://communities.vmware.com/docs/DOC-9150). Does linux-kvm have a
similar issue? More specifically will the the module kvm_amd.ko support
AMD-V on an Opteron 2218?


Yes, KVM doesn't try to be too smart. If you have svm/vt, it runs. If you 
don't, it falls back to tcg (qemu's normal/slow mode). The kvm-amd module will 
load as long as the bios and the CPU both support and enable svm.


That's right. Please note that KVM depends on hardware virtualization, 
so it does not have the choice like VMware has. Falling back to QEMU/TCG 
is not comparable to VMware's binary translation, because their approach 
is highly optimized and limited to x86 on x86, whereas QEMU wants to 
emulate each supported architecture on each host architecture, so it 
naturally cannot be as sophisticated as the VMware approach.
Nested paging has been supported by KVM for a long time, if it's there 
it will be automatically used.


BTW, every Opteron with a four-digit number supports AMD-V, and KVM will 
run on every such processor. I am not aware of any _Opteron_ boards not 
allowing AMD-V, but there are some desktop/notebook systems where the 
BIOS denies AMD-V (although the processor has it).
There presence of the svm flag in /proc/cpuinfo is a safe indicator 
for the usability of KVM.


Regards,
Andre.


--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/7] KVM: MMU: skip invalid sp when unprotect page

2010-06-04 Thread Xiao Guangrong
In kvm_mmu_unprotect_page(), the invalid sp can be skipped

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a62e3ba..e962f26 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1629,7 +1629,7 @@ static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t 
gfn)
bucket = kvm-arch.mmu_page_hash[index];
 restart:
hlist_for_each_entry_safe(sp, node, n, bucket, hash_link)
-   if (sp-gfn == gfn  !sp-role.direct) {
+   if (sp-gfn == gfn  !sp-role.direct  !sp-role.invalid) {
pgprintk(%s: gfn %lx role %x\n, __func__, gfn,
 sp-role.word);
r = 1;
-- 
1.6.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/7] KVM: MMU: introduce some macros to cleanup hlist traverseing

2010-06-04 Thread Xiao Guangrong
Introduce for_each_gfn_sp() and for_each_gfn_indirect_valid_sp() to
cleanup hlist traverseing

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |  122 
 1 files changed, 47 insertions(+), 75 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e962f26..75bd6df 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1201,6 +1201,17 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, 
struct kvm_mmu_page *sp)
 
 static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+#define for_each_gfn_sp(kvm, sp, gfn, pos, n)  \
+  hlist_for_each_entry_safe(sp, pos, n,
\
+   (kvm)-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)  \
+   if ((sp)-gfn != (gfn)) {} else
+
+#define for_each_gfn_indirect_valid_sp(kvm, sp, gfn, pos, n)   \
+  hlist_for_each_entry_safe(sp, pos, n,
\
+   (kvm)-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)  \
+   if ((sp)-gfn != (gfn) || (sp)-role.direct ||  \
+   (sp)-role.invalid) {} else
+
 static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
   bool clear_unsync)
 {
@@ -1244,16 +1255,12 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
 /* @gfn should be write-protected at the call site */
 static void kvm_sync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
 {
-   struct hlist_head *bucket;
struct kvm_mmu_page *s;
struct hlist_node *node, *n;
-   unsigned index;
bool flush = false;
 
-   index = kvm_page_table_hashfn(gfn);
-   bucket = vcpu-kvm-arch.mmu_page_hash[index];
-   hlist_for_each_entry_safe(s, node, n, bucket, hash_link) {
-   if (s-gfn != gfn || !s-unsync || s-role.invalid)
+   for_each_gfn_indirect_valid_sp(vcpu-kvm, s, gfn, node, n) {
+   if (!s-unsync)
continue;
 
WARN_ON(s-role.level != PT_PAGE_TABLE_LEVEL);
@@ -1365,9 +1372,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
 u64 *parent_pte)
 {
union kvm_mmu_page_role role;
-   unsigned index;
unsigned quadrant;
-   struct hlist_head *bucket;
struct kvm_mmu_page *sp;
struct hlist_node *node, *tmp;
bool need_sync = false;
@@ -1383,36 +1388,34 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
quadrant = (1  ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
role.quadrant = quadrant;
}
-   index = kvm_page_table_hashfn(gfn);
-   bucket = vcpu-kvm-arch.mmu_page_hash[index];
-   hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
-   if (sp-gfn == gfn) {
-   if (!need_sync  sp-unsync)
-   need_sync = true;
+   for_each_gfn_sp(vcpu-kvm, sp, gfn, node, tmp) {
+   if (!need_sync  sp-unsync)
+   need_sync = true;
 
-   if (sp-role.word != role.word)
-   continue;
+   if (sp-role.word != role.word)
+   continue;
 
-   if (sp-unsync  kvm_sync_page_transient(vcpu, sp))
-   break;
+   if (sp-unsync  kvm_sync_page_transient(vcpu, sp))
+   break;
 
-   mmu_page_add_parent_pte(vcpu, sp, parent_pte);
-   if (sp-unsync_children) {
-   set_bit(KVM_REQ_MMU_SYNC, vcpu-requests);
-   kvm_mmu_mark_parents_unsync(sp);
-   } else if (sp-unsync)
-   kvm_mmu_mark_parents_unsync(sp);
+   mmu_page_add_parent_pte(vcpu, sp, parent_pte);
+   if (sp-unsync_children) {
+   set_bit(KVM_REQ_MMU_SYNC, vcpu-requests);
+   kvm_mmu_mark_parents_unsync(sp);
+   } else if (sp-unsync)
+   kvm_mmu_mark_parents_unsync(sp);
 
-   trace_kvm_mmu_get_page(sp, false);
-   return sp;
-   }
+   trace_kvm_mmu_get_page(sp, false);
+   return sp;
+   }
++vcpu-kvm-stat.mmu_cache_miss;
sp = kvm_mmu_alloc_page(vcpu, parent_pte, direct);
if (!sp)
return sp;
sp-gfn = gfn;
sp-role = role;
-   hlist_add_head(sp-hash_link, bucket);
+   hlist_add_head(sp-hash_link,
+   vcpu-kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]);
if (!direct) {
if (rmap_write_protect(vcpu-kvm, gfn))
kvm_flush_remote_tlbs(vcpu-kvm);
@@ -1617,46 +1620,34 @@ 

[PATCH v2 3/7] KVM: MMU: split the operations of kvm_mmu_zap_page()

2010-06-04 Thread Xiao Guangrong
Using kvm_mmu_prepare_zap_page() and kvm_mmu_commit_zap_page() to
split kvm_mmu_zap_page() function, then we can:

- traverse hlist safely
- easily to gather remote tlb flush which occurs during page zapped

Those feature can be used in the later patches

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c  |   52 ++
 arch/x86/kvm/mmutrace.h |2 +-
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 75bd6df..a64c0e0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -916,6 +916,7 @@ static int is_empty_shadow_page(u64 *spt)
 static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
ASSERT(is_empty_shadow_page(sp-spt));
+   hlist_del(sp-hash_link);
list_del(sp-link);
__free_page(virt_to_page(sp-spt));
if (!sp-role.direct)
@@ -1200,6 +1201,10 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, 
struct kvm_mmu_page *sp)
 }
 
 static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+   struct list_head *invalid_list);
+static void kvm_mmu_commit_zap_page(struct kvm *kvm,
+   struct list_head *invalid_list);
 
 #define for_each_gfn_sp(kvm, sp, gfn, pos, n)  \
   hlist_for_each_entry_safe(sp, pos, n,
\
@@ -1530,7 +1535,8 @@ static void kvm_mmu_unlink_parents(struct kvm *kvm, 
struct kvm_mmu_page *sp)
 }
 
 static int mmu_zap_unsync_children(struct kvm *kvm,
-  struct kvm_mmu_page *parent)
+  struct kvm_mmu_page *parent,
+  struct list_head *invalid_list)
 {
int i, zapped = 0;
struct mmu_page_path parents;
@@ -1544,7 +1550,7 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
struct kvm_mmu_page *sp;
 
for_each_sp(pages, sp, parents, i) {
-   kvm_mmu_zap_page(kvm, sp);
+   kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
mmu_pages_clear_parents(parents);
zapped++;
}
@@ -1554,16 +1560,16 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
return zapped;
 }
 
-static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+   struct list_head *invalid_list)
 {
int ret;
 
-   trace_kvm_mmu_zap_page(sp);
+   trace_kvm_mmu_prepare_zap_page(sp);
++kvm-stat.mmu_shadow_zapped;
-   ret = mmu_zap_unsync_children(kvm, sp);
+   ret = mmu_zap_unsync_children(kvm, sp, invalid_list);
kvm_mmu_page_unlink_children(kvm, sp);
kvm_mmu_unlink_parents(kvm, sp);
-   kvm_flush_remote_tlbs(kvm);
if (!sp-role.invalid  !sp-role.direct)
unaccount_shadowed(kvm, sp-gfn);
if (sp-unsync)
@@ -1571,17 +1577,45 @@ static int kvm_mmu_zap_page(struct kvm *kvm, struct 
kvm_mmu_page *sp)
if (!sp-root_count) {
/* Count self */
ret++;
-   hlist_del(sp-hash_link);
-   kvm_mmu_free_page(kvm, sp);
+   list_move(sp-link, invalid_list);
} else {
-   sp-role.invalid = 1;
list_move(sp-link, kvm-arch.active_mmu_pages);
kvm_reload_remote_mmus(kvm);
}
+
+   sp-role.invalid = 1;
kvm_mmu_reset_last_pte_updated(kvm);
return ret;
 }
 
+static void kvm_mmu_commit_zap_page(struct kvm *kvm,
+   struct list_head *invalid_list)
+{
+   struct kvm_mmu_page *sp;
+
+   if (list_empty(invalid_list))
+   return;
+
+   kvm_flush_remote_tlbs(kvm);
+
+   do {
+   sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
+   WARN_ON(!sp-role.invalid || sp-root_count);
+   kvm_mmu_free_page(kvm, sp);
+   } while (!list_empty(invalid_list));
+
+}
+
+static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+   LIST_HEAD(invalid_list);
+   int ret;
+
+   ret = kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+   kvm_mmu_commit_zap_page(kvm, invalid_list);
+   return ret;
+}
+
 /*
  * Changing the number of mmu pages allocated to the vm
  * Note: if kvm_nr_mmu_pages is too small, you will get dead lock
diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h
index 42f07b1..3aab0f0 100644
--- a/arch/x86/kvm/mmutrace.h
+++ b/arch/x86/kvm/mmutrace.h
@@ -190,7 +190,7 @@ DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_unsync_page,
TP_ARGS(sp)
 );
 
-DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_zap_page,

[PATCH v2 4/7] KVM: MMU: don't get free page number in the loop

2010-06-04 Thread Xiao Guangrong
In the later patch, we will modify sp's zapping way like below:

kvm_mmu_prepare_zap_page A
kvm_mmu_prepare_zap_page B
kvm_mmu_prepare_zap_page C

kvm_mmu_commit_zap_page

[ zaped multiple sps only need to call kvm_mmu_commit_zap_page once ]   

In __kvm_mmu_free_some_pages() function, the free page number is
getted form 'vcpu-kvm-arch.n_free_mmu_pages' in loop, it will
hinders us to apply kvm_mmu_prepare_zap_page() and kvm_mmu_commit_zap_page()
since kvm_mmu_prepare_zap_page() not free sp.

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |7 +--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a64c0e0..77bc4ba 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2861,13 +2861,16 @@ EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page_virt);
 
 void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu)
 {
-   while (vcpu-kvm-arch.n_free_mmu_pages  KVM_REFILL_PAGES 
+   int free_pages;
+
+   free_pages = vcpu-kvm-arch.n_free_mmu_pages;
+   while (free_pages  KVM_REFILL_PAGES 
   !list_empty(vcpu-kvm-arch.active_mmu_pages)) {
struct kvm_mmu_page *sp;
 
sp = container_of(vcpu-kvm-arch.active_mmu_pages.prev,
  struct kvm_mmu_page, link);
-   kvm_mmu_zap_page(vcpu-kvm, sp);
+   free_pages += kvm_mmu_zap_page(vcpu-kvm, sp);
++vcpu-kvm-stat.mmu_recycled;
}
 }
-- 
1.6.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/7] KVM: MMU: gather remote tlb flush which occurs during page zapped

2010-06-04 Thread Xiao Guangrong
Using kvm_mmu_prepare_zap_page() and kvm_mmu_zap_page() instead of
kvm_mmu_zap_page() that can reduce remote tlb flush IPI

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   84 ---
 1 files changed, 53 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 77bc4ba..6544d8e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1200,7 +1200,6 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, 
struct kvm_mmu_page *sp)
--kvm-stat.mmu_unsync;
 }
 
-static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
struct list_head *invalid_list);
 static void kvm_mmu_commit_zap_page(struct kvm *kvm,
@@ -1218,10 +1217,10 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
(sp)-role.invalid) {} else
 
 static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
-  bool clear_unsync)
+  struct list_head *invalid_list, bool clear_unsync)
 {
if (sp-role.cr4_pae != !!is_pae(vcpu)) {
-   kvm_mmu_zap_page(vcpu-kvm, sp);
+   kvm_mmu_prepare_zap_page(vcpu-kvm, sp, invalid_list);
return 1;
}
 
@@ -1232,7 +1231,7 @@ static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
}
 
if (vcpu-arch.mmu.sync_page(vcpu, sp)) {
-   kvm_mmu_zap_page(vcpu-kvm, sp);
+   kvm_mmu_prepare_zap_page(vcpu-kvm, sp, invalid_list);
return 1;
}
 
@@ -1244,17 +1243,22 @@ static void mmu_convert_notrap(struct kvm_mmu_page *sp);
 static int kvm_sync_page_transient(struct kvm_vcpu *vcpu,
   struct kvm_mmu_page *sp)
 {
+   LIST_HEAD(invalid_list);
int ret;
 
-   ret = __kvm_sync_page(vcpu, sp, false);
+   ret = __kvm_sync_page(vcpu, sp, invalid_list, false);
if (!ret)
mmu_convert_notrap(sp);
+   else
+   kvm_mmu_commit_zap_page(vcpu-kvm, invalid_list);
+
return ret;
 }
 
-static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
+struct list_head *invalid_list)
 {
-   return __kvm_sync_page(vcpu, sp, true);
+   return __kvm_sync_page(vcpu, sp, invalid_list, true);
 }
 
 /* @gfn should be write-protected at the call site */
@@ -1262,6 +1266,7 @@ static void kvm_sync_pages(struct kvm_vcpu *vcpu,  gfn_t 
gfn)
 {
struct kvm_mmu_page *s;
struct hlist_node *node, *n;
+   LIST_HEAD(invalid_list);
bool flush = false;
 
for_each_gfn_indirect_valid_sp(vcpu-kvm, s, gfn, node, n) {
@@ -1271,13 +1276,14 @@ static void kvm_sync_pages(struct kvm_vcpu *vcpu,  
gfn_t gfn)
WARN_ON(s-role.level != PT_PAGE_TABLE_LEVEL);
if ((s-role.cr4_pae != !!is_pae(vcpu)) ||
(vcpu-arch.mmu.sync_page(vcpu, s))) {
-   kvm_mmu_zap_page(vcpu-kvm, s);
+   kvm_mmu_prepare_zap_page(vcpu-kvm, s, invalid_list);
continue;
}
kvm_unlink_unsync_page(vcpu-kvm, s);
flush = true;
}
 
+   kvm_mmu_commit_zap_page(vcpu-kvm, invalid_list);
if (flush)
kvm_mmu_flush_tlb(vcpu);
 }
@@ -1348,6 +1354,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp;
struct mmu_page_path parents;
struct kvm_mmu_pages pages;
+   LIST_HEAD(invalid_list);
 
kvm_mmu_pages_init(parent, parents, pages);
while (mmu_unsync_walk(parent, pages)) {
@@ -1360,9 +1367,10 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
kvm_flush_remote_tlbs(vcpu-kvm);
 
for_each_sp(pages, sp, parents, i) {
-   kvm_sync_page(vcpu, sp);
+   kvm_sync_page(vcpu, sp, invalid_list);
mmu_pages_clear_parents(parents);
}
+   kvm_mmu_commit_zap_page(vcpu-kvm, invalid_list);
cond_resched_lock(vcpu-kvm-mmu_lock);
kvm_mmu_pages_init(parent, parents, pages);
}
@@ -1606,16 +1614,6 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 
 }
 
-static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
-   LIST_HEAD(invalid_list);
-   int ret;
-
-   ret = kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
-   kvm_mmu_commit_zap_page(kvm, invalid_list);
-   return ret;
-}
-
 /*
  * Changing the number of mmu pages allocated to the vm
  * Note: if kvm_nr_mmu_pages is too small, you will get dead lock
@@ -1623,6 +1621,7 @@ static int kvm_mmu_zap_page(struct kvm 

[PATCH v2 6/7] KVM: MMU: traverse sp hlish safely

2010-06-04 Thread Xiao Guangrong
Now, we can safely to traverse sp hlish

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   51 +++
 1 files changed, 23 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6544d8e..845cba2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1205,13 +1205,13 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
struct kvm_mmu_page *sp,
 static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list);
 
-#define for_each_gfn_sp(kvm, sp, gfn, pos, n)  \
-  hlist_for_each_entry_safe(sp, pos, n,
\
+#define for_each_gfn_sp(kvm, sp, gfn, pos) \
+  hlist_for_each_entry(sp, pos,
\
(kvm)-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)  \
if ((sp)-gfn != (gfn)) {} else
 
-#define for_each_gfn_indirect_valid_sp(kvm, sp, gfn, pos, n)   \
-  hlist_for_each_entry_safe(sp, pos, n,
\
+#define for_each_gfn_indirect_valid_sp(kvm, sp, gfn, pos)  \
+  hlist_for_each_entry(sp, pos,
\
(kvm)-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)  \
if ((sp)-gfn != (gfn) || (sp)-role.direct ||  \
(sp)-role.invalid) {} else
@@ -1265,11 +1265,11 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
 static void kvm_sync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
 {
struct kvm_mmu_page *s;
-   struct hlist_node *node, *n;
+   struct hlist_node *node;
LIST_HEAD(invalid_list);
bool flush = false;
 
-   for_each_gfn_indirect_valid_sp(vcpu-kvm, s, gfn, node, n) {
+   for_each_gfn_indirect_valid_sp(vcpu-kvm, s, gfn, node) {
if (!s-unsync)
continue;
 
@@ -1387,7 +1387,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
union kvm_mmu_page_role role;
unsigned quadrant;
struct kvm_mmu_page *sp;
-   struct hlist_node *node, *tmp;
+   struct hlist_node *node;
bool need_sync = false;
 
role = vcpu-arch.mmu.base_role;
@@ -1401,7 +1401,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
quadrant = (1  ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
role.quadrant = quadrant;
}
-   for_each_gfn_sp(vcpu-kvm, sp, gfn, node, tmp) {
+   for_each_gfn_sp(vcpu-kvm, sp, gfn, node) {
if (!need_sync  sp-unsync)
need_sync = true;
 
@@ -1656,19 +1656,18 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned 
int kvm_nr_mmu_pages)
 static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
 {
struct kvm_mmu_page *sp;
-   struct hlist_node *node, *n;
+   struct hlist_node *node;
LIST_HEAD(invalid_list);
int r;
 
pgprintk(%s: looking for gfn %lx\n, __func__, gfn);
r = 0;
-restart:
-   for_each_gfn_indirect_valid_sp(kvm, sp, gfn, node, n) {
+
+   for_each_gfn_indirect_valid_sp(kvm, sp, gfn, node) {
pgprintk(%s: gfn %lx role %x\n, __func__, gfn,
 sp-role.word);
r = 1;
-   if (kvm_mmu_prepare_zap_page(kvm, sp, invalid_list))
-   goto restart;
+   kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
}
kvm_mmu_commit_zap_page(kvm, invalid_list);
return r;
@@ -1677,15 +1676,13 @@ restart:
 static void mmu_unshadow(struct kvm *kvm, gfn_t gfn)
 {
struct kvm_mmu_page *sp;
-   struct hlist_node *node, *nn;
+   struct hlist_node *node;
LIST_HEAD(invalid_list);
 
-restart:
-   for_each_gfn_indirect_valid_sp(kvm, sp, gfn, node, nn) {
+   for_each_gfn_indirect_valid_sp(kvm, sp, gfn, node) {
pgprintk(%s: zap %lx %x\n,
 __func__, gfn, sp-role.word);
-   if (kvm_mmu_prepare_zap_page(kvm, sp, invalid_list))
-   goto restart;
+   kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
}
kvm_mmu_commit_zap_page(kvm, invalid_list);
 }
@@ -1830,9 +1827,9 @@ static void __kvm_unsync_page(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *sp)
 static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
 {
struct kvm_mmu_page *s;
-   struct hlist_node *node, *n;
+   struct hlist_node *node;
 
-   for_each_gfn_indirect_valid_sp(vcpu-kvm, s, gfn, node, n) {
+   for_each_gfn_indirect_valid_sp(vcpu-kvm, s, gfn, node) {
if (s-unsync)
continue;
WARN_ON(s-role.level != PT_PAGE_TABLE_LEVEL);
@@ -1844,10 +1841,10 @@ static int 

[PATCH v2 7/7] KVM: MMU: reduce remote tlb flush in kvm_mmu_pte_write()

2010-06-04 Thread Xiao Guangrong
collect remote tlb flush in kvm_mmu_pte_write() path

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   20 +++-
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 845cba2..8528e5b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2664,11 +2664,15 @@ static bool need_remote_flush(u64 old, u64 new)
return (old  ~new  PT64_PERM_MASK) != 0;
 }
 
-static void mmu_pte_write_flush_tlb(struct kvm_vcpu *vcpu, u64 old, u64 new)
+static void mmu_pte_write_flush_tlb(struct kvm_vcpu *vcpu, bool zap_page,
+   bool remote_flush, bool local_flush)
 {
-   if (need_remote_flush(old, new))
+   if (zap_page)
+   return;
+
+   if (remote_flush)
kvm_flush_remote_tlbs(vcpu-kvm);
-   else
+   else if (local_flush)
kvm_mmu_flush_tlb(vcpu);
 }
 
@@ -2733,6 +2737,9 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
int npte;
int r;
int invlpg_counter;
+   bool remote_flush, local_flush, zap_page;
+
+   zap_page = remote_flush = local_flush = false;
 
pgprintk(%s: gpa %llx bytes %d\n, __func__, gpa, bytes);
 
@@ -2806,7 +2813,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 */
pgprintk(misaligned: gpa %llx bytes %d role %x\n,
 gpa, bytes, sp-role.word);
-   kvm_mmu_prepare_zap_page(vcpu-kvm, sp,
+   zap_page |= !!kvm_mmu_prepare_zap_page(vcpu-kvm, sp,
 invalid_list);
++vcpu-kvm-stat.mmu_flooded;
continue;
@@ -2831,16 +2838,19 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
if (quadrant != sp-role.quadrant)
continue;
}
+   local_flush = true;
spte = sp-spt[page_offset / sizeof(*spte)];
while (npte--) {
entry = *spte;
mmu_pte_write_zap_pte(vcpu, sp, spte);
if (gentry)
mmu_pte_write_new_pte(vcpu, sp, spte, gentry);
-   mmu_pte_write_flush_tlb(vcpu, entry, *spte);
+   if (!remote_flush  need_remote_flush(entry, *spte))
+   remote_flush = true;
++spte;
}
}
+   mmu_pte_write_flush_tlb(vcpu, zap_page, remote_flush, local_flush);
kvm_mmu_commit_zap_page(vcpu-kvm, invalid_list);
kvm_mmu_audit(vcpu, post pte write);
spin_unlock(vcpu-kvm-mmu_lock);
-- 
1.6.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend] KVM: MMU: fix compile warning in kvm_send_hwpoison_signal()

2010-06-04 Thread Xiao Guangrong
fix:

arch/x86/kvm/mmu.c: In function ‘kvm_send_hwpoison_signal’:
arch/x86/kvm/mmu.c:2051: warning: ignoring return value of ‘copy_from_user’, 
declared with attribute warn_unused_resul

Reported-by: Jan Kiszka jan.kis...@web.de
Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c16c4ca..a62e3ba 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2049,7 +2049,7 @@ static void kvm_send_hwpoison_signal(struct kvm *kvm, 
gfn_t gfn)
 
/* Touch the page, so send SIGBUS */
hva = (void __user *)gfn_to_hva(kvm, gfn);
-   (void)copy_from_user(buf, hva, 1);
+   r = copy_from_user(buf, hva, 1);
 }
 
 static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn)
-- 
1.6.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [scsi-bus]: Add PR-OUT and PR-IN case for SCSIRequest xfer and xfer_mode setup

2010-06-04 Thread Kevin Wolf
Am 31.05.2010 03:43, schrieb Nicholas A. Bellinger:
 From: Nicholas Bellinger n...@linux-iscsi.org
 
 This patch updates hw/scsi-bus.c to add PERSISTENT_RESERVE_OUT and 
 PERSISTENT_RESERVE_IN
 case in scsi_req_length() to extra the incoming buffer length into 
 SCSIRequest-cmd.xfer,
 and adds a second PERSISTENT_RESERVE_OUT case in scsi_req_xfer_mode() in 
 order to properly
 set SCSI_XFER_TO_DEV for WRITE data.
 
 Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation and TCM_Loop 
 target ports.
 
 Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
 ---
  hw/scsi-bus.c |5 +
  1 files changed, 5 insertions(+), 0 deletions(-)
 
 diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
 index b8e4b71..75ec74e 100644
 --- a/hw/scsi-bus.c
 +++ b/hw/scsi-bus.c
 @@ -325,6 +325,10 @@ static int scsi_req_length(SCSIRequest *req, uint8_t 
 *cmd)
  case INQUIRY:
  req-cmd.xfer = cmd[4] | (cmd[3]  8);
  break;
 +case PERSISTENT_RESERVE_OUT:
 +case PERSISTENT_RESERVE_IN:
 +req-cmd.xfer = cmd[8] | (cmd[7]  8);

Maybe I'm missing something, but isn't exactly the same value set in the
switch block above? (for cmd[0]  5 == 2)

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm/powerpc: fix a build error in e500_tlb.c

2010-06-04 Thread Alexander Graf

On 03.06.2010, at 07:52, Kevin Hao wrote:

 We use the wrong number arguments when invoking trace_kvm_stlb_inval,
 and cause the following build error.
 arch/powerpc/kvm/e500_tlb.c: In function 'kvmppc_e500_stlbe_invalidate':
 arch/powerpc/kvm/e500_tlb.c:230: error: too many arguments to function 
 'trace_kvm_stlb_inval'

Liu, I'd like to get an ack from you here.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make-release: make mtime, owner, group consistent

2010-06-04 Thread Marcelo Tosatti
On Wed, Jun 02, 2010 at 06:27:20PM +0300, Michael S. Tsirkin wrote:
 Files from git have modification time set to one
 of commit, and owner/group to root.
 Making it so for generated files as well makes
 it easier to generate an identical tarball from git.
 
 Setting owner/group to root is especially important because
 otherwise you must have a user/group with same name
 to generate an identical tarball.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  kvm/scripts/make-release |9 +++--
  1 files changed, 7 insertions(+), 2 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] msix: fix msix_set/unset_mask_notifier

2010-06-04 Thread Marcelo Tosatti
On Wed, Jun 02, 2010 at 08:49:35PM +0300, Michael S. Tsirkin wrote:
 Sridhar Samudrala reported hitting the following assertions
 in msix.c when doing a guest reboot or live migration using vhost.
 qemu-kvm/hw/msix.c:375: msix_mask_all: Assertion `r = 0' failed.
 qemu-kvm/hw/msix.c:640: msix_unset_mask_notifier:
 Assertion `dev-msix_mask_notifier_opaque[vector]' failed.
 
 The issue is that we didn't clear/set the opaque pointer
 when vector is masked. The following patch fixes this.
 
 Signed-off-by: Sridhar Samudrala s...@us.ibm.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 Sridhar, could you test the following please?
 
  hw/msix.c |   33 -
  1 files changed, 16 insertions(+), 17 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] virtio-net: stop vhost backend on vmstop

2010-06-04 Thread Marcelo Tosatti
On Wed, Jun 02, 2010 at 09:01:52PM +0300, Michael S. Tsirkin wrote:
 vhost net currently keeps running after vmstop,
 which causes trouble as qemy does not check
 for dirty pages anymore.
 The fix is to simply keep vm and vhost running/stopped
 status in sync.
 
 Tested-by: David L Stevens dlstev...@us.ibm.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 Changes from v1:
   Simplify code as suggested by Amit.
 
  hw/virtio-net.c |   11 +--
  1 files changed, 5 insertions(+), 6 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: rework remove-write-access for a slot

2010-06-04 Thread Marcelo Tosatti
On Fri, Jun 04, 2010 at 04:14:08PM +0800, Lai Jiangshan wrote:
 Avi Kivity wrote:
  On 06/02/2010 11:53 AM, Lai Jiangshan wrote:
  Current code uses slot_bitmap to find ptes who map a page
  from the memory slot, it is not precise: some ptes in the shadow page
  are not map any page from the memory slot.
 
  This patch uses rmap to find the ptes precisely, and remove
  the unused slot_bitmap.

Note that the current code is precise: memslot_id does unalias_gfn.

  Patch looks good; a couple of comments:
  
  - We might see a slowdown with !tdp, since we no longer have locality. 
  Each page will map to an spte in a different page.  However, it's still
  worth it in my opinion.
 
 Yes, this patch hurts the cache since we no longer have locality.
 And if most pages of the slot are not mapped(rmap_next(kvm, rmapp, 
 NULL)==NULL),
 this patch will worse than old method I think.

Can you get some numbers before/after patch, with/without lots of shadow
pages instantiated? Better with large amount of memory for the guest.

Because shrinking kvm_mmu_page is good.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: porting fixes regarding kvm-clock and lost irqs to stable qemu-kvm 0.12.4

2010-06-04 Thread Marcelo Tosatti
On Wed, Jun 02, 2010 at 05:13:30PM +0200, Peter Lieven wrote:
 Hi,
 
 I would like to get latest stable qemu-kvm (0.12.4) to a usable state
 regarding live-migration.
 
 Problems are fixed in git, but there is so much new stuff that has not
 extensively tested and therefore I would like to stay at 0.12.4 at
 the moment.
 
 Therefore I would appreciate your help regarding 2 bugs:
 
 a) -cpu xxx,-kvmclock doesn't work in 0.12.4. It started working first
 after applying a patchset from Avi that is not even in git yet. What
 needs to be done to get it working in 0.12.4 ?

You want to disable kvmclock? -no-kvmclock kernel option.

Note migration should be working fine since kvm.git commit 
afbcf7ab8d1bc8c2d04792f6d9e786e0adeb328d.

 b) There is a bug in 0.12.4 which leads to lost irqs. It was reported
 appearing in virtio_blk (#584131) and e1000 (#585113). It is finally
 fixed in GIT. Can someone give me a hint with commit fixed this?

c3f8f61157625d0bb5bfc135047573de48fdc675.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM VMX: Make sure single type invvpid is supported before issuing invvpid instruction

2010-06-04 Thread Marcelo Tosatti
On Fri, Jun 04, 2010 at 08:51:39AM +0800, Gui Jianfeng wrote:
 According to SDM, we need check whether single-context INVVPID type is 
 supported
 before issuing invvpid instruction.
 
 Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com
 ---
  arch/x86/include/asm/vmx.h |2 ++
  arch/x86/kvm/vmx.c |8 +++-
  2 files changed, 9 insertions(+), 1 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] KVM: MMU: fix compile warning in kvm_send_hwpoison_signal()

2010-06-04 Thread Marcelo Tosatti
On Fri, Jun 04, 2010 at 10:02:35PM +0800, Xiao Guangrong wrote:
 fix:
 
 arch/x86/kvm/mmu.c: In function ‘kvm_send_hwpoison_signal’:
 arch/x86/kvm/mmu.c:2051: warning: ignoring return value of ‘copy_from_user’, 
 declared with attribute warn_unused_resul
 
 Reported-by: Jan Kiszka jan.kis...@web.de
 Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
 ---
  arch/x86/kvm/mmu.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] introduce -machine switch

2010-06-04 Thread Glauber Costa
This patch adds initial support for the -machine option, that allows
command line specification of machine attributes.
Besides its value per-se, it is the saner way we found to
allow for enabling/disabling of kvm's in-kernel irqchip.

machine-related options like kernel, initrd, etc, are now
accepted under this switch.

Note: This is against anthony's staging.

---
 hw/boards.h |4 +++
 hw/pc_piix.c|3 ++
 qemu-options.hx |   14 ++
 vl.c|   72 +++
 4 files changed, 67 insertions(+), 26 deletions(-)

diff --git a/hw/boards.h b/hw/boards.h
index 18b6b8f..bac8583 100644
--- a/hw/boards.h
+++ b/hw/boards.h
@@ -35,6 +35,10 @@ extern QEMUMachine *current_machine;
 
 #define COMMON_MACHINE_OPTS()  \
 {   \
+.name = machine,  \
+.type = QEMU_OPT_STRING,\
+},  \
+{   \
 .name = ram_size, \
 .type = QEMU_OPT_NUMBER,\
 },  \
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index f01194c..3ddb695 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -67,6 +67,9 @@ static void pc_init1(QemuOpts *opts, int pci_enabled)
 
 vmport_init();
 
+if (!kernel_cmdline)
+kernel_cmdline = ;
+
 /* allocate ram and load rom/bios */
 pc_memory_init(ram_size, kernel_filename, kernel_cmdline, initrd_filename,
below_4g_mem_size, above_4g_mem_size);
diff --git a/qemu-options.hx b/qemu-options.hx
index a6928b7..76ca866 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -35,6 +35,20 @@ STEXI
 Select the emulated @var{machine} (@code{-M ?} for list)
 ETEXI
 
+DEF(machine, HAS_ARG, QEMU_OPTION_machine,
+-machine [machine=m][,ram_size=ram][,boot_device=dev]\n
+ [,kernel=vmlinux][,cmdline=kernel_cmdline][,initrd=initrd]\n
+ [,cpu=cpu_type]\n
+pc-specific options: [,acpi=on|off]\n
+kvm-x86 specific options: [,apic_in_kernel=on|off]\n
+select emulated machine (-machine ? for list)\n,
+QEMU_ARCH_ALL)
+STEXI
+...@item -machine @var{machine}[,@var{option}]
+...@findex -machine
+Select the emulated @var{machine} (@code{-machine ?} for list)
+ETEXI
+
 DEF(cpu, HAS_ARG, QEMU_OPTION_cpu,
 -cpu cpuselect CPU (-cpu ? for list)\n, QEMU_ARCH_ALL)
 STEXI
diff --git a/vl.c b/vl.c
index 96b8d35..177ffe2 100644
--- a/vl.c
+++ b/vl.c
@@ -1605,6 +1605,16 @@ static QEMUMachine *find_machine(const char *name)
 if (m-alias  !strcmp(m-alias, name))
 return m;
 }
+
+printf(Supported machines are:\n);
+for(m = first_machine; m != NULL; m = m-next) {
+   if (m-alias)
+   printf(%-10s %s (alias of %s)\n,
+  m-alias, m-desc, m-name);
+   printf(%-10s %s%s\n,
+  m-name, m-desc,
+  m-is_default ?  (default) : );
+}
 return NULL;
 }
 
@@ -2567,7 +2577,7 @@ int main(int argc, char **argv, char **envp)
 DisplayState *ds;
 DisplayChangeListener *dcl;
 int cyls, heads, secs, translation;
-QemuOpts *hda_opts = NULL, *opts;
+QemuOpts *hda_opts = NULL, *machine_opts = NULL, *opts = NULL;
 int optind;
 const char *optarg;
 const char *loadvm = NULL;
@@ -2697,21 +2707,29 @@ int main(int argc, char **argv, char **envp)
 exit(1);
 }
 switch(popt-index) {
+case QEMU_OPTION_machine: {
+   const char *mach;
+
+machine_opts = qemu_opts_parse(qemu_machine_opts, optarg, 0);
+if (!machine_opts) {
+fprintf(stderr, parse error: %s\n, optarg);
+exit(1);
+}
+mach = qemu_opt_get(machine_opts, machine);
+
+   if (!mach)
+   break;
+
+machine = find_machine(mach);
+
+if (!machine)
+exit(*mach != '?');
+   break;
+   }
 case QEMU_OPTION_M:
 machine = find_machine(optarg);
-if (!machine) {
-QEMUMachine *m;
-printf(Supported machines are:\n);
-for(m = first_machine; m != NULL; m = m-next) {
-if (m-alias)
-printf(%-10s %s (alias of %s)\n,
-   m-alias, m-desc, m-name);
-printf(%-10s %s%s\n,
-   m-name, m-desc,
-   m-is_default ?  (default) : );
-}
-exit(*optarg != '?');
-}
+if (!machine) 
+   exit(*optarg != '?');
 break;
 case QEMU_OPTION_cpu:
 /* hw initialization will 

Re: porting fixes regarding kvm-clock and lost irqs to stable qemu-kvm 0.12.4

2010-06-04 Thread Peter Lieven

Am 04.06.2010 um 17:31 schrieb Marcelo Tosatti:

 On Wed, Jun 02, 2010 at 05:13:30PM +0200, Peter Lieven wrote:
 Hi,
 
 I would like to get latest stable qemu-kvm (0.12.4) to a usable state
 regarding live-migration.
 
 Problems are fixed in git, but there is so much new stuff that has not
 extensively tested and therefore I would like to stay at 0.12.4 at
 the moment.
 
 Therefore I would appreciate your help regarding 2 bugs:
 
 a) -cpu xxx,-kvmclock doesn't work in 0.12.4. It started working first
 after applying a patchset from Avi that is not even in git yet. What
 needs to be done to get it working in 0.12.4 ?
 
 You want to disable kvmclock? -no-kvmclock kernel option.
 

i am afraid it is not, at least the last git i tested some days ago.
it's still crashing with kvm-clock.

i was looking for a generic way to disable it in the hypervisor
without the need to touch every guests kernel commandline.
this would be easy revertible once its all working as expected.

 Note migration should be working fine since kvm.git commit 
 afbcf7ab8d1bc8c2d04792f6d9e786e0adeb328d.

would it be easy to backport this to stable? hw/msix.c is looking quite 
different than
that of 0.12.4. 

 
 b) There is a bug in 0.12.4 which leads to lost irqs. It was reported
 appearing in virtio_blk (#584131) and e1000 (#585113). It is finally
 fixed in GIT. Can someone give me a hint with commit fixed this?
 
 c3f8f61157625d0bb5bfc135047573de48fdc675.
 
thanks

 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: porting fixes regarding kvm-clock and lost irqs to stable qemu-kvm 0.12.4

2010-06-04 Thread Peter Lieven

Am 04.06.2010 um 17:31 schrieb Marcelo Tosatti:

 On Wed, Jun 02, 2010 at 05:13:30PM +0200, Peter Lieven wrote:
 Hi,
 
 I would like to get latest stable qemu-kvm (0.12.4) to a usable state
 regarding live-migration.
 
 Problems are fixed in git, but there is so much new stuff that has not
 extensively tested and therefore I would like to stay at 0.12.4 at
 the moment.
 
 Therefore I would appreciate your help regarding 2 bugs:
 
 a) -cpu xxx,-kvmclock doesn't work in 0.12.4. It started working first
 after applying a patchset from Avi that is not even in git yet. What
 needs to be done to get it working in 0.12.4 ?
 
 You want to disable kvmclock? -no-kvmclock kernel option.
 
 Note migration should be working fine since kvm.git commit 
 afbcf7ab8d1bc8c2d04792f6d9e786e0adeb328d.

is this patch needed in the gust or on the host?
i have 2.6.33.3 on the host system and still observed crash conditions.

 
 b) There is a bug in 0.12.4 which leads to lost irqs. It was reported
 appearing in virtio_blk (#584131) and e1000 (#585113). It is finally
 fixed in GIT. Can someone give me a hint with commit fixed this?
 
 c3f8f61157625d0bb5bfc135047573de48fdc675.
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 2/6] Add function to assign ioeventfd to MMIO.

2010-06-04 Thread Cam Macdonell
---
 kvm-all.c |   32 
 kvm.h |1 +
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 47f58a6..2982631 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1257,6 +1257,38 @@ int kvm_set_signal_mask(CPUState *env, const sigset_t 
*sigset)
 return r;
 }
 
+int kvm_set_ioeventfd_mmio_long(int fd, uint32_t addr, uint32_t val, bool 
assign)
+{
+#ifdef KVM_IOEVENTFD
+int ret;
+struct kvm_ioeventfd iofd;
+
+iofd.datamatch = val;
+iofd.addr = addr;
+iofd.len = 4;
+iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
+iofd.fd = fd;
+
+if (!kvm_enabled()) {
+return -ENOSYS;
+}
+
+if (!assign) {
+iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
+}
+
+ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, iofd);
+
+if (ret  0) {
+return -errno;
+}
+
+return 0;
+#else
+return -ENOSYS;
+#endif
+}
+
 int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val, bool 
assign)
 {
 #ifdef KVM_IOEVENTFD
diff --git a/kvm.h b/kvm.h
index aab5118..52e3a7f 100644
--- a/kvm.h
+++ b/kvm.h
@@ -181,6 +181,7 @@ static inline void cpu_synchronize_post_init(CPUState *env)
 }
 
 #endif
+int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool 
assign);
 
 #if defined(KVM_IRQFD)  defined(CONFIG_KVM)
 int kvm_set_irqfd(int gsi, int fd, bool assigned);
-- 
1.6.3.2.198.g6096d

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 3/6] Change phys_ram_dirty to phys_ram_status

2010-06-04 Thread Cam Macdonell
phys_ram_dirty are 8-bit values storing 3 dirty bits.  Change to more generic
phys_ram_flags and use lower 4-bits for dirty status and leave upper 4 for
other uses.

The names of functions may need to be changed as well, such as 
c_p_m_get_dirty().

---
 cpu-all.h |   16 +---
 exec.c|   36 ++--
 2 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index 47a5722..9080cc7 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -858,7 +858,7 @@ target_phys_addr_t cpu_get_phys_page_debug(CPUState *env, 
target_ulong addr);
 /* memory API */
 
 extern int phys_ram_fd;
-extern uint8_t *phys_ram_dirty;
+extern uint8_t *phys_ram_flags;
 extern ram_addr_t ram_size;
 extern ram_addr_t last_ram_offset;
 
@@ -887,32 +887,34 @@ extern int mem_prealloc;
 #define CODE_DIRTY_FLAG  0x02
 #define MIGRATION_DIRTY_FLAG 0x08
 
+#define DIRTY_ALL_FLAG  (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG | 
MIGRATION_DIRTY_FLAG)
+
 /* read dirty bit (return 0 or 1) */
 static inline int cpu_physical_memory_is_dirty(ram_addr_t addr)
 {
-return phys_ram_dirty[addr  TARGET_PAGE_BITS] == 0xff;
+return phys_ram_flags[addr  TARGET_PAGE_BITS] == DIRTY_ALL_FLAG;
 }
 
 static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
 {
-return phys_ram_dirty[addr  TARGET_PAGE_BITS];
+return phys_ram_flags[addr  TARGET_PAGE_BITS];
 }
 
 static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
 int dirty_flags)
 {
-return phys_ram_dirty[addr  TARGET_PAGE_BITS]  dirty_flags;
+return phys_ram_flags[addr  TARGET_PAGE_BITS]  dirty_flags;
 }
 
 static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
 {
-phys_ram_dirty[addr  TARGET_PAGE_BITS] = 0xff;
+phys_ram_flags[addr  TARGET_PAGE_BITS] = DIRTY_ALL_FLAG;
 }
 
 static inline int cpu_physical_memory_set_dirty_flags(ram_addr_t addr,
   int dirty_flags)
 {
-return phys_ram_dirty[addr  TARGET_PAGE_BITS] |= dirty_flags;
+return phys_ram_flags[addr  TARGET_PAGE_BITS] |= dirty_flags;
 }
 
 static inline void cpu_physical_memory_mask_dirty_range(ram_addr_t start,
@@ -924,7 +926,7 @@ static inline void 
cpu_physical_memory_mask_dirty_range(ram_addr_t start,
 
 len = length  TARGET_PAGE_BITS;
 mask = ~dirty_flags;
-p = phys_ram_dirty + (start  TARGET_PAGE_BITS);
+p = phys_ram_flags + (start  TARGET_PAGE_BITS);
 for (i = 0; i  len; i++) {
 p[i] = mask;
 }
diff --git a/exec.c b/exec.c
index 7b0e1c5..39c18a7 100644
--- a/exec.c
+++ b/exec.c
@@ -116,7 +116,7 @@ uint8_t *code_gen_ptr;
 
 #if !defined(CONFIG_USER_ONLY)
 int phys_ram_fd;
-uint8_t *phys_ram_dirty;
+uint8_t *phys_ram_flags;
 static int in_migration;
 
 typedef struct RAMBlock {
@@ -2801,10 +2801,10 @@ ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
 new_block-next = ram_blocks;
 ram_blocks = new_block;
 
-phys_ram_dirty = qemu_realloc(phys_ram_dirty,
+phys_ram_flags = qemu_realloc(phys_ram_flags,
 (last_ram_offset + size)  TARGET_PAGE_BITS);
-memset(phys_ram_dirty + (last_ram_offset  TARGET_PAGE_BITS),
-   0xff, size  TARGET_PAGE_BITS);
+memset(phys_ram_flags + (last_ram_offset  TARGET_PAGE_BITS),
+   DIRTY_ALL_FLAG, size  TARGET_PAGE_BITS);
 
 last_ram_offset += size;
 
@@ -2853,10 +2853,10 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size)
 new_block-next = ram_blocks;
 ram_blocks = new_block;
 
-phys_ram_dirty = qemu_realloc(phys_ram_dirty,
+phys_ram_flags = qemu_realloc(phys_ram_flags,
 (last_ram_offset + size)  TARGET_PAGE_BITS);
-memset(phys_ram_dirty + (last_ram_offset  TARGET_PAGE_BITS),
-   0xff, size  TARGET_PAGE_BITS);
+memset(phys_ram_flags + (last_ram_offset  TARGET_PAGE_BITS),
+   DIRTY_ALL_FLAG, size  TARGET_PAGE_BITS);
 
 last_ram_offset += size;
 
@@ -3024,11 +3024,11 @@ static void notdirty_mem_writeb(void *opaque, 
target_phys_addr_t ram_addr,
 #endif
 }
 stb_p(qemu_get_ram_ptr(ram_addr), val);
-dirty_flags |= (0xff  ~CODE_DIRTY_FLAG);
+dirty_flags |= (DIRTY_ALL_FLAG  ~CODE_DIRTY_FLAG);
 cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
 /* we remove the notdirty callback only if the code has been
flushed */
-if (dirty_flags == 0xff)
+if (dirty_flags == DIRTY_ALL_FLAG)
 tlb_set_dirty(cpu_single_env, cpu_single_env-mem_io_vaddr);
 }
 
@@ -3044,11 +3044,11 @@ static void notdirty_mem_writew(void *opaque, 
target_phys_addr_t ram_addr,
 #endif
 }
 stw_p(qemu_get_ram_ptr(ram_addr), val);
-dirty_flags |= (0xff  ~CODE_DIRTY_FLAG);
+dirty_flags |= (DIRTY_ALL_FLAG  ~CODE_DIRTY_FLAG);
 cpu_physical_memory_set_dirty_flags(ram_addr, dirty_flags);
 /* we remove the notdirty callback only if the code has been
flushed */
-if (dirty_flags == 0xff)
+if (dirty_flags == DIRTY_ALL_FLAG)
 

[PATCH v6 0/6] Inter-VM Shared Memory Device with migration support

2010-06-04 Thread Cam Macdonell
Latest patch for PCI shared memory device that maps a host shared memory object
to be shared between guests

new in this series
- migration support with 'master' and 'peer' roles for guest to determine
  who owns memory.  With 'master', the guest has the canonical copy of
  the shared memory and will copy it with it on migration.  With 
'role=peer',
  the guest will not copy the shared memory, but attach to what is on the
  destination machine.
- modified phys_ram_dirty array for marking memory as not to be migrated
- add support for non-migrated memory regions

v5:
- fixed segfault for non-server case
- code style fixes
- removed limit on the number of guests
- shared memory server is now in qemu.git/contrib
- made ioeventfd setup function generic
- removed interrupts when guest joined (let application handle it)

v4:
- moved to single Doorbell register and use datamatch to trigger different
  VMs rather than one register per eventfd
- remove writing arbitrary values to eventfds.  Only values of 1 are now
  written to ensure correct usage

Cam Macdonell (6):
  Device specification for shared memory PCI device
  Adds two new functions for assigning ioeventfd and irqfds.
  Change phys_ram_dirty to phys_ram_status
  Add support for marking memory to not be migrated.  On migration,
memory is checked for the NO_MIGRATION_FLAG.
  Inter-VM shared memory PCI device
  the stand-alone shared memory server for inter-VM shared memory

 Makefile.target |3 +
 arch_init.c |   28 +-
 contrib/ivshmem-server/Makefile |   16 +
 contrib/ivshmem-server/README   |   30 ++
 contrib/ivshmem-server/ivshmem_server.c |  353 +
 contrib/ivshmem-server/send_scm.c   |  208 
 contrib/ivshmem-server/send_scm.h   |   19 +
 cpu-all.h   |   18 +-
 cpu-common.h|2 +
 docs/specs/ivshmem_device_spec.txt  |   96 
 exec.c  |   48 ++-
 hw/ivshmem.c|  852 +++
 kvm-all.c   |   32 ++
 kvm.h   |1 +
 qemu-char.c |6 +
 qemu-char.h |3 +
 qemu-doc.texi   |   32 ++
 17 files changed, 1710 insertions(+), 37 deletions(-)
 create mode 100644 contrib/ivshmem-server/Makefile
 create mode 100644 contrib/ivshmem-server/README
 create mode 100644 contrib/ivshmem-server/ivshmem_server.c
 create mode 100644 contrib/ivshmem-server/send_scm.c
 create mode 100644 contrib/ivshmem-server/send_scm.h
 create mode 100644 docs/specs/ivshmem_device_spec.txt
 create mode 100644 hw/ivshmem.c

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG.

2010-06-04 Thread Cam Macdonell
This is useful for devices that do not want to take memory regions data with 
them on migration.
---
 arch_init.c  |   28 
 cpu-all.h|2 ++
 cpu-common.h |2 ++
 exec.c   |   12 
 4 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index cfc03ea..7a234fa 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -118,18 +118,21 @@ static int ram_save_block(QEMUFile *f)
 current_addr + TARGET_PAGE_SIZE,
 MIGRATION_DIRTY_FLAG);
 
-p = qemu_get_ram_ptr(current_addr);
-
-if (is_dup_page(p, *p)) {
-qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
-qemu_put_byte(f, *p);
-} else {
-qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
-qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
-}
+if (!cpu_physical_memory_get_dirty(current_addr,
+NO_MIGRATION_FLAG)) {
+p = qemu_get_ram_ptr(current_addr);
+
+if (is_dup_page(p, *p)) {
+qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
+qemu_put_byte(f, *p);
+} else {
+qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
+qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
+}
 
-found = 1;
-break;
+found = 1;
+break;
+}
 }
 addr += TARGET_PAGE_SIZE;
 current_addr = (saved_addr + addr) % last_ram_offset;
@@ -146,7 +149,8 @@ static ram_addr_t ram_save_remaining(void)
 ram_addr_t count = 0;
 
 for (addr = 0; addr  last_ram_offset; addr += TARGET_PAGE_SIZE) {
-if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
+if (!cpu_physical_memory_get_dirty(addr, NO_MIGRATION_FLAG) 
+cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
 count++;
 }
 }
diff --git a/cpu-all.h b/cpu-all.h
index 9080cc7..4df00ab 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -887,6 +887,8 @@ extern int mem_prealloc;
 #define CODE_DIRTY_FLAG  0x02
 #define MIGRATION_DIRTY_FLAG 0x08
 
+#define NO_MIGRATION_FLAG 0x10
+
 #define DIRTY_ALL_FLAG  (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG | 
MIGRATION_DIRTY_FLAG)
 
 /* read dirty bit (return 0 or 1) */
diff --git a/cpu-common.h b/cpu-common.h
index 4b0ba60..a1ebbbe 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -39,6 +39,8 @@ static inline void 
cpu_register_physical_memory(target_phys_addr_t start_addr,
 cpu_register_physical_memory_offset(start_addr, size, phys_offset, 0);
 }
 
+void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t size);
+
 ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
 ram_addr_t qemu_ram_map(ram_addr_t size, void *host);
 ram_addr_t qemu_ram_alloc(ram_addr_t);
diff --git a/exec.c b/exec.c
index 39c18a7..c11d22f 100644
--- a/exec.c
+++ b/exec.c
@@ -2786,6 +2786,18 @@ static void *file_ram_alloc(ram_addr_t memory, const 
char *path)
 }
 #endif
 
+void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t length)
+{
+int i, len;
+uint8_t *p;
+
+len = length  TARGET_PAGE_BITS;
+p = phys_ram_flags + (start  TARGET_PAGE_BITS);
+for (i = 0; i  len; i++) {
+p[i] |= NO_MIGRATION_FLAG;
+}
+}
+
 ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
 {
 RAMBlock *new_block;
-- 
1.6.3.2.198.g6096d

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 5/6] Inter-VM shared memory PCI device

2010-06-04 Thread Cam Macdonell
Support an inter-vm shared memory device that maps a shared-memory object as a
PCI device in the guest.  This patch also supports interrupts between guest by
communicating over a unix domain socket.  This patch applies to the qemu-kvm
repository.

-device ivshmem,size=size in format accepted by -m[,shm=shm name]

Interrupts are supported between multiple VMs by using a shared memory server
by using a chardev socket.

-device ivshmem,size=size in format accepted by -m[,shm=shm name]
   [,chardev=id][,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
-chardev socket,path=path,id=id

(shared memory server is qemu.git/contrib/ivshmem-server)

Sample programs and init scripts are in a git repo here:

www.gitorious.org/nahanni
---
 Makefile.target |3 +
 hw/ivshmem.c|  852 +++
 qemu-char.c |6 +
 qemu-char.h |3 +
 qemu-doc.texi   |   43 +++
 5 files changed, 907 insertions(+), 0 deletions(-)
 create mode 100644 hw/ivshmem.c

diff --git a/Makefile.target b/Makefile.target
index c4ba592..4888308 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -202,6 +202,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
 obj-y += rtl8139.o
 obj-y += e1000.o
 
+# Inter-VM PCI shared memory
+obj-y += ivshmem.o
+
 # Hardware support
 obj-i386-y += vga.o
 obj-i386-y += mc146818rtc.o i8259.o pc.o
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
new file mode 100644
index 000..9057612
--- /dev/null
+++ b/hw/ivshmem.c
@@ -0,0 +1,852 @@
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *  Cam Macdonell c...@cs.ualberta.ca
+ *
+ * Based On: cirrus_vga.c
+ *  Copyright (c) 2004 Fabrice Bellard
+ *  Copyright (c) 2004 Makoto Suzuki (suzu)
+ *
+ *  and rtl8139.c
+ *  Copyright (c) 2006 Igor Kovalenko
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+#include sys/mman.h
+#include sys/types.h
+#include sys/socket.h
+#include sys/io.h
+#include sys/ioctl.h
+#include hw.h
+#include console.h
+#include pc.h
+#include pci.h
+#include sysemu.h
+
+#include msix.h
+#include qemu-kvm.h
+#include libkvm.h
+
+#include sys/eventfd.h
+#include sys/mman.h
+#include sys/socket.h
+#include sys/ioctl.h
+
+#define IVSHMEM_IRQFD   0
+#define IVSHMEM_MSI 1
+
+//#define DEBUG_IVSHMEM
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)\
+do {printf(IVSHMEM:  fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+typedef struct Peer {
+int nb_eventfds;
+int *eventfds;
+} Peer;
+
+typedef struct EventfdEntry {
+PCIDevice *pdev;
+int vector;
+} EventfdEntry;
+
+typedef struct IVShmemState {
+PCIDevice dev;
+uint32_t intrmask;
+uint32_t intrstatus;
+uint32_t doorbell;
+
+CharDriverState ** eventfd_chr;
+CharDriverState * server_chr;
+int ivshmem_mmio_io_addr;
+
+pcibus_t mmio_addr;
+pcibus_t shm_pci_addr;
+uint64_t ivshmem_offset;
+uint64_t ivshmem_size; /* size of shared memory region */
+int shm_fd; /* shared memory file descriptor */
+
+Peer *peers;
+int nb_peers; /* how many guests we have space for */
+int max_peer; /* maximum numbered peer */
+
+int vm_id;
+uint32_t vectors;
+uint32_t features;
+EventfdEntry *eventfd_table;
+
+char * shmobj;
+char * sizearg;
+char * role;
+} IVShmemState;
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+IntrMask = 0,
+IntrStatus = 4,
+IVPosition = 8,
+Doorbell = 12,
+};
+
+static inline uint32_t ivshmem_has_feature(IVShmemState *ivs, int feature) {
+return (ivs-features  (1  feature));
+}
+
+static inline bool is_power_of_two(uint64_t x) {
+return (x  (x - 1)) == 0;
+}
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+pcibus_t addr, pcibus_t size, int type)
+{
+IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
+
+s-shm_pci_addr = addr;
+
+if (s-ivshmem_offset  0) {
+cpu_register_physical_memory(s-shm_pci_addr, s-ivshmem_size,
+s-ivshmem_offset);
+if (s-role  strncmp(s-role, peer, 4) == 0) {
+IVSHMEM_DPRINTF(marking pages no migrate\n);
+cpu_mark_pages_no_migrate(s-ivshmem_offset, s-ivshmem_size);
+}
+}
+
+IVSHMEM_DPRINTF(guest pci addr = %u, guest h/w addr = %u, size = %u\n,
+(uint32_t)addr, (uint32_t)s-ivshmem_offset, (uint32_t)size);
+
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s, int val)
+{
+int isr;
+isr = (s-intrstatus  s-intrmask)  0x;
+
+/* don't print ISR resets */
+if (isr) {
+IVSHMEM_DPRINTF(Set IRQ to %d (%04x %04x)\n,
+   isr ? 1 : 0, s-intrstatus, s-intrmask);
+}
+
+qemu_set_irq(s-dev.irq[0], (isr != 0));
+}
+
+static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
+{
+  

[PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory

2010-06-04 Thread Cam Macdonell
this code is a standalone server which will pass file descriptors for the shared
memory region and eventfds to support interrupts between guests using inter-VM
shared memory.
---
 contrib/ivshmem-server/Makefile |   16 ++
 contrib/ivshmem-server/README   |   30 +++
 contrib/ivshmem-server/ivshmem_server.c |  353 +++
 contrib/ivshmem-server/send_scm.c   |  208 ++
 contrib/ivshmem-server/send_scm.h   |   19 ++
 5 files changed, 626 insertions(+), 0 deletions(-)
 create mode 100644 contrib/ivshmem-server/Makefile
 create mode 100644 contrib/ivshmem-server/README
 create mode 100644 contrib/ivshmem-server/ivshmem_server.c
 create mode 100644 contrib/ivshmem-server/send_scm.c
 create mode 100644 contrib/ivshmem-server/send_scm.h

diff --git a/contrib/ivshmem-server/Makefile b/contrib/ivshmem-server/Makefile
new file mode 100644
index 000..da40ffa
--- /dev/null
+++ b/contrib/ivshmem-server/Makefile
@@ -0,0 +1,16 @@
+CC = gcc
+CFLAGS = -O3 -Wall -Werror
+LIBS = -lrt
+
+# a very simple makefile to build the inter-VM shared memory server
+
+all: ivshmem_server
+
+.c.o:
+   $(CC) $(CFLAGS) -c $^ -o $@
+
+ivshmem_server: ivshmem_server.o send_scm.o
+   $(CC) $(CFLAGS) -o $@ $^ $(LIBS)
+
+clean:
+   rm -f *.o ivshmem_server
diff --git a/contrib/ivshmem-server/README b/contrib/ivshmem-server/README
new file mode 100644
index 000..b1fc2a2
--- /dev/null
+++ b/contrib/ivshmem-server/README
@@ -0,0 +1,30 @@
+Using the ivshmem shared memory server
+--
+
+This server is only supported on Linux.
+
+To use the shared memory server, first compile it.  Running 'make' should
+accomplish this.  An executable named 'ivshmem_server' will be built.
+
+to display the options run:
+
+./ivshmem_server -h
+
+Options
+---
+
+-h  print help message
+
+-p path on host
+unix socket to listen on.  The qemu-kvm chardev needs to connect on
+this socket. (default: '/tmp/ivshmem_socket')
+
+-s string
+POSIX shared object to create that is the shared memory (default: 
'ivshmem')
+
+-m #
+size of the POSIX object in MBs (default: 1)
+
+-n #
+number of eventfds for each guest.  This number must match the
+'vectors' argument passed the ivshmem device. (default: 1)
diff --git a/contrib/ivshmem-server/ivshmem_server.c 
b/contrib/ivshmem-server/ivshmem_server.c
new file mode 100644
index 000..e0a7b98
--- /dev/null
+++ b/contrib/ivshmem-server/ivshmem_server.c
@@ -0,0 +1,353 @@
+/*
+ * A stand-alone shared memory server for inter-VM shared memory for KVM
+*/
+
+#include errno.h
+#include string.h
+#include sys/types.h
+#include sys/socket.h
+#include sys/un.h
+#include unistd.h
+#include sys/types.h
+#include sys/stat.h
+#include fcntl.h
+#include sys/eventfd.h
+#include sys/mman.h
+#include sys/select.h
+#include stdio.h
+#include stdlib.h
+#include send_scm.h
+
+#define DEFAULT_SOCK_PATH /tmp/ivshmem_socket
+#define DEFAULT_SHM_OBJ ivshmem
+
+#define DEBUG 1
+
+typedef struct server_state {
+vmguest_t *live_vms;
+int nr_allocated_vms;
+int shm_size;
+long live_count;
+long total_count;
+int shm_fd;
+char * path;
+char * shmobj;
+int maxfd, conn_socket;
+long msi_vectors;
+} server_state_t;
+
+void usage(char const *prg);
+int find_set(fd_set * readset, int max);
+void print_vec(server_state_t * s, const char * c);
+
+void add_new_guest(server_state_t * s);
+void parse_args(int argc, char **argv, server_state_t * s);
+int create_listening_socket(char * path);
+
+int main(int argc, char ** argv)
+{
+fd_set readset;
+server_state_t * s;
+
+s = (server_state_t *)calloc(1, sizeof(server_state_t));
+
+s-live_count = 0;
+s-total_count = 0;
+parse_args(argc, argv, s);
+
+/* open shared memory file  */
+if ((s-shm_fd = shm_open(s-shmobj, O_CREAT|O_RDWR, S_IRWXU))  0)
+{
+fprintf(stderr, kvm_ivshmem: could not open shared file\n);
+exit(-1);
+}
+
+ftruncate(s-shm_fd, s-shm_size);
+
+s-conn_socket = create_listening_socket(s-path);
+
+s-maxfd = s-conn_socket;
+
+for(;;) {
+int ret, handle, i;
+char buf[1024];
+
+print_vec(s, vm_sockets);
+
+FD_ZERO(readset);
+/* conn socket is in Live_vms at posn 0 */
+FD_SET(s-conn_socket, readset);
+for (i = 0; i  s-total_count; i++) {
+if (s-live_vms[i].alive != 0) {
+FD_SET(s-live_vms[i].sockfd, readset);
+}
+}
+
+printf(\nWaiting (maxfd = %d)\n, s-maxfd);
+
+ret = select(s-maxfd + 1, readset, NULL, NULL, NULL);
+
+if (ret == -1) {
+perror(select());
+}
+
+handle = find_set(readset, s-maxfd + 1);
+if (handle == -1) continue;
+
+if (handle == s-conn_socket) {
+
+printf([NC] new connection\n);
+FD_CLR(s-conn_socket, readset);
+

[PATCH v6] Shared memory uio_pci driver

2010-06-04 Thread Cam Macdonell
This patch adds a driver for my shared memory PCI device using the uio_pci
interface.  The driver has three memory regions.  The first memory region is for
device registers for sending interrupts. The second BAR is for receiving MSI-X
interrupts and the third memory region maps the shared memory.  The device only
exports the first and third memory regions to userspace.

This driver supports MSI-X and regular pin interrupts.  Currently, the number
of MSI vectors is set to 1 but it could easily be increased.  If MSI is not
available, then regular interrupts will be used.
---
 drivers/uio/Kconfig   |8 ++
 drivers/uio/Makefile  |1 +
 drivers/uio/uio_ivshmem.c |  252 +
 3 files changed, 261 insertions(+), 0 deletions(-)
 create mode 100644 drivers/uio/uio_ivshmem.c

diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
index 1da73ec..b92cded 100644
--- a/drivers/uio/Kconfig
+++ b/drivers/uio/Kconfig
@@ -74,6 +74,14 @@ config UIO_SERCOS3
 
  If you compile this as a module, it will be called uio_sercos3.
 
+config UIO_IVSHMEM
+   tristate KVM shared memory PCI driver
+   default n
+   help
+ Userspace I/O interface for the KVM shared memory device.  This
+ driver will make available two memory regions, the first is
+ registers and the second is a region for sharing between VMs.
+
 config UIO_PCI_GENERIC
tristate Generic driver for PCI 2.3 and PCI Express cards
depends on PCI
diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
index 18fd818..25c1ca5 100644
--- a/drivers/uio/Makefile
+++ b/drivers/uio/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_UIO_AEC)   += uio_aec.o
 obj-$(CONFIG_UIO_SERCOS3)  += uio_sercos3.o
 obj-$(CONFIG_UIO_PCI_GENERIC)  += uio_pci_generic.o
 obj-$(CONFIG_UIO_NETX) += uio_netx.o
+obj-$(CONFIG_UIO_IVSHMEM) += uio_ivshmem.o
diff --git a/drivers/uio/uio_ivshmem.c b/drivers/uio/uio_ivshmem.c
new file mode 100644
index 000..95be1e0
--- /dev/null
+++ b/drivers/uio/uio_ivshmem.c
@@ -0,0 +1,252 @@
+/*
+ * UIO IVShmem Driver
+ *
+ * (C) 2009 Cam Macdonell
+ * based on Hilscher CIF card driver (C) 2007 Hans J. Koch h...@linutronix.de
+ *
+ * Licensed under GPL version 2 only.
+ *
+ */
+
+#include linux/device.h
+#include linux/module.h
+#include linux/pci.h
+#include linux/uio_driver.h
+
+#include asm/io.h
+
+#define IntrStatus 0x04
+#define IntrMask 0x00
+
+struct ivshmem_info {
+   struct uio_info *uio;
+   struct pci_dev *dev;
+   char (*msix_names)[256];
+   struct msix_entry *msix_entries;
+   int nvectors;
+};
+
+static irqreturn_t ivshmem_handler(int irq, struct uio_info *dev_info)
+{
+
+   void __iomem *plx_intscr = dev_info-mem[0].internal_addr
+   + IntrStatus;
+   u32 val;
+
+   val = readl(plx_intscr);
+   if (val == 0)
+   return IRQ_NONE;
+
+   return IRQ_HANDLED;
+}
+
+static irqreturn_t ivshmem_msix_handler(int irq, void *opaque)
+{
+
+   struct uio_info * dev_info = (struct uio_info *) opaque;
+
+   /* we have to do this explicitly when using MSI-X */
+   uio_event_notify(dev_info);
+   return IRQ_HANDLED;
+}
+
+static void free_msix_vectors(struct ivshmem_info *ivs_info,
+   const int max_vector)
+{
+   int i;
+
+   for (i = 0; i  max_vector; i++)
+   free_irq(ivs_info-msix_entries[i].vector, ivs_info-uio);
+}
+
+static int request_msix_vectors(struct ivshmem_info *ivs_info, int nvectors)
+{
+   int i, err;
+   const char *name = ivshmem;
+
+   ivs_info-nvectors = nvectors;
+
+   ivs_info-msix_entries = kmalloc(nvectors * sizeof *
+   ivs_info-msix_entries,
+   GFP_KERNEL);
+   if (ivs_info-msix_entries == NULL)
+   return -ENOSPC;
+
+   ivs_info-msix_names = kmalloc(nvectors * sizeof *ivs_info-msix_names,
+   GFP_KERNEL);
+   if (ivs_info-msix_names == NULL) {
+   kfree(ivs_info-msix_entries);
+   return -ENOSPC;
+   }
+
+   for (i = 0; i  nvectors; ++i)
+   ivs_info-msix_entries[i].entry = i;
+
+   err = pci_enable_msix(ivs_info-dev, ivs_info-msix_entries,
+   ivs_info-nvectors);
+   if (err  0) {
+   ivs_info-nvectors = err; /* msi-x positive error code
+returns the number available*/
+   err = pci_enable_msix(ivs_info-dev, ivs_info-msix_entries,
+   ivs_info-nvectors);
+   if (err) {
+   printk(KERN_INFO no MSI (%d). Back to INTx.\n, err);
+   goto error;
+   }
+   }
+
+   if (err)
+   goto error;
+
+   for (i = 0; i  ivs_info-nvectors; i++) {
+
+