Re: [Qemu-devel] Re: Another SIGFPE in display code, now in cirrus

2010-05-30 Thread Avi Kivity

On 05/12/2010 09:11 PM, Stefano Stabellini wrote:

On Wed, 12 May 2010, Jamie Lokier wrote:
   

Stefano Stabellini wrote:
 

On Wed, 12 May 2010, Avi Kivity wrote:
   

It's useful if you have a one-line horizontal pattern you want to
propagate all over.
 


It might be useful all right, but it is not entirely clear what the
hardware should do in this situation from the documentation we have, and
certainly the current state of the cirrus emulation code doesn't help.
   

It's quite a reasonable thing for hardware to do, even if not documented.
It would be surprising if the hardware didn't copy the one-line pattern.
 


All right then, you convinced me :)

This is my proposed solution, however it is untested with Windows NT.


Signed-off-by: Stefano Stabellinistefano.stabell...@eu.citrix.com

---



diff --git a/hw/cirrus_vga.c b/hw/cirrus_vga.c
index 9f61a01..a7f0d3c 100644
--- a/hw/cirrus_vga.c
+++ b/hw/cirrus_vga.c
@@ -676,15 +676,17 @@ static void cirrus_do_copy(CirrusVGAState *s, int dst, 
int src, int w, int h)
  int sx, sy;
  int dx, dy;
  int width, height;
+uint32_t start_addr, line_offset, line_compare;
  int depth;
  int notify = 0;

  depth = s-vga.get_bpp(s-vga) / 8;
  s-vga.get_resolution(s-vga,width,height);
+s-vga.get_offsets(s-vga,line_offset,start_addr,line_compare);

  /* extra x, y */
-sx = (src % ABS(s-cirrus_blt_srcpitch)) / depth;
-sy = (src / ABS(s-cirrus_blt_srcpitch));
+sx = (src % line_offset) / depth;
+sy = (src / line_offset);
  dx = (dst % ABS(s-cirrus_blt_dstpitch)) / depth;
  dy = (dst / ABS(s-cirrus_blt_dstpitch));
   


dx/dy also need to be calculated according to line_offset.

I think the rules are:

if src_byte_range in screen and dst_byte_range in screen
   and srcpitch == dstpitch == line_offset:
   can use qemu_console_copy
elif dst_byte_range overlaps screen:
   if dstpitch == line_offset:
   invalidate rectangle
   else:
   invalidate full screen


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Info on Unix Domain Socket Device

2010-05-30 Thread Avi Kivity

On 05/29/2010 08:11 PM, Narendra Prasad Madanapalli wrote:

Hi,

There is a small scale TODO task in kvm, 'Add a Unix domain socket
device. With this, the guest can talk to a pci device which is
connected to a Unix domain socket on the host.'. I would like to park
on this and contribute a patch. If any one is also planning to to do
this, we can coordinate and accomplish the task.

   


This was already implemented in qemu (as a large scale TODO) and is 
called virtio-serial.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add Documentation/kvm/msr.txt

2010-05-30 Thread Avi Kivity

On 05/27/2010 07:02 PM, Glauber Costa wrote:



+
+Custom MSR list
+
+
+The current supported Custom MSR list is:
+
+MSR_KVM_WALL_CLOCK:  0x11
+
+   data: physical address of a memory area.
   

Which must be in guest RAM (i.e., don't point it somewhere random
and expect the hypervisor to allocate it for you).

Must be aligned to 4 bytes (we don't enforce it though).
 

I don't see the reason for it.
   


Insurance.


If this is a requirement, our own implementation
is failing to meet it.
   


How are we failing to meet it?  All guests align to at least four bytes.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: MMU: always invalidate and flush on spte page size change

2010-05-30 Thread Avi Kivity

On 05/28/2010 03:44 PM, Marcelo Tosatti wrote:

Always invalidate spte and flush TLBs when changing page size, to make
sure different sized translations for the same address are never cached
in a CPU's TLB.

The first case where this occurs is when a non-leaf spte pointer is
overwritten by a leaf, large spte entry. This can happen after dirty
logging is disabled on a memslot, for example.

The second case is a leaf, large spte entry is overwritten with a
non-leaf spte pointer, in __direct_map. Note this cannot happen now
because the only potential source of such overwrite is dirty logging
being enabled, which zaps all MMU pages. But this might change
in the future, so better be robust against it.

Noticed by Andrea.

KVM-Stable-Tag
Signed-off-by: Marcelo Tosattimtosa...@redhat.com

Index: kvm/arch/x86/kvm/mmu.c
===
--- kvm.orig/arch/x86/kvm/mmu.c
+++ kvm/arch/x86/kvm/mmu.c
@@ -1952,6 +1952,8 @@ static void mmu_set_spte(struct kvm_vcpu

child = page_header(pte  PT64_BASE_ADDR_MASK);
mmu_page_remove_parent_pte(child, sptep);
+   __set_spte(sptep, shadow_trap_nonpresent_pte);
+   kvm_flush_remote_tlbs(vcpu-kvm);
} else if (pfn != spte_to_pfn(*sptep)) {
pgprintk(hfn old %lx new %lx\n,
 spte_to_pfn(*sptep), pfn);
   


Applied this bit.


@@ -2015,6 +2017,16 @@ static int __direct_map(struct kvm_vcpu
break;
}

+   if (is_shadow_present_pte(*iterator.sptep)
+   !is_large_pte(*iterator.sptep))
+   continue;
+
+   if (is_large_pte(*iterator.sptep)) {
+   rmap_remove(vcpu-kvm, iterator.sptep);
+   __set_spte(iterator.sptep, shadow_trap_nonpresent_pte);
+   kvm_flush_remote_tlbs(vcpu-kvm);
+   }
+
   


Don't we have exactly the same issue in FNAME(fetch)()?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2-RFC 0/2] virtio: put last seen used index into ring itself

2010-05-30 Thread Michael S. Tsirkin
On Fri, May 28, 2010 at 11:56:54AM +0200, Jes Sorensen wrote:
 On 05/26/10 21:50, Michael S. Tsirkin wrote:
  Here's a rewrite of the original patch with a new layout.
  I haven't tested it yet so no idea how this performs, but
  I think this addresses the cache bounce issue raised by Avi.
  Posting for early flames/comments.
  
  Generally, the Host end of the virtio ring doesn't need to see where
  Guest is up to in consuming the ring.  However, to completely understand
  what's going on from the outside, this information must be exposed.
  For example, host can reduce the number of interrupts by detecting
  that the guest is currently handling previous buffers.
  
  We add a feature bit so the guest can tell the host that it's writing
  out the current value there, if it wants to use that.
  
  This differs from original approach in that the used index
  is put after avail index (they are typically written out together).
  To avoid cache bounces on descriptor access,
  and make future extensions easier, we put the ring itself at start of
  page, and move the control after it.
 
 Hi Michael,
 
 It looks pretty good to me, however one thing I have been thinking of
 while reading through it:
 
 Rather than storing a pointer within the ring struct, pointing into a
 position within the same struct. How about storing a byte offset instead
 and using a cast to get to the pointer position? That would avoid the
 pointer dereference, which is less effective cache wise and harder for
 the CPU to predict.
 
 Not sure whether it really matters performance wise, just a thought.
 
 Cheers,
 Jes

I think this won't work: when PUBLUSH_USED_IDX is negotiated,
the pointer is to within the ring.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup

2010-05-30 Thread Michael S. Tsirkin
On Fri, May 28, 2010 at 05:54:42PM +0200, Tejun Heo wrote:
 Hello,
 
 On 05/28/2010 05:08 PM, Michael S. Tsirkin wrote:
  Well, we have create_singlethread_workqueue, right?
  This is not very different ... is it?
  
  Just copying structures and code from workqueue.c,
  adding vhost_ in front of it will definitely work:
 
 Sure it will, but you'll probably be able to get away with much less.
 
  there is nothing magic about the workqueue library.
  But this just involves cut and paste which might be best avoided.
 
 What I'm saying is that some magic needs to be added to workqueue and
 if you add this single(!) exception, it will have to be backed out
 pretty soon, so it would be better to do it properly now.
 
  One final idea before we go the cut and paste way: how about
  'create_workqueue_from_task' that would get a thread and have workqueue
  run there?
 
 You can currently depend on that implementation detail but it's not
 the workqueue interface is meant to do.  The single threadedness is
 there as execution ordering and concurrency specification and it
 doesn't (or rather won't) necessarily mean that a specific single
 thread is bound to certain workqueue.
 
 Can you please direct me to have a look at the code.  I'll be happy to
 do the conversion for you.

Great, thanks! The code in question is in drivers/vhost/vhost.c
It is used from drivers/vhost/net.c

On top of this, we have patchset from Sridhar Samudrala,
titled '0/3 Make vhost multi-threaded and associate each thread to its
guest's cgroup':

cgroups: Add an API to attach a task to current task's cgroup
workqueue: Add an API to create a singlethread workqueue attached to the
current task's cgroup
vhost: make it more scalable by creating a vhost thread per device

I have bounced the last three your way.


 Thanks.
 
 -- 
 tejun
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-30 Thread Avi Kivity

On 05/30/2010 03:19 PM, Michael S. Tsirkin wrote:

On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote:
   

The VFIO driver is used to allow privileged AND non-privileged processes to
implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
devices.
Signed-off-by: Tom Lyonp...@cisco.com
---
This patch is the evolution of code which was first proposed as a patch to
uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
out of the uio framework, and things seem much cleaner. Of course, there is
a lot of functional overlap with uio, but the previous version just seemed
like a giant mode switch in the uio code that did not lead to clarity for
either the new or old code.
 

IMO this was because this driver does two things: programming iommu and
handling interrupts. uio does interrupt handling.
We could have moved iommu / DMA programming to
a separate driver, and have uio work with it.
This would solve limitation of the current driver
that is needs an iommu domain per device.
   


How do we enforce security then?  We need to ensure that unprivileged 
users can only use the device with an iommu.



[a pony for avi...]
The major new functionality in this version is the ability to deal with
PCI config space accesses (through read  write calls) - but includes table
driven code to determine whats safe to write and what is not.
 

I don't really see why this is helpful: a driver written corrrectly
will not access these addresses, and we need an iommu anyway to protect
us against a drivers.
   


Haven't reviewed the code (yet) but things like the BARs, MSI, and 
interrupt disable need to be protected from the guest regardless of the 
iommu.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] KVM: MMU: introduce some macros to cleanup hlist traverseing

2010-05-30 Thread Xiao Guangrong
Introduce for_each_gfn_sp(), for_each_gfn_indirect_sp() and
for_each_gfn_indirect_valid_sp() to cleanup hlist traverseing

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |  129 ++--
 1 files changed, 54 insertions(+), 75 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 56f8c3c..84c705e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1200,6 +1200,22 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, 
struct kvm_mmu_page *sp)
 
 static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
+#define for_each_gfn_sp(kvm, sp, gfn, pos, n)  \
+  hlist_for_each_entry_safe(sp, pos, n,
\
+   kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
+   if (sp-gfn == gfn)
+
+#define for_each_gfn_indirect_sp(kvm, sp, gfn, pos, n) \
+  hlist_for_each_entry_safe(sp, pos, n,
\
+   kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
+   if (sp-gfn == gfn  !sp-role.direct)
+
+#define for_each_gfn_indirect_valid_sp(kvm, sp, gfn, pos, n)   \
+  hlist_for_each_entry_safe(sp, pos, n,
\
+   kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
+   if (sp-gfn == gfn  !sp-role.direct\
+   !sp-role.invalid)
+
 static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
   bool clear_unsync)
 {
@@ -1243,16 +1259,12 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
 /* @gfn should be write-protected at the call site */
 static void kvm_sync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
 {
-   struct hlist_head *bucket;
struct kvm_mmu_page *s;
struct hlist_node *node, *n;
-   unsigned index;
bool flush = false;
 
-   index = kvm_page_table_hashfn(gfn);
-   bucket = vcpu-kvm-arch.mmu_page_hash[index];
-   hlist_for_each_entry_safe(s, node, n, bucket, hash_link) {
-   if (s-gfn != gfn || !s-unsync || s-role.invalid)
+   for_each_gfn_indirect_valid_sp(vcpu-kvm, s, gfn, node, n) {
+   if (!s-unsync)
continue;
 
WARN_ON(s-role.level != PT_PAGE_TABLE_LEVEL);
@@ -1364,9 +1376,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
 u64 *parent_pte)
 {
union kvm_mmu_page_role role;
-   unsigned index;
unsigned quadrant;
-   struct hlist_head *bucket;
struct kvm_mmu_page *sp;
struct hlist_node *node, *tmp;
bool need_sync = false;
@@ -1382,36 +1392,36 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
quadrant = (1  ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
role.quadrant = quadrant;
}
-   index = kvm_page_table_hashfn(gfn);
-   bucket = vcpu-kvm-arch.mmu_page_hash[index];
-   hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link)
-   if (sp-gfn == gfn) {
-   if (!need_sync  sp-unsync)
-   need_sync = true;
 
-   if (sp-role.word != role.word)
-   continue;
+   for_each_gfn_sp(vcpu-kvm, sp, gfn, node, tmp) {
+   if (!need_sync  sp-unsync)
+   need_sync = true;
 
-   if (sp-unsync  kvm_sync_page_transient(vcpu, sp))
-   break;
+   if (sp-role.word != role.word)
+   continue;
 
-   mmu_page_add_parent_pte(vcpu, sp, parent_pte);
-   if (sp-unsync_children) {
-   set_bit(KVM_REQ_MMU_SYNC, vcpu-requests);
-   kvm_mmu_mark_parents_unsync(sp);
-   } else if (sp-unsync)
-   kvm_mmu_mark_parents_unsync(sp);
+   if (sp-unsync  kvm_sync_page_transient(vcpu, sp))
+   break;
 
-   trace_kvm_mmu_get_page(sp, false);
-   return sp;
-   }
+   mmu_page_add_parent_pte(vcpu, sp, parent_pte);
+   if (sp-unsync_children) {
+   set_bit(KVM_REQ_MMU_SYNC, vcpu-requests);
+   kvm_mmu_mark_parents_unsync(sp);
+   } else if (sp-unsync)
+   kvm_mmu_mark_parents_unsync(sp);
+
+   trace_kvm_mmu_get_page(sp, false);
+   return sp;
+   }
++vcpu-kvm-stat.mmu_cache_miss;
sp = kvm_mmu_alloc_page(vcpu, parent_pte, direct);
if (!sp)
return sp;
sp-gfn = gfn;
sp-role = role;
-   

[PATCH 2/5] KVM: MMU: split the operations of kvm_mmu_zap_page()

2010-05-30 Thread Xiao Guangrong
Using kvm_mmu_prepare_zap_page() and kvm_mmu_commit_zap_page() to
split kvm_mmu_zap_page() function, then we can:

- traverse hlist safely
- easily to gather remote tlb flush which occurs during page zapped

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   42 ++-
 arch/x86/kvm/x86.c  |1 +
 3 files changed, 43 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0cd0f29..e4df1cf 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -388,6 +388,7 @@ struct kvm_arch {
 * Hash table of struct kvm_mmu_page.
 */
struct list_head active_mmu_pages;
+   struct list_head invalid_mmu_pages;
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
int iommu_flags;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 84c705e..0c957bf 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -915,6 +915,7 @@ static int is_empty_shadow_page(u64 *spt)
 static void kvm_mmu_free_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
ASSERT(is_empty_shadow_page(sp-spt));
+   hlist_del(sp-hash_link);
list_del(sp-link);
__free_page(virt_to_page(sp-spt));
if (!sp-role.direct)
@@ -1560,6 +1561,46 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
return zapped;
 }
 
+static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+   int ret;
+
+   trace_kvm_mmu_zap_page(sp);
+   ++kvm-stat.mmu_shadow_zapped;
+   ret = mmu_zap_unsync_children(kvm, sp);
+   kvm_mmu_page_unlink_children(kvm, sp);
+   kvm_mmu_unlink_parents(kvm, sp);
+   if (!sp-role.invalid  !sp-role.direct)
+   unaccount_shadowed(kvm, sp-gfn);
+   if (sp-unsync)
+   kvm_unlink_unsync_page(kvm, sp);
+   if (!sp-root_count)
+   /* Count self */
+   ret++;
+   else
+   kvm_reload_remote_mmus(kvm);
+
+   sp-role.invalid = 1;
+   list_move(sp-link, kvm-arch.invalid_mmu_pages);
+   kvm_mmu_reset_last_pte_updated(kvm);
+   return ret;
+}
+
+static void kvm_mmu_commit_zap_page(struct kvm *kvm)
+{
+   struct kvm_mmu_page *sp, *n;
+
+   if (list_empty(kvm-arch.invalid_mmu_pages))
+   return;
+
+   kvm_flush_remote_tlbs(kvm);
+   list_for_each_entry_safe(sp, n, kvm-arch.invalid_mmu_pages, link) {
+   WARN_ON(!sp-role.invalid);
+   if (!sp-root_count)
+   kvm_mmu_free_page(kvm, sp);
+   }
+}
+
 static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
int ret;
@@ -1577,7 +1618,6 @@ static int kvm_mmu_zap_page(struct kvm *kvm, struct 
kvm_mmu_page *sp)
if (!sp-root_count) {
/* Count self */
ret++;
-   hlist_del(sp-hash_link);
kvm_mmu_free_page(kvm, sp);
} else {
sp-role.invalid = 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5e5cd8d..225c3c4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5331,6 +5331,7 @@ struct  kvm *kvm_arch_create_vm(void)
}
 
INIT_LIST_HEAD(kvm-arch.active_mmu_pages);
+   INIT_LIST_HEAD(kvm-arch.invalid_mmu_pages);
INIT_LIST_HEAD(kvm-arch.assigned_dev_head);
 
/* Reserve bit 0 of irq_sources_bitmap for userspace irq source */
-- 
1.6.1.2



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] KVM: MMU: gather remote tlb flush which occurs during page zapped

2010-05-30 Thread Xiao Guangrong
Using kvm_mmu_prepare_zap_page() and kvm_mmu_zap_page() instead of
kvm_mmu_zap_page() that can reduce remote tlb flush IPI

Have tested with xp/vista64, fedora12 32/64 guests, and it not broken

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c  |   75 +++
 arch/x86/kvm/mmutrace.h |2 +-
 2 files changed, 31 insertions(+), 46 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0c957bf..e2b1020 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1199,7 +1199,8 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, 
struct kvm_mmu_page *sp)
--kvm-stat.mmu_unsync;
 }
 
-static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+static void kvm_mmu_commit_zap_page(struct kvm *kvm);
 
 #define for_each_gfn_sp(kvm, sp, gfn, pos, n)  \
   hlist_for_each_entry_safe(sp, pos, n,
\
@@ -1221,7 +1222,7 @@ static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
   bool clear_unsync)
 {
if (sp-role.cr4_pae != !!is_pae(vcpu)) {
-   kvm_mmu_zap_page(vcpu-kvm, sp);
+   kvm_mmu_prepare_zap_page(vcpu-kvm, sp);
return 1;
}
 
@@ -1232,7 +1233,7 @@ static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
}
 
if (vcpu-arch.mmu.sync_page(vcpu, sp)) {
-   kvm_mmu_zap_page(vcpu-kvm, sp);
+   kvm_mmu_prepare_zap_page(vcpu-kvm, sp);
return 1;
}
 
@@ -1249,6 +1250,7 @@ static int kvm_sync_page_transient(struct kvm_vcpu *vcpu,
ret = __kvm_sync_page(vcpu, sp, false);
if (!ret)
mmu_convert_notrap(sp);
+   kvm_mmu_commit_zap_page(vcpu-kvm);
return ret;
 }
 
@@ -1271,13 +1273,13 @@ static void kvm_sync_pages(struct kvm_vcpu *vcpu,  
gfn_t gfn)
WARN_ON(s-role.level != PT_PAGE_TABLE_LEVEL);
if ((s-role.cr4_pae != !!is_pae(vcpu)) ||
(vcpu-arch.mmu.sync_page(vcpu, s))) {
-   kvm_mmu_zap_page(vcpu-kvm, s);
+   kvm_mmu_prepare_zap_page(vcpu-kvm, s);
continue;
}
kvm_unlink_unsync_page(vcpu-kvm, s);
flush = true;
}
-
+   kvm_mmu_commit_zap_page(vcpu-kvm);
if (flush)
kvm_mmu_flush_tlb(vcpu);
 }
@@ -1363,6 +1365,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
kvm_sync_page(vcpu, sp);
mmu_pages_clear_parents(parents);
}
+   kvm_mmu_commit_zap_page(vcpu-kvm);
cond_resched_lock(vcpu-kvm-mmu_lock);
kvm_mmu_pages_init(parent, parents, pages);
}
@@ -1551,7 +1554,7 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
struct kvm_mmu_page *sp;
 
for_each_sp(pages, sp, parents, i) {
-   kvm_mmu_zap_page(kvm, sp);
+   kvm_mmu_prepare_zap_page(kvm, sp);
mmu_pages_clear_parents(parents);
zapped++;
}
@@ -1565,7 +1568,7 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
struct kvm_mmu_page *sp)
 {
int ret;
 
-   trace_kvm_mmu_zap_page(sp);
+   trace_kvm_mmu_prepare_zap_page(sp);
++kvm-stat.mmu_shadow_zapped;
ret = mmu_zap_unsync_children(kvm, sp);
kvm_mmu_page_unlink_children(kvm, sp);
@@ -1601,33 +1604,6 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm)
}
 }
 
-static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
-   int ret;
-
-   trace_kvm_mmu_zap_page(sp);
-   ++kvm-stat.mmu_shadow_zapped;
-   ret = mmu_zap_unsync_children(kvm, sp);
-   kvm_mmu_page_unlink_children(kvm, sp);
-   kvm_mmu_unlink_parents(kvm, sp);
-   kvm_flush_remote_tlbs(kvm);
-   if (!sp-role.invalid  !sp-role.direct)
-   unaccount_shadowed(kvm, sp-gfn);
-   if (sp-unsync)
-   kvm_unlink_unsync_page(kvm, sp);
-   if (!sp-root_count) {
-   /* Count self */
-   ret++;
-   kvm_mmu_free_page(kvm, sp);
-   } else {
-   sp-role.invalid = 1;
-   list_move(sp-link, kvm-arch.active_mmu_pages);
-   kvm_reload_remote_mmus(kvm);
-   }
-   kvm_mmu_reset_last_pte_updated(kvm);
-   return ret;
-}
-
 /*
  * Changing the number of mmu pages allocated to the vm
  * Note: if kvm_nr_mmu_pages is too small, you will get dead lock
@@ -1652,8 +1628,9 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned 
int kvm_nr_mmu_pages)
 
page = container_of(kvm-arch.active_mmu_pages.prev,
   

[PATCH 4/5] KVM: MMU: traverse sp hlish safely

2010-05-30 Thread Xiao Guangrong
Now, we can safely to traverse sp hlish

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   53 +++
 1 files changed, 24 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e2b1020..be75cba 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1202,18 +1202,18 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, 
struct kvm_mmu_page *sp)
 static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 static void kvm_mmu_commit_zap_page(struct kvm *kvm);
 
-#define for_each_gfn_sp(kvm, sp, gfn, pos, n)  \
-  hlist_for_each_entry_safe(sp, pos, n,
\
+#define for_each_gfn_sp(kvm, sp, gfn, pos) \
+  hlist_for_each_entry(sp, pos,
\
kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
if (sp-gfn == gfn)
 
-#define for_each_gfn_indirect_sp(kvm, sp, gfn, pos, n) \
-  hlist_for_each_entry_safe(sp, pos, n,
\
+#define for_each_gfn_indirect_sp(kvm, sp, gfn, pos)\
+  hlist_for_each_entry(sp, pos,
\
kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
if (sp-gfn == gfn  !sp-role.direct)
 
-#define for_each_gfn_indirect_valid_sp(kvm, sp, gfn, pos, n)   \
-  hlist_for_each_entry_safe(sp, pos, n,
\
+#define for_each_gfn_indirect_valid_sp(kvm, sp, gfn, pos)  \
+  hlist_for_each_entry(sp, pos,
\
kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
if (sp-gfn == gfn  !sp-role.direct\
!sp-role.invalid)
@@ -1263,10 +1263,10 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
 static void kvm_sync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
 {
struct kvm_mmu_page *s;
-   struct hlist_node *node, *n;
+   struct hlist_node *node;
bool flush = false;
 
-   for_each_gfn_indirect_valid_sp(vcpu-kvm, s, gfn, node, n) {
+   for_each_gfn_indirect_valid_sp(vcpu-kvm, s, gfn, node) {
if (!s-unsync)
continue;
 
@@ -1382,7 +1382,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
union kvm_mmu_page_role role;
unsigned quadrant;
struct kvm_mmu_page *sp;
-   struct hlist_node *node, *tmp;
+   struct hlist_node *node;
bool need_sync = false;
 
role = vcpu-arch.mmu.base_role;
@@ -1397,7 +1397,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
role.quadrant = quadrant;
}
 
-   for_each_gfn_sp(vcpu-kvm, sp, gfn, node, tmp) {
+   for_each_gfn_sp(vcpu-kvm, sp, gfn, node) {
if (!need_sync  sp-unsync)
need_sync = true;
 
@@ -1644,18 +1644,17 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned 
int kvm_nr_mmu_pages)
 static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
 {
struct kvm_mmu_page *sp;
-   struct hlist_node *node, *n;
+   struct hlist_node *node;
int r;
 
pgprintk(%s: looking for gfn %lx\n, __func__, gfn);
r = 0;
-restart:
-   for_each_gfn_indirect_sp(kvm, sp, gfn, node, n) {
+
+   for_each_gfn_indirect_sp(kvm, sp, gfn, node) {
pgprintk(%s: gfn %lx role %x\n, __func__, gfn,
 sp-role.word);
r = 1;
-   if (kvm_mmu_prepare_zap_page(kvm, sp))
-   goto restart;
+   kvm_mmu_prepare_zap_page(kvm, sp);
}
kvm_mmu_commit_zap_page(kvm);
return r;
@@ -1664,14 +1663,12 @@ restart:
 static void mmu_unshadow(struct kvm *kvm, gfn_t gfn)
 {
struct kvm_mmu_page *sp;
-   struct hlist_node *node, *nn;
+   struct hlist_node *node;
 
-restart:
-   for_each_gfn_indirect_valid_sp(kvm, sp, gfn, node, nn) {
+   for_each_gfn_indirect_valid_sp(kvm, sp, gfn, node) {
pgprintk(%s: zap %lx %x\n,
 __func__, gfn, sp-role.word);
-   if (kvm_mmu_prepare_zap_page(kvm, sp))
-   goto restart;
+   kvm_mmu_prepare_zap_page(kvm, sp);
}
kvm_mmu_commit_zap_page(kvm);
 }
@@ -1816,9 +1813,9 @@ static void __kvm_unsync_page(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *sp)
 static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  gfn_t gfn)
 {
struct kvm_mmu_page *s;
-   struct hlist_node *node, *n;
+   struct hlist_node *node;
 
-   for_each_gfn_indirect_valid_sp(vcpu-kvm, s, gfn, node, n) {
+   for_each_gfn_indirect_valid_sp(vcpu-kvm, s, gfn, node) {
  

[PATCH 5/5] KVM: MMU: reduce remote tlb flush in kvm_mmu_pte_write()

2010-05-30 Thread Xiao Guangrong
collect remote tlb flush in kvm_mmu_pte_write() path

Have tested with xp/vista64, fedora12 32/64 guests, and it not broken

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   20 
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index be75cba..cbd8d9c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2629,11 +2629,15 @@ static bool need_remote_flush(u64 old, u64 new)
return (old  ~new  PT64_PERM_MASK) != 0;
 }
 
-static void mmu_pte_write_flush_tlb(struct kvm_vcpu *vcpu, u64 old, u64 new)
+static void mmu_pte_write_flush_tlb(struct kvm_vcpu *vcpu, bool zap_page,
+   bool remote_flush, bool local_flush)
 {
-   if (need_remote_flush(old, new))
+   if (zap_page)
+   return;
+
+   if (remote_flush)
kvm_flush_remote_tlbs(vcpu-kvm);
-   else
+   else if (local_flush)
kvm_mmu_flush_tlb(vcpu);
 }
 
@@ -2697,6 +2701,9 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
int npte;
int r;
int invlpg_counter;
+   bool remote_flush, local_flush, zap_page;
+
+   zap_page = remote_flush = local_flush = false;
 
pgprintk(%s: gpa %llx bytes %d\n, __func__, gpa, bytes);
 
@@ -2770,6 +2777,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 */
pgprintk(misaligned: gpa %llx bytes %d role %x\n,
 gpa, bytes, sp-role.word);
+   zap_page = true;
kvm_mmu_prepare_zap_page(vcpu-kvm, sp);
++vcpu-kvm-stat.mmu_flooded;
continue;
@@ -2794,16 +2802,20 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
if (quadrant != sp-role.quadrant)
continue;
}
+
+   local_flush = true;
spte = sp-spt[page_offset / sizeof(*spte)];
while (npte--) {
entry = *spte;
mmu_pte_write_zap_pte(vcpu, sp, spte);
if (gentry)
mmu_pte_write_new_pte(vcpu, sp, spte, gentry);
-   mmu_pte_write_flush_tlb(vcpu, entry, *spte);
+   if (!remote_flush  need_remote_flush(entry, *spte))
+   remote_flush = true;
++spte;
}
}
+   mmu_pte_write_flush_tlb(vcpu, zap_page, remote_flush, local_flush);
kvm_mmu_commit_zap_page(vcpu-kvm);
kvm_mmu_audit(vcpu, post pte write);
spin_unlock(vcpu-kvm-mmu_lock);
-- 
1.6.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-30 Thread Michael S. Tsirkin
On Sun, May 30, 2010 at 03:27:05PM +0300, Avi Kivity wrote:
 On 05/30/2010 03:19 PM, Michael S. Tsirkin wrote:
 On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote:

 The VFIO driver is used to allow privileged AND non-privileged processes 
 to
 implement user-level device drivers for any well-behaved PCI, PCI-X, and 
 PCIe
 devices.
 Signed-off-by: Tom Lyonp...@cisco.com
 ---
 This patch is the evolution of code which was first proposed as a patch to
 uio/uio_pci_generic, then as a more generic uio patch. Now it is taken 
 entirely
 out of the uio framework, and things seem much cleaner. Of course, there is
 a lot of functional overlap with uio, but the previous version just seemed
 like a giant mode switch in the uio code that did not lead to clarity for
 either the new or old code.
  
 IMO this was because this driver does two things: programming iommu and
 handling interrupts. uio does interrupt handling.
 We could have moved iommu / DMA programming to
 a separate driver, and have uio work with it.
 This would solve limitation of the current driver
 that is needs an iommu domain per device.


 How do we enforce security then?  We need to ensure that unprivileged  
 users can only use the device with an iommu.

Force assigning to iommu before we allow any other operation?

 [a pony for avi...]
 The major new functionality in this version is the ability to deal with
 PCI config space accesses (through read  write calls) - but includes table
 driven code to determine whats safe to write and what is not.
  
 I don't really see why this is helpful: a driver written corrrectly
 will not access these addresses, and we need an iommu anyway to protect
 us against a drivers.


 Haven't reviewed the code (yet) but things like the BARs, MSI, and  
 interrupt disable need to be protected from the guest regardless of the  
 iommu.

Yes but userspace can do this. As long as userspace can not
crash the kernel, no reason to put this policy into kernel.


 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-30 Thread Avi Kivity

On 05/29/2010 02:07 AM, Tom Lyon wrote:

The VFIO driver is used to allow privileged AND non-privileged processes to
implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
devices.
   



+
+Why is this interesting?  Some applications, especially in the high performance
+computing field, need access to hardware functions with as little overhead as
+possible. Examples are in network adapters (typically non tcp/ip based) and
+in compute accelerators - i.e., array processors, FPGA processors, etc.
+Previous to the VFIO drivers these apps would need either a kernel-level
+driver (with corrsponding overheads), or else root permissions to directly
+access the hardware. The VFIO driver allows generic access to the hardware
+from non-privileged apps IF the hardware is well-behaved enough for this
+to be safe.
   




+
+Any SR-IOV virtual function meets the VFIO definition of well-behaved, but
+there are many other non-IOV PCI devices which also meet the defintion.
+Elements of this definition are:
+- The size of any memory BARs to be mmap'ed into the user process space must be
+  a multiple of the system page size.
   


You can relax this.
 - smaller than page size can be mapped if the rest of the page unused 
and if the platform tolerates writes to unused areas

 - if the rest of the page is used, we can relocate the BAR
 - otherwise, we can prevent mmap() but still allow mediated access via 
a syscall



+- If MSI-X interrupts are used, the device driver must not attempt to mmap or
+  write the MSI-X vector area.
   


We can allow mediated access (that's what qemu-kvm does).  I guess the 
ioctls for setting up msi interrupts are equivalent to this mediated access.


(later I see you do provide mediated access via pwrite - please confirm)


+- The device must not use the PCI configuration space in any non-standard way,
+  i.e., the user level driver will be permitted only to read and write standard
+  fields of the PCI config space, and only if those fields cannot cause harm to
+  the system. In addition, some fields are virtualized, so that the user
+  driver can read/write them like a kernel driver, but they do not affect the
+  real device.
   


What's wrong with nonstandard fields?


+
+Even with these restrictions, there are bound to be devices which are unsafe
+for user level use - it is still up to the system admin to decide whether to
+grant access to the device.  When the vfio module is loaded, it will have
+access to no devices until the desired PCI devices are bound to the driver.
+First, make sure the devices are not bound to another kernel driver. You can
+unload that driver if you wish to unbind all its devices, or else enter the
+driver's sysfs directory, and unbind a specific device:
+   cd /sys/bus/pci/drivers/drivername
+   echo :06:02.00  unbind
+(The :06:02.00 is a fully qualified PCI device name - different for each
+device).  Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and
+write the PCI device type of the target device to the new_id file:
+   echo 8086 10ca  new_id
+(8086 10ca are the vendor and device type for the Intel 82576 virtual function
+devices). A /dev/vfioN  entry will be created for each device bound. The 
final
+step is to grant users permission by changing the mode and/or owner of the /dev
+entry - chmod 666 /dev/vfio0.
   


What if I have several such devices?  Isn't it better to bind by topoloy 
(device address)?



+
+Reads  Writes:
+
+The user driver will typically use mmap to access the memory BAR(s) of a
+device; the I/O BARs and the PCI config space may be accessed through normal
+read and write system calls. Only 1 file descriptor is needed for all driver
+functions -- the desired BAR for I/O, memory, or config space is indicated via
+high-order bits of the file offset.


My preference would be one fd per BAR, but that's a matter of personal 
taste.



For instance, the following implements a
+write to the PCI config space:
+
+   #includelinux/vfio.h
+   void pci_write_config_word(int pci_fd, u16 off, u16 wd)
+   {
+   off_t cfg_off = VFIO_PCI_CONFIG_OFF + off;
+
+   if (pwrite(pci_fd,wd, 2, cfg_off) != 2)
+   perror(pwrite config_dword);
+   }
+
   


Nice, has the benefit of avoiding endianness issues in the interface.


+The routines vfio_pci_space_to_offset and vfio_offset_to_pci_space are provided
+in vfio.h to convert bar numbers to file offsets and vice-versa.
+
+Interrupts:
+
+Device interrupts are translated by the vfio driver into input events on event
+notification file descriptors created by the eventfd system call. The user
+program must one or more event descriptors and pass them to the vfio driver
+via ioctls to arrange for the interrupt mapping:
+1.
+   efd = eventfd(0, 0);
+   ioctl(vfio_fd, VFIO_EVENTFD_IRQ,efd);
+   This provides an eventfd for traditional IRQ interrupts.
+   IRQs will be 

Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-05-30 Thread Michael S. Tsirkin
On Sun, May 30, 2010 at 04:01:53PM +0300, Avi Kivity wrote:
 On 05/30/2010 03:49 PM, Michael S. Tsirkin wrote:
 On Sun, May 30, 2010 at 03:27:05PM +0300, Avi Kivity wrote:

 On 05/30/2010 03:19 PM, Michael S. Tsirkin wrote:
  
 On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote:


 The VFIO driver is used to allow privileged AND non-privileged 
 processes to
 implement user-level device drivers for any well-behaved PCI, PCI-X, and 
 PCIe
 devices.
   Signed-off-by: Tom Lyonp...@cisco.com
 ---
 This patch is the evolution of code which was first proposed as a patch to
 uio/uio_pci_generic, then as a more generic uio patch. Now it is taken 
 entirely
 out of the uio framework, and things seem much cleaner. Of course, there 
 is
 a lot of functional overlap with uio, but the previous version just seemed
 like a giant mode switch in the uio code that did not lead to clarity for
 either the new or old code.

  
 IMO this was because this driver does two things: programming iommu and
 handling interrupts. uio does interrupt handling.
 We could have moved iommu / DMA programming to
 a separate driver, and have uio work with it.
 This would solve limitation of the current driver
 that is needs an iommu domain per device.


 How do we enforce security then?  We need to ensure that unprivileged
 users can only use the device with an iommu.
  
 Force assigning to iommu before we allow any other operation?


 That means the driver must be aware of the iommu.

The userspace driver? Yes. And It is a good thing to be explicit
there anyway, since this lets userspace map a non-contigious
virtual address list into a contiguous bus address range.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: MMU: introduce some macros to cleanup hlist traverseing

2010-05-30 Thread Avi Kivity

On 05/30/2010 03:36 PM, Xiao Guangrong wrote:

Introduce for_each_gfn_sp(), for_each_gfn_indirect_sp() and
for_each_gfn_indirect_valid_sp() to cleanup hlist traverseing

Signed-off-by: Xiao Guangrongxiaoguangr...@cn.fujitsu.com
---
  arch/x86/kvm/mmu.c |  129 ++--
  1 files changed, 54 insertions(+), 75 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 56f8c3c..84c705e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1200,6 +1200,22 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, 
struct kvm_mmu_page *sp)

  static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);

+#define for_each_gfn_sp(kvm, sp, gfn, pos, n)  \
+  hlist_for_each_entry_safe(sp, pos, n,
\
+   kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
+   if (sp-gfn == gfn)
   



   if (...)
for_each_gfn_sp(...)
  blah();
   else
 BUG();

will break.  Can do 'if ((sp)-gfn != (gfn)) ; else'.

Or call functions from the for (;;) parameters to advance the cursor.

(also use parentheses to protect macro arguments)




+
+#define for_each_gfn_indirect_sp(kvm, sp, gfn, pos, n) \
+  hlist_for_each_entry_safe(sp, pos, n,
\
+   kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
+   if (sp-gfn == gfn  !sp-role.direct)
+
+#define for_each_gfn_indirect_valid_sp(kvm, sp, gfn, pos, n)   \
+  hlist_for_each_entry_safe(sp, pos, n,
\
+   kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
+   if (sp-gfn == gfn  !sp-role.direct  \
+   !sp-role.invalid)
   


Shouldn't we always skip invalid gfns?  What about providing both gfn 
and role to the macro?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Perf trace event parse errors for KVM events

2010-05-30 Thread Steven Rostedt
On Sun, 2010-05-30 at 11:10 +0300, Avi Kivity wrote:
 On 05/29/2010 04:19 PM, Steven Rostedt wrote:
 
  The pretty printing from the kernel handles this fine. But there's
  pressure to pass the format to userspace in binary and have the tool
  parse it. Currently it uses the print fmt to figure out how to parse.
 
  Using one of the examples that Stefan showed:
 
  kvmmmu/kvm_mmu_get_page: print fmt: %s %s, ({ const char *ret =
  p-buffer + p-len; static const char *access_str[] = { ---, --x,
  w--, w-x, -u-, -ux, wu-, wux }; union kvm_mmu_page_role
  role; role.word = REC-role; trace_seq_printf(p, sp gfn %llx %u%s q%u%s
  %s%s  %snxe root %u %s%c, REC-gfn, role.level, role.cr4_pae ? 
  pae : , role.quadrant, role.direct ?  direct : ,
  access_str[role.access], role.invalid ?  invalid : , role.nxe ?  :
  !, REC-root_count, REC-unsync ? unsync : sync, 0); ret; }),
  REC-created ? new : existing
 
 
  You need a full C parser/interpreter to understand the above.
 
 
 Right.  The tools can fall back to %x/%s based on the structure 
 descriptor if they can't parse the format string.
 

trace-cmd has plugin support to override how to read the format and
print it out. It now has the ability to write those plugins in python.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Perf trace event parse errors for KVM events

2010-05-30 Thread Avi Kivity

On 05/30/2010 05:03 PM, Steven Rostedt wrote:



Right.  The tools can fall back to %x/%s based on the structure
descriptor if they can't parse the format string.

 

trace-cmd has plugin support to override how to read the format and
print it out. It now has the ability to write those plugins in python.
   


Cool.  May make sense to use simpler formatting in the kernel, and use 
trace-cmd plugins for the complicated cases.


It does raise issues with ABIs.  Can trace-cmd read plugins from 
/lib/modules/*?  We can then distribute the plugins with the kernel.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: MMU: always invalidate and flush on spte page size change

2010-05-30 Thread Marcelo Tosatti
On Sun, May 30, 2010 at 01:28:19PM +0300, Avi Kivity wrote:
 On 05/28/2010 03:44 PM, Marcelo Tosatti wrote:
 Always invalidate spte and flush TLBs when changing page size, to make
 sure different sized translations for the same address are never cached
 in a CPU's TLB.
 
 The first case where this occurs is when a non-leaf spte pointer is
 overwritten by a leaf, large spte entry. This can happen after dirty
 logging is disabled on a memslot, for example.
 
 The second case is a leaf, large spte entry is overwritten with a
 non-leaf spte pointer, in __direct_map. Note this cannot happen now
 because the only potential source of such overwrite is dirty logging
 being enabled, which zaps all MMU pages. But this might change
 in the future, so better be robust against it.
 
 Noticed by Andrea.
 
 KVM-Stable-Tag
 Signed-off-by: Marcelo Tosattimtosa...@redhat.com
 
 Index: kvm/arch/x86/kvm/mmu.c
 ===
 --- kvm.orig/arch/x86/kvm/mmu.c
 +++ kvm/arch/x86/kvm/mmu.c
 @@ -1952,6 +1952,8 @@ static void mmu_set_spte(struct kvm_vcpu
 
  child = page_header(pte  PT64_BASE_ADDR_MASK);
  mmu_page_remove_parent_pte(child, sptep);
 +__set_spte(sptep, shadow_trap_nonpresent_pte);
 +kvm_flush_remote_tlbs(vcpu-kvm);
  } else if (pfn != spte_to_pfn(*sptep)) {
  pgprintk(hfn old %lx new %lx\n,
   spte_to_pfn(*sptep), pfn);
 
 Applied this bit.
 
 @@ -2015,6 +2017,16 @@ static int __direct_map(struct kvm_vcpu
  break;
  }
 
 +if (is_shadow_present_pte(*iterator.sptep)
 +!is_large_pte(*iterator.sptep))
 +continue;
 +
 +if (is_large_pte(*iterator.sptep)) {
 +rmap_remove(vcpu-kvm, iterator.sptep);
 +__set_spte(iterator.sptep, shadow_trap_nonpresent_pte);
 +kvm_flush_remote_tlbs(vcpu-kvm);
 +}
 +
 
 Don't we have exactly the same issue in FNAME(fetch)()?

Yes and its already handled there:

if (is_shadow_present_pte(*sptep)  !is_large_pte(*sptep))
continue;

if (is_large_pte(*sptep)) {
rmap_remove(vcpu-kvm, sptep);
__set_spte(sptep, shadow_trap_nonpresent_pte);
kvm_flush_remote_tlbs(vcpu-kvm);
}

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Perf trace event parse errors for KVM events

2010-05-30 Thread Steven Rostedt
On Sun, 2010-05-30 at 17:07 +0300, Avi Kivity wrote:
 On 05/30/2010 05:03 PM, Steven Rostedt wrote:
 
  Right.  The tools can fall back to %x/%s based on the structure
  descriptor if they can't parse the format string.
 
   
  trace-cmd has plugin support to override how to read the format and
  print it out. It now has the ability to write those plugins in python.
 
 
 Cool.  May make sense to use simpler formatting in the kernel, and use 
 trace-cmd plugins for the complicated cases.
 
 It does raise issues with ABIs.  Can trace-cmd read plugins from 
 /lib/modules/*?  We can then distribute the plugins with the kernel.
 

We probably could do that. Perhaps if we can port the python code to
perf, then it would work for both. Then the plugins could be just python
scripts, (or binary .so modules) and have a tools/plugins dir?

The python part probably would be easier to port, since the .so modules
are a bit more specific to trace-cmd.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] cgroups: Add an API to attach a task to current task's cgroup

2010-05-30 Thread Tejun Heo
From: Sridhar Samudrala samudrala.srid...@gmail.com

Add a new kernel API to attach a task to current task's cgroup
in all the active hierarchies.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com
---
 include/linux/cgroup.h |1 +
 kernel/cgroup.c|   23 +++
 2 files changed, 24 insertions(+)

Index: work/include/linux/cgroup.h
===
--- work.orig/include/linux/cgroup.h
+++ work/include/linux/cgroup.h
@@ -570,6 +570,7 @@ struct task_struct *cgroup_iter_next(str
 void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
+int cgroup_attach_task_current_cg(struct task_struct *);

 /*
  * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
Index: work/kernel/cgroup.c
===
--- work.orig/kernel/cgroup.c
+++ work/kernel/cgroup.c
@@ -1788,6 +1788,29 @@ out:
return retval;
 }

+/**
+ * cgroup_attach_task_current_cg - attach task 'tsk' to current task's cgroup
+ * @tsk: the task to be attached
+ */
+int cgroup_attach_task_current_cg(struct task_struct *tsk)
+{
+   struct cgroupfs_root *root;
+   struct cgroup *cur_cg;
+   int retval = 0;
+
+   cgroup_lock();
+   for_each_active_root(root) {
+   cur_cg = task_cgroup_from_root(current, root);
+   retval = cgroup_attach_task(cur_cg, tsk);
+   if (retval)
+   break;
+   }
+   cgroup_unlock();
+
+   return retval;
+}
+EXPORT_SYMBOL_GPL(cgroup_attach_task_current_cg);
+
 /*
  * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
  * held. May take task_lock of task
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] vhost: apply cpumask and cgroup to vhost pollers

2010-05-30 Thread Tejun Heo
Apply the cpumask and cgroup of the initializing task to the created
vhost poller.

Based on Sridhar Samudrala's patch.

Cc: Michael S. Tsirkin m...@redhat.com
Cc: Sridhar Samudrala samudrala.srid...@gmail.com
---
 drivers/vhost/vhost.c |   36 +++-
 1 file changed, 31 insertions(+), 5 deletions(-)

Index: work/drivers/vhost/vhost.c
===
--- work.orig/drivers/vhost/vhost.c
+++ work/drivers/vhost/vhost.c
@@ -23,6 +23,7 @@
 #include linux/highmem.h
 #include linux/slab.h
 #include linux/kthread.h
+#include linux/cgroup.h

 #include linux/net.h
 #include linux/if_packet.h
@@ -176,12 +177,30 @@ repeat:
 long vhost_dev_init(struct vhost_dev *dev,
struct vhost_virtqueue *vqs, int nvqs)
 {
-   struct task_struct *poller;
-   int i;
+   struct task_struct *poller = NULL;
+   cpumask_var_t mask;
+   int i, ret = -ENOMEM;
+
+   if (!alloc_cpumask_var(mask, GFP_KERNEL))
+   goto out;

poller = kthread_create(vhost_poller, dev, vhost-%d, current-pid);
-   if (IS_ERR(poller))
-   return PTR_ERR(poller);
+   if (IS_ERR(poller)) {
+   ret = PTR_ERR(poller);
+   goto out;
+   }
+
+   ret = sched_getaffinity(current-pid, mask);
+   if (ret)
+   goto out;
+
+   ret = sched_setaffinity(poller-pid, mask);
+   if (ret)
+   goto out;
+
+   ret = cgroup_attach_task_current_cg(poller);
+   if (ret)
+   goto out;

dev-vqs = vqs;
dev-nvqs = nvqs;
@@ -202,7 +221,14 @@ long vhost_dev_init(struct vhost_dev *de
vhost_poll_init(dev-vqs[i].poll,
dev-vqs[i].handle_kick, POLLIN, dev);
}
-   return 0;
+
+   wake_up_process(poller);/* avoid contributing to loadavg */
+   ret = 0;
+out:
+   if (ret)
+   kthread_stop(poller);
+   free_cpumask_var(mask);
+   return ret;
 }

 /* Caller should have device mutex */
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] arch/s390/kvm: Use GFP_ATOMIC when a lock is held

2010-05-30 Thread Julia Lawall
From: Julia Lawall ju...@diku.dk

The containing function is called from several places.  At one of them, in
the function __sigp_stop, the spin lock fi-lock is held.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// smpl
@gfp exists@
identifier fn;
position p;
@@

fn(...) {
... when != spin_unlock
when any
  gfp_ker...@p
 ... when any
}

@locked@
identifier gfp.fn;
@@

spin_lock(...)
... when != spin_unlock
fn(...)

@depends on locked@
position gfp.p;
@@

- gfp_ker...@p
+ GFP_ATOMIC
// /smpl

Signed-off-by: Julia Lawall ju...@diku.dk

---
 arch/s390/kvm/sigp.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -u -p a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
--- a/arch/s390/kvm/sigp.c
+++ b/arch/s390/kvm/sigp.c
@@ -113,7 +113,7 @@ static int __inject_sigp_stop(struct kvm
 {
struct kvm_s390_interrupt_info *inti;
 
-   inti = kzalloc(sizeof(*inti), GFP_KERNEL);
+   inti = kzalloc(sizeof(*inti), GFP_ATOMIC);
if (!inti)
return -ENOMEM;
inti-type = KVM_S390_SIGP_STOP;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


About offload function of Virtio Nic

2010-05-30 Thread Amos Kong
Hello all,

Can I ask some question of the offloading function with Virtio-NIC?

I'm interested in KVM virtualization. The virtual NIC is different
from real hardward. And there is also a big distance between
Virtio-NIC and E1000-NIC.
The offloading function of Virtio-NIC close related with the function
of host's real hardware.
Associate Feature bits:
VIRTIO_NET_F_GUEST_...
VIRTIO_NET_F_HOST_...

---
command:
enable rx:  guest) # ethtool -K eth0
enable tx:  guest) # ethtool -K eth0
enable gs: guest) # ethtool -K eth0
enable tso: guest) # ethtool -K eth0
enable gso: guest) # ethtool -K eth0
enable lro: guest) # ethtool -K eth0
enable gro: guest) # ethtool -K eth0

Those steps work well with Virtual E1000-NIC, but it always fail when
using Virtio-NIC.

---
After setup, I have two check method. For example:
1. enable tso
2. check the output of 'ethtool -k eth0'
3. send large file from guest to host by scp, listen by tcpdump during
transferring file. and check if there contains some large
packet($length  1514).

The check method works with Virtual E1000-NIC, but it always fail when
using Virtio-NIC.

---
My question:
1. How to setup offloading function of Virtio-NIC?
2. How about my check method of offloading function(transferring file,
check packet length by tcpdump)?
3. Is there some detail user guide for Virtio-NIC?



Best Regards,
Amos
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About offload function of Virtio Nic

2010-05-30 Thread Amos Kong
On Mon, May 31, 2010 at 7:44 AM, Amos Kong kongjian...@gmail.com wrote:
 Hello all,

 Can I ask some question of the offloading function with Virtio-NIC?

 I'm interested in KVM virtualization. The virtual NIC is different
 from real hardward. And there is also a big distance between
 Virtio-NIC and E1000-NIC.
 The offloading function of Virtio-NIC close related with the function
 of host's real hardware.
 Associate Feature bits:
 VIRTIO_NET_F_GUEST_...
 VIRTIO_NET_F_HOST_...

 ---
 command:
 enable rx:  guest) # ethtool -K eth0
 enable tx:  guest) # ethtool -K eth0
 enable gs: guest) # ethtool -K eth0
 enable tso: guest) # ethtool -K eth0
 enable gso: guest) # ethtool -K eth0
 enable lro: guest) # ethtool -K eth0
 enable gro: guest) # ethtool -K eth0

sorry , it should be

enable rx:  guest) # ethtool -K eth0 rx on
enable tx:  guest) # ethtool -K eth0 tx on
enable gs: guest) # ethtool -K eth0 gs on
enable tso: guest) # ethtool -K eth0 tso on
enable gso: guest) # ethtool -K eth0 gso on
enable lro: guest) # ethtool -K eth0 lro on
enable gro: guest) # ethtool -K eth0 gro on

It also works with my real hardware.

 Those steps work well with Virtual E1000-NIC, but it always fail when
 using Virtio-NIC.

 ---
 After setup, I have two check method. For example:
 1. enable tso
 2. check the output of 'ethtool -k eth0'
 3. send large file from guest to host by scp, listen by tcpdump during
 transferring file. and check if there contains some large
 packet($length  1514).

 The check method works with Virtual E1000-NIC, but it always fail when
 using Virtio-NIC.

 ---
 My question:
 1. How to setup offloading function of Virtio-NIC?
 2. How about my check method of offloading function(transferring file,
 check packet length by tcpdump)?
 3. Is there some detail user guide for Virtio-NIC?



 Best Regards,
 Amos




-- 
个人主页搬家了,欢迎来串门  ;)
http://kongove.whostas.com/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] cgroups: Add an API to attach a task to current task's cgroup

2010-05-30 Thread Li Zefan
04:24, Tejun Heo wrote:
 From: Sridhar Samudrala samudrala.srid...@gmail.com
 
 Add a new kernel API to attach a task to current task's cgroup
 in all the active hierarchies.
 
 Signed-off-by: Sridhar Samudrala s...@us.ibm.com

Acked-by: Li Zefan l...@cn.fujitsu.com

btw: you lost the reviewed-by tag given by Paul Menage.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Issues with getting PCI pass-through to work

2010-05-30 Thread Adhyas Avasthi
I have a PCI WLAN controller that I wish to pass-through to the
standard linux-img VM (downloaded from qemu website).
When I try qemu-kvm -pcidevice host=0c:00.0,name=abcd linux-0.2.img
The VM does boot up but does not see the device.

On the command prompt, I see the following message:
Failed to assign irq for abcd: Operation not supported
Perhaps you are assigning a device that shares an IRQ with another device?

I want to know what does this message exactly mean? Is PCI
pass-through having the requirement of reconfiguring the IRQ of my
host devices before I can pass them through? Can I configure qemu-kvm
somehow to emulate the IRQ (and other registers) instead of trying to
pass-through the IRQ etc?

BTW, my box has VT and VT-d enabled in BIOS.

--
Adhyas

Two types have compatible type if their types are the same.
   — ANSI C Standard, 3.1.2.6.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Setup scsi-bus xfer and xfer_mode for PR IN/OUT and Maintenance IN/OUT

2010-05-30 Thread Nicholas A. Bellinger
From: Nicholas Bellinger n...@linux-iscsi.org

Greetings Gerd, Kevin and Co,

Attached are two patches to add the necesary CDB parsing to determine 
SCSIRequest-cmd.xfer
(length) and SCSIRequest-cmd.mode (direction) for Persistent Reservation IN/OUT
CDBs and for Maintenance IN/OUT CDBs used for Asymmetric Logical Unit Access, 
et al.
There is a special case for the latter Maintenance CDBs with TYPE_ROM that has 
been
included in scsi_req_length().

Also, I should mention this is a temporary measure in order to ensure that we 
can actually
do passthrough of these CDBs into KVM Guest for lsi and megaraid HBA emulation. 
 What will
need to eventually happen is to get rid of scsi_req_xfer_mode() all together 
and just setup
SCSIRequest-cmd.mode based on CDB type in scsi_req_length(), instead of having 
to have another
switch(cdb[0]) statement for every SCSI WRITE CDB on the planet to set 
SCSI_XFER_TO_DEV.

Anyways, I will look at doing this conversion in scsi_req_length() at some 
point, but please
apply these for the moment so folks can get access to their SPC-4 Port LUNs 
with QEMU.  ;)

Best,

--nab

Nicholas Bellinger (2):
  [scsi-bus]: Add PR-OUT and PR-IN case for SCSIRequest xfer and
xfer_mode setup
  [scsi-bus]: Add MAINTENANCE_IN and MAINTENANCE_OUT case for
SCSIRequest xfer and xfer_mode setup

 hw/scsi-bus.c  |   19 +++
 hw/scsi-defs.h |2 ++
 2 files changed, 21 insertions(+), 0 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] [scsi-bus]: Add PR-OUT and PR-IN case for SCSIRequest xfer and xfer_mode setup

2010-05-30 Thread Nicholas A. Bellinger
From: Nicholas Bellinger n...@linux-iscsi.org

This patch updates hw/scsi-bus.c to add PERSISTENT_RESERVE_OUT and 
PERSISTENT_RESERVE_IN
case in scsi_req_length() to extra the incoming buffer length into 
SCSIRequest-cmd.xfer,
and adds a second PERSISTENT_RESERVE_OUT case in scsi_req_xfer_mode() in order 
to properly
set SCSI_XFER_TO_DEV for WRITE data.

Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation and TCM_Loop 
target ports.

Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
---
 hw/scsi-bus.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
index b8e4b71..75ec74e 100644
--- a/hw/scsi-bus.c
+++ b/hw/scsi-bus.c
@@ -325,6 +325,10 @@ static int scsi_req_length(SCSIRequest *req, uint8_t *cmd)
 case INQUIRY:
 req-cmd.xfer = cmd[4] | (cmd[3]  8);
 break;
+case PERSISTENT_RESERVE_OUT:
+case PERSISTENT_RESERVE_IN:
+req-cmd.xfer = cmd[8] | (cmd[7]  8);
+break;
 }
 return 0;
 }
@@ -389,6 +393,7 @@ static void scsi_req_xfer_mode(SCSIRequest *req)
 case MEDIUM_SCAN:
 case SEND_VOLUME_TAG:
 case WRITE_LONG_2:
+case PERSISTENT_RESERVE_OUT:
 req-cmd.mode = SCSI_XFER_TO_DEV;
 break;
 default:
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] [scsi-bus]: Add MAINTENANCE_IN and MAINTENANCE_OUT case for SCSIRequest xfer and xfer_mode setup

2010-05-30 Thread Nicholas A. Bellinger
From: Nicholas Bellinger n...@linux-iscsi.org

This patch updates hw/scsi-bus.c to add MAINTENANCE_IN and MAINTENANCE_OUT case 
in
scsi_req_length() to extra the incoming buffer length into 
SCSIRequest-cmd.xfer,
and adds a second MAINTENANCE_OUT case in scsi_req_xfer_mode() in order to 
properly
set SCSI_XFER_TO_DEV for WRITE data.

Tested with Linux KVM guests and Megasas 8708EM2 HBA emulation and TCM_Loop 
target ports.

Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
---
 hw/scsi-bus.c  |   14 ++
 hw/scsi-defs.h |2 ++
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
index 75ec74e..7d80405 100644
--- a/hw/scsi-bus.c
+++ b/hw/scsi-bus.c
@@ -329,6 +329,17 @@ static int scsi_req_length(SCSIRequest *req, uint8_t *cmd)
 case PERSISTENT_RESERVE_IN:
 req-cmd.xfer = cmd[8] | (cmd[7]  8);
 break;
+case MAINTENANCE_OUT:
+case MAINTENANCE_IN:
+if (req-dev-type != TYPE_ROM) {
+/* Used for MI_REPORT_TARGET_PGS, MO_SET_TARGET_PGS et al. */
+req-cmd.xfer = cmd[9] | (cmd[8]  8) |
+(cmd[7]  16) | (cmd[6]  24);
+} else {
+/* GPCMD_REPORT_KEY and GPCMD_SEND_KEY from multi media commands */
+req-cmd.xfer = cmd[9] | (cmd[8]  8);
+}
+break;
 }
 return 0;
 }
@@ -394,6 +405,7 @@ static void scsi_req_xfer_mode(SCSIRequest *req)
 case SEND_VOLUME_TAG:
 case WRITE_LONG_2:
 case PERSISTENT_RESERVE_OUT:
+case MAINTENANCE_OUT:
 req-cmd.mode = SCSI_XFER_TO_DEV;
 break;
 default:
@@ -539,6 +551,8 @@ const char *scsi_command_name(uint8_t cmd)
 [ SPACE] = SPACE,
 [ INQUIRY  ] = INQUIRY,
 [ RECOVER_BUFFERED_DATA] = RECOVER_BUFFERED_DATA,
+[ MAINTENANCE_IN   ] = MAINTENANCE_IN,
+[ MAINTENANCE_OUT  ] = MAINTENANCE_OUT,
 [ MODE_SELECT  ] = MODE_SELECT,
 [ RESERVE  ] = RESERVE,
 [ RELEASE  ] = RELEASE,
diff --git a/hw/scsi-defs.h b/hw/scsi-defs.h
index 5890094..a4a3518 100644
--- a/hw/scsi-defs.h
+++ b/hw/scsi-defs.h
@@ -84,6 +84,8 @@
 #define MODE_SENSE_10 0x5a
 #define PERSISTENT_RESERVE_IN 0x5e
 #define PERSISTENT_RESERVE_OUT 0x5f
+#define MAINTENANCE_IN0xa3
+#define MAINTENANCE_OUT   0xa4
 #define MOVE_MEDIUM   0xa5
 #define READ_12   0xa8
 #define WRITE_12  0xaa
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: MMU: introduce some macros to cleanup hlist traverseing

2010-05-30 Thread Xiao Guangrong


Avi Kivity wrote:
 On 05/30/2010 03:36 PM, Xiao Guangrong wrote:
 Introduce for_each_gfn_sp(), for_each_gfn_indirect_sp() and
 for_each_gfn_indirect_valid_sp() to cleanup hlist traverseing

 Signed-off-by: Xiao Guangrongxiaoguangr...@cn.fujitsu.com
 ---
   arch/x86/kvm/mmu.c |  129
 ++--
   1 files changed, 54 insertions(+), 75 deletions(-)

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 56f8c3c..84c705e 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1200,6 +1200,22 @@ static void kvm_unlink_unsync_page(struct kvm
 *kvm, struct kvm_mmu_page *sp)

   static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);

 +#define for_each_gfn_sp(kvm, sp, gfn, pos, n)\
 +  hlist_for_each_entry_safe(sp, pos, n,\
 +kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
 +if (sp-gfn == gfn)


Avi,

Thanks for your review.

 
 
if (...)
 for_each_gfn_sp(...)
   blah();
else
  BUG();
 
 will break.  Can do 'if ((sp)-gfn != (gfn)) ; else'.
 
 Or call functions from the for (;;) parameters to advance the cursor.
 
 (also use parentheses to protect macro arguments)
 

Yeah, it's my mistake, i'll fix it in the next version.

 
 
 +
 +#define for_each_gfn_indirect_sp(kvm, sp, gfn, pos, n)\
 +  hlist_for_each_entry_safe(sp, pos, n,\
 +kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
 +if (sp-gfn == gfn  !sp-role.direct)
 +
 +#define for_each_gfn_indirect_valid_sp(kvm, sp, gfn, pos, n)\
 +  hlist_for_each_entry_safe(sp, pos, n,\
 +kvm-arch.mmu_page_hash[kvm_page_table_hashfn(gfn)], hash_link)\
 +if (sp-gfn == gfn  !sp-role.direct\
 +!sp-role.invalid)

 
 Shouldn't we always skip invalid gfns? 

Actually, in kvm_mmu_unprotect_page() function, it need find out
invalid shadow pages:

|   hlist_for_each_entry_safe(sp, node, n, bucket, hash_link)
|   if (sp-gfn == gfn  !sp-role.direct) {
|   pgprintk(%s: gfn %lx role %x\n, __func__, gfn,
|sp-role.word);
|   r = 1;
|   if (kvm_mmu_zap_page(kvm, sp))
|   goto restart;
|   }

I'm not sure whether we can skip invalid sp here, since it can change this
function's return value. :-(


 What about providing both gfn and role to the macro?
 

In current code, no code simply use role and gfn to find sp,
in kvm_mmu_get_page(), we need do other work for
'sp-gfn == gfn  sp-role != role' sp, and other functions only need compare
some members in role, but not all members.

Xiao

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: MMU: split the operations of kvm_mmu_zap_page()

2010-05-30 Thread Xiao Guangrong


Avi Kivity wrote:
 On 05/30/2010 03:37 PM, Xiao Guangrong wrote:
 Using kvm_mmu_prepare_zap_page() and kvm_mmu_commit_zap_page() to
 split kvm_mmu_zap_page() function, then we can:

 - traverse hlist safely
 - easily to gather remote tlb flush which occurs during page zapped


 +static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct
 kvm_mmu_page *sp)
 +{
 +int ret;
 +
 +trace_kvm_mmu_zap_page(sp);
 +++kvm-stat.mmu_shadow_zapped;
 +ret = mmu_zap_unsync_children(kvm, sp);
 +kvm_mmu_page_unlink_children(kvm, sp);
 +kvm_mmu_unlink_parents(kvm, sp);
 +if (!sp-role.invalid  !sp-role.direct)
 +unaccount_shadowed(kvm, sp-gfn);
 +if (sp-unsync)
 +kvm_unlink_unsync_page(kvm, sp);
 +if (!sp-root_count)
 +/* Count self */
 +ret++;
 +else
 +kvm_reload_remote_mmus(kvm);
 +
 +sp-role.invalid = 1;
 +list_move(sp-link,kvm-arch.invalid_mmu_pages);
 +kvm_mmu_reset_last_pte_updated(kvm);
 +return ret;
 +}
 +
 +static void kvm_mmu_commit_zap_page(struct kvm *kvm)
 +{
 +struct kvm_mmu_page *sp, *n;
 +
 +if (list_empty(kvm-arch.invalid_mmu_pages))
 +return;
 +
 +kvm_flush_remote_tlbs(kvm);
 +list_for_each_entry_safe(sp, n,kvm-arch.invalid_mmu_pages, link) {
 +WARN_ON(!sp-role.invalid);
 +if (!sp-root_count)
 +kvm_mmu_free_page(kvm, sp);
 +}
 +}
 +

 
 You're adding two new functions but not using them here?  Possibly in
 the old kvm_mmu_zap_page()?

I use those in the next patch, it's not in kvm_mmu_zap_page(), it's used like:

hold mmu spin lock

kvm_mmu_prepare_zap_page page A
kvm_mmu_prepare_zap_page page B
kvm_mmu_prepare_zap_page page C
..
kvm_mmu_commit_zap_page

release mmu spin lock

 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 5e5cd8d..225c3c4 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -5331,6 +5331,7 @@ struct  kvm *kvm_arch_create_vm(void)
   }

   INIT_LIST_HEAD(kvm-arch.active_mmu_pages);
 +INIT_LIST_HEAD(kvm-arch.invalid_mmu_pages);
   INIT_LIST_HEAD(kvm-arch.assigned_dev_head);

 
 This is a good idea, but belongs in a separate patch?  We can use it to
 reclaim invalid pages before allocating new ones.
 

This patch is very simple and kvm_mmu_commit_zap_page() function should depend 
on
kvm-arch.invalid_mmu_pages, so i think we on need separate this patch, your 
opinion? :-)

Thanks,
Xiao

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html