[PATCH] virtio_blk: Add help function to format mass of disks

2012-04-10 Thread Ren Mingxin
The current virtio block's naming algorithm just supports 18278
(26^3 + 26^2 + 26) disks. If there are mass of virtio blocks,
there will be disks with the same name.

Based on commit 3e1a7ff8a0a7b948f2684930166954f9e8e776fe, I add
function virtblk_name_format() for virtio block to support mass
of disks naming.

Signed-off-by: Ren Mingxin re...@cn.fujitsu.com
---
 drivers/block/virtio_blk.c |   38 ++
 1 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index c4a60ba..86516c8 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -374,6 +374,31 @@ static int init_vq(struct virtio_blk *vblk)
return err;
 }
 
+static int virtblk_name_format(char *prefix, int index, char *buf, int buflen)
+{
+   const int base = 'z' - 'a' + 1;
+   char *begin = buf + strlen(prefix);
+   char *begin = buf + strlen(prefix);
+   char *end = buf + buflen;
+   char *p;
+   int unit;
+
+   p = end - 1;
+   *p = '\0';
+   unit = base;
+   do {
+   if (p == begin)
+   return -EINVAL;
+   *--p = 'a' + (index % unit);
+   index = (index / unit) - 1;
+   } while (index = 0);
+
+   memmove(begin, p, end - p);
+   memcpy(buf, prefix, strlen(prefix));
+
+   return 0;
+}
+
 static int __devinit virtblk_probe(struct virtio_device *vdev)
 {
struct virtio_blk *vblk;
@@ -442,18 +467,7 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)
 
q-queuedata = vblk;
 
-   if (index  26) {
-   sprintf(vblk-disk-disk_name, vd%c, 'a' + index % 26);
-   } else if (index  (26 + 1) * 26) {
-   sprintf(vblk-disk-disk_name, vd%c%c,
-   'a' + index / 26 - 1, 'a' + index % 26);
-   } else {
-   const unsigned int m1 = (index / 26 - 1) / 26 - 1;
-   const unsigned int m2 = (index / 26 - 1) % 26;
-   const unsigned int m3 =  index % 26;
-   sprintf(vblk-disk-disk_name, vd%c%c%c,
-   'a' + m1, 'a' + m2, 'a' + m3);
-   }
+   virtblk_name_format(vd, index, vblk-disk-disk_name, DISK_NAME_LEN);
 
vblk-disk-major = major;
vblk-disk-first_minor = index_to_minor(index);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-10 Thread Avi Kivity
On 04/09/2012 09:26 PM, Xiao Guangrong wrote:
 Yes, if Xwindow is not enabled, the benefit is limited. :)

I'm more interested in migration.

We could optimize the framebuffer by disabling dirty logging when
VNC/Spice is not connected (which should usually be the case), or when
the SDL window is minimized (shouldn't be that often, unfortunately)

Related, qxl doesn't seem to stop the dirty log when switching to
accelerated mode.  vmsvga gets it right:

case SVGA_REG_ENABLE:
s-enable = value;
s-config = !!value;
s-width = -1;
s-height = -1;
s-invalidated = 1;
s-vga.invalidate(s-vga);
if (s-enable) {
s-fb_size = ((s-depth + 7)  3) * s-new_width *
s-new_height;
vga_dirty_log_stop(s-vga);
} else {
vga_dirty_log_start(s-vga);
}
break;


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-10 Thread Avi Kivity
On 04/09/2012 10:46 PM, Marcelo Tosatti wrote:
 Perhaps the mmu_lock hold times by get_dirty are a large component here?
 If that can be alleviated, not only RO-RW faults benefit.



Currently the longest holder in normal use is probably reading the dirty
log and write protecting the shadow page tables.

We could fix that by switching to O(1) write protection
(write-protecting PML4Es instead of PTEs).  It would be interesting to
combine O(1) write protection with lockless write-enabling.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: source for virt io backend driver

2012-04-10 Thread Stefan Hajnoczi
On Tue, Apr 10, 2012 at 4:47 AM, Steven wangwangk...@gmail.com wrote:
 I found this post
 http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/89334
 So the current block driver seems completely emulated by the qemu driver.

That's right: qemu/hw/virtio-blk.c

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-10 Thread Takuya Yoshikawa
On Tue, 10 Apr 2012 13:39:14 +0300
Avi Kivity a...@redhat.com wrote:

 On 04/09/2012 10:46 PM, Marcelo Tosatti wrote:
  Perhaps the mmu_lock hold times by get_dirty are a large component here?
  If that can be alleviated, not only RO-RW faults benefit.
 
 
 
 Currently the longest holder in normal use is probably reading the dirty
 log and write protecting the shadow page tables.
 
 We could fix that by switching to O(1) write protection
 (write-protecting PML4Es instead of PTEs).  It would be interesting to
 combine O(1) write protection with lockless write-enabling.
 

As Marcelo suggested during reviewing srcu-less dirty logging, we can
mitigate the get_dirty's mmu_lock hold time problem cleanly, locally in
get_dirty_log(), by using cond_resched_lock() -- although we need to
introduce cond_rescheck_lock_cb() to conditionally flush TLB.

I have already started that work.

Actually I introduced rmap based get_dirty for that kind of fine-grained
contention control.

I think we should do our best not to affect mmu so much just for the
limited time of live migration.

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] adding tracepoints to vhost

2012-04-10 Thread Stefan Hajnoczi
On Tue, Apr 10, 2012 at 3:58 AM, Jason Wang jasow...@redhat.com wrote:
 To help in vhost analyzing, the following series adding basic tracepoints to
 vhost. Operations of both virtqueues and vhost works were traced in current
 implementation, net code were untouched. A top-like satistics displaying 
 script
 were introduced to help the troubleshooting.

 TODO:
 - net specific tracepoints?

 ---

 Jason Wang (2):
      vhost: basic tracepoints
      tools: virtio: add a top-like utility for displaying vhost satistics


  drivers/vhost/trace.h   |  153 
  drivers/vhost/vhost.c   |   17 ++
  tools/virtio/vhost_stat |  360 
 +++
  3 files changed, 528 insertions(+), 2 deletions(-)
  create mode 100644 drivers/vhost/trace.h
  create mode 100755 tools/virtio/vhost_stat

Perhaps this can replace the vhost log feature?  I'm not sure if
tracepoints support the right data types but it seems like vhost
debugging could be done using tracing with less code.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question about emulation of KVM?

2012-04-10 Thread Stefan Hajnoczi
On Sat, Apr 7, 2012 at 6:18 AM, R 1989012...@gmail.com wrote:
             I try to use the x86_emulate_instruction() function.
             But it seems like that it fails to emulate some instruction.
             My program gets stuck in somewhere. It keeps emulating
 one instructions.
             Is there some instructions that this function can not emulate?

Yes there are instructions that are not supported by the emulator but
they should produce a kernel message.

Check dmesg(1) to see if an error was logged.  You can also enable the
kvm:* tracepoints in the kernel to get detailed information on guest
behavior, including emulated instruction opcodes.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost-blk development

2012-04-10 Thread Stefan Hajnoczi
On Mon, Apr 9, 2012 at 11:59 PM, Michael Baysek mbay...@liquidweb.com wrote:
 Hi all.  I'm interested in any developments on the vhost-blk in kernel 
 accelerator for disk i/o.

 I had seen a patchset on LKML https://lkml.org/lkml/2011/7/28/175 but that is 
 rather old.  Are there any newer developments going on with the vhost-blk 
 stuff?

Hi Michael,
I'm curious what you are looking for in vhost-blk.  Are you trying to
improve disk performance for KVM guests?

Perhaps you'd like to share your configuration, workload, and other
details so that we can discuss how to improve performance.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-10 Thread Xiao Guangrong
On 04/10/2012 07:40 PM, Takuya Yoshikawa wrote:

 On Tue, 10 Apr 2012 13:39:14 +0300
 Avi Kivity a...@redhat.com wrote:
 
 On 04/09/2012 10:46 PM, Marcelo Tosatti wrote:
 Perhaps the mmu_lock hold times by get_dirty are a large component here?
 If that can be alleviated, not only RO-RW faults benefit.



 Currently the longest holder in normal use is probably reading the dirty
 log and write protecting the shadow page tables.

 We could fix that by switching to O(1) write protection
 (write-protecting PML4Es instead of PTEs).  It would be interesting to
 combine O(1) write protection with lockless write-enabling.

 
 As Marcelo suggested during reviewing srcu-less dirty logging, we can
 mitigate the get_dirty's mmu_lock hold time problem cleanly, locally in
 get_dirty_log(), by using cond_resched_lock() -- although we need to
 introduce cond_rescheck_lock_cb() to conditionally flush TLB.
 


Although it can reduce the contention but it is not reduce the overload
of dirty-log.


 I have already started that work.
 
 Actually I introduced rmap based get_dirty for that kind of fine-grained
 contention control.
 


I do not think this way is better that O(1). Avi has explained the reason
for many times, and i agree with that. :)

 I think we should do our best not to affect mmu so much just for the
 limited time of live migration.
 


No, i do not really agree with that.

We really can get great benefit from O(1) especially if lockless write-protect
is introduced for O(1), live migration is very useful for cloud computing
architecture to balance the overload on all nodes.

And no reason to disallow us touch the code of MMU, yes, it needs simply
but it does not means stop the development of MMU.

For another hander, the mechanism like your to improve dirty-log also need
introduce lots of code and it does not make MMU clearer. :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PMU emulation: GLOBAL_CTRL MSR should be enabled on reset.

2012-04-10 Thread Avi Kivity
On 04/09/2012 05:38 PM, Gleb Natapov wrote:
 On reset all MPU counters should be enabled in GLOBAL_CTRL MSR.

 Signed-off-by: Gleb Natapov g...@redhat.com
 diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
 index 173df38..2e88438 100644
 --- a/arch/x86/kvm/pmu.c
 +++ b/arch/x86/kvm/pmu.c
 @@ -459,17 +459,17 @@ void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu)
   pmu-available_event_types = ~entry-ebx  ((1ull  bitmap_len) - 1);
  
   if (pmu-version == 1) {
 - pmu-global_ctrl = (1  pmu-nr_arch_gp_counters) - 1;
 - return;
 + pmu-nr_arch_fixed_counters = 0;
 + } else {
 + pmu-nr_arch_fixed_counters = min((int)(entry-edx  0x1f),
 + X86_PMC_MAX_FIXED);
 + pmu-counter_bitmask[KVM_PMC_FIXED] =
 + ((u64)1  ((entry-edx  5)  0xff)) - 1;
   }
  
 - pmu-nr_arch_fixed_counters = min((int)(entry-edx  0x1f),
 - X86_PMC_MAX_FIXED);
 - pmu-counter_bitmask[KVM_PMC_FIXED] =
 - ((u64)1  ((entry-edx  5)  0xff)) - 1;
 - pmu-global_ctrl_mask = ~(((1  pmu-nr_arch_gp_counters) - 1)
 - | (((1ull  pmu-nr_arch_fixed_counters) - 1)
 -  X86_PMC_IDX_FIXED));
 + pmu-global_ctrl = ((1  pmu-nr_arch_gp_counters) - 1) |
 + (((1ull  pmu-nr_arch_fixed_counters) - 1)  
 X86_PMC_IDX_FIXED);
 + pmu-global_ctrl_mask = ~pmu-global_ctrl;
  }
  


This is not called on INIT (not sure it should be).  On the other hand
update_cpuid() is not the best place to initialize stuff.

Oh well, this can be fixed later (not sure its possible), I'll apply
this to master.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] adding tracepoints to vhost

2012-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2012 at 12:40:50PM +0100, Stefan Hajnoczi wrote:
 On Tue, Apr 10, 2012 at 3:58 AM, Jason Wang jasow...@redhat.com wrote:
  To help in vhost analyzing, the following series adding basic tracepoints to
  vhost. Operations of both virtqueues and vhost works were traced in current
  implementation, net code were untouched. A top-like satistics displaying 
  script
  were introduced to help the troubleshooting.
 
  TODO:
  - net specific tracepoints?
 
  ---
 
  Jason Wang (2):
       vhost: basic tracepoints
       tools: virtio: add a top-like utility for displaying vhost satistics
 
 
   drivers/vhost/trace.h   |  153 
   drivers/vhost/vhost.c   |   17 ++
   tools/virtio/vhost_stat |  360 
  +++
   3 files changed, 528 insertions(+), 2 deletions(-)
   create mode 100644 drivers/vhost/trace.h
   create mode 100755 tools/virtio/vhost_stat
 
 Perhaps this can replace the vhost log feature?  I'm not sure if
 tracepoints support the right data types but it seems like vhost
 debugging could be done using tracing with less code.
 
 Stefan

vhost log is not a debugging tool, it logs memory accesses for
migration.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] adding tracepoints to vhost

2012-04-10 Thread Stefan Hajnoczi
On Tue, Apr 10, 2012 at 1:42 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Tue, Apr 10, 2012 at 12:40:50PM +0100, Stefan Hajnoczi wrote:
 On Tue, Apr 10, 2012 at 3:58 AM, Jason Wang jasow...@redhat.com wrote:
  To help in vhost analyzing, the following series adding basic tracepoints 
  to
  vhost. Operations of both virtqueues and vhost works were traced in current
  implementation, net code were untouched. A top-like satistics displaying 
  script
  were introduced to help the troubleshooting.
 
  TODO:
  - net specific tracepoints?
 
  ---
 
  Jason Wang (2):
       vhost: basic tracepoints
       tools: virtio: add a top-like utility for displaying vhost satistics
 
 
   drivers/vhost/trace.h   |  153 
   drivers/vhost/vhost.c   |   17 ++
   tools/virtio/vhost_stat |  360 
  +++
   3 files changed, 528 insertions(+), 2 deletions(-)
   create mode 100644 drivers/vhost/trace.h
   create mode 100755 tools/virtio/vhost_stat

 Perhaps this can replace the vhost log feature?  I'm not sure if
 tracepoints support the right data types but it seems like vhost
 debugging could be done using tracing with less code.

 Stefan

 vhost log is not a debugging tool, it logs memory accesses for
 migration.

Thanks.  I totally misunderstood its purpose.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] KVM: Avoid zapping unrelated shadows in __kvm_set_memory_region()

2012-04-10 Thread Takuya Yoshikawa
From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp

We do not need to zap all shadow pages of the guest when we create or
destroy a slot in this function.

To change this, we make kvm_mmu_zap_all()/kvm_arch_flush_shadow()
zap only those which have mappings into a given slot.

The way we iterate through active shadow pages is also changed to avoid
checking unrelated pages again and again.

Furthermore, the condition to see if we have any mmio sptes to clear is
changed so that we will not do flush for newly created slots.

With all these changes applied, the total amount of time needed to flush
shadow pages of a usual Linux guest, running Fedora with 4GB memory,
during a shutdown was reduced from 90ms to 60ms.

Furthermore, the total number of flushes needed to boot and shutdown
that guest was also reduced from 52 to 31.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
Cc: Takuya Yoshikawa takuya.yoshik...@gmail.com
---
 [ Added cc to my gmail account because my address may change (only) a bit
   in a few months. ]

 rebased against next-candidates

 arch/ia64/kvm/kvm-ia64.c|2 +-
 arch/powerpc/kvm/powerpc.c  |2 +-
 arch/s390/kvm/kvm-s390.c|2 +-
 arch/x86/include/asm/kvm_host.h |2 +-
 arch/x86/kvm/mmu.c  |   22 ++
 arch/x86/kvm/x86.c  |   13 ++---
 include/linux/kvm_host.h|2 +-
 virt/kvm/kvm_main.c |   15 ++-
 8 files changed, 39 insertions(+), 21 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 9d80ff8..360abe5 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1626,7 +1626,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
return;
 }
 
-void kvm_arch_flush_shadow(struct kvm *kvm)
+void kvm_arch_flush_shadow(struct kvm *kvm, int slot)
 {
kvm_flush_remote_tlbs(kvm);
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 58ad860..5680337 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -319,7 +319,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 }
 
 
-void kvm_arch_flush_shadow(struct kvm *kvm)
+void kvm_arch_flush_shadow(struct kvm *kvm, int slot)
 {
 }
 
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index d30c835..8c25606 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -879,7 +879,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
return;
 }
 
-void kvm_arch_flush_shadow(struct kvm *kvm)
+void kvm_arch_flush_shadow(struct kvm *kvm, int slot)
 {
 }
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f624ca7..422f23a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -715,7 +715,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int 
slot);
 void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 struct kvm_memory_slot *slot,
 gfn_t gfn_offset, unsigned long mask);
-void kvm_mmu_zap_all(struct kvm *kvm);
+void kvm_mmu_zap_all(struct kvm *kvm, int slot);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 29ad6f9..a50f7ba 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3930,16 +3930,30 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
int slot)
kvm_flush_remote_tlbs(kvm);
 }
 
-void kvm_mmu_zap_all(struct kvm *kvm)
+/**
+ * kvm_mmu_zap_all - zap all shadows which have mappings into a given slot
+ * @kvm: the kvm instance
+ * @slot: id of the target slot
+ *
+ * If @slot is -1, zap all shadow pages.
+ */
+void kvm_mmu_zap_all(struct kvm *kvm, int slot)
 {
struct kvm_mmu_page *sp, *node;
LIST_HEAD(invalid_list);
+   int zapped;
 
spin_lock(kvm-mmu_lock);
 restart:
-   list_for_each_entry_safe(sp, node, kvm-arch.active_mmu_pages, link)
-   if (kvm_mmu_prepare_zap_page(kvm, sp, invalid_list))
-   goto restart;
+   zapped = 0;
+   list_for_each_entry_safe(sp, node, kvm-arch.active_mmu_pages, link) {
+   if ((slot = 0)  !test_bit(slot, sp-slot_bitmap))
+   continue;
+
+   zapped |= kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);
+   }
+   if (zapped)
+   goto restart;
 
kvm_mmu_commit_zap_page(kvm, invalid_list);
spin_unlock(kvm-mmu_lock);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0d9a578..eac378c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5038,7 +5038,7 @@ int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
 * to ensure that the updated hypercall appears atomically across all
 * VCPUs.
 */
-   kvm_mmu_zap_all(vcpu-kvm);
+   

Re: [PATCH] virtio_blk: Add help function to format mass of disks

2012-04-10 Thread Asias He

On 04/10/2012 03:28 PM, Ren Mingxin wrote:

The current virtio block's naming algorithm just supports 18278
(26^3 + 26^2 + 26) disks. If there are mass of virtio blocks,
there will be disks with the same name.

Based on commit 3e1a7ff8a0a7b948f2684930166954f9e8e776fe, I add
function virtblk_name_format() for virtio block to support mass
of disks naming.

Signed-off-by: Ren Mingxinre...@cn.fujitsu.com


Make sense to me.

Acked-by: Asias He as...@redhat.com


---
  drivers/block/virtio_blk.c |   38 ++
  1 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index c4a60ba..86516c8 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -374,6 +374,31 @@ static int init_vq(struct virtio_blk *vblk)
return err;
  }

+static int virtblk_name_format(char *prefix, int index, char *buf, int buflen)
+{
+   const int base = 'z' - 'a' + 1;
+   char *begin = buf + strlen(prefix);
+   char *begin = buf + strlen(prefix);
+   char *end = buf + buflen;
+   char *p;
+   int unit;
+
+   p = end - 1;
+   *p = '\0';
+   unit = base;
+   do {
+   if (p == begin)
+   return -EINVAL;
+   *--p = 'a' + (index % unit);
+   index = (index / unit) - 1;
+   } while (index= 0);
+
+   memmove(begin, p, end - p);
+   memcpy(buf, prefix, strlen(prefix));
+
+   return 0;
+}
+
  static int __devinit virtblk_probe(struct virtio_device *vdev)
  {
struct virtio_blk *vblk;
@@ -442,18 +467,7 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)

q-queuedata = vblk;

-   if (index  26) {
-   sprintf(vblk-disk-disk_name, vd%c, 'a' + index % 26);
-   } else if (index  (26 + 1) * 26) {
-   sprintf(vblk-disk-disk_name, vd%c%c,
-   'a' + index / 26 - 1, 'a' + index % 26);
-   } else {
-   const unsigned int m1 = (index / 26 - 1) / 26 - 1;
-   const unsigned int m2 = (index / 26 - 1) % 26;
-   const unsigned int m3 =  index % 26;
-   sprintf(vblk-disk-disk_name, vd%c%c%c,
-   'a' + m1, 'a' + m2, 'a' + m3);
-   }
+   virtblk_name_format(vd, index, vblk-disk-disk_name, DISK_NAME_LEN);

vblk-disk-major = major;
vblk-disk-first_minor = index_to_minor(index);



--
Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] adding tracepoints to vhost

2012-04-10 Thread Zhi Yong Wu
On Tue, Apr 10, 2012 at 8:42 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Tue, Apr 10, 2012 at 12:40:50PM +0100, Stefan Hajnoczi wrote:
 On Tue, Apr 10, 2012 at 3:58 AM, Jason Wang jasow...@redhat.com wrote:
  To help in vhost analyzing, the following series adding basic tracepoints 
  to
  vhost. Operations of both virtqueues and vhost works were traced in current
  implementation, net code were untouched. A top-like satistics displaying 
  script
  were introduced to help the troubleshooting.
 
  TODO:
  - net specific tracepoints?
 
  ---
 
  Jason Wang (2):
       vhost: basic tracepoints
       tools: virtio: add a top-like utility for displaying vhost satistics
 
 
   drivers/vhost/trace.h   |  153 
   drivers/vhost/vhost.c   |   17 ++
   tools/virtio/vhost_stat |  360 
  +++
   3 files changed, 528 insertions(+), 2 deletions(-)
   create mode 100644 drivers/vhost/trace.h
   create mode 100755 tools/virtio/vhost_stat

 Perhaps this can replace the vhost log feature?  I'm not sure if
 tracepoints support the right data types but it seems like vhost
 debugging could be done using tracing with less code.

 Stefan

 vhost log is not a debugging tool, it logs memory accesses for
 migration.
Great, it is very appreciated if there's some docs about this

 ___
 Virtualization mailing list
 virtualizat...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/virtualization



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_blk: Add help function to format mass of disks

2012-04-10 Thread Avi Kivity
On 04/10/2012 10:28 AM, Ren Mingxin wrote:
 The current virtio block's naming algorithm just supports 18278
 (26^3 + 26^2 + 26) disks. If there are mass of virtio blocks,
 there will be disks with the same name.

 Based on commit 3e1a7ff8a0a7b948f2684930166954f9e8e776fe, I add
 function virtblk_name_format() for virtio block to support mass
 of disks naming.

 Signed-off-by: Ren Mingxin re...@cn.fujitsu.com
 ---
  drivers/block/virtio_blk.c |   38 ++
  1 files changed, 26 insertions(+), 12 deletions(-)

 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 index c4a60ba..86516c8 100644
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -374,6 +374,31 @@ static int init_vq(struct virtio_blk *vblk)
   return err;
  }
  
 +static int virtblk_name_format(char *prefix, int index, char *buf, int 
 buflen)
 +{
 + const int base = 'z' - 'a' + 1;
 + char *begin = buf + strlen(prefix);
 + char *begin = buf + strlen(prefix);

Duplicate line.

 + char *end = buf + buflen;
 + char *p;
 + int unit;
 +
 + p = end - 1;
 + *p = '\0';
 + unit = base;

Why not use 'base' below?  neither unit nor base change.

 + do {
 + if (p == begin)
 + return -EINVAL;
 + *--p = 'a' + (index % unit);
 + index = (index / unit) - 1;
 + } while (index = 0);
 +
 + memmove(begin, p, end - p);
 + memcpy(buf, prefix, strlen(prefix));
 +
 + return 0;
 +}
 +


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Michael S. Tsirkin
I took a stub at implementing PV EOI using shared memory.
This should reduce the number of exits an interrupt
causes as much as by half.

A partially complete draft for both host and guest parts
is below.

The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
We set it before injecting an interrupt and clear
before injecting a nested one. Guest tests it using
a test and clear operation - this is necessary
so that host can detect interrupt nesting -
and if set, it can skip the EOI MSR.

There's a new MSR to set the address of said register
in guest memory. Otherwise not much changes:
- Guest EOI is not required
- ISR is automatically cleared before injection

Some things are incomplete: add feature negotiation
options, qemu support for said options.
No testing was done beyond compiling the kernel.

I would appreciate early feedback.

Signed-off-by: Michael S. Tsirkin m...@redhat.com

--

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index d854101..8430f41 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -457,8 +457,13 @@ static inline u32 safe_apic_wait_icr_idle(void) { return 
0; }
 
 #endif /* CONFIG_X86_LOCAL_APIC */
 
+DECLARE_EARLY_PER_CPU(unsigned long, apic_eoi);
+
 static inline void ack_APIC_irq(void)
 {
+   if (__test_and_clear_bit(0, __get_cpu_var(apic_eoi)))
+   return;
+
/*
 * ack_APIC_irq() actually gets compiled as a single instruction
 * ... yummie.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e216ba0..0ee1472 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -481,6 +481,12 @@ struct kvm_vcpu_arch {
u64 length;
u64 status;
} osvw;
+
+   struct {
+   u64 msr_val;
+   struct gfn_to_hva_cache data;
+   int vector;
+   } eoi;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 734c376..e22b9f8 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -37,6 +37,8 @@
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
 #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
 #define MSR_KVM_STEAL_TIME  0x4b564d03
+#define MSR_KVM_EOI_EN  0x4b564d04
+#define MSR_KVM_EOI_ENABLED 0x1
 
 struct kvm_steal_time {
__u64 steal;
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 11544d8..1b3f9fa 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -89,6 +89,9 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid);
  */
 DEFINE_EARLY_PER_CPU(int, x86_cpu_to_logical_apicid, BAD_APICID);
 
+DEFINE_EARLY_PER_CPU(unsigned long, apic_eoi, 0);
+EXPORT_EARLY_PER_CPU_SYMBOL(apic_eoi);
+
 /*
  * Knob to control our willingness to enable the local APIC.
  *
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b8ba6e4..8b50f3a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -39,6 +39,7 @@
 #include asm/desc.h
 #include asm/tlbflush.h
 #include asm/idle.h
+#include asm/apic.h
 
 static int kvmapf = 1;
 
@@ -307,6 +308,9 @@ void __cpuinit kvm_guest_cpu_init(void)
   smp_processor_id());
}
 
+   wrmsrl(MSR_KVM_EOI_EN, __pa(this_cpu_ptr(apic_eoi)) |
+  MSR_KVM_EOI_ENABLED);
+
if (has_steal_clock)
kvm_register_steal_time();
 }
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 8584322..9e38e12 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -265,7 +265,61 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
kvm_lapic_irq *irq)
irq-level, irq-trig_mode);
 }
 
-static inline int apic_find_highest_isr(struct kvm_lapic *apic)
+static int eoi_put_user(struct kvm_vcpu *vcpu, u32 val)
+{
+
+   return kvm_write_guest_cached(vcpu-kvm, vcpu-arch.eoi.data, val,
+ sizeof(val));
+}
+
+static int eoi_get_user(struct kvm_vcpu *vcpu, u32 *val)
+{
+
+   return kvm_read_guest_cached(vcpu-kvm, vcpu-arch.eoi.data, val,
+ sizeof(*val));
+}
+
+static inline bool eoi_enabled(struct kvm_vcpu *vcpu)
+{
+   return (vcpu-arch.eoi.msr_val  MSR_KVM_EOI_ENABLED);
+}
+
+static int eoi_get_pending_vector(struct kvm_vcpu *vcpu)
+{
+   u32 val;
+   if (eoi_get_user(vcpu, val)  0)
+   apic_debug(Can't read EOI MSR value: 0x%llx\n,
+  (unsigned long long)vcpi-arch.eoi.msr_val);
+   if (!(val  0x1))
+   vcpu-arch.eoi.vector = -1;
+   return vcpu-arch.eoi.vector;
+}
+
+static void eoi_set_pending_vector(struct kvm_vcpu *vcpu, int vector)
+{
+   BUG_ON(vcpu-arch.eoi.vector != -1);
+   if (eoi_put_user(vcpu, 0x1)  0) {
+   apic_debug(Can't set EOI MSR value: 0x%llx\n,
+  (unsigned long 

Re: [PATCH] virtio_blk: Add help function to format mass of disks

2012-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2012 at 04:16:10PM +0300, Avi Kivity wrote:
 On 04/10/2012 10:28 AM, Ren Mingxin wrote:
  The current virtio block's naming algorithm just supports 18278
  (26^3 + 26^2 + 26) disks. If there are mass of virtio blocks,
  there will be disks with the same name.
 
  Based on commit 3e1a7ff8a0a7b948f2684930166954f9e8e776fe, I add
  function virtblk_name_format() for virtio block to support mass
  of disks naming.
 
  Signed-off-by: Ren Mingxin re...@cn.fujitsu.com
  ---
   drivers/block/virtio_blk.c |   38 ++
   1 files changed, 26 insertions(+), 12 deletions(-)
 
  diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
  index c4a60ba..86516c8 100644
  --- a/drivers/block/virtio_blk.c
  +++ b/drivers/block/virtio_blk.c
  @@ -374,6 +374,31 @@ static int init_vq(struct virtio_blk *vblk)
  return err;
   }
   
  +static int virtblk_name_format(char *prefix, int index, char *buf, int 
  buflen)
  +{
  +   const int base = 'z' - 'a' + 1;
  +   char *begin = buf + strlen(prefix);
  +   char *begin = buf + strlen(prefix);
 
 Duplicate line.
 
  +   char *end = buf + buflen;
  +   char *p;
  +   int unit;
  +
  +   p = end - 1;
  +   *p = '\0';
  +   unit = base;
 
 Why not use 'base' below?  neither unit nor base change.

Yes it's a bit strange, it was the same in Tejun's patch.
Tejun, any idea?

  +   do {
  +   if (p == begin)
  +   return -EINVAL;
  +   *--p = 'a' + (index % unit);
  +   index = (index / unit) - 1;
  +   } while (index = 0);
  +
  +   memmove(begin, p, end - p);
  +   memcpy(buf, prefix, strlen(prefix));
  +
  +   return 0;
  +}
  +
 
 
 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for April, Tuesday 10

2012-04-10 Thread Markus Armbruster
As there are no topics, call is cancelled.  Sorry for the late notice.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Avi Kivity
On 04/10/2012 04:27 PM, Michael S. Tsirkin wrote:
 I took a stub at implementing PV EOI using shared memory.
 This should reduce the number of exits an interrupt
 causes as much as by half.

 A partially complete draft for both host and guest parts
 is below.

 The idea is simple: there's a bit, per APIC, in guest memory,
 that tells the guest that it does not need EOI.
 We set it before injecting an interrupt and clear
 before injecting a nested one. Guest tests it using
 a test and clear operation - this is necessary
 so that host can detect interrupt nesting -
 and if set, it can skip the EOI MSR.

 There's a new MSR to set the address of said register
 in guest memory. Otherwise not much changes:
 - Guest EOI is not required
 - ISR is automatically cleared before injection

 Some things are incomplete: add feature negotiation
 options, qemu support for said options.
 No testing was done beyond compiling the kernel.

 I would appreciate early feedback.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com

 --

 diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
 index d854101..8430f41 100644
 --- a/arch/x86/include/asm/apic.h
 +++ b/arch/x86/include/asm/apic.h
 @@ -457,8 +457,13 @@ static inline u32 safe_apic_wait_icr_idle(void) { return 
 0; }
  
  #endif /* CONFIG_X86_LOCAL_APIC */
  
 +DECLARE_EARLY_PER_CPU(unsigned long, apic_eoi);
 +
  static inline void ack_APIC_irq(void)
  {
 + if (__test_and_clear_bit(0, __get_cpu_var(apic_eoi)))
 + return;
 +

While __test_and_clear_bit() is implemented in a single instruction,
it's not required to be.  Better have the instruction there explicitly.

   /*
* ack_APIC_irq() actually gets compiled as a single instruction
* ... yummie.
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index e216ba0..0ee1472 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -481,6 +481,12 @@ struct kvm_vcpu_arch {
   u64 length;
   u64 status;
   } osvw;
 +
 + struct {
 + u64 msr_val;
 + struct gfn_to_hva_cache data;
 + int vector;
 + } eoi;
  };

Needs to be cleared on INIT.

  

 @@ -307,6 +308,9 @@ void __cpuinit kvm_guest_cpu_init(void)
  smp_processor_id());
   }
  
 + wrmsrl(MSR_KVM_EOI_EN, __pa(this_cpu_ptr(apic_eoi)) |
 +MSR_KVM_EOI_ENABLED);
 +

Clear on kexec.

   if (has_steal_clock)
   kvm_register_steal_time();
  }
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 8584322..9e38e12 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -265,7 +265,61 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
 kvm_lapic_irq *irq)
   irq-level, irq-trig_mode);
  }
  
 -static inline int apic_find_highest_isr(struct kvm_lapic *apic)
 +static int eoi_put_user(struct kvm_vcpu *vcpu, u32 val)
 +{
 +
 + return kvm_write_guest_cached(vcpu-kvm, vcpu-arch.eoi.data, val,
 +   sizeof(val));
 +}
 +
 +static int eoi_get_user(struct kvm_vcpu *vcpu, u32 *val)
 +{
 +
 + return kvm_read_guest_cached(vcpu-kvm, vcpu-arch.eoi.data, val,
 +   sizeof(*val));
 +}
 +
 +static inline bool eoi_enabled(struct kvm_vcpu *vcpu)
 +{
 + return (vcpu-arch.eoi.msr_val  MSR_KVM_EOI_ENABLED);
 +}
 +
 +static int eoi_get_pending_vector(struct kvm_vcpu *vcpu)
 +{
 + u32 val;
 + if (eoi_get_user(vcpu, val)  0)
 + apic_debug(Can't read EOI MSR value: 0x%llx\n,
 +(unsigned long long)vcpi-arch.eoi.msr_val);
 + if (!(val  0x1))
 + vcpu-arch.eoi.vector = -1;
 + return vcpu-arch.eoi.vector;
 +}
 +
 +static void eoi_set_pending_vector(struct kvm_vcpu *vcpu, int vector)
 +{
 + BUG_ON(vcpu-arch.eoi.vector != -1);
 + if (eoi_put_user(vcpu, 0x1)  0) {
 + apic_debug(Can't set EOI MSR value: 0x%llx\n,
 +(unsigned long long)vcpi-arch.eoi.msr_val);
 + return;
 + }
 + vcpu-arch.eoi.vector = vector;
 +}
 +
 +static int eoi_clr_pending_vector(struct kvm_vcpu *vcpu)
 +{
 + int vector;
 + vector = vcpu-arch.eoi.vector;
 + if (vector != -1  eoi_put_user(vcpu, 0x0)  0) {
 + apic_debug(Can't clear EOI MSR value: 0x%llx\n,
 +(unsigned long long)vcpi-arch.eoi.msr_val);
 + return -1;
 + }
 + vcpu-arch.eoi.vector = -1;
 + return vector;
 +}



 +
 +static inline int __apic_find_highest_isr(struct kvm_lapic *apic)
  {
   int result;
  
 @@ -275,6 +329,17 @@ static inline int apic_find_highest_isr(struct kvm_lapic 
 *apic)
   return result;
  }
  
 +static inline int apic_find_highest_isr(struct kvm_lapic *apic)
 +{
 + int vector;
 + if (eoi_enabled(apic-vcpu)) {
 + vector = eoi_get_pending_vector(apic-vcpu);
 + if (vector != -1)
 +  

Re: [PATCH 0/2] adding tracepoints to vhost

2012-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2012 at 09:10:48PM +0800, Zhi Yong Wu wrote:
  Perhaps this can replace the vhost log feature?  I'm not sure if
  tracepoints support the right data types but it seems like vhost
  debugging could be done using tracing with less code.
 
  Stefan
 
  vhost log is not a debugging tool, it logs memory accesses for
  migration.
 Great, it is very appreciated if there's some docs about this

About what? vhost logging? See the comment near the
definition of VHOST_SET_LOG_BASE in vhost.h

  ___
  Virtualization mailing list
  virtualizat...@lists.linux-foundation.org
  https://lists.linuxfoundation.org/mailman/listinfo/virtualization
 
 
 
 -- 
 Regards,
 
 Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Avi Kivity
On 04/10/2012 05:26 PM, Michael S. Tsirkin wrote:
 u64 status;
 } osvw;
   +
   + struct {
   + u64 msr_val;
   + struct gfn_to_hva_cache data;
   + int vector;
   + } eoi;
};
  
  Needs to be cleared on INIT.

 You mean kvm_arch_vcpu_reset?

Yes, or kvm_lapic_reset().


  
   @@ -307,6 +308,9 @@ void __cpuinit kvm_guest_cpu_init(void)
smp_processor_id());
 }

   + wrmsrl(MSR_KVM_EOI_EN, __pa(this_cpu_ptr(apic_eoi)) |
   +MSR_KVM_EOI_ENABLED);
   +
  
  Clear on kexec.

 With register_reboot_notifier?

Yes, we already clear some kvm msrs there.


   - apic_set_vector(vector, apic-regs + APIC_ISR);
   + if (eoi_enabled(vcpu)) {
   + /* Anything pending? If yes disable eoi optimization. */
   + if (unlikely(apic_find_highest_isr(apic) = 0)) {
   + int v = eoi_clr_pending_vector(vcpu);
  
  ISR != pending, that's IRR.  If ISR vector has lower priority than the
  new vector, then we don't need to disable eoi avoidance.

 Yes. But we can and it's easier than figuring out priorities.
 I am guessing such collisions are rare, right?

It's pretty easy, if there is something in IRR but
kvm_lapic_has_interrupt() returns -1, then we need to disable eoi avoidance.

 I'll add a trace to make sure.

   + if (v != -1)
   + apic_set_vector(v, apic-regs + APIC_ISR);
   + } else {
   + eoi_set_pending_vector(vcpu, vector);
   + set_isr = false;
  
  Weird.  Just set it normally.  Remember that reading the ISR needs to
  return the correct value.

 Marcelo said linux does not normally read ISR - not true?

It's true and it's irrelevant.  We aren't coding a feature to what linux
does now, but for what linux or another guest may do in the future.

 Note this has no effect if the PV optimization is not enabled.

  We need to process the avoided EOI before any APIC read/writes, to be
  sure the guest sees the updated values.  Same for IOAPIC, EOI affects
  remote_irr.  That may been we need to sample it after every exit, or
  perhaps disable the feature for level-triggered interrupts.

 Disabling would be very sad.  Can we sample on remote irr read?

That can be done from another vcpu.  Why do we care about
level-triggered interrupts?  Everything uses MSI or edge-triggered
IOAPIC interrupts these days.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2012 at 05:33:18PM +0300, Avi Kivity wrote:
 On 04/10/2012 05:26 PM, Michael S. Tsirkin wrote:
u64 status;
} osvw;
+
+   struct {
+   u64 msr_val;
+   struct gfn_to_hva_cache data;
+   int vector;
+   } eoi;
 };
   
   Needs to be cleared on INIT.
 
  You mean kvm_arch_vcpu_reset?
 
 Yes, or kvm_lapic_reset().
 
 
   
@@ -307,6 +308,9 @@ void __cpuinit kvm_guest_cpu_init(void)
   smp_processor_id());
}
 
+   wrmsrl(MSR_KVM_EOI_EN, __pa(this_cpu_ptr(apic_eoi)) |
+  MSR_KVM_EOI_ENABLED);
+
   
   Clear on kexec.
 
  With register_reboot_notifier?
 
 Yes, we already clear some kvm msrs there.
 
 
-   apic_set_vector(vector, apic-regs + APIC_ISR);
+   if (eoi_enabled(vcpu)) {
+   /* Anything pending? If yes disable eoi optimization. */
+   if (unlikely(apic_find_highest_isr(apic) = 0)) {
+   int v = eoi_clr_pending_vector(vcpu);
   
   ISR != pending, that's IRR.  If ISR vector has lower priority than the
   new vector, then we don't need to disable eoi avoidance.
 
  Yes. But we can and it's easier than figuring out priorities.
  I am guessing such collisions are rare, right?
 
 It's pretty easy, if there is something in IRR but
 kvm_lapic_has_interrupt() returns -1, then we need to disable eoi avoidance.

I only see kvm_apic_has_interrupt - is this what you mean?

  I'll add a trace to make sure.
 
+   if (v != -1)
+   apic_set_vector(v, apic-regs + 
APIC_ISR);
+   } else {
+   eoi_set_pending_vector(vcpu, vector);
+   set_isr = false;
   
   Weird.  Just set it normally.  Remember that reading the ISR needs to
   return the correct value.
 
  Marcelo said linux does not normally read ISR - not true?
 
 It's true and it's irrelevant.  We aren't coding a feature to what linux
 does now, but for what linux or another guest may do in the future.

Right. So you think reading ISR has value
in combination with PV EOI for future guests?
I'm not arguing either way just curious.

  Note this has no effect if the PV optimization is not enabled.
 
   We need to process the avoided EOI before any APIC read/writes, to be
   sure the guest sees the updated values.  Same for IOAPIC, EOI affects
   remote_irr.  That may been we need to sample it after every exit, or
   perhaps disable the feature for level-triggered interrupts.
 
  Disabling would be very sad.  Can we sample on remote irr read?
 
 That can be done from another vcpu.

We still can handle it, right? Where's the code that handles that read?

 Why do we care about
 level-triggered interrupts?  Everything uses MSI or edge-triggered
 IOAPIC interrupts these days.

Well lots of emulated devices don't yet.
They probably should but it's nice to be able to
test with e.g. e1000 emulation not just virtio.

Besides, kvm_get_apic_interrupt
simply doesn't know about the triggering mode at the moment.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] virt tests: add dd_test

2012-04-10 Thread Lukas Doktor
This patch adds dd_test which tries simple read/write from/to an attached
disk. The purpose is to test readonly vs. readwrite subsystem.

The dd_test.py is highly parametrizable so it can be used in other
tests by changing parameters.

It also adds another image parameter image_readonly (bool) which is
required for this test.

Signed-off-by: Lukas Doktor ldok...@redhat.com
---
 client/virt/kvm_vm.py   |6 ++-
 client/virt/subtests.cfg.sample |   35 +
 client/virt/tests/dd_test.py|  105 +++
 3 files changed, 144 insertions(+), 2 deletions(-)
 create mode 100644 client/virt/tests/dd_test.py

diff --git a/client/virt/kvm_vm.py b/client/virt/kvm_vm.py
index 80ea453..cbc0494 100644
--- a/client/virt/kvm_vm.py
+++ b/client/virt/kvm_vm.py
@@ -310,7 +310,7 @@ class VM(virt_vm.BaseVM):
   boot=False, blkdebug=None, bus=None, port=None,
   bootindex=None, removable=None, min_io_size=None,
   opt_io_size=None, physical_block_size=None,
-  logical_block_size=None):
+  logical_block_size=None, readonly=False):
 name = None
 dev = 
 if format == ahci:
@@ -362,6 +362,7 @@ class VM(virt_vm.BaseVM):
 cmd += _add_option(snapshot, snapshot, bool)
 cmd += _add_option(boot, boot, bool)
 cmd += _add_option(id, name)
+cmd += _add_option(readonly, readonly, bool)
 return cmd + dev
 
 def add_nic(help, vlan, model=None, mac=None, device_id=None, 
netdev_id=None,
@@ -758,7 +759,8 @@ class VM(virt_vm.BaseVM):
 image_params.get(min_io_size),
 image_params.get(opt_io_size),
 image_params.get(physical_block_size),
-image_params.get(logical_block_size))
+image_params.get(logical_block_size),
+image_params.get(image_readonly))
 
 redirs = []
 for redir_name in params.objects(redirs):
diff --git a/client/virt/subtests.cfg.sample b/client/virt/subtests.cfg.sample
index 2192840..ec9fbc3 100644
--- a/client/virt/subtests.cfg.sample
+++ b/client/virt/subtests.cfg.sample
@@ -398,6 +398,41 @@ variants:
 create_image_stg = yes
 image_size_stg = 10M
 
+- dd_test: install setup image_copy unattended_install.cdrom
+type = dd_test
+images +=  stg1
+image_name_stg1 = sgt1
+image_size_stg1 = 1M
+image_snapshot_stg1 = no
+drive_index_stg1 = 3
+dd_count = 1
+# last input and output disk
+dd_if_select = -1
+dd_of_select = -1
+variants:
+- readwrite:
+dd_stat = 0
+variants:
+- zero2disk:
+dd_if = ZERO
+dd_of = /dev/[shv]d?
+- disk2null:
+dd_if = /dev/[shv]d?
+dd_of = NULL
+- readonly:
+# ide, ahci don't support readonly disks
+no ide, ahci
+image_readonly_stg1 = yes
+variants:
+- zero2disk:
+dd_if = ZERO
+dd_of = /dev/[shv]d?
+dd_stat = 1
+- disk2null:
+dd_if = /dev/[shv]d?
+dd_of = NULL
+dd_stat = 0
+
 
 - virsh_migrate: install setup image_copy unattended_install.cdrom
 type = virsh_migrate
diff --git a/client/virt/tests/dd_test.py b/client/virt/tests/dd_test.py
new file mode 100644
index 000..48d32b5
--- /dev/null
+++ b/client/virt/tests/dd_test.py
@@ -0,0 +1,105 @@
+
+Configurable on-guest dd test.
+@author: Lukas Doktor ldok...@redhat.com
+@copyright: 2012 Red Hat, Inc.
+
+import logging
+from autotest_lib.client.common_lib import error
+from autotest_lib.client.virt.aexpect import ShellCmdError
+from autotest_lib.client.virt.aexpect import ShellTimeoutError
+
+
+def run_dd_test(test, params, env):
+
+Executes dd with defined parameters and checks the return number and output
+
+def _get_file(filename, select):
+ Picks the actual file based on select value 
+if filename == NULL:
+return /dev/null
+elif filename == ZERO:
+return /dev/zero
+elif filename == RANDOM:
+return /dev/random
+elif filename == URANDOM:
+return /dev/urandom
+else:
+# get all matching filenames
+try:
+disks = sorted(session.cmd(ls -1d %s % filename).split('\n'))
+except ShellCmdError:   # No matching file (creating new?)
+disks = [filename]
+if disks[-1] == '':
+disks = disks[:-1]
+try:
+   

[PATCH 2/2] virt: Fix usb_stick block device subsystem

2012-04-10 Thread Lukas Doktor
* Add ehci controller when usbstick is selected
* Add default number of usb_max_port

With those mini-changes it's possible to run every test with usb_stick
block device without further test/cfg modifications.

Signed-off-by: Lukas Doktor ldok...@redhat.com
---
 client/virt/guest-hw.cfg.sample |2 ++
 client/virt/kvm_vm.py   |2 +-
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/client/virt/guest-hw.cfg.sample b/client/virt/guest-hw.cfg.sample
index 0729117..655ac9b 100644
--- a/client/virt/guest-hw.cfg.sample
+++ b/client/virt/guest-hw.cfg.sample
@@ -71,6 +71,8 @@ variants:
 cd_format=ahci
 - usb_stick:
 drive_format=usb2
+usbs +=  default-ehci
+usb_type_default-ehci = usb-ehci
 - usb_cdrom:
 cd_format=usb2
 - xenblk:
diff --git a/client/virt/kvm_vm.py b/client/virt/kvm_vm.py
index cbc0494..32d7330 100644
--- a/client/virt/kvm_vm.py
+++ b/client/virt/kvm_vm.py
@@ -238,7 +238,7 @@ class VM(virt_vm.BaseVM):
 usb_dev = self.usb_dev_dict.get(usb)
 
 controller = usb
-max_port = int(usb_params.get(usb_max_port))
+max_port = int(usb_params.get(usb_max_port, 6))
 if len(usb_dev)  max_port:
 bus = %s.0 % usb
 self.usb_dev_dict[usb].append(dev)
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] adding tracepoints to vhost

2012-04-10 Thread Zhi Yong Wu
On Tue, Apr 10, 2012 at 9:45 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Tue, Apr 10, 2012 at 09:10:48PM +0800, Zhi Yong Wu wrote:
  Perhaps this can replace the vhost log feature?  I'm not sure if
  tracepoints support the right data types but it seems like vhost
  debugging could be done using tracing with less code.
 
  Stefan
 
  vhost log is not a debugging tool, it logs memory accesses for
  migration.
 Great, it is very appreciated if there's some docs about this

 About what? vhost logging? See the comment near the
Yeah, thanks
 definition of VHOST_SET_LOG_BASE in vhost.h

  ___
  Virtualization mailing list
  virtualizat...@lists.linux-foundation.org
  https://lists.linuxfoundation.org/mailman/listinfo/virtualization



 --
 Regards,

 Zhi Yong Wu



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2012 at 05:03:22PM +0300, Avi Kivity wrote:
 On 04/10/2012 04:27 PM, Michael S. Tsirkin wrote:
  I took a stub at implementing PV EOI using shared memory.
  This should reduce the number of exits an interrupt
  causes as much as by half.
 
  A partially complete draft for both host and guest parts
  is below.
 
  The idea is simple: there's a bit, per APIC, in guest memory,
  that tells the guest that it does not need EOI.
  We set it before injecting an interrupt and clear
  before injecting a nested one. Guest tests it using
  a test and clear operation - this is necessary
  so that host can detect interrupt nesting -
  and if set, it can skip the EOI MSR.
 
  There's a new MSR to set the address of said register
  in guest memory. Otherwise not much changes:
  - Guest EOI is not required
  - ISR is automatically cleared before injection
 
  Some things are incomplete: add feature negotiation
  options, qemu support for said options.
  No testing was done beyond compiling the kernel.
 
  I would appreciate early feedback.
 
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
  --
 
  diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
  index d854101..8430f41 100644
  --- a/arch/x86/include/asm/apic.h
  +++ b/arch/x86/include/asm/apic.h
  @@ -457,8 +457,13 @@ static inline u32 safe_apic_wait_icr_idle(void) { 
  return 0; }
   
   #endif /* CONFIG_X86_LOCAL_APIC */
   
  +DECLARE_EARLY_PER_CPU(unsigned long, apic_eoi);
  +
   static inline void ack_APIC_irq(void)
   {
  +   if (__test_and_clear_bit(0, __get_cpu_var(apic_eoi)))
  +   return;
  +
 
 While __test_and_clear_bit() is implemented in a single instruction,
 it's not required to be.  Better have the instruction there explicitly.
 
  /*
   * ack_APIC_irq() actually gets compiled as a single instruction
   * ... yummie.
  diff --git a/arch/x86/include/asm/kvm_host.h 
  b/arch/x86/include/asm/kvm_host.h
  index e216ba0..0ee1472 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -481,6 +481,12 @@ struct kvm_vcpu_arch {
  u64 length;
  u64 status;
  } osvw;
  +
  +   struct {
  +   u64 msr_val;
  +   struct gfn_to_hva_cache data;
  +   int vector;
  +   } eoi;
   };
 
 Needs to be cleared on INIT.

You mean kvm_arch_vcpu_reset?

   
 
  @@ -307,6 +308,9 @@ void __cpuinit kvm_guest_cpu_init(void)
 smp_processor_id());
  }
   
  +   wrmsrl(MSR_KVM_EOI_EN, __pa(this_cpu_ptr(apic_eoi)) |
  +  MSR_KVM_EOI_ENABLED);
  +
 
 Clear on kexec.

With register_reboot_notifier?

  if (has_steal_clock)
  kvm_register_steal_time();
   }
  diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
  index 8584322..9e38e12 100644
  --- a/arch/x86/kvm/lapic.c
  +++ b/arch/x86/kvm/lapic.c
  @@ -265,7 +265,61 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
  kvm_lapic_irq *irq)
  irq-level, irq-trig_mode);
   }
   
  -static inline int apic_find_highest_isr(struct kvm_lapic *apic)
  +static int eoi_put_user(struct kvm_vcpu *vcpu, u32 val)
  +{
  +
  +   return kvm_write_guest_cached(vcpu-kvm, vcpu-arch.eoi.data, val,
  + sizeof(val));
  +}
  +
  +static int eoi_get_user(struct kvm_vcpu *vcpu, u32 *val)
  +{
  +
  +   return kvm_read_guest_cached(vcpu-kvm, vcpu-arch.eoi.data, val,
  + sizeof(*val));
  +}
  +
  +static inline bool eoi_enabled(struct kvm_vcpu *vcpu)
  +{
  +   return (vcpu-arch.eoi.msr_val  MSR_KVM_EOI_ENABLED);
  +}
  +
  +static int eoi_get_pending_vector(struct kvm_vcpu *vcpu)
  +{
  +   u32 val;
  +   if (eoi_get_user(vcpu, val)  0)
  +   apic_debug(Can't read EOI MSR value: 0x%llx\n,
  +  (unsigned long long)vcpi-arch.eoi.msr_val);
  +   if (!(val  0x1))
  +   vcpu-arch.eoi.vector = -1;
  +   return vcpu-arch.eoi.vector;
  +}
  +
  +static void eoi_set_pending_vector(struct kvm_vcpu *vcpu, int vector)
  +{
  +   BUG_ON(vcpu-arch.eoi.vector != -1);
  +   if (eoi_put_user(vcpu, 0x1)  0) {
  +   apic_debug(Can't set EOI MSR value: 0x%llx\n,
  +  (unsigned long long)vcpi-arch.eoi.msr_val);
  +   return;
  +   }
  +   vcpu-arch.eoi.vector = vector;
  +}
  +
  +static int eoi_clr_pending_vector(struct kvm_vcpu *vcpu)
  +{
  +   int vector;
  +   vector = vcpu-arch.eoi.vector;
  +   if (vector != -1  eoi_put_user(vcpu, 0x0)  0) {
  +   apic_debug(Can't clear EOI MSR value: 0x%llx\n,
  +  (unsigned long long)vcpi-arch.eoi.msr_val);
  +   return -1;
  +   }
  +   vcpu-arch.eoi.vector = -1;
  +   return vector;
  +}
 
 
 
  +
  +static inline int __apic_find_highest_isr(struct kvm_lapic *apic)
   {
  int result;
   
  @@ -275,6 +329,17 @@ static inline int apic_find_highest_isr(struct 
  kvm_lapic *apic)
  return result;
   }
   
  +static inline int 

Re: [PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Avi Kivity
On 04/10/2012 05:53 PM, Michael S. Tsirkin wrote:
  
   Yes. But we can and it's easier than figuring out priorities.
   I am guessing such collisions are rare, right?
  
  It's pretty easy, if there is something in IRR but
  kvm_lapic_has_interrupt() returns -1, then we need to disable eoi avoidance.

 I only see kvm_apic_has_interrupt - is this what you mean?

Yes, sorry.

It's not clear whether to do the check in kvm_apic_has_interrupt() or
kvm_apic_get_interrupt() - the latter is called only after interrupts
are enabled, so it looks like a better place (EOIs while interrupts are
disabled have no effect).  But need to make sure those functions are
actually called, since they're protected by KVM_REQ_EVENT.

   I'll add a trace to make sure.
  
 + if (v != -1)
 + apic_set_vector(v, apic-regs + 
 APIC_ISR);
 + } else {
 + eoi_set_pending_vector(vcpu, vector);
 + set_isr = false;

Weird.  Just set it normally.  Remember that reading the ISR needs to
return the correct value.
  
   Marcelo said linux does not normally read ISR - not true?
  
  It's true and it's irrelevant.  We aren't coding a feature to what linux
  does now, but for what linux or another guest may do in the future.

 Right. So you think reading ISR has value
 in combination with PV EOI for future guests?
 I'm not arguing either way just curious.

I don't.  But we need to preserve the same interface the APIC has
presented for thousands of years (well, almost).


   Note this has no effect if the PV optimization is not enabled.
  
We need to process the avoided EOI before any APIC read/writes, to be
sure the guest sees the updated values.  Same for IOAPIC, EOI affects
remote_irr.  That may been we need to sample it after every exit, or
perhaps disable the feature for level-triggered interrupts.
  
   Disabling would be very sad.  Can we sample on remote irr read?
  
  That can be done from another vcpu.

 We still can handle it, right? Where's the code that handles that read?

Better to keep everything per-cpu.  The code is in virt/kvm/ioapic.c


  Why do we care about
  level-triggered interrupts?  Everything uses MSI or edge-triggered
  IOAPIC interrupts these days.

 Well lots of emulated devices don't yet.
 They probably should but it's nice to be able to
 test with e.g. e1000 emulation not just virtio.


e1000 doesn't support msi?


 Besides, kvm_get_apic_interrupt
 simply doesn't know about the triggering mode at the moment.



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2012 at 06:00:51PM +0300, Avi Kivity wrote:
 On 04/10/2012 05:53 PM, Michael S. Tsirkin wrote:
   
Yes. But we can and it's easier than figuring out priorities.
I am guessing such collisions are rare, right?
   
   It's pretty easy, if there is something in IRR but
   kvm_lapic_has_interrupt() returns -1, then we need to disable eoi 
   avoidance.
 
  I only see kvm_apic_has_interrupt - is this what you mean?
 
 Yes, sorry.
 
 It's not clear whether to do the check in kvm_apic_has_interrupt() or
 kvm_apic_get_interrupt() - the latter is called only after interrupts
 are enabled, so it looks like a better place (EOIs while interrupts are
 disabled have no effect).  But need to make sure those functions are
 actually called, since they're protected by KVM_REQ_EVENT.

Sorry not sure what you mean by make sure - read the code carefully?

I'll add a trace to make sure.
   
  +   if (v != -1)
  +   apic_set_vector(v, apic-regs + 
  APIC_ISR);
  +   } else {
  +   eoi_set_pending_vector(vcpu, vector);
  +   set_isr = false;
 
 Weird.  Just set it normally.  Remember that reading the ISR needs to
 return the correct value.
   
Marcelo said linux does not normally read ISR - not true?
   
   It's true and it's irrelevant.  We aren't coding a feature to what linux
   does now, but for what linux or another guest may do in the future.
 
  Right. So you think reading ISR has value
  in combination with PV EOI for future guests?
  I'm not arguing either way just curious.
 
 I don't.  But we need to preserve the same interface the APIC has
 presented for thousands of years (well, almost).


Talk about overstatements :)

 
Note this has no effect if the PV optimization is not enabled.
   
 We need to process the avoided EOI before any APIC read/writes, to be
 sure the guest sees the updated values.  Same for IOAPIC, EOI affects
 remote_irr.  That may been we need to sample it after every exit, or
 perhaps disable the feature for level-triggered interrupts.
   
Disabling would be very sad.  Can we sample on remote irr read?
   
   That can be done from another vcpu.
 
  We still can handle it, right? Where's the code that handles that read?
 
 Better to keep everything per-cpu.  The code is in virt/kvm/ioapic.c

Hmm. Disabling for level handles the ack notifiers
issue as well, which I forgot about.
It's a tough call. You think looking at
TMR in kvm_get_apic_interrupt is safe?

 
   Why do we care about
   level-triggered interrupts?  Everything uses MSI or edge-triggered
   IOAPIC interrupts these days.
 
  Well lots of emulated devices don't yet.
  They probably should but it's nice to be able to
  test with e.g. e1000 emulation not just virtio.
 
 
 e1000 doesn't support msi?

qemu emulation doesn't.

 
  Besides, kvm_get_apic_interrupt
  simply doesn't know about the triggering mode at the moment.
 
 
 
 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_blk: Add help function to format mass of disks

2012-04-10 Thread Tejun Heo
Hello, guys.

On Tue, Apr 10, 2012 at 04:34:06PM +0300, Michael S. Tsirkin wrote:
  Why not use 'base' below?  neither unit nor base change.
 
 Yes it's a bit strange, it was the same in Tejun's patch.
 Tejun, any idea?

It was years ago, so I don't recall much.  I think I wanted to use a
variable name which signifies its role - I worked out the rather
convoluted base number logic on paper first and I probably wanted to
keep the distinctions.  I don't think it really matters at this point
tho.  Just make sure those functions are marked deprecated so that no
one else copies them.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_blk: Add help function to format mass of disks

2012-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2012 at 08:49:43AM -0700, Tejun Heo wrote:
 Hello, guys.
 
 On Tue, Apr 10, 2012 at 04:34:06PM +0300, Michael S. Tsirkin wrote:
   Why not use 'base' below?  neither unit nor base change.
  
  Yes it's a bit strange, it was the same in Tejun's patch.
  Tejun, any idea?
 
 It was years ago, so I don't recall much.  I think I wanted to use a
 variable name which signifies its role - I worked out the rather
 convoluted base number logic on paper first and I probably wanted to
 keep the distinctions.  I don't think it really matters at this point
 tho.  Just make sure those functions are marked deprecated so that no
 one else copies them.
 
 Thanks.

I guess I'll keep it same so it's easier to deduplicate
if someon wants to.

 -- 
 tejun
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Avi Kivity
On 04/10/2012 06:14 PM, Michael S. Tsirkin wrote:
 On Tue, Apr 10, 2012 at 06:00:51PM +0300, Avi Kivity wrote:
  On 04/10/2012 05:53 PM, Michael S. Tsirkin wrote:

 Yes. But we can and it's easier than figuring out priorities.
 I am guessing such collisions are rare, right?

It's pretty easy, if there is something in IRR but
kvm_lapic_has_interrupt() returns -1, then we need to disable eoi 
avoidance.
  
   I only see kvm_apic_has_interrupt - is this what you mean?
  
  Yes, sorry.
  
  It's not clear whether to do the check in kvm_apic_has_interrupt() or
  kvm_apic_get_interrupt() - the latter is called only after interrupts
  are enabled, so it looks like a better place (EOIs while interrupts are
  disabled have no effect).  But need to make sure those functions are
  actually called, since they're protected by KVM_REQ_EVENT.

 Sorry not sure what you mean by make sure - read the code carefully?

Yes.  And I mean, get called at the right time.

  
  Better to keep everything per-cpu.  The code is in virt/kvm/ioapic.c

 Hmm. Disabling for level handles the ack notifiers
 issue as well, which I forgot about.
 It's a tough call. You think looking at
 TMR in kvm_get_apic_interrupt is safe?

Yes, it's read only from the guest point of view IIRC.

  
Why do we care about
level-triggered interrupts?  Everything uses MSI or edge-triggered
IOAPIC interrupts these days.
  
   Well lots of emulated devices don't yet.
   They probably should but it's nice to be able to
   test with e.g. e1000 emulation not just virtio.
  
  
  e1000 doesn't support msi?

 qemu emulation doesn't.


Can be changed if someone's really interested.  But really, avoiding
EOIs for e1000 won't help it much.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-10 Thread Alex Williamson
On Mon, 2012-04-09 at 11:35 +0300, Avi Kivity wrote:
 On 04/08/2012 08:37 PM, Jan Kiszka wrote:
  The core problem is not the ordering. The problem is that the kernel is
  susceptible to ordering mistakes of userspace. And that is because the
  kernel panics on PCI errors of devices that are in user hands - a
  critical kernel bug IMHO. 
 
 Certainly.  But this userspace patch won't fix it.

No, it won't in general and I don't think it makes sense to mangle
pci-sysfs config space support to the nuances of a user space driver.
We really need a userspace driver interface that limits the config space
interactions and provides a channel to support error reporting and
userspace recovery.  This type of thing can be done with VFIO if we
could ever get off the ground and get some consensus around it.  Please
feel free to contribute to that discussion if you ever want to get away
from this clunky device assignment interface we have now.

  Proper reset of MSI or even the whole PCI
  config space is another issue, but one the kernel should not worry about
  - still, it should be fixed (therefore this patch).
 
 And I was asking what is the right way to do it.  Reset the device and
 read back the register values, or do an emulated reset and push down the
 register values.

Reading back the register values is currently a noop since the kernel
restores the config space to the incoming state after reset.  KVM does
stash away the original config space of the device to be restored prior
to releasing the device.  We could restore to that each time, but that
would mean implementing a device reset ioctl in kvm, and we'd still need
this patch for compatibility and we still have the issues Michael brings
up with the config restore updating things like MSI that we need to then
manually sync with kvm.  I fear suggesting it, but perhaps another way
to achieve this result would be to de-assign and re-assign the device in
reset.

  But even if we disallowed userland to disable MMIO and PIO access to the
  device, we would be be able to exclude that there are secrete channels
  in the device's interface having the same effect. So we likely need to
  enhance PCI error handling to catch and handle faults for certain
  devices differently - those we cannot trust to behave properly while
  they are under userland/guest control.
 
 Why not all of them?

I think Jan is probably suggesting that we do need user space error
handling for all userland/guest controlled devices, but some classes of
errors on certain devices may be handled automatically by the userspace
interface layer... which we could do with VFIO (well, assuming the APEI
spec let's us nak the bios reporting a fatal error).  So do we want to
invent new solutions for each of these or do we want to move to a new
interface?  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2012 at 07:08:26PM +0300, Avi Kivity wrote:
 On 04/10/2012 06:14 PM, Michael S. Tsirkin wrote:
  On Tue, Apr 10, 2012 at 06:00:51PM +0300, Avi Kivity wrote:
   On 04/10/2012 05:53 PM, Michael S. Tsirkin wrote:
 
  Yes. But we can and it's easier than figuring out priorities.
  I am guessing such collisions are rare, right?
 
 It's pretty easy, if there is something in IRR but
 kvm_lapic_has_interrupt() returns -1, then we need to disable eoi 
 avoidance.
   
I only see kvm_apic_has_interrupt - is this what you mean?
   
   Yes, sorry.
   
   It's not clear whether to do the check in kvm_apic_has_interrupt() or
   kvm_apic_get_interrupt() - the latter is called only after interrupts
   are enabled, so it looks like a better place (EOIs while interrupts are
   disabled have no effect).  But need to make sure those functions are
   actually called, since they're protected by KVM_REQ_EVENT.
 
  Sorry not sure what you mean by make sure - read the code carefully?
 
 Yes.  And I mean, get called at the right time.

OK, Review will help here.

   
   Better to keep everything per-cpu.  The code is in virt/kvm/ioapic.c
 
  Hmm. Disabling for level handles the ack notifiers
  issue as well, which I forgot about.
  It's a tough call. You think looking at
  TMR in kvm_get_apic_interrupt is safe?
 
 Yes, it's read only from the guest point of view IIRC.
 
   
 Why do we care about
 level-triggered interrupts?  Everything uses MSI or edge-triggered
 IOAPIC interrupts these days.
   
Well lots of emulated devices don't yet.
They probably should but it's nice to be able to
test with e.g. e1000 emulation not just virtio.
   
   
   e1000 doesn't support msi?
 
  qemu emulation doesn't.
 
 
 Can be changed if someone's really interested.  But really, avoiding
 EOIs for e1000 won't help it much.

It will help test EOI avoidance.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost-blk development

2012-04-10 Thread Michael Baysek
Hi Stefan.  

Well, I'm trying to determine which I/O method currently has the very least 
performance overhead and gives the best performance for both reads and writes.

I am doing my testing by putting the entire guest onto a ramdisk.  I'm working 
on an i5-760 with 16GB RAM with VT-d enabled.  I am running the standard Centos 
6 kernel with 0.12.1.2 release of qemu-kvm that comes stock on Centos 6.  The 
guest is configured with 512 MB RAM, using, 4 cpu cores with it's /dev/vda 
being the ramdisk on the host.

I'm not closed to building a custom kernel or kvm if I can get better 
performance reliably.  However, my initial attempts with the 3.3.1 kernel and 
latest kvm gave mixed results.
  
I've been using iozone 3.98 with -O -l32 -i0 -i1 -i2 -e -+n -r4K -s250M to 
measure performance.

So, I was interested in vhost-blk since it seemed like a promising avenue to 
take a look at.  If you have any other thoughts, that would also be helpful.

-Mike



- Original Message -
From: Stefan Hajnoczi stefa...@gmail.com
To: Michael Baysek mbay...@liquidweb.com
Cc: kvm@vger.kernel.org
Sent: Tuesday, April 10, 2012 4:55:26 AM
Subject: Re: vhost-blk development

On Mon, Apr 9, 2012 at 11:59 PM, Michael Baysek mbay...@liquidweb.com wrote:
 Hi all.  I'm interested in any developments on the vhost-blk in kernel 
 accelerator for disk i/o.

 I had seen a patchset on LKML https://lkml.org/lkml/2011/7/28/175 but that is 
 rather old.  Are there any newer developments going on with the vhost-blk 
 stuff?

Hi Michael,
I'm curious what you are looking for in vhost-blk.  Are you trying to
improve disk performance for KVM guests?

Perhaps you'd like to share your configuration, workload, and other
details so that we can discuss how to improve performance.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Gleb Natapov
Heh, I was working on that too.

On Tue, Apr 10, 2012 at 05:26:18PM +0300, Michael S. Tsirkin wrote:
 On Tue, Apr 10, 2012 at 05:03:22PM +0300, Avi Kivity wrote:
  On 04/10/2012 04:27 PM, Michael S. Tsirkin wrote:
   I took a stub at implementing PV EOI using shared memory.
   This should reduce the number of exits an interrupt
   causes as much as by half.
  
   A partially complete draft for both host and guest parts
   is below.
  
   The idea is simple: there's a bit, per APIC, in guest memory,
   that tells the guest that it does not need EOI.
   We set it before injecting an interrupt and clear
   before injecting a nested one. Guest tests it using
   a test and clear operation - this is necessary
   so that host can detect interrupt nesting -
   and if set, it can skip the EOI MSR.
  
   There's a new MSR to set the address of said register
   in guest memory. Otherwise not much changes:
   - Guest EOI is not required
   - ISR is automatically cleared before injection
  
   Some things are incomplete: add feature negotiation
   options, qemu support for said options.
   No testing was done beyond compiling the kernel.
  
   I would appreciate early feedback.
  
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
   --
  
   diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
   index d854101..8430f41 100644
   --- a/arch/x86/include/asm/apic.h
   +++ b/arch/x86/include/asm/apic.h
   @@ -457,8 +457,13 @@ static inline u32 safe_apic_wait_icr_idle(void) { 
   return 0; }

#endif /* CONFIG_X86_LOCAL_APIC */

   +DECLARE_EARLY_PER_CPU(unsigned long, apic_eoi);
   +
static inline void ack_APIC_irq(void)
{
   + if (__test_and_clear_bit(0, __get_cpu_var(apic_eoi)))
   + return;
   +
  
  While __test_and_clear_bit() is implemented in a single instruction,
  it's not required to be.  Better have the instruction there explicitly.
  
 /*
  * ack_APIC_irq() actually gets compiled as a single instruction
  * ... yummie.
   diff --git a/arch/x86/include/asm/kvm_host.h 
   b/arch/x86/include/asm/kvm_host.h
   index e216ba0..0ee1472 100644
   --- a/arch/x86/include/asm/kvm_host.h
   +++ b/arch/x86/include/asm/kvm_host.h
   @@ -481,6 +481,12 @@ struct kvm_vcpu_arch {
 u64 length;
 u64 status;
 } osvw;
   +
   + struct {
   + u64 msr_val;
   + struct gfn_to_hva_cache data;
   + int vector;
   + } eoi;
};
  
  Needs to be cleared on INIT.
 
 You mean kvm_arch_vcpu_reset?
 

  
   @@ -307,6 +308,9 @@ void __cpuinit kvm_guest_cpu_init(void)
smp_processor_id());
 }

   + wrmsrl(MSR_KVM_EOI_EN, __pa(this_cpu_ptr(apic_eoi)) |
   +MSR_KVM_EOI_ENABLED);
   +
  
  Clear on kexec.
 
 With register_reboot_notifier?
 
 if (has_steal_clock)
 kvm_register_steal_time();
}
   diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
   index 8584322..9e38e12 100644
   --- a/arch/x86/kvm/lapic.c
   +++ b/arch/x86/kvm/lapic.c
   @@ -265,7 +265,61 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
   kvm_lapic_irq *irq)
 irq-level, irq-trig_mode);
}

   -static inline int apic_find_highest_isr(struct kvm_lapic *apic)
   +static int eoi_put_user(struct kvm_vcpu *vcpu, u32 val)
   +{
   +
   + return kvm_write_guest_cached(vcpu-kvm, vcpu-arch.eoi.data, val,
   +   sizeof(val));
   +}
   +
   +static int eoi_get_user(struct kvm_vcpu *vcpu, u32 *val)
   +{
   +
   + return kvm_read_guest_cached(vcpu-kvm, vcpu-arch.eoi.data, val,
   +   sizeof(*val));
   +}
   +
   +static inline bool eoi_enabled(struct kvm_vcpu *vcpu)
   +{
   + return (vcpu-arch.eoi.msr_val  MSR_KVM_EOI_ENABLED);
   +}
   +
   +static int eoi_get_pending_vector(struct kvm_vcpu *vcpu)
   +{
   + u32 val;
   + if (eoi_get_user(vcpu, val)  0)
   + apic_debug(Can't read EOI MSR value: 0x%llx\n,
   +(unsigned long long)vcpi-arch.eoi.msr_val);
   + if (!(val  0x1))
   + vcpu-arch.eoi.vector = -1;
   + return vcpu-arch.eoi.vector;
   +}
   +
   +static void eoi_set_pending_vector(struct kvm_vcpu *vcpu, int vector)
   +{
   + BUG_ON(vcpu-arch.eoi.vector != -1);
   + if (eoi_put_user(vcpu, 0x1)  0) {
   + apic_debug(Can't set EOI MSR value: 0x%llx\n,
   +(unsigned long long)vcpi-arch.eoi.msr_val);
   + return;
   + }
   + vcpu-arch.eoi.vector = vector;
   +}
   +
   +static int eoi_clr_pending_vector(struct kvm_vcpu *vcpu)
   +{
   + int vector;
   + vector = vcpu-arch.eoi.vector;
   + if (vector != -1  eoi_put_user(vcpu, 0x0)  0) {
   + apic_debug(Can't clear EOI MSR value: 0x%llx\n,
   +(unsigned long long)vcpi-arch.eoi.msr_val);
   + return -1;
   + }
   + vcpu-arch.eoi.vector = -1;
   + return vector;
   +}
  
  
  
   +
   +static inline int __apic_find_highest_isr(struct kvm_lapic *apic)
{
 int 

[PATCH] KVM: Introduce generic interrupt injection for in-kernel irqchips

2012-04-10 Thread Jan Kiszka
Currently, MSI messages can only be injected to in-kernel irqchips by
defining a corresponding IRQ route for each message. This is not only
unhandy if the MSI messages are generated on the fly by user space,
IRQ routes are a limited resource that user space has to manage
carefully.

By providing a direct injection path, we can both avoid using up limited
resources and simplify the necessary steps for user land. This path is
provide in a way that allows for use with other interrupt sources as
well. Besides MSIs also external interrupt lines can be manipulated
through this interface, obsoleting KVM_IRQ_LINE_STATUS.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---

This picks up Avi's first suggestion as I still think it is the better
option to provide a direct MSI injection channel.

 Documentation/virtual/kvm/api.txt |   46 +
 include/linux/kvm.h   |   26 +
 include/linux/kvm_host.h  |2 +
 virt/kvm/irq_comm.c   |   29 +++
 virt/kvm/kvm_main.c   |   20 
 5 files changed, 123 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 81ff39f..c70be58 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1482,6 +1482,52 @@ See KVM_ASSIGN_DEV_IRQ for the data structure.  The 
target device is specified
 by assigned_dev_id.  In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
 evaluated.
 
+4.61 KVM_GENERAL_IRQ
+
+Capability: KVM_CAP_GENERAL_IRQ
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_general_irq (in/out)
+Returns: 0 on success, 0 on error
+
+Inject an interrupt event to the guest. Only valid if in-kernel irqchip is
+enabled.
+
+struct kvm_general_irq {
+   __u32 type;
+   __u32 op;
+   __s32 status;
+   __u32 pad;
+   union {
+   __u32 line;
+   struct {
+   __u32 address_lo;
+   __u32 address_hi;
+   __u32 data;
+   } msi;
+   __u8 pad[32];
+   } u;
+};
+
+Support IRQ types are:
+
+#define KVM_IRQTYPE_EXTERNAL_LINE  0
+#define KVM_IRQTYPE_MSI1
+
+Available operations are:
+
+#define KVM_IRQOP_LOWER0
+#define KVM_IRQOP_RAISE1
+#define KVM_IRQOP_TRIGGER  2
+
+The level of an external interrupt line can either be raised or lowered, a
+MSI can only be triggered.
+
+If 0 is returned from the IOCTL, the status field was updated as well to
+reflect the injection result. It will be 0 on interrupt delivery, 0 if the
+interrupt was coalesced with an already pending one, and 0 if the guest
+blocked the delivery or some delivery error occurred.
+
 4.62 KVM_CREATE_SPAPR_TCE
 
 Capability: KVM_CAP_SPAPR_TCE
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 7a9dd4b..cb3afaf 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -590,6 +590,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_SYNC_REGS 74
 #define KVM_CAP_PCI_2_3 75
 #define KVM_CAP_KVMCLOCK_CTRL 76
+#define KVM_CAP_GENERAL_IRQ 77
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -715,6 +716,29 @@ struct kvm_one_reg {
__u64 addr;
 };
 
+#define KVM_IRQTYPE_EXTERNAL_LINE  0
+#define KVM_IRQTYPE_MSI1
+
+#define KVM_IRQOP_LOWER0
+#define KVM_IRQOP_RAISE1
+#define KVM_IRQOP_TRIGGER  2
+
+struct kvm_general_irq {
+   __u32 type;
+   __u32 op;
+   __s32 status;
+   __u32 pad;
+   union {
+   __u32 line;
+   struct {
+   __u32 address_lo;
+   __u32 address_hi;
+   __u32 data;
+   } msi;
+   __u8 pad[32];
+   } u;
+};
+
 /*
  * ioctls for VM fds
  */
@@ -789,6 +813,8 @@ struct kvm_s390_ucas_mapping {
 /* Available with KVM_CAP_PCI_2_3 */
 #define KVM_ASSIGN_SET_INTX_MASK  _IOW(KVMIO,  0xa4, \
   struct kvm_assigned_pci_dev)
+/* Available with KVM_CAP_GENERAL_IRQ */
+#define KVM_GENERAL_IRQ   _IOWR(KVMIO,  0xa5, struct kvm_general_irq)
 
 /*
  * ioctls for vcpu fds
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 49c2f2f..31d3b44 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -773,6 +773,8 @@ int kvm_set_irq_routing(struct kvm *kvm,
unsigned flags);
 void kvm_free_irq_routing(struct kvm *kvm);
 
+int kvm_general_irq(struct kvm *kvm, struct kvm_general_irq *irq);
+
 #else
 
 static inline void kvm_free_irq_routing(struct kvm *kvm) {}
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 9f614b4..e487d3f 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -138,6 +138,35 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
return 

Re: [PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2012 at 08:59:21PM +0300, Gleb Natapov wrote:
 Heh, I was working on that too.
 
 On Tue, Apr 10, 2012 at 05:26:18PM +0300, Michael S. Tsirkin wrote:
  On Tue, Apr 10, 2012 at 05:03:22PM +0300, Avi Kivity wrote:
   On 04/10/2012 04:27 PM, Michael S. Tsirkin wrote:
I took a stub at implementing PV EOI using shared memory.
This should reduce the number of exits an interrupt
causes as much as by half.
   
A partially complete draft for both host and guest parts
is below.
   
The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
We set it before injecting an interrupt and clear
before injecting a nested one. Guest tests it using
a test and clear operation - this is necessary
so that host can detect interrupt nesting -
and if set, it can skip the EOI MSR.
   
There's a new MSR to set the address of said register
in guest memory. Otherwise not much changes:
- Guest EOI is not required
- ISR is automatically cleared before injection
   
Some things are incomplete: add feature negotiation
options, qemu support for said options.
No testing was done beyond compiling the kernel.
   
I would appreciate early feedback.
   
Signed-off-by: Michael S. Tsirkin m...@redhat.com
   
--
   
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index d854101..8430f41 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -457,8 +457,13 @@ static inline u32 safe_apic_wait_icr_idle(void) { 
return 0; }
 
 #endif /* CONFIG_X86_LOCAL_APIC */
 
+DECLARE_EARLY_PER_CPU(unsigned long, apic_eoi);
+
 static inline void ack_APIC_irq(void)
 {
+   if (__test_and_clear_bit(0, __get_cpu_var(apic_eoi)))
+   return;
+
   
   While __test_and_clear_bit() is implemented in a single instruction,
   it's not required to be.  Better have the instruction there explicitly.
   
/*
 * ack_APIC_irq() actually gets compiled as a single instruction
 * ... yummie.
diff --git a/arch/x86/include/asm/kvm_host.h 
b/arch/x86/include/asm/kvm_host.h
index e216ba0..0ee1472 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -481,6 +481,12 @@ struct kvm_vcpu_arch {
u64 length;
u64 status;
} osvw;
+
+   struct {
+   u64 msr_val;
+   struct gfn_to_hva_cache data;
+   int vector;
+   } eoi;
 };
   
   Needs to be cleared on INIT.
  
  You mean kvm_arch_vcpu_reset?
  
 
   
@@ -307,6 +308,9 @@ void __cpuinit kvm_guest_cpu_init(void)
   smp_processor_id());
}
 
+   wrmsrl(MSR_KVM_EOI_EN, __pa(this_cpu_ptr(apic_eoi)) |
+  MSR_KVM_EOI_ENABLED);
+
   
   Clear on kexec.
  
  With register_reboot_notifier?
  
if (has_steal_clock)
kvm_register_steal_time();
 }
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 8584322..9e38e12 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -265,7 +265,61 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
kvm_lapic_irq *irq)
irq-level, irq-trig_mode);
 }
 
-static inline int apic_find_highest_isr(struct kvm_lapic *apic)
+static int eoi_put_user(struct kvm_vcpu *vcpu, u32 val)
+{
+
+   return kvm_write_guest_cached(vcpu-kvm, vcpu-arch.eoi.data, 
val,
+ sizeof(val));
+}
+
+static int eoi_get_user(struct kvm_vcpu *vcpu, u32 *val)
+{
+
+   return kvm_read_guest_cached(vcpu-kvm, vcpu-arch.eoi.data, 
val,
+ sizeof(*val));
+}
+
+static inline bool eoi_enabled(struct kvm_vcpu *vcpu)
+{
+   return (vcpu-arch.eoi.msr_val  MSR_KVM_EOI_ENABLED);
+}
+
+static int eoi_get_pending_vector(struct kvm_vcpu *vcpu)
+{
+   u32 val;
+   if (eoi_get_user(vcpu, val)  0)
+   apic_debug(Can't read EOI MSR value: 0x%llx\n,
+  (unsigned long long)vcpi-arch.eoi.msr_val);
+   if (!(val  0x1))
+   vcpu-arch.eoi.vector = -1;
+   return vcpu-arch.eoi.vector;
+}
+
+static void eoi_set_pending_vector(struct kvm_vcpu *vcpu, int vector)
+{
+   BUG_ON(vcpu-arch.eoi.vector != -1);
+   if (eoi_put_user(vcpu, 0x1)  0) {
+   apic_debug(Can't set EOI MSR value: 0x%llx\n,
+  (unsigned long long)vcpi-arch.eoi.msr_val);
+   return;
+   }
+   vcpu-arch.eoi.vector = vector;
+}
+
+static int 

Re: [PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Gleb Natapov
On Tue, Apr 10, 2012 at 10:30:04PM +0300, Michael S. Tsirkin wrote:
 On Tue, Apr 10, 2012 at 08:59:21PM +0300, Gleb Natapov wrote:
  Heh, I was working on that too.
  
  On Tue, Apr 10, 2012 at 05:26:18PM +0300, Michael S. Tsirkin wrote:
   On Tue, Apr 10, 2012 at 05:03:22PM +0300, Avi Kivity wrote:
On 04/10/2012 04:27 PM, Michael S. Tsirkin wrote:
 I took a stub at implementing PV EOI using shared memory.
 This should reduce the number of exits an interrupt
 causes as much as by half.

 A partially complete draft for both host and guest parts
 is below.

 The idea is simple: there's a bit, per APIC, in guest memory,
 that tells the guest that it does not need EOI.
 We set it before injecting an interrupt and clear
 before injecting a nested one. Guest tests it using
 a test and clear operation - this is necessary
 so that host can detect interrupt nesting -
 and if set, it can skip the EOI MSR.

 There's a new MSR to set the address of said register
 in guest memory. Otherwise not much changes:
 - Guest EOI is not required
 - ISR is automatically cleared before injection

 Some things are incomplete: add feature negotiation
 options, qemu support for said options.
 No testing was done beyond compiling the kernel.

 I would appreciate early feedback.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com

 --

 diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
 index d854101..8430f41 100644
 --- a/arch/x86/include/asm/apic.h
 +++ b/arch/x86/include/asm/apic.h
 @@ -457,8 +457,13 @@ static inline u32 safe_apic_wait_icr_idle(void) 
 { return 0; }
  
  #endif /* CONFIG_X86_LOCAL_APIC */
  
 +DECLARE_EARLY_PER_CPU(unsigned long, apic_eoi);
 +
  static inline void ack_APIC_irq(void)
  {
 + if (__test_and_clear_bit(0, __get_cpu_var(apic_eoi)))
 + return;
 +

While __test_and_clear_bit() is implemented in a single instruction,
it's not required to be.  Better have the instruction there explicitly.

   /*
* ack_APIC_irq() actually gets compiled as a single instruction
* ... yummie.
 diff --git a/arch/x86/include/asm/kvm_host.h 
 b/arch/x86/include/asm/kvm_host.h
 index e216ba0..0ee1472 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -481,6 +481,12 @@ struct kvm_vcpu_arch {
   u64 length;
   u64 status;
   } osvw;
 +
 + struct {
 + u64 msr_val;
 + struct gfn_to_hva_cache data;
 + int vector;
 + } eoi;
  };

Needs to be cleared on INIT.
   
   You mean kvm_arch_vcpu_reset?
   
  

 @@ -307,6 +308,9 @@ void __cpuinit kvm_guest_cpu_init(void)
  smp_processor_id());
   }
  
 + wrmsrl(MSR_KVM_EOI_EN, __pa(this_cpu_ptr(apic_eoi)) |
 +MSR_KVM_EOI_ENABLED);
 +

Clear on kexec.
   
   With register_reboot_notifier?
   
   if (has_steal_clock)
   kvm_register_steal_time();
  }
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 8584322..9e38e12 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -265,7 +265,61 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, 
 struct kvm_lapic_irq *irq)
   irq-level, irq-trig_mode);
  }
  
 -static inline int apic_find_highest_isr(struct kvm_lapic *apic)
 +static int eoi_put_user(struct kvm_vcpu *vcpu, u32 val)
 +{
 +
 + return kvm_write_guest_cached(vcpu-kvm, vcpu-arch.eoi.data, 
 val,
 +   sizeof(val));
 +}
 +
 +static int eoi_get_user(struct kvm_vcpu *vcpu, u32 *val)
 +{
 +
 + return kvm_read_guest_cached(vcpu-kvm, vcpu-arch.eoi.data, 
 val,
 +   sizeof(*val));
 +}
 +
 +static inline bool eoi_enabled(struct kvm_vcpu *vcpu)
 +{
 + return (vcpu-arch.eoi.msr_val  MSR_KVM_EOI_ENABLED);
 +}
 +
 +static int eoi_get_pending_vector(struct kvm_vcpu *vcpu)
 +{
 + u32 val;
 + if (eoi_get_user(vcpu, val)  0)
 + apic_debug(Can't read EOI MSR value: 0x%llx\n,
 +(unsigned long long)vcpi-arch.eoi.msr_val);
 + if (!(val  0x1))
 + vcpu-arch.eoi.vector = -1;
 + return vcpu-arch.eoi.vector;
 +}
 +
 +static void eoi_set_pending_vector(struct kvm_vcpu *vcpu, int vector)
 +{
 + BUG_ON(vcpu-arch.eoi.vector != -1);
 + if (eoi_put_user(vcpu, 0x1)  0) {
 + apic_debug(Can't set EOI MSR value: 0x%llx\n,
 +(unsigned long 

KVM qemu-kvm ext4_fill_flex_info() Denial of Service Vulnerability

2012-04-10 Thread Agostino Sarubbo
Hi all.

Yesterday, secunia has released an advisory about qemu-kvm.
https://secunia.com/advisories/48645/

This seems to describe and 'old' kernel bug, but I don't know if there is a 
'link' between the ext4 issue and kvm.

Can you explain a bit this issue?

Thanks in advance.
-- 
Agostino Sarubboago -at- gentoo.org
Gentoo/AMD64 Arch Security Liaison
GPG: 0x7CD2DC5D


signature.asc
Description: This is a digitally signed message part.


Re: [PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2012 at 10:33:54PM +0300, Gleb Natapov wrote:
  We don't try to match what HV does 100% anyway.
  
 We should. The same code will be used for HV.

Only where it makes sense, that is where the functionality
is sufficiently similar.

   We have to notify IOAPIC about EOI ASAP. It
   may hold another interrupt for us that has to be delivered.
  
  You mean the ack notifiers? We can skip just for the vectors
  which have ack notifiers or only if there are no notifiers.
  
 No. I mean:
 
 if (!ent-fields.mask  (ioapic-irr  (1  i)))
 ioapic_service(ioapic, i);

Hmm.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv0 dont apply] RFC: kvm eoi PV using shared memory

2012-04-10 Thread Gleb Natapov
On Tue, Apr 10, 2012 at 10:40:14PM +0300, Michael S. Tsirkin wrote:
 On Tue, Apr 10, 2012 at 10:33:54PM +0300, Gleb Natapov wrote:
   We don't try to match what HV does 100% anyway.
   
  We should. The same code will be used for HV.
 
 Only where it makes sense, that is where the functionality
 is sufficiently similar.
 
You can sprinkle additional ifs in the code, but I do not see the point.

We have to notify IOAPIC about EOI ASAP. It
may hold another interrupt for us that has to be delivered.
   
   You mean the ack notifiers? We can skip just for the vectors
   which have ack notifiers or only if there are no notifiers.
   
  No. I mean:
  
  if (!ent-fields.mask  (ioapic-irr  (1  i)))
  ioapic_service(ioapic, i);
 
 Hmm.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] IOMMU groups

2012-04-10 Thread Alex Williamson

Ping.  Does this approach look like it could satisfy your desire for a
more integrated group layer?  I'd really like to move VFIO forward,
we've been stalled on this long enough.  David Woodhouse, I think this
provides the quirking you're looking for for device like the Ricoh, do
you have any other requirements for a group layer?  Thanks,

Alex

On Mon, 2012-04-02 at 15:14 -0600, Alex Williamson wrote:
 This series attempts to make IOMMU device grouping a slightly more
 integral part of the device model.  iommu_device_groups were originally
 introduced to support the VFIO user space driver interface which needs
 to understand the granularity of device isolation in order to ensure
 security of devices when assigned for user access.  This information
 was provided via a simple group identifier from the IOMMU driver allowing
 VFIO to walk devices and assemble groups itself.
 
 The feedback received from this was that groups should be the effective
 unit of work for the IOMMU API.  The existing model of allowing domains
 to be created and individual devices attached ignores many of the
 restrictions of the IOMMU, whether by design, by topology or by defective
 devices.  Additionally we should be able to use the grouping information
 at the dma ops layer for managing domains and quirking devices.
 
 This series is a sketch at implementing only those aspects and leaving
 everything else about the multifaceted hairball of Isolation groups for
 another API.  Please comment and let me know if this seems like the
 direction we should be headed.  Thanks,
 
 Alex
 
 
 ---
 
 Alex Williamson (3):
   iommu: Create attach/detach group interface
   iommu: Create basic group infrastructure and update AMD-Vi  Intel VT-d
   iommu: Introduce iommu_group
 
 
  drivers/iommu/amd_iommu.c   |   50 ++
  drivers/iommu/intel-iommu.c |   76 
  drivers/iommu/iommu.c   |  210 
 ++-
  include/linux/device.h  |2 
  include/linux/iommu.h   |   43 +
  5 files changed, 301 insertions(+), 80 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_blk: Add help function to format mass of disks

2012-04-10 Thread Asias He

On 04/10/2012 11:53 PM, Michael S. Tsirkin wrote:

On Tue, Apr 10, 2012 at 08:49:43AM -0700, Tejun Heo wrote:

Hello, guys.

On Tue, Apr 10, 2012 at 04:34:06PM +0300, Michael S. Tsirkin wrote:

Why not use 'base' below?  neither unit nor base change.


Yes it's a bit strange, it was the same in Tejun's patch.
Tejun, any idea?


It was years ago, so I don't recall much.  I think I wanted to use a
variable name which signifies its role - I worked out the rather
convoluted base number logic on paper first and I probably wanted to
keep the distinctions.  I don't think it really matters at this point
tho.  Just make sure those functions are marked deprecated so that no
one else copies them.

Thanks.


I guess I'll keep it same so it's easier to deduplicate
if someon wants to.


Why not fix it both in sd_format_disk_name() and virtblk_name_format().
Ren, mind to send v2 to drop the duplicate line?




--
tejun

___
Virtualization mailing list
virtualizat...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization



--
Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_blk: Add help function to format mass of disks

2012-04-10 Thread Ren Mingxin

 On 04/10/2012 09:16 PM, Avi Kivity wrote:

On 04/10/2012 10:28 AM, Ren Mingxin wrote:

The current virtio block's naming algorithm just supports 18278
(26^3 + 26^2 + 26) disks. If there are mass of virtio blocks,
there will be disks with the same name.

Based on commit 3e1a7ff8a0a7b948f2684930166954f9e8e776fe, I add
function virtblk_name_format() for virtio block to support mass
of disks naming.

Signed-off-by: Ren Mingxinre...@cn.fujitsu.com
---
  drivers/block/virtio_blk.c |   38 ++
  1 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index c4a60ba..86516c8 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -374,6 +374,31 @@ static int init_vq(struct virtio_blk *vblk)
return err;
  }

+static int virtblk_name_format(char *prefix, int index, char *buf, int buflen)
+{
+   const int base = 'z' - 'a' + 1;
+   char *begin = buf + strlen(prefix);
+   char *begin = buf + strlen(prefix);

Duplicate line.



Oh, obvious missed :-(

--
Thanks,
Ren

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_blk: Add help function to format mass of disks

2012-04-10 Thread Ren Mingxin

 On 04/11/2012 09:21 AM, Asias He wrote:

On 04/10/2012 11:53 PM, Michael S. Tsirkin wrote:

On Tue, Apr 10, 2012 at 08:49:43AM -0700, Tejun Heo wrote:

Hello, guys.

On Tue, Apr 10, 2012 at 04:34:06PM +0300, Michael S. Tsirkin wrote:

Why not use 'base' below?  neither unit nor base change.


Yes it's a bit strange, it was the same in Tejun's patch.
Tejun, any idea?


It was years ago, so I don't recall much.  I think I wanted to use a
variable name which signifies its role - I worked out the rather
convoluted base number logic on paper first and I probably wanted to
keep the distinctions.  I don't think it really matters at this point
tho.  Just make sure those functions are marked deprecated so that no
one else copies them.

Thanks.


I guess I'll keep it same so it's easier to deduplicate
if someon wants to.


So, I'd keep this in the next version.



Why not fix it both in sd_format_disk_name() and virtblk_name_format().
Ren, mind to send v2 to drop the duplicate line?



I'll send v2 soon.

--
Thanks,
Ren

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks

2012-04-10 Thread Marcelo Tosatti
On Sat, Mar 31, 2012 at 12:07:58AM +0200, Thomas Gleixner wrote:
 On Fri, 30 Mar 2012, H. Peter Anvin wrote:
 
  What is the current status of this patchset?  I haven't looked at it too
  closely because I have been focused on 3.4 up until now...
 
 The real question is whether these heuristics are the correct approach
 or not.
 
 If I look at it from the non virtualized kernel side then this is ass
 backwards. We know already that we are holding a spinlock which might
 cause other (v)cpus going into eternal spin. The non virtualized
 kernel solves this by disabling preemption and therefor getting out of
 the critical section as fast as possible,
 
 The virtualization problem reminds me a lot of the problem which RT
 kernels are observing where non raw spinlocks are turned into
 sleeping spinlocks and therefor can cause throughput issues for non
 RT workloads.
 
 Though the virtualized situation is even worse. Any preempted guest
 section which holds a spinlock is prone to cause unbound delays.
 
 The paravirt ticketlock solution can only mitigate the problem, but
 not solve it. With massive overcommit there is always a way to trigger
 worst case scenarious unless you are educating the scheduler to cope
 with that.
 
 So if we need to fiddle with the scheduler and frankly that's the only
 way to get a real gain (the numbers, which are achieved by this
 patches, are not that impressive) then the question arises whether we
 should turn the whole thing around.
 
 I know that Peter is going to go berserk on me, but if we are running
 a paravirt guest then it's simple to provide a mechanism which allows
 the host (aka hypervisor) to check that in the guest just by looking
 at some global state.
 
 So if a guest exits due to an external event it's easy to inspect the
 state of that guest and avoid to schedule away when it was interrupted
 in a spinlock held section. That guest/host shared state needs to be
 modified to indicate the guest to invoke an exit when the last nested
 lock has been released.

Remember that the host is scheduling other processes than vcpus of guests. 

The case where a higher priority task (whatever that task is) interrupts
a vcpu which holds a spinlock should be frequent, in a overcommit
scenario. Whenever that is the case, other vcpus _must_ be able to stop
spinning. 

Now extrapolate that to guests with large number of vcpus. There is no
replacement for sleep-in-hypervisor-instead-of-spin.

 Of course this needs to be time bound, so a rogue guest cannot
 monopolize the cpu forever, but that's the least to worry about
 problem simply because a guest which does not get out of a spinlocked
 region within a certain amount of time is borked and elegible to
 killing anyway.
 
 Thoughts ?
 
 Thanks,
 
   tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 uq/master] kvm: Drop unused kvm_pit_in_kernel

2012-04-10 Thread Marcelo Tosatti
On Thu, Mar 22, 2012 at 12:00:48AM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 This is now implied by kvm_irqchip_in_kernel.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
 
 Rebased over latest uq/master.
 
  kvm-all.c  |6 --
  kvm-stub.c |6 --
  kvm.h  |2 --
  3 files changed, 0 insertions(+), 14 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] Export offsets of VMCS fields as note information for kdump

2012-04-10 Thread zhangyanfei
This patch set exports offsets of VMCS fields as note information for
kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
runtime state of guest machine image, such as registers, in host
machine's crash dump as VMCS format. The problem is that VMCS
internal is hidden by Intel in its specification. So, we reverse
engineering it in the way implemented in this patch set. Please note
that this processing never affects any existing kvm logic. The
VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO.

Here is an example:
Processor: Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz

$cat /sys/kernel/vmcsinfo
1cba8c0 2000

crash rd -p 1cba8c0 1000
 1cba8c0:  127b0009 53434d56   {...VMCS
 1cba8d0:  4f464e49 4e4f495349564552   INFOREVISION
 1cba8e0:  49460a643d44495f 5f4e495028444c45   _ID=d.FIELD(PIN_
 1cba8f0:  4d565f4445534142 4f435f434558455f   BASED_VM_EXEC_CO
 1cba900:  303d294c4f52544e 0a30383130343831   NTROL)=01840180.
 1cba910:  504328444c454946 5f44455341425f55   FIELD(CPU_BASED_
 1cba920:  5f434558455f4d56 294c4f52544e4f43   VM_EXEC_CONTROL)
 1cba930:  393130343931303d 28444c4549460a30   =01940190.FIELD(
 1cba940:  5241444e4f434553 4558455f4d565f59   SECONDARY_VM_EXE
 1cba950:  4f52544e4f435f43 30346566303d294c   C_CONTROL)=0fe40
 1cba960:  4c4549460a306566 4958455f4d562844   fe0.FIELD(VM_EXI
 1cba970:  4f52544e4f435f54 346531303d29534c   T_CONTROLS)=01e4
 1cba980:  4549460a30653130 4e455f4d5628444c   01e0.FIELD(VM_EN
 1cba990:  544e4f435f595254 33303d29534c4f52   TRY_CONTROLS)=03
 1cba9a0:  460a303133303431 45554728444c4549   140310.FIELD(GUE
 1cba9b0:  45535f53455f5453 3d29524f5443454c   ST_ES_SELECTOR)=
 1cba9c0:  4549460a30303530 545345554728444c   0500.FIELD(GUEST
 1cba9d0:  454c45535f53435f 35303d29524f5443   _CS_SELECTOR)=05
 ..

TODO:
  1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information
 into vmcore.
  2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process
 core file. To do this, we will modify kernel core dumper, gdb gcore
 and crash gcore.
  3. Dump guest image from the qemu-process core file into a vmcore.

zhangyanfei (4):
  x86: Add helper variables and functions to hold VMCSINFO
  KVM: VMX: Add functions to fill VMCSINFO
  ksysfs: export VMCSINFO via sysfs
  kexec: Add crash_save_vmcsinfo to update VMCSINFO

 arch/x86/include/asm/vmcsinfo.h |   42 +
 arch/x86/kernel/Makefile|2 +
 arch/x86/kernel/vmcsinfo.c  |   70 
 arch/x86/kvm/vmx.c  |  350 +++
 include/linux/kexec.h   |1 +
 kernel/kexec.c  |   14 ++
 kernel/ksysfs.c |   19 ++
 7 files changed, 498 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/vmcsinfo.h
 create mode 100644 arch/x86/kernel/vmcsinfo.c
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] x86: Add helper variables and functions to hold VMCSINFO

2012-04-10 Thread zhangyanfei
This patch provides a set of variables to hold the VMCSINFO and also
some helper functions to help fill the VMCSINFO.

Signed-off-by: zhangyanfei zhangyan...@cn.fujitsu.com
---
 arch/x86/include/asm/vmcsinfo.h |   42 +++
 arch/x86/kernel/Makefile|2 +
 arch/x86/kernel/vmcsinfo.c  |   70 +++
 3 files changed, 114 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/vmcsinfo.h
 create mode 100644 arch/x86/kernel/vmcsinfo.c

diff --git a/arch/x86/include/asm/vmcsinfo.h b/arch/x86/include/asm/vmcsinfo.h
new file mode 100644
index 000..cfdc984
--- /dev/null
+++ b/arch/x86/include/asm/vmcsinfo.h
@@ -0,0 +1,42 @@
+#ifndef _ASM_X86_VMCSINFO_H
+#define _ASM_X86_VMCSINFO_H
+
+#ifndef __ASSEMBLY__
+#include linux/types.h
+#include linux/elf.h
+
+/*
+ * Currently, 2 pages are enough for vmcsinfo.
+ */
+#define VMCSINFO_BYTES (8192)
+#define VMCSINFO_NOTE_NAME VMCSINFO
+#define VMCSINFO_NOTE_NAME_BYTES   ALIGN(sizeof(VMCSINFO_NOTE_NAME), 4)
+#define VMCSINFO_NOTE_HEAD_BYTES   ALIGN(sizeof(struct elf_note), 4)
+#define VMCSINFO_NOTE_SIZE (VMCSINFO_NOTE_HEAD_BYTES*2 \
+   + VMCSINFO_BYTES \
+   + VMCSINFO_NOTE_NAME_BYTES)
+
+extern size_t vmcsinfo_size;
+extern size_t vmcsinfo_max_size;
+
+extern void update_vmcsinfo_note(void);
+extern void vmcsinfo_append_str(const char *fmt, ...);
+extern unsigned long paddr_vmcsinfo_note(void);
+
+#define VMCSINFO_REVISION_ID(id) \
+   vmcsinfo_append_str(REVISION_ID=%x\n, id)
+#define VMCSINFO_FIELD16(name, value) \
+   vmcsinfo_append_str(FIELD(%s)=%04x\n, #name, value)
+#define VMCSINFO_FIELD32(name, value) \
+   vmcsinfo_append_str(FIELD(%s)=%08x\n, #name, value)
+#define VMCSINFO_FIELD64(name, value) \
+   vmcsinfo_append_str(FIELD(%s)=%016llx\n, #name, value)
+
+#ifdef CONFIG_X86_64
+#define VMCSINFO_FIELD(name, value) VMCSINFO_FIELD64(name, value)
+#else
+#define VMCSINFO_FIELD(name, value) VMCSINFO_FIELD32(name, value)
+#endif
+
+#endif /* __ASSEMBLY__ */
+#endif /* _ASM_X86_VMCSINFO_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 532d2e0..63edf33 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -102,6 +102,8 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o
 obj-$(CONFIG_SWIOTLB)  += pci-swiotlb.o
 obj-$(CONFIG_OF)   += devicetree.o
 
+obj-y  += vmcsinfo.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/vmcsinfo.c b/arch/x86/kernel/vmcsinfo.c
new file mode 100644
index 000..c1306ef
--- /dev/null
+++ b/arch/x86/kernel/vmcsinfo.c
@@ -0,0 +1,70 @@
+/*
+ * Architecture specific (i386/x86_64) functions for storing vmcs
+ * field information.
+ *
+ * Created by: zhangyanfei (zhangyan...@cn.fujitsu.com)
+ *
+ * Copyright (C) Fujitsu Corporation, 2012. All rights reserved.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2.  See the file COPYING for more details.
+ */
+
+#include asm/vmcsinfo.h
+#include linux/module.h
+#include linux/elf.h
+
+static unsigned char vmcsinfo_data[VMCSINFO_BYTES];
+static u32 vmcsinfo_note[VMCSINFO_NOTE_SIZE/4];
+size_t vmcsinfo_max_size = sizeof(vmcsinfo_data);
+size_t vmcsinfo_size;
+EXPORT_SYMBOL(vmcsinfo_size);
+
+void update_vmcsinfo_note(void)
+{
+   u32 *buf = vmcsinfo_note;
+   struct elf_note note;
+
+   if (!vmcsinfo_size)
+   return;
+
+   note.n_namesz = strlen(VMCSINFO_NOTE_NAME) + 1;
+   note.n_descsz = vmcsinfo_size;
+   note.n_type   = 0;
+   memcpy(buf, note, sizeof(note));
+   buf += (sizeof(note) + 3)/4;
+   memcpy(buf, VMCSINFO_NOTE_NAME, note.n_namesz);
+   buf += (note.n_namesz + 3)/4;
+   memcpy(buf, vmcsinfo_data, note.n_descsz);
+   buf += (note.n_descsz + 3)/4;
+
+   note.n_namesz = 0;
+   note.n_descsz = 0;
+   note.n_type   = 0;
+   memcpy(buf, note, sizeof(note));
+}
+EXPORT_SYMBOL(update_vmcsinfo_note);
+
+void vmcsinfo_append_str(const char *fmt, ...)
+{
+   va_list args;
+   char buf[0x50];
+   int r;
+
+   va_start(args, fmt);
+   r = vsnprintf(buf, sizeof(buf), fmt, args);
+   va_end(args);
+
+   if (r + vmcsinfo_size  vmcsinfo_max_size)
+   r = vmcsinfo_max_size - vmcsinfo_size;
+
+   memcpy(vmcsinfo_data[vmcsinfo_size], buf, r);
+
+   vmcsinfo_size += r;
+}
+EXPORT_SYMBOL(vmcsinfo_append_str);
+
+unsigned long paddr_vmcsinfo_note(void)
+{
+   return __pa((unsigned long)(char *)vmcsinfo_note);
+}
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] KVM: VMX: Add functions to fill VMCSINFO

2012-04-10 Thread zhangyanfei
This patch is to implement the feature that at initialization of
kvm_intel module, fills VMCSINFO with a VMCS revision identifier,
and encoded offsets of VMCS fields. The reason why we put the
VMCSINFO processing at the initialization of kvm_intel module
is that it's dangerous to rob VMX resources while kvm module is
loaded.

Note, offsets of fields below will not be filled into VMCSINFO:
1. fields defined in Intel specification (Intel® 64 and
   IA-32 Architectures Software Developer’s Manual, Volume
   3C) but not defined in *vmcs_field*.
2. fields don't exist because their corresponding control bits
   are not set.

Signed-off-by: zhangyanfei zhangyan...@cn.fujitsu.com
---
 arch/x86/kvm/vmx.c |  350 
 1 files changed, 350 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ad85adf..e98fafa 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -41,6 +41,7 @@
 #include asm/i387.h
 #include asm/xcr.h
 #include asm/perf_event.h
+#include asm/vmcsinfo.h
 
 #include trace.h
 
@@ -2599,6 +2600,353 @@ static __init int alloc_kvm_area(void)
return 0;
 }
 
+/*
+ * For caculating offsets of fields in VMCS data, we index every 16-bit
+ * field by this kind of format:
+ * | - 16 bits -- |
+ * +-+-++-+
+ * | high 7 bits |1| low 7 bits |0|
+ * +-+-++-+
+ * In high byte, the lowest bit must be 1; In low byte, the lowest bit
+ * must be 0. The two bits are set like this in case indexes in VMCS
+ * data are read as big endian mode.
+ * The remaining 14 bits of the index indicate the real offset of the
+ * field. Because the size of a VMCS region is at most 4 KBytes, so
+ * 14 bits are enough to index the whole VMCS region.
+ *
+ * ENCODING_OFFSET: encode the offset into the index of this kind.
+ */
+#define OFFSET_HIGH_SHIFT (7)
+#define OFFSET_LOW_MASK   ((1  OFFSET_HIGH_SHIFT) - 1) /* 0x7f */
+#define OFFSET_HIGH_MASK  (OFFSET_LOW_MASK  OFFSET_HIGH_SHIFT) /* 0x3f80 */
+#define ENCODING_OFFSET(offset) \
+   offset)  OFFSET_LOW_MASK)  1) + \
+   offset)  OFFSET_HIGH_MASK)  2) | 0x100))
+
+/*
+ * We separate these five control fields from other fields
+ * because some fields only exist on processors that support
+ * the 1-setting of control bits in the five control fields.
+ */
+static inline void append_control_field(void)
+{
+#define CONTROL_FIELD_OFFSET(field) \
+   VMCSINFO_FIELD32(field, vmcs_read32(field))
+
+   CONTROL_FIELD_OFFSET(PIN_BASED_VM_EXEC_CONTROL);
+   CONTROL_FIELD_OFFSET(CPU_BASED_VM_EXEC_CONTROL);
+   if (cpu_has_secondary_exec_ctrls()) {
+   CONTROL_FIELD_OFFSET(SECONDARY_VM_EXEC_CONTROL);
+   }
+   CONTROL_FIELD_OFFSET(VM_EXIT_CONTROLS);
+   CONTROL_FIELD_OFFSET(VM_ENTRY_CONTROLS);
+}
+
+static inline void append_field16(void)
+{
+#define FIELD_OFFSET16(field) \
+   VMCSINFO_FIELD16(field, vmcs_read16(field));
+
+   FIELD_OFFSET16(GUEST_ES_SELECTOR);
+   FIELD_OFFSET16(GUEST_CS_SELECTOR);
+   FIELD_OFFSET16(GUEST_SS_SELECTOR);
+   FIELD_OFFSET16(GUEST_DS_SELECTOR);
+   FIELD_OFFSET16(GUEST_FS_SELECTOR);
+   FIELD_OFFSET16(GUEST_GS_SELECTOR);
+   FIELD_OFFSET16(GUEST_LDTR_SELECTOR);
+   FIELD_OFFSET16(GUEST_TR_SELECTOR);
+   FIELD_OFFSET16(HOST_ES_SELECTOR);
+   FIELD_OFFSET16(HOST_CS_SELECTOR);
+   FIELD_OFFSET16(HOST_SS_SELECTOR);
+   FIELD_OFFSET16(HOST_DS_SELECTOR);
+   FIELD_OFFSET16(HOST_FS_SELECTOR);
+   FIELD_OFFSET16(HOST_GS_SELECTOR);
+   FIELD_OFFSET16(HOST_TR_SELECTOR);
+}
+
+static inline void append_field64(void)
+{
+#define FIELD_OFFSET64(field) \
+   VMCSINFO_FIELD64(field, vmcs_read64(field));
+
+   FIELD_OFFSET64(IO_BITMAP_A);
+   FIELD_OFFSET64(IO_BITMAP_A_HIGH);
+   FIELD_OFFSET64(IO_BITMAP_B);
+   FIELD_OFFSET64(IO_BITMAP_B_HIGH);
+   FIELD_OFFSET64(VM_EXIT_MSR_STORE_ADDR);
+   FIELD_OFFSET64(VM_EXIT_MSR_STORE_ADDR_HIGH);
+   FIELD_OFFSET64(VM_EXIT_MSR_LOAD_ADDR);
+   FIELD_OFFSET64(VM_EXIT_MSR_LOAD_ADDR_HIGH);
+   FIELD_OFFSET64(VM_ENTRY_MSR_LOAD_ADDR);
+   FIELD_OFFSET64(VM_ENTRY_MSR_LOAD_ADDR_HIGH);
+   FIELD_OFFSET64(TSC_OFFSET);
+   FIELD_OFFSET64(TSC_OFFSET_HIGH);
+   FIELD_OFFSET64(VMCS_LINK_POINTER);
+   FIELD_OFFSET64(VMCS_LINK_POINTER_HIGH);
+   FIELD_OFFSET64(GUEST_IA32_DEBUGCTL);
+   FIELD_OFFSET64(GUEST_IA32_DEBUGCTL_HIGH);
+
+   if (cpu_has_vmx_msr_bitmap()) {
+   FIELD_OFFSET64(MSR_BITMAP);
+   FIELD_OFFSET64(MSR_BITMAP_HIGH);
+   }
+
+   if (cpu_has_vmx_tpr_shadow()) {
+   FIELD_OFFSET64(VIRTUAL_APIC_PAGE_ADDR);
+   FIELD_OFFSET64(VIRTUAL_APIC_PAGE_ADDR_HIGH);
+   }
+
+   if (cpu_has_secondary_exec_ctrls()) {
+   if (vmcs_config.cpu_based_2nd_exec_ctrl 
+   

Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-10 Thread Marcelo Tosatti
On Tue, Apr 10, 2012 at 01:04:13PM +0300, Avi Kivity wrote:
 On 04/09/2012 10:46 PM, Marcelo Tosatti wrote:
  Perhaps the mmu_lock hold times by get_dirty are a large component here?
 
 That's my concern, because it affects the scaling of migration for wider
 guests.
 
  If that can be alleviated, not only RO-RW faults benefit.
 
 Those are the most common types of faults on modern hardware, no?

Depends on your workload, of course. If there is memory pressure,
0-PRESENT might be very frequent. My point is that reduction of
mmu_lock contention is a good thing overall.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] ksysfs: export VMCSINFO via sysfs

2012-04-10 Thread zhangyanfei
This patch creates sysfs file to export where VMCSINFO is allocated,
as below:
$ cat /sys/kernel/vmcsinfo
1cb88a0 2000
number on the left-hand side is the physical address of VMCSINFO,
while the one on the right-hand side is the max size of VMCSINFO.

Signed-off-by: zhangyanfei zhangyan...@cn.fujitsu.com
---
 kernel/ksysfs.c |   19 +++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 4e316e1..becbb68 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -18,6 +18,8 @@
 #include linux/stat.h
 #include linux/sched.h
 #include linux/capability.h
+#include asm/vmcsinfo.h
+#include asm/virtext.h
 
 #define KERNEL_ATTR_RO(_name) \
 static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
@@ -133,6 +135,20 @@ KERNEL_ATTR_RO(vmcoreinfo);
 
 #endif /* CONFIG_KEXEC */
 
+#ifdef CONFIG_X86
+static ssize_t vmcsinfo_show(struct kobject *kobj,
+struct kobj_attribute *attr, char *buf)
+{
+   if (cpu_has_vmx())
+   return sprintf(buf, %lx %x\n,
+  paddr_vmcsinfo_note(),
+  (unsigned int)vmcsinfo_max_size);
+   return 0;
+}
+KERNEL_ATTR_RO(vmcsinfo);
+
+#endif /* CONFIG_X86 */
+
 /* whether file capabilities are enabled */
 static ssize_t fscaps_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
@@ -182,6 +198,9 @@ static struct attribute * kernel_attrs[] = {
kexec_crash_size_attr.attr,
vmcoreinfo_attr.attr,
 #endif
+#ifdef CONFIG_X86
+   vmcsinfo_attr.attr,
+#endif
NULL
 };
 
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] kexec: Add crash_save_vmcsinfo to update VMCSINFO

2012-04-10 Thread zhangyanfei
crash_save_vmcsinfo updates the VMCSINFO when kernel crashes.
If no VMCSINFO has been saved before, this function will do nothing.

Signed-off-by: zhangyanfei zhangyan...@cn.fujitsu.com
---
 include/linux/kexec.h |1 +
 kernel/kexec.c|   14 ++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 0d7d6a1..6e8ff13 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -145,6 +145,7 @@ void arch_crash_save_vmcoreinfo(void);
 __printf(1, 2)
 void vmcoreinfo_append_str(const char *fmt, ...);
 unsigned long paddr_vmcoreinfo_note(void);
+void crash_save_vmcsinfo(void);
 
 #define VMCOREINFO_OSRELEASE(value) \
vmcoreinfo_append_str(OSRELEASE=%s\n, value)
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 4e2e472..19843ef 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -38,6 +38,7 @@
 #include asm/uaccess.h
 #include asm/io.h
 #include asm/sections.h
+#include asm/vmcsinfo.h
 
 /* Per cpu memory for storing cpu states in case of system crash. */
 note_buf_t __percpu *crash_notes;
@@ -1094,6 +1095,7 @@ void crash_kexec(struct pt_regs *regs)
 
crash_setup_regs(fixed_regs, regs);
crash_save_vmcoreinfo();
+   crash_save_vmcsinfo();
machine_crash_shutdown(fixed_regs);
machine_kexec(kexec_crash_image);
}
@@ -1458,6 +1460,18 @@ unsigned long __attribute__ ((weak)) 
paddr_vmcoreinfo_note(void)
return __pa((unsigned long)(char *)vmcoreinfo_note);
 }
 
+#ifdef CONFIG_X86
+void crash_save_vmcsinfo(void)
+{
+   if (!vmcsinfo_size)
+   return;
+   vmcsinfo_append_str(CRASHTIME=%ld, get_seconds());
+   update_vmcsinfo_note();
+}
+#else
+void crash_save_vmcsinfo(void) {}
+#endif /* CONFIG_X86 */
+
 static int __init crash_save_vmcoreinfo_init(void)
 {
VMCOREINFO_OSRELEASE(init_uts_ns.name.release);
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Linux Crash Caused By KVM?

2012-04-10 Thread Peijie Yu
Hi,all
  I have met some problems while utilizing KVM。
  The test environment is:
Summary:Dell R610, 1 x Xeon E5645 2.40GHz, 47.1GB / 48GB 1333MHz DDR3
System: Dell PowerEdge R610 (Dell 08GXHX)
Processors: 1 (of 2) x Xeon E5645 2.40GHz 5860MHz FSB (HT enabled,
6 cores, 24 threads)
Memory: 47.1GB / 48GB 1333MHz DDR3 == 12 x 4GB
Disk:   sda: 299GB (72%) JBOD
Disk:   sdb (host9): 5.0TB JBOD == 1 x VIRTUAL-DISK
Disk:   sdc (host11): 5.0TB JBOD == 1 x VIRTUAL-DISK
Disk:   sdd (host12): 5.0TB JBOD == 1 x VIRTUAL-DISK
Disk:   sde (host10): 5.0TB JBOD == 1 x VIRTUAL-DISK
Disk-Control:   mpt2sas0: LSI Logic / Symbios Logic SAS2008
PCI-Express Fusion-MPT SAS-2 [Falcon]
Disk-Control:   host9:
Disk-Control:   host10:
Disk-Control:   host11:
Disk-Control:   host12:
Chipset:Intel 82801IB (ICH9)
Network:br1 (bridge): 14:fe:b5:dc:2c:6e
Network:em1 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
14:fe:b5:dc:2c:6e, 1000Mb/s full-duplex
Network:em2 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
14:fe:b5:dc:2c:70, 1000Mb/s full-duplex
Network:em3 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
14:fe:b5:dc:2c:72, 1000Mb/s full-duplex
Network:em4 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
14:fe:b5:dc:2c:74, 1000Mb/s full-duplex
Network:vnet0 (tun): fe:16:3e:49:fb:05, 10Mb/s full-duplex
Network:vnet1 (tun): fe:16:3e:cb:c0:d1, 10Mb/s full-duplex
Network:vnet2 (tun): fe:16:3e:1e:c1:c4, 10Mb/s full-duplex
Network:vnet3 (tun): fe:16:3e:d5:58:f4, 10Mb/s full-duplex
Network:vnet4 (tun): fe:16:3e:15:b4:16, 10Mb/s full-duplex
Network:vnet5 (tun): fe:16:3e:d2:07:47, 10Mb/s full-duplex
Network:vnet6 (tun): fe:16:3e:e1:2b:b9, 10Mb/s full-duplex
OS: RHEL Server 6.1 (Santiago), Linux
2.6.32-220.2.1.el6.x86_64 x86_64, 64-bit
BIOS:   Dell 3.0.0 01/31/2011

  And during the term i utilize KVM, some issues happen:
  1.   Host Crash Caused by
  a.   Kernel Panic
  31   KERNEL: /usr/lib/debug/lib/modules/2.6.32-131.12.1.el6.x86_64/vmlinux
  32 DUMPFILE: ../vmcore_2012.13.46  [PARTIAL DUMP]
  33 CPUS: 24
  34 DATE: Wed Jan 11 13:34:13 2012
  35   UPTIME: 25 days, 04:11:05
  36 LOAD AVERAGE: 223.16, 172.97, 158.23
  37TASKS: 1464
  38 NODENAME: dell2.localdomain
  39  RELEASE: 2.6.32-131.12.1.el6.x86_64
  40  VERSION: #1 SMP Sun Jul 31 16:44:56 EDT 2011
  41  MACHINE: x86_64  (2394 Mhz)
  42   MEMORY: 48 GB
  43PANIC: kernel BUG at arch/x86/kernel/traps.c:547!
  44  PID: 11851
  45  COMMAND: qemu-kvm
  46 TASK: 880c071c3500  [THREAD_INFO: 880c132d8000]
  47  CPU: 1
  48STATE: TASK_RUNNING (PANIC)
  49
  50 PID: 11851  TASK: 880c071c3500  CPU: 1   COMMAND: qemu-kvm
  51  #0 [880028207be0] machine_kexec at 810310cb
  52  #1 [880028207c40] crash_kexec at 810b6392
  53  #2 [880028207d10] oops_end at 814de670
  54  #3 [880028207d40] die at 8100f2eb
  55  #4 [880028207d70] do_trap at 814ddf64
  56  #5 [880028207dd0] do_invalid_op at 8100ceb5
  57  #6 [880028207e70] invalid_op at 8100bf5b
  58 [exception RIP: do_nmi+554]
  59 RIP: 814de43a  RSP: 880028207f28  RFLAGS: 00010002
  60 RAX: 880c132d9fd8  RBX: 880028207f58  RCX: c101
  61 RDX: 8800  RSI:   RDI: 880028207f58
  62 RBP: 880028207f48   R8: 88005ebf9800   R9: 880028203fc0
  63 R10: 0034  R11: 03e8  R12: cc20
  64 R13: 816024a0  R14: 88005ebf9800  R15: 7000
  65 ORIG_RAX:   CS: 0010  SS: 0018
  66  #7 [880028207f50] nmi at 814ddc90
  67 [exception RIP: bad_to_user+37]
  68 RIP: 814e4e2b  RSP: 880028207bb0  RFLAGS: 00010046
  69 RAX: 880c132d9fd8  RBX: 880c132d9c48  RCX: 0001
  70 RDX:   RSI: 0001000b  RDI: 880028207c08
  71 RBP: 880028207c48   R8: 88005ebf9800   R9: 880028203fc0
  72 R10: 0034  R11: 03e8  R12: cc20
  73 R13: 816024a0  R14: 88005ebf9800  R15: 7000
  74 ORIG_RAX:   CS: 0010  SS: 0018
  75 --- NMI exception stack ---

 For this problem, i found that panic is caused by
BUG_ON(in_nmi()) which means NMI happened during another NMI Context;
But i check the Intel Technical Manual and found While an NMI
interrupt handler is executing, the processor disables additional
calls to the NMI handler until the next IRET instruction is executed.
So, how this happen?


b.  Qemu Process's CPU dead lock
  28 KERNEL: /usr/lib/debug/lib/modules/2.6.32-131.12.1.el6.x86_64/vmlinux
  29 DUMPFILE: 

Re: [PATCH] kvm: set gsi_bits and max_gsi correctly

2012-04-10 Thread Marcelo Tosatti
On Wed, Mar 28, 2012 at 02:18:05PM -0400, Jason Baron wrote:
 The current kvm_init_irq_routing() doesn't set up the used_gsi_bitmap
 correctly, and as a consequence pins max_gsi to 32 when it really
 should be 1024. I ran into this limitation while testing pci
 passthrough, where I consistently got an -ENOSPC return from
 kvm_get_irq_route_gsi() called from assigned_dev_update_msix_mmio().
 
 Signed-off-by: Jason Baron jba...@redhat.com

Applied to uq/master, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] virtio_blk: Add help function to format mass of disks

2012-04-10 Thread Ren Mingxin
The current virtio block's naming algorithm just supports 18278
(26^3 + 26^2 + 26) disks. If there are mass of virtio blocks,
there will be disks with the same name.

Based on commit 3e1a7ff8a0a7b948f2684930166954f9e8e776fe, I add
function virtblk_name_format() for virtio block to support mass
of disks naming.

Signed-off-by: Ren Mingxin re...@cn.fujitsu.com
---
v1-v2: wipe off the duplicate line
---
 drivers/block/virtio_blk.c |   37 +
 1 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index c4a60ba..07b8bf9 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -374,6 +374,30 @@ static int init_vq(struct virtio_blk *vblk)
return err;
 }
 
+static int virtblk_name_format(char *prefix, int index, char *buf, int buflen)
+{
+   const int base = 'z' - 'a' + 1;
+   char *begin = buf + strlen(prefix);
+   char *end = buf + buflen;
+   char *p;
+   int unit;
+
+   p = end - 1;
+   *p = '\0';
+   unit = base;
+   do {
+   if (p == begin)
+   return -EINVAL;
+   *--p = 'a' + (index % unit);
+   index = (index / unit) - 1;
+   } while (index = 0);
+
+   memmove(begin, p, end - p);
+   memcpy(buf, prefix, strlen(prefix));
+
+   return 0;
+}
+
 static int __devinit virtblk_probe(struct virtio_device *vdev)
 {
struct virtio_blk *vblk;
@@ -442,18 +466,7 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)
 
q-queuedata = vblk;
 
-   if (index  26) {
-   sprintf(vblk-disk-disk_name, vd%c, 'a' + index % 26);
-   } else if (index  (26 + 1) * 26) {
-   sprintf(vblk-disk-disk_name, vd%c%c,
-   'a' + index / 26 - 1, 'a' + index % 26);
-   } else {
-   const unsigned int m1 = (index / 26 - 1) / 26 - 1;
-   const unsigned int m2 = (index / 26 - 1) % 26;
-   const unsigned int m3 =  index % 26;
-   sprintf(vblk-disk-disk_name, vd%c%c%c,
-   'a' + m1, 'a' + m2, 'a' + m3);
-   }
+   virtblk_name_format(vd, index, vblk-disk-disk_name, DISK_NAME_LEN);
 
vblk-disk-major = major;
vblk-disk-first_minor = index_to_minor(index);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Avoid zapping unrelated shadows in __kvm_set_memory_region()

2012-04-10 Thread Xiao Guangrong
On 04/10/2012 09:05 PM, Takuya Yoshikawa wrote:

 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 29ad6f9..a50f7ba 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -3930,16 +3930,30 @@ void kvm_mmu_slot_remove_write_access(struct kvm 
 *kvm, int slot)
   kvm_flush_remote_tlbs(kvm);
  }
 
 -void kvm_mmu_zap_all(struct kvm *kvm)
 +/**
 + * kvm_mmu_zap_all - zap all shadows which have mappings into a given slot
 + * @kvm: the kvm instance
 + * @slot: id of the target slot
 + *
 + * If @slot is -1, zap all shadow pages.
 + */
 +void kvm_mmu_zap_all(struct kvm *kvm, int slot)
  {
   struct kvm_mmu_page *sp, *node;
   LIST_HEAD(invalid_list);
 + int zapped;
 
   spin_lock(kvm-mmu_lock);
  restart:
 - list_for_each_entry_safe(sp, node, kvm-arch.active_mmu_pages, link)
 - if (kvm_mmu_prepare_zap_page(kvm, sp, invalid_list))
 - goto restart;
 + zapped = 0;
 + list_for_each_entry_safe(sp, node, kvm-arch.active_mmu_pages, link) {
 + if ((slot = 0)  !test_bit(slot, sp-slot_bitmap))
 + continue;
 +
 + zapped |= kvm_mmu_prepare_zap_page(kvm, sp, invalid_list);


You should goto restart here like the origin code, also, safe version of
list_for_each is not needed.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html