date:20121113

[PATCH 0/4] KVM: PPC: Book3S HV: HPT read/write functions for userspace

2012-11-13 Thread Paul Mackerras

This series of patches provides an interface by which userspace can
read and write the hashed page table (HPT) of a Book3S HV guest.
The interface is an ioctl which provides a file descriptor which can
be accessed with the read() and write() system calls.  The data read
and written is the guest view of the HPT, in which the second
doubleword of each HPTE (HPT entry) contains a guest physical address,
as distinct from the real HPT that the hardware accesses, where the
second doubleword of each HPTE contains a real address.

Because the HPT is divided into groups (HPTEGs) of 8 entries each,
where each HPTEG usually only contains a few valid entries, or none,
the data format that we use does run-length encoding of the invalid
entries, so in fact the invalid entries take up no space in the
stream.

The interface also provides for doing multiple passes over the HPT,
where the first pass provides information on all HPTEs, and subsequent
passes only return the HPTEs that have changed since the previous pass.

I have implemented a read/write interface rather than an mmap-based
interface because the data is not stored contiguously anywhere in
kernel memory.  Of each 16-byte HPTE, the first 8 bytes come from the
real HPT and the second 8 bytes come from the parallel vmalloc'd array
where we store the guest view of the guest physical address,
permissions, accessed/dirty bits etc.  Thus a mmap-based interface
would not be practicable (not without doubling the size of the
parallel array, typically requiring an extra 8MB of kernel memory per
guest).  This is also why I have not used the memslot interface for
this.

This implements the interface for HV-style KVM but not for PR-style
KVM.  Userspace does not need any additional interface with PR-style
KVM because userspace maintains the guest HPT already in that case,
and has an image of the guest view of the HPT in its address space.

This series is against the next branch of the kvm tree.  The patches
are basically identical to the previous posting of the series, just
rediffed for the move of kvm.h from include/linux to
include/uapi/linux, and for commit 8ca40a70a7 ("KVM: Take kvm instead
of vcpu to mmu_notifier_retry"), which supersedes patch 1 of the old
series.

The overall diffstat is:

 Documentation/virtual/kvm/api.txt|   53 +
 arch/powerpc/include/asm/kvm_book3s.h|8 +-
 arch/powerpc/include/asm/kvm_book3s_64.h |   24 ++
 arch/powerpc/include/asm/kvm_host.h  |1 +
 arch/powerpc/include/asm/kvm_ppc.h   |2 +
 arch/powerpc/include/uapi/asm/kvm.h  |   24 ++
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  380 +-
 arch/powerpc/kvm/book3s_hv.c |   12 -
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |   71 --
 arch/powerpc/kvm/powerpc.c   |   17 ++
 include/uapi/linux/kvm.h |3 +
 11 files changed, 551 insertions(+), 44 deletions(-)

Please apply.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/4] KVM: PPC: Book3S HV: Add a mechanism for recording modified HPTEs

2012-11-13 Thread Paul Mackerras

This uses a bit in our record of the guest view of the HPTE to record
when the HPTE gets modified.  We use a reserved bit for this, and ensure
that this bit is always cleared in HPTE values returned to the guest.
The recording of modified HPTEs is only done if other code indicates
its interest by setting kvm->arch.hpte_mod_interest to a non-zero value.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |6 ++
 arch/powerpc/include/asm/kvm_host.h  |1 +
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |   25 ++---
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 1472a5b..4ca4f25 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -50,6 +50,12 @@ extern int kvm_hpt_order;/* order of 
preallocated HPTs */
 #define HPTE_V_HVLOCK  0x40UL
 #define HPTE_V_ABSENT  0x20UL
 
+/*
+ * We use this bit in the guest_rpte field of the revmap entry
+ * to indicate a modified HPTE.
+ */
+#define HPTE_GR_MODIFIED   (1ul << 62)
+
 static inline long try_lock_hpte(unsigned long *hpte, unsigned long bits)
 {
unsigned long tmp, old;
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 3093896..58c7264 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -248,6 +248,7 @@ struct kvm_arch {
atomic_t vcpus_running;
unsigned long hpt_npte;
unsigned long hpt_mask;
+   atomic_t hpte_mod_interest;
spinlock_t slot_phys_lock;
unsigned short last_vcpu[NR_CPUS];
struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 362dffe..726231a 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -66,6 +66,18 @@ void kvmppc_add_revmap_chain(struct kvm *kvm, struct 
revmap_entry *rev,
 }
 EXPORT_SYMBOL_GPL(kvmppc_add_revmap_chain);
 
+/*
+ * Note modification of an HPTE; set the HPTE modified bit
+ * if it wasn't modified before and anyone is interested.
+ */
+static inline void note_hpte_modification(struct kvm *kvm,
+ struct revmap_entry *rev)
+{
+   if (!(rev->guest_rpte & HPTE_GR_MODIFIED) &&
+   atomic_read(&kvm->arch.hpte_mod_interest))
+   rev->guest_rpte |= HPTE_GR_MODIFIED;
+}
+
 /* Remove this HPTE from the chain for a real page */
 static void remove_revmap_chain(struct kvm *kvm, long pte_index,
struct revmap_entry *rev,
@@ -287,8 +299,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long 
flags,
rev = &kvm->arch.revmap[pte_index];
if (realmode)
rev = real_vmalloc_addr(rev);
-   if (rev)
+   if (rev) {
rev->guest_rpte = g_ptel;
+   note_hpte_modification(kvm, rev);
+   }
 
/* Link HPTE into reverse-map chain */
if (pteh & HPTE_V_VALID) {
@@ -392,7 +406,8 @@ long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long 
flags,
/* Read PTE low word after tlbie to get final R/C values */
remove_revmap_chain(kvm, pte_index, rev, v, hpte[1]);
}
-   r = rev->guest_rpte;
+   r = rev->guest_rpte & ~HPTE_GR_MODIFIED;
+   note_hpte_modification(kvm, rev);
unlock_hpte(hpte, 0);
 
vcpu->arch.gpr[4] = v;
@@ -466,6 +481,7 @@ long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu)
 
args[j] = ((0x80 | flags) << 56) + pte_index;
rev = real_vmalloc_addr(&kvm->arch.revmap[pte_index]);
+   note_hpte_modification(kvm, rev);
 
if (!(hp[0] & HPTE_V_VALID)) {
/* insert R and C bits from PTE */
@@ -555,6 +571,7 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long 
flags,
if (rev) {
r = (rev->guest_rpte & ~mask) | bits;
rev->guest_rpte = r;
+   note_hpte_modification(kvm, rev);
}
r = (hpte[1] & ~mask) | bits;
 
@@ -606,8 +623,10 @@ long kvmppc_h_read(struct kvm_vcpu *vcpu, unsigned long 
flags,
v &= ~HPTE_V_ABSENT;
v |= HPTE_V_VALID;
}
-   if (v & HPTE_V_VALID)
+   if (v & HPTE_V_VALID) {
r = rev[i].guest_rpte | (r & (HPTE_R_R | HPTE_R_C));
+   r &= ~HPTE_GR_MODIFIED;
+   }
vcpu->arch.gpr[4 + i * 2] = v;
vcpu->arch.gpr[5 + i * 2] = r;
}
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] KVM: PPC: Book3S HV: Restructure HPT entry creation code

2012-11-13 Thread Paul Mackerras

This restructures the code that creates HPT (hashed page table)
entries so that it can be called in situations where we don't have a
struct vcpu pointer, only a struct kvm pointer.  It also fixes a bug
where kvmppc_map_vrma() would corrupt the guest R4 value.

Most of the work of kvmppc_virtmode_h_enter is now done by a new
function, kvmppc_virtmode_do_h_enter, which itself calls another new
function, kvmppc_do_h_enter, which contains most of the old
kvmppc_h_enter.  The new kvmppc_do_h_enter takes explicit arguments
for the place to return the HPTE index, the Linux page tables to use,
and whether it is being called in real mode, thus removing the need
for it to have the vcpu as an argument.

Currently kvmppc_map_vrma creates the VRMA (virtual real mode area)
HPTEs by calling kvmppc_virtmode_h_enter, which is designed primarily
to handle H_ENTER hcalls from the guest that need to pin a page of
memory.  Since H_ENTER returns the index of the created HPTE in R4,
kvmppc_virtmode_h_enter updates the guest R4, corrupting the guest R4
in the case when it gets called from kvmppc_map_vrma on the first
VCPU_RUN ioctl.  With this, kvmppc_map_vrma instead calls
kvmppc_virtmode_do_h_enter with the address of a dummy word as the
place to store the HPTE index, thus avoiding corrupting the guest R4.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s.h |5 +++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |   36 +++--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   27 -
 3 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 36fcf41..fea768f 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -157,8 +157,9 @@ extern void *kvmppc_pin_guest_page(struct kvm *kvm, 
unsigned long addr,
 extern void kvmppc_unpin_guest_page(struct kvm *kvm, void *addr);
 extern long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
long pte_index, unsigned long pteh, unsigned long ptel);
-extern long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
-   long pte_index, unsigned long pteh, unsigned long ptel);
+extern long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
+   long pte_index, unsigned long pteh, unsigned long ptel,
+   pgd_t *pgdir, bool realmode, unsigned long *idx_ret);
 extern long kvmppc_hv_get_dirty_log(struct kvm *kvm,
struct kvm_memory_slot *memslot, unsigned long *map);
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 2a89a36..6ee6516 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -41,6 +41,10 @@
 /* Power architecture requires HPT is at least 256kB */
 #define PPC_MIN_HPT_ORDER  18
 
+static long kvmppc_virtmode_do_h_enter(struct kvm *kvm, unsigned long flags,
+   long pte_index, unsigned long pteh,
+   unsigned long ptel, unsigned long *pte_idx_ret);
+
 long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
 {
unsigned long hpt;
@@ -185,6 +189,7 @@ void kvmppc_map_vrma(struct kvm_vcpu *vcpu, struct 
kvm_memory_slot *memslot,
unsigned long addr, hash;
unsigned long psize;
unsigned long hp0, hp1;
+   unsigned long idx_ret;
long ret;
struct kvm *kvm = vcpu->kvm;
 
@@ -216,7 +221,8 @@ void kvmppc_map_vrma(struct kvm_vcpu *vcpu, struct 
kvm_memory_slot *memslot,
hash = (hash << 3) + 7;
hp_v = hp0 | ((addr >> 16) & ~0x7fUL);
hp_r = hp1 | addr;
-   ret = kvmppc_virtmode_h_enter(vcpu, H_EXACT, hash, hp_v, hp_r);
+   ret = kvmppc_virtmode_do_h_enter(kvm, H_EXACT, hash, hp_v, hp_r,
+&idx_ret);
if (ret != H_SUCCESS) {
pr_err("KVM: map_vrma at %lx failed, ret=%ld\n",
   addr, ret);
@@ -354,15 +360,10 @@ static long kvmppc_get_guest_page(struct kvm *kvm, 
unsigned long gfn,
return err;
 }
 
-/*
- * We come here on a H_ENTER call from the guest when we are not
- * using mmu notifiers and we don't have the requested page pinned
- * already.
- */
-long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
-   long pte_index, unsigned long pteh, unsigned long ptel)
+long kvmppc_virtmode_do_h_enter(struct kvm *kvm, unsigned long flags,
+   long pte_index, unsigned long pteh,
+   unsigned long ptel, unsigned long *pte_idx_ret)
 {
-   struct kvm *kvm = vcpu->kvm;
unsigned long psize, gpa, gfn;
struct kvm_memory_slot *memslot;
long ret;
@@ -390,8 +391,8 @@ long kvmppc_virtmode_h_enter(st

[PATCH 3/4] KVM: PPC: Book3S HV: Make a HPTE removal function available

2012-11-13 Thread Paul Mackerras

This makes a HPTE removal function, kvmppc_do_h_remove(), available
outside book3s_hv_rm_mmu.c.  This will be used by the HPT writing
code.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s.h |3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   19 +--
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index fea768f..46763d10 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -160,6 +160,9 @@ extern long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, 
unsigned long flags,
 extern long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
long pte_index, unsigned long pteh, unsigned long ptel,
pgd_t *pgdir, bool realmode, unsigned long *idx_ret);
+extern long kvmppc_do_h_remove(struct kvm *kvm, unsigned long flags,
+   unsigned long pte_index, unsigned long avpn,
+   unsigned long *hpret);
 extern long kvmppc_hv_get_dirty_log(struct kvm *kvm,
struct kvm_memory_slot *memslot, unsigned long *map);
 
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 726231a..e407e97 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -364,11 +364,10 @@ static inline int try_lock_tlbie(unsigned int *lock)
return old == 0;
 }
 
-long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
-unsigned long pte_index, unsigned long avpn,
-unsigned long va)
+long kvmppc_do_h_remove(struct kvm *kvm, unsigned long flags,
+   unsigned long pte_index, unsigned long avpn,
+   unsigned long *hpret)
 {
-   struct kvm *kvm = vcpu->kvm;
unsigned long *hpte;
unsigned long v, r, rb;
struct revmap_entry *rev;
@@ -410,10 +409,18 @@ long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long 
flags,
note_hpte_modification(kvm, rev);
unlock_hpte(hpte, 0);
 
-   vcpu->arch.gpr[4] = v;
-   vcpu->arch.gpr[5] = r;
+   hpret[0] = v;
+   hpret[1] = r;
return H_SUCCESS;
 }
+EXPORT_SYMBOL_GPL(kvmppc_do_h_remove);
+
+long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
+unsigned long pte_index, unsigned long avpn)
+{
+   return kvmppc_do_h_remove(vcpu->kvm, flags, pte_index, avpn,
+ &vcpu->arch.gpr[4]);
+}
 
 long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu)
 {
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/4] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

2012-11-13 Thread Paul Mackerras

A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
this fd return the contents of the HPT (hashed page table), writes
create and/or remove entries in the HPT.  There is a new capability,
KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
takes an argument structure with the index of the first HPT entry to
read out and a set of flags.  The flags indicate whether the user is
intending to read or write the HPT, and whether to return all entries
or only the "bolted" entries (those with the bolted bit, 0x10, set in
the first doubleword).

This is intended for use in implementing qemu's savevm/loadvm and for
live migration.  Therefore, on reads, the first pass returns information
about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
end of the HPT, it returns from the read.  Subsequent reads only return
information about HPTEs that have changed since they were last read.
A read that finds no changed HPTEs in the HPT following where the last
read finished will return 0 bytes.

The format of the data provides a simple run-length compression of the
invalid entries.  Each block of data starts with a header that indicates
the index (position in the HPT, which is just an array), the number of
valid entries starting at that index (may be zero), and the number of
invalid entries following those valid entries.  The valid entries, 16
bytes each, follow the header.  The invalid entries are not explicitly
represented.

Signed-off-by: Paul Mackerras 
---
 Documentation/virtual/kvm/api.txt|   53 +
 arch/powerpc/include/asm/kvm_book3s_64.h |   18 ++
 arch/powerpc/include/asm/kvm_ppc.h   |2 +
 arch/powerpc/include/uapi/asm/kvm.h  |   24 +++
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  344 ++
 arch/powerpc/kvm/book3s_hv.c |   12 --
 arch/powerpc/kvm/powerpc.c   |   17 ++
 include/uapi/linux/kvm.h |3 +
 8 files changed, 461 insertions(+), 12 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 6671fdc..33080ea 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2071,6 +2071,59 @@ KVM_S390_INT_EXTERNAL_CALL (vcpu) - sigp external call; 
source cpu in parm
 
 Note that the vcpu ioctl is asynchronous to vcpu execution.
 
+4.78 KVM_PPC_GET_HTAB_FD
+
+Capability: KVM_CAP_PPC_HTAB_FD
+Architectures: powerpc
+Type: vm ioctl
+Parameters: Pointer to struct kvm_get_htab_fd (in)
+Returns: file descriptor number (>= 0) on success, -1 on error
+
+This returns a file descriptor that can be used either to read out the
+entries in the guest's hashed page table (HPT), or to write entries to
+initialize the HPT.  The returned fd can only be written to if the
+KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
+can only be read if that bit is clear.  The argument struct looks like
+this:
+
+/* For KVM_PPC_GET_HTAB_FD */
+struct kvm_get_htab_fd {
+   __u64   flags;
+   __u64   start_index;
+};
+
+/* Values for kvm_get_htab_fd.flags */
+#define KVM_GET_HTAB_BOLTED_ONLY   ((__u64)0x1)
+#define KVM_GET_HTAB_WRITE ((__u64)0x2)
+
+The `start_index' field gives the index in the HPT of the entry at
+which to start reading.  It is ignored when writing.
+
+Reads on the fd will initially supply information about all
+"interesting" HPT entries.  Interesting entries are those with the
+bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
+all entries.  When the end of the HPT is reached, the read() will
+return.  If read() is called again on the fd, it will start again from
+the beginning of the HPT, but will only return HPT entries that have
+changed since they were last read.
+
+Data read or written is structured as a header (8 bytes) followed by a
+series of valid HPT entries (16 bytes) each.  The header indicates how
+many valid HPT entries there are and how many invalid entries follow
+the valid entries.  The invalid entries are not represented explicitly
+in the stream.  The header format is:
+
+struct kvm_get_htab_header {
+   __u32   index;
+   __u16   n_valid;
+   __u16   n_invalid;
+};
+
+Writes to the fd create HPT entries starting at the index given in the
+header; first `n_valid' valid entries with contents from the data
+written, then `n_invalid' invalid entries, invalidating any previously
+valid entries found.
+
 
 5. The kvm_run structure
 
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 4ca4f25..dc0a78d 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -243,4 +243,22 @@ static inline bool slot_is_aligned(struct kvm_memory_slot 
*memslot,
return !(memslot->base_gfn & mask) && !(memslot->npages & mask);
 }
 
+static inline unsigned long slb_pgsize_encoding(unsigned long psize)
+{
+   unsigned

Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset

2012-11-13 Thread Steven Rostedt

On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:
> On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt  wrote:
> > On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
> >
> >> To merge the data like previous pattern, we apply this patch set. Then, we 
> >> can
> >> get TSC offset of the guest as follows:
> >>
> >> $ dmesg | grep kvm
> >> [   57.717180] kvm: (2687) write TSC offset 18446743360465545001, now 
> >> clock ##
> >>        
> >>  |
> >>  PID TSC offset
> >>  |
> >>HOST TSC value 
> >> --+
> >>
> >
> > Using printk to export something like this is IMO a nasty hack.
> >
> > Can't we create a /sys or /proc file to export the same thing?
> 
> Since the value changes over the course of the trace, and seems to be
> part of the context of the trace, I think I'd include it as a
> tracepoint.
> 

I'm fine with that too.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset

2012-11-13 Thread David Sharp

On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt  wrote:
> On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:
>
>> To merge the data like previous pattern, we apply this patch set. Then, we 
>> can
>> get TSC offset of the guest as follows:
>>
>> $ dmesg | grep kvm
>> [   57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock 
>> ##
>>     |
>>  PID TSC offset |
>>HOST TSC value --+
>>
>
> Using printk to export something like this is IMO a nasty hack.
>
> Can't we create a /sys or /proc file to export the same thing?

Since the value changes over the course of the trace, and seems to be
part of the context of the trace, I think I'd include it as a
tracepoint.

>
> -- Steve
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset

2012-11-13 Thread H. Peter Anvin

On 11/13/2012 06:00 PM, Steven Rostedt wrote:
> 
> Using printk to export something like this is IMO a nasty hack.
> 
> Can't we create a /sys or /proc file to export the same thing?
> 

Maybe we need a /proc/pid/kvm/* directory?

-hpa


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset

2012-11-13 Thread Steven Rostedt

On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:

> To merge the data like previous pattern, we apply this patch set. Then, we can
> get TSC offset of the guest as follows:
> 
> $ dmesg | grep kvm
> [   57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
>     |
>  PID TSC offset |
>HOST TSC value --+ 
> 

Using printk to export something like this is IMO a nasty hack.

Can't we create a /sys or /proc file to export the same thing?

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 2/2] tools: Add a tool for merging trace data of a guest and a host

2012-11-13 Thread Yoshihiro YUNOMAE

This tool merges trace data of a guest and a host in chronological order.
Note that this tool is used only for a guest and a host. (not for multiple
guests)

- How to use
1. Get trace data of the host and guest via ssh, virtio-serial, or virtio-trace

2. Get TSC offset after applied patch "kvm/vmx: Print TSC_OFFSET information
   when TSC offset value is written to VMCS"
$ dmesg | grep kvm
[   57.717180] kvm: ([PID]) write TSC offset [TSC offset], now clock [HOST TSC]

3. Use this tool
$ ./trace-merge.pl   

hqemu-kvm-2687  [003] d...50550079203669: kvm_exit: [detail]
hqemu-kvm-2687  [003] d...50550079206816: kvm_entry: [detail]
gcomm-3826  [000] d.h.50550079226331: sched_wakeup: [detail]
hqemu-kvm-2687  [003] d...50550079240656: kvm_exit: [detail]
hqemu-kvm-2687  [003] d...50550079243467: kvm_entry: [detail]
hqemu-kvm-2687  [003] d...50550079256103: kvm_exit: [detail]
hqemu-kvm-2687  [003] d...50550079268391: kvm_entry: [detail]
gcomm-3826  [000] d...50550079279266: sched_switch: [detail]
hqemu-kvm-2687  [003] d...50550079280829: kvm_exit: [detail]
hqemu-kvm-2687  [003] d...50550079286028: kvm_entry: [detail]
|
\guest/host

Signed-off-by: Yoshihiro YUNOMAE 
---
 tools/scripts/trace-merge/trace-merge.pl |  109 ++
 1 file changed, 109 insertions(+)
 create mode 100755 tools/scripts/trace-merge/trace-merge.pl

diff --git a/tools/scripts/trace-merge/trace-merge.pl 
b/tools/scripts/trace-merge/trace-merge.pl
new file mode 100755
index 000..e0b080c
--- /dev/null
+++ b/tools/scripts/trace-merge/trace-merge.pl
@@ -0,0 +1,109 @@
+#!/usr/bin/perl
+#
+# Tool for merging and sorting trace data of a guest and host
+#
+# Created by Yoshihiro YUNOMAE 
+#
+# - How to use
+#   ./trace-merge.pl   
+#
+use strict;
+use bigint;
+
+my %all_data_info = ();
+my @merged_data = ();
+my @sorted_data = ();
+
+&read_all_data();
+&merge_guest_host_data();
+&sort_data_by_tsc();
+&output_data();
+
+sub read_all_data {
+   # TSC offset value is very big ull value.
+   # This value is actually negative, so we calculate the value here.
+   $all_data_info{"tsc_offset"} = &convert_tscoffset($ARGV[0]);
+   if ($all_data_info{"tsc_offset"} == 0) {
+   die "TSC should not be 0";
+   }
+
+   if (!open(HOST_DATA, $ARGV[1])) {
+   die "Cannot open host file: $!"
+   }
+   my @host_data = ;
+   close(HOST_DATA);
+
+   if (!open(GUEST_DATA, $ARGV[2])) {
+   die "Cannot open guest file: $!"
+   }
+   my @guest_data = ;
+   close(GUEST_DATA);
+
+   $all_data_info{"host_data"} = \@host_data;
+   $all_data_info{"guest_data"} = \@guest_data;
+}
+
+sub merge_guest_host_data {
+   &guest_push_data();
+   &host_push_data();
+}
+
+sub sort_data_by_tsc {
+   no strict 'refs';
+   @sorted_data = sort {$a->{tsc} <=> $b->{tsc}} @merged_data;
+}
+
+sub output_data {
+   foreach my $line (@sorted_data) {
+   print "$line->{name}$line->{comm}$line->{tsc}$line->{event}\n";
+   }
+}
+
+sub guest_push_data {
+   &make_data_list(1);
+}
+
+sub host_push_data {
+   &make_data_list(0);
+}
+
+#
+# If this function is used for guest's data,
+# subtract TSC offset from guest's TSC value.
+#
+# NOTE: guest's TSC is added TSC offset to actual TSC when the guest boots.
+#
+sub make_data_list {
+   my $is_guest = $_[0];
+   my @data = ();
+   my $name = "";
+   my $list = "";
+   my $tsc_offset = 0;
+
+   if ($is_guest eq 1) {
+   $name = "g";
+   @data = @{$all_data_info{"guest_data"}};
+   $tsc_offset = $all_data_info{"tsc_offset"};
+   } else {
+   $name = "h";
+   @data = @{$all_data_info{"host_data"}};
+   }
+
+   foreach my $line (@data) {
+   chomp($line);
+
+   if ($line =~ /^(.+\[[0-9]+\].{5})([0-9]+)(:.+)/) {
+   $list = {
+   name=> $name,
+   comm=> $1,
+   tsc => $2 - $tsc_offset,
+   event   => $3
+   };
+   push(@merged_data, $list);
+   }
+   }
+}
+
+sub convert_tscoffset {
+   return $_[0] - (1 << 64);
+}


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 1/2] kvm/vmx: Print TSC_OFFSET information when TSC offset value is written to VMCS

2012-11-13 Thread Yoshihiro YUNOMAE

Print TSC_OFFSET information when TSC offset value is written to VMCS for
measuring actual TSC of a guest.

TSC value on a guest is always the host TSC plus the guest's "TSC offset".
TSC offset is stored in the VMCS in vmx_write_tsc_offset() or
vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
The latter function is executed when kvm clock is updated. On the other hand,
the host can read the TSC offset values from VMCS. So, if the host outputs the
TSC offset values, we can calculate an actual TSC value for each TSC timestamp
recorded trace data of the guest.

Signed-off-by: Yoshihiro YUNOMAE 
Cc: Avi Kivity 
Cc: Marcelo Tosatti 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: H. Peter Anvin 
Cc: Masami Hiramatsu 
Cc: Hidehiro Kawai 
---
 arch/x86/kvm/vmx.c |5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ad6b1dd..8edfe3c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1887,6 +1887,9 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, 
u64 offset)
 vmcs12->tsc_offset : 0));
} else {
vmcs_write64(TSC_OFFSET, offset);
+   pr_info("kvm: (%d) write TSC offset %llu, now clock %llu\n",
+   current->pid, vmcs_read64(TSC_OFFSET),
+   native_read_tsc());
}
 }
 
@@ -1894,6 +1897,8 @@ static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, 
s64 adjustment, bool ho
 {
u64 offset = vmcs_read64(TSC_OFFSET);
vmcs_write64(TSC_OFFSET, offset + adjustment);
+   pr_info("kvm: (%d) adjust TSC offset %llu, now clock %llu\n",
+   current->pid, vmcs_read64(TSC_OFFSET), native_read_tsc());
if (is_guest_mode(vcpu)) {
/* Even when running L2, the adjustment needs to apply to L1 */
to_vmx(vcpu)->nested.vmcs01_tsc_offset += adjustment;


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/2] kvm/vmx: Output TSC offset

2012-11-13 Thread Yoshihiro YUNOMAE

Hi All,

The following patch set can make disordered trace data of a guest and a host
sorted in chronological order.

In a virtualization environment, it is difficult to analyze performance
problems, such as a delay of I/O request on a guest. This is because multiple
guests operate on the host. One of approaches for solving such kind of problems
is to sort trace data of guests and the host in chronological order.

After we applied the patch set(https://lkml.org/lkml/2012/11/13/588), raw TSC
can be chosen as a timestamp of ftrace. TSC is useful for merging trace data
in chronological order by two reasons. One of the reasons is that guests can
directly read raw TSC from the CPU using rdtsc operation. This means that raw
TSC value is not software clock like sched_clock, so we don't need to consider
about how the timestamp is calculated. The other is that TSC of recent x86 CPUs
is constantly incremented. This means that we don't need to worry about pace of
the timestamp. Therefore, choosing TSC as a timestamp for tracing is reasonable
to integrate trace data of guests and a host.

Here, we need to consider about just one matter for using TSC on guests. TSC
value on a guest is always the host TSC plus the guest's "TSC offset". In other
words, to merge trace data using TSC as timestamp in chronological order, we
need to consider TSC offset of the guest.

However, only the host kernel can read the TSC offset from VMCS and TSC offset
is not output in anywhere now. In other words, tools in userland cannot get
the TSC offset value, so we cannot merge trace data of guest and the host in
chronological order. Therefore, the TSC offset should be exported for userland
tools.

In this patch set, TSC offset is exported by printk() on the host. I also
attached a tool for merging trace data of a guest and a host in chronological
order.


We assume that wakeup-latency for a command is big on a guest. Normally
we will use ftrace's wakeup-latency tracer or event tracer on the guest, but we
may not be able to solve this problem. This is because guests often exit to
the host for several reasons. In the next, we will use TSC as ftrace's timestamp
and record the trace data on the guest and the host. Then, we get following
data:

 /* guest data */
comm-3826  [000] d...49836825726903: sched_wakeup: [detail]
comm-3826  [000] d...49836832225344: sched_switch: [detail]
 /* host data */
qemu-kvm-2687  [003] d...50550079203669: kvm_exit: [detail]
qemu-kvm-2687  [003] d...50550079206816: kvm_entry: [detail]
qemu-kvm-2687  [003] d...50550079240656: kvm_exit: [detail]
qemu-kvm-2687  [003] d...50550079243467: kvm_entry: [detail]
qemu-kvm-2687  [003] d...50550079256103: kvm_exit: [detail]
qemu-kvm-2687  [003] d...50550079268391: kvm_entry: [detail]
qemu-kvm-2687  [003] d...50550079280829: kvm_exit: [detail]
qemu-kvm-2687  [003] d...50550079286028: kvm_entry: [detail]

Since TSC offset is not considered, these data cannot be merged. If this trace
data is shown like as follows, we will be able to understand the reason:

qemu-kvm-2687  [003] d...50550079203669: kvm_exit: [detail]
qemu-kvm-2687  [003] d...50550079206816: kvm_entry: [detail]
comm-3826  [000] d.h.49836825726903: sched_wakeup: [detail] <=
qemu-kvm-2687  [003] d...50550079240656: kvm_exit: [detail]
qemu-kvm-2687  [003] d...50550079243467: kvm_entry: [detail]
qemu-kvm-2687  [003] d...50550079256103: kvm_exit: [detail]
qemu-kvm-2687  [003] d...50550079268391: kvm_entry: [detail]
comm-3826  [000] d...49836832225344: sched_switch: [detail] <=
qemu-kvm-2687  [003] d...50550079280829: kvm_exit: [detail]
qemu-kvm-2687  [003] d...50550079286028: kvm_entry: [detail]

In this case, we can understand wakeup-latency was big due to exit to host
twice. Getting this data sorted in chronological order is our goal.

To merge the data like previous pattern, we apply this patch set. Then, we can
get TSC offset of the guest as follows:

$ dmesg | grep kvm
[   57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
    |
 PID TSC offset |
   HOST TSC value --+   

We use this TSC offset value to a merge script and obtain the following data:

$ ./trace-merge.pl 18446743360465545001 host.data guest.data
hqemu-kvm-2687  [003] d...50550079203669: kvm_exit: [detail]
hqemu-kvm-2687  [003] d...50550079206816: kvm_entry: [detail]
gcomm-3826  [000] d.h.50550079226331: sched_wakeup: [detail] <=
hqemu-kvm-2687  [003] d...50550079240656: kvm_exit: [detail]
hqemu-kvm-2687  [003] d...50550079243467: kvm_entry: [detail]
hqemu-kvm-2687  [003] d...50550079256103: kvm_exit: [detail]
hqemu

virtio + vhost-net performance issue - preadv ?

2012-11-13 Thread Ben Clay

I have a working copy of libvirt 0.10.2 + qemu 1.2 installed on a vanilla
up-to-date (2.6.32-279.9.1) CentOS 6 host, and get very good VM <-> VM
network performance (both running on the same host) using virtio.  I have
cgroups set to cap the VMs at 10Gbps and iperf shows I'm getting exactly
10Gbps.

I copied these VMs to a CentOS 5 host and installed libvirt 1.0 + qemu 1.2.
However, the best performance I can get in between the VMs (again running on
the same host) is ~2Gbps.  In both cases, this is over a bridged interface
with static IPs assigned to each VM.  I've also tried virtual networking
with NAT or routing, yielding the same results.

I figured it was due to vhost-net missing on the older CentOS 5 kernel, so I
installed 2.6.39-4.2 from ELRepo and got the /dev/vhost-net device and vhost
processes associated with each VM:

]$ lsmod | grep vhost
vhost_net  28446  2 
tun23888  7 vhost_net

]$ ps aux | grep vhost-
root  9628  0.0  0.0  0 0 ?S17:57   0:00
[vhost-9626]
root  9671  0.0  0.0  0 0 ?S17:57   0:00
[vhost-9670]

]$ ls /dev/vhost-net -al
crw--- 1 root root 10, 58 Nov 13 15:19 /dev/vhost-net

After installing the new kernel, I also tried rebuilding libvirt and qemu,
to no avail.  I also disabled cgroups, just in case it was getting in the
way, as well as iptables.  I can see the virtio_net module loaded inside the
guest, and using virtio raises my performance from <400Mbps to 2Gbps, so it
does make some improvement.

The only differences between the two physical hosts that I can find are:

- qemu on the CentOS 5 host builds without preadv support - would this make
such a huge performance difference?  CentOS5 only comes with an old version
of glibc, which is missing preadv
- qemu on the CentOS 5 host builds without PIE
- libvirt 1.0 was required on the CentOS 5 host, since 0.10.2 had a build
bug. This shouldn't matter I don't think.
- I haven't tried rebuilding the VMs from scratch on the CentOS5 host, which
I guess is worth a try.

The qemu process is being started with virtio + vhost:

/usr/bin/qemu-system-x86_64 -name vmname -S -M pc-1.2 -enable-kvm -m 4096
-smp 8,sockets=8,cores=1,threads=1 -uuid
212915ed-a34a-4d6d-68f5-2216083a7693 -no-user-config -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vmname.monitor,server,nowai
t -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=/mnt/vmname/disk.img,if=none,id=drive-virtio-disk0,format=raw,cache=non
e -device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virti
o-disk0,bootindex=1 -netdev tap,fd=16,id=hostnet0,vhost=on,vhostfd=18
-device
virtio-net-pci,netdev=hostnet0,id=net0,mac=00:11:22:33:44:55,bus=pci.0,addr=
0x3 -chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc
127.0.0.1:1 -vga cirrus -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

The relevant part of my libvirt config, of which I've tried omitting the
target, alias and address elements with no difference in performance:

   
  
  
  
  
  
  


Is there something else which could be getting in the way here?

Thanks!

Ben Clay
rbc...@ncsu.edu



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: do not kfree error pointer

2012-11-13 Thread Marcelo Tosatti

On Fri, Nov 02, 2012 at 06:33:21PM +0800, Guo Chao wrote:
> We should avoid kfree()ing error pointer in kvm_vcpu_ioctl() and
> kvm_arch_vcpu_ioctl().
> 
> Signed-off-by: Guo Chao 

Applied all, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/5] s390: Guest support for virtio-ccw.

2012-11-13 Thread Marcelo Tosatti

On Tue, Oct 30, 2012 at 04:56:38PM +0100, Cornelia Huck wrote:
> Hi,
> 
> here's the respin of the virtio-ccw guest support patches
> (from http://marc.info/?l=kvm&m=135151606921361&w=2).
> 
> Changes to the last version:
> 
> - cc'ed stable for patch 1
> - coding style fixes in patches 4 and 5
> 
> Cornelia Huck (5):
>   KVM: s390: Handle hosts not supporting s390-virtio.
>   s390: Move css limits from drivers/s390/cio/ to include/asm/.
>   s390: Add a mechanism to get the subchannel id.
>   KVM: s390: Add a channel I/O based virtio transport driver.
>   KVM: s390: Split out early console code.
> 
>  arch/s390/include/asm/ccwdev.h  |   5 +
>  arch/s390/include/asm/cio.h |   2 +
>  arch/s390/include/asm/irq.h |   1 +
>  arch/s390/kernel/irq.c  |   1 +
>  drivers/s390/cio/css.h  |   3 -
>  drivers/s390/cio/device_ops.c   |  12 +
>  drivers/s390/kvm/Makefile   |   2 +-
>  drivers/s390/kvm/early_printk.c |  42 ++
>  drivers/s390/kvm/kvm_virtio.c   |  64 ++-
>  drivers/s390/kvm/virtio_ccw.c   | 843 
> 
>  10 files changed, 938 insertions(+), 37 deletions(-)
>  create mode 100644 drivers/s390/kvm/early_printk.c
>  create mode 100644 drivers/s390/kvm/virtio_ccw.c
> 
> -- 
> 1.7.12.4

Reviewed-by: Marcelo Tosatti 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v3 0/5] s390: Host support for channel I/O.

2012-11-13 Thread Marcelo Tosatti

On Wed, Oct 31, 2012 at 05:24:33PM +0100, Cornelia Huck wrote:
> Hi,
> 
> here's the latest incarnation of my host patches to support channel
> I/O on s390.
> 
> Most patches have only seen minor fixes, but patch 5 is completely
> different since the kvm <-> user space interface has been reworked.
> 
> We now handle only interrupt-related operations in kvm. This
> includes two channel I/O instructions that can dequeue pending I/O
> interrupts: tpi and tsch (not the part actually interacting with
> the subchannel). This makes the interface less complex (only one
> new exit for tsch handling) and avoids duplicating code from qemu.
> 
> Cornelia Huck (5):
>   KVM: s390: Support for I/O interrupts.
>   KVM: s390: Add support for machine checks.
>   KVM: s390: In-kernel handling of I/O instructions.
>   KVM: s390: Base infrastructure for enabling capabilities.
>   KVM: s390: Add support for channel I/O instructions.
> 
>  Documentation/virtual/kvm/api.txt |  40 +-
>  arch/s390/include/asm/kvm_host.h  |  11 ++
>  arch/s390/kvm/intercept.c |  22 ++-
>  arch/s390/kvm/interrupt.c | 264 +++-
>  arch/s390/kvm/kvm-s390.c  |  38 ++
>  arch/s390/kvm/kvm-s390.h  |   6 +
>  arch/s390/kvm/priv.c  | 275 
> +++---
>  arch/s390/kvm/trace-s390.h|  26 +++-
>  include/linux/kvm.h   |  18 +++
>  include/trace/events/kvm.h|   2 +-
>  10 files changed, 673 insertions(+), 29 deletions(-)
> 
> -- 
> 1.7.12.4

Reviewed-by: Marcelo Tosatti 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 14/18] time: export time information for KVM pvclock

2012-11-13 Thread Marcelo Tosatti

On Fri, Nov 09, 2012 at 05:02:52PM -0800, John Stultz wrote:
> On 10/24/2012 06:13 AM, Marcelo Tosatti wrote:
> >As suggested by John, export time data similarly to how its
> >done by vsyscall support. This allows KVM to retrieve necessary
> >information to implement vsyscall support in KVM guests.
> >
> >Signed-off-by: Marcelo Tosatti 
> Thanks Marcelo, I like this much better then what you were proposing
> privately earlier!
> 
> Fairly minor nit below.
> 
> >Index: vsyscall/kernel/time/timekeeping.c
> >===
> >--- vsyscall.orig/kernel/time/timekeeping.c
> >+++ vsyscall/kernel/time/timekeeping.c
> >@@ -21,6 +21,7 @@
> >  #include 
> >  #include 
> >  #include 
> >+#include 
> >
> >
> >  static struct timekeeper timekeeper;
> >@@ -180,6 +181,79 @@ static inline s64 timekeeping_get_ns_raw
> > return nsec + arch_gettimeoffset();
> >  }
> >
> >+static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);
> >+
> >+/**
> >+ * pvclock_gtod_register_notifier - register a pvclock timedata update 
> >listener
> >+ *
> >+ * Must hold write on timekeeper.lock
> >+ */
> >+int pvclock_gtod_register_notifier(struct notifier_block *nb)
> >+{
> >+struct timekeeper *tk = &timekeeper;
> >+unsigned long flags;
> >+int ret;
> >+
> >+write_seqlock_irqsave(&tk->lock, flags);
> >+ret = raw_notifier_chain_register(&pvclock_gtod_chain, nb);
> >+write_sequnlock_irqrestore(&tk->lock, flags);
> >+
> >+return ret;
> >+}
> >+EXPORT_SYMBOL_GPL(pvclock_gtod_register_notifier);
> >+
> >+/**
> >+ * pvclock_gtod_unregister_notifier - unregister a pvclock
> >+ * timedata update listener
> >+ *
> >+ * Must hold write on timekeeper.lock
> >+ */
> >+int pvclock_gtod_unregister_notifier(struct notifier_block *nb)
> >+{
> >+struct timekeeper *tk = &timekeeper;
> >+unsigned long flags;
> >+int ret;
> >+
> >+write_seqlock_irqsave(&tk->lock, flags);
> >+ret = raw_notifier_chain_unregister(&pvclock_gtod_chain, nb);
> >+write_sequnlock_irqrestore(&tk->lock, flags);
> >+
> >+return ret;
> >+}
> >+EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier);
> >+
> >+struct pvclock_gtod_data pvclock_gtod_data;
> >+EXPORT_SYMBOL_GPL(pvclock_gtod_data);
> >+
> >+static void update_pvclock_gtod(struct timekeeper *tk)
> >+{
> >+struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
> >+
> >+write_seqcount_begin(&vdata->seq);
> >+
> >+/* copy pvclock gtod data */
> >+vdata->clock.vclock_mode= tk->clock->archdata.vclock_mode;
> >+vdata->clock.cycle_last = tk->clock->cycle_last;
> >+vdata->clock.mask   = tk->clock->mask;
> >+vdata->clock.mult   = tk->mult;
> >+vdata->clock.shift  = tk->shift;
> >+
> >+vdata->monotonic_time_sec   = tk->xtime_sec
> >++ tk->wall_to_monotonic.tv_sec;
> >+vdata->monotonic_time_snsec = tk->xtime_nsec
> >++ (tk->wall_to_monotonic.tv_nsec
> >+<< tk->shift);
> >+while (vdata->monotonic_time_snsec >=
> >+(((u64)NSEC_PER_SEC) << tk->shift)) {
> >+vdata->monotonic_time_snsec -=
> >+((u64)NSEC_PER_SEC) << tk->shift;
> >+vdata->monotonic_time_sec++;
> >+}
> >+
> >+write_seqcount_end(&vdata->seq);
> >+raw_notifier_call_chain(&pvclock_gtod_chain, 0, NULL);
> >+}
> >+
> 
> My only request is could the update_pvclock_gtod() be implemented
> similarly to the update_vsyscall, where the update function lives in
> the pvclock code (maybe using a weak symbol or something) so we
> don't have to have all these pvclock details in the timekeeping
> core?
> 
> thanks
> -john

In KVM code, yes, no problem.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/2] KVM: make crash_clear_loaded_vmcss valid when loading kvm_intel module

2012-11-13 Thread Marcelo Tosatti

On Thu, Nov 01, 2012 at 01:55:04PM +0800, zhangyanfei wrote:
> 于 2012年10月31日 17:01, Hatayama, Daisuke 写道:
> > 
> > 
> >> -Original Message-
> >> From: kexec-boun...@lists.infradead.org
> >> [mailto:kexec-boun...@lists.infradead.org] On Behalf Of zhangyanfei
> >> Sent: Wednesday, October 31, 2012 12:34 PM
> >> To: x...@kernel.org; ke...@lists.infradead.org; Avi Kivity; Marcelo
> >> Tosatti
> >> Cc: linux-ker...@vger.kernel.org; kvm@vger.kernel.org
> >> Subject: [PATCH v3 2/2] KVM: make crash_clear_loaded_vmcss valid when
> >> loading kvm_intel module
> >>
> >> Signed-off-by: Zhang Yanfei 
> > 
> > [...]
> > 
> >> @@ -7230,6 +7231,10 @@ static int __init vmx_init(void)
> >>if (r)
> >>goto out3;
> >>
> >> +#ifdef CONFIG_KEXEC
> >> +  crash_clear_loaded_vmcss = vmclear_local_loaded_vmcss;
> >> +#endif
> >> +
> > 
> > Assignment here cannot cover the case where NMI is initiated after VMX is 
> > on in kvm_init and before vmclear_local_loaded_vmcss is assigned, though 
> > rare but can happen.
> > 
> 
> By saying "VMX is on in kvm init", you mean kvm_init enables the VMX feature 
> in the logical processor?
> No, only there is a vcpu to be created, kvm will enable the VMX feature.
> 
> I think there is no difference with this assignment before or after kvm_init 
> because the vmcs linked
> list must be empty before vmx_init is finished.

The list is not initialized before hardware_enable(), though. Should
move the assignment after that.

Also, it is possible that the loaded_vmcss_on_cpu list is being modified
_while_ crash executes say via NMI, correct? If that is the case, better
flag that the list is under manipulation so the vmclear can be skipped.

> Thanks
> Zhang Yanfei
> 
> > What does happen if calling vmclear_local_loaded_vmcss before kvm_init? I 
> > think it no problem since the list is initially empty.
> > 
> >>vmx_disable_intercept_for_msr(MSR_FS_BASE, false);
> >>vmx_disable_intercept_for_msr(MSR_GS_BASE, false);
> >>vmx_disable_intercept_for_msr(MSR_KERNEL_GS_BASE, true);
> >> @@ -7265,6 +7270,10 @@ static void __exit vmx_exit(void)
> >>free_page((unsigned long)vmx_io_bitmap_b);
> >>free_page((unsigned long)vmx_io_bitmap_a);
> >>
> >> +#ifdef CONFIG_KEXEC
> >> +  crash_clear_loaded_vmcss = NULL;
> >> +#endif
> >> +
> >>kvm_exit();
> >>  }
> > 
> > Also, this is converse to the above.
> > 
> > Thanks.
> > HATAYAMA, Daisuke
> > 
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fix lapic time counter read for periodic mode

2012-11-13 Thread Marcelo Tosatti

On Tue, Nov 13, 2012 at 08:52:54AM +0100, Christian Ehrhardt wrote:
> 
> Hi,
> 
> thanks for your reply.
> 
> On Mon, Nov 12, 2012 at 07:32:37PM -0200, Marcelo Tosatti wrote:
> > > there is a bug in the emulation of the lapic time counter. In particular
> > > what we are seeing is that the time counter of a periodic lapic timer
> > > in the guest reads as zero 99% of the time. The patch below fixes that.
> > > 
> > > The emulation of the lapic timer is done with the help of a hires
> > > timer that expires with the same frequency as the lapic counter.
> > > New expiration times for a periodic timer are calculated incrementally
> > > based on the last scheduled expiration time. This ensures long term
> > > accuracy of the emulated timer close to that of the underlying clock.
> > > 
> > > The actual value of the lapic time counter is calculated from the
> > > real time difference between current time and scheduled expiration time
> > > of the hires timer. If this difference is negative, the hires timer
> > > expired. For oneshot mode this is correctly translated into a zero value
> > > for the time counter. However, in periodic mode we must use the negative
> > > difference unmodified.
> > > 
> > >  regards   Christian
> > > 
> > > Fix lapic time counter read for periodic mode.
> > 
> > In periodic mode the hrtimer is rearmed once expired, see
> > apic_timer_fn. So _get_remaining should return proper value
> > even if the guest is not able to process timer interrupts. 
> > 
> > Can you describe your specific scenario in more detail?
> 
> In my specific case, the host is admittedly somewhat special as it
> already is a rehosted version of linux, i.e. not running directly on
> native hardware. It is still unclear if the host has sufficiently accurate
> timer interrupts. This is most likely part of the problems we are seeing.
> 
> However, AFAICS apic_timer_fn is only called once per jiffy (at least in
> some configurations). In particular, it is not called by
> hrtimer_get_remaining. Thus depending on the frequency of the LAPIC timer
> in the guest there might _several_ iterations that are missed. This can
> probably be mitigated by a hires timer interrupts. However, I think
> the problem is still there even in that case.
> 
> Additionally, the behaviour that I want to establish matches that of the
> PIT timer (in a not completely obvious way, though).
> 
> Having said that the proposed patch in my first mail is incomplete, as
> the mod_64 does not work correctly for negative values. A fixed version
> is below.
> 
>  regards Christian
> 
> Signed-off-by: Christian Ehrhardt 

Alright. Please add a comment from the LAPIC documentation describing
this behaviour (and a nice changelog). Thanks.

> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 43e9fad..ec7242c 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -810,11 +810,22 @@ static u32 apic_get_tmcct(struct kvm_lapic *apic)
>   if (kvm_apic_get_reg(apic, APIC_TMICT) == 0)
>   return 0;
>  
> + /*
> +  * hrtimer_get_remaining returns the signed difference between
> +  * timer expiration time and current time. Keep negative return
> +  * values iff the the timer is periodic.
> +  */
>   remaining = hrtimer_get_remaining(&apic->lapic_timer.timer);
> - if (ktime_to_ns(remaining) < 0)
> - remaining = ktime_set(0, 0);
> + ns = ktime_to_ns(remaining);
> + if (unlikely(ns < 0)) {
> + if (apic_lvtt_period(apic))
> + ns = apic->lapic_timer.period -
> + mod_64(-ns, apic->lapic_timer.period);
> + else
> + ns = 0;
> + }
>  
> - ns = mod_64(ktime_to_ns(remaining), apic->lapic_timer.period);
> + ns = mod_64(ns, apic->lapic_timer.period);
>   tmcct = div64_u64(ns,
>(APIC_BUS_CYCLE_NS * apic->divide_count));
>  
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] vfio-pci: Use common msi_get_message

2012-11-13 Thread Alex Williamson

We can get rid of our local version now that a helper exists.

Signed-off-by: Alex Williamson 
---
 hw/vfio_pci.c |   24 +---
 1 file changed, 1 insertion(+), 23 deletions(-)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 4e9c2dd..7c27834 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -688,28 +688,6 @@ static void vfio_msix_vector_release(PCIDevice *pdev, 
unsigned int nr)
 vector->use = false;
 }
 
-/* TODO This should move to msi.c */
-static MSIMessage msi_get_msg(PCIDevice *pdev, unsigned int vector)
-{
-uint16_t flags = pci_get_word(pdev->config + pdev->msi_cap + 
PCI_MSI_FLAGS);
-bool msi64bit = flags & PCI_MSI_FLAGS_64BIT;
-MSIMessage msg;
-
-if (msi64bit) {
-msg.address = pci_get_quad(pdev->config +
-   pdev->msi_cap + PCI_MSI_ADDRESS_LO);
-} else {
-msg.address = pci_get_long(pdev->config +
-   pdev->msi_cap + PCI_MSI_ADDRESS_LO);
-}
-
-msg.data = pci_get_word(pdev->config + pdev->msi_cap +
-(msi64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32));
-msg.data += vector;
-
-return msg;
-}
-
 static void vfio_enable_msix(VFIODevice *vdev)
 {
 vfio_disable_interrupts(vdev);
@@ -748,7 +726,7 @@ retry:
 error_report("vfio: Error: event_notifier_init failed\n");
 }
 
-msg = msi_get_msg(&vdev->pdev, i);
+msg = msi_get_message(&vdev->pdev, i);
 
 /*
  * Attempt to enable route through KVM irqchip,

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] vfio-pci: Add KVM INTx acceleration

2012-11-13 Thread Alex Williamson

This makes use of the new level irqfd support enabling bypass of qemu
userspace both on INTx injection and unmask.  This significantly
boosts the performance of devices making use of legacy interrupts (ex.
~60% better netperf TCP_RR scores for an e1000e assigned to a Linux
guest and booted with pci=nomsi).  This also avoids flipping mmaps on
and off to simulate EOIs, so greatly improves performance of device
access in addition to interrupt latency.

Signed-off-by: Alex Williamson 
---
 hw/vfio_pci.c |  186 +
 1 file changed, 186 insertions(+)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 0473ae8..4e9c2dd 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -185,6 +185,21 @@ static void vfio_unmask_intx(VFIODevice *vdev)
 ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
 }
 
+#ifdef CONFIG_KVM /* Unused outside of CONFIG_KVM code */
+static void vfio_mask_intx(VFIODevice *vdev)
+{
+struct vfio_irq_set irq_set = {
+.argsz = sizeof(irq_set),
+.flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK,
+.index = VFIO_PCI_INTX_IRQ_INDEX,
+.start = 0,
+.count = 1,
+};
+
+ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+#endif
+
 /*
  * Disabling BAR mmaping can be slow, but toggling it around INTx can
  * also be a huge overhead.  We try to get the best of both worlds by
@@ -248,6 +263,161 @@ static void vfio_eoi(VFIODevice *vdev)
 vfio_unmask_intx(vdev);
 }
 
+static void vfio_enable_intx_kvm(VFIODevice *vdev)
+{
+#ifdef CONFIG_KVM
+struct kvm_irqfd irqfd = {
+.fd = event_notifier_get_fd(&vdev->intx.interrupt),
+.gsi = vdev->intx.route.irq,
+.flags = KVM_IRQFD_FLAG_RESAMPLE,
+};
+struct vfio_irq_set *irq_set;
+int ret, argsz;
+int32_t *pfd;
+
+if (!kvm_irqchip_in_kernel() ||
+vdev->intx.route.mode != PCI_INTX_ENABLED ||
+!kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
+return;
+}
+
+/* Get to a known interrupt state */
+qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
+vfio_mask_intx(vdev);
+vdev->intx.pending = false;
+qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
+
+/* Get an eventfd for resample/unmask */
+if (event_notifier_init(&vdev->intx.unmask, 0)) {
+error_report("vfio: Error: event_notifier_init failed eoi\n");
+goto fail;
+}
+
+/* KVM triggers it, VFIO listens for it */
+irqfd.resamplefd = event_notifier_get_fd(&vdev->intx.unmask);
+
+if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+error_report("vfio: Error: Failed to setup resample irqfd: %m\n");
+goto fail_irqfd;
+}
+
+argsz = sizeof(*irq_set) + sizeof(*pfd);
+
+irq_set = g_malloc0(argsz);
+irq_set->argsz = argsz;
+irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_UNMASK;
+irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
+irq_set->start = 0;
+irq_set->count = 1;
+pfd = (int32_t *)&irq_set->data;
+
+*pfd = irqfd.resamplefd;
+
+ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
+g_free(irq_set);
+if (ret) {
+error_report("vfio: Error: Failed to setup INTx unmask fd: %m\n");
+goto fail_vfio;
+}
+
+/* Let'em rip */
+vfio_unmask_intx(vdev);
+
+vdev->intx.kvm_accel = true;
+
+DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel enabled\n",
+__func__, vdev->host.domain, vdev->host.bus,
+vdev->host.slot, vdev->host.function);
+
+return;
+
+fail_vfio:
+irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
+kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
+fail_irqfd:
+event_notifier_cleanup(&vdev->intx.unmask);
+fail:
+qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
+vfio_unmask_intx(vdev);
+#endif
+}
+
+static void vfio_disable_intx_kvm(VFIODevice *vdev)
+{
+#ifdef CONFIG_KVM
+struct kvm_irqfd irqfd = {
+.fd = event_notifier_get_fd(&vdev->intx.interrupt),
+.gsi = vdev->intx.route.irq,
+.flags = KVM_IRQFD_FLAG_DEASSIGN,
+};
+
+if (!vdev->intx.kvm_accel) {
+return;
+}
+
+/*
+ * Get to a known state, hardware masked, QEMU ready to accept new
+ * interrupts, QEMU IRQ de-asserted.
+ */
+vfio_mask_intx(vdev);
+vdev->intx.pending = false;
+qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
+
+/* Tell KVM to stop listening for an INTx irqfd */
+if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+error_report("vfio: Error: Failed to disable INTx irqfd: %m\n");
+}
+
+/* We only need to close the eventfd for VFIO to cleanup the kernel side */
+event_notifier_cleanup(&vdev->intx.unmask);
+
+/* QEMU starts listening for interrupt events. */
+qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
+
+vdev->intx.kvm_accel = false;
+
+/* If we've missed an event, let it re-fire through QEMU */
+vfio_unmask_intx(vdev);
+
+DP

[PATCH 1/3] linux-headers: Update to 3.7-rc5

2012-11-13 Thread Alex Williamson

update-linux-headers.sh script run against Linux tag v3.7-rc5

Signed-off-by: Alex Williamson 
---
 linux-headers/asm-powerpc/kvm_para.h |6 +++---
 linux-headers/asm-s390/kvm_para.h|8 +---
 linux-headers/asm-x86/kvm.h  |   17 +
 linux-headers/linux/kvm.h|   25 +
 linux-headers/linux/kvm_para.h   |6 +++---
 linux-headers/linux/vfio.h   |6 +++---
 linux-headers/linux/virtio_config.h  |6 +++---
 linux-headers/linux/virtio_ring.h|6 +++---
 8 files changed, 54 insertions(+), 26 deletions(-)

diff --git a/linux-headers/asm-powerpc/kvm_para.h 
b/linux-headers/asm-powerpc/kvm_para.h
index c047a84..5e04383 100644
--- a/linux-headers/asm-powerpc/kvm_para.h
+++ b/linux-headers/asm-powerpc/kvm_para.h
@@ -17,8 +17,8 @@
  * Authors: Hollis Blanchard 
  */
 
-#ifndef __POWERPC_KVM_PARA_H__
-#define __POWERPC_KVM_PARA_H__
+#ifndef _UAPI__POWERPC_KVM_PARA_H__
+#define _UAPI__POWERPC_KVM_PARA_H__
 
 #include 
 
@@ -87,4 +87,4 @@ struct kvm_vcpu_arch_shared {
 #define KVM_MAGIC_FEAT_MAS0_TO_SPRG7   (1 << 1)
 
 
-#endif /* __POWERPC_KVM_PARA_H__ */
+#endif /* _UAPI__POWERPC_KVM_PARA_H__ */
diff --git a/linux-headers/asm-s390/kvm_para.h 
b/linux-headers/asm-s390/kvm_para.h
index 870051f..ff1f4e7 100644
--- a/linux-headers/asm-s390/kvm_para.h
+++ b/linux-headers/asm-s390/kvm_para.h
@@ -1,5 +1,5 @@
 /*
- * definition for paravirtual devices on s390
+ * User API definitions for paravirtual devices on s390
  *
  * Copyright IBM Corp. 2008
  *
@@ -9,9 +9,3 @@
  *
  *Author(s): Christian Borntraeger 
  */
-
-#ifndef __S390_KVM_PARA_H
-#define __S390_KVM_PARA_H
-
-
-#endif /* __S390_KVM_PARA_H */
diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 246617e..a65ec29 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -9,6 +9,22 @@
 #include 
 #include 
 
+#define DE_VECTOR 0
+#define DB_VECTOR 1
+#define BP_VECTOR 3
+#define OF_VECTOR 4
+#define BR_VECTOR 5
+#define UD_VECTOR 6
+#define NM_VECTOR 7
+#define DF_VECTOR 8
+#define TS_VECTOR 10
+#define NP_VECTOR 11
+#define SS_VECTOR 12
+#define GP_VECTOR 13
+#define PF_VECTOR 14
+#define MF_VECTOR 16
+#define MC_VECTOR 18
+
 /* Select x86 specific features in  */
 #define __KVM_HAVE_PIT
 #define __KVM_HAVE_IOAPIC
@@ -25,6 +41,7 @@
 #define __KVM_HAVE_DEBUGREGS
 #define __KVM_HAVE_XSAVE
 #define __KVM_HAVE_XCRS
+#define __KVM_HAVE_READONLY_MEM
 
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 4b9e575..81d2feb 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -101,9 +101,13 @@ struct kvm_userspace_memory_region {
__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
-/* for kvm_memory_region::flags */
-#define KVM_MEM_LOG_DIRTY_PAGES  1UL
-#define KVM_MEMSLOT_INVALID  (1UL << 1)
+/*
+ * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace,
+ * other bits are reserved for kvm internal use which are defined in
+ * include/linux/kvm_host.h.
+ */
+#define KVM_MEM_LOG_DIRTY_PAGES(1UL << 0)
+#define KVM_MEM_READONLY   (1UL << 1)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -618,6 +622,10 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_GET_SMMU_INFO 78
 #define KVM_CAP_S390_COW 79
 #define KVM_CAP_PPC_ALLOC_HTAB 80
+#ifdef __KVM_HAVE_READONLY_MEM
+#define KVM_CAP_READONLY_MEM 81
+#endif
+#define KVM_CAP_IRQFD_RESAMPLE 82
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -683,12 +691,21 @@ struct kvm_xen_hvm_config {
 #endif
 
 #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
+/*
+ * Available with KVM_CAP_IRQFD_RESAMPLE
+ *
+ * KVM_IRQFD_FLAG_RESAMPLE indicates resamplefd is valid and specifies
+ * the irqfd to operate in resampling mode for level triggered interrupt
+ * emlation.  See Documentation/virtual/kvm/api.txt.
+ */
+#define KVM_IRQFD_FLAG_RESAMPLE (1 << 1)
 
 struct kvm_irqfd {
__u32 fd;
__u32 gsi;
__u32 flags;
-   __u8  pad[20];
+   __u32 resamplefd;
+   __u8  pad[16];
 };
 
 struct kvm_clock_data {
diff --git a/linux-headers/linux/kvm_para.h b/linux-headers/linux/kvm_para.h
index 7bdcf93..cea2c5c 100644
--- a/linux-headers/linux/kvm_para.h
+++ b/linux-headers/linux/kvm_para.h
@@ -1,5 +1,5 @@
-#ifndef __LINUX_KVM_PARA_H
-#define __LINUX_KVM_PARA_H
+#ifndef _UAPI__LINUX_KVM_PARA_H
+#define _UAPI__LINUX_KVM_PARA_H
 
 /*
  * This header file provides a method for making a hypercall to the host
@@ -25,4 +25,4 @@
  */
 #include 
 
-#endif /* __LINUX_KVM_PARA_H */
+#endif /* _UAPI__LINUX_KVM_PARA_H */
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index f787b72..4758d1b 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -8,8 +8,8 @@
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
  */
-#ifndef VFIO_H
-#define V

[PULL 0/3] vfio-pci for 1.3-rc0

2012-11-13 Thread Alex Williamson

Hi Anthony,

Please pull the tag below.  I posted the linux-headers update
separately on Oct-15; since it hasn't been applied and should be
non-controversial, I include it again here.  Thanks,

Alex

The following changes since commit f5022a135e4309a54d433c69b2a056756b2d0d6b:

  aio: fix aio_ctx_prepare with idle bottom halves (2012-11-12 20:02:09 +0400)

are available in the git repository at:

  git://github.com/awilliam/qemu-vfio.git tags/vfio-pci-for-qemu-1.3.0-rc0

for you to fetch changes up to a771c51703cf9f91023c6570426258bdf5ec775b:

  vfio-pci: Use common msi_get_message (2012-11-13 12:27:40 -0700)


vfio-pci: KVM INTx accel & common msi_get_message


Alex Williamson (3):
  linux-headers: Update to 3.7-rc5
  vfio-pci: Add KVM INTx acceleration
  vfio-pci: Use common msi_get_message

 hw/vfio_pci.c| 210 +++
 linux-headers/asm-powerpc/kvm_para.h |   6 +-
 linux-headers/asm-s390/kvm_para.h|   8 +-
 linux-headers/asm-x86/kvm.h  |  17 +++
 linux-headers/linux/kvm.h|  25 -
 linux-headers/linux/kvm_para.h   |   6 +-
 linux-headers/linux/vfio.h   |   6 +-
 linux-headers/linux/virtio_config.h  |   6 +-
 linux-headers/linux/virtio_ring.h|   6 +-
 9 files changed, 241 insertions(+), 49 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 16/16] kvm tools: add support for ARMv7 processors

2012-11-13 Thread Will Deacon

On Tue, Nov 13, 2012 at 10:28:20AM +, Pekka Enberg wrote:
> On Tue, Nov 13, 2012 at 12:21 PM, Matt Evans  wrote:
> > I *think* Will was going to make some small changes, if you've already 
> > merged it then a follow-up set perhaps?
> 
> I only merged the non-ARM specific changes which looked good to me.
> But sure, please send an incremental patch if you need to fix them up,
> Will.

Thanks Pekka! The patches you merged are all fine, so I'll send a v2 against
kvmtool master when I've updated the remaining patches to incorporate
comments from Sasha and Matt. I also need to investigate the build failure
you saw, because I've not managed to reproduce it myself.

Will

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SR-IOV problem with Intel 82599EB (not enough MMIO resources for SR-IOV)

2012-11-13 Thread Yinghai Lu

On Tue, Nov 13, 2012 at 10:25 AM, Li, Sibai  wrote:
>
> Never append "pci=realloc" for both kernel 2.6.32.279 and kernel 3.5.0 above.

well,  can you both post boot log with "debug ignore_loglevel" ?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: SR-IOV problem with Intel 82599EB (not enough MMIO resources for SR-IOV)

2012-11-13 Thread Li, Sibai


> -Original Message-
> From: yhlu.ker...@gmail.com [mailto:yhlu.ker...@gmail.com] On Behalf Of
> Yinghai Lu
> Sent: Tuesday, November 13, 2012 10:17 AM
> To: Li, Sibai
> Cc: Jason Gao; bhelg...@google.com; Rose, Gregory V; ddut...@redhat.com;
> Kirsher, Jeffrey T; linux-kernel; netdev; kvm; 
> e1000-de...@lists.sourceforge.net;
> linux-...@vger.kernel.org
> Subject: Re: SR-IOV problem with Intel 82599EB (not enough MMIO resources
> for SR-IOV)
> 
> On Tue, Nov 13, 2012 at 8:04 AM, Li, Sibai  wrote:
> >
> >>
> >> Thank you very much,I try "pci=realloc" in Centos 6.3,and now it works for
> me.
> >>
> >> thank you Sibai,Our server "Dell R710",its BIOS version is just
> >> v.6.3.0 and release date is 07/24/2012,and I also configured
> >> intel_iommu=on in the grub.conf file,but I can't find these IOMMU
> >> options in "Device Drivers" in my
> >> kernel(2.6.32-279) .config file , btw my os is Centos
> >> 6.3(RHEL6.3),although the problem solved,I'd like to know what's your os
> version ,kernel version?
> >
> > I am using RHEL6.3 with unstable kernel 3.7.0-rc
> 
> that means that config has
> CONFIG_PCI_REALLOC_ENABLE_AUTO=y
> 
> So you don't need to append "pci=realloc"
> 
> Yinghai

Never append "pci=realloc" for both kernel 2.6.32.279 and kernel 3.5.0 above.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SR-IOV problem with Intel 82599EB (not enough MMIO resources for SR-IOV)

2012-11-13 Thread Yinghai Lu

On Tue, Nov 13, 2012 at 8:04 AM, Li, Sibai  wrote:
>
>>
>> Thank you very much,I try "pci=realloc" in Centos 6.3,and now it works for 
>> me.
>>
>> thank you Sibai,Our server "Dell R710",its BIOS version is just
>> v.6.3.0 and release date is 07/24/2012,and I also configured intel_iommu=on 
>> in
>> the grub.conf file,but I can't find these IOMMU options in "Device Drivers" 
>> in my
>> kernel(2.6.32-279) .config file , btw my os is Centos 6.3(RHEL6.3),although 
>> the
>> problem solved,I'd like to know what's your os version ,kernel version?
>
> I am using RHEL6.3 with unstable kernel 3.7.0-rc

that means that config has
CONFIG_PCI_REALLOC_ENABLE_AUTO=y

So you don't need to append "pci=realloc"

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 3/6] VFIO: unregister IOMMU notifier on error recovery path

2012-11-13 Thread Alex Williamson

On Sat, 2012-11-10 at 21:57 +0800, Jiang Liu wrote:
>  From: Jiang Liu 
> 
> On error recovery path in function vfio_create_group(), it should
> unregister the IOMMU notifier for the new VFIO group. Otherwise it may
> cause invalid memory access later when handling bus notifications.
> 
> Signed-off-by: Jiang Liu 
> ---
>  drivers/vfio/vfio.c |   31 +++
>  1 file changed, 15 insertions(+), 16 deletions(-)

This patch and patch 6/6 looks like good vfio fixes regardless of how we
tackle the driver binding problem.  Please submit them separately.
Thanks for the patches, I look forward to a solution here.  Thanks,

Alex

> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 17830c9..3359ec2 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -191,6 +191,17 @@ static void vfio_container_put(struct vfio_container 
> *container)
>   kref_put(&container->kref, vfio_container_release);
>  }
>  
> +static void vfio_group_unlock_and_free(struct vfio_group *group)
> +{
> + mutex_unlock(&vfio.group_lock);
> + /*
> +  * Unregister outside of lock.  A spurious callback is harmless now
> +  * that the group is no longer in vfio.group_list.
> +  */
> + iommu_group_unregister_notifier(group->iommu_group, &group->nb);
> + kfree(group);
> +}
> +
>  /**
>   * Group objects - create, release, get, put, search
>   */
> @@ -229,8 +240,7 @@ static struct vfio_group *vfio_create_group(struct 
> iommu_group *iommu_group)
>  
>   minor = vfio_alloc_group_minor(group);
>   if (minor < 0) {
> - mutex_unlock(&vfio.group_lock);
> - kfree(group);
> + vfio_group_unlock_and_free(group);
>   return ERR_PTR(minor);
>   }
>  
> @@ -239,8 +249,7 @@ static struct vfio_group *vfio_create_group(struct 
> iommu_group *iommu_group)
>   if (tmp->iommu_group == iommu_group) {
>   vfio_group_get(tmp);
>   vfio_free_group_minor(minor);
> - mutex_unlock(&vfio.group_lock);
> - kfree(group);
> + vfio_group_unlock_and_free(group);
>   return tmp;
>   }
>   }
> @@ -249,8 +258,7 @@ static struct vfio_group *vfio_create_group(struct 
> iommu_group *iommu_group)
>   group, "%d", iommu_group_id(iommu_group));
>   if (IS_ERR(dev)) {
>   vfio_free_group_minor(minor);
> - mutex_unlock(&vfio.group_lock);
> - kfree(group);
> + vfio_group_unlock_and_free(group);
>   return (struct vfio_group *)dev; /* ERR_PTR */
>   }
>  
> @@ -274,16 +282,7 @@ static void vfio_group_release(struct kref *kref)
>   device_destroy(vfio.class, MKDEV(MAJOR(vfio.devt), group->minor));
>   list_del(&group->vfio_next);
>   vfio_free_group_minor(group->minor);
> -
> - mutex_unlock(&vfio.group_lock);
> -
> - /*
> -  * Unregister outside of lock.  A spurious callback is harmless now
> -  * that the group is no longer in vfio.group_list.
> -  */
> - iommu_group_unregister_notifier(group->iommu_group, &group->nb);
> -
> - kfree(group);
> + vfio_group_unlock_and_free(group);
>  }
>  
>  static void vfio_group_put(struct vfio_group *group)



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SR-IOV problem with Intel 82599EB (not enough MMIO resources for SR-IOV)

2012-11-13 Thread Don Dutile

On 11/13/2012 11:04 AM, Li, Sibai wrote:

-Original Message-
From: Jason Gao [mailto:pkill.2...@gmail.com]
Sent: Tuesday, November 13, 2012 5:38 AM
To: bhelg...@google.com; Rose, Gregory V; Li, Sibai
Cc: ddut...@redhat.com; Kirsher, Jeffrey T; linux-kernel; netdev; kvm; e1000-
de...@lists.sourceforge.net; linux-...@vger.kernel.org; Yinghai Lu
Subject: Re: SR-IOV problem with Intel 82599EB (not enough MMIO resources
for SR-IOV)

I'm very sorry for delayed reply.now SR-IOV works for me in Centos 6.3,thank all
of you.

On Fri, Nov 9, 2012 at 11:26 PM, Bjorn Helgaas  wrote:

Linux normally uses the resource assignments done by the BIOS, but it
is possible for the kernel to reassign those.  We don't have good
automatic support for that yet, but on a recent upstream kernel, you
can try "pci=realloc".  I doubt this option is in CentOS 6.3, though

Thank you very much,I try "pci=realloc" in Centos 6.3,and now it works for me.

On Sat, Nov 10, 2012 at 2:08 AM, Li, Sibai  wrote:

DellR710 with the latest BIOS should work fine for SR-IOV. My BIOS is
v.6.3.0 and release date is 07/24/2012 Please check if you configured

intel_iommu=on in the grub.conf file.

If you did, check your kernel .config file under Device Drivers->  IOMMU

Hardware support->enable Support for Intel IOMMU using DMA remapping
Devices, enable Intel DMA Remapping Devices by Default, enable Support for
Interrupt Remapping.

thank you Sibai,Our server "Dell R710",its BIOS version is just
v.6.3.0 and release date is 07/24/2012,and I also configured intel_iommu=on in
the grub.conf file,but I can't find these IOMMU options in "Device Drivers" in 
my

Sibai is referring to kernel config options.  RHEL6.3 has the IOMMU options 
built into
the kernel, but not enabled by default -- have to add 'intel_iommu=on' to the 
kernel
cmdline to enable IOMMU. SRIOV support (CONFIG_IOV) is built into the RHEL6.3 
kernel as well.

kernel(2.6.32-279) .config file , btw my os is Centos 6.3(RHEL6.3),although the
problem solved,I'd like to know what's your os version ,kernel version?

I am using RHEL6.3 with unstable kernel 3.7.0-rc
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 13/16] kvm tools: keep track of registered memory banks in struct kvm

2012-11-13 Thread Will Deacon

On Tue, Nov 13, 2012 at 04:09:05PM +, Sasha Levin wrote:
> On 11/13/2012 07:16 AM, Will Deacon wrote:
> > On Tue, Nov 13, 2012 at 04:37:38AM +, Sasha Levin wrote:
> >> On 11/12/2012 06:57 AM, Will Deacon wrote:
> >>> +struct kvm_mem_bank {
> >>> + struct list_headlist;
> >>> + unsigned long   guest_phys_addr;
> >>> + void*host_addr;
> >>> + unsigned long   size;
> >>> +};
> >>
> >> Can we just reuse struct kvm_userspace_memory_region here? We're also 
> >> using different
> >> data types for guest_phys_addr and size than whats in 
> >> kvm_userspace_memory_region - that
> >> can't be good.
> > 
> > I looked briefly at doing that when I wrote the multi-bank stuff, but I hit
> > a couple of issues:
> > 
> > - kvmtool itself tends to use void * for host addresses, rather than
> >   the __u64 userspace_addr in kvm_userspace_memory_region
> > 
> > - kvm_userspace_memory_region is a superset of what we need (not the
> >   end of the world I guess)
> > 
> > so you end up casting address types a fair amount. Still, I'll revisit it
> > and see if I can come up with something cleaner.
> 
> That's a good point. We used void* while the kernel side is using u64, which
> looks odd.
> 
> In that case, let's get everything moved to u64 (obviously not in the scope of
> this patch series).

Ok, I'll update the size field to match in this patch series, then we can
tackle the address discrepancy separately.

Cheers,

Will

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 13/16] kvm tools: keep track of registered memory banks in struct kvm

2012-11-13 Thread Sasha Levin

On 11/13/2012 07:16 AM, Will Deacon wrote:
> On Tue, Nov 13, 2012 at 04:37:38AM +, Sasha Levin wrote:
>> On 11/12/2012 06:57 AM, Will Deacon wrote:
>>> +struct kvm_mem_bank {
>>> +   struct list_headlist;
>>> +   unsigned long   guest_phys_addr;
>>> +   void*host_addr;
>>> +   unsigned long   size;
>>> +};
>>
>> Can we just reuse struct kvm_userspace_memory_region here? We're also using 
>> different
>> data types for guest_phys_addr and size than whats in 
>> kvm_userspace_memory_region - that
>> can't be good.
> 
> I looked briefly at doing that when I wrote the multi-bank stuff, but I hit
> a couple of issues:
> 
>   - kvmtool itself tends to use void * for host addresses, rather than
> the __u64 userspace_addr in kvm_userspace_memory_region
> 
>   - kvm_userspace_memory_region is a superset of what we need (not the
> end of the world I guess)
> 
> so you end up casting address types a fair amount. Still, I'll revisit it
> and see if I can come up with something cleaner.

That's a good point. We used void* while the kernel side is using u64, which
looks odd.

In that case, let's get everything moved to u64 (obviously not in the scope of
this patch series).


Thanks,
Sasha

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: SR-IOV problem with Intel 82599EB (not enough MMIO resources for SR-IOV)

2012-11-13 Thread Li, Sibai



> -Original Message-
> From: Jason Gao [mailto:pkill.2...@gmail.com]
> Sent: Tuesday, November 13, 2012 5:38 AM
> To: bhelg...@google.com; Rose, Gregory V; Li, Sibai
> Cc: ddut...@redhat.com; Kirsher, Jeffrey T; linux-kernel; netdev; kvm; e1000-
> de...@lists.sourceforge.net; linux-...@vger.kernel.org; Yinghai Lu
> Subject: Re: SR-IOV problem with Intel 82599EB (not enough MMIO resources
> for SR-IOV)
> 
> I'm very sorry for delayed reply.now SR-IOV works for me in Centos 6.3,thank 
> all
> of you.
> 
> 
> On Fri, Nov 9, 2012 at 11:26 PM, Bjorn Helgaas  wrote:
> > Linux normally uses the resource assignments done by the BIOS, but it
> > is possible for the kernel to reassign those.  We don't have good
> > automatic support for that yet, but on a recent upstream kernel, you
> > can try "pci=realloc".  I doubt this option is in CentOS 6.3, though
> 
> Thank you very much,I try "pci=realloc" in Centos 6.3,and now it works for me.
> 
> 
> 
> On Sat, Nov 10, 2012 at 2:08 AM, Li, Sibai  wrote:
> > DellR710 with the latest BIOS should work fine for SR-IOV. My BIOS is
> > v.6.3.0 and release date is 07/24/2012 Please check if you configured
> intel_iommu=on in the grub.conf file.
> > If you did, check your kernel .config file under Device Drivers-> IOMMU
> Hardware support->enable Support for Intel IOMMU using DMA remapping
> Devices, enable Intel DMA Remapping Devices by Default, enable Support for
> Interrupt Remapping.
> 
> thank you Sibai,Our server "Dell R710",its BIOS version is just
> v.6.3.0 and release date is 07/24/2012,and I also configured intel_iommu=on in
> the grub.conf file,but I can't find these IOMMU options in "Device Drivers" 
> in my
> kernel(2.6.32-279) .config file , btw my os is Centos 6.3(RHEL6.3),although 
> the
> problem solved,I'd like to know what's your os version ,kernel version?

I am using RHEL6.3 with unstable kernel 3.7.0-rc
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/6] driver core: add a bus notification to temporarily reject driver binding

2012-11-13 Thread Jiang Liu

On 11/11/2012 01:21 PM, Greg Kroah-Hartman wrote:
> On Sat, Nov 10, 2012 at 09:57:14PM +0800, Jiang Liu wrote:
>>  From: Jiang Liu 
>>
>> There are several requirements to temporarily reject device driver
>> binding. Possible usage cases as below:
>> 1) We should avoid binding an unsafe driver to a device belonging to
>>an active VFIO group, otherwise it will break the DMA isolation
>>property of VFIO.
>> 2) When hot-removing a PCI hierachy, we should avoid binding device
>>drivers to PCI devices going to be removed during the window
>>between unbinding of device driver and destroying of device nodes.
>> 3) When hot-adding a PCI host bridge, we should temporarily disable
>>driver binding before setting up corresponding IOMMU and IOAPIC.
>>
>> We may add a flag into struct device to temporarily disable driver
>> binding as in this thread https://patchwork.kernel.org/patch/1535721/.
> 
> I totally do not understand.  The bus controls this, if it does not want
> to bind a device to a driver, then don't do it.  It's really quite
> simple to just block the probe callback the bus gets, right?  Why create
> all of this extra, and confusing, interface instead?
Hi Greg,
Thanks for your comments. 
As you know, we already have an "drivers_autoprobe" flag for drivers,
we are trying to provide a similar mechanism for devices.
But I'm not sure whether we could block the probe callback. For PCI
host bridge hotplug, that will effectively block the PCI host bridge hotplug
thread. For VFIO case, its goal is to reject binding unsafe drivers to PCI
devices belonging to active VFIO group, so it doesn't make sense to block
the driver probing thread too. So we are trying to return error code instead
of blocking in really_probe().
Thanks!
Gerry

> 
>> This patch proposes another solution to temporarily disable driver
>> binding by using bus notification mechanisms. It adds an notification
>> event to solicit if anybody has objections when binding a driver to a
>> device.
> 
> Sorry, but no, don't do this, it's way more confusing.
> 
> greg k-h
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: MMU: lazily drop large spte

2012-11-13 Thread Takuya Yoshikawa

Ccing live migration developers who should be interested in this work,

On Mon, 12 Nov 2012 21:10:32 -0200
Marcelo Tosatti  wrote:

> On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
> > Do not drop large spte until it can be insteaded by small pages so that
> > the guest can happliy read memory through it
> > 
> > The idea is from Avi:
> > | As I mentioned before, write-protecting a large spte is a good idea,
> > | since it moves some work from protect-time to fault-time, so it reduces
> > | jitter.  This removes the need for the return value.
> > 
> > Signed-off-by: Xiao Guangrong 
> > ---
> >  arch/x86/kvm/mmu.c |   34 +-
> >  1 files changed, 9 insertions(+), 25 deletions(-)
> 
> Its likely that other 4k pages are mapped read-write in the 2mb range 
> covered by a read-only 2mb map. Therefore its not entirely useful to
> map read-only. 
> 
> Can you measure an improvement with this change?

What we discussed at KVM Forum last week was about the jitter we could
measure right after starting live migration: both Isaku and Chegu reported
such jitter.

So if this patch reduces such jitter for some real workloads, by lazily
dropping largepage mappings and saving read faults until that point, that
would be very nice!

But sadly, what they measured included interactions with the outside of the
guest, and the main cause was due to the big QEMU lock problem, they guessed.
The order is so different that an improvement by a kernel side effort may not
be seen easily.

FWIW: I am now changing the initial write protection by
kvm_mmu_slot_remove_write_access() to rmap based as I proposed at KVM Forum.
ftrace said that 1ms was improved to 250-350us by the change for 10GB guest.
My code still drops largepage mappings, so the initial write protection time
itself may not be a such big issue here, I think.

Again, if we can eliminate read faults to such an extent that guests can see
measurable improvement, that should be very nice!

Any thoughts?

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2012-11-12

2012-11-13 Thread Eduardo Habkost

On Tue, Nov 13, 2012 at 03:29:37PM +0100, Andreas Färber wrote:
> Am 13.11.2012 13:29, schrieb Eduardo Habkost:
> > On Mon, Nov 12, 2012 at 01:58:38PM +0100, Juan Quintela wrote:
> >>
> >> Please send in any agenda topics you are interested in.
> > 
> > - Clarify 1.3 plans for CPU:
> 
> From my submaintainer POV:
> 
> > DeviceState CPU,
> 
> I was specifically tasked with the qdev split by Anthony, so unless
> major obstacles arise I will send a PULL until Thursday.
> 
> What I am still unsure about is whether it makes sense to actually apply
> the final CPU-as-device change for v1.3 since that exposes the device
> name(s) as public "ABI", cf. below. A safety option would be no_user = 1
> to avoid users messing with untested use cases at this time.

I'm OK with holding the final TYPE_DEVICE patch, while including only
the other changes. Our main problem is coordinating/rebasing work on
those large patch series, so at least including the qdev split and
header fixes would already make our lives easier.


> 
> > x86 CPU classes,
> 
> If I get through review quickly enough and RFC seems sane and I get a
> PATCH, I might include it in the pull. To me, classes are prerequisites
> to exposing CPU-as-a-device because otherwise the user specifies the
> base class that we want to make abstract (which will then break
> backwards compatibility) and has no API to set it to something useful.
> Once applied, we would still have half a month for testing.

We still have some ongoing discussion about the CPU class namespace, and
how to map the name from "-cpu FOO" to the actual class name, so I don't
think we want to hurry to get the RFC in 1.3. I will try to rewrite it
in a way that allows all targets to use the same code to find the
appropriate CPU class.

Maybe we could just try to include just the cpu_x86_init() cleanups I
sent yesterday, to make further work easier to coordinate?

(It's not that important to get that in 1.3, anyway, we can just agree
to use that series as base for futher work, even if it doesn't get
included right now)

> 
> > x86 CPU properties
> 
> Won't make v1.3 due to timing constraints. There are also still
> unresolved review comments related to property naming IIRC.

OK. It still has to be rebased, so I didn't think it was feasible for
1.3.

> 
> >   (we still want to get any of this included, or all will have to wait for 
> > 1.4?)
> 
> Regards,
> Andreas
> 
> -- 
> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
> GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
> 

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2012-11-12

2012-11-13 Thread Eduardo Habkost

On Tue, Nov 13, 2012 at 03:48:55PM +0100, Juan Quintela wrote:
> Anthony Liguori  wrote:
> > Marcelo Tosatti  writes:
> >
> >> On Mon, Nov 12, 2012 at 01:58:38PM +0100, Juan Quintela wrote:
> >>> 
> >>> Hi
> >>> 
> >>> Please send in any agenda topics you are interested in.
> >>> 
> >>> Later, Juan.
> >>
> >> It would be good to have a status report on qemu-kvm compatibility
> >> (the remaining TODO items are with Anthony). They are:
> >>
> >> - qemu-kvm 1.2 machine type.
> >> - default accelerator being KVM.
> >>
> >> Note migration will remain broken due to 
> >>
> >> https://patchwork.kernel.org/patch/1674521/
> >>
> >> BTW, this can be via email, if preferred (i cannot attend the call).
> >
> > Let's cancel the call and I'll spend the hour writing up the patches and
> > sending them out.
> 
> Same for Eduardo requsets?

I'm happy with Andreas' clarifications, so it's OK to me.

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for 2012-11-12

2012-11-13 Thread Juan Quintela

Anthony Liguori  wrote:
> Marcelo Tosatti  writes:
>
>> On Mon, Nov 12, 2012 at 01:58:38PM +0100, Juan Quintela wrote:
>>> 
>>> Hi
>>> 
>>> Please send in any agenda topics you are interested in.
>>> 
>>> Later, Juan.
>>
>> It would be good to have a status report on qemu-kvm compatibility
>> (the remaining TODO items are with Anthony). They are:
>>
>> - qemu-kvm 1.2 machine type.
>> - default accelerator being KVM.
>>
>> Note migration will remain broken due to 
>>
>> https://patchwork.kernel.org/patch/1674521/
>>
>> BTW, this can be via email, if preferred (i cannot attend the call).
>
> Let's cancel the call and I'll spend the hour writing up the patches and
> sending them out.

Same for Eduardo requsets?

Later, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2012-11-12

2012-11-13 Thread Andreas Färber

Am 13.11.2012 13:29, schrieb Eduardo Habkost:
> On Mon, Nov 12, 2012 at 01:58:38PM +0100, Juan Quintela wrote:
>>
>> Please send in any agenda topics you are interested in.
> 
> - Clarify 1.3 plans for CPU:

>From my submaintainer POV:

> DeviceState CPU,

I was specifically tasked with the qdev split by Anthony, so unless
major obstacles arise I will send a PULL until Thursday.

What I am still unsure about is whether it makes sense to actually apply
the final CPU-as-device change for v1.3 since that exposes the device
name(s) as public "ABI", cf. below. A safety option would be no_user = 1
to avoid users messing with untested use cases at this time.

> x86 CPU classes,

If I get through review quickly enough and RFC seems sane and I get a
PATCH, I might include it in the pull. To me, classes are prerequisites
to exposing CPU-as-a-device because otherwise the user specifies the
base class that we want to make abstract (which will then break
backwards compatibility) and has no API to set it to something useful.
Once applied, we would still have half a month for testing.

> x86 CPU properties

Won't make v1.3 due to timing constraints. There are also still
unresolved review comments related to property naming IIRC.

>   (we still want to get any of this included, or all will have to wait for 
> 1.4?)

Regards,
Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for 2012-11-12

2012-11-13 Thread Anthony Liguori

Marcelo Tosatti  writes:

> On Mon, Nov 12, 2012 at 01:58:38PM +0100, Juan Quintela wrote:
>> 
>> Hi
>> 
>> Please send in any agenda topics you are interested in.
>> 
>> Later, Juan.
>
> It would be good to have a status report on qemu-kvm compatibility
> (the remaining TODO items are with Anthony). They are:
>
> - qemu-kvm 1.2 machine type.
> - default accelerator being KVM.
>
> Note migration will remain broken due to 
>
> https://patchwork.kernel.org/patch/1674521/
>
> BTW, this can be via email, if preferred (i cannot attend the call).

Let's cancel the call and I'll spend the hour writing up the patches and
sending them out.

Regards,

Anthony Liguori

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 06/13] ARM: KVM: VGIC distributor handling

2012-11-13 Thread Christoffer Dall

On Mon, Nov 12, 2012 at 4:29 AM, Dong Aisheng  wrote:
> On Sat, Nov 10, 2012 at 04:44:58PM +0100, Christoffer Dall wrote:
> [...]
>> @@ -141,7 +519,98 @@ struct mmio_range *find_matching_range(const struct 
>> mmio_range *ranges,
>>   */
>>  bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run, struct 
>> kvm_exit_mmio *mmio)
>>  {
>> - return KVM_EXIT_MMIO;
>> + const struct mmio_range *range;
>> + struct vgic_dist *dist = &vcpu->kvm->arch.vgic;
>> + unsigned long base = dist->vgic_dist_base;
>> + bool updated_state;
>> +
>> + if (!irqchip_in_kernel(vcpu->kvm) ||
>> + mmio->phys_addr < base ||
>> + (mmio->phys_addr + mmio->len) > (base + dist->vgic_dist_size))
>> + return false;
>> +
>> + range = find_matching_range(vgic_ranges, mmio, base);
>> + if (unlikely(!range || !range->handle_mmio)) {
>> + pr_warn("Unhandled access %d %08llx %d\n",
>> + mmio->is_write, mmio->phys_addr, mmio->len);
>> + return false;
>> + }
>> +
>> + spin_lock(&vcpu->kvm->arch.vgic.lock);
>> + updated_state = range->handle_mmio(vcpu, mmio,mmio->phys_addr - 
>> range->base - base);
> Missing space after ','.
> Checkpatch may fail here.
>
thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SR-IOV problem with Intel 82599EB (not enough MMIO resources for SR-IOV)

2012-11-13 Thread Jason Gao

I'm very sorry for delayed reply.now SR-IOV works for me in Centos
6.3,thank all of you.


On Fri, Nov 9, 2012 at 11:26 PM, Bjorn Helgaas  wrote:
> Linux normally uses the resource assignments done by the BIOS, but it
> is possible for the kernel to reassign those.  We don't have good
> automatic support for that yet, but on a recent upstream kernel, you
> can try "pci=realloc".  I doubt this option is in CentOS 6.3, though

Thank you very much,I try "pci=realloc" in Centos 6.3,and now it works for me.



On Sat, Nov 10, 2012 at 2:08 AM, Li, Sibai  wrote:
> DellR710 with the latest BIOS should work fine for SR-IOV. My BIOS is v.6.3.0 
> and release date is 07/24/2012
> Please check if you configured intel_iommu=on in the grub.conf file.
> If you did, check your kernel .config file under Device Drivers-> IOMMU 
> Hardware support->enable Support for Intel IOMMU using DMA remapping Devices, 
> enable Intel DMA Remapping Devices by Default, enable Support for Interrupt 
> Remapping.

thank you Sibai,Our server "Dell R710",its BIOS version is just
v.6.3.0 and release date is 07/24/2012,and I also configured
intel_iommu=on in the grub.conf file,but I can't find these IOMMU
options in "Device Drivers" in my kernel(2.6.32-279) .config file ,
btw my os is Centos 6.3(RHEL6.3),although the problem solved,I'd like
to know what's your os version ,kernel version?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 05/13] ARM: KVM: VGIC accept vcpu and dist base addresses from user space

2012-11-13 Thread Christoffer Dall

On Mon, Nov 12, 2012 at 3:56 AM, Dong Aisheng  wrote:
> On Sat, Nov 10, 2012 at 04:44:51PM +0100, Christoffer Dall wrote:
> [...]
>> +int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr)
>> +{
>> + int r = 0;
>> + struct vgic_dist *vgic = &kvm->arch.vgic;
>> +
>> + if (addr & ~KVM_PHYS_MASK)
>> + return -E2BIG;
>> +
>> + if (addr & ~PAGE_MASK)
>> + return -EINVAL;
>> +
>> + mutex_lock(&kvm->lock);
>> + switch (type) {
>> + case KVM_VGIC_V2_ADDR_TYPE_DIST:
>> + if (!IS_VGIC_ADDR_UNDEF(vgic->vgic_dist_base))
>> + return -EEXIST;
>> + if (addr + VGIC_DIST_SIZE < addr)
>> + return -EINVAL;
>> + kvm->arch.vgic.vgic_dist_base = addr;
>> + break;
>> + case KVM_VGIC_V2_ADDR_TYPE_CPU:
>> + if (!IS_VGIC_ADDR_UNDEF(vgic->vgic_cpu_base))
>> + return -EEXIST;
>> + if (addr + VGIC_CPU_SIZE < addr)
>> + return -EINVAL;
>> + kvm->arch.vgic.vgic_cpu_base = addr;
>> + break;
>> + default:
>> + r = -ENODEV;
>> + }
>> +
>> + if (vgic_ioaddr_overlap(kvm)) {
>> + kvm->arch.vgic.vgic_dist_base = VGIC_ADDR_UNDEF;
>> + kvm->arch.vgic.vgic_cpu_base = VGIC_ADDR_UNDEF;
>
> Missing mutex_unlock?

indeed, should be r = -EINVAL.

nice catch!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 04/13] ARM: KVM: Initial VGIC MMIO support code

2012-11-13 Thread Christoffer Dall

On Mon, Nov 12, 2012 at 3:54 AM, Dong Aisheng  wrote:
> On Sat, Nov 10, 2012 at 04:44:44PM +0100, Christoffer Dall wrote:
>> From: Marc Zyngier 
>>
>> Wire the initial in-kernel MMIO support code for the VGIC, used
>> for the distributor emulation.
>>
>> Signed-off-by: Marc Zyngier 
>> Signed-off-by: Christoffer Dall 
>> ---
>>  arch/arm/include/asm/kvm_vgic.h |6 +-
>>  arch/arm/kvm/Makefile   |1
>>  arch/arm/kvm/vgic.c |  138 
>> +++
>>  3 files changed, 144 insertions(+), 1 deletion(-)
>>  create mode 100644 arch/arm/kvm/vgic.c
>>
>> diff --git a/arch/arm/include/asm/kvm_vgic.h 
>> b/arch/arm/include/asm/kvm_vgic.h
>> index d75540a..b444ecf 100644
>> --- a/arch/arm/include/asm/kvm_vgic.h
>> +++ b/arch/arm/include/asm/kvm_vgic.h
>> @@ -30,7 +30,11 @@ struct kvm_vcpu;
>>  struct kvm_run;
>>  struct kvm_exit_mmio;
>>
>> -#ifndef CONFIG_KVM_ARM_VGIC
>> +#ifdef CONFIG_KVM_ARM_VGIC
>> +bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run,
>> +   struct kvm_exit_mmio *mmio);
>> +
>> +#else
>>  static inline int kvm_vgic_hyp_init(void)
>>  {
>>   return 0;
>> diff --git a/arch/arm/kvm/Makefile b/arch/arm/kvm/Makefile
>> index 8a4f396..c019f02 100644
>> --- a/arch/arm/kvm/Makefile
>> +++ b/arch/arm/kvm/Makefile
>> @@ -20,3 +20,4 @@ obj-$(CONFIG_KVM_ARM_HOST) += $(addprefix 
>> ../../../virt/kvm/, kvm_main.o coalesc
>>
>>  obj-$(CONFIG_KVM_ARM_HOST) += arm.o guest.o mmu.o emulate.o reset.o
>>  obj-$(CONFIG_KVM_ARM_HOST) += coproc.o coproc_a15.o mmio.o decode.o
>> +obj-$(CONFIG_KVM_ARM_VGIC) += vgic.o
>> diff --git a/arch/arm/kvm/vgic.c b/arch/arm/kvm/vgic.c
>> new file mode 100644
>> index 000..26ada3b
>> --- /dev/null
>> +++ b/arch/arm/kvm/vgic.c
>> @@ -0,0 +1,138 @@
>> +/*
>> + * Copyright (C) 2012 ARM Ltd.
>> + * Author: Marc Zyngier 
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#define ACCESS_READ_VALUE(1 << 0)
>> +#define ACCESS_READ_RAZ  (0 << 0)
>> +#define ACCESS_READ_MASK(x)  ((x) & (1 << 0))
>> +#define ACCESS_WRITE_IGNORED (0 << 1)
>> +#define ACCESS_WRITE_SETBIT  (1 << 1)
>> +#define ACCESS_WRITE_CLEARBIT(2 << 1)
>> +#define ACCESS_WRITE_VALUE   (3 << 1)
>> +#define ACCESS_WRITE_MASK(x) ((x) & (3 << 1))
>> +
>> +/**
>> + * vgic_reg_access - access vgic register
>> + * @mmio:   pointer to the data describing the mmio access
>> + * @reg:pointer to the virtual backing of the vgic distributor struct
>
> Is this correct?
>
>> + * @offset: least significant 2 bits used for word offset
>> + * @mode:   ACCESS_ mode (see defines above)
>> + *
>> + * Helper to make vgic register access easier using one of the access
>> + * modes defined for vgic register access
>> + * (read,raz,write-ignored,setbit,clearbit,write)
>> + */
>> +static void vgic_reg_access(struct kvm_exit_mmio *mmio, u32 *reg,
>> + u32 offset, int mode)
>> +{
>> + int word_offset = offset & 3;
>> + int shift = word_offset * 8;
>> + u32 mask;
>> + u32 regval;
>> +
>> + /*
>> +  * Any alignment fault should have been delivered to the guest
>> +  * directly (ARM ARM B3.12.7 "Prioritization of aborts").
>> +  */
>> +
>> + mask = (~0U) >> (word_offset * 8);
>> + if (reg)
>> + regval = *reg;
>> + else {
>> + BUG_ON(mode != (ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED));
>> + regval = 0;
>> + }
>> +
>> + if (mmio->is_write) {
>> + u32 data = (*((u32 *)mmio->data) & mask) << shift;
>> + switch (ACCESS_WRITE_MASK(mode)) {
>> + case ACCESS_WRITE_IGNORED:
>> + return;
>> +
>> + case ACCESS_WRITE_SETBIT:
>> + regval |= data;
>> + break;
>> +
>> + case ACCESS_WRITE_CLEARBIT:
>> + regval &= ~data;
>> + break;
>> +
>> + case ACCESS_WRITE_VALUE:
>> + regval = (regval & ~(mask << shift)) | data;
>> + break;
>> + }
>> + *reg = regval;
>> + } else {
>> + switch (ACCESS_READ_MASK(mode)) {
>> + case ACCESS_READ_RAZ:
>> + regva

Re: KVM call agenda for 2012-11-12

2012-11-13 Thread Zhi Yong Wu

HI,

I'm got confused by the date in this topic.

On Mon, Nov 12, 2012 at 8:58 PM, Juan Quintela  wrote:
>
> Hi
>
> Please send in any agenda topics you are interested in.
>
> Later, Juan.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2012-11-12

2012-11-13 Thread Eduardo Habkost

On Mon, Nov 12, 2012 at 01:58:38PM +0100, Juan Quintela wrote:
> 
> Hi
> 
> Please send in any agenda topics you are interested in.

- Clarify 1.3 plans for CPU: DeviceState CPU, x86 CPU classes, x86 CPU 
properties
  (we still want to get any of this included, or all will have to wait for 1.4?)

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 13/16] kvm tools: keep track of registered memory banks in struct kvm

2012-11-13 Thread Will Deacon

On Tue, Nov 13, 2012 at 04:37:38AM +, Sasha Levin wrote:
> On 11/12/2012 06:57 AM, Will Deacon wrote:
> > +struct kvm_mem_bank {
> > +   struct list_headlist;
> > +   unsigned long   guest_phys_addr;
> > +   void*host_addr;
> > +   unsigned long   size;
> > +};
> 
> Can we just reuse struct kvm_userspace_memory_region here? We're also using 
> different
> data types for guest_phys_addr and size than whats in 
> kvm_userspace_memory_region - that
> can't be good.

I looked briefly at doing that when I wrote the multi-bank stuff, but I hit
a couple of issues:

- kvmtool itself tends to use void * for host addresses, rather than
  the __u64 userspace_addr in kvm_userspace_memory_region

- kvm_userspace_memory_region is a superset of what we need (not the
  end of the world I guess)

so you end up casting address types a fair amount. Still, I'll revisit it
and see if I can come up with something cleaner.

> >  struct kvm {
> > struct kvm_arch arch;
> > struct kvm_config   cfg;
> > @@ -49,6 +56,7 @@ struct kvm {
> > u64 ram_size;
> > void*ram_start;
> > u64 ram_pagesize;
> > +   struct list_headmem_banks;
> 
> These memory banks actually look like a perfect example to use our augmented 
> interval rb-tree,
> can we switch them to use it, or is it a list on purpose?

Well, the usual case is one memory bank but that doesn't swing the argument
either way. I'll update to use the fancy new tree.

Thanks for the comments,

Will

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 16/16] kvm tools: add support for ARMv7 processors

2012-11-13 Thread Pekka Enberg

On Tue, Nov 13, 2012 at 12:21 PM, Matt Evans  wrote:
> I *think* Will was going to make some small changes, if you've already merged 
> it then a follow-up set perhaps?

I only merged the non-ARM specific changes which looked good to me.
But sure, please send an incremental patch if you need to fix them up,
Will.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 00/16] kvm tools: add support for ARMv7 processors

2012-11-13 Thread Will Deacon

On Mon, Nov 12, 2012 at 10:40:33PM +, Christoffer Dall wrote:
> On Mon, Nov 12, 2012 at 7:52 AM, Christoffer Dall
>  wrote:
> > On Mon, Nov 12, 2012 at 7:27 AM, Will Deacon  wrote:
> >> Hi Christoffer,
> >>
> >> On Mon, Nov 12, 2012 at 12:18:57PM +, Christoffer Dall wrote:
> >>> On Mon, Nov 12, 2012 at 6:57 AM, Will Deacon  wrote:
> >>> > Hello,
> >>> >
> >>> > This patch series adds support for ARMv7 processors (Cortex-A15) to kvm
> >>> > tool. The majority of the series consists of small changes in
> >>> > preparation for ARM support, which is added by the final patch. I can
> >>> > try to split this up further, but given that there is no current support
> >>> > for ARM, the sub-patches wouldn't be especially meaningful.
> >>>
> >>> Very cool, looking forward to trying it out!
> >>
> >> Great -- please let me know if/when it explodes...
> >>
> just to make sure, this patch series is also the one that can be found
> here (HEAD~1):
> git://git.kernel.org/pub/scm/linux/kernel/git/will/kvmtool.git kvmtool/arm

Currently, yes, but that is my development branch and will rebase as and
when I make changes in preparation for v2. There's also a patch there for
AArch64 support, but the kernel side needs to settle down before that's
ready for posting.

Cheers,

Will

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [RFC PATCH 16/16] kvm tools: add support for ARMv7 processors

2012-11-13 Thread Matt Evans

Hi Pekka,

On 13 November 2012 07:40 Pekka Enberg wrote:

> On Mon, 12 Nov 2012, Will Deacon wrote:
> > This patch adds initial support for ARMv7 processors (more
> specifically,
> > Cortex-A15) to kvmtool.
> >
> > Everything is driven by FDT, including dynamic generation of virtio
> nodes
> > for MMIO devices (PCI is not used due to lack of a suitable host-
> bridge).
> >
> > The virtual timers and virtual interrupt controller (VGIC) are
> provided
> > by the kernel and require very little in terms of userspace code.
> >
> > Signed-off-by: Will Deacon 
>
> I'm happy with this but I'm not really an ARM guy. Is there anyone in
> the
> ARM/KVM community who is interested in reviewing this?

For the sake of transparency, I sent some minor feedback to Will off-list so 
that my (Outlook >:( ) broken linewraps & legal autoappend signature (see above 
and below) wouldn't pollute the thread on-list.  He'll use my IMAP address next 
time. ;-)

I *think* Will was going to make some small changes, if you've already merged 
it then a follow-up set perhaps?

(Modulo minor changes,
Acked-By: Matt Evans 
)

Cheers,


Matt


-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium.  Thank you.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 08/16] kvm tools: add generic device registration mechanism

2012-11-13 Thread Will Deacon

Hi Sasha,

On Tue, Nov 13, 2012 at 04:29:33AM +, Sasha Levin wrote:
> On 11/12/2012 06:57 AM, Will Deacon wrote:
> > diff --git a/tools/kvm/devices.c b/tools/kvm/devices.c
> > new file mode 100644
> > index 000..f9666b9
> > --- /dev/null
> > +++ b/tools/kvm/devices.c
> > @@ -0,0 +1,24 @@
> > +#include "kvm/devices.h"
> > +#include "kvm/kvm.h"
> > +
> > +#include 
> > +
> > +static struct device_header *devices[KVM_MAX_DEVICES];
> 
> Does it really have a hard limit at KVM_MAX_DEVICES? Or can we turn it into
> something more dynamic (list/tree/whatever)?

Sure, I'm happy to change the datatype to something more appropriate. Matt
also suggested trying to split up PCI devices from MMIO devices so that they
have their own ID namespaces, so I'll need to have a play.

Cheers,

Will

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] s390: Virtual channel subsystem support.

2012-11-13 Thread Cornelia Huck

On Mon, 12 Nov 2012 23:17:55 -0200
Marcelo Tosatti  wrote:

> Hi Cornelia,
> 
> On Wed, Oct 31, 2012 at 05:24:47PM +0100, Cornelia Huck wrote:
> > Provide a mechanism for qemu to provide fully virtual subchannels to
> > the guest. In the KVM case, this relies on the kernel's css support
> > for I/O and machine check interrupt handling. The !KVM case handles
> > interrupts on its own.
> > 
> > Signed-off-by: Cornelia Huck 
> > ---
> >  hw/s390x/Makefile.objs |1 +
> >  hw/s390x/css.c | 1209 
> > 
> >  hw/s390x/css.h |   90 
> >  target-s390x/Makefile.objs |2 +-
> >  target-s390x/cpu.h |  232 +
> >  target-s390x/helper.c  |  146 ++
> >  target-s390x/ioinst.c  |  737 +++
> >  target-s390x/ioinst.h  |  213 
> >  target-s390x/kvm.c |  251 -
> >  target-s390x/misc_helper.c |6 +-
> >  10 files changed, 2872 insertions(+), 15 deletions(-)
> >  create mode 100644 hw/s390x/css.c
> >  create mode 100644 hw/s390x/css.h
> >  create mode 100644 target-s390x/ioinst.c
> >  create mode 100644 target-s390x/ioinst.h
> 
> > +void kvm_s390_enable_css_support(CPUS390XState *env)
> > +{
> > +struct kvm_enable_cap cap = {};
> > +int r;
> > +
> > +/* Activate host kernel channel subsystem support. */
> > +if (kvm_enabled()) {
> > +/* One CPU has to run */
> > +s390_add_running_cpu(env);
> 
> Care to explain this?

Old code leftovers; I've removed it.

> 
> > +
> > +cap.cap = KVM_CAP_S390_CSS_SUPPORT;
> > +r = kvm_vcpu_ioctl(env, KVM_ENABLE_CAP, &cap);
> > +assert(r == 0);
> > +}
> > +}
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: interrupt remapping support

2012-11-13 Thread Gleb Natapov

On Mon, Nov 12, 2012 at 10:44:51PM -0500, Abhinav Srivastava wrote:
> Hi there,
> 
> I would like to know if KVM supports interrupt remapping and queued
> invalidation. I could not find it in the kvm source code.
> I also noticed that these features are in KVM's TODO list. Is that
> accurate? Any pointers or direction would be appreciated.
> 
KVM does not implement VT-d spec if this is your question. Any help with
this will be appreciated.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] KVM: MMU: simplify mmu_set_spte

2012-11-13 Thread Xiao Guangrong

On 11/13/2012 07:12 AM, Marcelo Tosatti wrote:
> On Mon, Nov 05, 2012 at 08:10:08PM +0800, Xiao Guangrong wrote:
>> In order to detecting spte remapping, we can simply check whether the
>> spte has already been pointing to the pfn even if the spte is not the
>> last spte for middle spte is pointing to the kernel pfn which can not
>> be mapped to userspace
>>
>> Also, update slot and stat.lpages iff the spte is not remapped
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/kvm/mmu.c |   40 +---
>>  1 files changed, 13 insertions(+), 27 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 692ebb1..4ea731e 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2420,8 +2420,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
>> *sptep,
>>   pfn_t pfn, bool speculative,
>>   bool host_writable)
>>  {
>> -int was_rmapped = 0;
>> -int rmap_count;
>> +bool was_rmapped = false;
>>
>>  pgprintk("%s: spte %llx access %x write_fault %d"
>>   " user_fault %d gfn %llx\n",
>> @@ -2429,25 +2428,13 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
>> *sptep,
>>   write_fault, user_fault, gfn);
>>
>>  if (is_rmap_spte(*sptep)) {
>> -/*
>> - * If we overwrite a PTE page pointer with a 2MB PMD, unlink
>> - * the parent of the now unreachable PTE.
>> - */
>> -if (level > PT_PAGE_TABLE_LEVEL &&
>> -!is_large_pte(*sptep)) {
>> -struct kvm_mmu_page *child;
>> -u64 pte = *sptep;
>> +if (pfn != spte_to_pfn(*sptep)) {
>> +struct kvm_mmu_page *sp = page_header(__pa(sptep));
>>
>> -child = page_header(pte & PT64_BASE_ADDR_MASK);
>> -drop_parent_pte(child, sptep);
>> -kvm_flush_remote_tlbs(vcpu->kvm);
> 
> How come its safe to drop this case?

We use "if (pfn != spte_to_pfn(*sptep))" to simplify the thing.
There are two cases:
1) the sptep is not the last mapping.
   under this case, sptep must point to a shadow page table, that means
   spte_to_pfn(*sptep)) is used by KVM module, and 'pfn' is used by userspace.
   so, 'if' condition must be satisfied, the sptep will be dropped.

   Actually, This is the origin case:
  | if (level > PT_PAGE_TABLE_LEVEL &&
  | !is_large_pte(*sptep))"

2) the sptep is the last mapping.
   under this case, the level of spte (sp.level) must equal the 'level' which
   we pass to mmu_set_spte. If they point to the same pfn, it is 'remap', 
otherwise
   we drop it.

I think this is safe. :)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: MMU: lazily drop large spte

2012-11-13 Thread Xiao Guangrong

Hi Marcelo,

On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
> On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
>> Do not drop large spte until it can be insteaded by small pages so that
>> the guest can happliy read memory through it
>>
>> The idea is from Avi:
>> | As I mentioned before, write-protecting a large spte is a good idea,
>> | since it moves some work from protect-time to fault-time, so it reduces
>> | jitter.  This removes the need for the return value.
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/kvm/mmu.c |   34 +-
>>  1 files changed, 9 insertions(+), 25 deletions(-)
> 
> Its likely that other 4k pages are mapped read-write in the 2mb range 
> covered by a read-only 2mb map. Therefore its not entirely useful to
> map read-only. 
> 

It needs a page fault to install a pte even if it is the read access.
After the change, the page fault can be avoided.

> Can you measure an improvement with this change?

I have a test case to measure the read time which has been attached.
It maps 4k pages at first (dirt-loggged), then switch to large sptes
(stop dirt-logging), at the last, measure the read access time after write
protect sptes.

Before: 23314111 ns After: 11404197 ns


testcase.tar.bz2
Description: application/bzip

54 matches

Mail list logo