Re: Massive read only kvm guests when backing file was missing

2014-03-26 Thread Michael S. Tsirkin
On Wed, Mar 26, 2014 at 11:08:03PM -0300, Alejandro Comisario wrote:
> Hi List!
> Hope some one can help me, we had a big issue in our cloud the other
> day, a couple of our openstack regions ( +2000 kvm guests with qcow2 )
> went read only filesystem from the guest side because the backing
> files directory (the openstack _base directory) was compromised and
> the data was lost, when we realized the data was lost, it took us 5
> mins to restore the backup of the backing files, but by that time all
> the kvm guests received some kind of IO error from the hypervisor
> layer, and went read only on root filesystem.
> 
> My question would be, is there a way to hold the IO operations against
> the backing files ( i thought that would be 99% READ operations ) for
> a little longer ( im asking this because i dont quite understand what
> is the process and when it raises the error ) in a case the backing
> files are missing (no IO possible) but is recoverable within minutes ?
> 
> Any tip  on how to achieve this if possible, or information about how
> backing files works on kvm, will be amazing.
> Waiting for feedback!
> 
> kindest regards.
> Alejandro Comisario


I'm guessing this is what happened: guests timed out meanwhile.
You can increase the timeout within the guest:
echo 600 > /sys/block/sda/device/timeout
to timeout after 10 minutes.

If you have installed qemu guest agent on your system, you can do this
from the host. Unfortunately by default it's memory can be pushed out to swap
and then on disk error access there might will fail :(
Maybe we should consider mlock on all its memory at least as an option.

You could pause your guests, restart them after the issue is resolved,
and we could I guess add functionality to pause VM on disk errors
automatically.
Stefan?


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] KVM: enable Intel SMAP for KVM

2014-03-26 Thread Feng Wu
Supervisor Mode Access Prevention (SMAP) is a new security feature 
disclosed by Intel, please refer to the following document: 

http://software.intel.com/sites/default/files/319433-014.pdf
 
Every access to a linear address is either a supervisor-mode access
or a user-mode access. All accesses performed while the current
privilege level (CPL) is less than 3 are supervisor-mode accesses.
If CPL = 3, accesses are generally user-mode accesses. However, some
operations implicitly access system data structures, and the resulting
accesses to those data structures are supervisor-mode accesses regardless
of CPL. Examples of such implicit supervisor accesses include the following:
accesses to the global descriptor table (GDT) or local descriptor table
(LDT) to load a segment descriptor; accesses to the interrupt descriptor
table (IDT) when delivering an interrupt or exception; and accesses to the
task-state segment (TSS) as part of a task switch or change of CPL.

If CR4.SMAP = 1, supervisor-mode data accesses are not allowed to linear
addresses that are accessible in user mode. If CPL < 3, SMAP protections
are disabled if EFLAGS.AC = 1. If CPL = 3, SMAP applies to all supervisor-mode
data accesses (these are implicit supervisor accesses) regardless of the
value of EFLAGS.AC.

This patchset pass-through SMAP feature to guests, and let guests
benefit from it.

Feng Wu (4):
  KVM: expose SMAP feature to guest
  KVM: Remove SMAP bit from CR4_RESERVED_BITS.
  KVM: Add SMAP support when setting CR4
  KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode

 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/cpuid.c|  2 +-
 arch/x86/kvm/cpuid.h|  8 
 arch/x86/kvm/mmu.c  | 22 +++---
 arch/x86/kvm/mmu.h  |  2 ++
 arch/x86/kvm/vmx.c  | 10 ++
 arch/x86/kvm/x86.c  |  6 ++
 7 files changed, 43 insertions(+), 9 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] KVM: expose SMAP feature to guest

2014-03-26 Thread Feng Wu
This patch exposes SMAP feature to guest

Signed-off-by: Feng Wu 
---
 arch/x86/kvm/cpuid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c697625..deb5f9b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -303,7 +303,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
/* cpuid 7.0.ebx */
const u32 kvm_supported_word9_x86_features =
F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) |
-   F(BMI2) | F(ERMS) | f_invpcid | F(RTM);
+   F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | F(SMAP);
 
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode

2014-03-26 Thread Feng Wu
SMAP is disabled if CPU is in non-paging mode in hardware.
However KVM always uses paging mode to emulate guest non-paging
mode with TDP. To emulate this behavior, SMAP needs to be
manually disabled when guest switches to non-paging mode.

Signed-off-by: Feng Wu 
---
 arch/x86/kvm/vmx.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index dcc4de3..1d37e50 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3421,13 +3421,15 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
hw_cr4 &= ~X86_CR4_PAE;
hw_cr4 |= X86_CR4_PSE;
/*
-* SMEP is disabled if CPU is in non-paging mode in
-* hardware. However KVM always uses paging mode to
+* SMEP/SMAP is disabled if CPU is in non-paging mode
+* in hardware. However KVM always uses paging mode to
 * emulate guest non-paging mode with TDP.
-* To emulate this behavior, SMEP needs to be manually
-* disabled when guest switches to non-paging mode.
+* To emulate this behavior, SMEP/SMAP needs to be
+* manually disabled when guest switches to non-paging
+* mode.
 */
hw_cr4 &= ~X86_CR4_SMEP;
+   hw_cr4 &= ~X86_CR4_SMAP;
} else if (!(cr4 & X86_CR4_PAE)) {
hw_cr4 &= ~X86_CR4_PAE;
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] KVM: Add SMAP support when setting CR4

2014-03-26 Thread Feng Wu
This patch adds SMAP handling logic when setting CR4 for guests

Signed-off-by: Feng Wu 
---
 arch/x86/kvm/cpuid.h |  8 
 arch/x86/kvm/mmu.c   | 22 +++---
 arch/x86/kvm/mmu.h   |  2 ++
 arch/x86/kvm/x86.c   |  6 ++
 4 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index f1e4895..63124a2 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -48,6 +48,14 @@ static inline bool guest_cpuid_has_smep(struct kvm_vcpu 
*vcpu)
return best && (best->ebx & bit(X86_FEATURE_SMEP));
 }
 
+static inline bool guest_cpuid_has_smap(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 7, 0);
+   return best && (best->ebx & bit(X86_FEATURE_SMAP));
+}
+
 static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 40772ef..33e656c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3591,14 +3591,15 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu 
*vcpu,
}
 }
 
-static void update_permission_bitmask(struct kvm_vcpu *vcpu,
+void update_permission_bitmask(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu, bool ept)
 {
unsigned bit, byte, pfec;
u8 map;
-   bool fault, x, w, u, wf, uf, ff, smep;
+   bool fault, x, w, u, wf, uf, ff, smep, smap;
 
smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
+   smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
pfec = byte << 1;
map = 0;
@@ -3617,11 +3618,26 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
w |= !is_write_protection(vcpu) && !uf;
/* Disallow supervisor fetches of user code if 
cr4.smep */
x &= !(smep && u && !uf);
+
+   /*
+* SMAP:kernel-mode data accesses from user-mode
+* mappings should fault. A fault is considered
+* as a SMAP violation if all of the following
+* conditions are ture:
+*   - X86_CR4_SMAP is set in CR4
+*   - An user page is accessed
+*   - !(CPL<3 && X86_EFLAGS_AC is set)
+*   - Page fault in kernel mode
+*/
+   smap = smap && u && !uf &&
+   !((kvm_x86_ops->get_cpl(vcpu) < 3) &&
+   ((kvm_x86_ops->get_rflags(vcpu) &
+   X86_EFLAGS_AC) == 1));
} else
/* Not really needed: no U/S accesses on ept  */
u = 1;
 
-   fault = (ff && !x) || (uf && !u) || (wf && !w);
+   fault = (ff && !x) || (uf && !u) || (wf && !w) || smap;
map |= fault << bit;
}
mmu->permissions[byte] = map;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2926152..8820f78 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -73,6 +73,8 @@ int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 
addr, bool direct);
 void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
bool execonly);
+void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+   bool ept);
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4e33b85..f8293fb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -630,6 +630,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
if (!guest_cpuid_has_smep(vcpu) && (cr4 & X86_CR4_SMEP))
return 1;
 
+   if (!guest_cpuid_has_smap(vcpu) && (cr4 & X86_CR4_SMAP))
+   return 1;
+
if (!guest_cpuid_has_fsgsbase(vcpu) && (cr4 & X86_CR4_FSGSBASE))
return 1;
 
@@ -658,6 +661,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
(!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
kvm_mmu_reset_context(vcpu);
 
+   if ((cr4 ^ old_cr4) & X86_CR4_SMAP)
+   update_permission_bitmask(vcpu, vcpu->arch.walk_mmu, false);
+
if ((cr4 ^ old_cr4) & X86_CR4_OSXSAVE)
kvm_update_cpuid(vcpu);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://

[PATCH 2/4] KVM: Remove SMAP bit from CR4_RESERVED_BITS.

2014-03-26 Thread Feng Wu
This patch removes SMAP bit from CR4_RESERVED_BITS.

Signed-off-by: Feng Wu 
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ae5d783..b673925 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -60,7 +60,7 @@
  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | 
X86_CR4_PCIDE \
  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
- | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE))
+ | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Massive read only kvm guests when backing file was missing

2014-03-26 Thread Alejandro Comisario
Hi List!
Hope some one can help me, we had a big issue in our cloud the other
day, a couple of our openstack regions ( +2000 kvm guests with qcow2 )
went read only filesystem from the guest side because the backing
files directory (the openstack _base directory) was compromised and
the data was lost, when we realized the data was lost, it took us 5
mins to restore the backup of the backing files, but by that time all
the kvm guests received some kind of IO error from the hypervisor
layer, and went read only on root filesystem.

My question would be, is there a way to hold the IO operations against
the backing files ( i thought that would be 99% READ operations ) for
a little longer ( im asking this because i dont quite understand what
is the process and when it raises the error ) in a case the backing
files are missing (no IO possible) but is recoverable within minutes ?

Any tip  on how to achieve this if possible, or information about how
backing files works on kvm, will be amazing.
Waiting for feedback!

kindest regards.
Alejandro Comisario
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors

2014-03-26 Thread Zoltan Kiss

On 26/03/14 20:12, David Miller wrote:

From: David Miller 
Date: Wed, 26 Mar 2014 15:59:58 -0400 (EDT)


From: Zoltan Kiss 
Date: Fri, 21 Mar 2014 10:31:34 +


skb_zerocopy can copy elements of the frags array between skbs, but it doesn't
orphan them. Also, it doesn't handle errors, so this patch takes care of that
as well, and modify the callers accordingly. skb_tx_error() is also added to
the callers so they will signal the failed delivery towards the creator of the
skb.

Signed-off-by: Zoltan Kiss 


Applied, thanks Zoltan.


Actually, Zoltan, you have to fix this:

net/core/skbuff.c: In function ‘skb_zerocopy’:
net/core/skbuff.c:2172:2: warning: passing argument 1 of ‘skb_orphan_frags’ 
discards ‘const’ qualifi
er from pointer target type [enabled by default]
In file included from include/linux/tcp.h:21:0,
  from net/core/skbuff.c:50:
include/linux/skbuff.h:1904:19: note: expected ‘struct sk_buff *’ but argument 
is of type ‘const str
uct sk_buff *’
net/core/skbuff.c:2173:3: warning: passing argument 1 of ‘skb_tx_error’ 
discards ‘const’ qualifier f
rom pointer target type [enabled by default]
net/core/skbuff.c:642:6: note: expected ‘struct sk_buff *’ but argument is of 
type ‘const struct sk_
buff *’



Ok, resubmitted. 'from' is now not a const parameter, because 
skb->pfmemalloc might change.


Zoli
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors

2014-03-26 Thread Zoltan Kiss
skb_zerocopy can copy elements of the frags array between skbs, but it doesn't
orphan them. Also, it doesn't handle errors, so this patch takes care of that
as well, and modify the callers accordingly. skb_tx_error() is also added to
the callers so they will signal the failed delivery towards the creator of the
skb.

Signed-off-by: Zoltan Kiss 
---
v2: orphan the frags right before touching the frags

v3:
- orphan 'from' instead of 'to'
- call skb_tx_error() in the callers if something went wrong

v4: correctly use error path in queue_userspace_packet

v5: remove const qualifier of "from" in skb_zerocopy parameters

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 03db95a..0db91fa 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2508,8 +2508,8 @@ int skb_splice_bits(struct sk_buff *skb, unsigned int 
offset,
unsigned int flags);
 void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 unsigned int skb_zerocopy_headlen(const struct sk_buff *from);
-void skb_zerocopy(struct sk_buff *to, const struct sk_buff *from,
- int len, int hlen);
+int skb_zerocopy(struct sk_buff *to, struct sk_buff *from,
+int len, int hlen);
 void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len);
 int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen);
 void skb_scrub_packet(struct sk_buff *skb, bool xnet);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3f14c63..908ad98 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2127,25 +2127,31 @@ EXPORT_SYMBOL_GPL(skb_zerocopy_headlen);
  *
  * The `hlen` as calculated by skb_zerocopy_headlen() specifies the
  * headroom in the `to` buffer.
+ *
+ * Return value:
+ * 0: everything is OK
+ * -ENOMEM: couldn't orphan frags of @from due to lack of memory
+ * -EFAULT: skb_copy_bits() found some problem with skb geometry
  */
-void
-skb_zerocopy(struct sk_buff *to, const struct sk_buff *from, int len, int hlen)
+int
+skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen)
 {
int i, j = 0;
int plen = 0; /* length of skb->head fragment */
+   int ret;
struct page *page;
unsigned int offset;
 
BUG_ON(!from->head_frag && !hlen);
 
/* dont bother with small payloads */
-   if (len <= skb_tailroom(to)) {
-   skb_copy_bits(from, 0, skb_put(to, len), len);
-   return;
-   }
+   if (len <= skb_tailroom(to))
+   return skb_copy_bits(from, 0, skb_put(to, len), len);
 
if (hlen) {
-   skb_copy_bits(from, 0, skb_put(to, hlen), hlen);
+   ret = skb_copy_bits(from, 0, skb_put(to, hlen), hlen);
+   if (unlikely(ret))
+   return ret;
len -= hlen;
} else {
plen = min_t(int, skb_headlen(from), len);
@@ -2163,6 +2169,11 @@ skb_zerocopy(struct sk_buff *to, const struct sk_buff 
*from, int len, int hlen)
to->len += len + plen;
to->data_len += len + plen;
 
+   if (unlikely(skb_orphan_frags(from, GFP_ATOMIC))) {
+   skb_tx_error(from);
+   return -ENOMEM;
+   }
+
for (i = 0; i < skb_shinfo(from)->nr_frags; i++) {
if (!len)
break;
@@ -2173,6 +2184,8 @@ skb_zerocopy(struct sk_buff *to, const struct sk_buff 
*from, int len, int hlen)
j++;
}
skb_shinfo(to)->nr_frags = j;
+
+   return 0;
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy);
 
diff --git a/net/netfilter/nfnetlink_queue_core.c 
b/net/netfilter/nfnetlink_queue_core.c
index f072fe8..108120f 100644
--- a/net/netfilter/nfnetlink_queue_core.c
+++ b/net/netfilter/nfnetlink_queue_core.c
@@ -354,13 +354,16 @@ nfqnl_build_packet_message(struct net *net, struct 
nfqnl_instance *queue,
 
skb = nfnetlink_alloc_skb(net, size, queue->peer_portid,
  GFP_ATOMIC);
-   if (!skb)
+   if (!skb) {
+   skb_tx_error(entskb);
return NULL;
+   }
 
nlh = nlmsg_put(skb, 0, 0,
NFNL_SUBSYS_QUEUE << 8 | NFQNL_MSG_PACKET,
sizeof(struct nfgenmsg), 0);
if (!nlh) {
+   skb_tx_error(entskb);
kfree_skb(skb);
return NULL;
}
@@ -488,13 +491,15 @@ nfqnl_build_packet_message(struct net *net, struct 
nfqnl_instance *queue,
nla->nla_type = NFQA_PAYLOAD;
nla->nla_len = nla_attr_size(data_len);
 
-   skb_zerocopy(skb, entskb, data_len, hlen);
+   if (skb_zerocopy(skb, entskb, data_len, hlen))
+   goto nla_put_failure;
}
 
nlh->nlmsg_len = skb->len;
return skb;
 
 nla_put_failure:
+   skb_tx_error(entskb);
kfree_skb(skb);
net_err_ratelimited("nf_queue: error creating packet message\n");
  

Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-03-26 Thread Scott Wood
On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:
> Load external pid (lwepx) instruction faults (when called from
> KVM with guest context) needs to be handled by KVM. This implies
> additional code in DO_KVM macro to identify the source of the
> exception (which oiginate from KVM host rather than the guest).
> The hook requires to check the Exception Syndrome Register
> ESR[EPID] and External PID Load Context Register EPLC[EGS] for
> some exceptions (DTLB_MISS, DSI and LRAT). Doing this on Data TLB
> miss exception is obvious intrusive for the host.
> 
> Get rid of lwepx and acquire last instuction in kvmppc_get_last_inst()
> by searching for the physical address and kmap it. This fixes an
> infinite loop caused by lwepx's data TLB miss handled in the host
> and the TODO for TLB eviction and execute-but-not-read entries.
> 
> Signed-off-by: Mihai Caraman 
> ---
>  arch/powerpc/kvm/bookehv_interrupts.S |   37 +++--
>  arch/powerpc/kvm/e500_mmu_host.c  |   93 
> +
>  2 files changed, 102 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
> b/arch/powerpc/kvm/bookehv_interrupts.S
> index 20c7a54..c50490c 100644
> --- a/arch/powerpc/kvm/bookehv_interrupts.S
> +++ b/arch/powerpc/kvm/bookehv_interrupts.S
> @@ -119,38 +119,14 @@
>  1:
>  
>   .if \flags & NEED_EMU
> - /*
> -  * This assumes you have external PID support.
> -  * To support a bookehv CPU without external PID, you'll
> -  * need to look up the TLB entry and create a temporary mapping.
> -  *
> -  * FIXME: we don't currently handle if the lwepx faults.  PR-mode
> -  * booke doesn't handle it either.  Since Linux doesn't use
> -  * broadcast tlbivax anymore, the only way this should happen is
> -  * if the guest maps its memory execute-but-not-read, or if we
> -  * somehow take a TLB miss in the middle of this entry code and
> -  * evict the relevant entry.  On e500mc, all kernel lowmem is
> -  * bolted into TLB1 large page mappings, and we don't use
> -  * broadcast invalidates, so we should not take a TLB miss here.
> -  *
> -  * Later we'll need to deal with faults here.  Disallowing guest
> -  * mappings that are execute-but-not-read could be an option on
> -  * e500mc, but not on chips with an LRAT if it is used.
> -  */
> -
> - mfspr   r3, SPRN_EPLC   /* will already have correct ELPID and EGS */
>   PPC_STL r15, VCPU_GPR(R15)(r4)
>   PPC_STL r16, VCPU_GPR(R16)(r4)
>   PPC_STL r17, VCPU_GPR(R17)(r4)
>   PPC_STL r18, VCPU_GPR(R18)(r4)
>   PPC_STL r19, VCPU_GPR(R19)(r4)
> - mr  r8, r3
>   PPC_STL r20, VCPU_GPR(R20)(r4)
> - rlwimi  r8, r6, EPC_EAS_SHIFT - MSR_IR_LG, EPC_EAS
>   PPC_STL r21, VCPU_GPR(R21)(r4)
> - rlwimi  r8, r6, EPC_EPR_SHIFT - MSR_PR_LG, EPC_EPR
>   PPC_STL r22, VCPU_GPR(R22)(r4)
> - rlwimi  r8, r10, EPC_EPID_SHIFT, EPC_EPID
>   PPC_STL r23, VCPU_GPR(R23)(r4)
>   PPC_STL r24, VCPU_GPR(R24)(r4)
>   PPC_STL r25, VCPU_GPR(R25)(r4)
> @@ -160,10 +136,15 @@
>   PPC_STL r29, VCPU_GPR(R29)(r4)
>   PPC_STL r30, VCPU_GPR(R30)(r4)
>   PPC_STL r31, VCPU_GPR(R31)(r4)
> - mtspr   SPRN_EPLC, r8
> - isync
> - lwepx   r9, 0, r5
> - mtspr   SPRN_EPLC, r3
> +
> + /*
> +  * We don't use external PID support. lwepx faults would need to be
> +  * handled by KVM and this implies aditional code in DO_KVM (for
> +  * DTB_MISS, DSI and LRAT) to check ESR[EPID] and EPLC[EGS] which
> +  * is too intrusive for the host. Get last instuction in
> +  * kvmppc_get_last_inst().
> +  */
> + li  r9, KVM_INST_FETCH_FAILED
>   stw r9, VCPU_LAST_INST(r4)
>   .endif
>  
> diff --git a/arch/powerpc/kvm/e500_mmu_host.c 
> b/arch/powerpc/kvm/e500_mmu_host.c
> index 6025cb7..1b4cb41 100644
> --- a/arch/powerpc/kvm/e500_mmu_host.c
> +++ b/arch/powerpc/kvm/e500_mmu_host.c
> @@ -598,9 +598,102 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, 
> gpa_t gpaddr,
>   }
>  }
>  
> +#ifdef CONFIG_KVM_BOOKE_HV
> +int kvmppc_ld_inst(struct kvm_vcpu *vcpu, u32 *instr)

It'd be interesting to see what the performance impact of doing this on
non-HV would be -- it would eliminate divergent code, eliminate the
MSR_DS hack, and make exec-only mappings work.

> +{
> + gva_t geaddr;
> + hpa_t addr;
> + hfn_t pfn;
> + hva_t eaddr;
> + u32 mas0, mas1, mas2, mas3;
> + u64 mas7_mas3;
> + struct page *page;
> + unsigned int addr_space, psize_shift;
> + bool pr;
> + unsigned long flags;
> +
> + /* Search TLB for guest pc to get the real address */
> + geaddr = kvmppc_get_pc(vcpu);
> + addr_space = (vcpu->arch.shared->msr & MSR_IS) >> MSR_IR_LG;
> +
> + local_irq_save(flags);
> + mtspr(SPRN_MAS6, (vcpu->arch.pid << MAS6_SPID_SHIFT) | addr_space);
> + mtspr(SPRN_MAS5, MAS5_SGS | vcpu->kvm->arch.lpid);
> + isync();
> 

Re: [PATCH 3/4] KVM: PPC: Alow kvmppc_get_last_inst() to fail

2014-03-26 Thread Scott Wood
On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:
> diff --git a/arch/powerpc/kvm/book3s_paired_singles.c 
> b/arch/powerpc/kvm/book3s_paired_singles.c
> index a59a25a..80c533e 100644
> --- a/arch/powerpc/kvm/book3s_paired_singles.c
> +++ b/arch/powerpc/kvm/book3s_paired_singles.c
> @@ -640,19 +640,24 @@ static int kvmppc_ps_one_in(struct kvm_vcpu *vcpu, bool 
> rc,
>  
>  int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu *vcpu)
>  {
> - u32 inst = kvmppc_get_last_inst(vcpu);
> + u32 inst;
>   enum emulation_result emulated = EMULATE_DONE;
> -
> - int ax_rd = inst_get_field(inst, 6, 10);
> - int ax_ra = inst_get_field(inst, 11, 15);
> - int ax_rb = inst_get_field(inst, 16, 20);
> - int ax_rc = inst_get_field(inst, 21, 25);
> - short full_d = inst_get_field(inst, 16, 31);
> -
> - u64 *fpr_d = &vcpu->arch.fpr[ax_rd];
> - u64 *fpr_a = &vcpu->arch.fpr[ax_ra];
> - u64 *fpr_b = &vcpu->arch.fpr[ax_rb];
> - u64 *fpr_c = &vcpu->arch.fpr[ax_rc];
> + int ax_rd, ax_ra, ax_rb, ax_rc;
> + short full_d;
> + u64 *fpr_d, *fpr_a, *fpr_b, *fpr_c;
> +
> + kvmppc_get_last_inst(vcpu, &inst);

Should probably check for failure here and elsewhere -- even though it
can't currently fail on book3s, the interface now allows it.

> diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
> index 5b9e906..b0d884d 100644
> --- a/arch/powerpc/kvm/book3s_pr.c
> +++ b/arch/powerpc/kvm/book3s_pr.c
> @@ -624,9 +624,10 @@ void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr)
>  static int kvmppc_read_inst(struct kvm_vcpu *vcpu)
>  {
>   ulong srr0 = kvmppc_get_pc(vcpu);
> - u32 last_inst = kvmppc_get_last_inst(vcpu);
> + u32 last_inst;
>   int ret;
>  
> + kvmppc_get_last_inst(vcpu, &last_inst);
>   ret = kvmppc_ld(vcpu, &srr0, sizeof(u32), &last_inst, false);

This isn't new, but this function looks odd to me -- calling
kvmppc_get_last_inst() but ignoring last_inst, then calling kvmppc_ld()
and ignoring anything but failure.  last_inst itself is never read.  And
no comments to explain the weirdness. :-)

I get that kvmppc_get_last_inst() is probably being used for the side
effect of filling in vcpu->arch.last_inst, but why store the return
value without using it?  Why pass the address of it to kvmppc_ld(),
which seems to be used only as an indirect way of determining whether
kvmppc_get_last_inst() failed?  And that whole mechanism becomes
stranger once it becomes possible for kvmppc_get_last_inst() to directly
return failure.

> diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h
> index 09bfd9b..c7c60c2 100644
> --- a/arch/powerpc/kvm/booke.h
> +++ b/arch/powerpc/kvm/booke.h
> @@ -90,6 +90,9 @@ void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu);
>  void kvmppc_booke_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>  void kvmppc_booke_vcpu_put(struct kvm_vcpu *vcpu);
>  
> +void kvmppc_core_queue_inst_storage(struct kvm_vcpu *vcpu,
> +   ulong esr_flags);

Whitespace

> +
>  enum int_class {
>   INT_CLASS_NONCRIT,
>   INT_CLASS_CRIT,
> @@ -123,6 +126,8 @@ extern int kvmppc_core_emulate_mtspr_e500(struct kvm_vcpu 
> *vcpu, int sprn,
>  extern int kvmppc_core_emulate_mfspr_e500(struct kvm_vcpu *vcpu, int sprn,
> ulong *spr_val);
>  
> +extern int kvmppc_ld_inst(struct kvm_vcpu *vcpu, u32 *instr) ;

Whitespace

> diff --git a/arch/powerpc/kvm/e500_mmu_host.c 
> b/arch/powerpc/kvm/e500_mmu_host.c
> index ecf2247..6025cb7 100644
> --- a/arch/powerpc/kvm/e500_mmu_host.c
> +++ b/arch/powerpc/kvm/e500_mmu_host.c
> @@ -34,6 +34,7 @@
>  #include "e500.h"
>  #include "timing.h"
>  #include "e500_mmu_host.h"
> +#include "booke.h"
>  
>  #include "trace_booke.h"
>  
> @@ -597,6 +598,10 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, 
> gpa_t gpaddr,
>   }
>  }
>  
> +int kvmppc_ld_inst(struct kvm_vcpu *vcpu, u32 *instr) {
> + return EMULATE_FAIL;
> +};

Brace placement

>  /* MMU Notifiers */
>  
>  int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
> diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c
> index 2f9a087..24a8e50 100644
> --- a/arch/powerpc/kvm/emulate.c
> +++ b/arch/powerpc/kvm/emulate.c
> @@ -225,19 +225,26 @@ static int kvmppc_emulate_mfspr(struct kvm_vcpu *vcpu, 
> int sprn, int rt)
>   * from opcode tables in the future. */
>  int kvmppc_emulate_instruction(struct kvm_run *run, struct kvm_vcpu *vcpu)
>  {
> - u32 inst = kvmppc_get_last_inst(vcpu);
> - int ra = get_ra(inst);
> - int rs = get_rs(inst);
> - int rt = get_rt(inst);
> - int sprn = get_sprn(inst);
> - enum emulation_result emulated = EMULATE_DONE;
> + u32 inst;
> + int ra, rs, rt, sprn;
> + enum emulation_result emulated;
>   int advance = 1;
>  
>   /* this default type might be overwritten by subcategories */
>   kvmppc_set_exit_type(vcpu, E

Re: [PATCH 3/3] KVM: nVMX: check for null vmcs12 when L1 does invept

2014-03-26 Thread Bandan Das
Jan Kiszka  writes:

> On 2014-03-22 17:43, Bandan Das wrote:
>> Jan Kiszka  writes:
>> 
>>> On 2014-03-20 21:58, Bandan Das wrote:
 Jan Kiszka  writes:

> On 2014-03-20 04:28, Bandan Das wrote:
>> Some L1 hypervisors such as Xen seem to be calling invept after
>> vmclear or before vmptrld on L2. In this case, proceed with
>> falling through and syncing roots as a case where
>> context wide invalidation can't be supported
>
> Can we also base this behaviour on a statement in the SDM? But on first
> glance, I do not find anything like this over there.

 The SDM has nothing of this sort explicitly mentioned but 28.3.3.1 
 "Operations that invalidate Cached Mappings" does mention that
 the instruction may invalidate mappings associated with other
 EP4TAs (even in single context).
>>>
>>> Yes, "may". So we are implementing undefined behavior in order to please
>>> a broken hypervisor that relies on it? Then please state this in the
>>> patch and probably also inform Xen about their issue.
>> 
>> Why undefined behavior ? We don't do anything specific for 
>> the single context invalidation case ianyway .e If the eptp matches what 
>> vmcs12 has, single context invalidation does fall though to the global 
>> invalidation case already. All this change does is add the "L1 calls 
>> invept after vmclear and  before vmptrld" to the list of cases to fall 
>> though to global invalidation since nvmx doesn't have any knowledge of 
>> the current eptp for this case.
>
> OK, I think I misunderstood what the guest expects and how we currently
> achieve this: we do not track the mapping between guest and host eptp,
> thus cannot properly emulate its behaviour. We therefore need to flush
> everything.
>
>> 
>> Or do you think we should rethink this approach ?
>
> Well, I wonder if we should expose single-context invept support at all.
>
> I'm also wondering if we are returning proper flags on
>
> if ((operand.eptp & eptp_mask) !=
> (nested_ept_get_cr3(vcpu) & eptp_mask))
> break;

That does sound plausible but then we would have to get rid of this 
little optimization in the code you have quoted above. We would have
to flush and sync roots for all invept calls. So, my preference is to 
keep it.

> Neither nested_vmx_succeed nor nested_vmx_fail* is called if this
> condition evaluates to true.

It appears we should always call nested_vmx_succeed(). If you are ok,
I will send a v2.

Thanks,
Bandan

> Jan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors

2014-03-26 Thread David Miller
From: David Miller 
Date: Wed, 26 Mar 2014 15:59:58 -0400 (EDT)

> From: Zoltan Kiss 
> Date: Fri, 21 Mar 2014 10:31:34 +
> 
>> skb_zerocopy can copy elements of the frags array between skbs, but it 
>> doesn't
>> orphan them. Also, it doesn't handle errors, so this patch takes care of that
>> as well, and modify the callers accordingly. skb_tx_error() is also added to
>> the callers so they will signal the failed delivery towards the creator of 
>> the
>> skb.
>> 
>> Signed-off-by: Zoltan Kiss 
> 
> Applied, thanks Zoltan.

Actually, Zoltan, you have to fix this:

net/core/skbuff.c: In function ‘skb_zerocopy’:
net/core/skbuff.c:2172:2: warning: passing argument 1 of ‘skb_orphan_frags’ 
discards ‘const’ qualifi
er from pointer target type [enabled by default]
In file included from include/linux/tcp.h:21:0,
 from net/core/skbuff.c:50:
include/linux/skbuff.h:1904:19: note: expected ‘struct sk_buff *’ but argument 
is of type ‘const str
uct sk_buff *’
net/core/skbuff.c:2173:3: warning: passing argument 1 of ‘skb_tx_error’ 
discards ‘const’ qualifier f
rom pointer target type [enabled by default]
net/core/skbuff.c:642:6: note: expected ‘struct sk_buff *’ but argument is of 
type ‘const struct sk_
buff *’
N‹§²ζμrΈ›yϊθšΨb²X¬ΆΗ§vΨ^–)ήΊ{.nΗ+‰·€Ύh§Ά›‘ά¨}©ž²Ζ 
zΪ&j:+v‰¨Ύ«‘κηzZ+€Κ+zf£’·hšˆ§~†­†Ϋi�ϋΰzΉ�w₯’Έ?™¨θ­Ϊ&’)ί’f

Re: [PATCH v4] core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors

2014-03-26 Thread David Miller
From: Zoltan Kiss 
Date: Fri, 21 Mar 2014 10:31:34 +

> skb_zerocopy can copy elements of the frags array between skbs, but it doesn't
> orphan them. Also, it doesn't handle errors, so this patch takes care of that
> as well, and modify the callers accordingly. skb_tx_error() is also added to
> the callers so they will signal the failed delivery towards the creator of the
> skb.
> 
> Signed-off-by: Zoltan Kiss 

Applied, thanks Zoltan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: vmx: fix MPX detection

2014-03-26 Thread Paolo Bonzini
kvm_x86_ops is still NULL at this point.  Since kvm_init_msr_list
cannot fail, it is safe to initialize it before the call.

Reported-by: Fengguang Wu 
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/x86.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 39c28f09dfd5..49b514f76b5d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5537,9 +5537,10 @@ int kvm_arch_init(void *opaque)
goto out_free_percpu;
 
kvm_set_mmio_spte_mask();
-   kvm_init_msr_list();
 
kvm_x86_ops = ops;
+   kvm_init_msr_list();
+
kvm_mmu_set_mask_ptes(PT_USER_MASK, PT_ACCESSED_MASK,
PT_DIRTY_MASK, PT64_NX_MASK, 0);
 
-- 
1.8.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Preparing kvm/next for first pull request of 3.15 merge window

2014-03-26 Thread Paul Mackerras
On Wed, Mar 26, 2014 at 10:19:52AM +0100, Paolo Bonzini wrote:
> Il 26/03/2014 04:51, Paul Mackerras ha scritto:
> >>> I would like to know from ARM and PPC maintainers *now* (before the
> >>> merge window opens) what will be in 3.15.  Also, PPC guys, please
> >>> make sure the pull requests will be based on commit e724f080f5dd
> >>> (KVM: PPC: Book3S HV: Fix register usage when loading/saving VRSAVE,
> >>> 2014-03-13).
> >Alex has 5 commits in his kvm-ppc-queue branch that should go in.
> >That tree is based on your kvm-3.14-2 tag, not on e724f080f5dd.
> >Would you accept a pull request for that from me on Alex's behalf
> >since he's away?
> 
> Yes, of course.
> 
> >Do you need me to rebase it at all?
> 
> These are:
> 
> 1ac4484979 PPC: KVM: Introduce hypervisor call H_GET_TCE
> e0a7be38c9 PPC: KVM: fix RESUME_GUEST check before returning from
> kvmppc_run_core()
> 013e98e657 PPC: KVM: fix RESUME_GUEST check before ending CEDE in
> kvmppc_run_core()
> e8a7f11fe0 PPC: KVM: fix VCPU run for HV KVM (v2)
> 26d96ec97c PPC: KVM: introduce helper to check RESUME_GUEST and related
> 
> I would consider rebasing; but you know better than me the effect of
> the two host-crash-fixing patches and how the
> testability/bisectability of kvm-ppc-queue is affected.
> 
> In particular, how likely is it that reverse-bisection ("which
> commit fixed the bug?") would end up on e724f080f5dd rather than one
> of the four RESUME_GUEST commits?

OK, based on that and the fact that 4 of those commits should ideally
be combined into one (as Greg commented), I will apply those changes
as two commits on top of e724f080f5dd, followed by the 8 that I posted
recently, and send you a pull request.

Thanks,
Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC]Two ideas to optimize updating irq routing table

2014-03-26 Thread Michael S. Tsirkin
On Wed, Mar 26, 2014 at 08:22:29AM +, Gonglei (Arei) wrote:
> > > Based on discussions in:
> > > http://lists.gnu.org/archive/html/qemu-devel/2013-11/threads.html#03322
> > >
> > > About KVM_SET_GSI_ROUTING ioctl, I tested changing RCU to SRCU, but
> > unfortunately
> > > it looks like SRCU's grace period is no better than RCU.
> > 
> > Really?  This is not what Christian Borntraeger claimed at
> > http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/118083 -- in
> > fact, Andrew Theurer is currently testing a variant of that patch and I
> > was going to post it for 3.16.
> > 
> > Have you tried synchronize_srcu_expedited?  Unlike the RCU variant, you
> > can use this function.
> > 
> Yes, previously I was using synchronize_srcu, which is not good. When I 
> changed it to synchronize_srcu_expedited, grace period delay is much better 
> than synchronize_srcu. Though in our tests, we can still see some impact 
> of KVM_SET_GSI_ROUTING ioctl.
> 
> Our testing scenario is like this. In VM we run a script that sets 
> smp_affinity 
> for each IRQ every 0.5s (this leads QEMU to do KVM_SET_GSI_ROUTING ioctl).
> Outside the VM we ping that VM.
> 
> Without patches, ping time can jump from 0.3ms to 2ms-30ms. With 
> synchronize_srcu 
> patch, ping time is worse. With synchronize_srcu_expedited patch, ping time 
> is 
> overall good, though sometimes ping time jump to 1ms-3ms.
> 
> With following raw patch, ping time is like call_rcu patch, that not 
> influenced 
> by setting IRQ affinity, keeps 0.3ms, and there is no vulnerability, frequent 
> intermidiate KVM_SET_GSI_ROUTING settings are just skipped, and always the 
> newest 
> setting would take effect.
> 
> Would you think this patch is acceptable or has problem? Thanks in advance.

So this will just queue the update and never bother waiting for it
to take effect.

I'm not sure it's safe in all cases, but maybe we can optimize out
some cases, by detecting in qemu that changes only affect the VCPU that isn't 
running,
and telling KVM that it can delay the expensive synchronization.



> diff -urp kvm_kmod/include/linux/kvm_host.h 
> kvm_kmod_new//include/linux/kvm_host.h
> --- kvm_kmod/include/linux/kvm_host.h 2014-03-10 14:08:28.0 +
> +++ kvm_kmod_new//include/linux/kvm_host.h2014-03-26 15:07:48.0 
> +
> @@ -337,6 +337,12 @@ struct kvm {
>  
>   struct mutex irq_lock;
>  #ifdef CONFIG_HAVE_KVM_IRQCHIP
> + struct task_struct *kthread;
> + wait_queue_head_t wq;
> + struct mutex irq_routing_lock;
> + struct kvm_irq_routing to_update_routing;
> + struct kvm_irq_routing_entry *to_update_entries;
> + atomic_t have_new;
>   /*
>* Update side is protected by irq_lock and,
>* if configured, irqfds.lock.
> diff -urp kvm_kmod/x86/assigned-dev.c kvm_kmod_new//x86/assigned-dev.c
> --- kvm_kmod/x86/assigned-dev.c   2014-03-10 14:08:28.0 +
> +++ kvm_kmod_new//x86/assigned-dev.c  2014-03-26 15:22:33.0 +
> @@ -1056,12 +1056,23 @@ long kvm_vm_ioctl_assigned_device(struct
>   r = -EFAULT;
>   urouting = argp;
>   if (copy_from_user(entries, urouting->entries,
> -routing.nr * sizeof(*entries)))
> - goto out_free_irq_routing;
> - r = kvm_set_irq_routing(kvm, entries, routing.nr,
> - routing.flags);
> - out_free_irq_routing:
> - vfree(entries);
> +routing.nr * sizeof(*entries))) {
> + vfree(entries);
> + break;
> + }
> +
> + mutex_lock(&kvm->irq_routing_lock);
> + if (kvm->to_update_entries) {
> + vfree(kvm->to_update_entries);
> + kvm->to_update_entries = NULL;
> + }
> + kvm->to_update_routing = routing;
> + kvm->to_update_entries = entries;
> + atomic_set(&kvm->have_new, 1);
> + mutex_unlock(&kvm->irq_routing_lock);
> +
> + wake_up(&kvm->wq);
> + r = 0;  /* parameter validity or memory alloc avalibity should 
> be checked before return */
>   break;
>   }
>  #endif /* KVM_CAP_IRQ_ROUTING */
> diff -urp kvm_kmod/x86/kvm_main.c kvm_kmod_new//x86/kvm_main.c
> --- kvm_kmod/x86/kvm_main.c   2014-03-10 14:08:28.0 +
> +++ kvm_kmod_new//x86/kvm_main.c  2014-03-26 15:27:02.0 +
> @@ -83,6 +83,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -501,6 +502,49 @@ static void kvm_init_memslots_id(struct
>   slots->id_to_index[i] = slots->memslots[i].id = i;
>  }
>  
> +static int do_irq_routing_table_update(struct kvm *kvm)
> +{
> + struct kvm_irq_routing_entry *entries;
> + unsigned nr;
> + unsigned flags;
> +
> + mutex_lock(&kvm->irq_routing_lock);
> + BUG_ON(!kvm->to_update_entries);
> +

Re: [RFC]Two ideas to optimize updating irq routing table

2014-03-26 Thread Christian Borntraeger
On 26/03/14 09:22, Gonglei (Arei) wrote:

> Without patches, ping time can jump from 0.3ms to 2ms-30ms. With 
> synchronize_srcu 
> patch, ping time is worse. With synchronize_srcu_expedited patch, ping time 
> is 
> overall good, though sometimes ping time jump to 1ms-3ms.

Just to understand whats going on, does something like

diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index 3318d82..432c2a2 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -331,7 +331,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
  */
 #define SRCU_RETRY_CHECK_DELAY 5
 #define SYNCHRONIZE_SRCU_TRYCOUNT  2
-#define SYNCHRONIZE_SRCU_EXP_TRYCOUNT  12
+#define SYNCHRONIZE_SRCU_EXP_TRYCOUNT  50
 
 /*
  * @@@ Wait until all pre-existing readers complete.  Such readers

make the problem go away?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Preparing kvm/next for first pull request of 3.15 merge window

2014-03-26 Thread Greg Kurz
On Wed, 26 Mar 2014 10:19:52 +0100
Paolo Bonzini  wrote:
> Il 26/03/2014 04:51, Paul Mackerras ha scritto:
> >> > I would like to know from ARM and PPC maintainers *now* (before the
> >> > merge window opens) what will be in 3.15.  Also, PPC guys, please
> >> > make sure the pull requests will be based on commit e724f080f5dd
> >> > (KVM: PPC: Book3S HV: Fix register usage when loading/saving VRSAVE,
> >> > 2014-03-13).
> > Alex has 5 commits in his kvm-ppc-queue branch that should go in.
> > That tree is based on your kvm-3.14-2 tag, not on e724f080f5dd.
> > Would you accept a pull request for that from me on Alex's behalf
> > since he's away?
> 
> Yes, of course.
> 
> > Do you need me to rebase it at all?
> 
> These are:
> 
> 1ac4484979 PPC: KVM: Introduce hypervisor call H_GET_TCE
> e0a7be38c9 PPC: KVM: fix RESUME_GUEST check before returning from 
> kvmppc_run_core()
> 013e98e657 PPC: KVM: fix RESUME_GUEST check before ending CEDE in 
> kvmppc_run_core()
> e8a7f11fe0 PPC: KVM: fix VCPU run for HV KVM (v2)
> 26d96ec97c PPC: KVM: introduce helper to check RESUME_GUEST and related
> 
> I would consider rebasing; but you know better than me the effect of the 
> two host-crash-fixing patches and how the testability/bisectability of 
> kvm-ppc-queue is affected.
> 
> In particular, how likely is it that reverse-bisection ("which commit 
> fixed the bug?") would end up on e724f080f5dd rather than one of the 
> four RESUME_GUEST commits?
> 

Paul,

It is safe to merge e0a7be38c9, 013e98e657, e8a7f11fe0 and 26d96ec97c as
a single patch... thing I should I've done from the beginning I guess. :)

Cheers.

--
Greg

> > There is also the set of 8 patches that I just posted and Scott
> > acked.  I can make a branch based on e724f080f5dd and send you a pull
> > request.
> 
> Yes, thanks!
> 
> Paolo
> 
> > Let me know what you would prefer regarding Alex's kvm-ppc-queue
> > branch.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



-- 
Gregory Kurz kurzg...@fr.ibm.com
 gk...@linux.vnet.ibm.com
Software Engineer @ IBM/Meiosys  http://www.ibm.com
Tel +33 (0)562 165 496

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: GPF in intel_pmu_lbr_reset() with qemu -cpu host

2014-03-26 Thread Paolo Bonzini

Il 26/03/2014 08:23, Wu, Feng ha scritto:

Is there a solution for this issue right now? I also met this GPF crash.


Can you attach your .config?

Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC]Two ideas to optimize updating irq routing table

2014-03-26 Thread Paolo Bonzini

Il 26/03/2014 09:22, Gonglei (Arei) ha scritto:

Yes, previously I was using synchronize_srcu, which is not good. When I
changed it to synchronize_srcu_expedited, grace period delay is much better
than synchronize_srcu. Though in our tests, we can still see some impact
of KVM_SET_GSI_ROUTING ioctl.

Our testing scenario is like this. In VM we run a script that sets smp_affinity
for each IRQ every 0.5s (this leads QEMU to do KVM_SET_GSI_ROUTING ioctl).
Outside the VM we ping that VM.


Does the affinity actually change every 0.5s?


Without patches, ping time can jump from 0.3ms to 2ms-30ms. With 
synchronize_srcu
patch, ping time is worse. With synchronize_srcu_expedited patch, ping time is
overall good, though sometimes ping time jump to 1ms-3ms.

With following raw patch, ping time is like call_rcu patch, that not influenced
by setting IRQ affinity, keeps 0.3ms, and there is no vulnerability, frequent
intermidiate KVM_SET_GSI_ROUTING settings are just skipped, and always the 
newest
setting would take effect.


Interesting, but it only works for assigned-dev.c which is deprecated. 
If you used VFIO you'd see no improvement, and Christian's s390 usecase 
would also see no improvement.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Preparing kvm/next for first pull request of 3.15 merge window

2014-03-26 Thread Paolo Bonzini

Il 25/03/2014 21:35, Christian Borntraeger ha scritto:

On 25/03/14 18:29, Paolo Bonzini wrote:

Also, the pull requests I got in the last couple of months were a bit
messy, and I'm already fearing that Linus notices. In the future, I'll
tag kvm-3.x-base (e.g. kvm-3.16-base) when I open kvm/next, and I will
*reject* pull requests from submaintainers that include extra non-KVM
commits without a very good reason.


can you clarify this statement? What are the dont do things: (I already checked 
one obvious answer)

[ ] I include non-kvm s390 patches (with a Maintainer ack from Martin
or Heiko) which are required for other patches. These patches might
even go via the s390 tree as well


This is okay.  The patches will stand out in the diffstat and I'll check 
that they have Acked-bys before pulling.  Bonus for using signed tags 
and mentioning it in the tag message. :)



[ ] My pull request is based on current kvm/next instead of kvm/next
that was branched away after rc1


This is not only okay, it's almost always desirable.  It's also okay to 
base your pull request on your last pull request.  Any commit between 
kvm-3.x-base and kvm/next will do.


Exception: if there is a particularly important bugfix that you want in 
both -rc and kvm/next, make the pull request based on kvm-3.x-base. 
After pulling it in kvm/next, I'll "forward" the pull request 
immediately to Linus.  But it should happen very rarely, and only for 
very bad bugs introduced during the last merge window.


For an example, see https://lkml.org/lkml/2014/3/18/112.


[X] My pull request is based on 3.x-rcy instead of kvm/next


Indeed this is bad and it is my main complaint.  As mentioned above, 
pull requests should be based on a commit between kvm-3.x-base and kvm/next.



[ ] My pull request is based on kvm/queue instead of kvm/next


This is not bad per se, but kvm/queue can be rebased: if down the line I 
have problems with kvm/queue, you'll have to resend the pull request.


In fact, there should be no need to base pull requests on kvm/queue. 
Even if your arch-specific patches touch virt/kvm/ or have prerequisites 
in virt/kvm/, it is better to harass me until I test kvm/queue and merge 
it back into kvm/next. :)


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Preparing kvm/next for first pull request of 3.15 merge window

2014-03-26 Thread Paolo Bonzini

Il 26/03/2014 04:51, Paul Mackerras ha scritto:

> I would like to know from ARM and PPC maintainers *now* (before the
> merge window opens) what will be in 3.15.  Also, PPC guys, please
> make sure the pull requests will be based on commit e724f080f5dd
> (KVM: PPC: Book3S HV: Fix register usage when loading/saving VRSAVE,
> 2014-03-13).

Alex has 5 commits in his kvm-ppc-queue branch that should go in.
That tree is based on your kvm-3.14-2 tag, not on e724f080f5dd.
Would you accept a pull request for that from me on Alex's behalf
since he's away?


Yes, of course.


Do you need me to rebase it at all?


These are:

1ac4484979 PPC: KVM: Introduce hypervisor call H_GET_TCE
e0a7be38c9 PPC: KVM: fix RESUME_GUEST check before returning from 
kvmppc_run_core()
013e98e657 PPC: KVM: fix RESUME_GUEST check before ending CEDE in 
kvmppc_run_core()

e8a7f11fe0 PPC: KVM: fix VCPU run for HV KVM (v2)
26d96ec97c PPC: KVM: introduce helper to check RESUME_GUEST and related

I would consider rebasing; but you know better than me the effect of the 
two host-crash-fixing patches and how the testability/bisectability of 
kvm-ppc-queue is affected.


In particular, how likely is it that reverse-bisection ("which commit 
fixed the bug?") would end up on e724f080f5dd rather than one of the 
four RESUME_GUEST commits?



There is also the set of 8 patches that I just posted and Scott
acked.  I can make a branch based on e724f080f5dd and send you a pull
request.


Yes, thanks!

Paolo


Let me know what you would prefer regarding Alex's kvm-ppc-queue
branch.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC]Two ideas to optimize updating irq routing table

2014-03-26 Thread Gonglei (Arei)
> On my system I have HZ=100 and lots of CPUs. So RCUs "every cpu has
> scheduled"
> is certainly slower than SRCUs algorithm
> (/*
>  * We use an adaptive strategy for synchronize_srcu() and especially for
>  * synchronize_srcu_expedited().  We spin for a fixed time period
>  * (defined below) to allow SRCU readers to exit their read-side critical
>  * sections.  If there are still some readers after 10 microseconds,
>  * we repeatedly block for 1-millisecond time periods.  This approach
>  * has done well in testing, so there is no need for a config parameter.
>  */
> )
> 
> With HZ==1000 and a NO. CPUs small SRCUs spinning might be in the same
> delay
> range than classic RCU depending on how long the read side critical
> section is (if we move from spinning to blocking)
> So using synchronize_srcu_expedited is certainly something to test as it
> increased the spinning time.
> 
> Christian
Yes, after we changed to synchronize_srcu_expedited, grace period latency 
improves much 
and overall this is good. However as I mentioned in another mail, in our 
setting-IRQ-affinity 
and ping test, we can still see some impact of KVM_SET_GSI_ROUTING ioctl. I 
wrote another 
patch in that mail and want to be examined to see if it is acceptable or has 
any problem, thank you.


Best regards,
-Gonglei

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 72381] [Nested] L1 call trace when create windows 7 guest as L2 guest.

2014-03-26 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=72381

--- Comment #2 from robert...@intel.com ---
Tried this case: L0 94b3ffcd + L1 94b3ffcd , L1 panic.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC]Two ideas to optimize updating irq routing table

2014-03-26 Thread Gonglei (Arei)
> > Based on discussions in:
> > http://lists.gnu.org/archive/html/qemu-devel/2013-11/threads.html#03322
> >
> > About KVM_SET_GSI_ROUTING ioctl, I tested changing RCU to SRCU, but
> unfortunately
> > it looks like SRCU's grace period is no better than RCU.
> 
> Really?  This is not what Christian Borntraeger claimed at
> http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/118083 -- in
> fact, Andrew Theurer is currently testing a variant of that patch and I
> was going to post it for 3.16.
> 
> Have you tried synchronize_srcu_expedited?  Unlike the RCU variant, you
> can use this function.
> 
Yes, previously I was using synchronize_srcu, which is not good. When I 
changed it to synchronize_srcu_expedited, grace period delay is much better 
than synchronize_srcu. Though in our tests, we can still see some impact 
of KVM_SET_GSI_ROUTING ioctl.

Our testing scenario is like this. In VM we run a script that sets smp_affinity 
for each IRQ every 0.5s (this leads QEMU to do KVM_SET_GSI_ROUTING ioctl).
Outside the VM we ping that VM.

Without patches, ping time can jump from 0.3ms to 2ms-30ms. With 
synchronize_srcu 
patch, ping time is worse. With synchronize_srcu_expedited patch, ping time is 
overall good, though sometimes ping time jump to 1ms-3ms.

With following raw patch, ping time is like call_rcu patch, that not influenced 
by setting IRQ affinity, keeps 0.3ms, and there is no vulnerability, frequent 
intermidiate KVM_SET_GSI_ROUTING settings are just skipped, and always the 
newest 
setting would take effect.

Would you think this patch is acceptable or has problem? Thanks in advance.

diff -urp kvm_kmod/include/linux/kvm_host.h 
kvm_kmod_new//include/linux/kvm_host.h
--- kvm_kmod/include/linux/kvm_host.h   2014-03-10 14:08:28.0 +
+++ kvm_kmod_new//include/linux/kvm_host.h  2014-03-26 15:07:48.0 
+
@@ -337,6 +337,12 @@ struct kvm {
 
struct mutex irq_lock;
 #ifdef CONFIG_HAVE_KVM_IRQCHIP
+   struct task_struct *kthread;
+   wait_queue_head_t wq;
+   struct mutex irq_routing_lock;
+   struct kvm_irq_routing to_update_routing;
+   struct kvm_irq_routing_entry *to_update_entries;
+   atomic_t have_new;
/*
 * Update side is protected by irq_lock and,
 * if configured, irqfds.lock.
diff -urp kvm_kmod/x86/assigned-dev.c kvm_kmod_new//x86/assigned-dev.c
--- kvm_kmod/x86/assigned-dev.c 2014-03-10 14:08:28.0 +
+++ kvm_kmod_new//x86/assigned-dev.c2014-03-26 15:22:33.0 +
@@ -1056,12 +1056,23 @@ long kvm_vm_ioctl_assigned_device(struct
r = -EFAULT;
urouting = argp;
if (copy_from_user(entries, urouting->entries,
-  routing.nr * sizeof(*entries)))
-   goto out_free_irq_routing;
-   r = kvm_set_irq_routing(kvm, entries, routing.nr,
-   routing.flags);
-   out_free_irq_routing:
-   vfree(entries);
+  routing.nr * sizeof(*entries))) {
+   vfree(entries);
+   break;
+   }
+
+   mutex_lock(&kvm->irq_routing_lock);
+   if (kvm->to_update_entries) {
+   vfree(kvm->to_update_entries);
+   kvm->to_update_entries = NULL;
+   }
+   kvm->to_update_routing = routing;
+   kvm->to_update_entries = entries;
+   atomic_set(&kvm->have_new, 1);
+   mutex_unlock(&kvm->irq_routing_lock);
+
+   wake_up(&kvm->wq);
+   r = 0;  /* parameter validity or memory alloc avalibity should 
be checked before return */
break;
}
 #endif /* KVM_CAP_IRQ_ROUTING */
diff -urp kvm_kmod/x86/kvm_main.c kvm_kmod_new//x86/kvm_main.c
--- kvm_kmod/x86/kvm_main.c 2014-03-10 14:08:28.0 +
+++ kvm_kmod_new//x86/kvm_main.c2014-03-26 15:27:02.0 +
@@ -83,6 +83,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -501,6 +502,49 @@ static void kvm_init_memslots_id(struct
slots->id_to_index[i] = slots->memslots[i].id = i;
 }
 
+static int do_irq_routing_table_update(struct kvm *kvm)
+{
+   struct kvm_irq_routing_entry *entries;
+   unsigned nr;
+   unsigned flags;
+
+   mutex_lock(&kvm->irq_routing_lock);
+   BUG_ON(!kvm->to_update_entries);
+
+   nr = kvm->to_update_routing.nr;
+   flags = kvm->to_update_routing.flags;
+   entries = vmalloc(nr * sizeof(*entries));
+   if (!entries) {
+   mutex_unlock(&kvm->irq_routing_lock);
+   return 0;
+   }
+   /* this memcpy can be eliminated by doing set in mutex_lock and doing 
synchronize_rcu outside */
+   memcpy(entries, kvm->to_update_entries, nr * sizeof(*entries));
+
+   atomic_set(&kvm->have_new, 0);
+   mutex_unlock(&kvm->irq_routing_lock);
+

Questions about interrupt injection

2014-03-26 Thread 缪天翔
Hi all,

I want to implement a virtual pci device in qemu and when needed
inject a virtual interrupt into the running guest. However, possibly
this action would fail and crash the guest VM especially when the
workload in guest is heavy. Qemu report the following errors:

>KVM: entry failed, hardware error 0x8021

>If you're running a guest on an Intel machine without unrestricted mode>
>support, the failure can be most likely due to the guest entering an invalid
>state for Intel VT. For example, the guest maybe running in big real mode
>which is not supported on less recent Intel processors.

BTW, there are two function named kvm_set_irq and kvm_queue_interrupt
which can be used to inject interrupt into guest, is it right?

Thx a lot.





-- 
Tianxiang Miao,
Graduate Student,
Institute of Parallel and Distributed Systems (IPADS),
School of Software,
Shanghai JiaoTong University
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC]Two ideas to optimize updating irq routing table

2014-03-26 Thread Christian Borntraeger
On 25/03/14 13:37, Paolo Bonzini wrote:
> Il 25/03/2014 04:19, Gonglei (Arei) ha scritto:
>> Based on discussions in:
>> http://lists.gnu.org/archive/html/qemu-devel/2013-11/threads.html#03322
>>
>> About KVM_SET_GSI_ROUTING ioctl, I tested changing RCU to SRCU, but 
>> unfortunately
>> it looks like SRCU's grace period is no better than RCU.
> 
> Really?  This is not what Christian Borntraeger claimed at 
> http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/118083 -- in fact, 
> Andrew Theurer is currently testing a variant of that patch and I was going 
> to post it for 3.16.
> 
> Have you tried synchronize_srcu_expedited?  Unlike the RCU variant, you can 
> use this function.

On my system I have HZ=100 and lots of CPUs. So RCUs "every cpu has scheduled"
is certainly slower than SRCUs algorithm
(/*
 * We use an adaptive strategy for synchronize_srcu() and especially for
 * synchronize_srcu_expedited().  We spin for a fixed time period
 * (defined below) to allow SRCU readers to exit their read-side critical
 * sections.  If there are still some readers after 10 microseconds,
 * we repeatedly block for 1-millisecond time periods.  This approach
 * has done well in testing, so there is no need for a config parameter.
 */
)

With HZ==1000 and a NO. CPUs small SRCUs spinning might be in the same delay
range than classic RCU depending on how long the read side critical 
section is (if we move from spinning to blocking)
So using synchronize_srcu_expedited is certainly something to test as it 
increased the spinning time.

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: GPF in intel_pmu_lbr_reset() with qemu -cpu host

2014-03-26 Thread Wu, Feng


> -Original Message-
> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
> Behalf Of Peter Wu
> Sent: Sunday, March 23, 2014 6:00 AM
> To: Gleb Natapov
> Cc: Venkatesh Srinivas; Peter Zijlstra; Ingo Molnar; Andi Kleen; Linux Kernel
> Developers List; H. Peter Anvin; kvm@vger.kernel.org; Paolo Bonzini
> Subject: Re: GPF in intel_pmu_lbr_reset() with qemu -cpu host
> 
> On Saturday 22 March 2014 14:27:59 Gleb Natapov wrote:
> > > but now I have a NULL dereference (in rapl_pmu_init). Previously, when
> > > `-cpu SandyBridge` was passed to qemu, it would show this:
> > >
> > > [0.016995] Performance Events: unsupported p6 CPU model 42
> no PMU driver, software events only.
> > >
> > > The same NULL pointer deref would be visible (slightly different
> > > addresses, but the Code lines are equal). With `-host`, the NULL deref
> > > with `-cpu host` contains:
> > >
> > > [0.016445] Performance Events: 16-deep LBR, IvyBridge events,
> Intel PMU driver.
> > >
> > > Full dmesg below.
> > >
> > I am confused. Do you see crash now with -cpu SandyBridge and -cpu host, or
> -cpu host only?
> 
> The RAPL crash is seen with both SandyBridge and host, I mentioned SB
> because that environment should be more constant than "host" (which
> depends on the CPU you have on, well, the host).
> 
> Peter

Is there a solution for this issue right now? I also met this GPF crash.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks,
Feng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html