[Bug 60830] New: L2 rhel6u4(32bit) guest reboot continuously

2013-09-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=60830

Bug ID: 60830
   Summary: L2 rhel6u4(32bit) guest reboot continuously
   Product: Virtualization
   Version: unspecified
Kernel Version: 3.11.0-rc1
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: kvm
  Assignee: virtualization_...@kernel-bugs.osdl.org
  Reporter: chao.z...@intel.com
Regression: No

Environment:

Host OS (ia32/ia32e/IA64):ia32e
Guest OS (ia32/ia32e/IA64):ia32e
Guest OS Type (Linux/Windows):Linux
kvm.git Commit:cc2df20c7c4ce594c3e17e9cc260c330646012c8
qemu.git Commit:f7ad538e1ea130c8b6f3abb06ad6c856242c799e
Host Kernel Version:3.11.0-rc1
Hardware:Romley_EP

Bug detailed description:
--

when create L1 guest with -cpu host , then create a 32bit rhel6u4 guest as L2
guest, the L2 guest reboot continuously.

This commit introduced this bug:

commit afa61f752ba62549e4143d9f9378a8d1d710d6eb
Author: Nadav Har'El n...@il.ibm.com
Date:   Wed Aug 7 14:59:22 2013 +0200
Advertise the support of EPT to the L1 guest, through the appropriate MSR.
This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support
until this last patch, and will not attempt to use the half-applied feature.

note:

1. create a 32bit RHEL6u3 as L2 guest, the guest reboot continuously.
2. when creat a 64bit rhel6u4 guest as L2 guest, the L2 guest works fine
3. this should be a kernel bug:

kvm  +  qemu = result

cc2df20c + f7ad538e  = bad

205befd9 + f7ad538e  = good

Reproduce steps:


1. create L1 guest:

qemu-system-x86_64 -enable-kvm -m 8G -smp 4 -net nic,macaddr=00:12:41:51:14:16
-net tap,script=/etc/kvm/qemu-ifup ia32e_nested-kvm.img -cpu host

2. create L2 guest

qemu-system-x86_64 -enable-kvm -m 1024 -smp 2 -net none ia32p_rhel6u4.img

Current result:


32bit rhel6u4 as L2 guest reboot continuously

Expected result:


32bit rhel6u4 as L2 guest works fine

Basic root-causing log:
--

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 2/6] vhost_net: use vhost_add_used_and_signal_n() in vhost_zerocopy_signal_used()

2013-09-02 Thread Jason Wang
On 09/02/2013 01:50 PM, Michael S. Tsirkin wrote:
 On Fri, Aug 30, 2013 at 12:29:18PM +0800, Jason Wang wrote:
  We tend to batch the used adding and signaling in vhost_zerocopy_callback()
  which may result more than 100 used buffers to be updated in
  vhost_zerocopy_signal_used() in some cases. So wwitch to use
 switch

Ok.
  vhost_add_used_and_signal_n() to avoid multiple calls to
  vhost_add_used_and_signal(). Which means much more less times of used index
  updating and memory barriers.
 pls put info on perf gain in commit log too


Sure.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/6] vhost_net: make vhost_zerocopy_signal_used() returns void

2013-09-02 Thread Jason Wang
On 09/02/2013 01:51 PM, Michael S. Tsirkin wrote:
 tweak subj s/returns/return/

 On Fri, Aug 30, 2013 at 12:29:17PM +0800, Jason Wang wrote:
  None of its caller use its return value, so let it return void.
  
  Signed-off-by: Jason Wang jasow...@redhat.com
  ---

Will correct it in v3.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 6/6] vhost_net: correctly limit the max pending buffers

2013-09-02 Thread Jason Wang
On 09/02/2013 01:56 PM, Michael S. Tsirkin wrote:
 On Fri, Aug 30, 2013 at 12:29:22PM +0800, Jason Wang wrote:
 As Michael point out, We used to limit the max pending DMAs to get better 
 cache
 utilization. But it was not done correctly since it was one done when 
 there's no
 new buffers submitted from guest. Guest can easily exceeds the limitation by
 keeping sending packets.

 So this patch moves the check into main loop. Tests shows about 5%-10%
 improvement on per cpu throughput for guest tx. But a 5% drop on per cpu
 transaction rate for a single session TCP_RR.
 Any explanation for the drop? single session TCP_RR is unlikely to
 exceed VHOST_MAX_PEND, correct?

Unlikely to exceed. Recheck the result, looks like it was not stable
enough. Will re-test and report.

 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  drivers/vhost/net.c |   15 ---
  1 files changed, 4 insertions(+), 11 deletions(-)

 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index d09c17c..592e1f2 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -363,6 +363,10 @@ static void handle_tx(struct vhost_net *net)
  if (zcopy)
  vhost_zerocopy_signal_used(net, vq);
  
 +if ((nvq-upend_idx + vq-num - VHOST_MAX_PEND) % UIO_MAXIOV ==
 +nvq-done_idx)
 +break;
 +
  head = vhost_get_vq_desc(net-dev, vq, vq-iov,
   ARRAY_SIZE(vq-iov),
   out, in,
 @@ -372,17 +376,6 @@ static void handle_tx(struct vhost_net *net)
  break;
  /* Nothing new?  Wait for eventfd to tell us they refilled. */
  if (head == vq-num) {
 -int num_pends;
 -
 -/* If more outstanding DMAs, queue the work.
 - * Handle upend_idx wrap around
 - */
 -num_pends = likely(nvq-upend_idx = nvq-done_idx) ?
 -(nvq-upend_idx - nvq-done_idx) :
 -(nvq-upend_idx + UIO_MAXIOV -
 - nvq-done_idx);
 -if (unlikely(num_pends  VHOST_MAX_PEND))
 -break;
  if (unlikely(vhost_enable_notify(net-dev, vq))) {
  vhost_disable_notify(net-dev, vq);
  continue;
 -- 
 1.7.1
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-09-02 Thread Xiao Guangrong
On 08/30/2013 07:38 PM, Gleb Natapov wrote:
 On Thu, Aug 29, 2013 at 07:26:40PM +0800, Xiao Guangrong wrote:
 On 08/29/2013 05:51 PM, Gleb Natapov wrote:
 On Thu, Aug 29, 2013 at 05:31:42PM +0800, Xiao Guangrong wrote:
 As Documentation/RCU/whatisRCU.txt says:

 As with rcu_assign_pointer(), an important function of
 rcu_dereference() is to document which pointers are protected by
 RCU, in particular, flagging a pointer that is subject to changing
 at any time, including immediately after the rcu_dereference().
 And, again like rcu_assign_pointer(), rcu_dereference() is
 typically used indirectly, via the _rcu list-manipulation
 primitives, such as list_for_each_entry_rcu().

 The documentation aspect of rcu_assign_pointer()/rcu_dereference() is
 important. The code is complicated, so self documentation will not hurt.
 I want to see what is actually protected by rcu here. Freeing shadow
 pages with call_rcu() further complicates matters: does it mean that
 shadow pages are also protected by rcu? 

 Yes, it stops shadow page to be freed when we do write-protection on
 it.

 Yeah, I got the trick, what I am saying that we have a data structure
 here protected by RCU, but we do not use RCU functions to access it...

 Yes, they are not used when insert a spte into rmap and get the rmap from
 the entry... but do we need to use these functions to guarantee the order?

 The worst case is, we fetch the spte from the desc but the spte is not
 updated yet, we can happily skip this spte since it will set the
 dirty-bitmap later, this is guaranteed by the barrier between 
 mmu_spte_update()
 and mark_page_dirty(), the code is:

 set_spte():

  if (mmu_spte_update(sptep, spte))
  kvm_flush_remote_tlbs(vcpu-kvm);

  if (!remap) {
  if (rmap_add(vcpu, sptep, gfn)  RMAP_RECYCLE_THRESHOLD)
  rmap_recycle(vcpu, sptep, gfn);

  if (level  PT_PAGE_TABLE_LEVEL)
  ++vcpu-kvm-stat.lpages;
  }

  smp_wmb();

  if (pte_access  ACC_WRITE_MASK)
  mark_page_dirty(vcpu-kvm, gfn);

 So, i guess if we can guaranteed the order by ourself, we do not need
 to call the rcu functions explicitly...

 But, the memory barres in the rcu functions are really light on x86 (store
 can not be reordered with store), so i do not mind to explicitly use them
 if you think this way is more safe. :)

 I think the self documentation aspect of using rcu function is also
 important.

Okay. I will use these rcu functions and measure them to see whether it'll
cause performance issue.

 
 BTW why not allocate sp-spt from SLAB_DESTROY_BY_RCU cache too? We may
 switch write protection on a random spt occasionally if page is deleted
 and reused for another spt though. For last level spt it should not be a
 problem and for non last level we have is_last_spte() check in
 __rmap_write_protect_lockless(). Can it work?

 Yes, i also considered this way. It can work if we handle is_last_spte()
 properly. Since the sp-spte can be reused, we can not get the mapping
 level from sp. We need to encode the mapping level into spte so that
 cmpxhg can understand if the page table has been moved to another mapping
 level.
 Isn't one bit that says that spte is the last one enough? IIRC we
 have one more ignored bit to spare in spte.

Right. But i also want to use this way in fast_page_fault where mapping-level
is needed.

 
 Could you allow me to make this optimization separately after this
 patchset be merged?

 If you think it will complicate the initial version I am fine with
 postponing it for later.

Thank you, Gleb!


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is fallback vhost_net to qemu for live migrate available?

2013-09-02 Thread Wei Liu
On Sat, Aug 31, 2013 at 12:45:11PM +0800, Qin Chuanyu wrote:
 On 2013/8/30 0:08, Anthony Liguori wrote:
 Hi Qin,
 
 By change the memory copy and notify mechanism ,currently virtio-net with
 vhost_net could run on Xen with good performance。
 
 I think the key in doing this would be to implement a property
 ioeventfd and irqfd interface in the driver domain kernel.  Just
 hacking vhost_net with Xen specific knowledge would be pretty nasty
 IMHO.
 
 Yes, I add a kernel module which persist virtio-net pio_addr and
 msix address as what kvm module did. Guest wake up vhost thread by
 adding a hook func in evtchn_interrupt.
 
 Did you modify the front end driver to do grant table mapping or is
 this all being done by mapping the domain's memory?
 
 There is nothing changed in front end driver. Currently I use
 alloc_vm_area to get address space, and map the domain's memory as
 what what qemu did.
 

You mean you're using xc_map_foreign_range and friends in the backend to
map guest memory? That's not very desirable as it violates Xen's
security model. It would not be too hard to pass grant references
instead of guest physical memory address IMHO.

Wei.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is fallback vhost_net to qemu for live migrate available?

2013-09-02 Thread Michael S. Tsirkin
On Mon, Sep 02, 2013 at 08:57:22AM +0100, Wei Liu wrote:
 On Sat, Aug 31, 2013 at 12:45:11PM +0800, Qin Chuanyu wrote:
  On 2013/8/30 0:08, Anthony Liguori wrote:
  Hi Qin,
  
  By change the memory copy and notify mechanism ,currently virtio-net with
  vhost_net could run on Xen with good performance。
  
  I think the key in doing this would be to implement a property
  ioeventfd and irqfd interface in the driver domain kernel.  Just
  hacking vhost_net with Xen specific knowledge would be pretty nasty
  IMHO.
  
  Yes, I add a kernel module which persist virtio-net pio_addr and
  msix address as what kvm module did. Guest wake up vhost thread by
  adding a hook func in evtchn_interrupt.
  
  Did you modify the front end driver to do grant table mapping or is
  this all being done by mapping the domain's memory?
  
  There is nothing changed in front end driver. Currently I use
  alloc_vm_area to get address space, and map the domain's memory as
  what what qemu did.
  
 
 You mean you're using xc_map_foreign_range and friends in the backend to
 map guest memory? That's not very desirable as it violates Xen's
 security model. It would not be too hard to pass grant references
 instead of guest physical memory address IMHO.
 
 Wei.

It's a start and it should make it fast and work with existing
infrastructure in the host, though.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/6] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state

2013-09-02 Thread Gleb Natapov
On Thu, Aug 08, 2013 at 04:26:28PM +0200, Jan Kiszka wrote:
 Likely a typo, but a fatal one as kvm_set_cr0 performs checks on the
Not a typo :) That what Avi asked for do during initial nested VMX
review: http://markmail.org/message/hhidqyhbo2mrgxxc

But there is at least one transition check that kvm_set_cr0() does that
should not be done during vmexit emulation, namely CS.L bit check, so I
tend to agree that kvm_set_cr0() is not appropriate here, at lest not as
it is. But can we skip other checks kvm_set_cr0() does? For instance
what prevents us from loading CR0.PG = 1 EFER.LME = 1 and CR4.PAE = 0
during nested vmexit?  What _should_ prevent it is vmentry check from
26.2.4

If the host address-space size VM-exit control is 1, the following
must hold:
 - Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1.

But I do not see that we do that check on vmentry.

What about NW/CD bit checks, or reserved bits checks? 27.5.1 says:
  The following bits are not modified:
   For CR0, ET, CD, NW; bits 63:32 (on processors that support Intel 64
   architecture), 28:19, 17, and 15:6; and any bits that are fixed in
   VMX operation (see Section 23.8).

But again current vmexit code does not emulate this properly and just
sets everything from host_cr0. vmentry should also preserve all those
bit but it looks like it doesn't too.


 state transition that may prevent loading L1's cr0.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  arch/x86/kvm/vmx.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 57b4e12..d001b019 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -8185,7 +8185,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu 
 *vcpu,
* fpu_active (which may have changed).
* Note that vmx_set_cr0 refers to efer set above.
*/
 - kvm_set_cr0(vcpu, vmcs12-host_cr0);
 + vmx_set_cr0(vcpu, vmcs12-host_cr0);
   /*
* If we did fpu_activate()/fpu_deactivate() during L2's run, we need
* to apply the same changes to L1's vmcs. We just set cr0 correctly,
 -- 
 1.7.3.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 6/6] vhost_net: correctly limit the max pending buffers

2013-09-02 Thread Jason Wang
On 09/02/2013 02:30 PM, Jason Wang wrote:
 On 09/02/2013 01:56 PM, Michael S. Tsirkin wrote:
  On Fri, Aug 30, 2013 at 12:29:22PM +0800, Jason Wang wrote:
  As Michael point out, We used to limit the max pending DMAs to get 
  better cache
  utilization. But it was not done correctly since it was one done when 
  there's no
  new buffers submitted from guest. Guest can easily exceeds the 
  limitation by
  keeping sending packets.
 
  So this patch moves the check into main loop. Tests shows about 5%-10%
  improvement on per cpu throughput for guest tx. But a 5% drop on per cpu
  transaction rate for a single session TCP_RR.
  Any explanation for the drop? single session TCP_RR is unlikely to
  exceed VHOST_MAX_PEND, correct?
 Unlikely to exceed. Recheck the result, looks like it was not stable
 enough. Will re-test and report.

Re-tested with more iterations, result shows no difference on TCP_RR
test and 5%-10% improvement on per cpu throughput for guest tx.

Will post V3 soon.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-09-02 Thread Xiao Guangrong
On 08/30/2013 07:44 PM, Gleb Natapov wrote:
 On Thu, Aug 29, 2013 at 08:02:30PM +0800, Xiao Guangrong wrote:
 On 08/29/2013 07:33 PM, Xiao Guangrong wrote:
 On 08/29/2013 05:31 PM, Gleb Natapov wrote:
 On Thu, Aug 29, 2013 at 02:50:51PM +0800, Xiao Guangrong wrote:
 After more thinking, I still think rcu_assign_pointer() is unneeded when 
 a entry
 is removed. The remove-API does not care the order between unlink the 
 entry and
 the changes to its fields. It is the caller's responsibility:
 - in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), 
 etc to
   enforce all lookups exit and the later change on that entry is 
 invisible to the
   lookups.

 - In the case of rculist_nulls, it seems refcounter is used to guarantee 
 the order
   (see the example from Documentation/RCU/rculist_nulls.txt).

 - In our case, we allow the lookup to see the deleted desc even if it is 
 in slab cache
   or its is initialized or it is re-added.

 BTW is it a good idea? We can access deleted desc while it is allocated
 and initialized to zero by kmem_cache_zalloc(), are we sure we cannot
 see partially initialized desc-sptes[] entry? On related note what about
 32 bit systems, they do not have atomic access to desc-sptes[].

 Ah... wait. desc is a array of pointers:

 struct pte_list_desc {
  u64 *sptes[PTE_LIST_EXT];
  struct pte_list_desc *more;
 };

 Yep, I misread it to be u64 bits and wondered why do we use u64 to store
 pointers.
 
 assigning a pointer is aways aotomic, but we should carefully initialize it
 as you said. I will introduce a constructor for desc slab cache which 
 initialize
 the struct like this:

 for (i = 0; i  PTE_LIST_EXT; i++)
  desc-sptes[i] = NULL;

 It is okay.

 I hope slab does not write anything into allocated memory internally if
 constructor is present. 

If only constructor is present (no SLAB_DESTROY_BY_RCU), It'll temporarily
write the poison value into the memory then call the constructor to initialize
it again, e.g, in slab.c:

static void *cache_alloc_debugcheck_after(struct kmem_cache *cachep,
gfp_t flags, void *objp, unsigned long caller)
{
if (cachep-flags  SLAB_POISON) {
..
poison_obj(cachep, objp, POISON_INUSE);
}
..
if (cachep-ctor  cachep-flags  SLAB_POISON)
cachep-ctor(objp);
}

But SLAB_DESTROY_BY_RCU can force the allocer to don't touch
the memory, this is true in our case.

 BTW do you know what happens when SLAB debug is enabled
 and SLAB_DESTROY_BY_RCU is set? 

When SLAB debug is enabled, these 3 flags may be set:
#define SLAB_DEBUG_FREE 0x0100UL/* DEBUG: Perform (expensive) 
checks on free */
#define SLAB_RED_ZONE   0x0400UL/* DEBUG: Red zone objs in a 
cache */
#define SLAB_POISON 0x0800UL/* DEBUG: Poison objects */

Only SLAB_POISON can write something into the memory, and ...

 Does poison value is written into freed
 object (freed to slab, but not yet to page allocator)?

SLAB_POISON is cleared if SLAB_DESTROY_BY_RCU is used.
- In slab,  There is the code in __kmem_cache_create():
if (flags  SLAB_DESTROY_BY_RCU)
BUG_ON(flags  SLAB_POISON);

- In slub, the code is in calculate_sizes():
/*
 * Determine if we can poison the object itself. If the user of
 * the slab may touch the object after free or before allocation
 * then we should never poison the object itself.
 */
if ((flags  SLAB_POISON)  !(flags  SLAB_DESTROY_BY_RCU) 
!s-ctor)
s-flags |= __OBJECT_POISON;
else
s-flags = ~__OBJECT_POISON;

- in slob, it seems it does not support SLAB DEBUG.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 5/6] vhost_net: poll vhost queue after marking DMA is done

2013-09-02 Thread Jason Wang
We used to poll vhost queue before making DMA is done, this is racy if vhost
thread were waked up before marking DMA is done which can result the signal to
be missed. Fix this by always polling the vhost thread before DMA is done.

Signed-off-by: Jason Wang jasow...@redhat.com
---
- The patch is needed for stable 3.4+
---
 drivers/vhost/net.c |9 +
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 3f89dea..8e9dc55 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -308,6 +308,11 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
struct vhost_virtqueue *vq = ubufs-vq;
int cnt = atomic_read(ubufs-kref.refcount);
 
+   /* set len to mark this desc buffers done DMA */
+   vq-heads[ubuf-desc].len = success ?
+   VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
+   vhost_net_ubuf_put(ubufs);
+
/*
 * Trigger polling thread if guest stopped submitting new buffers:
 * in this case, the refcount after decrement will eventually reach 1
@@ -318,10 +323,6 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
 */
if (cnt = 2 || !(cnt % 16))
vhost_poll_queue(vq-poll);
-   /* set len to mark this desc buffers done DMA */
-   vq-heads[ubuf-desc].len = success ?
-   VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
-   vhost_net_ubuf_put(ubufs);
 }
 
 /* Expects to be always run from workqueue - which acts as
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 3/6] vhost: switch to use vhost_add_used_n()

2013-09-02 Thread Jason Wang
Let vhost_add_used() to use vhost_add_used_n() to reduce the code
duplication. To avoid the overhead brought by __copy_to_user(). We will use
put_user() when one used need to be added.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 drivers/vhost/vhost.c |   54 ++--
 1 files changed, 12 insertions(+), 42 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 448efe0..9a9502a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1332,48 +1332,9 @@ EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
  * want to notify the guest, using eventfd. */
 int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
 {
-   struct vring_used_elem __user *used;
+   struct vring_used_elem heads = { head, len };
 
-   /* The virtqueue contains a ring of used buffers.  Get a pointer to the
-* next entry in that used ring. */
-   used = vq-used-ring[vq-last_used_idx % vq-num];
-   if (__put_user(head, used-id)) {
-   vq_err(vq, Failed to write used id);
-   return -EFAULT;
-   }
-   if (__put_user(len, used-len)) {
-   vq_err(vq, Failed to write used len);
-   return -EFAULT;
-   }
-   /* Make sure buffer is written before we update index. */
-   smp_wmb();
-   if (__put_user(vq-last_used_idx + 1, vq-used-idx)) {
-   vq_err(vq, Failed to increment used idx);
-   return -EFAULT;
-   }
-   if (unlikely(vq-log_used)) {
-   /* Make sure data is seen before log. */
-   smp_wmb();
-   /* Log used ring entry write. */
-   log_write(vq-log_base,
- vq-log_addr +
-  ((void __user *)used - (void __user *)vq-used),
- sizeof *used);
-   /* Log used index update. */
-   log_write(vq-log_base,
- vq-log_addr + offsetof(struct vring_used, idx),
- sizeof vq-used-idx);
-   if (vq-log_ctx)
-   eventfd_signal(vq-log_ctx, 1);
-   }
-   vq-last_used_idx++;
-   /* If the driver never bothers to signal in a very long while,
-* used index might wrap around. If that happens, invalidate
-* signalled_used index we stored. TODO: make sure driver
-* signals at least once in 2^16 and remove this. */
-   if (unlikely(vq-last_used_idx == vq-signalled_used))
-   vq-signalled_used_valid = false;
-   return 0;
+   return vhost_add_used_n(vq, heads, 1);
 }
 EXPORT_SYMBOL_GPL(vhost_add_used);
 
@@ -1387,7 +1348,16 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
 
start = vq-last_used_idx % vq-num;
used = vq-used-ring + start;
-   if (__copy_to_user(used, heads, count * sizeof *used)) {
+   if (count == 1) {
+   if (__put_user(heads[0].id, used-id)) {
+   vq_err(vq, Failed to write used id);
+   return -EFAULT;
+   }
+   if (__put_user(heads[0].len, used-len)) {
+   vq_err(vq, Failed to write used len);
+   return -EFAULT;
+   }
+   } else if (__copy_to_user(used, heads, count * sizeof *used)) {
vq_err(vq, Failed to write used);
return -EFAULT;
}
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 4/6] vhost_net: determine whether or not to use zerocopy at one time

2013-09-02 Thread Jason Wang
Currently, even if the packet length is smaller than VHOST_GOODCOPY_LEN, if
upend_idx != done_idx we still set zcopy_used to true and rollback this choice
later. This could be avoided by determining zerocopy once by checking all
conditions at one time before.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 drivers/vhost/net.c |   47 ---
 1 files changed, 20 insertions(+), 27 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8a6dd0d..3f89dea 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -404,43 +404,36 @@ static void handle_tx(struct vhost_net *net)
   iov_length(nvq-hdr, s), hdr_size);
break;
}
-   zcopy_used = zcopy  (len = VHOST_GOODCOPY_LEN ||
-  nvq-upend_idx != nvq-done_idx);
+
+   zcopy_used = zcopy  len = VHOST_GOODCOPY_LEN
+   (nvq-upend_idx + 1) % UIO_MAXIOV !=
+ nvq-done_idx
+   vhost_net_tx_select_zcopy(net);
 
/* use msg_control to pass vhost zerocopy ubuf info to skb */
if (zcopy_used) {
+   struct ubuf_info *ubuf;
+   ubuf = nvq-ubuf_info + nvq-upend_idx;
+
vq-heads[nvq-upend_idx].id = head;
-   if (!vhost_net_tx_select_zcopy(net) ||
-   len  VHOST_GOODCOPY_LEN) {
-   /* copy don't need to wait for DMA done */
-   vq-heads[nvq-upend_idx].len =
-   VHOST_DMA_DONE_LEN;
-   msg.msg_control = NULL;
-   msg.msg_controllen = 0;
-   ubufs = NULL;
-   } else {
-   struct ubuf_info *ubuf;
-   ubuf = nvq-ubuf_info + nvq-upend_idx;
-
-   vq-heads[nvq-upend_idx].len =
-   VHOST_DMA_IN_PROGRESS;
-   ubuf-callback = vhost_zerocopy_callback;
-   ubuf-ctx = nvq-ubufs;
-   ubuf-desc = nvq-upend_idx;
-   msg.msg_control = ubuf;
-   msg.msg_controllen = sizeof(ubuf);
-   ubufs = nvq-ubufs;
-   kref_get(ubufs-kref);
-   }
+   vq-heads[nvq-upend_idx].len = VHOST_DMA_IN_PROGRESS;
+   ubuf-callback = vhost_zerocopy_callback;
+   ubuf-ctx = nvq-ubufs;
+   ubuf-desc = nvq-upend_idx;
+   msg.msg_control = ubuf;
+   msg.msg_controllen = sizeof(ubuf);
+   ubufs = nvq-ubufs;
+   kref_get(ubufs-kref);
nvq-upend_idx = (nvq-upend_idx + 1) % UIO_MAXIOV;
-   } else
+   } else {
msg.msg_control = NULL;
+   ubufs = NULL;
+   }
/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock-ops-sendmsg(NULL, sock, msg, len);
if (unlikely(err  0)) {
if (zcopy_used) {
-   if (ubufs)
-   vhost_net_ubuf_put(ubufs);
+   vhost_net_ubuf_put(ubufs);
nvq-upend_idx = ((unsigned)nvq-upend_idx - 1)
% UIO_MAXIOV;
}
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 6/6] vhost_net: correctly limit the max pending buffers

2013-09-02 Thread Jason Wang
As Michael point out, We used to limit the max pending DMAs to get better cache
utilization. But it was not done correctly since it was one done when there's no
new buffers submitted from guest. Guest can easily exceeds the limitation by
keeping sending packets.

So this patch moves the check into main loop. Tests shows about 5%-10%
improvement on per cpu throughput for guest tx.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 drivers/vhost/net.c |   18 +++---
 1 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8e9dc55..831eb4f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -363,6 +363,13 @@ static void handle_tx(struct vhost_net *net)
if (zcopy)
vhost_zerocopy_signal_used(net, vq);
 
+   /* If more outstanding DMAs, queue the work.
+* Handle upend_idx wrap around
+*/
+   if (unlikely((nvq-upend_idx + vq-num - VHOST_MAX_PEND)
+ % UIO_MAXIOV == nvq-done_idx))
+   break;
+
head = vhost_get_vq_desc(net-dev, vq, vq-iov,
 ARRAY_SIZE(vq-iov),
 out, in,
@@ -372,17 +379,6 @@ static void handle_tx(struct vhost_net *net)
break;
/* Nothing new?  Wait for eventfd to tell us they refilled. */
if (head == vq-num) {
-   int num_pends;
-
-   /* If more outstanding DMAs, queue the work.
-* Handle upend_idx wrap around
-*/
-   num_pends = likely(nvq-upend_idx = nvq-done_idx) ?
-   (nvq-upend_idx - nvq-done_idx) :
-   (nvq-upend_idx + UIO_MAXIOV -
-nvq-done_idx);
-   if (unlikely(num_pends  VHOST_MAX_PEND))
-   break;
if (unlikely(vhost_enable_notify(net-dev, vq))) {
vhost_disable_notify(net-dev, vq);
continue;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/4] kvm-unit-tests: Add a series of test cases

2013-09-02 Thread Arthur Chunqi Li
Hi Gleb, Paolo and Jan,

Would you please review this series of codes when you can spare time?
Jan has review it and, of course, further suggestions are welcomed.

Arthur

On Thu, Aug 15, 2013 at 7:45 PM, Arthur Chunqi Li yzt...@gmail.com wrote:
 Add a series of test cases for nested VMX in kvm-unit-tests.

 Arthur Chunqi Li (4):
   kvm-unit-tests: VMX: Add test cases for PAT and EFER
   kvm-unit-tests: VMX: Add test cases for CR0/4 shadowing
   kvm-unit-tests: VMX: Add test cases for I/O bitmaps
   kvm-unit-tests: VMX: Add test cases for instruction  interception

  lib/x86/vm.h|4 +
  x86/vmx.c   |3 +-
  x86/vmx.h   |   20 +-
  x86/vmx_tests.c |  714 
 +++
  4 files changed, 736 insertions(+), 5 deletions(-)

 --
 1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/6] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state

2013-09-02 Thread Jan Kiszka
On 2013-09-02 10:21, Gleb Natapov wrote:
 On Thu, Aug 08, 2013 at 04:26:28PM +0200, Jan Kiszka wrote:
 Likely a typo, but a fatal one as kvm_set_cr0 performs checks on the
 Not a typo :) That what Avi asked for do during initial nested VMX
 review: http://markmail.org/message/hhidqyhbo2mrgxxc

Yeah, should rephrase this.

 
 But there is at least one transition check that kvm_set_cr0() does that
 should not be done during vmexit emulation, namely CS.L bit check, so I
 tend to agree that kvm_set_cr0() is not appropriate here, at lest not as
 it is.

kvm_set_cr0() is for emulating explicit guest changes. It is not the
proper interface for implicit, vendor-dependent changes like this one.

 But can we skip other checks kvm_set_cr0() does? For instance
 what prevents us from loading CR0.PG = 1 EFER.LME = 1 and CR4.PAE = 0
 during nested vmexit?  What _should_ prevent it is vmentry check from
 26.2.4
 
 If the host address-space size VM-exit control is 1, the following
 must hold:
  - Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1.
 
 But I do not see that we do that check on vmentry.
 
 What about NW/CD bit checks, or reserved bits checks? 27.5.1 says:
   The following bits are not modified:
For CR0, ET, CD, NW; bits 63:32 (on processors that support Intel 64
architecture), 28:19, 17, and 15:6; and any bits that are fixed in
VMX operation (see Section 23.8).
 
 But again current vmexit code does not emulate this properly and just
 sets everything from host_cr0. vmentry should also preserve all those
 bit but it looks like it doesn't too.
 

Yes, there is surely more to improve. Do you think the lacking checks
can cause troubles for L0, or is this just imprecise emulation that can
be addressed separately?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 2/6] vhost_net: use vhost_add_used_and_signal_n() in vhost_zerocopy_signal_used()

2013-09-02 Thread Jason Wang
We tend to batch the used adding and signaling in vhost_zerocopy_callback()
which may result more than 100 used buffers to be updated in
vhost_zerocopy_signal_used() in some cases. So switch to use
vhost_add_used_and_signal_n() to avoid multiple calls to
vhost_add_used_and_signal(). Which means much less times of used index
updating and memory barriers.

2% performance improvement were seen on netperf TCP_RR test.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 drivers/vhost/net.c |   13 -
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 280ee66..8a6dd0d 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -281,7 +281,7 @@ static void vhost_zerocopy_signal_used(struct vhost_net 
*net,
 {
struct vhost_net_virtqueue *nvq =
container_of(vq, struct vhost_net_virtqueue, vq);
-   int i;
+   int i, add;
int j = 0;
 
for (i = nvq-done_idx; i != nvq-upend_idx; i = (i + 1) % UIO_MAXIOV) {
@@ -289,14 +289,17 @@ static void vhost_zerocopy_signal_used(struct vhost_net 
*net,
vhost_net_tx_err(net);
if (VHOST_DMA_IS_DONE(vq-heads[i].len)) {
vq-heads[i].len = VHOST_DMA_CLEAR_LEN;
-   vhost_add_used_and_signal(vq-dev, vq,
- vq-heads[i].id, 0);
++j;
} else
break;
}
-   if (j)
-   nvq-done_idx = i;
+   while (j) {
+   add = min(UIO_MAXIOV - nvq-done_idx, j);
+   vhost_add_used_and_signal_n(vq-dev, vq,
+   vq-heads[nvq-done_idx], add);
+   nvq-done_idx = (nvq-done_idx + add) % UIO_MAXIOV;
+   j -= add;
+   }
 }
 
 static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 1/6] vhost_net: make vhost_zerocopy_signal_used() return void

2013-09-02 Thread Jason Wang
None of its caller use its return value, so let it return void.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 drivers/vhost/net.c |5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 969a859..280ee66 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -276,8 +276,8 @@ static void copy_iovec_hdr(const struct iovec *from, struct 
iovec *to,
  * of used idx. Once lower device DMA done contiguously, we will signal KVM
  * guest used idx.
  */
-static int vhost_zerocopy_signal_used(struct vhost_net *net,
- struct vhost_virtqueue *vq)
+static void vhost_zerocopy_signal_used(struct vhost_net *net,
+  struct vhost_virtqueue *vq)
 {
struct vhost_net_virtqueue *nvq =
container_of(vq, struct vhost_net_virtqueue, vq);
@@ -297,7 +297,6 @@ static int vhost_zerocopy_signal_used(struct vhost_net *net,
}
if (j)
nvq-done_idx = i;
-   return j;
 }
 
 static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 60833] New: Window guest time passed very quickly, when restore vm that saved 10 minutes ago.

2013-09-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=60833

Bug ID: 60833
   Summary: Window guest time passed very quickly, when restore vm
that saved 10 minutes ago.
   Product: Virtualization
   Version: unspecified
Kernel Version: qemu-kvm-0.12.1.2-2.355.el6.x86_64
  Hardware: x86-64
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: kvm
  Assignee: virtualization_...@kernel-bugs.osdl.org
  Reporter: y...@jhinno.com
Regression: No

Recently, I found a problem with windows guest(both winxp and win7) that window
guest time passed very quickly, when restore vm that saved 1 hours ago.

Kvm server OS: Red Hat Enterprise Linux Server release 6.4

Reproduce Steps:
1. install a new windows xp vm named winxp.
2. power on the vm.
3. save the vm by command virsh save winxp /tmp/winxp.save
4. wait 10 minutes.
5. restore the vm by command virsh restore /tmp/winxp.save
6. Note the clock on the task bar, time passed as 10 time quickly as real time
speed.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3 0/6] vhost code cleanup and minor enhancement

2013-09-02 Thread Jason Wang
This series tries to unify and simplify vhost codes especially for
zerocopy. With this series, 5% - 10% improvement for per cpu throughput were
seen during netperf guest sending test.

Plase review.

Changes from V2:
- Typo fixes and code style fix
- Add performance gain in the commit log of patch 2/6
- Retest the update the result in patch 6/6

Changes from V1:
- Fix the zerocopy enabling check by changing the check of upend_idx != done_idx
  to (upend_idx + 1) % UIO_MAXIOV == done_idx.
- Switch to use put_user() in __vhost_add_used_n() if there's only one used
- Keep the max pending check based on Michael's suggestion.

Jason Wang (6):
  vhost_net: make vhost_zerocopy_signal_used() return void
  vhost_net: use vhost_add_used_and_signal_n() in
vhost_zerocopy_signal_used()
  vhost: switch to use vhost_add_used_n()
  vhost_net: determine whether or not to use zerocopy at one time
  vhost_net: poll vhost queue after marking DMA is done
  vhost_net: correctly limit the max pending buffers

 drivers/vhost/net.c   |   92 ++--
 drivers/vhost/vhost.c |   54 ++--
 2 files changed, 54 insertions(+), 92 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Xiao Guangrong
On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
 Page tables in a read-only memory slot will currently cause a triple
 fault because the page walker uses gfn_to_hva and it fails on such a slot.
 
 OVMF uses such a page table; however, real hardware seems to be fine with
 that as long as the accessed/dirty bits are set.  Save whether the slot
 is readonly, and later check it when updating the accessed and dirty bits.

Paolo, do you know why OVMF is using readonly memory like this?

AFAIK, The fault trigged by this kind of access can hardly be fixed by
userspace since the fault is trigged by pagetable walking not by the current
instruction. Do you have any idea to let uerspace emulate it properly?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 60833] Window guest time passed very quickly, when restore vm that saved 10 minutes ago.

2013-09-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=60833

Gleb g...@redhat.com changed:

   What|Removed |Added

 CC||g...@redhat.com

--- Comment #1 from Gleb g...@redhat.com ---
This is how it suppose to work. If you do not want this behaviour change
time catchup policy in libvirt.

BTW this bugzilla is for upstream issues, not vendor kernels.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Gleb Natapov
On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
 On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
  Page tables in a read-only memory slot will currently cause a triple
  fault because the page walker uses gfn_to_hva and it fails on such a slot.
  
  OVMF uses such a page table; however, real hardware seems to be fine with
  that as long as the accessed/dirty bits are set.  Save whether the slot
  is readonly, and later check it when updating the accessed and dirty bits.
 
 Paolo, do you know why OVMF is using readonly memory like this?
 
Just a guess, but perhaps they want to move to paging mode as early as
possible even before memory controller is fully initialized.

 AFAIK, The fault trigged by this kind of access can hardly be fixed by
 userspace since the fault is trigged by pagetable walking not by the current
 instruction. Do you have any idea to let uerspace emulate it properly?
Not sure what userspace you mean here, but there shouldn't be a fault in the
first place if ROM page tables have access/dirty bit set and they do.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm-unit-tests: VMX: Add the framework of EPT

2013-09-02 Thread Arthur Chunqi Li
Add a framework of EPT in nested VMX testing, including a set of
functions to construct and read EPT paging structures and a simple
read/write test of EPT remapping from guest to host.

Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
---
 x86/vmx.c   |  132 --
 x86/vmx.h   |   76 +++
 x86/vmx_tests.c |  156 +++
 3 files changed, 360 insertions(+), 4 deletions(-)

diff --git a/x86/vmx.c b/x86/vmx.c
index ca36d35..a156b71 100644
--- a/x86/vmx.c
+++ b/x86/vmx.c
@@ -143,6 +143,132 @@ asm(
   call hypercall\n\t
 );
 
+/* EPT paging structure related functions */
+/* install_ept_entry : Install a page to a given level in EPT
+   @pml4 : addr of pml4 table
+   @pte_level : level of PTE to set
+   @guest_addr : physical address of guest
+   @pte : pte value to set
+   @pt_page : address of page table, NULL for a new page
+ */
+void install_ept_entry(unsigned long *pml4,
+   int pte_level,
+   unsigned long guest_addr,
+   unsigned long pte,
+   unsigned long *pt_page)
+{
+   int level;
+   unsigned long *pt = pml4;
+   unsigned offset;
+
+   for (level = EPT_PAGE_LEVEL; level  pte_level; --level) {
+   offset = (guest_addr  ((level-1) * EPT_PGDIR_WIDTH + 12))
+EPT_PGDIR_MASK;
+   if (!(pt[offset]  (EPT_RA | EPT_WA | EPT_EA))) {
+   unsigned long *new_pt = pt_page;
+   if (!new_pt)
+   new_pt = alloc_page();
+   else
+   pt_page = 0;
+   memset(new_pt, 0, PAGE_SIZE);
+   pt[offset] = virt_to_phys(new_pt)
+   | EPT_RA | EPT_WA | EPT_EA;
+   }
+   pt = phys_to_virt(pt[offset]  0xff000ull);
+   }
+   offset = ((unsigned long)guest_addr  ((level-1) *
+   EPT_PGDIR_WIDTH + 12))  EPT_PGDIR_MASK;
+   pt[offset] = pte;
+}
+
+/* Map a page, @perm is the permission of the page */
+void install_ept(unsigned long *pml4,
+   unsigned long phys,
+   unsigned long guest_addr,
+   u64 perm)
+{
+   install_ept_entry(pml4, 1, guest_addr, (phys  PAGE_MASK) | perm, 0);
+}
+
+/* Map a 1G-size page */
+void install_1g_ept(unsigned long *pml4,
+   unsigned long phys,
+   unsigned long guest_addr,
+   u64 perm)
+{
+   install_ept_entry(pml4, 3, guest_addr,
+   (phys  PAGE_MASK) | perm | EPT_LARGE_PAGE, 0);
+}
+
+/* Map a 2M-size page */
+void install_2m_ept(unsigned long *pml4,
+   unsigned long phys,
+   unsigned long guest_addr,
+   u64 perm)
+{
+   install_ept_entry(pml4, 2, guest_addr,
+   (phys  PAGE_MASK) | perm | EPT_LARGE_PAGE, 0);
+}
+
+/* setup_ept_range : Setup a range of 1:1 mapped page to EPT paging structure.
+   @start : start address of guest page
+   @len : length of address to be mapped
+   @map_1g : whether 1G page map is used
+   @map_2m : whether 2M page map is used
+   @perm : permission for every page
+ */
+int setup_ept_range(unsigned long *pml4, unsigned long start,
+   unsigned long len, int map_1g, int map_2m, u64 perm)
+{
+   u64 phys = start;
+   u64 max = (u64)len + (u64)start;
+
+   if (map_1g) {
+   while (phys + PAGE_SIZE_1G = max) {
+   install_1g_ept(pml4, phys, phys, perm);
+   phys += PAGE_SIZE_1G;
+   }
+   }
+   if (map_2m) {
+   while (phys + PAGE_SIZE_2M = max) {
+   install_2m_ept(pml4, phys, phys, perm);
+   phys += PAGE_SIZE_2M;
+   }
+   }
+   while (phys + PAGE_SIZE = max) {
+   install_ept(pml4, phys, phys, perm);
+   phys += PAGE_SIZE;
+   }
+   return 0;
+}
+
+/* get_ept_pte : Get the PTE of a given level in EPT,
+@level == 1 means get the latest level*/
+unsigned long get_ept_pte(unsigned long *pml4,
+   unsigned long guest_addr, int level)
+{
+   int l;
+   unsigned long *pt = pml4, pte;
+   unsigned offset;
+
+   for (l = EPT_PAGE_LEVEL; l  1; --l) {
+   offset = (guest_addr  (((l-1) * EPT_PGDIR_WIDTH) + 12))
+EPT_PGDIR_MASK;
+   pte = pt[offset];
+   if (!(pte  (EPT_RA | EPT_WA | EPT_EA)))
+   return 0;
+   if (l == level)
+   return pte;
+   if (l  4  (pte  EPT_LARGE_PAGE))
+   return pte;
+   pt = (unsigned long 

Re: [PATCH v3 1/6] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state

2013-09-02 Thread Gleb Natapov
On Mon, Sep 02, 2013 at 11:06:53AM +0200, Jan Kiszka wrote:
 On 2013-09-02 10:21, Gleb Natapov wrote:
  On Thu, Aug 08, 2013 at 04:26:28PM +0200, Jan Kiszka wrote:
  Likely a typo, but a fatal one as kvm_set_cr0 performs checks on the
  Not a typo :) That what Avi asked for do during initial nested VMX
  review: http://markmail.org/message/hhidqyhbo2mrgxxc
 
 Yeah, should rephrase this.
 
  
  But there is at least one transition check that kvm_set_cr0() does that
  should not be done during vmexit emulation, namely CS.L bit check, so I
  tend to agree that kvm_set_cr0() is not appropriate here, at lest not as
  it is.
 
 kvm_set_cr0() is for emulating explicit guest changes. It is not the
 proper interface for implicit, vendor-dependent changes like this one.
 
Agree, the problem is that we do not have proper interface for implicit
changes like this one (do not see why it is vendor-dependent, SVM also
restores host state in a similar way).

  But can we skip other checks kvm_set_cr0() does? For instance
  what prevents us from loading CR0.PG = 1 EFER.LME = 1 and CR4.PAE = 0
  during nested vmexit?  What _should_ prevent it is vmentry check from
  26.2.4
  
  If the host address-space size VM-exit control is 1, the following
  must hold:
   - Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1.
  
  But I do not see that we do that check on vmentry.
  
  What about NW/CD bit checks, or reserved bits checks? 27.5.1 says:
The following bits are not modified:
 For CR0, ET, CD, NW; bits 63:32 (on processors that support Intel 64
 architecture), 28:19, 17, and 15:6; and any bits that are fixed in
 VMX operation (see Section 23.8).
  
  But again current vmexit code does not emulate this properly and just
  sets everything from host_cr0. vmentry should also preserve all those
  bit but it looks like it doesn't too.
  
 
 Yes, there is surely more to improve. Do you think the lacking checks
 can cause troubles for L0, or is this just imprecise emulation that can
 be addressed separately?
 
The lacking checks may cause L0 to fail guest entry which will trigger
internal error. If it is exploitable by L0 userspace it is a serious
problem, if only L0 kernel can trigger it then less so. I remember Avi
was concerned that KVM code may depend on all registers to be consistent
otherwise it can be exploited, I cannot prove or disprove this theory
:), but if it is the case then event L0 kernel case is problematic. 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm-unit-tests: VMX: Add the framework of EPT

2013-09-02 Thread Arthur Chunqi Li
There must have some minor revisions to be done in this patch, so this
is mainly a RFC mail.

Besides, I'm not quite clear what we should test in nested EPT
modules, and I bet writers of nested EPT must have ideas to continue
and refine this testing part. Any suggestions of which part and how to
test nested EPT is welcome.

Please help me CC EPT-related guys if anyone knows.

Thanks,
Arthur

On Mon, Sep 2, 2013 at 5:26 PM, Arthur Chunqi Li yzt...@gmail.com wrote:
 Add a framework of EPT in nested VMX testing, including a set of
 functions to construct and read EPT paging structures and a simple
 read/write test of EPT remapping from guest to host.

 Signed-off-by: Arthur Chunqi Li yzt...@gmail.com
 ---
  x86/vmx.c   |  132 --
  x86/vmx.h   |   76 +++
  x86/vmx_tests.c |  156 
 +++
  3 files changed, 360 insertions(+), 4 deletions(-)

 diff --git a/x86/vmx.c b/x86/vmx.c
 index ca36d35..a156b71 100644
 --- a/x86/vmx.c
 +++ b/x86/vmx.c
 @@ -143,6 +143,132 @@ asm(
call hypercall\n\t
  );

 +/* EPT paging structure related functions */
 +/* install_ept_entry : Install a page to a given level in EPT
 +   @pml4 : addr of pml4 table
 +   @pte_level : level of PTE to set
 +   @guest_addr : physical address of guest
 +   @pte : pte value to set
 +   @pt_page : address of page table, NULL for a new page
 + */
 +void install_ept_entry(unsigned long *pml4,
 +   int pte_level,
 +   unsigned long guest_addr,
 +   unsigned long pte,
 +   unsigned long *pt_page)
 +{
 +   int level;
 +   unsigned long *pt = pml4;
 +   unsigned offset;
 +
 +   for (level = EPT_PAGE_LEVEL; level  pte_level; --level) {
 +   offset = (guest_addr  ((level-1) * EPT_PGDIR_WIDTH + 12))
 +EPT_PGDIR_MASK;
 +   if (!(pt[offset]  (EPT_RA | EPT_WA | EPT_EA))) {
 +   unsigned long *new_pt = pt_page;
 +   if (!new_pt)
 +   new_pt = alloc_page();
 +   else
 +   pt_page = 0;
 +   memset(new_pt, 0, PAGE_SIZE);
 +   pt[offset] = virt_to_phys(new_pt)
 +   | EPT_RA | EPT_WA | EPT_EA;
 +   }
 +   pt = phys_to_virt(pt[offset]  0xff000ull);
 +   }
 +   offset = ((unsigned long)guest_addr  ((level-1) *
 +   EPT_PGDIR_WIDTH + 12))  EPT_PGDIR_MASK;
 +   pt[offset] = pte;
 +}
 +
 +/* Map a page, @perm is the permission of the page */
 +void install_ept(unsigned long *pml4,
 +   unsigned long phys,
 +   unsigned long guest_addr,
 +   u64 perm)
 +{
 +   install_ept_entry(pml4, 1, guest_addr, (phys  PAGE_MASK) | perm, 0);
 +}
 +
 +/* Map a 1G-size page */
 +void install_1g_ept(unsigned long *pml4,
 +   unsigned long phys,
 +   unsigned long guest_addr,
 +   u64 perm)
 +{
 +   install_ept_entry(pml4, 3, guest_addr,
 +   (phys  PAGE_MASK) | perm | EPT_LARGE_PAGE, 0);
 +}
 +
 +/* Map a 2M-size page */
 +void install_2m_ept(unsigned long *pml4,
 +   unsigned long phys,
 +   unsigned long guest_addr,
 +   u64 perm)
 +{
 +   install_ept_entry(pml4, 2, guest_addr,
 +   (phys  PAGE_MASK) | perm | EPT_LARGE_PAGE, 0);
 +}
 +
 +/* setup_ept_range : Setup a range of 1:1 mapped page to EPT paging 
 structure.
 +   @start : start address of guest page
 +   @len : length of address to be mapped
 +   @map_1g : whether 1G page map is used
 +   @map_2m : whether 2M page map is used
 +   @perm : permission for every page
 + */
 +int setup_ept_range(unsigned long *pml4, unsigned long start,
 +   unsigned long len, int map_1g, int map_2m, u64 perm)
 +{
 +   u64 phys = start;
 +   u64 max = (u64)len + (u64)start;
 +
 +   if (map_1g) {
 +   while (phys + PAGE_SIZE_1G = max) {
 +   install_1g_ept(pml4, phys, phys, perm);
 +   phys += PAGE_SIZE_1G;
 +   }
 +   }
 +   if (map_2m) {
 +   while (phys + PAGE_SIZE_2M = max) {
 +   install_2m_ept(pml4, phys, phys, perm);
 +   phys += PAGE_SIZE_2M;
 +   }
 +   }
 +   while (phys + PAGE_SIZE = max) {
 +   install_ept(pml4, phys, phys, perm);
 +   phys += PAGE_SIZE;
 +   }
 +   return 0;
 +}
 +
 +/* get_ept_pte : Get the PTE of a given level in EPT,
 +@level == 1 means get the latest level*/
 +unsigned long get_ept_pte(unsigned long *pml4,
 +   unsigned 

Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Xiao Guangrong
On 09/01/2013 05:17 PM, Gleb Natapov wrote:
 On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote:
 Page tables in a read-only memory slot will currently cause a triple
 fault because the page walker uses gfn_to_hva and it fails on such a slot.

 OVMF uses such a page table; however, real hardware seems to be fine with
 that as long as the accessed/dirty bits are set.  Save whether the slot
 is readonly, and later check it when updating the accessed and dirty bits.

 The fix looks OK to me, but some comment below.
 
 Cc: sta...@vger.kernel.org
 Cc: g...@redhat.com
 Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  CCing to stable@ since the regression was introduced with
  support for readonly memory slots.

  arch/x86/kvm/paging_tmpl.h |  7 ++-
  include/linux/kvm_host.h   |  1 +
  virt/kvm/kvm_main.c| 14 +-
  3 files changed, 16 insertions(+), 6 deletions(-)

 diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
 index 0433301..dadc5c0 100644
 --- a/arch/x86/kvm/paging_tmpl.h
 +++ b/arch/x86/kvm/paging_tmpl.h
 @@ -99,6 +99,7 @@ struct guest_walker {
  pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
  gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
  pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
 +bool pte_writable[PT_MAX_FULL_LEVELS];
  unsigned pt_access;
  unsigned pte_access;
  gfn_t gfn;
 @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct 
 kvm_vcpu *vcpu,
  if (pte == orig_pte)
  continue;
  
 +if (unlikely(!walker-pte_writable[level - 1]))
 +return -EACCES;
 +
  ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, 
 orig_pte, pte);
  if (ret)
  return ret;
 @@ -309,7 +313,8 @@ retry_walk:
  goto error;
  real_gfn = gpa_to_gfn(real_gfn);
  
 -host_addr = gfn_to_hva(vcpu-kvm, real_gfn);
 +host_addr = gfn_to_hva_read(vcpu-kvm, real_gfn,
 +walker-pte_writable[walker-level 
 - 1]);
 The use of gfn_to_hva_read is misleading. The code can still write into
 gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva()
 to gfn_to_hva_write().

Yes. I agreed.

 
 This makes me think are there other places where gfn_to_hva() was
 used, but gfn_to_hva_prot() should have been?
  - kvm_host_page_size() looks incorrect. We never use huge page to map
read only memory slots currently.

It only checks whether gfn have been mapped, I think we can use
gfn_to_hva_read() instead, the real permission will be checked when we translate
the gfn to pfn.

  - kvm_handle_bad_page() also looks incorrect and may cause incorrect
address to be reported to userspace.

I have no idea on this point. kvm_handle_bad_page() is called when it failed to
translate the target gfn to pfn, then the emulator can detect the error on 
target gfn
properly. no? Or i misunderstood your meaning?

  - kvm_setup_async_pf() also incorrect. Makes all page fault on read
only slot to be sync.
  - kvm_vm_fault() one looks OK since function assumes write only slots,
but it is obsolete and should be deleted anyway.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Gleb Natapov
On Mon, Sep 02, 2013 at 05:42:25PM +0800, Xiao Guangrong wrote:
 On 09/01/2013 05:17 PM, Gleb Natapov wrote:
  On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote:
  Page tables in a read-only memory slot will currently cause a triple
  fault because the page walker uses gfn_to_hva and it fails on such a slot.
 
  OVMF uses such a page table; however, real hardware seems to be fine with
  that as long as the accessed/dirty bits are set.  Save whether the slot
  is readonly, and later check it when updating the accessed and dirty bits.
 
  The fix looks OK to me, but some comment below.
  
  Cc: sta...@vger.kernel.org
  Cc: g...@redhat.com
  Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
  Signed-off-by: Paolo Bonzini pbonz...@redhat.com
  ---
 CCing to stable@ since the regression was introduced with
 support for readonly memory slots.
 
   arch/x86/kvm/paging_tmpl.h |  7 ++-
   include/linux/kvm_host.h   |  1 +
   virt/kvm/kvm_main.c| 14 +-
   3 files changed, 16 insertions(+), 6 deletions(-)
 
  diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
  index 0433301..dadc5c0 100644
  --- a/arch/x86/kvm/paging_tmpl.h
  +++ b/arch/x86/kvm/paging_tmpl.h
  @@ -99,6 +99,7 @@ struct guest_walker {
 pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
 gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
 pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
  +  bool pte_writable[PT_MAX_FULL_LEVELS];
 unsigned pt_access;
 unsigned pte_access;
 gfn_t gfn;
  @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct 
  kvm_vcpu *vcpu,
 if (pte == orig_pte)
 continue;
   
  +  if (unlikely(!walker-pte_writable[level - 1]))
  +  return -EACCES;
  +
 ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, 
  orig_pte, pte);
 if (ret)
 return ret;
  @@ -309,7 +313,8 @@ retry_walk:
 goto error;
 real_gfn = gpa_to_gfn(real_gfn);
   
  -  host_addr = gfn_to_hva(vcpu-kvm, real_gfn);
  +  host_addr = gfn_to_hva_read(vcpu-kvm, real_gfn,
  +  walker-pte_writable[walker-level 
  - 1]);
  The use of gfn_to_hva_read is misleading. The code can still write into
  gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva()
  to gfn_to_hva_write().
 
 Yes. I agreed.
 
  
  This makes me think are there other places where gfn_to_hva() was
  used, but gfn_to_hva_prot() should have been?
   - kvm_host_page_size() looks incorrect. We never use huge page to map
 read only memory slots currently.
 
 It only checks whether gfn have been mapped, I think we can use
 gfn_to_hva_read() instead, the real permission will be checked when we 
 translate
 the gfn to pfn.
 
Yes, all the cases I listed should be changed to use function that looks
at both regular and RO slots.

   - kvm_handle_bad_page() also looks incorrect and may cause incorrect
 address to be reported to userspace.
 
 I have no idea on this point. kvm_handle_bad_page() is called when it failed 
 to
 translate the target gfn to pfn, then the emulator can detect the error on 
 target gfn
 properly. no? Or i misunderstood your meaning?
 
I am talking about the following code:

if (pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(gfn_to_hva(vcpu-kvm, gfn), current);
return 0;
}

pfn will be KVM_PFN_ERR_HWPOISON gfn is backed by faulty memory, we need
to report the liner address of the faulty memory to a userspace here,
but if gfn is in a RO slot gfn_to_hva() will not return correct address
here.

   - kvm_setup_async_pf() also incorrect. Makes all page fault on read
 only slot to be sync.
   - kvm_vm_fault() one looks OK since function assumes write only slots,
 but it is obsolete and should be deleted anyway.
 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Xiao Guangrong
On 09/02/2013 05:25 PM, Gleb Natapov wrote:
 On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
 On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
 Page tables in a read-only memory slot will currently cause a triple
 fault because the page walker uses gfn_to_hva and it fails on such a slot.

 OVMF uses such a page table; however, real hardware seems to be fine with
 that as long as the accessed/dirty bits are set.  Save whether the slot
 is readonly, and later check it when updating the accessed and dirty bits.

 Paolo, do you know why OVMF is using readonly memory like this?

 Just a guess, but perhaps they want to move to paging mode as early as
 possible even before memory controller is fully initialized.
 
 AFAIK, The fault trigged by this kind of access can hardly be fixed by
 userspace since the fault is trigged by pagetable walking not by the current
 instruction. Do you have any idea to let uerspace emulate it properly?
 Not sure what userspace you mean here, but there shouldn't be a fault in the

I just wonder how to fix this kind of fault. The current patch returns -EACCES
but that will crash the guest. I think we'd better let userspace to fix this
error (let userspace set the D/A bit.)

 first place if ROM page tables have access/dirty bit set and they do.

Yes, so we can not call x86_emulate_instruction() to fix this fault (that 
function
emulates the access on the first place). Need directly return a MMIO-exit to
userpsace when met this fault? What happen if this fault on pagetable-walking
is trigged in x86_emulate_instruction().?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Xiao Guangrong
On 09/02/2013 05:49 PM, Gleb Natapov wrote:
 On Mon, Sep 02, 2013 at 05:42:25PM +0800, Xiao Guangrong wrote:
 On 09/01/2013 05:17 PM, Gleb Natapov wrote:
 On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote:
 Page tables in a read-only memory slot will currently cause a triple
 fault because the page walker uses gfn_to_hva and it fails on such a slot.

 OVMF uses such a page table; however, real hardware seems to be fine with
 that as long as the accessed/dirty bits are set.  Save whether the slot
 is readonly, and later check it when updating the accessed and dirty bits.

 The fix looks OK to me, but some comment below.

 Cc: sta...@vger.kernel.org
 Cc: g...@redhat.com
 Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
CCing to stable@ since the regression was introduced with
support for readonly memory slots.

  arch/x86/kvm/paging_tmpl.h |  7 ++-
  include/linux/kvm_host.h   |  1 +
  virt/kvm/kvm_main.c| 14 +-
  3 files changed, 16 insertions(+), 6 deletions(-)

 diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
 index 0433301..dadc5c0 100644
 --- a/arch/x86/kvm/paging_tmpl.h
 +++ b/arch/x86/kvm/paging_tmpl.h
 @@ -99,6 +99,7 @@ struct guest_walker {
pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
 +  bool pte_writable[PT_MAX_FULL_LEVELS];
unsigned pt_access;
unsigned pte_access;
gfn_t gfn;
 @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct 
 kvm_vcpu *vcpu,
if (pte == orig_pte)
continue;
  
 +  if (unlikely(!walker-pte_writable[level - 1]))
 +  return -EACCES;
 +
ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, 
 orig_pte, pte);
if (ret)
return ret;
 @@ -309,7 +313,8 @@ retry_walk:
goto error;
real_gfn = gpa_to_gfn(real_gfn);
  
 -  host_addr = gfn_to_hva(vcpu-kvm, real_gfn);
 +  host_addr = gfn_to_hva_read(vcpu-kvm, real_gfn,
 +  walker-pte_writable[walker-level 
 - 1]);
 The use of gfn_to_hva_read is misleading. The code can still write into
 gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva()
 to gfn_to_hva_write().

 Yes. I agreed.


 This makes me think are there other places where gfn_to_hva() was
 used, but gfn_to_hva_prot() should have been?
  - kvm_host_page_size() looks incorrect. We never use huge page to map
read only memory slots currently.

 It only checks whether gfn have been mapped, I think we can use
 gfn_to_hva_read() instead, the real permission will be checked when we 
 translate
 the gfn to pfn.

 Yes, all the cases I listed should be changed to use function that looks
 at both regular and RO slots.
 
  - kvm_handle_bad_page() also looks incorrect and may cause incorrect
address to be reported to userspace.

 I have no idea on this point. kvm_handle_bad_page() is called when it failed 
 to
 translate the target gfn to pfn, then the emulator can detect the error on 
 target gfn
 properly. no? Or i misunderstood your meaning?

 I am talking about the following code:
 
 if (pfn == KVM_PFN_ERR_HWPOISON) {
 kvm_send_hwpoison_signal(gfn_to_hva(vcpu-kvm, gfn), current);
 return 0;
 }
 
 pfn will be KVM_PFN_ERR_HWPOISON gfn is backed by faulty memory, we need
 to report the liner address of the faulty memory to a userspace here,
 but if gfn is in a RO slot gfn_to_hva() will not return correct address
 here.

Got it, thanks for your explanation.

BTW, if you and Paolo are busy on other things, i am happy to fix these issues. 
:)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Gleb Natapov
On Mon, Sep 02, 2013 at 06:00:39PM +0800, Xiao Guangrong wrote:
 On 09/02/2013 05:25 PM, Gleb Natapov wrote:
  On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
  On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
  Page tables in a read-only memory slot will currently cause a triple
  fault because the page walker uses gfn_to_hva and it fails on such a slot.
 
  OVMF uses such a page table; however, real hardware seems to be fine with
  that as long as the accessed/dirty bits are set.  Save whether the slot
  is readonly, and later check it when updating the accessed and dirty bits.
 
  Paolo, do you know why OVMF is using readonly memory like this?
 
  Just a guess, but perhaps they want to move to paging mode as early as
  possible even before memory controller is fully initialized.
  
  AFAIK, The fault trigged by this kind of access can hardly be fixed by
  userspace since the fault is trigged by pagetable walking not by the 
  current
  instruction. Do you have any idea to let uerspace emulate it properly?
  Not sure what userspace you mean here, but there shouldn't be a fault in the
 
 I just wonder how to fix this kind of fault. The current patch returns -EACCES
 but that will crash the guest. I think we'd better let userspace to fix this
 error (let userspace set the D/A bit.)
 
Ugh, this is not good. Missed that. Don't know what real HW will do
here, but the easy thing for us to do would be to just return success.

  first place if ROM page tables have access/dirty bit set and they do.
 
 Yes, so we can not call x86_emulate_instruction() to fix this fault (that 
 function
 emulates the access on the first place). Need directly return a MMIO-exit to
 userpsace when met this fault? What happen if this fault on pagetable-walking
 is trigged in x86_emulate_instruction().?
I think we should not return MMIO-exit to userspace. Either ignore write attempt
or kill a guest.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Gleb Natapov
On Mon, Sep 02, 2013 at 06:05:10PM +0800, Xiao Guangrong wrote:
 On 09/02/2013 05:49 PM, Gleb Natapov wrote:
  On Mon, Sep 02, 2013 at 05:42:25PM +0800, Xiao Guangrong wrote:
  On 09/01/2013 05:17 PM, Gleb Natapov wrote:
  On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote:
  Page tables in a read-only memory slot will currently cause a triple
  fault because the page walker uses gfn_to_hva and it fails on such a 
  slot.
 
  OVMF uses such a page table; however, real hardware seems to be fine with
  that as long as the accessed/dirty bits are set.  Save whether the slot
  is readonly, and later check it when updating the accessed and dirty 
  bits.
 
  The fix looks OK to me, but some comment below.
 
  Cc: sta...@vger.kernel.org
  Cc: g...@redhat.com
  Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
  Signed-off-by: Paolo Bonzini pbonz...@redhat.com
  ---
   CCing to stable@ since the regression was introduced with
   support for readonly memory slots.
 
   arch/x86/kvm/paging_tmpl.h |  7 ++-
   include/linux/kvm_host.h   |  1 +
   virt/kvm/kvm_main.c| 14 +-
   3 files changed, 16 insertions(+), 6 deletions(-)
 
  diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
  index 0433301..dadc5c0 100644
  --- a/arch/x86/kvm/paging_tmpl.h
  +++ b/arch/x86/kvm/paging_tmpl.h
  @@ -99,6 +99,7 @@ struct guest_walker {
   pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
   gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
   pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
  +bool pte_writable[PT_MAX_FULL_LEVELS];
   unsigned pt_access;
   unsigned pte_access;
   gfn_t gfn;
  @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct 
  kvm_vcpu *vcpu,
   if (pte == orig_pte)
   continue;
   
  +if (unlikely(!walker-pte_writable[level - 1]))
  +return -EACCES;
  +
   ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, 
  orig_pte, pte);
   if (ret)
   return ret;
  @@ -309,7 +313,8 @@ retry_walk:
   goto error;
   real_gfn = gpa_to_gfn(real_gfn);
   
  -host_addr = gfn_to_hva(vcpu-kvm, real_gfn);
  +host_addr = gfn_to_hva_read(vcpu-kvm, real_gfn,
  +
  walker-pte_writable[walker-level - 1]);
  The use of gfn_to_hva_read is misleading. The code can still write into
  gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva()
  to gfn_to_hva_write().
 
  Yes. I agreed.
 
 
  This makes me think are there other places where gfn_to_hva() was
  used, but gfn_to_hva_prot() should have been?
   - kvm_host_page_size() looks incorrect. We never use huge page to map
 read only memory slots currently.
 
  It only checks whether gfn have been mapped, I think we can use
  gfn_to_hva_read() instead, the real permission will be checked when we 
  translate
  the gfn to pfn.
 
  Yes, all the cases I listed should be changed to use function that looks
  at both regular and RO slots.
  
   - kvm_handle_bad_page() also looks incorrect and may cause incorrect
 address to be reported to userspace.
 
  I have no idea on this point. kvm_handle_bad_page() is called when it 
  failed to
  translate the target gfn to pfn, then the emulator can detect the error on 
  target gfn
  properly. no? Or i misunderstood your meaning?
 
  I am talking about the following code:
  
  if (pfn == KVM_PFN_ERR_HWPOISON) {
  kvm_send_hwpoison_signal(gfn_to_hva(vcpu-kvm, gfn), 
  current);
  return 0;
  }
  
  pfn will be KVM_PFN_ERR_HWPOISON gfn is backed by faulty memory, we need
  to report the liner address of the faulty memory to a userspace here,
  but if gfn is in a RO slot gfn_to_hva() will not return correct address
  here.
 
 Got it, thanks for your explanation.
 
 BTW, if you and Paolo are busy on other things, i am happy to fix these 
 issues. :)
I am busy with reviews mostly :). If you are not to busy with lockless
write protection then fine with me. Lest wait for Paolo's input on
proposed API though.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: nEPT: reset PDPTR register cache on nested vmentry emulation

2013-09-02 Thread Gleb Natapov
After nested vmentry stale cache can be used to reload L2 PDPTR pointers
which will cause L2 guest to fail. Fix it by invalidating cache on nested
vmentry emulation.

https://bugzilla.kernel.org/show_bug.cgi?id=60830

Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 57b4e12..6f69aac 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7765,6 +7765,10 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
vmcs_write64(GUEST_PDPTR1, vmcs12-guest_pdptr1);
vmcs_write64(GUEST_PDPTR2, vmcs12-guest_pdptr2);
vmcs_write64(GUEST_PDPTR3, vmcs12-guest_pdptr3);
+   __clear_bit(VCPU_EXREG_PDPTR,
+   (unsigned long *)vcpu-arch.regs_avail);
+   __clear_bit(VCPU_EXREG_PDPTR,
+   (unsigned long *)vcpu-arch.regs_dirty);
}
 
kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12-guest_rsp);
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] i386: forward CPUID cache leaves when -cpu host is used

2013-09-02 Thread Andreas Färber
Hi,

target-i386: please.

Am 27.08.2013 22:38, schrieb Benoît Canet:
 Some users running cpu intensive tasks checking the cache CPUID leaves at
 startup and making decisions based on the result reported that the guest was
 not reflecting the host CPUID leaves when -cpu host is used.
 
 This patch fix this.
 
 Signed-off-by: Benoit Canet ben...@irqsave.net
 ---
  target-i386/cpu.c |   19 +++
  target-i386/cpu.h |1 +
  2 files changed, 20 insertions(+)
 
 diff --git a/target-i386/cpu.c b/target-i386/cpu.c
 index 42c5de0..2c8eaf7 100644
 --- a/target-i386/cpu.c
 +++ b/target-i386/cpu.c
 @@ -374,6 +374,7 @@ typedef struct x86_def_t {
  int stepping;
  FeatureWordArray features;
  char model_id[48];
 +bool fwd_cpuid_cache_leaves;
  } x86_def_t;
  
  #define I486_FEATURES (CPUID_FP87 | CPUID_VME | CPUID_PSE)
 @@ -1027,6 +1028,7 @@ static void kvm_cpu_fill_host(x86_def_t *x86_cpu_def)
  assert(kvm_enabled());
  
  x86_cpu_def-name = host;
 +x86_cpu_def-fwd_cpuid_cache_leaves = true;
  host_cpuid(0x0, 0, eax, ebx, ecx, edx);
  x86_cpu_vendor_words2str(x86_cpu_def-vendor, ebx, edx, ecx);
  
 @@ -1776,6 +1778,7 @@ static void cpu_x86_register(X86CPU *cpu, const char 
 *name, Error **errp)
  env-features[FEAT_C000_0001_EDX] = def-features[FEAT_C000_0001_EDX];
  env-features[FEAT_7_0_EBX] = def-features[FEAT_7_0_EBX];
  env-cpuid_xlevel2 = def-xlevel2;
 +env-fwd_cpuid_cache_leaves = def-fwd_cpuid_cache_leaves;
  
  object_property_set_str(OBJECT(cpu), def-model_id, model-id, errp);
  }
 @@ -1949,6 +1952,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
 uint32_t count,
  }
  break;
  case 2:
 +if (env-fwd_cpuid_cache_leaves) {
 +host_cpuid(0x2, 0, eax, ebx, ecx, edx);
 +break;
 +}
  /* cache info: needed for Pentium Pro compatibility */
  *eax = 1;
  *ebx = 0;
 @@ -1956,6 +1963,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
 uint32_t count,
  *edx = 0x2c307d;
  break;
  case 4:
 +if (env-fwd_cpuid_cache_leaves) {
 +host_cpuid(0x4, count, eax, ebx, ecx, edx);
 +break;
 +}
  /* cache info: needed for Core compatibility */
  if (cs-nr_cores  1) {
  *eax = (cs-nr_cores - 1)  26;
 @@ -2102,6 +2113,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
 uint32_t count,
  break;
  case 0x8005:
  /* cache info (L1 cache) */
 +if (env-fwd_cpuid_cache_leaves) {
 +host_cpuid(0x8005, 0, eax, ebx, ecx, edx);
 +break;
 +}
  *eax = 0x01ff01ff;
  *ebx = 0x01ff01ff;
  *ecx = 0x40020140;
 @@ -2109,6 +2124,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
 uint32_t count,
  break;
  case 0x8006:
  /* cache info (L2 cache) */
 +if (env-fwd_cpuid_cache_leaves) {
 +host_cpuid(0x8006, 0, eax, ebx, ecx, edx);
 +break;
 +}
  *eax = 0;
  *ebx = 0x42004200;
  *ecx = 0x02008140;

This hunk may trivially conflict with Eduardo's cache flags cleanup.

 diff --git a/target-i386/cpu.h b/target-i386/cpu.h
 index 8a3d0fd..1ec32fa 100644
 --- a/target-i386/cpu.h
 +++ b/target-i386/cpu.h
 @@ -865,6 +865,7 @@ typedef struct CPUX86State {
  bool tsc_valid;
  int tsc_khz;
  void *kvm_xsave_buf;
 +bool fwd_cpuid_cache_leaves;
  
  /* in order to simplify APIC support, we leave this pointer to the
 user */

Please place the field in X86CPU instead and document it.

Otherwise patch looks okay to me on a brief sight; but since this is
about -cpu host I would prefer this to go through uq/master once fixed
or at least to get some acks.

Regards,
Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] i386: forward CPUID cache leaves when -cpu host is used

2013-09-02 Thread Eduardo Habkost
On Mon, Sep 02, 2013 at 02:55:36PM +0200, Andreas Färber wrote:
[...]
  diff --git a/target-i386/cpu.h b/target-i386/cpu.h
  index 8a3d0fd..1ec32fa 100644
  --- a/target-i386/cpu.h
  +++ b/target-i386/cpu.h
  @@ -865,6 +865,7 @@ typedef struct CPUX86State {
   bool tsc_valid;
   int tsc_khz;
   void *kvm_xsave_buf;
  +bool fwd_cpuid_cache_leaves;
   
   /* in order to simplify APIC support, we leave this pointer to the
  user */
 
 Please place the field in X86CPU instead and document it.

While moving it, I believe the name can be made clearer. I would name it
fwd_host_cache_info or something like that, to make it clear that it
will expose the _host CPU_ cache information to the guest.

-- 
Eduardo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/6] KVM: nVMX: Load nEPT state after EFER

2013-09-02 Thread Gleb Natapov
On Thu, Aug 08, 2013 at 04:26:30PM +0200, Jan Kiszka wrote:
 We need to update EFER.NX before building the nEPT state via
 nested_ept_init_mmu_context. Otherwise, we risk to create an MMU context
 that claims to have NX disabled while the guest EPT used NX. This will
 cause spurious faults for L2.
 
Hmm, I do not see how nested ept mmu depends on guests EFER.NX setting.
It just sets mmu-nx to true.

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  arch/x86/kvm/vmx.c |   10 +-
  1 files changed, 5 insertions(+), 5 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 6c42518..363fe19 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -7727,11 +7727,6 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
 struct vmcs12 *vmcs12)
   vmx_flush_tlb(vcpu);
   }
  
 - if (nested_cpu_has_ept(vmcs12)) {
 - kvm_mmu_unload(vcpu);
 - nested_ept_init_mmu_context(vcpu);
 - }
 -
   if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_EFER)
   vcpu-arch.efer = vmcs12-guest_ia32_efer;
   else if (vmcs12-vm_entry_controls  VM_ENTRY_IA32E_MODE)
 @@ -7741,6 +7736,11 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
 struct vmcs12 *vmcs12)
   /* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */
   vmx_set_efer(vcpu, vcpu-arch.efer);
  
 + if (nested_cpu_has_ept(vmcs12)) {
 + kvm_mmu_unload(vcpu);
 + nested_ept_init_mmu_context(vcpu);
 + }
 +
   /*
* This sets GUEST_CR0 to vmcs12-guest_cr0, with possibly a modified
* TS bit (for lazy fpu) and bits which we consider mandatory enabled.
 -- 
 1.7.3.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3] forward cpuid leaves when using -cpu host

2013-09-02 Thread Benoît Canet
This patch uses directly cpuid_host to forward the informations instead of
storing a variable number of leaves in the cpu states.

v3:
   s/i386/target-i386/ [Andrea] 
   move forward field to X86CPU structure and document it [Andrea]
   rename forward field [Eduardo]
   Rebase on top of eduardo cache flags cleanup [Andrea]

v2:
   use index as argument to cpuid_host

Benoît Canet (1):
  target-i386: forward CPUID cache leaves when -cpu host is used

 target-i386/cpu-qom.h |3 +++
 target-i386/cpu.c |   19 +++
 2 files changed, 22 insertions(+)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3] target-i386: forward CPUID cache leaves when -cpu host is used

2013-09-02 Thread Benoît Canet
Some users running cpu intensive tasks checking the cache CPUID leaves at
startup and making decisions based on the result reported that the guest was
not reflecting the host CPUID leaves when -cpu host is used.

This patch fix this.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 target-i386/cpu-qom.h |3 +++
 target-i386/cpu.c |   19 +++
 2 files changed, 22 insertions(+)

diff --git a/target-i386/cpu-qom.h b/target-i386/cpu-qom.h
index c4447c2..b1d1bd8 100644
--- a/target-i386/cpu-qom.h
+++ b/target-i386/cpu-qom.h
@@ -70,6 +70,9 @@ typedef struct X86CPU {
 bool hyperv_relaxed_timing;
 int hyperv_spinlock_attempts;
 
+/* if true the CPUID code directly forward host cache leaves to the guest 
*/
+bool fwd_host_cache_info;
+
 /* Features that were filtered out because of missing host capabilities */
 uint32_t filtered_features[FEATURE_WORDS];
 
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index c36345e..f0df4db 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -486,6 +486,7 @@ typedef struct x86_def_t {
 int stepping;
 FeatureWordArray features;
 char model_id[48];
+bool fwd_host_cache_info;
 } x86_def_t;
 
 #define I486_FEATURES (CPUID_FP87 | CPUID_VME | CPUID_PSE)
@@ -1139,6 +1140,7 @@ static void kvm_cpu_fill_host(x86_def_t *x86_cpu_def)
 assert(kvm_enabled());
 
 x86_cpu_def-name = host;
+x86_cpu_def-fwd_host_cache_info = true;
 host_cpuid(0x0, 0, eax, ebx, ecx, edx);
 x86_cpu_vendor_words2str(x86_cpu_def-vendor, ebx, edx, ecx);
 
@@ -1888,6 +1890,7 @@ static void cpu_x86_register(X86CPU *cpu, const char 
*name, Error **errp)
 env-features[FEAT_C000_0001_EDX] = def-features[FEAT_C000_0001_EDX];
 env-features[FEAT_7_0_EBX] = def-features[FEAT_7_0_EBX];
 env-cpuid_xlevel2 = def-xlevel2;
+cpu-fwd_host_cache_info = def-fwd_host_cache_info;
 
 object_property_set_str(OBJECT(cpu), def-model_id, model-id, errp);
 }
@@ -2062,6 +2065,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 break;
 case 2:
 /* cache info: needed for Pentium Pro compatibility */
+if (cpu-fwd_host_cache_info) {
+host_cpuid(index, 0, eax, ebx, ecx, edx);
+break;
+}
 *eax = 1; /* Number of CPUID[EAX=2] calls required */
 *ebx = 0;
 *ecx = 0;
@@ -2071,6 +2078,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 break;
 case 4:
 /* cache info: needed for Core compatibility */
+if (cpu-fwd_host_cache_info) {
+host_cpuid(index, count, eax, ebx, ecx, edx);
+break;
+}
 if (cs-nr_cores  1) {
 *eax = (cs-nr_cores - 1)  26;
 } else {
@@ -2228,6 +2239,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 break;
 case 0x8005:
 /* cache info (L1 cache) */
+if (cpu-fwd_host_cache_info) {
+host_cpuid(index, 0, eax, ebx, ecx, edx);
+break;
+}
 *eax = (L1_DTLB_2M_ASSOC  24) | (L1_DTLB_2M_ENTRIES  16) | \
(L1_ITLB_2M_ASSOC   8) | (L1_ITLB_2M_ENTRIES);
 *ebx = (L1_DTLB_4K_ASSOC  24) | (L1_DTLB_4K_ENTRIES  16) | \
@@ -2239,6 +2254,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 break;
 case 0x8006:
 /* cache info (L2 cache) */
+if (cpu-fwd_host_cache_info) {
+host_cpuid(index, 0, eax, ebx, ecx, edx);
+break;
+}
 *eax = (AMD_ENC_ASSOC(L2_DTLB_2M_ASSOC)  28) | \
(L2_DTLB_2M_ENTRIES  16) | \
(AMD_ENC_ASSOC(L2_ITLB_2M_ASSOC)  12) | \
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hallo..........

2013-09-02 Thread Mutoni Williams

Hello,How are you,I hope you're well,my name is Mutoni,I'm medium height and 
fair in complexion,i love,caring and I decided to contact you.I really want to 
have a good relationship with you.Next I have a special something I want to 
discuss with you,and tell you more about my self.Hope hear from you soon.I know 
age will not be a bearier to our relationship,I will give my best. 
Yours,Miss.Mutoni.
 
Hallo,Wie geht es dir,ich hoffe,du bist gut,mein Name ist Mutoni, ich mittlerer 
Höhe und Messe in Teint bin,ich liebe,fürsorgliche und ich beschlossen,you.I 
kontaktieren wirklich wollen,eine gute Beziehung mit Ihnen haben.Weiter habe 
ich eine besondere etwas möchte ich mit Ihnen zu besprechen haben,und erfahren 
Sie mehr über meine self.Hope von Ihnen zu hören soon.i wissen Alter wird kein 
bearier,unsere Beziehung zu sein, werde ich mein Bestes geben.Yours,Miss.Mutoni.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Paolo Bonzini
Il 02/09/2013 12:07, Gleb Natapov ha scritto:
 On Mon, Sep 02, 2013 at 06:00:39PM +0800, Xiao Guangrong wrote:
 On 09/02/2013 05:25 PM, Gleb Natapov wrote:
 On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
 On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
 Page tables in a read-only memory slot will currently cause a triple
 fault because the page walker uses gfn_to_hva and it fails on such a slot.

 OVMF uses such a page table; however, real hardware seems to be fine with
 that as long as the accessed/dirty bits are set.  Save whether the slot
 is readonly, and later check it when updating the accessed and dirty bits.

 Paolo, do you know why OVMF is using readonly memory like this?

 Just a guess, but perhaps they want to move to paging mode as early as
 possible even before memory controller is fully initialized.

 AFAIK, The fault trigged by this kind of access can hardly be fixed by
 userspace since the fault is trigged by pagetable walking not by the 
 current
 instruction. Do you have any idea to let uerspace emulate it properly?
 Not sure what userspace you mean here, but there shouldn't be a fault in the

 I just wonder how to fix this kind of fault. The current patch returns 
 -EACCES
 but that will crash the guest. I think we'd better let userspace to fix this
 error (let userspace set the D/A bit.)

 Ugh, this is not good. Missed that. Don't know what real HW will do
 here, but the easy thing for us to do would be to just return success.

Real hardware would just do a memory write.  What happens depends on
what is on the bus, i.e. on what the ROM is used for.

QEMU uses read-only slots for two things: actual read-only memory where
writes go to the bitbucket, and ROMD memory where writes are treated
as MMIO.

So, in the first case we would ignore the write.  In the second we would
do an MMIO exit to userspace.  But ignoring the write isn't always
correct, and doing an MMIO exit is complicated, so I would just kill the
guest.

EPT will probably write to the read-only slots without thinking much
about it.

My patch injects a page fault, which is very likely to escalate to a
triple fault.  This is probably never what you want---on the other hand,
I wasn't sure what level of accuracy we want in this case, given that
EPT does it wrong too.

Paolo

 first place if ROM page tables have access/dirty bit set and they do.

 Yes, so we can not call x86_emulate_instruction() to fix this fault (that 
 function
 emulates the access on the first place). Need directly return a MMIO-exit to
 userpsace when met this fault? What happen if this fault on pagetable-walking
 is trigged in x86_emulate_instruction().?
 I think we should not return MMIO-exit to userspace. Either ignore write 
 attempt
 or kill a guest.
 
 --
   Gleb.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Paolo Bonzini
Il 02/09/2013 12:11, Gleb Natapov ha scritto:
  
  Got it, thanks for your explanation.
  
  BTW, if you and Paolo are busy on other things, i am happy to fix these 
  issues. :)
 I am busy with reviews mostly :). If you are not to busy with lockless
 write protection then fine with me. Lest wait for Paolo's input on
 proposed API though.

No problem, I'm happy to do this kind of exercise.  I like the API you
proposed.  I can do this change on top of the API change.  However,
since this bug is in the wild it may be applicable to stable@ as well.
 If you agree, I will also post v2 of this patch for stable kernels as
soon as we agree on the desired faulting behavior.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Paolo Bonzini
Il 02/09/2013 11:25, Gleb Natapov ha scritto:
 On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
 On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
 Page tables in a read-only memory slot will currently cause a triple
 fault because the page walker uses gfn_to_hva and it fails on such a slot.

 OVMF uses such a page table; however, real hardware seems to be fine with
 that as long as the accessed/dirty bits are set.  Save whether the slot
 is readonly, and later check it when updating the accessed and dirty bits.

 Paolo, do you know why OVMF is using readonly memory like this?

 Just a guess, but perhaps they want to move to paging mode as early as
 possible even before memory controller is fully initialized.

More precisely they want to move to 64-bit mode as early as possible,
and that requires paging.

 AFAIK, The fault trigged by this kind of access can hardly be fixed by
 userspace since the fault is trigged by pagetable walking not by the current
 instruction. Do you have any idea to let uerspace emulate it properly?

 Not sure what userspace you mean here, but there shouldn't be a fault in the
 first place if ROM page tables have access/dirty bit set and they do.

Yep.  Actually they do not set dirty on the PD/PDP/PML4, only on the
bottom level (which is also fine).

Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Gleb Natapov
On Mon, Sep 02, 2013 at 06:00:08PM +0200, Paolo Bonzini wrote:
 Il 02/09/2013 11:25, Gleb Natapov ha scritto:
  On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
  On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
  Page tables in a read-only memory slot will currently cause a triple
  fault because the page walker uses gfn_to_hva and it fails on such a slot.
 
  OVMF uses such a page table; however, real hardware seems to be fine with
  that as long as the accessed/dirty bits are set.  Save whether the slot
  is readonly, and later check it when updating the accessed and dirty bits.
 
  Paolo, do you know why OVMF is using readonly memory like this?
 
  Just a guess, but perhaps they want to move to paging mode as early as
  possible even before memory controller is fully initialized.
 
 More precisely they want to move to 64-bit mode as early as possible,
 and that requires paging.
 
  AFAIK, The fault trigged by this kind of access can hardly be fixed by
  userspace since the fault is trigged by pagetable walking not by the 
  current
  instruction. Do you have any idea to let uerspace emulate it properly?
 
  Not sure what userspace you mean here, but there shouldn't be a fault in the
  first place if ROM page tables have access/dirty bit set and they do.
 
 Yep.  Actually they do not set dirty on the PD/PDP/PML4, only on the
 bottom level (which is also fine).
 
They do it for a good reason, dirty bit exists only on the bottom level :)

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3] target-i386: forward CPUID cache leaves when -cpu host is used

2013-09-02 Thread Eduardo Habkost
On Mon, Sep 02, 2013 at 05:06:37PM +0200, Benoît Canet wrote:
 Some users running cpu intensive tasks checking the cache CPUID leaves at
 startup and making decisions based on the result reported that the guest was
 not reflecting the host CPUID leaves when -cpu host is used.
 
 This patch fix this.
 
 Signed-off-by: Benoit Canet ben...@irqsave.net

Reviewed-by: Eduardo Habkost ehabk...@redhat.com

-- 
Eduardo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Gleb Natapov
On Mon, Sep 02, 2013 at 05:56:07PM +0200, Paolo Bonzini wrote:
 Il 02/09/2013 12:07, Gleb Natapov ha scritto:
  On Mon, Sep 02, 2013 at 06:00:39PM +0800, Xiao Guangrong wrote:
  On 09/02/2013 05:25 PM, Gleb Natapov wrote:
  On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
  On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
  Page tables in a read-only memory slot will currently cause a triple
  fault because the page walker uses gfn_to_hva and it fails on such a 
  slot.
 
  OVMF uses such a page table; however, real hardware seems to be fine 
  with
  that as long as the accessed/dirty bits are set.  Save whether the slot
  is readonly, and later check it when updating the accessed and dirty 
  bits.
 
  Paolo, do you know why OVMF is using readonly memory like this?
 
  Just a guess, but perhaps they want to move to paging mode as early as
  possible even before memory controller is fully initialized.
 
  AFAIK, The fault trigged by this kind of access can hardly be fixed by
  userspace since the fault is trigged by pagetable walking not by the 
  current
  instruction. Do you have any idea to let uerspace emulate it properly?
  Not sure what userspace you mean here, but there shouldn't be a fault in 
  the
 
  I just wonder how to fix this kind of fault. The current patch returns 
  -EACCES
  but that will crash the guest. I think we'd better let userspace to fix 
  this
  error (let userspace set the D/A bit.)
 
  Ugh, this is not good. Missed that. Don't know what real HW will do
  here, but the easy thing for us to do would be to just return success.
 
 Real hardware would just do a memory write.  What happens depends on
 what is on the bus, i.e. on what the ROM is used for.
 
 QEMU uses read-only slots for two things: actual read-only memory where
 writes go to the bitbucket, and ROMD memory where writes are treated
 as MMIO.
 
 So, in the first case we would ignore the write.  In the second we would
 do an MMIO exit to userspace.  But ignoring the write isn't always
 correct, and doing an MMIO exit is complicated, so I would just kill the
 guest.
 
 EPT will probably write to the read-only slots without thinking much
 about it.
 
 My patch injects a page fault, which is very likely to escalate to a
 triple fault.  This is probably never what you want---on the other hand,
 I wasn't sure what level of accuracy we want in this case, given that
 EPT does it wrong too.
 
Just ignoring write into RO slot is OK. We do not handle page tables in
MMIO memory, so handling page tables in ROMD memory shouldn't be a
priority either. 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

2013-09-02 Thread Gleb Natapov
On Mon, Sep 02, 2013 at 05:58:41PM +0200, Paolo Bonzini wrote:
 Il 02/09/2013 12:11, Gleb Natapov ha scritto:
   
   Got it, thanks for your explanation.
   
   BTW, if you and Paolo are busy on other things, i am happy to fix these 
   issues. :)
  I am busy with reviews mostly :). If you are not to busy with lockless
  write protection then fine with me. Lest wait for Paolo's input on
  proposed API though.
 
 No problem, I'm happy to do this kind of exercise.  I like the API you
 proposed.  I can do this change on top of the API change.  However,
 since this bug is in the wild it may be applicable to stable@ as well.
  If you agree, I will also post v2 of this patch for stable kernels as
 soon as we agree on the desired faulting behavior.
 
I am OK with having minimal patch for backporting to stable and doing
API changes + fixes for other call sites on top.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH V3] target-i386: forward CPUID cache leaves when -cpu host is used

2013-09-02 Thread Benoît Canet
  
  Signed-off-by: Benoit Canet ben...@irqsave.net
 
 Reviewed-by: Eduardo Habkost ehabk...@redhat.com

Thanks.

Do you have an idea on how QEMU could reflect the real host clock frequency
to the guest when the host cpu scaling governor kicks in ?
Giving a false value to cloud customers is mildly annoying.

Best regards

Benoît
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/6] KVM: nVMX: Load nEPT state after EFER

2013-09-02 Thread Jan Kiszka
On 2013-09-02 15:16, Gleb Natapov wrote:
 On Thu, Aug 08, 2013 at 04:26:30PM +0200, Jan Kiszka wrote:
 We need to update EFER.NX before building the nEPT state via
 nested_ept_init_mmu_context. Otherwise, we risk to create an MMU context
 that claims to have NX disabled while the guest EPT used NX. This will
 cause spurious faults for L2.

 Hmm, I do not see how nested ept mmu depends on guests EFER.NX setting.
 It just sets mmu-nx to true.

Don't ask me for the details behind this, but update_permission_bitmask
called by kvm_init_shadow_ept_mmu is using it e.g. And the
before-after effect was clearly visible when L2 and L1 were using
different NX settings. Maybe Arthur can write a test for this.

Jan

 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  arch/x86/kvm/vmx.c |   10 +-
  1 files changed, 5 insertions(+), 5 deletions(-)

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 6c42518..363fe19 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -7727,11 +7727,6 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
 struct vmcs12 *vmcs12)
  vmx_flush_tlb(vcpu);
  }
  
 -if (nested_cpu_has_ept(vmcs12)) {
 -kvm_mmu_unload(vcpu);
 -nested_ept_init_mmu_context(vcpu);
 -}
 -
  if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_EFER)
  vcpu-arch.efer = vmcs12-guest_ia32_efer;
  else if (vmcs12-vm_entry_controls  VM_ENTRY_IA32E_MODE)
 @@ -7741,6 +7736,11 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
 struct vmcs12 *vmcs12)
  /* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */
  vmx_set_efer(vcpu, vcpu-arch.efer);
  
 +if (nested_cpu_has_ept(vmcs12)) {
 +kvm_mmu_unload(vcpu);
 +nested_ept_init_mmu_context(vcpu);
 +}
 +
  /*
   * This sets GUEST_CR0 to vmcs12-guest_cr0, with possibly a modified
   * TS bit (for lazy fpu) and bits which we consider mandatory enabled.
 -- 
 1.7.3.4
 
 --
   Gleb.
 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/6] KVM: nVMX: Load nEPT state after EFER

2013-09-02 Thread Gleb Natapov
On Mon, Sep 02, 2013 at 07:58:30PM +0200, Jan Kiszka wrote:
 On 2013-09-02 15:16, Gleb Natapov wrote:
  On Thu, Aug 08, 2013 at 04:26:30PM +0200, Jan Kiszka wrote:
  We need to update EFER.NX before building the nEPT state via
  nested_ept_init_mmu_context. Otherwise, we risk to create an MMU context
  that claims to have NX disabled while the guest EPT used NX. This will
  cause spurious faults for L2.
 
  Hmm, I do not see how nested ept mmu depends on guests EFER.NX setting.
  It just sets mmu-nx to true.
 
 Don't ask me for the details behind this, but update_permission_bitmask
 called by kvm_init_shadow_ept_mmu is using it e.g. And the
It uses it only in !ept case though and never looks at a guest setting
as far as I can tell. Is it possible that this was an artifact of all
nEPT code and the latest one does not need this patch?

 before-after effect was clearly visible when L2 and L1 were using
 different NX settings. Maybe Arthur can write a test for this.
 
 Jan
 
  
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
  ---
   arch/x86/kvm/vmx.c |   10 +-
   1 files changed, 5 insertions(+), 5 deletions(-)
 
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 6c42518..363fe19 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -7727,11 +7727,6 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
  struct vmcs12 *vmcs12)
 vmx_flush_tlb(vcpu);
 }
   
  -  if (nested_cpu_has_ept(vmcs12)) {
  -  kvm_mmu_unload(vcpu);
  -  nested_ept_init_mmu_context(vcpu);
  -  }
  -
 if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_EFER)
 vcpu-arch.efer = vmcs12-guest_ia32_efer;
 else if (vmcs12-vm_entry_controls  VM_ENTRY_IA32E_MODE)
  @@ -7741,6 +7736,11 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
  struct vmcs12 *vmcs12)
 /* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */
 vmx_set_efer(vcpu, vcpu-arch.efer);
   
  +  if (nested_cpu_has_ept(vmcs12)) {
  +  kvm_mmu_unload(vcpu);
  +  nested_ept_init_mmu_context(vcpu);
  +  }
  +
 /*
  * This sets GUEST_CR0 to vmcs12-guest_cr0, with possibly a modified
  * TS bit (for lazy fpu) and bits which we consider mandatory enabled.
  -- 
  1.7.3.4
  
  --
  Gleb.
  
 
 -- 
 Siemens AG, Corporate Technology, CT RTC ITP SES-DE
 Corporate Competence Center Embedded Linux

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/6] KVM: nVMX: Load nEPT state after EFER

2013-09-02 Thread Jan Kiszka
On 2013-09-02 20:09, Gleb Natapov wrote:
 On Mon, Sep 02, 2013 at 07:58:30PM +0200, Jan Kiszka wrote:
 On 2013-09-02 15:16, Gleb Natapov wrote:
 On Thu, Aug 08, 2013 at 04:26:30PM +0200, Jan Kiszka wrote:
 We need to update EFER.NX before building the nEPT state via
 nested_ept_init_mmu_context. Otherwise, we risk to create an MMU context
 that claims to have NX disabled while the guest EPT used NX. This will
 cause spurious faults for L2.

 Hmm, I do not see how nested ept mmu depends on guests EFER.NX setting.
 It just sets mmu-nx to true.

 Don't ask me for the details behind this, but update_permission_bitmask
 called by kvm_init_shadow_ept_mmu is using it e.g. And the
 It uses it only in !ept case though and never looks at a guest setting
 as far as I can tell. Is it possible that this was an artifact of all
 nEPT code and the latest one does not need this patch?

Hmm, possibly. Let me recheck, hope I can find the reproduction pattern
again...

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/6] KVM: nVMX: Load nEPT state after EFER

2013-09-02 Thread Jan Kiszka
On 2013-09-02 20:20, Jan Kiszka wrote:
 On 2013-09-02 20:09, Gleb Natapov wrote:
 On Mon, Sep 02, 2013 at 07:58:30PM +0200, Jan Kiszka wrote:
 On 2013-09-02 15:16, Gleb Natapov wrote:
 On Thu, Aug 08, 2013 at 04:26:30PM +0200, Jan Kiszka wrote:
 We need to update EFER.NX before building the nEPT state via
 nested_ept_init_mmu_context. Otherwise, we risk to create an MMU context
 that claims to have NX disabled while the guest EPT used NX. This will
 cause spurious faults for L2.

 Hmm, I do not see how nested ept mmu depends on guests EFER.NX setting.
 It just sets mmu-nx to true.

 Don't ask me for the details behind this, but update_permission_bitmask
 called by kvm_init_shadow_ept_mmu is using it e.g. And the
 It uses it only in !ept case though and never looks at a guest setting
 as far as I can tell. Is it possible that this was an artifact of all
 nEPT code and the latest one does not need this patch?
 
 Hmm, possibly. Let me recheck, hope I can find the reproduction pattern
 again...

Yeah, drop it. Things work fine without it, and I just found an old
version of nEPT where kvm_init_shadow_EPT_mmu did this:

context-nx = is_nx(vcpu); /* TODO: ? */

Obviously, this patch was a workaround for that issue.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] KVM: PPC: Book3S HV: Reserve POWER8 space in get/set_one_reg

2013-09-02 Thread Michael Neuling
This reserves space in get/set_one_reg ioctl for the extra guest state
needed for POWER8.  It doesn't implement these at all, it just reserves
them so that the ABI is defined now.

A few things to note here:

- This add *a lot* state for transactional memory.  TM suspend mode,
  this is unavoidable, you can't simply roll back all transactions and
  store only the checkpointed state.  I've added this all to
  get/set_one_reg (including GPRs) rather than creating a new ioctl
  which returns a struct kvm_regs like KVM_GET_REGS does.  This means we
  if we need to extract the TM state, we are going to need a bucket load
  of IOCTLs.  Hopefully most of the time this will not be needed as we
  can look at the MSR to see if TM is active and only grab them when
  needed.  If this becomes a bottle neck in future we can add another
  ioctl to grab all this state in one go.

- The TM state is offset by 0x8000.

- For TM, I've done away with VMX and FP and created a single 64x128 bit
  VSX register space.

- I've left a space of 1 (at 0x9c) since Paulus needs to add a value
  which applies to POWER7 as well.

Signed-off-by: Michael Neuling mi...@neuling.org

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index ef925ea..341058c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1810,6 +1810,45 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_TLB3PS   | 32
   PPC   | KVM_REG_PPC_EPTCFG   | 32
   PPC   | KVM_REG_PPC_ICP_STATE | 64
+  PPC   | KVM_REG_PPC_SPMC1| 32
+  PPC   | KVM_REG_PPC_SPMC2| 32
+  PPC   | KVM_REG_PPC_IAMR | 64
+  PPC   | KVM_REG_PPC_TFHAR| 64
+  PPC   | KVM_REG_PPC_TFIAR| 64
+  PPC   | KVM_REG_PPC_TEXASR   | 64
+  PPC   | KVM_REG_PPC_FSCR | 64
+  PPC   | KVM_REG_PPC_PSPB | 32
+  PPC   | KVM_REG_PPC_EBBHR| 64
+  PPC   | KVM_REG_PPC_EBBRR| 64
+  PPC   | KVM_REG_PPC_BESCR| 64
+  PPC   | KVM_REG_PPC_TAR  | 64
+  PPC   | KVM_REG_PPC_DPDES| 64
+  PPC   | KVM_REG_PPC_DAWR | 64
+  PPC   | KVM_REG_PPC_DAWRX| 64
+  PPC   | KVM_REG_PPC_CIABR| 64
+  PPC   | KVM_REG_PPC_IC   | 64
+  PPC   | KVM_REG_PPC_VTB  | 64
+  PPC   | KVM_REG_PPC_CSIGR| 64
+  PPC   | KVM_REG_PPC_TACR | 64
+  PPC   | KVM_REG_PPC_TCSCR| 64
+  PPC   | KVM_REG_PPC_PID  | 64
+  PPC   | KVM_REG_PPC_ACOP | 64
+  PPC   | KVM_REG_PPC_TM_GPR0  | 64
+  ...
+  PPC   | KVM_REG_PPC_TM_GPR31 | 64
+  PPC   | KVM_REG_PPC_TM_VSR0  | 128
+  ...
+  PPC   | KVM_REG_PPC_TM_VSR63 | 128
+  PPC   | KVM_REG_PPC_TM_CR| 64
+  PPC   | KVM_REG_PPC_TM_LR| 64
+  PPC   | KVM_REG_PPC_TM_CTR   | 64
+  PPC   | KVM_REG_PPC_TM_FPSCR | 64
+  PPC   | KVM_REG_PPC_TM_AMR   | 64
+  PPC   | KVM_REG_PPC_TM_PPR   | 64
+  PPC   | KVM_REG_PPC_TM_VRSAVE| 64
+  PPC   | KVM_REG_PPC_TM_VSCR  | 32
+  PPC   | KVM_REG_PPC_TM_DSCR  | 64
+  PPC   | KVM_REG_PPC_TM_TAR   | 64
 
 ARM registers are mapped using the lower 32 bits.  The upper 16 of that
 is the register group type, or coprocessor number:
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..8e687a1 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -429,6 +429,11 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_MMCR0  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10)
 #define KVM_REG_PPC_MMCR1  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11)
 #define KVM_REG_PPC_MMCRA  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
+#define KVM_REG_PPC_MMCR2  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
+#define KVM_REG_PPC_MMCRS  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x14)
+#define KVM_REG_PPC_SIAR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x15)
+#define KVM_REG_PPC_SDAR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x16)
+#define KVM_REG_PPC_SIER   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x17)
 
 #define KVM_REG_PPC_PMC1   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x18)
 #define KVM_REG_PPC_PMC2   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x19)
@@ -499,6 +504,55 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_TLB3PS (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9a)
 #define KVM_REG_PPC_EPTCFG (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9b)
 
+/* POWER8 registers */
+#define KVM_REG_PPC_SPMC1  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9d)
+#define KVM_REG_PPC_SPMC2  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9e)
+#define KVM_REG_PPC_IAMR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9f)
+#define KVM_REG_PPC_TFHAR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa0)
+#define KVM_REG_PPC_TFIAR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa1)
+#define KVM_REG_PPC_TEXASR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa2)
+#define KVM_REG_PPC_FSCR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa3)
+#define KVM_REG_PPC_PSPB   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xa4)
+#define KVM_REG_PPC_EBBHR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa5)
+#define KVM_REG_PPC_EBBRR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa6)
+#define 

Re: Is fallback vhost_net to qemu for live migrate available?

2013-09-02 Thread Qin Chuanyu

On 2013/9/2 15:57, Wei Liu wrote:

On Sat, Aug 31, 2013 at 12:45:11PM +0800, Qin Chuanyu wrote:

On 2013/8/30 0:08, Anthony Liguori wrote:

Hi Qin,



By change the memory copy and notify mechanism ,currently virtio-net with
vhost_net could run on Xen with good performance。


I think the key in doing this would be to implement a property
ioeventfd and irqfd interface in the driver domain kernel.  Just
hacking vhost_net with Xen specific knowledge would be pretty nasty
IMHO.


Yes, I add a kernel module which persist virtio-net pio_addr and
msix address as what kvm module did. Guest wake up vhost thread by
adding a hook func in evtchn_interrupt.


Did you modify the front end driver to do grant table mapping or is
this all being done by mapping the domain's memory?


There is nothing changed in front end driver. Currently I use
alloc_vm_area to get address space, and map the domain's memory as
what what qemu did.



You mean you're using xc_map_foreign_range and friends in the backend to
map guest memory? That's not very desirable as it violates Xen's
security model. It would not be too hard to pass grant references
instead of guest physical memory address IMHO.


In fact, I did what virtio-net have done in Qemu. I think security
is a pseudo question because Dom0 is under control.

Host could access memory of guest in KVM much easier than Xen,
but I hadn't heard someone said KVM is un-secret.

Regards
Qin chuanyu


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: update masterclock when kvmclock_offset is calculated

2013-09-02 Thread Marcelo Tosatti
On Wed, Aug 28, 2013 at 02:37:20PM +0200, Paolo Bonzini wrote:
 Il 28/08/2013 04:52, Marcelo Tosatti ha scritto:
  On Thu, Aug 22, 2013 at 07:05:20PM +0200, Paolo Bonzini wrote:
  Il 20/08/2013 20:20, Marcelo Tosatti ha scritto:
 
  The offset to add to the hosts monotonic time, kvmclock_offset, is
  calculated against the monotonic time at KVM_SET_CLOCK ioctl time.
 
  Request a master clock update at this time, to reduce a potentially
  unbounded difference between the values of the masterclock and
  the clock value used to calculate kvmclock_offset.
 
  Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
  Index: linux-2.6-kvmclock-fixes/arch/x86/kvm/x86.c
  ===
  --- linux-2.6-kvmclock-fixes.orig/arch/x86/kvm/x86.c
  +++ linux-2.6-kvmclock-fixes/arch/x86/kvm/x86.c
  @@ -3806,6 +3806,7 @@ long kvm_arch_vm_ioctl(struct file *filp
delta = user_ns.clock - now_ns;
local_irq_enable();
kvm-arch.kvmclock_offset = delta;
  + kvm_gen_update_masterclock(kvm);
break;
}
case KVM_GET_CLOCK: {
 
  Reviewed-by: Paolo Bonzini pbonz...@redhat.com
 
  While reviewing this patch, which BTW looks good, I noticed the handling
  of KVM_REQ_MCLOCK_INPROGRESS, the dummy request that is never processed
  and is only used to block guest entry.
 
  It seems to me that this bit is not necessary.  After
  KVM_REQ_CLOCK_UPDATE is issued, no guest entries will happen because
  kvm_guest_time_update will try to take the pvclock_gtod_sync_lock,
  currently taken by kvm_gen_update_masterclock.
  
  Not entirely clear, to cancel guest entry the bit is necessary:
  
  if (vcpu-mode == EXITING_GUEST_MODE || vcpu-requests
  || need_resched() || signal_pending(current)) {
  vcpu-mode = OUTSIDE_GUEST_MODE;
  smp_wmb();
  local_irq_enable();
  preempt_enable();
  r = 1;
  goto cancel_injection;
  }
 
 Yes, KVM_REQ_CLOCK_UPDATE is enough to cancel guest entries too.  See
 code below.
 
  Thus, you do not need the dummy request.  You can simply issue
  KVM_REQ_CLOCK_UPDATE before calling pvclock_update_vm_gtod_copy (with
  the side effect of exiting VCPUs).  VCPUs will stall in
  kvm_guest_time_update until pvclock_gtod_sync_lock is released by
  kvm_gen_update_masterclock.  What do you think?
  
  Not sure its safe. Can you describe the safety of your proposal in more
  detail ?
 
 Here is the code I was thinking of:
 
   spin_lock(ka-pvclock_gtod_sync_lock);
   make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);
 
   /*
* No guest entries from this point: VCPUs will be spinning
* on pvclock_gtod_sync_lock in kvm_guest_time_update.
*/
   pvclock_update_vm_gtod_copy(kvm);
 
   /*
* Let kvm_guest_time_update continue: entering the guest
* is now allowed too.
*/
   spin_unlock(ka-pvclock_gtod_sync_lock);
 
 KVM_REQ_CLOCK_UPDATE is used to cancel guest entry and execute
 kvm_guest_time_update.  But kvm_guest_time_update will spin on
 pvclock_gtod_sync_lock until pvclock_update_vm_gtod_copy exits and
 kvm_gen_update_masterclock releases the spinlock.

Not safe because there are places which set KVM_REQ_CLOCK_UPDATE without
kicking target vcpu out of guest mode. Unless you use a modified
make_all_cpus_request.

The point of REQ_MCLOCK_INPROGRESS request is to guarantee that the
following is not possible:

- 2 vcpus in guest mode with per-vcpu kvmclock areas with
different {system_timestamp, tsc_offset} values.

To achieve that:

- Kick all vcpus out of guest mode (via a request bit that can't be
  cleared).
- Update the {system_timestamp, tsc_offset} values.
- Clear the request bit.

  On top of this, optionally the spinlock could become an rw_semaphore
  so that clock updates for different VCPUs will not be serialized.  The
  effect is probably not visible, though.
  
  Still not clear of the benefits, but this area certainly welcomes
  performance improvements (the global kick is one thing we discussed 
  and that should be improved).
 
 This unfortunately does not eliminate the global kick, so there is
 likely no performance improvement yet.  It simplifies the logic a bit
 though.
 
 The change I suggested above is to make pvclock_gtod_sync_lock an rwsem
 or, probably better, a seqlock. 
 VCPUs reading ka-use_master_clock,
 ka-master_cycle_now, ka-master_kernel_ns can then run concurrently,
 with no locked operations in kvm_guest_time_update.

Good point.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] KVM: PPC: Book3S HV: Reserve POWER8 space in get/set_one_reg

2013-09-02 Thread Michael Neuling
This reserves space in get/set_one_reg ioctl for the extra guest state
needed for POWER8.  It doesn't implement these at all, it just reserves
them so that the ABI is defined now.

A few things to note here:

- This add *a lot* state for transactional memory.  TM suspend mode,
  this is unavoidable, you can't simply roll back all transactions and
  store only the checkpointed state.  I've added this all to
  get/set_one_reg (including GPRs) rather than creating a new ioctl
  which returns a struct kvm_regs like KVM_GET_REGS does.  This means we
  if we need to extract the TM state, we are going to need a bucket load
  of IOCTLs.  Hopefully most of the time this will not be needed as we
  can look at the MSR to see if TM is active and only grab them when
  needed.  If this becomes a bottle neck in future we can add another
  ioctl to grab all this state in one go.

- The TM state is offset by 0x8000.

- For TM, I've done away with VMX and FP and created a single 64x128 bit
  VSX register space.

- I've left a space of 1 (at 0x9c) since Paulus needs to add a value
  which applies to POWER7 as well.

Signed-off-by: Michael Neuling mi...@neuling.org

--- 
The last one was screwed up... sorry..

v3: 
  fix naming mistake and whitespace screwage.

v2: 
  integrate feedback

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index ef925ea..341058c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1810,6 +1810,45 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_TLB3PS   | 32
   PPC   | KVM_REG_PPC_EPTCFG   | 32
   PPC   | KVM_REG_PPC_ICP_STATE | 64
+  PPC   | KVM_REG_PPC_SPMC1| 32
+  PPC   | KVM_REG_PPC_SPMC2| 32
+  PPC   | KVM_REG_PPC_IAMR | 64
+  PPC   | KVM_REG_PPC_TFHAR| 64
+  PPC   | KVM_REG_PPC_TFIAR| 64
+  PPC   | KVM_REG_PPC_TEXASR   | 64
+  PPC   | KVM_REG_PPC_FSCR | 64
+  PPC   | KVM_REG_PPC_PSPB | 32
+  PPC   | KVM_REG_PPC_EBBHR| 64
+  PPC   | KVM_REG_PPC_EBBRR| 64
+  PPC   | KVM_REG_PPC_BESCR| 64
+  PPC   | KVM_REG_PPC_TAR  | 64
+  PPC   | KVM_REG_PPC_DPDES| 64
+  PPC   | KVM_REG_PPC_DAWR | 64
+  PPC   | KVM_REG_PPC_DAWRX| 64
+  PPC   | KVM_REG_PPC_CIABR| 64
+  PPC   | KVM_REG_PPC_IC   | 64
+  PPC   | KVM_REG_PPC_VTB  | 64
+  PPC   | KVM_REG_PPC_CSIGR| 64
+  PPC   | KVM_REG_PPC_TACR | 64
+  PPC   | KVM_REG_PPC_TCSCR| 64
+  PPC   | KVM_REG_PPC_PID  | 64
+  PPC   | KVM_REG_PPC_ACOP | 64
+  PPC   | KVM_REG_PPC_TM_GPR0  | 64
+  ...
+  PPC   | KVM_REG_PPC_TM_GPR31 | 64
+  PPC   | KVM_REG_PPC_TM_VSR0  | 128
+  ...
+  PPC   | KVM_REG_PPC_TM_VSR63 | 128
+  PPC   | KVM_REG_PPC_TM_CR| 64
+  PPC   | KVM_REG_PPC_TM_LR| 64
+  PPC   | KVM_REG_PPC_TM_CTR   | 64
+  PPC   | KVM_REG_PPC_TM_FPSCR | 64
+  PPC   | KVM_REG_PPC_TM_AMR   | 64
+  PPC   | KVM_REG_PPC_TM_PPR   | 64
+  PPC   | KVM_REG_PPC_TM_VRSAVE| 64
+  PPC   | KVM_REG_PPC_TM_VSCR  | 32
+  PPC   | KVM_REG_PPC_TM_DSCR  | 64
+  PPC   | KVM_REG_PPC_TM_TAR   | 64
 
 ARM registers are mapped using the lower 32 bits.  The upper 16 of that
 is the register group type, or coprocessor number:
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..7ed41c0 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -429,6 +429,11 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_MMCR0  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10)
 #define KVM_REG_PPC_MMCR1  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11)
 #define KVM_REG_PPC_MMCRA  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
+#define KVM_REG_PPC_MMCR2  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
+#define KVM_REG_PPC_MMCRS  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x14)
+#define KVM_REG_PPC_SIAR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x15)
+#define KVM_REG_PPC_SDAR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x16)
+#define KVM_REG_PPC_SIER   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x17)
 
 #define KVM_REG_PPC_PMC1   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x18)
 #define KVM_REG_PPC_PMC2   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x19)
@@ -499,6 +504,55 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_TLB3PS (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9a)
 #define KVM_REG_PPC_EPTCFG (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9b)
 
+/* POWER8 registers */
+#define KVM_REG_PPC_SPMC1  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9d)
+#define KVM_REG_PPC_SPMC2  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9e)
+#define KVM_REG_PPC_IAMR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9f)
+#define KVM_REG_PPC_TFHAR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa0)
+#define KVM_REG_PPC_TFIAR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa1)
+#define KVM_REG_PPC_TEXASR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa2)
+#define KVM_REG_PPC_FSCR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa3)
+#define KVM_REG_PPC_PSPB   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xa4)
+#define KVM_REG_PPC_EBBHR  

[PATCH v2] KVM: PPC: Book3S HV: Reserve POWER8 space in get/set_one_reg

2013-09-02 Thread Michael Neuling
This reserves space in get/set_one_reg ioctl for the extra guest state
needed for POWER8.  It doesn't implement these at all, it just reserves
them so that the ABI is defined now.

A few things to note here:

- This add *a lot* state for transactional memory.  TM suspend mode,
  this is unavoidable, you can't simply roll back all transactions and
  store only the checkpointed state.  I've added this all to
  get/set_one_reg (including GPRs) rather than creating a new ioctl
  which returns a struct kvm_regs like KVM_GET_REGS does.  This means we
  if we need to extract the TM state, we are going to need a bucket load
  of IOCTLs.  Hopefully most of the time this will not be needed as we
  can look at the MSR to see if TM is active and only grab them when
  needed.  If this becomes a bottle neck in future we can add another
  ioctl to grab all this state in one go.

- The TM state is offset by 0x8000.

- For TM, I've done away with VMX and FP and created a single 64x128 bit
  VSX register space.

- I've left a space of 1 (at 0x9c) since Paulus needs to add a value
  which applies to POWER7 as well.

Signed-off-by: Michael Neuling mi...@neuling.org

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index ef925ea..341058c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1810,6 +1810,45 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_TLB3PS   | 32
   PPC   | KVM_REG_PPC_EPTCFG   | 32
   PPC   | KVM_REG_PPC_ICP_STATE | 64
+  PPC   | KVM_REG_PPC_SPMC1| 32
+  PPC   | KVM_REG_PPC_SPMC2| 32
+  PPC   | KVM_REG_PPC_IAMR | 64
+  PPC   | KVM_REG_PPC_TFHAR| 64
+  PPC   | KVM_REG_PPC_TFIAR| 64
+  PPC   | KVM_REG_PPC_TEXASR   | 64
+  PPC   | KVM_REG_PPC_FSCR | 64
+  PPC   | KVM_REG_PPC_PSPB | 32
+  PPC   | KVM_REG_PPC_EBBHR| 64
+  PPC   | KVM_REG_PPC_EBBRR| 64
+  PPC   | KVM_REG_PPC_BESCR| 64
+  PPC   | KVM_REG_PPC_TAR  | 64
+  PPC   | KVM_REG_PPC_DPDES| 64
+  PPC   | KVM_REG_PPC_DAWR | 64
+  PPC   | KVM_REG_PPC_DAWRX| 64
+  PPC   | KVM_REG_PPC_CIABR| 64
+  PPC   | KVM_REG_PPC_IC   | 64
+  PPC   | KVM_REG_PPC_VTB  | 64
+  PPC   | KVM_REG_PPC_CSIGR| 64
+  PPC   | KVM_REG_PPC_TACR | 64
+  PPC   | KVM_REG_PPC_TCSCR| 64
+  PPC   | KVM_REG_PPC_PID  | 64
+  PPC   | KVM_REG_PPC_ACOP | 64
+  PPC   | KVM_REG_PPC_TM_GPR0  | 64
+  ...
+  PPC   | KVM_REG_PPC_TM_GPR31 | 64
+  PPC   | KVM_REG_PPC_TM_VSR0  | 128
+  ...
+  PPC   | KVM_REG_PPC_TM_VSR63 | 128
+  PPC   | KVM_REG_PPC_TM_CR| 64
+  PPC   | KVM_REG_PPC_TM_LR| 64
+  PPC   | KVM_REG_PPC_TM_CTR   | 64
+  PPC   | KVM_REG_PPC_TM_FPSCR | 64
+  PPC   | KVM_REG_PPC_TM_AMR   | 64
+  PPC   | KVM_REG_PPC_TM_PPR   | 64
+  PPC   | KVM_REG_PPC_TM_VRSAVE| 64
+  PPC   | KVM_REG_PPC_TM_VSCR  | 32
+  PPC   | KVM_REG_PPC_TM_DSCR  | 64
+  PPC   | KVM_REG_PPC_TM_TAR   | 64
 
 ARM registers are mapped using the lower 32 bits.  The upper 16 of that
 is the register group type, or coprocessor number:
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..8e687a1 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -429,6 +429,11 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_MMCR0  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10)
 #define KVM_REG_PPC_MMCR1  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11)
 #define KVM_REG_PPC_MMCRA  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
+#define KVM_REG_PPC_MMCR2  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
+#define KVM_REG_PPC_MMCRS  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x14)
+#define KVM_REG_PPC_SIAR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x15)
+#define KVM_REG_PPC_SDAR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x16)
+#define KVM_REG_PPC_SIER   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x17)
 
 #define KVM_REG_PPC_PMC1   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x18)
 #define KVM_REG_PPC_PMC2   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x19)
@@ -499,6 +504,55 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_TLB3PS (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9a)
 #define KVM_REG_PPC_EPTCFG (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9b)
 
+/* POWER8 registers */
+#define KVM_REG_PPC_SPMC1  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9d)
+#define KVM_REG_PPC_SPMC2  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9e)
+#define KVM_REG_PPC_IAMR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9f)
+#define KVM_REG_PPC_TFHAR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa0)
+#define KVM_REG_PPC_TFIAR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa1)
+#define KVM_REG_PPC_TEXASR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa2)
+#define KVM_REG_PPC_FSCR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa3)
+#define KVM_REG_PPC_PSPB   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xa4)
+#define KVM_REG_PPC_EBBHR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa5)
+#define KVM_REG_PPC_EBBRR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa6)
+#define 

[PATCH v3] KVM: PPC: Book3S HV: Reserve POWER8 space in get/set_one_reg

2013-09-02 Thread Michael Neuling
This reserves space in get/set_one_reg ioctl for the extra guest state
needed for POWER8.  It doesn't implement these at all, it just reserves
them so that the ABI is defined now.

A few things to note here:

- This add *a lot* state for transactional memory.  TM suspend mode,
  this is unavoidable, you can't simply roll back all transactions and
  store only the checkpointed state.  I've added this all to
  get/set_one_reg (including GPRs) rather than creating a new ioctl
  which returns a struct kvm_regs like KVM_GET_REGS does.  This means we
  if we need to extract the TM state, we are going to need a bucket load
  of IOCTLs.  Hopefully most of the time this will not be needed as we
  can look at the MSR to see if TM is active and only grab them when
  needed.  If this becomes a bottle neck in future we can add another
  ioctl to grab all this state in one go.

- The TM state is offset by 0x8000.

- For TM, I've done away with VMX and FP and created a single 64x128 bit
  VSX register space.

- I've left a space of 1 (at 0x9c) since Paulus needs to add a value
  which applies to POWER7 as well.

Signed-off-by: Michael Neuling mi...@neuling.org

--- 
The last one was screwed up... sorry..

v3: 
  fix naming mistake and whitespace screwage.

v2: 
  integrate feedback

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index ef925ea..341058c 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1810,6 +1810,45 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_TLB3PS   | 32
   PPC   | KVM_REG_PPC_EPTCFG   | 32
   PPC   | KVM_REG_PPC_ICP_STATE | 64
+  PPC   | KVM_REG_PPC_SPMC1| 32
+  PPC   | KVM_REG_PPC_SPMC2| 32
+  PPC   | KVM_REG_PPC_IAMR | 64
+  PPC   | KVM_REG_PPC_TFHAR| 64
+  PPC   | KVM_REG_PPC_TFIAR| 64
+  PPC   | KVM_REG_PPC_TEXASR   | 64
+  PPC   | KVM_REG_PPC_FSCR | 64
+  PPC   | KVM_REG_PPC_PSPB | 32
+  PPC   | KVM_REG_PPC_EBBHR| 64
+  PPC   | KVM_REG_PPC_EBBRR| 64
+  PPC   | KVM_REG_PPC_BESCR| 64
+  PPC   | KVM_REG_PPC_TAR  | 64
+  PPC   | KVM_REG_PPC_DPDES| 64
+  PPC   | KVM_REG_PPC_DAWR | 64
+  PPC   | KVM_REG_PPC_DAWRX| 64
+  PPC   | KVM_REG_PPC_CIABR| 64
+  PPC   | KVM_REG_PPC_IC   | 64
+  PPC   | KVM_REG_PPC_VTB  | 64
+  PPC   | KVM_REG_PPC_CSIGR| 64
+  PPC   | KVM_REG_PPC_TACR | 64
+  PPC   | KVM_REG_PPC_TCSCR| 64
+  PPC   | KVM_REG_PPC_PID  | 64
+  PPC   | KVM_REG_PPC_ACOP | 64
+  PPC   | KVM_REG_PPC_TM_GPR0  | 64
+  ...
+  PPC   | KVM_REG_PPC_TM_GPR31 | 64
+  PPC   | KVM_REG_PPC_TM_VSR0  | 128
+  ...
+  PPC   | KVM_REG_PPC_TM_VSR63 | 128
+  PPC   | KVM_REG_PPC_TM_CR| 64
+  PPC   | KVM_REG_PPC_TM_LR| 64
+  PPC   | KVM_REG_PPC_TM_CTR   | 64
+  PPC   | KVM_REG_PPC_TM_FPSCR | 64
+  PPC   | KVM_REG_PPC_TM_AMR   | 64
+  PPC   | KVM_REG_PPC_TM_PPR   | 64
+  PPC   | KVM_REG_PPC_TM_VRSAVE| 64
+  PPC   | KVM_REG_PPC_TM_VSCR  | 32
+  PPC   | KVM_REG_PPC_TM_DSCR  | 64
+  PPC   | KVM_REG_PPC_TM_TAR   | 64
 
 ARM registers are mapped using the lower 32 bits.  The upper 16 of that
 is the register group type, or coprocessor number:
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..7ed41c0 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -429,6 +429,11 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_MMCR0  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10)
 #define KVM_REG_PPC_MMCR1  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11)
 #define KVM_REG_PPC_MMCRA  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12)
+#define KVM_REG_PPC_MMCR2  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13)
+#define KVM_REG_PPC_MMCRS  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x14)
+#define KVM_REG_PPC_SIAR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x15)
+#define KVM_REG_PPC_SDAR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x16)
+#define KVM_REG_PPC_SIER   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x17)
 
 #define KVM_REG_PPC_PMC1   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x18)
 #define KVM_REG_PPC_PMC2   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x19)
@@ -499,6 +504,55 @@ struct kvm_get_htab_header {
 #define KVM_REG_PPC_TLB3PS (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9a)
 #define KVM_REG_PPC_EPTCFG (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9b)
 
+/* POWER8 registers */
+#define KVM_REG_PPC_SPMC1  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9d)
+#define KVM_REG_PPC_SPMC2  (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9e)
+#define KVM_REG_PPC_IAMR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9f)
+#define KVM_REG_PPC_TFHAR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa0)
+#define KVM_REG_PPC_TFIAR  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa1)
+#define KVM_REG_PPC_TEXASR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa2)
+#define KVM_REG_PPC_FSCR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa3)
+#define KVM_REG_PPC_PSPB   (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xa4)
+#define KVM_REG_PPC_EBBHR