Re: source for virt io backend driver

2012-04-09 Thread Steven
Hi,
I found this post
http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/89334
So the current block driver seems completely emulated by the qemu driver.

- ha

On Mon, Apr 9, 2012 at 2:15 AM, Nadav Har'El  wrote:
> On Sun, Apr 08, 2012, Steven wrote about "Re: source for virt io backend 
> driver":
>> > vhost-net. This you can find in drivers/vhost/net.c (and
>> > drivers/vhost/vhost.c for the generic vhost infrastructure for virtio
>> > drivers in the host kernel).
>>
>> Yes, I am looking for the backend code in the kernel. I found the file
>> for net backend drivers/vhost/net.c. However, the backend code for blk
>> is not there.
>> Could you point out this? Thanks.
>
> Indeed, vhost-block is not (yet) part of the mainline kernel.
>
> Maybe someone else more experienced in vhost-block can recommend where
> to get the latest version.
>
> Nadav.
>
>
> --
> Nadav Har'El                        |                     Monday, Apr 9 2012,
> n...@math.technion.ac.il             
> |-
> Phone +972-523-790466, ICQ 13349191 |Diplomat: A man who always remembers a
> http://nadav.harel.org.il           |woman's birthday but never her age.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-09 Thread shashank rachamalla
On Mon, Apr 9, 2012 at 2:56 PM, Gleb Natapov  wrote:
> On Sun, Apr 08, 2012 at 06:27:03PM +0300, Gleb Natapov wrote:
>> On Fri, Apr 06, 2012 at 09:50:50AM +0300, Gleb Natapov wrote:
>> > On Fri, Apr 06, 2012 at 10:43:17AM +0530, shashank rachamalla wrote:
>> > > On Thu, Apr 5, 2012 at 8:11 PM, Gleb Natapov  wrote:
>> > > > On Thu, Apr 05, 2012 at 05:38:40PM +0300, Avi Kivity wrote:
>> > > >> On 04/05/2012 04:57 PM, Gleb Natapov wrote:
>> > > >> > > >
>> > > >> > > May be it used NMI based profiling. We should ask oprofile 
>> > > >> > > developers.
>> > > >> > > As I said I am almost sure my inability to run it on a host is 
>> > > >> > > probably
>> > > >> > > PEBKAC, although I ran the same script exactly on the host and the
>> > > >> > > guest (the script is from the first email of this thread)
>> > > >> > >
>> > > >> > After upgrading the kernel to latest git from whatever it was there 
>> > > >> > the
>> > > >> > same script works and counts CPU_CLK_UNHALT events.
>> > > >> >
>> > > >>
>> > > >> This is even while it violates the Intel guidelines?
>> > > >>
>> > > > Yes, but who says the result is correct :) It seems that we handle
>> > > > global ctrl msr wrong. That is counter can be enabled either in global
>> > > > ctrl or in eventsel. Trying to confirm that.
>> > > >
>> > > if that becomes true then will global ctrl msr have any significance ?
>> > When it is in use yes.
>> >
>> I was wrong. We do handle global ctrl msr correctly, I just ran my test
>> incorrectly. If I disable global ctrl on all cpus (for i in `seq 0 15`;
>> do wrmsr -p $i 0x38f 0; done) oprofile stops working.
>>
> After searching high and low I finally found the following in
> "Performance Monitoring Unit Sharing Guide" white paper:
>
>  Known starting state: Software requires a known starting
>  state. After CPU reset, all counters and control registers are
>  disabled and clear/reset to 0.  The only exception to this is the
>  IA32_PERF_GLOBAL_CTRL control MSR, all programmable counter global
>  enable bits are reset to 1.
>
> Patch will follow.
>

Thanks for the patch. global ctrl msr is properly set inside guest and
i can run oprofile as well.

> --
>                        Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Xiao Guangrong
On 04/10/2012 03:46 AM, Marcelo Tosatti wrote:

> On Tue, Apr 10, 2012 at 02:26:27AM +0800, Xiao Guangrong wrote:
>> On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:
>>
>>> On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
 On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
> * Idea
> The present bit of page fault error code (EFEC.P) indicates whether the
> page table is populated on all levels, if this bit is set, we can know
> the page fault is caused by the page-protection bits (e.g. W/R bit) or
> the reserved bits.
>
> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
> simply fixed: the page fault caused by reserved bit
> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
> is just increasing the corresponding access on the spte.
>
> This pachset introduces a fast path to fix this kind of page fault: it
> is out of mmu-lock and need not walk host page table to get the mapping
> from gfn to pfn.
>
>

 This patchset is really worrying to me.

 It introduces a lot of concurrency into data structures that were not
 designed for it.  Even if it is correct, it will be very hard to
 convince ourselves that it is correct, and if it isn't, to debug those
 subtle bugs.  It will also be much harder to maintain the mmu code than
 it is now.

 There are a lot of things to check.  Just as an example, we need to be
 sure that if we use rcu_dereference() twice in the same code path, that
 any inconsistencies due to a write in between are benign.  Doing that is
 a huge task.

 But I appreciate the performance improvement and would like to see a
 simpler version make it in.  This needs to reduce the amount of data
 touched in the fast path so it is easier to validate, and perhaps reduce
 the number of cases that the fast path works on.

 I would like to see the fast path as simple as

   rcu_read_lock();

   (lockless shadow walk)
   spte = ACCESS_ONCE(*sptep);

   if (!(spte & PT_MAY_ALLOW_WRITES))
 goto slow;

   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
   mark_page_dirty(kvm, gfn);

   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
   if (cmpxchg(sptep, spte, new_spte) != spte)
goto slow;

   rcu_read_unlock();
   return;

 slow:
   rcu_read_unlock();
   slow_path();

 It now becomes the responsibility of the slow path to maintain *sptep &
 PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
 can be as simple as a clear_bit() before we update sp->gfns[] or if we
 add host write protection.

 Sorry, it's too complicated for me.  Marcelo, what's your take?
>>>
>>> The improvement is small and limited to special cases (migration should
>>> be rare and framebuffer memory accounts for a small percentage of total
>>> memory).
>>
>>
>> Actually, although the framebuffer is small but it is modified really
>> frequently, and another unlucky things is that dirty-log is also
>> very frequently and need hold mmu-lock to do write-protect.
>>
>> Yes, if Xwindow is not enabled, the benefit is limited. :)
> 
> Ignoring that fact, the safety of lockless set_spte and friends is not
> clear.
> 


That is why AVI suggested me to simplify the whole things. :)


> Perhaps the mmu_lock hold times by get_dirty are a large component here?
> If that can be alleviated, not only RO->RW faults benefit.
> 

Yes.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] tools: virtio: add a top-like utility for displaying vhost satistics

2012-04-09 Thread Jason Wang
This patch adds simple python to display vhost satistics of vhost, the codes
were based on kvm_stat script from qemu. As work function has been recored,
filters could be used to distinguish which kinds of work are being executed or
queued:

vhost statistics

 vhost_virtio_get_vq_desc   1460997  219682
 vhost_work_execute_start101248   12842
 vhost_work_execute_end  101247   12842
 vhost_work_queue_wakeup 101263   12841
 vhost_virtio_signal  684528659
 vhost_work_queue_wakeup(rx_net)  517976584
 vhost_work_execute_start(rx_net) 517956584
 vhost_work_queue_coalesce357376571
 vhost_work_queue_coalesce(rx_net)357096566
 vhost_virtio_update_avail_event  495126271
 vhost_work_execute_start(tx_kick)494296254
 vhost_work_queue_wakeup(tx_kick) 494426252
 vhost_work_queue_coalesce(tx_kick)  28   5
 vhost_work_execute_start(rx_kick)   22   3
 vhost_work_queue_wakeup(rx_kick)22   3
 vhost_poll_start 4   0

Signed-off-by: Jason Wang 
---
 tools/virtio/vhost_stat |  360 +++
 1 files changed, 360 insertions(+), 0 deletions(-)
 create mode 100755 tools/virtio/vhost_stat

diff --git a/tools/virtio/vhost_stat b/tools/virtio/vhost_stat
new file mode 100755
index 000..b730f3b
--- /dev/null
+++ b/tools/virtio/vhost_stat
@@ -0,0 +1,360 @@
+#!/usr/bin/python
+#
+# top-like utility for displaying vhost statistics
+#
+# Copyright 2012 Red Hat, Inc.
+#
+# Modified from kvm_stat from qemu
+#
+# This work is licensed under the terms of the GNU GPL, version 2.  See
+# the COPYING file in the top-level directory.
+
+import curses
+import sys, os, time, optparse
+
+work_types = {
+"handle_rx_kick" : "rx_kick",
+"handle_tx_kick" : "tx_kick",
+"handle_rx_net" : "rx_net",
+"handle_tx_net" : "tx_net",
+"vhost_attach_cgroups_work": "cg_attach"
+}
+
+addr = {}
+
+kallsyms = file("/proc/kallsyms").readlines()
+for kallsym in kallsyms:
+entry = kallsym.split()
+if entry[2] in work_types.keys():
+addr["0x%s" % entry[0]] = work_types[entry[2]]
+
+filters = {
+'vhost_work_queue_wakeup': ('function', addr),
+'vhost_work_queue_coalesce' : ('function', addr),
+'vhost_work_execute_start' : ('function', addr),
+'vhost_poll_start' : ('function', addr),
+'vhost_poll_stop' : ('function', addr),
+}
+
+def invert(d):
+return dict((x[1], x[0]) for x in d.iteritems())
+
+for f in filters:
+filters[f] = (filters[f][0], invert(filters[f][1]))
+
+import ctypes, struct, array
+
+libc = ctypes.CDLL('libc.so.6')
+syscall = libc.syscall
+class perf_event_attr(ctypes.Structure):
+_fields_ = [('type', ctypes.c_uint32),
+('size', ctypes.c_uint32),
+('config', ctypes.c_uint64),
+('sample_freq', ctypes.c_uint64),
+('sample_type', ctypes.c_uint64),
+('read_format', ctypes.c_uint64),
+('flags', ctypes.c_uint64),
+('wakeup_events', ctypes.c_uint32),
+('bp_type', ctypes.c_uint32),
+('bp_addr', ctypes.c_uint64),
+('bp_len', ctypes.c_uint64),
+]
+def _perf_event_open(attr, pid, cpu, group_fd, flags):
+return syscall(298, ctypes.pointer(attr), ctypes.c_int(pid),
+   ctypes.c_int(cpu), ctypes.c_int(group_fd),
+   ctypes.c_long(flags))
+
+PERF_TYPE_HARDWARE  = 0
+PERF_TYPE_SOFTWARE  = 1
+PERF_TYPE_TRACEPOINT= 2
+PERF_TYPE_HW_CACHE  = 3
+PERF_TYPE_RAW   = 4
+PERF_TYPE_BREAKPOINT= 5
+
+PERF_SAMPLE_IP  = 1 << 0
+PERF_SAMPLE_TID = 1 << 1
+PERF_SAMPLE_TIME= 1 << 2
+PERF_SAMPLE_ADDR= 1 << 3
+PERF_SAMPLE_READ= 1 << 4
+PERF_SAMPLE_CALLCHAIN   = 1 << 5
+PERF_SAMPLE_ID  = 1 << 6
+PERF_SAMPLE_CPU = 1 << 7
+PERF_SAMPLE_PERIOD  = 1 << 8
+PERF_SAMPLE_STREAM_ID   = 1 << 9
+PERF_SAMPLE_RAW = 1 << 10
+
+PERF_FORMAT_TOTAL_TIME_ENABLED  = 1 << 0
+PERF_FORMAT_TOTAL_TIME_RUNNING  = 1 << 1
+PERF_FORMAT_ID  = 1 << 2
+PERF_FORMAT_GROUP   = 1 << 3
+
+import re
+
+sys_tracing = '/sys/kernel/debug/tracing'
+
+class Group(object):
+def __init__(self, cpu):
+self.events = []
+self.group_leader = None
+self.cpu = cpu
+def add_event(self, name, event_set, tracepoint, filter = None):
+self.events.append(Event(group = self,
+ name = name, event_set = event_set,
+ tracepoint = tracepoint, filter = filter))
+if len(se

[PATCH 1/2] vhost: basic tracepoints

2012-04-09 Thread Jason Wang
To help for the performance optimizations and debugging, this patch tracepoints
for vhost. Pay attention that the tracepoints are only for vhost, net code are
not touched.

Two kinds of activities were traced: virtio and vhost work.

Signed-off-by: Jason Wang 
---
 drivers/vhost/trace.h |  153 +
 drivers/vhost/vhost.c |   17 +
 2 files changed, 168 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vhost/trace.h

diff --git a/drivers/vhost/trace.h b/drivers/vhost/trace.h
new file mode 100644
index 000..0423899
--- /dev/null
+++ b/drivers/vhost/trace.h
@@ -0,0 +1,153 @@
+#if !defined(_TRACE_VHOST_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VHOST_H
+
+#include 
+#include "vhost.h"
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vhost
+
+/*
+ * Tracepoint for updating used flag.
+ */
+TRACE_EVENT(vhost_virtio_update_used_flags,
+   TP_PROTO(struct vhost_virtqueue *vq),
+   TP_ARGS(vq),
+
+   TP_STRUCT__entry(
+   __field(struct vhost_virtqueue *, vq)
+   __field(u16, used_flags)
+   ),
+
+   TP_fast_assign(
+   __entry->vq = vq;
+   __entry->used_flags = vq->used_flags;
+   ),
+
+   TP_printk("vhost update used flag %x to vq %p notify %s",
+   __entry->used_flags, __entry->vq,
+   (__entry->used_flags & VRING_USED_F_NO_NOTIFY) ?
+   "disabled" : "enabled")
+);
+
+/*
+ * Tracepoint for updating avail event.
+ */
+TRACE_EVENT(vhost_virtio_update_avail_event,
+   TP_PROTO(struct vhost_virtqueue *vq),
+   TP_ARGS(vq),
+
+   TP_STRUCT__entry(
+   __field(struct vhost_virtqueue *, vq)
+   __field(u16, avail_idx)
+   ),
+
+   TP_fast_assign(
+   __entry->vq = vq;
+   __entry->avail_idx = vq->avail_idx;
+   ),
+
+   TP_printk("vhost update avail idx %u(%u) for vq %p",
+ __entry->avail_idx, __entry->avail_idx %
+ __entry->vq->num, __entry->vq)
+);
+
+/*
+ * Tracepoint for processing descriptor.
+ */
+TRACE_EVENT(vhost_virtio_get_vq_desc,
+   TP_PROTO(struct vhost_virtqueue *vq, unsigned int index,
+unsigned out, unsigned int in),
+   TP_ARGS(vq, index, out, in),
+
+   TP_STRUCT__entry(
+   __field(struct vhost_virtqueue *, vq)
+   __field(unsigned int, head)
+   __field(unsigned int, out)
+   __field(unsigned int, in)
+   ),
+
+   TP_fast_assign(
+   __entry->vq = vq;
+   __entry->head = index;
+   __entry->out = out;
+   __entry->in = in;
+   ),
+
+   TP_printk("vhost get vq %p head %u out %u in %u",
+ __entry->vq, __entry->head, __entry->out, __entry->in)
+
+);
+
+/*
+ * Tracepoint for signal guest.
+ */
+TRACE_EVENT(vhost_virtio_signal,
+   TP_PROTO(struct vhost_virtqueue *vq),
+   TP_ARGS(vq),
+
+   TP_STRUCT__entry(
+   __field(struct vhost_virtqueue *, vq)
+   ),
+
+   TP_fast_assign(
+   __entry->vq = vq;
+   ),
+
+   TP_printk("vhost signal vq %p", __entry->vq)
+);
+
+DECLARE_EVENT_CLASS(vhost_work_template,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work),
+
+   TP_STRUCT__entry(
+   __field(struct vhost_dev *, dev)
+   __field(struct vhost_work *, work)
+   __field(void *, function)
+   ),
+
+   TP_fast_assign(
+   __entry->dev = dev;
+   __entry->work = work;
+   __entry->function = work->fn;
+   ),
+
+   TP_printk("%pf for work %p dev %p",
+   __entry->function, __entry->work, __entry->dev)
+);
+
+DEFINE_EVENT(vhost_work_template, vhost_work_queue_wakeup,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work));
+
+DEFINE_EVENT(vhost_work_template, vhost_work_queue_coalesce,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work));
+
+DEFINE_EVENT(vhost_work_template, vhost_poll_start,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work));
+
+DEFINE_EVENT(vhost_work_template, vhost_poll_stop,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work));
+
+DEFINE_EVENT(vhost_work_template, vhost_work_execute_start,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work));
+
+DEFINE_EVENT(vhost_work_template, vhost_work_execute_end,
+   TP_PROTO(struct vhost_dev *dev, struct vhost_work *work),
+   TP_ARGS(dev, work));
+
+#endif /* _TRACE_VHOST_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH ../../drivers/vhost
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE trace
+
+/* This part must be outside protection */
+#include 
+
diff --git a/drivers/vhost/vhost.c b/driver

[PATCH 0/2] adding tracepoints to vhost

2012-04-09 Thread Jason Wang
To help in vhost analyzing, the following series adding basic tracepoints to
vhost. Operations of both virtqueues and vhost works were traced in current
implementation, net code were untouched. A top-like satistics displaying script
were introduced to help the troubleshooting.

TODO:
- net specific tracepoints?

---

Jason Wang (2):
  vhost: basic tracepoints
  tools: virtio: add a top-like utility for displaying vhost satistics


 drivers/vhost/trace.h   |  153 
 drivers/vhost/vhost.c   |   17 ++
 tools/virtio/vhost_stat |  360 +++
 3 files changed, 528 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vhost/trace.h
 create mode 100755 tools/virtio/vhost_stat

-- 
Jason Wang
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


vhost-blk development

2012-04-09 Thread Michael Baysek
Hi all.  I'm interested in any developments on the vhost-blk in kernel 
accelerator for disk i/o.  

I had seen a patchset on LKML https://lkml.org/lkml/2011/7/28/175 but that is 
rather old.  Are there any newer developments going on with the vhost-blk stuff?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup - CPU#1 stuck for 61s ! [qemu-kvm... (and server hang)

2012-04-09 Thread Marcelo Tosatti
On Mon, Apr 09, 2012 at 07:56:28PM +, Rahul wrote:
> 
> 
> Tomasz,
> 
> Did you get the cause of this issue? plz share, I am also facing the same 
> issue.
> 
> -Rahul N.

Rahul,

Can you please post traces?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup - CPU#1 stuck for 61s ! [qemu-kvm... (and server hang)

2012-04-09 Thread Rahul


Tomasz,

Did you get the cause of this issue? plz share, I am also facing the same issue.

-Rahul N.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Marcelo Tosatti
On Tue, Apr 10, 2012 at 02:26:27AM +0800, Xiao Guangrong wrote:
> On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:
> 
> > On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
> >> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
> >>> * Idea
> >>> The present bit of page fault error code (EFEC.P) indicates whether the
> >>> page table is populated on all levels, if this bit is set, we can know
> >>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
> >>> the reserved bits.
> >>>
> >>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
> >>> simply fixed: the page fault caused by reserved bit
> >>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
> >>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
> >>> is just increasing the corresponding access on the spte.
> >>>
> >>> This pachset introduces a fast path to fix this kind of page fault: it
> >>> is out of mmu-lock and need not walk host page table to get the mapping
> >>> from gfn to pfn.
> >>>
> >>>
> >>
> >> This patchset is really worrying to me.
> >>
> >> It introduces a lot of concurrency into data structures that were not
> >> designed for it.  Even if it is correct, it will be very hard to
> >> convince ourselves that it is correct, and if it isn't, to debug those
> >> subtle bugs.  It will also be much harder to maintain the mmu code than
> >> it is now.
> >>
> >> There are a lot of things to check.  Just as an example, we need to be
> >> sure that if we use rcu_dereference() twice in the same code path, that
> >> any inconsistencies due to a write in between are benign.  Doing that is
> >> a huge task.
> >>
> >> But I appreciate the performance improvement and would like to see a
> >> simpler version make it in.  This needs to reduce the amount of data
> >> touched in the fast path so it is easier to validate, and perhaps reduce
> >> the number of cases that the fast path works on.
> >>
> >> I would like to see the fast path as simple as
> >>
> >>   rcu_read_lock();
> >>
> >>   (lockless shadow walk)
> >>   spte = ACCESS_ONCE(*sptep);
> >>
> >>   if (!(spte & PT_MAY_ALLOW_WRITES))
> >> goto slow;
> >>
> >>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
> >>   mark_page_dirty(kvm, gfn);
> >>
> >>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
> >>   if (cmpxchg(sptep, spte, new_spte) != spte)
> >>goto slow;
> >>
> >>   rcu_read_unlock();
> >>   return;
> >>
> >> slow:
> >>   rcu_read_unlock();
> >>   slow_path();
> >>
> >> It now becomes the responsibility of the slow path to maintain *sptep &
> >> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
> >> can be as simple as a clear_bit() before we update sp->gfns[] or if we
> >> add host write protection.
> >>
> >> Sorry, it's too complicated for me.  Marcelo, what's your take?
> > 
> > The improvement is small and limited to special cases (migration should
> > be rare and framebuffer memory accounts for a small percentage of total
> > memory).
> 
> 
> Actually, although the framebuffer is small but it is modified really
> frequently, and another unlucky things is that dirty-log is also
> very frequently and need hold mmu-lock to do write-protect.
> 
> Yes, if Xwindow is not enabled, the benefit is limited. :)

Ignoring that fact, the safety of lockless set_spte and friends is not
clear.

Perhaps the mmu_lock hold times by get_dirty are a large component here?
If that can be alleviated, not only RO->RW faults benefit.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Marcelo Tosatti
On Tue, Apr 10, 2012 at 02:13:41AM +0800, Xiao Guangrong wrote:
> On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:
> 
> > On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
> >> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
> >>> * Idea
> >>> The present bit of page fault error code (EFEC.P) indicates whether the
> >>> page table is populated on all levels, if this bit is set, we can know
> >>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
> >>> the reserved bits.
> >>>
> >>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
> >>> simply fixed: the page fault caused by reserved bit
> >>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
> >>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
> >>> is just increasing the corresponding access on the spte.
> >>>
> >>> This pachset introduces a fast path to fix this kind of page fault: it
> >>> is out of mmu-lock and need not walk host page table to get the mapping
> >>> from gfn to pfn.
> >>>
> >>>
> >>
> >> This patchset is really worrying to me.
> >>
> >> It introduces a lot of concurrency into data structures that were not
> >> designed for it.  Even if it is correct, it will be very hard to
> >> convince ourselves that it is correct, and if it isn't, to debug those
> >> subtle bugs.  It will also be much harder to maintain the mmu code than
> >> it is now.
> >>
> >> There are a lot of things to check.  Just as an example, we need to be
> >> sure that if we use rcu_dereference() twice in the same code path, that
> >> any inconsistencies due to a write in between are benign.  Doing that is
> >> a huge task.
> >>
> >> But I appreciate the performance improvement and would like to see a
> >> simpler version make it in.  This needs to reduce the amount of data
> >> touched in the fast path so it is easier to validate, and perhaps reduce
> >> the number of cases that the fast path works on.
> >>
> >> I would like to see the fast path as simple as
> >>
> >>   rcu_read_lock();
> >>
> >>   (lockless shadow walk)
> >>   spte = ACCESS_ONCE(*sptep);
> >>
> >>   if (!(spte & PT_MAY_ALLOW_WRITES))
> >> goto slow;
> >>
> >>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
> >>   mark_page_dirty(kvm, gfn);
> >>
> >>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
> >>   if (cmpxchg(sptep, spte, new_spte) != spte)
> >>goto slow;
> >>
> >>   rcu_read_unlock();
> >>   return;
> >>
> >> slow:
> >>   rcu_read_unlock();
> >>   slow_path();
> >>
> >> It now becomes the responsibility of the slow path to maintain *sptep &
> >> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
> >> can be as simple as a clear_bit() before we update sp->gfns[] or if we
> >> add host write protection.
> >>
> >> Sorry, it's too complicated for me.  Marcelo, what's your take?
> > 
> > The improvement is small and limited to special cases (migration should
> > be rare and framebuffer memory accounts for a small percentage of total
> > memory).
> > 
> > For one, how can this be safe against mmu notifier methods?
> > 
> > KSM   |VCPU0| VCPU1
> >   | fault   | fault
> >   | cow-page|
> >   | set spte RW |
> >   | |
> > write protect host pte| |
> > grab mmu_lock | |
> > remove writeable bit in spte  | |
> > increase mmu_notifier_seq | |  spte = read-only spte
> > release mmu_lock  | |  cmpxchg succeeds, RO->RW!
> > 
> > MMU notifiers rely on the fault path sequence being
> > 
> > read host pte
> > read mmu_notifier_seq
> > spin_lock(mmu_lock)
> > if (mmu_notifier_seq changed)
> > goodbye, host pte value is stale
> > spin_unlock(mmu_lock)
> > 
> > By the example above, you cannot rely on the spte value alone,
> > mmu_notifier_seq must be taken into account.
> 
> 
> No.
> 
> When KSM change the host page to read-only, the HOST_WRITABLE bit
> of spte should be removed, that means, the spte should be changed
> that can be watched by cmpxchg.
> 
> Note: we mark spte to be writeable only if spte.HOST_WRITABLE is
> set.

Right. 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Xiao Guangrong
On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:

> On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
>> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
>>> * Idea
>>> The present bit of page fault error code (EFEC.P) indicates whether the
>>> page table is populated on all levels, if this bit is set, we can know
>>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
>>> the reserved bits.
>>>
>>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
>>> simply fixed: the page fault caused by reserved bit
>>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
>>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
>>> is just increasing the corresponding access on the spte.
>>>
>>> This pachset introduces a fast path to fix this kind of page fault: it
>>> is out of mmu-lock and need not walk host page table to get the mapping
>>> from gfn to pfn.
>>>
>>>
>>
>> This patchset is really worrying to me.
>>
>> It introduces a lot of concurrency into data structures that were not
>> designed for it.  Even if it is correct, it will be very hard to
>> convince ourselves that it is correct, and if it isn't, to debug those
>> subtle bugs.  It will also be much harder to maintain the mmu code than
>> it is now.
>>
>> There are a lot of things to check.  Just as an example, we need to be
>> sure that if we use rcu_dereference() twice in the same code path, that
>> any inconsistencies due to a write in between are benign.  Doing that is
>> a huge task.
>>
>> But I appreciate the performance improvement and would like to see a
>> simpler version make it in.  This needs to reduce the amount of data
>> touched in the fast path so it is easier to validate, and perhaps reduce
>> the number of cases that the fast path works on.
>>
>> I would like to see the fast path as simple as
>>
>>   rcu_read_lock();
>>
>>   (lockless shadow walk)
>>   spte = ACCESS_ONCE(*sptep);
>>
>>   if (!(spte & PT_MAY_ALLOW_WRITES))
>> goto slow;
>>
>>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
>>   mark_page_dirty(kvm, gfn);
>>
>>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
>>   if (cmpxchg(sptep, spte, new_spte) != spte)
>>goto slow;
>>
>>   rcu_read_unlock();
>>   return;
>>
>> slow:
>>   rcu_read_unlock();
>>   slow_path();
>>
>> It now becomes the responsibility of the slow path to maintain *sptep &
>> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
>> can be as simple as a clear_bit() before we update sp->gfns[] or if we
>> add host write protection.
>>
>> Sorry, it's too complicated for me.  Marcelo, what's your take?
> 
> The improvement is small and limited to special cases (migration should
> be rare and framebuffer memory accounts for a small percentage of total
> memory).


Actually, although the framebuffer is small but it is modified really
frequently, and another unlucky things is that dirty-log is also
very frequently and need hold mmu-lock to do write-protect.

Yes, if Xwindow is not enabled, the benefit is limited. :)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Xiao Guangrong
On 04/10/2012 01:58 AM, Marcelo Tosatti wrote:

> On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
>> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
>>> * Idea
>>> The present bit of page fault error code (EFEC.P) indicates whether the
>>> page table is populated on all levels, if this bit is set, we can know
>>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
>>> the reserved bits.
>>>
>>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
>>> simply fixed: the page fault caused by reserved bit
>>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
>>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
>>> is just increasing the corresponding access on the spte.
>>>
>>> This pachset introduces a fast path to fix this kind of page fault: it
>>> is out of mmu-lock and need not walk host page table to get the mapping
>>> from gfn to pfn.
>>>
>>>
>>
>> This patchset is really worrying to me.
>>
>> It introduces a lot of concurrency into data structures that were not
>> designed for it.  Even if it is correct, it will be very hard to
>> convince ourselves that it is correct, and if it isn't, to debug those
>> subtle bugs.  It will also be much harder to maintain the mmu code than
>> it is now.
>>
>> There are a lot of things to check.  Just as an example, we need to be
>> sure that if we use rcu_dereference() twice in the same code path, that
>> any inconsistencies due to a write in between are benign.  Doing that is
>> a huge task.
>>
>> But I appreciate the performance improvement and would like to see a
>> simpler version make it in.  This needs to reduce the amount of data
>> touched in the fast path so it is easier to validate, and perhaps reduce
>> the number of cases that the fast path works on.
>>
>> I would like to see the fast path as simple as
>>
>>   rcu_read_lock();
>>
>>   (lockless shadow walk)
>>   spte = ACCESS_ONCE(*sptep);
>>
>>   if (!(spte & PT_MAY_ALLOW_WRITES))
>> goto slow;
>>
>>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
>>   mark_page_dirty(kvm, gfn);
>>
>>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
>>   if (cmpxchg(sptep, spte, new_spte) != spte)
>>goto slow;
>>
>>   rcu_read_unlock();
>>   return;
>>
>> slow:
>>   rcu_read_unlock();
>>   slow_path();
>>
>> It now becomes the responsibility of the slow path to maintain *sptep &
>> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
>> can be as simple as a clear_bit() before we update sp->gfns[] or if we
>> add host write protection.
>>
>> Sorry, it's too complicated for me.  Marcelo, what's your take?
> 
> The improvement is small and limited to special cases (migration should
> be rare and framebuffer memory accounts for a small percentage of total
> memory).
> 
> For one, how can this be safe against mmu notifier methods?
> 
> KSM |VCPU0| VCPU1
> | fault   | fault
> | cow-page|
> | set spte RW |
> | |
> write protect host pte  | |
> grab mmu_lock   | |
> remove writeable bit in spte  |   |
> increase mmu_notifier_seq |   |  spte = read-only spte
> release mmu_lock| |  cmpxchg succeeds, RO->RW!
> 
> MMU notifiers rely on the fault path sequence being
> 
> read host pte
> read mmu_notifier_seq
> spin_lock(mmu_lock)
> if (mmu_notifier_seq changed)
>   goodbye, host pte value is stale
> spin_unlock(mmu_lock)
> 
> By the example above, you cannot rely on the spte value alone,
> mmu_notifier_seq must be taken into account.


No.

When KSM change the host page to read-only, the HOST_WRITABLE bit
of spte should be removed, that means, the spte should be changed
that can be watched by cmpxchg.

Note: we mark spte to be writeable only if spte.HOST_WRITABLE is
set.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Marcelo Tosatti
On Mon, Apr 09, 2012 at 04:12:46PM +0300, Avi Kivity wrote:
> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
> > * Idea
> > The present bit of page fault error code (EFEC.P) indicates whether the
> > page table is populated on all levels, if this bit is set, we can know
> > the page fault is caused by the page-protection bits (e.g. W/R bit) or
> > the reserved bits.
> >
> > In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
> > simply fixed: the page fault caused by reserved bit
> > (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
> > path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
> > is just increasing the corresponding access on the spte.
> >
> > This pachset introduces a fast path to fix this kind of page fault: it
> > is out of mmu-lock and need not walk host page table to get the mapping
> > from gfn to pfn.
> >
> >
> 
> This patchset is really worrying to me.
> 
> It introduces a lot of concurrency into data structures that were not
> designed for it.  Even if it is correct, it will be very hard to
> convince ourselves that it is correct, and if it isn't, to debug those
> subtle bugs.  It will also be much harder to maintain the mmu code than
> it is now.
> 
> There are a lot of things to check.  Just as an example, we need to be
> sure that if we use rcu_dereference() twice in the same code path, that
> any inconsistencies due to a write in between are benign.  Doing that is
> a huge task.
> 
> But I appreciate the performance improvement and would like to see a
> simpler version make it in.  This needs to reduce the amount of data
> touched in the fast path so it is easier to validate, and perhaps reduce
> the number of cases that the fast path works on.
> 
> I would like to see the fast path as simple as
> 
>   rcu_read_lock();
> 
>   (lockless shadow walk)
>   spte = ACCESS_ONCE(*sptep);
> 
>   if (!(spte & PT_MAY_ALLOW_WRITES))
> goto slow;
> 
>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
>   mark_page_dirty(kvm, gfn);
> 
>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
>   if (cmpxchg(sptep, spte, new_spte) != spte)
>goto slow;
> 
>   rcu_read_unlock();
>   return;
> 
> slow:
>   rcu_read_unlock();
>   slow_path();
> 
> It now becomes the responsibility of the slow path to maintain *sptep &
> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
> can be as simple as a clear_bit() before we update sp->gfns[] or if we
> add host write protection.
> 
> Sorry, it's too complicated for me.  Marcelo, what's your take?

The improvement is small and limited to special cases (migration should
be rare and framebuffer memory accounts for a small percentage of total
memory).

For one, how can this be safe against mmu notifier methods?

KSM   |VCPU0| VCPU1
  | fault   | fault
  | cow-page|
  | set spte RW |
  | |
write protect host pte| |
grab mmu_lock | |
remove writeable bit in spte  | |
increase mmu_notifier_seq | |  spte = read-only spte
release mmu_lock  | |  cmpxchg succeeds, RO->RW!

MMU notifiers rely on the fault path sequence being

read host pte
read mmu_notifier_seq
spin_lock(mmu_lock)
if (mmu_notifier_seq changed)
goodbye, host pte value is stale
spin_unlock(mmu_lock)

By the example above, you cannot rely on the spte value alone,
mmu_notifier_seq must be taken into account.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] KVM: x86 emulator: MMX support

2012-04-09 Thread Avi Kivity
General support for the MMX instruction set.  Special care is taken
to trap pending x87 exceptions so that they are properly reflected
to the guest.

Signed-off-by: Avi Kivity 
---
 arch/x86/include/asm/kvm_emulate.h |4 +-
 arch/x86/kvm/emulate.c |  103 ++--
 2 files changed, 102 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index c222e1a..1ac46c22 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -200,7 +200,7 @@ typedef u32 __attribute__((vector_size(16))) sse128_t;
 
 /* Type, address-of, and value of an instruction's operand. */
 struct operand {
-   enum { OP_REG, OP_MEM, OP_IMM, OP_XMM, OP_NONE } type;
+   enum { OP_REG, OP_MEM, OP_IMM, OP_XMM, OP_MM, OP_NONE } type;
unsigned int bytes;
union {
unsigned long orig_val;
@@ -213,12 +213,14 @@ struct operand {
unsigned seg;
} mem;
unsigned xmm;
+   unsigned mm;
} addr;
union {
unsigned long val;
u64 val64;
char valptr[sizeof(unsigned long) + 2];
sse128_t vec_val;
+   u64 mm_val;
};
 };
 
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index fb39e0b..0011b4a 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -142,6 +142,7 @@
 #define Src2FS  (OpFS << Src2Shift)
 #define Src2GS  (OpGS << Src2Shift)
 #define Src2Mask(OpMask << Src2Shift)
+#define Mmx ((u64)1 << 40)  /* MMX Vector instruction */
 #define Aligned ((u64)1 << 41)  /* Explicitly aligned (e.g. MOVDQA) */
 #define Unaligned   ((u64)1 << 42)  /* Explicitly unaligned (e.g. MOVDQU) */
 #define Avx ((u64)1 << 43)  /* Advanced Vector Extensions */
@@ -887,6 +888,40 @@ static void write_sse_reg(struct x86_emulate_ctxt *ctxt, 
sse128_t *data,
ctxt->ops->put_fpu(ctxt);
 }
 
+static void read_mmx_reg(struct x86_emulate_ctxt *ctxt, u64 *data, int reg)
+{
+   ctxt->ops->get_fpu(ctxt);
+   switch (reg) {
+   case 0: asm("movq %%mm0, %0" : "=m"(*data)); break;
+   case 1: asm("movq %%mm1, %0" : "=m"(*data)); break;
+   case 2: asm("movq %%mm2, %0" : "=m"(*data)); break;
+   case 3: asm("movq %%mm3, %0" : "=m"(*data)); break;
+   case 4: asm("movq %%mm4, %0" : "=m"(*data)); break;
+   case 5: asm("movq %%mm5, %0" : "=m"(*data)); break;
+   case 6: asm("movq %%mm6, %0" : "=m"(*data)); break;
+   case 7: asm("movq %%mm7, %0" : "=m"(*data)); break;
+   default: BUG();
+   }
+   ctxt->ops->put_fpu(ctxt);
+}
+
+static void write_mmx_reg(struct x86_emulate_ctxt *ctxt, u64 *data, int reg)
+{
+   ctxt->ops->get_fpu(ctxt);
+   switch (reg) {
+   case 0: asm("movq %0, %%mm0" : : "m"(*data)); break;
+   case 1: asm("movq %0, %%mm1" : : "m"(*data)); break;
+   case 2: asm("movq %0, %%mm2" : : "m"(*data)); break;
+   case 3: asm("movq %0, %%mm3" : : "m"(*data)); break;
+   case 4: asm("movq %0, %%mm4" : : "m"(*data)); break;
+   case 5: asm("movq %0, %%mm5" : : "m"(*data)); break;
+   case 6: asm("movq %0, %%mm6" : : "m"(*data)); break;
+   case 7: asm("movq %0, %%mm7" : : "m"(*data)); break;
+   default: BUG();
+   }
+   ctxt->ops->put_fpu(ctxt);
+}
+
 static void decode_register_operand(struct x86_emulate_ctxt *ctxt,
struct operand *op)
 {
@@ -903,6 +938,13 @@ static void decode_register_operand(struct 
x86_emulate_ctxt *ctxt,
read_sse_reg(ctxt, &op->vec_val, reg);
return;
}
+   if (ctxt->d & Mmx) {
+   reg &= 7;
+   op->type = OP_MM;
+   op->bytes = 8;
+   op->addr.mm = reg;
+   return;
+   }
 
op->type = OP_REG;
if (ctxt->d & ByteOp) {
@@ -948,6 +990,12 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt,
read_sse_reg(ctxt, &op->vec_val, ctxt->modrm_rm);
return rc;
}
+   if (ctxt->d & Mmx) {
+   op->type = OP_MM;
+   op->bytes = 8;
+   op->addr.xmm = ctxt->modrm_rm & 7;
+   return rc;
+   }
fetch_register_operand(op);
return rc;
}
@@ -1415,6 +1463,9 @@ static int writeback(struct x86_emulate_ctxt *ctxt)
case OP_XMM:
write_sse_reg(ctxt, &ctxt->dst.vec_val, ctxt->dst.addr.xmm);
break;
+   case OP_MM:
+   write_mmx_reg(ctxt, &ctxt->dst.mm_val, ctxt->dst.addr.mm);
+   break;
case OP_NONE:
/* no writeback */
break;
@@ -3987,6 +4038,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void 
*insn, int insn_len)
 
 

[PATCH 3/5] KVM: x86 emulator: implement movntps

2012-04-09 Thread Avi Kivity
Used to write to framebuffers (by at least Icaros).

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index b160fb1f..fb39e0b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3440,6 +3440,10 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
N, I(Sse | Aligned, em_mov), N, I(Sse | Unaligned, em_mov),
 };
 
+static struct gprefix pfx_vmovntpx = {
+   I(0, em_mov), N, N, N,
+};
+
 static struct opcode opcode_table[256] = {
/* 0x00 - 0x07 */
I6ALU(Lock, em_add),
@@ -3571,7 +3575,8 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
IIP(ModRM | SrcMem | Priv | Op3264, em_cr_write, cr_write, 
check_cr_write),
IIP(ModRM | SrcMem | Priv | Op3264, em_dr_write, dr_write, 
check_dr_write),
N, N, N, N,
-   N, N, N, N, N, N, N, N,
+   N, N, N, GP(ModRM | DstMem | SrcReg | Sse | Mov | Aligned, 
&pfx_vmovntpx),
+   N, N, N, N,
/* 0x30 - 0x3F */
II(ImplicitOps | Priv, em_wrmsr, wrmsr),
IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] KVM: x86: emulate movdqa

2012-04-09 Thread Avi Kivity
From: Stefan Hajnoczi 

An Ubuntu 9.10 Karmic Koala guest is unable to boot or install due to
missing movdqa emulation:

kvm_exit: reason EXCEPTION_NMI rip 0x7fef3e025a7b info 7fef3e799000 8b0e
kvm_page_fault: address 7fef3e799000 error_code f
kvm_emulate_insn: 0:7fef3e025a7b: 66 0f 7f 07 (prot64)

movdqa %xmm0,(%rdi)

[avi: mark it explicitly aligned]

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |   10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 6302e5c..b160fb1f 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2818,7 +2818,7 @@ static int em_rdpmc(struct x86_emulate_ctxt *ctxt)
 
 static int em_mov(struct x86_emulate_ctxt *ctxt)
 {
-   ctxt->dst.val = ctxt->src.val;
+   memcpy(ctxt->dst.valptr, ctxt->src.valptr, ctxt->op_bytes);
return X86EMUL_CONTINUE;
 }
 
@@ -2898,12 +2898,6 @@ static int em_mov_sreg_rm(struct x86_emulate_ctxt *ctxt)
return load_segment_descriptor(ctxt, sel, ctxt->modrm_reg);
 }
 
-static int em_movdqu(struct x86_emulate_ctxt *ctxt)
-{
-   memcpy(&ctxt->dst.vec_val, &ctxt->src.vec_val, ctxt->op_bytes);
-   return X86EMUL_CONTINUE;
-}
-
 static int em_invlpg(struct x86_emulate_ctxt *ctxt)
 {
int rc;
@@ -3443,7 +3437,7 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 };
 
 static struct gprefix pfx_0f_6f_0f_7f = {
-   N, N, N, I(Sse | Unaligned, em_movdqu),
+   N, I(Sse | Aligned, em_mov), N, I(Sse | Unaligned, em_mov),
 };
 
 static struct opcode opcode_table[256] = {
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] KVM: x86 emulator: implement MMX MOVQ (opcodes 0f 6f, 0f 7f)

2012-04-09 Thread Avi Kivity
Needed by some framebuffer drivers.  See

https://bugzilla.kernel.org/show_bug.cgi?id=42779

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 0011b4a..d5729a9 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3488,7 +3488,7 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 };
 
 static struct gprefix pfx_0f_6f_0f_7f = {
-   N, I(Sse | Aligned, em_mov), N, I(Sse | Unaligned, em_mov),
+   I(Mmx, em_mov), I(Sse | Aligned, em_mov), N, I(Sse | Unaligned, em_mov),
 };
 
 static struct gprefix pfx_vmovntpx = {
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] MMX/SSE data movement

2012-04-09 Thread Avi Kivity
This patchset implements MMX registers, SSE data alignment, and three
instructions:

  MOVQ (MMX)
  MOVNTPS (SSE)
  MOVDQA (SSE)

all used in accessing framebuffers from guests.

Avi Kivity (4):
  KVM: x86 emulator: add support for vector alignment
  KVM: x86 emulator: implement movntps
  KVM: x86 emulator: MMX support
  KVM: x86 emulator: implement MMX MOVQ (opcodes 0f 6f, 0f 7f)

Stefan Hajnoczi (1):
  KVM: x86: emulate movdqa

 arch/x86/include/asm/kvm_emulate.h |4 +-
 arch/x86/kvm/emulate.c |  148 
 2 files changed, 138 insertions(+), 14 deletions(-)

-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] KVM: x86 emulator: add support for vector alignment

2012-04-09 Thread Avi Kivity
x86 defines three classes of vector instructions: explicitly
aligned (#GP(0) if unaligned, explicitly unaligned, and default
(which depends on the encoding: AVX is unaligned, SSE is aligned).

Add support for marking an instruction as explicitly aligned or
unaligned, and mark MOVDQU as unaligned.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/emulate.c |   30 +-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8375622..6302e5c 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -142,6 +142,9 @@
 #define Src2FS  (OpFS << Src2Shift)
 #define Src2GS  (OpGS << Src2Shift)
 #define Src2Mask(OpMask << Src2Shift)
+#define Aligned ((u64)1 << 41)  /* Explicitly aligned (e.g. MOVDQA) */
+#define Unaligned   ((u64)1 << 42)  /* Explicitly unaligned (e.g. MOVDQU) */
+#define Avx ((u64)1 << 43)  /* Advanced Vector Extensions */
 
 #define X2(x...) x, x
 #define X3(x...) X2(x), x
@@ -557,6 +560,29 @@ static void set_segment_selector(struct x86_emulate_ctxt 
*ctxt, u16 selector,
ctxt->ops->set_segment(ctxt, selector, &desc, base3, seg);
 }
 
+/*
+ * x86 defines three classes of vector instructions: explicitly
+ * aligned, explicitly unaligned, and the rest, which change behaviour
+ * depending on whether they're AVX encoded or not.
+ *
+ * Also included is CMPXCHG16B which is not a vector instruction, yet it is
+ * subject to the same check.
+ */
+static bool insn_aligned(struct x86_emulate_ctxt *ctxt, unsigned size)
+{
+   if (likely(size < 16))
+   return false;
+
+   if (ctxt->d & Aligned)
+   return true;
+   else if (ctxt->d & Unaligned)
+   return false;
+   else if (ctxt->d & Avx)
+   return false;
+   else
+   return true;
+}
+
 static int __linearize(struct x86_emulate_ctxt *ctxt,
 struct segmented_address addr,
 unsigned size, bool write, bool fetch,
@@ -621,6 +647,8 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
}
if (fetch ? ctxt->mode != X86EMUL_MODE_PROT64 : ctxt->ad_bytes != 8)
la &= (u32)-1;
+   if (insn_aligned(ctxt, size) && ((la & (size - 1)) != 0))
+   return emulate_gp(ctxt, 0);
*linear = la;
return X86EMUL_CONTINUE;
 bad:
@@ -3415,7 +3443,7 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 };
 
 static struct gprefix pfx_0f_6f_0f_7f = {
-   N, N, N, I(Sse, em_movdqu),
+   N, N, N, I(Sse | Unaligned, em_movdqu),
 };
 
 static struct opcode opcode_table[256] = {
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-test] make PMU test to pass on more cpu types

2012-04-09 Thread Gleb Natapov
ping.

On Thu, Mar 29, 2012 at 10:53:36AM +0200, Gleb Natapov wrote:
> Expand boundaries of some tests since it was observed that on some cpus
> results do not fit.
> 
> Signed-off-by: Gleb Natapov 
> diff --git a/x86/pmu.c b/x86/pmu.c
> index e1b8279..2c46f31 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -80,8 +80,8 @@ struct pmu_event {
>  } gp_events[] = {
>   {"core cycles", 0x003c, 1*N, 50*N},
>   {"instructions", 0x00c0, 10*N, 10.2*N},
> - {"ref cycles", 0x013c, 1*N, 30*N},
> - {"llc refference", 0x4f2e, 1, 1*N},
> + {"ref cycles", 0x013c, 0.1*N, 30*N},
> + {"llc refference", 0x4f2e, 1, 2*N},
>   {"llc misses", 0x412e, 1, 1*N},
>   {"branches", 0x00c4, 1*N, 1.1*N},
>   {"branch misses", 0x00c5, 0, 0.1*N},
> --
>   Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PMU emulation: GLOBAL_CTRL MSR should be enabled on reset.

2012-04-09 Thread Gleb Natapov
On reset all MPU counters should be enabled in GLOBAL_CTRL MSR.

Signed-off-by: Gleb Natapov 
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 173df38..2e88438 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -459,17 +459,17 @@ void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu)
pmu->available_event_types = ~entry->ebx & ((1ull << bitmap_len) - 1);
 
if (pmu->version == 1) {
-   pmu->global_ctrl = (1 << pmu->nr_arch_gp_counters) - 1;
-   return;
+   pmu->nr_arch_fixed_counters = 0;
+   } else {
+   pmu->nr_arch_fixed_counters = min((int)(entry->edx & 0x1f),
+   X86_PMC_MAX_FIXED);
+   pmu->counter_bitmask[KVM_PMC_FIXED] =
+   ((u64)1 << ((entry->edx >> 5) & 0xff)) - 1;
}
 
-   pmu->nr_arch_fixed_counters = min((int)(entry->edx & 0x1f),
-   X86_PMC_MAX_FIXED);
-   pmu->counter_bitmask[KVM_PMC_FIXED] =
-   ((u64)1 << ((entry->edx >> 5) & 0xff)) - 1;
-   pmu->global_ctrl_mask = ~(((1 << pmu->nr_arch_gp_counters) - 1)
-   | (((1ull << pmu->nr_arch_fixed_counters) - 1)
-   << X86_PMC_IDX_FIXED));
+   pmu->global_ctrl = ((1 << pmu->nr_arch_gp_counters) - 1) |
+   (((1ull << pmu->nr_arch_fixed_counters) - 1) << 
X86_PMC_IDX_FIXED);
+   pmu->global_ctrl_mask = ~pmu->global_ctrl;
 }
 
 void kvm_pmu_init(struct kvm_vcpu *vcpu)
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Avi Kivity
On 04/09/2012 04:55 PM, Xiao Guangrong wrote:
>
> Okay, let's simplify it as possible:
>
> - let it only fix the page fault with PFEC.P == 1 && PFEC.W = 0, that means
>   unlock set_spte path can be dropped
>
> - let it just fixes the page fault caused by dirty-log that means we always
>   skip the spte which write-protected by shadow page protection.
>
> Then, things should be fair simper:
>
> In set_spte path, if the spte can be writable, we set ALLOW_WRITE bit
> In rmap_write_protect:
>   if (spte.PT_WRITABLE_MASK) {
>   WARN_ON(!(spte & ALLOW_WRITE));
>   spte &= ~PT_WRITABLE_MASK;
>   spte |= WRITE_PROTECT;
>   }
>
> in fast page fault:
>
> if (spte & PT_WRITABLE_MASK)
>   return_go_guest;
>
> if ((spte & ALLOW_WRITE) && !(spte & WRITE_PROTECT))
>   cmpxchg spte + PT_WRITABLE_MASK
>
> The information all we needed comes from spte it is independence from other 
> path,
> and no barriers.
>
>
> Hmm, how about this one?
>

I like it. WRITE_PROTECT is better than my ALLOW_WRITES, the meaning is
clearer.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Xiao Guangrong
On 04/09/2012 09:55 PM, Xiao Guangrong wrote:


> 
> 
> Okay, let's simplify it as possible:
> 
> - let it only fix the page fault with PFEC.P == 1 && PFEC.W = 0, that means


PFEC.P == 1 && PFEC.W = 1, my typo.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Xiao Guangrong
On 04/09/2012 09:20 PM, Avi Kivity wrote:

> On 04/06/2012 08:24 AM, Xiao Guangrong wrote:
>>
>> Foolish me, i should be crazy. Sorry for my mistake. :(
>>
>> Unfortunately, it can not work, we can not get a stable gfn from gpte or
>> sp->gfns[]. For example:
>>
>> beginning:
>> Gpte = Gfn1
>> gfn_to_pfn(Gfn1) = Pfn
>> Spte = Pfn
>> Gfn1 is write-free
>> Gfn2 is write-protected
>>
>>
>> VCPU 0  VCPU 1 VCPU 2
>>
>> fault on gpte
>> fast page fault path:
>>   set Spte.fast_pf
>>   get Gfn1 from Gpte/sp->gfns[]
>>   if (Gfn1 is writable)
>> Pfn is swapped out:
>>  Spte = 0
>>  Gpte is modified to Gfn2,
>> and Pfn is realloced and remapped
>> to Gfn2, so:
>> Spte = Pfn
>>
>>   fast page fault 
>> path:
>>  set Spte.fast_pf
>>
>>  cmpxchg  Spte+w
>> OOPS!!!
>>   >happily make it writable, so gfn2 can be writable>
>>
>> It seems only a unique identification can prevent this. :(
>>
> 
> Ouch.
> 
> What about restricting this to role.direct=1?  Then gfn is stable?
> 


Yes.

The gfn of direct sp is stable since it is calculated by sp->gfn which is
independent with gpte.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Xiao Guangrong
On 04/09/2012 09:12 PM, Avi Kivity wrote:

> On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
>> * Idea
>> The present bit of page fault error code (EFEC.P) indicates whether the
>> page table is populated on all levels, if this bit is set, we can know
>> the page fault is caused by the page-protection bits (e.g. W/R bit) or
>> the reserved bits.
>>
>> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
>> simply fixed: the page fault caused by reserved bit
>> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
>> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
>> is just increasing the corresponding access on the spte.
>>
>> This pachset introduces a fast path to fix this kind of page fault: it
>> is out of mmu-lock and need not walk host page table to get the mapping
>> from gfn to pfn.
>>
>>
> 
> This patchset is really worrying to me.
> 
> It introduces a lot of concurrency into data structures that were not
> designed for it.  Even if it is correct, it will be very hard to
> convince ourselves that it is correct, and if it isn't, to debug those
> subtle bugs.  It will also be much harder to maintain the mmu code than
> it is now.
> 
> There are a lot of things to check.  Just as an example, we need to be
> sure that if we use rcu_dereference() twice in the same code path, that
> any inconsistencies due to a write in between are benign.  Doing that is
> a huge task.
> 
> But I appreciate the performance improvement and would like to see a
> simpler version make it in.  This needs to reduce the amount of data
> touched in the fast path so it is easier to validate, and perhaps reduce
> the number of cases that the fast path works on.
> 
> I would like to see the fast path as simple as
> 
>   rcu_read_lock();
> 
>   (lockless shadow walk)
>   spte = ACCESS_ONCE(*sptep);
> 
>   if (!(spte & PT_MAY_ALLOW_WRITES))
> goto slow;
> 
>   gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
>   mark_page_dirty(kvm, gfn);
> 
>   new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
>   if (cmpxchg(sptep, spte, new_spte) != spte)
>goto slow;
> 
>   rcu_read_unlock();
>   return;
> 
> slow:
>   rcu_read_unlock();
>   slow_path();
> 
> It now becomes the responsibility of the slow path to maintain *sptep &
> PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
> can be as simple as a clear_bit() before we update sp->gfns[] or if we
> add host write protection.
> 


Okay, let's simplify it as possible:

- let it only fix the page fault with PFEC.P == 1 && PFEC.W = 0, that means
  unlock set_spte path can be dropped

- let it just fixes the page fault caused by dirty-log that means we always
  skip the spte which write-protected by shadow page protection.

Then, things should be fair simper:

In set_spte path, if the spte can be writable, we set ALLOW_WRITE bit
In rmap_write_protect:
  if (spte.PT_WRITABLE_MASK) {
WARN_ON(!(spte & ALLOW_WRITE));
spte &= ~PT_WRITABLE_MASK;
spte |= WRITE_PROTECT;
  }

in fast page fault:

if (spte & PT_WRITABLE_MASK)
return_go_guest;

if ((spte & ALLOW_WRITE) && !(spte & WRITE_PROTECT))
cmpxchg spte + PT_WRITABLE_MASK

The information all we needed comes from spte it is independence from other 
path,
and no barriers.


Hmm, how about this one?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock

2012-04-09 Thread Avi Kivity
On 04/09/2012 04:16 PM, Takuya Yoshikawa wrote:
> On Mon, 09 Apr 2012 15:28:47 +0300
> Avi Kivity  wrote:
>
> > Ah, so it's protecting two paths at the same time: rmap.write_protect ->
> > fast page fault, and lock(sptep) -> write protect.
> > 
> > The whole thing needs to be documented very carefully in locking.txt,
> > otherwise mmu.c will be write-protected to everyone except you.
>
> Before changing mmu_lock usage, better describe it in locking.txt.
>
> I guess from mmu.c git history, it is already restricted to a few
> people who can describe that.

It's getting harder and harder, yes.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Avi Kivity
On 04/06/2012 08:24 AM, Xiao Guangrong wrote:
>
> Foolish me, i should be crazy. Sorry for my mistake. :(
>
> Unfortunately, it can not work, we can not get a stable gfn from gpte or
> sp->gfns[]. For example:
>
> beginning:
> Gpte = Gfn1
> gfn_to_pfn(Gfn1) = Pfn
> Spte = Pfn
> Gfn1 is write-free
> Gfn2 is write-protected
>
>
> VCPU 0  VCPU 1 VCPU 2
>
> fault on gpte
> fast page fault path:
>   set Spte.fast_pf
>   get Gfn1 from Gpte/sp->gfns[]
>   if (Gfn1 is writable)
> Pfn is swapped out:
>   Spte = 0
>   Gpte is modified to Gfn2,
> and Pfn is realloced and remapped
> to Gfn2, so:
> Spte = Pfn
>
>   fast page fault 
> path:
>  set Spte.fast_pf
>
>  cmpxchg  Spte+w
> OOPS!!!
>   happily make it writable, so gfn2 can be writable>
>
> It seems only a unique identification can prevent this. :(
>

Ouch.

What about restricting this to role.direct=1?  Then gfn is stable?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock

2012-04-09 Thread Takuya Yoshikawa
On Mon, 09 Apr 2012 15:28:47 +0300
Avi Kivity  wrote:

> Ah, so it's protecting two paths at the same time: rmap.write_protect ->
> fast page fault, and lock(sptep) -> write protect.
> 
> The whole thing needs to be documented very carefully in locking.txt,
> otherwise mmu.c will be write-protected to everyone except you.

Before changing mmu_lock usage, better describe it in locking.txt.

I guess from mmu.c git history, it is already restricted to a few
people who can describe that.

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-09 Thread Avi Kivity
On 03/29/2012 11:20 AM, Xiao Guangrong wrote:
> * Idea
> The present bit of page fault error code (EFEC.P) indicates whether the
> page table is populated on all levels, if this bit is set, we can know
> the page fault is caused by the page-protection bits (e.g. W/R bit) or
> the reserved bits.
>
> In KVM, in most cases, all this kind of page fault (EFEC.P = 1) can be
> simply fixed: the page fault caused by reserved bit
> (EFFC.P = 1 && EFEC.RSV = 1) has already been filtered out in fast mmio
> path. What we need do to fix the rest page fault (EFEC.P = 1 && RSV != 1)
> is just increasing the corresponding access on the spte.
>
> This pachset introduces a fast path to fix this kind of page fault: it
> is out of mmu-lock and need not walk host page table to get the mapping
> from gfn to pfn.
>
>

This patchset is really worrying to me.

It introduces a lot of concurrency into data structures that were not
designed for it.  Even if it is correct, it will be very hard to
convince ourselves that it is correct, and if it isn't, to debug those
subtle bugs.  It will also be much harder to maintain the mmu code than
it is now.

There are a lot of things to check.  Just as an example, we need to be
sure that if we use rcu_dereference() twice in the same code path, that
any inconsistencies due to a write in between are benign.  Doing that is
a huge task.

But I appreciate the performance improvement and would like to see a
simpler version make it in.  This needs to reduce the amount of data
touched in the fast path so it is easier to validate, and perhaps reduce
the number of cases that the fast path works on.

I would like to see the fast path as simple as

  rcu_read_lock();

  (lockless shadow walk)
  spte = ACCESS_ONCE(*sptep);

  if (!(spte & PT_MAY_ALLOW_WRITES))
goto slow;

  gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->sptes)
  mark_page_dirty(kvm, gfn);

  new_spte = spte & ~(PT64_MAY_ALLOW_WRITES | PT_WRITABLE_MASK);
  if (cmpxchg(sptep, spte, new_spte) != spte)
   goto slow;

  rcu_read_unlock();
  return;

slow:
  rcu_read_unlock();
  slow_path();

It now becomes the responsibility of the slow path to maintain *sptep &
PT_MAY_ALLOW_WRITES, but that path has a simpler concurrency model.  It
can be as simple as a clear_bit() before we update sp->gfns[] or if we
add host write protection.

Sorry, it's too complicated for me.  Marcelo, what's your take?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/13] KVM: MMU: get expected spte out of mmu-lock

2012-04-09 Thread Avi Kivity
On 04/05/2012 09:25 PM, Xiao Guangrong wrote:
> On 04/01/2012 11:53 PM, Avi Kivity wrote:
>
> > On 03/29/2012 11:25 AM, Xiao Guangrong wrote:
> >> It depends on PTE_LIST_WRITE_PROTECT bit in rmap which let us quickly know
> >> whether the page is writable out of mmu-lock
> >>
> >> Signed-off-by: Xiao Guangrong 
> >> ---
> >>  arch/x86/kvm/mmu.c |   17 +
> >>  arch/x86/kvm/paging_tmpl.h |2 +-
> >>  2 files changed, 14 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 3887a07..c029185 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -1148,6 +1148,12 @@ static int rmap_write_protect(struct kvm *kvm, u64 
> >> gfn)
> >>
> >>*rmapp |= PTE_LIST_WRITE_PROTECT;
> >>
> >> +  /*
> >> +   * Setting PTE_LIST_WRITE_PROTECT bit before doing page
> >> +   * write-protect.
> >> +   */
> >> +  smp_mb();
> >> +
> > 
> > wmb only needed.
> > 
>
>
> We should ensure setting this bit before reading spte, it cooperates with
> fast page fault path to avoid this case:
>
> On fast page fault path:On rmap_write_protect path:
> read spte: old_spte = *spte
>(reading spte is reordered to the 
> front of
> setting PTE_LIST_WRITE_PROTECT bit)
> set spte.identification
>smp_mb
> if (!rmap.PTE_LIST_WRITE_PROTECT)
> set rmap.PTE_LIST_WRITE_PROTECT
> cmpxchg(sptep, spte, spte | WRITABLE)
> see old_spte.identification is 
> not set,
> so it does not write-protect this 
> page
>   OOPS!!!

Ah, so it's protecting two paths at the same time: rmap.write_protect ->
fast page fault, and lock(sptep) -> write protect.

The whole thing needs to be documented very carefully in locking.txt,
otherwise mmu.c will be write-protected to everyone except you.

> > Would it be better to store this bit in all the sptes instead?  We're
> > touching them in any case.  More work to clear them, but
> > un-write-protecting a page is beneficial anyway as it can save a fault.
> > 
>
> There are two reasons:
> - if we set this bit in rmap, we can do the quickly check to see the page is
>   writble before doing shadow page walking.
>
> - since a full barrier is needed, we should use smp_mb for every spte like 
> this:
>
>   while ((spte = rmap_next(rmapp, spte))) {
>   read spte
> smp_mb
> write-protect spte
>   }
>
>   smp_mb is called in the loop, i think it is not good, yes?

Yes, agree.

> If you just want to save the fault, we can let all spte to be writeable in
> mmu_need_write_protect, but we should cache gpte access bits into spte 
> firstly.
> It should be another patchset i think. :)

Yes.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for April, Tuesday 10

2012-04-09 Thread Juan Quintela

Hi

Please send in any agenda items you are interested in covering.

Cheers,

Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] skbuff: struct ubuf_info callback type safety

2012-04-09 Thread Michael S. Tsirkin
The skb struct ubuf_info callback gets passed struct ubuf_info
itself, not the arg value as the field name and the function signature
seem to imply. Rename the arg field to ctx to match usage,
add documentation and change the callback argument type
to make usage clear and to have compiler check correctness.

Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/net.c|2 +-
 drivers/vhost/vhost.c  |5 ++---
 drivers/vhost/vhost.h  |2 +-
 include/linux/skbuff.h |7 ---
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f0da2c3..1f21d2a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -238,7 +238,7 @@ static void handle_tx(struct vhost_net *net)
 
vq->heads[vq->upend_idx].len = len;
ubuf->callback = vhost_zerocopy_callback;
-   ubuf->arg = vq->ubufs;
+   ubuf->ctx = vq->ubufs;
ubuf->desc = vq->upend_idx;
msg.msg_control = ubuf;
msg.msg_controllen = sizeof(ubuf);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 947f00d..51e4c1e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1598,10 +1598,9 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref 
*ubufs)
kfree(ubufs);
 }
 
-void vhost_zerocopy_callback(void *arg)
+void vhost_zerocopy_callback(struct ubuf_info *ubuf)
 {
-   struct ubuf_info *ubuf = arg;
-   struct vhost_ubuf_ref *ubufs = ubuf->arg;
+   struct vhost_ubuf_ref *ubufs = ubuf->ctx;
struct vhost_virtqueue *vq = ubufs->vq;
 
/* set len = 1 to mark this desc buffers done DMA */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8dcf4cc..8de1fd5 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -188,7 +188,7 @@ bool vhost_enable_notify(struct vhost_dev *, struct 
vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
unsigned int log_num, u64 len);
-void vhost_zerocopy_callback(void *arg);
+void vhost_zerocopy_callback(struct ubuf_info *);
 int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq);
 
 #define vq_err(vq, fmt, ...) do {  \
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3337027..4c3f138 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -238,11 +238,12 @@ enum {
 /*
  * The callback notifies userspace to release buffers when skb DMA is done in
  * lower device, the skb last reference should be 0 when calling this.
- * The desc is used to track userspace buffer index.
+ * The ctx field is used to track device context.
+ * The desc field is used to track userspace buffer index.
  */
 struct ubuf_info {
-   void (*callback)(void *);
-   void *arg;
+   void (*callback)(struct ubuf_info *);
+   void *ctx;
unsigned long desc;
 };
 
-- 
1.7.9.111.gf3fb0
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questing regarding KVM Guest PMU

2012-04-09 Thread Gleb Natapov
On Sun, Apr 08, 2012 at 06:27:03PM +0300, Gleb Natapov wrote:
> On Fri, Apr 06, 2012 at 09:50:50AM +0300, Gleb Natapov wrote:
> > On Fri, Apr 06, 2012 at 10:43:17AM +0530, shashank rachamalla wrote:
> > > On Thu, Apr 5, 2012 at 8:11 PM, Gleb Natapov  wrote:
> > > > On Thu, Apr 05, 2012 at 05:38:40PM +0300, Avi Kivity wrote:
> > > >> On 04/05/2012 04:57 PM, Gleb Natapov wrote:
> > > >> > > >
> > > >> > > May be it used NMI based profiling. We should ask oprofile 
> > > >> > > developers.
> > > >> > > As I said I am almost sure my inability to run it on a host is 
> > > >> > > probably
> > > >> > > PEBKAC, although I ran the same script exactly on the host and the
> > > >> > > guest (the script is from the first email of this thread)
> > > >> > >
> > > >> > After upgrading the kernel to latest git from whatever it was there 
> > > >> > the
> > > >> > same script works and counts CPU_CLK_UNHALT events.
> > > >> >
> > > >>
> > > >> This is even while it violates the Intel guidelines?
> > > >>
> > > > Yes, but who says the result is correct :) It seems that we handle
> > > > global ctrl msr wrong. That is counter can be enabled either in global
> > > > ctrl or in eventsel. Trying to confirm that.
> > > >
> > > if that becomes true then will global ctrl msr have any significance ?
> > When it is in use yes.
> > 
> I was wrong. We do handle global ctrl msr correctly, I just ran my test
> incorrectly. If I disable global ctrl on all cpus (for i in `seq 0 15`;
> do wrmsr -p $i 0x38f 0; done) oprofile stops working.
> 
After searching high and low I finally found the following in
"Performance Monitoring Unit Sharing Guide" white paper:

  Known starting state: Software requires a known starting
  state. After CPU reset, all counters and control registers are
  disabled and clear/reset to 0.  The only exception to this is the
  IA32_PERF_GLOBAL_CTRL control MSR, all programmable counter global
  enable bits are reset to 1.

Patch will follow.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Disable MSI/MSI-X in assigned device reset path

2012-04-09 Thread Avi Kivity
On 04/08/2012 08:37 PM, Jan Kiszka wrote:
> The core problem is not the ordering. The problem is that the kernel is
> susceptible to ordering mistakes of userspace. And that is because the
> kernel panics on PCI errors of devices that are in user hands - a
> critical kernel bug IMHO. 

Certainly.  But this userspace patch won't fix it.

> Proper reset of MSI or even the whole PCI
> config space is another issue, but one the kernel should not worry about
> - still, it should be fixed (therefore this patch).

And I was asking what is the right way to do it.  Reset the device and
read back the register values, or do an emulated reset and push down the
register values.

> But even if we disallowed userland to disable MMIO and PIO access to the
> device, we would be be able to exclude that there are secrete channels
> in the device's interface having the same effect. So we likely need to
> enhance PCI error handling to catch and handle faults for certain
> devices differently - those we cannot trust to behave properly while
> they are under userland/guest control.

Why not all of them?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming

2012-04-09 Thread Michael S. Tsirkin
On Mon, Apr 09, 2012 at 11:47:51AM +0800, Ren Mingxin wrote:
>  On 04/04/2012 04:01 PM, Michael S. Tsirkin wrote:
> >On Mon, Apr 02, 2012 at 12:00:45PM -0700, Tejun Heo wrote:
> >>On Mon, Apr 02, 2012 at 11:56:18AM -0700, James Bottomley wrote:
> >>>So if we're agreed no other devices going forwards should ever use this
> >>>interface, is there any point unifying the interface?  No matter how
> >>>many caveats you hedge it round with, putting the API in a central place
> >>>will be a bit like a honey trap for careless bears.   It might be safer
> >>>just to leave it buried in the three current drivers.
> >>Yeah, that was my hope but I think it would be easier to enforce to
> >>have a common function which is clearly marked legacy so that new
> >>driver writers can go look for the naming code in the existing ones,
> >>find out they're all using the same function which is marked legacy
> >>and explains what to do for newer drivers.
> >I think I'm not the only one to be confused about the
> >preferred direction here.
> >James, do you agree to the approach above?
> >
> >It would be nice to fix virtio block for 3.4, so
> >how about this:
> >- I'll just apply the original bugfix patch for 3.4 -
> >   it only affects virtio
> 
> Sorry, about only affects virtio, I'm not very clear here:
> 1) Just duplicate the disk name format function in virtio_blk
> like the original patch: https://lkml.org/lkml/2012/3/28/45

So I'd like to apply this, and we can discuss the
deduplication for 3.5. Please post a version of this that
1. isn't line-wrapped and doesn't have damaged whitespace
   so I can run git am on it
2. lists the # of duspported disks correctly as 26^3+(26^2+26)
   in the description

Thanks!

> 2) Move the disk name format function into block core like
> this patch series but only affects virtio(not affect mtip32xx).
> Do you mean the 2) one or something else?
> 
> >- Ren will repost the refactoring patch on top, and we can
> >   keep up the discussion
> >
> >Ren if you agree, can you make this a two patch series please?
> >
> 
> Sure.
> 
> -- 
> Thanks,
> Ren
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming

2012-04-09 Thread Michael S. Tsirkin
On Mon, Apr 09, 2012 at 11:47:51AM +0800, Ren Mingxin wrote:
>  On 04/04/2012 04:01 PM, Michael S. Tsirkin wrote:
> >On Mon, Apr 02, 2012 at 12:00:45PM -0700, Tejun Heo wrote:
> >>On Mon, Apr 02, 2012 at 11:56:18AM -0700, James Bottomley wrote:
> >>>So if we're agreed no other devices going forwards should ever use this
> >>>interface, is there any point unifying the interface?  No matter how
> >>>many caveats you hedge it round with, putting the API in a central place
> >>>will be a bit like a honey trap for careless bears.   It might be safer
> >>>just to leave it buried in the three current drivers.
> >>Yeah, that was my hope but I think it would be easier to enforce to
> >>have a common function which is clearly marked legacy so that new
> >>driver writers can go look for the naming code in the existing ones,
> >>find out they're all using the same function which is marked legacy
> >>and explains what to do for newer drivers.
> >I think I'm not the only one to be confused about the
> >preferred direction here.
> >James, do you agree to the approach above?
> >
> >It would be nice to fix virtio block for 3.4, so
> >how about this:
> >- I'll just apply the original bugfix patch for 3.4 -
> >   it only affects virtio
> 
> Sorry, about only affects virtio, I'm not very clear here:
> 1) Just duplicate the disk name format function in virtio_blk
> like the original patch: https://lkml.org/lkml/2012/3/28/45
> 2) Move the disk name format function into block core like
> this patch series but only affects virtio(not affect mtip32xx).
> Do you mean the 2) one or something else?

I mean 1) - I'll apply the original patch.

> >- Ren will repost the refactoring patch on top, and we can
> >   keep up the discussion
> >
> >Ren if you agree, can you make this a two patch series please?
> >
> 
> Sure.
> 
> -- 
> Thanks,
> Ren
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html