Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-11 Thread Kevin Wolf
Am 10.11.2011 22:30, schrieb Anthony Liguori:
 Live migration with qcow2 or any other image format is just not going to work 
 right now even with proper clustered storage.  I think doing a block level 
 flush 
 cache interface and letting block devices decide how to do it is the best 
 approach.

I would really prefer reusing the existing open/close code. It means
less (duplicated) code, is existing code that is well tested and doesn't
make migration much of a special case.

If you want to avoid reopening the file on the OS level, we can reopen
only the topmost layer (i.e. the format, but not the protocol) for now
and in 1.1 we can use bdrv_reopen().

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Add driver support for Native Linux KVM Tool

2011-11-11 Thread Osier Yang

Sorry, I missed the title. :(

Osier
On 11/11/2011 07:56 PM, Osier Yang wrote:

Hi, all

This is a basic implementation of libvirt Native Linux KVM
Tool driver. Note that this is just made with my own interest
and spare time, it's not an endorsement/effort by Red Hat,
and it isn't supported by Red Hat officially.

Basically, the driver is designed as *stateful*, as KVM tool
doesn't maintain any info about the guest except a socket which
for its own IPC. And it's implemented by using KVM tool binary,
which is name kvm currently, along with cgroup controllers
cpuacct, and memory support. And as one of KVM tool's
pricinple is to allow both the non-root and root user to play with.
The driver is designed to support root and non-root too, just
like QEMU does. Example of the connection URI:

 virsh -c kvmtool:///system
 virsh -c kvmtool:///session
 virsh -c kvmtool+unix:///system
 virsh -c kvmtool+unix:///session

The implementation can support more or less than 15 virsh commands
currently, including basic domain cycle operations (define/undefine,
start/destroy, suspend/resume, console, setmem, schedinfo, dumpxml,
,autostart, dominfo, etc.)

About the domain configuration:
   * kernel: must be specified as KVM tool only support boots
  from the kernel currently (no integration with BIOS app yet).

   * disk: only virtio bus is supported, and device type must be 'disk'.

   * serial/console: only one console is supported, of type serial or
  virtio (can extend to support multiple console as long as kvm tool
  supports, libvirt already supported mutiple console, see upstream
  commit 0873b688c).

   * p9fs: only support specifying the source dir, and mount tag, only
  type of 'mount' is supported.

   * memballoon: only virtio is supported, and there is no way
  to config the addr.

   * Multiple disk and p9fs is supported.

   * Graphics and network are not supported, will explain below.

Please see [PATCH 7/8] for an example of the domain config. (which
contains all the XMLs supported by current implementation).

The problems of Native Linux KVM Tool from libvirt p.o.v:

   * Some destros package qemu-kvm as kvm, also kvm is a long
 established name for KVM itself, so naming the project as
 kvm might be not a good idea. I assume it will be named
 as kvmtool in this implementation, never mind this if you
 don't like that, it can be updated easily. :-)

   * It still doesn't have an official package yet, even no make install.
 means we have no way to check the dependancy and do the checking
 when 'configure'. I assume it will be installed as /usr/bin/kvmtool
 in this implementation. This is the main reason which can prevents
 upstream libvirt accepting the patches I guess.

   * Lacks of options for user's configuration, such as -vnc, there
 is no option for user to configure the properties for the vnc,
 such as the port. It hides things, doesn't provide ways to query
 the properties too, this causes problems for libvirt to add the
 vnc support, as vnc clients such as virt-manager, virt-viewer,
 have no way to connect the guest. Even vncviewer can't.

   * KVM tool manages the network completely itself (with DHCP support?),
 no way to configure, except specify the modes (user|tap|none). I
 have not test it yet, but it should need explicit script to setup
 the network rules(e.g. NAT) for the guest access outside world.
 Anyway, there is no way for libvirt to control the guest network.

   * There is a gap about the domain status between KVM tool and libvirt,
 it's caused by KVM tool unlink()s the guest socket when user exits
 from console (both text and graphic), but libvirt still think the
 guest is running.

   * KVM tool uses $HOME/.kvm_tool as the state dir, and no way to configure,
 I made a small patch to allow KVM tool accept a ENV variable,
 which is KVM_STATE_DIR, it's used across the driver. I made a
 simple patch against kvm tool to let the whole patches work. See
 [PATCH] kvm tools.. As generally we want the state dir of
 a driver can be /var/run/libvirt/kvmtool/... for root user or
 $HOME/.libvirt/kvmtool/run for non-root user.

   * kvmtoolGetVersion is just broken now, as what ./kvm version returns
 is something like 3.0.rc5.873.gb73216, however, libvirt wants things
 like 2.6.40.6-0. This might be not a problem as long as KVM tool
 has a official package.

   * console connection is implemented by setup ptys in libvirt, stdout/stderr
 of kvm tool process is redirected to the master pty, and libvirt connects
 to the slave pty. This works fine now, but it might be better if kvm
 tool could provide more advanced console mechanisms. Just like QEMU
 does?

   * Not much ways existed yet for external apps or user to query the guest
 informations. But this might be changed soon per KVM tool grows up
 quickly.

   * It will 

Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-11 Thread Joerg Roedel
On Thu, Nov 10, 2011 at 07:28:39PM +, David Woodhouse wrote:

 ... which implies that a mapping, once made, might *never* actually get
 torn down until we loop and start reusing address space? That has
 interesting security implications.

Yes, it is a trade-off between security and performance. But if the user
wants more security the unmap_flush parameter can be used.

 Is it true even for devices which have been assigned to a VM and then
 unassigned?

No, this is only used in the DMA-API path. The device-assignment code
uses the IOMMU-API directly. There the IOTLB is always flushed on unmap.

  There is something similar on the AMD IOMMU side. There it is called
  unmap_flush.
 
 OK, so that definitely wants consolidating into a generic option.

Agreed.

  Some time ago I proposed the iommu_commit() interface which changes
  these requirements. With this interface the requirement is that after a
  couple of map/unmap operations the IOMMU-API user has to call
  iommu_commit() to make these changes visible to the hardware (so mostly
  sync the IOTLBs). As discussed at that time this would make sense for
  the Intel and AMD IOMMU drivers.
 
 I would *really* want to keep those off the fast path (thinking mostly
 about DMA API here, since that's the performance issue). But as long as
 we can achieve that, that's fine.

For AMD IOMMU there is a feature called not-present cache. It says that
the IOMMU caches non-present entries as well and needs an IOTLB flush
when something is mapped (meant for software implementations of the
IOMMU).
So it can't be really taken out of the fast-path. But the IOMMU driver
can optimize the function so that it only flushes the IOTLB when there
was an unmap-call before. It is also an improvement over the current
situation where every iommu_unmap call results in a flush implicitly.
This pretty much a no-go for using IOMMU-API in DMA mapping at the
moment.

 But also, it's not *so* much of an issue to divide the space up even
 when it's limited. The idea was not to have it *strictly* per-CPU, but
 just for a CPU to try allocating from its own subrange first, and then
 fall back to allocating a new subrange, and *then* fall back to
 allocating from subranges belonging to other CPUs. It's not that the
 allocation from a subrange would be lockless — it's that the lock would
 almost never leave the l1 cache of the CPU that *normally* uses that
 subrange.

Yeah, I get the idea. I fear that the memory consumption will get pretty
high with that approach. It basically means one round-robin allocator
per cpu and device. What does that mean on a 4096 CPU machine :)
How much lock contention will be lowered also depends on the work-load.
If dma-handles are frequently freed from another cpu than they were
allocated from the same problem re-appears.
But in the end we have to try it out and see what works best :)


Regards,

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 RFC] virtio-spec: flexible configuration layout

2011-11-11 Thread Michael S. Tsirkin
On Fri, Nov 11, 2011 at 02:54:31PM +1030, Rusty Russell wrote:
 On Wed, 09 Nov 2011 22:57:28 +0200, Sasha Levin levinsasha...@gmail.com 
 wrote:
  On Wed, 2011-11-09 at 22:52 +0200, Michael S. Tsirkin wrote:
   On Wed, Nov 09, 2011 at 10:24:47PM +0200, Sasha Levin wrote:
It'll be a bit harder deprecating it in the future.
   
   Harder than ... what ?
  
  Harder than allowing devices not to present it at all if new layout
  config is used. Right now the simple implementation is to use MMIO for
  config and device specific, and let it fallback to legacy for ISR and
  notifications (and therefore, this is probably how everybody will
  implement it), which means that when you do want to deprecate legacy,
  there will be extra work to be done then, instead of doing it now.
 
 Indeed, I'd like to see two changes to your proposal:
 
 (1) It should be all or nothing.  If a driver can find the virtio header
 capability, it should only use the capabilties.  Otherwise, it
 should fall back to legacy.  Your draft suggests a mix is possible;
 I prefer a clean failure (ie. one day don't present a BAR 0 *at
 all*, so ancient drivers just fail to load.).
 (2) There's no huge win in keeping the same layout.  Let's make some
 cleanups.  There are more users ahead of us then behind us (I
 hope!).
 But I think this is the right direction!
 
 Thanks,
 Rusty.

I'll do these changes, thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 RFC] virtio-spec: flexible configuration layout

2011-11-11 Thread Michael S. Tsirkin
On Fri, Nov 11, 2011 at 09:39:13AM +0200, Sasha Levin wrote:
 On Fri, Nov 11, 2011 at 6:24 AM, Rusty Russell ru...@rustcorp.com.au wrote:
  On Wed, 09 Nov 2011 22:57:28 +0200, Sasha Levin levinsasha...@gmail.com 
  wrote:
  On Wed, 2011-11-09 at 22:52 +0200, Michael S. Tsirkin wrote:
   On Wed, Nov 09, 2011 at 10:24:47PM +0200, Sasha Levin wrote:
It'll be a bit harder deprecating it in the future.
  
   Harder than ... what ?
 
  Harder than allowing devices not to present it at all if new layout
  config is used. Right now the simple implementation is to use MMIO for
  config and device specific, and let it fallback to legacy for ISR and
  notifications (and therefore, this is probably how everybody will
  implement it), which means that when you do want to deprecate legacy,
  there will be extra work to be done then, instead of doing it now.
 
  Indeed, I'd like to see two changes to your proposal:
 
  (1) It should be all or nothing.  If a driver can find the virtio header
     capability, it should only use the capabilties.  Otherwise, it
     should fall back to legacy.  Your draft suggests a mix is possible;
     I prefer a clean failure (ie. one day don't present a BAR 0 *at
     all*, so ancient drivers just fail to load.).
 
  (2) There's no huge win in keeping the same layout.  Let's make some
     cleanups.  There are more users ahead of us then behind us (I
     hope!).
 
 Actually, if we already do cleanups, here are two more suggestions:
 
 1. Make 64bit features a one big 64bit block, instead of having 32bits
 in one place and 32 in another.
 2. Remove the reserved fields out of the config (the ones that were
 caused by moving the ISR and the notifications out).
 
  But I think this is the right direction!
 
  Thanks,
  Rusty.
 
 
 Also, an unrelated questions: With PIO, requests were ordered, which
 means that if we wrote to the queue selector and then read from a
 queue register we would read the correct queue info.
 Is the same thing assured to us with MMIO?

For real PCI, reads do not bypass writes in PCI. However this
is only true if both are MMIO or both PIO reads.
I don't think the ordering of MMIO versus PIO is guaranteed.

On KVM, the kernel doesn't do anything to guarantee ordering.
So you get the natural ordering of the CPU.

 If we write to a queue
 selector and immediately read from queue info would we be reading the
 right info, or is there the slight chance that it would get reordered
 and we would be reading queue info first and writing to the selector
 later?

The thing to realize is that write to queue selector with KVM is in the
end performed by host. And reading queue info means that host will be
reading the queue selector. So this is a write followed by read
from the same address. AFAIK no CPUs can reorder such accesses,
so you get the right info.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] [ver3 PATCH 2/6] virtio: Move 'num_queues' to virtqueue

2011-11-11 Thread Krishna Kumar
Move queue_index from virtio_net_config to virtqueue. This is
needed to figure out the queue number of the vq in the 'done'
handler of the device.

Signed-off-by: krkum...@in.ibm.com
---
 drivers/virtio/virtio_pci.c |   10 +++---
 include/linux/virtio.h  |1 +
 2 files changed, 4 insertions(+), 7 deletions(-)

diff -ruNp org/drivers/virtio/virtio_pci.c new/drivers/virtio/virtio_pci.c
--- org/drivers/virtio/virtio_pci.c 2011-11-11 16:44:30.0 +0530
+++ new/drivers/virtio/virtio_pci.c 2011-11-11 16:44:45.0 +0530
@@ -75,9 +75,6 @@ struct virtio_pci_vq_info
/* the number of entries in the queue */
int num;
 
-   /* the index of the queue */
-   int queue_index;
-
/* the virtual address of the ring queue */
void *queue;
 
@@ -180,11 +177,10 @@ static void vp_reset(struct virtio_devic
 static void vp_notify(struct virtqueue *vq)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vq-vdev);
-   struct virtio_pci_vq_info *info = vq-priv;
 
/* we write the queue's selector into the notification register to
 * signal the other end */
-   iowrite16(info-queue_index, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
+   iowrite16(vq-queue_index, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
 }
 
 /* Handle a configuration change: Tell driver if it wants to know. */
@@ -380,7 +376,6 @@ static struct virtqueue *setup_vq(struct
if (!info)
return ERR_PTR(-ENOMEM);
 
-   info-queue_index = index;
info-num = num;
info-msix_vector = msix_vec;
 
@@ -403,6 +398,7 @@ static struct virtqueue *setup_vq(struct
goto out_activate_queue;
}
 
+   vq-queue_index = index;
vq-priv = info;
info-vq = vq;
 
@@ -445,7 +441,7 @@ static void vp_del_vq(struct virtqueue *
list_del(info-node);
spin_unlock_irqrestore(vp_dev-lock, flags);
 
-   iowrite16(info-queue_index, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_SEL);
+   iowrite16(vq-queue_index, vp_dev-ioaddr + VIRTIO_PCI_QUEUE_SEL);
 
if (vp_dev-msix_enabled) {
iowrite16(VIRTIO_MSI_NO_VECTOR,
diff -ruNp org/include/linux/virtio.h new/include/linux/virtio.h
--- org/include/linux/virtio.h  2011-11-11 16:44:30.0 +0530
+++ new/include/linux/virtio.h  2011-11-11 16:44:45.0 +0530
@@ -22,6 +22,7 @@ struct virtqueue {
void (*callback)(struct virtqueue *vq);
const char *name;
struct virtio_device *vdev;
+   int queue_index;/* the index of the queue */
void *priv;
 };
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] [ver3 PATCH 1/6] virtio_net: Introduce VIRTIO_NET_F_MULTIQUEUE

2011-11-11 Thread Krishna Kumar
Introduce VIRTIO_NET_F_MULTIQUEUE. 

Signed-off-by: krkum...@in.ibm.com
---
 include/linux/virtio_net.h |1 +
 1 file changed, 1 insertion(+)

diff -ruNp org/include/linux/virtio_net.h new/include/linux/virtio_net.h
--- org/include/linux/virtio_net.h  2011-10-12 10:16:46.0 +0530
+++ new/include/linux/virtio_net.h  2011-11-11 16:44:34.0 +0530
@@ -49,6 +49,7 @@
 #define VIRTIO_NET_F_CTRL_RX   18  /* Control channel RX mode support */
 #define VIRTIO_NET_F_CTRL_VLAN 19  /* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20  /* Extra RX mode control support */
+#define VIRTIO_NET_F_MULTIQUEUE21  /* Device supports multiple 
TXQ/RXQ */
 
 #define VIRTIO_NET_S_LINK_UP   1   /* Link is up */
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] [ver3 PATCH 0/6] Implement multiqueue virtio-net

2011-11-11 Thread Krishna Kumar
This patch series resurrects the earlier multiple TX/RX queues
functionality for virtio_net, and addresses the issues pointed
out.  It also includes an API to share irq's, f.e.  amongst the
TX vqs. 

I plan to run TCP/UDP STREAM and RR tests for local-host and
local-remote, and send the results in the next couple of days.


patch #1: Introduce VIRTIO_NET_F_MULTIQUEUE
patch #2: Move 'num_queues' to virtqueue
patch #3: virtio_net driver changes
patch #4: vhost_net changes
patch #5: Implement find_vqs_irq()
patch #6: Convert virtio_net driver to use find_vqs_irq()


Changes from rev2:
Michael:
---
1. Added functions to handle setting RX/TX/CTRL vq's.
2. num_queue_pairs instead of numtxqs.
3. Experimental support for fewer irq's in find_vqs.

Rusty:
--
4. Cleaned up some existing while (1).
5. rvq/svq and rx_sg/tx_sg changed to vq and sg respectively.
6. Cleaned up some #if 1 code.


Issue when using patch5:
-

The new API is designed to minimize code duplication.  E.g.
vp_find_vqs() is implemented as:

static int vp_find_vqs(...)
{
return vp_find_vqs_irq(vdev, nvqs, vqs, callbacks, names, NULL);
}

In my testing, when multiple tx/rx is used with multiple netperf
sessions, all the device tx queues stops a few thousand times and
subsequently woken up by skb_xmit_done.  But after some 40K-50K
iterations of stop/wake, some of the txq's stop and no wake
interrupt comes. (modprobe -r followed by modprobe solves this, so
it is not a system hang).  At the time of the hang (#txqs=#rxqs=4):

# egrep CPU|virtio0 /proc/interrupts | grep -v config
   CPU0 CPU1 CPU2CPU3
41:490574926248828   49421  PCI-MSI-edgevirtio0-input.0
42:5066 5213 52215109   PCI-MSI-edgevirtio0-output.0
43:433804377043007   43148  PCI-MSI-edgevirtio0-input.1
44:414334172742101   41175  PCI-MSI-edgevirtio0-input.2
45:384653762938468   38768  PCI-MSI-edgevirtio0-input.3

# tc -s qdisc show dev eth0
qdisc mq 0: root  
Sent 393196939897 bytes 271191624 pkt (dropped 59897,
overlimits 0 requeues 67156) backlog 25375720b 1601p
requeues 67156  

I am not sure if patch #5 is responsible for the hang.  Also, without
patch #5/patch #6, I changed vp_find_vqs() to:
static int vp_find_vqs(...)
{
return vp_try_to_find_vqs(vdev, nvqs, vqs, callbacks, names,
  false, false);
}
No packets were getting TX'd with this change when #txqs1.  This is
with the MQ-only patch that doesn't touch drivers/virtio/ directory.

Also, the MQ patch works reasonably well with 2 vectors - with
use_msix=1 and per_vq_vectors=0 in vp_find_vqs().

Patch against net-next - please review.

Signed-off-by: krkum...@in.ibm.com
---

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] [ver3 PATCH 4/6] vhost_net: vhost_net changes

2011-11-11 Thread Krishna Kumar
Changes for multiqueue vhost_net driver.

Signed-off-by: krkum...@in.ibm.com
---
 drivers/vhost/net.c   |  253 +---
 drivers/vhost/vhost.c |  225 ---
 drivers/vhost/vhost.h |   26 +++-
 3 files changed, 340 insertions(+), 164 deletions(-)

diff -ruNp org/drivers/vhost/net.c new/drivers/vhost/net.c
--- org/drivers/vhost/net.c 2011-11-11 16:44:56.0 +0530
+++ new/drivers/vhost/net.c 2011-11-11 16:45:11.0 +0530
@@ -41,12 +41,6 @@ MODULE_PARM_DESC(experimental_zcopytx, 
 #define VHOST_MAX_PEND 128
 #define VHOST_GOODCOPY_LEN 256
 
-enum {
-   VHOST_NET_VQ_RX = 0,
-   VHOST_NET_VQ_TX = 1,
-   VHOST_NET_VQ_MAX = 2,
-};
-
 enum vhost_net_poll_state {
VHOST_NET_POLL_DISABLED = 0,
VHOST_NET_POLL_STARTED = 1,
@@ -55,12 +49,13 @@ enum vhost_net_poll_state {
 
 struct vhost_net {
struct vhost_dev dev;
-   struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
-   struct vhost_poll poll[VHOST_NET_VQ_MAX];
+   struct vhost_virtqueue *vqs;
+   struct vhost_poll *poll;
+   struct socket **socks;
/* Tells us whether we are polling a socket for TX.
 * We only do this when socket buffer fills up.
 * Protected by tx vq lock. */
-   enum vhost_net_poll_state tx_poll_state;
+   enum vhost_net_poll_state *tx_poll_state;
 };
 
 static bool vhost_sock_zcopy(struct socket *sock)
@@ -108,28 +103,28 @@ static void copy_iovec_hdr(const struct 
 }
 
 /* Caller must have TX VQ lock */
-static void tx_poll_stop(struct vhost_net *net)
+static void tx_poll_stop(struct vhost_net *net, int qnum)
 {
-   if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
+   if (likely(net-tx_poll_state[qnum / 2] != VHOST_NET_POLL_STARTED))
return;
-   vhost_poll_stop(net-poll + VHOST_NET_VQ_TX);
-   net-tx_poll_state = VHOST_NET_POLL_STOPPED;
+   vhost_poll_stop(net-poll[qnum]);
+   net-tx_poll_state[qnum / 2] = VHOST_NET_POLL_STOPPED;
 }
 
 /* Caller must have TX VQ lock */
-static void tx_poll_start(struct vhost_net *net, struct socket *sock)
+static void tx_poll_start(struct vhost_net *net, struct socket *sock, int qnum)
 {
-   if (unlikely(net-tx_poll_state != VHOST_NET_POLL_STOPPED))
+   if (unlikely(net-tx_poll_state[qnum / 2] != VHOST_NET_POLL_STOPPED))
return;
-   vhost_poll_start(net-poll + VHOST_NET_VQ_TX, sock-file);
-   net-tx_poll_state = VHOST_NET_POLL_STARTED;
+   vhost_poll_start(net-poll[qnum], sock-file);
+   net-tx_poll_state[qnum / 2] = VHOST_NET_POLL_STARTED;
 }
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_tx(struct vhost_net *net)
+static void handle_tx(struct vhost_virtqueue *vq)
 {
-   struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
+   struct vhost_net *net = container_of(vq-dev, struct vhost_net, dev);
unsigned out, in, s;
int head;
struct msghdr msg = {
@@ -155,7 +150,7 @@ static void handle_tx(struct vhost_net *
wmem = atomic_read(sock-sk-sk_wmem_alloc);
if (wmem = sock-sk-sk_sndbuf) {
mutex_lock(vq-mutex);
-   tx_poll_start(net, sock);
+   tx_poll_start(net, sock, vq-qnum);
mutex_unlock(vq-mutex);
return;
}
@@ -164,7 +159,7 @@ static void handle_tx(struct vhost_net *
vhost_disable_notify(net-dev, vq);
 
if (wmem  sock-sk-sk_sndbuf / 2)
-   tx_poll_stop(net);
+   tx_poll_stop(net, vq-qnum);
hdr_size = vq-vhost_hlen;
zcopy = vhost_sock_zcopy(sock);
 
@@ -186,7 +181,7 @@ static void handle_tx(struct vhost_net *
 
wmem = atomic_read(sock-sk-sk_wmem_alloc);
if (wmem = sock-sk-sk_sndbuf * 3 / 4) {
-   tx_poll_start(net, sock);
+   tx_poll_start(net, sock, vq-qnum);
set_bit(SOCK_ASYNC_NOSPACE, sock-flags);
break;
}
@@ -197,7 +192,7 @@ static void handle_tx(struct vhost_net *
(vq-upend_idx - vq-done_idx) :
(vq-upend_idx + UIO_MAXIOV - vq-done_idx);
if (unlikely(num_pends  VHOST_MAX_PEND)) {
-   tx_poll_start(net, sock);
+   tx_poll_start(net, sock, vq-qnum);
set_bit(SOCK_ASYNC_NOSPACE, sock-flags);
break;
}
@@ -257,7 +252,7 @@ static void handle_tx(struct vhost_net *
UIO_MAXIOV;
}
vhost_discard_vq_desc(vq, 1);
-   tx_poll_start(net, sock);
+   

Re: [PATCHv2 RFC] virtio-spec: flexible configuration layout

2011-11-11 Thread Pawel Moll
  Also, an unrelated questions: With PIO, requests were ordered, which
  means that if we wrote to the queue selector and then read from a
  queue register we would read the correct queue info.
  Is the same thing assured to us with MMIO?
 
 For real PCI, reads do not bypass writes in PCI. However this
 is only true if both are MMIO or both PIO reads.
 I don't think the ordering of MMIO versus PIO is guaranteed.
 
 On KVM, the kernel doesn't do anything to guarantee ordering.
 So you get the natural ordering of the CPU.
 
  If we write to a queue
  selector and immediately read from queue info would we be reading the
  right info, or is there the slight chance that it would get reordered
  and we would be reading queue info first and writing to the selector
  later?
 
 The thing to realize is that write to queue selector with KVM is in the
 end performed by host. And reading queue info means that host will be
 reading the queue selector. So this is a write followed by read
 from the same address. AFAIK no CPUs can reorder such accesses,
 so you get the right info.

(As far as I understand all the complexity ;-) Memory mapped using
ioremap()-like functions should preserve access order. In case of ARM
architecture such memory region is defined (at the page tables level) as
a device memory (contrary to normal memory) and the processor will
not try to be clever about it. I know next-to-nothing about x86, but I
suppose similar idea applies.

Cheers!

Paweł


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes

2011-11-11 Thread Krishna Kumar
Changes for multiqueue virtio_net driver.

Signed-off-by: krkum...@in.ibm.com
---
 drivers/net/virtio_net.c   |  688 ---
 include/linux/virtio_net.h |2 
 2 files changed, 481 insertions(+), 209 deletions(-)

diff -ruNp org/drivers/net/virtio_net.c new/drivers/net/virtio_net.c
--- org/drivers/net/virtio_net.c2011-11-11 16:44:38.0 +0530
+++ new/drivers/net/virtio_net.c2011-11-11 16:44:59.0 +0530
@@ -40,33 +40,42 @@ module_param(gso, bool, 0444);
 
 #define VIRTNET_SEND_COMMAND_SG_MAX2
 
-struct virtnet_stats {
+struct virtnet_send_stats {
struct u64_stats_sync syncp;
u64 tx_bytes;
u64 tx_packets;
+};
 
+struct virtnet_recv_stats {
+   struct u64_stats_sync syncp;
u64 rx_bytes;
u64 rx_packets;
 };
 
-struct virtnet_info {
-   struct virtio_device *vdev;
-   struct virtqueue *rvq, *svq, *cvq;
-   struct net_device *dev;
-   struct napi_struct napi;
-   unsigned int status;
+/* Internal representation of a send virtqueue */
+struct send_queue {
+   /* Virtqueue associated with this send _queue */
+   struct virtqueue *vq;
 
-   /* Number of input buffers, and max we've ever had. */
-   unsigned int num, max;
+   /* TX: fragments + linear part + virtio header */
+   struct scatterlist sg[MAX_SKB_FRAGS + 2];
 
-   /* I like... big packets and I cannot lie! */
-   bool big_packets;
+   /* Active tx statistics */
+   struct virtnet_send_stats __percpu *stats;
+};
 
-   /* Host will merge rx buffers for big packets (shake it! shake it!) */
-   bool mergeable_rx_bufs;
+/* Internal representation of a receive virtqueue */
+struct receive_queue {
+   /* Virtqueue associated with this receive_queue */
+   struct virtqueue *vq;
+
+   /* Back pointer to the virtnet_info */
+   struct virtnet_info *vi;
 
-   /* Active statistics */
-   struct virtnet_stats __percpu *stats;
+   struct napi_struct napi;
+
+   /* Number of input buffers, and max we've ever had. */
+   unsigned int num, max;
 
/* Work struct for refilling if we run low on memory. */
struct delayed_work refill;
@@ -74,9 +83,29 @@ struct virtnet_info {
/* Chain pages by the private ptr. */
struct page *pages;
 
-   /* fragments + linear part + virtio header */
-   struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
-   struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
+   /* RX: fragments + linear part + virtio header */
+   struct scatterlist sg[MAX_SKB_FRAGS + 2];
+
+   /* Active rx statistics */
+   struct virtnet_recv_stats __percpu *stats;
+};
+
+struct virtnet_info {
+   int num_queue_pairs;/* # of RX/TX vq pairs */
+
+   struct send_queue **sq;
+   struct receive_queue **rq;
+   struct virtqueue *cvq;
+
+   struct virtio_device *vdev;
+   struct net_device *dev;
+   unsigned int status;
+
+   /* I like... big packets and I cannot lie! */
+   bool big_packets;
+
+   /* Host will merge rx buffers for big packets (shake it! shake it!) */
+   bool mergeable_rx_bufs;
 };
 
 struct skb_vnet_hdr {
@@ -106,22 +135,22 @@ static inline struct skb_vnet_hdr *skb_v
  * private is used to chain pages for big packets, put the whole
  * most recent used list in the beginning for reuse
  */
-static void give_pages(struct virtnet_info *vi, struct page *page)
+static void give_pages(struct receive_queue *rq, struct page *page)
 {
struct page *end;
 
/* Find end of list, sew whole thing into vi-pages. */
for (end = page; end-private; end = (struct page *)end-private);
-   end-private = (unsigned long)vi-pages;
-   vi-pages = page;
+   end-private = (unsigned long)rq-pages;
+   rq-pages = page;
 }
 
-static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
+static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 {
-   struct page *p = vi-pages;
+   struct page *p = rq-pages;
 
if (p) {
-   vi-pages = (struct page *)p-private;
+   rq-pages = (struct page *)p-private;
/* clear private here, it is used to chain pages */
p-private = 0;
} else
@@ -129,15 +158,16 @@ static struct page *get_a_page(struct vi
return p;
 }
 
-static void skb_xmit_done(struct virtqueue *svq)
+static void skb_xmit_done(struct virtqueue *vq)
 {
-   struct virtnet_info *vi = svq-vdev-priv;
+   struct virtnet_info *vi = vq-vdev-priv;
+   int qnum = vq-queue_index / 2; /* RX/TX vqs are allocated in pairs */
 
/* Suppress further interrupts. */
-   virtqueue_disable_cb(svq);
+   virtqueue_disable_cb(vq);
 
/* We were probably waiting for more output buffers. */
-   netif_wake_queue(vi-dev);
+   netif_wake_subqueue(vi-dev, qnum);
 }
 
 static void set_skb_frag(struct sk_buff *skb, struct 

Changing IOMMU-API for generic DMA-mapping supported by the hardware

2011-11-11 Thread Joerg Roedel
Okay, seperate thread for this one.

On Thu, Nov 10, 2011 at 07:28:39PM +, David Woodhouse wrote:
  The plan is to have a single DMA-API implementation for all IOMMU
  drivers (X86 and ARM) which just uses the IOMMU-API. But to make this
  performing reasonalbly well a few changes to the IOMMU-API are required.
  I already have some ideas which we can discuss if you want.
 
 Yeah, that sounds useful.

As I said some changes to the IOMMU-API are required in my opinion.
These changes should also allow it to move over old-style IOMMUs like
Calgary or GART later.

The basic idea is that IOMMU drivers should be required to put every
device they are responsible for into a default domain. The DMA mapping
code can query this default domain for each device.

Also the default domain has capabilities that can be queried. Those
capabilities include the size and offset of the address space they can
re-map. For GART and Calgary this will be the aperture, for VT-d and AMD
IOMMU the whole 64bit address space. Another capability is whether
addresses outside of that area are 1-1 mapped or no accessible to the
device.

The generic DMA-mapping code will use that information to initialize its
allocator and uses iommu_map/iommu_unmap to create and destroy mappings
as requested by the DMA-API (but the DMA-mapping code does not need to
create a domain of its own).

The good thing about these default domains is that IOMMU drivers can
implement their own optimizations on it. The AMD IOMMU driver for
example already makes a distinction between dma-mapping domains and
other protection-domains. The optimization for dma-mapping domains is
that the leaf-pages of the page-table are keept in an array so that it
is very easy to find the PTE for an address. Those optimizations are
still possible with the default-domain concept.

In short, the benefits of the default-domain concept are:

1) It allows existing optimizations for the DMA-mapping code
   paths to persist
2) It also fits old-style IOMMUs like GART, Calgary and others

An open problem is how to report reserved ranges of an address-space.
These ranges might exist from a BIOS requirement for 1-1 mapping of
certain address ranges (in AMD jargon: Unity mapped ranges, something
similar exists on VT-d afaik) or hardware requirements like the reserved
address range used for MSI interrupts.

Regards,

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-11 Thread Joerg Roedel
On Thu, Nov 10, 2011 at 01:12:00PM -0800, Stepan Moskovchenko wrote:
 I have been experimenting with an iommu_map_range call, which maps a
 given scatterlist of discontiguous physical pages into a contiguous
 virtual region at a given IOVA. This has some performance advantages
 over just calling iommu_map iteratively. First, it reduces the
 amount of table walking / calculation needed for mapping each page,
 given how you know that all the pages will be mapped into a single
 virtually-contiguous region (so in most cases, the first-level table
 calculation can be reused). Second, it allows one to defer the TLB
 (and sometimes cache) maintenance operations until the entire
 scatterlist has been mapped, rather than doing a TLB invalidate
 after mapping each page, as would have been the case if iommu_map
 were just being called from within a loop. Granted, just using
 iommu_map many times may be acceptable on the slow path, but I have
 seen significant performance gains when using this approach on the
 fast path.

Yes, from a performance point-of-view that makes sense, as an addition
to the existing iommu_map interface. Are the pages in the list allowed
to have different page-sizes?


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-11 Thread David Woodhouse
On Fri, 2011-11-11 at 13:58 +0100, Joerg Roedel wrote:
 For AMD IOMMU there is a feature called not-present cache. It says that
 the IOMMU caches non-present entries as well and needs an IOTLB flush
 when something is mapped (meant for software implementations of the
 IOMMU).
 So it can't be really taken out of the fast-path. But the IOMMU driver
 can optimize the function so that it only flushes the IOTLB when there
 was an unmap-call before. 

We have exactly the same situation with the Intel IOMMU (we call it
'Caching Mode') for the same reasons.

I'd be wary about making the IOMMU driver *track* whether there was an
unmap call before — that seems like hard work and more cache contention,
especially if the -commit() call happens on a CPU other than the one
that just did the unmap.

I'm also not sure exactly when you'd call the -commit() function when
the DMA API is being used, and which 'side' of that API the
deferred-flush optimisations would live.

Would the optimisation be done on the generic side, only calling
-commit when it absolutely *has* to happen? (Or periodically after
unmaps have happened to avoid entries hanging around for ever?)

Or would the optimisation be done in the IOMMU driver, thus turning the
-commit() function into more of a *hint*? You could add a 'simon_says'
boolean argument to it, I suppose...?

 It is also an improvement over the current
 situation where every iommu_unmap call results in a flush implicitly.
 This pretty much a no-go for using IOMMU-API in DMA mapping at the
 moment.

Right. That definitely needs to be handled. We just need to work out the
(above and other) details.

  But also, it's not *so* much of an issue to divide the space up even
  when it's limited. The idea was not to have it *strictly* per-CPU, but
  just for a CPU to try allocating from its own subrange first…
 
 Yeah, I get the idea. I fear that the memory consumption will get pretty
 high with that approach. It basically means one round-robin allocator
 per cpu and device. What does that mean on a 4096 CPU machine :)

Well, if your network device is taking interrupts, and mapping/unmapping
buffers across all 4096 CPUs, then your performance is screwed anyway :)

Certainly your concerns are valid, but I think we can cope with them
fairly reasonably. If we *do* have large number of CPUs allocating for a
given domain, we can move to a per-node rather than per-CPU allocator.
And we can have dynamically sized allocation regions, so we aren't
wasting too much space on unused bitmaps if you map just *one* page from
each of your 4096 CPUs.

 How much lock contention will be lowered also depends on the work-load.
 If dma-handles are frequently freed from another cpu than they were
 allocated from the same problem re-appears.

The idea is that dma handles are *infrequently* freed, in batches. So
we'll bounce the lock's cache line occasionally, but not all the time.

In strict or unmap_flush mode, you get to go slowly unless you do
the unmap on the same CPU that you mapped it from. I can live with that.

 But in the end we have to try it out and see what works best :)

Indeed. I'm just trying to work out if I should try to do the allocator
thing purely inside the Intel code first, and then try to move it out
and make it generic — or if I should start with making the DMA API work
with a wrapper around the IOMMU API, with your -commit() and other
necessary changes. I think I'd prefer the latter, if we can work out how
it should look.

-- 
dwmw2


smime.p7s
Description: S/MIME cryptographic signature


[RFC] [ver3 PATCH 6/6] virtio_net: Convert virtio_net driver to use find_vqs_irq

2011-11-11 Thread Krishna Kumar
Convert virtio_net driver to use find_vqs_irq(). The TX vq's
share a single irq, while the RX vq's have individual irq's.
The skb_xmit_done handler also checks if any work is required.

Signed-off-by: krkum...@in.ibm.com
---
 drivers/net/virtio_net.c |   29 ++---
 1 file changed, 22 insertions(+), 7 deletions(-)

diff -ruNp org/drivers/net/virtio_net.c new/drivers/net/virtio_net.c
--- org/drivers/net/virtio_net.c2011-11-11 16:45:17.0 +0530
+++ new/drivers/net/virtio_net.c2011-11-11 16:48:45.0 +0530
@@ -163,11 +163,13 @@ static void skb_xmit_done(struct virtque
struct virtnet_info *vi = vq-vdev-priv;
int qnum = vq-queue_index / 2; /* RX/TX vqs are allocated in pairs */
 
-   /* Suppress further interrupts. */
-   virtqueue_disable_cb(vq);
+   if (__netif_subqueue_stopped(vi-dev, qnum)) {
+   /* Suppress further interrupts. */
+   virtqueue_disable_cb(vq);
 
-   /* We were probably waiting for more output buffers. */
-   netif_wake_subqueue(vi-dev, qnum);
+   /* We were probably waiting for more output buffers. */
+   netif_wake_subqueue(vi-dev, qnum);
+   }
 }
 
 static void set_skb_frag(struct sk_buff *skb, struct page *page,
@@ -1120,6 +1122,7 @@ static void setup_cvq(struct virtnet_inf
 
 static int invoke_find_vqs(struct virtnet_info *vi)
 {
+   unsigned long *flags = NULL;
vq_callback_t **callbacks;
struct virtqueue **vqs;
int ret = -ENOMEM;
@@ -1141,6 +1144,14 @@ static int invoke_find_vqs(struct virtne
if (!vqs || !callbacks || !names)
goto err;
 
+   if (vi-num_queue_pairs  1) {
+   int num = (total_vqs + BITS_PER_LONG - 1) / BITS_PER_LONG;
+
+   flags = kzalloc(num * sizeof(*flags), GFP_KERNEL);
+   if (!flags)
+   goto err;
+   }
+
/* Allocate/initialize parameters for recv virtqueues */
for (i = 0; i  vi-num_queue_pairs * 2; i += 2) {
callbacks[i] = skb_recv_done;
@@ -1155,6 +1166,8 @@ static int invoke_find_vqs(struct virtne
names[i] = kasprintf(GFP_KERNEL, output.%d, i / 2);
if (!names[i])
goto err;
+   if (flags)
+   set_bit(i, flags);
}
 
/* Parameters for control virtqueue, if any */
@@ -1163,9 +1176,9 @@ static int invoke_find_vqs(struct virtne
names[i - 1] = control;
}
 
-   ret = vi-vdev-config-find_vqs(vi-vdev, total_vqs, vqs, callbacks,
-(const char **)names);
-
+   ret = vi-vdev-config-find_vqs_irq(vi-vdev, total_vqs, vqs,
+callbacks, (const char **)names,
+flags);
if (ret)
goto err;
 
@@ -1174,6 +1187,8 @@ static int invoke_find_vqs(struct virtne
setup_cvq(vi, vqs, vi-num_queue_pairs * 2);
 
 err:
+   kfree(flags);
+
if (ret  names)
for (i = 0; i  vi-num_queue_pairs * 2; i++)
kfree(names[i]);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] [ver3 PATCH 5/6] virtio: Implement find_vqs_irq()

2011-11-11 Thread Krishna Kumar
Implement find_vqs_irq() to reduce number of vectors. It can
be used to specify which vq's need their own irqs, and which
can share irqs with other vq's.

Signed-off-by: krkum...@in.ibm.com
---
 drivers/virtio/virtio_pci.c   |  108 
 include/linux/virtio_config.h |   14 
 2 files changed, 95 insertions(+), 27 deletions(-)

diff -ruNp org/drivers/virtio/virtio_pci.c new/drivers/virtio/virtio_pci.c
--- org/drivers/virtio/virtio_pci.c 2011-11-11 16:45:09.0 +0530
+++ new/drivers/virtio/virtio_pci.c 2011-11-11 16:54:35.0 +0530
@@ -40,7 +40,7 @@ struct virtio_pci_device
/* the IO mapping for the PCI config space */
void __iomem *ioaddr;
 
-   /* a list of queues so we can dispatch IRQs */
+   /* a list of queues which have registered to receive IRQs */
spinlock_t lock;
struct list_head virtqueues;
 
@@ -196,7 +196,7 @@ static irqreturn_t vp_config_changed(int
return IRQ_HANDLED;
 }
 
-/* Notify all virtqueues on an interrupt. */
+/* Notify all vq's on 'virtqueues' list on an interrupt. */
 static irqreturn_t vp_vring_interrupt(int irq, void *opaque)
 {
struct virtio_pci_device *vp_dev = opaque;
@@ -358,7 +358,7 @@ static struct virtqueue *setup_vq(struct
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
struct virtio_pci_vq_info *info;
struct virtqueue *vq;
-   unsigned long flags, size;
+   unsigned long size;
u16 num;
int err;
 
@@ -378,6 +378,7 @@ static struct virtqueue *setup_vq(struct
 
info-num = num;
info-msix_vector = msix_vec;
+   INIT_LIST_HEAD(info-node);
 
size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
info-queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
@@ -411,14 +412,6 @@ static struct virtqueue *setup_vq(struct
}
}
 
-   if (callback) {
-   spin_lock_irqsave(vp_dev-lock, flags);
-   list_add(info-node, vp_dev-virtqueues);
-   spin_unlock_irqrestore(vp_dev-lock, flags);
-   } else {
-   INIT_LIST_HEAD(info-node);
-   }
-
return vq;
 
 out_assign:
@@ -472,7 +465,8 @@ static void vp_del_vqs(struct virtio_dev
if (vp_dev-per_vq_vectors 
info-msix_vector != VIRTIO_MSI_NO_VECTOR)
free_irq(vp_dev-msix_entries[info-msix_vector].vector,
-vq);
+list_empty(info-node) ?
+(void *)vq : (void *)vp_dev);
vp_del_vq(vq);
}
vp_dev-per_vq_vectors = false;
@@ -480,16 +474,37 @@ static void vp_del_vqs(struct virtio_dev
vp_free_vectors(vdev);
 }
 
+static void add_vq_to_list(struct virtqueue *vq,
+  struct virtio_pci_device *vp_dev,
+  vq_callback_t *cb)
+{
+   struct virtio_pci_vq_info *info = vq-priv;
+   unsigned long flags;
+
+   if (cb) {
+   spin_lock_irqsave(vp_dev-lock, flags);
+   list_add(info-node, vp_dev-virtqueues);
+   spin_unlock_irqrestore(vp_dev-lock, flags);
+   }
+}
+
+/* Return true if flags is NULL, or 'bit'# in flags is clear */
+static bool bit_clear(unsigned long *flags, int bit)
+{
+   return flags ? !test_bit(bit, flags) : true;
+}
+
 static int vp_try_to_find_vqs(struct virtio_device *vdev, unsigned nvqs,
  struct virtqueue *vqs[],
  vq_callback_t *callbacks[],
  const char *names[],
  bool use_msix,
- bool per_vq_vectors)
+ bool per_vq_vectors, unsigned long *flags)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
u16 msix_vec;
int i, err, nvectors, allocated_vectors;
+   int count = 0;  /* Count of vq's using shared irq's */
 
if (!use_msix) {
/* Old style: one normal interrupt for change and all vqs. */
@@ -500,9 +515,19 @@ static int vp_try_to_find_vqs(struct vir
if (per_vq_vectors) {
/* Best option: one for change interrupt, one per vq. */
nvectors = 1;
-   for (i = 0; i  nvqs; ++i)
-   if (callbacks[i])
+   for (i = 0; i  nvqs; ++i) {
+   bool alloc_irq = bit_clear(flags, i);
+
+   /*
+* We allocate a vector if cb is present,
+* AND (driver requested a vector OR this
+* is the first shared vector).
+*/
+   if (callbacks[i] 
+   (alloc_irq || ++count == 1))
   

Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-11 Thread Anthony Liguori

On 11/11/2011 04:15 AM, Kevin Wolf wrote:

Am 10.11.2011 22:30, schrieb Anthony Liguori:

Live migration with qcow2 or any other image format is just not going to work
right now even with proper clustered storage.  I think doing a block level flush
cache interface and letting block devices decide how to do it is the best 
approach.


I would really prefer reusing the existing open/close code. It means
less (duplicated) code, is existing code that is well tested and doesn't
make migration much of a special case.


Just to be clear, reopen only addresses image format migration.  It does not 
address NFS migration since it doesn't guarantee close-to-open semantics.


The problem I have with the reopen patches are that they introduce regressions 
and change at semantics for a management tool.  If you look at the libvirt 
workflow with encrypted disks, it would break with the reopen patches.




If you want to avoid reopening the file on the OS level, we can reopen
only the topmost layer (i.e. the format, but not the protocol) for now
and in 1.1 we can use bdrv_reopen().


I don't view not supporting migration with image formats as a regression as it's 
never been a feature we've supported.  While there might be confusion about 
support around NFS, I think it's always been clear that image formats cannot be 
used.


Given that, I don't think this is a candidate for 1.0.

Regards,

Anthony Liguori




Kevin



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: e500: Casting (void *) value returned by kmalloc is useless

2011-11-11 Thread Alexander Graf

On 11/08/2011 08:15 PM, Thomas Meyer wrote:

From: Thomas Meyertho...@m3y3r.de

  Casting (void *) value returned by kmalloc is useless
  as mentioned in Documentation/CodingStyle, Chap 14.

  The semantic patch that makes this change is available
  in scripts/coccinelle/api/alloc/drop_kmalloc_cast.cocci.

  More information about semantic patching is available at
  http://coccinelle.lip6.fr/

Signed-off-by: Thomas Meyertho...@m3y3r.de
---

diff -u -p a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
--- a/arch/powerpc/kvm/e500_tlb.c 2011-11-07 19:37:27.329638682 +0100
+++ b/arch/powerpc/kvm/e500_tlb.c 2011-11-08 09:18:39.955928218 +0100
@@ -1026,11 +1026,11 @@ int kvmppc_e500_tlb_init(struct kvmppc_v
if (vcpu_e500-gtlb_arch[1] == NULL)
goto err_out_guest0;

-   vcpu_e500-gtlb_priv[0] = (struct tlbe_priv *)
+   vcpu_e500-gtlb_priv[0] =
kzalloc(sizeof(struct tlbe_priv) * KVM_E500_TLB0_SIZE, 
GFP_KERNEL);
if (vcpu_e500-gtlb_priv[0] == NULL)
goto err_out_guest1;
-   vcpu_e500-gtlb_priv[1] = (struct tlbe_priv *)
+   vcpu_e500-gtlb_priv[1] =
kzalloc(sizeof(struct tlbe_priv) * KVM_E500_TLB1_SIZE, 
GFP_KERNEL);


Sorry, I can't seem to find that code anywhere. I suppose you're using 
an old git checkout?



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: booke: check for signals in kvmppc_vcpu_run

2011-11-11 Thread Alexander Graf

On 11/08/2011 11:11 PM, Scott Wood wrote:

Currently we check prior to returning from a lightweight exit,
but not prior to initial entry.

book3s already does a similar test.

Signed-off-by: Scott Woodscottw...@freescale.com
---
  arch/powerpc/kvm/booke.c |   10 +-
  1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index b642200..9c78589 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -322,11 +322,19 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
}

local_irq_disable();
+
+   if (signal_pending(current)) {


Any reason you're doing this after irq_disable()?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-11 Thread Joerg Roedel
On Fri, Nov 11, 2011 at 01:27:28PM +, David Woodhouse wrote:
 On Fri, 2011-11-11 at 13:58 +0100, Joerg Roedel wrote:
  For AMD IOMMU there is a feature called not-present cache. It says that
  the IOMMU caches non-present entries as well and needs an IOTLB flush
  when something is mapped (meant for software implementations of the
  IOMMU).
  So it can't be really taken out of the fast-path. But the IOMMU driver
  can optimize the function so that it only flushes the IOTLB when there
  was an unmap-call before. 
 
 We have exactly the same situation with the Intel IOMMU (we call it
 'Caching Mode') for the same reasons.
 
 I'd be wary about making the IOMMU driver *track* whether there was an
 unmap call before — that seems like hard work and more cache contention,
 especially if the -commit() call happens on a CPU other than the one
 that just did the unmap.

Well, the IOMMU driver does not need to track this globally or
per-IOMMU. It basically tracks it per-domain. This lowers the
cache-bouncing a bit.

 I'm also not sure exactly when you'd call the -commit() function when
 the DMA API is being used, and which 'side' of that API the
 deferred-flush optimisations would live.

The commit function is called every time before one of the DMA-API
functions return.

 Would the optimisation be done on the generic side, only calling
 -commit when it absolutely *has* to happen? (Or periodically after
 unmaps have happened to avoid entries hanging around for ever?)
 
 Or would the optimisation be done in the IOMMU driver, thus turning the
 -commit() function into more of a *hint*? You could add a 'simon_says'
 boolean argument to it, I suppose...?

The IOMMU driver can't do that because special knowledge about the
address allocator is needed. So it has to happen in the generic part.
Okay, that optimization does not really work with the -commit() idea,
as I think about it. With the requirement to call commit() on the
map-side would result in too many flushes to happen.

So it may be better to just call it iommu_flush_tlb(domain) or
something. The user of the IOMMU-API can assume that the flush is only
necessary on the unmap-path. Stricter flushing rules need to be handled
in the IOMMU driver itself.

  It is also an improvement over the current
  situation where every iommu_unmap call results in a flush implicitly.
  This pretty much a no-go for using IOMMU-API in DMA mapping at the
  moment.
 
 Right. That definitely needs to be handled. We just need to work out the
 (above and other) details.

Yes, I am convinced now that the commit() idea is not the right solution
:)
Some less generic semantics like a iommu_flush_tlb call-back would do
better with such an optimization.

 Certainly your concerns are valid, but I think we can cope with them
 fairly reasonably. If we *do* have large number of CPUs allocating for a
 given domain, we can move to a per-node rather than per-CPU allocator.
 And we can have dynamically sized allocation regions, so we aren't
 wasting too much space on unused bitmaps if you map just *one* page from
 each of your 4096 CPUs.

It also requires somewhat that the chunks are reasonably small. On a
typical small system the amount of allocated DMA memory is rather small
(GPUs are the exception, of course).

Btw, do you have numbers how much time is spent in spinlocks for
multi-node machines and the VT-d driver for some workloads? I remember
that I did this measurement some time ago on a 24 core AMD machine with
netperf on a 10GBit network card and the time spent spinning came out
at ~5%.
For 1Gbit network and disk-load it was around 0.5% on that machine.

  How much lock contention will be lowered also depends on the work-load.
  If dma-handles are frequently freed from another cpu than they were
  allocated from the same problem re-appears.
 
 The idea is that dma handles are *infrequently* freed, in batches. So
 we'll bounce the lock's cache line occasionally, but not all the time.

Hmm, btw, how does this batch-freeing works? Is the freeing done when
the address space is filled up or is another approach used?

  But in the end we have to try it out and see what works best :)
 
 Indeed. I'm just trying to work out if I should try to do the allocator
 thing purely inside the Intel code first, and then try to move it out
 and make it generic — or if I should start with making the DMA API work
 with a wrapper around the IOMMU API, with your -commit() and other
 necessary changes. I think I'd prefer the latter, if we can work out how
 it should look.

I think it is the best to start with a generic DMA-mapping
implementation, benchmark it and then find the bottlenecks and try out
possible optimizations. In the beginning it is hard enough to come up
with an implementation that fits X86 and ARM at the same time. When this
is worked out we can continue to optimize :)


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: 

Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-11 Thread Kevin Wolf
Am 11.11.2011 15:03, schrieb Anthony Liguori:
 On 11/11/2011 04:15 AM, Kevin Wolf wrote:
 Am 10.11.2011 22:30, schrieb Anthony Liguori:
 Live migration with qcow2 or any other image format is just not going to 
 work
 right now even with proper clustered storage.  I think doing a block level 
 flush
 cache interface and letting block devices decide how to do it is the best 
 approach.

 I would really prefer reusing the existing open/close code. It means
 less (duplicated) code, is existing code that is well tested and doesn't
 make migration much of a special case.
 
 Just to be clear, reopen only addresses image format migration.  It does not 
 address NFS migration since it doesn't guarantee close-to-open semantics.

Yes. But image formats are the only thing that is really completely
broken today. For NFS etc. we can tell users to use
cache=none/directsync and they will be good. There is no such option
that makes image formats safe.

 The problem I have with the reopen patches are that they introduce 
 regressions 
 and change at semantics for a management tool.  If you look at the libvirt 
 workflow with encrypted disks, it would break with the reopen patches.

Yes, this is nasty. But on the other hand: Today migration is broken for
all qcow2 images, with the reopen it's only broken for encrypted ones.
Certainly an improvement, even though there's still a bug left.

 If you want to avoid reopening the file on the OS level, we can reopen
 only the topmost layer (i.e. the format, but not the protocol) for now
 and in 1.1 we can use bdrv_reopen().
 
 I don't view not supporting migration with image formats as a regression as 
 it's 
 never been a feature we've supported.  While there might be confusion about 
 support around NFS, I think it's always been clear that image formats cannot 
 be 
 used.
 
 Given that, I don't think this is a candidate for 1.0.

Nobody says it's a regression, but it's a bad bug and you're blocking a
solution for it for over a year now because the solution isn't perfect
enough in your eyes. :-(

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-11 Thread Anthony Liguori

On 11/11/2011 08:29 AM, Kevin Wolf wrote:

Am 11.11.2011 15:03, schrieb Anthony Liguori:

On 11/11/2011 04:15 AM, Kevin Wolf wrote:

Am 10.11.2011 22:30, schrieb Anthony Liguori:

Live migration with qcow2 or any other image format is just not going to work
right now even with proper clustered storage.  I think doing a block level flush
cache interface and letting block devices decide how to do it is the best 
approach.


I would really prefer reusing the existing open/close code. It means
less (duplicated) code, is existing code that is well tested and doesn't
make migration much of a special case.


Just to be clear, reopen only addresses image format migration.  It does not
address NFS migration since it doesn't guarantee close-to-open semantics.


Yes. But image formats are the only thing that is really completely
broken today. For NFS etc. we can tell users to use
cache=none/directsync and they will be good. There is no such option
that makes image formats safe.


The problem I have with the reopen patches are that they introduce regressions
and change at semantics for a management tool.  If you look at the libvirt
workflow with encrypted disks, it would break with the reopen patches.


Yes, this is nasty. But on the other hand: Today migration is broken for
all qcow2 images, with the reopen it's only broken for encrypted ones.
Certainly an improvement, even though there's still a bug left.


This sounds like a good thing to work through in the next release.




If you want to avoid reopening the file on the OS level, we can reopen
only the topmost layer (i.e. the format, but not the protocol) for now
and in 1.1 we can use bdrv_reopen().


I don't view not supporting migration with image formats as a regression as it's
never been a feature we've supported.  While there might be confusion about
support around NFS, I think it's always been clear that image formats cannot be
used.

Given that, I don't think this is a candidate for 1.0.


Nobody says it's a regression, but it's a bad bug and you're blocking a
solution for it for over a year now because the solution isn't perfect
enough in your eyes. :-(


This patch was posted a year ago.  Feedback was provided and there was never any 
follow up[1].  I've never Nack'd this approach.  I can't see how I was blocking 
this since I never even responded in the thread.  If this came in before soft 
freeze, I wouldn't have objected if you wanted to go in this direction.


This is not a bug fix, this is a new feature.  We're long past feature freeze. 
It's not a simple and obvious fix either.  It only partially fixes the problem 
and introduces other problems.  It's not a good candidate for making an 
exception at this stage in the release.


[1] http://mid.gmane.org/cover.1294150511.git.quint...@redhat.com

Regards,

Anthony Liguori



Kevin



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-11 Thread Kevin Wolf
Am 11.11.2011 15:35, schrieb Anthony Liguori:
 On 11/11/2011 08:29 AM, Kevin Wolf wrote:
 Am 11.11.2011 15:03, schrieb Anthony Liguori:
 On 11/11/2011 04:15 AM, Kevin Wolf wrote:
 Am 10.11.2011 22:30, schrieb Anthony Liguori:
 Live migration with qcow2 or any other image format is just not going to 
 work
 right now even with proper clustered storage.  I think doing a block 
 level flush
 cache interface and letting block devices decide how to do it is the best 
 approach.

 I would really prefer reusing the existing open/close code. It means
 less (duplicated) code, is existing code that is well tested and doesn't
 make migration much of a special case.

 Just to be clear, reopen only addresses image format migration.  It does not
 address NFS migration since it doesn't guarantee close-to-open semantics.

 Yes. But image formats are the only thing that is really completely
 broken today. For NFS etc. we can tell users to use
 cache=none/directsync and they will be good. There is no such option
 that makes image formats safe.

 The problem I have with the reopen patches are that they introduce 
 regressions
 and change at semantics for a management tool.  If you look at the libvirt
 workflow with encrypted disks, it would break with the reopen patches.

 Yes, this is nasty. But on the other hand: Today migration is broken for
 all qcow2 images, with the reopen it's only broken for encrypted ones.
 Certainly an improvement, even though there's still a bug left.
 
 This sounds like a good thing to work through in the next release.
 

 If you want to avoid reopening the file on the OS level, we can reopen
 only the topmost layer (i.e. the format, but not the protocol) for now
 and in 1.1 we can use bdrv_reopen().

 I don't view not supporting migration with image formats as a regression as 
 it's
 never been a feature we've supported.  While there might be confusion about
 support around NFS, I think it's always been clear that image formats 
 cannot be
 used.

 Given that, I don't think this is a candidate for 1.0.

 Nobody says it's a regression, but it's a bad bug and you're blocking a
 solution for it for over a year now because the solution isn't perfect
 enough in your eyes. :-(
 
 This patch was posted a year ago.  Feedback was provided and there was never 
 any 
 follow up[1].  I've never Nack'd this approach.  I can't see how I was 
 blocking 
 this since I never even responded in the thread.  If this came in before soft 
 freeze, I wouldn't have objected if you wanted to go in this direction.
 
 This is not a bug fix, this is a new feature.  We're long past feature 
 freeze. 
 It's not a simple and obvious fix either.  It only partially fixes the 
 problem 
 and introduces other problems.  It's not a good candidate for making an 
 exception at this stage in the release.
 
 [1] http://mid.gmane.org/cover.1294150511.git.quint...@redhat.com

Then please send a fix that fails migration with non-raw images. Not
breaking images silently during migration is critical for 1.0, IMO.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: e500: Casting (void *) value returned by kmalloc is useless

2011-11-11 Thread Thomas Meyer
Am Freitag, den 11.11.2011, 15:05 +0100 schrieb Alexander Graf:
 On 11/08/2011 08:15 PM, Thomas Meyer wrote:
  From: Thomas Meyertho...@m3y3r.de
 
Casting (void *) value returned by kmalloc is useless
as mentioned in Documentation/CodingStyle, Chap 14.
 
The semantic patch that makes this change is available
in scripts/coccinelle/api/alloc/drop_kmalloc_cast.cocci.
 
More information about semantic patching is available at
http://coccinelle.lip6.fr/
 
  Signed-off-by: Thomas Meyertho...@m3y3r.de
  ---
 
  diff -u -p a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
  --- a/arch/powerpc/kvm/e500_tlb.c 2011-11-07 19:37:27.329638682 +0100
  +++ b/arch/powerpc/kvm/e500_tlb.c 2011-11-08 09:18:39.955928218 +0100
  @@ -1026,11 +1026,11 @@ int kvmppc_e500_tlb_init(struct kvmppc_v
  if (vcpu_e500-gtlb_arch[1] == NULL)
  goto err_out_guest0;
 
  -   vcpu_e500-gtlb_priv[0] = (struct tlbe_priv *)
  +   vcpu_e500-gtlb_priv[0] =
  kzalloc(sizeof(struct tlbe_priv) * KVM_E500_TLB0_SIZE, 
  GFP_KERNEL);
  if (vcpu_e500-gtlb_priv[0] == NULL)
  goto err_out_guest1;
  -   vcpu_e500-gtlb_priv[1] = (struct tlbe_priv *)
  +   vcpu_e500-gtlb_priv[1] =
  kzalloc(sizeof(struct tlbe_priv) * KVM_E500_TLB1_SIZE, 
  GFP_KERNEL);
 
 Sorry, I can't seem to find that code anywhere. I suppose you're using 
 an old git checkout?

No, I'm at 3.2-rc1:

https://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=arch/powerpc/kvm/e500_tlb.c#l1029

 
 
 Alex
 



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: e500: Casting (void *) value returned by kmalloc is useless

2011-11-11 Thread Alexander Graf

On 11/11/2011 04:24 PM, Thomas Meyer wrote:

Am Freitag, den 11.11.2011, 15:05 +0100 schrieb Alexander Graf:

On 11/08/2011 08:15 PM, Thomas Meyer wrote:

From: Thomas Meyertho...@m3y3r.de

   Casting (void *) value returned by kmalloc is useless
   as mentioned in Documentation/CodingStyle, Chap 14.

   The semantic patch that makes this change is available
   in scripts/coccinelle/api/alloc/drop_kmalloc_cast.cocci.

   More information about semantic patching is available at
   http://coccinelle.lip6.fr/

Signed-off-by: Thomas Meyertho...@m3y3r.de
---

diff -u -p a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
--- a/arch/powerpc/kvm/e500_tlb.c 2011-11-07 19:37:27.329638682 +0100
+++ b/arch/powerpc/kvm/e500_tlb.c 2011-11-08 09:18:39.955928218 +0100
@@ -1026,11 +1026,11 @@ int kvmppc_e500_tlb_init(struct kvmppc_v
if (vcpu_e500-gtlb_arch[1] == NULL)
goto err_out_guest0;

-   vcpu_e500-gtlb_priv[0] = (struct tlbe_priv *)
+   vcpu_e500-gtlb_priv[0] =
kzalloc(sizeof(struct tlbe_priv) * KVM_E500_TLB0_SIZE, 
GFP_KERNEL);
if (vcpu_e500-gtlb_priv[0] == NULL)
goto err_out_guest1;
-   vcpu_e500-gtlb_priv[1] = (struct tlbe_priv *)
+   vcpu_e500-gtlb_priv[1] =
kzalloc(sizeof(struct tlbe_priv) * KVM_E500_TLB1_SIZE, 
GFP_KERNEL);

Sorry, I can't seem to find that code anywhere. I suppose you're using
an old git checkout?

No, I'm at 3.2-rc1:

https://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=arch/powerpc/kvm/e500_tlb.c#l1029


Ah, yes, that's old. I'm behind Avi's tree, so changes take a while to 
trickle through to Linus. Please check out the following (current 
development) tree:


git://github.com/agraf/linux-2.6.git kvm-ppc-next

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: booke: check for signals in kvmppc_vcpu_run

2011-11-11 Thread Alexander Graf

On 11/11/2011 04:32 PM, Scott Wood wrote:

On Fri, Nov 11, 2011 at 03:09:17PM +0100, Alexander Graf wrote:

On 11/08/2011 11:11 PM, Scott Wood wrote:

Currently we check prior to returning from a lightweight exit,
but not prior to initial entry.

book3s already does a similar test.

Signed-off-by: Scott Woodscottw...@freescale.com
---
  arch/powerpc/kvm/booke.c |   10 +-
  1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index b642200..9c78589 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -322,11 +322,19 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
}

local_irq_disable();
+
+   if (signal_pending(current)) {

Any reason you're doing this after irq_disable()?

If we get a signal after the check, we want to be sure that we don't
receive the reschedule IPI until after we're in the guest, so that it
will cause another signal check.


Makes sense. So the current book3s implementation is wrong?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: booke: check for signals in kvmppc_vcpu_run

2011-11-11 Thread Scott Wood
On Fri, Nov 11, 2011 at 03:09:17PM +0100, Alexander Graf wrote:
 On 11/08/2011 11:11 PM, Scott Wood wrote:
 Currently we check prior to returning from a lightweight exit,
 but not prior to initial entry.
 
 book3s already does a similar test.
 
 Signed-off-by: Scott Woodscottw...@freescale.com
 ---
   arch/powerpc/kvm/booke.c |   10 +-
   1 files changed, 9 insertions(+), 1 deletions(-)
 
 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
 index b642200..9c78589 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -322,11 +322,19 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
  }
 
  local_irq_disable();
 +
 +if (signal_pending(current)) {
 
 Any reason you're doing this after irq_disable()?

If we get a signal after the check, we want to be sure that we don't
receive the reschedule IPI until after we're in the guest, so that it
will cause another signal check.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: protect use of kvmppc_h_pr

2011-11-11 Thread Alexander Graf

On 11/08/2011 06:17 PM, Andreas Schwab wrote:

kvmppc_h_pr is only available if CONFIG_KVM_BOOK3S_64_PR.

Signed-off-by: Andreas Schwabsch...@linux-m68k.org


Thanks, applied to kvm-ppc-next.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: move compute_tlbie_rb to book3s_64 common header

2011-11-11 Thread Alexander Graf

On 11/08/2011 06:08 PM, Andreas Schwab wrote:

compute_tlbie_rb is only used on ppc64 and cannot be compiled on ppc32.

Signed-off-by: Andreas Schwabsch...@linux-m68k.org


Thanks, applied to kvm-ppc-next.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: booke: check for signals in kvmppc_vcpu_run

2011-11-11 Thread Scott Wood
On 11/11/2011 09:35 AM, Alexander Graf wrote:
 On 11/11/2011 04:32 PM, Scott Wood wrote:
 On Fri, Nov 11, 2011 at 03:09:17PM +0100, Alexander Graf wrote:
 On 11/08/2011 11:11 PM, Scott Wood wrote:
 Currently we check prior to returning from a lightweight exit,
 but not prior to initial entry.

 book3s already does a similar test.

 Signed-off-by: Scott Woodscottw...@freescale.com
 ---
   arch/powerpc/kvm/booke.c |   10 +-
   1 files changed, 9 insertions(+), 1 deletions(-)

 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
 index b642200..9c78589 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -322,11 +322,19 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run,
 struct kvm_vcpu *vcpu)
   }

   local_irq_disable();
 +
 +if (signal_pending(current)) {
 Any reason you're doing this after irq_disable()?
 If we get a signal after the check, we want to be sure that we don't
 receive the reschedule IPI until after we're in the guest, so that it
 will cause another signal check.
 
 Makes sense. So the current book3s implementation is wrong?

I think so.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-11 Thread Alex Williamson
On Wed, 2011-11-09 at 18:57 -0600, Christian Benvenuti (benve) wrote:
 Here are few minor comments on vfio_iommu.c ...

Sorry, I've been poking sticks at trying to figure out a clean way to
solve the force vfio driver attach problem.

  diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
  new file mode 100644
  index 000..029dae3
  --- /dev/null
  +++ b/drivers/vfio/vfio_iommu.c
snip
  +
  +#include vfio_private.h
 
 Doesn't the 'dma_'  prefix belong to the generic DMA code?

Sure, we could these more vfio-centric.

  +struct dma_map_page {
  +   struct list_headlist;
  +   dma_addr_t  daddr;
  +   unsigned long   vaddr;
  +   int npage;
  +   int rdwr;
  +};
  +
  +/*
  + * This code handles mapping and unmapping of user data buffers
  + * into DMA'ble space using the IOMMU
  + */
  +
  +#define NPAGE_TO_SIZE(npage)   ((size_t)(npage)  PAGE_SHIFT)
  +
  +struct vwork {
  +   struct mm_struct*mm;
  +   int npage;
  +   struct work_struct  work;
  +};
  +
  +/* delayed decrement for locked_vm */
  +static void vfio_lock_acct_bg(struct work_struct *work)
  +{
  +   struct vwork *vwork = container_of(work, struct vwork, work);
  +   struct mm_struct *mm;
  +
  +   mm = vwork-mm;
  +   down_write(mm-mmap_sem);
  +   mm-locked_vm += vwork-npage;
  +   up_write(mm-mmap_sem);
  +   mmput(mm);  /* unref mm */
  +   kfree(vwork);
  +}
  +
  +static void vfio_lock_acct(int npage)
  +{
  +   struct vwork *vwork;
  +   struct mm_struct *mm;
  +
  +   if (!current-mm) {
  +   /* process exited */
  +   return;
  +   }
  +   if (down_write_trylock(current-mm-mmap_sem)) {
  +   current-mm-locked_vm += npage;
  +   up_write(current-mm-mmap_sem);
  +   return;
  +   }
  +   /*
  +* Couldn't get mmap_sem lock, so must setup to decrement
   ^
 
 Increment?

Yep

snip
  +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
  start,
  +   size_t size, struct dma_map_page *mlp)
  +{
  +   struct dma_map_page *split;
  +   int npage_lo, npage_hi;
  +
  +   /* Existing dma region is completely covered, unmap all */
 
 This works. However, given how vfio_dma_map_dm implements the merging
 logic, I think it is impossible to have
 
 (start  mlp-daddr 
  start + size  mlp-daddr + NPAGE_TO_SIZE(mlp-npage))

It's quite possible.  This allows userspace to create a sparse mapping,
then blow it all away with a single unmap from 0 to ~0.

  +   if (start = mlp-daddr 
  +   start + size = mlp-daddr + NPAGE_TO_SIZE(mlp-npage)) {
  +   vfio_dma_unmap(iommu, mlp-daddr, mlp-npage, mlp-rdwr);
  +   list_del(mlp-list);
  +   npage_lo = mlp-npage;
  +   kfree(mlp);
  +   return npage_lo;
  +   }
  +
  +   /* Overlap low address of existing range */
 
 Same as above (ie, '' is impossible)

existing:   |--- A ---|  |--- B ---|
unmap:|--- C ---|

Maybe not good practice from userspace, but we shouldn't count on
userspace to be well behaved.

  +   if (start = mlp-daddr) {
  +   size_t overlap;
  +
  +   overlap = start + size - mlp-daddr;
  +   npage_lo = overlap  PAGE_SHIFT;
  +   npage_hi = mlp-npage - npage_lo;
  +
  +   vfio_dma_unmap(iommu, mlp-daddr, npage_lo, mlp-rdwr);
  +   mlp-daddr += overlap;
  +   mlp-vaddr += overlap;
  +   mlp-npage -= npage_lo;
  +   return npage_lo;
  +   }
 
 Same as above (ie, '' is impossible).

Same example as above.

  +   /* Overlap high address of existing range */
  +   if (start + size = mlp-daddr + NPAGE_TO_SIZE(mlp-npage)) {
  +   size_t overlap;
  +
  +   overlap = mlp-daddr + NPAGE_TO_SIZE(mlp-npage) - start;
  +   npage_hi = overlap  PAGE_SHIFT;
  +   npage_lo = mlp-npage - npage_hi;
  +
  +   vfio_dma_unmap(iommu, start, npage_hi, mlp-rdwr);
  +   mlp-npage -= npage_hi;
  +   return npage_hi;
  +   }
snip
  +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map
  *dmp)
  +{
  +   int npage;
  +   struct dma_map_page *mlp, *mmlp = NULL;
  +   dma_addr_t daddr = dmp-dmaaddr;
  +   unsigned long locked, lock_limit, vaddr = dmp-vaddr;
  +   size_t size = dmp-size;
  +   int ret = 0, rdwr = dmp-flags  VFIO_DMA_MAP_FLAG_WRITE;
  +
  +   if (vaddr  (PAGE_SIZE-1))
  +   return -EINVAL;
  +   if (daddr  (PAGE_SIZE-1))
  +   return -EINVAL;
  +   if (size  (PAGE_SIZE-1))
  +   return -EINVAL;
  +
  +   npage = size  PAGE_SHIFT;
  +   if (!npage)
  +   return -EINVAL;
  +
  +   if (!iommu)
  +   return -EINVAL;
  +
  +   mutex_lock(iommu-dgate);
  +
  +   if (vfio_find_dma(iommu, daddr, size)) {
  +   ret = -EBUSY;
  +   goto out_lock;
  +   }
  +
  +   /* account for locked pages 

Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions

2011-11-11 Thread Anthony Liguori

On 11/11/2011 08:44 AM, Kevin Wolf wrote:

Am 11.11.2011 15:35, schrieb Anthony Liguori:

This is not a bug fix, this is a new feature.  We're long past feature freeze.
It's not a simple and obvious fix either.  It only partially fixes the problem
and introduces other problems.  It's not a good candidate for making an
exception at this stage in the release.

[1] http://mid.gmane.org/cover.1294150511.git.quint...@redhat.com


Then please send a fix that fails migration with non-raw images. Not
breaking images silently during migration is critical for 1.0, IMO.


I sent a quick series.  If you want to do things different for the blocker layer 
like add a field to BlockDriver to indicate whether migration is supported and 
register the blocker in the core code, feel free to do that.


Regards,

Anthony Liguori



Kevin



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [ver3 PATCH 0/6] Implement multiqueue virtio-net

2011-11-11 Thread Sasha Levin
Hi,

I'm seeing this BUG() sometimes when running it using a small patch I
did for KVM tool:

[1.280766] BUG: unable to handle kernel NULL pointer dereference at
0010
[1.281531] IP: [810b3ac7] free_percpu+0x9a/0x104
[1.281531] PGD 0 
[1.281531] Oops:  [#1] PREEMPT SMP 
[1.281531] CPU 0 
[1.281531] Pid: 1, comm: swapper Not tainted
3.1.0-sasha-19665-gef3d2b7 #39  
[1.281531] RIP: 0010:[810b3ac7]  [810b3ac7]
free_percpu+0x9a/0x104
[1.281531] RSP: 0018:88001383fd50  EFLAGS: 00010046
[1.281531] RAX:  RBX: 0282 RCX:
000f4400
[1.281531] RDX: 3000 RSI: 88000240 RDI:
01c06063
[1.281531] RBP: 880013fcb7c0 R08: ea4e30c0 R09:
8138ba64
[1.281531] R10: 1880 R11: 1880 R12:
881213c0
[1.281531] R13: 8800138c0e00 R14: 0010 R15:
8800138c0d00
[1.281531] FS:  () GS:880013c0()
knlGS:
[1.281531] CS:  0010 DS:  ES:  CR0: 8005003b
[1.281531] CR2: 0010 CR3: 01c05000 CR4:
000406f0
[1.281531] DR0:  DR1:  DR2:

[1.281531] DR3:  DR6: 0ff0 DR7:
0400
[1.281531] Process swapper (pid: 1, threadinfo 88001383e000,
task 880013848000)
[1.281531] Stack:
[1.281531]  880013846ec0  
8138a0e5
[1.281531]  880013846ec0 880013846800 880013b6c000
8138bb63
[1.281531]  0011 000f 8800fff0
000181239bcd
[1.281531] Call Trace:
[1.281531]  [8138a0e5] ? free_rq_sq+0x2c/0xce
[1.281531]  [8138bb63] ? virtnet_probe+0x81c/0x855
[1.281531]  [8129c9e7] ? virtio_dev_probe+0xa7/0xc6
[1.281531]  [8134d2c3] ? driver_probe_device+0xb2/0x142
[1.281531]  [8134d3a2] ? __driver_attach+0x4f/0x6f
[1.281531]  [8134d353] ? driver_probe_device+0x142/0x142
[1.281531]  [8134c3ab] ? bus_for_each_dev+0x47/0x72
[1.281531]  [8134c90d] ? bus_add_driver+0xa2/0x1e6
[1.281531]  [81cc1b36] ? tun_init+0x89/0x89
[1.281531]  [8134db59] ? driver_register+0x8d/0xf8
[1.281531]  [81cc1b36] ? tun_init+0x89/0x89
[1.281531]  [81c98ac1] ? do_one_initcall+0x78/0x130
[1.281531]  [81c98c0e] ? kernel_init+0x95/0x113
[1.281531]  [81658274] ? kernel_thread_helper+0x4/0x10
[1.281531]  [81c98b79] ? do_one_initcall+0x130/0x130
[1.281531]  [81658270] ? gs_change+0x13/0x13
[1.281531] Code: c2 85 d2 48 0f 45 2d d1 39 ce 00 eb 22 65 8b 14 25
90 cc 00 00 48 8b 05 f0 a6 bc 00 48 63 d2 4c 89 e7 48 03 3c d0 e8 83 dd
00 00 
[1.281531]  8b 68 10 44 89 e6 48 89 ef 2b 75 18 e8 e4 f1 ff ff 8b 05
fd 
[1.281531] RIP  [810b3ac7] free_percpu+0x9a/0x104
[1.281531]  RSP 88001383fd50
[1.281531] CR2: 0010
[1.281531] ---[ end trace 68cbc23dfe2fe62a ]---

I don't have time today to dig into it, sorry.

On Fri, 2011-11-11 at 18:32 +0530, Krishna Kumar wrote:
 This patch series resurrects the earlier multiple TX/RX queues
 functionality for virtio_net, and addresses the issues pointed
 out.  It also includes an API to share irq's, f.e.  amongst the
 TX vqs. 
 
 I plan to run TCP/UDP STREAM and RR tests for local-host and
 local-remote, and send the results in the next couple of days.
 
 
 patch #1: Introduce VIRTIO_NET_F_MULTIQUEUE
 patch #2: Move 'num_queues' to virtqueue
 patch #3: virtio_net driver changes
 patch #4: vhost_net changes
 patch #5: Implement find_vqs_irq()
 patch #6: Convert virtio_net driver to use find_vqs_irq()
 
 
   Changes from rev2:
 Michael:
 ---
 1. Added functions to handle setting RX/TX/CTRL vq's.
 2. num_queue_pairs instead of numtxqs.
 3. Experimental support for fewer irq's in find_vqs.
 
 Rusty:
 --
 4. Cleaned up some existing while (1).
 5. rvq/svq and rx_sg/tx_sg changed to vq and sg respectively.
 6. Cleaned up some #if 1 code.
 
 
 Issue when using patch5:
 -
 
 The new API is designed to minimize code duplication.  E.g.
 vp_find_vqs() is implemented as:
 
 static int vp_find_vqs(...)
 {
   return vp_find_vqs_irq(vdev, nvqs, vqs, callbacks, names, NULL);
 }
 
 In my testing, when multiple tx/rx is used with multiple netperf
 sessions, all the device tx queues stops a few thousand times and
 subsequently woken up by skb_xmit_done.  But after some 40K-50K
 iterations of stop/wake, some of the txq's stop and no wake
 interrupt comes. (modprobe -r followed by modprobe solves this, so
 it is not a system hang).  At the time of the hang (#txqs=#rxqs=4):
 
 # egrep CPU|virtio0 /proc/interrupts | grep -v config
CPU0 CPU1 CPU2CPU3
 41:49057

[PATCH] kvm tools: Fix parameter to virtio device features config

2011-11-11 Thread Sasha Levin
Commit ef10d0bfc95f72bebd37a2ef7205bf2d1f68d162 broke config updates
by passing wrong argument to set_config(), this patch fixes it.

Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/virtio/pci.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tools/kvm/virtio/pci.c b/tools/kvm/virtio/pci.c
index 85994f8..1660f06 100644
--- a/tools/kvm/virtio/pci.c
+++ b/tools/kvm/virtio/pci.c
@@ -181,7 +181,7 @@ static bool virtio_pci__io_out(struct ioport *ioport, 
struct kvm *kvm, u16 port,
switch (offset) {
case VIRTIO_PCI_GUEST_FEATURES:
val = ioport__read32(data);
-   vtrans-virtio_ops-set_guest_features(kvm, vpci, val);
+   vtrans-virtio_ops-set_guest_features(kvm, vpci-dev, val);
break;
case VIRTIO_PCI_QUEUE_PFN:
val = ioport__read32(data);
-- 
1.7.7.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] kvm tools: Implement multiple VQ for virtio-net

2011-11-11 Thread Sasha Levin
This is a patch based on Krishna Kumar's patch series which implements
multiple VQ support for virtio-net.

The patch was tested with ver3 of the patch.

Cc: Krishna Kumar krkum...@in.ibm.com
Cc: Michael S. Tsirkin m...@redhat.com
Cc: Rusty Russell ru...@rustcorp.com.au
Cc: virtualizat...@lists.linux-foundation.org
Cc: net...@vger.kernel.org
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/include/kvm/virtio-pci.h |2 +-
 tools/kvm/virtio/net.c |   94 +++
 2 files changed, 52 insertions(+), 44 deletions(-)

diff --git a/tools/kvm/include/kvm/virtio-pci.h 
b/tools/kvm/include/kvm/virtio-pci.h
index 2bbb271..94d20ee 100644
--- a/tools/kvm/include/kvm/virtio-pci.h
+++ b/tools/kvm/include/kvm/virtio-pci.h
@@ -6,7 +6,7 @@
 
 #include linux/types.h
 
-#define VIRTIO_PCI_MAX_VQ  3
+#define VIRTIO_PCI_MAX_VQ  16
 #define VIRTIO_PCI_MAX_CONFIG  1
 
 struct kvm;
diff --git a/tools/kvm/virtio/net.c b/tools/kvm/virtio/net.c
index cee2b5b..0754795 100644
--- a/tools/kvm/virtio/net.c
+++ b/tools/kvm/virtio/net.c
@@ -27,9 +27,8 @@
 #include sys/wait.h
 
 #define VIRTIO_NET_QUEUE_SIZE  128
-#define VIRTIO_NET_NUM_QUEUES  2
-#define VIRTIO_NET_RX_QUEUE0
-#define VIRTIO_NET_TX_QUEUE1
+#define VIRTIO_NET_NUM_QUEUES  16
+#define VIRTIO_NET_IS_RX_QUEUE(x)  (((x) % 2) == 0)
 
 struct net_dev;
 
@@ -49,14 +48,13 @@ struct net_dev {
struct virtio_net_configconfig;
u32 features;
 
-   pthread_t   io_rx_thread;
-   pthread_mutex_t io_rx_lock;
-   pthread_cond_t  io_rx_cond;
-
-   pthread_t   io_tx_thread;
-   pthread_mutex_t io_tx_lock;
-   pthread_cond_t  io_tx_cond;
+   pthread_t   io_thread[VIRTIO_NET_NUM_QUEUES];
+   pthread_mutex_t io_lock[VIRTIO_NET_NUM_QUEUES];
+   pthread_cond_t  io_cond[VIRTIO_NET_NUM_QUEUES];
 
+   int rx_vq_num;
+   int tx_vq_num;
+   int vq_num;
int tap_fd;
chartap_name[IFNAMSIZ];
 
@@ -78,17 +76,22 @@ static void *virtio_net_rx_thread(void *p)
struct net_dev *ndev = p;
u16 out, in;
u16 head;
-   int len;
+   int len, queue_num;
+
+   mutex_lock(ndev-mutex);
+   queue_num = ndev-rx_vq_num * 2;
+   ndev-tx_vq_num++;
+   mutex_unlock(ndev-mutex);
 
kvm = ndev-kvm;
-   vq  = ndev-vqs[VIRTIO_NET_RX_QUEUE];
+   vq  = ndev-vqs[queue_num];
 
while (1) {
 
-   mutex_lock(ndev-io_rx_lock);
+   mutex_lock(ndev-io_lock[queue_num]);
if (!virt_queue__available(vq))
-   pthread_cond_wait(ndev-io_rx_cond, ndev-io_rx_lock);
-   mutex_unlock(ndev-io_rx_lock);
+   pthread_cond_wait(ndev-io_cond[queue_num], 
ndev-io_lock[queue_num]);
+   mutex_unlock(ndev-io_lock[queue_num]);
 
while (virt_queue__available(vq)) {
 
@@ -99,7 +102,7 @@ static void *virtio_net_rx_thread(void *p)
virt_queue__set_used_elem(vq, head, len);
 
/* We should interrupt guest right now, otherwise 
latency is huge. */
-   ndev-vtrans.trans_ops-signal_vq(kvm, ndev-vtrans, 
VIRTIO_NET_RX_QUEUE);
+   ndev-vtrans.trans_ops-signal_vq(kvm, ndev-vtrans, 
queue_num);
}
 
}
@@ -117,16 +120,21 @@ static void *virtio_net_tx_thread(void *p)
struct net_dev *ndev = p;
u16 out, in;
u16 head;
-   int len;
+   int len, queue_num;
+
+   mutex_lock(ndev-mutex);
+   queue_num = ndev-tx_vq_num * 2 + 1;
+   ndev-tx_vq_num++;
+   mutex_unlock(ndev-mutex);
 
kvm = ndev-kvm;
-   vq  = ndev-vqs[VIRTIO_NET_TX_QUEUE];
+   vq  = ndev-vqs[queue_num];
 
while (1) {
-   mutex_lock(ndev-io_tx_lock);
+   mutex_lock(ndev-io_lock[queue_num]);
if (!virt_queue__available(vq))
-   pthread_cond_wait(ndev-io_tx_cond, ndev-io_tx_lock);
-   mutex_unlock(ndev-io_tx_lock);
+   pthread_cond_wait(ndev-io_cond[queue_num], 
ndev-io_lock[queue_num]);
+   mutex_unlock(ndev-io_lock[queue_num]);
 
while (virt_queue__available(vq)) {
 
@@ -137,7 +145,7 @@ static void *virtio_net_tx_thread(void *p)
virt_queue__set_used_elem(vq, head, len);
}
 
-   ndev-vtrans.trans_ops-signal_vq(kvm, ndev-vtrans, 
VIRTIO_NET_TX_QUEUE);
+   ndev-vtrans.trans_ops-signal_vq(kvm, ndev-vtrans, 
queue_num);
}
 

RE: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-11 Thread Christian Benvenuti (benve)
 -Original Message-
 From: Alex Williamson [mailto:alex.william...@redhat.com]
 Sent: Friday, November 11, 2011 10:04 AM
 To: Christian Benvenuti (benve)
 Cc: chr...@sous-sol.org; a...@au1.ibm.com; p...@au1.ibm.com;
 d...@au1.ibm.com; joerg.roe...@amd.com; ag...@suse.de; Aaron Fabbri
 (aafabbri); b08...@freescale.com; b07...@freescale.com; a...@redhat.com;
 konrad.w...@oracle.com; kvm@vger.kernel.org; qemu-de...@nongnu.org;
 io...@lists.linux-foundation.org; linux-...@vger.kernel.org
 Subject: RE: [RFC PATCH] vfio: VFIO Driver core framework
 
 On Wed, 2011-11-09 at 18:57 -0600, Christian Benvenuti (benve) wrote:
  Here are few minor comments on vfio_iommu.c ...
 
 Sorry, I've been poking sticks at trying to figure out a clean way to
 solve the force vfio driver attach problem.

Attach o detach?

   diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
   new file mode 100644
   index 000..029dae3
   --- /dev/null
   +++ b/drivers/vfio/vfio_iommu.c
 snip
   +
   +#include vfio_private.h
 
  Doesn't the 'dma_'  prefix belong to the generic DMA code?
 
 Sure, we could these more vfio-centric.

Like vfio_dma_map_page?

 
   +struct dma_map_page {
   + struct list_headlist;
   + dma_addr_t  daddr;
   + unsigned long   vaddr;
   + int npage;
   + int rdwr;
   +};
   +
   +/*
   + * This code handles mapping and unmapping of user data buffers
   + * into DMA'ble space using the IOMMU
   + */
   +
   +#define NPAGE_TO_SIZE(npage) ((size_t)(npage)  PAGE_SHIFT)
   +
   +struct vwork {
   + struct mm_struct*mm;
   + int npage;
   + struct work_struct  work;
   +};
   +
   +/* delayed decrement for locked_vm */
   +static void vfio_lock_acct_bg(struct work_struct *work)
   +{
   + struct vwork *vwork = container_of(work, struct vwork, work);
   + struct mm_struct *mm;
   +
   + mm = vwork-mm;
   + down_write(mm-mmap_sem);
   + mm-locked_vm += vwork-npage;
   + up_write(mm-mmap_sem);
   + mmput(mm);  /* unref mm */
   + kfree(vwork);
   +}
   +
   +static void vfio_lock_acct(int npage)
   +{
   + struct vwork *vwork;
   + struct mm_struct *mm;
   +
   + if (!current-mm) {
   + /* process exited */
   + return;
   + }
   + if (down_write_trylock(current-mm-mmap_sem)) {
   + current-mm-locked_vm += npage;
   + up_write(current-mm-mmap_sem);
   + return;
   + }
   + /*
   +  * Couldn't get mmap_sem lock, so must setup to decrement
^
 
  Increment?
 
 Yep
 
 snip
   +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
   start,
   + size_t size, struct dma_map_page *mlp)
   +{
   + struct dma_map_page *split;
   + int npage_lo, npage_hi;
   +
   + /* Existing dma region is completely covered, unmap all */
 
  This works. However, given how vfio_dma_map_dm implements the merging
  logic, I think it is impossible to have
 
  (start  mlp-daddr 
   start + size  mlp-daddr + NPAGE_TO_SIZE(mlp-npage))
 
 It's quite possible.  This allows userspace to create a sparse mapping,
 then blow it all away with a single unmap from 0 to ~0.

I would prefer the user to use exact ranges in the unmap operations
because it would make it easier to detect bugs/leaks in the map/unmap
logic used by the callers.
My assumptions are that:

- the user always keeps track of the mappings

- the user either unmaps one specific mapping or 'all of them'.
  The 'all of them' case would also take care of those cases where
  the user does _not_ keep track of mappings and simply uses
  the unmap from 0 to ~0 each time.

Because of this you could still provide an exact map/unmap logic
and allow such unmap from 0 to ~0 by making the latter a special
case.
However, if we want to allow any arbitrary/inexact unmap request, then OK.

   + if (start = mlp-daddr 
   + start + size = mlp-daddr + NPAGE_TO_SIZE(mlp-npage)) {
   + vfio_dma_unmap(iommu, mlp-daddr, mlp-npage, mlp-rdwr);
   + list_del(mlp-list);
   + npage_lo = mlp-npage;
   + kfree(mlp);
   + return npage_lo;
   + }
   +
   + /* Overlap low address of existing range */
 
  Same as above (ie, '' is impossible)
 
 existing:   |--- A ---|  |--- B ---|
 unmap:|--- C ---|
 
 Maybe not good practice from userspace, but we shouldn't count on
 userspace to be well behaved.
 
   + if (start = mlp-daddr) {
   + size_t overlap;
   +
   + overlap = start + size - mlp-daddr;
   + npage_lo = overlap  PAGE_SHIFT;
   + npage_hi = mlp-npage - npage_lo;
   +
   + vfio_dma_unmap(iommu, mlp-daddr, npage_lo, mlp-rdwr);
   + mlp-daddr += overlap;
   + mlp-vaddr += overlap;
   + mlp-npage -= npage_lo;
   + return npage_lo;
   + }
 
  Same as above (ie, '' is impossible).
 
 Same example as above.
 
   + /* 

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-11 Thread Scott Wood
On 11/03/2011 03:12 PM, Alex Williamson wrote:
 +Many modern system now provide DMA and interrupt remapping facilities
 +to help ensure I/O devices behave within the boundaries they've been
 +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
 +well as POWER systems with Partitionable Endpoints (PEs) and even
 +embedded powerpc systems (technology name unknown).  

Maybe replace (technology name unknown) with (such as Freescale chips
with PAMU) or similar?

Or just leave out the parenthetical.

 +As documented in linux/vfio.h, several ioctls are provided on the
 +group chardev:
 +
 +#define VFIO_GROUP_GET_FLAGS_IOR(';', 100, __u64)
 + #define VFIO_GROUP_FLAGS_VIABLE(1  0)
 + #define VFIO_GROUP_FLAGS_MM_LOCKED (1  1)
 +#define VFIO_GROUP_MERGE_IOW(';', 101, int)
 +#define VFIO_GROUP_UNMERGE  _IOW(';', 102, int)
 +#define VFIO_GROUP_GET_IOMMU_FD _IO(';', 103)
 +#define VFIO_GROUP_GET_DEVICE_FD_IOW(';', 104, char *)

This suggests the argument to VFIO_GROUP_GET_DEVICE_FD is a pointer to a
pointer to char rather than a pointer to an array of char (just as e.g.
VFIO_GROUP_MERGE takes a pointer to an int, not just an int).

 +The IOMMU file descriptor provides this set of ioctls:
 +
 +#define VFIO_IOMMU_GET_FLAGS_IOR(';', 105, __u64)
 + #define VFIO_IOMMU_FLAGS_MAP_ANY   (1  0)
 +#define VFIO_IOMMU_MAP_DMA  _IOWR(';', 106, struct vfio_dma_map)
 +#define VFIO_IOMMU_UNMAP_DMA_IOWR(';', 107, struct vfio_dma_map)

What is the implication if VFIO_IOMMU_FLAGS_MAP_ANY is clear?  Is such
an implementation supposed to add a new flag that describes its
restrictions?

Can we get a way to turn DMA access off and on, short of unmapping
everything, and then mapping it again?

 +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
 +We currently only support IOMMU domains that are able to map any
 +virtual address to any IOVA.  This is indicated by the MAP_ANY flag.
 +
 +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
 +and unmapping IOVAs to process virtual addresses:
 +
 +struct vfio_dma_map {
 +__u64   len;/* length of structure */
 +__u64   vaddr;  /* process virtual addr */
 +__u64   dmaaddr;/* desired and/or returned dma address */
 +__u64   size;   /* size in bytes */
 +__u64   flags;
 +#define VFIO_DMA_MAP_FLAG_WRITE (1  0) /* req writeable DMA mem */
 +};

What are the semantics of desired and/or returned dma address?

Are we always supposed to provide a desired address, but it may be
different on return?  Or are there cases where we want to say give me
whatever you want or give me this or fail?

How much of this needs to be filled out for unmap?

Note that the length of structure approach means that ioctl numbers
will change whenever this grows -- perhaps we should avoid encoding the
struct size into these ioctls?

 +struct vfio_region_info {
 +__u32   len;/* length of structure */
 +__u32   index;  /* region number */
 +__u64   size;   /* size in bytes of region */
 +__u64   offset; /* start offset of region */
 +__u64   flags;
 +#define VFIO_REGION_INFO_FLAG_MMAP  (1  0)
 +#define VFIO_REGION_INFO_FLAG_RO(1  1)
 +#define VFIO_REGION_INFO_FLAG_PHYS_VALID(1  2)
 +__u64   phys;   /* physical address of region */
 +};
 +
 +#define VFIO_DEVICE_GET_REGION_INFO _IOWR(';', 110, struct 
 vfio_region_info)
 +
 +The offset indicates the offset into the device file descriptor which
 +accesses the given range (for read/write/mmap/seek).  Flags indicate the
 +available access types and validity of optional fields.  For instance
 +the phys field may only be valid for certain devices types.
 +
 +Interrupts are described using a similar interface.  GET_NUM_IRQS
 +reports the number or IRQ indexes for the device.
 +
 +#define VFIO_DEVICE_GET_NUM_IRQS_IOR(';', 111, int)
 +
 +struct vfio_irq_info {
 +__u32   len;/* length of structure */
 +__u32   index;  /* IRQ number */
 +__u32   count;  /* number of individual IRQs */
 +__u64   flags;
 +#define VFIO_IRQ_INFO_FLAG_LEVEL(1  0)

Make sure flags is 64-bit aligned -- some 32-bit ABIs, such as x86, will
not do this, causing problems if the kernel is 64-bit and thus assumes a
different layout.

 +Information about each index can be retrieved using the GET_IRQ_INFO
 +ioctl, used much like GET_REGION_INFO.
 +
 +#define VFIO_DEVICE_GET_IRQ_INFO_IOWR(';', 112, struct vfio_irq_info)
 +
 +Individual indexes can describe single or sets of IRQs.  This provides the
 +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
 +
 +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
 +as shown 

Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-11 Thread Stepan Moskovchenko

On 11/11/2011 5:24 AM, Joerg Roedel wrote:

On Thu, Nov 10, 2011 at 01:12:00PM -0800, Stepan Moskovchenko wrote:

I have been experimenting with an iommu_map_range call, which maps a
given scatterlist of discontiguous physical pages into a contiguous
virtual region at a given IOVA. This has some performance advantages
over just calling iommu_map iteratively. First, it reduces the
amount of table walking / calculation needed for mapping each page,
given how you know that all the pages will be mapped into a single
virtually-contiguous region (so in most cases, the first-level table
calculation can be reused). Second, it allows one to defer the TLB
(and sometimes cache) maintenance operations until the entire
scatterlist has been mapped, rather than doing a TLB invalidate
after mapping each page, as would have been the case if iommu_map
were just being called from within a loop. Granted, just using
iommu_map many times may be acceptable on the slow path, but I have
seen significant performance gains when using this approach on the
fast path.

Yes, from a performance point-of-view that makes sense, as an addition
to the existing iommu_map interface. Are the pages in the list allowed
to have different page-sizes?


Joerg



Hello

Yes, the scatterlist is allowed to have different page sizes. But, they 
are required to have a length that is a multiple of 4K. If a page in the 
list is bigger than 4K, the code will iteratively map it with 4K pages. 
I suppose based on how my implementation is written, it would not be too 
difficult to add checks for the proper length and VA/PA alignments, and 
insert a 64K / 1M / 16M mapping if the alignment is lucky and the SG 
item is big enough.


In my particular test case, even though the pages in the list might be 
of different sizes, they are not guaranteed to be aligned properly and I 
would most likely have to fall back on mapping them as multiple 
consecutive 4K pages, anyway. But even despite this, having map_range to 
consolidate a lot of the common operations into one call sill gives me a 
nice speed boost.


I hadn't sent the patches out because this was all for my testing, but 
would you be interested in me adding a map_range to the API? The 
iommu_map_range call could even do a check if the ops supports a 
.map_range, and fall back on calling iommu_map repeatedly if the driver 
doesn't support this operation natively. In my code, the function takes 
a domain, iova, scatterlist, length, and prot.


Steve
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [ver3 PATCH 0/6] Implement multiqueue virtio-net

2011-11-11 Thread Krishna Kumar
Sasha Levin levinsasha...@gmail.com wrote on 11/12/2011 03:32:04 AM:

 I'm seeing this BUG() sometimes when running it using a small patch I
 did for KVM tool:
 
 [1.281531] Call Trace:
 [1.281531]  [8138a0e5] ? free_rq_sq+0x2c/0xce
 [1.281531]  [8138bb63] ? virtnet_probe+0x81c/0x855
 [1.281531]  [8129c9e7] ? virtio_dev_probe+0xa7/0xc6
 [1.281531]  [8134d2c3] ? driver_probe_device+0xb2/0x142
 [1.281531]  [8134d3a2] ? __driver_attach+0x4f/0x6f
 [1.281531]  [8134d353] ? driver_probe_device+0x142/0x142
 [1.281531]  [8134c3ab] ? bus_for_each_dev+0x47/0x72
 [1.281531]  [8134c90d] ? bus_add_driver+0xa2/0x1e6
 [1.281531]  [81cc1b36] ? tun_init+0x89/0x89
 [1.281531]  [8134db59] ? driver_register+0x8d/0xf8
 [1.281531]  [81cc1b36] ? tun_init+0x89/0x89
 [1.281531]  [81c98ac1] ? do_one_initcall+0x78/0x130
 [1.281531]  [81c98c0e] ? kernel_init+0x95/0x113
 [1.281531]  [81658274] ? kernel_thread_helper+0x4/0x10
 [1.281531]  [81c98b79] ? do_one_initcall+0x130/0x130
 [1.281531]  [81658270] ? gs_change+0x13/0x13
 [1.281531] Code: c2 85 d2 48 0f 45 2d d1 39 ce 00 eb 22 65 8b 14 25
 90 cc 00 00 48 8b 05 f0 a6 bc 00 48 63 d2 4c 89 e7 48 03 3c d0 e8 83 dd
 00 00 
 [1.281531]  8b 68 10 44 89 e6 48 89 ef 2b 75 18 e8 e4 f1 ff ff 8b 05
 fd 
 [1.281531] RIP  [810b3ac7] free_percpu+0x9a/0x104
 [1.281531]  RSP 88001383fd50
 [1.281531] CR2: 0010
 [1.281531] ---[ end trace 68cbc23dfe2fe62a ]---
 
 I don't have time today to dig into it, sorry.

Thanks for the report.

free_rq_sq() was being called twice in the failure path. The second
call panic'd since it had freed the same pointers earlier.

1. free_rq_sq() was being called twice in the failure path.
   virtnet_setup_vqs() had already freed up rq/sq on error, and
   virtnet_probe() tried to do it again. Fix it in virtnet_probe
   by moving the call up.
2. Make free_rq_sq() re-entrant by setting freed pointers to NULL.
3. Remove free_stats() as it was being called only once.

Sasha, could you please try this patch on top of existing patches?

thanks!

Signed-off-by: krkum...@in.ibm.com
---
 drivers/net/virtio_net.c |   41 +++--
 1 file changed, 13 insertions(+), 28 deletions(-)

diff -ruNp n6/drivers/net/virtio_net.c n7/drivers/net/virtio_net.c
--- n6/drivers/net/virtio_net.c 2011-11-12 11:03:48.0 +0530
+++ n7/drivers/net/virtio_net.c 2011-11-12 10:39:28.0 +0530
@@ -782,23 +782,6 @@ static void virtnet_netpoll(struct net_d
 }
 #endif
 
-static void free_stats(struct virtnet_info *vi)
-{
-   int i;
-
-   for (i = 0; i  vi-num_queue_pairs; i++) {
-   if (vi-sq  vi-sq[i]) {
-   free_percpu(vi-sq[i]-stats);
-   vi-sq[i]-stats = NULL;
-   }
-
-   if (vi-rq  vi-rq[i]) {
-   free_percpu(vi-rq[i]-stats);
-   vi-rq[i]-stats = NULL;
-   }
-   }
-}
-
 static int virtnet_open(struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
@@ -1054,19 +1037,22 @@ static void free_rq_sq(struct virtnet_in
 {
int i;
 
-   free_stats(vi);
-
-   if (vi-rq) {
-   for (i = 0; i  vi-num_queue_pairs; i++)
+   for (i = 0; i  vi-num_queue_pairs; i++) {
+   if (vi-rq  vi-rq[i]) {
+   free_percpu(vi-rq[i]-stats);
kfree(vi-rq[i]);
-   kfree(vi-rq);
-   }
+   vi-rq[i] = NULL;
+   }
 
-   if (vi-sq) {
-   for (i = 0; i  vi-num_queue_pairs; i++)
+   if (vi-sq  vi-sq[i]) {
+   free_percpu(vi-sq[i]-stats);
kfree(vi-sq[i]);
-   kfree(vi-sq);
+   vi-sq[i] = NULL;
+   }
}
+
+   kfree(vi-rq);
+   kfree(vi-sq);
 }
 
 static void free_unused_bufs(struct virtnet_info *vi)
@@ -1387,10 +1373,9 @@ free_vqs:
for (i = 0; i  num_queue_pairs; i++)
cancel_delayed_work_sync(vi-rq[i]-refill);
vdev-config-del_vqs(vdev);
-
-free_netdev:
free_rq_sq(vi);
 
+free_netdev:
free_netdev(dev);
return err;
 }

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [ver3 PATCH 0/6] Implement multiqueue virtio-net

2011-11-11 Thread Sasha Levin
On Sat, 2011-11-12 at 11:15 +0530, Krishna Kumar wrote:
 Sasha Levin levinsasha...@gmail.com wrote on 11/12/2011 03:32:04 AM:
 
  I'm seeing this BUG() sometimes when running it using a small patch I
  did for KVM tool:
  
  [1.281531] Call Trace:
  [1.281531]  [8138a0e5] ? free_rq_sq+0x2c/0xce
  [1.281531]  [8138bb63] ? virtnet_probe+0x81c/0x855
  [1.281531]  [8129c9e7] ? virtio_dev_probe+0xa7/0xc6
  [1.281531]  [8134d2c3] ? driver_probe_device+0xb2/0x142
  [1.281531]  [8134d3a2] ? __driver_attach+0x4f/0x6f
  [1.281531]  [8134d353] ? driver_probe_device+0x142/0x142
  [1.281531]  [8134c3ab] ? bus_for_each_dev+0x47/0x72
  [1.281531]  [8134c90d] ? bus_add_driver+0xa2/0x1e6
  [1.281531]  [81cc1b36] ? tun_init+0x89/0x89
  [1.281531]  [8134db59] ? driver_register+0x8d/0xf8
  [1.281531]  [81cc1b36] ? tun_init+0x89/0x89
  [1.281531]  [81c98ac1] ? do_one_initcall+0x78/0x130
  [1.281531]  [81c98c0e] ? kernel_init+0x95/0x113
  [1.281531]  [81658274] ? kernel_thread_helper+0x4/0x10
  [1.281531]  [81c98b79] ? do_one_initcall+0x130/0x130
  [1.281531]  [81658270] ? gs_change+0x13/0x13
  [1.281531] Code: c2 85 d2 48 0f 45 2d d1 39 ce 00 eb 22 65 8b 14 25
  90 cc 00 00 48 8b 05 f0 a6 bc 00 48 63 d2 4c 89 e7 48 03 3c d0 e8 83 dd
  00 00 
  [1.281531]  8b 68 10 44 89 e6 48 89 ef 2b 75 18 e8 e4 f1 ff ff 8b 05
  fd 
  [1.281531] RIP  [810b3ac7] free_percpu+0x9a/0x104
  [1.281531]  RSP 88001383fd50
  [1.281531] CR2: 0010
  [1.281531] ---[ end trace 68cbc23dfe2fe62a ]---
  
  I don't have time today to dig into it, sorry.
 
 Thanks for the report.
 
 free_rq_sq() was being called twice in the failure path. The second
 call panic'd since it had freed the same pointers earlier.
 
 1. free_rq_sq() was being called twice in the failure path.
virtnet_setup_vqs() had already freed up rq/sq on error, and
virtnet_probe() tried to do it again. Fix it in virtnet_probe
by moving the call up.
 2. Make free_rq_sq() re-entrant by setting freed pointers to NULL.
 3. Remove free_stats() as it was being called only once.
 
 Sasha, could you please try this patch on top of existing patches?
 
 thanks!
 
 Signed-off-by: krkum...@in.ibm.com
 ---
[snip]

I've tested it and it looks good, no BUG() or panic.

I would suggest adding some output if we take the failure path, since
while the guest does boot peacefully, it doesn't have a network
interface and there's nothing in the boot log to indicate anything has
gone wrong.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: booke: check for signals in kvmppc_vcpu_run

2011-11-11 Thread Alexander Graf

On 11/08/2011 11:11 PM, Scott Wood wrote:

Currently we check prior to returning from a lightweight exit,
but not prior to initial entry.

book3s already does a similar test.

Signed-off-by: Scott Woodscottw...@freescale.com
---
  arch/powerpc/kvm/booke.c |   10 +-
  1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index b642200..9c78589 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -322,11 +322,19 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
}

local_irq_disable();
+
+   if (signal_pending(current)) {


Any reason you're doing this after irq_disable()?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: e500: Casting (void *) value returned by kmalloc is useless

2011-11-11 Thread Alexander Graf

On 11/11/2011 04:24 PM, Thomas Meyer wrote:

Am Freitag, den 11.11.2011, 15:05 +0100 schrieb Alexander Graf:

On 11/08/2011 08:15 PM, Thomas Meyer wrote:

From: Thomas Meyertho...@m3y3r.de

   Casting (void *) value returned by kmalloc is useless
   as mentioned in Documentation/CodingStyle, Chap 14.

   The semantic patch that makes this change is available
   in scripts/coccinelle/api/alloc/drop_kmalloc_cast.cocci.

   More information about semantic patching is available at
   http://coccinelle.lip6.fr/

Signed-off-by: Thomas Meyertho...@m3y3r.de
---

diff -u -p a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
--- a/arch/powerpc/kvm/e500_tlb.c 2011-11-07 19:37:27.329638682 +0100
+++ b/arch/powerpc/kvm/e500_tlb.c 2011-11-08 09:18:39.955928218 +0100
@@ -1026,11 +1026,11 @@ int kvmppc_e500_tlb_init(struct kvmppc_v
if (vcpu_e500-gtlb_arch[1] == NULL)
goto err_out_guest0;

-   vcpu_e500-gtlb_priv[0] = (struct tlbe_priv *)
+   vcpu_e500-gtlb_priv[0] =
kzalloc(sizeof(struct tlbe_priv) * KVM_E500_TLB0_SIZE, 
GFP_KERNEL);
if (vcpu_e500-gtlb_priv[0] == NULL)
goto err_out_guest1;
-   vcpu_e500-gtlb_priv[1] = (struct tlbe_priv *)
+   vcpu_e500-gtlb_priv[1] =
kzalloc(sizeof(struct tlbe_priv) * KVM_E500_TLB1_SIZE, 
GFP_KERNEL);

Sorry, I can't seem to find that code anywhere. I suppose you're using
an old git checkout?

No, I'm at 3.2-rc1:

https://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=arch/powerpc/kvm/e500_tlb.c#l1029


Ah, yes, that's old. I'm behind Avi's tree, so changes take a while to 
trickle through to Linus. Please check out the following (current 
development) tree:


git://github.com/agraf/linux-2.6.git kvm-ppc-next

Alex


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: booke: check for signals in kvmppc_vcpu_run

2011-11-11 Thread Alexander Graf

On 11/11/2011 04:32 PM, Scott Wood wrote:

On Fri, Nov 11, 2011 at 03:09:17PM +0100, Alexander Graf wrote:

On 11/08/2011 11:11 PM, Scott Wood wrote:

Currently we check prior to returning from a lightweight exit,
but not prior to initial entry.

book3s already does a similar test.

Signed-off-by: Scott Woodscottw...@freescale.com
---
  arch/powerpc/kvm/booke.c |   10 +-
  1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index b642200..9c78589 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -322,11 +322,19 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
}

local_irq_disable();
+
+   if (signal_pending(current)) {

Any reason you're doing this after irq_disable()?

If we get a signal after the check, we want to be sure that we don't
receive the reschedule IPI until after we're in the guest, so that it
will cause another signal check.


Makes sense. So the current book3s implementation is wrong?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: booke: check for signals in kvmppc_vcpu_run

2011-11-11 Thread Scott Wood
On Fri, Nov 11, 2011 at 03:09:17PM +0100, Alexander Graf wrote:
 On 11/08/2011 11:11 PM, Scott Wood wrote:
 Currently we check prior to returning from a lightweight exit,
 but not prior to initial entry.
 
 book3s already does a similar test.
 
 Signed-off-by: Scott Woodscottw...@freescale.com
 ---
   arch/powerpc/kvm/booke.c |   10 +-
   1 files changed, 9 insertions(+), 1 deletions(-)
 
 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
 index b642200..9c78589 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -322,11 +322,19 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
  }
 
  local_irq_disable();
 +
 +if (signal_pending(current)) {
 
 Any reason you're doing this after irq_disable()?

If we get a signal after the check, we want to be sure that we don't
receive the reschedule IPI until after we're in the guest, so that it
will cause another signal check.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] powerpc kvm: fix kvmppc_start_thread() for CONFIG_SMP=N

2011-11-11 Thread Alexander Graf

On 11/11/2011 03:03 AM, Michael Neuling wrote:

Currently kvmppc_start_thread() tries to wake other SMT threads via
xics_wake_cpu().  Unfortunately xics_wake_cpu only exists when
CONFIG_SMP=Y so when compiling with CONFIG_SMP=N we get:

   arch/powerpc/kvm/built-in.o: In function `.kvmppc_start_thread':
   book3s_hv.c:(.text+0xa1e0): undefined reference to `.xics_wake_cpu'

The following should be fine since kvmppc_start_thread() shouldn't
called to start non-zero threads when SMP=N since threads_per_core=1.

Signed-off-by: Michael Neulingmi...@neuling.org


Thanks, applied to kvm-ppc-next. Please CC kvm-ppc@vger next time :).


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: protect use of kvmppc_h_pr

2011-11-11 Thread Alexander Graf

On 11/08/2011 06:17 PM, Andreas Schwab wrote:

kvmppc_h_pr is only available if CONFIG_KVM_BOOK3S_64_PR.

Signed-off-by: Andreas Schwabsch...@linux-m68k.org


Thanks, applied to kvm-ppc-next.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: move compute_tlbie_rb to book3s_64 common header

2011-11-11 Thread Alexander Graf

On 11/08/2011 06:08 PM, Andreas Schwab wrote:

compute_tlbie_rb is only used on ppc64 and cannot be compiled on ppc32.

Signed-off-by: Andreas Schwabsch...@linux-m68k.org


Thanks, applied to kvm-ppc-next.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: booke: check for signals in kvmppc_vcpu_run

2011-11-11 Thread Scott Wood
On 11/11/2011 09:35 AM, Alexander Graf wrote:
 On 11/11/2011 04:32 PM, Scott Wood wrote:
 On Fri, Nov 11, 2011 at 03:09:17PM +0100, Alexander Graf wrote:
 On 11/08/2011 11:11 PM, Scott Wood wrote:
 Currently we check prior to returning from a lightweight exit,
 but not prior to initial entry.

 book3s already does a similar test.

 Signed-off-by: Scott Woodscottw...@freescale.com
 ---
   arch/powerpc/kvm/booke.c |   10 +-
   1 files changed, 9 insertions(+), 1 deletions(-)

 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
 index b642200..9c78589 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -322,11 +322,19 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run,
 struct kvm_vcpu *vcpu)
   }

   local_irq_disable();
 +
 +if (signal_pending(current)) {
 Any reason you're doing this after irq_disable()?
 If we get a signal after the check, we want to be sure that we don't
 receive the reschedule IPI until after we're in the guest, so that it
 will cause another signal check.
 
 Makes sense. So the current book3s implementation is wrong?

I think so.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] powerpc kvm: fix kvmppc_start_thread() for CONFIG_SMP=N

2011-11-11 Thread Michael Neuling
In message 4ebd46f4.5040...@suse.de you wrote:
 On 11/11/2011 03:03 AM, Michael Neuling wrote:
  Currently kvmppc_start_thread() tries to wake other SMT threads via
  xics_wake_cpu().  Unfortunately xics_wake_cpu only exists when
  CONFIG_SMP=Y so when compiling with CONFIG_SMP=N we get:
 
 arch/powerpc/kvm/built-in.o: In function `.kvmppc_start_thread':
 book3s_hv.c:(.text+0xa1e0): undefined reference to `.xics_wake_cpu'
 
  The following should be fine since kvmppc_start_thread() shouldn't
  called to start non-zero threads when SMP=N since threads_per_core=1.
 
  Signed-off-by: Michael Neulingmi...@neuling.org
 
 Thanks, applied to kvm-ppc-next. Please CC kvm-ppc@vger next time :).

Thanks will do.

Mikey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html