RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-06 Thread Xin, Xiaohui
Sridhar,

 The idea is simple, just to pin the guest VM user space and then
 let host NIC driver has the chance to directly DMA to it. 
 The patches are based on vhost-net backend driver. We add a device
 which provides proto_ops as sendmsg/recvmsg to vhost-net to
 send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
 get copyless data transfer thru guest virtio-net frontend.

What is the advantage of this approach compared to PCI-passthrough
of the host NIC to the guest?

PCI-passthrough needs hardware support, a kind of iommu engine will
help to translate guest physical address to host physical address.
And currently, a PCI-passthrough device cannot pass live migration.

The zero-copy is a pure software solution. It doesn't need special hardware 
support.
In theory, it can pass live migration.
 
Does this require pinning of the entire guest memory? Or only the
send/receive buffers?

We need only to pin the send/receive buffers.

Thanks
Xiaohui

Thanks
Sridhar
 
 The scenario is like this:
 
 The guest virtio-net driver submits multiple requests thru vhost-net
 backend driver to the kernel. And the requests are queued and then
 completed after corresponding actions in h/w are done.
 
 For read, user space buffers are dispensed to NIC driver for rx when
 a page constructor API is invoked. Means NICs can allocate user buffers
 from a page constructor. We add a hook in netif_receive_skb() function
 to intercept the incoming packets, and notify the zero-copy device.
 
 For write, the zero-copy deivce may allocates a new host skb and puts
 payload on the skb_shinfo(skb)-frags, and copied the header to skb-data.
 The request remains pending until the skb is transmitted by h/w.
 
 Here, we have ever considered 2 ways to utilize the page constructor
 API to dispense the user buffers.
 
 One:  Modify __alloc_skb() function a bit, it can only allocate a 
   structure of sk_buff, and the data pointer is pointing to a 
   user buffer which is coming from a page constructor API.
   Then the shinfo of the skb is also from guest.
   When packet is received from hardware, the skb-data is filled
   directly by h/w. What we have done is in this way.
 
   Pros:   We can avoid any copy here.
   Cons:   Guest virtio-net driver needs to allocate skb as almost
   the same method with the host NIC drivers, say the size
   of netdev_alloc_skb() and the same reserved space in the
   head of skb. Many NIC drivers are the same with guest and
   ok for this. But some lastest NIC drivers reserves special
   room in skb head. To deal with it, we suggest to provide
   a method in guest virtio-net driver to ask for parameter
   we interest from the NIC driver when we know which device 
   we have bind to do zero-copy. Then we ask guest to do so.
   Is that reasonable?
 
 Two:  Modify driver to get user buffer allocated from a page constructor
   API(to substitute alloc_page()), the user buffer are used as payload
   buffers and filled by h/w directly when packet is received. Driver
   should associate the pages with skb (skb_shinfo(skb)-frags). For 
   the head buffer side, let host allocates skb, and h/w fills it. 
   After that, the data filled in host skb header will be copied into
   guest header buffer which is submitted together with the payload buffer.
 
   Pros:   We could less care the way how guest or host allocates their
   buffers.
   Cons:   We still need a bit copy here for the skb header.
 
 We are not sure which way is the better here. This is the first thing we want
 to get comments from the community. We wish the modification to the network
 part will be generic which not used by vhost-net backend only, but a user
 application may use it as well when the zero-copy device may provides async
 read/write operations later.
 
 Please give comments especially for the network part modifications.
 
 
 We provide multiple submits and asynchronous notifiicaton to 
 vhost-net too.
 
 Our goal is to improve the bandwidth and reduce the CPU usage.
 Exact performance data will be provided later. But for simple
 test with netperf, we found bindwidth up and CPU % up too,
 but the bindwidth up ratio is much more than CPU % up ratio.
 
 What we have not done yet:
   packet split support
   To support GRO
   Performance tuning
 
 what we have done in v1:
   polish the RCU usage
   deal with write logging in asynchroush mode in vhost
   add notifier block for mp device
   rename page_ctor to mp_port in netdevice.h to make it looks generic
   add mp_dev_change_flags() for mp device to change NIC state
   add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
   a small fix for missing dev_put when fail
   using dynamic minor 

RE: [RFC] [PATCH v2 3/3] Let host NIC driver to DMA to guest user space.

2010-04-06 Thread Xin, Xiaohui

 From: Xin Xiaohui xiaohui@intel.com
 
 The patch let host NIC driver to receive user space skb,
 then the driver has chance to directly DMA to guest user
 space buffers thru single ethX interface.
 We want it to be more generic as a zero copy framework.
 
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzha...@gmail.com
 Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org
 ---
 
 We consider 2 way to utilize the user buffres, but not sure which one
 is better. Please give any comments.
 
 One:Modify __alloc_skb() function a bit, it can only allocate a
 structure of sk_buff, and the data pointer is pointing to a
 user buffer which is coming from a page constructor API.
 Then the shinfo of the skb is also from guest.
 When packet is received from hardware, the skb-data is filled
 directly by h/w. What we have done is in this way.
 
 Pros:   We can avoid any copy here.
 Cons:   Guest virtio-net driver needs to allocate skb as almost
 the same method with the host NIC drivers, say the size
 of netdev_alloc_skb() and the same reserved space in the
 head of skb. Many NIC drivers are the same with guest and
 ok for this. But some lastest NIC drivers reserves special
 room in skb head. To deal with it, we suggest to provide
 a method in guest virtio-net driver to ask for parameter
 we interest from the NIC driver when we know which device
 we have bind to do zero-copy. Then we ask guest to do so.
 Is that reasonable?
 
 Two:Modify driver to get user buffer allocated from a page constructor
 API(to substitute alloc_page()), the user buffer are used as payload
 buffers and filled by h/w directly when packet is received. Driver
 should associate the pages with skb (skb_shinfo(skb)-frags). For
 the head buffer side, let host allocates skb, and h/w fills it.
 After that, the data filled in host skb header will be copied into
 guest header buffer which is submitted together with the payload 
 buffer.
 
 Pros:   We could less care the way how guest or host allocates their
 buffers.
 Cons:   We still need a bit copy here for the skb header.
 
 We are not sure which way is the better here. This is the first thing we want
 to get comments from the community. We wish the modification to the network
 part will be generic which not used by vhost-net backend only, but a user
 application may use it as well when the zero-copy device may provides async
 read/write operations later.
 
 
 Thanks
 Xiaohui

How do you deal with the DoS problem of hostile user space app posting huge
number of receives and never getting anything. 

That's a problem we are trying to deal with. It's critical for long term.
Currently, we tried to limit the pages it can pin, but not sure how much is 
reasonable.
For now, the buffers submitted is from guest virtio-net driver, so it's safe in 
some extent
just for now.

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-06 Thread Michael S. Tsirkin
On Tue, Apr 06, 2010 at 01:41:37PM +0800, Xin, Xiaohui wrote:
 Michael,
  
 For the DOS issue, I'm not sure how much the limit get_user_pages()
  can pin is reasonable, should we compute the bindwidth to make it?
 
 There's a ulimit for locked memory. Can we use this, decreasing
 the value for rlimit array? We can do this when backend is
 enabled and re-increment when backend is disabled.
 
 I have tried it with rlim[RLIMIT_MEMLOCK].rlim_cur, but I found
 the initial value for it is 0x1, after right shift PAGE_SHIFT,
 it's only 16 pages we can lock then, it seems too small, since the 
 guest virito-net driver may submit a lot requests one time.
 
 
 Thanks
 Xiaohui

Yes, that's the default, but system administrator can always increase
this value with ulimit if necessary.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.

2010-04-06 Thread Michael S. Tsirkin
On Tue, Apr 06, 2010 at 01:46:56PM +0800, Xin, Xiaohui wrote:
 Michael,
   For the write logging, do you have a function in hand that we can
   recompute the log? If that, I think I can use it to recompute the
  log info when the logging is suddenly enabled.
   For the outstanding requests, do you mean all the user buffers have
  submitted before the logging ioctl changed? That may be a lot, and
   some of them are still in NIC ring descriptors. Waiting them to be
  finished may be need some time. I think when logging ioctl changed,
   then the logging is changed just after that is also reasonable.
  
  The key point is that after loggin ioctl returns, any
  subsequent change to memory must be logged. It does not
  matter when was the request submitted, otherwise we will
  get memory corruption on migration.
 
  The change to memory happens when vhost_add_used_and_signal(), right?
  So after ioctl returns, just recompute the log info to the events in the 
  async queue,
  is ok. Since the ioctl and write log operations are all protected by 
  vq-mutex.
  
  Thanks
  Xiaohui
 
 Yes, I think this will work.
 
 Thanks, so do you have the function to recompute the log info in your hand 
 that I can 
 use? I have weakly remembered that you have noticed it before some time.

Doesn't just rerunning vhost_get_vq_desc work?

   Thanks
   Xiaohui
   
drivers/vhost/net.c   |  189 
   +++--
drivers/vhost/vhost.h |   10 +++
2 files changed, 192 insertions(+), 7 deletions(-)
   
   diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
   index 22d5fef..2aafd90 100644
   --- a/drivers/vhost/net.c
   +++ b/drivers/vhost/net.c
   @@ -17,11 +17,13 @@
#include linux/workqueue.h
#include linux/rcupdate.h
#include linux/file.h
   +#include linux/aio.h

#include linux/net.h
#include linux/if_packet.h
#include linux/if_arp.h
#include linux/if_tun.h
   +#include linux/mpassthru.h

#include net/sock.h

   @@ -47,6 +49,7 @@ struct vhost_net {
 struct vhost_dev dev;
 struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
 struct vhost_poll poll[VHOST_NET_VQ_MAX];
   + struct kmem_cache   *cache;
 /* Tells us whether we are polling a socket for TX.
  * We only do this when socket buffer fills up.
  * Protected by tx vq lock. */
   @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, 
   struct socket *sock)
 net-tx_poll_state = VHOST_NET_POLL_STARTED;
}

   +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
   +{
   + struct kiocb *iocb = NULL;
   + unsigned long flags;
   +
   + spin_lock_irqsave(vq-notify_lock, flags);
   + if (!list_empty(vq-notifier)) {
   + iocb = list_first_entry(vq-notifier,
   + struct kiocb, ki_list);
   + list_del(iocb-ki_list);
   + }
   + spin_unlock_irqrestore(vq-notify_lock, flags);
   + return iocb;
   +}
   +
   +static void handle_async_rx_events_notify(struct vhost_net *net,
   + struct vhost_virtqueue *vq)
   +{
   + struct kiocb *iocb = NULL;
   + struct vhost_log *vq_log = NULL;
   + int rx_total_len = 0;
   + int log, size;
   +
   + if (vq-link_state != VHOST_VQ_LINK_ASYNC)
   + return;
   +
   + if (vq-receiver)
   + vq-receiver(vq);
   +
   + vq_log = unlikely(vhost_has_feature(
   + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL;
   + while ((iocb = notify_dequeue(vq)) != NULL) {
   + vhost_add_used_and_signal(net-dev, vq,
   + iocb-ki_pos, iocb-ki_nbytes);
   + log = (int)iocb-ki_user_data;
   + size = iocb-ki_nbytes;
   + rx_total_len += iocb-ki_nbytes;
   +
   + if (iocb-ki_dtor)
   + iocb-ki_dtor(iocb);
   + kmem_cache_free(net-cache, iocb);
   +
   + if (unlikely(vq_log))
   + vhost_log_write(vq, vq_log, log, size);
   + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
   + vhost_poll_queue(vq-poll);
   + break;
   + }
   + }
   +}
   +
   +static void handle_async_tx_events_notify(struct vhost_net *net,
   + struct vhost_virtqueue *vq)
   +{
   + struct kiocb *iocb = NULL;
   + int tx_total_len = 0;
   +
   + if (vq-link_state != VHOST_VQ_LINK_ASYNC)
   + return;
   +
   + while ((iocb = notify_dequeue(vq)) != NULL) {
   + vhost_add_used_and_signal(net-dev, vq,
   + iocb-ki_pos, 0);
   + tx_total_len += iocb-ki_nbytes;
   +
   + if (iocb-ki_dtor)
   + iocb-ki_dtor(iocb);
   +
   + kmem_cache_free(net-cache, iocb);
   + if (unlikely(tx_total_len = VHOST_NET_WEIGHT)) {
   + vhost_poll_queue(vq-poll);
   + break;
   + }
   + }
   +}
   +
/* Expects to be always run from workqueue - which acts as
 * read-size 

Re: [PATCH 1/2] qemu-kvm: extboot: Keep variables in RAM

2010-04-06 Thread Jan Kiszka
Avi Kivity wrote:
 On 02/18/2010 06:13 PM, Jan Kiszka wrote:
 Instead of saving the old INT 0x13 and 0x19 handlers in ROM which fails
 under QEMU as it enforces protection, keep them in spare vectors of the
 interrupt table, namely INT 0x80 and 0x81.


 
 Applied both, thanks.

Forgot to tag it: Please consider the first one (Keep variables in
RAM, 2dcbbec) for stable as well.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on skip_emulated_instructions()

2010-04-06 Thread Gleb Natapov
On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote:
 Hi.
 
 When handle_io() is called, rip is currently proceeded *before* actually 
 having
 I/O handled by qemu in userland.  Upon implementing Kemari for
 KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly in
 userland qemu, we encountered a problem that synchronizing the content of VCPU
 before handling I/O in qemu is too late because rip is already proceeded in 
 KVM,
 Although we avoided this issue with temporal hack, I would like to ask a few
 question on skip_emulated_instructions.
 
 1. Does rip need to be proceeded before having I/O handled by qemu?
In current kvm.git rip is proceeded before I/O is handled by qemu only
in case of out instruction. From architecture point of view I think
it's OK since on real HW you can't guaranty that I/O will take effect
before instruction pointer is advanced. It is done like that because we
want out emulation to be real fast so we skip x86 emulator.

 2. If no, is it possible to divide skip_emulated_instructions(), like
 rec_emulated_instructions() to remember to next_rip, and
 skip_emulated_instructions() to actually proceed the rip.
Currently only emulator can call userspace to do I/O, so after
userspace returns after I/O exit, control is handled back to emulator
unconditionally.  out instruction skips emulator, but there is nothing
to do after userspace returns, so regular cpu loop is executed. If we
want to advance rip only after userspace executed I/O done by out we
need to distinguish who requested I/O (emulator or kvm_fast_pio_out())
and call different code depending on who that was. It can be done by
having a callback that (if not null) is called on return from userspace.

 3. svm has next_rip but when it is 0, nop is emulated.  Can this be modified 
 to
 continue without emulating nop when next_rip is 0?
 
I don't see where nop is emulated if next_rip is 0. As far as I see in
case of next_rip==0 an instruction at rip is decoded to figure out its
length and then rip is advanced by instruction length. Anyway next_rip
is svm thing only.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


page allocation failure

2010-04-06 Thread kvm
Hi,

running kernel 2.6.32 (kvm 0.12.3) in host
and 2.6.30 in guest (using Gentoo) works fine.
Now I've upgraded several guests to 2.6.32 too
and have had no problems so far. But with one
guest after 2-3 hours the guest hangs and I
always get a message like this:

[ 1392.030904] rpciod/0: page allocation failure. order:0, mode:0x20
[ 1392.030910] Pid: 407, comm: rpciod/0 Not tainted 2.6.32-gentoo-r5 #1
[ 1392.030912] Call Trace:
[ 1392.030915]  IRQ  [8109cf2f]
__alloc_pages_nodemask+0x5ad/0x5fa
[ 1392.030986]  [81028fdc] ? default_spin_lock_flags+0x9/0xd
[ 1392.031008]  [810c1aa9] alloc_pages_current+0x96/0x9f
[ 1392.031061]  [a00f90c4] try_fill_recv+0x8c/0x18e [virtio_net]
[ 1392.031065]  [a00f9c19] virtnet_poll+0x594/0x60f [virtio_net]
[ 1392.031070]  [81029050] ? pvclock_clocksource_read+0x42/0x7e
[ 1392.031087]  [810520b2] ? run_timer_softirq+0x1d8/0x1f0
[ 1392.031118]  [8171760a] net_rx_action+0xad/0x1a8
[ 1392.031126]  [8104e2a9] __do_softirq+0x9c/0x127
[ 1392.031141]  [8100c1cc] call_softirq+0x1c/0x28
[ 1392.031147]  [8100dd8d] do_softirq+0x41/0x81
[ 1392.031154]  [8104dfb3] irq_exit+0x36/0x75
[ 1392.031161]  [81021c5f] smp_apic_timer_interrupt+0x88/0x96
[ 1392.031170]  [8100bb93] apic_timer_interrupt+0x13/0x20
[ 1392.031174]  EOI  [811e60cd] ?
nfs_readpage_result_full+0x7a/0xcd
[ 1392.031221]  [817932b3] ? rpc_exit_task+0x27/0x54
[ 1392.031227]  [81793997] ? __rpc_execute+0x86/0x247
[ 1392.031234]  [81793bf2] ? rpc_async_schedule+0x0/0x12
[ 1392.031241]  [81793c02] ? rpc_async_schedule+0x10/0x12
[ 1392.031248]  [81058a9e] ? worker_thread+0x173/0x214
[ 1392.031259]  [8105c404] ? autoremove_wake_function+0x0/0x38
[ 1392.031263]  [8105892b] ? worker_thread+0x0/0x214
[ 1392.031266]  [8105c139] ? kthread+0x7d/0x85
[ 1392.031269]  [8100c0ca] ? child_rip+0xa/0x20
[ 1392.031273]  [8105c0bc] ? kthread+0x0/0x85
[ 1392.031279]  [8100c0c0] ? child_rip+0x0/0x20
[ 1392.031283] Mem-Info:
[ 1392.031287] Node 0 DMA per-cpu:
[ 1392.031293] CPU0: hi:0, btch:   1 usd:   0
[ 1392.031298] CPU1: hi:0, btch:   1 usd:   0
[ 1392.031302] Node 0 DMA32 per-cpu:
[ 1392.031308] CPU0: hi:  186, btch:  31 usd: 201
[ 1392.031312] CPU1: hi:  186, btch:  31 usd: 153
[ 1392.031320] active_anon:8093 inactive_anon:8705 isolated_anon:0
[ 1392.031322]  active_file:5209 inactive_file:214248 isolated_file:0
[ 1392.031324]  unevictable:0 dirty:8 writeback:0 unstable:0
[ 1392.031326]  free:1351 slab_reclaimable:2093 slab_unreclaimable:3964
[ 1392.031329]  mapped:2406 shmem:190 pagetables:1184 bounce:0
[ 1392.031334] Node 0 DMA free:3996kB min:60kB low:72kB high:88kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:11860kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15372kB
mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? yes
[ 1392.031355] lowmem_reserve[]: 0 994 994 994
[ 1392.031368] Node 0 DMA32 free:1408kB min:4000kB low:5000kB high:6000kB
active_anon:32372kB inactive_anon:34820kB active_file:20836kB
inactive_file:845132kB unevictable:0kB isolated(anon):0kB
isolated(file):0kB present:1018068kB mlocked:0kB dirty:32kB writeback:0kB
mapped:9624kB shmem:760kB slab_reclaimable:8372kB
slab_unreclaimable:15848kB kernel_stack:1112kB pagetables:4736kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
[ 1392.031391] lowmem_reserve[]: 0 0 0 0
[ 1392.031404] Node 0 DMA: 1*4kB 1*8kB 1*16kB 2*32kB 3*64kB 3*128kB
3*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 3996kB
[ 1392.031433] Node 0 DMA32: 0*4kB 0*8kB 1*16kB 0*32kB 0*64kB 1*128kB
1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1424kB
[ 1392.031463] 219673 total pagecache pages
[ 1392.031467] 0 pages in swap cache
[ 1392.031471] Swap cache stats: add 0, delete 0, find 0/0
[ 1392.031475] Free swap  = 1959920kB
[ 1392.031479] Total swap = 1959920kB
[ 1392.036726] 262141 pages RAM
[ 1392.036728] 7326 pages reserved
[ 1392.036730] 21063 pages shared
[ 1392.036731] 244405 pages non-shared

The guest only runs a Apache webserver and has a 
NFSv4 mount. Nothing special. The guest has 1 GB
of memory and two vcpu's. Here're the startup
parameters:

/usr/bin/qemu-system-x86_64 --enable-kvm -m 1024 -smp 2 -cpu host
-daemonize -k de -vnc 127.0.0.1:2 -monitor
telnet:172.18.105.4:4445,server,nowait -localtime -pidfile
/var/tmp/kvm-vm10.pid -drive
file=/data/kvm/kvmimages/vm10.qcow2,if=virtio,boot=on -net
nic,vlan=104,model=virtio,macaddr=00:ff:48:46:01:f2 -net
tap,vlan=104,ifname=tap.b.vm10,script=no -net
nic,vlan=96,model=virtio,macaddr=00:ff:48:46:01:f4 -net
tap,vlan=96,ifname=tap.f.vm10,script=no

Just to get sure that my kernel config doesn't have
some issues I've 

[PATCH 1/2] KVM MMU: remove unused field

2010-04-06 Thread Xiao Guangrong
kvm_mmu_page.oos_link is not used, so remove it

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/include/asm/kvm_host.h |2 --
 arch/x86/kvm/mmu.c  |1 -
 2 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 26c629a..0c49c88 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -187,8 +187,6 @@ struct kvm_mmu_page {
struct list_head link;
struct hlist_node hash_link;
 
-   struct list_head oos_link;
-
/*
 * The following two entries are used to key the shadow page in the
 * hash table.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d7700bb..8dfe8eb 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -922,7 +922,6 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct 
kvm_vcpu *vcpu,
sp-gfns = mmu_memory_cache_alloc(vcpu-arch.mmu_page_cache, 
PAGE_SIZE);
set_page_private(virt_to_page(sp-spt), (unsigned long)sp);
list_add(sp-link, vcpu-kvm-arch.active_mmu_pages);
-   INIT_LIST_HEAD(sp-oos_link);
bitmap_zero(sp-slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
sp-multimapped = 0;
sp-parent_pte = parent_pte;
-- 
1.6.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM MMU: remove unnecessary judgement

2010-04-06 Thread Xiao Guangrong
After is_rsvd_bits_set() checks, EFER.NXE must be enabled if NX bit is seted

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/paging_tmpl.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 067797a..d9dea28 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -170,7 +170,7 @@ walk:
goto access_error;
 
 #if PTTYPE == 64
-   if (fetch_fault  is_nx(vcpu)  (pte  PT64_NX_MASK))
+   if (fetch_fault  (pte  PT64_NX_MASK))
goto access_error;
 #endif
 
-- 
1.6.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: page allocation failure

2010-04-06 Thread Thomas Mueller
Am Tue, 06 Apr 2010 14:15:10 +0200 schrieb kvm:

 Hi,
 
 running kernel 2.6.32 (kvm 0.12.3) in host and 2.6.30 in guest (using
 Gentoo) works fine. Now I've upgraded several guests to 2.6.32 too and
 have had no problems so far. But with one guest after 2-3 hours the
 guest hangs and I always get a message like this:
 

IMHO there are NFS related problems with =2.6.32.x . 

try googling.

- Thomas

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: page allocation failure

2010-04-06 Thread kvm
Thanks! I'll try a new kernel. Interestingly two guests with 2.6.32-r3
(Gentoo naming not rc3) with much more NFS traffic don't show
this behavior of 2.6.32-r5. So I'll try 2.6.32-r8. I've found
some threads with NFS and kernel 2.6.32.x related problems
which seems to be fixed in later versions.

- Robert

On Tue, 6 Apr 2010 10:59:45 + (UTC), Thomas Mueller
tho...@chaschperli.ch wrote:
 Am Tue, 06 Apr 2010 14:15:10 +0200 schrieb kvm:
 
 Hi,
 
 running kernel 2.6.32 (kvm 0.12.3) in host and 2.6.30 in guest (using
 Gentoo) works fine. Now I've upgraded several guests to 2.6.32 too and
 have had no problems so far. But with one guest after 2-3 hours the
 guest hangs and I always get a message like this:
 
 
 IMHO there are NFS related problems with =2.6.32.x . 
 
 try googling.
 
 - Thomas
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GSoC 2010] Shared memory transport between guest(s) and host

2010-04-06 Thread Mohammed Gamal
Hi,
I am interested in the Shared memory transport between guest(s) and
host project for GSoC 2010. The description of the project is pretty
straightforward, but I am a little bit lost on some parts:

1- Is there any documentation available on KVM shared memory
transport. This'd definitely help understand how inter-vm shared
memory should work.

2- Does the project only aim at providing a shared memory transport
between a single host and a number of guests, with the host acting as
a central node containing shared memory objects and communication
taking placde only between guests and host, or is there any kind of
guest-guest communications to be supported? If yes, how should it be
done?

Regards,
Mohammed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for Apr 6

2010-04-06 Thread Alexander Graf
Chris Wright wrote:
 Please send in any agenda items you are interested in covering.
   

Management stack discussion (again :))


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call minutes for Apr 6

2010-04-06 Thread Chris Wright
Management stack again
- qemud?
- external mgmt stack, qemu/kvm devs less inclined to care
  - Oh, you're using virsh, try #virt on OFTC
- standard libvirt issues
  - concern about speed of adopting kvm features
  - complicated, XML hard to understand
  - being slowed by hv agnositc
  - ...
  - but...clearly have done a lot of work and widely used/deployed/etc...
- libvirt is still useful, so need to play well w/ libvirt
- qemud
  - qemu registers
  - have enumeration
  - have access to all qemu features (QMP based)
  - runs privileged, can deal w/ bridge, tap setup
  - really good UI magically appears /sarcasm
  - regressions (we'd lose these w/ qemud):
- sVirt
- networking
- storage pools, etc
- device assignment
- hotplug
- large pages
- cgroups
- stable mgmt ABI
  - what's needed global/privileged?
- guest enumeration
- network setup
- device hotplug
  - need good single VM UI
- but...as soon as you want 2 VMs, or 2 hosts...
- no need to reinvent the wheel
- qemu project push features up towards mgmt stack, but doesn't make
  those features exist in mgmt stack
- automated interface creation (remove barrier to adding libvirt features)
- QtEmu as example of nice interface that suffered because programming to qemu 
cli is too hard
- libvirt has made a lot of effort, nobody is discouting that, what's un
- strong agreement is that libvirt is needed long term
- we should focus on making qemu easy to manage not on writing mgmt tools
  - qmp + libvirt
  - define requirements for layering
- needs for global scope and privilege requirements
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.

2010-04-06 Thread Avi Kivity

On 04/05/2010 08:35 PM, Sridhar Samudrala wrote:

On Sun, 2010-04-04 at 14:14 +0300, Michael S. Tsirkin wrote:
   

On Fri, Apr 02, 2010 at 10:31:20AM -0700, Sridhar Samudrala wrote:
 

Make vhost scalable by creating a separate vhost thread per vhost
device. This provides better scaling across multiple guests and with
multiple interfaces in a guest.
   

Thanks for looking into this. An alternative approach is
to simply replace create_singlethread_workqueue with
create_workqueue which would get us a thread per host CPU.

It seems that in theory this should be the optimal approach
wrt CPU locality, however, in practice a single thread
seems to get better numbers. I have a TODO to investigate this.
Could you try looking into this?
 

Yes. I tried using create_workqueue(), but the results were not good
atleast when the number of guest interfaces is less than the number
of CPUs. I didn't try more than 8 guests.
Creating a separate thread per guest interface seems to be more
scalable based on the testing i have done so far.
   


Thread per guest is also easier to account.  I'm worried about guests 
impacting other guests' performance outside scheduler control by 
extensive use of vhost.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Trace emulated instructions

2010-04-06 Thread Marcelo Tosatti
On Tue, Apr 06, 2010 at 12:38:00AM +0300, Avi Kivity wrote:
 On 04/05/2010 09:44 PM, Marcelo Tosatti wrote:
 On Thu, Mar 25, 2010 at 05:02:56PM +0200, Avi Kivity wrote:
 Log emulated instructions in ftrace, especially if they failed.
 Why not log all emulated instructions? Seems useful to me.
 
 
 That was the intent, but it didn't pan out.  I tried to avoid
 double-logging where an mmio read is dispatched to userspace, and
 the instruction is re-executed.
 
 Perhaps we should split it into a decode trace and execution result
 trace?  Less easy to use but avoids confusion due to duplication.

I don't think duplication introduces confusion, as long as one can see
rip on the trace entry (the duplication can actually be useful).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv6 0/4] qemu-kvm: vhost net port

2010-04-06 Thread Marcelo Tosatti
On Sun, Apr 04, 2010 at 07:30:20PM +0300, Avi Kivity wrote:
 On 04/04/2010 02:46 PM, Michael S. Tsirkin wrote:
 On Wed, Mar 24, 2010 at 02:38:57PM +0200, Avi Kivity wrote:
 On 03/17/2010 03:04 PM, Michael S. Tsirkin wrote:
 This is port of vhost v6 patch set I posted previously to qemu-kvm, for
 those that want to get good performance out of it :) This patchset needs
 to be applied when qemu.git one gets merged, this includes irqchip
 support.
 
 
 Ping me when this happens please.
 Ping
 
 Bounce.

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] Add Mergeable receive buffer support to vhost_net

2010-04-06 Thread David L Stevens

This patch adds support for the Mergeable Receive Buffers feature to
vhost_net.

+-DLS

Changes from previous revision:
1) renamed:
vhost_discard_vq_desc - vhost_discard_desc
vhost_get_heads - vhost_get_desc_n
vhost_get_vq_desc - vhost_get_desc
2) added heads as argument to ghost_get_desc_n
3) changed vq-heads from iovec to vring_used_elem, removed casts
4) changed vhost_add_used to do multiple elements in a single
copy_to_user,
or two when we wrap the ring.
5) removed rxmaxheadcount and available buffer checks in favor of
running until
an allocation failure, but making sure we break the loop if we get
two in a row, indicating we have at least 1 buffer, but not enough
for the current receive packet
6) restore non-vnet header handling

Signed-Off-By: David L Stevens dlstev...@us.ibm.com

diff -ruNp net-next-p0/drivers/vhost/net.c
net-next-v3/drivers/vhost/net.c
--- net-next-p0/drivers/vhost/net.c 2010-03-22 12:04:38.0 -0700
+++ net-next-v3/drivers/vhost/net.c 2010-04-06 12:54:56.0 -0700
@@ -130,9 +130,8 @@ static void handle_tx(struct vhost_net *
hdr_size = vq-hdr_size;
 
for (;;) {
-   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
+   head = vhost_get_desc(net-dev, vq, vq-iov,
+ARRAY_SIZE(vq-iov), out, in,
 NULL, NULL);
/* Nothing new?  Wait for eventfd to tell us they refilled. */
if (head == vq-num) {
@@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock-ops-sendmsg(NULL, sock, msg, len);
if (unlikely(err  0)) {
-   vhost_discard_vq_desc(vq);
-   tx_poll_start(net, sock);
+   if (err == -EAGAIN) {
+   vhost_discard_desc(vq, 1);
+   tx_poll_start(net, sock);
+   } else {
+   vq_err(vq, sendmsg: errno %d\n, -err);
+   /* drop packet; do not discard/resend */
+   vhost_add_used_and_signal(net-dev, vq, head,
+ 0);
+   }
break;
}
if (err != len)
@@ -186,12 +192,25 @@ static void handle_tx(struct vhost_net *
unuse_mm(net-dev.mm);
 }
 
+static int vhost_head_len(struct sock *sk)
+{
+   struct sk_buff *head;
+   int len = 0;
+
+   lock_sock(sk);
+   head = skb_peek(sk-sk_receive_queue);
+   if (head)
+   len = head-len;
+   release_sock(sk);
+   return len;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_rx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
-   unsigned head, out, in, log, s;
+   unsigned in, log, s;
struct vhost_log *vq_log;
struct msghdr msg = {
.msg_name = NULL,
@@ -202,13 +221,14 @@ static void handle_rx(struct vhost_net *
.msg_flags = MSG_DONTWAIT,
};
 
-   struct virtio_net_hdr hdr = {
-   .flags = 0,
-   .gso_type = VIRTIO_NET_HDR_GSO_NONE
+   struct virtio_net_hdr_mrg_rxbuf hdr = {
+   .hdr.flags = 0,
+   .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
};
 
+   int retries = 0;
size_t len, total_len = 0;
-   int err;
+   int err, headcount, datalen;
size_t hdr_size;
struct socket *sock = rcu_dereference(vq-private_data);
if (!sock || skb_queue_empty(sock-sk-sk_receive_queue))
@@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
vq-log : NULL;
 
-   for (;;) {
-   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
-ARRAY_SIZE(vq-iov),
-out, in,
-vq_log, log);
+   while ((datalen = vhost_head_len(sock-sk))) {
+   headcount = vhost_get_desc_n(vq, vq-heads, datalen, in,
+vq_log, log);
/* OK, now we need to know about added descriptors. */
-   if (head == vq-num) {
-   if (unlikely(vhost_enable_notify(vq))) {
+   if (!headcount) {
+   if (retries == 0  unlikely(vhost_enable_notify(vq))) {
/* They have 

Re: Setting nx bit in virtual CPU

2010-04-06 Thread Richard Simpson
On 05/04/10 09:27, Avi Kivity wrote:
 On 04/03/2010 12:07 AM, Richard Simpson wrote:
 Nope, both Kernels are 64 bit.

 uname -a Host: Linux gordon 2.6.27-gentoo-r8 #5 Sat Mar 14 18:01:59 GMT
 2009 x86_64 AMD Athlon(tm) 64 Processor 3200+ AuthenticAMD GNU/Linux

 uname -a Guest: Linux andrew 2.6.28-hardened-r9 #4 Mon Jan 18 22:39:31
 GMT 2010 x86_64 AMD Athlon(tm) 64 Processor 3200+ AuthenticAMD GNU/Linux

 As you can see, both kernels are a little old, and I have been wondering
 if that might be part of the problem.  The Guest one is old because that
 is the latest stable hardened version in Gentoo.  The host one is old
 because of:

 
 2.6.27 should be plenty fine for nx.  Really the important bit is that
 the host kernel has nx enabled.  Can you check if that is so?
 
Umm, could you give me a clue about how to do that.  It is some time
since I configured the host kernel, but I do have a /proc/config.gz.
Could I check by looking in that?

Thanks
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM test: Fix some typos on autotest run utility function

2010-04-06 Thread Lucas Meneghel Rodrigues
Fix some typos found on the utility function that runs
autotest tests on a guest.

Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
---
 client/tests/kvm/kvm_test_utils.py |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/client/tests/kvm/kvm_test_utils.py 
b/client/tests/kvm/kvm_test_utils.py
index 6ba834a..f512044 100644
--- a/client/tests/kvm/kvm_test_utils.py
+++ b/client/tests/kvm/kvm_test_utils.py
@@ -298,12 +298,12 @@ def run_autotest(vm, session, control_path, timeout, 
test_name, outputdir):
 @param dest_dir: Destination dir for the contents
 
 basename = os.path.basename(remote_path)
-logging.info(Extracting %s... % basename)
+logging.info(Extracting %s..., basename)
 (status, output) = session.get_command_status_output(
   tar xjvf %s -C %s % (remote_path, 
dest_dir))
 if status != 0:
 logging.error(Uncompress output:\n%s % output)
-raise error.TestFail(Could not extract % on guest)
+raise error.TestFail(Could not extract %s on guest % basename)
 
 if not os.path.isfile(control_path):
 raise error.TestError(Invalid path to autotest control file: %s %
@@ -356,7 +356,7 @@ def run_autotest(vm, session, control_path, timeout, 
test_name, outputdir):
 raise error.TestFail(Could not copy the test control file to guest)
 
 # Run the test
-logging.info(Running test '%s'... % test_name)
+logging.info(Running test '%s'..., test_name)
 session.get_command_output(cd %s % autotest_path)
 session.get_command_output(rm -f control.state)
 session.get_command_output(rm -rf results/*)
@@ -364,7 +364,7 @@ def run_autotest(vm, session, control_path, timeout, 
test_name, outputdir):
 status = session.get_command_status(bin/autotest control,
 timeout=timeout,
 print_func=logging.info)
-logging.info(--End of test output )
+logging.info(- End of test output )
 if status is None:
 raise error.TestFail(Timeout elapsed while waiting for autotest to 
  complete)
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] vhost-blk implementation (v2)

2010-04-06 Thread Badari Pulavarty
Hi All,

Here is the latest version of vhost-blk implementation.
Major difference from my previous implementation is that, I
now merge all contiguous requests (both read and write), before
submitting them. This significantly improved IO performance.
I am still collecting performance numbers, I will be posting
in next few days.

Comments ?

Todo:
- Address hch's comments on annontations
- Implement per device read/write queues
- Finish up error handling

Thanks,
Badari

---
 drivers/vhost/blk.c |  445 
 1 file changed, 445 insertions(+)

Index: net-next/drivers/vhost/blk.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ net-next/drivers/vhost/blk.c2010-04-06 16:38:03.563847905 -0400
@@ -0,0 +1,445 @@
+ /*
+  * virtio-block server in host kernel.
+  * Inspired by vhost-net and shamlessly ripped code from it :)
+  */
+
+#include linux/compat.h
+#include linux/eventfd.h
+#include linux/vhost.h
+#include linux/virtio_net.h
+#include linux/virtio_blk.h
+#include linux/mmu_context.h
+#include linux/miscdevice.h
+#include linux/module.h
+#include linux/mutex.h
+#include linux/workqueue.h
+#include linux/rcupdate.h
+#include linux/file.h
+
+#include vhost.h
+
+#define VHOST_BLK_VQ_MAX 1
+#define SECTOR_SHIFT 9
+
+struct vhost_blk {
+   struct vhost_dev dev;
+   struct vhost_virtqueue vqs[VHOST_BLK_VQ_MAX];
+   struct vhost_poll poll[VHOST_BLK_VQ_MAX];
+};
+
+struct vhost_blk_io {
+   struct list_head list;
+   struct work_struct work;
+   struct vhost_blk *blk;
+   struct file *file;
+   int head;
+   uint32_t type;
+   uint32_t nvecs;
+   uint64_t sector;
+   uint64_t len;
+   struct iovec iov[0];
+};
+
+static struct workqueue_struct *vblk_workqueue;
+static LIST_HEAD(write_queue);
+static LIST_HEAD(read_queue);
+
+static void handle_io_work(struct work_struct *work)
+{
+   struct vhost_blk_io *vbio, *entry;
+   struct vhost_virtqueue *vq;
+   struct vhost_blk *blk;
+   struct list_head single, *head, *node, *tmp;
+
+   int i, need_free, ret = 0;
+   loff_t pos;
+   uint8_t status = 0;
+
+   vbio = container_of(work, struct vhost_blk_io, work);
+   blk = vbio-blk;
+   vq = blk-dev.vqs[0];
+   pos = vbio-sector  8;
+
+   use_mm(blk-dev.mm);
+   if (vbio-type  VIRTIO_BLK_T_FLUSH)  {
+   ret = vfs_fsync(vbio-file, vbio-file-f_path.dentry, 1);
+   } else if (vbio-type  VIRTIO_BLK_T_OUT) {
+   ret = vfs_writev(vbio-file, vbio-iov, vbio-nvecs, pos);
+   } else {
+   ret = vfs_readv(vbio-file, vbio-iov, vbio-nvecs, pos);
+   }
+   status = (ret  0) ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
+   if (vbio-head != -1) {
+   INIT_LIST_HEAD(single);
+   list_add(vbio-list, single);
+   head = single;
+   need_free = 0;
+   } else {
+   head = vbio-list;
+   need_free = 1;
+   }
+   list_for_each_entry(entry, head, list) {
+   copy_to_user(entry-iov[entry-nvecs].iov_base, status, sizeof 
status);
+   }
+   mutex_lock(vq-mutex);
+   list_for_each_safe(node, tmp, head) {
+   entry = list_entry(node, struct vhost_blk_io, list);
+   vhost_add_used_and_signal(blk-dev, vq, entry-head, ret);
+   list_del(node);
+   kfree(entry);
+   }
+   mutex_unlock(vq-mutex);
+   unuse_mm(blk-dev.mm);
+   if (need_free)
+   kfree(vbio);
+}
+
+static struct vhost_blk_io *allocate_vbio(int nvecs)
+{
+   struct vhost_blk_io *vbio;
+   int size = sizeof(struct vhost_blk_io) + nvecs * sizeof(struct iovec);
+   vbio = kmalloc(size, GFP_KERNEL);
+   if (vbio) {
+   INIT_WORK(vbio-work, handle_io_work);
+   INIT_LIST_HEAD(vbio-list);
+   }
+   return vbio;
+}
+
+static void merge_and_handoff_work(struct list_head *queue)
+{
+   struct vhost_blk_io *vbio, *entry;
+   int nvecs = 0;
+   int entries = 0;
+
+   list_for_each_entry(entry, queue, list) {
+   nvecs += entry-nvecs;
+   entries++;
+   }
+
+   if (entries == 1) {
+   vbio = list_first_entry(queue, struct vhost_blk_io, list);
+   list_del(vbio-list);
+   queue_work(vblk_workqueue, vbio-work);
+   return;
+   }
+
+   vbio = allocate_vbio(nvecs);
+   if (!vbio) {
+   /* Unable to allocate memory - submit IOs individually */
+   list_for_each_entry(vbio, queue, list) {
+   queue_work(vblk_workqueue, vbio-work);
+   }
+   INIT_LIST_HEAD(queue);
+   return;
+   }
+
+   entry = list_first_entry(queue, struct vhost_blk_io, list);
+   vbio-nvecs = nvecs;
+   vbio-blk = entry-blk;
+   

Re: virsh dump blocking problem

2010-04-06 Thread KAMEZAWA Hiroyuki
On Tue, 06 Apr 2010 09:35:09 +0800
Gui Jianfeng guijianf...@cn.fujitsu.com wrote:

 Hi all,
 
 I'm not sure whether it's appropriate to post the problem here.
 I played with virsh under Fedora 12, and started a KVM fedora12 guest
 by virsh start command. The fedora12 guest is successfully started.
 Than I run the following command to dump the guest core:
 #virsh dump 1 mycoredump (domain id is 1)
 
 This command seemed blocking and not return. According to he strace
 output, virsh dump seems that it's blocking at poll() call. I think
 the following should be the call trace of virsh.
 
 cmdDump()
   - virDomainCoreDump()
 - remoteDomainCoreDump()
  - call()
  - remoteIO()
  - remoteIOEventLoop()
   - poll(fds, ARRAY_CARDINALITY(fds), -1)
 
 
 Any one encounters this problem also, any thoughts?
 

I met and it seems qemu-kvm continues to counting the number of dirty pages
and does no answer to libvirt. Guest never work and I have to kill it.

I met this with 2.6.32+ qemu-0.12.3+ libvirt 0.7.7.1.
When I updated the host kernel to 2.6.33, qemu-kvm never work. So, I moved
back to fedora12's latest qemu-kvm.

Now, 2.6.34-rc3+ qemu-0.11.0-13.fc12.x86_64 + libvirt 0.7.7.1
# virsh dump   
hangs.

In most case, I see following 2 back trace.(with gdb)

(gdb) bt
#0  ram_save_remaining () at /usr/src/debug/qemu-kvm-0.11.0/vl.c:3104
#1  ram_bytes_remaining () at /usr/src/debug/qemu-kvm-0.11.0/vl.c:3112
#2  0x004ab2cf in do_info_migrate (mon=0x16b7970) at migration.c:150
#3  0x00414b1a in monitor_handle_command (mon=value optimized out,
cmdline=value optimized out)
at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:2870
#4  0x00414c6a in monitor_command_cb (mon=0x16b7970,
cmdline=value optimized out, opaque=value optimized out)
at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:3160
#5  0x0048b71b in readline_handle_byte (rs=0x208d6a0,
ch=value optimized out) at readline.c:369
#6  0x00414cdc in monitor_read (opaque=value optimized out,
buf=0x7fff1b1104b0 info migrate\r, size=13)
at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:3146
#7  0x004b2a53 in tcp_chr_read (opaque=0x1614c30) at qemu-char.c:2006
#8  0x0040a6c7 in main_loop_wait (timeout=value optimized out)
at /usr/src/debug/qemu-kvm-0.11.0/vl.c:4188
#9  0x0040eed5 in main_loop (argc=value optimized out,
argv=value optimized out, envp=value optimized out)
at /usr/src/debug/qemu-kvm-0.11.0/vl.c:4414
#10 main (argc=value optimized out, argv=value optimized out,
envp=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:6263


(gdb) bt
#0  0x003c2680e0bd in write () at ../sysdeps/unix/syscall-template.S:82
#1  0x004b304a in unix_write (fd=11, buf=value optimized out, len1=40)
at qemu-char.c:512
#2  send_all (fd=11, buf=value optimized out, len1=40) at qemu-char.c:528
#3  0x00411201 in monitor_flush (mon=0x16b7970)
at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:131
#4  0x00414cdc in monitor_read (opaque=value optimized out,
buf=0x7fff1b1104b0 info migrate\r, size=13)
at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:3146
#5  0x004b2a53 in tcp_chr_read (opaque=0x1614c30) at qemu-char.c:2006
#6  0x0040a6c7 in main_loop_wait (timeout=value optimized out)
at /usr/src/debug/qemu-kvm-0.11.0/vl.c:4188
#7  0x0040eed5 in main_loop (argc=value optimized out,
argv=value optimized out, envp=value optimized out)
at /usr/src/debug/qemu-kvm-0.11.0/vl.c:4414
#8  main (argc=value optimized out, argv=value optimized out,
envp=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:6263

And see no dump progress.

I'm sorry if this is not a hang but just very slow. I don't see any
progress at lease for 15 minutes and qemu-kvm continues to use 75% of cpus.
I'm not sure why dump command trigger migration code...

How long it takes to do virsh dump xxx , an idle VM with 2G memory ?
I'm sorry if I ask wrong mailing list.

Thanks,
-Kame









--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.

2010-04-06 Thread Xin, Xiaohui
Michael,
   For the write logging, do you have a function in hand that we can
   recompute the log? If that, I think I can use it to recompute the
  log info when the logging is suddenly enabled.
   For the outstanding requests, do you mean all the user buffers have
  submitted before the logging ioctl changed? That may be a lot, and
   some of them are still in NIC ring descriptors. Waiting them to be
  finished may be need some time. I think when logging ioctl changed,
   then the logging is changed just after that is also reasonable.

  The key point is that after loggin ioctl returns, any
  subsequent change to memory must be logged. It does not
  matter when was the request submitted, otherwise we will
  get memory corruption on migration.

  The change to memory happens when vhost_add_used_and_signal(), right?
  So after ioctl returns, just recompute the log info to the events in the 
  async queue,
  is ok. Since the ioctl and write log operations are all protected by 
  vq-mutex.

  Thanks
  Xiaohui

 Yes, I think this will work.

 Thanks, so do you have the function to recompute the log info in your hand 
 that I can
use? I have weakly remembered that you have noticed it before some time.

Doesn't just rerunning vhost_get_vq_desc work?

Am I missing something here?
The vhost_get_vq_desc() looks in vq, and finds the first available buffers, and 
converts it
to an iovec. I think the first available buffer is not the buffers in the async 
queue, so I
think rerunning vhost_get_vq_desc() cannot work.

Thanks
Xiaohui

   Thanks
   Xiaohui
  
drivers/vhost/net.c   |  189 
   +++--
drivers/vhost/vhost.h |   10 +++
2 files changed, 192 insertions(+), 7 deletions(-)
  
   diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
   index 22d5fef..2aafd90 100644
   --- a/drivers/vhost/net.c
   +++ b/drivers/vhost/net.c
   @@ -17,11 +17,13 @@
#include linux/workqueue.h
#include linux/rcupdate.h
#include linux/file.h
   +#include linux/aio.h
  
#include linux/net.h
#include linux/if_packet.h
#include linux/if_arp.h
#include linux/if_tun.h
   +#include linux/mpassthru.h
  
#include net/sock.h
  
   @@ -47,6 +49,7 @@ struct vhost_net {
 struct vhost_dev dev;
 struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
 struct vhost_poll poll[VHOST_NET_VQ_MAX];
   + struct kmem_cache   *cache;
 /* Tells us whether we are polling a socket for TX.
  * We only do this when socket buffer fills up.
  * Protected by tx vq lock. */
   @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, 
   struct socket *sock)
 net-tx_poll_state = VHOST_NET_POLL_STARTED;
}
  
   +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
   +{
   + struct kiocb *iocb = NULL;
   + unsigned long flags;
   +
   + spin_lock_irqsave(vq-notify_lock, flags);
   + if (!list_empty(vq-notifier)) {
   + iocb = list_first_entry(vq-notifier,
   + struct kiocb, ki_list);
   + list_del(iocb-ki_list);
   + }
   + spin_unlock_irqrestore(vq-notify_lock, flags);
   + return iocb;
   +}
   +
   +static void handle_async_rx_events_notify(struct vhost_net *net,
   + struct vhost_virtqueue *vq)
   +{
   + struct kiocb *iocb = NULL;
   + struct vhost_log *vq_log = NULL;
   + int rx_total_len = 0;
   + int log, size;
   +
   + if (vq-link_state != VHOST_VQ_LINK_ASYNC)
   + return;
   +
   + if (vq-receiver)
   + vq-receiver(vq);
   +
   + vq_log = unlikely(vhost_has_feature(
   + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL;
   + while ((iocb = notify_dequeue(vq)) != NULL) {
   + vhost_add_used_and_signal(net-dev, vq,
   + iocb-ki_pos, iocb-ki_nbytes);
   + log = (int)iocb-ki_user_data;
   + size = iocb-ki_nbytes;
   + rx_total_len += iocb-ki_nbytes;
   +
   + if (iocb-ki_dtor)
   + iocb-ki_dtor(iocb);
   + kmem_cache_free(net-cache, iocb);
   +
   + if (unlikely(vq_log))
   + vhost_log_write(vq, vq_log, log, size);
   + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
   + vhost_poll_queue(vq-poll);
   + break;
   + }
   + }
   +}
   +
   +static void handle_async_tx_events_notify(struct vhost_net *net,
   + struct vhost_virtqueue *vq)
   +{
   + struct kiocb *iocb = NULL;
   + int tx_total_len = 0;
   +
   + if (vq-link_state != VHOST_VQ_LINK_ASYNC)
   + return;
   +
   + while ((iocb = notify_dequeue(vq)) != NULL) {
   + vhost_add_used_and_signal(net-dev, vq,
   + iocb-ki_pos, 0);
   + tx_total_len += iocb-ki_nbytes;
   +
   + if (iocb-ki_dtor)
   + iocb-ki_dtor(iocb);
   +
   + kmem_cache_free(net-cache, iocb);
   + if (unlikely(tx_total_len = 

[PATCH] [RFC] KVM test: Introduce sample performance test set

2010-04-06 Thread Lucas Meneghel Rodrigues
As part of the performance testing effort for KVM,
introduce a base performance testset for the
sample KVM control file. It will execute several
benchmarks on a Fedora 12 guest, bringing back the
results to the host. This base testset can be tweaked
for folks interested on getting figures from a
particular KVM release.

Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
Signed-off-by: Jes Sorensen jes.soren...@redhat.com
---
 .../tests/kvm/autotest_control/hackbench.control   |6 +-
 client/tests/kvm/autotest_control/iozone.control   |   18 +++
 .../tests/kvm/autotest_control/kernbench.control   |2 +-
 client/tests/kvm/autotest_control/lmbench.control  |   33 +
 .../tests/kvm/autotest_control/performance.control |  123 
 client/tests/kvm/autotest_control/reaim.control|   11 ++
 client/tests/kvm/autotest_control/tiobench.control |   12 ++
 .../tests/kvm/autotest_control/unixbench.control   |   11 ++
 client/tests/kvm/tests.cfg.sample  |   13 ++
 client/tests/kvm/tests_base.cfg.sample |   21 
 10 files changed, 247 insertions(+), 3 deletions(-)
 create mode 100644 client/tests/kvm/autotest_control/iozone.control
 create mode 100644 client/tests/kvm/autotest_control/lmbench.control
 create mode 100644 client/tests/kvm/autotest_control/performance.control
 create mode 100644 client/tests/kvm/autotest_control/reaim.control
 create mode 100644 client/tests/kvm/autotest_control/tiobench.control
 create mode 100644 client/tests/kvm/autotest_control/unixbench.control

diff --git a/client/tests/kvm/autotest_control/hackbench.control 
b/client/tests/kvm/autotest_control/hackbench.control
index 5b94865..3248b26 100644
--- a/client/tests/kvm/autotest_control/hackbench.control
+++ b/client/tests/kvm/autotest_control/hackbench.control
@@ -1,4 +1,4 @@
-AUTHOR = Sudhir Kumar sku...@linux.vnet.ibm.com
+AUTHOR = nc...@google.com (Nikhil Rao)
 NAME = Hackbench
 TIME = SHORT
 TEST_CLASS = Kernel
@@ -6,8 +6,10 @@ TEST_CATEGORY = Benchmark
 TEST_TYPE = client
 
 DOC = 
-Hackbench is a benchmark which measures the performance, overhead and
+Hackbench is a benchmark for measuring the performance, overhead and
 scalability of the Linux scheduler.
 
+hackbench.c copied from:
+http://people.redhat.com/~mingo/cfs-scheduler/tools/hackbench.c
 
 job.run_test('hackbench')
diff --git a/client/tests/kvm/autotest_control/iozone.control 
b/client/tests/kvm/autotest_control/iozone.control
new file mode 100644
index 000..17d9be2
--- /dev/null
+++ b/client/tests/kvm/autotest_control/iozone.control
@@ -0,0 +1,18 @@
+AUTHOR = Ying Tao ying...@cn.ibm.com
+TIME = MEDIUM
+NAME = IOzone
+TEST_TYPE = client
+TEST_CLASS = Kernel
+TEST_CATEGORY = Benchmark
+
+DOC = 
+Iozone is useful for performing a broad filesystem analysis of a vendors
+computer platform. The benchmark tests file I/O performance for the following
+operations:
+  Read, write, re-read, re-write, read backwards, read strided, fread,
+  fwrite, random read, pread ,mmap, aio_read, aio_write
+
+For more information see http://www.iozone.org
+
+
+job.run_test('iozone')
diff --git a/client/tests/kvm/autotest_control/kernbench.control 
b/client/tests/kvm/autotest_control/kernbench.control
index 76a546e..9fc5da7 100644
--- a/client/tests/kvm/autotest_control/kernbench.control
+++ b/client/tests/kvm/autotest_control/kernbench.control
@@ -1,4 +1,4 @@
-AUTHOR = Sudhir Kumar sku...@linux.vnet.ibm.com
+AUTHOR = mbl...@google.com (Martin Bligh)
 NAME = Kernbench
 TIME = SHORT
 TEST_CLASS = Kernel
diff --git a/client/tests/kvm/autotest_control/lmbench.control 
b/client/tests/kvm/autotest_control/lmbench.control
new file mode 100644
index 000..95b47fb
--- /dev/null
+++ b/client/tests/kvm/autotest_control/lmbench.control
@@ -0,0 +1,33 @@
+NAME = lmbench
+AUTHOR = Martin Bligh mbl...@google.com
+TIME = MEDIUM
+TEST_CATEGORY = BENCHMARK
+TEST_CLASS = KERNEL
+TEST_TYPE = CLIENT
+DOC = 
+README for lmbench 2alpha8 net release.
+
+To run the benchmark, you should be able to say:
+
+cd src
+make results
+
+If you want to see how you did compared to the other system results
+included here, say
+
+make see
+
+Be warned that many of these benchmarks are sensitive to other things
+being run on the system, mainly from CPU cache and CPU cycle effects.
+So make sure your screen saver is not running, etc.
+
+It's a good idea to do several runs and compare the output like so
+
+make results
+make rerun
+make rerun
+make rerun
+cd Results  make LIST=your OS/*
+
+
+job.run_test('lmbench')
diff --git a/client/tests/kvm/autotest_control/performance.control 
b/client/tests/kvm/autotest_control/performance.control
new file mode 100644
index 000..5bc0b28
--- /dev/null
+++ b/client/tests/kvm/autotest_control/performance.control
@@ -0,0 +1,123 @@
+def step_init():
+job.next_step('step0')
+job.next_step('step1')
+job.next_step('step2')
+

buildbot failure in qemu-kvm on disable_kvm_x86_64_debian_5_0

2010-04-06 Thread qemu-kvm
The Buildbot has detected a new failure of disable_kvm_x86_64_debian_5_0 on 
qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_x86_64_debian_5_0/builds/336

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_1

Build Reason: The Nightly scheduler named 'nightly_disable_kvm' triggered this 
build
Build Source Stamp: [branch master] HEAD
Blamelist: 

BUILD FAILED: failed compile

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


buildbot failure in qemu-kvm on disable_kvm_i386_debian_5_0

2010-04-06 Thread qemu-kvm
The Buildbot has detected a new failure of disable_kvm_i386_debian_5_0 on 
qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_i386_debian_5_0/builds/337

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_2

Build Reason: The Nightly scheduler named 'nightly_disable_kvm' triggered this 
build
Build Source Stamp: [branch master] HEAD
Blamelist: 

BUILD FAILED: failed compile

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


buildbot failure in qemu-kvm on disable_kvm_i386_out_of_tree

2010-04-06 Thread qemu-kvm
The Buildbot has detected a new failure of disable_kvm_i386_out_of_tree on 
qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_i386_out_of_tree/builds/285

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_2

Build Reason: The Nightly scheduler named 'nightly_disable_kvm' triggered this 
build
Build Source Stamp: [branch master] HEAD
Blamelist: 

BUILD FAILED: failed compile

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


buildbot failure in qemu-kvm on disable_kvm_x86_64_out_of_tree

2010-04-06 Thread qemu-kvm
The Buildbot has detected a new failure of disable_kvm_x86_64_out_of_tree on 
qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_x86_64_out_of_tree/builds/285

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_1

Build Reason: The Nightly scheduler named 'nightly_disable_kvm' triggered this 
build
Build Source Stamp: [branch master] HEAD
Blamelist: 

BUILD FAILED: failed compile

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.

2010-04-06 Thread Xin, Xiaohui
Michael,
 
 Qemu needs a userspace write, is that a synchronous one or
asynchronous one?

It's a synchronous non-blocking write.
Sorry, why the Qemu live migration needs the device have a userspace write?
how does the write operation work? And why a read operation is not cared here?

Thanks
Xiaohui
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Setting nx bit in virtual CPU

2010-04-06 Thread Avi Kivity

On 04/07/2010 01:31 AM, Richard Simpson wrote:



2.6.27 should be plenty fine for nx.  Really the important bit is that
the host kernel has nx enabled.  Can you check if that is so?

 

Umm, could you give me a clue about how to do that.  It is some time
since I configured the host kernel, but I do have a /proc/config.gz.
Could I check by looking in that?
   


The attached script should verify it.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

#!/usr/bin/python

class msr(object):
def __init__(self):
try:
self.f = file('/dev/cpu/0/msr')
except:
self.f = file('/dev/msr0')
def read(self, index, default = None):
import struct
self.f.seek(index)
try:
return struct.unpack('Q', self.f.read(8))[0]
except:
return default

efer = msr().read(0xc080, 0)
nx = (efer  11)  1

if nx:
   print 'nx: enabled'
else:
print 'nx: disabled'


Re: PCI passthrough resource remapping

2010-04-06 Thread Avi Kivity

On 03/31/2010 06:18 PM, Chris Wright wrote:



Hrm, I'm not sure these would be related to the small BAR region patch.
It looks more like a timing issue.
 

small BAR == slow path == timing issue?
   


Would be interesting to verify using perf with the 'kvm:kvm_mmio' 
software event, see how many happen per second.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [questions] savevm|loadvm

2010-04-06 Thread Avi Kivity

On 04/01/2010 10:35 PM, Wenhao Xu wrote:

Does current qemu-kvm (qemu v0.12.3) use the irqchip, pit of KVM? I
cannot find any KVM_CREATE_IRQCHIP and KVM_CREATE_PIT in the qemu
code.

   


Are you looking at qemu or qemu-kvm?


Concerning the interface between qemu and kvm, I have the following confusion:
1. How irqchip and pit of KVM collaborating with the irq and pit
emulation of QEMU? As far as I see, qemu-kvm still uses qemu's irq and
pit emulation, doesn't it?
   


No, they're completely separate.


2. For return from KVM to QEMU, I cannot get the meaning of two exit reasons:
case KVM_EXIT_EXCEPTION:
What exception will cause KVM exit?
   


I think that's obsolete.


default:
 dprintf(kvm_arch_handle_exit\n);
 ret = kvm_arch_handle_exit(env, run);
What exit reasons are default?

3. How could DMA interrupt the cpu when it finishes and the qemu-kvm
is still running in kvm now?
   


Usually the device that does the dma will raise an interrupt, which qemu 
is waiting for.



I am still working in the patch, but these confusions really prevent
me moving forward. Thanks first for you guys giving me more hints.


The following is the code so far I write:
The main idea is synchronizing the CPU state and enter into the
emulator mode when switching from kvm to emulator. I only do the
switch when the exit reason is KVM_EXIT_IRQ_WINDOW_OPEN.
   


That doesn't happen with qemu-kvm.


However, I got the following errors:
Whenever switch from kvm to qemu, the interrupt request in qemu will
cause qemu enter into smm mode which is definitely a bug.
   


Definitely shouldn't happen.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html