RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
Sridhar, The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. What is the advantage of this approach compared to PCI-passthrough of the host NIC to the guest? PCI-passthrough needs hardware support, a kind of iommu engine will help to translate guest physical address to host physical address. And currently, a PCI-passthrough device cannot pass live migration. The zero-copy is a pure software solution. It doesn't need special hardware support. In theory, it can pass live migration. Does this require pinning of the entire guest memory? Or only the send/receive buffers? We need only to pin the send/receive buffers. Thanks Xiaohui Thanks Sridhar The scenario is like this: The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. Here, we have ever considered 2 ways to utilize the page constructor API to dispense the user buffers. One: Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb-data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to provide a method in guest virtio-net driver to ask for parameter we interest from the NIC driver when we know which device we have bind to do zero-copy. Then we ask guest to do so. Is that reasonable? Two: Modify driver to get user buffer allocated from a page constructor API(to substitute alloc_page()), the user buffer are used as payload buffers and filled by h/w directly when packet is received. Driver should associate the pages with skb (skb_shinfo(skb)-frags). For the head buffer side, let host allocates skb, and h/w fills it. After that, the data filled in host skb header will be copied into guest header buffer which is submitted together with the payload buffer. Pros: We could less care the way how guest or host allocates their buffers. Cons: We still need a bit copy here for the skb header. We are not sure which way is the better here. This is the first thing we want to get comments from the community. We wish the modification to the network part will be generic which not used by vhost-net backend only, but a user application may use it as well when the zero-copy device may provides async read/write operations later. Please give comments especially for the network part modifications. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. But for simple test with netperf, we found bindwidth up and CPU % up too, but the bindwidth up ratio is much more than CPU % up ratio. What we have not done yet: packet split support To support GRO Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor
RE: [RFC] [PATCH v2 3/3] Let host NIC driver to DMA to guest user space.
From: Xin Xiaohui xiaohui@intel.com The patch let host NIC driver to receive user space skb, then the driver has chance to directly DMA to guest user space buffers thru single ethX interface. We want it to be more generic as a zero copy framework. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzha...@gmail.com Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org --- We consider 2 way to utilize the user buffres, but not sure which one is better. Please give any comments. One:Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb-data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to provide a method in guest virtio-net driver to ask for parameter we interest from the NIC driver when we know which device we have bind to do zero-copy. Then we ask guest to do so. Is that reasonable? Two:Modify driver to get user buffer allocated from a page constructor API(to substitute alloc_page()), the user buffer are used as payload buffers and filled by h/w directly when packet is received. Driver should associate the pages with skb (skb_shinfo(skb)-frags). For the head buffer side, let host allocates skb, and h/w fills it. After that, the data filled in host skb header will be copied into guest header buffer which is submitted together with the payload buffer. Pros: We could less care the way how guest or host allocates their buffers. Cons: We still need a bit copy here for the skb header. We are not sure which way is the better here. This is the first thing we want to get comments from the community. We wish the modification to the network part will be generic which not used by vhost-net backend only, but a user application may use it as well when the zero-copy device may provides async read/write operations later. Thanks Xiaohui How do you deal with the DoS problem of hostile user space app posting huge number of receives and never getting anything. That's a problem we are trying to deal with. It's critical for long term. Currently, we tried to limit the pages it can pin, but not sure how much is reasonable. For now, the buffers submitted is from guest virtio-net driver, so it's safe in some extent just for now. Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
On Tue, Apr 06, 2010 at 01:41:37PM +0800, Xin, Xiaohui wrote: Michael, For the DOS issue, I'm not sure how much the limit get_user_pages() can pin is reasonable, should we compute the bindwidth to make it? There's a ulimit for locked memory. Can we use this, decreasing the value for rlimit array? We can do this when backend is enabled and re-increment when backend is disabled. I have tried it with rlim[RLIMIT_MEMLOCK].rlim_cur, but I found the initial value for it is 0x1, after right shift PAGE_SHIFT, it's only 16 pages we can lock then, it seems too small, since the guest virito-net driver may submit a lot requests one time. Thanks Xiaohui Yes, that's the default, but system administrator can always increase this value with ulimit if necessary. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
On Tue, Apr 06, 2010 at 01:46:56PM +0800, Xin, Xiaohui wrote: Michael, For the write logging, do you have a function in hand that we can recompute the log? If that, I think I can use it to recompute the log info when the logging is suddenly enabled. For the outstanding requests, do you mean all the user buffers have submitted before the logging ioctl changed? That may be a lot, and some of them are still in NIC ring descriptors. Waiting them to be finished may be need some time. I think when logging ioctl changed, then the logging is changed just after that is also reasonable. The key point is that after loggin ioctl returns, any subsequent change to memory must be logged. It does not matter when was the request submitted, otherwise we will get memory corruption on migration. The change to memory happens when vhost_add_used_and_signal(), right? So after ioctl returns, just recompute the log info to the events in the async queue, is ok. Since the ioctl and write log operations are all protected by vq-mutex. Thanks Xiaohui Yes, I think this will work. Thanks, so do you have the function to recompute the log info in your hand that I can use? I have weakly remembered that you have noticed it before some time. Doesn't just rerunning vhost_get_vq_desc work? Thanks Xiaohui drivers/vhost/net.c | 189 +++-- drivers/vhost/vhost.h | 10 +++ 2 files changed, 192 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..2aafd90 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -47,6 +49,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + int log, size; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + if (vq-receiver) + vq-receiver(vq); + + vq_log = unlikely(vhost_has_feature( + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + log = (int)iocb-ki_user_data; + size = iocb-ki_nbytes; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + if (unlikely(vq_log)) + vhost_log_write(vq, vq_log, log, size); + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + int tx_total_len = 0; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, 0); + tx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + + kmem_cache_free(net-cache, iocb); + if (unlikely(tx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + /* Expects to be always run from workqueue - which acts as * read-size
Re: [PATCH 1/2] qemu-kvm: extboot: Keep variables in RAM
Avi Kivity wrote: On 02/18/2010 06:13 PM, Jan Kiszka wrote: Instead of saving the old INT 0x13 and 0x19 handlers in ROM which fails under QEMU as it enforces protection, keep them in spare vectors of the interrupt table, namely INT 0x80 and 0x81. Applied both, thanks. Forgot to tag it: Please consider the first one (Keep variables in RAM, 2dcbbec) for stable as well. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question on skip_emulated_instructions()
On Tue, Apr 06, 2010 at 01:11:23PM +0900, Yoshiaki Tamura wrote: Hi. When handle_io() is called, rip is currently proceeded *before* actually having I/O handled by qemu in userland. Upon implementing Kemari for KVM(http://www.mail-archive.com/kvm@vger.kernel.org/msg25141.html) mainly in userland qemu, we encountered a problem that synchronizing the content of VCPU before handling I/O in qemu is too late because rip is already proceeded in KVM, Although we avoided this issue with temporal hack, I would like to ask a few question on skip_emulated_instructions. 1. Does rip need to be proceeded before having I/O handled by qemu? In current kvm.git rip is proceeded before I/O is handled by qemu only in case of out instruction. From architecture point of view I think it's OK since on real HW you can't guaranty that I/O will take effect before instruction pointer is advanced. It is done like that because we want out emulation to be real fast so we skip x86 emulator. 2. If no, is it possible to divide skip_emulated_instructions(), like rec_emulated_instructions() to remember to next_rip, and skip_emulated_instructions() to actually proceed the rip. Currently only emulator can call userspace to do I/O, so after userspace returns after I/O exit, control is handled back to emulator unconditionally. out instruction skips emulator, but there is nothing to do after userspace returns, so regular cpu loop is executed. If we want to advance rip only after userspace executed I/O done by out we need to distinguish who requested I/O (emulator or kvm_fast_pio_out()) and call different code depending on who that was. It can be done by having a callback that (if not null) is called on return from userspace. 3. svm has next_rip but when it is 0, nop is emulated. Can this be modified to continue without emulating nop when next_rip is 0? I don't see where nop is emulated if next_rip is 0. As far as I see in case of next_rip==0 an instruction at rip is decoded to figure out its length and then rip is advanced by instruction length. Anyway next_rip is svm thing only. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
page allocation failure
Hi, running kernel 2.6.32 (kvm 0.12.3) in host and 2.6.30 in guest (using Gentoo) works fine. Now I've upgraded several guests to 2.6.32 too and have had no problems so far. But with one guest after 2-3 hours the guest hangs and I always get a message like this: [ 1392.030904] rpciod/0: page allocation failure. order:0, mode:0x20 [ 1392.030910] Pid: 407, comm: rpciod/0 Not tainted 2.6.32-gentoo-r5 #1 [ 1392.030912] Call Trace: [ 1392.030915] IRQ [8109cf2f] __alloc_pages_nodemask+0x5ad/0x5fa [ 1392.030986] [81028fdc] ? default_spin_lock_flags+0x9/0xd [ 1392.031008] [810c1aa9] alloc_pages_current+0x96/0x9f [ 1392.031061] [a00f90c4] try_fill_recv+0x8c/0x18e [virtio_net] [ 1392.031065] [a00f9c19] virtnet_poll+0x594/0x60f [virtio_net] [ 1392.031070] [81029050] ? pvclock_clocksource_read+0x42/0x7e [ 1392.031087] [810520b2] ? run_timer_softirq+0x1d8/0x1f0 [ 1392.031118] [8171760a] net_rx_action+0xad/0x1a8 [ 1392.031126] [8104e2a9] __do_softirq+0x9c/0x127 [ 1392.031141] [8100c1cc] call_softirq+0x1c/0x28 [ 1392.031147] [8100dd8d] do_softirq+0x41/0x81 [ 1392.031154] [8104dfb3] irq_exit+0x36/0x75 [ 1392.031161] [81021c5f] smp_apic_timer_interrupt+0x88/0x96 [ 1392.031170] [8100bb93] apic_timer_interrupt+0x13/0x20 [ 1392.031174] EOI [811e60cd] ? nfs_readpage_result_full+0x7a/0xcd [ 1392.031221] [817932b3] ? rpc_exit_task+0x27/0x54 [ 1392.031227] [81793997] ? __rpc_execute+0x86/0x247 [ 1392.031234] [81793bf2] ? rpc_async_schedule+0x0/0x12 [ 1392.031241] [81793c02] ? rpc_async_schedule+0x10/0x12 [ 1392.031248] [81058a9e] ? worker_thread+0x173/0x214 [ 1392.031259] [8105c404] ? autoremove_wake_function+0x0/0x38 [ 1392.031263] [8105892b] ? worker_thread+0x0/0x214 [ 1392.031266] [8105c139] ? kthread+0x7d/0x85 [ 1392.031269] [8100c0ca] ? child_rip+0xa/0x20 [ 1392.031273] [8105c0bc] ? kthread+0x0/0x85 [ 1392.031279] [8100c0c0] ? child_rip+0x0/0x20 [ 1392.031283] Mem-Info: [ 1392.031287] Node 0 DMA per-cpu: [ 1392.031293] CPU0: hi:0, btch: 1 usd: 0 [ 1392.031298] CPU1: hi:0, btch: 1 usd: 0 [ 1392.031302] Node 0 DMA32 per-cpu: [ 1392.031308] CPU0: hi: 186, btch: 31 usd: 201 [ 1392.031312] CPU1: hi: 186, btch: 31 usd: 153 [ 1392.031320] active_anon:8093 inactive_anon:8705 isolated_anon:0 [ 1392.031322] active_file:5209 inactive_file:214248 isolated_file:0 [ 1392.031324] unevictable:0 dirty:8 writeback:0 unstable:0 [ 1392.031326] free:1351 slab_reclaimable:2093 slab_unreclaimable:3964 [ 1392.031329] mapped:2406 shmem:190 pagetables:1184 bounce:0 [ 1392.031334] Node 0 DMA free:3996kB min:60kB low:72kB high:88kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:11860kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15372kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 1392.031355] lowmem_reserve[]: 0 994 994 994 [ 1392.031368] Node 0 DMA32 free:1408kB min:4000kB low:5000kB high:6000kB active_anon:32372kB inactive_anon:34820kB active_file:20836kB inactive_file:845132kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1018068kB mlocked:0kB dirty:32kB writeback:0kB mapped:9624kB shmem:760kB slab_reclaimable:8372kB slab_unreclaimable:15848kB kernel_stack:1112kB pagetables:4736kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 1392.031391] lowmem_reserve[]: 0 0 0 0 [ 1392.031404] Node 0 DMA: 1*4kB 1*8kB 1*16kB 2*32kB 3*64kB 3*128kB 3*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 3996kB [ 1392.031433] Node 0 DMA32: 0*4kB 0*8kB 1*16kB 0*32kB 0*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1424kB [ 1392.031463] 219673 total pagecache pages [ 1392.031467] 0 pages in swap cache [ 1392.031471] Swap cache stats: add 0, delete 0, find 0/0 [ 1392.031475] Free swap = 1959920kB [ 1392.031479] Total swap = 1959920kB [ 1392.036726] 262141 pages RAM [ 1392.036728] 7326 pages reserved [ 1392.036730] 21063 pages shared [ 1392.036731] 244405 pages non-shared The guest only runs a Apache webserver and has a NFSv4 mount. Nothing special. The guest has 1 GB of memory and two vcpu's. Here're the startup parameters: /usr/bin/qemu-system-x86_64 --enable-kvm -m 1024 -smp 2 -cpu host -daemonize -k de -vnc 127.0.0.1:2 -monitor telnet:172.18.105.4:4445,server,nowait -localtime -pidfile /var/tmp/kvm-vm10.pid -drive file=/data/kvm/kvmimages/vm10.qcow2,if=virtio,boot=on -net nic,vlan=104,model=virtio,macaddr=00:ff:48:46:01:f2 -net tap,vlan=104,ifname=tap.b.vm10,script=no -net nic,vlan=96,model=virtio,macaddr=00:ff:48:46:01:f4 -net tap,vlan=96,ifname=tap.f.vm10,script=no Just to get sure that my kernel config doesn't have some issues I've
[PATCH 1/2] KVM MMU: remove unused field
kvm_mmu_page.oos_link is not used, so remove it Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |2 -- arch/x86/kvm/mmu.c |1 - 2 files changed, 0 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 26c629a..0c49c88 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -187,8 +187,6 @@ struct kvm_mmu_page { struct list_head link; struct hlist_node hash_link; - struct list_head oos_link; - /* * The following two entries are used to key the shadow page in the * hash table. diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index d7700bb..8dfe8eb 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -922,7 +922,6 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, sp-gfns = mmu_memory_cache_alloc(vcpu-arch.mmu_page_cache, PAGE_SIZE); set_page_private(virt_to_page(sp-spt), (unsigned long)sp); list_add(sp-link, vcpu-kvm-arch.active_mmu_pages); - INIT_LIST_HEAD(sp-oos_link); bitmap_zero(sp-slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS); sp-multimapped = 0; sp-parent_pte = parent_pte; -- 1.6.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM MMU: remove unnecessary judgement
After is_rsvd_bits_set() checks, EFER.NXE must be enabled if NX bit is seted Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/paging_tmpl.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 067797a..d9dea28 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -170,7 +170,7 @@ walk: goto access_error; #if PTTYPE == 64 - if (fetch_fault is_nx(vcpu) (pte PT64_NX_MASK)) + if (fetch_fault (pte PT64_NX_MASK)) goto access_error; #endif -- 1.6.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: page allocation failure
Am Tue, 06 Apr 2010 14:15:10 +0200 schrieb kvm: Hi, running kernel 2.6.32 (kvm 0.12.3) in host and 2.6.30 in guest (using Gentoo) works fine. Now I've upgraded several guests to 2.6.32 too and have had no problems so far. But with one guest after 2-3 hours the guest hangs and I always get a message like this: IMHO there are NFS related problems with =2.6.32.x . try googling. - Thomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: page allocation failure
Thanks! I'll try a new kernel. Interestingly two guests with 2.6.32-r3 (Gentoo naming not rc3) with much more NFS traffic don't show this behavior of 2.6.32-r5. So I'll try 2.6.32-r8. I've found some threads with NFS and kernel 2.6.32.x related problems which seems to be fixed in later versions. - Robert On Tue, 6 Apr 2010 10:59:45 + (UTC), Thomas Mueller tho...@chaschperli.ch wrote: Am Tue, 06 Apr 2010 14:15:10 +0200 schrieb kvm: Hi, running kernel 2.6.32 (kvm 0.12.3) in host and 2.6.30 in guest (using Gentoo) works fine. Now I've upgraded several guests to 2.6.32 too and have had no problems so far. But with one guest after 2-3 hours the guest hangs and I always get a message like this: IMHO there are NFS related problems with =2.6.32.x . try googling. - Thomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GSoC 2010] Shared memory transport between guest(s) and host
Hi, I am interested in the Shared memory transport between guest(s) and host project for GSoC 2010. The description of the project is pretty straightforward, but I am a little bit lost on some parts: 1- Is there any documentation available on KVM shared memory transport. This'd definitely help understand how inter-vm shared memory should work. 2- Does the project only aim at providing a shared memory transport between a single host and a number of guests, with the host acting as a central node containing shared memory objects and communication taking placde only between guests and host, or is there any kind of guest-guest communications to be supported? If yes, how should it be done? Regards, Mohammed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for Apr 6
Chris Wright wrote: Please send in any agenda items you are interested in covering. Management stack discussion (again :)) Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call minutes for Apr 6
Management stack again - qemud? - external mgmt stack, qemu/kvm devs less inclined to care - Oh, you're using virsh, try #virt on OFTC - standard libvirt issues - concern about speed of adopting kvm features - complicated, XML hard to understand - being slowed by hv agnositc - ... - but...clearly have done a lot of work and widely used/deployed/etc... - libvirt is still useful, so need to play well w/ libvirt - qemud - qemu registers - have enumeration - have access to all qemu features (QMP based) - runs privileged, can deal w/ bridge, tap setup - really good UI magically appears /sarcasm - regressions (we'd lose these w/ qemud): - sVirt - networking - storage pools, etc - device assignment - hotplug - large pages - cgroups - stable mgmt ABI - what's needed global/privileged? - guest enumeration - network setup - device hotplug - need good single VM UI - but...as soon as you want 2 VMs, or 2 hosts... - no need to reinvent the wheel - qemu project push features up towards mgmt stack, but doesn't make those features exist in mgmt stack - automated interface creation (remove barrier to adding libvirt features) - QtEmu as example of nice interface that suffered because programming to qemu cli is too hard - libvirt has made a lot of effort, nobody is discouting that, what's un - strong agreement is that libvirt is needed long term - we should focus on making qemu easy to manage not on writing mgmt tools - qmp + libvirt - define requirements for layering - needs for global scope and privilege requirements -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.
On 04/05/2010 08:35 PM, Sridhar Samudrala wrote: On Sun, 2010-04-04 at 14:14 +0300, Michael S. Tsirkin wrote: On Fri, Apr 02, 2010 at 10:31:20AM -0700, Sridhar Samudrala wrote: Make vhost scalable by creating a separate vhost thread per vhost device. This provides better scaling across multiple guests and with multiple interfaces in a guest. Thanks for looking into this. An alternative approach is to simply replace create_singlethread_workqueue with create_workqueue which would get us a thread per host CPU. It seems that in theory this should be the optimal approach wrt CPU locality, however, in practice a single thread seems to get better numbers. I have a TODO to investigate this. Could you try looking into this? Yes. I tried using create_workqueue(), but the results were not good atleast when the number of guest interfaces is less than the number of CPUs. I didn't try more than 8 guests. Creating a separate thread per guest interface seems to be more scalable based on the testing i have done so far. Thread per guest is also easier to account. I'm worried about guests impacting other guests' performance outside scheduler control by extensive use of vhost. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: Trace emulated instructions
On Tue, Apr 06, 2010 at 12:38:00AM +0300, Avi Kivity wrote: On 04/05/2010 09:44 PM, Marcelo Tosatti wrote: On Thu, Mar 25, 2010 at 05:02:56PM +0200, Avi Kivity wrote: Log emulated instructions in ftrace, especially if they failed. Why not log all emulated instructions? Seems useful to me. That was the intent, but it didn't pan out. I tried to avoid double-logging where an mmio read is dispatched to userspace, and the instruction is re-executed. Perhaps we should split it into a decode trace and execution result trace? Less easy to use but avoids confusion due to duplication. I don't think duplication introduces confusion, as long as one can see rip on the trace entry (the duplication can actually be useful). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv6 0/4] qemu-kvm: vhost net port
On Sun, Apr 04, 2010 at 07:30:20PM +0300, Avi Kivity wrote: On 04/04/2010 02:46 PM, Michael S. Tsirkin wrote: On Wed, Mar 24, 2010 at 02:38:57PM +0200, Avi Kivity wrote: On 03/17/2010 03:04 PM, Michael S. Tsirkin wrote: This is port of vhost v6 patch set I posted previously to qemu-kvm, for those that want to get good performance out of it :) This patchset needs to be applied when qemu.git one gets merged, this includes irqchip support. Ping me when this happens please. Ping Bounce. Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] Add Mergeable receive buffer support to vhost_net
This patch adds support for the Mergeable Receive Buffers feature to vhost_net. +-DLS Changes from previous revision: 1) renamed: vhost_discard_vq_desc - vhost_discard_desc vhost_get_heads - vhost_get_desc_n vhost_get_vq_desc - vhost_get_desc 2) added heads as argument to ghost_get_desc_n 3) changed vq-heads from iovec to vring_used_elem, removed casts 4) changed vhost_add_used to do multiple elements in a single copy_to_user, or two when we wrap the ring. 5) removed rxmaxheadcount and available buffer checks in favor of running until an allocation failure, but making sure we break the loop if we get two in a row, indicating we have at least 1 buffer, but not enough for the current receive packet 6) restore non-vnet header handling Signed-Off-By: David L Stevens dlstev...@us.ibm.com diff -ruNp net-next-p0/drivers/vhost/net.c net-next-v3/drivers/vhost/net.c --- net-next-p0/drivers/vhost/net.c 2010-03-22 12:04:38.0 -0700 +++ net-next-v3/drivers/vhost/net.c 2010-04-06 12:54:56.0 -0700 @@ -130,9 +130,8 @@ static void handle_tx(struct vhost_net * hdr_size = vq-hdr_size; for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, + head = vhost_get_desc(net-dev, vq, vq-iov, +ARRAY_SIZE(vq-iov), out, in, NULL, NULL); /* Nothing new? Wait for eventfd to tell us they refilled. */ if (head == vq-num) { @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net * /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock-ops-sendmsg(NULL, sock, msg, len); if (unlikely(err 0)) { - vhost_discard_vq_desc(vq); - tx_poll_start(net, sock); + if (err == -EAGAIN) { + vhost_discard_desc(vq, 1); + tx_poll_start(net, sock); + } else { + vq_err(vq, sendmsg: errno %d\n, -err); + /* drop packet; do not discard/resend */ + vhost_add_used_and_signal(net-dev, vq, head, + 0); + } break; } if (err != len) @@ -186,12 +192,25 @@ static void handle_tx(struct vhost_net * unuse_mm(net-dev.mm); } +static int vhost_head_len(struct sock *sk) +{ + struct sk_buff *head; + int len = 0; + + lock_sock(sk); + head = skb_peek(sk-sk_receive_queue); + if (head) + len = head-len; + release_sock(sk); + return len; +} + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_rx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX]; - unsigned head, out, in, log, s; + unsigned in, log, s; struct vhost_log *vq_log; struct msghdr msg = { .msg_name = NULL, @@ -202,13 +221,14 @@ static void handle_rx(struct vhost_net * .msg_flags = MSG_DONTWAIT, }; - struct virtio_net_hdr hdr = { - .flags = 0, - .gso_type = VIRTIO_NET_HDR_GSO_NONE + struct virtio_net_hdr_mrg_rxbuf hdr = { + .hdr.flags = 0, + .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE }; + int retries = 0; size_t len, total_len = 0; - int err; + int err, headcount, datalen; size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net * vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; - for (;;) { - head = vhost_get_vq_desc(net-dev, vq, vq-iov, -ARRAY_SIZE(vq-iov), -out, in, -vq_log, log); + while ((datalen = vhost_head_len(sock-sk))) { + headcount = vhost_get_desc_n(vq, vq-heads, datalen, in, +vq_log, log); /* OK, now we need to know about added descriptors. */ - if (head == vq-num) { - if (unlikely(vhost_enable_notify(vq))) { + if (!headcount) { + if (retries == 0 unlikely(vhost_enable_notify(vq))) { /* They have
Re: Setting nx bit in virtual CPU
On 05/04/10 09:27, Avi Kivity wrote: On 04/03/2010 12:07 AM, Richard Simpson wrote: Nope, both Kernels are 64 bit. uname -a Host: Linux gordon 2.6.27-gentoo-r8 #5 Sat Mar 14 18:01:59 GMT 2009 x86_64 AMD Athlon(tm) 64 Processor 3200+ AuthenticAMD GNU/Linux uname -a Guest: Linux andrew 2.6.28-hardened-r9 #4 Mon Jan 18 22:39:31 GMT 2010 x86_64 AMD Athlon(tm) 64 Processor 3200+ AuthenticAMD GNU/Linux As you can see, both kernels are a little old, and I have been wondering if that might be part of the problem. The Guest one is old because that is the latest stable hardened version in Gentoo. The host one is old because of: 2.6.27 should be plenty fine for nx. Really the important bit is that the host kernel has nx enabled. Can you check if that is so? Umm, could you give me a clue about how to do that. It is some time since I configured the host kernel, but I do have a /proc/config.gz. Could I check by looking in that? Thanks -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM test: Fix some typos on autotest run utility function
Fix some typos found on the utility function that runs autotest tests on a guest. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/kvm_test_utils.py |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/client/tests/kvm/kvm_test_utils.py b/client/tests/kvm/kvm_test_utils.py index 6ba834a..f512044 100644 --- a/client/tests/kvm/kvm_test_utils.py +++ b/client/tests/kvm/kvm_test_utils.py @@ -298,12 +298,12 @@ def run_autotest(vm, session, control_path, timeout, test_name, outputdir): @param dest_dir: Destination dir for the contents basename = os.path.basename(remote_path) -logging.info(Extracting %s... % basename) +logging.info(Extracting %s..., basename) (status, output) = session.get_command_status_output( tar xjvf %s -C %s % (remote_path, dest_dir)) if status != 0: logging.error(Uncompress output:\n%s % output) -raise error.TestFail(Could not extract % on guest) +raise error.TestFail(Could not extract %s on guest % basename) if not os.path.isfile(control_path): raise error.TestError(Invalid path to autotest control file: %s % @@ -356,7 +356,7 @@ def run_autotest(vm, session, control_path, timeout, test_name, outputdir): raise error.TestFail(Could not copy the test control file to guest) # Run the test -logging.info(Running test '%s'... % test_name) +logging.info(Running test '%s'..., test_name) session.get_command_output(cd %s % autotest_path) session.get_command_output(rm -f control.state) session.get_command_output(rm -rf results/*) @@ -364,7 +364,7 @@ def run_autotest(vm, session, control_path, timeout, test_name, outputdir): status = session.get_command_status(bin/autotest control, timeout=timeout, print_func=logging.info) -logging.info(--End of test output ) +logging.info(- End of test output ) if status is None: raise error.TestFail(Timeout elapsed while waiting for autotest to complete) -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] vhost-blk implementation (v2)
Hi All, Here is the latest version of vhost-blk implementation. Major difference from my previous implementation is that, I now merge all contiguous requests (both read and write), before submitting them. This significantly improved IO performance. I am still collecting performance numbers, I will be posting in next few days. Comments ? Todo: - Address hch's comments on annontations - Implement per device read/write queues - Finish up error handling Thanks, Badari --- drivers/vhost/blk.c | 445 1 file changed, 445 insertions(+) Index: net-next/drivers/vhost/blk.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ net-next/drivers/vhost/blk.c2010-04-06 16:38:03.563847905 -0400 @@ -0,0 +1,445 @@ + /* + * virtio-block server in host kernel. + * Inspired by vhost-net and shamlessly ripped code from it :) + */ + +#include linux/compat.h +#include linux/eventfd.h +#include linux/vhost.h +#include linux/virtio_net.h +#include linux/virtio_blk.h +#include linux/mmu_context.h +#include linux/miscdevice.h +#include linux/module.h +#include linux/mutex.h +#include linux/workqueue.h +#include linux/rcupdate.h +#include linux/file.h + +#include vhost.h + +#define VHOST_BLK_VQ_MAX 1 +#define SECTOR_SHIFT 9 + +struct vhost_blk { + struct vhost_dev dev; + struct vhost_virtqueue vqs[VHOST_BLK_VQ_MAX]; + struct vhost_poll poll[VHOST_BLK_VQ_MAX]; +}; + +struct vhost_blk_io { + struct list_head list; + struct work_struct work; + struct vhost_blk *blk; + struct file *file; + int head; + uint32_t type; + uint32_t nvecs; + uint64_t sector; + uint64_t len; + struct iovec iov[0]; +}; + +static struct workqueue_struct *vblk_workqueue; +static LIST_HEAD(write_queue); +static LIST_HEAD(read_queue); + +static void handle_io_work(struct work_struct *work) +{ + struct vhost_blk_io *vbio, *entry; + struct vhost_virtqueue *vq; + struct vhost_blk *blk; + struct list_head single, *head, *node, *tmp; + + int i, need_free, ret = 0; + loff_t pos; + uint8_t status = 0; + + vbio = container_of(work, struct vhost_blk_io, work); + blk = vbio-blk; + vq = blk-dev.vqs[0]; + pos = vbio-sector 8; + + use_mm(blk-dev.mm); + if (vbio-type VIRTIO_BLK_T_FLUSH) { + ret = vfs_fsync(vbio-file, vbio-file-f_path.dentry, 1); + } else if (vbio-type VIRTIO_BLK_T_OUT) { + ret = vfs_writev(vbio-file, vbio-iov, vbio-nvecs, pos); + } else { + ret = vfs_readv(vbio-file, vbio-iov, vbio-nvecs, pos); + } + status = (ret 0) ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK; + if (vbio-head != -1) { + INIT_LIST_HEAD(single); + list_add(vbio-list, single); + head = single; + need_free = 0; + } else { + head = vbio-list; + need_free = 1; + } + list_for_each_entry(entry, head, list) { + copy_to_user(entry-iov[entry-nvecs].iov_base, status, sizeof status); + } + mutex_lock(vq-mutex); + list_for_each_safe(node, tmp, head) { + entry = list_entry(node, struct vhost_blk_io, list); + vhost_add_used_and_signal(blk-dev, vq, entry-head, ret); + list_del(node); + kfree(entry); + } + mutex_unlock(vq-mutex); + unuse_mm(blk-dev.mm); + if (need_free) + kfree(vbio); +} + +static struct vhost_blk_io *allocate_vbio(int nvecs) +{ + struct vhost_blk_io *vbio; + int size = sizeof(struct vhost_blk_io) + nvecs * sizeof(struct iovec); + vbio = kmalloc(size, GFP_KERNEL); + if (vbio) { + INIT_WORK(vbio-work, handle_io_work); + INIT_LIST_HEAD(vbio-list); + } + return vbio; +} + +static void merge_and_handoff_work(struct list_head *queue) +{ + struct vhost_blk_io *vbio, *entry; + int nvecs = 0; + int entries = 0; + + list_for_each_entry(entry, queue, list) { + nvecs += entry-nvecs; + entries++; + } + + if (entries == 1) { + vbio = list_first_entry(queue, struct vhost_blk_io, list); + list_del(vbio-list); + queue_work(vblk_workqueue, vbio-work); + return; + } + + vbio = allocate_vbio(nvecs); + if (!vbio) { + /* Unable to allocate memory - submit IOs individually */ + list_for_each_entry(vbio, queue, list) { + queue_work(vblk_workqueue, vbio-work); + } + INIT_LIST_HEAD(queue); + return; + } + + entry = list_first_entry(queue, struct vhost_blk_io, list); + vbio-nvecs = nvecs; + vbio-blk = entry-blk; +
Re: virsh dump blocking problem
On Tue, 06 Apr 2010 09:35:09 +0800 Gui Jianfeng guijianf...@cn.fujitsu.com wrote: Hi all, I'm not sure whether it's appropriate to post the problem here. I played with virsh under Fedora 12, and started a KVM fedora12 guest by virsh start command. The fedora12 guest is successfully started. Than I run the following command to dump the guest core: #virsh dump 1 mycoredump (domain id is 1) This command seemed blocking and not return. According to he strace output, virsh dump seems that it's blocking at poll() call. I think the following should be the call trace of virsh. cmdDump() - virDomainCoreDump() - remoteDomainCoreDump() - call() - remoteIO() - remoteIOEventLoop() - poll(fds, ARRAY_CARDINALITY(fds), -1) Any one encounters this problem also, any thoughts? I met and it seems qemu-kvm continues to counting the number of dirty pages and does no answer to libvirt. Guest never work and I have to kill it. I met this with 2.6.32+ qemu-0.12.3+ libvirt 0.7.7.1. When I updated the host kernel to 2.6.33, qemu-kvm never work. So, I moved back to fedora12's latest qemu-kvm. Now, 2.6.34-rc3+ qemu-0.11.0-13.fc12.x86_64 + libvirt 0.7.7.1 # virsh dump hangs. In most case, I see following 2 back trace.(with gdb) (gdb) bt #0 ram_save_remaining () at /usr/src/debug/qemu-kvm-0.11.0/vl.c:3104 #1 ram_bytes_remaining () at /usr/src/debug/qemu-kvm-0.11.0/vl.c:3112 #2 0x004ab2cf in do_info_migrate (mon=0x16b7970) at migration.c:150 #3 0x00414b1a in monitor_handle_command (mon=value optimized out, cmdline=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:2870 #4 0x00414c6a in monitor_command_cb (mon=0x16b7970, cmdline=value optimized out, opaque=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:3160 #5 0x0048b71b in readline_handle_byte (rs=0x208d6a0, ch=value optimized out) at readline.c:369 #6 0x00414cdc in monitor_read (opaque=value optimized out, buf=0x7fff1b1104b0 info migrate\r, size=13) at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:3146 #7 0x004b2a53 in tcp_chr_read (opaque=0x1614c30) at qemu-char.c:2006 #8 0x0040a6c7 in main_loop_wait (timeout=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:4188 #9 0x0040eed5 in main_loop (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:4414 #10 main (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:6263 (gdb) bt #0 0x003c2680e0bd in write () at ../sysdeps/unix/syscall-template.S:82 #1 0x004b304a in unix_write (fd=11, buf=value optimized out, len1=40) at qemu-char.c:512 #2 send_all (fd=11, buf=value optimized out, len1=40) at qemu-char.c:528 #3 0x00411201 in monitor_flush (mon=0x16b7970) at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:131 #4 0x00414cdc in monitor_read (opaque=value optimized out, buf=0x7fff1b1104b0 info migrate\r, size=13) at /usr/src/debug/qemu-kvm-0.11.0/monitor.c:3146 #5 0x004b2a53 in tcp_chr_read (opaque=0x1614c30) at qemu-char.c:2006 #6 0x0040a6c7 in main_loop_wait (timeout=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:4188 #7 0x0040eed5 in main_loop (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:4414 #8 main (argc=value optimized out, argv=value optimized out, envp=value optimized out) at /usr/src/debug/qemu-kvm-0.11.0/vl.c:6263 And see no dump progress. I'm sorry if this is not a hang but just very slow. I don't see any progress at lease for 15 minutes and qemu-kvm continues to use 75% of cpus. I'm not sure why dump command trigger migration code... How long it takes to do virsh dump xxx , an idle VM with 2G memory ? I'm sorry if I ask wrong mailing list. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
Michael, For the write logging, do you have a function in hand that we can recompute the log? If that, I think I can use it to recompute the log info when the logging is suddenly enabled. For the outstanding requests, do you mean all the user buffers have submitted before the logging ioctl changed? That may be a lot, and some of them are still in NIC ring descriptors. Waiting them to be finished may be need some time. I think when logging ioctl changed, then the logging is changed just after that is also reasonable. The key point is that after loggin ioctl returns, any subsequent change to memory must be logged. It does not matter when was the request submitted, otherwise we will get memory corruption on migration. The change to memory happens when vhost_add_used_and_signal(), right? So after ioctl returns, just recompute the log info to the events in the async queue, is ok. Since the ioctl and write log operations are all protected by vq-mutex. Thanks Xiaohui Yes, I think this will work. Thanks, so do you have the function to recompute the log info in your hand that I can use? I have weakly remembered that you have noticed it before some time. Doesn't just rerunning vhost_get_vq_desc work? Am I missing something here? The vhost_get_vq_desc() looks in vq, and finds the first available buffers, and converts it to an iovec. I think the first available buffer is not the buffers in the async queue, so I think rerunning vhost_get_vq_desc() cannot work. Thanks Xiaohui Thanks Xiaohui drivers/vhost/net.c | 189 +++-- drivers/vhost/vhost.h | 10 +++ 2 files changed, 192 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..2aafd90 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -47,6 +49,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + int log, size; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + if (vq-receiver) + vq-receiver(vq); + + vq_log = unlikely(vhost_has_feature( + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + log = (int)iocb-ki_user_data; + size = iocb-ki_nbytes; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + if (unlikely(vq_log)) + vhost_log_write(vq, vq_log, log, size); + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + int tx_total_len = 0; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, 0); + tx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + + kmem_cache_free(net-cache, iocb); + if (unlikely(tx_total_len =
[PATCH] [RFC] KVM test: Introduce sample performance test set
As part of the performance testing effort for KVM, introduce a base performance testset for the sample KVM control file. It will execute several benchmarks on a Fedora 12 guest, bringing back the results to the host. This base testset can be tweaked for folks interested on getting figures from a particular KVM release. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com Signed-off-by: Jes Sorensen jes.soren...@redhat.com --- .../tests/kvm/autotest_control/hackbench.control |6 +- client/tests/kvm/autotest_control/iozone.control | 18 +++ .../tests/kvm/autotest_control/kernbench.control |2 +- client/tests/kvm/autotest_control/lmbench.control | 33 + .../tests/kvm/autotest_control/performance.control | 123 client/tests/kvm/autotest_control/reaim.control| 11 ++ client/tests/kvm/autotest_control/tiobench.control | 12 ++ .../tests/kvm/autotest_control/unixbench.control | 11 ++ client/tests/kvm/tests.cfg.sample | 13 ++ client/tests/kvm/tests_base.cfg.sample | 21 10 files changed, 247 insertions(+), 3 deletions(-) create mode 100644 client/tests/kvm/autotest_control/iozone.control create mode 100644 client/tests/kvm/autotest_control/lmbench.control create mode 100644 client/tests/kvm/autotest_control/performance.control create mode 100644 client/tests/kvm/autotest_control/reaim.control create mode 100644 client/tests/kvm/autotest_control/tiobench.control create mode 100644 client/tests/kvm/autotest_control/unixbench.control diff --git a/client/tests/kvm/autotest_control/hackbench.control b/client/tests/kvm/autotest_control/hackbench.control index 5b94865..3248b26 100644 --- a/client/tests/kvm/autotest_control/hackbench.control +++ b/client/tests/kvm/autotest_control/hackbench.control @@ -1,4 +1,4 @@ -AUTHOR = Sudhir Kumar sku...@linux.vnet.ibm.com +AUTHOR = nc...@google.com (Nikhil Rao) NAME = Hackbench TIME = SHORT TEST_CLASS = Kernel @@ -6,8 +6,10 @@ TEST_CATEGORY = Benchmark TEST_TYPE = client DOC = -Hackbench is a benchmark which measures the performance, overhead and +Hackbench is a benchmark for measuring the performance, overhead and scalability of the Linux scheduler. +hackbench.c copied from: +http://people.redhat.com/~mingo/cfs-scheduler/tools/hackbench.c job.run_test('hackbench') diff --git a/client/tests/kvm/autotest_control/iozone.control b/client/tests/kvm/autotest_control/iozone.control new file mode 100644 index 000..17d9be2 --- /dev/null +++ b/client/tests/kvm/autotest_control/iozone.control @@ -0,0 +1,18 @@ +AUTHOR = Ying Tao ying...@cn.ibm.com +TIME = MEDIUM +NAME = IOzone +TEST_TYPE = client +TEST_CLASS = Kernel +TEST_CATEGORY = Benchmark + +DOC = +Iozone is useful for performing a broad filesystem analysis of a vendors +computer platform. The benchmark tests file I/O performance for the following +operations: + Read, write, re-read, re-write, read backwards, read strided, fread, + fwrite, random read, pread ,mmap, aio_read, aio_write + +For more information see http://www.iozone.org + + +job.run_test('iozone') diff --git a/client/tests/kvm/autotest_control/kernbench.control b/client/tests/kvm/autotest_control/kernbench.control index 76a546e..9fc5da7 100644 --- a/client/tests/kvm/autotest_control/kernbench.control +++ b/client/tests/kvm/autotest_control/kernbench.control @@ -1,4 +1,4 @@ -AUTHOR = Sudhir Kumar sku...@linux.vnet.ibm.com +AUTHOR = mbl...@google.com (Martin Bligh) NAME = Kernbench TIME = SHORT TEST_CLASS = Kernel diff --git a/client/tests/kvm/autotest_control/lmbench.control b/client/tests/kvm/autotest_control/lmbench.control new file mode 100644 index 000..95b47fb --- /dev/null +++ b/client/tests/kvm/autotest_control/lmbench.control @@ -0,0 +1,33 @@ +NAME = lmbench +AUTHOR = Martin Bligh mbl...@google.com +TIME = MEDIUM +TEST_CATEGORY = BENCHMARK +TEST_CLASS = KERNEL +TEST_TYPE = CLIENT +DOC = +README for lmbench 2alpha8 net release. + +To run the benchmark, you should be able to say: + +cd src +make results + +If you want to see how you did compared to the other system results +included here, say + +make see + +Be warned that many of these benchmarks are sensitive to other things +being run on the system, mainly from CPU cache and CPU cycle effects. +So make sure your screen saver is not running, etc. + +It's a good idea to do several runs and compare the output like so + +make results +make rerun +make rerun +make rerun +cd Results make LIST=your OS/* + + +job.run_test('lmbench') diff --git a/client/tests/kvm/autotest_control/performance.control b/client/tests/kvm/autotest_control/performance.control new file mode 100644 index 000..5bc0b28 --- /dev/null +++ b/client/tests/kvm/autotest_control/performance.control @@ -0,0 +1,123 @@ +def step_init(): +job.next_step('step0') +job.next_step('step1') +job.next_step('step2') +
buildbot failure in qemu-kvm on disable_kvm_x86_64_debian_5_0
The Buildbot has detected a new failure of disable_kvm_x86_64_debian_5_0 on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_x86_64_debian_5_0/builds/336 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_1 Build Reason: The Nightly scheduler named 'nightly_disable_kvm' triggered this build Build Source Stamp: [branch master] HEAD Blamelist: BUILD FAILED: failed compile sincerely, -The Buildbot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on disable_kvm_i386_debian_5_0
The Buildbot has detected a new failure of disable_kvm_i386_debian_5_0 on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_i386_debian_5_0/builds/337 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_2 Build Reason: The Nightly scheduler named 'nightly_disable_kvm' triggered this build Build Source Stamp: [branch master] HEAD Blamelist: BUILD FAILED: failed compile sincerely, -The Buildbot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on disable_kvm_i386_out_of_tree
The Buildbot has detected a new failure of disable_kvm_i386_out_of_tree on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_i386_out_of_tree/builds/285 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_2 Build Reason: The Nightly scheduler named 'nightly_disable_kvm' triggered this build Build Source Stamp: [branch master] HEAD Blamelist: BUILD FAILED: failed compile sincerely, -The Buildbot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on disable_kvm_x86_64_out_of_tree
The Buildbot has detected a new failure of disable_kvm_x86_64_out_of_tree on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_x86_64_out_of_tree/builds/285 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_1 Build Reason: The Nightly scheduler named 'nightly_disable_kvm' triggered this build Build Source Stamp: [branch master] HEAD Blamelist: BUILD FAILED: failed compile sincerely, -The Buildbot -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
Michael, Qemu needs a userspace write, is that a synchronous one or asynchronous one? It's a synchronous non-blocking write. Sorry, why the Qemu live migration needs the device have a userspace write? how does the write operation work? And why a read operation is not cared here? Thanks Xiaohui -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Setting nx bit in virtual CPU
On 04/07/2010 01:31 AM, Richard Simpson wrote: 2.6.27 should be plenty fine for nx. Really the important bit is that the host kernel has nx enabled. Can you check if that is so? Umm, could you give me a clue about how to do that. It is some time since I configured the host kernel, but I do have a /proc/config.gz. Could I check by looking in that? The attached script should verify it. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. #!/usr/bin/python class msr(object): def __init__(self): try: self.f = file('/dev/cpu/0/msr') except: self.f = file('/dev/msr0') def read(self, index, default = None): import struct self.f.seek(index) try: return struct.unpack('Q', self.f.read(8))[0] except: return default efer = msr().read(0xc080, 0) nx = (efer 11) 1 if nx: print 'nx: enabled' else: print 'nx: disabled'
Re: PCI passthrough resource remapping
On 03/31/2010 06:18 PM, Chris Wright wrote: Hrm, I'm not sure these would be related to the small BAR region patch. It looks more like a timing issue. small BAR == slow path == timing issue? Would be interesting to verify using perf with the 'kvm:kvm_mmio' software event, see how many happen per second. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [questions] savevm|loadvm
On 04/01/2010 10:35 PM, Wenhao Xu wrote: Does current qemu-kvm (qemu v0.12.3) use the irqchip, pit of KVM? I cannot find any KVM_CREATE_IRQCHIP and KVM_CREATE_PIT in the qemu code. Are you looking at qemu or qemu-kvm? Concerning the interface between qemu and kvm, I have the following confusion: 1. How irqchip and pit of KVM collaborating with the irq and pit emulation of QEMU? As far as I see, qemu-kvm still uses qemu's irq and pit emulation, doesn't it? No, they're completely separate. 2. For return from KVM to QEMU, I cannot get the meaning of two exit reasons: case KVM_EXIT_EXCEPTION: What exception will cause KVM exit? I think that's obsolete. default: dprintf(kvm_arch_handle_exit\n); ret = kvm_arch_handle_exit(env, run); What exit reasons are default? 3. How could DMA interrupt the cpu when it finishes and the qemu-kvm is still running in kvm now? Usually the device that does the dma will raise an interrupt, which qemu is waiting for. I am still working in the patch, but these confusions really prevent me moving forward. Thanks first for you guys giving me more hints. The following is the code so far I write: The main idea is synchronizing the CPU state and enter into the emulator mode when switching from kvm to emulator. I only do the switch when the exit reason is KVM_EXIT_IRQ_WINDOW_OPEN. That doesn't happen with qemu-kvm. However, I got the following errors: Whenever switch from kvm to qemu, the interrupt request in qemu will cause qemu enter into smm mode which is definitely a bug. Definitely shouldn't happen. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html