Re: KVM Page Fault Question
On 04/02/2010 07:41 AM, Marek Olszewski wrote: When a guest OS writes to a shadowed (and therefore page protected) guest page table, does the resulting page fault get handled in paging_tmpl.h:xxx_page_fault or does it call some rmap related code directly? page faults are dispatched to the page_fault callback. Also, what does the direct mmu page role mean? It means that the page maps the linear range (gfn 12)..(((gfn + (1 level*9))) 12) instead of shadowing a guest page table at gfn. Useful for real mode, large pages, and tdp. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes
On 04/02/2010 12:27 AM, Tom Lyon wrote: kvm really wants the event counter to be an eventfd, that allows hooking it directly to kvm (which can inject an interrupt on an eventfd_signal), can you adapt your patch to do this? I looked further into eventfds - they seem the perfect solution for the MSI/MSI-X interrupts. Will include in V2. They are indeed. Thanks. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. The scenario is like this: The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. Here, we have ever considered 2 ways to utilize the page constructor API to dispense the user buffers. One:Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb-data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to provide a method in guest virtio-net driver to ask for parameter we interest from the NIC driver when we know which device we have bind to do zero-copy. Then we ask guest to do so. Is that reasonable? Two:Modify driver to get user buffer allocated from a page constructor API(to substitute alloc_page()), the user buffer are used as payload buffers and filled by h/w directly when packet is received. Driver should associate the pages with skb (skb_shinfo(skb)-frags). For the head buffer side, let host allocates skb, and h/w fills it. After that, the data filled in host skb header will be copied into guest header buffer which is submitted together with the payload buffer. Pros: We could less care the way how guest or host allocates their buffers. Cons: We still need a bit copy here for the skb header. We are not sure which way is the better here. This is the first thing we want to get comments from the community. We wish the modification to the network part will be generic which not used by vhost-net backend only, but a user application may use it as well when the zero-copy device may provides async read/write operations later. Please give comments especially for the network part modifications. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. But for simple test with netperf, we found bindwidth up and CPU % up too, but the bindwidth up ratio is much more than CPU % up ratio. What we have not done yet: packet split support To support GRO Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() what we have done in v2: remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the ctor pointer cannot be raced into wrong situation. Remove the struct vhost_notifier with struct kiocb. Let vhost-net backend to alloc/free the kiocb and transfer them via sendmsg/recvmsg. use get_user_pages_fast() and set_page_dirty_lock()
[RFC] [PATCH v2 1/3] A device for zero-copy based on KVM virtio-net.
From: Xin Xiaohui xiaohui@intel.com Add a device to utilize the vhost-net backend driver for copy-less data transfer between guest FE and host NIC. It pins the guest user space to the host memory and provides proto_ops as sendmsg/recvmsg to vhost-net. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzha...@gmail.com Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org --- drivers/vhost/Kconfig |5 + drivers/vhost/Makefile|2 + drivers/vhost/mpassthru.c | 1162 + include/linux/mpassthru.h | 29 ++ 4 files changed, 1198 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c create mode 100644 include/linux/mpassthru.h diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index 9f409f4..ee32a3b 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,8 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config VHOST_PASSTHRU + tristate Zerocopy network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..3f79c79 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_VHOST_PASSTHRU) += mpassthru.o diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..6e8fc4d --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1162 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/aio.h + +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h + +#include vhost.h + +/* Uncomment to enable debugging */ +/* #define MPASSTHRU_DEBUG 1 */ + +#ifdef MPASSTHRU_DEBUG +static int debug; + +#define DBG if (mp-debug) printk +#define DBG1 if (debug == 2) printk +#else +#define DBG(a...) +#define DBG1(a...) +#endif + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +struct frag { + u16 offset; + u16 size; +}; + +struct page_ctor { + struct list_headreadq; + int w_len; + int r_len; + spinlock_t read_lock; + struct kmem_cache *cache; + struct net_device *dev; + struct mpassthru_port port; +}; + +struct page_info { + void*ctrl; + struct list_headlist; + int header; + /* indicate the actual length of bytes +* send/recv in the user space buffers +*/ + int total; + int offset; + struct page *pages[MAX_SKB_FRAGS+1]; + struct skb_frag_struct frag[MAX_SKB_FRAGS+1]; + struct sk_buff *skb; + struct page_ctor*ctor; + + /* The pointer relayed to skb, to indicate +* it's a user space allocated skb or kernel +*/ + struct skb_user_pageuser; + struct skb_shared_info ushinfo; + +#define INFO_READ 0 +#define INFO_WRITE 1 + unsignedflags; + unsignedpnum; + + /* It's meaningful for receive, means +* the max length allowed +*/ + size_t len; + + /* The fields after that is for
[RFC] [PATCH v2 2/3] Provides multiple submits and asynchronous notifications.
From: Xin Xiaohui xiaohui@intel.com The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/vhost/net.c | 189 +++-- drivers/vhost/vhost.h | 10 +++ 2 files changed, 192 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..2aafd90 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -47,6 +49,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -91,11 +94,88 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + int log, size; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + if (vq-receiver) + vq-receiver(vq); + + vq_log = unlikely(vhost_has_feature( + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + log = (int)iocb-ki_user_data; + size = iocb-ki_nbytes; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + if (unlikely(vq_log)) + vhost_log_write(vq, vq_log, log, size); + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + int tx_total_len = 0; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, 0); + tx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + + kmem_cache_free(net-cache, iocb); + if (unlikely(tx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_tx(struct vhost_net *net) { struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX]; + struct kiocb *iocb = NULL; unsigned head, out, in, s; struct msghdr msg = { .msg_name = NULL, @@ -124,6 +204,8 @@ static void handle_tx(struct vhost_net *net) tx_poll_stop(net); hdr_size = vq-hdr_size; + handle_async_tx_events_notify(net, vq); + for (;;) { head = vhost_get_vq_desc(net-dev, vq, vq-iov, ARRAY_SIZE(vq-iov), @@ -151,6 +233,15 @@ static void handle_tx(struct vhost_net *net) /* Skip header. TODO: support TSO. */ s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out); msg.msg_iovlen = out; + + if (vq-link_state == VHOST_VQ_LINK_ASYNC) { + iocb = kmem_cache_zalloc(net-cache, GFP_KERNEL); + if (!iocb) + break; +
[RFC] [PATCH v2 3/3] Let host NIC driver to DMA to guest user space.
From: Xin Xiaohui xiaohui@intel.com The patch let host NIC driver to receive user space skb, then the driver has chance to directly DMA to guest user space buffers thru single ethX interface. We want it to be more generic as a zero copy framework. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzha...@gmail.com Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org --- We consider 2 way to utilize the user buffres, but not sure which one is better. Please give any comments. One:Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb-data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to provide a method in guest virtio-net driver to ask for parameter we interest from the NIC driver when we know which device we have bind to do zero-copy. Then we ask guest to do so. Is that reasonable? Two:Modify driver to get user buffer allocated from a page constructor API(to substitute alloc_page()), the user buffer are used as payload buffers and filled by h/w directly when packet is received. Driver should associate the pages with skb (skb_shinfo(skb)-frags). For the head buffer side, let host allocates skb, and h/w fills it. After that, the data filled in host skb header will be copied into guest header buffer which is submitted together with the payload buffer. Pros: We could less care the way how guest or host allocates their buffers. Cons: We still need a bit copy here for the skb header. We are not sure which way is the better here. This is the first thing we want to get comments from the community. We wish the modification to the network part will be generic which not used by vhost-net backend only, but a user application may use it as well when the zero-copy device may provides async read/write operations later. Thanks Xiaohui include/linux/netdevice.h | 69 - include/linux/skbuff.h| 30 -- net/core/dev.c| 63 ++ net/core/skbuff.c | 74 4 files changed, 224 insertions(+), 12 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 94958c1..ba48eb0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -485,6 +485,17 @@ struct netdev_queue { unsigned long tx_dropped; } cacheline_aligned_in_smp; +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE) +struct mpassthru_port { + int hdr_len; + int data_len; + int npages; + unsignedflags; + struct socket *sock; + struct skb_user_page*(*ctor)(struct mpassthru_port *, + struct sk_buff *, int); +}; +#endif /* * This structure defines the management hooks for network devices. @@ -636,6 +647,10 @@ struct net_device_ops { int (*ndo_fcoe_ddp_done)(struct net_device *dev, u16 xid); #endif +#if defined(CONFIG_VHOST_PASSTHRU) || defined(CONFIG_VHOST_PASSTHRU_MODULE) + int (*ndo_mp_port_prep)(struct net_device *dev, + struct mpassthru_port *port); +#endif }; /* @@ -891,7 +906,8 @@ struct net_device struct macvlan_port *macvlan_port; /* GARP */ struct garp_port*garp_port; - + /* mpassthru */ + struct mpassthru_port *mp_port; /* class/net/name entry */ struct device dev; /* space for optional statistics and wireless sysfs groups */ @@ -2013,6 +2029,55 @@ static inline u32 dev_ethtool_get_flags(struct net_device *dev) return 0; return dev-ethtool_ops-get_flags(dev); } -#endif /* __KERNEL__ */ +/* To support zero-copy between user space application and NIC driver, + * we'd better ask NIC driver for the capability it can provide, especially + * for packet split mode, now we only ask for the header size, and the + *
Re: [RFC] [PATCH v2 3/3] Let host NIC driver to DMA to guest user space.
On Fri, 2 Apr 2010 15:30:10 +0800 xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com The patch let host NIC driver to receive user space skb, then the driver has chance to directly DMA to guest user space buffers thru single ethX interface. We want it to be more generic as a zero copy framework. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzha...@gmail.com Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org --- We consider 2 way to utilize the user buffres, but not sure which one is better. Please give any comments. One:Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb-data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to provide a method in guest virtio-net driver to ask for parameter we interest from the NIC driver when we know which device we have bind to do zero-copy. Then we ask guest to do so. Is that reasonable? Two:Modify driver to get user buffer allocated from a page constructor API(to substitute alloc_page()), the user buffer are used as payload buffers and filled by h/w directly when packet is received. Driver should associate the pages with skb (skb_shinfo(skb)-frags). For the head buffer side, let host allocates skb, and h/w fills it. After that, the data filled in host skb header will be copied into guest header buffer which is submitted together with the payload buffer. Pros: We could less care the way how guest or host allocates their buffers. Cons: We still need a bit copy here for the skb header. We are not sure which way is the better here. This is the first thing we want to get comments from the community. We wish the modification to the network part will be generic which not used by vhost-net backend only, but a user application may use it as well when the zero-copy device may provides async read/write operations later. Thanks Xiaohui How do you deal with the DoS problem of hostile user space app posting huge number of receives and never getting anything. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes
On Fri, Apr 02, 2010 at 09:43:35AM +0300, Avi Kivity wrote: On 04/01/2010 10:24 PM, Tom Lyon wrote: But there are multiple msi-x interrupts, how do you know which one triggered? You don't. This would suck for KVM, I guess, but we'd need major rework of the generic UIO stuff to have a separate event channel for each MSI-X. Doesn't it suck for non-kvm in the same way? Multiple vectors are there for a reason. For example, if you have a multiqueue NIC, you'd have to process all queues instead of just the one that triggered. For my purposes, collapsing all the MSI-Xs into one MSI-look-alike is fine, because I'd be using MSI anyways if I could. The weird Intel 82599 VF only supports MSI-X. So one big question is - do we expand the whole UIO framework for KVM requirements, or do we split off either KVM or non-VM into a separate driver? Hans or Greg - care to opine? Currently kvm does device assignment with its own code, I'd like to unify it with uio, not split it off. Separate notifications for msi-x interrupts are just as useful for uio as they are for kvm. I agree, there should not be a difference here for KVM vs. the normal version. thanks, greg k-h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost: Make it more scalable by creating a vhost thread per device.
Make vhost scalable by creating a separate vhost thread per vhost device. This provides better scaling across multiple guests and with multiple interfaces in a guest. I am seeing better aggregated througput/latency when running netperf across multiple guests or multiple interfaces in a guest in parallel with this patch. Signed-off-by: Sridhar Samudrala s...@us.ibm.com diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index a6a88df..29aa80f 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -339,8 +339,10 @@ static int vhost_net_open(struct inode *inode, struct file *f) return r; } - vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT); - vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN); + vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, + n-dev); + vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, + n-dev); n-tx_poll_state = VHOST_NET_POLL_DISABLED; f-private_data = n; @@ -643,25 +645,14 @@ static struct miscdevice vhost_net_misc = { int vhost_net_init(void) { - int r = vhost_init(); - if (r) - goto err_init; - r = misc_register(vhost_net_misc); - if (r) - goto err_reg; - return 0; -err_reg: - vhost_cleanup(); -err_init: - return r; - + return misc_register(vhost_net_misc); } + module_init(vhost_net_init); void vhost_net_exit(void) { misc_deregister(vhost_net_misc); - vhost_cleanup(); } module_exit(vhost_net_exit); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 7bd7a1e..243f4d3 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -36,8 +36,6 @@ enum { VHOST_MEMORY_F_LOG = 0x1, }; -static struct workqueue_struct *vhost_workqueue; - static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh, poll_table *pt) { @@ -56,18 +54,19 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync, if (!((unsigned long)key poll-mask)) return 0; - queue_work(vhost_workqueue, poll-work); + queue_work(poll-dev-wq, poll-work); return 0; } /* Init poll structure */ void vhost_poll_init(struct vhost_poll *poll, work_func_t func, -unsigned long mask) +unsigned long mask, struct vhost_dev *dev) { INIT_WORK(poll-work, func); init_waitqueue_func_entry(poll-wait, vhost_poll_wakeup); init_poll_funcptr(poll-table, vhost_poll_func); poll-mask = mask; + poll-dev = dev; } /* Start polling a file. We add ourselves to file's wait queue. The caller must @@ -96,7 +95,7 @@ void vhost_poll_flush(struct vhost_poll *poll) void vhost_poll_queue(struct vhost_poll *poll) { - queue_work(vhost_workqueue, poll-work); + queue_work(poll-dev-wq, poll-work); } static void vhost_vq_reset(struct vhost_dev *dev, @@ -128,6 +127,11 @@ long vhost_dev_init(struct vhost_dev *dev, struct vhost_virtqueue *vqs, int nvqs) { int i; + + dev-wq = create_singlethread_workqueue(vhost); + if (!dev-wq) + return -ENOMEM; + dev-vqs = vqs; dev-nvqs = nvqs; mutex_init(dev-mutex); @@ -143,7 +147,7 @@ long vhost_dev_init(struct vhost_dev *dev, if (dev-vqs[i].handle_kick) vhost_poll_init(dev-vqs[i].poll, dev-vqs[i].handle_kick, - POLLIN); + POLLIN, dev); } return 0; } @@ -216,6 +220,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev) if (dev-mm) mmput(dev-mm); dev-mm = NULL; + + destroy_workqueue(dev-wq); } static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz) @@ -1095,16 +1101,3 @@ void vhost_disable_notify(struct vhost_virtqueue *vq) vq_err(vq, Failed to enable notification at %p: %d\n, vq-used-flags, r); } - -int vhost_init(void) -{ - vhost_workqueue = create_singlethread_workqueue(vhost); - if (!vhost_workqueue) - return -ENOMEM; - return 0; -} - -void vhost_cleanup(void) -{ - destroy_workqueue(vhost_workqueue); -} diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 44591ba..60fefd0 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -29,10 +29,11 @@ struct vhost_poll { /* struct which will handle all actual work. */ struct work_structwork; unsigned long mask; + struct vhost_dev *dev; }; void vhost_poll_init(struct vhost_poll *poll, work_func_t func, -unsigned long mask); +unsigned long mask, struct vhost_dev *dev);
[GSoC 2010] Completing Nested VMX
Hello All, I'm interested in adding nested VMX support to KVM in GSoC 2010 (among other things). I see that Orit Wasserman has done some work in this area, but it didn't get merged yet. The last patches were a few months ago and I have not seen any substantial progress in that front ever since. I wonder whether the previous work can be used as a starting ground for any future effort? What is missing from it? What are the current limitations of that implementation? And how can it be extended? And within the scopr of GSoC, what do you think the achievments of such a project should be? Regards, Mohammed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Setting nx bit in virtual CPU
Nope, both Kernels are 64 bit. uname -a Host: Linux gordon 2.6.27-gentoo-r8 #5 Sat Mar 14 18:01:59 GMT 2009 x86_64 AMD Athlon(tm) 64 Processor 3200+ AuthenticAMD GNU/Linux uname -a Guest: Linux andrew 2.6.28-hardened-r9 #4 Mon Jan 18 22:39:31 GMT 2010 x86_64 AMD Athlon(tm) 64 Processor 3200+ AuthenticAMD GNU/Linux As you can see, both kernels are a little old, and I have been wondering if that might be part of the problem. The Guest one is old because that is the latest stable hardened version in Gentoo. The host one is old because of: (gordon:~) rs10% uptime 22:01:37 up 374 days, 23:29, 1 user, load average: 1.09, 0.42, 0.18 Now that I have managed to smash the psychologically important 1 year uptime for the first time ever (Woo!) I shall probably upgrade the host kernel in the near future. Of course, it is important to remember that with the --no-kvm switch it works just fine (only slowly) with exactly the same two kernels. Thanks On 01/04/10 09:43, Avi Kivity wrote: On 03/30/2010 01:16 AM, Richard Simpson wrote: Hello, Summary: How can I have a virtual CPU with the nx bit set whilst enjoying KVM acceleration? My Host - AMD Athlon(tm) 64 Processor 3200+ running Gentoo My VM - KVM running hardened Gentoo My KVM version - 0.12.3 My Task - Implement restricted secure VM to handle services exposed to internet. My Command - kvm -hda /dev/mapper/vols-andrew -kernel ./bzImage -append root=/dev/hda2 -cpu host -runas xxx -net nic -net user -m 256 -k en-gb -vnc :1 -monitor stdio Are you running a 32-bit non-pae host kernel? In that case, nx is disabled both for the guest and host. Switch to a pae (or 64-bit) kernel and all should be well. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
On Fri, 2010-04-02 at 15:25 +0800, xiaohui@intel.com wrote: The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. What is the advantage of this approach compared to PCI-passthrough of the host NIC to the guest? Does this require pinning of the entire guest memory? Or only the send/receive buffers? Thanks Sridhar The scenario is like this: The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. Here, we have ever considered 2 ways to utilize the page constructor API to dispense the user buffers. One: Modify __alloc_skb() function a bit, it can only allocate a structure of sk_buff, and the data pointer is pointing to a user buffer which is coming from a page constructor API. Then the shinfo of the skb is also from guest. When packet is received from hardware, the skb-data is filled directly by h/w. What we have done is in this way. Pros: We can avoid any copy here. Cons: Guest virtio-net driver needs to allocate skb as almost the same method with the host NIC drivers, say the size of netdev_alloc_skb() and the same reserved space in the head of skb. Many NIC drivers are the same with guest and ok for this. But some lastest NIC drivers reserves special room in skb head. To deal with it, we suggest to provide a method in guest virtio-net driver to ask for parameter we interest from the NIC driver when we know which device we have bind to do zero-copy. Then we ask guest to do so. Is that reasonable? Two: Modify driver to get user buffer allocated from a page constructor API(to substitute alloc_page()), the user buffer are used as payload buffers and filled by h/w directly when packet is received. Driver should associate the pages with skb (skb_shinfo(skb)-frags). For the head buffer side, let host allocates skb, and h/w fills it. After that, the data filled in host skb header will be copied into guest header buffer which is submitted together with the payload buffer. Pros: We could less care the way how guest or host allocates their buffers. Cons: We still need a bit copy here for the skb header. We are not sure which way is the better here. This is the first thing we want to get comments from the community. We wish the modification to the network part will be generic which not used by vhost-net backend only, but a user application may use it as well when the zero-copy device may provides async read/write operations later. Please give comments especially for the network part modifications. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. But for simple test with netperf, we found bindwidth up and CPU % up too, but the bindwidth up ratio is much more than CPU % up ratio. What we have not done yet: packet split support To support GRO Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() what we have done in v2: remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the