Re: [PATCH] fix kvmclock bug
Am 19.09.2010 02:15, Zachary Amsden wrote: For CPUs with unstable TSC, we null time offset between not just VCPU switches, but all preemptions of the kvm thread. This makes a bug much more likely where the kvmclock values are updated before a successful exit from virt, causing an underflow. The null offsetting was added at : bf0fb4a42ba7eb362f4013bd2e93209666793e66 The underflow happens with this additional patch : cf839f5da2b0779b9ec8b990f851fb4e7d681da0 There is a secondary bug, which is that TSC fails to advance with real time on unstable TSC, but the fix is much more involved (it requires the TSC catchup code). For now, this patch is sufficient to get things working again for me. ...but not for me. I still face stuck (or infinitely slow) guests that want to use kvmclock once tsc_unstable gets set. Or is this patch addressing a different issue? Jan signature.asc Description: OpenPGP digital signature
Re: [Autotest] [PATCH 07/18] KVM test: Add a subtest jumbo
On Tue, 14 Sep 2010 19:25:32 -0300 Lucas Meneghel Rodrigues l...@redhat.com wrote: +session.close() +logging.info(Removing the temporary ARP entry) +utils.run(arp -d %s -i %s % (ip, ifname)) Hi Lucas Tried different combinations for this jumbo test case. it dint work for me. I guess there is a problem while trying to remove ARP entry. ARP entry can be removed from cache using ip and network interface (for ex: eth0) arp -d ip -i eth0 Error which i got: 23:06:14 DEBUG| Running 'arp -d 192.168.122.104 -i rtl8139_0_5900' 23:06:14 ERROR| Test failed: CmdError: Command arp -d 192.168.122.104 -i rtl8139_0_5900 failed, rc=255, Command returned non-zero exit status * Command: arp -d 192.168.122.104 -i rtl8139_0_5900 Exit status: 255 Duration: 0.00138521194458 stderr: SIOCDARP(pub): No such file or directory When i try manually this one works for me. Gues ip: 192.168.122.104 try arp -i 192.168.122.1 -i eth0 --Thanks Pradeep -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] make-release: don't use --tmpdir mktemp option
This allows the script to work on older systems, where 'mktemp --tmpdir' is not available. Signed-off-by: Eduardo Habkost ehabk...@redhat.com --- kvm/scripts/make-release |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kvm/scripts/make-release b/kvm/scripts/make-release index 64e77f9..c5f8c92 100755 --- a/kvm/scripts/make-release +++ b/kvm/scripts/make-release @@ -12,7 +12,7 @@ formal= releasedir=~/sf-release [[ -z $TMP ]] TMP=/tmp -tmpdir=`mktemp -d --tmpdir=$TMP qemu-kvm-make-release.XX` +tmpdir=`mktemp -d $TMP/qemu-kvm-make-release.XX` while [[ $1 = -* ]]; do opt=$1 shift -- 1.7.2.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] make-release: don't use --mtime and --transform tar options
Those options are not available on older systems. Instead of --transform, just create the file inside the expected directory. Instead of --mtime, use 'touch' to set file mtime before running tar. Signed-off-by: Eduardo Habkost ehabk...@redhat.com --- kvm/scripts/make-release | 18 ++ 1 files changed, 10 insertions(+), 8 deletions(-) diff --git a/kvm/scripts/make-release b/kvm/scripts/make-release index c5f8c92..56302c3 100755 --- a/kvm/scripts/make-release +++ b/kvm/scripts/make-release @@ -52,20 +52,22 @@ mkdir -p $(dirname $tarball) git archive --prefix=$name/ --format=tar $commit $tarball mtime=`git show --format=%ct $commit^{commit} --` -tarargs=--owner=root --group=root --mti...@$mtime +tarargs=--owner=root --group=root -mkdir -p $tmpdir +mkdir -p $tmpdir/$name git cat-file -p ${commit}:roms | awk ' { print $4, $3 } ' \ - $tmpdir/EXTERNAL_DEPENDENCIES -tar -rf $tarball --transform s,^,$name/, -C $tmpdir \ + $tmpdir/$name/EXTERNAL_DEPENDENCIES +touch -d @$mtime $tmpdir/$name/EXTERNAL_DEPENDENCIES +tar -rf $tarball -C $tmpdir \ $tarargs \ -EXTERNAL_DEPENDENCIES +$name/EXTERNAL_DEPENDENCIES rm -rf $tmpdir if [[ -n $formal ]]; then -mkdir -p $tmpdir -echo $name $tmpdir/KVM_VERSION -tar -rf $tarball --transform s,^,$name/, -C $tmpdir KVM_VERSION \ +mkdir -p $tmpdir/$name +echo $name $tmpdir/$name/KVM_VERSION +touch -d @$mtime $tmpdir/$name/KVM_VERSION +tar -rf $tarball -C $tmpdir $name/KVM_VERSION \ $tarargs rm -rf $tmpdir fi -- 1.7.2.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] qemu-kvm: make make-release work on older systems
Hi, The following patches allow make-release to be run on older systems (such as RHEL5), where mktemp doesn't have the --tmpdir option and tar doesn't have the --transform and --mtime options. I made those changes on the scripts for my own use (to help testing and packaging of qemu-kvm), but I don't know if they are really interesting to be applied on upstream qemu-kvm. Eduardo Habkost (2): make-release: don't use --tmpdir mktemp option make-release: don't use --mtime and --transform tar options kvm/scripts/make-release | 20 +++- 1 files changed, 11 insertions(+), 9 deletions(-) -- 1.7.2.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] VFIO V4: VFIO driver: Non-privileged user level PCI drivers
On Wed, 22 Sep 2010 14:15:57 -0700 Tom Lyon p...@cisco.com wrote: After a long summer break, it's tanned, it's rested, and it's ready to rumble! In this version: *** REBASE to 2.6.35 *** There's new code using generic netlink messages which allows the kernel to notify the user level of weird events and allows the user level to respond. This is currently used to handle device removal (whether software or hardware driven), PCI error events, and system suspend hibernate. The driver now supports devices which use multiple MSI interrupts, reflecting the actual number of interrupts allocated by the system to the user level. PCI config accesses are now done through the pci_user_{read,write)_config routines from drivers/pci/access.c. I really don't like to encourage user level drivers (in fact I've actively tried to kill them in our graphics stack), but I do understand that they're convenient in many scenarios. So assuming you can convince someone to apply the VFIO framework, I'm ok with exporting the user level accessor functions (after all, we export them to userland already via sysfs, so exporting them as GPL symbols to another module is fine). So you can add my Acked-by: Jesse Barnes jbar...@virtuousgeek.org to the PCI parts, but don't take it as an endorsement of VFIO in general! :) -- Jesse Barnes, Intel Open Source Technology Center -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
inconsistent use of $TMP vs. $TMPDIR
I noticed today that various kvm source files are inconsistent on the use of $TMP vs. $TMPDIR: $ git grep -l '\$TMP\b' | cat scripts/Kbuild.include tools/perf/feature-tests.mak $ git grep -l '\$TMPDIR\b' | cat Documentation/lguest/extract According to POSIX, you should probably be using $TMPDIR instead of $TMP when referring to the preferred temporary directory location. http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_03 -- Eric Blake ebl...@redhat.com+1-801-349-2682 Libvirt virtualization library http://libvirt.org -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 05/17] Add a function to indicate if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 5f192de..23d6ec0 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1602,6 +1602,11 @@ extern gro_result_t napi_gro_frags(struct napi_struct *napi); extern int netdev_mp_port_prep(struct net_device *dev, struct mpassthru_port *port); +static inline bool dev_is_mpassthru(struct net_device *dev) +{ + return dev dev-mp_port; +} + static inline void napi_free_frags(struct napi_struct *napi) { kfree_skb(napi-skb); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 13/17] Add mp(mediate passthru) device.
From: Xin Xiaohui xiaohui@intel.com The patch add mp(mediate passthru) device, which now based on vhost-net backend driver and provides proto_ops to send/receive guest buffers data from/to guest vitio-net driver. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/mpassthru.c | 1407 + 1 files changed, 1407 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..d86d94c --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1407 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/compat.h +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/aio.h + +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h + +/* Uncomment to enable debugging */ +/* #define MPASSTHRU_DEBUG 1 */ + +#ifdef MPASSTHRU_DEBUG +static int debug; + +#define DBG if (mp-debug) printk +#define DBG1 if (debug == 2) printk +#else +#define DBG(a...) +#define DBG1(a...) +#endif + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +struct frag { + u16 offset; + u16 size; +}; + +#defineHASH_BUCKETS(8192*2) + +struct page_info { + struct list_headlist; + struct page_info*next; + struct page_info*prev; + struct page *pages[MAX_SKB_FRAGS]; + struct sk_buff *skb; + struct page_ctor*ctor; + + /* The pointer relayed to skb, to indicate +* it's a external allocated skb or kernel +*/ + struct skb_ext_pageext_page; + +#define INFO_READ 0 +#define INFO_WRITE 1 + unsignedflags; + unsignedpnum; + + /* The fields after that is for backend +* driver, now for vhost-net. +*/ + + struct kiocb*iocb; + unsigned intdesc_pos; + struct iovechdr[2]; + struct ioveciov[MAX_SKB_FRAGS]; +}; + +static struct kmem_cache *ext_page_info_cache; + +struct page_ctor { + struct list_headreadq; + int wq_len; + int rq_len; + spinlock_t read_lock; + /* record the locked pages */ + int lock_pages; + struct rlimit o_rlim; + struct net_device *dev; + struct mpassthru_port port; + struct page_info**hash_table; +}; + +struct mp_struct { + struct mp_file *mfile; + struct net_device *dev; + struct page_ctor*ctor; + struct socket socket; + +#ifdef MPASSTHRU_DEBUG + int debug; +#endif +}; + +struct mp_file { + atomic_t count; + struct mp_struct *mp; + struct net *net; +}; + +struct mp_sock { + struct sock sk; + struct mp_struct*mp; +}; + +static int mp_dev_change_flags(struct net_device *dev, unsigned flags) +{ + int ret = 0; + + rtnl_lock(); + ret = dev_change_flags(dev, flags); + rtnl_unlock(); + + if (ret 0) + printk(KERN_ERR failed to change dev state of %s, dev-name); + + return ret; +} + +/* The main function to allocate external buffers */ +static struct skb_ext_page *page_ctor(struct mpassthru_port *port, +
[PATCH v11 12/17] Add a kconfig entry and make entry for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- drivers/vhost/Kconfig | 10 ++ drivers/vhost/Makefile |2 ++ 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index e4e2fd1..a6b8cbf 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,13 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config MEDIATE_PASSTHRU + tristate mediate passthru network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support, we call it as mediate passthru to + be distiguish with hardare passthru. + + To compile this driver as a module, choose M here: the module will + be called mpassthru. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..c18b9fc 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 14/17]Provides multiple submits and asynchronous notifications.
From: Xin Xiaohui xiaohui@intel.com The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- drivers/vhost/net.c | 341 + drivers/vhost/vhost.c | 79 drivers/vhost/vhost.h | 15 ++ 3 files changed, 407 insertions(+), 28 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index b38abc6..44f4b15 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -24,6 +24,8 @@ #include linux/if_arp.h #include linux/if_tun.h #include linux/if_macvlan.h +#include linux/mpassthru.h +#include linux/aio.h #include net/sock.h @@ -39,6 +41,8 @@ enum { VHOST_NET_VQ_MAX = 2, }; +static struct kmem_cache *notify_cache; + enum vhost_net_poll_state { VHOST_NET_POLL_DISABLED = 0, VHOST_NET_POLL_STARTED = 1, @@ -49,6 +53,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -93,11 +98,183 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_iocb(struct kiocb *iocb) +{ + struct vhost_virtqueue *vq = iocb-private; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + list_add_tail(iocb-ki_list, vq-notifier); + spin_unlock_irqrestore(vq-notify_lock, flags); +} + +static int is_async_vq(struct vhost_virtqueue *vq) +{ + return (vq-link_state == VHOST_VQ_LINK_ASYNC); +} + +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq, + struct socket *sock) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + unsigned int head, log, in, out; + int size; + + if (!is_async_vq(vq)) + return; + + if (sock-sk-sk_data_ready) + sock-sk-sk_data_ready(sock-sk, 0); + + vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ? + vq-log : NULL; + + while ((iocb = notify_dequeue(vq)) != NULL) { + if (!iocb-ki_left) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + size = iocb-ki_nbytes; + head = iocb-ki_pos; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + /* when log is enabled, recomputing the log is needed, +* since these buffers are in async queue, may not get +* the log info before. +*/ + if (unlikely(vq_log)) { + if (!log) + __vhost_get_desc(net-dev, vq, vq-iov, + ARRAY_SIZE(vq-iov), + out, in, vq_log, + log, head); + vhost_log_write(vq, vq_log, log, size); + } + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } else { + int i = 0; + int count = iocb-ki_left; + int hc = count; + while (count--) { + if (iocb) { + vq-heads[i].id = iocb-ki_pos; + vq-heads[i].len = iocb-ki_nbytes; + size = iocb-ki_nbytes; + head = iocb-ki_pos; + rx_total_len +=
[PATCH v11 17/17]add two new ioctls for mp device.
From: Xin Xiaohui xiaohui@intel.com The patch add two ioctls for mp device. One is for userspace to query how much memory locked to make mp device run smoothly. Another one is for userspace to set how much meory locked it really wants. --- drivers/vhost/mpassthru.c | 103 +++-- include/linux/mpassthru.h |2 + 2 files changed, 54 insertions(+), 51 deletions(-) diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c index d86d94c..e3a0199 100644 --- a/drivers/vhost/mpassthru.c +++ b/drivers/vhost/mpassthru.c @@ -67,6 +67,8 @@ static int debug; #define COPY_THRESHOLD (L1_CACHE_BYTES * 4) #define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) +#define DEFAULT_NEED ((8192*2*2)*4096) + struct frag { u16 offset; u16 size; @@ -110,7 +112,8 @@ struct page_ctor { int rq_len; spinlock_t read_lock; /* record the locked pages */ - int lock_pages; + int locked_pages; + int cur_pages; struct rlimit o_rlim; struct net_device *dev; struct mpassthru_port port; @@ -122,6 +125,7 @@ struct mp_struct { struct net_device *dev; struct page_ctor*ctor; struct socket socket; + struct task_struct *user; #ifdef MPASSTHRU_DEBUG int debug; @@ -231,7 +235,8 @@ static int page_ctor_attach(struct mp_struct *mp) ctor-port.ctor = page_ctor; ctor-port.sock = mp-socket; ctor-port.hash = mp_lookup; - ctor-lock_pages = 0; + ctor-locked_pages = 0; + ctor-cur_pages = 0; /* locked by mp_mutex */ dev-mp_port = ctor-port; @@ -264,37 +269,6 @@ struct page_info *info_dequeue(struct page_ctor *ctor) return info; } -static int set_memlock_rlimit(struct page_ctor *ctor, int resource, - unsigned long cur, unsigned long max) -{ - struct rlimit new_rlim, *old_rlim; - int retval; - - if (resource != RLIMIT_MEMLOCK) - return -EINVAL; - new_rlim.rlim_cur = cur; - new_rlim.rlim_max = max; - - old_rlim = current-signal-rlim + resource; - - /* remember the old rlimit value when backend enabled */ - ctor-o_rlim.rlim_cur = old_rlim-rlim_cur; - ctor-o_rlim.rlim_max = old_rlim-rlim_max; - - if ((new_rlim.rlim_max old_rlim-rlim_max) - !capable(CAP_SYS_RESOURCE)) - return -EPERM; - - retval = security_task_setrlimit(resource, new_rlim); - if (retval) - return retval; - - task_lock(current-group_leader); - *old_rlim = new_rlim; - task_unlock(current-group_leader); - return 0; -} - static void relinquish_resource(struct page_ctor *ctor) { if (!(ctor-dev-flags IFF_UP) @@ -323,7 +297,7 @@ static void mp_ki_dtor(struct kiocb *iocb) } else info-ctor-wq_len--; /* Decrement the number of locked pages */ - info-ctor-lock_pages -= info-pnum; + info-ctor-cur_pages -= info-pnum; kmem_cache_free(ext_page_info_cache, info); relinquish_resource(info-ctor); @@ -357,6 +331,7 @@ static int page_ctor_detach(struct mp_struct *mp) { struct page_ctor *ctor; struct page_info *info; + struct task_struct *tsk = mp-user; int i; /* locked by mp_mutex */ @@ -375,9 +350,9 @@ static int page_ctor_detach(struct mp_struct *mp) relinquish_resource(ctor); - set_memlock_rlimit(ctor, RLIMIT_MEMLOCK, - ctor-o_rlim.rlim_cur, - ctor-o_rlim.rlim_max); + down_write(tsk-mm-mmap_sem); + tsk-mm-locked_vm -= ctor-locked_pages; + up_write(tsk-mm-mmap_sem); /* locked by mp_mutex */ ctor-dev-mp_port = NULL; @@ -514,7 +489,6 @@ static struct page_info *mp_hash_delete(struct page_ctor *ctor, { key_mp_t key = mp_hash(info-pages[0], HASH_BUCKETS); struct page_info *tmp = NULL; - int i; tmp = ctor-hash_table[key]; while (tmp) { @@ -565,14 +539,11 @@ static struct page_info *alloc_page_info(struct page_ctor *ctor, int rc; int i, j, n = 0; int len; - unsigned long base, lock_limit; + unsigned long base; struct page_info *info = NULL; - lock_limit = current-signal-rlim[RLIMIT_MEMLOCK].rlim_cur; - lock_limit = PAGE_SHIFT; - - if (ctor-lock_pages + count lock_limit npages) { - printk(KERN_INFO exceed the locked memory rlimit.); + if (ctor-cur_pages + count ctor-locked_pages) { + printk(KERN_INFO Exceed memory lock rlimt.); return NULL; } @@ -634,7 +605,7 @@ static struct page_info *alloc_page_info(struct page_ctor *ctor,
[PATCH v11 15/17]An example how to modifiy NIC driver to use napi_gro_frags() interface
From: Xin Xiaohui xiaohui@intel.com This example is made on ixgbe driver. It provides API is_rx_buffer_mapped_as_page() to indicate if the driver use napi_gro_frags() interface or not. The example allocates 2 pages for DMA for one ring descriptor using netdev_alloc_page(). When packets is coming, using napi_gro_frags() to allocate skb and to receive the packets. --- drivers/net/ixgbe/ixgbe.h |3 + drivers/net/ixgbe/ixgbe_main.c | 151 2 files changed, 125 insertions(+), 29 deletions(-) diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h index 79c35ae..fceffc5 100644 --- a/drivers/net/ixgbe/ixgbe.h +++ b/drivers/net/ixgbe/ixgbe.h @@ -131,6 +131,9 @@ struct ixgbe_rx_buffer { struct page *page; dma_addr_t page_dma; unsigned int page_offset; + u16 mapped_as_page; + struct page *page_skb; + unsigned int page_skb_offset; }; struct ixgbe_queue_stats { diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index 6c00ee4..905d6d2 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -688,6 +688,12 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw, IXGBE_WRITE_REG(hw, IXGBE_RDT(rx_ring-reg_idx), val); } +static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi, + struct net_device *dev) +{ + return true; +} + /** * ixgbe_alloc_rx_buffers - Replace used receive buffers; packet split * @adapter: address of board private structure @@ -704,13 +710,17 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, i = rx_ring-next_to_use; bi = rx_ring-rx_buffer_info[i]; + while (cleaned_count--) { rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i); + bi-mapped_as_page = + is_rx_buffer_mapped_as_page(bi, adapter-netdev); + if (!bi-page_dma (rx_ring-flags IXGBE_RING_RX_PS_ENABLED)) { if (!bi-page) { - bi-page = alloc_page(GFP_ATOMIC); + bi-page = netdev_alloc_page(adapter-netdev); if (!bi-page) { adapter-alloc_rx_page_failed++; goto no_buffers; @@ -727,7 +737,7 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, PCI_DMA_FROMDEVICE); } - if (!bi-skb) { + if (!bi-mapped_as_page !bi-skb) { struct sk_buff *skb; /* netdev_alloc_skb reserves 32 bytes up front!! */ uint bufsz = rx_ring-rx_buf_len + SMP_CACHE_BYTES; @@ -747,6 +757,19 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter, rx_ring-rx_buf_len, PCI_DMA_FROMDEVICE); } + + if (bi-mapped_as_page !bi-page_skb) { + bi-page_skb = netdev_alloc_page(adapter-netdev); + if (!bi-page_skb) { + adapter-alloc_rx_page_failed++; + goto no_buffers; + } + bi-page_skb_offset = 0; + bi-dma = pci_map_page(pdev, bi-page_skb, + bi-page_skb_offset, + (PAGE_SIZE / 2), + PCI_DMA_FROMDEVICE); + } /* Refresh the desc even if buffer_addrs didn't change because * each write-back erases this info. */ if (rx_ring-flags IXGBE_RING_RX_PS_ENABLED) { @@ -823,6 +846,13 @@ struct ixgbe_rsc_cb { dma_addr_t dma; }; +static bool is_no_buffer(struct ixgbe_rx_buffer *rx_buffer_info) +{ + return (!rx_buffer_info-skb || + !rx_buffer_info-page_skb) + !rx_buffer_info-page; +} + #define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)-cb) static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, @@ -832,6 +862,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, struct ixgbe_adapter *adapter = q_vector-adapter; struct net_device *netdev = adapter-netdev; struct pci_dev *pdev = adapter-pdev; + struct napi_struct *napi = q_vector-napi; union ixgbe_adv_rx_desc *rx_desc, *next_rxd; struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer; struct sk_buff *skb; @@ -868,29 +899,71 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, len = le16_to_cpu(rx_desc-wb.upper.length); } + if
[PATCH v11 00/17] Provide a zero-copy method on KVM virtio-net.
We provide an zero-copy method which driver side may get external buffers to DMA. Here external means driver don't use kernel space to allocate skb buffers. Currently the external buffer can be from guest virtio-net driver. The idea is simple, just to pin the guest VM user space and then let host NIC driver has the chance to directly DMA to it. The patches are based on vhost-net backend driver. We add a device which provides proto_ops as sendmsg/recvmsg to vhost-net to send/recv directly to/from the NIC driver. KVM guest who use the vhost-net backend may bind any ethX interface in the host side to get copyless data transfer thru guest virtio-net frontend. patch 01-10:net core and kernel changes. patch 11-13:new device as interface to mantpulate external buffers. patch 14: for vhost-net. patch 15: An example on modifying NIC driver to using napi_gro_frags(). patch 16: An example how to get guest buffers based on driver who using napi_gro_frags(). patch 17: It's a patch to address comments from Michael S. Thirkin to add 2 new ioctls in mp device. We split it out here to make easier reiewer. The guest virtio-net driver submits multiple requests thru vhost-net backend driver to the kernel. And the requests are queued and then completed after corresponding actions in h/w are done. For read, user space buffers are dispensed to NIC driver for rx when a page constructor API is invoked. Means NICs can allocate user buffers from a page constructor. We add a hook in netif_receive_skb() function to intercept the incoming packets, and notify the zero-copy device. For write, the zero-copy deivce may allocates a new host skb and puts payload on the skb_shinfo(skb)-frags, and copied the header to skb-data. The request remains pending until the skb is transmitted by h/w. We provide multiple submits and asynchronous notifiicaton to vhost-net too. Our goal is to improve the bandwidth and reduce the CPU usage. Exact performance data will be provided later. What we have not done yet: Performance tuning what we have done in v1: polish the RCU usage deal with write logging in asynchroush mode in vhost add notifier block for mp device rename page_ctor to mp_port in netdevice.h to make it looks generic add mp_dev_change_flags() for mp device to change NIC state add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load a small fix for missing dev_put when fail using dynamic minor instead of static minor number a __KERNEL__ protect to mp_get_sock() what we have done in v2: remove most of the RCU usage, since the ctor pointer is only changed by BIND/UNBIND ioctl, and during that time, NIC will be stopped to get good cleanup(all outstanding requests are finished), so the ctor pointer cannot be raced into wrong situation. Remove the struct vhost_notifier with struct kiocb. Let vhost-net backend to alloc/free the kiocb and transfer them via sendmsg/recvmsg. use get_user_pages_fast() and set_page_dirty_lock() when read. Add some comments for netdev_mp_port_prep() and handle_mpassthru(). what we have done in v3: the async write logging is rewritten a drafted synchronous write function for qemu live migration a limit for locked pages from get_user_pages_fast() to prevent Dos by using RLIMIT_MEMLOCK what we have done in v4: add iocb completion callback from vhost-net to queue iocb in mp device replace vq-receiver by mp_sock_data_ready() remove stuff in mp device which access structures from vhost-net modify skb_reserve() to ignore host NIC driver reserved space rebase to the latest vhost tree split large patches into small pieces, especially for net core part. what we have done in v5: address Arnd Bergmann's comments -remove IFF_MPASSTHRU_EXCL flag in mp device -Add CONFIG_COMPAT macro -remove mp_release ops move dev_is_mpassthru() as inline func fix a bug in memory relinquish Apply to current git (2.6.34-rc6) tree. what we have done in v6: move create_iocb() out of page_dtor which may happen in interrupt context -This remove the potential issues which lock called in interrupt context make the cache used by mp, vhost as static, and created/destoryed during modules init/exit functions. -This makes multiple mp guest created at the same time. what we have done in v7: some cleanup prepared to suppprt PS mode what we have done in v8: discarding the modifications to point skb-data to guest buffer directly. Add code to modify driver to support napi_gro_frags() with Herbert's comments. To support PS mode. Add mergeable
[PATCH v11 01/17] Add a new structure for skb buffer from external.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 124f90c..74af06c 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -203,6 +203,15 @@ struct skb_shared_info { void * destructor_arg; }; +/* The structure is for a skb which pages may point to + * an external buffer, which is not allocated from kernel space. + * It also contains a destructor for itself. + */ +struct skb_ext_page { + struct page *page; + void(*dtor)(struct skb_ext_page *); +}; + /* We divide dataref into two halves. The higher 16 bits hold references * to the payload part of skb-data. The lower 16 bits hold references to * the entire skb-data. A clone of a headerless skb holds the length of -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 11/17] Add header file for mp device.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/mpassthru.h | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) create mode 100644 include/linux/mpassthru.h diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h new file mode 100644 index 000..ba8f320 --- /dev/null +++ b/include/linux/mpassthru.h @@ -0,0 +1,25 @@ +#ifndef __MPASSTHRU_H +#define __MPASSTHRU_H + +#include linux/types.h +#include linux/if_ether.h + +/* ioctl defines */ +#define MPASSTHRU_BINDDEV _IOW('M', 213, int) +#define MPASSTHRU_UNBINDDEV_IO('M', 214) + +#ifdef __KERNEL__ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +struct socket *mp_get_socket(struct file *); +#else +#include linux/err.h +#include linux/errno.h +struct file; +struct socket; +static inline struct socket *mp_get_socket(struct file *f) +{ + return ERR_PTR(-EINVAL); +} +#endif /* CONFIG_MEDIATE_PASSTHRU */ +#endif /* __KERNEL__ */ +#endif /* __MPASSTHRU_H */ -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 10/17] Add a hook to intercept external buffers from NIC driver.
From: Xin Xiaohui xiaohui@intel.com The hook is called in netif_receive_skb(). Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/dev.c | 35 +++ 1 files changed, 35 insertions(+), 0 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 636f11b..4b379b1 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2517,6 +2517,37 @@ err: EXPORT_SYMBOL(netdev_mp_port_prep); #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +/* Add a hook to intercept mediate passthru(zero-copy) packets, + * and insert it to the socket queue owned by mp_port specially. + */ +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb, + struct packet_type **pt_prev, + int *ret, + struct net_device *orig_dev) +{ + struct mpassthru_port *mp_port = NULL; + struct sock *sk = NULL; + + if (!dev_is_mpassthru(skb-dev)) + return skb; + mp_port = skb-dev-mp_port; + + if (*pt_prev) { + *ret = deliver_skb(skb, *pt_prev, orig_dev); + *pt_prev = NULL; + } + + sk = mp_port-sock-sk; + skb_queue_tail(sk-sk_receive_queue, skb); + sk-sk_state_change(sk); + + return NULL; +} +#else +#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb) +#endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process @@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb) ncls: #endif + /* To intercept mediate passthru(zero-copy) packets here */ + skb = handle_mpassthru(skb, pt_prev, ret, orig_dev); + if (!skb) + goto out; skb = handle_bridge(skb, pt_prev, ret, orig_dev); if (!skb) goto out; -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 09/17] Don't do skb recycle, if device use external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index bbf4707..9b156bb 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -565,6 +565,12 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size) if (skb_shared(skb) || skb_cloned(skb)) return 0; + /* if the device wants to do mediate passthru, the skb may +* get external buffer, so don't recycle +*/ + if (dev_is_mpassthru(skb-dev)) + return 0; + skb_release_head_state(skb); shinfo = skb_shinfo(skb); atomic_set(shinfo-dataref, 1); -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 07/17] Modify netdev_alloc_page() to get external buffer
From: Xin Xiaohui xiaohui@intel.com Currently, it can get external buffers from mp device. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 117d82b..1a61e2b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -269,11 +269,38 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, } EXPORT_SYMBOL(__netdev_alloc_skb); +struct page *netdev_alloc_ext_pages(struct net_device *dev, int npages) +{ + struct mpassthru_port *port; + struct skb_ext_page *ext_page = NULL; + + port = dev-mp_port; + if (!port) + goto out; + ext_page = port-ctor(port, NULL, npages); + if (ext_page) + return ext_page-page; +out: + return NULL; + +} +EXPORT_SYMBOL(netdev_alloc_ext_pages); + +struct page *netdev_alloc_ext_page(struct net_device *dev) +{ + return netdev_alloc_ext_pages(dev, 1); + +} +EXPORT_SYMBOL(netdev_alloc_ext_page); + struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) { int node = dev-dev.parent ? dev_to_node(dev-dev.parent) : -1; struct page *page; + if (dev_is_mpassthru(dev)) + return netdev_alloc_ext_page(dev); + page = alloc_pages_node(node, gfp_mask, 0); return page; } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 08/17] Modify netdev_free_page() to release external buffer
From: Xin Xiaohui xiaohui@intel.com Currently, it can get external buffers from mp device. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |4 +++- net/core/skbuff.c | 24 2 files changed, 27 insertions(+), 1 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index ab29675..3d7f70e 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1512,9 +1512,11 @@ static inline struct page *netdev_alloc_page(struct net_device *dev) return __netdev_alloc_page(dev, GFP_ATOMIC); } +extern void __netdev_free_page(struct net_device *dev, struct page *page); + static inline void netdev_free_page(struct net_device *dev, struct page *page) { - __free_page(page); + __netdev_free_page(dev, page); } /** diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 1a61e2b..bbf4707 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -306,6 +306,30 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) } EXPORT_SYMBOL(__netdev_alloc_page); +void netdev_free_ext_page(struct net_device *dev, struct page *page) +{ + struct skb_ext_page *ext_page = NULL; + if (dev_is_mpassthru(dev) dev-mp_port-hash) { + ext_page = dev-mp_port-hash(dev, page); + if (ext_page) + ext_page-dtor(ext_page); + else + __free_page(page); + } +} +EXPORT_SYMBOL(netdev_free_ext_page); + +void __netdev_free_page(struct net_device *dev, struct page *page) +{ + if (dev_is_mpassthru(dev)) { + netdev_free_ext_page(dev, page); + return; + } + + __free_page(page); +} +EXPORT_SYMBOL(__netdev_free_page); + void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off, int size) { -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 06/17]Use callback to deal with skb_release_data() specially.
From: Xin Xiaohui xiaohui@intel.com If buffer is external, then use the callback to destruct buffers. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/skbuff.h |3 ++- net/core/skbuff.c |8 2 files changed, 10 insertions(+), 1 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 74af06c..ab29675 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -197,10 +197,11 @@ struct skb_shared_info { union skb_shared_tx tx_flags; struct sk_buff *frag_list; struct skb_shared_hwtstamps hwtstamps; - skb_frag_t frags[MAX_SKB_FRAGS]; /* Intermediate layers must ensure that destructor_arg * remains valid until skb destructor */ void * destructor_arg; + + skb_frag_t frags[MAX_SKB_FRAGS]; }; /* The structure is for a skb which pages may point to diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 93c4e06..117d82b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -217,6 +217,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, shinfo-gso_type = 0; shinfo-ip6_frag_id = 0; shinfo-tx_flags.flags = 0; + shinfo-destructor_arg = NULL; skb_frag_list_init(skb); memset(shinfo-hwtstamps, 0, sizeof(shinfo-hwtstamps)); @@ -350,6 +351,13 @@ static void skb_release_data(struct sk_buff *skb) if (skb_has_frags(skb)) skb_drop_fraglist(skb); + if (skb-dev dev_is_mpassthru(skb-dev)) { + struct skb_ext_page *ext_page = + skb_shinfo(skb)-destructor_arg; + if (ext_page ext_page-dtor) + ext_page-dtor(ext_page); + } + kfree(skb-head); } } -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 02/17] Add a new struct for device to manipulate external buffer.
From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h | 22 +- 1 files changed, 21 insertions(+), 1 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index fa8b476..ba582e1 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -530,6 +530,25 @@ struct netdev_queue { unsigned long tx_dropped; } cacheline_aligned_in_smp; +/* Add a structure in structure net_device, the new field is + * named as mp_port. It's for mediate passthru (zero-copy). + * It contains the capability for the net device driver, + * a socket, and an external buffer creator, external means + * skb buffer belongs to the device may not be allocated from + * kernel space. + */ +struct mpassthru_port { + int hdr_len; + int data_len; + int npages; + unsignedflags; + struct socket *sock; + int vnet_hlen; + struct skb_ext_page *(*ctor)(struct mpassthru_port *, + struct sk_buff *, int); + struct skb_ext_page *(*hash)(struct net_device *, + struct page *); +}; /* * This structure defines the management hooks for network devices. @@ -952,7 +971,8 @@ struct net_device { struct macvlan_port *macvlan_port; /* GARP */ struct garp_port*garp_port; - + /* mpassthru */ + struct mpassthru_port *mp_port; /* class/net/name entry */ struct device dev; /* space for optional device, statistics, and wireless sysfs groups */ -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 03/17] Add a ndo_mp_port_prep pointer to net_device_ops.
From: Xin Xiaohui xiaohui@intel.com If the driver want to allocate external buffers, then it can export it's capability, as the skb buffer header length, the page length can be DMA, etc. The external buffers owner may utilize this. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ba582e1..aba0308 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -710,6 +710,10 @@ struct net_device_ops { int (*ndo_fcoe_get_wwn)(struct net_device *dev, u64 *wwn, int type); #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) + int (*ndo_mp_port_prep)(struct net_device *dev, + struct mpassthru_port *port); +#endif }; /* -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 04/17]Add a function make external buffer owner to query capability.
From: Xin Xiaohui xiaohui@intel.com The external buffer owner can use the functions to get the capability of the underlying NIC driver. --- include/linux/netdevice.h |2 + net/core/dev.c| 49 + 2 files changed, 51 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index aba0308..5f192de 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1599,6 +1599,8 @@ extern gro_result_t napi_frags_finish(struct napi_struct *napi, gro_result_t ret); extern struct sk_buff *napi_frags_skb(struct napi_struct *napi); extern gro_result_tnapi_gro_frags(struct napi_struct *napi); +extern int netdev_mp_port_prep(struct net_device *dev, + struct mpassthru_port *port); static inline void napi_free_frags(struct napi_struct *napi) { diff --git a/net/core/dev.c b/net/core/dev.c index 264137f..636f11b 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2468,6 +2468,55 @@ void netif_nit_deliver(struct sk_buff *skb) rcu_read_unlock(); } +/* To support meidate passthru(zero-copy) with NIC driver, + * we'd better query NIC driver for the capability it can + * provide, especially for packet split mode, now we only + * query for the header size, and the payload a descriptor + * may carry. If a driver does not use the API to export, + * then we may try to use a default value, currently, + * we use the default value from an IGB driver. Now, + * it's only called by mpassthru device. + */ +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +int netdev_mp_port_prep(struct net_device *dev, + struct mpassthru_port *port) +{ + int rc; + int npages, data_len; + const struct net_device_ops *ops = dev-netdev_ops; + + if (ops-ndo_mp_port_prep) { + rc = ops-ndo_mp_port_prep(dev, port); + if (rc) + return rc; + } else { + /* If the NIC driver did not report this, +* then we try to use default value. +*/ + port-hdr_len = 128; + port-data_len = 2048; + port-npages = 1; + } + + if (port-hdr_len = 0) + goto err; + + npages = port-npages; + data_len = port-data_len; + if (npages = 0 || npages MAX_SKB_FRAGS || + (data_len PAGE_SIZE * (npages - 1) || +data_len PAGE_SIZE * npages)) + goto err; + + return 0; +err: + dev_warn(dev-dev, invalid page constructor parameters\n); + + return -EINVAL; +} +EXPORT_SYMBOL(netdev_mp_port_prep); +#endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process -- 1.7.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html