Re: [PATCH] fix kvmclock bug

2010-09-24 Thread Jan Kiszka
Am 19.09.2010 02:15, Zachary Amsden wrote:
 For CPUs with unstable TSC, we null time offset between not just VCPU
 switches, but all preemptions of the kvm thread.  This makes a bug much
 more likely where the kvmclock values are updated before a successful
 exit from virt, causing an underflow.
 
 The null offsetting was added at : bf0fb4a42ba7eb362f4013bd2e93209666793e66
 The underflow happens with this additional patch : 
 cf839f5da2b0779b9ec8b990f851fb4e7d681da0
 
 There is a secondary bug, which is that TSC fails to advance with real
 time on unstable TSC, but the fix is much more involved (it requires the
 TSC catchup code).
 
 For now, this patch is sufficient to get things working again for me.

...but not for me. I still face stuck (or infinitely slow) guests that
want to use kvmclock once tsc_unstable gets set. Or is this patch
addressing a different issue?

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Autotest] [PATCH 07/18] KVM test: Add a subtest jumbo

2010-09-24 Thread pradeep
On Tue, 14 Sep 2010 19:25:32 -0300
Lucas Meneghel Rodrigues l...@redhat.com wrote:

 +session.close()
 +logging.info(Removing the temporary ARP entry)
 +utils.run(arp -d %s -i %s % (ip, ifname))
 

Hi Lucas

Tried different combinations for this jumbo test case. it dint work for
me. I guess there is a problem while trying to remove ARP entry.
ARP entry can be removed from cache using ip and network
interface (for ex: eth0)

arp -d ip -i eth0


Error which i got:

23:06:14 DEBUG| Running 'arp -d 192.168.122.104 -i rtl8139_0_5900'
23:06:14 ERROR| Test failed: CmdError: Command arp -d 192.168.122.104
-i rtl8139_0_5900 failed, rc=255, Command returned non-zero exit status
* Command: 
arp -d 192.168.122.104 -i rtl8139_0_5900
Exit status: 255
Duration: 0.00138521194458

stderr:
SIOCDARP(pub): No such file or directory

When i try manually this one works for me.

Gues ip: 192.168.122.104

try arp -i 192.168.122.1 -i eth0


--Thanks
Pradeep
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] make-release: don't use --tmpdir mktemp option

2010-09-24 Thread Eduardo Habkost
This allows the script to work on older systems, where 'mktemp --tmpdir' is not
available.

Signed-off-by: Eduardo Habkost ehabk...@redhat.com
---
 kvm/scripts/make-release |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kvm/scripts/make-release b/kvm/scripts/make-release
index 64e77f9..c5f8c92 100755
--- a/kvm/scripts/make-release
+++ b/kvm/scripts/make-release
@@ -12,7 +12,7 @@ formal=
 
 releasedir=~/sf-release
 [[ -z $TMP ]]  TMP=/tmp
-tmpdir=`mktemp -d --tmpdir=$TMP qemu-kvm-make-release.XX`
+tmpdir=`mktemp -d $TMP/qemu-kvm-make-release.XX`
 while [[ $1 = -* ]]; do
 opt=$1
 shift
-- 
1.7.2.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] make-release: don't use --mtime and --transform tar options

2010-09-24 Thread Eduardo Habkost
Those options are not available on older systems.

Instead of --transform, just create the file inside the expected directory.

Instead of --mtime, use 'touch' to set file mtime before running tar.

Signed-off-by: Eduardo Habkost ehabk...@redhat.com
---
 kvm/scripts/make-release |   18 ++
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/kvm/scripts/make-release b/kvm/scripts/make-release
index c5f8c92..56302c3 100755
--- a/kvm/scripts/make-release
+++ b/kvm/scripts/make-release
@@ -52,20 +52,22 @@ mkdir -p $(dirname $tarball)
 git archive --prefix=$name/ --format=tar $commit  $tarball
 
 mtime=`git show --format=%ct $commit^{commit} --`
-tarargs=--owner=root --group=root --mti...@$mtime
+tarargs=--owner=root --group=root
 
-mkdir -p $tmpdir
+mkdir -p $tmpdir/$name
 git cat-file -p ${commit}:roms | awk ' { print $4, $3 } ' \
- $tmpdir/EXTERNAL_DEPENDENCIES
-tar -rf $tarball --transform s,^,$name/, -C $tmpdir \
+ $tmpdir/$name/EXTERNAL_DEPENDENCIES
+touch -d @$mtime $tmpdir/$name/EXTERNAL_DEPENDENCIES
+tar -rf $tarball -C $tmpdir \
 $tarargs \
-EXTERNAL_DEPENDENCIES
+$name/EXTERNAL_DEPENDENCIES
 rm -rf $tmpdir
 
 if [[ -n $formal ]]; then
-mkdir -p $tmpdir
-echo $name  $tmpdir/KVM_VERSION
-tar -rf $tarball --transform s,^,$name/, -C $tmpdir KVM_VERSION \
+mkdir -p $tmpdir/$name
+echo $name  $tmpdir/$name/KVM_VERSION
+touch -d @$mtime $tmpdir/$name/KVM_VERSION
+tar -rf $tarball -C $tmpdir $name/KVM_VERSION \
 $tarargs
 rm -rf $tmpdir
 fi
-- 
1.7.2.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] qemu-kvm: make make-release work on older systems

2010-09-24 Thread Eduardo Habkost
Hi,

The following patches allow make-release to be run on older systems (such as
RHEL5), where mktemp doesn't have the --tmpdir option and tar doesn't have the
--transform and --mtime options.

I made those changes on the scripts for my own use (to help testing and
packaging of qemu-kvm), but I don't know if they are really interesting to be
applied on upstream qemu-kvm.

Eduardo Habkost (2):
  make-release: don't use --tmpdir mktemp option
  make-release: don't use --mtime and --transform tar options

 kvm/scripts/make-release |   20 +++-
 1 files changed, 11 insertions(+), 9 deletions(-)

-- 
1.7.2.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] VFIO V4: VFIO driver: Non-privileged user level PCI drivers

2010-09-24 Thread Jesse Barnes
On Wed, 22 Sep 2010 14:15:57 -0700
Tom Lyon p...@cisco.com wrote:

 After a long summer break, it's tanned, it's rested, and it's ready to rumble!
 
 In this version:  *** REBASE to 2.6.35 ***
 
 There's new code using generic netlink messages which allows the kernel
 to notify the user level of weird events and allows the user level to 
 respond. This is currently used to handle device removal (whether software
 or hardware driven), PCI error events, and system suspend  hibernate.
 
 The driver now supports devices which use multiple MSI interrupts, reflecting
 the actual number of interrupts allocated by the system to the user level.
 
 PCI config accesses are now done through the pci_user_{read,write)_config
 routines from drivers/pci/access.c.
 

I really don't like to encourage user level drivers (in fact I've
actively tried to kill them in our graphics stack), but I do understand
that they're convenient in many scenarios.

So assuming you can convince someone to apply the VFIO framework, I'm
ok with exporting the user level accessor functions (after all, we
export them to userland already via sysfs, so exporting them as GPL
symbols to another module is fine).

So you can add my Acked-by: Jesse Barnes jbar...@virtuousgeek.org to
the PCI parts, but don't take it as an endorsement of VFIO in
general! :) 

-- 
Jesse Barnes, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


inconsistent use of $TMP vs. $TMPDIR

2010-09-24 Thread Eric Blake
I noticed today that various kvm source files are inconsistent on the 
use of $TMP vs. $TMPDIR:


$ git grep -l '\$TMP\b' | cat
scripts/Kbuild.include
tools/perf/feature-tests.mak
$ git grep -l '\$TMPDIR\b' | cat
Documentation/lguest/extract

According to POSIX, you should probably be using $TMPDIR instead of $TMP 
when referring to the preferred temporary directory location.


http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_03 



--
Eric Blake   ebl...@redhat.com+1-801-349-2682
Libvirt virtualization library http://libvirt.org
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 05/17] Add a function to indicate if device use external buffer.

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/netdevice.h |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5f192de..23d6ec0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1602,6 +1602,11 @@ extern gro_result_t  napi_gro_frags(struct 
napi_struct *napi);
 extern int netdev_mp_port_prep(struct net_device *dev,
struct mpassthru_port *port);
 
+static inline bool dev_is_mpassthru(struct net_device *dev)
+{
+   return dev  dev-mp_port;
+}
+
 static inline void napi_free_frags(struct napi_struct *napi)
 {
kfree_skb(napi-skb);
-- 
1.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 13/17] Add mp(mediate passthru) device.

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The patch add mp(mediate passthru) device, which now
based on vhost-net backend driver and provides proto_ops
to send/receive guest buffers data from/to guest vitio-net
driver.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 drivers/vhost/mpassthru.c | 1407 +
 1 files changed, 1407 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 000..d86d94c
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1407 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAMEmpassthru
+#define DRV_DESCRIPTION Mediate passthru device driver
+#define DRV_COPYRIGHT   (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+
+#include linux/compat.h
+#include linux/module.h
+#include linux/errno.h
+#include linux/kernel.h
+#include linux/major.h
+#include linux/slab.h
+#include linux/smp_lock.h
+#include linux/poll.h
+#include linux/fcntl.h
+#include linux/init.h
+#include linux/aio.h
+
+#include linux/skbuff.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/miscdevice.h
+#include linux/ethtool.h
+#include linux/rtnetlink.h
+#include linux/if.h
+#include linux/if_arp.h
+#include linux/if_ether.h
+#include linux/crc32.h
+#include linux/nsproxy.h
+#include linux/uaccess.h
+#include linux/virtio_net.h
+#include linux/mpassthru.h
+#include net/net_namespace.h
+#include net/netns/generic.h
+#include net/rtnetlink.h
+#include net/sock.h
+
+#include asm/system.h
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG  if (mp-debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES  64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+   u16 offset;
+   u16 size;
+};
+
+#defineHASH_BUCKETS(8192*2)
+
+struct page_info {
+   struct list_headlist;
+   struct page_info*next;
+   struct page_info*prev;
+   struct page *pages[MAX_SKB_FRAGS];
+   struct sk_buff  *skb;
+   struct page_ctor*ctor;
+
+   /* The pointer relayed to skb, to indicate
+* it's a external allocated skb or kernel
+*/
+   struct skb_ext_pageext_page;
+
+#define INFO_READ  0
+#define INFO_WRITE 1
+   unsignedflags;
+   unsignedpnum;
+
+   /* The fields after that is for backend
+* driver, now for vhost-net.
+*/
+
+   struct kiocb*iocb;
+   unsigned intdesc_pos;
+   struct iovechdr[2];
+   struct ioveciov[MAX_SKB_FRAGS];
+};
+
+static struct kmem_cache *ext_page_info_cache;
+
+struct page_ctor {
+   struct list_headreadq;
+   int wq_len;
+   int rq_len;
+   spinlock_t  read_lock;
+   /* record the locked pages */
+   int lock_pages;
+   struct rlimit   o_rlim;
+   struct net_device   *dev;
+   struct mpassthru_port   port;
+   struct page_info**hash_table;
+};
+
+struct mp_struct {
+   struct mp_file  *mfile;
+   struct net_device   *dev;
+   struct page_ctor*ctor;
+   struct socket   socket;
+
+#ifdef MPASSTHRU_DEBUG
+   int debug;
+#endif
+};
+
+struct mp_file {
+   atomic_t count;
+   struct mp_struct *mp;
+   struct net *net;
+};
+
+struct mp_sock {
+   struct sock sk;
+   struct mp_struct*mp;
+};
+
+static int mp_dev_change_flags(struct net_device *dev, unsigned flags)
+{
+   int ret = 0;
+
+   rtnl_lock();
+   ret = dev_change_flags(dev, flags);
+   rtnl_unlock();
+
+   if (ret  0)
+   printk(KERN_ERR failed to change dev state of %s, dev-name);
+
+   return ret;
+}
+
+/* The main function to allocate external buffers */
+static struct skb_ext_page *page_ctor(struct mpassthru_port *port,
+  

[PATCH v11 12/17] Add a kconfig entry and make entry for mp device.

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 drivers/vhost/Kconfig  |   10 ++
 drivers/vhost/Makefile |2 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index e4e2fd1..a6b8cbf 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
  To compile this driver as a module, choose M here: the module will
  be called vhost_net.
 
+config MEDIATE_PASSTHRU
+   tristate mediate passthru network driver (EXPERIMENTAL)
+   depends on VHOST_NET
+   ---help---
+ zerocopy network I/O support, we call it as mediate passthru to
+ be distiguish with hardare passthru.
+
+ To compile this driver as a module, choose M here: the module will
+ be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
-- 
1.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 14/17]Provides multiple submits and asynchronous notifications.

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
---
 drivers/vhost/net.c   |  341 +
 drivers/vhost/vhost.c |   79 
 drivers/vhost/vhost.h |   15 ++
 3 files changed, 407 insertions(+), 28 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b38abc6..44f4b15 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -24,6 +24,8 @@
 #include linux/if_arp.h
 #include linux/if_tun.h
 #include linux/if_macvlan.h
+#include linux/mpassthru.h
+#include linux/aio.h
 
 #include net/sock.h
 
@@ -39,6 +41,8 @@ enum {
VHOST_NET_VQ_MAX = 2,
 };
 
+static struct kmem_cache *notify_cache;
+
 enum vhost_net_poll_state {
VHOST_NET_POLL_DISABLED = 0,
VHOST_NET_POLL_STARTED = 1,
@@ -49,6 +53,7 @@ struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
+   struct kmem_cache   *cache;
/* Tells us whether we are polling a socket for TX.
 * We only do this when socket buffer fills up.
 * Protected by tx vq lock. */
@@ -93,11 +98,183 @@ static void tx_poll_start(struct vhost_net *net, struct 
socket *sock)
net-tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+   struct kiocb *iocb = NULL;
+   unsigned long flags;
+
+   spin_lock_irqsave(vq-notify_lock, flags);
+   if (!list_empty(vq-notifier)) {
+   iocb = list_first_entry(vq-notifier,
+   struct kiocb, ki_list);
+   list_del(iocb-ki_list);
+   }
+   spin_unlock_irqrestore(vq-notify_lock, flags);
+   return iocb;
+}
+
+static void handle_iocb(struct kiocb *iocb)
+{
+   struct vhost_virtqueue *vq = iocb-private;
+   unsigned long flags;
+
+   spin_lock_irqsave(vq-notify_lock, flags);
+   list_add_tail(iocb-ki_list, vq-notifier);
+   spin_unlock_irqrestore(vq-notify_lock, flags);
+}
+
+static int is_async_vq(struct vhost_virtqueue *vq)
+{
+   return (vq-link_state == VHOST_VQ_LINK_ASYNC);
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq,
+ struct socket *sock)
+{
+   struct kiocb *iocb = NULL;
+   struct vhost_log *vq_log = NULL;
+   int rx_total_len = 0;
+   unsigned int head, log, in, out;
+   int size;
+
+   if (!is_async_vq(vq))
+   return;
+
+   if (sock-sk-sk_data_ready)
+   sock-sk-sk_data_ready(sock-sk, 0);
+
+   vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
+   vq-log : NULL;
+
+   while ((iocb = notify_dequeue(vq)) != NULL) {
+   if (!iocb-ki_left) {
+   vhost_add_used_and_signal(net-dev, vq,
+   iocb-ki_pos, iocb-ki_nbytes);
+   size = iocb-ki_nbytes;
+   head = iocb-ki_pos;
+   rx_total_len += iocb-ki_nbytes;
+
+   if (iocb-ki_dtor)
+   iocb-ki_dtor(iocb);
+   kmem_cache_free(net-cache, iocb);
+
+   /* when log is enabled, recomputing the log is needed,
+* since these buffers are in async queue, may not get
+* the log info before.
+*/
+   if (unlikely(vq_log)) {
+   if (!log)
+   __vhost_get_desc(net-dev, vq, vq-iov,
+   ARRAY_SIZE(vq-iov),
+   out, in, vq_log,
+   log, head);
+   vhost_log_write(vq, vq_log, log, size);
+   }
+   if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) {
+   vhost_poll_queue(vq-poll);
+   break;
+   }
+   } else {
+   int i = 0;
+   int count = iocb-ki_left;
+   int hc = count;
+   while (count--) {
+   if (iocb) {
+   vq-heads[i].id = iocb-ki_pos;
+   vq-heads[i].len = iocb-ki_nbytes;
+   size = iocb-ki_nbytes;
+   head = iocb-ki_pos;
+   rx_total_len += 

[PATCH v11 17/17]add two new ioctls for mp device.

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The patch add two ioctls for mp device.
One is for userspace to query how much memory locked to make mp device
run smoothly. Another one is for userspace to set how much meory locked
it really wants.

---
 drivers/vhost/mpassthru.c |  103 +++--
 include/linux/mpassthru.h |2 +
 2 files changed, 54 insertions(+), 51 deletions(-)

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
index d86d94c..e3a0199 100644
--- a/drivers/vhost/mpassthru.c
+++ b/drivers/vhost/mpassthru.c
@@ -67,6 +67,8 @@ static int debug;
 #define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
 #define COPY_HDR_LEN   (L1_CACHE_BYTES  64 ? 64 : L1_CACHE_BYTES)
 
+#define DEFAULT_NEED   ((8192*2*2)*4096)
+
 struct frag {
u16 offset;
u16 size;
@@ -110,7 +112,8 @@ struct page_ctor {
int rq_len;
spinlock_t  read_lock;
/* record the locked pages */
-   int lock_pages;
+   int locked_pages;
+   int cur_pages;
struct rlimit   o_rlim;
struct net_device   *dev;
struct mpassthru_port   port;
@@ -122,6 +125,7 @@ struct mp_struct {
struct net_device   *dev;
struct page_ctor*ctor;
struct socket   socket;
+   struct task_struct  *user;
 
 #ifdef MPASSTHRU_DEBUG
int debug;
@@ -231,7 +235,8 @@ static int page_ctor_attach(struct mp_struct *mp)
ctor-port.ctor = page_ctor;
ctor-port.sock = mp-socket;
ctor-port.hash = mp_lookup;
-   ctor-lock_pages = 0;
+   ctor-locked_pages = 0;
+   ctor-cur_pages = 0;
 
/* locked by mp_mutex */
dev-mp_port = ctor-port;
@@ -264,37 +269,6 @@ struct page_info *info_dequeue(struct page_ctor *ctor)
return info;
 }
 
-static int set_memlock_rlimit(struct page_ctor *ctor, int resource,
- unsigned long cur, unsigned long max)
-{
-   struct rlimit new_rlim, *old_rlim;
-   int retval;
-
-   if (resource != RLIMIT_MEMLOCK)
-   return -EINVAL;
-   new_rlim.rlim_cur = cur;
-   new_rlim.rlim_max = max;
-
-   old_rlim = current-signal-rlim + resource;
-
-   /* remember the old rlimit value when backend enabled */
-   ctor-o_rlim.rlim_cur = old_rlim-rlim_cur;
-   ctor-o_rlim.rlim_max = old_rlim-rlim_max;
-
-   if ((new_rlim.rlim_max  old_rlim-rlim_max) 
-   !capable(CAP_SYS_RESOURCE))
-   return -EPERM;
-
-   retval = security_task_setrlimit(resource, new_rlim);
-   if (retval)
-   return retval;
-
-   task_lock(current-group_leader);
-   *old_rlim = new_rlim;
-   task_unlock(current-group_leader);
-   return 0;
-}
-
 static void relinquish_resource(struct page_ctor *ctor)
 {
if (!(ctor-dev-flags  IFF_UP) 
@@ -323,7 +297,7 @@ static void mp_ki_dtor(struct kiocb *iocb)
} else
info-ctor-wq_len--;
/* Decrement the number of locked pages */
-   info-ctor-lock_pages -= info-pnum;
+   info-ctor-cur_pages -= info-pnum;
kmem_cache_free(ext_page_info_cache, info);
relinquish_resource(info-ctor);
 
@@ -357,6 +331,7 @@ static int page_ctor_detach(struct mp_struct *mp)
 {
struct page_ctor *ctor;
struct page_info *info;
+   struct task_struct *tsk = mp-user;
int i;
 
/* locked by mp_mutex */
@@ -375,9 +350,9 @@ static int page_ctor_detach(struct mp_struct *mp)
 
relinquish_resource(ctor);
 
-   set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
-  ctor-o_rlim.rlim_cur,
-  ctor-o_rlim.rlim_max);
+   down_write(tsk-mm-mmap_sem);
+   tsk-mm-locked_vm -= ctor-locked_pages;
+   up_write(tsk-mm-mmap_sem);
 
/* locked by mp_mutex */
ctor-dev-mp_port = NULL;
@@ -514,7 +489,6 @@ static struct page_info *mp_hash_delete(struct page_ctor 
*ctor,
 {
key_mp_t key = mp_hash(info-pages[0], HASH_BUCKETS);
struct page_info *tmp = NULL;
-   int i;
 
tmp = ctor-hash_table[key];
while (tmp) {
@@ -565,14 +539,11 @@ static struct page_info *alloc_page_info(struct page_ctor 
*ctor,
int rc;
int i, j, n = 0;
int len;
-   unsigned long base, lock_limit;
+   unsigned long base;
struct page_info *info = NULL;
 
-   lock_limit = current-signal-rlim[RLIMIT_MEMLOCK].rlim_cur;
-   lock_limit = PAGE_SHIFT;
-
-   if (ctor-lock_pages + count  lock_limit  npages) {
-   printk(KERN_INFO exceed the locked memory rlimit.);
+   if (ctor-cur_pages + count  ctor-locked_pages) {
+   printk(KERN_INFO Exceed memory lock rlimt.);
return NULL;
}
 
@@ -634,7 +605,7 @@ static struct page_info *alloc_page_info(struct page_ctor 
*ctor,
 

[PATCH v11 15/17]An example how to modifiy NIC driver to use napi_gro_frags() interface

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

This example is made on ixgbe driver.
It provides API is_rx_buffer_mapped_as_page() to indicate
if the driver use napi_gro_frags() interface or not.
The example allocates 2 pages for DMA for one ring descriptor
using netdev_alloc_page(). When packets is coming, using
napi_gro_frags() to allocate skb and to receive the packets.

---
 drivers/net/ixgbe/ixgbe.h  |3 +
 drivers/net/ixgbe/ixgbe_main.c |  151 
 2 files changed, 125 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 79c35ae..fceffc5 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -131,6 +131,9 @@ struct ixgbe_rx_buffer {
struct page *page;
dma_addr_t page_dma;
unsigned int page_offset;
+   u16 mapped_as_page;
+   struct page *page_skb;
+   unsigned int page_skb_offset;
 };
 
 struct ixgbe_queue_stats {
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 6c00ee4..905d6d2 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -688,6 +688,12 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw 
*hw,
IXGBE_WRITE_REG(hw, IXGBE_RDT(rx_ring-reg_idx), val);
 }
 
+static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi,
+   struct net_device *dev)
+{
+   return true;
+}
+
 /**
  * ixgbe_alloc_rx_buffers - Replace used receive buffers; packet split
  * @adapter: address of board private structure
@@ -704,13 +710,17 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter 
*adapter,
i = rx_ring-next_to_use;
bi = rx_ring-rx_buffer_info[i];
 
+
while (cleaned_count--) {
rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i);
 
+   bi-mapped_as_page =
+   is_rx_buffer_mapped_as_page(bi, adapter-netdev);
+
if (!bi-page_dma 
(rx_ring-flags  IXGBE_RING_RX_PS_ENABLED)) {
if (!bi-page) {
-   bi-page = alloc_page(GFP_ATOMIC);
+   bi-page = netdev_alloc_page(adapter-netdev);
if (!bi-page) {
adapter-alloc_rx_page_failed++;
goto no_buffers;
@@ -727,7 +737,7 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter 
*adapter,
PCI_DMA_FROMDEVICE);
}
 
-   if (!bi-skb) {
+   if (!bi-mapped_as_page  !bi-skb) {
struct sk_buff *skb;
/* netdev_alloc_skb reserves 32 bytes up front!! */
uint bufsz = rx_ring-rx_buf_len + SMP_CACHE_BYTES;
@@ -747,6 +757,19 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter 
*adapter,
 rx_ring-rx_buf_len,
 PCI_DMA_FROMDEVICE);
}
+
+   if (bi-mapped_as_page  !bi-page_skb) {
+   bi-page_skb = netdev_alloc_page(adapter-netdev);
+   if (!bi-page_skb) {
+   adapter-alloc_rx_page_failed++;
+   goto no_buffers;
+   }
+   bi-page_skb_offset = 0;
+   bi-dma = pci_map_page(pdev, bi-page_skb,
+   bi-page_skb_offset,
+   (PAGE_SIZE / 2),
+   PCI_DMA_FROMDEVICE);
+   }
/* Refresh the desc even if buffer_addrs didn't change because
 * each write-back erases this info. */
if (rx_ring-flags  IXGBE_RING_RX_PS_ENABLED) {
@@ -823,6 +846,13 @@ struct ixgbe_rsc_cb {
dma_addr_t dma;
 };
 
+static bool is_no_buffer(struct ixgbe_rx_buffer *rx_buffer_info)
+{
+   return (!rx_buffer_info-skb ||
+   !rx_buffer_info-page_skb) 
+   !rx_buffer_info-page;
+}
+
 #define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)-cb)
 
 static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
@@ -832,6 +862,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
struct ixgbe_adapter *adapter = q_vector-adapter;
struct net_device *netdev = adapter-netdev;
struct pci_dev *pdev = adapter-pdev;
+   struct napi_struct *napi = q_vector-napi;
union ixgbe_adv_rx_desc *rx_desc, *next_rxd;
struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer;
struct sk_buff *skb;
@@ -868,29 +899,71 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
len = le16_to_cpu(rx_desc-wb.upper.length);
}
 
+   if 

[PATCH v11 00/17] Provide a zero-copy method on KVM virtio-net.

2010-09-24 Thread xiaohui . xin
We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-10:net core and kernel changes.
patch 11-13:new device as interface to mantpulate external buffers.
patch 14:   for vhost-net.
patch 15:   An example on modifying NIC driver to using napi_gro_frags().
patch 16:   An example how to get guest buffers based on driver
who using napi_gro_frags().
patch 17:   It's a patch to address comments from Michael S. Thirkin
to add 2 new ioctls in mp device.
We split it out here to make easier reiewer.

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)-frags, and copied the header to skb-data.
The request remains pending until the skb is transmitted by h/w.

We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later.

What we have not done yet:
Performance tuning

what we have done in v1:
polish the RCU usage
deal with write logging in asynchroush mode in vhost
add notifier block for mp device
rename page_ctor to mp_port in netdevice.h to make it looks generic
add mp_dev_change_flags() for mp device to change NIC state
add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
a small fix for missing dev_put when fail
using dynamic minor instead of static minor number
a __KERNEL__ protect to mp_get_sock()

what we have done in v2:

remove most of the RCU usage, since the ctor pointer is only
changed by BIND/UNBIND ioctl, and during that time, NIC will be
stopped to get good cleanup(all outstanding requests are finished),
so the ctor pointer cannot be raced into wrong situation.

Remove the struct vhost_notifier with struct kiocb.
Let vhost-net backend to alloc/free the kiocb and transfer them
via sendmsg/recvmsg.

use get_user_pages_fast() and set_page_dirty_lock() when read.

Add some comments for netdev_mp_port_prep() and handle_mpassthru().

what we have done in v3:
the async write logging is rewritten 
a drafted synchronous write function for qemu live migration
a limit for locked pages from get_user_pages_fast() to prevent Dos
by using RLIMIT_MEMLOCK


what we have done in v4:
add iocb completion callback from vhost-net to queue iocb in mp device
replace vq-receiver by mp_sock_data_ready()
remove stuff in mp device which access structures from vhost-net
modify skb_reserve() to ignore host NIC driver reserved space
rebase to the latest vhost tree
split large patches into small pieces, especially for net core part.


what we have done in v5:
address Arnd Bergmann's comments
-remove IFF_MPASSTHRU_EXCL flag in mp device
-Add CONFIG_COMPAT macro
-remove mp_release ops
move dev_is_mpassthru() as inline func
fix a bug in memory relinquish
Apply to current git (2.6.34-rc6) tree.

what we have done in v6:
move create_iocb() out of page_dtor which may happen in interrupt 
context
-This remove the potential issues which lock called in interrupt context
make the cache used by mp, vhost as static, and created/destoryed during
modules init/exit functions.
-This makes multiple mp guest created at the same time.

what we have done in v7:
some cleanup prepared to suppprt PS mode

what we have done in v8:
discarding the modifications to point skb-data to guest buffer 
directly.
Add code to modify driver to support napi_gro_frags() with Herbert's 
comments.
To support PS mode.
Add mergeable 

[PATCH v11 01/17] Add a new structure for skb buffer from external.

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/skbuff.h |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 124f90c..74af06c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,15 @@ struct skb_shared_info {
void *  destructor_arg;
 };
 
+/* The structure is for a skb which pages may point to
+ * an external buffer, which is not allocated from kernel space.
+ * It also contains a destructor for itself.
+ */
+struct skb_ext_page {
+   struct  page *page;
+   void(*dtor)(struct skb_ext_page *);
+};
+
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb-data.  The lower 16 bits hold references to
  * the entire skb-data.  A clone of a headerless skb holds the length of
-- 
1.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 11/17] Add header file for mp device.

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/mpassthru.h |   25 +
 1 files changed, 25 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/mpassthru.h

diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 000..ba8f320
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,25 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include linux/types.h
+#include linux/if_ether.h
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV  _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV_IO('M', 214)
+
+#ifdef __KERNEL__
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+#else
+#include linux/err.h
+#include linux/errno.h
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+   return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_MEDIATE_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
-- 
1.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 10/17] Add a hook to intercept external buffers from NIC driver.

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The hook is called in netif_receive_skb().
Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com

---
 net/core/dev.c |   35 +++
 1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 636f11b..4b379b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2517,6 +2517,37 @@ err:
 EXPORT_SYMBOL(netdev_mp_port_prep);
 #endif
 
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+/* Add a hook to intercept mediate passthru(zero-copy) packets,
+ * and insert it to the socket queue owned by mp_port specially.
+ */
+static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
+  struct packet_type **pt_prev,
+  int *ret,
+  struct net_device *orig_dev)
+{
+   struct mpassthru_port *mp_port = NULL;
+   struct sock *sk = NULL;
+
+   if (!dev_is_mpassthru(skb-dev))
+   return skb;
+   mp_port = skb-dev-mp_port;
+
+   if (*pt_prev) {
+   *ret = deliver_skb(skb, *pt_prev, orig_dev);
+   *pt_prev = NULL;
+   }
+
+   sk = mp_port-sock-sk;
+   skb_queue_tail(sk-sk_receive_queue, skb);
+   sk-sk_state_change(sk);
+
+   return NULL;
+}
+#else
+#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb)
+#endif
+
 /**
  * netif_receive_skb - process receive buffer from network
  * @skb: buffer to process
@@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
+   /* To intercept mediate passthru(zero-copy) packets here */
+   skb = handle_mpassthru(skb, pt_prev, ret, orig_dev);
+   if (!skb)
+   goto out;
skb = handle_bridge(skb, pt_prev, ret, orig_dev);
if (!skb)
goto out;
-- 
1.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 09/17] Don't do skb recycle, if device use external buffer.

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com

---
 net/core/skbuff.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bbf4707..9b156bb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -565,6 +565,12 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
if (skb_shared(skb) || skb_cloned(skb))
return 0;
 
+   /* if the device wants to do mediate passthru, the skb may
+* get external buffer, so don't recycle
+*/
+   if (dev_is_mpassthru(skb-dev))
+   return 0;
+
skb_release_head_state(skb);
shinfo = skb_shinfo(skb);
atomic_set(shinfo-dataref, 1);
-- 
1.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 07/17] Modify netdev_alloc_page() to get external buffer

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Currently, it can get external buffers from mp device.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com

---
 net/core/skbuff.c |   27 +++
 1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 117d82b..1a61e2b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -269,11 +269,38 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 }
 EXPORT_SYMBOL(__netdev_alloc_skb);
 
+struct page *netdev_alloc_ext_pages(struct net_device *dev, int npages)
+{
+   struct mpassthru_port *port;
+   struct skb_ext_page *ext_page = NULL;
+
+   port = dev-mp_port;
+   if (!port)
+   goto out;
+   ext_page = port-ctor(port, NULL, npages);
+   if (ext_page)
+   return ext_page-page;
+out:
+   return NULL;
+
+}
+EXPORT_SYMBOL(netdev_alloc_ext_pages);
+
+struct page *netdev_alloc_ext_page(struct net_device *dev)
+{
+   return netdev_alloc_ext_pages(dev, 1);
+
+}
+EXPORT_SYMBOL(netdev_alloc_ext_page);
+
 struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
 {
int node = dev-dev.parent ? dev_to_node(dev-dev.parent) : -1;
struct page *page;
 
+   if (dev_is_mpassthru(dev))
+   return netdev_alloc_ext_page(dev);
+
page = alloc_pages_node(node, gfp_mask, 0);
return page;
 }
-- 
1.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 08/17] Modify netdev_free_page() to release external buffer

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Currently, it can get external buffers from mp device.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/skbuff.h |4 +++-
 net/core/skbuff.c  |   24 
 2 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ab29675..3d7f70e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1512,9 +1512,11 @@ static inline struct page *netdev_alloc_page(struct 
net_device *dev)
return __netdev_alloc_page(dev, GFP_ATOMIC);
 }
 
+extern void __netdev_free_page(struct net_device *dev, struct page *page);
+
 static inline void netdev_free_page(struct net_device *dev, struct page *page)
 {
-   __free_page(page);
+   __netdev_free_page(dev, page);
 }
 
 /**
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1a61e2b..bbf4707 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -306,6 +306,30 @@ struct page *__netdev_alloc_page(struct net_device *dev, 
gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(__netdev_alloc_page);
 
+void netdev_free_ext_page(struct net_device *dev, struct page *page)
+{
+   struct skb_ext_page *ext_page = NULL;
+   if (dev_is_mpassthru(dev)  dev-mp_port-hash) {
+   ext_page = dev-mp_port-hash(dev, page);
+   if (ext_page)
+   ext_page-dtor(ext_page);
+   else
+   __free_page(page);
+   }
+}
+EXPORT_SYMBOL(netdev_free_ext_page);
+
+void __netdev_free_page(struct net_device *dev, struct page *page)
+{
+   if (dev_is_mpassthru(dev)) {
+   netdev_free_ext_page(dev, page);
+   return;
+   }
+
+   __free_page(page);
+}
+EXPORT_SYMBOL(__netdev_free_page);
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
int size)
 {
-- 
1.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 06/17]Use callback to deal with skb_release_data() specially.

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

If buffer is external, then use the callback to destruct
buffers.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com

---
 include/linux/skbuff.h |3 ++-
 net/core/skbuff.c  |8 
 2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 74af06c..ab29675 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -197,10 +197,11 @@ struct skb_shared_info {
union skb_shared_tx tx_flags;
struct sk_buff  *frag_list;
struct skb_shared_hwtstamps hwtstamps;
-   skb_frag_t  frags[MAX_SKB_FRAGS];
/* Intermediate layers must ensure that destructor_arg
 * remains valid until skb destructor */
void *  destructor_arg;
+
+   skb_frag_t  frags[MAX_SKB_FRAGS];
 };
 
 /* The structure is for a skb which pages may point to
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 93c4e06..117d82b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -217,6 +217,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
shinfo-gso_type = 0;
shinfo-ip6_frag_id = 0;
shinfo-tx_flags.flags = 0;
+   shinfo-destructor_arg = NULL;
skb_frag_list_init(skb);
memset(shinfo-hwtstamps, 0, sizeof(shinfo-hwtstamps));
 
@@ -350,6 +351,13 @@ static void skb_release_data(struct sk_buff *skb)
if (skb_has_frags(skb))
skb_drop_fraglist(skb);
 
+   if (skb-dev  dev_is_mpassthru(skb-dev)) {
+   struct skb_ext_page *ext_page =
+   skb_shinfo(skb)-destructor_arg;
+   if (ext_page  ext_page-dtor)
+   ext_page-dtor(ext_page);
+   }
+
kfree(skb-head);
}
 }
-- 
1.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 02/17] Add a new struct for device to manipulate external buffer.

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/netdevice.h |   22 +-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..ba582e1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,6 +530,25 @@ struct netdev_queue {
unsigned long   tx_dropped;
 } cacheline_aligned_in_smp;
 
+/* Add a structure in structure net_device, the new field is
+ * named as mp_port. It's for mediate passthru (zero-copy).
+ * It contains the capability for the net device driver,
+ * a socket, and an external buffer creator, external means
+ * skb buffer belongs to the device may not be allocated from
+ * kernel space.
+ */
+struct mpassthru_port  {
+   int hdr_len;
+   int data_len;
+   int npages;
+   unsignedflags;
+   struct socket   *sock;
+   int vnet_hlen;
+   struct skb_ext_page *(*ctor)(struct mpassthru_port *,
+   struct sk_buff *, int);
+   struct skb_ext_page *(*hash)(struct net_device *,
+   struct page *);
+};
 
 /*
  * This structure defines the management hooks for network devices.
@@ -952,7 +971,8 @@ struct net_device {
struct macvlan_port *macvlan_port;
/* GARP */
struct garp_port*garp_port;
-
+   /* mpassthru */
+   struct mpassthru_port   *mp_port;
/* class/net/name entry */
struct device   dev;
/* space for optional device, statistics, and wireless sysfs groups */
-- 
1.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 03/17] Add a ndo_mp_port_prep pointer to net_device_ops.

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

If the driver want to allocate external buffers,
then it can export it's capability, as the skb
buffer header length, the page length can be DMA, etc.
The external buffers owner may utilize this.

Signed-off-by: Xin Xiaohui xiaohui@intel.com
Signed-off-by: Zhao Yu yzhao81...@gmail.com
Reviewed-by: Jeff Dike jd...@linux.intel.com
---
 include/linux/netdevice.h |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ba582e1..aba0308 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -710,6 +710,10 @@ struct net_device_ops {
int (*ndo_fcoe_get_wwn)(struct net_device *dev,
u64 *wwn, int type);
 #endif
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+   int (*ndo_mp_port_prep)(struct net_device *dev,
+   struct mpassthru_port *port);
+#endif
 };
 
 /*
-- 
1.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 04/17]Add a function make external buffer owner to query capability.

2010-09-24 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

The external buffer owner can use the functions to get
the capability of the underlying NIC driver.

---
 include/linux/netdevice.h |2 +
 net/core/dev.c|   49 +
 2 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index aba0308..5f192de 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1599,6 +1599,8 @@ extern gro_result_t   napi_frags_finish(struct 
napi_struct *napi,
  gro_result_t ret);
 extern struct sk_buff *napi_frags_skb(struct napi_struct *napi);
 extern gro_result_tnapi_gro_frags(struct napi_struct *napi);
+extern int netdev_mp_port_prep(struct net_device *dev,
+   struct mpassthru_port *port);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 264137f..636f11b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2468,6 +2468,55 @@ void netif_nit_deliver(struct sk_buff *skb)
rcu_read_unlock();
 }
 
+/* To support meidate passthru(zero-copy) with NIC driver,
+ * we'd better query NIC driver for the capability it can
+ * provide, especially for packet split mode, now we only
+ * query for the header size, and the payload a descriptor
+ * may carry. If a driver does not use the API to export,
+ * then we may try to use a default value, currently,
+ * we use the default value from an IGB driver. Now,
+ * it's only called by mpassthru device.
+ */
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+int netdev_mp_port_prep(struct net_device *dev,
+   struct mpassthru_port *port)
+{
+   int rc;
+   int npages, data_len;
+   const struct net_device_ops *ops = dev-netdev_ops;
+
+   if (ops-ndo_mp_port_prep) {
+   rc = ops-ndo_mp_port_prep(dev, port);
+   if (rc)
+   return rc;
+   } else {
+   /* If the NIC driver did not report this,
+* then we try to use default value.
+*/
+   port-hdr_len = 128;
+   port-data_len = 2048;
+   port-npages = 1;
+   }
+
+   if (port-hdr_len = 0)
+   goto err;
+
+   npages = port-npages;
+   data_len = port-data_len;
+   if (npages = 0 || npages  MAX_SKB_FRAGS ||
+   (data_len  PAGE_SIZE * (npages - 1) ||
+data_len  PAGE_SIZE * npages))
+   goto err;
+
+   return 0;
+err:
+   dev_warn(dev-dev, invalid page constructor parameters\n);
+
+   return -EINVAL;
+}
+EXPORT_SYMBOL(netdev_mp_port_prep);
+#endif
+
 /**
  * netif_receive_skb - process receive buffer from network
  * @skb: buffer to process
-- 
1.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html