Re: [PATCH RFC 4/5] tun: vringfd xmit support.

2008-04-07 Thread Rusty Russell
On Monday 07 April 2008 17:35:28 David Miller wrote:
> From: Rusty Russell <[EMAIL PROTECTED]>
> Date: Mon, 7 Apr 2008 17:24:51 +1000
>
> > On Monday 07 April 2008 15:13:44 Herbert Xu wrote:
> > > On second thought, this is not going to work.  The network stack
> > > can clone individual pages out of this skb and put it into a new
> > > skb.  Therefore whatever scheme we come up with will either need
> > > to be page-based, or add a flag to tell the network stack that it
> > > can't clone those pages.
> >
> > Erk... I'll put in the latter for now.   A page-level solution is not
> > really an option: if userspace hands us mmaped pages for example.
>
> Keep in mind that the core of the TCP stack really depends
> upon being able to slice and dice paged SKBs as is pleases
> in order to send packets out.
>
> In fact, it also does such splitting during SACK processing.
>
> It really is a base requirement for efficient TSO support.
> Otherwise the above operations would be so incredibly
> expensive we might as well rip all of the TSO support out.

In this case that's OK; these are only for packets coming from 
userspace/guest, so the only interaction with the tcp code is possibly as a 
recipient.

If tcp wanted to do zero-copy aio xmit, I think we'd need something else.

Cheers,
Rusty.




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 4/5] tun: vringfd xmit support.

2008-04-07 Thread David Miller
From: Rusty Russell <[EMAIL PROTECTED]>
Date: Mon, 7 Apr 2008 17:24:51 +1000

> On Monday 07 April 2008 15:13:44 Herbert Xu wrote:
> > On second thought, this is not going to work.  The network stack
> > can clone individual pages out of this skb and put it into a new
> > skb.  Therefore whatever scheme we come up with will either need
> > to be page-based, or add a flag to tell the network stack that it
> > can't clone those pages.
> 
> Erk... I'll put in the latter for now.   A page-level solution is not really 
> an option: if userspace hands us mmaped pages for example.

Keep in mind that the core of the TCP stack really depends
upon being able to slice and dice paged SKBs as is pleases
in order to send packets out.

In fact, it also does such splitting during SACK processing.

It really is a base requirement for efficient TSO support.
Otherwise the above operations would be so incredibly
expensive we might as well rip all of the TSO support out.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 4/5] tun: vringfd xmit support.

2008-04-07 Thread Rusty Russell
On Monday 07 April 2008 15:13:44 Herbert Xu wrote:
> Rusty Russell <[EMAIL PROTECTED]> wrote:
> > +/* We are done with this skb: put it in the used pile. */
> > +static void skb_finished(struct skb_shared_info *sinfo)
> > +{
> > +   struct skb_shinfo_tun *sht = (void *)(sinfo + 1);
> > +
> > +   /* FIXME: Race prevention */
> > +   vring_used_buffer_atomic(sht->tun->outring, sht->id, sht->len);
> > +   vring_wake(sht->tun->outring);
> > +
> > +   /* Release device. */
> > +   dev_put(sht->tun->dev);
> > +}
>
> On second thought, this is not going to work.  The network stack
> can clone individual pages out of this skb and put it into a new
> skb.  Therefore whatever scheme we come up with will either need
> to be page-based, or add a flag to tell the network stack that it
> can't clone those pages.

Erk... I'll put in the latter for now.   A page-level solution is not really 
an option: if userspace hands us mmaped pages for example.

Thanks,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 4/5] tun: vringfd xmit support.

2008-04-06 Thread Herbert Xu
Rusty Russell <[EMAIL PROTECTED]> wrote:
>
> +/* We are done with this skb: put it in the used pile. */
> +static void skb_finished(struct skb_shared_info *sinfo)
> +{
> +   struct skb_shinfo_tun *sht = (void *)(sinfo + 1);
> +
> +   /* FIXME: Race prevention */
> +   vring_used_buffer_atomic(sht->tun->outring, sht->id, sht->len);
> +   vring_wake(sht->tun->outring);
> +
> +   /* Release device. */
> +   dev_put(sht->tun->dev);
> +}

On second thought, this is not going to work.  The network stack
can clone individual pages out of this skb and put it into a new
skb.  Therefore whatever scheme we come up with will either need
to be page-based, or add a flag to tell the network stack that it
can't clone those pages.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH RFC 4/5] tun: vringfd xmit support.

2008-04-05 Thread Rusty Russell
This patch modifies tun to allow a vringfd to specify the send
buffer.  The user does a write to push out packets from the buffer.

Again, more thought needs to be put into the possible races with ring
registration.

Again we use the 'struct virtio_net_hdr' to allow userspace to send
GSO packets.  In this case, it can hint how much to copy, and the
other pages will be made into skb fragments.

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>

diff -r 8270b5fdf03f drivers/net/tun.c
--- a/drivers/net/tun.c Sat Apr 05 22:49:10 2008 +1100
+++ b/drivers/net/tun.c Sat Apr 05 22:51:10 2008 +1100
@@ -101,7 +101,7 @@ struct tun_struct {
u32 chr_filter[2];
u32 net_filter[2];
 
-   struct vring_info   *inring;
+   struct vring_info   *inring, *outring;
 
 #ifdef TUN_DEBUG   
int debug;
@@ -258,6 +258,162 @@ static void tun_net_init(struct net_devi
}
 }
 
+/* We don't consolidate consecutive iovecs, so huge iovecs can break here.
+ * Users will learn not to do that. */
+static int get_user_skb_frags(const struct iovec *iv, size_t len,
+ struct skb_frag_struct *f)
+{
+   unsigned int i, j, num_pg = 0;
+   int err;
+   struct page *pages[MAX_SKB_FRAGS];
+
+   down_read(¤t->mm->mmap_sem);
+   while (len) {
+   int n, npages;
+   unsigned long base, len;
+   base = (unsigned long)iv->iov_base;
+   len = (unsigned long)iv->iov_len;
+
+   if (len == 0) {
+   iv++;
+   continue;
+   }
+
+   /* How many pages will this take? */
+   npages = 1 + (base + len - 1)/PAGE_SIZE - base/PAGE_SIZE;
+   if (unlikely(num_pg + npages > MAX_SKB_FRAGS)) {
+   err = -ENOSPC;
+   goto fail;
+   }
+   n = get_user_pages(current, current->mm, base, npages,
+  0, 0, pages, NULL);
+   if (unlikely(n < 0)) {
+   err = n;
+   goto fail;
+   }
+
+   /* Transfer pages to the frag array */
+   for (j = 0; j < n; j++) {
+   f[num_pg].page = pages[j];
+   if (j == 0) {
+   f[num_pg].page_offset = offset_in_page(base);
+   f[num_pg].size = min(len, PAGE_SIZE -
+f[num_pg].page_offset);
+   } else {
+   f[num_pg].page_offset = 0;
+   f[num_pg].size = min(len, PAGE_SIZE);
+   }
+   len -= f[num_pg].size;
+   base += f[num_pg].size;
+   num_pg++;
+   }
+
+   if (unlikely(n != npages)) {
+   err = -EFAULT;
+   goto fail;
+   }
+   }
+   up_read(¤t->mm->mmap_sem);
+   return num_pg;
+
+fail:
+   for (i = 0; i < num_pg; i++)
+   put_page(f[i].page);
+   up_read(¤t->mm->mmap_sem);
+   return err;
+}
+
+/* Get packet from user space buffer.  copylen is a hint as to how
+ * much to copy (rest is pinned).  */
+static struct sk_buff *get_user_skb(struct tun_struct *tun, struct iovec *iv,
+   size_t copylen, size_t len, int extra)
+{
+   struct tun_pi pi = { 0, __constant_htons(ETH_P_IP) };
+   struct sk_buff *skb;
+   size_t align = 0;
+   int err;
+
+   /* You can't have user fragments without room for destruction info. */
+   BUG_ON(!extra && copylen != len);
+
+   if (!(tun->flags & TUN_NO_PI)) {
+   if (len < sizeof(pi)) {
+   err = -EINVAL;
+   goto fail;
+   }
+   len -= sizeof(pi);
+
+   if (memcpy_fromiovec((void *)&pi, iv, sizeof(pi))) {
+   err = -EFAULT;
+   goto fail;
+   }
+   if (copylen > len)
+   copylen = len;  
+   }
+
+   if ((tun->flags & TUN_TYPE_MASK) == TUN_TAP_DEV) {
+   align = NET_IP_ALIGN;
+   if (unlikely(copylen < ETH_HLEN)) {
+   if (len < ETH_HLEN) {
+   err = -EINVAL;
+   goto fail;
+   }
+   copylen = ETH_HLEN;
+   }
+   }
+
+   /* We don't need a destructor if we don't have fragments. */
+   if (extra && copylen == len)
+   extra = 0;
+
+   if (!(skb = __alloc_skb(copylen + align, GFP_KERNEL, 0, extra, -1))) {
+   err = -ENOMEM;
+   goto fail;
+   }
+
+   if (align)
+   skb_reserve(skb, align);
+   if (memcpy_fromiovec(skb_put(skb, co