Re: [PATCH RFC 4/5] tun: vringfd xmit support.
On Monday 07 April 2008 17:35:28 David Miller wrote: > From: Rusty Russell <[EMAIL PROTECTED]> > Date: Mon, 7 Apr 2008 17:24:51 +1000 > > > On Monday 07 April 2008 15:13:44 Herbert Xu wrote: > > > On second thought, this is not going to work. The network stack > > > can clone individual pages out of this skb and put it into a new > > > skb. Therefore whatever scheme we come up with will either need > > > to be page-based, or add a flag to tell the network stack that it > > > can't clone those pages. > > > > Erk... I'll put in the latter for now. A page-level solution is not > > really an option: if userspace hands us mmaped pages for example. > > Keep in mind that the core of the TCP stack really depends > upon being able to slice and dice paged SKBs as is pleases > in order to send packets out. > > In fact, it also does such splitting during SACK processing. > > It really is a base requirement for efficient TSO support. > Otherwise the above operations would be so incredibly > expensive we might as well rip all of the TSO support out. In this case that's OK; these are only for packets coming from userspace/guest, so the only interaction with the tcp code is possibly as a recipient. If tcp wanted to do zero-copy aio xmit, I think we'd need something else. Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 4/5] tun: vringfd xmit support.
From: Rusty Russell <[EMAIL PROTECTED]> Date: Mon, 7 Apr 2008 17:24:51 +1000 > On Monday 07 April 2008 15:13:44 Herbert Xu wrote: > > On second thought, this is not going to work. The network stack > > can clone individual pages out of this skb and put it into a new > > skb. Therefore whatever scheme we come up with will either need > > to be page-based, or add a flag to tell the network stack that it > > can't clone those pages. > > Erk... I'll put in the latter for now. A page-level solution is not really > an option: if userspace hands us mmaped pages for example. Keep in mind that the core of the TCP stack really depends upon being able to slice and dice paged SKBs as is pleases in order to send packets out. In fact, it also does such splitting during SACK processing. It really is a base requirement for efficient TSO support. Otherwise the above operations would be so incredibly expensive we might as well rip all of the TSO support out. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 4/5] tun: vringfd xmit support.
On Monday 07 April 2008 15:13:44 Herbert Xu wrote: > Rusty Russell <[EMAIL PROTECTED]> wrote: > > +/* We are done with this skb: put it in the used pile. */ > > +static void skb_finished(struct skb_shared_info *sinfo) > > +{ > > + struct skb_shinfo_tun *sht = (void *)(sinfo + 1); > > + > > + /* FIXME: Race prevention */ > > + vring_used_buffer_atomic(sht->tun->outring, sht->id, sht->len); > > + vring_wake(sht->tun->outring); > > + > > + /* Release device. */ > > + dev_put(sht->tun->dev); > > +} > > On second thought, this is not going to work. The network stack > can clone individual pages out of this skb and put it into a new > skb. Therefore whatever scheme we come up with will either need > to be page-based, or add a flag to tell the network stack that it > can't clone those pages. Erk... I'll put in the latter for now. A page-level solution is not really an option: if userspace hands us mmaped pages for example. Thanks, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC 4/5] tun: vringfd xmit support.
Rusty Russell <[EMAIL PROTECTED]> wrote: > > +/* We are done with this skb: put it in the used pile. */ > +static void skb_finished(struct skb_shared_info *sinfo) > +{ > + struct skb_shinfo_tun *sht = (void *)(sinfo + 1); > + > + /* FIXME: Race prevention */ > + vring_used_buffer_atomic(sht->tun->outring, sht->id, sht->len); > + vring_wake(sht->tun->outring); > + > + /* Release device. */ > + dev_put(sht->tun->dev); > +} On second thought, this is not going to work. The network stack can clone individual pages out of this skb and put it into a new skb. Therefore whatever scheme we come up with will either need to be page-based, or add a flag to tell the network stack that it can't clone those pages. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH RFC 4/5] tun: vringfd xmit support.
This patch modifies tun to allow a vringfd to specify the send buffer. The user does a write to push out packets from the buffer. Again, more thought needs to be put into the possible races with ring registration. Again we use the 'struct virtio_net_hdr' to allow userspace to send GSO packets. In this case, it can hint how much to copy, and the other pages will be made into skb fragments. Signed-off-by: Rusty Russell <[EMAIL PROTECTED]> diff -r 8270b5fdf03f drivers/net/tun.c --- a/drivers/net/tun.c Sat Apr 05 22:49:10 2008 +1100 +++ b/drivers/net/tun.c Sat Apr 05 22:51:10 2008 +1100 @@ -101,7 +101,7 @@ struct tun_struct { u32 chr_filter[2]; u32 net_filter[2]; - struct vring_info *inring; + struct vring_info *inring, *outring; #ifdef TUN_DEBUG int debug; @@ -258,6 +258,162 @@ static void tun_net_init(struct net_devi } } +/* We don't consolidate consecutive iovecs, so huge iovecs can break here. + * Users will learn not to do that. */ +static int get_user_skb_frags(const struct iovec *iv, size_t len, + struct skb_frag_struct *f) +{ + unsigned int i, j, num_pg = 0; + int err; + struct page *pages[MAX_SKB_FRAGS]; + + down_read(¤t->mm->mmap_sem); + while (len) { + int n, npages; + unsigned long base, len; + base = (unsigned long)iv->iov_base; + len = (unsigned long)iv->iov_len; + + if (len == 0) { + iv++; + continue; + } + + /* How many pages will this take? */ + npages = 1 + (base + len - 1)/PAGE_SIZE - base/PAGE_SIZE; + if (unlikely(num_pg + npages > MAX_SKB_FRAGS)) { + err = -ENOSPC; + goto fail; + } + n = get_user_pages(current, current->mm, base, npages, + 0, 0, pages, NULL); + if (unlikely(n < 0)) { + err = n; + goto fail; + } + + /* Transfer pages to the frag array */ + for (j = 0; j < n; j++) { + f[num_pg].page = pages[j]; + if (j == 0) { + f[num_pg].page_offset = offset_in_page(base); + f[num_pg].size = min(len, PAGE_SIZE - +f[num_pg].page_offset); + } else { + f[num_pg].page_offset = 0; + f[num_pg].size = min(len, PAGE_SIZE); + } + len -= f[num_pg].size; + base += f[num_pg].size; + num_pg++; + } + + if (unlikely(n != npages)) { + err = -EFAULT; + goto fail; + } + } + up_read(¤t->mm->mmap_sem); + return num_pg; + +fail: + for (i = 0; i < num_pg; i++) + put_page(f[i].page); + up_read(¤t->mm->mmap_sem); + return err; +} + +/* Get packet from user space buffer. copylen is a hint as to how + * much to copy (rest is pinned). */ +static struct sk_buff *get_user_skb(struct tun_struct *tun, struct iovec *iv, + size_t copylen, size_t len, int extra) +{ + struct tun_pi pi = { 0, __constant_htons(ETH_P_IP) }; + struct sk_buff *skb; + size_t align = 0; + int err; + + /* You can't have user fragments without room for destruction info. */ + BUG_ON(!extra && copylen != len); + + if (!(tun->flags & TUN_NO_PI)) { + if (len < sizeof(pi)) { + err = -EINVAL; + goto fail; + } + len -= sizeof(pi); + + if (memcpy_fromiovec((void *)&pi, iv, sizeof(pi))) { + err = -EFAULT; + goto fail; + } + if (copylen > len) + copylen = len; + } + + if ((tun->flags & TUN_TYPE_MASK) == TUN_TAP_DEV) { + align = NET_IP_ALIGN; + if (unlikely(copylen < ETH_HLEN)) { + if (len < ETH_HLEN) { + err = -EINVAL; + goto fail; + } + copylen = ETH_HLEN; + } + } + + /* We don't need a destructor if we don't have fragments. */ + if (extra && copylen == len) + extra = 0; + + if (!(skb = __alloc_skb(copylen + align, GFP_KERNEL, 0, extra, -1))) { + err = -ENOMEM; + goto fail; + } + + if (align) + skb_reserve(skb, align); + if (memcpy_fromiovec(skb_put(skb, co