Re: [PATCH 5/5] tun: vringfd xmit support.

2008-04-18 Thread Andrew Morton
On Sat, 19 Apr 2008 01:15:15 +1000 Rusty Russell <[EMAIL PROTECTED]> wrote:

> It's not clear to me why this is in swap.c, but it exists
> even without CONFIG_SWAP, so that's OK.

We should have done mv mm/swap.c mm/mmlib.c years ago.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/5] /dev/vring: simple userspace-kernel ringbuffer interface.

2008-04-18 Thread Michael Kerrisk
On 4/18/08, Andrew Morton <[EMAIL PROTECTED]> wrote:
> On Sat, 19 Apr 2008 00:32:39 +1000 Rusty Russell <[EMAIL PROTECTED]> wrote:
>
>  > > Isn't this kinda-sorta like what a relayfs file does?  The oprofile
>  > > buffers?  etc?  Nothing in common at all, no hope?
>  >
>  > An excellent question, but I thought the modern kernel etiquette was to 
> only
>  > comment on whitespace and formatting, and call it "review"? :)
>  >
>  > Yes, kinda-sorta in that it's a ring buffer.  No, in that it's bidir and
>  > consumption can be out-of-order (kind of important for I/O buffers).
>  >
>  > But the reason I'm not proposing it as a syscall is that I'm not convinced
>  > it's the One True Solution which everyone should be using.  Time will tell:
>  > it's clearly not tied to tun and it's been generically useful for virtual
>  > I/O, but history has not been kind to new userspace interfaces.
>
>
> This is may be our third high-bandwidth user/kernel interface to transport
>  bulk data ("hbukittbd") which was implemented because its predecessors
>  weren't quite right.  In a year or two's time someone else will need a
>  hbukittbd and will find that the existing three aren't quite right and will
>  give us another one.  One day we need to stop doing this ;)
>
>  It could be that this person will look at Rusty's hbukittbd and find that
>  it _could_ be tweaked to do what he wants, but it's already shipping and
>  it's part of the kernel API and hence can't be made to do what he wants.
>
>  So I think it would be good to plonk the proposed interface on the table
>  and have a poke at it.  Is it compat-safe?  Is it extensible in a
>  backward-compatible fashion?  Are there future-safe changes we should make
>  to it?  Can Michael Kerrisk understand, review and document it?  etc.

Well, it helps if he's CCed

I'm happy to work *with someone* on the documentation (pointless to do
it on my own -- how do I know what Rusty's *intended* behavior for the
interface is), and review, and testing.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] tun: vringfd xmit support.

2008-04-18 Thread Andrew Morton
On Sat, 19 Apr 2008 01:15:15 +1000 Rusty Russell <[EMAIL PROTECTED]> wrote:

> >
> > What is the maximum numbet of pages which an unpriviliged user can
> > concurrently pin with this code?
> 
> Since only root can open the tun device, it's currently OK.  The old code
> kmalloced and copied: is there some mm-fu reason why pinning userspace memory
> is worse?

We generally try to avoid it - it allows users to dos the box.  Although I
suspect that direct-io presently permits users to transiently pin an amount
of memory which is proportional to the number of disks upon which they can
open files.

> Subject: Export release_pages; nice undo for get_user_pages.
> 
> Andrew Morton suggests tun/tap use release_pages, but it's not
> exported.  It's not clear to me why this is in swap.c, but it exists
> even without CONFIG_SWAP, so that's OK.
> 
> Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
> 
> diff -r abd2ad431e5c mm/swap.c
> --- a/mm/swap.c   Sat Apr 19 00:34:54 2008 +1000
> +++ b/mm/swap.c   Sat Apr 19 01:11:40 2008 +1000
> @@ -346,6 +346,7 @@ void release_pages(struct page **pages, 
>  
>   pagevec_free(&pages_to_free);
>  }
> +EXPORT_SYMBOL(release_pages);
>  
>  /*
>   * The pages which we're about to release may be in the deferred lru-addition

acked-by: me.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/5] /dev/vring: simple userspace-kernel ringbuffer interface.

2008-04-18 Thread Andrew Morton
On Sat, 19 Apr 2008 00:32:39 +1000 Rusty Russell <[EMAIL PROTECTED]> wrote:

> > Isn't this kinda-sorta like what a relayfs file does?  The oprofile
> > buffers?  etc?  Nothing in common at all, no hope?
> 
> An excellent question, but I thought the modern kernel etiquette was to only 
> comment on whitespace and formatting, and call it "review"? :)
> 
> Yes, kinda-sorta in that it's a ring buffer.  No, in that it's bidir and 
> consumption can be out-of-order (kind of important for I/O buffers).
> 
> But the reason I'm not proposing it as a syscall is that I'm not convinced 
> it's the One True Solution which everyone should be using.  Time will tell: 
> it's clearly not tied to tun and it's been generically useful for virtual 
> I/O, but history has not been kind to new userspace interfaces.

This is may be our third high-bandwidth user/kernel interface to transport
bulk data ("hbukittbd") which was implemented because its predecessors
weren't quite right.  In a year or two's time someone else will need a
hbukittbd and will find that the existing three aren't quite right and will
give us another one.  One day we need to stop doing this ;)

It could be that this person will look at Rusty's hbukittbd and find that
it _could_ be tweaked to do what he wants, but it's already shipping and
it's part of the kernel API and hence can't be made to do what he wants.

So I think it would be good to plonk the proposed interface on the table
and have a poke at it.  Is it compat-safe?  Is it extensible in a
backward-compatible fashion?  Are there future-safe changes we should make
to it?  Can Michael Kerrisk understand, review and document it?  etc.

You know what I'm saying ;)  What is the proposed interface?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] tun: vringfd xmit support.

2008-04-18 Thread pradeep singh rautela
On Fri, Apr 18, 2008 at 7:55 PM, Ray Lee <[EMAIL PROTECTED]> wrote:
> On Fri, Apr 18, 2008 at 4:46 AM, pradeep singh rautela
>  <[EMAIL PROTECTED]> wrote:
>  >
>  > On Fri, Apr 18, 2008 at 10:13 AM, Rusty Russell <[EMAIL PROTECTED]> wrote:
>
> >  >  +   /* How many pages will this take? */
>  >  >  +   npages = 1 + (base + len - 1)/PAGE_SIZE - 
> base/PAGE_SIZE;
>  >
>  >  Hi Rusty,
>  >  A trivial suggestion, how about
>  >   npages = 1+(len -1)/PAGE_SIZE ?
>
>  That's not the same. In particular, his version accounts for the
>  fractional page at the beginning, while yours doesn't.

Oh thanks for correcting me Ray. :)
>
>  While it's tempting to use algebra to simplify things, it's not safe
>  to do so when the expression involves division over the integers. The
>  only care-free integer math in a computer is subtraction and ++.

I stand corrected.

Sorry for noise.

Thanks,
-- 
Pradeep Singh Rautela
http://eagain.wordpress.com
http://emptydomain.googlepages.com
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] tun: vringfd xmit support.

2008-04-18 Thread Ray Lee
On Fri, Apr 18, 2008 at 8:15 AM, Rusty Russell <[EMAIL PROTECTED]> wrote:
> On Friday 18 April 2008 21:31:20 Andrew Morton wrote:
>  > On Fri, 18 Apr 2008 14:43:24 +1000 Rusty Russell <[EMAIL PROTECTED]> wrote:
>
> > > +   /* How many pages will this take? */
>  > > +   npages = 1 + (base + len - 1)/PAGE_SIZE - base/PAGE_SIZE;
>  >
>  > Brain hurts.  I hope you got that right.
>
>  I tested it when I wrote it, but just wrote a tester again:
>
>  baselen npages
>  0   1   1
>  0xfff   1   1
>  0x1000  1   1
>  0   40961
>  0x1 40962
>  0xfff   40962
>  0x1000  40961
>  0xf000  40961
>  0xf000  40974293918722

Picky, picky :-).

If base were always page aligned, we'd want

npages = (len + PAGE_SIZE - 1) / PAGE_SIZE;

which is simply rounding len up to the next largest page size. So when
base is not page aligned, increase len by the left-over space at the
beginning, and then do the same calculation as above. (ie, pretend
base is page aligned, and instead count the excess at the beginning as
part of len.)

npages = ( (base & PAGE_MASK) + len + PAGE_SIZE - 1) / PAGE_SIZE;

As long as len < PAGE_MASK - PAGE_SIZE, we're safe from overflows.

(The above gives a different answer when len=0, but you guard for that.)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] tun: vringfd xmit support.

2008-04-18 Thread Rusty Russell
On Friday 18 April 2008 21:31:20 Andrew Morton wrote:
> On Fri, 18 Apr 2008 14:43:24 +1000 Rusty Russell <[EMAIL PROTECTED]> wrote:
> > +   /* How many pages will this take? */
> > +   npages = 1 + (base + len - 1)/PAGE_SIZE - base/PAGE_SIZE;
>
> Brain hurts.  I hope you got that right.

I tested it when I wrote it, but just wrote a tester again:

baselen npages
0   1   1
0xfff   1   1
0x1000  1   1
0   40961
0x1 40962
0xfff   40962
0x1000  40961
0xf000  40961
0xf000  40974293918722

> > +   if (unlikely(num_pg + npages > MAX_SKB_FRAGS)) {
> > +   err = -ENOSPC;
> > +   goto fail;
> > +   }
> > +   n = get_user_pages(current, current->mm, base, npages,
> > +  0, 0, pages, NULL);
>
> What is the maximum numbet of pages which an unpriviliged user can
> concurrently pin with this code?

Since only root can open the tun device, it's currently OK.  The old code
kmalloced and copied: is there some mm-fu reason why pinning userspace memory
is worse?

But I actually think it's OK even for non-root, since these become skbs, which
means they either go into an outgoing device queue or a socket queue which is
accounted for exactly for this reason. 

> > +   if (unlikely(n < 0)) {
> > +   err = n;
> > +   goto fail;
> > +   }
> > +
> > +   /* Transfer pages to the frag array */
> > +   for (j = 0; j < n; j++) {
> > +   f[num_pg].page = pages[j];
> > +   if (j == 0) {
> > +   f[num_pg].page_offset = offset_in_page(base);
> > +   f[num_pg].size = min(len, PAGE_SIZE -
> > +f[num_pg].page_offset);
> > +   } else {
> > +   f[num_pg].page_offset = 0;
> > +   f[num_pg].size = min(len, PAGE_SIZE);
> > +   }
> > +   len -= f[num_pg].size;
> > +   base += f[num_pg].size;
> > +   num_pg++;
> > +   }
>
> This loop is a fancy way of doing
>
>   num_pg = n;

Damn, you had me reworking this until I realized why.  It's not: we're
inside a loop, doing one iovec array element at a time.

> > +   if (unlikely(n != npages)) {
> > +   err = -EFAULT;
> > +   goto fail;
> > +   }
>
> why not do this immediately after running get_user_pages()?

To simplify the failure path.  Hmm, I would use release_pages here...

> > +fail:
> > +   for (i = 0; i < num_pg; i++)
> > +   put_page(f[i].page);
>
> release_pages() could be a tad more efficient, but it's only error-path.

... but I didn't know that existed.  Had to include pagemap.h, and it's not
exported.  It seems to be a useful interface; see patch.

Cheers,
Rusty.

Subject: Export release_pages; nice undo for get_user_pages.

Andrew Morton suggests tun/tap use release_pages, but it's not
exported.  It's not clear to me why this is in swap.c, but it exists
even without CONFIG_SWAP, so that's OK.

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>

diff -r abd2ad431e5c mm/swap.c
--- a/mm/swap.c Sat Apr 19 00:34:54 2008 +1000
+++ b/mm/swap.c Sat Apr 19 01:11:40 2008 +1000
@@ -346,6 +346,7 @@ void release_pages(struct page **pages, 
 
pagevec_free(&pages_to_free);
 }
+EXPORT_SYMBOL(release_pages);
 
 /*
  * The pages which we're about to release may be in the deferred lru-addition
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] tun: vringfd xmit support.

2008-04-18 Thread Ray Lee
On Fri, Apr 18, 2008 at 4:46 AM, pradeep singh rautela
<[EMAIL PROTECTED]> wrote:
>
> On Fri, Apr 18, 2008 at 10:13 AM, Rusty Russell <[EMAIL PROTECTED]> wrote:
>  >  +   /* How many pages will this take? */
>  >  +   npages = 1 + (base + len - 1)/PAGE_SIZE - base/PAGE_SIZE;
>
>  Hi Rusty,
>  A trivial suggestion, how about
>   npages = 1+(len -1)/PAGE_SIZE ?

That's not the same. In particular, his version accounts for the
fractional page at the beginning, while yours doesn't.

While it's tempting to use algebra to simplify things, it's not safe
to do so when the expression involves division over the integers. The
only care-free integer math in a computer is subtraction and ++.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/5] /dev/vring: simple userspace-kernel ringbuffer interface.

2008-04-18 Thread Rusty Russell
On Friday 18 April 2008 21:18:46 Andrew Morton wrote:
> > +   /* Must be a power of two, and limit indices to a u16. */
> > +   if (!num_descs || (num_descs & (num_descs-1)) || num_descs > 65536)
>
> We have an is_power_of_2().

Thanks, fixed.

> > + * vring_get - check out a vring file descriptor
> > + * @filp: the file structure to attach to (eg. from fget()).
> > + *
> > + * Userspace opens /dev/vring and mmaps it, then hands that fd to the
> > + * kernel subsystem it wants to communicate with.  That subsystem uses
> > + * this routine and vring_set_ops() to attach to it.
> > + *
> > + * This simply checks that it really is a vring fd (otherwise it
> > + * returns NULL), the other routine checks that it's not already
> > + * attached.
> > + */
>
> hm, I don't understand the big picture here yet.
>
> Isn't this kinda-sorta like what a relayfs file does?  The oprofile
> buffers?  etc?  Nothing in common at all, no hope?

An excellent question, but I thought the modern kernel etiquette was to only 
comment on whitespace and formatting, and call it "review"? :)

Yes, kinda-sorta in that it's a ring buffer.  No, in that it's bidir and 
consumption can be out-of-order (kind of important for I/O buffers).

But the reason I'm not proposing it as a syscall is that I'm not convinced 
it's the One True Solution which everyone should be using.  Time will tell: 
it's clearly not tied to tun and it's been generically useful for virtual 
I/O, but history has not been kind to new userspace interfaces.

> > +   mutex_unlock(&vr->lock);
> > +   local_irq_enable();
>
> what's this doing here?

Snot from previous version.  Removed.

> > +void vring_unset_ops(struct vring_info *vr)
> > +{
> > +   BUG_ON(!vr->ops);
> > +   mutex_lock(&vr->lock);
> > +   vr->ops = NULL;
> > +   mutex_unlock(&vr->lock);
> > +}
> > +EXPORT_SYMBOL_GPL(vring_unset_ops);
>
> Isn't this just vring_set_ops(vr, NULL, NULL)?

Yes, except I like the clarity and the BUG_ON.

> ponders #include 

"#include " for me, just to add more inclement weather to 
that teacup...

Thanks,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [6/6] [VIRTIO] net: Allow receiving SG packets

2008-04-18 Thread Herbert Xu
On Sat, Apr 19, 2008 at 12:08:04AM +1000, Rusty Russell wrote:
> On Friday 18 April 2008 13:24:27 Herbert Xu wrote:
> > Finally this patch lets virtio_net receive GSO packets in addition
> > to sending them.  This can definitely be optimised for the non-GSO
> > case.  For comparison the Xen approach stores one page in each skb
> > and uses subsequent skb's pages to construct an SG skb instead of
> > preallocating the maximum amount of pages per skb.
> 
> Well, this ends up freeing the unused pages immediately which is actually 
> quite nice.  We can cache them ourselves if we want, but we'd have to see if 
> the page allocation/free overhead is measurable...

I was thinking of the case of a large number of these interfaces
but yeah for one interface it's no biggie :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [6/6] [VIRTIO] net: Allow receiving SG packets

2008-04-18 Thread Rusty Russell
On Friday 18 April 2008 13:24:27 Herbert Xu wrote:
> Finally this patch lets virtio_net receive GSO packets in addition
> to sending them.  This can definitely be optimised for the non-GSO
> case.  For comparison the Xen approach stores one page in each skb
> and uses subsequent skb's pages to construct an SG skb instead of
> preallocating the maximum amount of pages per skb.

Well, this ends up freeing the unused pages immediately which is actually 
quite nice.  We can cache them ourselves if we want, but we'd have to see if 
the page allocation/free overhead is measurable...

Thanks!
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] tun: vringfd xmit support.

2008-04-18 Thread pradeep singh rautela
On Fri, Apr 18, 2008 at 10:13 AM, Rusty Russell <[EMAIL PROTECTED]> wrote:
> This patch modifies tun to allow a vringfd to specify the send
>  buffer.  The user does a write to push out packets from the buffer.
>
>  Again we use the 'struct virtio_net_hdr' to allow userspace to send
>  GSO packets.  In this case, it can hint how much to copy, and the
>  other pages will be made into skb fragments.
>
>  Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
>  ---
>   drivers/net/tun.c  |  410 
> +
>   include/linux/if_tun.h |1
>   2 files changed, 351 insertions(+), 60 deletions(-)
>
>  diff -r f797ec115d1b drivers/net/tun.c
>  --- a/drivers/net/tun.c Fri Apr 18 05:58:40 2008 +1000
>  +++ b/drivers/net/tun.c Fri Apr 18 06:07:21 2008 +1000
>  @@ -65,6 +65,8 @@
>   #include 
>   #include 
>   #include 
>  +#include 
>  +#include 
>   #include 
>
>   #include 
>  @@ -102,8 +104,8 @@ struct tun_struct {
> u32 chr_filter[2];
> u32 net_filter[2];
>
>  -   struct vring_info   *inring;
>  -   struct file *infile;
>  +   struct vring_info   *inring, *outring;
>  +   struct file *infile, *outfile;
>
>   #ifdef TUN_DEBUG
> int debug;
>  @@ -258,6 +261,169 @@ static void tun_net_init(struct net_devi
> dev->tx_queue_len = TUN_READQ_SIZE;  /* We prefer our own 
> queue length */
> break;
> }
>  +}
>  +
>  +/* We don't consolidate consecutive iovecs, so huge iovecs can break here.
>  + * Users will learn not to do that. */
>  +static int get_user_skb_frags(const struct iovec *iv, size_t len,
>  + struct skb_frag_struct *f)
>  +{
>  +   unsigned int i, j, num_pg = 0;
>  +   int err;
>  +   struct page *pages[MAX_SKB_FRAGS];
>  +
>  +   down_read(¤t->mm->mmap_sem);
>  +   while (len) {
>  +   int n, npages;
>  +   unsigned long base, len;
>  +   base = (unsigned long)iv->iov_base;
>  +   len = (unsigned long)iv->iov_len;
>  +
>  +   if (len == 0) {
>  +   iv++;
>  +   continue;
>  +   }
>  +
>  +   /* How many pages will this take? */
>  +   npages = 1 + (base + len - 1)/PAGE_SIZE - base/PAGE_SIZE;

Hi Rusty,
A trivial suggestion, how about
  npages = 1+(len -1)/PAGE_SIZE ?

Thanks,
 --Pradeep
>  +   if (unlikely(num_pg + npages > MAX_SKB_FRAGS)) {
>  +   err = -ENOSPC;
>  +   goto fail;
>  +   }
>  +   n = get_user_pages(current, current->mm, base, npages,
>  +  0, 0, pages, NULL);
>  +   if (unlikely(n < 0)) {
>  +   err = n;
>  +   goto fail;
>  +   }
>  +
>  +   /* Transfer pages to the frag array */
>  +   for (j = 0; j < n; j++) {
>  +   f[num_pg].page = pages[j];
>  +   if (j == 0) {
>  +   f[num_pg].page_offset = offset_in_page(base);
>  +   f[num_pg].size = min(len, PAGE_SIZE -
>  +f[num_pg].page_offset);
>  +   } else {
>  +   f[num_pg].page_offset = 0;
>  +   f[num_pg].size = min(len, PAGE_SIZE);
>  +   }
>  +   len -= f[num_pg].size;
>  +   base += f[num_pg].size;
>  +   num_pg++;
>  +   }
>  +
>  +   if (unlikely(n != npages)) {
>  +   err = -EFAULT;
>  +   goto fail;
>  +   }
>  +   }
>  +   up_read(¤t->mm->mmap_sem);
>  +   return num_pg;
>  +
>  +fail:
>  +   for (i = 0; i < num_pg; i++)
>  +   put_page(f[i].page);
>  +   up_read(¤t->mm->mmap_sem);
>  +   return err;
>  +}
>  +
>  +/* We actually store this at the head of the skb. */
>  +struct skb_tun_hdr {
>  +   struct list_head list;
>  +   struct tun_struct *tun;
>  +   unsigned int id;
>  +   unsigned int len;
>  +};
>  +
>  +/* Get packet from user space buffer.  copylen is a hint as to how
>  + * much to copy (rest is pinned).  */
>  +static struct sk_buff *get_user_skb(struct tun_struct *tun, struct iovec 
> *iv,
>  +   size_t copylen, size_t len)
>  +{
>  +   struct tun_pi pi = { 0, __constant_htons(ETH_P_IP) };
>  +   struct sk_buff *skb;
>  +   size_t align = 0, extra = 0;
>  +   int err;
>  +
>  +   if (!(tun->flags & TUN_NO_PI)) {
>  +   if (len < sizeof(pi)) {
>  +   err = -EINVAL;
>  +   goto fail;
>  +   }
>  +   len -= sizeof(pi);
>  +
>  +   

Re: [PATCH 5/5] tun: vringfd xmit support.

2008-04-18 Thread Andrew Morton
On Fri, 18 Apr 2008 14:43:24 +1000 Rusty Russell <[EMAIL PROTECTED]> wrote:

> This patch modifies tun to allow a vringfd to specify the send
> buffer.  The user does a write to push out packets from the buffer.
> 
> Again we use the 'struct virtio_net_hdr' to allow userspace to send
> GSO packets.  In this case, it can hint how much to copy, and the
> other pages will be made into skb fragments.
> 
> 
> ...
>
> +/* We don't consolidate consecutive iovecs, so huge iovecs can break here.
> + * Users will learn not to do that. */
> +static int get_user_skb_frags(const struct iovec *iv, size_t len,
> +   struct skb_frag_struct *f)
> +{
> + unsigned int i, j, num_pg = 0;
> + int err;
> + struct page *pages[MAX_SKB_FRAGS];
> +
> + down_read(¤t->mm->mmap_sem);
> + while (len) {
> + int n, npages;
> + unsigned long base, len;
> + base = (unsigned long)iv->iov_base;
> + len = (unsigned long)iv->iov_len;
> +
> + if (len == 0) {
> + iv++;
> + continue;
> + }
> +
> + /* How many pages will this take? */
> + npages = 1 + (base + len - 1)/PAGE_SIZE - base/PAGE_SIZE;

Brain hurts.  I hope you got that right.

> + if (unlikely(num_pg + npages > MAX_SKB_FRAGS)) {
> + err = -ENOSPC;
> + goto fail;
> + }
> + n = get_user_pages(current, current->mm, base, npages,
> +0, 0, pages, NULL);

What is the maximum numbet of pages which an unpriviliged user can
concurrently pin with this code?

> + if (unlikely(n < 0)) {
> + err = n;
> + goto fail;
> + }
> +
> + /* Transfer pages to the frag array */
> + for (j = 0; j < n; j++) {
> + f[num_pg].page = pages[j];
> + if (j == 0) {
> + f[num_pg].page_offset = offset_in_page(base);
> + f[num_pg].size = min(len, PAGE_SIZE -
> +  f[num_pg].page_offset);
> + } else {
> + f[num_pg].page_offset = 0;
> + f[num_pg].size = min(len, PAGE_SIZE);
> + }
> + len -= f[num_pg].size;
> + base += f[num_pg].size;
> + num_pg++;
> + }

This loop is a fancy way of doing

num_pg = n;

> + if (unlikely(n != npages)) {
> + err = -EFAULT;
> + goto fail;
> + }

why not do this immediately after running get_user_pages()?

> + }
> + up_read(¤t->mm->mmap_sem);
> + return num_pg;
> +
> +fail:
> + for (i = 0; i < num_pg; i++)
> + put_page(f[i].page);

release_pages() could be a tad more efficient, but it's only error-path.

> + up_read(¤t->mm->mmap_sem);
> + return err;
> +}
> +
>  

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/5] /dev/vring: simple userspace-kernel ringbuffer interface.

2008-04-18 Thread Andrew Morton
On Fri, 18 Apr 2008 14:39:48 +1000 Rusty Russell <[EMAIL PROTECTED]> wrote:

> virtio introduced a ring structure ABI for guest-host communications
> (currently used by lguest and kvm).  Using this same ABI, we can
> create a nice fd version.
> 
> This is useful for efficiently passing packets to and from the tun,
> for example.
> 
> ...
>
> +static int vring_mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> + unsigned long size, num_descs;
> + struct vring_info *vr = filp->private_data;
> + int err;
> +
> + /* We overload mmap's offset to hold the ring number. */
> + num_descs = vma->vm_pgoff;
> +
> + /* Must be a power of two, and limit indices to a u16. */
> + if (!num_descs || (num_descs & (num_descs-1)) || num_descs > 65536)

We have an is_power_of_2().

> + return -EINVAL;
> +
> + /* mmap size must be what we expect for such a ring. */
> + size = vma->vm_end - vma->vm_start;
> + if (size != ALIGN(vring_size(num_descs, PAGE_SIZE), PAGE_SIZE))
> + return -EINVAL;
> +
> + /* We only let them map this in one place. */
> + mutex_lock(&vr->lock);
> + if (vr->ring.num != 0) {
> + err = -EBUSY;
> + goto unlock;
> + }
> +
> + vring_init(&vr->ring, num_descs, (void *)vma->vm_start, PAGE_SIZE);
> +
> + vr->mask = num_descs - 1;
> + err = 0;
> +
> +unlock:
> + mutex_unlock(&vr->lock);
> + return err;
> +}
>
> ...
>
> +/**
> + * vring_get - check out a vring file descriptor
> + * @filp: the file structure to attach to (eg. from fget()).
> + *
> + * Userspace opens /dev/vring and mmaps it, then hands that fd to the
> + * kernel subsystem it wants to communicate with.  That subsystem uses
> + * this routine and vring_set_ops() to attach to it.
> + *
> + * This simply checks that it really is a vring fd (otherwise it
> + * returns NULL), the other routine checks that it's not already
> + * attached.
> + */

hm, I don't understand the big picture here yet.

Isn't this kinda-sorta like what a relayfs file does?  The oprofile
buffers?  etc?  Nothing in common at all, no hope?

> +struct vring_info *vring_get(struct file *filp)
> +{
> + /* Must be one of ours. */
> + if (filp->f_op != &vring_fops)
> + return NULL;
> +
> + return filp->private_data;
> +}
> +EXPORT_SYMBOL_GPL(vring_get);
> +
> +/**
> + * vring_set_ops - attach operations to a vring file descriptor.
> + * @vr: the vring_info returned from vring_get.
> + * @ops: the operations to attach.
> + * @ops_data: the argument to the ops callbacks.
> + *
> + * This is called after vring_get(): the reason for the two-part
> + * process is that the ops can be called before vring_set_ops returns
> + * (we don't do locking), so you really need to set things up before
> + * this call.
> + *
> + * This simply checks that the ring is not already attached to something,
> + * then sets the ops.
> + */
> +int vring_set_ops(struct vring_info *vr,
> +   const struct vring_ops *ops, void *ops_data)
> +{
> + int err;
> +
> + mutex_lock(&vr->lock);
> + if (vr->ops) {
> + err = -EBUSY;
> + goto unlock;
> + }
> +
> + /* We don't lock, so make sure we get this in the right order. */
> + vr->ops_data = ops_data;
> + wmb();
> + vr->ops = ops;
> +
> + err = 0;
> +unlock:
> + mutex_unlock(&vr->lock);
> + local_irq_enable();

what's this doing here?

> + return err;
> +}
> +EXPORT_SYMBOL_GPL(vring_set_ops);
> +
> +/**
> + * vring_unset_ops - remove operations to a vring file descriptor.
> + * @vr: the vring_info previously successfully vring_set_ops'd
> + */
> +void vring_unset_ops(struct vring_info *vr)
> +{
> + BUG_ON(!vr->ops);
> + mutex_lock(&vr->lock);
> + vr->ops = NULL;
> + mutex_unlock(&vr->lock);
> +}
> +EXPORT_SYMBOL_GPL(vring_unset_ops);

Isn't this just vring_set_ops(vr, NULL, NULL)?

> +static struct miscdevice vring_dev = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = KBUILD_MODNAME,
> + .fops = &vring_fops,
> +};
> +
> +static int __init init(void)
> +{
> + return misc_register(&vring_dev);
> +}
> +
> +static void __exit fini(void)
> +{
> + misc_deregister(&vring_dev);
> +}
> +
> +module_init(init);
> +module_exit(fini);
> diff -r b2d9869d338f include/linux/vring.h
> --- /dev/null Thu Jan 01 00:00:00 1970 +
> +++ b/include/linux/vring.h   Fri Apr 18 13:35:16 2008 +1000
> @@ -0,0 +1,58 @@
> +/* Ring-buffer file descriptor implementation.
> + *
> + *  Copyright 2008 Rusty Russell IBM Corporation
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warr

Re: [kvm-devel] [PATCH 1/1] QEMU/KVM: Support for PCI Passthrough

2008-04-18 Thread Samuel Masham
On Fri, Apr 18, 2008 at 2:39 PM, Amit Shah <[EMAIL PROTECTED]> wrote:
> * On Monday 14 Apr 2008 06:01:07 Samuel Masham wrote:
>  >
>  > Please keep the userspace support alive.
>  >
>  > I am particularly interested in using the pci-passthough to qemu
>  > running non x86 system emulation
>  > (at the moment mips)
>  >
>  > My hope is that the pci - passthough could help with developing
>  > drivers and testing across architectures...
>
>  OK; keeping support around won't be too much of a hassle, though the current
>  support for pci-passthrough and the irqhook module are developed with x86 in
>  mind (and only tested on x86).
>
>  Since it's not tested on any other architecture, I've marked it TARGET_I386
>  and TARGET_X86_64 for now. Feel free to extend it to other architectures.

Thanks,

I will let you know if I get anything useful out of it.

Samuel
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization