Re: [PATCH][RFC] network splice receive v3

2007-07-19 Thread Evgeniy Polyakov
Hi.

On Fri, Jul 13, 2007 at 02:21:00PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
  It really looks like the last tree we tested, so if you think additional
  one will not hurt, feel free to ping, so I will completely rebase
  testing tree.
 
 It would be great if you could retest! There are some minor changes in
 there, and some extra testing definitely will not hurt.

I've just tested it with 2.6.22
(e1c1e98d2a3f57b22a0d4136c8160e54404aa437 commit) and did not found any
problems - after qute big files were transferred there is no observed 
previously skb leak, no crashes (quite a few debug options are turned on 
in config) and files are correct on both peers, so it works good.

 -- 
 Jens Axboe

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v3

2007-07-19 Thread YOSHIFUJI Hideaki / 吉藤英明
Hello.

In article [EMAIL PROTECTED] (at Wed, 11 Jul 2007 11:19:27 +0200), Jens Axboe 
[EMAIL PROTECTED] says:

 @@ -835,6 +835,7 @@ const struct proto_ops inet_stream_ops = {
   .recvmsg   = sock_common_recvmsg,
   .mmap  = sock_no_mmap,
   .sendpage  = tcp_sendpage,
 + .splice_read   = tcp_splice_read,
  #ifdef CONFIG_COMPAT
   .compat_setsockopt = compat_sock_common_setsockopt,
   .compat_getsockopt = compat_sock_common_getsockopt,

Please add similar bits in net/ipv6/af_inet6.c
unless there are any dependency on IPv4.
(And if there are, it is not good.)

--yoshfuji
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v3

2007-07-19 Thread Jens Axboe
On Thu, Jul 19 2007, YOSHIFUJI Hideaki / ?$B5HF#1QL@ wrote:
 Hello.
 
 In article [EMAIL PROTECTED] (at Wed, 11 Jul 2007 11:19:27 +0200), Jens 
 Axboe [EMAIL PROTECTED] says:
 
  @@ -835,6 +835,7 @@ const struct proto_ops inet_stream_ops = {
  .recvmsg   = sock_common_recvmsg,
  .mmap  = sock_no_mmap,
  .sendpage  = tcp_sendpage,
  +   .splice_read   = tcp_splice_read,
   #ifdef CONFIG_COMPAT
  .compat_setsockopt = compat_sock_common_setsockopt,
  .compat_getsockopt = compat_sock_common_getsockopt,
 
 Please add similar bits in net/ipv6/af_inet6.c
 unless there are any dependency on IPv4.
 (And if there are, it is not good.)

There are no specific ipv4 depedencies, it's just an oversight. So
thanks for the clue, I'll add it!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v3

2007-07-13 Thread Jens Axboe
On Thu, Jul 12 2007, Evgeniy Polyakov wrote:
 On Wed, Jul 11, 2007 at 11:19:27AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  Hi,
 
 Hi Jens.
 
  Here's an updated implementation of tcp network splice receive support.
  It actually works for me now, no data corruption seen.
  
  For the original announcement and how to test it, see:
  
  http://marc.info/?l=linux-netdevm=118103093400770w=2
  
  The splice core changes needed to support this are now merged in
  2.6.22-git, so the patchset shrinks to just two patches - one for adding
  a release hook, and one for the networking changes.
  
  The code is also available in the splice-net branch here:
  
  git://git.kernel.dk/data/git/linux-2.6-block.git splice-net
  
  There's a third experimental patch in there that allows vmsplice
  directly to user memory, that still needs some work though.
  
  Comments, testing welcome!
 
 It looks like you included all bits we found in the previous runs, so
 likely it will work good, but so far I have conflicts merging todays git
 and your tree in include/linux/splice.h, fs/ext2/file.c, fs/splice.c and 
 mm/filemap_xip.c. This can be a problem with my tree though.

Hmm, the patch should apply directly to the tree as of when I posted
this original mail, or any later one. I just tried a rebase, and it
rebased fine on top of the current -git as well. So I think the issue is
with your tree, sorry!

 It really looks like the last tree we tested, so if you think additional
 one will not hurt, feel free to ping, so I will completely rebase
 testing tree.

It would be great if you could retest! There are some minor changes in
there, and some extra testing definitely will not hurt.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v3

2007-07-12 Thread Evgeniy Polyakov
On Wed, Jul 11, 2007 at 11:19:27AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 Hi,

Hi Jens.

 Here's an updated implementation of tcp network splice receive support.
 It actually works for me now, no data corruption seen.
 
 For the original announcement and how to test it, see:
 
 http://marc.info/?l=linux-netdevm=118103093400770w=2
 
 The splice core changes needed to support this are now merged in
 2.6.22-git, so the patchset shrinks to just two patches - one for adding
 a release hook, and one for the networking changes.
 
 The code is also available in the splice-net branch here:
 
 git://git.kernel.dk/data/git/linux-2.6-block.git splice-net
 
 There's a third experimental patch in there that allows vmsplice
 directly to user memory, that still needs some work though.
 
 Comments, testing welcome!

It looks like you included all bits we found in the previous runs, so
likely it will work good, but so far I have conflicts merging todays git
and your tree in include/linux/splice.h, fs/ext2/file.c, fs/splice.c and 
mm/filemap_xip.c. This can be a problem with my tree though.
It really looks like the last tree we tested, so if you think additional
one will not hurt, feel free to ping, so I will completely rebase
testing tree.

 -- 
 Jens Axboe


-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][RFC] network splice receive v3

2007-07-11 Thread Jens Axboe
Hi,

Here's an updated implementation of tcp network splice receive support.
It actually works for me now, no data corruption seen.

For the original announcement and how to test it, see:

http://marc.info/?l=linux-netdevm=118103093400770w=2

The splice core changes needed to support this are now merged in
2.6.22-git, so the patchset shrinks to just two patches - one for adding
a release hook, and one for the networking changes.

The code is also available in the splice-net branch here:

git://git.kernel.dk/data/git/linux-2.6-block.git splice-net

There's a third experimental patch in there that allows vmsplice
directly to user memory, that still needs some work though.

Comments, testing welcome!

-- 
Jens Axboe

From e59a68f2d7d261b301960b97659910aab8e3d776 Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Mon, 11 Jun 2007 13:00:32 +0200
Subject: [PATCH] splice: don't assume regular pages in splice_to_pipe()

Allow caller to pass in a release function, there might be
other resources that need releasing as well. Needed for
network receive.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 fs/splice.c|9 -
 include/linux/splice.h |1 +
 2 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 3160951..4b4b501 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -254,11 +254,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 	}
 
 	while (page_nr  spd_pages)
-		page_cache_release(spd-pages[page_nr++]);
+		spd-spd_release(spd, page_nr++);
 
 	return ret;
 }
 
+static void spd_release_page(struct splice_pipe_desc *spd, unsigned int i)
+{
+	page_cache_release(spd-pages[i]);
+}
+
 static int
 __generic_file_splice_read(struct file *in, loff_t *ppos,
 			   struct pipe_inode_info *pipe, size_t len,
@@ -277,6 +282,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		.partial = partial,
 		.flags = flags,
 		.ops = page_cache_pipe_buf_ops,
+		.spd_release = spd_release_page,
 	};
 
 	index = *ppos  PAGE_CACHE_SHIFT;
@@ -1674,6 +1680,7 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
 		.partial = partial,
 		.flags = flags,
 		.ops = user_page_pipe_buf_ops,
+		.spd_release = spd_release_page,
 	};
 
 	pipe = pipe_info(file-f_path.dentry-d_inode);
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 2c08456..b8fa41e 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -54,6 +54,7 @@ struct splice_pipe_desc {
 	int nr_pages;			/* number of pages in map */
 	unsigned int flags;		/* splice flags */
 	const struct pipe_buf_operations *ops;/* ops associated with output pipe */
+	void (*spd_release)(struct splice_pipe_desc *, unsigned int);
 };
 
 typedef int (splice_actor)(struct pipe_inode_info *, struct pipe_buffer *,
-- 
1.5.3.rc0.90.gbaa79

From b62e4a5a3e3220702e837e556427972dc591ff59 Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Wed, 20 Jun 2007 09:54:14 +0200
Subject: [PATCH] TCP splice receive support

Support for network splice receive.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 include/linux/net.h|3 +
 include/linux/skbuff.h |5 +
 include/net/tcp.h  |3 +
 net/core/skbuff.c  |  246 
 net/ipv4/af_inet.c |1 +
 net/ipv4/tcp.c |  129 +
 net/socket.c   |   13 +++
 7 files changed, 400 insertions(+), 0 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index efc4517..472ee12 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -19,6 +19,7 @@
 #define _LINUX_NET_H
 
 #include linux/wait.h
+#include linux/splice.h
 #include asm/socket.h
 
 struct poll_table_struct;
@@ -165,6 +166,8 @@ struct proto_ops {
   struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
   int offset, size_t size, int flags);
+	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
+   struct pipe_inode_info *pipe, size_t len, unsigned int flags);
 };
 
 struct net_proto_family {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6f0b2f7..177bffc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1504,6 +1504,11 @@ extern int	   skb_store_bits(struct sk_buff *skb, int offset,
 extern __wsum	   skb_copy_and_csum_bits(const struct sk_buff *skb,
 	  int offset, u8 *to, int len,
 	  __wsum csum);
+extern int skb_splice_bits(struct sk_buff *skb,
+		unsigned int offset,
+		struct pipe_inode_info *pipe,
+		unsigned int len,
+		unsigned int flags);
 extern void	   skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 extern void	   skb_split(struct sk_buff *skb,
  struct sk_buff *skb1, const u32 len);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a8af9ae..8e86697 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -308,6 +308,9 @@ extern int	

Re: [PATCH][RFC] network splice receive v3

2007-07-11 Thread Joel Becker
On Wed, Jul 11, 2007 at 11:19:27AM +0200, Jens Axboe wrote:
 Subject: [PATCH] splice: don't assume regular pages in splice_to_pipe()
 
 Allow caller to pass in a release function, there might be
 other resources that need releasing as well. Needed for
 network receive.
 
 diff --git a/fs/splice.c b/fs/splice.c
 index 3160951..4b4b501 100644
 --- a/fs/splice.c
 +++ b/fs/splice.c
 @@ -254,11 +254,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
   }
  
   while (page_nr  spd_pages)
 - page_cache_release(spd-pages[page_nr++]);
 + spd-spd_release(spd, page_nr++);

Rather than requiring the caller set this, shouldn't we just
allow it?  Especially given there is only one non-page user?

while (page_nr  spd_pages)
 -  page_cache_release(spd-pages[page_nr++]);
 +  if (spd-spd_release)
 +  spd-spd_release(spd, page_nr++);
 +  else
 +  page_cache_release(spd-pages[page_nr++]);

Joel

-- 

Any man who is under 30, and is not a liberal, has not heart;
 and any man who is over 30, and is not a conservative, has no brains.
 - Sir Winston Churchill 

Joel Becker
Principal Software Developer
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v3

2007-07-11 Thread Jens Axboe
On Wed, Jul 11 2007, Joel Becker wrote:
 On Wed, Jul 11, 2007 at 11:19:27AM +0200, Jens Axboe wrote:
  Subject: [PATCH] splice: don't assume regular pages in splice_to_pipe()
  
  Allow caller to pass in a release function, there might be
  other resources that need releasing as well. Needed for
  network receive.
  
  diff --git a/fs/splice.c b/fs/splice.c
  index 3160951..4b4b501 100644
  --- a/fs/splice.c
  +++ b/fs/splice.c
  @@ -254,11 +254,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
  }
   
  while (page_nr  spd_pages)
  -   page_cache_release(spd-pages[page_nr++]);
  +   spd-spd_release(spd, page_nr++);
 
   Rather than requiring the caller set this, shouldn't we just
 allow it?  Especially given there is only one non-page user?
 
   while (page_nr  spd_pages)
  -page_cache_release(spd-pages[page_nr++]);
  +if (spd-spd_release)
  +spd-spd_release(spd, page_nr++);
  +else
  +page_cache_release(spd-pages[page_nr++]);

Certainly possible, I think it's cleaner with it always being set
though. If it grows other out-of-splice.c users, then your change may be
a good idea though.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-15 Thread Jens Axboe
On Fri, Jun 15 2007, Evgeniy Polyakov wrote:
 On Thu, Jun 14, 2007 at 01:59:02PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  On Thu, Jun 14 2007, Evgeniy Polyakov wrote:
   On Wed, Jun 13, 2007 at 12:01:04PM +0400, Evgeniy Polyakov ([EMAIL 
   PROTECTED]) wrote:
I will rebase my tree, likely something was not merged correctly.
   
   Ok, I've just rebased a tree from recent git and pulled from brick -
   things seems to be settled. I've ran several tests with different
   filesizes and all files were received correctly without kernel crashes.
   There is skb leak indeed, and it looks like it is in the
   __splice_from_pipe() for the last page.
  
  Uh, the leak, right - I had forgotten about that, was working on getting
  vmsplice into shape the last two days. Interesting that you mention the
  last page, I'll dig in now! Any more info on this (how did you notice
  the leak originates from there)?
 
 I first observed leak via slabinfo data (not a leak, but number of skbs
 did not dropped after quite huge files were transferred), then added
 number of allocated and freed objects in skbuff.c, they did not match
 for big files, so I started to check splice source and found that
 sometimes -release callback is not called, but code breaks out of the
 loop. I've put some printks in __splice_from_pipe() and found following
 case, when skb is leaked:
 when the same cloned skb was shared multiple times (no more than 2 though),
 only one copy was freed.
 
 Further analysis description requires some splice background (obvious
 for you, but that clears it for me):
 pipe_buffer only contains 16 pages.
 There is a code, which copies pages (pointers) from spd to pipe_buffer
 (splice_to_pipe()).
 skb allocations happens in chunks of different size (i.e. with different
 number of skbs/pages per call), so it is possible that number of
 allocated skbs will be less than pipe_buffer size (16), and then the
 rest of the pages will be put into different (or the same) pipe_buffer later.
 Sometimes two skbs from above example happens to be on the boundary of
 the pipe buffer, so only one of them is being copied into pipe_buffer,
 which is then transferred over the pipe.
 So, we have a case, when spd has (it had more, but this two are special) 
 2 pages (actually the same page, but two references to it), but pipe_buffer 
 has a place only for one. In that case second page from spd will be missed.
 
 So, things turned down to be not in the __splice_from_pipe(), but
 splice_to_pipe(). Attached patch fixes a leak for me.
 It was tested with different data files and all were received correctly.
 
 Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]
 
 diff --git a/fs/splice.c b/fs/splice.c
 index bc481f1..365bfd9 100644
 --- a/fs/splice.c
 +++ b/fs/splice.c
 @@ -211,8 +211,6 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
   break;
   if (pipe-nrbufs  PIPE_BUFFERS)
   continue;
 -
 - break;
   }
  
   if (spd-flags  SPLICE_F_NONBLOCK) {
 

Hmm, curious. If we hit that location, then two conditions are true:

- Pipe is full
- We transferred some data

if you remove the break, then you'll end up blocking in pipe_wait()
(unless you have SPLICE_F_NONBLOCK also set). And we don't want to block
waiting for more room, if we already transferred some data. In that case
we just want to return a short splice. Looking at pipe_write(), it'll
block as well though. Just doesn't seem optimal to me, but...

So the question is why would doing the break there cause a leak? I just
don't yet see how it can happen, I'd love to fix that condition instead.
For the case you describe, we should have page_nr == 1 and spd-nr_pages
== 2. Is the:

while (page_nr  spd-nr_pages)
spd-spd_release(spd, page_nr++);

not dropping the right reference?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-15 Thread Evgeniy Polyakov
On Fri, Jun 15, 2007 at 10:43:18AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
  So, things turned down to be not in the __splice_from_pipe(), but
  splice_to_pipe(). Attached patch fixes a leak for me.
  It was tested with different data files and all were received correctly.
  
  Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]
  
  diff --git a/fs/splice.c b/fs/splice.c
  index bc481f1..365bfd9 100644
  --- a/fs/splice.c
  +++ b/fs/splice.c
  @@ -211,8 +211,6 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
  break;
  if (pipe-nrbufs  PIPE_BUFFERS)
  continue;
  -
  -   break;
  }
   
  if (spd-flags  SPLICE_F_NONBLOCK) {
  
 
 Hmm, curious. If we hit that location, then two conditions are true:
 
 - Pipe is full
 - We transferred some data

Yep.

 if you remove the break, then you'll end up blocking in pipe_wait()
 (unless you have SPLICE_F_NONBLOCK also set). And we don't want to block
 waiting for more room, if we already transferred some data. In that case
 we just want to return a short splice. Looking at pipe_write(), it'll
 block as well though. Just doesn't seem optimal to me, but...
 
 So the question is why would doing the break there cause a leak? I just
 don't yet see how it can happen, I'd love to fix that condition instead.
 For the case you describe, we should have page_nr == 1 and spd-nr_pages
 == 2. Is the:
 
 while (page_nr  spd-nr_pages)
 spd-spd_release(spd, page_nr++);
 
 not dropping the right reference?

Both spd-nr_pages and page_nr are equal to 1. When spd-nr_pages
was 2 there was only 1 free slot in pipe_buffer.

spd_fill_page: allocated: 89, freed: 73, data: 81003d606d28
spd_fill_page: allocated: 90, freed: 73, data: 81003d606d28
splice_to_pipe: priv: 81003d606d28, spd_nrpages: 1, pipe_nrbufs: 16, 
page_nr: 1.

spd_fill_page: allocated: 91, freed: 73, data: fff81003d6549c8 // next data
...

__splice_from_pipe: process: sd_len: 0, buf_len: 0, buf_priv: 81003d606d28.
__splice_from_pipe: release 81003d606d28.
sock_pipe_buf_release: allocated: 91, freed: 89, ptr: 81003d606d28

splice_to_pipe: priv: 81003d6549c8, spd_nrpages: 0, pipe_nrbufs: 1, 
page_nr: 1. // next data

 -- 
 Jens Axboe

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-15 Thread Jens Axboe
On Fri, Jun 15 2007, Evgeniy Polyakov wrote:
 On Fri, Jun 15, 2007 at 10:43:18AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
   So, things turned down to be not in the __splice_from_pipe(), but
   splice_to_pipe(). Attached patch fixes a leak for me.
   It was tested with different data files and all were received correctly.
   
   Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]
   
   diff --git a/fs/splice.c b/fs/splice.c
   index bc481f1..365bfd9 100644
   --- a/fs/splice.c
   +++ b/fs/splice.c
   @@ -211,8 +211,6 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 break;
 if (pipe-nrbufs  PIPE_BUFFERS)
 continue;
   -
   - break;
 }

 if (spd-flags  SPLICE_F_NONBLOCK) {
   
  
  Hmm, curious. If we hit that location, then two conditions are true:
  
  - Pipe is full
  - We transferred some data
 
 Yep.
 
  if you remove the break, then you'll end up blocking in pipe_wait()
  (unless you have SPLICE_F_NONBLOCK also set). And we don't want to block
  waiting for more room, if we already transferred some data. In that case
  we just want to return a short splice. Looking at pipe_write(), it'll
  block as well though. Just doesn't seem optimal to me, but...
  
  So the question is why would doing the break there cause a leak? I just
  don't yet see how it can happen, I'd love to fix that condition instead.
  For the case you describe, we should have page_nr == 1 and spd-nr_pages
  == 2. Is the:
  
  while (page_nr  spd-nr_pages)
  spd-spd_release(spd, page_nr++);
  
  not dropping the right reference?
 
 Both spd-nr_pages and page_nr are equal to 1. When spd-nr_pages
 was 2 there was only 1 free slot in pipe_buffer.

Ah duh, indeed, it is a dumb bug indeed. This should work.

diff --git a/fs/splice.c b/fs/splice.c
index 89871c6..5327cbf 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -172,6 +172,7 @@ static const struct pipe_buf_operations 
user_page_pipe_buf_ops = {
 ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
   struct splice_pipe_desc *spd)
 {
+   unsigned int spd_pages = spd-nr_pages;
int ret, do_wakeup, page_nr;
 
ret = 0;
@@ -252,7 +253,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
}
}
 
-   while (page_nr  spd-nr_pages)
+   while (page_nr  spd_pages)
spd-spd_release(spd, page_nr++);
 
return ret;

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-15 Thread Evgeniy Polyakov
On Fri, Jun 15, 2007 at 11:34:30AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
  Both spd-nr_pages and page_nr are equal to 1. When spd-nr_pages
  was 2 there was only 1 free slot in pipe_buffer.
 
 Ah duh, indeed, it is a dumb bug indeed. This should work.

Yep, this one works too.

 diff --git a/fs/splice.c b/fs/splice.c
 index 89871c6..5327cbf 100644
 --- a/fs/splice.c
 +++ b/fs/splice.c
 @@ -172,6 +172,7 @@ static const struct pipe_buf_operations 
 user_page_pipe_buf_ops = {
  ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
  struct splice_pipe_desc *spd)
  {
 + unsigned int spd_pages = spd-nr_pages;
   int ret, do_wakeup, page_nr;
  
   ret = 0;
 @@ -252,7 +253,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
   }
   }
  
 - while (page_nr  spd-nr_pages)
 + while (page_nr  spd_pages)
   spd-spd_release(spd, page_nr++);
  
   return ret;
 
 -- 
 Jens Axboe

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-14 Thread Evgeniy Polyakov
On Wed, Jun 13, 2007 at 12:01:04PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 I will rebase my tree, likely something was not merged correctly.

Ok, I've just rebased a tree from recent git and pulled from brick -
things seems to be settled. I've ran several tests with different
filesizes and all files were received correctly without kernel crashes.
There is skb leak indeed, and it looks like it is in the
__splice_from_pipe() for the last page.

Thanks Jens, good work.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-14 Thread Jens Axboe
On Thu, Jun 14 2007, Evgeniy Polyakov wrote:
 On Wed, Jun 13, 2007 at 12:01:04PM +0400, Evgeniy Polyakov ([EMAIL 
 PROTECTED]) wrote:
  I will rebase my tree, likely something was not merged correctly.
 
 Ok, I've just rebased a tree from recent git and pulled from brick -
 things seems to be settled. I've ran several tests with different
 filesizes and all files were received correctly without kernel crashes.
 There is skb leak indeed, and it looks like it is in the
 __splice_from_pipe() for the last page.

Uh, the leak, right - I had forgotten about that, was working on getting
vmsplice into shape the last two days. Interesting that you mention the
last page, I'll dig in now! Any more info on this (how did you notice
the leak originates from there)?

 Thanks Jens, good work.

Thanks, you're welcome!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-13 Thread Evgeniy Polyakov
On Tue, Jun 12, 2007 at 08:17:32PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
  On Mon, Jun 11, 2007 at 01:59:26PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
  wrote:
   Patches are against the #splice branch of the block repo, official url
   of that is:
   
   git://git.kernel.dk/data/git/linux-2.6-block.git/
   
   and it's based on Linus main tree. Let me know if I should supply netdev
   branch patches instead, or even just provide a rolled up patch (or patch
   series) for anyone curious to test or play with it.
  
  Hi Jens.
  
  I've just pulled your tree (splice-net, but splice tree looks the
  same, git pull says 'Already up-to-date.') on top of linus git and got
  following bug trace.  I will investigate it further tomorrow.
 
 Please tell me the contents of splice-net, it looks like you didn't
 actually use the new code. That BUG_ON() is in get_page(), which
 splice-net no longer uses. So the bug report cannot be valid for the
 current code.

This is the last commit in that tree:

commit c90a6ce8242d108a5bc6fd0bc1b2aca72a2b5944
Author: Jens Axboe [EMAIL PROTECTED]
Date:   Mon Jun 11 21:59:50 2007 +0200

TCP splice receive support

Support for network splice receive.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]

:100644 100644 efc4517... 472ee12... M include/linux/net.h
:100644 100644 e7367c7... 64e3eed... M include/linux/skbuff.h
:100644 100644 a8af9ae... 8e86697... M include/net/tcp.h
:100644 100644 7c6a34e... daea7b0... M net/core/skbuff.c
:100644 100644 041fba3... 0ff9f86... M net/ipv4/af_inet.c
:100644 100644 450f44b... 63efd7a... M net/ipv4/tcp.c
:100644 100644 f453019... 41240f5... M net/socket.c

I will rebase my tree, likely something was not merged correctly.

 -- 
 Jens Axboe

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-13 Thread Jens Axboe
On Wed, Jun 13 2007, Evgeniy Polyakov wrote:
 On Tue, Jun 12, 2007 at 08:17:32PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
   On Mon, Jun 11, 2007 at 01:59:26PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
   wrote:
Patches are against the #splice branch of the block repo, official url
of that is:

git://git.kernel.dk/data/git/linux-2.6-block.git/

and it's based on Linus main tree. Let me know if I should supply netdev
branch patches instead, or even just provide a rolled up patch (or patch
series) for anyone curious to test or play with it.
   
   Hi Jens.
   
   I've just pulled your tree (splice-net, but splice tree looks the
   same, git pull says 'Already up-to-date.') on top of linus git and got
   following bug trace.  I will investigate it further tomorrow.
  
  Please tell me the contents of splice-net, it looks like you didn't
  actually use the new code. That BUG_ON() is in get_page(), which
  splice-net no longer uses. So the bug report cannot be valid for the
  current code.
 
 This is the last commit in that tree:
 
 commit c90a6ce8242d108a5bc6fd0bc1b2aca72a2b5944
 Author: Jens Axboe [EMAIL PROTECTED]
 Date:   Mon Jun 11 21:59:50 2007 +0200
 
 TCP splice receive support
 
 Support for network splice receive.
 
 Signed-off-by: Jens Axboe [EMAIL PROTECTED]
 
 :100644 100644 efc4517... 472ee12... M include/linux/net.h
 :100644 100644 e7367c7... 64e3eed... M include/linux/skbuff.h
 :100644 100644 a8af9ae... 8e86697... M include/net/tcp.h
 :100644 100644 7c6a34e... daea7b0... M net/core/skbuff.c
 :100644 100644 041fba3... 0ff9f86... M net/ipv4/af_inet.c
 :100644 100644 450f44b... 63efd7a... M net/ipv4/tcp.c
 :100644 100644 f453019... 41240f5... M net/socket.c
 
 I will rebase my tree, likely something was not merged correctly.

It must have been, please let me know how the current stuf works for
you!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-12 Thread Evgeniy Polyakov
On Sat, Jun 09, 2007 at 08:36:09AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
  On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov ([EMAIL 
  PROTECTED]) wrote:
   I will try some things for the nearest 30-60 minutes, and then will move 
   to
   canoe trip until thuesday, so will not be able to work on this idea.
  
  Ok, replacing in fs/splice.c every page_cache_release() with
  static void splice_page_release(struct page *p)
  {
  if (!PageSlab(p))
  page_cache_release(p);
  }
 
 Ehm, I don't see why that should be necessary. Except in
 splice_to_pipe(), I have considered that we need to pass in a release
 function if mapping fails at some point. But it's probably best to do
 that in the caller, since they have the knowledge of how to release the
 pages.
 
 The rest of the PageSlab() tests are bogus.

I had a crashdump, where page was released via splice_to_pipe() indeed,
I did not investigate if it is possible to release provided page in
other places. I think if in future there will other slab usage cases
except networking receiving, that might be useful, but as is it is not
needed.

  and putting cloned skb into private field instead of 
  original on in spd_fill_page() ends up without kernel hung.
 
 Why? Seems pointless to allocate a clone just to hold on to the skb, a
 reference should be equally good. I would not be opposed to doing it
 this way, I just don't see what a clone buys us as compared to just
 holding that reference to the skb.

Receiving code does not expect shared skbs - too many fields are changed
with assumptions that it is a private copy.

  I'm not sure it is correct, that page can be released in fs/splice.c
  without calling any callback from network code, when network data is
  being processed.
 
 Please explain!

I had a crashdump, where page was attempted to be released in
fs/splice.c:splice_to_pipe(), I do not have details handy, but the best
solution would be to provide a release callback and use that instead of
page_cache_release().

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-12 Thread Jens Axboe
On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
 On Sat, Jun 09, 2007 at 08:36:09AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
   On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov ([EMAIL 
   PROTECTED]) wrote:
I will try some things for the nearest 30-60 minutes, and then will 
move to
canoe trip until thuesday, so will not be able to work on this idea.
   
   Ok, replacing in fs/splice.c every page_cache_release() with
   static void splice_page_release(struct page *p)
   {
 if (!PageSlab(p))
 page_cache_release(p);
   }
  
  Ehm, I don't see why that should be necessary. Except in
  splice_to_pipe(), I have considered that we need to pass in a release
  function if mapping fails at some point. But it's probably best to do
  that in the caller, since they have the knowledge of how to release the
  pages.
  
  The rest of the PageSlab() tests are bogus.
 
 I had a crashdump, where page was released via splice_to_pipe() indeed,
 I did not investigate if it is possible to release provided page in
 other places. I think if in future there will other slab usage cases
 except networking receiving, that might be useful, but as is it is not
 needed.

Read the just posted code, it has moved way beyond this :-)

   and putting cloned skb into private field instead of 
   original on in spd_fill_page() ends up without kernel hung.
  
  Why? Seems pointless to allocate a clone just to hold on to the skb, a
  reference should be equally good. I would not be opposed to doing it
  this way, I just don't see what a clone buys us as compared to just
  holding that reference to the skb.
 
 Receiving code does not expect shared skbs - too many fields are changed
 with assumptions that it is a private copy.

Actually the main problem is that tcp_read_sock() unconditionally frees
the skb, so it wouldn't help if we grabbed a reference to it. I've yet
to receive an explanation of why it does so, seem awkward and violates
the whole principle of reference counted objects. Davem??

So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd
hope we can get rid of that by fixing tcp_read_sock(), though.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-12 Thread Evgeniy Polyakov
On Tue, Jun 12, 2007 at 01:33:54PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
  I had a crashdump, where page was released via splice_to_pipe() indeed,
  I did not investigate if it is possible to release provided page in
  other places. I think if in future there will other slab usage cases
  except networking receiving, that might be useful, but as is it is not
  needed.
 
 Read the just posted code, it has moved way beyond this :-)

It is just a side result of traditional optimization technique called 
vim ':%s/page_cache_release/splice_page_release' :)

and putting cloned skb into private field instead of 
original on in spd_fill_page() ends up without kernel hung.
   
   Why? Seems pointless to allocate a clone just to hold on to the skb, a
   reference should be equally good. I would not be opposed to doing it
   this way, I just don't see what a clone buys us as compared to just
   holding that reference to the skb.
  
  Receiving code does not expect shared skbs - too many fields are changed
  with assumptions that it is a private copy.
 
 Actually the main problem is that tcp_read_sock() unconditionally frees
 the skb, so it wouldn't help if we grabbed a reference to it. I've yet
 to receive an explanation of why it does so, seem awkward and violates
 the whole principle of reference counted objects. Davem??
 
 So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd
 hope we can get rid of that by fixing tcp_read_sock(), though.

It does that because it knows, that skb is not allowed to be shared
there. Similar things are being done in udp for example - code changes
internal mebers of skb, since it knows skb is not shared.

For example generic_make_request() is not allowed to change, say, 
bio-bi_sector or bi_destructor, since it does not own a block request, 
not matter what bi_cnt is. From another side, -bi_destructor() can do
whatever it wants with bio without any check for its reference counter.

According to sk_eat_skb() - it is an optimisation to remove atomic
check.

 -- 
 Jens Axboe

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-12 Thread Jens Axboe
On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
 and putting cloned skb into private field instead of 
 original on in spd_fill_page() ends up without kernel hung.

Why? Seems pointless to allocate a clone just to hold on to the skb, a
reference should be equally good. I would not be opposed to doing it
this way, I just don't see what a clone buys us as compared to just
holding that reference to the skb.
   
   Receiving code does not expect shared skbs - too many fields are changed
   with assumptions that it is a private copy.
  
  Actually the main problem is that tcp_read_sock() unconditionally frees
  the skb, so it wouldn't help if we grabbed a reference to it. I've yet
  to receive an explanation of why it does so, seem awkward and violates
  the whole principle of reference counted objects. Davem??
  
  So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd
  hope we can get rid of that by fixing tcp_read_sock(), though.
 
 It does that because it knows, that skb is not allowed to be shared
 there. Similar things are being done in udp for example - code changes
 internal mebers of skb, since it knows skb is not shared.
 
 For example generic_make_request() is not allowed to change, say, 
 bio-bi_sector or bi_destructor, since it does not own a block request, 
 not matter what bi_cnt is. From another side, -bi_destructor() can do
 whatever it wants with bio without any check for its reference counter.

But generic_make_request() DOES change -bi_sector, that's how partition
remapping works :-). The destructor can of course do whatever it wants,
by definition the bio is not referenced at that point (or it would not
have been called). So while I think your analogy is quite poor, I do now
follow the code (even if I think it's ugly). There's quite a big
difference between changing parts of the elements of a structure to just
grabbing a reference to it. If the skb cannot be referenced, skb_get()
should return NULL.

But that aside, I see the issue. I'll just stick to the clone, it works
fine as-is (well almost, there's a leak there, but functionally it's
ok!).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-12 Thread Evgeniy Polyakov
On Tue, Jun 12, 2007 at 02:40:05PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
  and putting cloned skb into private field instead of 
  original on in spd_fill_page() ends up without kernel hung.
 
 Why? Seems pointless to allocate a clone just to hold on to the skb, a
 reference should be equally good. I would not be opposed to doing it
 this way, I just don't see what a clone buys us as compared to just
 holding that reference to the skb.

Receiving code does not expect shared skbs - too many fields are changed
with assumptions that it is a private copy.
   
   Actually the main problem is that tcp_read_sock() unconditionally frees
   the skb, so it wouldn't help if we grabbed a reference to it. I've yet
   to receive an explanation of why it does so, seem awkward and violates
   the whole principle of reference counted objects. Davem??
   
   So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd
   hope we can get rid of that by fixing tcp_read_sock(), though.
  
  It does that because it knows, that skb is not allowed to be shared
  there. Similar things are being done in udp for example - code changes
  internal mebers of skb, since it knows skb is not shared.
  
  For example generic_make_request() is not allowed to change, say, 
  bio-bi_sector or bi_destructor, since it does not own a block request, 
  not matter what bi_cnt is. From another side, -bi_destructor() can do
  whatever it wants with bio without any check for its reference counter.
 
 But generic_make_request() DOES change -bi_sector, that's how partition
 remapping works :-). The destructor can of course do whatever it wants,
 by definition the bio is not referenced at that point (or it would not
 have been called). So while I think your analogy is quite poor, I do now
 follow the code (even if I think it's ugly). There's quite a big

Yeah, that was quite long time ago I hacked block layer :)
Good we found a way to explain the issue.

 difference between changing parts of the elements of a structure to just
 grabbing a reference to it. If the skb cannot be referenced, skb_get()
 should return NULL.
 
 But that aside, I see the issue. I'll just stick to the clone, it works
 fine as-is (well almost, there's a leak there, but functionally it's
 ok!).

Btw, is it allowed to use splice from network with, say, nfs?
Since RPC code uses sk_user_data as long as network splice.

 -- 
 Jens Axboe

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-12 Thread Jens Axboe
On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
  difference between changing parts of the elements of a structure to just
  grabbing a reference to it. If the skb cannot be referenced, skb_get()
  should return NULL.
  
  But that aside, I see the issue. I'll just stick to the clone, it works
  fine as-is (well almost, there's a leak there, but functionally it's
  ok!).
 
 Btw, is it allowed to use splice from network with, say, nfs?
 Since RPC code uses sk_user_data as long as network splice.

It doesn't anymore, see the version posted today (or yesterday, but it
would be silly to read older code than the newest :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-12 Thread Evgeniy Polyakov
On Mon, Jun 11, 2007 at 01:59:26PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 Patches are against the #splice branch of the block repo, official url
 of that is:
 
 git://git.kernel.dk/data/git/linux-2.6-block.git/
 
 and it's based on Linus main tree. Let me know if I should supply netdev
 branch patches instead, or even just provide a rolled up patch (or patch
 series) for anyone curious to test or play with it.

Hi Jens.

I've just pulled your tree (splice-net, but splice tree looks the same, git 
pull says
'Already up-to-date.') on top of linus git and got following bug trace. 
I will investigate it further tomorrow.

[   51.942373] [ cut here ]
[   51.947041] kernel BUG at include/linux/mm.h:285!
[   51.951786] invalid opcode:  [1] PREEMPT SMP 
[   51.956680] CPU 0 
[   51.958784] Modules linked in: button loop snd_intel8x0
snd_ac97_codec psmouse ac97_bus snd_pcm snd_timer snd soundcore
snd_page_alloc k8temp i2c_nforcen
[   51.988793] Pid: 2604, comm: splice-fromnet Not tainted
2.6.22-rc4-splice #2
[   51.995886] RIP: 0010:[80389b15]  [80389b15]
__skb_splice_bits+0xcd/0x201
[   52.004520] RSP: 0018:810037f23c28  EFLAGS: 00010246
[   52.009872] RAX:  RBX: 810037f23d98 RCX:
003f
[   52.017053] RDX: 81003fe93808 RSI: 81003fe93808 RDI:
0003c0a3
[   52.024233] RBP: 810037f23c78 R08:  R09:
81003780e4b8
[   52.031412] R10: 803b01d9 R11: 810037f23de8 R12:
009a
[   52.038591] R13:  R14: 810037f23c90 R15:
05a8
[   52.045771] FS:  2b9181d2c6d0() GS:804fb000()
knlGS:
[   52.053920] CS:  0010 DS:  ES:  CR0: 8005003b
[   52.059714] CR2: 2b9181bb60e0 CR3: 3d109000 CR4:
06e0
[   52.066894] Process splice-fromnet (pid: 2604, threadinfo
810037f22000, task 8100010f4100)
[   52.075908] Stack:  004612d0 810037f23c94
81003780e4b8 37f23c78
[   52.084214]  faf2050e 81003780e4b8 81003780e4b8
81003e8f22d8
[   52.091860]  81003c99c820 4d5f4ede 810037f23dd8
8038bf20
[   52.099265] Call Trace:
[   52.101998]  [8038bf20] skb_splice_bits+0x6c/0xd0
[   52.107619]  [803dc720] _read_unlock_irq+0x31/0x4e
[   52.113330]  [803afc1c] tcp_splice_data_recv+0x20/0x22
[   52.119386]  [803afaf3] tcp_read_sock+0xa2/0x1ab
[   52.124920]  [803afbfc] tcp_splice_data_recv+0x0/0x22
[   52.130888]  [803b0232] tcp_splice_read+0xa1/0x21b
[   52.136593]  [803891cf] sock_def_readable+0x0/0x6f
[   52.142303]  [80384a25] sock_splice_read+0x15/0x17
[   52.148010]  [8029e773] do_splice_to+0x76/0x88
[   52.153370]  [8029fc87] sys_splice+0x1a8/0x232
[   52.158733]  [802097ce] system_call+0x7e/0x83
[   52.164005] 
[   52.165544] 
[   52.165545] Code: 0f 0b eb fe 44 39 65 d4 8b 4d d4 41 0f 47 cc 90 ff
42 08 48 
[   52.175364] RIP  [80389b15] __skb_splice_bits+0xcd/0x201
[   52.181636]  RSP 810037f23c28



-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-12 Thread Jens Axboe
On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
 On Mon, Jun 11, 2007 at 01:59:26PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  Patches are against the #splice branch of the block repo, official url
  of that is:
  
  git://git.kernel.dk/data/git/linux-2.6-block.git/
  
  and it's based on Linus main tree. Let me know if I should supply netdev
  branch patches instead, or even just provide a rolled up patch (or patch
  series) for anyone curious to test or play with it.
 
 Hi Jens.
 
 I've just pulled your tree (splice-net, but splice tree looks the
 same, git pull says 'Already up-to-date.') on top of linus git and got
 following bug trace.  I will investigate it further tomorrow.

Please tell me the contents of splice-net, it looks like you didn't
actually use the new code. That BUG_ON() is in get_page(), which
splice-net no longer uses. So the bug report cannot be valid for the
current code.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-11 Thread Jens Axboe
On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
 Size of the received file is bigger than file sent, file contains repeated
 blocks of data sometimes. Cloned skb usage is likely too big overhead,
 although for receiving fast clone is unused in most cases, so there
 might be some gain.

That was actually a new bug, here:

plen -= *offset;
poff += *offset;

in __skb_slice_bits(), we should only subtract the offset from plen, not
add to poff. Then we just create some weird hole without any meaning. So
remove those two poff additions in the two loops, and the size issue is
resolved at least.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][RFC] network splice receive v2

2007-06-11 Thread Jens Axboe
Hi,

Here's an updated implementation of tcp network splice receive support.
It actually works for me now, no data corruption seen.

For the original announcement and how to test it, see:

http://marc.info/?l=linux-netdevm=118103093400770w=2

I hang on to the original skb by creating a clone of it and holding
references to that. I really don't need a clone, but tcp_read_sock() is
not being very helpful by calling sk_eat_skb() which blissfully ignores
any reference counts and uncondtionally frees the skb - why is that??
The clone works around that issue.

The current code also gets rid of the data_ready hack, and it should now
handle linear/fragments/fraglist in the skb just fine. Thanks to Olaf
for providing review of that stuff!

Patches are against the #splice branch of the block repo, official url
of that is:

git://git.kernel.dk/data/git/linux-2.6-block.git/

and it's based on Linus main tree. Let me know if I should supply netdev
branch patches instead, or even just provide a rolled up patch (or patch
series) for anyone curious to test or play with it.

-- 
Jens Axboe

From fd79bf84fdeb15b72f256c342609257ae8a56235 Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Mon, 11 Jun 2007 13:00:32 +0200
Subject: [PATCH] splice: don't assume regular pages in splice_to_pipe()

Allow caller to pass in a release function, there might be
other resources that need releasing as well. Needed for
network receive.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 fs/splice.c|9 -
 include/linux/splice.h |1 +
 2 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index f24e367..25ec9c8 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -247,11 +247,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 	}
 
 	while (page_nr  spd-nr_pages)
-		page_cache_release(spd-pages[page_nr++]);
+		spd-spd_release(spd, page_nr++);
 
 	return ret;
 }
 
+static void spd_release_page(struct splice_pipe_desc *spd, unsigned int i)
+{
+	page_cache_release(spd-pages[i]);
+}
+
 static int
 __generic_file_splice_read(struct file *in, loff_t *ppos,
 			   struct pipe_inode_info *pipe, size_t len,
@@ -270,6 +275,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		.partial = partial,
 		.flags = flags,
 		.ops = page_cache_pipe_buf_ops,
+		.spd_release = spd_release_page,
 	};
 
 	index = *ppos  PAGE_CACHE_SHIFT;
@@ -1442,6 +1448,7 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
 		.partial = partial,
 		.flags = flags,
 		.ops = user_page_pipe_buf_ops,
+		.spd_release = spd_release_page,
 	};
 
 	pipe = pipe_info(file-f_path.dentry-d_inode);
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 1a1182b..04c1068 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -53,6 +53,7 @@ struct splice_pipe_desc {
 	int nr_pages;			/* number of pages in map */
 	unsigned int flags;		/* splice flags */
 	const struct pipe_buf_operations *ops;/* ops associated with output pipe */
+	void (*spd_release)(struct splice_pipe_desc *, unsigned int);
 };
 
 typedef int (splice_actor)(struct pipe_inode_info *, struct pipe_buffer *,
-- 
1.5.2.1.174.gcd03

From 95a1ee277f2a6df5c95d786b9229ea0ffa46850d Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Mon, 4 Jun 2007 15:06:43 +0200
Subject: [PATCH] tcp_read_sock: alloc recv_actor() return return negative error value

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 net/ipv4/tcp.c |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cd3c7e9..450f44b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1064,7 +1064,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	break;
 			}
 			used = recv_actor(desc, skb, offset, len);
-			if (used = len) {
+			if (used  0) {
+if (!copied)
+	copied = used;
+break;
+			} else if (used = len) {
 seq += used;
 copied += used;
 offset += used;
@@ -1086,7 +1090,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	tcp_rcv_space_adjust(sk);
 
 	/* Clean up data we have read: This will do ACK frames. */
-	if (copied)
+	if (copied  0)
 		tcp_cleanup_rbuf(sk, copied);
 	return copied;
 }
-- 
1.5.2.1.174.gcd03

From f0329226c6c1f521c2069358699bae5ed48f5a43 Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Mon, 11 Jun 2007 13:51:57 +0200
Subject: [PATCH] TCP splice receive support

Support for network splice receive.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 include/linux/net.h|3 +
 include/linux/skbuff.h |5 ++
 include/net/tcp.h  |3 +
 net/core/skbuff.c  |  175 
 net/ipv4/af_inet.c |1 +
 net/ipv4/tcp.c |  122 +
 net/socket.c   |   13 
 7 files changed, 322 insertions(+), 0 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 

Re: [PATCH][RFC] network splice receive

2007-06-09 Thread Jens Axboe
On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
 On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov ([EMAIL 
 PROTECTED]) wrote:
  I will try some things for the nearest 30-60 minutes, and then will move to
  canoe trip until thuesday, so will not be able to work on this idea.
 
 Ok, replacing in fs/splice.c every page_cache_release() with
 static void splice_page_release(struct page *p)
 {
   if (!PageSlab(p))
   page_cache_release(p);
 }

Ehm, I don't see why that should be necessary. Except in
splice_to_pipe(), I have considered that we need to pass in a release
function if mapping fails at some point. But it's probably best to do
that in the caller, since they have the knowledge of how to release the
pages.

The rest of the PageSlab() tests are bogus.

 and putting cloned skb into private field instead of 
 original on in spd_fill_page() ends up without kernel hung.

Why? Seems pointless to allocate a clone just to hold on to the skb, a
reference should be equally good. I would not be opposed to doing it
this way, I just don't see what a clone buys us as compared to just
holding that reference to the skb.

 I'm not sure it is correct, that page can be released in fs/splice.c
 without calling any callback from network code, when network data is
 being processed.

Please explain!

 Size of the received file is bigger than file sent, file contains repeated
 blocks of data sometimes. Cloned skb usage is likely too big overhead,
 although for receiving fast clone is unused in most cases, so there
 might be some gain.
 
 Attached your patch with above changes.

Thanks, I'll fiddle with this on monday.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Thu, Jun 07 2007, Evgeniy Polyakov wrote:
 On Thu, Jun 07, 2007 at 12:51:59PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
   What bout checking if page belongs to kmalloc cache (or any other cache
   via priviate pointers) and do not perform any kind of reference counting
   on them? I will play with this a bit later today.
  
  That might work, but sounds a little dirty... But there's probably no
  way around. Be sure to look at the #splice-net branch if you are playing
  with this, I've updated it a number of times and fixed some bugs in
  there. Notably it now gets the offset right, and handles fragments and
  fraglist as well.
 
 I've pulled splice-net, which indeed fixed some issues, but referencing
 slab pages is still is not allowed. There are at least two problems
 (although they are related):
 1. if we do not increment reference counter for slab pages, they
 eventually get refilled and slab exploses after it understood that its
 pages are in use (or user dies when page is moved out of his control in
 slab).
 2. get/put page does not work with slab pages, and simple
 increment/decrement of the reference counters is not allowed too.
 
 Both problems have the same root - slab does not allow anyone to 
 manipulate page's members. That should be broken/changed to allow splice
 to put its hands into network using the fastest way.
 I will think about it.

Perhaps it's possible to solve this at a different level - can we hang
on to the skb until the pipe buffer has been consumed, and prevent reuse
that way? Then we don't have to care what backing the skb has, as long
as it (and its data) isn't being reused until we drop the reference to
it in sock_pipe_buf_release().

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread David Miller
From: Jens Axboe [EMAIL PROTECTED]
Date: Fri, 8 Jun 2007 09:48:24 +0200

 Perhaps it's possible to solve this at a different level - can we hang
 on to the skb until the pipe buffer has been consumed, and prevent reuse
 that way? Then we don't have to care what backing the skb has, as long
 as it (and its data) isn't being reused until we drop the reference to
 it in sock_pipe_buf_release().

Depending upon whether the pipe buffer consumption is bounded of not,
this will jam up the TCP sender because the SKB data allocation is
charged against the socket send buffer allocation.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Fri, Jun 08 2007, David Miller wrote:
 From: Jens Axboe [EMAIL PROTECTED]
 Date: Fri, 8 Jun 2007 09:48:24 +0200
 
  Perhaps it's possible to solve this at a different level - can we hang
  on to the skb until the pipe buffer has been consumed, and prevent reuse
  that way? Then we don't have to care what backing the skb has, as long
  as it (and its data) isn't being reused until we drop the reference to
  it in sock_pipe_buf_release().
 
 Depending upon whether the pipe buffer consumption is bounded of not,
 this will jam up the TCP sender because the SKB data allocation is
 charged against the socket send buffer allocation.

Forgive my network ignorance, but is that a problem? Since you bring it
up, I guess so :-)

We can grow the pipe, should we have to. So instead of blocking waiting
on reader consumption, we can extend the size of the pipe and keep
going.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
 On Fri, Jun 08, 2007 at 10:38:53AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  On Fri, Jun 08 2007, David Miller wrote:
   From: Jens Axboe [EMAIL PROTECTED]
   Date: Fri, 8 Jun 2007 09:48:24 +0200
   
Perhaps it's possible to solve this at a different level - can we hang
on to the skb until the pipe buffer has been consumed, and prevent reuse
that way? Then we don't have to care what backing the skb has, as long
as it (and its data) isn't being reused until we drop the reference to
it in sock_pipe_buf_release().
   
   Depending upon whether the pipe buffer consumption is bounded of not,
   this will jam up the TCP sender because the SKB data allocation is
   charged against the socket send buffer allocation.
  
  Forgive my network ignorance, but is that a problem? Since you bring it
  up, I guess so :-)
 
 David means, that socket bufer allocation is limited, and delaying
 freeing can end up with exhausint that limit.

OK, so a delayed empty of the pipe could end up causing packet drops
elsewhere due to allocation exhaustion?

  We can grow the pipe, should we have to. So instead of blocking waiting
  on reader consumption, we can extend the size of the pipe and keep
  going.
 
 I have a code, which roughly works (but I will test it some more), which
 just introduces reference counters for slab pages, so that the would not
 be actually freed via page reclaim, but only after reference counters
 are dropped. That forced changes in mm/slab.c so likely it is
 unacceptible solution, but it is interesting as is.

Hmm, still seems like it's working around the problem. We essentially
just need to ensure that the data doesn't get _reused_, not just freed.
It doesn't help holding a reference to the page, if someone else just
reuses it and fills it with other data before it has been consumed and
released by the pipe buffer operations.

That's why I thought the skb referencing was the better idea, then we
don't have to care about the backing of the skb either. Provided that
preventing the free of the skb before the pipe buffer has been consumed
guarantees that the contents aren't reused.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Evgeniy Polyakov
On Fri, Jun 08, 2007 at 10:38:53AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 On Fri, Jun 08 2007, David Miller wrote:
  From: Jens Axboe [EMAIL PROTECTED]
  Date: Fri, 8 Jun 2007 09:48:24 +0200
  
   Perhaps it's possible to solve this at a different level - can we hang
   on to the skb until the pipe buffer has been consumed, and prevent reuse
   that way? Then we don't have to care what backing the skb has, as long
   as it (and its data) isn't being reused until we drop the reference to
   it in sock_pipe_buf_release().
  
  Depending upon whether the pipe buffer consumption is bounded of not,
  this will jam up the TCP sender because the SKB data allocation is
  charged against the socket send buffer allocation.
 
 Forgive my network ignorance, but is that a problem? Since you bring it
 up, I guess so :-)

David means, that socket bufer allocation is limited, and delaying
freeing can end up with exhausint that limit.

 We can grow the pipe, should we have to. So instead of blocking waiting
 on reader consumption, we can extend the size of the pipe and keep
 going.

I have a code, which roughly works (but I will test it some more), which
just introduces reference counters for slab pages, so that the would not
be actually freed via page reclaim, but only after reference counters
are dropped. That forced changes in mm/slab.c so likely it is
unacceptible solution, but it is interesting as is.

 -- 
 Jens Axboe

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Evgeniy Polyakov
On Fri, Jun 08, 2007 at 11:04:40AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 OK, so a delayed empty of the pipe could end up causing packet drops
 elsewhere due to allocation exhaustion?

Yes.

   We can grow the pipe, should we have to. So instead of blocking waiting
   on reader consumption, we can extend the size of the pipe and keep
   going.
  
  I have a code, which roughly works (but I will test it some more), which
  just introduces reference counters for slab pages, so that the would not
  be actually freed via page reclaim, but only after reference counters
  are dropped. That forced changes in mm/slab.c so likely it is
  unacceptible solution, but it is interesting as is.
 
 Hmm, still seems like it's working around the problem. We essentially
 just need to ensure that the data doesn't get _reused_, not just freed.
 It doesn't help holding a reference to the page, if someone else just
 reuses it and fills it with other data before it has been consumed and
 released by the pipe buffer operations.
 
 That's why I thought the skb referencing was the better idea, then we
 don't have to care about the backing of the skb either. Provided that
 preventing the free of the skb before the pipe buffer has been consumed
 guarantees that the contents aren't reused.

It is not only better idea, it is the only correct one.
Attached patch for interested reader, which does slab pages accounting,
but it is broken. It does not fires up with kernel bug, but it fills
output file with random garbage from reused and dirtied pages. And I do
not know why, but received file is always smaller than file being sent
(when file has resonable size like 10mb, with 4-40kb filesize things
seems to be ok).

I've started skb referencing work, let's see where this will end up.

diff --git a/fs/splice.c b/fs/splice.c
index 928bea0..742e1ee 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -29,6 +29,18 @@
 #include linux/syscalls.h
 #include linux/uio.h
 
+extern void slab_change_usage(struct page *p);
+
+static inline void splice_page_release(struct page *p)
+{
+   struct page *head = p-first_page;
+   if (!PageSlab(head))
+   page_cache_release(p);
+   else {
+   slab_change_usage(head);
+   }
+}
+
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
  * a vm helper function, it's already simplified quite a bit by the
@@ -81,7 +93,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info 
*pipe,
 static void page_cache_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
 {
-   page_cache_release(buf-page);
+   splice_page_release(buf-page);
buf-flags = ~PIPE_BUF_FLAG_LRU;
 }
 
@@ -246,7 +258,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
}
 
while (page_nr  spd-nr_pages)
-   page_cache_release(spd-pages[page_nr++]);
+   splice_page_release(spd-pages[page_nr++]);
 
return ret;
 }
@@ -322,7 +334,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
error = add_to_page_cache_lru(page, mapping, index,
  GFP_KERNEL);
if (unlikely(error)) {
-   page_cache_release(page);
+   splice_page_release(page);
if (error == -EEXIST)
continue;
break;
@@ -448,7 +460,7 @@ fill_it:
 * we got, 'nr_pages' is how many pages are in the map.
 */
while (page_nr  nr_pages)
-   page_cache_release(pages[page_nr++]);
+   splice_page_release(pages[page_nr++]);
 
if (spd.nr_pages)
return splice_to_pipe(pipe, spd);
@@ -604,7 +616,7 @@ find_page:
 
if (ret != AOP_TRUNCATED_PAGE)
unlock_page(page);
-   page_cache_release(page);
+   splice_page_release(page);
if (ret == AOP_TRUNCATED_PAGE)
goto find_page;
 
@@ -634,7 +646,7 @@ find_page:
ret = mapping-a_ops-commit_write(file, page, offset, offset+this_len);
if (ret) {
if (ret == AOP_TRUNCATED_PAGE) {
-   page_cache_release(page);
+   splice_page_release(page);
goto find_page;
}
if (ret  0)
@@ -651,7 +663,7 @@ find_page:
 */
mark_page_accessed(page);
 out:
-   page_cache_release(page);
+   splice_page_release(page);
unlock_page(page);
 out_ret:
return ret;
diff --git a/mm/slab.c b/mm/slab.c
index 2e71a32..673383d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1649,8 +1649,12 @@ static void *kmem_getpages(struct kmem_cache *cachep, 
gfp_t flags, int nodeid)
else
add_zone_page_state(page_zone(page),

Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
 On Fri, Jun 08, 2007 at 11:04:40AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  OK, so a delayed empty of the pipe could end up causing packet drops
  elsewhere due to allocation exhaustion?
 
 Yes.
 
We can grow the pipe, should we have to. So instead of blocking waiting
on reader consumption, we can extend the size of the pipe and keep
going.
   
   I have a code, which roughly works (but I will test it some more), which
   just introduces reference counters for slab pages, so that the would not
   be actually freed via page reclaim, but only after reference counters
   are dropped. That forced changes in mm/slab.c so likely it is
   unacceptible solution, but it is interesting as is.
  
  Hmm, still seems like it's working around the problem. We essentially
  just need to ensure that the data doesn't get _reused_, not just freed.
  It doesn't help holding a reference to the page, if someone else just
  reuses it and fills it with other data before it has been consumed and
  released by the pipe buffer operations.
  
  That's why I thought the skb referencing was the better idea, then we
  don't have to care about the backing of the skb either. Provided that
  preventing the free of the skb before the pipe buffer has been consumed
  guarantees that the contents aren't reused.
 
 It is not only better idea, it is the only correct one.
 Attached patch for interested reader, which does slab pages accounting,
 but it is broken. It does not fires up with kernel bug, but it fills
 output file with random garbage from reused and dirtied pages. And I do
 not know why, but received file is always smaller than file being sent
 (when file has resonable size like 10mb, with 4-40kb filesize things
 seems to be ok).
 
 I've started skb referencing work, let's see where this will end up.

Here's a start, for the splice side at least of storing a buf-private
entity with the ops.

diff --git a/fs/splice.c b/fs/splice.c
index 90588a8..f24e367 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -191,6 +191,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
buf-page = spd-pages[page_nr];
buf-offset = spd-partial[page_nr].offset;
buf-len = spd-partial[page_nr].len;
+   buf-private = spd-partial[page_nr].private;
buf-ops = spd-ops;
if (spd-flags  SPLICE_F_GIFT)
buf-flags |= PIPE_BUF_FLAG_GIFT;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 7ba228d..4409167 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -14,6 +14,7 @@ struct pipe_buffer {
unsigned int offset, len;
const struct pipe_buf_operations *ops;
unsigned int flags;
+   unsigned long private;
 };
 
 struct pipe_inode_info {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 619dcf5..64e3eed 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1504,7 +1504,7 @@ extern int   skb_store_bits(struct sk_buff 
*skb, int offset,
 extern __wsum skb_copy_and_csum_bits(const struct sk_buff *skb,
  int offset, u8 *to, int len,
  __wsum csum);
-extern int skb_splice_bits(const struct sk_buff *skb,
+extern int skb_splice_bits(struct sk_buff *skb,
unsigned int offset,
struct pipe_inode_info *pipe,
unsigned int len,
diff --git a/include/linux/splice.h b/include/linux/splice.h
index b3f1528..1a1182b 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -41,6 +41,7 @@ struct splice_desc {
 struct partial_page {
unsigned int offset;
unsigned int len;
+   unsigned long private;
 };
 
 /*
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d2b2547..7d9ec9e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -78,7 +78,10 @@ static void sock_pipe_buf_release(struct pipe_inode_info 
*pipe,
 #ifdef NET_COPY_SPLICE
__free_page(buf-page);
 #else
-   put_page(buf-page);
+   struct sk_buff *skb = (struct sk_buff *) buf-private;
+
+   kfree_skb(skb);
+   //put_page(buf-page);
 #endif
 }
 
@@ -1148,7 +1151,8 @@ fault:
  * Fill page/offset/length into spd, if it can hold more pages.
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page 
*page,
-   unsigned int len, unsigned int offset)
+   unsigned int len, unsigned int offset,
+   struct sk_buff *skb)
 {
struct page *p;
 
@@ -1163,12 +1167,14 @@ static inline int spd_fill_page(struct splice_pipe_desc 
*spd, struct page *page,

Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Evgeniy Polyakov
On Fri, Jun 08, 2007 at 04:14:52PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 Here's a start, for the splice side at least of storing a buf-private
 entity with the ops.

:) I tested the same implementation, but I put skb pointer into
page-private. My approach is not correct, since the same page can hold
several objects, so if there are several splicers, this will scream.
I've tested your patch on top of splice-net branch, here is a result:

[   44.798853] Slab corruption: skbuff_head_cache start=81003b726668, 
len=192
[   44.806148] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
[   44.811598] Last user: [803699fd](kfree_skbmem+0x7a/0x7e)
[   44.818012] 0b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5
[   44.824889] Prev obj: start=81003b726590, len=192
[   44.829985] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
[   44.835604] Last user: [8036a22c](__alloc_skb+0x40/0x13f)
[   44.842010] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   44.848896] 010: 20 58 7e 3b 00 81 ff ff 00 00 00 00 00 00 00 00
[   44.855772] Next obj: start=81003b726740, len=192
[   44.860868] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
[   44.866314] Last user: [803699fd](kfree_skbmem+0x7a/0x7e)
[   44.872721] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
[   44.879597] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

I will try some things for the nearest 30-60 minutes, and then will move to
canoe trip until thuesday, so will not be able to work on this idea.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
 On Fri, Jun 08, 2007 at 04:14:52PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  Here's a start, for the splice side at least of storing a buf-private
  entity with the ops.
 
 :) I tested the same implementation, but I put skb pointer into
 page-private. My approach is not correct, since the same page can hold
 several objects, so if there are several splicers, this will scream.
 I've tested your patch on top of splice-net branch, here is a result:
 
 [   44.798853] Slab corruption: skbuff_head_cache start=81003b726668, 
 len=192
 [   44.806148] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
 [   44.811598] Last user: [803699fd](kfree_skbmem+0x7a/0x7e)
 [   44.818012] 0b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5
 [   44.824889] Prev obj: start=81003b726590, len=192
 [   44.829985] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
 [   44.835604] Last user: [8036a22c](__alloc_skb+0x40/0x13f)
 [   44.842010] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [   44.848896] 010: 20 58 7e 3b 00 81 ff ff 00 00 00 00 00 00 00 00
 [   44.855772] Next obj: start=81003b726740, len=192
 [   44.860868] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
 [   44.866314] Last user: [803699fd](kfree_skbmem+0x7a/0x7e)
 [   44.872721] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
 [   44.879597] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
 
 I will try some things for the nearest 30-60 minutes, and then will move to
 canoe trip until thuesday, so will not be able to work on this idea.

I'm not surprised, it wasn't tested at all - just provides the basic
framework for storing the skb so we can access it on pipe buffer
release.

Lets talk more next week, I'll likely play with this approach on monday.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Evgeniy Polyakov
On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
 I will try some things for the nearest 30-60 minutes, and then will move to
 canoe trip until thuesday, so will not be able to work on this idea.

Ok, replacing in fs/splice.c every page_cache_release() with
static void splice_page_release(struct page *p)
{
if (!PageSlab(p))
page_cache_release(p);
}

and putting cloned skb into private field instead of 
original on in spd_fill_page() ends up without kernel hung.

I'm not sure it is correct, that page can be released in fs/splice.c
without calling any callback from network code, when network data is
being processed.

Size of the received file is bigger than file sent, file contains repeated
blocks of data sometimes. Cloned skb usage is likely too big overhead,
although for receiving fast clone is unused in most cases, so there
might be some gain.

Attached your patch with above changes.

diff --git a/fs/splice.c b/fs/splice.c
index 928bea0..a75dc56 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -29,6 +29,12 @@
 #include linux/syscalls.h
 #include linux/uio.h
 
+static void splice_page_release(struct page *p)
+{
+   if (!PageSlab(p))
+   page_cache_release(p);
+}
+
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
  * a vm helper function, it's already simplified quite a bit by the
@@ -81,7 +87,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info 
*pipe,
 static void page_cache_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
 {
-   page_cache_release(buf-page);
+   splice_page_release(buf-page);
buf-flags = ~PIPE_BUF_FLAG_LRU;
 }
 
@@ -191,6 +197,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
buf-page = spd-pages[page_nr];
buf-offset = spd-partial[page_nr].offset;
buf-len = spd-partial[page_nr].len;
+   buf-private = spd-partial[page_nr].private;
buf-ops = spd-ops;
if (spd-flags  SPLICE_F_GIFT)
buf-flags |= PIPE_BUF_FLAG_GIFT;
@@ -246,7 +253,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
}
 
while (page_nr  spd-nr_pages)
-   page_cache_release(spd-pages[page_nr++]);
+   splice_page_release(spd-pages[page_nr++]);
 
return ret;
 }
@@ -322,7 +329,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
error = add_to_page_cache_lru(page, mapping, index,
  GFP_KERNEL);
if (unlikely(error)) {
-   page_cache_release(page);
+   splice_page_release(page);
if (error == -EEXIST)
continue;
break;
@@ -448,7 +455,7 @@ fill_it:
 * we got, 'nr_pages' is how many pages are in the map.
 */
while (page_nr  nr_pages)
-   page_cache_release(pages[page_nr++]);
+   splice_page_release(pages[page_nr++]);
 
if (spd.nr_pages)
return splice_to_pipe(pipe, spd);
@@ -604,7 +611,7 @@ find_page:
 
if (ret != AOP_TRUNCATED_PAGE)
unlock_page(page);
-   page_cache_release(page);
+   splice_page_release(page);
if (ret == AOP_TRUNCATED_PAGE)
goto find_page;
 
@@ -634,7 +641,7 @@ find_page:
ret = mapping-a_ops-commit_write(file, page, offset, offset+this_len);
if (ret) {
if (ret == AOP_TRUNCATED_PAGE) {
-   page_cache_release(page);
+   splice_page_release(page);
goto find_page;
}
if (ret  0)
@@ -651,7 +658,7 @@ find_page:
 */
mark_page_accessed(page);
 out:
-   page_cache_release(page);
+   splice_page_release(page);
unlock_page(page);
 out_ret:
return ret;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 7ba228d..4409167 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -14,6 +14,7 @@ struct pipe_buffer {
unsigned int offset, len;
const struct pipe_buf_operations *ops;
unsigned int flags;
+   unsigned long private;
 };
 
 struct pipe_inode_info {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 619dcf5..64e3eed 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1504,7 +1504,7 @@ extern int   skb_store_bits(struct sk_buff 
*skb, int offset,
 extern __wsum skb_copy_and_csum_bits(const struct sk_buff *skb,
  int offset, u8 *to, int len,
  

Re: [PATCH][RFC] network splice receive

2007-06-07 Thread Evgeniy Polyakov
On Wed, Jun 06, 2007 at 09:17:25AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
  Some pages have zero reference counter here.
 
 But it's somewhat annoying to get pages with zero reference counts
 there, I wonder how that happens. I guess if the skb-data originated
 from kmalloc() then we don't really know. The main intent there was just
 to ensure the page wasn't going away, but clearly it's not good enough
 to ensure that reuse isn't taking place.

What bout checking if page belongs to kmalloc cache (or any other cache
via priviate pointers) and do not perform any kind of reference counting
on them? I will play with this a bit later today.

 -- 
 Jens Axboe

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-07 Thread Jens Axboe
On Thu, Jun 07 2007, Evgeniy Polyakov wrote:
 On Wed, Jun 06, 2007 at 09:17:25AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
   Some pages have zero reference counter here.
  
  But it's somewhat annoying to get pages with zero reference counts
  there, I wonder how that happens. I guess if the skb-data originated
  from kmalloc() then we don't really know. The main intent there was just
  to ensure the page wasn't going away, but clearly it's not good enough
  to ensure that reuse isn't taking place.
 
 What bout checking if page belongs to kmalloc cache (or any other cache
 via priviate pointers) and do not perform any kind of reference counting
 on them? I will play with this a bit later today.

That might work, but sounds a little dirty... But there's probably no
way around. Be sure to look at the #splice-net branch if you are playing
with this, I've updated it a number of times and fixed some bugs in
there. Notably it now gets the offset right, and handles fragments and
fraglist as well.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-07 Thread Evgeniy Polyakov
On Thu, Jun 07, 2007 at 12:51:59PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
  What bout checking if page belongs to kmalloc cache (or any other cache
  via priviate pointers) and do not perform any kind of reference counting
  on them? I will play with this a bit later today.
 
 That might work, but sounds a little dirty... But there's probably no
 way around. Be sure to look at the #splice-net branch if you are playing
 with this, I've updated it a number of times and fixed some bugs in
 there. Notably it now gets the offset right, and handles fragments and
 fraglist as well.

I've pulled splice-net, which indeed fixed some issues, but referencing
slab pages is still is not allowed. There are at least two problems
(although they are related):
1. if we do not increment reference counter for slab pages, they
eventually get refilled and slab exploses after it understood that its
pages are in use (or user dies when page is moved out of his control in
slab).
2. get/put page does not work with slab pages, and simple
increment/decrement of the reference counters is not allowed too.

Both problems have the same root - slab does not allow anyone to 
manipulate page's members. That should be broken/changed to allow splice
to put its hands into network using the fastest way.
I will think about it.

 -- 
 Jens Axboe

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-06 Thread Jens Axboe
On Tue, Jun 05 2007, jamal wrote:
 On Tue, 2007-05-06 at 14:20 +0200, Jens Axboe wrote:
 
   
    1800 4ff3 937f e000 6381 7275 0008
   
   Perhaps that hex pattern rings a bell with someone intimate with the
   networking. The remaining wrong bytes don't seem to have anything in
   common.
  
  Ok, the source mac address is 00:18:F3:4F:7F:93 and the destination is
  00:E0:81:63:75:72 which are the middle 12 bytes of the 16.
  
 
 It appears you may have endianness issues and perhaps a 16 bit
 mis-offset. the 0008 at the end looks like a swapped 0x800 which
 is the ethernet type for IPV4.

Yeah, it looks like the first part of the ethernet frame. I'm using an
e1000 on the receive side and a sky2 on the sender.

  Hope that helps someone clue me in as to which network part is reusing
  the data. Do I need to 'pin' the sk_buff until the pipe data has been
  consumed?
 
 I would worry about the driver level first - likely thats where your
 corruption is.

I'd just be a heck-of-a-lot more inclined to suspect my code! Hence I
thought that data reuse for perhaps another packet would be the likely
explanation, especially since it looked like valid network driver data
ending up where it should not.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-06 Thread Jens Axboe
On Tue, Jun 05 2007, Evgeniy Polyakov wrote:
 On Tue, Jun 05, 2007 at 10:05:43AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  Here's an implementation of tcp network splice receive support. It's
  originally based on the patch set that Intel posted some time ago, but
  has been (close to) 100% reworked.
  
  Now, I'm not a networking guru by any stretch of the imagination, so I'd
  like some input on the direction of the main patch. Is the approach
  feasible? Glaring errors? Missing bits?
 
   263.709262] [ cut here ]
   [  263.713932] kernel BUG at include/linux/mm.h:285!
   [  263.718678] invalid opcode:  [1] PREEMPT SMP 
   [  263.723561] CPU 0 
   [  263.725665] Modules linked in: button loop snd_intel8x0
   snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore psmouse
   snd_page_alloc k8temp i2c_nforcen
   [  263.755666] Pid: 2709, comm: splice-fromnet Not tainted
   2.6.22-rc4-splice #2
   [  263.762759] RIP: 0010:[8038c60c]  [8038c60c]
   skb_splice_bits+0xac/0x1c9
   [  263.771212] RSP: 0018:81003c79fc88  EFLAGS: 00010246
   [  263.776564] RAX:  RBX: 05a8 RCX:
   81003ff04778
   [  263.783743] RDX: 81003ff04778 RSI: 0ab2 RDI:
   0003d52d
   [  263.790925] RBP: 81003c79fdd8 R08:  R09:
   81003d927b78
   [  263.798104] R10: 803b0181 R11: 81003c79fde8 R12:
   81003d52d000
   [  263.805284] R13: 054e R14: 81003d927b78 R15:
   81003bbc6ea0
   [  263.812463] FS:  2ac4089a86d0() GS:804fb000()
   knlGS:
   [  263.820611] CS:  0010 DS:  ES:  CR0: 8005003b
   [  263.826396] CR2: 2ac4088320e0 CR3: 3c987000 CR4:
   06e0
   [  263.833578] Process splice-fromnet (pid: 2709, threadinfo
   81003c79e000, task 81003755c380)
   [  263.842591] Stack:  81003ff04720 
   81003755c380 0046
   [  263.850897]  00c6 0046 81003b0428b8
   81003d0b5b10
   [  263.858543]  00c6 81003d0b5b10 81003b0428b8
   81003d0b5b10
   [  263.865957] Call Trace:
   [  263.868683]  [803dc630] _read_unlock_irq+0x31/0x4e
   [  263.874393]  [803afb54] tcp_splice_data_recv+0x20/0x22
   [  263.880447]  [803afa2b] tcp_read_sock+0xa2/0x1ab
   [  263.885983]  [803afb34] tcp_splice_data_recv+0x0/0x22
   [  263.891951]  [803b01c1] tcp_splice_read+0xae/0x1a3
   [  263.897655]  [8038920f] sock_def_readable+0x0/0x6f
   [  263.903366]  [80384a65] sock_splice_read+0x15/0x17
   [  263.909072]  [8029e773] do_splice_to+0x76/0x88
   [  263.914432]  [8029fcc8] sys_splice+0x1a8/0x232
   [  263.919795]  [802097ce] system_call+0x7e/0x83
   [  263.925067] 
   [  263.926606] 
   [  263.926607] Code: 0f 0b eb fe 44 89 e6 81 e6 ff 0f 00 00 90 ff 42
   08 48 63 55 
   [  263.936418] RIP  [8038c60c] skb_splice_bits+0xac/0x1c9
   [  263.942516]  RSP 81003c79fc88
 
 This a vm_bug_on in get_page().
 
  +static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page 
  *page,
  +   unsigned int len, unsigned int offset)
  +{
  +   struct page *p;
  +
  +   if (unlikely(spd-nr_pages == PIPE_BUFFERS))
  +   return 1;
  +
  +#ifdef NET_COPY_SPLICE
  +   p = alloc_pages(GFP_KERNEL, 0);
  +   if (!p)
  +   return 1;
  +
  +   memcpy(page_address(p) + offset, page_address(page) + offset, len);
  +#else
  +   p = page;
  +   get_page(p);
  +#endif
 
 Some pages have zero reference counter here.
 
 Is commented NET_COPY_SPLICE part from old implementation?
 It will be always slower than existing approach due to allocation
 overhead.

NET_COPY_SPLICE is not supposed to be enabled, it's just a demonstration
of a copy operation if you wanted to test functionality without just
linking in the skb pages. At least that would allow you to test
correctness of the rest of the code, since I don't know if the skb page
linking is entirely correct yet.

But it's somewhat annoying to get pages with zero reference counts
there, I wonder how that happens. I guess if the skb-data originated
from kmalloc() then we don't really know. The main intent there was just
to ensure the page wasn't going away, but clearly it's not good enough
to ensure that reuse isn't taking place.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][RFC] network splice receive

2007-06-05 Thread Jens Axboe
Hi,

Here's an implementation of tcp network splice receive support. It's
originally based on the patch set that Intel posted some time ago, but
has been (close to) 100% reworked.

Now, I'm not a networking guru by any stretch of the imagination, so I'd
like some input on the direction of the main patch. Is the approach
feasible? Glaring errors? Missing bits?

If you want to test it, I'd suggest downloading the latest splice tools
snapshot here:

http://brick.kernel.dk/snaps/splice-git-latest.tar.gz

Examples:

- Sending a small test message over the network:

  [EMAIL PROTECTED]:~/splice $ ./splice-fromnet  | cat
  [EMAIL PROTECTED]:~ $ echo hello | netcat host1 

  should write hello on host1. Yeah, complex stuff.

- Sending a file over the network:

  [EMAIL PROTECTED]:~/splice $ ./splice-fromnet  | ./splice out outfile
  [EMAIL PROTECTED]:~ $ cat somefile | netcat host1 

  should send somefile over the network, storing it in outfile.

Seems to work reasonably well for me, sometimes I do see small ranges
inside the output file that are not correct, but I haven't been able to
reproduce today. I think it has to do with page reuse, hence the
NET_COPY_SPLICE ifdef that you can enable to just plain copy the data
instead of referencing it.

Patches are against the #splice branch of the block repo, official url
of that is:

git://git.kernel.dk/data/git/linux-2.6-block.git/

and it's based on Linus main tree (at 2.6.22-rc4 right now). Let me know
if I should supply netdev branch patches instead.

-- 
Jens Axboe

From 592c46ea813c31c0d6b28bf543ce2f5dd884a75e Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Mon, 4 Jun 2007 15:06:43 +0200
Subject: [PATCH] [NET] tcp_read_sock: alloc recv_actor() return return negative error value

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 net/ipv4/tcp.c |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cd3c7e9..450f44b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1064,7 +1064,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	break;
 			}
 			used = recv_actor(desc, skb, offset, len);
-			if (used = len) {
+			if (used  0) {
+if (!copied)
+	copied = used;
+break;
+			} else if (used = len) {
 seq += used;
 copied += used;
 offset += used;
@@ -1086,7 +1090,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	tcp_rcv_space_adjust(sk);
 
 	/* Clean up data we have read: This will do ACK frames. */
-	if (copied)
+	if (copied  0)
 		tcp_cleanup_rbuf(sk, copied);
 	return copied;
 }
-- 
1.5.2.rc1

From 10d906a9a5a16a022d5067bee3963a0e3a03ae0c Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Tue, 5 Jun 2007 09:54:00 +0200
Subject: [PATCH] [NET] TCP splice receive support

Losely based on original patches from Intel, modified to actually
be zero-copy (the original patches memcpy'ed the data).

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 include/linux/net.h|3 +
 include/linux/skbuff.h |5 ++
 include/net/tcp.h  |3 +
 net/core/skbuff.c  |  114 +++
 net/ipv4/af_inet.c |1 +
 net/ipv4/tcp.c |  138 
 net/socket.c   |   13 +
 7 files changed, 277 insertions(+), 0 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index efc4517..472ee12 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -19,6 +19,7 @@
 #define _LINUX_NET_H
 
 #include linux/wait.h
+#include linux/splice.h
 #include asm/socket.h
 
 struct poll_table_struct;
@@ -165,6 +166,8 @@ struct proto_ops {
   struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
   int offset, size_t size, int flags);
+	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
+   struct pipe_inode_info *pipe, size_t len, unsigned int flags);
 };
 
 struct net_proto_family {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e7367c7..619dcf5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1504,6 +1504,11 @@ extern int	   skb_store_bits(struct sk_buff *skb, int offset,
 extern __wsum	   skb_copy_and_csum_bits(const struct sk_buff *skb,
 	  int offset, u8 *to, int len,
 	  __wsum csum);
+extern int skb_splice_bits(const struct sk_buff *skb,
+		unsigned int offset,
+		struct pipe_inode_info *pipe,
+		unsigned int len,
+		unsigned int flags);
 extern void	   skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 extern void	   skb_split(struct sk_buff *skb,
  struct sk_buff *skb1, const u32 len);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a8af9ae..8e86697 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -308,6 +308,9 @@ extern int			tcp_twsk_unique(struct sock *sk,
 
 extern void			tcp_twsk_destructor(struct sock 

Re: [PATCH][RFC] network splice receive

2007-06-05 Thread Jens Axboe
On Tue, Jun 05 2007, Jens Axboe wrote:
 Seems to work reasonably well for me, sometimes I do see small ranges
 inside the output file that are not correct, but I haven't been able to
 reproduce today. I think it has to do with page reuse, hence the
 NET_COPY_SPLICE ifdef that you can enable to just plain copy the data
 instead of referencing it.

I managed to reproduce. It's segments of 68-80 bytes beyond corrupt in
the middle of the out, and there might be 1-3 of such occurences in the
30mb file I tested with. The first 16 bytes of the corruption are always
the same:

 1800 4ff3 937f e000 6381 7275 0008

Perhaps that hex pattern rings a bell with someone intimate with the
networking. The remaining wrong bytes don't seem to have anything in
common.

Slab poisoning doesn't change the pattern, so it's not use-after-free.


-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-05 Thread Jens Axboe
On Tue, Jun 05 2007, Jens Axboe wrote:
 On Tue, Jun 05 2007, Jens Axboe wrote:
  Seems to work reasonably well for me, sometimes I do see small ranges
  inside the output file that are not correct, but I haven't been able to
  reproduce today. I think it has to do with page reuse, hence the
  NET_COPY_SPLICE ifdef that you can enable to just plain copy the data
  instead of referencing it.
 
 I managed to reproduce. It's segments of 68-80 bytes beyond corrupt in
 the middle of the out, and there might be 1-3 of such occurences in the
 30mb file I tested with. The first 16 bytes of the corruption are always
 the same:
 
  1800 4ff3 937f e000 6381 7275 0008
 
 Perhaps that hex pattern rings a bell with someone intimate with the
 networking. The remaining wrong bytes don't seem to have anything in
 common.

Ok, the source mac address is 00:18:F3:4F:7F:93 and the destination is
00:E0:81:63:75:72 which are the middle 12 bytes of the 16.

Hope that helps someone clue me in as to which network part is reusing
the data. Do I need to 'pin' the sk_buff until the pipe data has been
consumed?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-05 Thread Evgeniy Polyakov
On Tue, Jun 05, 2007 at 10:05:43AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 Hi,

Hi Jens.

 Here's an implementation of tcp network splice receive support. It's
 originally based on the patch set that Intel posted some time ago, but
 has been (close to) 100% reworked.
 
 Now, I'm not a networking guru by any stretch of the imagination, so I'd
 like some input on the direction of the main patch. Is the approach
 feasible? Glaring errors? Missing bits?

First one - you seems to create new data_ready callback
tcp_splice_data_ready(), but it is unused, and actually can not be at 
all - there will be a deadlock, since sk_data_ready can be called with locked 
socket and also in bh context.

I will setup this and report abck about bugs.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-05 Thread Evgeniy Polyakov
On Tue, Jun 05, 2007 at 10:05:43AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote:
 Here's an implementation of tcp network splice receive support. It's
 originally based on the patch set that Intel posted some time ago, but
 has been (close to) 100% reworked.
 
 Now, I'm not a networking guru by any stretch of the imagination, so I'd
 like some input on the direction of the main patch. Is the approach
 feasible? Glaring errors? Missing bits?

  263.709262] [ cut here ]
  [  263.713932] kernel BUG at include/linux/mm.h:285!
  [  263.718678] invalid opcode:  [1] PREEMPT SMP 
  [  263.723561] CPU 0 
  [  263.725665] Modules linked in: button loop snd_intel8x0
  snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore psmouse
  snd_page_alloc k8temp i2c_nforcen
  [  263.755666] Pid: 2709, comm: splice-fromnet Not tainted
  2.6.22-rc4-splice #2
  [  263.762759] RIP: 0010:[8038c60c]  [8038c60c]
  skb_splice_bits+0xac/0x1c9
  [  263.771212] RSP: 0018:81003c79fc88  EFLAGS: 00010246
  [  263.776564] RAX:  RBX: 05a8 RCX:
  81003ff04778
  [  263.783743] RDX: 81003ff04778 RSI: 0ab2 RDI:
  0003d52d
  [  263.790925] RBP: 81003c79fdd8 R08:  R09:
  81003d927b78
  [  263.798104] R10: 803b0181 R11: 81003c79fde8 R12:
  81003d52d000
  [  263.805284] R13: 054e R14: 81003d927b78 R15:
  81003bbc6ea0
  [  263.812463] FS:  2ac4089a86d0() GS:804fb000()
  knlGS:
  [  263.820611] CS:  0010 DS:  ES:  CR0: 8005003b
  [  263.826396] CR2: 2ac4088320e0 CR3: 3c987000 CR4:
  06e0
  [  263.833578] Process splice-fromnet (pid: 2709, threadinfo
  81003c79e000, task 81003755c380)
  [  263.842591] Stack:  81003ff04720 
  81003755c380 0046
  [  263.850897]  00c6 0046 81003b0428b8
  81003d0b5b10
  [  263.858543]  00c6 81003d0b5b10 81003b0428b8
  81003d0b5b10
  [  263.865957] Call Trace:
  [  263.868683]  [803dc630] _read_unlock_irq+0x31/0x4e
  [  263.874393]  [803afb54] tcp_splice_data_recv+0x20/0x22
  [  263.880447]  [803afa2b] tcp_read_sock+0xa2/0x1ab
  [  263.885983]  [803afb34] tcp_splice_data_recv+0x0/0x22
  [  263.891951]  [803b01c1] tcp_splice_read+0xae/0x1a3
  [  263.897655]  [8038920f] sock_def_readable+0x0/0x6f
  [  263.903366]  [80384a65] sock_splice_read+0x15/0x17
  [  263.909072]  [8029e773] do_splice_to+0x76/0x88
  [  263.914432]  [8029fcc8] sys_splice+0x1a8/0x232
  [  263.919795]  [802097ce] system_call+0x7e/0x83
  [  263.925067] 
  [  263.926606] 
  [  263.926607] Code: 0f 0b eb fe 44 89 e6 81 e6 ff 0f 00 00 90 ff 42
  08 48 63 55 
  [  263.936418] RIP  [8038c60c] skb_splice_bits+0xac/0x1c9
  [  263.942516]  RSP 81003c79fc88

This a vm_bug_on in get_page().

 +static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page 
 *page,
 + unsigned int len, unsigned int offset)
 +{
 + struct page *p;
 +
 + if (unlikely(spd-nr_pages == PIPE_BUFFERS))
 + return 1;
 +
 +#ifdef NET_COPY_SPLICE
 + p = alloc_pages(GFP_KERNEL, 0);
 + if (!p)
 + return 1;
 +
 + memcpy(page_address(p) + offset, page_address(page) + offset, len);
 +#else
 + p = page;
 + get_page(p);
 +#endif

Some pages have zero reference counter here.

Is commented NET_COPY_SPLICE part from old implementation?
It will be always slower than existing approach due to allocation
overhead.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-05 Thread Evgeniy Polyakov
On Tue, Jun 05, 2007 at 06:31:31PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) 
wrote:
   [  263.936418] RIP  [8038c60c] skb_splice_bits+0xac/0x1c9
   [  263.942516]  RSP 81003c79fc88
 
 This a vm_bug_on in get_page().
 
  +static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page 
  *page,
  +   unsigned int len, unsigned int offset)
  +{
  +   struct page *p;
  +
  +   if (unlikely(spd-nr_pages == PIPE_BUFFERS))
  +   return 1;
  +
  +#ifdef NET_COPY_SPLICE
  +   p = alloc_pages(GFP_KERNEL, 0);
  +   if (!p)
  +   return 1;
  +
  +   memcpy(page_address(p) + offset, page_address(page) + offset, len);
  +#else
  +   p = page;
  +   get_page(p);
  +#endif
 
 Some pages have zero reference counter here.

Very likley bug with mac address is related to this one and you do not
have vm debug enabled in the config? Naive atomic_inc and
atomic_dec_return with bug_on  0 instead of that get_page and put_page
in spd_fill_page()/sock_pipe_buf_release() resulted in broken file -
initial data contained 6B and 5A instead of zeroes sent. Even more naive 
atomic_add(2, page) ended with:

[   48.273345] page:81003ff22a18 flags:0x0100
mapping: mapcount:0 count:2
[   48.273347] Trying to fix it up, but a reboot is needed
[   48.273349] Backtrace:
[   48.295576] 
[   48.295577] Call Trace:
[   48.299624]  [8025f075] bad_page+0x67/0x95
[   48.304636]  [8025f771] __free_pages_ok+0x76/0x2c1
[   48.310343]  [8025fbec] __free_pages+0x29/0x2b
[   48.315703]  [8025fc38] free_pages+0x4a/0x4f
[   48.320884]  [8027b15f] kmem_freepages+0xd9/0xe2
[   48.326416]  [8027bd93] slab_destroy+0xef/0x114
[   48.331865]  [8027bf15] free_block+0x15d/0x19f
[   48.337227]  [8027c0bb] cache_flusharray+0x95/0xff
[   48.342933]  [8027c36a] kfree+0x1cd/0x1ec
[   48.347863]  [8038b233] skb_release_data+0xab/0xb0
[   48.353567]  [8038aff3] kfree_skbmem+0x11/0x7e
[   48.358927]  [8038b104] __kfree_skb+0xa4/0xa9
[   48.364204]  [803afa9e] tcp_read_sock+0x101/0x1ab
[   48.369823]  [803afb48] tcp_splice_data_recv+0x0/0x22
[   48.375791]  [803b01d5] tcp_splice_read+0xae/0x1a3
[   48.381497]  [8038920f] sock_def_readable+0x0/0x6f
[   48.387209]  [80384a65] sock_splice_read+0x15/0x17
[   48.392913]  [8029e773] do_splice_to+0x76/0x88
[   48.398273]  [8029fcc8] sys_splice+0x1a8/0x232
[   48.403636]  [802097ce] system_call+0x7e/0x83

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html