Re: [PATCH][RFC] network splice receive v3
Hi. On Fri, Jul 13, 2007 at 02:21:00PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: It really looks like the last tree we tested, so if you think additional one will not hurt, feel free to ping, so I will completely rebase testing tree. It would be great if you could retest! There are some minor changes in there, and some extra testing definitely will not hurt. I've just tested it with 2.6.22 (e1c1e98d2a3f57b22a0d4136c8160e54404aa437 commit) and did not found any problems - after qute big files were transferred there is no observed previously skb leak, no crashes (quite a few debug options are turned on in config) and files are correct on both peers, so it works good. -- Jens Axboe -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v3
Hello. In article [EMAIL PROTECTED] (at Wed, 11 Jul 2007 11:19:27 +0200), Jens Axboe [EMAIL PROTECTED] says: @@ -835,6 +835,7 @@ const struct proto_ops inet_stream_ops = { .recvmsg = sock_common_recvmsg, .mmap = sock_no_mmap, .sendpage = tcp_sendpage, + .splice_read = tcp_splice_read, #ifdef CONFIG_COMPAT .compat_setsockopt = compat_sock_common_setsockopt, .compat_getsockopt = compat_sock_common_getsockopt, Please add similar bits in net/ipv6/af_inet6.c unless there are any dependency on IPv4. (And if there are, it is not good.) --yoshfuji - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v3
On Thu, Jul 19 2007, YOSHIFUJI Hideaki / ?$B5HF#1QL@ wrote: Hello. In article [EMAIL PROTECTED] (at Wed, 11 Jul 2007 11:19:27 +0200), Jens Axboe [EMAIL PROTECTED] says: @@ -835,6 +835,7 @@ const struct proto_ops inet_stream_ops = { .recvmsg = sock_common_recvmsg, .mmap = sock_no_mmap, .sendpage = tcp_sendpage, + .splice_read = tcp_splice_read, #ifdef CONFIG_COMPAT .compat_setsockopt = compat_sock_common_setsockopt, .compat_getsockopt = compat_sock_common_getsockopt, Please add similar bits in net/ipv6/af_inet6.c unless there are any dependency on IPv4. (And if there are, it is not good.) There are no specific ipv4 depedencies, it's just an oversight. So thanks for the clue, I'll add it! -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v3
On Thu, Jul 12 2007, Evgeniy Polyakov wrote: On Wed, Jul 11, 2007 at 11:19:27AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Hi, Hi Jens. Here's an updated implementation of tcp network splice receive support. It actually works for me now, no data corruption seen. For the original announcement and how to test it, see: http://marc.info/?l=linux-netdevm=118103093400770w=2 The splice core changes needed to support this are now merged in 2.6.22-git, so the patchset shrinks to just two patches - one for adding a release hook, and one for the networking changes. The code is also available in the splice-net branch here: git://git.kernel.dk/data/git/linux-2.6-block.git splice-net There's a third experimental patch in there that allows vmsplice directly to user memory, that still needs some work though. Comments, testing welcome! It looks like you included all bits we found in the previous runs, so likely it will work good, but so far I have conflicts merging todays git and your tree in include/linux/splice.h, fs/ext2/file.c, fs/splice.c and mm/filemap_xip.c. This can be a problem with my tree though. Hmm, the patch should apply directly to the tree as of when I posted this original mail, or any later one. I just tried a rebase, and it rebased fine on top of the current -git as well. So I think the issue is with your tree, sorry! It really looks like the last tree we tested, so if you think additional one will not hurt, feel free to ping, so I will completely rebase testing tree. It would be great if you could retest! There are some minor changes in there, and some extra testing definitely will not hurt. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v3
On Wed, Jul 11, 2007 at 11:19:27AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Hi, Hi Jens. Here's an updated implementation of tcp network splice receive support. It actually works for me now, no data corruption seen. For the original announcement and how to test it, see: http://marc.info/?l=linux-netdevm=118103093400770w=2 The splice core changes needed to support this are now merged in 2.6.22-git, so the patchset shrinks to just two patches - one for adding a release hook, and one for the networking changes. The code is also available in the splice-net branch here: git://git.kernel.dk/data/git/linux-2.6-block.git splice-net There's a third experimental patch in there that allows vmsplice directly to user memory, that still needs some work though. Comments, testing welcome! It looks like you included all bits we found in the previous runs, so likely it will work good, but so far I have conflicts merging todays git and your tree in include/linux/splice.h, fs/ext2/file.c, fs/splice.c and mm/filemap_xip.c. This can be a problem with my tree though. It really looks like the last tree we tested, so if you think additional one will not hurt, feel free to ping, so I will completely rebase testing tree. -- Jens Axboe -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][RFC] network splice receive v3
Hi, Here's an updated implementation of tcp network splice receive support. It actually works for me now, no data corruption seen. For the original announcement and how to test it, see: http://marc.info/?l=linux-netdevm=118103093400770w=2 The splice core changes needed to support this are now merged in 2.6.22-git, so the patchset shrinks to just two patches - one for adding a release hook, and one for the networking changes. The code is also available in the splice-net branch here: git://git.kernel.dk/data/git/linux-2.6-block.git splice-net There's a third experimental patch in there that allows vmsplice directly to user memory, that still needs some work though. Comments, testing welcome! -- Jens Axboe From e59a68f2d7d261b301960b97659910aab8e3d776 Mon Sep 17 00:00:00 2001 From: Jens Axboe [EMAIL PROTECTED] Date: Mon, 11 Jun 2007 13:00:32 +0200 Subject: [PATCH] splice: don't assume regular pages in splice_to_pipe() Allow caller to pass in a release function, there might be other resources that need releasing as well. Needed for network receive. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- fs/splice.c|9 - include/linux/splice.h |1 + 2 files changed, 9 insertions(+), 1 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 3160951..4b4b501 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -254,11 +254,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, } while (page_nr spd_pages) - page_cache_release(spd-pages[page_nr++]); + spd-spd_release(spd, page_nr++); return ret; } +static void spd_release_page(struct splice_pipe_desc *spd, unsigned int i) +{ + page_cache_release(spd-pages[i]); +} + static int __generic_file_splice_read(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, @@ -277,6 +282,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos, .partial = partial, .flags = flags, .ops = page_cache_pipe_buf_ops, + .spd_release = spd_release_page, }; index = *ppos PAGE_CACHE_SHIFT; @@ -1674,6 +1680,7 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov, .partial = partial, .flags = flags, .ops = user_page_pipe_buf_ops, + .spd_release = spd_release_page, }; pipe = pipe_info(file-f_path.dentry-d_inode); diff --git a/include/linux/splice.h b/include/linux/splice.h index 2c08456..b8fa41e 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -54,6 +54,7 @@ struct splice_pipe_desc { int nr_pages; /* number of pages in map */ unsigned int flags; /* splice flags */ const struct pipe_buf_operations *ops;/* ops associated with output pipe */ + void (*spd_release)(struct splice_pipe_desc *, unsigned int); }; typedef int (splice_actor)(struct pipe_inode_info *, struct pipe_buffer *, -- 1.5.3.rc0.90.gbaa79 From b62e4a5a3e3220702e837e556427972dc591ff59 Mon Sep 17 00:00:00 2001 From: Jens Axboe [EMAIL PROTECTED] Date: Wed, 20 Jun 2007 09:54:14 +0200 Subject: [PATCH] TCP splice receive support Support for network splice receive. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- include/linux/net.h|3 + include/linux/skbuff.h |5 + include/net/tcp.h |3 + net/core/skbuff.c | 246 net/ipv4/af_inet.c |1 + net/ipv4/tcp.c | 129 + net/socket.c | 13 +++ 7 files changed, 400 insertions(+), 0 deletions(-) diff --git a/include/linux/net.h b/include/linux/net.h index efc4517..472ee12 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -19,6 +19,7 @@ #define _LINUX_NET_H #include linux/wait.h +#include linux/splice.h #include asm/socket.h struct poll_table_struct; @@ -165,6 +166,8 @@ struct proto_ops { struct vm_area_struct * vma); ssize_t (*sendpage) (struct socket *sock, struct page *page, int offset, size_t size, int flags); + ssize_t (*splice_read)(struct socket *sock, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, unsigned int flags); }; struct net_proto_family { diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6f0b2f7..177bffc 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1504,6 +1504,11 @@ extern int skb_store_bits(struct sk_buff *skb, int offset, extern __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to, int len, __wsum csum); +extern int skb_splice_bits(struct sk_buff *skb, + unsigned int offset, + struct pipe_inode_info *pipe, + unsigned int len, + unsigned int flags); extern void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to); extern void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); diff --git a/include/net/tcp.h b/include/net/tcp.h index a8af9ae..8e86697 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -308,6 +308,9 @@ extern int
Re: [PATCH][RFC] network splice receive v3
On Wed, Jul 11, 2007 at 11:19:27AM +0200, Jens Axboe wrote: Subject: [PATCH] splice: don't assume regular pages in splice_to_pipe() Allow caller to pass in a release function, there might be other resources that need releasing as well. Needed for network receive. diff --git a/fs/splice.c b/fs/splice.c index 3160951..4b4b501 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -254,11 +254,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, } while (page_nr spd_pages) - page_cache_release(spd-pages[page_nr++]); + spd-spd_release(spd, page_nr++); Rather than requiring the caller set this, shouldn't we just allow it? Especially given there is only one non-page user? while (page_nr spd_pages) - page_cache_release(spd-pages[page_nr++]); + if (spd-spd_release) + spd-spd_release(spd, page_nr++); + else + page_cache_release(spd-pages[page_nr++]); Joel -- Any man who is under 30, and is not a liberal, has not heart; and any man who is over 30, and is not a conservative, has no brains. - Sir Winston Churchill Joel Becker Principal Software Developer Oracle E-mail: [EMAIL PROTECTED] Phone: (650) 506-8127 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v3
On Wed, Jul 11 2007, Joel Becker wrote: On Wed, Jul 11, 2007 at 11:19:27AM +0200, Jens Axboe wrote: Subject: [PATCH] splice: don't assume regular pages in splice_to_pipe() Allow caller to pass in a release function, there might be other resources that need releasing as well. Needed for network receive. diff --git a/fs/splice.c b/fs/splice.c index 3160951..4b4b501 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -254,11 +254,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, } while (page_nr spd_pages) - page_cache_release(spd-pages[page_nr++]); + spd-spd_release(spd, page_nr++); Rather than requiring the caller set this, shouldn't we just allow it? Especially given there is only one non-page user? while (page_nr spd_pages) -page_cache_release(spd-pages[page_nr++]); +if (spd-spd_release) +spd-spd_release(spd, page_nr++); +else +page_cache_release(spd-pages[page_nr++]); Certainly possible, I think it's cleaner with it always being set though. If it grows other out-of-splice.c users, then your change may be a good idea though. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v2
On Fri, Jun 15 2007, Evgeniy Polyakov wrote: On Thu, Jun 14, 2007 at 01:59:02PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: On Thu, Jun 14 2007, Evgeniy Polyakov wrote: On Wed, Jun 13, 2007 at 12:01:04PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: I will rebase my tree, likely something was not merged correctly. Ok, I've just rebased a tree from recent git and pulled from brick - things seems to be settled. I've ran several tests with different filesizes and all files were received correctly without kernel crashes. There is skb leak indeed, and it looks like it is in the __splice_from_pipe() for the last page. Uh, the leak, right - I had forgotten about that, was working on getting vmsplice into shape the last two days. Interesting that you mention the last page, I'll dig in now! Any more info on this (how did you notice the leak originates from there)? I first observed leak via slabinfo data (not a leak, but number of skbs did not dropped after quite huge files were transferred), then added number of allocated and freed objects in skbuff.c, they did not match for big files, so I started to check splice source and found that sometimes -release callback is not called, but code breaks out of the loop. I've put some printks in __splice_from_pipe() and found following case, when skb is leaked: when the same cloned skb was shared multiple times (no more than 2 though), only one copy was freed. Further analysis description requires some splice background (obvious for you, but that clears it for me): pipe_buffer only contains 16 pages. There is a code, which copies pages (pointers) from spd to pipe_buffer (splice_to_pipe()). skb allocations happens in chunks of different size (i.e. with different number of skbs/pages per call), so it is possible that number of allocated skbs will be less than pipe_buffer size (16), and then the rest of the pages will be put into different (or the same) pipe_buffer later. Sometimes two skbs from above example happens to be on the boundary of the pipe buffer, so only one of them is being copied into pipe_buffer, which is then transferred over the pipe. So, we have a case, when spd has (it had more, but this two are special) 2 pages (actually the same page, but two references to it), but pipe_buffer has a place only for one. In that case second page from spd will be missed. So, things turned down to be not in the __splice_from_pipe(), but splice_to_pipe(). Attached patch fixes a leak for me. It was tested with different data files and all were received correctly. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/fs/splice.c b/fs/splice.c index bc481f1..365bfd9 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -211,8 +211,6 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, break; if (pipe-nrbufs PIPE_BUFFERS) continue; - - break; } if (spd-flags SPLICE_F_NONBLOCK) { Hmm, curious. If we hit that location, then two conditions are true: - Pipe is full - We transferred some data if you remove the break, then you'll end up blocking in pipe_wait() (unless you have SPLICE_F_NONBLOCK also set). And we don't want to block waiting for more room, if we already transferred some data. In that case we just want to return a short splice. Looking at pipe_write(), it'll block as well though. Just doesn't seem optimal to me, but... So the question is why would doing the break there cause a leak? I just don't yet see how it can happen, I'd love to fix that condition instead. For the case you describe, we should have page_nr == 1 and spd-nr_pages == 2. Is the: while (page_nr spd-nr_pages) spd-spd_release(spd, page_nr++); not dropping the right reference? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v2
On Fri, Jun 15, 2007 at 10:43:18AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: So, things turned down to be not in the __splice_from_pipe(), but splice_to_pipe(). Attached patch fixes a leak for me. It was tested with different data files and all were received correctly. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/fs/splice.c b/fs/splice.c index bc481f1..365bfd9 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -211,8 +211,6 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, break; if (pipe-nrbufs PIPE_BUFFERS) continue; - - break; } if (spd-flags SPLICE_F_NONBLOCK) { Hmm, curious. If we hit that location, then two conditions are true: - Pipe is full - We transferred some data Yep. if you remove the break, then you'll end up blocking in pipe_wait() (unless you have SPLICE_F_NONBLOCK also set). And we don't want to block waiting for more room, if we already transferred some data. In that case we just want to return a short splice. Looking at pipe_write(), it'll block as well though. Just doesn't seem optimal to me, but... So the question is why would doing the break there cause a leak? I just don't yet see how it can happen, I'd love to fix that condition instead. For the case you describe, we should have page_nr == 1 and spd-nr_pages == 2. Is the: while (page_nr spd-nr_pages) spd-spd_release(spd, page_nr++); not dropping the right reference? Both spd-nr_pages and page_nr are equal to 1. When spd-nr_pages was 2 there was only 1 free slot in pipe_buffer. spd_fill_page: allocated: 89, freed: 73, data: 81003d606d28 spd_fill_page: allocated: 90, freed: 73, data: 81003d606d28 splice_to_pipe: priv: 81003d606d28, spd_nrpages: 1, pipe_nrbufs: 16, page_nr: 1. spd_fill_page: allocated: 91, freed: 73, data: fff81003d6549c8 // next data ... __splice_from_pipe: process: sd_len: 0, buf_len: 0, buf_priv: 81003d606d28. __splice_from_pipe: release 81003d606d28. sock_pipe_buf_release: allocated: 91, freed: 89, ptr: 81003d606d28 splice_to_pipe: priv: 81003d6549c8, spd_nrpages: 0, pipe_nrbufs: 1, page_nr: 1. // next data -- Jens Axboe -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v2
On Fri, Jun 15 2007, Evgeniy Polyakov wrote: On Fri, Jun 15, 2007 at 10:43:18AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: So, things turned down to be not in the __splice_from_pipe(), but splice_to_pipe(). Attached patch fixes a leak for me. It was tested with different data files and all were received correctly. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/fs/splice.c b/fs/splice.c index bc481f1..365bfd9 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -211,8 +211,6 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, break; if (pipe-nrbufs PIPE_BUFFERS) continue; - - break; } if (spd-flags SPLICE_F_NONBLOCK) { Hmm, curious. If we hit that location, then two conditions are true: - Pipe is full - We transferred some data Yep. if you remove the break, then you'll end up blocking in pipe_wait() (unless you have SPLICE_F_NONBLOCK also set). And we don't want to block waiting for more room, if we already transferred some data. In that case we just want to return a short splice. Looking at pipe_write(), it'll block as well though. Just doesn't seem optimal to me, but... So the question is why would doing the break there cause a leak? I just don't yet see how it can happen, I'd love to fix that condition instead. For the case you describe, we should have page_nr == 1 and spd-nr_pages == 2. Is the: while (page_nr spd-nr_pages) spd-spd_release(spd, page_nr++); not dropping the right reference? Both spd-nr_pages and page_nr are equal to 1. When spd-nr_pages was 2 there was only 1 free slot in pipe_buffer. Ah duh, indeed, it is a dumb bug indeed. This should work. diff --git a/fs/splice.c b/fs/splice.c index 89871c6..5327cbf 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -172,6 +172,7 @@ static const struct pipe_buf_operations user_page_pipe_buf_ops = { ssize_t splice_to_pipe(struct pipe_inode_info *pipe, struct splice_pipe_desc *spd) { + unsigned int spd_pages = spd-nr_pages; int ret, do_wakeup, page_nr; ret = 0; @@ -252,7 +253,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, } } - while (page_nr spd-nr_pages) + while (page_nr spd_pages) spd-spd_release(spd, page_nr++); return ret; -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v2
On Fri, Jun 15, 2007 at 11:34:30AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Both spd-nr_pages and page_nr are equal to 1. When spd-nr_pages was 2 there was only 1 free slot in pipe_buffer. Ah duh, indeed, it is a dumb bug indeed. This should work. Yep, this one works too. diff --git a/fs/splice.c b/fs/splice.c index 89871c6..5327cbf 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -172,6 +172,7 @@ static const struct pipe_buf_operations user_page_pipe_buf_ops = { ssize_t splice_to_pipe(struct pipe_inode_info *pipe, struct splice_pipe_desc *spd) { + unsigned int spd_pages = spd-nr_pages; int ret, do_wakeup, page_nr; ret = 0; @@ -252,7 +253,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, } } - while (page_nr spd-nr_pages) + while (page_nr spd_pages) spd-spd_release(spd, page_nr++); return ret; -- Jens Axboe -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v2
On Wed, Jun 13, 2007 at 12:01:04PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: I will rebase my tree, likely something was not merged correctly. Ok, I've just rebased a tree from recent git and pulled from brick - things seems to be settled. I've ran several tests with different filesizes and all files were received correctly without kernel crashes. There is skb leak indeed, and it looks like it is in the __splice_from_pipe() for the last page. Thanks Jens, good work. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v2
On Thu, Jun 14 2007, Evgeniy Polyakov wrote: On Wed, Jun 13, 2007 at 12:01:04PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: I will rebase my tree, likely something was not merged correctly. Ok, I've just rebased a tree from recent git and pulled from brick - things seems to be settled. I've ran several tests with different filesizes and all files were received correctly without kernel crashes. There is skb leak indeed, and it looks like it is in the __splice_from_pipe() for the last page. Uh, the leak, right - I had forgotten about that, was working on getting vmsplice into shape the last two days. Interesting that you mention the last page, I'll dig in now! Any more info on this (how did you notice the leak originates from there)? Thanks Jens, good work. Thanks, you're welcome! -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v2
On Tue, Jun 12, 2007 at 08:17:32PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: On Tue, Jun 12 2007, Evgeniy Polyakov wrote: On Mon, Jun 11, 2007 at 01:59:26PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Patches are against the #splice branch of the block repo, official url of that is: git://git.kernel.dk/data/git/linux-2.6-block.git/ and it's based on Linus main tree. Let me know if I should supply netdev branch patches instead, or even just provide a rolled up patch (or patch series) for anyone curious to test or play with it. Hi Jens. I've just pulled your tree (splice-net, but splice tree looks the same, git pull says 'Already up-to-date.') on top of linus git and got following bug trace. I will investigate it further tomorrow. Please tell me the contents of splice-net, it looks like you didn't actually use the new code. That BUG_ON() is in get_page(), which splice-net no longer uses. So the bug report cannot be valid for the current code. This is the last commit in that tree: commit c90a6ce8242d108a5bc6fd0bc1b2aca72a2b5944 Author: Jens Axboe [EMAIL PROTECTED] Date: Mon Jun 11 21:59:50 2007 +0200 TCP splice receive support Support for network splice receive. Signed-off-by: Jens Axboe [EMAIL PROTECTED] :100644 100644 efc4517... 472ee12... M include/linux/net.h :100644 100644 e7367c7... 64e3eed... M include/linux/skbuff.h :100644 100644 a8af9ae... 8e86697... M include/net/tcp.h :100644 100644 7c6a34e... daea7b0... M net/core/skbuff.c :100644 100644 041fba3... 0ff9f86... M net/ipv4/af_inet.c :100644 100644 450f44b... 63efd7a... M net/ipv4/tcp.c :100644 100644 f453019... 41240f5... M net/socket.c I will rebase my tree, likely something was not merged correctly. -- Jens Axboe -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v2
On Wed, Jun 13 2007, Evgeniy Polyakov wrote: On Tue, Jun 12, 2007 at 08:17:32PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: On Tue, Jun 12 2007, Evgeniy Polyakov wrote: On Mon, Jun 11, 2007 at 01:59:26PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Patches are against the #splice branch of the block repo, official url of that is: git://git.kernel.dk/data/git/linux-2.6-block.git/ and it's based on Linus main tree. Let me know if I should supply netdev branch patches instead, or even just provide a rolled up patch (or patch series) for anyone curious to test or play with it. Hi Jens. I've just pulled your tree (splice-net, but splice tree looks the same, git pull says 'Already up-to-date.') on top of linus git and got following bug trace. I will investigate it further tomorrow. Please tell me the contents of splice-net, it looks like you didn't actually use the new code. That BUG_ON() is in get_page(), which splice-net no longer uses. So the bug report cannot be valid for the current code. This is the last commit in that tree: commit c90a6ce8242d108a5bc6fd0bc1b2aca72a2b5944 Author: Jens Axboe [EMAIL PROTECTED] Date: Mon Jun 11 21:59:50 2007 +0200 TCP splice receive support Support for network splice receive. Signed-off-by: Jens Axboe [EMAIL PROTECTED] :100644 100644 efc4517... 472ee12... M include/linux/net.h :100644 100644 e7367c7... 64e3eed... M include/linux/skbuff.h :100644 100644 a8af9ae... 8e86697... M include/net/tcp.h :100644 100644 7c6a34e... daea7b0... M net/core/skbuff.c :100644 100644 041fba3... 0ff9f86... M net/ipv4/af_inet.c :100644 100644 450f44b... 63efd7a... M net/ipv4/tcp.c :100644 100644 f453019... 41240f5... M net/socket.c I will rebase my tree, likely something was not merged correctly. It must have been, please let me know how the current stuf works for you! -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Sat, Jun 09, 2007 at 08:36:09AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: On Fri, Jun 08 2007, Evgeniy Polyakov wrote: On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: I will try some things for the nearest 30-60 minutes, and then will move to canoe trip until thuesday, so will not be able to work on this idea. Ok, replacing in fs/splice.c every page_cache_release() with static void splice_page_release(struct page *p) { if (!PageSlab(p)) page_cache_release(p); } Ehm, I don't see why that should be necessary. Except in splice_to_pipe(), I have considered that we need to pass in a release function if mapping fails at some point. But it's probably best to do that in the caller, since they have the knowledge of how to release the pages. The rest of the PageSlab() tests are bogus. I had a crashdump, where page was released via splice_to_pipe() indeed, I did not investigate if it is possible to release provided page in other places. I think if in future there will other slab usage cases except networking receiving, that might be useful, but as is it is not needed. and putting cloned skb into private field instead of original on in spd_fill_page() ends up without kernel hung. Why? Seems pointless to allocate a clone just to hold on to the skb, a reference should be equally good. I would not be opposed to doing it this way, I just don't see what a clone buys us as compared to just holding that reference to the skb. Receiving code does not expect shared skbs - too many fields are changed with assumptions that it is a private copy. I'm not sure it is correct, that page can be released in fs/splice.c without calling any callback from network code, when network data is being processed. Please explain! I had a crashdump, where page was attempted to be released in fs/splice.c:splice_to_pipe(), I do not have details handy, but the best solution would be to provide a release callback and use that instead of page_cache_release(). -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Tue, Jun 12 2007, Evgeniy Polyakov wrote: On Sat, Jun 09, 2007 at 08:36:09AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: On Fri, Jun 08 2007, Evgeniy Polyakov wrote: On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: I will try some things for the nearest 30-60 minutes, and then will move to canoe trip until thuesday, so will not be able to work on this idea. Ok, replacing in fs/splice.c every page_cache_release() with static void splice_page_release(struct page *p) { if (!PageSlab(p)) page_cache_release(p); } Ehm, I don't see why that should be necessary. Except in splice_to_pipe(), I have considered that we need to pass in a release function if mapping fails at some point. But it's probably best to do that in the caller, since they have the knowledge of how to release the pages. The rest of the PageSlab() tests are bogus. I had a crashdump, where page was released via splice_to_pipe() indeed, I did not investigate if it is possible to release provided page in other places. I think if in future there will other slab usage cases except networking receiving, that might be useful, but as is it is not needed. Read the just posted code, it has moved way beyond this :-) and putting cloned skb into private field instead of original on in spd_fill_page() ends up without kernel hung. Why? Seems pointless to allocate a clone just to hold on to the skb, a reference should be equally good. I would not be opposed to doing it this way, I just don't see what a clone buys us as compared to just holding that reference to the skb. Receiving code does not expect shared skbs - too many fields are changed with assumptions that it is a private copy. Actually the main problem is that tcp_read_sock() unconditionally frees the skb, so it wouldn't help if we grabbed a reference to it. I've yet to receive an explanation of why it does so, seem awkward and violates the whole principle of reference counted objects. Davem?? So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd hope we can get rid of that by fixing tcp_read_sock(), though. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Tue, Jun 12, 2007 at 01:33:54PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: I had a crashdump, where page was released via splice_to_pipe() indeed, I did not investigate if it is possible to release provided page in other places. I think if in future there will other slab usage cases except networking receiving, that might be useful, but as is it is not needed. Read the just posted code, it has moved way beyond this :-) It is just a side result of traditional optimization technique called vim ':%s/page_cache_release/splice_page_release' :) and putting cloned skb into private field instead of original on in spd_fill_page() ends up without kernel hung. Why? Seems pointless to allocate a clone just to hold on to the skb, a reference should be equally good. I would not be opposed to doing it this way, I just don't see what a clone buys us as compared to just holding that reference to the skb. Receiving code does not expect shared skbs - too many fields are changed with assumptions that it is a private copy. Actually the main problem is that tcp_read_sock() unconditionally frees the skb, so it wouldn't help if we grabbed a reference to it. I've yet to receive an explanation of why it does so, seem awkward and violates the whole principle of reference counted objects. Davem?? So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd hope we can get rid of that by fixing tcp_read_sock(), though. It does that because it knows, that skb is not allowed to be shared there. Similar things are being done in udp for example - code changes internal mebers of skb, since it knows skb is not shared. For example generic_make_request() is not allowed to change, say, bio-bi_sector or bi_destructor, since it does not own a block request, not matter what bi_cnt is. From another side, -bi_destructor() can do whatever it wants with bio without any check for its reference counter. According to sk_eat_skb() - it is an optimisation to remove atomic check. -- Jens Axboe -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Tue, Jun 12 2007, Evgeniy Polyakov wrote: and putting cloned skb into private field instead of original on in spd_fill_page() ends up without kernel hung. Why? Seems pointless to allocate a clone just to hold on to the skb, a reference should be equally good. I would not be opposed to doing it this way, I just don't see what a clone buys us as compared to just holding that reference to the skb. Receiving code does not expect shared skbs - too many fields are changed with assumptions that it is a private copy. Actually the main problem is that tcp_read_sock() unconditionally frees the skb, so it wouldn't help if we grabbed a reference to it. I've yet to receive an explanation of why it does so, seem awkward and violates the whole principle of reference counted objects. Davem?? So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd hope we can get rid of that by fixing tcp_read_sock(), though. It does that because it knows, that skb is not allowed to be shared there. Similar things are being done in udp for example - code changes internal mebers of skb, since it knows skb is not shared. For example generic_make_request() is not allowed to change, say, bio-bi_sector or bi_destructor, since it does not own a block request, not matter what bi_cnt is. From another side, -bi_destructor() can do whatever it wants with bio without any check for its reference counter. But generic_make_request() DOES change -bi_sector, that's how partition remapping works :-). The destructor can of course do whatever it wants, by definition the bio is not referenced at that point (or it would not have been called). So while I think your analogy is quite poor, I do now follow the code (even if I think it's ugly). There's quite a big difference between changing parts of the elements of a structure to just grabbing a reference to it. If the skb cannot be referenced, skb_get() should return NULL. But that aside, I see the issue. I'll just stick to the clone, it works fine as-is (well almost, there's a leak there, but functionally it's ok!). -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Tue, Jun 12, 2007 at 02:40:05PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: On Tue, Jun 12 2007, Evgeniy Polyakov wrote: and putting cloned skb into private field instead of original on in spd_fill_page() ends up without kernel hung. Why? Seems pointless to allocate a clone just to hold on to the skb, a reference should be equally good. I would not be opposed to doing it this way, I just don't see what a clone buys us as compared to just holding that reference to the skb. Receiving code does not expect shared skbs - too many fields are changed with assumptions that it is a private copy. Actually the main problem is that tcp_read_sock() unconditionally frees the skb, so it wouldn't help if we grabbed a reference to it. I've yet to receive an explanation of why it does so, seem awkward and violates the whole principle of reference counted objects. Davem?? So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd hope we can get rid of that by fixing tcp_read_sock(), though. It does that because it knows, that skb is not allowed to be shared there. Similar things are being done in udp for example - code changes internal mebers of skb, since it knows skb is not shared. For example generic_make_request() is not allowed to change, say, bio-bi_sector or bi_destructor, since it does not own a block request, not matter what bi_cnt is. From another side, -bi_destructor() can do whatever it wants with bio without any check for its reference counter. But generic_make_request() DOES change -bi_sector, that's how partition remapping works :-). The destructor can of course do whatever it wants, by definition the bio is not referenced at that point (or it would not have been called). So while I think your analogy is quite poor, I do now follow the code (even if I think it's ugly). There's quite a big Yeah, that was quite long time ago I hacked block layer :) Good we found a way to explain the issue. difference between changing parts of the elements of a structure to just grabbing a reference to it. If the skb cannot be referenced, skb_get() should return NULL. But that aside, I see the issue. I'll just stick to the clone, it works fine as-is (well almost, there's a leak there, but functionally it's ok!). Btw, is it allowed to use splice from network with, say, nfs? Since RPC code uses sk_user_data as long as network splice. -- Jens Axboe -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Tue, Jun 12 2007, Evgeniy Polyakov wrote: difference between changing parts of the elements of a structure to just grabbing a reference to it. If the skb cannot be referenced, skb_get() should return NULL. But that aside, I see the issue. I'll just stick to the clone, it works fine as-is (well almost, there's a leak there, but functionally it's ok!). Btw, is it allowed to use splice from network with, say, nfs? Since RPC code uses sk_user_data as long as network splice. It doesn't anymore, see the version posted today (or yesterday, but it would be silly to read older code than the newest :-) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v2
On Mon, Jun 11, 2007 at 01:59:26PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Patches are against the #splice branch of the block repo, official url of that is: git://git.kernel.dk/data/git/linux-2.6-block.git/ and it's based on Linus main tree. Let me know if I should supply netdev branch patches instead, or even just provide a rolled up patch (or patch series) for anyone curious to test or play with it. Hi Jens. I've just pulled your tree (splice-net, but splice tree looks the same, git pull says 'Already up-to-date.') on top of linus git and got following bug trace. I will investigate it further tomorrow. [ 51.942373] [ cut here ] [ 51.947041] kernel BUG at include/linux/mm.h:285! [ 51.951786] invalid opcode: [1] PREEMPT SMP [ 51.956680] CPU 0 [ 51.958784] Modules linked in: button loop snd_intel8x0 snd_ac97_codec psmouse ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc k8temp i2c_nforcen [ 51.988793] Pid: 2604, comm: splice-fromnet Not tainted 2.6.22-rc4-splice #2 [ 51.995886] RIP: 0010:[80389b15] [80389b15] __skb_splice_bits+0xcd/0x201 [ 52.004520] RSP: 0018:810037f23c28 EFLAGS: 00010246 [ 52.009872] RAX: RBX: 810037f23d98 RCX: 003f [ 52.017053] RDX: 81003fe93808 RSI: 81003fe93808 RDI: 0003c0a3 [ 52.024233] RBP: 810037f23c78 R08: R09: 81003780e4b8 [ 52.031412] R10: 803b01d9 R11: 810037f23de8 R12: 009a [ 52.038591] R13: R14: 810037f23c90 R15: 05a8 [ 52.045771] FS: 2b9181d2c6d0() GS:804fb000() knlGS: [ 52.053920] CS: 0010 DS: ES: CR0: 8005003b [ 52.059714] CR2: 2b9181bb60e0 CR3: 3d109000 CR4: 06e0 [ 52.066894] Process splice-fromnet (pid: 2604, threadinfo 810037f22000, task 8100010f4100) [ 52.075908] Stack: 004612d0 810037f23c94 81003780e4b8 37f23c78 [ 52.084214] faf2050e 81003780e4b8 81003780e4b8 81003e8f22d8 [ 52.091860] 81003c99c820 4d5f4ede 810037f23dd8 8038bf20 [ 52.099265] Call Trace: [ 52.101998] [8038bf20] skb_splice_bits+0x6c/0xd0 [ 52.107619] [803dc720] _read_unlock_irq+0x31/0x4e [ 52.113330] [803afc1c] tcp_splice_data_recv+0x20/0x22 [ 52.119386] [803afaf3] tcp_read_sock+0xa2/0x1ab [ 52.124920] [803afbfc] tcp_splice_data_recv+0x0/0x22 [ 52.130888] [803b0232] tcp_splice_read+0xa1/0x21b [ 52.136593] [803891cf] sock_def_readable+0x0/0x6f [ 52.142303] [80384a25] sock_splice_read+0x15/0x17 [ 52.148010] [8029e773] do_splice_to+0x76/0x88 [ 52.153370] [8029fc87] sys_splice+0x1a8/0x232 [ 52.158733] [802097ce] system_call+0x7e/0x83 [ 52.164005] [ 52.165544] [ 52.165545] Code: 0f 0b eb fe 44 39 65 d4 8b 4d d4 41 0f 47 cc 90 ff 42 08 48 [ 52.175364] RIP [80389b15] __skb_splice_bits+0xcd/0x201 [ 52.181636] RSP 810037f23c28 -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive v2
On Tue, Jun 12 2007, Evgeniy Polyakov wrote: On Mon, Jun 11, 2007 at 01:59:26PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Patches are against the #splice branch of the block repo, official url of that is: git://git.kernel.dk/data/git/linux-2.6-block.git/ and it's based on Linus main tree. Let me know if I should supply netdev branch patches instead, or even just provide a rolled up patch (or patch series) for anyone curious to test or play with it. Hi Jens. I've just pulled your tree (splice-net, but splice tree looks the same, git pull says 'Already up-to-date.') on top of linus git and got following bug trace. I will investigate it further tomorrow. Please tell me the contents of splice-net, it looks like you didn't actually use the new code. That BUG_ON() is in get_page(), which splice-net no longer uses. So the bug report cannot be valid for the current code. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08 2007, Evgeniy Polyakov wrote: Size of the received file is bigger than file sent, file contains repeated blocks of data sometimes. Cloned skb usage is likely too big overhead, although for receiving fast clone is unused in most cases, so there might be some gain. That was actually a new bug, here: plen -= *offset; poff += *offset; in __skb_slice_bits(), we should only subtract the offset from plen, not add to poff. Then we just create some weird hole without any meaning. So remove those two poff additions in the two loops, and the size issue is resolved at least. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][RFC] network splice receive v2
Hi, Here's an updated implementation of tcp network splice receive support. It actually works for me now, no data corruption seen. For the original announcement and how to test it, see: http://marc.info/?l=linux-netdevm=118103093400770w=2 I hang on to the original skb by creating a clone of it and holding references to that. I really don't need a clone, but tcp_read_sock() is not being very helpful by calling sk_eat_skb() which blissfully ignores any reference counts and uncondtionally frees the skb - why is that?? The clone works around that issue. The current code also gets rid of the data_ready hack, and it should now handle linear/fragments/fraglist in the skb just fine. Thanks to Olaf for providing review of that stuff! Patches are against the #splice branch of the block repo, official url of that is: git://git.kernel.dk/data/git/linux-2.6-block.git/ and it's based on Linus main tree. Let me know if I should supply netdev branch patches instead, or even just provide a rolled up patch (or patch series) for anyone curious to test or play with it. -- Jens Axboe From fd79bf84fdeb15b72f256c342609257ae8a56235 Mon Sep 17 00:00:00 2001 From: Jens Axboe [EMAIL PROTECTED] Date: Mon, 11 Jun 2007 13:00:32 +0200 Subject: [PATCH] splice: don't assume regular pages in splice_to_pipe() Allow caller to pass in a release function, there might be other resources that need releasing as well. Needed for network receive. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- fs/splice.c|9 - include/linux/splice.h |1 + 2 files changed, 9 insertions(+), 1 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index f24e367..25ec9c8 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -247,11 +247,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, } while (page_nr spd-nr_pages) - page_cache_release(spd-pages[page_nr++]); + spd-spd_release(spd, page_nr++); return ret; } +static void spd_release_page(struct splice_pipe_desc *spd, unsigned int i) +{ + page_cache_release(spd-pages[i]); +} + static int __generic_file_splice_read(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, @@ -270,6 +275,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos, .partial = partial, .flags = flags, .ops = page_cache_pipe_buf_ops, + .spd_release = spd_release_page, }; index = *ppos PAGE_CACHE_SHIFT; @@ -1442,6 +1448,7 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov, .partial = partial, .flags = flags, .ops = user_page_pipe_buf_ops, + .spd_release = spd_release_page, }; pipe = pipe_info(file-f_path.dentry-d_inode); diff --git a/include/linux/splice.h b/include/linux/splice.h index 1a1182b..04c1068 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -53,6 +53,7 @@ struct splice_pipe_desc { int nr_pages; /* number of pages in map */ unsigned int flags; /* splice flags */ const struct pipe_buf_operations *ops;/* ops associated with output pipe */ + void (*spd_release)(struct splice_pipe_desc *, unsigned int); }; typedef int (splice_actor)(struct pipe_inode_info *, struct pipe_buffer *, -- 1.5.2.1.174.gcd03 From 95a1ee277f2a6df5c95d786b9229ea0ffa46850d Mon Sep 17 00:00:00 2001 From: Jens Axboe [EMAIL PROTECTED] Date: Mon, 4 Jun 2007 15:06:43 +0200 Subject: [PATCH] tcp_read_sock: alloc recv_actor() return return negative error value Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- net/ipv4/tcp.c |8 ++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index cd3c7e9..450f44b 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1064,7 +1064,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, break; } used = recv_actor(desc, skb, offset, len); - if (used = len) { + if (used 0) { +if (!copied) + copied = used; +break; + } else if (used = len) { seq += used; copied += used; offset += used; @@ -1086,7 +1090,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, tcp_rcv_space_adjust(sk); /* Clean up data we have read: This will do ACK frames. */ - if (copied) + if (copied 0) tcp_cleanup_rbuf(sk, copied); return copied; } -- 1.5.2.1.174.gcd03 From f0329226c6c1f521c2069358699bae5ed48f5a43 Mon Sep 17 00:00:00 2001 From: Jens Axboe [EMAIL PROTECTED] Date: Mon, 11 Jun 2007 13:51:57 +0200 Subject: [PATCH] TCP splice receive support Support for network splice receive. Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- include/linux/net.h|3 + include/linux/skbuff.h |5 ++ include/net/tcp.h |3 + net/core/skbuff.c | 175 net/ipv4/af_inet.c |1 + net/ipv4/tcp.c | 122 + net/socket.c | 13 7 files changed, 322 insertions(+), 0 deletions(-) diff --git a/include/linux/net.h b/include/linux/net.h index
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08 2007, Evgeniy Polyakov wrote: On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: I will try some things for the nearest 30-60 minutes, and then will move to canoe trip until thuesday, so will not be able to work on this idea. Ok, replacing in fs/splice.c every page_cache_release() with static void splice_page_release(struct page *p) { if (!PageSlab(p)) page_cache_release(p); } Ehm, I don't see why that should be necessary. Except in splice_to_pipe(), I have considered that we need to pass in a release function if mapping fails at some point. But it's probably best to do that in the caller, since they have the knowledge of how to release the pages. The rest of the PageSlab() tests are bogus. and putting cloned skb into private field instead of original on in spd_fill_page() ends up without kernel hung. Why? Seems pointless to allocate a clone just to hold on to the skb, a reference should be equally good. I would not be opposed to doing it this way, I just don't see what a clone buys us as compared to just holding that reference to the skb. I'm not sure it is correct, that page can be released in fs/splice.c without calling any callback from network code, when network data is being processed. Please explain! Size of the received file is bigger than file sent, file contains repeated blocks of data sometimes. Cloned skb usage is likely too big overhead, although for receiving fast clone is unused in most cases, so there might be some gain. Attached your patch with above changes. Thanks, I'll fiddle with this on monday. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Thu, Jun 07 2007, Evgeniy Polyakov wrote: On Thu, Jun 07, 2007 at 12:51:59PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: What bout checking if page belongs to kmalloc cache (or any other cache via priviate pointers) and do not perform any kind of reference counting on them? I will play with this a bit later today. That might work, but sounds a little dirty... But there's probably no way around. Be sure to look at the #splice-net branch if you are playing with this, I've updated it a number of times and fixed some bugs in there. Notably it now gets the offset right, and handles fragments and fraglist as well. I've pulled splice-net, which indeed fixed some issues, but referencing slab pages is still is not allowed. There are at least two problems (although they are related): 1. if we do not increment reference counter for slab pages, they eventually get refilled and slab exploses after it understood that its pages are in use (or user dies when page is moved out of his control in slab). 2. get/put page does not work with slab pages, and simple increment/decrement of the reference counters is not allowed too. Both problems have the same root - slab does not allow anyone to manipulate page's members. That should be broken/changed to allow splice to put its hands into network using the fastest way. I will think about it. Perhaps it's possible to solve this at a different level - can we hang on to the skb until the pipe buffer has been consumed, and prevent reuse that way? Then we don't have to care what backing the skb has, as long as it (and its data) isn't being reused until we drop the reference to it in sock_pipe_buf_release(). -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
From: Jens Axboe [EMAIL PROTECTED] Date: Fri, 8 Jun 2007 09:48:24 +0200 Perhaps it's possible to solve this at a different level - can we hang on to the skb until the pipe buffer has been consumed, and prevent reuse that way? Then we don't have to care what backing the skb has, as long as it (and its data) isn't being reused until we drop the reference to it in sock_pipe_buf_release(). Depending upon whether the pipe buffer consumption is bounded of not, this will jam up the TCP sender because the SKB data allocation is charged against the socket send buffer allocation. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08 2007, David Miller wrote: From: Jens Axboe [EMAIL PROTECTED] Date: Fri, 8 Jun 2007 09:48:24 +0200 Perhaps it's possible to solve this at a different level - can we hang on to the skb until the pipe buffer has been consumed, and prevent reuse that way? Then we don't have to care what backing the skb has, as long as it (and its data) isn't being reused until we drop the reference to it in sock_pipe_buf_release(). Depending upon whether the pipe buffer consumption is bounded of not, this will jam up the TCP sender because the SKB data allocation is charged against the socket send buffer allocation. Forgive my network ignorance, but is that a problem? Since you bring it up, I guess so :-) We can grow the pipe, should we have to. So instead of blocking waiting on reader consumption, we can extend the size of the pipe and keep going. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08 2007, Evgeniy Polyakov wrote: On Fri, Jun 08, 2007 at 10:38:53AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: On Fri, Jun 08 2007, David Miller wrote: From: Jens Axboe [EMAIL PROTECTED] Date: Fri, 8 Jun 2007 09:48:24 +0200 Perhaps it's possible to solve this at a different level - can we hang on to the skb until the pipe buffer has been consumed, and prevent reuse that way? Then we don't have to care what backing the skb has, as long as it (and its data) isn't being reused until we drop the reference to it in sock_pipe_buf_release(). Depending upon whether the pipe buffer consumption is bounded of not, this will jam up the TCP sender because the SKB data allocation is charged against the socket send buffer allocation. Forgive my network ignorance, but is that a problem? Since you bring it up, I guess so :-) David means, that socket bufer allocation is limited, and delaying freeing can end up with exhausint that limit. OK, so a delayed empty of the pipe could end up causing packet drops elsewhere due to allocation exhaustion? We can grow the pipe, should we have to. So instead of blocking waiting on reader consumption, we can extend the size of the pipe and keep going. I have a code, which roughly works (but I will test it some more), which just introduces reference counters for slab pages, so that the would not be actually freed via page reclaim, but only after reference counters are dropped. That forced changes in mm/slab.c so likely it is unacceptible solution, but it is interesting as is. Hmm, still seems like it's working around the problem. We essentially just need to ensure that the data doesn't get _reused_, not just freed. It doesn't help holding a reference to the page, if someone else just reuses it and fills it with other data before it has been consumed and released by the pipe buffer operations. That's why I thought the skb referencing was the better idea, then we don't have to care about the backing of the skb either. Provided that preventing the free of the skb before the pipe buffer has been consumed guarantees that the contents aren't reused. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08, 2007 at 10:38:53AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: On Fri, Jun 08 2007, David Miller wrote: From: Jens Axboe [EMAIL PROTECTED] Date: Fri, 8 Jun 2007 09:48:24 +0200 Perhaps it's possible to solve this at a different level - can we hang on to the skb until the pipe buffer has been consumed, and prevent reuse that way? Then we don't have to care what backing the skb has, as long as it (and its data) isn't being reused until we drop the reference to it in sock_pipe_buf_release(). Depending upon whether the pipe buffer consumption is bounded of not, this will jam up the TCP sender because the SKB data allocation is charged against the socket send buffer allocation. Forgive my network ignorance, but is that a problem? Since you bring it up, I guess so :-) David means, that socket bufer allocation is limited, and delaying freeing can end up with exhausint that limit. We can grow the pipe, should we have to. So instead of blocking waiting on reader consumption, we can extend the size of the pipe and keep going. I have a code, which roughly works (but I will test it some more), which just introduces reference counters for slab pages, so that the would not be actually freed via page reclaim, but only after reference counters are dropped. That forced changes in mm/slab.c so likely it is unacceptible solution, but it is interesting as is. -- Jens Axboe -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08, 2007 at 11:04:40AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: OK, so a delayed empty of the pipe could end up causing packet drops elsewhere due to allocation exhaustion? Yes. We can grow the pipe, should we have to. So instead of blocking waiting on reader consumption, we can extend the size of the pipe and keep going. I have a code, which roughly works (but I will test it some more), which just introduces reference counters for slab pages, so that the would not be actually freed via page reclaim, but only after reference counters are dropped. That forced changes in mm/slab.c so likely it is unacceptible solution, but it is interesting as is. Hmm, still seems like it's working around the problem. We essentially just need to ensure that the data doesn't get _reused_, not just freed. It doesn't help holding a reference to the page, if someone else just reuses it and fills it with other data before it has been consumed and released by the pipe buffer operations. That's why I thought the skb referencing was the better idea, then we don't have to care about the backing of the skb either. Provided that preventing the free of the skb before the pipe buffer has been consumed guarantees that the contents aren't reused. It is not only better idea, it is the only correct one. Attached patch for interested reader, which does slab pages accounting, but it is broken. It does not fires up with kernel bug, but it fills output file with random garbage from reused and dirtied pages. And I do not know why, but received file is always smaller than file being sent (when file has resonable size like 10mb, with 4-40kb filesize things seems to be ok). I've started skb referencing work, let's see where this will end up. diff --git a/fs/splice.c b/fs/splice.c index 928bea0..742e1ee 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -29,6 +29,18 @@ #include linux/syscalls.h #include linux/uio.h +extern void slab_change_usage(struct page *p); + +static inline void splice_page_release(struct page *p) +{ + struct page *head = p-first_page; + if (!PageSlab(head)) + page_cache_release(p); + else { + slab_change_usage(head); + } +} + /* * Attempt to steal a page from a pipe buffer. This should perhaps go into * a vm helper function, it's already simplified quite a bit by the @@ -81,7 +93,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe, static void page_cache_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - page_cache_release(buf-page); + splice_page_release(buf-page); buf-flags = ~PIPE_BUF_FLAG_LRU; } @@ -246,7 +258,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, } while (page_nr spd-nr_pages) - page_cache_release(spd-pages[page_nr++]); + splice_page_release(spd-pages[page_nr++]); return ret; } @@ -322,7 +334,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos, error = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL); if (unlikely(error)) { - page_cache_release(page); + splice_page_release(page); if (error == -EEXIST) continue; break; @@ -448,7 +460,7 @@ fill_it: * we got, 'nr_pages' is how many pages are in the map. */ while (page_nr nr_pages) - page_cache_release(pages[page_nr++]); + splice_page_release(pages[page_nr++]); if (spd.nr_pages) return splice_to_pipe(pipe, spd); @@ -604,7 +616,7 @@ find_page: if (ret != AOP_TRUNCATED_PAGE) unlock_page(page); - page_cache_release(page); + splice_page_release(page); if (ret == AOP_TRUNCATED_PAGE) goto find_page; @@ -634,7 +646,7 @@ find_page: ret = mapping-a_ops-commit_write(file, page, offset, offset+this_len); if (ret) { if (ret == AOP_TRUNCATED_PAGE) { - page_cache_release(page); + splice_page_release(page); goto find_page; } if (ret 0) @@ -651,7 +663,7 @@ find_page: */ mark_page_accessed(page); out: - page_cache_release(page); + splice_page_release(page); unlock_page(page); out_ret: return ret; diff --git a/mm/slab.c b/mm/slab.c index 2e71a32..673383d 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1649,8 +1649,12 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid) else add_zone_page_state(page_zone(page),
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08 2007, Evgeniy Polyakov wrote: On Fri, Jun 08, 2007 at 11:04:40AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: OK, so a delayed empty of the pipe could end up causing packet drops elsewhere due to allocation exhaustion? Yes. We can grow the pipe, should we have to. So instead of blocking waiting on reader consumption, we can extend the size of the pipe and keep going. I have a code, which roughly works (but I will test it some more), which just introduces reference counters for slab pages, so that the would not be actually freed via page reclaim, but only after reference counters are dropped. That forced changes in mm/slab.c so likely it is unacceptible solution, but it is interesting as is. Hmm, still seems like it's working around the problem. We essentially just need to ensure that the data doesn't get _reused_, not just freed. It doesn't help holding a reference to the page, if someone else just reuses it and fills it with other data before it has been consumed and released by the pipe buffer operations. That's why I thought the skb referencing was the better idea, then we don't have to care about the backing of the skb either. Provided that preventing the free of the skb before the pipe buffer has been consumed guarantees that the contents aren't reused. It is not only better idea, it is the only correct one. Attached patch for interested reader, which does slab pages accounting, but it is broken. It does not fires up with kernel bug, but it fills output file with random garbage from reused and dirtied pages. And I do not know why, but received file is always smaller than file being sent (when file has resonable size like 10mb, with 4-40kb filesize things seems to be ok). I've started skb referencing work, let's see where this will end up. Here's a start, for the splice side at least of storing a buf-private entity with the ops. diff --git a/fs/splice.c b/fs/splice.c index 90588a8..f24e367 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -191,6 +191,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, buf-page = spd-pages[page_nr]; buf-offset = spd-partial[page_nr].offset; buf-len = spd-partial[page_nr].len; + buf-private = spd-partial[page_nr].private; buf-ops = spd-ops; if (spd-flags SPLICE_F_GIFT) buf-flags |= PIPE_BUF_FLAG_GIFT; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 7ba228d..4409167 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -14,6 +14,7 @@ struct pipe_buffer { unsigned int offset, len; const struct pipe_buf_operations *ops; unsigned int flags; + unsigned long private; }; struct pipe_inode_info { diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 619dcf5..64e3eed 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1504,7 +1504,7 @@ extern int skb_store_bits(struct sk_buff *skb, int offset, extern __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to, int len, __wsum csum); -extern int skb_splice_bits(const struct sk_buff *skb, +extern int skb_splice_bits(struct sk_buff *skb, unsigned int offset, struct pipe_inode_info *pipe, unsigned int len, diff --git a/include/linux/splice.h b/include/linux/splice.h index b3f1528..1a1182b 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -41,6 +41,7 @@ struct splice_desc { struct partial_page { unsigned int offset; unsigned int len; + unsigned long private; }; /* diff --git a/net/core/skbuff.c b/net/core/skbuff.c index d2b2547..7d9ec9e 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -78,7 +78,10 @@ static void sock_pipe_buf_release(struct pipe_inode_info *pipe, #ifdef NET_COPY_SPLICE __free_page(buf-page); #else - put_page(buf-page); + struct sk_buff *skb = (struct sk_buff *) buf-private; + + kfree_skb(skb); + //put_page(buf-page); #endif } @@ -1148,7 +1151,8 @@ fault: * Fill page/offset/length into spd, if it can hold more pages. */ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, - unsigned int len, unsigned int offset) + unsigned int len, unsigned int offset, + struct sk_buff *skb) { struct page *p; @@ -1163,12 +1167,14 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08, 2007 at 04:14:52PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Here's a start, for the splice side at least of storing a buf-private entity with the ops. :) I tested the same implementation, but I put skb pointer into page-private. My approach is not correct, since the same page can hold several objects, so if there are several splicers, this will scream. I've tested your patch on top of splice-net branch, here is a result: [ 44.798853] Slab corruption: skbuff_head_cache start=81003b726668, len=192 [ 44.806148] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 44.811598] Last user: [803699fd](kfree_skbmem+0x7a/0x7e) [ 44.818012] 0b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5 [ 44.824889] Prev obj: start=81003b726590, len=192 [ 44.829985] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [ 44.835604] Last user: [8036a22c](__alloc_skb+0x40/0x13f) [ 44.842010] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 44.848896] 010: 20 58 7e 3b 00 81 ff ff 00 00 00 00 00 00 00 00 [ 44.855772] Next obj: start=81003b726740, len=192 [ 44.860868] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 44.866314] Last user: [803699fd](kfree_skbmem+0x7a/0x7e) [ 44.872721] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 44.879597] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b I will try some things for the nearest 30-60 minutes, and then will move to canoe trip until thuesday, so will not be able to work on this idea. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08 2007, Evgeniy Polyakov wrote: On Fri, Jun 08, 2007 at 04:14:52PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Here's a start, for the splice side at least of storing a buf-private entity with the ops. :) I tested the same implementation, but I put skb pointer into page-private. My approach is not correct, since the same page can hold several objects, so if there are several splicers, this will scream. I've tested your patch on top of splice-net branch, here is a result: [ 44.798853] Slab corruption: skbuff_head_cache start=81003b726668, len=192 [ 44.806148] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 44.811598] Last user: [803699fd](kfree_skbmem+0x7a/0x7e) [ 44.818012] 0b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5 [ 44.824889] Prev obj: start=81003b726590, len=192 [ 44.829985] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. [ 44.835604] Last user: [8036a22c](__alloc_skb+0x40/0x13f) [ 44.842010] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 44.848896] 010: 20 58 7e 3b 00 81 ff ff 00 00 00 00 00 00 00 00 [ 44.855772] Next obj: start=81003b726740, len=192 [ 44.860868] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. [ 44.866314] Last user: [803699fd](kfree_skbmem+0x7a/0x7e) [ 44.872721] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 44.879597] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b I will try some things for the nearest 30-60 minutes, and then will move to canoe trip until thuesday, so will not be able to work on this idea. I'm not surprised, it wasn't tested at all - just provides the basic framework for storing the skb so we can access it on pipe buffer release. Lets talk more next week, I'll likely play with this approach on monday. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: I will try some things for the nearest 30-60 minutes, and then will move to canoe trip until thuesday, so will not be able to work on this idea. Ok, replacing in fs/splice.c every page_cache_release() with static void splice_page_release(struct page *p) { if (!PageSlab(p)) page_cache_release(p); } and putting cloned skb into private field instead of original on in spd_fill_page() ends up without kernel hung. I'm not sure it is correct, that page can be released in fs/splice.c without calling any callback from network code, when network data is being processed. Size of the received file is bigger than file sent, file contains repeated blocks of data sometimes. Cloned skb usage is likely too big overhead, although for receiving fast clone is unused in most cases, so there might be some gain. Attached your patch with above changes. diff --git a/fs/splice.c b/fs/splice.c index 928bea0..a75dc56 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -29,6 +29,12 @@ #include linux/syscalls.h #include linux/uio.h +static void splice_page_release(struct page *p) +{ + if (!PageSlab(p)) + page_cache_release(p); +} + /* * Attempt to steal a page from a pipe buffer. This should perhaps go into * a vm helper function, it's already simplified quite a bit by the @@ -81,7 +87,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe, static void page_cache_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - page_cache_release(buf-page); + splice_page_release(buf-page); buf-flags = ~PIPE_BUF_FLAG_LRU; } @@ -191,6 +197,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, buf-page = spd-pages[page_nr]; buf-offset = spd-partial[page_nr].offset; buf-len = spd-partial[page_nr].len; + buf-private = spd-partial[page_nr].private; buf-ops = spd-ops; if (spd-flags SPLICE_F_GIFT) buf-flags |= PIPE_BUF_FLAG_GIFT; @@ -246,7 +253,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe, } while (page_nr spd-nr_pages) - page_cache_release(spd-pages[page_nr++]); + splice_page_release(spd-pages[page_nr++]); return ret; } @@ -322,7 +329,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos, error = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL); if (unlikely(error)) { - page_cache_release(page); + splice_page_release(page); if (error == -EEXIST) continue; break; @@ -448,7 +455,7 @@ fill_it: * we got, 'nr_pages' is how many pages are in the map. */ while (page_nr nr_pages) - page_cache_release(pages[page_nr++]); + splice_page_release(pages[page_nr++]); if (spd.nr_pages) return splice_to_pipe(pipe, spd); @@ -604,7 +611,7 @@ find_page: if (ret != AOP_TRUNCATED_PAGE) unlock_page(page); - page_cache_release(page); + splice_page_release(page); if (ret == AOP_TRUNCATED_PAGE) goto find_page; @@ -634,7 +641,7 @@ find_page: ret = mapping-a_ops-commit_write(file, page, offset, offset+this_len); if (ret) { if (ret == AOP_TRUNCATED_PAGE) { - page_cache_release(page); + splice_page_release(page); goto find_page; } if (ret 0) @@ -651,7 +658,7 @@ find_page: */ mark_page_accessed(page); out: - page_cache_release(page); + splice_page_release(page); unlock_page(page); out_ret: return ret; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 7ba228d..4409167 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -14,6 +14,7 @@ struct pipe_buffer { unsigned int offset, len; const struct pipe_buf_operations *ops; unsigned int flags; + unsigned long private; }; struct pipe_inode_info { diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 619dcf5..64e3eed 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1504,7 +1504,7 @@ extern int skb_store_bits(struct sk_buff *skb, int offset, extern __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to, int len,
Re: [PATCH][RFC] network splice receive
On Wed, Jun 06, 2007 at 09:17:25AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Some pages have zero reference counter here. But it's somewhat annoying to get pages with zero reference counts there, I wonder how that happens. I guess if the skb-data originated from kmalloc() then we don't really know. The main intent there was just to ensure the page wasn't going away, but clearly it's not good enough to ensure that reuse isn't taking place. What bout checking if page belongs to kmalloc cache (or any other cache via priviate pointers) and do not perform any kind of reference counting on them? I will play with this a bit later today. -- Jens Axboe -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Thu, Jun 07 2007, Evgeniy Polyakov wrote: On Wed, Jun 06, 2007 at 09:17:25AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Some pages have zero reference counter here. But it's somewhat annoying to get pages with zero reference counts there, I wonder how that happens. I guess if the skb-data originated from kmalloc() then we don't really know. The main intent there was just to ensure the page wasn't going away, but clearly it's not good enough to ensure that reuse isn't taking place. What bout checking if page belongs to kmalloc cache (or any other cache via priviate pointers) and do not perform any kind of reference counting on them? I will play with this a bit later today. That might work, but sounds a little dirty... But there's probably no way around. Be sure to look at the #splice-net branch if you are playing with this, I've updated it a number of times and fixed some bugs in there. Notably it now gets the offset right, and handles fragments and fraglist as well. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Thu, Jun 07, 2007 at 12:51:59PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: What bout checking if page belongs to kmalloc cache (or any other cache via priviate pointers) and do not perform any kind of reference counting on them? I will play with this a bit later today. That might work, but sounds a little dirty... But there's probably no way around. Be sure to look at the #splice-net branch if you are playing with this, I've updated it a number of times and fixed some bugs in there. Notably it now gets the offset right, and handles fragments and fraglist as well. I've pulled splice-net, which indeed fixed some issues, but referencing slab pages is still is not allowed. There are at least two problems (although they are related): 1. if we do not increment reference counter for slab pages, they eventually get refilled and slab exploses after it understood that its pages are in use (or user dies when page is moved out of his control in slab). 2. get/put page does not work with slab pages, and simple increment/decrement of the reference counters is not allowed too. Both problems have the same root - slab does not allow anyone to manipulate page's members. That should be broken/changed to allow splice to put its hands into network using the fastest way. I will think about it. -- Jens Axboe -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Tue, Jun 05 2007, jamal wrote: On Tue, 2007-05-06 at 14:20 +0200, Jens Axboe wrote: 1800 4ff3 937f e000 6381 7275 0008 Perhaps that hex pattern rings a bell with someone intimate with the networking. The remaining wrong bytes don't seem to have anything in common. Ok, the source mac address is 00:18:F3:4F:7F:93 and the destination is 00:E0:81:63:75:72 which are the middle 12 bytes of the 16. It appears you may have endianness issues and perhaps a 16 bit mis-offset. the 0008 at the end looks like a swapped 0x800 which is the ethernet type for IPV4. Yeah, it looks like the first part of the ethernet frame. I'm using an e1000 on the receive side and a sky2 on the sender. Hope that helps someone clue me in as to which network part is reusing the data. Do I need to 'pin' the sk_buff until the pipe data has been consumed? I would worry about the driver level first - likely thats where your corruption is. I'd just be a heck-of-a-lot more inclined to suspect my code! Hence I thought that data reuse for perhaps another packet would be the likely explanation, especially since it looked like valid network driver data ending up where it should not. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Tue, Jun 05 2007, Evgeniy Polyakov wrote: On Tue, Jun 05, 2007 at 10:05:43AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Here's an implementation of tcp network splice receive support. It's originally based on the patch set that Intel posted some time ago, but has been (close to) 100% reworked. Now, I'm not a networking guru by any stretch of the imagination, so I'd like some input on the direction of the main patch. Is the approach feasible? Glaring errors? Missing bits? 263.709262] [ cut here ] [ 263.713932] kernel BUG at include/linux/mm.h:285! [ 263.718678] invalid opcode: [1] PREEMPT SMP [ 263.723561] CPU 0 [ 263.725665] Modules linked in: button loop snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore psmouse snd_page_alloc k8temp i2c_nforcen [ 263.755666] Pid: 2709, comm: splice-fromnet Not tainted 2.6.22-rc4-splice #2 [ 263.762759] RIP: 0010:[8038c60c] [8038c60c] skb_splice_bits+0xac/0x1c9 [ 263.771212] RSP: 0018:81003c79fc88 EFLAGS: 00010246 [ 263.776564] RAX: RBX: 05a8 RCX: 81003ff04778 [ 263.783743] RDX: 81003ff04778 RSI: 0ab2 RDI: 0003d52d [ 263.790925] RBP: 81003c79fdd8 R08: R09: 81003d927b78 [ 263.798104] R10: 803b0181 R11: 81003c79fde8 R12: 81003d52d000 [ 263.805284] R13: 054e R14: 81003d927b78 R15: 81003bbc6ea0 [ 263.812463] FS: 2ac4089a86d0() GS:804fb000() knlGS: [ 263.820611] CS: 0010 DS: ES: CR0: 8005003b [ 263.826396] CR2: 2ac4088320e0 CR3: 3c987000 CR4: 06e0 [ 263.833578] Process splice-fromnet (pid: 2709, threadinfo 81003c79e000, task 81003755c380) [ 263.842591] Stack: 81003ff04720 81003755c380 0046 [ 263.850897] 00c6 0046 81003b0428b8 81003d0b5b10 [ 263.858543] 00c6 81003d0b5b10 81003b0428b8 81003d0b5b10 [ 263.865957] Call Trace: [ 263.868683] [803dc630] _read_unlock_irq+0x31/0x4e [ 263.874393] [803afb54] tcp_splice_data_recv+0x20/0x22 [ 263.880447] [803afa2b] tcp_read_sock+0xa2/0x1ab [ 263.885983] [803afb34] tcp_splice_data_recv+0x0/0x22 [ 263.891951] [803b01c1] tcp_splice_read+0xae/0x1a3 [ 263.897655] [8038920f] sock_def_readable+0x0/0x6f [ 263.903366] [80384a65] sock_splice_read+0x15/0x17 [ 263.909072] [8029e773] do_splice_to+0x76/0x88 [ 263.914432] [8029fcc8] sys_splice+0x1a8/0x232 [ 263.919795] [802097ce] system_call+0x7e/0x83 [ 263.925067] [ 263.926606] [ 263.926607] Code: 0f 0b eb fe 44 89 e6 81 e6 ff 0f 00 00 90 ff 42 08 48 63 55 [ 263.936418] RIP [8038c60c] skb_splice_bits+0xac/0x1c9 [ 263.942516] RSP 81003c79fc88 This a vm_bug_on in get_page(). +static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, + unsigned int len, unsigned int offset) +{ + struct page *p; + + if (unlikely(spd-nr_pages == PIPE_BUFFERS)) + return 1; + +#ifdef NET_COPY_SPLICE + p = alloc_pages(GFP_KERNEL, 0); + if (!p) + return 1; + + memcpy(page_address(p) + offset, page_address(page) + offset, len); +#else + p = page; + get_page(p); +#endif Some pages have zero reference counter here. Is commented NET_COPY_SPLICE part from old implementation? It will be always slower than existing approach due to allocation overhead. NET_COPY_SPLICE is not supposed to be enabled, it's just a demonstration of a copy operation if you wanted to test functionality without just linking in the skb pages. At least that would allow you to test correctness of the rest of the code, since I don't know if the skb page linking is entirely correct yet. But it's somewhat annoying to get pages with zero reference counts there, I wonder how that happens. I guess if the skb-data originated from kmalloc() then we don't really know. The main intent there was just to ensure the page wasn't going away, but clearly it's not good enough to ensure that reuse isn't taking place. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][RFC] network splice receive
Hi, Here's an implementation of tcp network splice receive support. It's originally based on the patch set that Intel posted some time ago, but has been (close to) 100% reworked. Now, I'm not a networking guru by any stretch of the imagination, so I'd like some input on the direction of the main patch. Is the approach feasible? Glaring errors? Missing bits? If you want to test it, I'd suggest downloading the latest splice tools snapshot here: http://brick.kernel.dk/snaps/splice-git-latest.tar.gz Examples: - Sending a small test message over the network: [EMAIL PROTECTED]:~/splice $ ./splice-fromnet | cat [EMAIL PROTECTED]:~ $ echo hello | netcat host1 should write hello on host1. Yeah, complex stuff. - Sending a file over the network: [EMAIL PROTECTED]:~/splice $ ./splice-fromnet | ./splice out outfile [EMAIL PROTECTED]:~ $ cat somefile | netcat host1 should send somefile over the network, storing it in outfile. Seems to work reasonably well for me, sometimes I do see small ranges inside the output file that are not correct, but I haven't been able to reproduce today. I think it has to do with page reuse, hence the NET_COPY_SPLICE ifdef that you can enable to just plain copy the data instead of referencing it. Patches are against the #splice branch of the block repo, official url of that is: git://git.kernel.dk/data/git/linux-2.6-block.git/ and it's based on Linus main tree (at 2.6.22-rc4 right now). Let me know if I should supply netdev branch patches instead. -- Jens Axboe From 592c46ea813c31c0d6b28bf543ce2f5dd884a75e Mon Sep 17 00:00:00 2001 From: Jens Axboe [EMAIL PROTECTED] Date: Mon, 4 Jun 2007 15:06:43 +0200 Subject: [PATCH] [NET] tcp_read_sock: alloc recv_actor() return return negative error value Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- net/ipv4/tcp.c |8 ++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index cd3c7e9..450f44b 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1064,7 +1064,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, break; } used = recv_actor(desc, skb, offset, len); - if (used = len) { + if (used 0) { +if (!copied) + copied = used; +break; + } else if (used = len) { seq += used; copied += used; offset += used; @@ -1086,7 +1090,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, tcp_rcv_space_adjust(sk); /* Clean up data we have read: This will do ACK frames. */ - if (copied) + if (copied 0) tcp_cleanup_rbuf(sk, copied); return copied; } -- 1.5.2.rc1 From 10d906a9a5a16a022d5067bee3963a0e3a03ae0c Mon Sep 17 00:00:00 2001 From: Jens Axboe [EMAIL PROTECTED] Date: Tue, 5 Jun 2007 09:54:00 +0200 Subject: [PATCH] [NET] TCP splice receive support Losely based on original patches from Intel, modified to actually be zero-copy (the original patches memcpy'ed the data). Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- include/linux/net.h|3 + include/linux/skbuff.h |5 ++ include/net/tcp.h |3 + net/core/skbuff.c | 114 +++ net/ipv4/af_inet.c |1 + net/ipv4/tcp.c | 138 net/socket.c | 13 + 7 files changed, 277 insertions(+), 0 deletions(-) diff --git a/include/linux/net.h b/include/linux/net.h index efc4517..472ee12 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -19,6 +19,7 @@ #define _LINUX_NET_H #include linux/wait.h +#include linux/splice.h #include asm/socket.h struct poll_table_struct; @@ -165,6 +166,8 @@ struct proto_ops { struct vm_area_struct * vma); ssize_t (*sendpage) (struct socket *sock, struct page *page, int offset, size_t size, int flags); + ssize_t (*splice_read)(struct socket *sock, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, unsigned int flags); }; struct net_proto_family { diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index e7367c7..619dcf5 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1504,6 +1504,11 @@ extern int skb_store_bits(struct sk_buff *skb, int offset, extern __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to, int len, __wsum csum); +extern int skb_splice_bits(const struct sk_buff *skb, + unsigned int offset, + struct pipe_inode_info *pipe, + unsigned int len, + unsigned int flags); extern void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to); extern void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); diff --git a/include/net/tcp.h b/include/net/tcp.h index a8af9ae..8e86697 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -308,6 +308,9 @@ extern int tcp_twsk_unique(struct sock *sk, extern void tcp_twsk_destructor(struct sock
Re: [PATCH][RFC] network splice receive
On Tue, Jun 05 2007, Jens Axboe wrote: Seems to work reasonably well for me, sometimes I do see small ranges inside the output file that are not correct, but I haven't been able to reproduce today. I think it has to do with page reuse, hence the NET_COPY_SPLICE ifdef that you can enable to just plain copy the data instead of referencing it. I managed to reproduce. It's segments of 68-80 bytes beyond corrupt in the middle of the out, and there might be 1-3 of such occurences in the 30mb file I tested with. The first 16 bytes of the corruption are always the same: 1800 4ff3 937f e000 6381 7275 0008 Perhaps that hex pattern rings a bell with someone intimate with the networking. The remaining wrong bytes don't seem to have anything in common. Slab poisoning doesn't change the pattern, so it's not use-after-free. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Tue, Jun 05 2007, Jens Axboe wrote: On Tue, Jun 05 2007, Jens Axboe wrote: Seems to work reasonably well for me, sometimes I do see small ranges inside the output file that are not correct, but I haven't been able to reproduce today. I think it has to do with page reuse, hence the NET_COPY_SPLICE ifdef that you can enable to just plain copy the data instead of referencing it. I managed to reproduce. It's segments of 68-80 bytes beyond corrupt in the middle of the out, and there might be 1-3 of such occurences in the 30mb file I tested with. The first 16 bytes of the corruption are always the same: 1800 4ff3 937f e000 6381 7275 0008 Perhaps that hex pattern rings a bell with someone intimate with the networking. The remaining wrong bytes don't seem to have anything in common. Ok, the source mac address is 00:18:F3:4F:7F:93 and the destination is 00:E0:81:63:75:72 which are the middle 12 bytes of the 16. Hope that helps someone clue me in as to which network part is reusing the data. Do I need to 'pin' the sk_buff until the pipe data has been consumed? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Tue, Jun 05, 2007 at 10:05:43AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Hi, Hi Jens. Here's an implementation of tcp network splice receive support. It's originally based on the patch set that Intel posted some time ago, but has been (close to) 100% reworked. Now, I'm not a networking guru by any stretch of the imagination, so I'd like some input on the direction of the main patch. Is the approach feasible? Glaring errors? Missing bits? First one - you seems to create new data_ready callback tcp_splice_data_ready(), but it is unused, and actually can not be at all - there will be a deadlock, since sk_data_ready can be called with locked socket and also in bh context. I will setup this and report abck about bugs. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Tue, Jun 05, 2007 at 10:05:43AM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: Here's an implementation of tcp network splice receive support. It's originally based on the patch set that Intel posted some time ago, but has been (close to) 100% reworked. Now, I'm not a networking guru by any stretch of the imagination, so I'd like some input on the direction of the main patch. Is the approach feasible? Glaring errors? Missing bits? 263.709262] [ cut here ] [ 263.713932] kernel BUG at include/linux/mm.h:285! [ 263.718678] invalid opcode: [1] PREEMPT SMP [ 263.723561] CPU 0 [ 263.725665] Modules linked in: button loop snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore psmouse snd_page_alloc k8temp i2c_nforcen [ 263.755666] Pid: 2709, comm: splice-fromnet Not tainted 2.6.22-rc4-splice #2 [ 263.762759] RIP: 0010:[8038c60c] [8038c60c] skb_splice_bits+0xac/0x1c9 [ 263.771212] RSP: 0018:81003c79fc88 EFLAGS: 00010246 [ 263.776564] RAX: RBX: 05a8 RCX: 81003ff04778 [ 263.783743] RDX: 81003ff04778 RSI: 0ab2 RDI: 0003d52d [ 263.790925] RBP: 81003c79fdd8 R08: R09: 81003d927b78 [ 263.798104] R10: 803b0181 R11: 81003c79fde8 R12: 81003d52d000 [ 263.805284] R13: 054e R14: 81003d927b78 R15: 81003bbc6ea0 [ 263.812463] FS: 2ac4089a86d0() GS:804fb000() knlGS: [ 263.820611] CS: 0010 DS: ES: CR0: 8005003b [ 263.826396] CR2: 2ac4088320e0 CR3: 3c987000 CR4: 06e0 [ 263.833578] Process splice-fromnet (pid: 2709, threadinfo 81003c79e000, task 81003755c380) [ 263.842591] Stack: 81003ff04720 81003755c380 0046 [ 263.850897] 00c6 0046 81003b0428b8 81003d0b5b10 [ 263.858543] 00c6 81003d0b5b10 81003b0428b8 81003d0b5b10 [ 263.865957] Call Trace: [ 263.868683] [803dc630] _read_unlock_irq+0x31/0x4e [ 263.874393] [803afb54] tcp_splice_data_recv+0x20/0x22 [ 263.880447] [803afa2b] tcp_read_sock+0xa2/0x1ab [ 263.885983] [803afb34] tcp_splice_data_recv+0x0/0x22 [ 263.891951] [803b01c1] tcp_splice_read+0xae/0x1a3 [ 263.897655] [8038920f] sock_def_readable+0x0/0x6f [ 263.903366] [80384a65] sock_splice_read+0x15/0x17 [ 263.909072] [8029e773] do_splice_to+0x76/0x88 [ 263.914432] [8029fcc8] sys_splice+0x1a8/0x232 [ 263.919795] [802097ce] system_call+0x7e/0x83 [ 263.925067] [ 263.926606] [ 263.926607] Code: 0f 0b eb fe 44 89 e6 81 e6 ff 0f 00 00 90 ff 42 08 48 63 55 [ 263.936418] RIP [8038c60c] skb_splice_bits+0xac/0x1c9 [ 263.942516] RSP 81003c79fc88 This a vm_bug_on in get_page(). +static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, + unsigned int len, unsigned int offset) +{ + struct page *p; + + if (unlikely(spd-nr_pages == PIPE_BUFFERS)) + return 1; + +#ifdef NET_COPY_SPLICE + p = alloc_pages(GFP_KERNEL, 0); + if (!p) + return 1; + + memcpy(page_address(p) + offset, page_address(page) + offset, len); +#else + p = page; + get_page(p); +#endif Some pages have zero reference counter here. Is commented NET_COPY_SPLICE part from old implementation? It will be always slower than existing approach due to allocation overhead. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] network splice receive
On Tue, Jun 05, 2007 at 06:31:31PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: [ 263.936418] RIP [8038c60c] skb_splice_bits+0xac/0x1c9 [ 263.942516] RSP 81003c79fc88 This a vm_bug_on in get_page(). +static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, + unsigned int len, unsigned int offset) +{ + struct page *p; + + if (unlikely(spd-nr_pages == PIPE_BUFFERS)) + return 1; + +#ifdef NET_COPY_SPLICE + p = alloc_pages(GFP_KERNEL, 0); + if (!p) + return 1; + + memcpy(page_address(p) + offset, page_address(page) + offset, len); +#else + p = page; + get_page(p); +#endif Some pages have zero reference counter here. Very likley bug with mac address is related to this one and you do not have vm debug enabled in the config? Naive atomic_inc and atomic_dec_return with bug_on 0 instead of that get_page and put_page in spd_fill_page()/sock_pipe_buf_release() resulted in broken file - initial data contained 6B and 5A instead of zeroes sent. Even more naive atomic_add(2, page) ended with: [ 48.273345] page:81003ff22a18 flags:0x0100 mapping: mapcount:0 count:2 [ 48.273347] Trying to fix it up, but a reboot is needed [ 48.273349] Backtrace: [ 48.295576] [ 48.295577] Call Trace: [ 48.299624] [8025f075] bad_page+0x67/0x95 [ 48.304636] [8025f771] __free_pages_ok+0x76/0x2c1 [ 48.310343] [8025fbec] __free_pages+0x29/0x2b [ 48.315703] [8025fc38] free_pages+0x4a/0x4f [ 48.320884] [8027b15f] kmem_freepages+0xd9/0xe2 [ 48.326416] [8027bd93] slab_destroy+0xef/0x114 [ 48.331865] [8027bf15] free_block+0x15d/0x19f [ 48.337227] [8027c0bb] cache_flusharray+0x95/0xff [ 48.342933] [8027c36a] kfree+0x1cd/0x1ec [ 48.347863] [8038b233] skb_release_data+0xab/0xb0 [ 48.353567] [8038aff3] kfree_skbmem+0x11/0x7e [ 48.358927] [8038b104] __kfree_skb+0xa4/0xa9 [ 48.364204] [803afa9e] tcp_read_sock+0x101/0x1ab [ 48.369823] [803afb48] tcp_splice_data_recv+0x0/0x22 [ 48.375791] [803b01d5] tcp_splice_read+0xae/0x1a3 [ 48.381497] [8038920f] sock_def_readable+0x0/0x6f [ 48.387209] [80384a65] sock_splice_read+0x15/0x17 [ 48.392913] [8029e773] do_splice_to+0x76/0x88 [ 48.398273] [8029fcc8] sys_splice+0x1a8/0x232 [ 48.403636] [802097ce] system_call+0x7e/0x83 -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html