Re: PATCH zero-copy send completion callback
On Tuesday 17 October 2006 02:53, Eric Barton wrote: If so, do you have any ideas about how to do it more economically? It's 2 pointers rather than 1 to avoid forcing an unnecessary packet boundary between successive zero-copy sends. But I guess that might not be hugely significant since you're generally sending many pages when zero-copy is needed for performance. Also, (please correct me if I'm wrong) I didn't think this would push the allocation over to the next entry in 'malloc_sizes'. Well, skbuff heads are allocated from dedicated kmem_cache (skbuff_fclone_cache skbuff_head_cache), and these caches are not constrained by the sizes available in malloc_sizes. Their size are a multiple of L1 CACHE size, which is 64 bytes for most common machines. Even if your two pointers addition (16 bytes on x86_64) doesnt cross a 64bytes line (I didn't checked), they are going to be set to NULL each time a skbuff is allocated , and checked against NULL each time a skbuff is destroyed. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATCH zero-copy send completion callback
On Tue, Oct 17, 2006 at 01:53:02AM +0100, Eric Barton ([EMAIL PROTECTED]) wrote: And these days we're trying to figure out how to eliminate skbuff and skb_shared_info struct members whereas you're adding 16-bytes of space on 64-bit platforms. Do you think the general concept of a zero-copy completion callback is useful? You can use existing skb destructor and appropriate reference counter is already there. In your own destructor you need to call old one of course, and it's type can be determined from the analysis of the headers and skb itself (there are not so much destructor's types actually). If that level of abstraction is not enough, it is possible to change skb_release_data()/__kfree_skb() so that it would be possible in skb-destructor() to determine if attached pages will be freed or not. If so, do you have any ideas about how to do it more economically? It's 2 pointers rather than 1 to avoid forcing an unnecessary packet boundary between successive zero-copy sends. But I guess that might not be hugely significant since you're generally sending many pages when zero-copy is Existing sendfile() implementation is synchronous, it does not require async callback. It looks like lustre sets number of pages to be sent asyncrhonously and report to user that everything is ok, and when appropriate callback is invoked, it updates it's metadata? Fair enough, it looks similar to VFS cache in case of usual write. needed for performance. Also, (please correct me if I'm wrong) I didn't think this would push the allocation over to the next entry in 'malloc_sizes'. skbs are allocated from own cache, and the smaller it is, the better. Cheers, Eric -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: PATCH zero-copy send completion callback
Also, (please correct me if I'm wrong) I didn't think this would push the allocation over to the next entry in 'malloc_sizes'. Well, skbuff heads are allocated from dedicated kmem_cache (skbuff_fclone_cache skbuff_head_cache), and these caches are not constrained by the sizes available in malloc_sizes. Their size are a multiple of L1 CACHE size, which is 64 bytes for most common machines. Indeed, struct skbuff is so allocated. But I added the callback pointers to struct skb_shared_info where the page pointers are stored, and this struct is allocated along with the packet header using kmalloc. Even if your two pointers addition (16 bytes on x86_64) doesnt cross a 64bytes line (I didn't checked), they are going to be set to NULL each time a skbuff is allocated , and checked against NULL each time a skbuff is destroyed. Indeed. Do you think that's significant? Cheers, Eric --- |Eric BartonBarton Software | |9 York Gardens Tel:+44 (117) 330 1575| |CliftonMobile: +44 (7909) 680 356| |Bristol BS8 4LLFax:call first| |United Kingdom E-Mail: [EMAIL PROTECTED]| --- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: PATCH zero-copy send completion callback
In addition to that I'm pretty sure I remember that some clusterfs person already posted these patches a while ago and got ripped apart in the same way. Yes - unfortunately I didn't submit my patch personally. And I've rewritten it since to to avoid the obvious criticisms. This time around, I find the comments much more to the point. Cheers, Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: PATCH zero-copy send completion callback
Evgeniy, You can use existing skb destructor and appropriate reference counter is already there. In your own destructor you need to call old one of course, and it's type can be determined from the analysis of the headers and skb itself (there are not so much destructor's types actually). If that level of abstraction is not enough, it is possible to change skb_release_data()/__kfree_skb() so that it would be possible in skb-destructor() to determine if attached pages will be freed or not. Yes absolutely. My first thought was to use the skbuf destructor but I was paranoid I might screw up the destructor stacking. Maybe I should have been braver? Since the callback descriptor needs to track the pages in skb_shinfo() rather than the skbuf itself, it seemed natural to make skb_release_data() the trigger. Existing sendfile() implementation is synchronous, it does not require async callback. Is it not true that you cannot know when it is safe to overwrite pages sent in this way? skbs are allocated from own cache, and the smaller it is, the better. As I mentioned in another reply, skbs are indeed allocated from their own cache, but skb_shinfo() is allocated contiguously with the packet header using kmalloc. -- Cheers, Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATCH zero-copy send completion callback
On Tue, Oct 17, 2006 at 01:50:04PM +0100, Eric Barton ([EMAIL PROTECTED]) wrote: Evgeniy, You can use existing skb destructor and appropriate reference counter is already there. In your own destructor you need to call old one of course, and it's type can be determined from the analysis of the headers and skb itself (there are not so much destructor's types actually). If that level of abstraction is not enough, it is possible to change skb_release_data()/__kfree_skb() so that it would be possible in skb-destructor() to determine if attached pages will be freed or not. Yes absolutely. My first thought was to use the skbuf destructor but I was paranoid I might screw up the destructor stacking. Maybe I should have been braver? It depends on the results quality... Since the callback descriptor needs to track the pages in skb_shinfo() rather than the skbuf itself, it seemed natural to make skb_release_data() the trigger. Existing sendfile() implementation is synchronous, it does not require async callback. Is it not true that you cannot know when it is safe to overwrite pages sent in this way? There are tricks all over the place in sendfile. First one is sendpage() imeplementation, which copies data if hardware does not support checksumming and scater-gather, and simultaneous writing is protected in the higher layer (check do_generic_mapping_read()). We do not care about 'later' writing, i.e. while skb was in some queue on the local machine, since new data will be transferred in that case. truncation is also protected by the fact, that page's reference counter is increased, so the same page can not be freed and reused. It was design decision not to care about page overwrites (and thus no page locking) - either smart hardware transfers new data, or we do copy and send old data. skbs are allocated from own cache, and the smaller it is, the better. As I mentioned in another reply, skbs are indeed allocated from their own cache, but skb_shinfo() is allocated contiguously with the packet header using kmalloc. Yes, skb itself is not touched. You probably saw a lot of discussions about problems with e1000 hardware, memory fragmentation and jumbo frames. Since skb_shared_info is added to the actual data, it frequently forces next order allocations, so one of the solution is to put skb_shared_info into separate allocations in some cases, although those discussions are sleeping right now, problem still exists and if your current needs can be handled within existing interfaces it should be tried first. -- Cheers, Eric -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATCH zero-copy send completion callback
From: Eric Barton [EMAIL PROTECTED] Date: Tue, 17 Oct 2006 13:23:10 +0100 Even if your two pointers addition (16 bytes on x86_64) doesnt cross a 64bytes line (I didn't checked), they are going to be set to NULL each time a skbuff is allocated , and checked against NULL each time a skbuff is destroyed. Indeed. Do you think that's significant? On a machine routing a million packets per second, it definitely is. It is the most crucial data structure for performance in all of the networking. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
PATCH zero-copy send completion callback
This patch has been used with the lustre cluster file system (www.lustre.org) to give notification when page buffers used to send bulk data via TCP/IP may be overwritten. It implements... a) A general-purpose callback to inform higher-level protocols when a zero-copy send of a set of pages has completed. b) tcp_sendpage_zccd(), a variation on tcp_sendpage() that includes a completion callback parameter. How to use it (you are a higher-level protocol driver)... a) Initialise a zero-copy descriptor with your callback procedure. b) Pass this descriptor in all zero-copy sends for an arbitrary set of pages. Skbuffs that reference your pages also take a reference on your zero-copy callback descriptor. They release this reference when they release their page references. c) Release your own reference when you've posted all your pages and you're ready for the callback. d) The callback occurs when the last reference is dropped. This patch applies on branch 'master' of git://kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 85577a4..4afaef1 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -129,6 +129,36 @@ struct skb_frag_struct { __u16 size; }; +/* Zero Copy Callback Descriptor + * This struct supports receiving notification when zero-copy network I/O has + * completed. The ZCCD can be embedded in a struct containing the state of a + * zero-copy network send. Every skbuff that references that send's pages also + * keeps a reference on the ZCCD. When they have all been disposed of, the + * reference count on the ZCCD drops to zero and the callback is made, telling + * the original caller that the pages may now be overwritten. */ +struct zccd +{ + atomic_t zccd_refcount; + void (*zccd_callback)(struct zccd *); +}; + +static inline void zccd_init (struct zccd *d, void (*callback)(struct zccd *)) +{ + atomic_set (d-zccd_refcount, 1); + d-zccd_callback = callback; +} + +static inline void zccd_incref (struct zccd *d)/* take a reference */ +{ + atomic_inc (d-zccd_refcount); +} + +static inline void zccd_decref (struct zccd *d)/* release a reference */ +{ + if (atomic_dec_and_test (d-zccd_refcount)) + (d-zccd_callback)(d); +} + /* This data is invariant across clones and lives at * the end of the header data, ie. at skb-end. */ @@ -141,6 +171,11 @@ struct skb_shared_info { unsigned short gso_type; unsigned intip6_frag_id; struct sk_buff *frag_list; + struct zccd *zccd1; + struct zccd *zccd2; + /* NB zero-copy data is normally whole pages. We have 2 zccds in an +* skbuff so we don't unneccessarily split the packet where pages fall +* into the same packet. */ skb_frag_t frags[MAX_SKB_FRAGS]; }; @@ -1311,6 +1346,23 @@ #ifdef CONFIG_HIGHMEM #endif } +/* This skbuf has dropped its pages: drop refs on any zero-copy callback + * descriptors it has. */ +static inline void skb_complete_zccd (struct sk_buff *skb) +{ + struct skb_shared_info *info = skb_shinfo(skb); + + if (info-zccd1 != NULL) { + zccd_decref(info-zccd1); + info-zccd1 = NULL; + } + + if (info-zccd2 != NULL) { + zccd_decref(info-zccd2); + info-zccd2 = NULL; + } +} + #define skb_queue_walk(queue, skb) \ for (skb = (queue)-next; \ prefetch(skb-next), (skb != (struct sk_buff *)(queue)); \ diff --git a/include/net/tcp.h b/include/net/tcp.h index 7a093d0..e02b55f 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -278,6 +278,8 @@ extern int tcp_v4_tw_remember_stam extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t size); extern ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags); +extern ssize_t tcp_sendpage_zccd(struct socket *sock, struct page *page, int offset, size_t size, + int flags, struct zccd *zccd); extern int tcp_ioctl(struct sock *sk, int cmd, diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3c23760..a1d2ed0 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -177,6 +177,8 @@ struct sk_buff *__alloc_skb(unsigned int shinfo-gso_type = 0; shinfo-ip6_frag_id = 0; shinfo-frag_list = NULL; + shinfo-zccd1 = NULL; + shinfo-zccd2 = NULL; if (fclone) { struct sk_buff *child = skb + 1; @@
PATCH zero-copy send completion callback
This patch has been used with the lustre cluster file system (www.lustre.org) to give notification when page buffers used to send bulk data via TCP/IP may be overwritten. It implements... a) A general-purpose callback to inform higher-level protocols when a zero-copy send of a set of pages has completed. b) tcp_sendpage_zccd(), a variation on tcp_sendpage() that includes a completion callback parameter. How to use it (you are a higher-level protocol driver)... a) Initialise a zero-copy descriptor with your callback procedure. b) Pass this descriptor in all zero-copy sends for an arbitrary set of pages. Skbuffs that reference your pages also take a reference on your zero-copy callback descriptor. They release this reference when they release their page references. c) Release your own reference when you've posted all your pages and you're ready for the callback. d) The callback occurs when the last reference is dropped. This patch applies on branch 'master' of git://kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 85577a4..4afaef1 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -129,6 +129,36 @@ struct skb_frag_struct { __u16 size; }; +/* Zero Copy Callback Descriptor + * This struct supports receiving notification when zero-copy network I/O has + * completed. The ZCCD can be embedded in a struct containing the state of a + * zero-copy network send. Every skbuff that references that send's pages also + * keeps a reference on the ZCCD. When they have all been disposed of, the + * reference count on the ZCCD drops to zero and the callback is made, telling + * the original caller that the pages may now be overwritten. */ +struct zccd +{ + atomic_t zccd_refcount; + void (*zccd_callback)(struct zccd *); +}; + +static inline void zccd_init (struct zccd *d, void (*callback)(struct zccd *)) +{ + atomic_set (d-zccd_refcount, 1); + d-zccd_callback = callback; +} + +static inline void zccd_incref (struct zccd *d)/* take a reference */ +{ + atomic_inc (d-zccd_refcount); +} + +static inline void zccd_decref (struct zccd *d)/* release a reference */ +{ + if (atomic_dec_and_test (d-zccd_refcount)) + (d-zccd_callback)(d); +} + /* This data is invariant across clones and lives at * the end of the header data, ie. at skb-end. */ @@ -141,6 +171,11 @@ struct skb_shared_info { unsigned short gso_type; unsigned intip6_frag_id; struct sk_buff *frag_list; + struct zccd *zccd1; + struct zccd *zccd2; + /* NB zero-copy data is normally whole pages. We have 2 zccds in an +* skbuff so we don't unneccessarily split the packet where pages fall +* into the same packet. */ skb_frag_t frags[MAX_SKB_FRAGS]; }; @@ -1311,6 +1346,23 @@ #ifdef CONFIG_HIGHMEM #endif } +/* This skbuf has dropped its pages: drop refs on any zero-copy callback + * descriptors it has. */ +static inline void skb_complete_zccd (struct sk_buff *skb) +{ + struct skb_shared_info *info = skb_shinfo(skb); + + if (info-zccd1 != NULL) { + zccd_decref(info-zccd1); + info-zccd1 = NULL; + } + + if (info-zccd2 != NULL) { + zccd_decref(info-zccd2); + info-zccd2 = NULL; + } +} + #define skb_queue_walk(queue, skb) \ for (skb = (queue)-next; \ prefetch(skb-next), (skb != (struct sk_buff *)(queue)); \ diff --git a/include/net/tcp.h b/include/net/tcp.h index 7a093d0..e02b55f 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -278,6 +278,8 @@ extern int tcp_v4_tw_remember_stam extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t size); extern ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags); +extern ssize_t tcp_sendpage_zccd(struct socket *sock, struct page *page, int offset, size_t size, + int flags, struct zccd *zccd); extern int tcp_ioctl(struct sock *sk, int cmd, diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3c23760..a1d2ed0 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -177,6 +177,8 @@ struct sk_buff *__alloc_skb(unsigned int shinfo-gso_type = 0; shinfo-ip6_frag_id = 0; shinfo-frag_list = NULL; + shinfo-zccd1 = NULL; + shinfo-zccd2 = NULL; if (fclone) { struct sk_buff *child = skb + 1; @@
RE: PATCH zero-copy send completion callback
David, Also, the correct mailing list to get to the networking developers is [EMAIL PROTECTED] linux-net is for users. Noted. Finally, I very much doubt you have much chance getting this change in, the infrastructure is implemented in a very ad-hoc fashion and it takes into consideration none of the potential other users of such a thing. Are you referring to the absence of a callback argument other than the callback descriptor itself? It seemed natural to me to contain the descriptor in whatever state the higher-level protocol associates with the message it's sending, and to derive this from the descriptor address in the callback. If this isn't what you mean, could you explain? I'm not at all religious about it. And these days we're trying to figure out how to eliminate skbuff and skb_shared_info struct members whereas you're adding 16-bytes of space on 64-bit platforms. Do you think the general concept of a zero-copy completion callback is useful? If so, do you have any ideas about how to do it more economically? It's 2 pointers rather than 1 to avoid forcing an unnecessary packet boundary between successive zero-copy sends. But I guess that might not be hugely significant since you're generally sending many pages when zero-copy is needed for performance. Also, (please correct me if I'm wrong) I didn't think this would push the allocation over to the next entry in 'malloc_sizes'. Cheers, Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html