Re: PATCH zero-copy send completion callback

2006-10-17 Thread Eric Dumazet
On Tuesday 17 October 2006 02:53, Eric Barton wrote:
 If so, do you have any ideas about how to do it more economically?  It's 2
 pointers rather than 1 to avoid forcing an unnecessary packet boundary
 between successive zero-copy sends.  But I guess that might not be hugely
 significant since you're generally sending many pages when zero-copy is
 needed for performance.  Also, (please correct me if I'm wrong) I didn't
 think this would push the allocation over to the next entry in
 'malloc_sizes'.

Well, skbuff heads are allocated from dedicated kmem_cache 
(skbuff_fclone_cache  skbuff_head_cache), and these caches are not 
constrained by the sizes available in malloc_sizes. Their size are a multiple 
of L1 CACHE size, which is 64 bytes for most common machines.

Even if your two pointers addition (16 bytes on x86_64) doesnt cross a 64bytes 
line (I didn't checked), they are going to be set to NULL each time a skbuff 
is allocated , and checked against NULL each time a skbuff is destroyed.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCH zero-copy send completion callback

2006-10-17 Thread Evgeniy Polyakov
On Tue, Oct 17, 2006 at 01:53:02AM +0100, Eric Barton ([EMAIL PROTECTED]) wrote:
  And these days we're trying to figure
  out how to eliminate skbuff and skb_shared_info struct members
  whereas you're adding 16-bytes of space on 64-bit platforms.
 
 Do you think the general concept of a zero-copy completion callback is
 useful?

You can use existing skb destructor and appropriate reference counter is
already there. In your own destructor you need to call old one of
course, and it's type can be determined from the analysis of the headers
and skb itself (there are not so much destructor's types actually).
If that level of abstraction is not enough, it is possible to change
skb_release_data()/__kfree_skb() so that it would be possible in
skb-destructor() to determine if attached pages will be freed or not.

 If so, do you have any ideas about how to do it more economically?  It's 2
 pointers rather than 1 to avoid forcing an unnecessary packet boundary
 between successive zero-copy sends.  But I guess that might not be hugely
 significant since you're generally sending many pages when zero-copy is

Existing sendfile() implementation is synchronous, it does not require
async callback. It looks like lustre sets number of pages to be sent
asyncrhonously and report to user that everything is ok, and when
appropriate callback is invoked, it updates it's metadata? Fair enough,
it looks similar to VFS cache in case of usual write.

 needed for performance.  Also, (please correct me if I'm wrong) I didn't
 think this would push the allocation over to the next entry in
 'malloc_sizes'.

skbs are allocated from own cache, and the smaller it is, the better.

 Cheers,
 Eric

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: PATCH zero-copy send completion callback

2006-10-17 Thread Eric Barton
  Also, (please correct me if I'm wrong) I didn't
  think this would push the allocation over to the next entry in
  'malloc_sizes'.
 
 Well, skbuff heads are allocated from dedicated kmem_cache 
 (skbuff_fclone_cache  skbuff_head_cache), and these caches are not 
 constrained by the sizes available in malloc_sizes. Their 
 size are a multiple 
 of L1 CACHE size, which is 64 bytes for most common machines.

Indeed, struct skbuff is so allocated.  But I added the callback
pointers to struct skb_shared_info where the page pointers are stored,
and this struct is allocated along with the packet header using kmalloc.

 Even if your two pointers addition (16 bytes on x86_64) 
 doesnt cross a 64bytes 
 line (I didn't checked), they are going to be set to NULL 
 each time a skbuff 
 is allocated , and checked against NULL each time a skbuff is 
 destroyed.

Indeed.  Do you think that's significant?

Cheers,
Eric

---
|Eric BartonBarton Software   |
|9 York Gardens Tel:+44 (117) 330 1575|
|CliftonMobile: +44 (7909) 680 356|
|Bristol BS8 4LLFax:call first|
|United Kingdom E-Mail: [EMAIL PROTECTED]|
---


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: PATCH zero-copy send completion callback

2006-10-17 Thread Eric Barton
 In addition to that I'm pretty sure I remember that some clusterfs
 person already posted these patches a while ago and got ripped apart
 in the same way.

Yes - unfortunately I didn't submit my patch personally.  And I've
rewritten it since to to avoid the obvious criticisms.  This time
around, I find the comments much more to the point.

Cheers,
Eric


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: PATCH zero-copy send completion callback

2006-10-17 Thread Eric Barton
Evgeniy,

 You can use existing skb destructor and appropriate reference
 counter is already there. In your own destructor you need to
 call old one of course, and it's type can be determined from
 the analysis of the headers and skb itself (there are not so
 much destructor's types actually).  If that level of
 abstraction is not enough, it is possible to change
 skb_release_data()/__kfree_skb() so that it would be possible
 in skb-destructor() to determine if attached pages will be
 freed or not.

Yes absolutely.  My first thought was to use the skbuf destructor
but I was paranoid I might screw up the destructor stacking.
Maybe I should have been braver?

Since the callback descriptor needs to track the pages in
skb_shinfo() rather than the skbuf itself, it seemed natural
to make skb_release_data() the trigger.

 Existing sendfile() implementation is synchronous, it does not
 require async callback. 

Is it not true that you cannot know when it is safe to overwrite
pages sent in this way?

 skbs are allocated from own cache, and the smaller it is, the better.

As I mentioned in another reply, skbs are indeed allocated from
their own cache, but skb_shinfo() is allocated contiguously with
the packet header using kmalloc.

-- 

Cheers,
Eric


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCH zero-copy send completion callback

2006-10-17 Thread Evgeniy Polyakov
On Tue, Oct 17, 2006 at 01:50:04PM +0100, Eric Barton ([EMAIL PROTECTED]) wrote:
 Evgeniy,
 
  You can use existing skb destructor and appropriate reference
  counter is already there. In your own destructor you need to
  call old one of course, and it's type can be determined from
  the analysis of the headers and skb itself (there are not so
  much destructor's types actually).  If that level of
  abstraction is not enough, it is possible to change
  skb_release_data()/__kfree_skb() so that it would be possible
  in skb-destructor() to determine if attached pages will be
  freed or not.
 
 Yes absolutely.  My first thought was to use the skbuf destructor
 but I was paranoid I might screw up the destructor stacking.
 Maybe I should have been braver?

It depends on the results quality...

 Since the callback descriptor needs to track the pages in
 skb_shinfo() rather than the skbuf itself, it seemed natural
 to make skb_release_data() the trigger.
 
  Existing sendfile() implementation is synchronous, it does not
  require async callback. 
 
 Is it not true that you cannot know when it is safe to overwrite
 pages sent in this way?

There are tricks all over the place in sendfile. First one is sendpage()
imeplementation, which copies data if hardware does not support
checksumming and scater-gather, and simultaneous writing is protected in
the higher layer (check do_generic_mapping_read()). We do not care about
'later' writing, i.e. while skb was in some queue on the local machine,
since new data will be transferred in that case.
truncation is also protected by the fact, that page's reference counter
is increased, so the same page can not be freed and reused.

It was design decision not to care about page overwrites (and thus no
page locking) - either smart hardware transfers new data, or we do copy 
and send old data.

  skbs are allocated from own cache, and the smaller it is, the better.
 
 As I mentioned in another reply, skbs are indeed allocated from
 their own cache, but skb_shinfo() is allocated contiguously with
 the packet header using kmalloc.

Yes, skb itself is not touched.

You probably saw a lot of discussions about problems with e1000
hardware, memory fragmentation and jumbo frames.
Since skb_shared_info is added to the actual data, it frequently forces
next order allocations, so one of the solution is to put skb_shared_info
into separate allocations in some cases, although those discussions are
sleeping right now, problem still exists and if your current needs can
be handled within existing interfaces it should be tried first.

 -- 
 
 Cheers,
 Eric
 

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCH zero-copy send completion callback

2006-10-17 Thread David Miller
From: Eric Barton [EMAIL PROTECTED]
Date: Tue, 17 Oct 2006 13:23:10 +0100

  Even if your two pointers addition (16 bytes on x86_64) 
  doesnt cross a 64bytes 
  line (I didn't checked), they are going to be set to NULL 
  each time a skbuff 
  is allocated , and checked against NULL each time a skbuff is 
  destroyed.
 
 Indeed.  Do you think that's significant?

On a machine routing a million packets per second, it
definitely is.  It is the most crucial data structure
for performance in all of the networking.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PATCH zero-copy send completion callback

2006-10-16 Thread Eric Barton

This patch has been used with the lustre cluster file system (www.lustre.org)
to give notification when page buffers used to send bulk data via TCP/IP may be
overwritten.  It implements...

  a) A general-purpose callback to inform higher-level protocols when a
 zero-copy send of a set of pages has completed.

  b) tcp_sendpage_zccd(), a variation on tcp_sendpage() that includes a
 completion callback parameter.

How to use it (you are a higher-level protocol driver)...

  a) Initialise a zero-copy descriptor with your callback procedure.

  b) Pass this descriptor in all zero-copy sends for an arbitrary set of pages.
 Skbuffs that reference your pages also take a reference on your zero-copy
 callback descriptor.  They release this reference when they release their
 page references.

  c) Release your own reference when you've posted all your pages and you're
 ready for the callback.

  d) The callback occurs when the last reference is dropped.


This patch applies on branch 'master' of
git://kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 85577a4..4afaef1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -129,6 +129,36 @@ struct skb_frag_struct {
__u16 size;
 };
 
+/* Zero Copy Callback Descriptor
+ * This struct supports receiving notification when zero-copy network I/O has
+ * completed.  The ZCCD can be embedded in a struct containing the state of a
+ * zero-copy network send.  Every skbuff that references that send's pages also
+ * keeps a reference on the ZCCD.  When they have all been disposed of, the
+ * reference count on the ZCCD drops to zero and the callback is made, telling
+ * the original caller that the pages may now be overwritten. */
+struct zccd 
+{
+   atomic_t zccd_refcount;
+   void   (*zccd_callback)(struct zccd *); 
+};
+
+static inline void zccd_init (struct zccd *d, void (*callback)(struct zccd *))
+{
+   atomic_set (d-zccd_refcount, 1);
+   d-zccd_callback = callback;
+}
+
+static inline void zccd_incref (struct zccd *d)/* take a reference */
+{
+   atomic_inc (d-zccd_refcount);
+}
+
+static inline void zccd_decref (struct zccd *d)/* release a reference 
*/
+{
+   if (atomic_dec_and_test (d-zccd_refcount))
+   (d-zccd_callback)(d);
+}
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb-end.
  */
@@ -141,6 +171,11 @@ struct skb_shared_info {
unsigned short  gso_type;
unsigned intip6_frag_id;
struct sk_buff  *frag_list;
+   struct zccd *zccd1;
+   struct zccd *zccd2;
+   /* NB zero-copy data is normally whole pages.  We have 2 zccds in an
+* skbuff so we don't unneccessarily split the packet where pages fall
+* into the same packet. */
skb_frag_t  frags[MAX_SKB_FRAGS];
 };
 
@@ -1311,6 +1346,23 @@ #ifdef CONFIG_HIGHMEM
 #endif
 }
 
+/* This skbuf has dropped its pages: drop refs on any zero-copy callback
+ * descriptors it has. */
+static inline void skb_complete_zccd (struct sk_buff *skb)
+{
+   struct skb_shared_info *info = skb_shinfo(skb);
+   
+   if (info-zccd1 != NULL) {
+   zccd_decref(info-zccd1);
+   info-zccd1 = NULL;
+   }
+
+   if (info-zccd2 != NULL) {
+   zccd_decref(info-zccd2);
+   info-zccd2 = NULL;
+   }
+}
+
 #define skb_queue_walk(queue, skb) \
for (skb = (queue)-next;   
\
 prefetch(skb-next), (skb != (struct sk_buff *)(queue));   
\
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..e02b55f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -278,6 +278,8 @@ extern int  tcp_v4_tw_remember_stam
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk,
struct msghdr *msg, size_t size);
 extern ssize_t tcp_sendpage(struct socket *sock, struct page 
*page, int offset, size_t size, int flags);
+extern ssize_t tcp_sendpage_zccd(struct socket *sock, struct 
page *page, int offset, size_t size,
+ int flags, struct zccd *zccd);
 
 extern int tcp_ioctl(struct sock *sk, 
  int cmd, 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3c23760..a1d2ed0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -177,6 +177,8 @@ struct sk_buff *__alloc_skb(unsigned int
shinfo-gso_type = 0;
shinfo-ip6_frag_id = 0;
shinfo-frag_list = NULL;
+   shinfo-zccd1 = NULL;
+   shinfo-zccd2 = NULL;
 
if (fclone) {
struct sk_buff *child = skb + 1;
@@ 

PATCH zero-copy send completion callback

2006-10-16 Thread Eric Barton

This patch has been used with the lustre cluster file system (www.lustre.org)
to give notification when page buffers used to send bulk data via TCP/IP may be
overwritten.  It implements...

  a) A general-purpose callback to inform higher-level protocols when a
 zero-copy send of a set of pages has completed.

  b) tcp_sendpage_zccd(), a variation on tcp_sendpage() that includes a
 completion callback parameter.

How to use it (you are a higher-level protocol driver)...

  a) Initialise a zero-copy descriptor with your callback procedure.

  b) Pass this descriptor in all zero-copy sends for an arbitrary set of pages.
 Skbuffs that reference your pages also take a reference on your zero-copy
 callback descriptor.  They release this reference when they release their
 page references.

  c) Release your own reference when you've posted all your pages and you're
 ready for the callback.

  d) The callback occurs when the last reference is dropped.


This patch applies on branch 'master' of
git://kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 85577a4..4afaef1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -129,6 +129,36 @@ struct skb_frag_struct {
__u16 size;
 };
 
+/* Zero Copy Callback Descriptor
+ * This struct supports receiving notification when zero-copy network I/O has
+ * completed.  The ZCCD can be embedded in a struct containing the state of a
+ * zero-copy network send.  Every skbuff that references that send's pages also
+ * keeps a reference on the ZCCD.  When they have all been disposed of, the
+ * reference count on the ZCCD drops to zero and the callback is made, telling
+ * the original caller that the pages may now be overwritten. */
+struct zccd 
+{
+   atomic_t zccd_refcount;
+   void   (*zccd_callback)(struct zccd *); 
+};
+
+static inline void zccd_init (struct zccd *d, void (*callback)(struct zccd *))
+{
+   atomic_set (d-zccd_refcount, 1);
+   d-zccd_callback = callback;
+}
+
+static inline void zccd_incref (struct zccd *d)/* take a reference */
+{
+   atomic_inc (d-zccd_refcount);
+}
+
+static inline void zccd_decref (struct zccd *d)/* release a reference 
*/
+{
+   if (atomic_dec_and_test (d-zccd_refcount))
+   (d-zccd_callback)(d);
+}
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb-end.
  */
@@ -141,6 +171,11 @@ struct skb_shared_info {
unsigned short  gso_type;
unsigned intip6_frag_id;
struct sk_buff  *frag_list;
+   struct zccd *zccd1;
+   struct zccd *zccd2;
+   /* NB zero-copy data is normally whole pages.  We have 2 zccds in an
+* skbuff so we don't unneccessarily split the packet where pages fall
+* into the same packet. */
skb_frag_t  frags[MAX_SKB_FRAGS];
 };
 
@@ -1311,6 +1346,23 @@ #ifdef CONFIG_HIGHMEM
 #endif
 }
 
+/* This skbuf has dropped its pages: drop refs on any zero-copy callback
+ * descriptors it has. */
+static inline void skb_complete_zccd (struct sk_buff *skb)
+{
+   struct skb_shared_info *info = skb_shinfo(skb);
+   
+   if (info-zccd1 != NULL) {
+   zccd_decref(info-zccd1);
+   info-zccd1 = NULL;
+   }
+
+   if (info-zccd2 != NULL) {
+   zccd_decref(info-zccd2);
+   info-zccd2 = NULL;
+   }
+}
+
 #define skb_queue_walk(queue, skb) \
for (skb = (queue)-next;   
\
 prefetch(skb-next), (skb != (struct sk_buff *)(queue));   
\
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..e02b55f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -278,6 +278,8 @@ extern int  tcp_v4_tw_remember_stam
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk,
struct msghdr *msg, size_t size);
 extern ssize_t tcp_sendpage(struct socket *sock, struct page 
*page, int offset, size_t size, int flags);
+extern ssize_t tcp_sendpage_zccd(struct socket *sock, struct 
page *page, int offset, size_t size,
+ int flags, struct zccd *zccd);
 
 extern int tcp_ioctl(struct sock *sk, 
  int cmd, 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3c23760..a1d2ed0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -177,6 +177,8 @@ struct sk_buff *__alloc_skb(unsigned int
shinfo-gso_type = 0;
shinfo-ip6_frag_id = 0;
shinfo-frag_list = NULL;
+   shinfo-zccd1 = NULL;
+   shinfo-zccd2 = NULL;
 
if (fclone) {
struct sk_buff *child = skb + 1;
@@ 

RE: PATCH zero-copy send completion callback

2006-10-16 Thread Eric Barton
David,

 Also, the correct mailing list to get to the networking developers
 is [EMAIL PROTECTED]  linux-net is for users.

Noted.

 Finally, I very much doubt you have much chance getting this
 change in, the infrastructure is implemented in a very ad-hoc
 fashion and it takes into consideration none of the potential
 other users of such a thing.  

Are you referring to the absence of a callback argument other than the
callback descriptor itself?  It seemed natural to me to contain the
descriptor in whatever state the higher-level protocol associates with the
message it's sending, and to derive this from the descriptor address in the
callback.

If this isn't what you mean, could you explain?  I'm not at all religious
about it.

 And these days we're trying to figure
 out how to eliminate skbuff and skb_shared_info struct members
 whereas you're adding 16-bytes of space on 64-bit platforms.

Do you think the general concept of a zero-copy completion callback is
useful?

If so, do you have any ideas about how to do it more economically?  It's 2
pointers rather than 1 to avoid forcing an unnecessary packet boundary
between successive zero-copy sends.  But I guess that might not be hugely
significant since you're generally sending many pages when zero-copy is
needed for performance.  Also, (please correct me if I'm wrong) I didn't
think this would push the allocation over to the next entry in
'malloc_sizes'.

Cheers,
Eric


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html