subject:"\[PLEASE\-TESTME\] Zerocopy networking patch, 2.4.0\-1"

Re: [Kiobuf-io-devel] Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-19 Thread Christoph Hellwig


On Fri, Jan 19, 2001 at 08:05:41AM +0530, [EMAIL PROTECTED] wrote:
> Shouldn't we have an error / status field too ?

Might make sense.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-19 Thread Andrew Scott


On 10 Jan 2001, at 15:32, Linus Torvalds wrote:

> Latin 101. Literally "about taste no argument".

Or "about taste no argument there is" if you add the 'est', which 
still makes sense in english, in a twisted (convoluted as apposed to 
'bad' or 'sick') way.   

Q.E.D.

> I suspect that it _should_ be "De gustibus non disputandum est", but
> it's been too many years. That adds the required verb ("is") to make it
> a full sentence. 
> 
> In English: "There is no arguing taste".
> 
>   Linus


--Mailed via Pegasus 3.12c & Mercury 1.48---
[EMAIL PROTECTED]Fax (617)373-2942
Andrew Scott   Tel (617)373-5278   _
Northeastern University--138 Meserve Hall / \   /
College of Arts & Sciences-Deans Office  / \ \ /
Boston, Ma. 02115   /   \_/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-19 Thread Andrew Scott


On 10 Jan 2001, at 15:32, Linus Torvalds wrote:

 Latin 101. Literally "about taste no argument".

Or "about taste no argument there is" if you add the 'est', which 
still makes sense in english, in a twisted (convoluted as apposed to 
'bad' or 'sick') way.   

Q.E.D.

 I suspect that it _should_ be "De gustibus non disputandum est", but
 it's been too many years. That adds the required verb ("is") to make it
 a full sentence. 
 
 In English: "There is no arguing taste".
 
   Linus


--Mailed via Pegasus 3.12c  Mercury 1.48---
[EMAIL PROTECTED]Fax (617)373-2942
Andrew Scott   Tel (617)373-5278   _
Northeastern University--138 Meserve Hall / \   /
College of Arts  Sciences-Deans Office  / \ \ /
Boston, Ma. 02115   /   \_/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-19 Thread Christoph Hellwig


On Fri, Jan 19, 2001 at 08:05:41AM +0530, [EMAIL PROTECTED] wrote:
 Shouldn't we have an error / status field too ?

Might make sense.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-18 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Albert D. Cahalan <[EMAIL PROTECTED]> wrote:
>
>What about getting rid of both that and the pointer, and just
>hanging that data on the end as a variable length array?
>
>struct kiovec2{
>  int nbufs;
>  /* ... */
>  struct kiobuf[0];
>}

If the struct ends up having lots of other fields, yes.

On the other hand, if one basic form of kiobuf's ends up being really
just the array and the number of elements, there are reasons not to do
this. One is that you can "peel" off parts of the buffer, and split it
up if (for example) your driver has some limitation to the number of
scatter-gather requests it can make. For example, you may have code that
looks roughly like

.. int nr, struct kibuf *buf ..

while (nr > MAX_SEGMENTS) {
lower_level(MAX_SEGMENTS, buf);
nr -= MAX_SEGMENTS;
buf += MAX_SEGMENTS;
}
lower_level(nr, buf);

which is rather awkward to do if you tie "nr" and the array too closely
together. 

(Of course, the driver could just split them up - take it from the
structure and pass them down in the separated manner. I don't know which
level the separation is worth doing at, but I have this feeling that if
the structure ends up being _only_ the nbufs and bufs, they should not
be tied together.)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-18 Thread bsuparna



>Ok. Then we need an additional more or less generic object that is used
for
>passing in a rw_kiovec file operation (and we really want that for many
kinds
>of IO). I thould mostly be used for communicating to the high-level
driver.
>
>/*
> * the name is just plain stupid, but that shouldn't matter
> */
>struct vfs_kiovec {
>struct kiovec * iov;
>
>/* private data, mostly for the callback */
>void * private;
>
>/* completion callback */
>void (*end_io) (struct vfs_kiovec *);
>wait_queue_head_t wait_queue;
>};
>
>Christoph

Shouldn't we have an error / status field too ?


  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-18 Thread Albert D. Cahalan


>> struct kiovec2 {
>>  int nbufs;  /* Kiobufs actually referenced */
>>  int array_len;  /* Space in the allocated lists */
>>  struct kiobuf * bufs;
>
> Any reason for array_len?
>
> Why not just 
> 
>   int nbufs,
>   struct kiobuf *bufs;
>
> Remember: simplicity is a virtue. 
>
> Simplicity is also what makes it usable for people who do NOT want to have
> huge overhead.
>
>>  unsigned intlocked : 1; /* If set, pages has been locked */
>
> Remove this. I don't think it's valid to lock the pages. Who wants to use
> this anyway?
>
>>  /* Always embed enough struct pages for 64k of IO */
>>  struct kiobuf * buf_array[KIO_STATIC_PAGES]; 
>
> Kill kill kill kill. 
>
> If somebody wants to embed a kiovec into their own data structure, THEY
> can decide to add their own buffers etc. A fundamental data structure
> should _never_ make assumptions like this.

What about getting rid of both that and the pointer, and just
hanging that data on the end as a variable length array?

struct kiovec2{
  int nbufs;
  /* ... */
  struct kiobuf[0];
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-18 Thread Linus Torvalds




On Thu, 18 Jan 2001, Christoph Hellwig wrote:
> > 
> > Remove this. I don't think it's valid to lock the pages. Who wants to use
> > this anyway?
> 
> E.g. in the block IO pathes the pages have to be locked.
> It's also used by free_kiovec to see wether to do unlock_kiovec before.

This is all MUCH higher level functionality, and probably bogus anyway.

> > That's kind of the minimal set. That should be one level of abstraction in
> > its own right. 
> 
> Ok. Then we need an additional more or less generic object that is used for
> passing in a rw_kiovec file operation (and we really want that for many kinds
> of IO). I thould mostly be used for communicating to the high-level driver.

That's fine.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-18 Thread Christoph Hellwig


On Wed, Jan 17, 2001 at 05:13:31PM -0800, Linus Torvalds wrote:
> 
> 
> On Thu, 18 Jan 2001, Christoph Hellwig wrote:
> > 
> > /*
> >  * a simple page,offset,legth tuple like Linus wants it
> >  */
> > struct kiobuf2 {
> > struct page *   page;   /* The page itself   */
> > u_int16_t   offset; /* Offset to start of valid data */
> > u_int16_t   length; /* Number of valid bytes of data */
> > };
> 
> Please use "u16". Or "__u16" if you want to export it to user space.

Ok.


> 
> > struct kiovec2 {
> > int nbufs;  /* Kiobufs actually referenced */
> > int array_len;  /* Space in the allocated lists */
> > struct kiobuf * bufs;
> 
> Any reason for array_len?

It's usefull for the expand function - but with kiobufs as secondary data
structure it may no more be nessesary.

> Why not just 
> 
>   int nbufs,
>   struct kiobuf *bufs;
> 
> 
> Remember: simplicity is a virtue. 
> 
> Simplicity is also what makes it usable for people who do NOT want to have
> huge overhead.
> 
> > unsigned intlocked : 1; /* If set, pages has been locked */
> 
> Remove this. I don't think it's valid to lock the pages. Who wants to use
> this anyway?

E.g. in the block IO pathes the pages have to be locked.
It's also used by free_kiovec to see wether to do unlock_kiovec before.

> 
> > /* Always embed enough struct pages for 64k of IO */
> > struct kiobuf * buf_array[KIO_STATIC_PAGES]; 
> 
> Kill kill kill kill. 
> 
> If somebody wants to embed a kiovec into their own data structure, THEY
> can decide to add their own buffers etc. A fundamental data structure
> should _never_ make assumptions like this.

Ok.

> 
> > /* Private data */
> > void *  private;
> > 
> > /* Dynamic state for IO completion: */
> > atomic_tio_count;   /* IOs still in progress */
> 
> What is io_count used for?

In the current buffer_head based IO-scheme it is used to determine wether
all bh request are finished.  It's obsolete once we pass kiobufs to the
low-level drivers.

> 
> > int errno;
> > 
> > /* Status of completed IO */
> > void (*end_io)  (struct kiovec *); /* Completion callback */
> > wait_queue_head_t wait_queue;
> 
> I suspect all of the above ("private", "end_io" etc) should be at a higher
> layer. Not everybody will necessarily need them.
> 
> Remember: if this is to be well designed, we want to have the data
> structures to pass down to low-level drivers etc, that may not want or
> need a lot of high-level stuff. You should not pass down more than the
> driver really needs.
> 
> In the end, the only thing you _know_ a driver will need (assuming that it
> wants these kinds of buffers) is just
> 
>   int nbufs;
>   struct biobuf *bufs;
> 
> That's kind of the minimal set. That should be one level of abstraction in
> its own right. 

Ok. Then we need an additional more or less generic object that is used for
passing in a rw_kiovec file operation (and we really want that for many kinds
of IO). I thould mostly be used for communicating to the high-level driver.

/*
 * the name is just plain stupid, but that shouldn't matter
 */
struct vfs_kiovec {
struct kiovec * iov;

/* private data, mostly for the callback */
void * private;

/* completion callback */
void (*end_io)  (struct vfs_kiovec *);
wait_queue_head_t wait_queue;
};

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-18 Thread Christoph Hellwig


On Wed, Jan 17, 2001 at 05:13:31PM -0800, Linus Torvalds wrote:
 
 
 On Thu, 18 Jan 2001, Christoph Hellwig wrote:
  
  /*
   * a simple page,offset,legth tuple like Linus wants it
   */
  struct kiobuf2 {
  struct page *   page;   /* The page itself   */
  u_int16_t   offset; /* Offset to start of valid data */
  u_int16_t   length; /* Number of valid bytes of data */
  };
 
 Please use "u16". Or "__u16" if you want to export it to user space.

Ok.


 
  struct kiovec2 {
  int nbufs;  /* Kiobufs actually referenced */
  int array_len;  /* Space in the allocated lists */
  struct kiobuf * bufs;
 
 Any reason for array_len?

It's usefull for the expand function - but with kiobufs as secondary data
structure it may no more be nessesary.

 Why not just 
 
   int nbufs,
   struct kiobuf *bufs;
 
 
 Remember: simplicity is a virtue. 
 
 Simplicity is also what makes it usable for people who do NOT want to have
 huge overhead.
 
  unsigned intlocked : 1; /* If set, pages has been locked */
 
 Remove this. I don't think it's valid to lock the pages. Who wants to use
 this anyway?

E.g. in the block IO pathes the pages have to be locked.
It's also used by free_kiovec to see wether to do unlock_kiovec before.

 
  /* Always embed enough struct pages for 64k of IO */
  struct kiobuf * buf_array[KIO_STATIC_PAGES]; 
 
 Kill kill kill kill. 
 
 If somebody wants to embed a kiovec into their own data structure, THEY
 can decide to add their own buffers etc. A fundamental data structure
 should _never_ make assumptions like this.

Ok.

 
  /* Private data */
  void *  private;
  
  /* Dynamic state for IO completion: */
  atomic_tio_count;   /* IOs still in progress */
 
 What is io_count used for?

In the current buffer_head based IO-scheme it is used to determine wether
all bh request are finished.  It's obsolete once we pass kiobufs to the
low-level drivers.

 
  int errno;
  
  /* Status of completed IO */
  void (*end_io)  (struct kiovec *); /* Completion callback */
  wait_queue_head_t wait_queue;
 
 I suspect all of the above ("private", "end_io" etc) should be at a higher
 layer. Not everybody will necessarily need them.
 
 Remember: if this is to be well designed, we want to have the data
 structures to pass down to low-level drivers etc, that may not want or
 need a lot of high-level stuff. You should not pass down more than the
 driver really needs.
 
 In the end, the only thing you _know_ a driver will need (assuming that it
 wants these kinds of buffers) is just
 
   int nbufs;
   struct biobuf *bufs;
 
 That's kind of the minimal set. That should be one level of abstraction in
 its own right. 

Ok. Then we need an additional more or less generic object that is used for
passing in a rw_kiovec file operation (and we really want that for many kinds
of IO). I thould mostly be used for communicating to the high-level driver.

/*
 * the name is just plain stupid, but that shouldn't matter
 */
struct vfs_kiovec {
struct kiovec * iov;

/* private data, mostly for the callback */
void * private;

/* completion callback */
void (*end_io)  (struct vfs_kiovec *);
wait_queue_head_t wait_queue;
};

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-18 Thread Linus Torvalds




On Thu, 18 Jan 2001, Christoph Hellwig wrote:
  
  Remove this. I don't think it's valid to lock the pages. Who wants to use
  this anyway?
 
 E.g. in the block IO pathes the pages have to be locked.
 It's also used by free_kiovec to see wether to do unlock_kiovec before.

This is all MUCH higher level functionality, and probably bogus anyway.

  That's kind of the minimal set. That should be one level of abstraction in
  its own right. 
 
 Ok. Then we need an additional more or less generic object that is used for
 passing in a rw_kiovec file operation (and we really want that for many kinds
 of IO). I thould mostly be used for communicating to the high-level driver.

That's fine.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-18 Thread Albert D. Cahalan


 struct kiovec2 {
  int nbufs;  /* Kiobufs actually referenced */
  int array_len;  /* Space in the allocated lists */
  struct kiobuf * bufs;

 Any reason for array_len?

 Why not just 
 
   int nbufs,
   struct kiobuf *bufs;

 Remember: simplicity is a virtue. 

 Simplicity is also what makes it usable for people who do NOT want to have
 huge overhead.

  unsigned intlocked : 1; /* If set, pages has been locked */

 Remove this. I don't think it's valid to lock the pages. Who wants to use
 this anyway?

  /* Always embed enough struct pages for 64k of IO */
  struct kiobuf * buf_array[KIO_STATIC_PAGES]; 

 Kill kill kill kill. 

 If somebody wants to embed a kiovec into their own data structure, THEY
 can decide to add their own buffers etc. A fundamental data structure
 should _never_ make assumptions like this.

What about getting rid of both that and the pointer, and just
hanging that data on the end as a variable length array?

struct kiovec2{
  int nbufs;
  /* ... */
  struct kiobuf[0];
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-18 Thread bsuparna



Ok. Then we need an additional more or less generic object that is used
for
passing in a rw_kiovec file operation (and we really want that for many
kinds
of IO). I thould mostly be used for communicating to the high-level
driver.

/*
 * the name is just plain stupid, but that shouldn't matter
 */
struct vfs_kiovec {
struct kiovec * iov;

/* private data, mostly for the callback */
void * private;

/* completion callback */
void (*end_io) (struct vfs_kiovec *);
wait_queue_head_t wait_queue;
};

Christoph

Shouldn't we have an error / status field too ?


  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-18 Thread Linus Torvalds


In article [EMAIL PROTECTED],
Albert D. Cahalan [EMAIL PROTECTED] wrote:

What about getting rid of both that and the pointer, and just
hanging that data on the end as a variable length array?

struct kiovec2{
  int nbufs;
  /* ... */
  struct kiobuf[0];
}

If the struct ends up having lots of other fields, yes.

On the other hand, if one basic form of kiobuf's ends up being really
just the array and the number of elements, there are reasons not to do
this. One is that you can "peel" off parts of the buffer, and split it
up if (for example) your driver has some limitation to the number of
scatter-gather requests it can make. For example, you may have code that
looks roughly like

.. int nr, struct kibuf *buf ..

while (nr  MAX_SEGMENTS) {
lower_level(MAX_SEGMENTS, buf);
nr -= MAX_SEGMENTS;
buf += MAX_SEGMENTS;
}
lower_level(nr, buf);

which is rather awkward to do if you tie "nr" and the array too closely
together. 

(Of course, the driver could just split them up - take it from the
structure and pass them down in the separated manner. I don't know which
level the separation is worth doing at, but I have this feeling that if
the structure ends up being _only_ the nbufs and bufs, they should not
be tied together.)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-17 Thread Linus Torvalds

On Thu, 18 Jan 2001, Christoph Hellwig wrote:
> 
> /*
>  * a simple page,offset,legth tuple like Linus wants it
>  */
> struct kiobuf2 {
>   struct page *   page;   /* The page itself   */
>   u_int16_t   offset; /* Offset to start of valid data */
>   u_int16_t   length; /* Number of valid bytes of data */
> };

Please use "u16". Or "__u16" if you want to export it to user space.

> struct kiovec2 {
>   int nbufs;  /* Kiobufs actually referenced */
>   int array_len;  /* Space in the allocated lists */
>   struct kiobuf * bufs;

Any reason for array_len?

Why not just 

int nbufs,
struct kiobuf *bufs;

Remember: simplicity is a virtue. 

Simplicity is also what makes it usable for people who do NOT want to have
huge overhead.

>   unsigned intlocked : 1; /* If set, pages has been locked */

Remove this. I don't think it's valid to lock the pages. Who wants to use
this anyway?

>   /* Always embed enough struct pages for 64k of IO */
>   struct kiobuf * buf_array[KIO_STATIC_PAGES]; 

Kill kill kill kill. 

If somebody wants to embed a kiovec into their own data structure, THEY
can decide to add their own buffers etc. A fundamental data structure
should _never_ make assumptions like this.

>   /* Private data */
>   void *  private;
>   
>   /* Dynamic state for IO completion: */
>   atomic_tio_count;   /* IOs still in progress */

What is io_count used for?

>   int errno;
> 
>   /* Status of completed IO */
>   void (*end_io)  (struct kiovec *); /* Completion callback */
>   wait_queue_head_t wait_queue;

I suspect all of the above ("private", "end_io" etc) should be at a higher
layer. Not everybody will necessarily need them.

Remember: if this is to be well designed, we want to have the data
structures to pass down to low-level drivers etc, that may not want or
need a lot of high-level stuff. You should not pass down more than the
driver really needs.

In the end, the only thing you _know_ a driver will need (assuming that it
wants these kinds of buffers) is just

int nbufs;
struct biobuf *bufs;

That's kind of the minimal set. That should be one level of abstraction in
its own right. 

Never over-design. Never think "Hmm, maybe somebody would find this
useful". Start from what you know people _have_ to have, and try to make
that set smaller. When you can make it no smaller, you've reached one
point. That's a good point to start from - use that for some real
implementation.

Once you've gotten that far, you can see how well you can embed the lower
layers into higher layers. That does _not_ mean that the lower layers
should know about the high-level data structures. Try to avoid pushing
down abstractions too far. Maybe you'll want to push down the error code.
But maybe not. And you should NOT link the callback with the vector of
IO's: you may find (in fact, I bet you _will_ find), that the lowest level
will want a callback to call up to when it is ready, and that layer may
want _another_ callback to call up to higher levels.

Imagine, for example, the network driver telling the IP layer that "ok,
packet sent". That's _NOT_ the same callback as the TCP layer telling the
upper layers that the packet data has been sent and successfully
acknowledged, and that the data structures can be free'd now. They are at
two completely different levels of abstraction, and one level needing
something doesn't need that the other level should necessarily even care.

Don't imagine that everybody wants the same data structure, and that that
data structure should thus be very generic. Genericity kills good ideas.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-17 Thread Christoph Hellwig


On Thu, Jan 18, 2001 at 01:05:43AM +1100, Rik van Riel wrote:
> On Wed, 10 Jan 2001, Christoph Hellwig wrote:
> 
> > Simple.  Because I stated before that I DON'T even want the
> > networking to use kiobufs in lower layers.  My whole argument is
> > to pass a kiovec into the fileop instead of a page, because it
> > makes sense for other drivers to use multiple pages,
> 
> Now wouldn't it be great if we had one type of data
> structure that would work for both the network layer
> and the block layer (and v4l, ...)  ?

Sure it would be nice, and IIRC that was what the kiobuf stuff was
designed for.  But it looks like it doesn't do well for the networking
(and maybe other) guys.

That means we have to find something that might be worth paying a little
overhead for in all layers, but that on the other hand is usable evrywhere.

So after the last flame^H^H^H^H^Hthread I've come up in my mind with the
following structures:

/*
 * a simple page,offset,legth tuple like Linus wants it
 */
struct kiobuf2 {
struct page *   page;   /* The page itself   */
u_int16_t   offset; /* Offset to start of valid data */
u_int16_t   length; /* Number of valid bytes of data */
};

/*
 * A container for the tuples - it is actually pretty similar to old
 * kiobuf, but on the other hand allows SG
 */
struct kiovec2 {
int nbufs;  /* Kiobufs actually referenced */
int array_len;  /* Space in the allocated lists */

struct kiobuf * bufs;

unsigned intlocked : 1; /* If set, pages has been locked */

/* Always embed enough struct pages for 64k of IO */
struct kiobuf * buf_array[KIO_STATIC_PAGES]; 

/* Private data */
void *  private;

/* Dynamic state for IO completion: */
atomic_tio_count;   /* IOs still in progress */
int errno;

/* Status of completed IO */
void (*end_io)  (struct kiovec *); /* Completion callback */
wait_queue_head_t wait_queue;
};


We don't need the page-length/offset in the usual block-io path, but on
the other hand, if we get a common interface for it...

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-17 Thread Rik van Riel

On Wed, 10 Jan 2001, Christoph Hellwig wrote:

> Simple.  Because I stated before that I DON'T even want the
> networking to use kiobufs in lower layers.  My whole argument is
> to pass a kiovec into the fileop instead of a page, because it
> makes sense for other drivers to use multiple pages,

Now wouldn't it be great if we had one type of data
structure that would work for both the network layer
and the block layer (and v4l, ...)  ?

If we constantly need to convert between zerocopy
metadata type, I'm sure we'll lose most of the performance
gain we started this whole idea for in the first place.

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-17 Thread Rik van Riel


On Wed, 10 Jan 2001, Christoph Hellwig wrote:

 Simple.  Because I stated before that I DON'T even want the
 networking to use kiobufs in lower layers.  My whole argument is
 to pass a kiovec into the fileop instead of a page, because it
 makes sense for other drivers to use multiple pages,

Now wouldn't it be great if we had one type of data
structure that would work for both the network layer
and the block layer (and v4l, ...)  ?

If we constantly need to convert between zerocopy
metadata type, I'm sure we'll lose most of the performance
gain we started this whole idea for in the first place.

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-17 Thread Christoph Hellwig


On Thu, Jan 18, 2001 at 01:05:43AM +1100, Rik van Riel wrote:
 On Wed, 10 Jan 2001, Christoph Hellwig wrote:
 
  Simple.  Because I stated before that I DON'T even want the
  networking to use kiobufs in lower layers.  My whole argument is
  to pass a kiovec into the fileop instead of a page, because it
  makes sense for other drivers to use multiple pages,
 
 Now wouldn't it be great if we had one type of data
 structure that would work for both the network layer
 and the block layer (and v4l, ...)  ?

Sure it would be nice, and IIRC that was what the kiobuf stuff was
designed for.  But it looks like it doesn't do well for the networking
(and maybe other) guys.

That means we have to find something that might be worth paying a little
overhead for in all layers, but that on the other hand is usable evrywhere.

So after the last flame^H^H^H^H^Hthread I've come up in my mind with the
following structures:

/*
 * a simple page,offset,legth tuple like Linus wants it
 */
struct kiobuf2 {
struct page *   page;   /* The page itself   */
u_int16_t   offset; /* Offset to start of valid data */
u_int16_t   length; /* Number of valid bytes of data */
};

/*
 * A container for the tuples - it is actually pretty similar to old
 * kiobuf, but on the other hand allows SG
 */
struct kiovec2 {
int nbufs;  /* Kiobufs actually referenced */
int array_len;  /* Space in the allocated lists */

struct kiobuf * bufs;

unsigned intlocked : 1; /* If set, pages has been locked */

/* Always embed enough struct pages for 64k of IO */
struct kiobuf * buf_array[KIO_STATIC_PAGES]; 

/* Private data */
void *  private;

/* Dynamic state for IO completion: */
atomic_tio_count;   /* IOs still in progress */
int errno;

/* Status of completed IO */
void (*end_io)  (struct kiovec *); /* Completion callback */
wait_queue_head_t wait_queue;
};


We don't need the page-length/offset in the usual block-io path, but on
the other hand, if we get a common interface for it...

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-17 Thread Linus Torvalds




On Thu, 18 Jan 2001, Christoph Hellwig wrote:
 
 /*
  * a simple page,offset,legth tuple like Linus wants it
  */
 struct kiobuf2 {
   struct page *   page;   /* The page itself   */
   u_int16_t   offset; /* Offset to start of valid data */
   u_int16_t   length; /* Number of valid bytes of data */
 };

Please use "u16". Or "__u16" if you want to export it to user space.

 struct kiovec2 {
   int nbufs;  /* Kiobufs actually referenced */
   int array_len;  /* Space in the allocated lists */
   struct kiobuf * bufs;

Any reason for array_len?

Why not just 

int nbufs,
struct kiobuf *bufs;


Remember: simplicity is a virtue. 

Simplicity is also what makes it usable for people who do NOT want to have
huge overhead.

   unsigned intlocked : 1; /* If set, pages has been locked */

Remove this. I don't think it's valid to lock the pages. Who wants to use
this anyway?

   /* Always embed enough struct pages for 64k of IO */
   struct kiobuf * buf_array[KIO_STATIC_PAGES]; 

Kill kill kill kill. 

If somebody wants to embed a kiovec into their own data structure, THEY
can decide to add their own buffers etc. A fundamental data structure
should _never_ make assumptions like this.

   /* Private data */
   void *  private;
   
   /* Dynamic state for IO completion: */
   atomic_tio_count;   /* IOs still in progress */

What is io_count used for?

   int errno;
 
   /* Status of completed IO */
   void (*end_io)  (struct kiovec *); /* Completion callback */
   wait_queue_head_t wait_queue;

I suspect all of the above ("private", "end_io" etc) should be at a higher
layer. Not everybody will necessarily need them.

Remember: if this is to be well designed, we want to have the data
structures to pass down to low-level drivers etc, that may not want or
need a lot of high-level stuff. You should not pass down more than the
driver really needs.

In the end, the only thing you _know_ a driver will need (assuming that it
wants these kinds of buffers) is just

int nbufs;
struct biobuf *bufs;

That's kind of the minimal set. That should be one level of abstraction in
its own right. 

Never over-design. Never think "Hmm, maybe somebody would find this
useful". Start from what you know people _have_ to have, and try to make
that set smaller. When you can make it no smaller, you've reached one
point. That's a good point to start from - use that for some real
implementation.

Once you've gotten that far, you can see how well you can embed the lower
layers into higher layers. That does _not_ mean that the lower layers
should know about the high-level data structures. Try to avoid pushing
down abstractions too far. Maybe you'll want to push down the error code.
But maybe not. And you should NOT link the callback with the vector of
IO's: you may find (in fact, I bet you _will_ find), that the lowest level
will want a callback to call up to when it is ready, and that layer may
want _another_ callback to call up to higher levels.

Imagine, for example, the network driver telling the IP layer that "ok,
packet sent". That's _NOT_ the same callback as the TCP layer telling the
upper layers that the packet data has been sent and successfully
acknowledged, and that the data structures can be free'd now. They are at
two completely different levels of abstraction, and one level needing
something doesn't need that the other level should necessarily even care.

Don't imagine that everybody wants the same data structure, and that that
data structure should thus be very generic. Genericity kills good ideas.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-16 Thread Rik van Riel


On Tue, 9 Jan 2001, Andrea Arcangeli wrote:

> BTW, I noticed what is left in blk-13B seems to be my work

Yeah yeah, we'll buy you beer at the next conference... ;)

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-16 Thread Rik van Riel


On Tue, 9 Jan 2001, Andrea Arcangeli wrote:

 BTW, I noticed what is left in blk-13B seems to be my work

Yeah yeah, we'll buy you beer at the next conference... ;)

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-13 Thread yodaiken



FWIW: POSIX mq_send does not promise that the buffer is safe, it only
promises that the message is queued. Interesting interface.



On Wed, Jan 10, 2001 at 09:41:24AM +0100, Manfred Spraul wrote:
> > > In user space, how do you know when its safe to reuse the buffer that 
> > > was handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg() 
> > > with that flag block until the buffer isn't needed by the kernel any 
> > > more? If it does block, doesn't that defeat the use of non-blocking 
> > > I/O? 
> > 
> > sendmsg() marks those pages COW and copies the original page into a new 
> > one for further usage. (the old page is used until the packet is 
> > released.) So for maximum performance user-space should not reuse such 
> > buffers immediately. 
> >
> That means sendmsg() changes the page tables? I measures
> smp_call_function on my Dual Pentium 350, and it took around 1950 cpu
> ticks.
> I'm sure that for an 8 way server the total lost time on all cpus (multi
> threaded server) is larger than the time required to copy the complete
> page.
> (I've attached my patch, just run "insmod dummy p_shift=0")
> 
> 
> --
>   Manfred
> --- 2.4/drivers/net/dummy.c   Mon Dec  4 02:45:22 2000
> +++ build-2.4/drivers/net/dummy.c Wed Jan 10 09:15:20 2001
> @@ -95,9 +95,168 @@
>  
>  static struct net_device dev_dummy;
>  
> +/* * */
> +int p_shift = -1;
> +MODULE_PARM (p_shift, "1i");
> +MODULE_PARM_DESC(p_shift, "Shift for the profile buffer");
> +
> +int p_size = 0;
> +MODULE_PARM (p_size, "1i");
> +MODULE_PARM_DESC(p_size, "size");
> +
> +
> +#define STAT_TABLELEN16384
> +static unsigned long totals[STAT_TABLELEN];
> +static unsigned int overflows;
> +
> +static unsigned long long stime;
> +static void start_measure(void)
> +{
> +  __asm__ __volatile__ (
> + ".align 64\n\t"
> + "pushal\n\t"
> + "cpuid\n\t"
> + "popal\n\t"
> + "rdtsc\n\t"
> + "movl %%eax,(%0)\n\t"
> + "movl %%edx,4(%0)\n\t"
> + : /* no output */
> + : "c"()
> + : "eax", "edx", "memory" );
> +}
> +
> +static void end_measure(void)
> +{
> +static unsigned long long etime;
> + __asm__ __volatile__ (
> + "pushal\n\t"
> + "cpuid\n\t"
> + "popal\n\t"
> + "rdtsc\n\t"
> + "movl %%eax,(%0)\n\t"
> + "movl %%edx,4(%0)\n\t"
> + : /* no output */
> + : "c"()
> + : "eax", "edx", "memory" );
> + {
> + unsigned long time = (unsigned long)(etime-stime);
> + time >>= p_shift;
> + if(time < STAT_TABLELEN) {
> + totals[time]++;
> + } else {
> + overflows++;
> + }
> + }
> +}
> +
> +static void clean_buf(void)
> +{
> + memset(totals,0,sizeof(totals));
> + overflows = 0;
> +}
> +
> +static void print_line(unsigned long* array)
> +{
> + int i;
> + for(i=0;i<32;i++) {
> + if((i%32)==16)
> + printk(":");
> + printk("%lx ",array[i]); 
> + }
> +}
> +
> +static void print_buf(char* caption)
> +{
> + int i, other = 0;
> + printk("Results - %s - shift %d",
> + caption, p_shift);
> +
> + for(i=0;i + int j;
> + int local = 0;
> + for(j=0;j<32;j++)
> + local += totals[i+j];
> +
> + if(local) {
> + printk("\n%3x: ",i);
> + print_line([i]);
> + other += local;
> + }
> + }
> + printk("\nOverflows: %d.\n",
> + overflows);
> + printk("Sum: %ld\n",other+overflows);
> +}
> +
> +static void return_immediately(void* dummy)
> +{
> + return;
> +}
> +
> +static void just_one_page(void* dummy)
> +{
> + __flush_tlb_one(0x12345678);
> + return;
> +}
> +
> +
>  static int __init dummy_init_module(void)
>  {
>   int err;
> +
> + if(p_shift != -1) {
> + int i;
> + void* p;
> + kmem_cache_t* cachep;
> + /* empty test measurement: */
> + printk(" kernel cpu benchmark started **\n");
> + clean_buf();
> + set_current_state(TASK_UNINTERRUPTIBLE);
> + schedule_timeout(200);
> + for(i=0;i<100;i++) {
> + start_measure();
> + return_immediately(NULL);
> + return_immediately(NULL);
> + return_immediately(NULL);
> + return_immediately(NULL);
> + end_measure();
> + }
> + print_buf("zero");
> + clean_buf();
> +
> + set_current_state(TASK_UNINTERRUPTIBLE);
> + schedule_timeout(200);
> + for(i=0;i<100;i++) {
> +

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-13 Thread yodaiken



FWIW: POSIX mq_send does not promise that the buffer is safe, it only
promises that the message is queued. Interesting interface.



On Wed, Jan 10, 2001 at 09:41:24AM +0100, Manfred Spraul wrote:
   In user space, how do you know when its safe to reuse the buffer that 
   was handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg() 
   with that flag block until the buffer isn't needed by the kernel any 
   more? If it does block, doesn't that defeat the use of non-blocking 
   I/O? 
  
  sendmsg() marks those pages COW and copies the original page into a new 
  one for further usage. (the old page is used until the packet is 
  released.) So for maximum performance user-space should not reuse such 
  buffers immediately. 
 
 That means sendmsg() changes the page tables? I measures
 smp_call_function on my Dual Pentium 350, and it took around 1950 cpu
 ticks.
 I'm sure that for an 8 way server the total lost time on all cpus (multi
 threaded server) is larger than the time required to copy the complete
 page.
 (I've attached my patch, just run "insmod dummy p_shift=0")
 
 
 --
   Manfred
 --- 2.4/drivers/net/dummy.c   Mon Dec  4 02:45:22 2000
 +++ build-2.4/drivers/net/dummy.c Wed Jan 10 09:15:20 2001
 @@ -95,9 +95,168 @@
  
  static struct net_device dev_dummy;
  
 +/* * */
 +int p_shift = -1;
 +MODULE_PARM (p_shift, "1i");
 +MODULE_PARM_DESC(p_shift, "Shift for the profile buffer");
 +
 +int p_size = 0;
 +MODULE_PARM (p_size, "1i");
 +MODULE_PARM_DESC(p_size, "size");
 +
 +
 +#define STAT_TABLELEN16384
 +static unsigned long totals[STAT_TABLELEN];
 +static unsigned int overflows;
 +
 +static unsigned long long stime;
 +static void start_measure(void)
 +{
 +  __asm__ __volatile__ (
 + ".align 64\n\t"
 + "pushal\n\t"
 + "cpuid\n\t"
 + "popal\n\t"
 + "rdtsc\n\t"
 + "movl %%eax,(%0)\n\t"
 + "movl %%edx,4(%0)\n\t"
 + : /* no output */
 + : "c"(stime)
 + : "eax", "edx", "memory" );
 +}
 +
 +static void end_measure(void)
 +{
 +static unsigned long long etime;
 + __asm__ __volatile__ (
 + "pushal\n\t"
 + "cpuid\n\t"
 + "popal\n\t"
 + "rdtsc\n\t"
 + "movl %%eax,(%0)\n\t"
 + "movl %%edx,4(%0)\n\t"
 + : /* no output */
 + : "c"(etime)
 + : "eax", "edx", "memory" );
 + {
 + unsigned long time = (unsigned long)(etime-stime);
 + time = p_shift;
 + if(time  STAT_TABLELEN) {
 + totals[time]++;
 + } else {
 + overflows++;
 + }
 + }
 +}
 +
 +static void clean_buf(void)
 +{
 + memset(totals,0,sizeof(totals));
 + overflows = 0;
 +}
 +
 +static void print_line(unsigned long* array)
 +{
 + int i;
 + for(i=0;i32;i++) {
 + if((i%32)==16)
 + printk(":");
 + printk("%lx ",array[i]); 
 + }
 +}
 +
 +static void print_buf(char* caption)
 +{
 + int i, other = 0;
 + printk("Results - %s - shift %d",
 + caption, p_shift);
 +
 + for(i=0;iSTAT_TABLELEN;i+=32) {
 + int j;
 + int local = 0;
 + for(j=0;j32;j++)
 + local += totals[i+j];
 +
 + if(local) {
 + printk("\n%3x: ",i);
 + print_line(totals[i]);
 + other += local;
 + }
 + }
 + printk("\nOverflows: %d.\n",
 + overflows);
 + printk("Sum: %ld\n",other+overflows);
 +}
 +
 +static void return_immediately(void* dummy)
 +{
 + return;
 +}
 +
 +static void just_one_page(void* dummy)
 +{
 + __flush_tlb_one(0x12345678);
 + return;
 +}
 +
 +
  static int __init dummy_init_module(void)
  {
   int err;
 +
 + if(p_shift != -1) {
 + int i;
 + void* p;
 + kmem_cache_t* cachep;
 + /* empty test measurement: */
 + printk(" kernel cpu benchmark started **\n");
 + clean_buf();
 + set_current_state(TASK_UNINTERRUPTIBLE);
 + schedule_timeout(200);
 + for(i=0;i100;i++) {
 + start_measure();
 + return_immediately(NULL);
 + return_immediately(NULL);
 + return_immediately(NULL);
 + return_immediately(NULL);
 + end_measure();
 + }
 + print_buf("zero");
 + clean_buf();
 +
 + set_current_state(TASK_UNINTERRUPTIBLE);
 + schedule_timeout(200);
 + for(i=0;i100;i++) {
 + start_measure();
 + return_immediately(NULL);
 + return_immediately(NULL);
 +

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-11 Thread Stephen C. Tweedie

Hi,

On Tue, Jan 09, 2001 at 11:14:54AM -0800, Linus Torvalds wrote:
> In article <[EMAIL PROTECTED]>,
> 
> kiobufs are crap. Face it. They do NOT allow proper multi-page scatter
> gather, regardless of what the kiobuf PR department has said.

It's not surprising, since they were designed to solve a totally
different problem.

Kiobufs were always intended to represent logical buffers --- a virtual
address range from some process, or a region of a cached file.  The
purpose behind them was, if you remember, to allow something like
map_user_kiobuf() to produce a list of physical pages from the user VA
range.

This works exactly as intended.  The raw IO device driver may build a
kiobuf to represent a user VA range, and the XFS filesystem may build
one for its pagebuf abstraction to represent a range within a file in
the page cache.  The lower level IO routines just don't care where the
buffers came from.

There are still problems here --- the encoding of block addresses in
the list, dealing with a stack of completion events if you push these
buffers down through various layers of logical block device such as
raid/lvm, carving requests up and merging them if you get requirest
which span a raid or LVM stripe, for example.  Kiobufs don't solve
those, but neither do skfrags, and neither does the MSG_MORE concept.

If you want a scatter-gather list capable of taking individual
buffer_heads and merging them, then sure, kiobufs won't do the trick
as they stand now: they were never intended to.  The whole point of
kiobufs was to encapsulate one single buffer in the higher layers, and
to allow lower layers to work on that buffer without caring where the
memory came from.  

But adding the sub-page sg lists is a simple extension.  I've got a
number of raw IO fixes pending, and we've just traced the source of
the last problem that was holding it up, so if you want I'll add the
per-page offset/length with those. 

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-11 Thread Stephen C. Tweedie


Hi,

On Tue, Jan 09, 2001 at 11:14:54AM -0800, Linus Torvalds wrote:
 In article [EMAIL PROTECTED],
 
 kiobufs are crap. Face it. They do NOT allow proper multi-page scatter
 gather, regardless of what the kiobuf PR department has said.

It's not surprising, since they were designed to solve a totally
different problem.

Kiobufs were always intended to represent logical buffers --- a virtual
address range from some process, or a region of a cached file.  The
purpose behind them was, if you remember, to allow something like
map_user_kiobuf() to produce a list of physical pages from the user VA
range.

This works exactly as intended.  The raw IO device driver may build a
kiobuf to represent a user VA range, and the XFS filesystem may build
one for its pagebuf abstraction to represent a range within a file in
the page cache.  The lower level IO routines just don't care where the
buffers came from.

There are still problems here --- the encoding of block addresses in
the list, dealing with a stack of completion events if you push these
buffers down through various layers of logical block device such as
raid/lvm, carving requests up and merging them if you get requirest
which span a raid or LVM stripe, for example.  Kiobufs don't solve
those, but neither do skfrags, and neither does the MSG_MORE concept.

If you want a scatter-gather list capable of taking individual
buffer_heads and merging them, then sure, kiobufs won't do the trick
as they stand now: they were never intended to.  The whole point of
kiobufs was to encapsulate one single buffer in the higher layers, and
to allow lower layers to work on that buffer without caring where the
memory came from.  

But adding the sub-page sg lists is a simple extension.  I've got a
number of raw IO fixes pending, and we've just traced the source of
the last problem that was holding it up, so if you want I'll add the
per-page offset/length with those. 

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Andrew Morton  <[EMAIL PROTECTED]> wrote:
>Linus Torvalds wrote:
>> 
>> De gustibus non disputandum.
>
>http://cogprints.soton.ac.uk/documents/disk0/00/00/07/57/
>
>   "ingestion of the afterbirth during delivery"
>
>eh?
>
>
>http://www.degustibus.co.uk/
>
>   "Award winning artisan breadmakers."
>
>Ah.  That'll be it.

Latin 101. Literally "about taste no argument".

I suspect that it _should_ be "De gustibus non disputandum est", but
it's been too many years. That adds the required verb ("is") to make it
a full sentence. 

In English: "There is no arguing taste".

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Jamie Lokier

Ingo Molnar wrote:
> > > well, this is a performance problem if you are using threads. For normal
> > > processes there is no need for a SMP cross-call, there TLB flushes are
> > > local only.
> > >
> > But that would be ugly as hell:
> > so apache 2.0 would become slower with MSG_NOCOPY, whereas samba 2.2
> > would become faster.
> 
> there *is* a cost of having a shared VM - and this is i suspect
> unavoidable.

Is it possible to avoid the SMP cross-call in the case that the other
threads have neither accessed nor dirtied the page in question?

One way to implement this is to share VMs but not the page tables, or to
share parts of the page tables that don't contain writable pages.

Just a sudden inspired thought...  I don't know if it is possible or
worthwhile.

enjoy,
-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Stephen C. Tweedie

Hi,

On Tue, Jan 09, 2001 at 02:25:43PM -0800, Linus Torvalds wrote:
> In article <[EMAIL PROTECTED]>,
> Stephen C. Tweedie <[EMAIL PROTECTED]> wrote:
> >
> >Jes has also got hard numbers for the performance advantages of
> >jumbograms on some of the networks he's been using, and you ain't
> >going to get udp jumbograms through a page-by-page API, ever.
> 
> The only thing you need is a nagle-type thing that coalesces requests.

Is this robust enough to build a useful user-level API on top of?

What happens if we have a threaded application in which more than one
process may be sending udp sendmsg()s to the file descriptor?  If we
end up decomposing each datagram into multiple page-sized chunks, then
you can imagine them arriving at the fd stream in interleaved order.

You can fix that by adding extra locking, but that just indicates that
the original API wasn't sufficient to communicate the precise intent
of the application in the first place.

Things look worse from the point of view of ll_rw_block, which lacks
any concept of (a) a file descriptor, or (b) a non-reorderable stream
of atomic requests.  ll_rw_block coalesces in any order it chooses, so
its coalescing function is a _lot_ more complex than hooking the next
page onto a linked list.  

Once the queue size grows non-trivial, adding a new request can become
quite expensive (even with only one item on the request queue at once,
make_request is still by far the biggest cost on a kernel profile
running raw IO).  If you've got a 32-page IO to send, sending it in
chunks means either merging 32 times into that queue when you could
have just done it once, or holding off all merging until you're told
to unplug: but with multiple clients, you just encounter the lack of
caller context again, and each client can unplug the other before its
time.

I realise these are apples and oranges to some extent, because
ll_rw_block doesn't accept a file descriptor: the place where we _do_
use file descriptors, block_write(), could be doing some of this if
the requests were coming from an application.

However, that doesn't address the fact that we have got raw devices
and filesystems such as XFS already generating large multi-page block
IO requests and having to cram them down the thin pipe which is
ll_rw_block, and the MSG_MORE flag doesn't seem capable of extending
to ll_rw_block sufficiently well.

I guess it comes down to this: what problem are we trying to fix?  If
it's strictly limited to sendfile/writev and related calls, then
you've convinced me that page-by-page MSG_MORE can work if you add a
bit of locking, but that locking is by itself nasty.  

Think about O_DIRECT to a database file.  We get a write() call,
locate the physical pages through unspecified magic, and fire off a
series of page or partial-page writes to the O_DIRECT fd.  If we are
coalescing these via MSG_MORE, then we have to keep the fd locked for
write until we've processed the whole IO (including any page faults
that result).  The filesystem --- which is what understands the
concept of a file descriptor --- can merge these together into another
request, but we'd just have to split that request into chunks again to
send them to ll_rw_block.

We may also have things like software raid layers in the write path.
That's the motivation for having an object capable of describing
multi-page IOs --- it lets us pass the desired IO chunks down through
the filesystem, virtual block devices and physical block devices,
without any context being required and without having to
decompose/merge at each layer.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Ingo Molnar

On Wed, 10 Jan 2001, Manfred Spraul wrote:

> > well, this is a performance problem if you are using threads. For normal
> > processes there is no need for a SMP cross-call, there TLB flushes are
> > local only.
> >
> But that would be ugly as hell:
> so apache 2.0 would become slower with MSG_NOCOPY, whereas samba 2.2
> would become faster.

there *is* a cost of having a shared VM - and this is i suspect
unavoidable.

> Is is possible to move the responsibility for maitaining the copy to
> the caller?

this needs a completion event i believe.

> e.g. use msg_control, and then the caller can request either that a
> signal is sent when that data is transfered, or that a variable is set
> to 0.

i believe a signal-based thing would be the right (and scalable) solution
- the signal handler could free() the buffer.

this makes sense even in the VM-assisted MSG_NOCOPY case, since one wants
to do garbage collection of these in-flight buffers anyway. (not for
correctness but for performance reasons - free()-ing and immediately
reusing such a buffer would generate a COW.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Manfred Spraul


Ingo Molnar wrote:
> 
> On Wed, 10 Jan 2001, Manfred Spraul wrote:
> 
> > That means sendmsg() changes the page tables? I measures
> > smp_call_function on my Dual Pentium 350, and it took around 1950 cpu
> > ticks.
> 
> well, this is a performance problem if you are using threads. For normal
> processes there is no need for a SMP cross-call, there TLB flushes are
> local only.
> 
But that would be ugly as hell:
so apache 2.0 would become slower with MSG_NOCOPY, whereas samba 2.2
would become faster.

Is is possible to move the responsibility for maitaining the copy to the
caller?

e.g. use msg_control, and then the caller can request either that a
signal is sent when that data is transfered, or that a variable is set
to 0.

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Ingo Molnar

On Wed, 10 Jan 2001, Manfred Spraul wrote:

> That means sendmsg() changes the page tables? I measures
> smp_call_function on my Dual Pentium 350, and it took around 1950 cpu
> ticks.

well, this is a performance problem if you are using threads. For normal
processes there is no need for a SMP cross-call, there TLB flushes are
local only.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Trond Myklebust

> " " == David S Miller <[EMAIL PROTECTED]> writes:

 >Date: Tue, 9 Jan 2001 16:27:49 +0100 (CET) From: Trond
 >Myklebust <[EMAIL PROTECTED]>

 >OK, but can you eventually generalize it to non-stream
 >protocols (i.e. UDP)?

 > Sure, this is what MSG_MORE is meant to accomodate.  UDP could
 > support it just fine.

Great! I've been waiting for something like this. In particular the
knfsd TCP server code can get very buffer-intensive without it since
you need to pre-allocate 1 set of buffers per TCP connection (else you
get DOS due to buffer saturation when doing wait+retry for blocked
sockets).

If it all gets in to the kernel, I'll do the work of adapting the NFS
+ sunrpc stuff.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread David S. Miller

   Date:Wed, 10 Jan 2001 09:41:24 +0100
   From: Manfred Spraul <[EMAIL PROTECTED]>

   That means sendmsg() changes the page tables?

Not in the zerocopy patch I am proposing and asking people to test.  I
stated in another email that MSG_NOCOPY was considered experimental
and thus left out of my patches.

   I measures smp_call_function on my Dual Pentium 350, and it took
   around 1950 cpu ticks.

And this is one of several reasons why the MSG_NOCOPY facility is
considered experimental.

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Manfred Spraul


> > In user space, how do you know when its safe to reuse the buffer that 
> > was handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg() 
> > with that flag block until the buffer isn't needed by the kernel any 
> > more? If it does block, doesn't that defeat the use of non-blocking 
> > I/O? 
> 
> sendmsg() marks those pages COW and copies the original page into a new 
> one for further usage. (the old page is used until the packet is 
> released.) So for maximum performance user-space should not reuse such 
> buffers immediately. 
>
That means sendmsg() changes the page tables? I measures
smp_call_function on my Dual Pentium 350, and it took around 1950 cpu
ticks.
I'm sure that for an 8 way server the total lost time on all cpus (multi
threaded server) is larger than the time required to copy the complete
page.
(I've attached my patch, just run "insmod dummy p_shift=0")


--
Manfred

--- 2.4/drivers/net/dummy.c Mon Dec  4 02:45:22 2000
+++ build-2.4/drivers/net/dummy.c   Wed Jan 10 09:15:20 2001
@@ -95,9 +95,168 @@
 
 static struct net_device dev_dummy;
 
+/* * */
+int p_shift = -1;
+MODULE_PARM (p_shift, "1i");
+MODULE_PARM_DESC(p_shift, "Shift for the profile buffer");
+
+int p_size = 0;
+MODULE_PARM (p_size, "1i");
+MODULE_PARM_DESC(p_size, "size");
+
+
+#define STAT_TABLELEN  16384
+static unsigned long totals[STAT_TABLELEN];
+static unsigned int overflows;
+
+static unsigned long long stime;
+static void start_measure(void)
+{
+__asm__ __volatile__ (
+   ".align 64\n\t"
+   "pushal\n\t"
+   "cpuid\n\t"
+   "popal\n\t"
+   "rdtsc\n\t"
+   "movl %%eax,(%0)\n\t"
+   "movl %%edx,4(%0)\n\t"
+   : /* no output */
+   : "c"()
+   : "eax", "edx", "memory" );
+}
+
+static void end_measure(void)
+{
+static unsigned long long etime;
+   __asm__ __volatile__ (
+   "pushal\n\t"
+   "cpuid\n\t"
+   "popal\n\t"
+   "rdtsc\n\t"
+   "movl %%eax,(%0)\n\t"
+   "movl %%edx,4(%0)\n\t"
+   : /* no output */
+   : "c"()
+   : "eax", "edx", "memory" );
+   {
+   unsigned long time = (unsigned long)(etime-stime);
+   time >>= p_shift;
+   if(time < STAT_TABLELEN) {
+   totals[time]++;
+   } else {
+   overflows++;
+   }
+   }
+}
+
+static void clean_buf(void)
+{
+   memset(totals,0,sizeof(totals));
+   overflows = 0;
+}
+
+static void print_line(unsigned long* array)
+{
+   int i;
+   for(i=0;i<32;i++) {
+   if((i%32)==16)
+   printk(":");
+   printk("%lx ",array[i]); 
+   }
+}
+
+static void print_buf(char* caption)
+{
+   int i, other = 0;
+   printk("Results - %s - shift %d",
+   caption, p_shift);
+
+   for(i=0;i

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Christoph Hellwig


On Wed, Jan 10, 2001 at 12:05:01AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 10 Jan 2001, Christoph Hellwig wrote:
> > 
> > Simple.  Because I stated before that I DON'T even want the networking
> > to use kiobufs in lower layers.  My whole argument is to pass a kiovec
> > into the fileop instead of a page, because it makes sense for other
> > drivers to use multiple pages, and doesn't hurt networking besides
> > the cost of one kiobuf (116k) and the processor cycles for creating
> > and destroying it once per sys_sendfile.
> 
> Fair enough.
> 
> My whole argument against that is that I think kiovec's are incredibly
> ugly, and the less I see of them in critical regions, the happier I am.
> 
> And that, I have to admit, is really mostly a matter of "taste". 

Ok.

This is a statement that makes all the kiobuf efforts currently look
no more as interesting as before.

IHMO is time to find a generic interface for IO that is acceptable by
you and widely usable.

As you stated before that seems to be s.th. with page,offset,length
tuples.

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Andrew Morton


Linus Torvalds wrote:
> 
> De gustibus non disputandum.

http://cogprints.soton.ac.uk/documents/disk0/00/00/07/57/

"ingestion of the afterbirth during delivery"

eh?


http://www.degustibus.co.uk/

"Award winning artisan breadmakers."

Ah.  That'll be it.

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Gerd Knorr


> > Please tell me what you think the right interface is that provides a hook
> > on io completion and is asynchronous.
> 
> Suggested fix to kiovec's: get rid of them. Immediately. Replace them with
> kiobuf's that can handle scatter-gather pages. kiobuf's have 90% of that
> support already.
> 
> Never EVER have a "struct page **" interface. It is never the valid thing
> to do.

Hmm, /me is quite happy with it.  It's fine for *big* chunks of memory like
video frames:  I just need a large number of pages, length and offset.  If
someone wants to have a look: a rewritten bttv version which uses kiobufs
is available at http://www.strusel007.de/linux/bttv/bttv-0.8.8.tar.gz

It does _not_ use kiovecs throuth (to be exact: kiovecs with just one single
kiobuf in there).

> You should have
> 
>   struct fragment {
>   struct page *page;
>   __u16 offset, length;
>   }

What happens with big memory blocks?  Do all pages but the first and last
get offset=0 and length=PAGE_SIZE then?

  Gerd

-- 
Get back there in front of the computer NOW. Christmas can wait.
-- Linus "the Grinch" Torvalds,  24 Dec 2000 on linux-kernel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Stephen C. Tweedie


Hi,

On Tue, Jan 09, 2001 at 02:25:43PM -0800, Linus Torvalds wrote:
 In article [EMAIL PROTECTED],
 Stephen C. Tweedie [EMAIL PROTECTED] wrote:
 
 Jes has also got hard numbers for the performance advantages of
 jumbograms on some of the networks he's been using, and you ain't
 going to get udp jumbograms through a page-by-page API, ever.
 
 The only thing you need is a nagle-type thing that coalesces requests.

Is this robust enough to build a useful user-level API on top of?

What happens if we have a threaded application in which more than one
process may be sending udp sendmsg()s to the file descriptor?  If we
end up decomposing each datagram into multiple page-sized chunks, then
you can imagine them arriving at the fd stream in interleaved order.

You can fix that by adding extra locking, but that just indicates that
the original API wasn't sufficient to communicate the precise intent
of the application in the first place.

Things look worse from the point of view of ll_rw_block, which lacks
any concept of (a) a file descriptor, or (b) a non-reorderable stream
of atomic requests.  ll_rw_block coalesces in any order it chooses, so
its coalescing function is a _lot_ more complex than hooking the next
page onto a linked list.  

Once the queue size grows non-trivial, adding a new request can become
quite expensive (even with only one item on the request queue at once,
make_request is still by far the biggest cost on a kernel profile
running raw IO).  If you've got a 32-page IO to send, sending it in
chunks means either merging 32 times into that queue when you could
have just done it once, or holding off all merging until you're told
to unplug: but with multiple clients, you just encounter the lack of
caller context again, and each client can unplug the other before its
time.

I realise these are apples and oranges to some extent, because
ll_rw_block doesn't accept a file descriptor: the place where we _do_
use file descriptors, block_write(), could be doing some of this if
the requests were coming from an application.

However, that doesn't address the fact that we have got raw devices
and filesystems such as XFS already generating large multi-page block
IO requests and having to cram them down the thin pipe which is
ll_rw_block, and the MSG_MORE flag doesn't seem capable of extending
to ll_rw_block sufficiently well.

I guess it comes down to this: what problem are we trying to fix?  If
it's strictly limited to sendfile/writev and related calls, then
you've convinced me that page-by-page MSG_MORE can work if you add a
bit of locking, but that locking is by itself nasty.  

Think about O_DIRECT to a database file.  We get a write() call,
locate the physical pages through unspecified magic, and fire off a
series of page or partial-page writes to the O_DIRECT fd.  If we are
coalescing these via MSG_MORE, then we have to keep the fd locked for
write until we've processed the whole IO (including any page faults
that result).  The filesystem --- which is what understands the
concept of a file descriptor --- can merge these together into another
request, but we'd just have to split that request into chunks again to
send them to ll_rw_block.

We may also have things like software raid layers in the write path.
That's the motivation for having an object capable of describing
multi-page IOs --- it lets us pass the desired IO chunks down through
the filesystem, virtual block devices and physical block devices,
without any context being required and without having to
decompose/merge at each layer.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Jamie Lokier


Ingo Molnar wrote:
   well, this is a performance problem if you are using threads. For normal
   processes there is no need for a SMP cross-call, there TLB flushes are
   local only.
  
  But that would be ugly as hell:
  so apache 2.0 would become slower with MSG_NOCOPY, whereas samba 2.2
  would become faster.
 
 there *is* a cost of having a shared VM - and this is i suspect
 unavoidable.

Is it possible to avoid the SMP cross-call in the case that the other
threads have neither accessed nor dirtied the page in question?

One way to implement this is to share VMs but not the page tables, or to
share parts of the page tables that don't contain writable pages.

Just a sudden inspired thought...  I don't know if it is possible or
worthwhile.

enjoy,
-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Linus Torvalds


In article [EMAIL PROTECTED],
Andrew Morton  [EMAIL PROTECTED] wrote:
Linus Torvalds wrote:
 
 De gustibus non disputandum.

http://cogprints.soton.ac.uk/documents/disk0/00/00/07/57/

   "ingestion of the afterbirth during delivery"

eh?


http://www.degustibus.co.uk/

   "Award winning artisan breadmakers."

Ah.  That'll be it.

Latin 101. Literally "about taste no argument".

I suspect that it _should_ be "De gustibus non disputandum est", but
it's been too many years. That adds the required verb ("is") to make it
a full sentence. 

In English: "There is no arguing taste".

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread David S. Miller

   Date:Wed, 10 Jan 2001 09:41:24 +0100
   From: Manfred Spraul [EMAIL PROTECTED]

   That means sendmsg() changes the page tables?

Not in the zerocopy patch I am proposing and asking people to test.  I
stated in another email that MSG_NOCOPY was considered experimental
and thus left out of my patches.

   I measures smp_call_function on my Dual Pentium 350, and it took
   around 1950 cpu ticks.

And this is one of several reasons why the MSG_NOCOPY facility is
considered experimental.

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Gerd Knorr


  Please tell me what you think the right interface is that provides a hook
  on io completion and is asynchronous.
 
 Suggested fix to kiovec's: get rid of them. Immediately. Replace them with
 kiobuf's that can handle scatter-gather pages. kiobuf's have 90% of that
 support already.
 
 Never EVER have a "struct page **" interface. It is never the valid thing
 to do.

Hmm, /me is quite happy with it.  It's fine for *big* chunks of memory like
video frames:  I just need a large number of pages, length and offset.  If
someone wants to have a look: a rewritten bttv version which uses kiobufs
is available at http://www.strusel007.de/linux/bttv/bttv-0.8.8.tar.gz

It does _not_ use kiovecs throuth (to be exact: kiovecs with just one single
kiobuf in there).

 You should have
 
   struct fragment {
   struct page *page;
   __u16 offset, length;
   }

What happens with big memory blocks?  Do all pages but the first and last
get offset=0 and length=PAGE_SIZE then?

  Gerd

-- 
Get back there in front of the computer NOW. Christmas can wait.
-- Linus "the Grinch" Torvalds,  24 Dec 2000 on linux-kernel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Andrew Morton


Linus Torvalds wrote:
 
 De gustibus non disputandum.

http://cogprints.soton.ac.uk/documents/disk0/00/00/07/57/

"ingestion of the afterbirth during delivery"

eh?


http://www.degustibus.co.uk/

"Award winning artisan breadmakers."

Ah.  That'll be it.

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Christoph Hellwig


On Wed, Jan 10, 2001 at 12:05:01AM -0800, Linus Torvalds wrote:
 
 
 On Wed, 10 Jan 2001, Christoph Hellwig wrote:
  
  Simple.  Because I stated before that I DON'T even want the networking
  to use kiobufs in lower layers.  My whole argument is to pass a kiovec
  into the fileop instead of a page, because it makes sense for other
  drivers to use multiple pages, and doesn't hurt networking besides
  the cost of one kiobuf (116k) and the processor cycles for creating
  and destroying it once per sys_sendfile.
 
 Fair enough.
 
 My whole argument against that is that I think kiovec's are incredibly
 ugly, and the less I see of them in critical regions, the happier I am.
 
 And that, I have to admit, is really mostly a matter of "taste". 

Ok.

This is a statement that makes all the kiobuf efforts currently look
no more as interesting as before.

IHMO is time to find a generic interface for IO that is acceptable by
you and widely usable.

As you stated before that seems to be s.th. with page,offset,length
tuples.

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Manfred Spraul


  In user space, how do you know when its safe to reuse the buffer that 
  was handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg() 
  with that flag block until the buffer isn't needed by the kernel any 
  more? If it does block, doesn't that defeat the use of non-blocking 
  I/O? 
 
 sendmsg() marks those pages COW and copies the original page into a new 
 one for further usage. (the old page is used until the packet is 
 released.) So for maximum performance user-space should not reuse such 
 buffers immediately. 

That means sendmsg() changes the page tables? I measures
smp_call_function on my Dual Pentium 350, and it took around 1950 cpu
ticks.
I'm sure that for an 8 way server the total lost time on all cpus (multi
threaded server) is larger than the time required to copy the complete
page.
(I've attached my patch, just run "insmod dummy p_shift=0")


--
Manfred

--- 2.4/drivers/net/dummy.c Mon Dec  4 02:45:22 2000
+++ build-2.4/drivers/net/dummy.c   Wed Jan 10 09:15:20 2001
@@ -95,9 +95,168 @@
 
 static struct net_device dev_dummy;
 
+/* * */
+int p_shift = -1;
+MODULE_PARM (p_shift, "1i");
+MODULE_PARM_DESC(p_shift, "Shift for the profile buffer");
+
+int p_size = 0;
+MODULE_PARM (p_size, "1i");
+MODULE_PARM_DESC(p_size, "size");
+
+
+#define STAT_TABLELEN  16384
+static unsigned long totals[STAT_TABLELEN];
+static unsigned int overflows;
+
+static unsigned long long stime;
+static void start_measure(void)
+{
+__asm__ __volatile__ (
+   ".align 64\n\t"
+   "pushal\n\t"
+   "cpuid\n\t"
+   "popal\n\t"
+   "rdtsc\n\t"
+   "movl %%eax,(%0)\n\t"
+   "movl %%edx,4(%0)\n\t"
+   : /* no output */
+   : "c"(stime)
+   : "eax", "edx", "memory" );
+}
+
+static void end_measure(void)
+{
+static unsigned long long etime;
+   __asm__ __volatile__ (
+   "pushal\n\t"
+   "cpuid\n\t"
+   "popal\n\t"
+   "rdtsc\n\t"
+   "movl %%eax,(%0)\n\t"
+   "movl %%edx,4(%0)\n\t"
+   : /* no output */
+   : "c"(etime)
+   : "eax", "edx", "memory" );
+   {
+   unsigned long time = (unsigned long)(etime-stime);
+   time = p_shift;
+   if(time  STAT_TABLELEN) {
+   totals[time]++;
+   } else {
+   overflows++;
+   }
+   }
+}
+
+static void clean_buf(void)
+{
+   memset(totals,0,sizeof(totals));
+   overflows = 0;
+}
+
+static void print_line(unsigned long* array)
+{
+   int i;
+   for(i=0;i32;i++) {
+   if((i%32)==16)
+   printk(":");
+   printk("%lx ",array[i]); 
+   }
+}
+
+static void print_buf(char* caption)
+{
+   int i, other = 0;
+   printk("Results - %s - shift %d",
+   caption, p_shift);
+
+   for(i=0;iSTAT_TABLELEN;i+=32) {
+   int j;
+   int local = 0;
+   for(j=0;j32;j++)
+   local += totals[i+j];
+
+   if(local) {
+   printk("\n%3x: ",i);
+   print_line(totals[i]);
+   other += local;
+   }
+   }
+   printk("\nOverflows: %d.\n",
+   overflows);
+   printk("Sum: %ld\n",other+overflows);
+}
+
+static void return_immediately(void* dummy)
+{
+   return;
+}
+
+static void just_one_page(void* dummy)
+{
+   __flush_tlb_one(0x12345678);
+   return;
+}
+
+
 static int __init dummy_init_module(void)
 {
int err;
+
+   if(p_shift != -1) {
+   int i;
+   void* p;
+   kmem_cache_t* cachep;
+   /* empty test measurement: */
+   printk(" kernel cpu benchmark started **\n");
+   clean_buf();
+   set_current_state(TASK_UNINTERRUPTIBLE);
+   schedule_timeout(200);
+   for(i=0;i100;i++) {
+   start_measure();
+   return_immediately(NULL);
+   return_immediately(NULL);
+   return_immediately(NULL);
+   return_immediately(NULL);
+   end_measure();
+   }
+   print_buf("zero");
+   clean_buf();
+
+   set_current_state(TASK_UNINTERRUPTIBLE);
+   schedule_timeout(200);
+   for(i=0;i100;i++) {
+   start_measure();
+   return_immediately(NULL);
+   return_immediately(NULL);
+   smp_call_function(return_immediately,NULL,
+   1, 1);
+   return_immediately(NULL);
+

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Trond Myklebust


 " " == David S Miller [EMAIL PROTECTED] writes:

 Date: Tue, 9 Jan 2001 16:27:49 +0100 (CET) From: Trond
 Myklebust [EMAIL PROTECTED]

 OK, but can you eventually generalize it to non-stream
 protocols (i.e. UDP)?

  Sure, this is what MSG_MORE is meant to accomodate.  UDP could
  support it just fine.

Great! I've been waiting for something like this. In particular the
knfsd TCP server code can get very buffer-intensive without it since
you need to pre-allocate 1 set of buffers per TCP connection (else you
get DOS due to buffer saturation when doing wait+retry for blocked
sockets).

If it all gets in to the kernel, I'll do the work of adapting the NFS
+ sunrpc stuff.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Ingo Molnar



On Wed, 10 Jan 2001, Manfred Spraul wrote:

 That means sendmsg() changes the page tables? I measures
 smp_call_function on my Dual Pentium 350, and it took around 1950 cpu
 ticks.

well, this is a performance problem if you are using threads. For normal
processes there is no need for a SMP cross-call, there TLB flushes are
local only.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Manfred Spraul


Ingo Molnar wrote:
 
 On Wed, 10 Jan 2001, Manfred Spraul wrote:
 
  That means sendmsg() changes the page tables? I measures
  smp_call_function on my Dual Pentium 350, and it took around 1950 cpu
  ticks.
 
 well, this is a performance problem if you are using threads. For normal
 processes there is no need for a SMP cross-call, there TLB flushes are
 local only.
 
But that would be ugly as hell:
so apache 2.0 would become slower with MSG_NOCOPY, whereas samba 2.2
would become faster.

Is is possible to move the responsibility for maitaining the copy to the
caller?

e.g. use msg_control, and then the caller can request either that a
signal is sent when that data is transfered, or that a variable is set
to 0.

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-10 Thread Ingo Molnar



On Wed, 10 Jan 2001, Manfred Spraul wrote:

  well, this is a performance problem if you are using threads. For normal
  processes there is no need for a SMP cross-call, there TLB flushes are
  local only.
 
 But that would be ugly as hell:
 so apache 2.0 would become slower with MSG_NOCOPY, whereas samba 2.2
 would become faster.

there *is* a cost of having a shared VM - and this is i suspect
unavoidable.

 Is is possible to move the responsibility for maitaining the copy to
 the caller?

this needs a completion event i believe.

 e.g. use msg_control, and then the caller can request either that a
 signal is sent when that data is transfered, or that a variable is set
 to 0.

i believe a signal-based thing would be the right (and scalable) solution
- the signal handler could free() the buffer.

this makes sense even in the VM-assisted MSG_NOCOPY case, since one wants
to do garbage collection of these in-flight buffers anyway. (not for
correctness but for performance reasons - free()-ing and immediately
reusing such a buffer would generate a COW.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Linus Torvalds




On Wed, 10 Jan 2001, Christoph Hellwig wrote:
> 
> Simple.  Because I stated before that I DON'T even want the networking
> to use kiobufs in lower layers.  My whole argument is to pass a kiovec
> into the fileop instead of a page, because it makes sense for other
> drivers to use multiple pages, and doesn't hurt networking besides
> the cost of one kiobuf (116k) and the processor cycles for creating
> and destroying it once per sys_sendfile.

Fair enough.

My whole argument against that is that I think kiovec's are incredibly
ugly, and the less I see of them in critical regions, the happier I am.

And that, I have to admit, is really mostly a matter of "taste". 

De gustibus non disputandum.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Christoph Hellwig


On Tue, Jan 09, 2001 at 01:26:44PM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 9 Jan 2001, Christoph Hellwig wrote:
> > > 
> > > Look at sendfile(). You do NOT have a "bunch" of pages.
> > > 
> > > Sendfile() is very much a page-at-a-time thing, and expects the actual IO
> > > layers to do it's own scatter-gather. 
> > > 
> > > So sendfile() doesn't want any array at all: it only wants a single
> > > page-offset-length tuple interface.
> > 
> > The current implementations does.
> > But others are possible.  I could post one in a few days to show that it is
> > possible.
> 
> Why do you bother arguing, when I've shown you that even if sendfile()
> _did_ do multiple pages, it STILL wouldn't make kibuf's the right
> interface. You just snipped out that part of my email, which states that
> the networking layer would still need to do better scatter-gather than
> kiobuf's can give it for multiple send-file invocations.

Simple.  Because I stated before that I DON'T even want the networking
to use kiobufs in lower layers.  My whole argument is to pass a kiovec
into the fileop instead of a page, because it makes sense for other
drivers to use multiple pages, and doesn't hurt networking besides
the cost of one kiobuf (116k) and the processor cycles for creating
and destroying it once per sys_sendfile.

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch,2.4.0-1)

2001-01-09 Thread David S. Miller

   Date: Tue, 9 Jan 2001 18:56:33 -0800 (PST)
   From: dean gaudet <[EMAIL PROTECTED]>

   is NFS receive single copy today?

With the zerocopy patches, NFS client receive is "single cpu copy" if
that's what you mean.

Later,
David S. Miller
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

storage over IP (was Re: [PLEASE-TESTME] Zerocopy networking patch,2.4.0-1)

2001-01-09 Thread dean gaudet

On Tue, 9 Jan 2001, Ingo Molnar wrote:

> On Mon, 8 Jan 2001, Rik van Riel wrote:
>
> > Having proper kiobuf support would make it possible to, for example,
> > do zerocopy network->disk data transfers and lots of other things.
>
> i used to think that this is useful, but these days it isnt.

this seems to be in the general theme of "network receive is boring".
which i mostly agree with... except recently i've been thinking about an
application where it may not be so boring, but i haven't researched all
the details yet.

the application is storage over IP -- SAN using IP (i.e. gigabit ethernet)
technologies instead of fiberchannel technologies.  several companies are
doing it or planning to do it (for example EMC, 3ware).

i'm taking a wild guess that SCSI over FC is arranged conveniently to
allow a scatter request to read packets off the FC NIC such that the
headers go one way and the data lands neatly into the page cache (i.e.
fixed length headers).  i've never investigated the actual protocols
though so maybe the solution used was to just push a lot of the detail
down into the controllers.

a quick look at the iSCSI specification
, and the
FCIP spec

show that both use TCP/IP.  TCP/IP has variable length headers (or am i on
crack?), which totally complicates the receive path.

the iSCSI requirements document seems to imply they're happy with pushing
this extra processing down to a special storage NIC.  that kind of sucks
-- one of the benefits of storage over IP would be the ability to
redundantly connect a box to storage and IP with only two NICs (instead of
4 -- 2 IP and 2 FC).

is NFS receive single copy today?

anyone tried doing packet demultiplexing by grabbing headers on one pass
and scattering the data on a second pass?

i'm hoping i'm missing something.  anyone else looked around at this stuff
yet?

-dean

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Dave Zarzycki

On Tue, 9 Jan 2001, David S. Miller wrote:

> Ignore Ingo's comments about the MSG_NOCOPY flag, I've not included
> those parts in the zerocopy patches as they are very controversial
> and require some VM layer support.

Okay, I talked to some kernel engineers where I work and they were (I
think) very justifiably skeptical of zero-copy work with respect to
read/write style APIs.

> Basically, it pins the userspace pages, so if you write to them before
> the data is fully sent and the networking buffer freed, they get
> copied with a COW fault.

Yum... Assuming a gigabit ethernet link is saturated with the
sendmsg(MSG_NOCOPY) API, what is CPU utilization like for a given clock
speed and processor make? It is any different than the sendfile() case?

davez

-- 
Dave Zarzycki
http://thor.sbay.org/~dave/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread David S. Miller

   Date:Tue, 9 Jan 2001 17:14:33 -0800 (PST)
   From: Dave Zarzycki <[EMAIL PROTECTED]>

   On Tue, 9 Jan 2001, Ingo Molnar wrote:

   > then you'll love the zerocopy patch :-) Just use sendfile() or specify
   > MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card
   > DMA-and-checksumming on cards that support it.

   I'm confused.

   In user space, how do you know when its safe to reuse the buffer that was
   handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg() with that
   flag block until the buffer isn't needed by the kernel any more? If it
   does block, doesn't that defeat the use of non-blocking I/O?

Ignore Ingo's comments about the MSG_NOCOPY flag, I've not included
those parts in the zerocopy patches as they are very controversial
and require some VM layer support.

Basically, it pins the userspace pages, so if you write to them before
the data is fully sent and the networking buffer freed, they get
copied with a COW fault.

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Ingo Molnar

On Tue, 9 Jan 2001, Dave Zarzycki wrote:

> In user space, how do you know when its safe to reuse the buffer that
> was handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg()
> with that flag block until the buffer isn't needed by the kernel any
> more? If it does block, doesn't that defeat the use of non-blocking
> I/O?

sendmsg() marks those pages COW and copies the original page into a new
one for further usage. (the old page is used until the packet is
released.) So for maximum performance user-space should not reuse such
buffers immediately.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Dave Zarzycki

On Tue, 9 Jan 2001, Ingo Molnar wrote:

> then you'll love the zerocopy patch :-) Just use sendfile() or specify
> MSG_NOCOPY to sendmsg(), and you'll see effective memory-to-card
> DMA-and-checksumming on cards that support it.

I'm confused.

In user space, how do you know when its safe to reuse the buffer that was
handed to sendmsg() with the MSG_NOCOPY flag? Or does sendmsg() with that
flag block until the buffer isn't needed by the kernel any more? If it
does block, doesn't that defeat the use of non-blocking I/O?

davez

-- 
Dave Zarzycki
http://thor.sbay.org/~dave/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Andrea Arcangeli


On Tue, Jan 09, 2001 at 09:10:24PM +0100, Ingo Molnar wrote:
> 
> On Tue, 9 Jan 2001, Andrea Arcangeli wrote:
> 
> > BTW, I noticed what is left in blk-13B seems to be my work (Jens's
> > fixes for merging when the I/O queue is full are just been integrated
> > in test1x).  [...]
> 
> it was Jens' [i think those were implemented by Jens entirely]
> batch-freeing changes that made the most difference. (we did

Confirm, the bach-freeing was Jens's work.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Linus Torvalds

On Tue, 9 Jan 2001, Benjamin C.R. LaHaise wrote:

> On Tue, 9 Jan 2001, Linus Torvalds wrote:
> 
> > The _lower-level_ stuff (ie TCP and the drivers) want the "array of
> > tuples", and again, they do NOT want an array of pages, because if
> > somebody does two sendfile() calls that fit in one packet, it really needs
> > an array of tuples.
> 
> A kiobuf simply provides that tuple plus the completion callback.  Stick a
> bunch of them together and you've got a kiovec.  I don't see the advantage
> of moving to simpler primatives if they don't provide needed
> functionality.

Ehh.

Let's re-state your argument:

 "You could have used the existing, complex and cumbersome primitives that
  had the wrong semantics. I don't see the advantage of pointing out the
  fact that those primitives are badly designed for the problem at hand 
  and moving to simpler and better designed primitives that fit the
  problem well"

Would you agree that that is the essense of what you said? And if not,
then why not?

> Please tell me what you think the right interface is that provides a hook
> on io completion and is asynchronous.

Suggested fix to kiovec's: get rid of them. Immediately. Replace them with
kiobuf's that can handle scatter-gather pages. kiobuf's have 90% of that
support already.

Never EVER have a "struct page **" interface. It is never the valid thing
to do. You should have

struct fragment {
struct page *page;
__u16 offset, length;
}

and then have "struct fragment **" inside the kiobuf's instead. Rename
"nr_pages" as "nr_fragments", and get rid of the global offset/length, as
they don't make any sense. Voila - your kiobuf is suddenly a lot more
flexible.

Finally, don't embed the static KIO_STATIC_PAGES array in the kiobuf. The
caller knows when it makes sense, and when it doesn't. Don't embed that
knowledge in fundamental data structures.

In the meantime, I'm more than happy to make sure that the networking
infrastructure is sane. Which implies that the networking infrastructure
does NOT use kiovecs.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Andrea Arcangeli

On Wed, Jan 10, 2001 at 12:34:35AM +0100, Jens Axboe wrote:
> Ah I see. It would be nice to base the QUEUE_NR_REQUEST on something else
> than a static number. For example, 3000 per queue translates into 281Kb
> of request slots per queue. On a typical system with a floppy, hard drive,
> and CD-ROM it's getting close to 1Mb of RAM used for this alone. On a
> 32Mb box this is unaccebtable.

Yes of course. Infact 3000 was just the number I choosen when doing the
benchmarks on a 128Mbox. Things needs to be autotuning and that's not yet
implemented. I meant 3000 to tell how such number can grow. Right now if you
use 3000 you will need to lock 1.5G of RAM (more than the normal zone!) before
you can block with the 512K scsi commands.  This was just to show the rest of
the blkdev layer was obviously restructured.  On a 8G box 1 requests
would probably be a good number.

> Yes I see your point. However memory shortage will fire the queue in due
> time, it won't make the WRITE block however. In this case it would be

That's the performance problem I'm talking about on the lowmem boxes. Infact
this problem will happen in 2.4.x too, just less biased than with the
512K scsi commands and by you increasing the number of requests from 256 to 512.

> bdflush blocking on the WRITE's, which seem exactly what we don't want?

In 2.4.0 Linus fixed wakeup_bdflush not to wait bdflush anymore as I suggested,
now it's the task context that sumbits the requests directly to the I/O queue
so it's the task that must block, not bdflush. And the task will block correctly
_if_ we unplug at the sane time in ll_rw_block.

> So you imposed a MB limit on how much I/O would be outstanding in
> blkdev_release_request? Wouldn't it make more sense to move this to at

No absolutely. Not in blkdev_release_request. The changes there
are because you need to somehow do some accounting at I/O completion.

> get_request time, since with the blkdev_release_request approach you won't

Yes, only ll_rw_block uplugs, not blkdev_release_request.  Obviously since the
latter runs from irqs.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Jens Axboe

On Wed, Jan 10 2001, Andrea Arcangeli wrote:
> On Tue, Jan 09, 2001 at 09:12:04PM +0100, Jens Axboe wrote:
> > I haven't heard anything beyond the raised QUEUE_NR_REQUEST, so I'd like to
> > see what you have pending so we can merge :-). The tiotest seek increase was
> > mainly due to the elevator having 3000 requests to juggle and thus being able
> > to eliminate a lot of seeks right?
> 
> Raising QUEUE_NR_REQUEST is possible because of the rework of other parts of
> ll_rw_block meant to fix the lowmem boxes.

Ah I see. It would be nice to base the QUEUE_NR_REQUEST on something else
than a static number. For example, 3000 per queue translates into 281Kb
of request slots per queue. On a typical system with a floppy, hard drive,
and CD-ROM it's getting close to 1Mb of RAM used for this alone. On a
32Mb box this is unaccebtable.

I previously had blk_init_queue_nr(q, nr_free_slots) to eg not use that
many free slots on say floppy, which doesn't really make much sense
anyway.

> > I don't see any lowmem problems -- if under pressure, the queue should be
> > fired and thus it won't get as long as if you have lots of memory free.`
> 
> A write(2) shouldn't cause the allocator to wait I/O completion. It's the write
> that should block when it's only polluting the cache or you'll hurt the
> innocent rest of the system that isn't writing.
> 
> At least with my original implementation of the 512K large scsi command
> support that you merged, before a write could block you first had to generate
> at least 128Mbyte of memory _locked_ all queued in the I/O request list waiting
> the driver to process the requests (only locked, without considering
> the dirty part of memory).
> 
> Since you raised from 256 requests per queue to 512 with your patch you
> may have to generate 256Mbyte of locked memory before a write can block.
> 
> This is great on the 8G boxes that runs specweb but this isn't that great on a
> 32Mbyte box connected incidentally to a decent SCSI adapter.

Yes I see your point. However memory shortage will fire the queue in due
time, it won't make the WRITE block however. In this case it would be
bdflush blocking on the WRITE's, which seem exactly what we don't want?

> I say "may" because I didn't checked closely if you introduced any kind of
> logic to avoid this. It seems not though because such a logic needs to touch at
> least blkdev_release_request and that's what I developed in my tree and then I
> could raise the number of I/O request in the queue up to 1 if I wanted
> without any problem, the max-I/O in flight was controlled properly. (this
> allowed me to optimize away not 256 or in your case 512 seeks but 1 seeks)
> This is what I meant with exploiting the elevator. No panic, there's no buffer
> overflow there ;)

So you imposed a MB limit on how much I/O would be outstanding in
blkdev_release_request? Wouldn't it make more sense to move this to at
get_request time, since with the blkdev_release_request approach you won't
catch lots of outstanding lock buffers before you start releasing one of
them, at which point it would be too late (it might recover, but still).

-- 
* Jens Axboe <[EMAIL PROTECTED]>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Andrea Arcangeli

On Tue, Jan 09, 2001 at 09:12:04PM +0100, Jens Axboe wrote:
> I haven't heard anything beyond the raised QUEUE_NR_REQUEST, so I'd like to
> see what you have pending so we can merge :-). The tiotest seek increase was
> mainly due to the elevator having 3000 requests to juggle and thus being able
> to eliminate a lot of seeks right?

Raising QUEUE_NR_REQUEST is possible because of the rework of other parts of
ll_rw_block meant to fix the lowmem boxes.

> > write numbers, streaming I/O doesn't change on highmem boxes but it doesn't
> > hurt lowmem boxes anymore). Current blk-13B isn't ok for integration yet
> > because it hurts with lowmem (try with mem=32m with your scsi array that gets
> > 512K*512 requests in flight :) and it's not able to exploit the elevator as
> 
> I don't see any lowmem problems -- if under pressure, the queue should be
> fired and thus it won't get as long as if you have lots of memory free.`

A write(2) shouldn't cause the allocator to wait I/O completion. It's the write
that should block when it's only polluting the cache or you'll hurt the
innocent rest of the system that isn't writing.

At least with my original implementation of the 512K large scsi command
support that you merged, before a write could block you first had to generate
at least 128Mbyte of memory _locked_ all queued in the I/O request list waiting
the driver to process the requests (only locked, without considering
the dirty part of memory).

Since you raised from 256 requests per queue to 512 with your patch you
may have to generate 256Mbyte of locked memory before a write can block.

This is great on the 8G boxes that runs specweb but this isn't that great on a
32Mbyte box connected incidentally to a decent SCSI adapter.

I say "may" because I didn't checked closely if you introduced any kind of
logic to avoid this. It seems not though because such a logic needs to touch at
least blkdev_release_request and that's what I developed in my tree and then I
could raise the number of I/O request in the queue up to 1 if I wanted
without any problem, the max-I/O in flight was controlled properly. (this
allowed me to optimize away not 256 or in your case 512 seeks but 1 seeks)
This is what I meant with exploiting the elevator. No panic, there's no buffer
overflow there ;)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Dan Hollis

On Tue, 9 Jan 2001, Ingo Molnar wrote:
> On Tue, 9 Jan 2001, Dan Hollis wrote:
> > > This is not what senfile() does, it sends (to a network socket) a
> > > file (from the page cache), nothing more.
> > Ok in any case, it would be nice to have a generic sendfile() which works
> > on any fd's - socket or otherwise.
> it's a bad name in that case. We dont 'send any file' if we in fact are
> receiving a data stream from a socket and writing it into a file :-)

So we should have different system calls just so one can handle socket
and one can handle disk fd? :P

Ok so now will have special case sendfile() for each different kind of
fd's.

To connect socket-socket we can call it electrician() and to connect
pipe-pipe we can call it plumber() [1].

:P :b :P :b

-Dan

[1] Yes, Alex Belits, I know i've now stolen your joke...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Benjamin C.R. LaHaise

On Tue, 9 Jan 2001, Linus Torvalds wrote:

> The _lower-level_ stuff (ie TCP and the drivers) want the "array of
> tuples", and again, they do NOT want an array of pages, because if
> somebody does two sendfile() calls that fit in one packet, it really needs
> an array of tuples.

A kiobuf simply provides that tuple plus the completion callback.  Stick a
bunch of them together and you've got a kiovec.  I don't see the advantage
of moving to simpler primatives if they don't provide needed
functionality.

> In short, the kiobuf interface is _always_ the wrong one.

Please tell me what you think the right interface is that provides a hook
on io completion and is asynchronous.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Ingo Molnar



On Tue, 9 Jan 2001, Dan Hollis wrote:

> > This is not what senfile() does, it sends (to a network socket) a
> > file (from the page cache), nothing more.
>
> Ok in any case, it would be nice to have a generic sendfile() which works
> on any fd's - socket or otherwise.

it's a bad name in that case. We dont 'send any file' if we in fact are
receiving a data stream from a socket and writing it into a file :-)

(i think Pavel raised this issue before.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[patch]: ac4 blk (was Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1)

2001-01-09 Thread Jens Axboe


On Tue, Jan 09 2001, Ingo Molnar wrote:
> 
> > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. )
> >
> > Is this "right patch from Jens" on the radar for 2.4 inclusion?
> 
> i do hope so!

Here's a version against 2.4.0-ac4, blk-13B did not apply cleanly due to
moving of i2o files and S/390 dasd changes:

*.kernel.org/pub/linux/kernel/people/axboe/patches/2.4.0-ac4/blk-13C.bz2

-- 
* Jens Axboe <[EMAIL PROTECTED]>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Dan Hollis


On Tue, 9 Jan 2001, David S. Miller wrote:
>Just extend sendfile to allow any fd to any fd. sendfile already
>does file->socket and file->file. It only needs to be extended to
>do socket->file.
> This is not what senfile() does, it sends (to a network socket) a
> file (from the page cache), nothing more.

Ok in any case, it would be nice to have a generic sendfile() which works
on any fd's - socket or otherwise.

What sort of sendfile() behaviour is defined with select()? Can it be
asynchronous?

-Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Stephen C. Tweedie <[EMAIL PROTECTED]> wrote:
>
>Jes has also got hard numbers for the performance advantages of
>jumbograms on some of the networks he's been using, and you ain't
>going to get udp jumbograms through a page-by-page API, ever.

Wrong.

The only thing you need is a nagle-type thing that coalesces requests.
In the case of UDP, that coalescing obviously has to be explicitly
controlled, as the "standard" UDP behaviour is to send out just one
packet per write.

But this is a problem for TCP too: you want to tell TCP to _not_ send
out a short packet even if there are none in-flight, if you know you
want to send more.  So you want to have some way to anti-nagle for TCP
anyway. 

Also, if you look at the problem of "writev()", you'll notice that you
have many of the same issues: what you really want is to _always_
coalesce, and only send out when explicitly asked for (and then that
explicit ask would be on by default at the end of write() and at the
very end of the last segment in "writev()". 

It so happens that this logic already exists, it's called MSG_MORE or
something similar (I'm too lazy to check the actual patches). 

And it's there exactly because it is stupid to make the upper layers
have to gather everything into one packet if the lower layers need that
logic for other reasons anyway. Which they obviously do.

So what you can do is to just do multiple writes, and set the MSG_MORE
flag.  This works with sendfile(), but more importantly it is also an
uncommonly good interface to user mode.  With this, you can actually
implement things like "writev()" _properly_ from user-space, and we
could get rid of the special socket writev() magic if we wanted to. 

So if you have a header, you just send out that header separately (with
the MSG_MORE flag), and then do a "sendfile()" or whatever to send out
the data. 

This is much more flexible than writev(), and a lot easier to use.  It's
also a hell of a lot more flexible than the ugly sendfile() interfaces
that HP-UX and the BSD people have - I'm ashamed of how little taste the
BSD group in general has had in interface design.  Ugh.  Tacking on a
mixture of writev() and sendfile() in the same system call.  Tacky. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread David S. Miller

   Date:Tue, 9 Jan 2001 11:14:05 -0800 (PST)
   From: Dan Hollis <[EMAIL PROTECTED]>

   Just extend sendfile to allow any fd to any fd. sendfile already
   does file->socket and file->file. It only needs to be extended to
   do socket->file.

This is not what senfile() does, it sends (to a network socket) a
file (from the page cache), nothing more.

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread David S. Miller

   Date: Tue, 9 Jan 2001 16:27:49 +0100 (CET)
   From: Trond Myklebust <[EMAIL PROTECTED]>

   OK, but can you eventually generalize it to non-stream protocols
   (i.e. UDP)?

Sure, this is what MSG_MORE is meant to accomodate.  UDP could support
it just fine.

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread David S. Miller

   Date: Tue, 9 Jan 2001 15:17:25 +
   From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>

   Jes has also got hard numbers for the performance advantages of
   jumbograms on some of the networks he's been using, and you ain't
   going to get udp jumbograms through a page-by-page API, ever.

Again, see MSG_MORE in the patches.  It is possible and our UDP
implementation could make it easily.

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread David S. Miller

   Date: Tue, 9 Jan 2001 14:25:42 +
   From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>

   Perhaps tcp can merge internal 4K requests, but if you're doing udp
   jumbograms (or STP or VIA), you do need an interface which can give
   the networking stack more than one page at once.

All network protocols can use the current interface and get the result
you are after, see MSG_MORE.  TCP isn't "special" in this regard.

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Linus Torvalds

On Tue, 9 Jan 2001, Christoph Hellwig wrote:
> > 
> > Look at sendfile(). You do NOT have a "bunch" of pages.
> > 
> > Sendfile() is very much a page-at-a-time thing, and expects the actual IO
> > layers to do it's own scatter-gather. 
> > 
> > So sendfile() doesn't want any array at all: it only wants a single
> > page-offset-length tuple interface.
> 
> The current implementations does.
> But others are possible.  I could post one in a few days to show that it is
> possible.

Why do you bother arguing, when I've shown you that even if sendfile()
_did_ do multiple pages, it STILL wouldn't make kibuf's the right
interface. You just snipped out that part of my email, which states that
the networking layer would still need to do better scatter-gather than
kiobuf's can give it for multiple send-file invocations.

Let me iterate:

 - the layers like TCP _need_ to do scatter-gather anyway: you absolutely
   want to be able to send out just one packet even if the data comes from
   two different sources (for example, one source might be the http
   header, while the other source is the actual file contents. This is
   definitely not a made-up-example, this is THE example of something like
   this, and happens with just about all protocols that have a notion of 
   a header, which is pretty much 100% of them).

 - because TCP needs to do scatter-gather anyway across calls, there is no
   real reason for sendfile() to do it. And sendfile() doing it would
   _not_ obviate the need for it in the networking layer - it would only
   add complexity for absolutely no performance gain.

So neither sendfile _nor_ the networking layer want kiobuf's. Never have,
never will. The "half-way scatter-gather" support they give ends up either
being too much baggage, or too little. It's never the right fit.

kiovec adds the support for true scatter-gather, but with a horribly bad
interface, and much too much overhead - and absolutely NO advantages over
the _proper_ array of  which is much simpler than the
complex two-level arrays that you get with kiovec+kiobuf.

End of story.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Christoph Hellwig


On Tue, Jan 09, 2001 at 12:55:51PM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 9 Jan 2001, Christoph Hellwig wrote:
> > 
> > Also the tuple argument you gave earlier isn't right in this specific case:
> > 
> > when doing sendfile from pagecache to an fs, you have a bunch of pages,
> > an offset in the first and a length that makes the data end before last
> > page's end.
> 
> No.
> 
> Look at sendfile(). You do NOT have a "bunch" of pages.
> 
> Sendfile() is very much a page-at-a-time thing, and expects the actual IO
> layers to do it's own scatter-gather. 
> 
> So sendfile() doesn't want any array at all: it only wants a single
> page-offset-length tuple interface.

The current implementations does.
But others are possible.  I could post one in a few days to show that it is
possible.

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Linus Torvalds

On Tue, 9 Jan 2001, Christoph Hellwig wrote:
> 
> Also the tuple argument you gave earlier isn't right in this specific case:
> 
> when doing sendfile from pagecache to an fs, you have a bunch of pages,
> an offset in the first and a length that makes the data end before last
> page's end.

No.

Look at sendfile(). You do NOT have a "bunch" of pages.

Sendfile() is very much a page-at-a-time thing, and expects the actual IO
layers to do it's own scatter-gather. 

So sendfile() doesn't want any array at all: it only wants a single
page-offset-length tuple interface.

The _lower-level_ stuff (ie TCP and the drivers) want the "array of
tuples", and again, they do NOT want an array of pages, because if
somebody does two sendfile() calls that fit in one packet, it really needs
an array of tuples.

In short, the kiobuf interface is _always_ the wrong one.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Christoph Hellwig

In article <[EMAIL PROTECTED]> you wrote:

> On Tue, 9 Jan 2001, Ingo Molnar wrote:
>> 
>>   So i do believe that the networking
>> code is properly designed in this respect, and this concept goes to the
>> highest level of the networking code.

> Absolutely. This is why I have no conceptual problems with the networking
> layer changes, and why I am in violent disagreement with people who think
> the networking layer should have used the (much inferior, in my opinion)
> kiobuf/kiovec approach.

At least I (who has started this threads) haven't said htey should use iobufs
internally.  I said: use iovecs in the interface, because this interface
is a little more general and allows to integrate into other parts (namely Ben's
aio work nicely).

Also the tuple argument you gave earlier isn't right in this specific case:

when doing sendfile from pagecache to an fs, you have a bunch of pages,
an offset in the first and a length that makes the data end before last
page's end.

> For people who worry about code re-use and argue for kiobuf/kiovec on
> those grounds, I can only say that the code re-use should go the other
> way. It should be "the bad code should re-use code from the good code". It
> should NOT be "the new code should re-use code from the old code".

It's not relly about reusing, but about compatiblity with other interfaces...

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Linus Torvalds

On Tue, 9 Jan 2001, Ingo Molnar wrote:
> 
>So i do believe that the networking
> code is properly designed in this respect, and this concept goes to the
> highest level of the networking code.

Absolutely. This is why I have no conceptual problems with the networking
layer changes, and why I am in violent disagreement with people who think
the networking layer should have used the (much inferior, in my opinion)
kiobuf/kiovec approach.

For people who worry about code re-use and argue for kiobuf/kiovec on
those grounds, I can only say that the code re-use should go the other
way. It should be "the bad code should re-use code from the good code". It
should NOT be "the new code should re-use code from the old code".

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Jens Axboe

On Tue, Jan 09 2001, Andrea Arcangeli wrote:
> > > > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller
> > > > and I'll be a happy man
> > >
> > > No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID.
> > 
> > i cannot agree more - Jens' patch did wonders to IO performance here. It
> 
> BTW, I noticed what is left in blk-13B seems to be my work (Jens's fixes for
> merging when the I/O queue is full are just been integrated in test1x). The
> 512K SCSI command, wake_up_nr, elevator fixes and cleanups and removal of the
> bogus 64 max_segment limit in scsi.c that matters only with the IOMMU to allow
> devices with sg_tablesize <64 to do SG with 64 segments were all thought and
> implemented by me. My last public patch with most of the blk-13B stuff in it
> was here:
> 
>   
>ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/2.4.0-test7/blkdev-3
> 
> I sumbitted a later revision of the above blkdev-3 to Jens and he kept nicely
> maintaining it in sync with 2.4.x-latest.

There are several parts that have been merged beyond recognition at this
point :-). The wake_up_nr was actually partially redone by Ingo, I suspect
he can fill in the gaps there. Then there are the general cleanups and cruft
removal done by you (elevator->nr_segments stuff). The bogus 64 max segments
from SCSI was there before merge too, I think I've actually had that in my
tree for ages!

The request free batching and pending queues were done by me, and Ingo
helped tweak it during the spec runs to find a sweet spot of how much to
batch etc.

The elevator received lots of massaging beyond blkdev-3. For one, there
are now only one complete queue scan for merge and insert of request where
we before did one for each of them. The merger also does correct
accounting and aging.

In addition there are a bunch other small fixes in there, I'm too lazy
to list them all now :)

> My blkdev tree is even more advanced but I didn't had time to update with 2.4.0
> and marge it with Jens yet (I just described to Jens what "more advanced"
> means though, in practice it means something like a x2 speedup in tiotest seek

I haven't heard anything beyond the raised QUEUE_NR_REQUEST, so I'd like to
see what you have pending so we can merge :-). The tiotest seek increase was
mainly due to the elevator having 3000 requests to juggle and thus being able
to eliminate a lot of seeks right?

> write numbers, streaming I/O doesn't change on highmem boxes but it doesn't
> hurt lowmem boxes anymore). Current blk-13B isn't ok for integration yet
> because it hurts with lowmem (try with mem=32m with your scsi array that gets
> 512K*512 requests in flight :) and it's not able to exploit the elevator as

I don't see any lowmem problems -- if under pressure, the queue should be
fired and thus it won't get as long as if you have lots of memory free.`

> well as my tree even on highmemory machines. So I'd wait until I merge the last
> bits with Jens (I raised the QUEUE_NR_REQUESTS to 3000) before inclusion.

?? What do you mean exploit the elevator?

-- 
* Jens Axboe <[EMAIL PROTECTED]>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Ingo Molnar

On Tue, 9 Jan 2001, Andrea Arcangeli wrote:

> BTW, I noticed what is left in blk-13B seems to be my work (Jens's
> fixes for merging when the I/O queue is full are just been integrated
> in test1x).  [...]

it was Jens' [i think those were implemented by Jens entirely]
batch-freeing changes that made the most difference. (we did
profile it step by step.)

> ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/2.4.0-test7/blkdev-3

great! i'm happy that the block IO layer and IO scheduler now has
a real home :-) nice work.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Ingo Molnar

On 9 Jan 2001, Linus Torvalds wrote:

> I told David that he can fix the network zero-copy code two ways: either
> he makes it _truly_ scatter-gather (an array of not just pages, but of
> proper page-offset-length tuples), or he makes it just a single area and
> lets the low-level TCP/whatever code build up multiple segments
> internally.  Either of which are good designs.

it's actually truly zero-copy internally, we use an array of
(page,offset,length) tuples, with proper per-page usage counting. We did
this for than half a year. I believe the array-of-pages solution you refer
to went only from the pagecache layer into the highest level of TCP - then
it got converted into the internal representation. These tuples right now
do not have their own life, they are always associated with actual
outgoing packets (and in fact are allocated together with skb's and are at
the end of the header area).

the lowlevel networking drivers (and even midlevel networking code) knows
nothing about kiovecs or arrays of pages, it's using the array-of-tuples
representation:

typedef struct skb_frag_struct skb_frag_t;

struct skb_frag_struct
{
struct page *page;
__u16 page_offset;
__u16 size;
};

/* This data is invariant across clones and lives at
 * the end of the header data, ie. at skb->end.
 */
struct skb_shared_info {
atomic_tdataref;
unsigned intnr_frags;
struct sk_buff  *frag_list;
skb_frag_t  frags[MAX_SKB_FRAGS];
};

(the __u16 thing is more of a cache footprint paranoia than real
necessity, it could be int as well.). So i do believe that the networking
code is properly designed in this respect, and this concept goes to the
highest level of the networking code.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Andrea Arcangeli

On Tue, Jan 09, 2001 at 07:38:28PM +0100, Ingo Molnar wrote:
> 
> On Tue, 9 Jan 2001, Jens Axboe wrote:
> 
> > > > ever seen, this is why i quoted it - the talk was about block-IO
> > > > performance, and Stephen said that our block IO sucks. It used to suck,
> > > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. )
> > >
> > > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller
> > > and I'll be a happy man
> >
> > No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID.
> 
> i cannot agree more - Jens' patch did wonders to IO performance here. It

BTW, I noticed what is left in blk-13B seems to be my work (Jens's fixes for
merging when the I/O queue is full are just been integrated in test1x). The
512K SCSI command, wake_up_nr, elevator fixes and cleanups and removal of the
bogus 64 max_segment limit in scsi.c that matters only with the IOMMU to allow
devices with sg_tablesize <64 to do SG with 64 segments were all thought and
implemented by me. My last public patch with most of the blk-13B stuff in it
was here:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/2.4.0-test7/blkdev-3

I sumbitted a later revision of the above blkdev-3 to Jens and he kept nicely
maintaining it in sync with 2.4.x-latest.

My blkdev tree is even more advanced but I didn't had time to update with 2.4.0
and marge it with Jens yet (I just described to Jens what "more advanced"
means though, in practice it means something like a x2 speedup in tiotest seek
write numbers, streaming I/O doesn't change on highmem boxes but it doesn't
hurt lowmem boxes anymore). Current blk-13B isn't ok for integration yet
because it hurts with lowmem (try with mem=32m with your scsi array that gets
512K*512 requests in flight :) and it's not able to exploit the elevator as
well as my tree even on highmemory machines. So I'd wait until I merge the last
bits with Jens (I raised the QUEUE_NR_REQUESTS to 3000) before inclusion.

Confirm Jens?

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread J Sloan


Alan Cox wrote:

>
> > it might not be important to others, but we do hold one particular
> > SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full
>
> And its real world value is exactly the same as the mindcraft NT values. Don't
> forget that.

In other words, devastating.

jjs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Ingo Molnar



On Tue, 9 Jan 2001, Christoph Hellwig wrote:

> I didn't want to suggest that - I'm to clueless concerning networking
> to even consider an internal design for network zero-copy IO. I'm just
> talking about the VFS interface to the rest of the kernel.

(well, i think you just cannot be clueless about one and then demand
various things about the other...)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Christoph Hellwig  <[EMAIL PROTECTED]> wrote:
>
>You get that multiple page call with kiobufs for free...

No, you don't.

kiobufs are crap. Face it. They do NOT allow proper multi-page scatter
gather, regardless of what the kiobuf PR department has said.

I've complained about it before, and nobody listened. Davids zero-copy
network code had the same bug. I complained about it to David, and David
took about a day to understand my arguments, and fixed it.

It's more likely that the zero-copy network code will be used in real
life than kiobufs will ever be.  The kiobufs are damn ugly by
comparison, and the fact that the kiobuf people don't even seem to
realize the problems makes me just more convinced that it's not worth
even arguing about.

What is the problem with kiobuf's? Simple: they have a "offset" and a
"length", and an array of pages.  What that completely and utterly
misses is that if you have an array of pages, you should have an array
of "offset" and "length" too.  As it is, kiobuf's cannot be used for
things like readv() and writev(). 

Yes, to work around this limitation, there's the notion of "kiovec", an
array of kiobuf's.  Never mind the fact that if kiobuf's had been
properly designed in the first place, you wouldn't need kiovec's at all. 
And kiovec's are too damn heavy to use for something like the networking
zero-copy, with all the double indirection etc. 

I told David that he can fix the network zero-copy code two ways: either
he makes it _truly_ scatter-gather (an array of not just pages, but of
proper page-offset-length tuples), or he makes it just a single area and
lets the low-level TCP/whatever code build up multiple segments
internally.  Either of which are good designs.

It so happens that none of the users actually wanted multi-page
scatter-gather, and the only thing that really wanted to do the sg was
the networking layer when it created a single packet out of multiple
areas, so the zero-copy stuff uses the simpler non-array interface. 

And kiobufs can rot in hell for their design mistakes.  Maybe somebody
will listen some day and fix them up, and in the meantime they can look
at the networking code for an example of how to do it. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Dan Hollis


On Wed, 10 Jan 2001, Andrew Morton wrote:
> y'know our pals have patented it?
> http://www.delphion.com/details?pn=US05845280__

Bad faith patent? Actionable, treble damages?

-Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Dan Hollis


On Tue, 9 Jan 2001, Ingo Molnar wrote:
> :-) I think sendfile() should also have its logical extensions:
> receivefile(). I dont know how the HPUX implementation works, but in
> Linux, right now it's only possible to sendfile() from a file to a socket.
> The logical extension of this is to allow socket->file IO and file->file,
> socket->socket IO as well. (the later one could be interesting for things
> like web proxies.)

Just extend sendfile to allow any fd to any fd. sendfile already does
file->socket and file->file. It only needs to be extended to do
socket->file.

-Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Ingo Molnar



On Tue, 9 Jan 2001, Chris Evans wrote:

> > but in 2.4, with the right patch from Jens, it doesnt suck anymore. )
>
> Is this "right patch from Jens" on the radar for 2.4 inclusion?

i do hope so!

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Ingo Molnar

On Tue, 9 Jan 2001, Jens Axboe wrote:

> > > ever seen, this is why i quoted it - the talk was about block-IO
> > > performance, and Stephen said that our block IO sucks. It used to suck,
> > > but in 2.4, with the right patch from Jens, it doesnt suck anymore. )
> >
> > Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller
> > and I'll be a happy man
>
> No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID.

i cannot agree more - Jens' patch did wonders to IO performance here. It
fixes a long-standing bug in the Linux block-IO-scheduler that caused very
suboptimal requests being issued to lowlevel drivers once the request
queue gets full. I think this patch is a clear candidate for 2.4.x
inclusion.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Ingo Molnar

On Tue, 9 Jan 2001, Benjamin C.R. LaHaise wrote:

> I've already got fully async read and write working via a helper thread
  ^^^
> for doing the bmaps when the page is not uptodate in the page cache.
  ^^^

thats what TUX 2.0 does. (it does async reads at the moment.)

> The primatives for async locking of pages and waiting on events such
> that converting ext2 to performing full async bmap should be trivial.

well - if you think it's trivial (ie. no process context, no helper thread
will be needed), more power to you. How are you going to assure that the
issuing process does not block during the bmap()? [without extensive
lowlevel-FS changes that is.]

> Note that O_NONBLOCK is not good enough because you can't implement an
> asynchronous O_SYNC write with it.

(i'm using it for reads only.)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Christoph Hellwig

On Tue, Jan 09, 2001 at 12:05:59PM +0100, Ingo Molnar wrote:
> 
> On Tue, 9 Jan 2001, Christoph Hellwig wrote:
> 
> > > 2.4. In any case, the zerocopy code is 'kiovec in spirit' (uses
> > > vectors of struct page *, offset, size entities),
> 
> > Yep. That is why I was so worried aboit the writepages file op.
> 
> i believe you misunderstand. kiovecs (in their current form) are simply
> too bloated for networking purposes.

Stop.  I NEVER said you should use them internally.
My concern is too use a file operation with a kiobuf ** as main argument
instead of page *.  With a little more bloat it allows you to do the same
you do now.  But it also offers a real advantage:  you don't have to call
into the network stack for every single page, and this fits easily in Ben's
AIO stuff, so your stuff is very well integrated into the (futur) asynch IO
framework. (he latter was my main concern).

You pay 116 bytes and a few cycles for a _lot_ more abstraction and
integration.  Exactly such a design principle (design vs speed) is the cause
why UNIX survived so long.

> Due to its nature and nonpersistency,
> networking is very lightweight and memory-footprint-sensitive code (as
> opposed to eg. block IO code), right now an 'struct skb_shared_info'
> [which is roughly equivalent to a kiovec] is 12+4*6 == 36 bytes, which
> includes support for 6 distinct fragments (each fragment can be on any
> page, any offset, any size). A *single* kiobuf (which is roughly
> equivalent to an skb fragment) is 52+16*4 == 116 bytes. 6 of these would
> be 696 bytes, for a single TCP packet (!!!). This is simply not something
> to be used for lightweight zero-copy networking.

This doesn't matter, because rw_kiovec can easily take only one kiobuf,
and you don't really need the different fragments there.

> so it's easy to say 'use kiovecs', but right now it's simply not
> practical. kiobufs are a loaded concept, and i'm not sure whether it's
> desirable at all to mix networking zero-copy concepts with
> block-IO/filesystem zero-copy concepts.

I didn't wnat to suggest that - I'm to clueless concerning networking to
even consider an internal design for network zero-copy IO.
I'm just talking about the VFS interface to the rest of the kernel.

> we talked (and are talking) to Stephen about this problem, but it's a
> clealy 2.5 kernel issue. Merging to a finalized zero-copy framework will
> be easy. (The overwhelming percentage of zero-copy code is in the
> networking code itself and is insensitive to any kiovec issues.)

Agreed.

> > It's rather hackish (only write, looks usefull only for networking)
> > instead of the proposed rw_kiovec fop.
> 
> i'm not sure what you are trying to say. You mean we should remove
> sendfile() as well? It's only write, looks useful mostly for networking. A
> substantial percentage of kernel code is useful only for networking :-)

No.  But it looks like a recvmsg syscall wouldn't too bad either ...

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Stephen C. Tweedie

Hi,

On Tue, Jan 09, 2001 at 05:16:40PM +0100, Ingo Molnar wrote:
> On Tue, 9 Jan 2001, Stephen C. Tweedie wrote:
> 
> i'm talking about kiovecs not kiobufs (because those are equivalent to a
> fragmented packet - every packet fragment can be anywhere). Initializing a
> kiovec involves touching a dozen cachelines. Keeping structures compressed
> is very important.
> 
> i dont know. I dont think it's necesserily bad for a subsystem to have its
> own 'native structure' how it manages data.

For the transmit case, unless the sender needs seriously fragmented
data, the kiovec is just a kiobuf*.

> i do believe that you are wrong here. We did have a multi-page API between
> sendfile and the TCP layer initially, and it made *absolutely no
> performance difference*.

That may be fine for tcp, but tcp explicitly maintains the state of
the caller and can stream things sequentially to a specific file
descriptor.

The block device layer, on the other hand, has to accept requests _in
any order_ and still reorder them to the optimal elevator order.  The
merging in ll_rw_block is _far_ more expensive than adding a request
to the end of a list.  It's not helped by the fact that each such
request has a buffer_head and a struct request associated with it, so
deconstructing the large IO into buffer_heads results in huge amounts
of data being allocated and deleted.

We could streamline this greatly if the block device layer kept
per-caller context in the way that tcp does, but the block device API
just doesn't work that way.

> > We have already shown that the IO-plugging API sucks, I'm afraid.
> 
> it might not be important to others, but we do hold one particular
> SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full
> fileset of ~9 GB. It generates insane block-IO load, and we do beat other
> OSs that have multipage support, including SGI. (and no, it's not due to
> kernel-space acceleration alone this time - it's mostly due to very good
> block-IO performance.) We use Jens Axobe's IO-batching fixes that
> dramatically improve the block scheduler's performance under high load.

Perhaps, but we have proven and significant reductions in CPU
utilisation from eliminating the per-buffer_head API to the block
layer.  Next time M$ gets close to our specweb records, maybe this is
the next place to look for those extra few % points!

> > Gig Ethernet, [...]
> 
> we handle gigabit ethernet with 1.5K zero-copy packets just fine. One
> thing people forget is IRQ throttling: when switching from 1500 byte
> packets to 9000 byte packets then the amount of interrupts drops by a
> factor of 6. Now if the tunings of a driver are not changed accordingly,
> 1500 byte MTU can show dramatically lower performance than 9000 byte MTU.
> But if tuned properly, i see little difference between 1500 byte and 9000
> byte MTU. (when using a good protocol such as TCP.)

Maybe you see good throughput numbers, but I still bet the CPU
utilisation could be bettered significantly with jumbograms.

That's one of the problems with benchmarks: our CPU may be fast enough
that we can keep the IO subsystems streaming, and the benchmark will
not show up any OS bottlenecks, but we may still be consuming far too
much CPU time internally.  That's certainly the case with the block IO
measurements made on XFS: sure, ext2 can keep a fast disk loaded to
pretty much 100%, but at the cost of far more system CPU time than
XFS+pagebuf+kiobuf-IO takes on the same disk.

> > The presence of terrible performance in the old ll_rw_block code is
> > NOT a good excuse for perpetuating that model.
> 
> i'd like to measure this performance problem (because i'd like to
> double-check it) - what measurement method was used?

"time" will show it.  A 13MB/sec raw IO dd using 64K blocks uses
something between 5% and 15% of CPU time on the various systems I've
tested on (up to 30% on an old 486 with a 1540, but that's hardly
representative. :)  The kernel profile clearly shows the buffer
management as the biggest cost, with the SCSI code walking those
buffer heads a close second.

On my main scsi server test box, I get raw 32K reads taking about 7%
system time on the cpu, with make_request and __get_request_wait being
the biggest hogs.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Stephen C. Tweedie

Hi,

On Tue, Jan 09, 2001 at 12:30:39PM -0500, Benjamin C.R. LaHaise wrote:
> On Tue, 9 Jan 2001, Ingo Molnar wrote:
> 
> > this is why i ment that *right now* kiobufs are not suited for networking,
> > at least the way we do it. Maybe if kiobufs had the same kind of internal
> > structure as sk_frag (ie. array of (page,offset,size) triples, not array
> > of pages), that would work out better.
> 
> That I can agree with, and it would make my life easier since I really
> only care about the completion of an entire io, not the individual
> fragments of it.

Right, but this is why the kiobuf IO functions are supposed to accept
kiovecs (ie. counted vectors of kiobuf *s, just like ll_rw_block
receives buffer_heads).

The kiobuf is supposed to be a unit of memory, not of IO.  You can map
several different kiobufs from different sources and send them all
together to brw_kiovec() as a single IO.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Chris Evans



On Tue, 9 Jan 2001, Ingo Molnar wrote:

> This is one of the busiest and most complex block-IO Linux systems i've
> ever seen, this is why i quoted it - the talk was about block-IO
> performance, and Stephen said that our block IO sucks. It used to suck,
> but in 2.4, with the right patch from Jens, it doesnt suck anymore. )

Is this "right patch from Jens" on the radar for 2.4 inclusion?

Cheers
Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Christoph Hellwig


On Tue, Jan 09, 2001 at 10:38:30AM -0500, Benjamin C.R. LaHaise wrote:
> What you're completely ignoring is that sendpages is lacking a huge amount
> of functionality that is *needed*.  I can't implement clean async io on
> top of sendpages -- it'll require keeping 1 task around per outstanding
> io, which is exactly the bottleneck we're trying to work around.

Yepp.  That's why I proposed to ue rw_kiovec.  Currently Alexy seems
to have an own hack for socket-only asynch IO with some COW semantics
for the userlevel buffers, but I would much prefer a generic version...

Christoph

P.S. Any chance to find a new version of your aio-patch somewhere?
-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Manfred Spraul

sct wrote:
> We've already got measurements showing how insane this is. Raw IO 
> requests, plus internal pagebuf contiguous requests from XFS, have to 
> get broken down into page-sized chunks by the current ll_rw_block() 
> API, only to get reassembled by the make_request code. It's 
> *enormous* overhead, and the kiobuf-based disk IO code demonstrates 
> this clearly. 

Stephen, I see one big difference between ll_rw_block and the proposed
tcp_sendpage():
You must allocate and initialize a complete buffer head for each page
you want to read, and then you pass the array of buffer heads to
ll_rw_block with one function call.
I'm certain the overhead is the allocation/initialization/freeing of the
buffer heads, not the function call.

AFAICS the proposed tcp_sendpage interface is the other way around:
you need one function call for each page, but no memory
allocation/setup. The memory is allocated internally by the tcp_sendpage
implementation, and it merges requests when possible, thus for a 9000
byte jumbopacket you'd need 3 function calls to tcp_sendpage(MSG_MORE),
but only one skb is allocated and set up.

Ingo is that correct?

--
Manfred

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Jens Axboe


On Tue, Jan 09 2001, Alan Cox wrote:
> > ever seen, this is why i quoted it - the talk was about block-IO
> > performance, and Stephen said that our block IO sucks. It used to suck,
> > but in 2.4, with the right patch from Jens, it doesnt suck anymore. )
> 
> Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller
> and I'll be a happy man

No problem, apply blk-13B and you'll get 512K chunks for SCSI and RAID.

-- 
* Jens Axboe <[EMAIL PROTECTED]>
* SuSE Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Benjamin C.R. LaHaise

On Tue, 9 Jan 2001, Ingo Molnar wrote:

> this is why i ment that *right now* kiobufs are not suited for networking,
> at least the way we do it. Maybe if kiobufs had the same kind of internal
> structure as sk_frag (ie. array of (page,offset,size) triples, not array
> of pages), that would work out better.

That I can agree with, and it would make my life easier since I really
only care about the completion of an entire io, not the individual
fragments of it.

> Please take a look at next release of TUX. Probably the last missing piece
> was that i added O_NONBLOCK to generic_file_read() && sendfile(), so not
> fully cached requests can be offloaded to IO threads.
> 
> Otherwise the current lowlevel filesystem infrastructure is not suited for
> implementing "process-less async IO "- and kiovecs wont be able to help
> that either. Unless we implement async, IRQ-driven bmap(), we'll always
> need some sort of process context to set up IO.

I've already got fully async read and write working via a helper thread
for doing the bmaps when the page is not uptodate in the page cache.  The
primatives for async locking of pages and waiting on events such that
converting ext2 to performing full async bmap should be trivial.  Note
that O_NONBLOCK is not good enough because you can't implement an
asynchronous O_SYNC write with it.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Alan Cox


> ever seen, this is why i quoted it - the talk was about block-IO
> performance, and Stephen said that our block IO sucks. It used to suck,
> but in 2.4, with the right patch from Jens, it doesnt suck anymore. )

Thats fine. Get me 128K-512K chunks nicely streaming into my raid controller
and I'll be a happy man

I don't have a problem with the claim that its not the per page stuff and 
plugging that breaks ll_rw_blk. If there is evidence contradicting the SGI
stuff it's very interesting

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PLEASE-TESTME] Zerocopy networking patch, 2.4.0-1

2001-01-09 Thread Ingo Molnar

On Tue, 9 Jan 2001, Alan Cox wrote:

> > > We have already shown that the IO-plugging API sucks, I'm afraid.
> >
> > it might not be important to others, but we do hold one particular
> > SPECweb99 world record: on 2-way, 2 GB RAM, testing a load with a full
>
> And its real world value is exactly the same as the mindcraft NT
> values. Don't forget that.

( what you have not quoted is the part that says that the fileset is 9GB.
This is one of the busiest and most complex block-IO Linux systems i've
ever seen, this is why i quoted it - the talk was about block-IO
performance, and Stephen said that our block IO sucks. It used to suck,
but in 2.4, with the right patch from Jens, it doesnt suck anymore. )

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 225 matches

Mail list logo