Re: scsi vs ide performance on fsync's

2001-03-07 Thread Stephen C. Tweedie

Hi,

On Wed, Mar 07, 2001 at 09:15:36PM +0100, Jens Axboe wrote:
 On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
  
  For most fs'es, that's not an issue.  The fs won't start writeback on
  the primary disk at all until the journal commit has been acknowledged
  as firm on disk.
 
 But do you then force wait on that journal commit?

It doesn't matter too much --- it's only the writeback which is doing
this (ext3 uses a separate journal thread for it), so any sleep is
only there to wait for the moment when writeback can safely begin:
users of the filesystem won't see any stalls.

 A barrier operation is sufficient then. So you're saying don't
 over design, a simple barrier is all you need?

Pretty much so.  The simple barrier is the only thing which can be
effectively optimised at the hardware level with SCSI anyway.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Raw IO fixes for 2.4.2-ac8

2001-03-02 Thread Stephen C. Tweedie

Hi,

I've just uploaded the current raw IO fixes as
kiobuf-2.4.2-ac8-A0.tar.gz on

ftp.uk.linux.org:/pub/linux/sct/fs/raw-io/

and

ftp.*.kernel.org:/pub/linux/kernel/people/sct/raw-io/

This includes:

00-movecode.diff:   move kiobuf code from mm/memory.c to fs/iobuf.c
02-faultfix.diff:   fixes for faulting and pinning pages
03-unbind.diff: allow unbinding of raw devices
04-pgdirty.diff:use the new SetPageDirty to dirty pages after reads
05-bh-err.diff: fix cleanup of buffer_heads after ENOMEM
06-eio.diff:fix error returned on EIO in first block of IO

The first 3 of these are from the current 2.2 raw patches.  The 4th is
the fix for dirtying pages after raw reads, using the new
functionality of the 2.4 VM.  The 5th and 6th fix up problems
introduced when brw_kiovec was moved to use submit_bh().

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] set kiobuf io_count once, instead of increment

2001-03-02 Thread Stephen C. Tweedie

Hi,

On Wed, Feb 28, 2001 at 09:18:59AM -0800, Robert Read wrote:
> On Tue, Feb 27, 2001 at 10:50:54PM -0300, Marcelo Tosatti wrote:

> This is true, but it looks like the brw_kiovec allocation failure
> handling is broken already; it's calling __put_unused_buffer_head on
> bhs without waiting for them to complete first.  Also, the err won't
> be returned if the previous batch of bhs finished ok.  It looks like
> brw_kiovec needs some work, but I'm going to need some coffee first...

Right, looks like this happened when somebody was changing the bh
submission mechanism to use submit_bh().  I'll fix it.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] set kiobuf io_count once, instead of increment

2001-03-02 Thread Stephen C. Tweedie

On Tue, Feb 27, 2001 at 04:22:22PM -0800, Robert Read wrote:
> Currently in brw_kiovec, iobuf->io_count is being incremented as each
> bh is submitted, and decremented in the bh->b_end_io().  This means
> io_count can go to zero before all the bhs have been submitted,
> especially during a large request. This causes the end_kio_request()
> to be called before all of the io is complete.  

brw_kiovec is currently entirely synchronous, so end_kio_request()
calling is probably not a big deal right now.  It would be much more
important for an async version.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] set kiobuf io_count once, instead of increment

2001-03-02 Thread Stephen C. Tweedie

On Tue, Feb 27, 2001 at 04:22:22PM -0800, Robert Read wrote:
 Currently in brw_kiovec, iobuf-io_count is being incremented as each
 bh is submitted, and decremented in the bh-b_end_io().  This means
 io_count can go to zero before all the bhs have been submitted,
 especially during a large request. This causes the end_kio_request()
 to be called before all of the io is complete.  

brw_kiovec is currently entirely synchronous, so end_kio_request()
calling is probably not a big deal right now.  It would be much more
important for an async version.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] set kiobuf io_count once, instead of increment

2001-03-02 Thread Stephen C. Tweedie

Hi,

On Wed, Feb 28, 2001 at 09:18:59AM -0800, Robert Read wrote:
 On Tue, Feb 27, 2001 at 10:50:54PM -0300, Marcelo Tosatti wrote:

 This is true, but it looks like the brw_kiovec allocation failure
 handling is broken already; it's calling __put_unused_buffer_head on
 bhs without waiting for them to complete first.  Also, the err won't
 be returned if the previous batch of bhs finished ok.  It looks like
 brw_kiovec needs some work, but I'm going to need some coffee first...

Right, looks like this happened when somebody was changing the bh
submission mechanism to use submit_bh().  I'll fix it.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Raw IO fixes for 2.4.2-ac8

2001-03-02 Thread Stephen C. Tweedie

Hi,

I've just uploaded the current raw IO fixes as
kiobuf-2.4.2-ac8-A0.tar.gz on

ftp.uk.linux.org:/pub/linux/sct/fs/raw-io/

and

ftp.*.kernel.org:/pub/linux/kernel/people/sct/raw-io/

This includes:

00-movecode.diff:   move kiobuf code from mm/memory.c to fs/iobuf.c
02-faultfix.diff:   fixes for faulting and pinning pages
03-unbind.diff: allow unbinding of raw devices
04-pgdirty.diff:use the new SetPageDirty to dirty pages after reads
05-bh-err.diff: fix cleanup of buffer_heads after ENOMEM
06-eio.diff:fix error returned on EIO in first block of IO

The first 3 of these are from the current 2.2 raw patches.  The 4th is
the fix for dirtying pages after raw reads, using the new
functionality of the 2.4 VM.  The 5th and 6th fix up problems
introduced when brw_kiovec was moved to use submit_bh().

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Writing on raw device with software RAID 0 is slow

2001-03-01 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 01, 2001 at 11:08:13AM -0500, Ben LaHaise wrote:
> On Thu, 1 Mar 2001, Stephen C. Tweedie wrote:
> 
> Actually, how about making it a sysctl?  That's probably the most
> reasonable approach for now since the optimal size depends on hardware.

Fine with me.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Writing on raw device with software RAID 0 is slow

2001-03-01 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 01, 2001 at 10:44:38AM -0500, Ben LaHaise wrote:
> 
> On Thu, 1 Mar 2001, Stephen C. Tweedie wrote:
> 
> > Raw IO is always synchronous: it gets flushed to disk before the write
> > returns.  You don't get any write-behind with raw IO, so the smaller
> > the blocksize you write in, the slower things get.
> 
> More importantly, the mainstream raw io code only writes in 64KB chunks
> that are unpipelined, which can lead to writes not hitting the drive
> before the sector passes under the rw head.  You can work around this to
> some extent by issuing multiple writes (via threads, or the aio work I've
> done) at the expense of atomicity.  Also, before we allow locking of
> arbitrary larger ios in main memory, we need bean counting to prevent the
> obvious DoSes.

Yep.  There shouldn't be any problem increasing the 64KB size, it's
only the lack of accounting for the pinned memory which stopped me
increasing it by default.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: ext3 fsck question

2001-03-01 Thread Stephen C. Tweedie

Hi,

On Wed, Feb 28, 2001 at 08:03:21PM -0600, Neal Gieselman wrote:
> 
> I applied the libs and other utilites from e2fsprogs by hand.
> I ran fsck.ext3 on my secondary partition and it ran fine.  The boot fsck
> on / was complaining about something but I could not catch it.
> I then went single user and ran fsck.ext3 on / while mounted.

e2fsck should complain loudly and ask for confirmation if you do that.
Goin ahead with the fsck is a bad move on a mounted, rw filesystem!

> Excuse the stupid question, but with ext3, do I really require the
> fsck.ext3?  

fsck.ext3 is just a link to e2fsck.  Make sure you're running recent
e2fsprogs, though (either the latest snapshot from
downloads.sourceforge.net or a build from
ftp.uk.linux.org:/pub/linux/sct/fs/jfs/).

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: ext3 fsck question

2001-03-01 Thread Stephen C. Tweedie

Hi,

On Wed, Feb 28, 2001 at 08:03:21PM -0600, Neal Gieselman wrote:
 
 I applied the libs and other utilites from e2fsprogs by hand.
 I ran fsck.ext3 on my secondary partition and it ran fine.  The boot fsck
 on / was complaining about something but I could not catch it.
 I then went single user and ran fsck.ext3 on / while mounted.

e2fsck should complain loudly and ask for confirmation if you do that.
Goin ahead with the fsck is a bad move on a mounted, rw filesystem!

 Excuse the stupid question, but with ext3, do I really require the
 fsck.ext3?  

fsck.ext3 is just a link to e2fsck.  Make sure you're running recent
e2fsprogs, though (either the latest snapshot from
downloads.sourceforge.net or a build from
ftp.uk.linux.org:/pub/linux/sct/fs/jfs/).

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Writing on raw device with software RAID 0 is slow

2001-03-01 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 01, 2001 at 10:44:38AM -0500, Ben LaHaise wrote:
 
 On Thu, 1 Mar 2001, Stephen C. Tweedie wrote:
 
  Raw IO is always synchronous: it gets flushed to disk before the write
  returns.  You don't get any write-behind with raw IO, so the smaller
  the blocksize you write in, the slower things get.
 
 More importantly, the mainstream raw io code only writes in 64KB chunks
 that are unpipelined, which can lead to writes not hitting the drive
 before the sector passes under the rw head.  You can work around this to
 some extent by issuing multiple writes (via threads, or the aio work I've
 done) at the expense of atomicity.  Also, before we allow locking of
 arbitrary larger ios in main memory, we need bean counting to prevent the
 obvious DoSes.

Yep.  There shouldn't be any problem increasing the 64KB size, it's
only the lack of accounting for the pinned memory which stopped me
increasing it by default.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Writing on raw device with software RAID 0 is slow

2001-03-01 Thread Stephen C. Tweedie

Hi,

On Thu, Mar 01, 2001 at 11:08:13AM -0500, Ben LaHaise wrote:
 On Thu, 1 Mar 2001, Stephen C. Tweedie wrote:
 
 Actually, how about making it a sysctl?  That's probably the most
 reasonable approach for now since the optimal size depends on hardware.

Fine with me.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] guard mm->rss with page_table_lock (241p11)

2001-02-13 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 12, 2001 at 07:15:57PM -0800, george anzinger wrote:
> Excuse me if I am off base here, but wouldn't an atomic operation be
> better here.  There are atomic inc/dec and add/sub macros for this.  It
> just seems that that is all that is needed here (from inspection of the
> patch).

The counter-argument is that we already hold the page table lock in
the vast majority of places where the rss is modified, so overall it's
cheaper to avoid the extra atomic update.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] guard mm-rss with page_table_lock (241p11)

2001-02-13 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 12, 2001 at 07:15:57PM -0800, george anzinger wrote:
 Excuse me if I am off base here, but wouldn't an atomic operation be
 better here.  There are atomic inc/dec and add/sub macros for this.  It
 just seems that that is all that is needed here (from inspection of the
 patch).

The counter-argument is that we already hold the page table lock in
the vast majority of places where the rss is modified, so overall it's
cheaper to avoid the extra atomic update.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: ext2: block > big ?

2001-02-12 Thread Stephen C. Tweedie

Hi,

On Sun, Feb 11, 2001 at 05:44:02PM -0700, Brian Grossman wrote:
> 
> What does a message like 'ext2: block > big' indicate?

An attempt was made to access a block beyond the legal max size for an
ext2 file.  That probably implies a corrupt inode, because the ext2
file write code checks for that limit and won't attempt to write
beyond the boundary.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://vger.kernel.org/lkml/



Re: ext2: block big ?

2001-02-12 Thread Stephen C. Tweedie

Hi,

On Sun, Feb 11, 2001 at 05:44:02PM -0700, Brian Grossman wrote:
 
 What does a message like 'ext2: block  big' indicate?

An attempt was made to access a block beyond the legal max size for an
ext2 file.  That probably implies a corrupt inode, because the ext2
file write code checks for that limit and won't attempt to write
beyond the boundary.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://vger.kernel.org/lkml/



Re: Raw devices bound to RAID arrays ?

2001-02-11 Thread Stephen C. Tweedie

Hi,

On Sun, Feb 11, 2001 at 06:29:12PM +0200, Petru Paler wrote:
> 
> Is it possible to bind a raw device to a software RAID 1 array ?

Yes.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Raw devices bound to RAID arrays ?

2001-02-11 Thread Stephen C. Tweedie

Hi,

On Sun, Feb 11, 2001 at 06:29:12PM +0200, Petru Paler wrote:
 
 Is it possible to bind a raw device to a software RAID 1 array ?

Yes.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 08, 2001 at 03:52:35PM +0100, Mikulas Patocka wrote:
> 
> > How do you write high-performance ftp server without threads if select
> > on regular file always returns "ready"?
> 
> No, it's not really possible on Linux. Use SYS$QIO call on VMS :-)

Ahh, but even VMS SYS$QIO is synchronous at doing opens, allocation of
the IO request packets, and mapping file location to disk blocks.
Only the data IO is ever async (and Ben's async IO stuff for Linux
provides that too).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 08, 2001 at 12:15:13AM +0100, Pavel Machek wrote:
> 
> > EAGAIN is _not_ a valid return value for block devices or for regular
> > files. And in fact it _cannot_ be, because select() is defined to always
> > return 1 on them - so if a write() were to return EAGAIN, user space would
> > have nothing to wait on. Busy waiting is evil.
> 
> So you consider inability to select() on regular files _feature_?

Select might make some sort of sense for sequential access to files,
and for random access via lseek/read but it makes no sense at all for
pread and pwrite where select() has no idea _which_ part of the file
the user is going to want to access next.

> How do you write high-performance ftp server without threads if select
> on regular file always returns "ready"?

Select can work if the access is sequential, but async IO is a more
general solution.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 08, 2001 at 12:15:13AM +0100, Pavel Machek wrote:
 
  EAGAIN is _not_ a valid return value for block devices or for regular
  files. And in fact it _cannot_ be, because select() is defined to always
  return 1 on them - so if a write() were to return EAGAIN, user space would
  have nothing to wait on. Busy waiting is evil.
 
 So you consider inability to select() on regular files _feature_?

Select might make some sort of sense for sequential access to files,
and for random access via lseek/read but it makes no sense at all for
pread and pwrite where select() has no idea _which_ part of the file
the user is going to want to access next.

 How do you write high-performance ftp server without threads if select
 on regular file always returns "ready"?

Select can work if the access is sequential, but async IO is a more
general solution.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 08, 2001 at 03:52:35PM +0100, Mikulas Patocka wrote:
 
  How do you write high-performance ftp server without threads if select
  on regular file always returns "ready"?
 
 No, it's not really possible on Linux. Use SYS$QIO call on VMS :-)

Ahh, but even VMS SYS$QIO is synchronous at doing opens, allocation of
the IO request packets, and mapping file location to disk blocks.
Only the data IO is ever async (and Ben's async IO stuff for Linux
provides that too).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Stephen C. Tweedie

Hi,

On Wed, Feb 07, 2001 at 12:12:44PM -0700, Richard Gooch wrote:
> Stephen C. Tweedie writes:
> > 
> > Sorry?  I'm not sure where communication is breaking down here, but
> > we really don't seem to be talking about the same things.  SGI's
> > kiobuf request patches already let us pass a large IO through the
> > request layer in a single unit without having to split it up to
> > squeeze it through the API.
> 
> Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you
> don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB
> buffer_head is effectively the same thing.

kiobufs let you encode _any_ contiguous region of user VA or of an
inode's page cache contents in one kiobuf, no matter how many pages
there are in it.  A write of a megabyte to a raw device can be encoded
as a single kiobuf if we want to pass the entire 1MB IO down to the
block layers untouched.  That's what the page vector in the kiobuf is
for.

Doing the same thing with buffer_heads would still require a couple of
hundred of them, and you'd have to submit each such buffer_head to the
IO subsystem independently.  And then the IO layer will just have to
reassemble them on the other side (and it may have to scan the
device's entire request queue once for every single buffer_head to do
so).

> But an API extension to allow passing a pre-built chain would be even
> better.

Yep.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote:
> >
> However, I really _do_ want to have the page cache have a bigger
> granularity than the smallest memory mapping size, and there are always
> special cases that might be able to generate IO in bigger chunks (ie
> in-kernel services etc)

No argument there.

> > Yes.  We still have this fundamental property: if a user sends in a
> > 128kB IO, we end up having to split it up into buffer_heads and doing
> > a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
> > (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
> > this case.
> 
> Absolutely. And this is independent of what kind of interface we end up
> using, whether it be kiobuf of just plain "struct buffer_head". In that
> respect they are equivalent.

Sorry?  I'm not sure where communication is breaking down here, but
we really don't seem to be talking about the same things.  SGI's
kiobuf request patches already let us pass a large IO through the
request layer in a single unit without having to split it up to
squeeze it through the API.

> > THAT is the overhead that I'm talking about: having to split a large
> > IO into small chunks, each of which just ends up having to be merged
> > back again into a single struct request by the *make_request code.
> 
> You could easily just generate the bh then and there, if you wanted to.

In the current 2.4 tree, we already do: brw_kiovec creates the
temporary buffer_heads on demand to feed them to the IO layers.

> Your overhead comes from the fact that you want to gather the IO together. 

> And I'm saying that you _shouldn't_ gather the IO. There's no point.

I don't --- the underlying layer does.  And that is where the overhead
is: for every single large IO being created by the higher layers,
make_request is doing a dozen or more merges because I can only feed
the IO through make_request in tiny pieces.

> The
> gathering is sufficiently done by the low-level code anyway, and I've
> tried to explain why the low-level code _has_ to do that work regardless
> of what upper layers do.

I know.  The problem is the low-level code doing it a hundred times
for a single injected IO.

> You need to generate a separate sg entry for each page anyway. So why not
> just use the existing one? The "struct buffer_head". Which already
> _handles_ all the issues that you have complained are hard to handle.

Two issues here.  First is that the buffer_head is an enormously
heavyweight object for a sg-list fragment.  It contains a ton of
fields of interest only to the buffer cache.  We could mitigate this
to some extent by ensuring that the relevant fields for IO (rsector,
size, req_next, state, data, page etc) were in a single cache line.

Secondly, the cost of adding each single buffer_head to the request
list is O(n) in the number of requests already on the list.  We end up
walking potentially the entire request queue before finding the
request to merge against, and we do that again and again, once for
every single buffer_head in the list.  We do this even if the caller
went in via a multi-bh ll_rw_block() call in which case we know in
advance that all of the buffer_heads are contiguous on disk.


There is a side problem: right now, things like raid remapping occur
during generic_make_request, before we have a request built.  That
means that all of the raid0 remapping or raid1/5 request expanding is
being done on a per-buffer_head, not per-request, basis, so again
we're doing a whole lot of unnecessary duplicate work when an IO
larger than a buffer_head is submitted.


If you really don't mind the size of the buffer_head as a sg fragment
header, then at least I'd like us to be able to submit a pre-built
chain of bh's all at once without having to go through the remap/merge
cost for each single bh.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Stephen C. Tweedie

Hi,

On Wed, Feb 07, 2001 at 09:10:32AM +, David Howells wrote:
> 
> I presume that correct_size will always be a power of 2...

Yes.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Stephen C. Tweedie

Hi,

On Wed, Feb 07, 2001 at 09:10:32AM +, David Howells wrote:
 
 I presume that correct_size will always be a power of 2...

Yes.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote:
 
 However, I really _do_ want to have the page cache have a bigger
 granularity than the smallest memory mapping size, and there are always
 special cases that might be able to generate IO in bigger chunks (ie
 in-kernel services etc)

No argument there.

  Yes.  We still have this fundamental property: if a user sends in a
  128kB IO, we end up having to split it up into buffer_heads and doing
  a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
  (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
  this case.
 
 Absolutely. And this is independent of what kind of interface we end up
 using, whether it be kiobuf of just plain "struct buffer_head". In that
 respect they are equivalent.

Sorry?  I'm not sure where communication is breaking down here, but
we really don't seem to be talking about the same things.  SGI's
kiobuf request patches already let us pass a large IO through the
request layer in a single unit without having to split it up to
squeeze it through the API.

  THAT is the overhead that I'm talking about: having to split a large
  IO into small chunks, each of which just ends up having to be merged
  back again into a single struct request by the *make_request code.
 
 You could easily just generate the bh then and there, if you wanted to.

In the current 2.4 tree, we already do: brw_kiovec creates the
temporary buffer_heads on demand to feed them to the IO layers.

 Your overhead comes from the fact that you want to gather the IO together. 

 And I'm saying that you _shouldn't_ gather the IO. There's no point.

I don't --- the underlying layer does.  And that is where the overhead
is: for every single large IO being created by the higher layers,
make_request is doing a dozen or more merges because I can only feed
the IO through make_request in tiny pieces.

 The
 gathering is sufficiently done by the low-level code anyway, and I've
 tried to explain why the low-level code _has_ to do that work regardless
 of what upper layers do.

I know.  The problem is the low-level code doing it a hundred times
for a single injected IO.

 You need to generate a separate sg entry for each page anyway. So why not
 just use the existing one? The "struct buffer_head". Which already
 _handles_ all the issues that you have complained are hard to handle.

Two issues here.  First is that the buffer_head is an enormously
heavyweight object for a sg-list fragment.  It contains a ton of
fields of interest only to the buffer cache.  We could mitigate this
to some extent by ensuring that the relevant fields for IO (rsector,
size, req_next, state, data, page etc) were in a single cache line.

Secondly, the cost of adding each single buffer_head to the request
list is O(n) in the number of requests already on the list.  We end up
walking potentially the entire request queue before finding the
request to merge against, and we do that again and again, once for
every single buffer_head in the list.  We do this even if the caller
went in via a multi-bh ll_rw_block() call in which case we know in
advance that all of the buffer_heads are contiguous on disk.


There is a side problem: right now, things like raid remapping occur
during generic_make_request, before we have a request built.  That
means that all of the raid0 remapping or raid1/5 request expanding is
being done on a per-buffer_head, not per-request, basis, so again
we're doing a whole lot of unnecessary duplicate work when an IO
larger than a buffer_head is submitted.


If you really don't mind the size of the buffer_head as a sg fragment
header, then at least I'd like us to be able to submit a pre-built
chain of bh's all at once without having to go through the remap/merge
cost for each single bh.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Stephen C. Tweedie

Hi,

On Wed, Feb 07, 2001 at 12:12:44PM -0700, Richard Gooch wrote:
 Stephen C. Tweedie writes:
  
  Sorry?  I'm not sure where communication is breaking down here, but
  we really don't seem to be talking about the same things.  SGI's
  kiobuf request patches already let us pass a large IO through the
  request layer in a single unit without having to split it up to
  squeeze it through the API.
 
 Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you
 don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB
 buffer_head is effectively the same thing.

kiobufs let you encode _any_ contiguous region of user VA or of an
inode's page cache contents in one kiobuf, no matter how many pages
there are in it.  A write of a megabyte to a raw device can be encoded
as a single kiobuf if we want to pass the entire 1MB IO down to the
block layers untouched.  That's what the page vector in the kiobuf is
for.

Doing the same thing with buffer_heads would still require a couple of
hundred of them, and you'd have to submit each such buffer_head to the
IO subsystem independently.  And then the IO layer will just have to
reassemble them on the other side (and it may have to scan the
device's entire request queue once for every single buffer_head to do
so).

 But an API extension to allow passing a pre-built chain would be even
 better.

Yep.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > 
> > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> > enforces a single blocksize on all requests but that relaxing that
> > requirement is no big deal).  Buffer_heads can't deal with data which
> > spans more than a page right now.
> 
> "struct buffer_head" can deal with pretty much any size: the only thing it
> cares about is bh->b_size.

Right now, anything larger than a page is physically non-contiguous,
and sorry if I didn't make that explicit, but I thought that was
obvious enough that I didn't need to.  We were talking about raw IO,
and as long as we're doing IO out of user anonymous data allocated
from individual pages, buffer_heads are limited to that page size in
this context.

> Have you ever spent even just 5 minutes actually _looking_ at the block
> device layer, before you decided that you think it needs to be completely
> re-done some other way? It appears that you never bothered to.

Yes.  We still have this fundamental property: if a user sends in a
128kB IO, we end up having to split it up into buffer_heads and doing
a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
(*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
this case.

THAT is the overhead that I'm talking about: having to split a large
IO into small chunks, each of which just ends up having to be merged
back again into a single struct request by the *make_request code.

A constructed IO request basically doesn't care about anything in the
buffer_head except for the data pointer and size, and the completion
status info and callback.  All of the physical IO description is in
the struct request by this point.  The chain of buffer_heads is
carrying around a huge amount of information which isn't used by the
IO, and if the caller is something like the raw IO driver which isn't
using the buffer cache, that extra buffer_head data is just overhead. 

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 04:41:21PM -0800, Linus Torvalds wrote:
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size.
> 
> Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the
> traditional block device setup, where "b_blocknr" is the "virtual
> blocknumber" and that indeed is tied in to the block size.
> 
> The fact is, if you have problems like the above, then you don't
> understand the interfaces. And it sounds like you designed kiobuf support
> around the wrong set of interfaces.

They used the only interfaces available at the time...

> If you want to get at the _sector_ level, then you do
...
> which doesn't look all that complicated to me. What's the problem?

Doesn't this break nastily as soon as the IO hits an LVM or soft raid
device?  I don't think we are safe if we create a larger-sized
buffer_head which spans a raid stripe: the raid mapping is only
applied once per buffer_head.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote:
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> 
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size.  Under the Unix
> > raw IO interface it is perfectly legal to begin a 128kB IO at offset
> > 512 bytes into a device.
> 
> then we should either fix this limitation, or the raw IO code should split
> the request up into several, variable-size bhs, so that the range is
> filled out optimally with aligned bhs.

That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
enforces a single blocksize on all requests but that relaxing that
requirement is no big deal).  Buffer_heads can't deal with data which
spans more than a page right now.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote:
> 
> [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> the raw IO code.]

No, it is a problem of the ll_rw_block interface: buffer_heads need to
be aligned on disk at a multiple of their buffer size.  Under the Unix
raw IO interface it is perfectly legal to begin a 128kB IO at offset
512 bytes into a device.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: sync & asyck i/o

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 11:25:00AM -0800, Andre Hedrick wrote:
> On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:

> > No, we simply omit to instruct them to enable write-back caching.
> > Linux assumes that the WCE (write cache enable) bit in a disk's
> > caching mode page is zero.
> 
> You can not be so blind to omit the command.

Linux has traditionally ignored the issue.  Don't ask me to defend it
--- the last advice I got from anybody who knew SCSI well was that
SCSI disks were defaulting to WCE-disabled.  

Note that disabling SCSI WCE doesn't disable the cache, it just
enforces synchronous completion.  With tagged command queuing,
writeback caching doesn't necessarily mean a huge performance
increase.  But if WCE is being enabled by default on modern SCSI
drives, then that's something which the scsi stack really does need to
fix --- the upper block layers will most definitely break if we have
WCE enabled and we don't set force-unit-access on the scsi commands.

The ll_rw_block interface is perfectly clear: it expects the data to
be written to persistent storage once the buffer_head end_io is
called.  If that's not the case, somebody needs to fix the lower
layers.

Cheers, 
  Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 06:22:58PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 06, 2001 at 05:05:06PM +0000, Stephen C. Tweedie wrote:
> > The whole point of the post was that it is merging, not splitting,
> > which is troublesome.  How are you going to merge requests without
> > having chains of scatter-gather entities each with their own
> > completion callbacks?
> 
> The object passed down to the low-level driver just needs to ne able
> to contain multiple end-io callbacks.  The decision what to call when
> some of the scatter-gather entities fail is of course not so easy to
> handle and needs further discussion.

Umm, and if you want the separate higher-level IOs to be told which
IOs succeeded and which ones failed on error, you need to associate
each of the multiple completion callbacks with its particular
scatter-gather fragment or fragments.  So you end up with the same
sort of kiobuf/kiovec concept where you have chains of sg chunks, each
chunk with its own completion information.

This is *precisely* what I've been trying to get people to address.
Forget whether the individual sg fragments are based on pages or not:
if you want to have IO merging and accurate completion callbacks, you
need not just one sg list but multiple lists each with a separate
callback header.

Abandon the merging of sg-list requests (by moving that functionality
into the higher-level layers) and that problem disappears: flat
sg-lists will then work quite happily at the request layer.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: sync & asyck i/o

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 02:52:40PM +, Alan Cox wrote:
> > According to the man page for fsync it copies in-core data to disk 
> > prior to its return. Does that take async i/o to the media in account? 
> > I.e. does it wait for completion of the async i/o to the disk?
> 
> Undefined. 

> In practice some IDE disks do write merging and small amounts of write
> caching in the drive firmware so you cannot trust it 100%. 

It's worth noting that it *is* defined unambiguously in the standards:
fsync waits until all the data is hard on disk.  Linux will obey that
if it possibly can: only in cases where the hardware is actively lying
about when the data has hit disk will the guarantee break down.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 06:00:58PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 06, 2001 at 12:07:04AM +0000, Stephen C. Tweedie wrote:
> > 
> > Is that a realistic basis for a cleaned-up ll_rw_blk.c?
> 
> I don't think os.  If we minimize the state in the IO container object,
> the lower levels could split them at their guess and the IO completion
> function just has to handle the case that it might be called for a smaller
> object.

The whole point of the post was that it is merging, not splitting,
which is troublesome.  How are you going to merge requests without
having chains of scatter-gather entities each with their own
completion callbacks?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: rawio usage

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 10:36:32PM -0800, Mayank Vasa wrote:
> 
> When I run this program as root, I get the error "write: Invalid argument".

Raw IO requires that the buffers are aligned on a 512-byte boundary in
memory.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: rawio usage

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 10:36:32PM -0800, Mayank Vasa wrote:
 
 When I run this program as root, I get the error "write: Invalid argument".

Raw IO requires that the buffers are aligned on a 512-byte boundary in
memory.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: sync asyck i/o

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 02:52:40PM +, Alan Cox wrote:
  According to the man page for fsync it copies in-core data to disk 
  prior to its return. Does that take async i/o to the media in account? 
  I.e. does it wait for completion of the async i/o to the disk?
 
 Undefined. 

 In practice some IDE disks do write merging and small amounts of write
 caching in the drive firmware so you cannot trust it 100%. 

It's worth noting that it *is* defined unambiguously in the standards:
fsync waits until all the data is hard on disk.  Linux will obey that
if it possibly can: only in cases where the hardware is actively lying
about when the data has hit disk will the guarantee break down.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 06:22:58PM +0100, Christoph Hellwig wrote:
 On Tue, Feb 06, 2001 at 05:05:06PM +, Stephen C. Tweedie wrote:
  The whole point of the post was that it is merging, not splitting,
  which is troublesome.  How are you going to merge requests without
  having chains of scatter-gather entities each with their own
  completion callbacks?
 
 The object passed down to the low-level driver just needs to ne able
 to contain multiple end-io callbacks.  The decision what to call when
 some of the scatter-gather entities fail is of course not so easy to
 handle and needs further discussion.

Umm, and if you want the separate higher-level IOs to be told which
IOs succeeded and which ones failed on error, you need to associate
each of the multiple completion callbacks with its particular
scatter-gather fragment or fragments.  So you end up with the same
sort of kiobuf/kiovec concept where you have chains of sg chunks, each
chunk with its own completion information.

This is *precisely* what I've been trying to get people to address.
Forget whether the individual sg fragments are based on pages or not:
if you want to have IO merging and accurate completion callbacks, you
need not just one sg list but multiple lists each with a separate
callback header.

Abandon the merging of sg-list requests (by moving that functionality
into the higher-level layers) and that problem disappears: flat
sg-lists will then work quite happily at the request layer.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: sync asyck i/o

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 11:25:00AM -0800, Andre Hedrick wrote:
 On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:

  No, we simply omit to instruct them to enable write-back caching.
  Linux assumes that the WCE (write cache enable) bit in a disk's
  caching mode page is zero.
 
 You can not be so blind to omit the command.

Linux has traditionally ignored the issue.  Don't ask me to defend it
--- the last advice I got from anybody who knew SCSI well was that
SCSI disks were defaulting to WCE-disabled.  

Note that disabling SCSI WCE doesn't disable the cache, it just
enforces synchronous completion.  With tagged command queuing,
writeback caching doesn't necessarily mean a huge performance
increase.  But if WCE is being enabled by default on modern SCSI
drives, then that's something which the scsi stack really does need to
fix --- the upper block layers will most definitely break if we have
WCE enabled and we don't set force-unit-access on the scsi commands.

The ll_rw_block interface is perfectly clear: it expects the data to
be written to persistent storage once the buffer_head end_io is
called.  If that's not the case, somebody needs to fix the lower
layers.

Cheers, 
  Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote:
 
 [overhead of 512-byte bhs in the raw IO code is an artificial problem of
 the raw IO code.]

No, it is a problem of the ll_rw_block interface: buffer_heads need to
be aligned on disk at a multiple of their buffer size.  Under the Unix
raw IO interface it is perfectly legal to begin a 128kB IO at offset
512 bytes into a device.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote:
 
 On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
 
  No, it is a problem of the ll_rw_block interface: buffer_heads need to
  be aligned on disk at a multiple of their buffer size.  Under the Unix
  raw IO interface it is perfectly legal to begin a 128kB IO at offset
  512 bytes into a device.
 
 then we should either fix this limitation, or the raw IO code should split
 the request up into several, variable-size bhs, so that the range is
 filled out optimally with aligned bhs.

That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
enforces a single blocksize on all requests but that relaxing that
requirement is no big deal).  Buffer_heads can't deal with data which
spans more than a page right now.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote:
 
 
 On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
  
  That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
  enforces a single blocksize on all requests but that relaxing that
  requirement is no big deal).  Buffer_heads can't deal with data which
  spans more than a page right now.
 
 "struct buffer_head" can deal with pretty much any size: the only thing it
 cares about is bh-b_size.

Right now, anything larger than a page is physically non-contiguous,
and sorry if I didn't make that explicit, but I thought that was
obvious enough that I didn't need to.  We were talking about raw IO,
and as long as we're doing IO out of user anonymous data allocated
from individual pages, buffer_heads are limited to that page size in
this context.

 Have you ever spent even just 5 minutes actually _looking_ at the block
 device layer, before you decided that you think it needs to be completely
 re-done some other way? It appears that you never bothered to.

Yes.  We still have this fundamental property: if a user sends in a
128kB IO, we end up having to split it up into buffer_heads and doing
a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
(*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
this case.

THAT is the overhead that I'm talking about: having to split a large
IO into small chunks, each of which just ends up having to be merged
back again into a single struct request by the *make_request code.

A constructed IO request basically doesn't care about anything in the
buffer_head except for the data pointer and size, and the completion
status info and callback.  All of the physical IO description is in
the struct request by this point.  The chain of buffer_heads is
carrying around a huge amount of information which isn't used by the
IO, and if the caller is something like the raw IO driver which isn't
using the buffer cache, that extra buffer_head data is just overhead. 

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-05 Thread Stephen C. Tweedie

Hi,

OK, if we take a step back what does this look like:

On Mon, Feb 05, 2001 at 08:54:29PM +, Stephen C. Tweedie wrote:
> 
> If we are doing readahead, we want completion callbacks raised as soon
> as possible on IO completions, no matter how many other IOs have been
> merged with the current one.  More importantly though, when we are
> merging multiple page or buffer_head IOs in a request, we want to know
> exactly which buffer/page contents are valid and which are not once
> the IO completes.

This is the current situation.  If the page cache submits a 64K IO to
the block layer, it does so in pieces, and then expects to be told on
return exactly which pages succeeded and which failed.

That's where the mess of having multiple completion objects in a
single IO request comes from.  Can we just forbid this case?

That's the short cut that SGI's kiobuf block dev patches do when they
get kiobufs: they currently deal with either buffer_heads or kiobufs
in struct requests, but they don't merge kiobuf requests.  (XFS
already clusters the IOs for them in that case.)

Is that a realistic basis for a cleaned-up ll_rw_blk.c?

It implies that the caller has to do IO merging.  For read, that's not
much pain, as the most important case --- readahead --- is already
done in a generic way which could submit larger IOs relatively easily.
It would be harder for writes, but high-level write clustering code
has already been started.

It implies that for any IO, on IO failure you don't get told which
part of the IO failed.  That adds code to the caller: the page cache
would have to retry per-page to work out which pages are readable and
which are not.  It means that for soft raid, you don't get told which
blocks are bad if a stripe has an error anywhere.  Ingo, is that a
potential problem?

But it gives very, very simple semantics to the request layer: single
IOs go in (with a completion callback and a single scatter-gather
list), and results go back with success or failure.

With that change, it becomes _much_ more natural to push a simple sg
list down through the disk layers.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 11:06:48PM +, Alan Cox wrote:
> > do you then tell the application _above_ raid0 if one of the
> > underlying IOs succeeds and the other fails halfway through?
> 
> struct 
> {
>   u32 flags;  /* because everything needs flags */
>   struct io_completion *completions;
>   kiovec_t sglist[0];
> } thingy;
> 
> now kmalloc one object of the header the sglist of the right size and the
> completion list. Shove the completion list on the end of it as another
> array of objects and what is the problem.

XFS uses both small metadata items in the buffer cache and large
pagebufs.  You may have merged a 512-byte read with a large pagebuf
read: one completion callback is associated with a single sg fragment,
the next callback belongs to a dozen different fragments.  Associating
the two lists becomes non-trivial, although it could be done.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 10:28:37PM +0100, Ingo Molnar wrote:
> 
> On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
> 
> it's exactly these 'compound' structures i'm vehemently against. I do
> think it's a design nightmare. I can picture these monster kiobufs
> complicating the whole code for no good reason - we couldnt even get the
> bh-list code in block_device.c right - why do you think kiobufs *all
> across the kernel* will be any better?
> 
> RAID0 is not an issue. Split it up, use separate kiobufs for every
> different disk.

Umm, that's not the point --- of course you can use separate kiobufs
for the communication between raid0 and the underlying disks, but what
do you then tell the application _above_ raid0 if one of the
underlying IOs succeeds and the other fails halfway through?

And what about raid1?  Are you really saying that raid1 doesn't need
to know which blocks succeeded and which failed?  That's the level of
completion information I'm worrying about at the moment.

> fragmented skbs are a different matter: they are simply a bit more generic
> abstractions of 'memory buffer'. Clear goal, clear solution. I do not
> think kiobufs have clear goals.

The goal: allow arbitrary IOs to be pushed down through the stack in
such a way that the callers can get meaningful information back about
what worked and what did not.  If the write was a 128kB raw IO, then
you obviously get coarse granularity of completion callback.  If the
write was a series of independent pages which happened to be
contiguous on disk, you actually get told which pages hit disk and
which did not.

> and what is the goal of having multi-page kiobufs. To avoid having to do
> multiple function calls via a simpler interface? Shouldnt we optimize that
> codepath instead?

The original multi-page buffers came from the map_user_kiobuf
interface: they represented a user data buffer.  I'm not wedded to
that format --- we can happily replace it with a fine-grained sg list
--- but the reason they have been pushed so far down the IO stack is
the need for accurate completion information on the originally
requested IOs.

In other words, even if we expand the kiobuf into a sg vector list,
when it comes to merging requests in ll_rw_blk.c we still need to
track the callbacks on each independent source kiobufs.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 11:28:17AM -0800, Linus Torvalds wrote:

> The _vectors_ are needed at the very lowest levels: the levels that do not
> necessarily have to worry at all about completion notification etc. You
> want the arbitrary scatter-gather vectors passed down to the stuff that
> sets up the SG arrays etc, the stuff that doesn't care AT ALL about the
> high-level semantics.

OK, this is exactly where we have a problem: I can see too many cases
where we *do* need to know about completion stuff at a fine
granularity when it comes to disk IO (unlike network IO, where we can
usually rely on a caller doing retransmit at some point in the stack).

If we are doing readahead, we want completion callbacks raised as soon
as possible on IO completions, no matter how many other IOs have been
merged with the current one.  More importantly though, when we are
merging multiple page or buffer_head IOs in a request, we want to know
exactly which buffer/page contents are valid and which are not once
the IO completes.

The current request struct's buffer_head list provides that quite
naturally, but is a hugely heavyweight way of performing large IOs.
What I'm really after is a way of sending IOs to make_request in such
a way that if the caller provides an array of buffer_heads, it gets
back completion information on each one, but if the IO is requested in
large chunks (eg. XFS's pagebufs or large kiobufs from raw IO), then
the request code can deal with it in those large chunks.

What worries me is things like the soft raid1/5 code: pretending that
we can skimp on the return information about which blocks were
transferred successfully and which were not sounds like a really bad
idea when you've got a driver which relies on that completion
information in order to do intelligent error recovery.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 08:36:31AM -0800, Linus Torvalds wrote:

> Have you ever thought about other things, like networking, special
> devices, stuff like that? They can (and do) have packet boundaries that
> have nothing to do with pages what-so-ever. They can have such notions as
> packets that contain multiple streams in one packet, where it ends up
> being split up into several pieces. Where neither the original packet
> _nor_ the final pieces have _anything_ to do with "pages".
> 
> THERE IS NO PAGE ALIGNMENT.

And kiobufs don't require IO to be page aligned, and they have never
done.  The only page alignment they assume is that if a *single*
scatter-gather element spans multiple pages, then the joins between
those pages occur on page boundaries.

Remember, a kiobuf is only designed to represent one scatter-gather
fragment, not a full sg list.  That was the whole reason for having a
kiovec as a separate concept: if you have more than one independent
fragment in the sg-list, you need more than one kiobuf.

And the reason why we created sg fragments which can span pages was so
that we can encode IOs which interact with the VM: any arbitrary
virtually-contiguous user data buffer can be mapped into a *single*
kiobuf for a write() call, so it's a generic way of supporting things
like O_DIRECT without the IO layers having to know anything about VM
(and Ben's async IO patches also use kiobufs in this way to allow
read()s to write to the user's data buffer once the IO completes,
without having to have a context switch back into that user's
context.)  Similarly, any extent of a file in the page cache can be
encoded in a single kiobuf.

And no, the simpler networking-style sg-list does not cut it for block
device IO, because for block devices, we want to have separate
completion status made available for each individual sg fragment in
the IO.  *That* is why the kiobuf is more heavyweight than the
networking variant: each fragment [kiobuf] in the scatter-gather list
[kiovec] has its own completion information.  

If we have a bunch of separate data buffers queued for sequential disk
IO as a single request, then we still want things like readahead and
error handling to work.  That means that we want the first kiobuf in
the chain to get its completion wakeup as soon as that segment of the
IO is complete, without having to wait for the remaining sectors of
the IO to be transferred.  It also means that if we've done something
like split the IO over a raid stripe, then when an error occurs, we
still want to know which of the callers' buffers succeeded and which
failed.

Yes, I agree that the original kiovec mechanism of using a *kiobuf[]
array to assemble the scatter-gather fragments sucked.  But I don't
believe that just throwing away the concept of kiobuf as a sc-fragment
will work either when it comes to disk IOs: the need for per-fragment
completion is too compelling.  I'd rather shift to allowing kiobufs to
be assembled into linked lists for IO to avoid *kiobuf[] vectors, in
just the same way that we currently chain buffer_heads for IO.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 05:29:47PM +, Alan Cox wrote:
> > 
> > _All_ drivers would have to do that in the degenerate case, because
> > none of our drivers can deal with a dma boundary in the middle of a
> > sector, and even in those places where the hardware supports it in
> > theory, you are still often limited to word-alignment.
> 
> Thats true for _block_ disk devices but if we want a generic kiovec then
> if I am going from video capture to network I dont need to force anything more
> than 4 byte align

Kiobufs have never, ever required the IO to be aligned on any
particular boundary.  They simply make the assumption that the
underlying buffered object can be described in terms of pages with
some arbitrary (non-aligned) start/offset.  Every video framebuffer
I've ever seen satisfies that, so you can easily map an arbitrary
contiguous region of the framebuffer with a kiobuf already.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 03:19:09PM +, Alan Cox wrote:
> > Yes, it's the sort of thing that you would hope should work, but in
> > practice it's not reliable.
> 
> So the less smart devices need to call something like
> 
>   kiovec_align(kiovec, 512);
> 
> and have it do the bounce buffers ?

_All_ drivers would have to do that in the degenerate case, because
none of our drivers can deal with a dma boundary in the middle of a
sector, and even in those places where the hardware supports it in
theory, you are still often limited to word-alignment.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 01:00:51PM +0100, Manfred Spraul wrote:
> "Stephen C. Tweedie" wrote:
> > 
> > You simply cannot do physical disk IO on
> > non-sector-aligned memory or in chunks which aren't a multiple of
> > sector size.
> 
> Why not?
> 
> Obviously the disk access itself must be sector aligned and the total
> length must be a multiple of the sector length, but there shouldn't be
> any restrictions on the data buffers.

But there are.  Many controllers just break down and corrupt things
silently if you don't align the data buffers (Jeff Merkey found this
by accident when he started generating unaligned IOs within page
boundaries in his NWFS code).  And a lot of controllers simply cannot
break a sector dma over a page boundary (at least not without some
form of IOMMU remapping).

Yes, it's the sort of thing that you would hope should work, but in
practice it's not reliable.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 08:01:45PM +0530, [EMAIL PROTECTED] wrote:
> 
> >It's the very essence of readahead that we wake up the earlier buffers
> >as soon as they become available, without waiting for the later ones
> >to complete, so we _need_ this multiple completion concept.
> 
> I can understand this in principle, but when we have a single request going
> down to the device that actually fills in multiple buffers, do we get
> notified (interrupted) by the device before all the data in that request
> got transferred ?

It depends on the device driver.  Different controllers will have
different maximum transfer size.  For IDE, for example, we get wakeups
all over the place.  For SCSI, it depends on how many scatter-gather
entries the driver can push into a single on-the-wire request.  Exceed
that limit and the driver is forced to open a new scsi mailbox, and
you get independent completion signals for each such chunk.

> >Which is exactly why we have one kiobuf per higher-level buffer, and
> >we chain together kiobufs when we need to for a long request, but we
> >still get the independent completion notifiers.
> 
> As I mentioned above, the alternative is to have the i/o completion related
> linkage information within the wakeup structures instead. That way, it
> doesn't matter to the lower level driver what higher level structure we
> have above (maybe buffer heads, may be page cache structures, may be
> kiobufs). We only chain together memory descriptors for the buffers during
> the io.

You forgot IO failures: it is essential, once the IO completes, to
know exactly which higher-level structures completed successfully and
which did not.  The low-level drivers have to have access to the
independent completion notifications for this to work.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Fri, Feb 02, 2001 at 01:02:28PM +0100, Christoph Hellwig wrote:
> 
> > I may still be persuaded that we need the full scatter-gather list
> > fields throughout, but for now I tend to think that, at least in the
> > disk layers, we may get cleaner results by allow linked lists of
> > page-aligned kiobufs instead.  That allows for merging of kiobufs
> > without having to copy all of the vector information each time.
> 
> But it will have the same problems as the array soloution: there will
> be one complete kio structure for each kiobuf, with it's own end_io
> callback, etc.

And what's the problem with that?

You *need* this.  You have to have that multiple-completion concept in
the disk layers.  Think about chains of buffer_heads being sent to
disk as a single IO --- you need to know which buffers make it to disk
successfully and which had IO errors.

And no, the IO success is *not* necessarily sequential from the start
of the IO: if you are doing IO to raid0, for example, and the IO gets
striped across two disks, you might find that the first disk gets an
error so the start of the IO fails but the rest completes.  It's the
completion code which notifies the caller of what worked and what did
not.

And for readahead, you want to notify the caller as early as posssible
about completion for the first part of the IO, even if the device
driver is still processing the rest.

Multiple completions are a necessary feature of the current block
device interface.  Removing that would be a step backwards.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Sun, Feb 04, 2001 at 06:54:58PM +0530, [EMAIL PROTECTED] wrote:
> 
> Can't we define a kiobuf structure as just this ? A combination of a
> frag_list and a page_list ?

Then all code which needs to accept an arbitrary kiobuf needs to be
able to parse both --- ugh.

> BTW, We could have a higher level io container that includes a 
> field and a  to take care of i/o completion

IO completion requirements are much more complex.  Think of disk
readahead: we can create a single request struct for an IO of a
hundred buffer heads, and as the device driver satisfies that request,
it wakes up the buffer heads as it goes.  There is a separete
completion notification for every single buffer head in the chain.

It's the very essence of readahead that we wake up the earlier buffers
as soon as they become available, without waiting for the later ones
to complete, so we _need_ this multiple completion concept.

Which is exactly why we have one kiobuf per higher-level buffer, and
we chain together kiobufs when we need to for a long request, but we
still get the independent completion notifiers.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote:
> 
> On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> > 
> Neither the read nor the write are page-aligned. I don't know where you
> got that idea. It's obviously not true even in the common case: it depends
> _entirely_ on what the file offsets are, and expecting the offset to be
> zero is just being stupid. It's often _not_ zero. With networking it is in
> fact seldom zero, because the network packets are seldom aligned either in
> size or in location.

The underlying buffer is.  The VFS (and the current kiobuf code) is
already happy about IO happening at odd offsets within a page.
However, the more general case --- doing zero-copy IO on arbitrary
unaligned buffers --- simply won't work if you expect to be able to
push those buffers to disk without a copy.  

The splice case you talked about is fine because it's doing the normal
prepare/commit logic where the underlying buffer is page aligned, even
if the splice IO is not to a page aligned location.  That's _exactly_
what kiobufs were intended to support.  The prepare_read/prepare_write/
pull/push cycle lets the caller tell the pull() function where to
store its data, becausse there are alignment constraints which just
can't be ignored: you simply cannot do physical disk IO on
non-sector-aligned memory or in chunks which aren't a multiple of
sector size.  (The buffer address alignment can sometimes be relaxed
--- obviously if you're doing PIO then it doesn't matter --- but the
length granularity is rigidly enforced.)
 
Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote:
 
 On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
  
 Neither the read nor the write are page-aligned. I don't know where you
 got that idea. It's obviously not true even in the common case: it depends
 _entirely_ on what the file offsets are, and expecting the offset to be
 zero is just being stupid. It's often _not_ zero. With networking it is in
 fact seldom zero, because the network packets are seldom aligned either in
 size or in location.

The underlying buffer is.  The VFS (and the current kiobuf code) is
already happy about IO happening at odd offsets within a page.
However, the more general case --- doing zero-copy IO on arbitrary
unaligned buffers --- simply won't work if you expect to be able to
push those buffers to disk without a copy.  

The splice case you talked about is fine because it's doing the normal
prepare/commit logic where the underlying buffer is page aligned, even
if the splice IO is not to a page aligned location.  That's _exactly_
what kiobufs were intended to support.  The prepare_read/prepare_write/
pull/push cycle lets the caller tell the pull() function where to
store its data, becausse there are alignment constraints which just
can't be ignored: you simply cannot do physical disk IO on
non-sector-aligned memory or in chunks which aren't a multiple of
sector size.  (The buffer address alignment can sometimes be relaxed
--- obviously if you're doing PIO then it doesn't matter --- but the
length granularity is rigidly enforced.)
 
Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Sun, Feb 04, 2001 at 06:54:58PM +0530, [EMAIL PROTECTED] wrote:
 
 Can't we define a kiobuf structure as just this ? A combination of a
 frag_list and a page_list ?

Then all code which needs to accept an arbitrary kiobuf needs to be
able to parse both --- ugh.

 BTW, We could have a higher level io container that includes a status
 field and a wait_queue_head to take care of i/o completion

IO completion requirements are much more complex.  Think of disk
readahead: we can create a single request struct for an IO of a
hundred buffer heads, and as the device driver satisfies that request,
it wakes up the buffer heads as it goes.  There is a separete
completion notification for every single buffer head in the chain.

It's the very essence of readahead that we wake up the earlier buffers
as soon as they become available, without waiting for the later ones
to complete, so we _need_ this multiple completion concept.

Which is exactly why we have one kiobuf per higher-level buffer, and
we chain together kiobufs when we need to for a long request, but we
still get the independent completion notifiers.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Fri, Feb 02, 2001 at 01:02:28PM +0100, Christoph Hellwig wrote:
 
  I may still be persuaded that we need the full scatter-gather list
  fields throughout, but for now I tend to think that, at least in the
  disk layers, we may get cleaner results by allow linked lists of
  page-aligned kiobufs instead.  That allows for merging of kiobufs
  without having to copy all of the vector information each time.
 
 But it will have the same problems as the array soloution: there will
 be one complete kio structure for each kiobuf, with it's own end_io
 callback, etc.

And what's the problem with that?

You *need* this.  You have to have that multiple-completion concept in
the disk layers.  Think about chains of buffer_heads being sent to
disk as a single IO --- you need to know which buffers make it to disk
successfully and which had IO errors.

And no, the IO success is *not* necessarily sequential from the start
of the IO: if you are doing IO to raid0, for example, and the IO gets
striped across two disks, you might find that the first disk gets an
error so the start of the IO fails but the rest completes.  It's the
completion code which notifies the caller of what worked and what did
not.

And for readahead, you want to notify the caller as early as posssible
about completion for the first part of the IO, even if the device
driver is still processing the rest.

Multiple completions are a necessary feature of the current block
device interface.  Removing that would be a step backwards.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 08:01:45PM +0530, [EMAIL PROTECTED] wrote:
 
 It's the very essence of readahead that we wake up the earlier buffers
 as soon as they become available, without waiting for the later ones
 to complete, so we _need_ this multiple completion concept.
 
 I can understand this in principle, but when we have a single request going
 down to the device that actually fills in multiple buffers, do we get
 notified (interrupted) by the device before all the data in that request
 got transferred ?

It depends on the device driver.  Different controllers will have
different maximum transfer size.  For IDE, for example, we get wakeups
all over the place.  For SCSI, it depends on how many scatter-gather
entries the driver can push into a single on-the-wire request.  Exceed
that limit and the driver is forced to open a new scsi mailbox, and
you get independent completion signals for each such chunk.

 Which is exactly why we have one kiobuf per higher-level buffer, and
 we chain together kiobufs when we need to for a long request, but we
 still get the independent completion notifiers.
 
 As I mentioned above, the alternative is to have the i/o completion related
 linkage information within the wakeup structures instead. That way, it
 doesn't matter to the lower level driver what higher level structure we
 have above (maybe buffer heads, may be page cache structures, may be
 kiobufs). We only chain together memory descriptors for the buffers during
 the io.

You forgot IO failures: it is essential, once the IO completes, to
know exactly which higher-level structures completed successfully and
which did not.  The low-level drivers have to have access to the
independent completion notifications for this to work.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 01:00:51PM +0100, Manfred Spraul wrote:
 "Stephen C. Tweedie" wrote:
  
  You simply cannot do physical disk IO on
  non-sector-aligned memory or in chunks which aren't a multiple of
  sector size.
 
 Why not?
 
 Obviously the disk access itself must be sector aligned and the total
 length must be a multiple of the sector length, but there shouldn't be
 any restrictions on the data buffers.

But there are.  Many controllers just break down and corrupt things
silently if you don't align the data buffers (Jeff Merkey found this
by accident when he started generating unaligned IOs within page
boundaries in his NWFS code).  And a lot of controllers simply cannot
break a sector dma over a page boundary (at least not without some
form of IOMMU remapping).

Yes, it's the sort of thing that you would hope should work, but in
practice it's not reliable.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 03:19:09PM +, Alan Cox wrote:
  Yes, it's the sort of thing that you would hope should work, but in
  practice it's not reliable.
 
 So the less smart devices need to call something like
 
   kiovec_align(kiovec, 512);
 
 and have it do the bounce buffers ?

_All_ drivers would have to do that in the degenerate case, because
none of our drivers can deal with a dma boundary in the middle of a
sector, and even in those places where the hardware supports it in
theory, you are still often limited to word-alignment.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 05:29:47PM +, Alan Cox wrote:
  
  _All_ drivers would have to do that in the degenerate case, because
  none of our drivers can deal with a dma boundary in the middle of a
  sector, and even in those places where the hardware supports it in
  theory, you are still often limited to word-alignment.
 
 Thats true for _block_ disk devices but if we want a generic kiovec then
 if I am going from video capture to network I dont need to force anything more
 than 4 byte align

Kiobufs have never, ever required the IO to be aligned on any
particular boundary.  They simply make the assumption that the
underlying buffered object can be described in terms of pages with
some arbitrary (non-aligned) start/offset.  Every video framebuffer
I've ever seen satisfies that, so you can easily map an arbitrary
contiguous region of the framebuffer with a kiobuf already.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 08:36:31AM -0800, Linus Torvalds wrote:

 Have you ever thought about other things, like networking, special
 devices, stuff like that? They can (and do) have packet boundaries that
 have nothing to do with pages what-so-ever. They can have such notions as
 packets that contain multiple streams in one packet, where it ends up
 being split up into several pieces. Where neither the original packet
 _nor_ the final pieces have _anything_ to do with "pages".
 
 THERE IS NO PAGE ALIGNMENT.

And kiobufs don't require IO to be page aligned, and they have never
done.  The only page alignment they assume is that if a *single*
scatter-gather element spans multiple pages, then the joins between
those pages occur on page boundaries.

Remember, a kiobuf is only designed to represent one scatter-gather
fragment, not a full sg list.  That was the whole reason for having a
kiovec as a separate concept: if you have more than one independent
fragment in the sg-list, you need more than one kiobuf.

And the reason why we created sg fragments which can span pages was so
that we can encode IOs which interact with the VM: any arbitrary
virtually-contiguous user data buffer can be mapped into a *single*
kiobuf for a write() call, so it's a generic way of supporting things
like O_DIRECT without the IO layers having to know anything about VM
(and Ben's async IO patches also use kiobufs in this way to allow
read()s to write to the user's data buffer once the IO completes,
without having to have a context switch back into that user's
context.)  Similarly, any extent of a file in the page cache can be
encoded in a single kiobuf.

And no, the simpler networking-style sg-list does not cut it for block
device IO, because for block devices, we want to have separate
completion status made available for each individual sg fragment in
the IO.  *That* is why the kiobuf is more heavyweight than the
networking variant: each fragment [kiobuf] in the scatter-gather list
[kiovec] has its own completion information.  

If we have a bunch of separate data buffers queued for sequential disk
IO as a single request, then we still want things like readahead and
error handling to work.  That means that we want the first kiobuf in
the chain to get its completion wakeup as soon as that segment of the
IO is complete, without having to wait for the remaining sectors of
the IO to be transferred.  It also means that if we've done something
like split the IO over a raid stripe, then when an error occurs, we
still want to know which of the callers' buffers succeeded and which
failed.

Yes, I agree that the original kiovec mechanism of using a *kiobuf[]
array to assemble the scatter-gather fragments sucked.  But I don't
believe that just throwing away the concept of kiobuf as a sc-fragment
will work either when it comes to disk IOs: the need for per-fragment
completion is too compelling.  I'd rather shift to allowing kiobufs to
be assembled into linked lists for IO to avoid *kiobuf[] vectors, in
just the same way that we currently chain buffer_heads for IO.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 11:28:17AM -0800, Linus Torvalds wrote:

 The _vectors_ are needed at the very lowest levels: the levels that do not
 necessarily have to worry at all about completion notification etc. You
 want the arbitrary scatter-gather vectors passed down to the stuff that
 sets up the SG arrays etc, the stuff that doesn't care AT ALL about the
 high-level semantics.

OK, this is exactly where we have a problem: I can see too many cases
where we *do* need to know about completion stuff at a fine
granularity when it comes to disk IO (unlike network IO, where we can
usually rely on a caller doing retransmit at some point in the stack).

If we are doing readahead, we want completion callbacks raised as soon
as possible on IO completions, no matter how many other IOs have been
merged with the current one.  More importantly though, when we are
merging multiple page or buffer_head IOs in a request, we want to know
exactly which buffer/page contents are valid and which are not once
the IO completes.

The current request struct's buffer_head list provides that quite
naturally, but is a hugely heavyweight way of performing large IOs.
What I'm really after is a way of sending IOs to make_request in such
a way that if the caller provides an array of buffer_heads, it gets
back completion information on each one, but if the IO is requested in
large chunks (eg. XFS's pagebufs or large kiobufs from raw IO), then
the request code can deal with it in those large chunks.

What worries me is things like the soft raid1/5 code: pretending that
we can skimp on the return information about which blocks were
transferred successfully and which were not sounds like a really bad
idea when you've got a driver which relies on that completion
information in order to do intelligent error recovery.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 10:28:37PM +0100, Ingo Molnar wrote:
 
 On Mon, 5 Feb 2001, Stephen C. Tweedie wrote:
 
 it's exactly these 'compound' structures i'm vehemently against. I do
 think it's a design nightmare. I can picture these monster kiobufs
 complicating the whole code for no good reason - we couldnt even get the
 bh-list code in block_device.c right - why do you think kiobufs *all
 across the kernel* will be any better?
 
 RAID0 is not an issue. Split it up, use separate kiobufs for every
 different disk.

Umm, that's not the point --- of course you can use separate kiobufs
for the communication between raid0 and the underlying disks, but what
do you then tell the application _above_ raid0 if one of the
underlying IOs succeeds and the other fails halfway through?

And what about raid1?  Are you really saying that raid1 doesn't need
to know which blocks succeeded and which failed?  That's the level of
completion information I'm worrying about at the moment.

 fragmented skbs are a different matter: they are simply a bit more generic
 abstractions of 'memory buffer'. Clear goal, clear solution. I do not
 think kiobufs have clear goals.

The goal: allow arbitrary IOs to be pushed down through the stack in
such a way that the callers can get meaningful information back about
what worked and what did not.  If the write was a 128kB raw IO, then
you obviously get coarse granularity of completion callback.  If the
write was a series of independent pages which happened to be
contiguous on disk, you actually get told which pages hit disk and
which did not.

 and what is the goal of having multi-page kiobufs. To avoid having to do
 multiple function calls via a simpler interface? Shouldnt we optimize that
 codepath instead?

The original multi-page buffers came from the map_user_kiobuf
interface: they represented a user data buffer.  I'm not wedded to
that format --- we can happily replace it with a fine-grained sg list
--- but the reason they have been pushed so far down the IO stack is
the need for accurate completion information on the originally
requested IOs.

In other words, even if we expand the kiobuf into a sg vector list,
when it comes to merging requests in ll_rw_blk.c we still need to
track the callbacks on each independent source kiobufs.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-05 Thread Stephen C. Tweedie

Hi,

On Mon, Feb 05, 2001 at 11:06:48PM +, Alan Cox wrote:
  do you then tell the application _above_ raid0 if one of the
  underlying IOs succeeds and the other fails halfway through?
 
 struct 
 {
   u32 flags;  /* because everything needs flags */
   struct io_completion *completions;
   kiovec_t sglist[0];
 } thingy;
 
 now kmalloc one object of the header the sglist of the right size and the
 completion list. Shove the completion list on the end of it as another
 array of objects and what is the problem.

XFS uses both small metadata items in the buffer cache and large
pagebufs.  You may have merged a 512-byte read with a large pagebuf
read: one completion callback is associated with a single sg fragment,
the next callback belongs to a dozen different fragments.  Associating
the two lists becomes non-trivial, although it could be done.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-05 Thread Stephen C. Tweedie

Hi,

OK, if we take a step back what does this look like:

On Mon, Feb 05, 2001 at 08:54:29PM +, Stephen C. Tweedie wrote:
 
 If we are doing readahead, we want completion callbacks raised as soon
 as possible on IO completions, no matter how many other IOs have been
 merged with the current one.  More importantly though, when we are
 merging multiple page or buffer_head IOs in a request, we want to know
 exactly which buffer/page contents are valid and which are not once
 the IO completes.

This is the current situation.  If the page cache submits a 64K IO to
the block layer, it does so in pieces, and then expects to be told on
return exactly which pages succeeded and which failed.

That's where the mess of having multiple completion objects in a
single IO request comes from.  Can we just forbid this case?

That's the short cut that SGI's kiobuf block dev patches do when they
get kiobufs: they currently deal with either buffer_heads or kiobufs
in struct requests, but they don't merge kiobuf requests.  (XFS
already clusters the IOs for them in that case.)

Is that a realistic basis for a cleaned-up ll_rw_blk.c?

It implies that the caller has to do IO merging.  For read, that's not
much pain, as the most important case --- readahead --- is already
done in a generic way which could submit larger IOs relatively easily.
It would be harder for writes, but high-level write clustering code
has already been started.

It implies that for any IO, on IO failure you don't get told which
part of the IO failed.  That adds code to the caller: the page cache
would have to retry per-page to work out which pages are readable and
which are not.  It means that for soft raid, you don't get told which
blocks are bad if a stripe has an error anywhere.  Ingo, is that a
potential problem?

But it gives very, very simple semantics to the request layer: single
IOs go in (with a completion callback and a single scatter-gather
list), and results go back with success or failure.

With that change, it becomes _much_ more natural to push a simple sg
list down through the disk layers.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-02 Thread Stephen C. Tweedie

Hi,

On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote:
> > 
> > If I have a page vector with a single offset/length pair, I can build
> > a new header with the same vector and modified offset/length to split
> > the vector in two without copying it.
> 
> You just say in the higher-level structure ignore from x to y even if
> they have an offset in their own vector.

Exactly --- and so you end up with something _much_ uglier, because
you end up with all sorts of combinations of length/offset fields all
over the place.

This is _precisely_ the mess I want to avoid.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-02 Thread Stephen C. Tweedie

Hi,

On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote:
  
  If I have a page vector with a single offset/length pair, I can build
  a new header with the same vector and modified offset/length to split
  the vector in two without copying it.
 
 You just say in the higher-level structure ignore from x to y even if
 they have an offset in their own vector.

Exactly --- and so you end up with something _much_ uglier, because
you end up with all sorts of combinations of length/offset fields all
over the place.

This is _precisely_ the mess I want to avoid.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:

> I think you want the whole kio concept only for disk-like IO.  

No.  I want something good for zero-copy IO in general, but a lot of
that concerns the problem of interacting with the user, and the basic
center of that interaction in 99% of the interesting cases is either a
user VM buffer or the page cache --- all of which are page-aligned.  

If you look at the sorts of models being proposed (even by Linus) for
splice, you get

len = prepare_read();
prepare_write();
pull_fd();
commit_write();

in which the read is being pulled into a known location in the page
cache -- it's page-aligned, again.  I'm perfectly willing to accept
that there may be a need for scatter-gather boundaries including
non-page-aligned fragments in this model, but I can't see one if
you're using the page cache as a mediator, nor if you're doing it
through a user mmapped buffer.

The only reason you need finer scatter-gather boundaries --- and it
may be a compelling reason --- is if you are merging multiple IOs
together into a single device-level IO.  That makes perfect sense for
the zerocopy tcp case where you're doing MSG_MORE-type coalescing.  It
doesn't help the existing SGI kiobuf block device code, because that
performs its merging in the filesystem layers and the block device
code just squirts the IOs to the wire as-is, but if we want to start
merging those kiobuf-based IOs within make_request() then the block
device layer may want it too.

And Linus is right, the old way of using a *kiobuf[] for that was
painful, but the solution of adding start/length to every entry in
the page vector just doesn't sit right with many components of the
block device environment either.

I may still be persuaded that we need the full scatter-gather list
fields throughout, but for now I tend to think that, at least in the
disk layers, we may get cleaner results by allow linked lists of
page-aligned kiobufs instead.  That allows for merging of kiobufs
without having to copy all of the vector information each time.

The killer, however, is what happens if you want to split such a
merged kiobuf.  Right now, that's something that I can only imagine
happening in the block layers if we start encoding buffer_head chains
as kiobufs, but if we do that in the future, or if we start merging
genuine kiobuf requests requests, then doing that split later on (for
raid0 etc) may require duplicating whole chains of kiobufs.  At that
point, just doing scatter-gather lists is cleaner.

But for now, the way to picture what I'm trying to achieve is that
kiobufs are a bit like buffer_heads --- they represent the physical
pages of some VM object that a higher layer has constructed, such as
the page cache or a user VM buffer.  You can chain these objects
together for IO, but that doesn't stop the individual objects from
being separate entities with independent IO completion callbacks to be
honoured.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:
> 
> > On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote:
> > In the disk IO case, you basically don't get that (the only thing
> > which comes close is raid5 parity blocks).  The data which the user
> > started with is the data sent out on the wire.  You do get some
> > interesting cases such as soft raid and LVM, or even in the scsi stack
> > if you run out of mailbox space, where you need to send only a
> > sub-chunk of the input buffer. 
> 
> Though your describption is right, I don't think the case is very common:
> Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.

On raid0 stripes, it's common to have stripes of between 16k and 64k,
so it's rather more common there than you'd like.  In any case, you
need the code to handle it, and I don't want to make the code paths
any more complex than necessary.

> In raid1 you need some kind of clone iobuf, which should work with both
> cases.  In raid0 you need a complete new pagelist anyway

No you don't.  You take the existing one, specify which region of it
is going to the current stripe, and send it off.  Nothing more.

> > In that case, having offset/len as the kiobuf limit markers is ideal:
> > you can clone a kiobuf header using the same page vector as the
> > parent, narrow down the start/end points, and continue down the stack
> > without having to copy any part of the page list.  If you had the
> > offset/len data encoded implicitly into each entry in the sglist, you
> > would not be able to do that.
> 
> Sure you could: you embedd that information in a higher-level structure.

What's the point in a common data container structure if you need
higher-level information to make any sense out of it?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 09:46:27PM +0100, Christoph Hellwig wrote:

> > Right now we can take a kiobuf and turn it into a bunch of
> > buffer_heads for IO.  The io_count lets us track all of those sub-IOs
> > so that we know when all submitted IO has completed, so that we can
> > pass the completion callback back up the chain without having to
> > allocate yet more descriptor structs for the IO.
> 
> > Again, remove this and the IO becomes more heavyweight because we need
> > to create a separate struct for the info.
> 
> No.  Just allow passing the multiple of the devices blocksize over
> ll_rw_block.

That was just one example: you need the sub-ios just as much when
you split up an IO over stripe boundaries in LVM or raid0, for
example.  Secondly, ll_rw_block needs to die anyway: you can expand
the blocksize up to PAGE_SIZE but not beyond, whereas something like
ll_rw_kiobuf can submit a much larger IO atomically (and we have
devices which don't start to deliver good throughput until you use
IO sizes of 1MB or more).

> >> and the lack of
> >> scatter gather in one kiobuf struct (you always need an array)
> 
> > Again, _all_ data being sent down through the block device layer is
> > either in buffer heads or is page aligned.
> 
> That's the point.  You are always talking about the block-layer only.

I'm talking about why the minimal, generic solution doesn't provide
what the block layer needs.


> > Obviously, extra code will be needed to scan kiobufs if we do that,
> > and unless we have both per-page _and_ per-kiobuf start/offset pairs
> > (adding even further to the complexity), those scatter-gather lists
> > would prevent us from carving up a kiobuf into smaller sub-ios without
> > copying the whole (expanded) vector.
> 
> No.  I think I explained that in my last mail.

How?

If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want
to split it in two, I have to make two new vectors (page X, offset 0,
length n) and (page X, offset n, length PAGE_SIZE-n).  That implies
copying both vectors.

If I have a page vector with a single offset/length pair, I can build
a new header with the same vector and modified offset/length to split
the vector in two without copying it.

> > Possibly, but I remain to be convinced, because you may end up with a
> > mechanism which is generic but is not well-tuned for any specific
> > case, so everything goes slower.
> 
> As kiobufs are widely used for real IO, just as containers, this is
> better then nothing.

Surely having all of the subsystems working fast is better still?

> And IMHO a nice generic concepts that lets different subsystems work
> toegther is a _lot_ better then a bunch of over-optimized, rather isolated
> subsytems.  The IO-Lite people have done a nice research of the effect of
> an unified IO-Caching system vs. the typical isolated systems.

I know, and IO-Lite has some major problems (the close integration of
that code into the cache, for example, makes it harder to expose the
zero-copy to user-land).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 07:14:03PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 05:41:20PM +0000, Stephen C. Tweedie wrote:
> > > 
> > > We can't allocate a huge kiobuf structure just for requesting one page of
> > > IO.  It might get better with VM-level IO clustering though.
> > 
> > A kiobuf is *much* smaller than, say, a buffer_head, and we currently
> > allocate a buffer_head per block for all IO.
> 
> A kiobuf is 124 bytes,

... the vast majority of which is room for the page vector to expand
without having to be copied.  You don't touch that in the normal case.

> a buffer_head 96.  And a buffer_head is additionally
> used for caching data, a kiobuf not.

Buffer_heads are _sometimes_ used for caching data.  That's one of the
big problems with them, they are too overloaded, being both IO
descriptors _and_ cache descriptors.  If you've got 128k of data to
write out from user space, do you want to set up one kiobuf or 256
buffer_heads?  Buffer_heads become really very heavy indeed once you
start doing non-trivial IO.

> > What is so heavyweight in the current kiobuf (other than the embedded
> > vector, which I've already noted I'm willing to cut)?
> 
> array_len

kiobufs can be reused after IO.  You can depopulate a kiobuf,
repopulate it with new pages and submit new IO without having to
deallocate the kiobuf.  You can't do this without knowing how big the
data vector is.  Removing that functionality will prevent reuse,
making them _more_ heavyweight.

> io_count,

Right now we can take a kiobuf and turn it into a bunch of
buffer_heads for IO.  The io_count lets us track all of those sub-IOs
so that we know when all submitted IO has completed, so that we can
pass the completion callback back up the chain without having to
allocate yet more descriptor structs for the IO.

Again, remove this and the IO becomes more heavyweight because we need
to create a separate struct for the info.

> the presence of wait_queue AND end_io,

That's fine, I'm happy scrapping the wait queue: people can always use
the kiobuf private data field to refer to a wait queue if they want
to.

> and the lack of
> scatter gather in one kiobuf struct (you always need an array)

Again, _all_ data being sent down through the block device layer is
either in buffer heads or is page aligned.  You want us to triple the
size of the "heavyweight" kiobuf's data vector for what gain, exactly?
Obviously, extra code will be needed to scan kiobufs if we do that,
and unless we have both per-page _and_ per-kiobuf start/offset pairs
(adding even further to the complexity), those scatter-gather lists
would prevent us from carving up a kiobuf into smaller sub-ios without
copying the whole (expanded) vector.

That's a _lot_ of extra complexity in the disk IO layers.

I'm all for a fast kiobuf_to_sglist converter.  But I haven't seen any
evidence that such scatter-gather lists will do anything in the block
device case except complicate the code and decrease performance.

> S.th. like:
...
> makes it a lot simpler for the subsytems to integrate.

Possibly, but I remain to be convinced, because you may end up with a
mechanism which is generic but is not well-tuned for any specific
case, so everything goes slower.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 06:49:50PM +0100, Christoph Hellwig wrote:
> 
> > Adding tons of base/limit pairs to kiobufs makes it worse not better
> 
> For disk I/O it makes the handling a little easier for the cost of the
> additional offset/length fields.

Umm, actually, no, it makes it much worse for many of the cases.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote:
> > 
> > I don't see any real advantage for disk IO.  The real advantage is that
> > we can have a generic structure that is also usefull in e.g. networking
> > and can lead to a unified IO buffering scheme (a little like IO-Lite).
> 
> Networking wants something lighter rather than heavier. Adding tons of
> base/limit pairs to kiobufs makes it worse not better

Networking has fundamentally different requirements.  In a network
stack, you want the ability to add fragments to unaligned chunks of
data to represent headers at any point in the stack.

In the disk IO case, you basically don't get that (the only thing
which comes close is raid5 parity blocks).  The data which the user
started with is the data sent out on the wire.  You do get some
interesting cases such as soft raid and LVM, or even in the scsi stack
if you run out of mailbox space, where you need to send only a
sub-chunk of the input buffer.  

In that case, having offset/len as the kiobuf limit markers is ideal:
you can clone a kiobuf header using the same page vector as the
parent, narrow down the start/end points, and continue down the stack
without having to copy any part of the page list.  If you had the
offset/len data encoded implicitly into each entry in the sglist, you
would not be able to do that.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote:
> > > 
> > > No, and with the current kiobufs it would not make sense, because they
> > > are to heavy-weight.
> > 
> > Really?  In what way?  
> 
> We can't allocate a huge kiobuf structure just for requesting one page of
> IO.  It might get better with VM-level IO clustering though.

A kiobuf is *much* smaller than, say, a buffer_head, and we currently
allocate a buffer_head per block for all IO.

A kiobuf contains enough embedded page vector space for 16 pages by
default, but I'm happy enough to remove that if people want.  However,
note that that memory is not initialised, so there is no memory access
cost at all for that empty space.  Remove that space and instead of
one memory allocation per kiobuf, you get two, so the cost goes *UP*
for small IOs.

> > > With page,length,offsett iobufs this makes sense
> > > and is IMHO the way to go.
> > 
> > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
> > lean enough to do the job??
> 
> No.  I was speaking abou the light-weight kiobuf Linux & Me discussed on
> lkml some time ago (though I'd much more like to call it kiovec analogous
> to BSD iovecs).

What is so heavyweight in the current kiobuf (other than the embedded
vector, which I've already noted I'm willing to cut)?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] vma limited swapin readahead

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 02:45:04PM -0200, Rik van Riel wrote:
> On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> 
> But only when the extra pages we're reading in don't
> displace useful data from memory, making us fault in
> those other pages ... causing us to go to the disk
> again and do more readahead, which could potentially
> displace even more pages, etc...

Remember, it's a balance.  You can displace a few useful pages and
still win overall because the cost _per page_ goes way down due to
better disk IO utilisation.

> One solution could be to put (most of) the swapin readahead
> pages on the inactive_dirty list, so pressure by readahead
> on the resident pages is smaller and the not used readahead
> pages are reclaimed faster.

Yep, that would make much sense.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote:
> > 
> > That would require the vfs interfaces themselves (address space
> > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > page *  . That's not the case right now, is it ?
> 
> No, and with the current kiobufs it would not make sense, because they
> are to heavy-weight.

Really?  In what way?  

> With page,length,offsett iobufs this makes sense
> and is IMHO the way to go.

What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
lean enough to do the job??

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 10:08:45AM -0600, Steve Lord wrote:
> Christoph Hellwig wrote:
> > On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote:
> > > 
> > > That would require the vfs interfaces themselves (address space
> > > readpage/writepage ops) to take kiobufs as arguments, instead of struct
> > > page *  . That's not the case right now, is it ?
> > 
> > No, and with the current kiobufs it would not make sense, because they
> > are to heavy-weight.  With page,length,offsett iobufs this makes sense
> > and is IMHO the way to go.
> 
> Enquiring minds would like to know if you are working towards this 
> revamp of the kiobuf structure at the moment, you have been very quiet
> recently. 

I'm in the middle of some parts of it, and am actively soliciting
feedback on what cleanups are required.  

I've been merging all of the 2.2 fixes into a 2.4 kiobuf tree, and
have started doing some of the cleanups needed --- removing the
embedded page vector, and adding support for lightweight stacking of
kiobufs for completion callback chains.

However, filesystem IO is almost *always* page aligned: O_DIRECT IO
comes from VM pages, and internal filesystem IO comes from page cache
pages.  Buffer cache IOs are the only exception, and kiobufs only fail
for such IOs once you have multiple buffer_heads being merged into
single requests.

So, what are the benefits in the disk IO stack of adding length/offset
pairs to each page of the kiobuf?  Basically, the only advantage right
now is that it would allow us to merge requests together without
having to chain separate kiobufs.  However, chaining kiobufs in this
case is actually much better than merging them if the original IOs
came in as kiobufs: merging kiobufs requires us to reallocate a new,
longer (page/offset/len) vector, whereas chaining kiobufs is just a
list operation.

Having true scatter-gather lists in the kiobuf would let us represent
arbitrary lists of buffer_heads as a single kiobuf, though, and that
_is_ a big win if we can avoid using buffer_heads below the
ll_rw_block layer at all.  (It's not clear that this is really
possible, though, since we still need to propagate completion
information back up into each individual buffer head's status and wait
queue.)

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] vma limited swapin readahead

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 08:53:33AM -0200, Marcelo Tosatti wrote:
> 
> On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
> 
> If we're under free memory shortage, "unlucky" readaheads will be harmful.

I know, it's a balancing act.  But given that even one successful
readahead per read will halve the number of swapin seeks, the
performance loss due to the extra scavenging has got to be bad to
outweigh the benefit.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 01:28:33PM +0530, [EMAIL PROTECTED] wrote:
> 
> Here's a second pass attempt, based on Ben's wait queue extensions:
> Does this sound any better ?

It's a mechanism, all right, but you haven't described what problems
it is trying to solve, and where it is likely to be used, so it's hard
to judge it. :)

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 10:25:22AM +0530, [EMAIL PROTECTED] wrote:
> 
> >We _do_ need the ability to stack completion events, but as far as the
> >kiobuf work goes, my current thoughts are to do that by stacking
> >lightweight "clone" kiobufs.
> 
> Would that work with stackable filesystems ?

Only if the filesystems were using VFS interfaces which used kiobufs.
Right now, the only filesystem using kiobufs is XFS, and it only
passes them down to the block device layer, not to other filesystems.

> Being able to track the children of a kiobuf would help with I/O
> cancellation (e.g. to pull sub-ios off their request queues if I/O
> cancellation for the parent kiobuf was issued). Not essential, I guess, in
> general, but useful in some situations.

What exactly is the justification for IO cancellation?  It really
upsets the normal flow of control through the IO stack to have
voluntary cancellation semantics.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] vma limited swapin readahead

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Wed, Jan 31, 2001 at 04:24:24PM -0800, David Gould wrote:
> 
> I am skeptical of the argument that we can win by replacing "the least
> desirable" pages with pages were even less desireable and that we have
> no recent indication of any need for. It seems possible under heavy swap
> to discard quite a portion of the useful pages in favor of junk that just
> happenned to have a lucky disk address.

When readin clustering was added to 2.2 for swap and paging,
performance for a lot of VM-intensive tasks more than doubled.  Disk
seeks are _expensive_.  If you read in 15 neighbouring pages on swapin
and on average only one of them turns out to be useful, you have still
halved the number of swapin IOs required.  The performance advantages
are so enormous that easily compensate for the cost of holding the
other, unneeded pages in memory for a while.

Also remember that the readahead pages won't actually get mapped into
memory, so they can be recycled easily.  So, under swapping you tend
to find that the extra readin pages are going to be replacing old,
unneeded readahead pages to some extent, rather than swapping out
useful pages.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] vma limited swapin readahead

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Wed, Jan 31, 2001 at 04:24:24PM -0800, David Gould wrote:
 
 I am skeptical of the argument that we can win by replacing "the least
 desirable" pages with pages were even less desireable and that we have
 no recent indication of any need for. It seems possible under heavy swap
 to discard quite a portion of the useful pages in favor of junk that just
 happenned to have a lucky disk address.

When readin clustering was added to 2.2 for swap and paging,
performance for a lot of VM-intensive tasks more than doubled.  Disk
seeks are _expensive_.  If you read in 15 neighbouring pages on swapin
and on average only one of them turns out to be useful, you have still
halved the number of swapin IOs required.  The performance advantages
are so enormous that easily compensate for the cost of holding the
other, unneeded pages in memory for a while.

Also remember that the readahead pages won't actually get mapped into
memory, so they can be recycled easily.  So, under swapping you tend
to find that the extra readin pages are going to be replacing old,
unneeded readahead pages to some extent, rather than swapping out
useful pages.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 10:25:22AM +0530, [EMAIL PROTECTED] wrote:
 
 We _do_ need the ability to stack completion events, but as far as the
 kiobuf work goes, my current thoughts are to do that by stacking
 lightweight "clone" kiobufs.
 
 Would that work with stackable filesystems ?

Only if the filesystems were using VFS interfaces which used kiobufs.
Right now, the only filesystem using kiobufs is XFS, and it only
passes them down to the block device layer, not to other filesystems.

 Being able to track the children of a kiobuf would help with I/O
 cancellation (e.g. to pull sub-ios off their request queues if I/O
 cancellation for the parent kiobuf was issued). Not essential, I guess, in
 general, but useful in some situations.

What exactly is the justification for IO cancellation?  It really
upsets the normal flow of control through the IO stack to have
voluntary cancellation semantics.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 01:28:33PM +0530, [EMAIL PROTECTED] wrote:
 
 Here's a second pass attempt, based on Ben's wait queue extensions:
 Does this sound any better ?

It's a mechanism, all right, but you haven't described what problems
it is trying to solve, and where it is likely to be used, so it's hard
to judge it. :)

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] vma limited swapin readahead

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 08:53:33AM -0200, Marcelo Tosatti wrote:
 
 On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
 
 If we're under free memory shortage, "unlucky" readaheads will be harmful.

I know, it's a balancing act.  But given that even one successful
readahead per read will halve the number of swapin seeks, the
performance loss due to the extra scavenging has got to be bad to
outweigh the benefit.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 10:08:45AM -0600, Steve Lord wrote:
 Christoph Hellwig wrote:
  On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote:
   
   That would require the vfs interfaces themselves (address space
   readpage/writepage ops) to take kiobufs as arguments, instead of struct
   page *  . That's not the case right now, is it ?
  
  No, and with the current kiobufs it would not make sense, because they
  are to heavy-weight.  With page,length,offsett iobufs this makes sense
  and is IMHO the way to go.
 
 Enquiring minds would like to know if you are working towards this 
 revamp of the kiobuf structure at the moment, you have been very quiet
 recently. 

I'm in the middle of some parts of it, and am actively soliciting
feedback on what cleanups are required.  

I've been merging all of the 2.2 fixes into a 2.4 kiobuf tree, and
have started doing some of the cleanups needed --- removing the
embedded page vector, and adding support for lightweight stacking of
kiobufs for completion callback chains.

However, filesystem IO is almost *always* page aligned: O_DIRECT IO
comes from VM pages, and internal filesystem IO comes from page cache
pages.  Buffer cache IOs are the only exception, and kiobufs only fail
for such IOs once you have multiple buffer_heads being merged into
single requests.

So, what are the benefits in the disk IO stack of adding length/offset
pairs to each page of the kiobuf?  Basically, the only advantage right
now is that it would allow us to merge requests together without
having to chain separate kiobufs.  However, chaining kiobufs in this
case is actually much better than merging them if the original IOs
came in as kiobufs: merging kiobufs requires us to reallocate a new,
longer (page/offset/len) vector, whereas chaining kiobufs is just a
list operation.

Having true scatter-gather lists in the kiobuf would let us represent
arbitrary lists of buffer_heads as a single kiobuf, though, and that
_is_ a big win if we can avoid using buffer_heads below the
ll_rw_block layer at all.  (It's not clear that this is really
possible, though, since we still need to propagate completion
information back up into each individual buffer head's status and wait
queue.)

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote:
 On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote:
  
  That would require the vfs interfaces themselves (address space
  readpage/writepage ops) to take kiobufs as arguments, instead of struct
  page *  . That's not the case right now, is it ?
 
 No, and with the current kiobufs it would not make sense, because they
 are to heavy-weight.

Really?  In what way?  

 With page,length,offsett iobufs this makes sense
 and is IMHO the way to go.

What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
lean enough to do the job??

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] vma limited swapin readahead

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 02:45:04PM -0200, Rik van Riel wrote:
 On Thu, 1 Feb 2001, Stephen C. Tweedie wrote:
 
 But only when the extra pages we're reading in don't
 displace useful data from memory, making us fault in
 those other pages ... causing us to go to the disk
 again and do more readahead, which could potentially
 displace even more pages, etc...

Remember, it's a balance.  You can displace a few useful pages and
still win overall because the cost _per page_ goes way down due to
better disk IO utilisation.

 One solution could be to put (most of) the swapin readahead
 pages on the inactive_dirty list, so pressure by readahead
 on the resident pages is smaller and the not used readahead
 pages are reclaimed faster.

Yep, that would make much sense.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote:
 On Thu, Feb 01, 2001 at 04:16:15PM +, Stephen C. Tweedie wrote:
   
   No, and with the current kiobufs it would not make sense, because they
   are to heavy-weight.
  
  Really?  In what way?  
 
 We can't allocate a huge kiobuf structure just for requesting one page of
 IO.  It might get better with VM-level IO clustering though.

A kiobuf is *much* smaller than, say, a buffer_head, and we currently
allocate a buffer_head per block for all IO.

A kiobuf contains enough embedded page vector space for 16 pages by
default, but I'm happy enough to remove that if people want.  However,
note that that memory is not initialised, so there is no memory access
cost at all for that empty space.  Remove that space and instead of
one memory allocation per kiobuf, you get two, so the cost goes *UP*
for small IOs.

   With page,length,offsett iobufs this makes sense
   and is IMHO the way to go.
  
  What, you mean adding *extra* stuff to the heavyweight kiobuf makes it
  lean enough to do the job??
 
 No.  I was speaking abou the light-weight kiobuf Linux  Me discussed on
 lkml some time ago (though I'd much more like to call it kiovec analogous
 to BSD iovecs).

What is so heavyweight in the current kiobuf (other than the embedded
vector, which I've already noted I'm willing to cut)?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote:
  
  I don't see any real advantage for disk IO.  The real advantage is that
  we can have a generic structure that is also usefull in e.g. networking
  and can lead to a unified IO buffering scheme (a little like IO-Lite).
 
 Networking wants something lighter rather than heavier. Adding tons of
 base/limit pairs to kiobufs makes it worse not better

Networking has fundamentally different requirements.  In a network
stack, you want the ability to add fragments to unaligned chunks of
data to represent headers at any point in the stack.

In the disk IO case, you basically don't get that (the only thing
which comes close is raid5 parity blocks).  The data which the user
started with is the data sent out on the wire.  You do get some
interesting cases such as soft raid and LVM, or even in the scsi stack
if you run out of mailbox space, where you need to send only a
sub-chunk of the input buffer.  

In that case, having offset/len as the kiobuf limit markers is ideal:
you can clone a kiobuf header using the same page vector as the
parent, narrow down the start/end points, and continue down the stack
without having to copy any part of the page list.  If you had the
offset/len data encoded implicitly into each entry in the sglist, you
would not be able to do that.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 06:49:50PM +0100, Christoph Hellwig wrote:
 
  Adding tons of base/limit pairs to kiobufs makes it worse not better
 
 For disk I/O it makes the handling a little easier for the cost of the
 additional offset/length fields.

Umm, actually, no, it makes it much worse for many of the cases.  

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 07:14:03PM +0100, Christoph Hellwig wrote:
 On Thu, Feb 01, 2001 at 05:41:20PM +, Stephen C. Tweedie wrote:
   
   We can't allocate a huge kiobuf structure just for requesting one page of
   IO.  It might get better with VM-level IO clustering though.
  
  A kiobuf is *much* smaller than, say, a buffer_head, and we currently
  allocate a buffer_head per block for all IO.
 
 A kiobuf is 124 bytes,

... the vast majority of which is room for the page vector to expand
without having to be copied.  You don't touch that in the normal case.

 a buffer_head 96.  And a buffer_head is additionally
 used for caching data, a kiobuf not.

Buffer_heads are _sometimes_ used for caching data.  That's one of the
big problems with them, they are too overloaded, being both IO
descriptors _and_ cache descriptors.  If you've got 128k of data to
write out from user space, do you want to set up one kiobuf or 256
buffer_heads?  Buffer_heads become really very heavy indeed once you
start doing non-trivial IO.

  What is so heavyweight in the current kiobuf (other than the embedded
  vector, which I've already noted I'm willing to cut)?
 
 array_len

kiobufs can be reused after IO.  You can depopulate a kiobuf,
repopulate it with new pages and submit new IO without having to
deallocate the kiobuf.  You can't do this without knowing how big the
data vector is.  Removing that functionality will prevent reuse,
making them _more_ heavyweight.

 io_count,

Right now we can take a kiobuf and turn it into a bunch of
buffer_heads for IO.  The io_count lets us track all of those sub-IOs
so that we know when all submitted IO has completed, so that we can
pass the completion callback back up the chain without having to
allocate yet more descriptor structs for the IO.

Again, remove this and the IO becomes more heavyweight because we need
to create a separate struct for the info.

 the presence of wait_queue AND end_io,

That's fine, I'm happy scrapping the wait queue: people can always use
the kiobuf private data field to refer to a wait queue if they want
to.

 and the lack of
 scatter gather in one kiobuf struct (you always need an array)

Again, _all_ data being sent down through the block device layer is
either in buffer heads or is page aligned.  You want us to triple the
size of the "heavyweight" kiobuf's data vector for what gain, exactly?
Obviously, extra code will be needed to scan kiobufs if we do that,
and unless we have both per-page _and_ per-kiobuf start/offset pairs
(adding even further to the complexity), those scatter-gather lists
would prevent us from carving up a kiobuf into smaller sub-ios without
copying the whole (expanded) vector.

That's a _lot_ of extra complexity in the disk IO layers.

I'm all for a fast kiobuf_to_sglist converter.  But I haven't seen any
evidence that such scatter-gather lists will do anything in the block
device case except complicate the code and decrease performance.

 S.th. like:
...
 makes it a lot simpler for the subsytems to integrate.

Possibly, but I remain to be convinced, because you may end up with a
mechanism which is generic but is not well-tuned for any specific
case, so everything goes slower.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 09:46:27PM +0100, Christoph Hellwig wrote:

  Right now we can take a kiobuf and turn it into a bunch of
  buffer_heads for IO.  The io_count lets us track all of those sub-IOs
  so that we know when all submitted IO has completed, so that we can
  pass the completion callback back up the chain without having to
  allocate yet more descriptor structs for the IO.
 
  Again, remove this and the IO becomes more heavyweight because we need
  to create a separate struct for the info.
 
 No.  Just allow passing the multiple of the devices blocksize over
 ll_rw_block.

That was just one example: you need the sub-ios just as much when
you split up an IO over stripe boundaries in LVM or raid0, for
example.  Secondly, ll_rw_block needs to die anyway: you can expand
the blocksize up to PAGE_SIZE but not beyond, whereas something like
ll_rw_kiobuf can submit a much larger IO atomically (and we have
devices which don't start to deliver good throughput until you use
IO sizes of 1MB or more).

  and the lack of
  scatter gather in one kiobuf struct (you always need an array)
 
  Again, _all_ data being sent down through the block device layer is
  either in buffer heads or is page aligned.
 
 That's the point.  You are always talking about the block-layer only.

I'm talking about why the minimal, generic solution doesn't provide
what the block layer needs.


  Obviously, extra code will be needed to scan kiobufs if we do that,
  and unless we have both per-page _and_ per-kiobuf start/offset pairs
  (adding even further to the complexity), those scatter-gather lists
  would prevent us from carving up a kiobuf into smaller sub-ios without
  copying the whole (expanded) vector.
 
 No.  I think I explained that in my last mail.

How?

If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want
to split it in two, I have to make two new vectors (page X, offset 0,
length n) and (page X, offset n, length PAGE_SIZE-n).  That implies
copying both vectors.

If I have a page vector with a single offset/length pair, I can build
a new header with the same vector and modified offset/length to split
the vector in two without copying it.

  Possibly, but I remain to be convinced, because you may end up with a
  mechanism which is generic but is not well-tuned for any specific
  case, so everything goes slower.
 
 As kiobufs are widely used for real IO, just as containers, this is
 better then nothing.

Surely having all of the subsystems working fast is better still?

 And IMHO a nice generic concepts that lets different subsystems work
 toegther is a _lot_ better then a bunch of over-optimized, rather isolated
 subsytems.  The IO-Lite people have done a nice research of the effect of
 an unified IO-Caching system vs. the typical isolated systems.

I know, and IO-Lite has some major problems (the close integration of
that code into the cache, for example, makes it harder to expose the
zero-copy to user-land).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:
 
  On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote:
  In the disk IO case, you basically don't get that (the only thing
  which comes close is raid5 parity blocks).  The data which the user
  started with is the data sent out on the wire.  You do get some
  interesting cases such as soft raid and LVM, or even in the scsi stack
  if you run out of mailbox space, where you need to send only a
  sub-chunk of the input buffer. 
 
 Though your describption is right, I don't think the case is very common:
 Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code.

On raid0 stripes, it's common to have stripes of between 16k and 64k,
so it's rather more common there than you'd like.  In any case, you
need the code to handle it, and I don't want to make the code paths
any more complex than necessary.

 In raid1 you need some kind of clone iobuf, which should work with both
 cases.  In raid0 you need a complete new pagelist anyway

No you don't.  You take the existing one, specify which region of it
is going to the current stripe, and send it off.  Nothing more.

  In that case, having offset/len as the kiobuf limit markers is ideal:
  you can clone a kiobuf header using the same page vector as the
  parent, narrow down the start/end points, and continue down the stack
  without having to copy any part of the page list.  If you had the
  offset/len data encoded implicitly into each entry in the sglist, you
  would not be able to do that.
 
 Sure you could: you embedd that information in a higher-level structure.

What's the point in a common data container structure if you need
higher-level information to make any sense out of it?

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-02-01 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote:

 I think you want the whole kio concept only for disk-like IO.  

No.  I want something good for zero-copy IO in general, but a lot of
that concerns the problem of interacting with the user, and the basic
center of that interaction in 99% of the interesting cases is either a
user VM buffer or the page cache --- all of which are page-aligned.  

If you look at the sorts of models being proposed (even by Linus) for
splice, you get

len = prepare_read();
prepare_write();
pull_fd();
commit_write();

in which the read is being pulled into a known location in the page
cache -- it's page-aligned, again.  I'm perfectly willing to accept
that there may be a need for scatter-gather boundaries including
non-page-aligned fragments in this model, but I can't see one if
you're using the page cache as a mediator, nor if you're doing it
through a user mmapped buffer.

The only reason you need finer scatter-gather boundaries --- and it
may be a compelling reason --- is if you are merging multiple IOs
together into a single device-level IO.  That makes perfect sense for
the zerocopy tcp case where you're doing MSG_MORE-type coalescing.  It
doesn't help the existing SGI kiobuf block device code, because that
performs its merging in the filesystem layers and the block device
code just squirts the IOs to the wire as-is, but if we want to start
merging those kiobuf-based IOs within make_request() then the block
device layer may want it too.

And Linus is right, the old way of using a *kiobuf[] for that was
painful, but the solution of adding start/length to every entry in
the page vector just doesn't sit right with many components of the
block device environment either.

I may still be persuaded that we need the full scatter-gather list
fields throughout, but for now I tend to think that, at least in the
disk layers, we may get cleaner results by allow linked lists of
page-aligned kiobufs instead.  That allows for merging of kiobufs
without having to copy all of the vector information each time.

The killer, however, is what happens if you want to split such a
merged kiobuf.  Right now, that's something that I can only imagine
happening in the block layers if we start encoding buffer_head chains
as kiobufs, but if we do that in the future, or if we start merging
genuine kiobuf requests requests, then doing that split later on (for
raid0 etc) may require duplicating whole chains of kiobufs.  At that
point, just doing scatter-gather lists is cleaner.

But for now, the way to picture what I'm trying to achieve is that
kiobufs are a bit like buffer_heads --- they represent the physical
pages of some VM object that a higher layer has constructed, such as
the page cache or a user VM buffer.  You can chain these objects
together for IO, but that doesn't stop the individual objects from
being separate entities with independent IO completion callbacks to be
honoured.  

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains

2001-01-31 Thread Stephen C. Tweedie

Hi,

On Wed, Jan 31, 2001 at 07:28:01PM +0530, [EMAIL PROTECTED] wrote:
> 
> Do the following modifications to your wait queue extension sound
> reasonable ?
> 
> 1. Change add_wait_queue to add elements to the end of queue (fifo, by
> default) and instead have an add_wait_queue_lifo() routine that adds to the
> head of the queue ?

Cache efficiency: you wake up the task whose data set is most likely
to be in L1 cache by waking it before its triggering event is flushed
from cache.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



<    1   2   3   4   5   6   7   >