Re: scsi vs ide performance on fsync's
Hi, On Wed, Mar 07, 2001 at 09:15:36PM +0100, Jens Axboe wrote: On Wed, Mar 07 2001, Stephen C. Tweedie wrote: For most fs'es, that's not an issue. The fs won't start writeback on the primary disk at all until the journal commit has been acknowledged as firm on disk. But do you then force wait on that journal commit? It doesn't matter too much --- it's only the writeback which is doing this (ext3 uses a separate journal thread for it), so any sleep is only there to wait for the moment when writeback can safely begin: users of the filesystem won't see any stalls. A barrier operation is sufficient then. So you're saying don't over design, a simple barrier is all you need? Pretty much so. The simple barrier is the only thing which can be effectively optimised at the hardware level with SCSI anyway. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Raw IO fixes for 2.4.2-ac8
Hi, I've just uploaded the current raw IO fixes as kiobuf-2.4.2-ac8-A0.tar.gz on ftp.uk.linux.org:/pub/linux/sct/fs/raw-io/ and ftp.*.kernel.org:/pub/linux/kernel/people/sct/raw-io/ This includes: 00-movecode.diff: move kiobuf code from mm/memory.c to fs/iobuf.c 02-faultfix.diff: fixes for faulting and pinning pages 03-unbind.diff: allow unbinding of raw devices 04-pgdirty.diff:use the new SetPageDirty to dirty pages after reads 05-bh-err.diff: fix cleanup of buffer_heads after ENOMEM 06-eio.diff:fix error returned on EIO in first block of IO The first 3 of these are from the current 2.2 raw patches. The 4th is the fix for dirtying pages after raw reads, using the new functionality of the 2.4 VM. The 5th and 6th fix up problems introduced when brw_kiovec was moved to use submit_bh(). Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] set kiobuf io_count once, instead of increment
Hi, On Wed, Feb 28, 2001 at 09:18:59AM -0800, Robert Read wrote: > On Tue, Feb 27, 2001 at 10:50:54PM -0300, Marcelo Tosatti wrote: > This is true, but it looks like the brw_kiovec allocation failure > handling is broken already; it's calling __put_unused_buffer_head on > bhs without waiting for them to complete first. Also, the err won't > be returned if the previous batch of bhs finished ok. It looks like > brw_kiovec needs some work, but I'm going to need some coffee first... Right, looks like this happened when somebody was changing the bh submission mechanism to use submit_bh(). I'll fix it. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] set kiobuf io_count once, instead of increment
On Tue, Feb 27, 2001 at 04:22:22PM -0800, Robert Read wrote: > Currently in brw_kiovec, iobuf->io_count is being incremented as each > bh is submitted, and decremented in the bh->b_end_io(). This means > io_count can go to zero before all the bhs have been submitted, > especially during a large request. This causes the end_kio_request() > to be called before all of the io is complete. brw_kiovec is currently entirely synchronous, so end_kio_request() calling is probably not a big deal right now. It would be much more important for an async version. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] set kiobuf io_count once, instead of increment
On Tue, Feb 27, 2001 at 04:22:22PM -0800, Robert Read wrote: Currently in brw_kiovec, iobuf-io_count is being incremented as each bh is submitted, and decremented in the bh-b_end_io(). This means io_count can go to zero before all the bhs have been submitted, especially during a large request. This causes the end_kio_request() to be called before all of the io is complete. brw_kiovec is currently entirely synchronous, so end_kio_request() calling is probably not a big deal right now. It would be much more important for an async version. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] set kiobuf io_count once, instead of increment
Hi, On Wed, Feb 28, 2001 at 09:18:59AM -0800, Robert Read wrote: On Tue, Feb 27, 2001 at 10:50:54PM -0300, Marcelo Tosatti wrote: This is true, but it looks like the brw_kiovec allocation failure handling is broken already; it's calling __put_unused_buffer_head on bhs without waiting for them to complete first. Also, the err won't be returned if the previous batch of bhs finished ok. It looks like brw_kiovec needs some work, but I'm going to need some coffee first... Right, looks like this happened when somebody was changing the bh submission mechanism to use submit_bh(). I'll fix it. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Raw IO fixes for 2.4.2-ac8
Hi, I've just uploaded the current raw IO fixes as kiobuf-2.4.2-ac8-A0.tar.gz on ftp.uk.linux.org:/pub/linux/sct/fs/raw-io/ and ftp.*.kernel.org:/pub/linux/kernel/people/sct/raw-io/ This includes: 00-movecode.diff: move kiobuf code from mm/memory.c to fs/iobuf.c 02-faultfix.diff: fixes for faulting and pinning pages 03-unbind.diff: allow unbinding of raw devices 04-pgdirty.diff:use the new SetPageDirty to dirty pages after reads 05-bh-err.diff: fix cleanup of buffer_heads after ENOMEM 06-eio.diff:fix error returned on EIO in first block of IO The first 3 of these are from the current 2.2 raw patches. The 4th is the fix for dirtying pages after raw reads, using the new functionality of the 2.4 VM. The 5th and 6th fix up problems introduced when brw_kiovec was moved to use submit_bh(). Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Writing on raw device with software RAID 0 is slow
Hi, On Thu, Mar 01, 2001 at 11:08:13AM -0500, Ben LaHaise wrote: > On Thu, 1 Mar 2001, Stephen C. Tweedie wrote: > > Actually, how about making it a sysctl? That's probably the most > reasonable approach for now since the optimal size depends on hardware. Fine with me. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Writing on raw device with software RAID 0 is slow
Hi, On Thu, Mar 01, 2001 at 10:44:38AM -0500, Ben LaHaise wrote: > > On Thu, 1 Mar 2001, Stephen C. Tweedie wrote: > > > Raw IO is always synchronous: it gets flushed to disk before the write > > returns. You don't get any write-behind with raw IO, so the smaller > > the blocksize you write in, the slower things get. > > More importantly, the mainstream raw io code only writes in 64KB chunks > that are unpipelined, which can lead to writes not hitting the drive > before the sector passes under the rw head. You can work around this to > some extent by issuing multiple writes (via threads, or the aio work I've > done) at the expense of atomicity. Also, before we allow locking of > arbitrary larger ios in main memory, we need bean counting to prevent the > obvious DoSes. Yep. There shouldn't be any problem increasing the 64KB size, it's only the lack of accounting for the pinned memory which stopped me increasing it by default. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 fsck question
Hi, On Wed, Feb 28, 2001 at 08:03:21PM -0600, Neal Gieselman wrote: > > I applied the libs and other utilites from e2fsprogs by hand. > I ran fsck.ext3 on my secondary partition and it ran fine. The boot fsck > on / was complaining about something but I could not catch it. > I then went single user and ran fsck.ext3 on / while mounted. e2fsck should complain loudly and ask for confirmation if you do that. Goin ahead with the fsck is a bad move on a mounted, rw filesystem! > Excuse the stupid question, but with ext3, do I really require the > fsck.ext3? fsck.ext3 is just a link to e2fsck. Make sure you're running recent e2fsprogs, though (either the latest snapshot from downloads.sourceforge.net or a build from ftp.uk.linux.org:/pub/linux/sct/fs/jfs/). Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext3 fsck question
Hi, On Wed, Feb 28, 2001 at 08:03:21PM -0600, Neal Gieselman wrote: I applied the libs and other utilites from e2fsprogs by hand. I ran fsck.ext3 on my secondary partition and it ran fine. The boot fsck on / was complaining about something but I could not catch it. I then went single user and ran fsck.ext3 on / while mounted. e2fsck should complain loudly and ask for confirmation if you do that. Goin ahead with the fsck is a bad move on a mounted, rw filesystem! Excuse the stupid question, but with ext3, do I really require the fsck.ext3? fsck.ext3 is just a link to e2fsck. Make sure you're running recent e2fsprogs, though (either the latest snapshot from downloads.sourceforge.net or a build from ftp.uk.linux.org:/pub/linux/sct/fs/jfs/). Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Writing on raw device with software RAID 0 is slow
Hi, On Thu, Mar 01, 2001 at 10:44:38AM -0500, Ben LaHaise wrote: On Thu, 1 Mar 2001, Stephen C. Tweedie wrote: Raw IO is always synchronous: it gets flushed to disk before the write returns. You don't get any write-behind with raw IO, so the smaller the blocksize you write in, the slower things get. More importantly, the mainstream raw io code only writes in 64KB chunks that are unpipelined, which can lead to writes not hitting the drive before the sector passes under the rw head. You can work around this to some extent by issuing multiple writes (via threads, or the aio work I've done) at the expense of atomicity. Also, before we allow locking of arbitrary larger ios in main memory, we need bean counting to prevent the obvious DoSes. Yep. There shouldn't be any problem increasing the 64KB size, it's only the lack of accounting for the pinned memory which stopped me increasing it by default. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Writing on raw device with software RAID 0 is slow
Hi, On Thu, Mar 01, 2001 at 11:08:13AM -0500, Ben LaHaise wrote: On Thu, 1 Mar 2001, Stephen C. Tweedie wrote: Actually, how about making it a sysctl? That's probably the most reasonable approach for now since the optimal size depends on hardware. Fine with me. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] guard mm->rss with page_table_lock (241p11)
Hi, On Mon, Feb 12, 2001 at 07:15:57PM -0800, george anzinger wrote: > Excuse me if I am off base here, but wouldn't an atomic operation be > better here. There are atomic inc/dec and add/sub macros for this. It > just seems that that is all that is needed here (from inspection of the > patch). The counter-argument is that we already hold the page table lock in the vast majority of places where the rss is modified, so overall it's cheaper to avoid the extra atomic update. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] guard mm-rss with page_table_lock (241p11)
Hi, On Mon, Feb 12, 2001 at 07:15:57PM -0800, george anzinger wrote: Excuse me if I am off base here, but wouldn't an atomic operation be better here. There are atomic inc/dec and add/sub macros for this. It just seems that that is all that is needed here (from inspection of the patch). The counter-argument is that we already hold the page table lock in the vast majority of places where the rss is modified, so overall it's cheaper to avoid the extra atomic update. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext2: block > big ?
Hi, On Sun, Feb 11, 2001 at 05:44:02PM -0700, Brian Grossman wrote: > > What does a message like 'ext2: block > big' indicate? An attempt was made to access a block beyond the legal max size for an ext2 file. That probably implies a corrupt inode, because the ext2 file write code checks for that limit and won't attempt to write beyond the boundary. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://vger.kernel.org/lkml/
Re: ext2: block big ?
Hi, On Sun, Feb 11, 2001 at 05:44:02PM -0700, Brian Grossman wrote: What does a message like 'ext2: block big' indicate? An attempt was made to access a block beyond the legal max size for an ext2 file. That probably implies a corrupt inode, because the ext2 file write code checks for that limit and won't attempt to write beyond the boundary. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://vger.kernel.org/lkml/
Re: Raw devices bound to RAID arrays ?
Hi, On Sun, Feb 11, 2001 at 06:29:12PM +0200, Petru Paler wrote: > > Is it possible to bind a raw device to a software RAID 1 array ? Yes. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Raw devices bound to RAID arrays ?
Hi, On Sun, Feb 11, 2001 at 06:29:12PM +0200, Petru Paler wrote: Is it possible to bind a raw device to a software RAID 1 array ? Yes. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Thu, Feb 08, 2001 at 03:52:35PM +0100, Mikulas Patocka wrote: > > > How do you write high-performance ftp server without threads if select > > on regular file always returns "ready"? > > No, it's not really possible on Linux. Use SYS$QIO call on VMS :-) Ahh, but even VMS SYS$QIO is synchronous at doing opens, allocation of the IO request packets, and mapping file location to disk blocks. Only the data IO is ever async (and Ben's async IO stuff for Linux provides that too). --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Thu, Feb 08, 2001 at 12:15:13AM +0100, Pavel Machek wrote: > > > EAGAIN is _not_ a valid return value for block devices or for regular > > files. And in fact it _cannot_ be, because select() is defined to always > > return 1 on them - so if a write() were to return EAGAIN, user space would > > have nothing to wait on. Busy waiting is evil. > > So you consider inability to select() on regular files _feature_? Select might make some sort of sense for sequential access to files, and for random access via lseek/read but it makes no sense at all for pread and pwrite where select() has no idea _which_ part of the file the user is going to want to access next. > How do you write high-performance ftp server without threads if select > on regular file always returns "ready"? Select can work if the access is sequential, but async IO is a more general solution. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Thu, Feb 08, 2001 at 12:15:13AM +0100, Pavel Machek wrote: EAGAIN is _not_ a valid return value for block devices or for regular files. And in fact it _cannot_ be, because select() is defined to always return 1 on them - so if a write() were to return EAGAIN, user space would have nothing to wait on. Busy waiting is evil. So you consider inability to select() on regular files _feature_? Select might make some sort of sense for sequential access to files, and for random access via lseek/read but it makes no sense at all for pread and pwrite where select() has no idea _which_ part of the file the user is going to want to access next. How do you write high-performance ftp server without threads if select on regular file always returns "ready"? Select can work if the access is sequential, but async IO is a more general solution. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Thu, Feb 08, 2001 at 03:52:35PM +0100, Mikulas Patocka wrote: How do you write high-performance ftp server without threads if select on regular file always returns "ready"? No, it's not really possible on Linux. Use SYS$QIO call on VMS :-) Ahh, but even VMS SYS$QIO is synchronous at doing opens, allocation of the IO request packets, and mapping file location to disk blocks. Only the data IO is ever async (and Ben's async IO stuff for Linux provides that too). --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Wed, Feb 07, 2001 at 12:12:44PM -0700, Richard Gooch wrote: > Stephen C. Tweedie writes: > > > > Sorry? I'm not sure where communication is breaking down here, but > > we really don't seem to be talking about the same things. SGI's > > kiobuf request patches already let us pass a large IO through the > > request layer in a single unit without having to split it up to > > squeeze it through the API. > > Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you > don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB > buffer_head is effectively the same thing. kiobufs let you encode _any_ contiguous region of user VA or of an inode's page cache contents in one kiobuf, no matter how many pages there are in it. A write of a megabyte to a raw device can be encoded as a single kiobuf if we want to pass the entire 1MB IO down to the block layers untouched. That's what the page vector in the kiobuf is for. Doing the same thing with buffer_heads would still require a couple of hundred of them, and you'd have to submit each such buffer_head to the IO subsystem independently. And then the IO layer will just have to reassemble them on the other side (and it may have to scan the device's entire request queue once for every single buffer_head to do so). > But an API extension to allow passing a pre-built chain would be even > better. Yep. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote: > > > However, I really _do_ want to have the page cache have a bigger > granularity than the smallest memory mapping size, and there are always > special cases that might be able to generate IO in bigger chunks (ie > in-kernel services etc) No argument there. > > Yes. We still have this fundamental property: if a user sends in a > > 128kB IO, we end up having to split it up into buffer_heads and doing > > a separate submit_bh() on each single one. Given our VM, PAGE_SIZE > > (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in > > this case. > > Absolutely. And this is independent of what kind of interface we end up > using, whether it be kiobuf of just plain "struct buffer_head". In that > respect they are equivalent. Sorry? I'm not sure where communication is breaking down here, but we really don't seem to be talking about the same things. SGI's kiobuf request patches already let us pass a large IO through the request layer in a single unit without having to split it up to squeeze it through the API. > > THAT is the overhead that I'm talking about: having to split a large > > IO into small chunks, each of which just ends up having to be merged > > back again into a single struct request by the *make_request code. > > You could easily just generate the bh then and there, if you wanted to. In the current 2.4 tree, we already do: brw_kiovec creates the temporary buffer_heads on demand to feed them to the IO layers. > Your overhead comes from the fact that you want to gather the IO together. > And I'm saying that you _shouldn't_ gather the IO. There's no point. I don't --- the underlying layer does. And that is where the overhead is: for every single large IO being created by the higher layers, make_request is doing a dozen or more merges because I can only feed the IO through make_request in tiny pieces. > The > gathering is sufficiently done by the low-level code anyway, and I've > tried to explain why the low-level code _has_ to do that work regardless > of what upper layers do. I know. The problem is the low-level code doing it a hundred times for a single injected IO. > You need to generate a separate sg entry for each page anyway. So why not > just use the existing one? The "struct buffer_head". Which already > _handles_ all the issues that you have complained are hard to handle. Two issues here. First is that the buffer_head is an enormously heavyweight object for a sg-list fragment. It contains a ton of fields of interest only to the buffer cache. We could mitigate this to some extent by ensuring that the relevant fields for IO (rsector, size, req_next, state, data, page etc) were in a single cache line. Secondly, the cost of adding each single buffer_head to the request list is O(n) in the number of requests already on the list. We end up walking potentially the entire request queue before finding the request to merge against, and we do that again and again, once for every single buffer_head in the list. We do this even if the caller went in via a multi-bh ll_rw_block() call in which case we know in advance that all of the buffer_heads are contiguous on disk. There is a side problem: right now, things like raid remapping occur during generic_make_request, before we have a request built. That means that all of the raid0 remapping or raid1/5 request expanding is being done on a per-buffer_head, not per-request, basis, so again we're doing a whole lot of unnecessary duplicate work when an IO larger than a buffer_head is submitted. If you really don't mind the size of the buffer_head as a sg fragment header, then at least I'd like us to be able to submit a pre-built chain of bh's all at once without having to go through the remap/merge cost for each single bh. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Wed, Feb 07, 2001 at 09:10:32AM +, David Howells wrote: > > I presume that correct_size will always be a power of 2... Yes. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Wed, Feb 07, 2001 at 09:10:32AM +, David Howells wrote: I presume that correct_size will always be a power of 2... Yes. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote: However, I really _do_ want to have the page cache have a bigger granularity than the smallest memory mapping size, and there are always special cases that might be able to generate IO in bigger chunks (ie in-kernel services etc) No argument there. Yes. We still have this fundamental property: if a user sends in a 128kB IO, we end up having to split it up into buffer_heads and doing a separate submit_bh() on each single one. Given our VM, PAGE_SIZE (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in this case. Absolutely. And this is independent of what kind of interface we end up using, whether it be kiobuf of just plain "struct buffer_head". In that respect they are equivalent. Sorry? I'm not sure where communication is breaking down here, but we really don't seem to be talking about the same things. SGI's kiobuf request patches already let us pass a large IO through the request layer in a single unit without having to split it up to squeeze it through the API. THAT is the overhead that I'm talking about: having to split a large IO into small chunks, each of which just ends up having to be merged back again into a single struct request by the *make_request code. You could easily just generate the bh then and there, if you wanted to. In the current 2.4 tree, we already do: brw_kiovec creates the temporary buffer_heads on demand to feed them to the IO layers. Your overhead comes from the fact that you want to gather the IO together. And I'm saying that you _shouldn't_ gather the IO. There's no point. I don't --- the underlying layer does. And that is where the overhead is: for every single large IO being created by the higher layers, make_request is doing a dozen or more merges because I can only feed the IO through make_request in tiny pieces. The gathering is sufficiently done by the low-level code anyway, and I've tried to explain why the low-level code _has_ to do that work regardless of what upper layers do. I know. The problem is the low-level code doing it a hundred times for a single injected IO. You need to generate a separate sg entry for each page anyway. So why not just use the existing one? The "struct buffer_head". Which already _handles_ all the issues that you have complained are hard to handle. Two issues here. First is that the buffer_head is an enormously heavyweight object for a sg-list fragment. It contains a ton of fields of interest only to the buffer cache. We could mitigate this to some extent by ensuring that the relevant fields for IO (rsector, size, req_next, state, data, page etc) were in a single cache line. Secondly, the cost of adding each single buffer_head to the request list is O(n) in the number of requests already on the list. We end up walking potentially the entire request queue before finding the request to merge against, and we do that again and again, once for every single buffer_head in the list. We do this even if the caller went in via a multi-bh ll_rw_block() call in which case we know in advance that all of the buffer_heads are contiguous on disk. There is a side problem: right now, things like raid remapping occur during generic_make_request, before we have a request built. That means that all of the raid0 remapping or raid1/5 request expanding is being done on a per-buffer_head, not per-request, basis, so again we're doing a whole lot of unnecessary duplicate work when an IO larger than a buffer_head is submitted. If you really don't mind the size of the buffer_head as a sg fragment header, then at least I'd like us to be able to submit a pre-built chain of bh's all at once without having to go through the remap/merge cost for each single bh. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Wed, Feb 07, 2001 at 12:12:44PM -0700, Richard Gooch wrote: Stephen C. Tweedie writes: Sorry? I'm not sure where communication is breaking down here, but we really don't seem to be talking about the same things. SGI's kiobuf request patches already let us pass a large IO through the request layer in a single unit without having to split it up to squeeze it through the API. Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB buffer_head is effectively the same thing. kiobufs let you encode _any_ contiguous region of user VA or of an inode's page cache contents in one kiobuf, no matter how many pages there are in it. A write of a megabyte to a raw device can be encoded as a single kiobuf if we want to pass the entire 1MB IO down to the block layers untouched. That's what the page vector in the kiobuf is for. Doing the same thing with buffer_heads would still require a couple of hundred of them, and you'd have to submit each such buffer_head to the IO subsystem independently. And then the IO layer will just have to reassemble them on the other side (and it may have to scan the device's entire request queue once for every single buffer_head to do so). But an API extension to allow passing a pre-built chain would be even better. Yep. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote: > > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block > > enforces a single blocksize on all requests but that relaxing that > > requirement is no big deal). Buffer_heads can't deal with data which > > spans more than a page right now. > > "struct buffer_head" can deal with pretty much any size: the only thing it > cares about is bh->b_size. Right now, anything larger than a page is physically non-contiguous, and sorry if I didn't make that explicit, but I thought that was obvious enough that I didn't need to. We were talking about raw IO, and as long as we're doing IO out of user anonymous data allocated from individual pages, buffer_heads are limited to that page size in this context. > Have you ever spent even just 5 minutes actually _looking_ at the block > device layer, before you decided that you think it needs to be completely > re-done some other way? It appears that you never bothered to. Yes. We still have this fundamental property: if a user sends in a 128kB IO, we end up having to split it up into buffer_heads and doing a separate submit_bh() on each single one. Given our VM, PAGE_SIZE (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in this case. THAT is the overhead that I'm talking about: having to split a large IO into small chunks, each of which just ends up having to be merged back again into a single struct request by the *make_request code. A constructed IO request basically doesn't care about anything in the buffer_head except for the data pointer and size, and the completion status info and callback. All of the physical IO description is in the struct request by this point. The chain of buffer_heads is carrying around a huge amount of information which isn't used by the IO, and if the caller is something like the raw IO driver which isn't using the buffer cache, that extra buffer_head data is just overhead. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 04:41:21PM -0800, Linus Torvalds wrote: > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > > be aligned on disk at a multiple of their buffer size. > > Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the > traditional block device setup, where "b_blocknr" is the "virtual > blocknumber" and that indeed is tied in to the block size. > > The fact is, if you have problems like the above, then you don't > understand the interfaces. And it sounds like you designed kiobuf support > around the wrong set of interfaces. They used the only interfaces available at the time... > If you want to get at the _sector_ level, then you do ... > which doesn't look all that complicated to me. What's the problem? Doesn't this break nastily as soon as the IO hits an LVM or soft raid device? I don't think we are safe if we create a larger-sized buffer_head which spans a raid stripe: the raid mapping is only applied once per buffer_head. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote: > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > > be aligned on disk at a multiple of their buffer size. Under the Unix > > raw IO interface it is perfectly legal to begin a 128kB IO at offset > > 512 bytes into a device. > > then we should either fix this limitation, or the raw IO code should split > the request up into several, variable-size bhs, so that the range is > filled out optimally with aligned bhs. That gets us from 512-byte blocks to 4k, but no more (ll_rw_block enforces a single blocksize on all requests but that relaxing that requirement is no big deal). Buffer_heads can't deal with data which spans more than a page right now. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote: > > [overhead of 512-byte bhs in the raw IO code is an artificial problem of > the raw IO code.] No, it is a problem of the ll_rw_block interface: buffer_heads need to be aligned on disk at a multiple of their buffer size. Under the Unix raw IO interface it is perfectly legal to begin a 128kB IO at offset 512 bytes into a device. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: sync & asyck i/o
Hi, On Tue, Feb 06, 2001 at 11:25:00AM -0800, Andre Hedrick wrote: > On Tue, 6 Feb 2001, Stephen C. Tweedie wrote: > > No, we simply omit to instruct them to enable write-back caching. > > Linux assumes that the WCE (write cache enable) bit in a disk's > > caching mode page is zero. > > You can not be so blind to omit the command. Linux has traditionally ignored the issue. Don't ask me to defend it --- the last advice I got from anybody who knew SCSI well was that SCSI disks were defaulting to WCE-disabled. Note that disabling SCSI WCE doesn't disable the cache, it just enforces synchronous completion. With tagged command queuing, writeback caching doesn't necessarily mean a huge performance increase. But if WCE is being enabled by default on modern SCSI drives, then that's something which the scsi stack really does need to fix --- the upper block layers will most definitely break if we have WCE enabled and we don't set force-unit-access on the scsi commands. The ll_rw_block interface is perfectly clear: it expects the data to be written to persistent storage once the buffer_head end_io is called. If that's not the case, somebody needs to fix the lower layers. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 06:22:58PM +0100, Christoph Hellwig wrote: > On Tue, Feb 06, 2001 at 05:05:06PM +0000, Stephen C. Tweedie wrote: > > The whole point of the post was that it is merging, not splitting, > > which is troublesome. How are you going to merge requests without > > having chains of scatter-gather entities each with their own > > completion callbacks? > > The object passed down to the low-level driver just needs to ne able > to contain multiple end-io callbacks. The decision what to call when > some of the scatter-gather entities fail is of course not so easy to > handle and needs further discussion. Umm, and if you want the separate higher-level IOs to be told which IOs succeeded and which ones failed on error, you need to associate each of the multiple completion callbacks with its particular scatter-gather fragment or fragments. So you end up with the same sort of kiobuf/kiovec concept where you have chains of sg chunks, each chunk with its own completion information. This is *precisely* what I've been trying to get people to address. Forget whether the individual sg fragments are based on pages or not: if you want to have IO merging and accurate completion callbacks, you need not just one sg list but multiple lists each with a separate callback header. Abandon the merging of sg-list requests (by moving that functionality into the higher-level layers) and that problem disappears: flat sg-lists will then work quite happily at the request layer. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: sync & asyck i/o
Hi, On Tue, Feb 06, 2001 at 02:52:40PM +, Alan Cox wrote: > > According to the man page for fsync it copies in-core data to disk > > prior to its return. Does that take async i/o to the media in account? > > I.e. does it wait for completion of the async i/o to the disk? > > Undefined. > In practice some IDE disks do write merging and small amounts of write > caching in the drive firmware so you cannot trust it 100%. It's worth noting that it *is* defined unambiguously in the standards: fsync waits until all the data is hard on disk. Linux will obey that if it possibly can: only in cases where the hardware is actively lying about when the data has hit disk will the guarantee break down. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 06:00:58PM +0100, Christoph Hellwig wrote: > On Tue, Feb 06, 2001 at 12:07:04AM +0000, Stephen C. Tweedie wrote: > > > > Is that a realistic basis for a cleaned-up ll_rw_blk.c? > > I don't think os. If we minimize the state in the IO container object, > the lower levels could split them at their guess and the IO completion > function just has to handle the case that it might be called for a smaller > object. The whole point of the post was that it is merging, not splitting, which is troublesome. How are you going to merge requests without having chains of scatter-gather entities each with their own completion callbacks? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: rawio usage
Hi, On Mon, Feb 05, 2001 at 10:36:32PM -0800, Mayank Vasa wrote: > > When I run this program as root, I get the error "write: Invalid argument". Raw IO requires that the buffers are aligned on a 512-byte boundary in memory. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: rawio usage
Hi, On Mon, Feb 05, 2001 at 10:36:32PM -0800, Mayank Vasa wrote: When I run this program as root, I get the error "write: Invalid argument". Raw IO requires that the buffers are aligned on a 512-byte boundary in memory. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: sync asyck i/o
Hi, On Tue, Feb 06, 2001 at 02:52:40PM +, Alan Cox wrote: According to the man page for fsync it copies in-core data to disk prior to its return. Does that take async i/o to the media in account? I.e. does it wait for completion of the async i/o to the disk? Undefined. In practice some IDE disks do write merging and small amounts of write caching in the drive firmware so you cannot trust it 100%. It's worth noting that it *is* defined unambiguously in the standards: fsync waits until all the data is hard on disk. Linux will obey that if it possibly can: only in cases where the hardware is actively lying about when the data has hit disk will the guarantee break down. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 06:22:58PM +0100, Christoph Hellwig wrote: On Tue, Feb 06, 2001 at 05:05:06PM +, Stephen C. Tweedie wrote: The whole point of the post was that it is merging, not splitting, which is troublesome. How are you going to merge requests without having chains of scatter-gather entities each with their own completion callbacks? The object passed down to the low-level driver just needs to ne able to contain multiple end-io callbacks. The decision what to call when some of the scatter-gather entities fail is of course not so easy to handle and needs further discussion. Umm, and if you want the separate higher-level IOs to be told which IOs succeeded and which ones failed on error, you need to associate each of the multiple completion callbacks with its particular scatter-gather fragment or fragments. So you end up with the same sort of kiobuf/kiovec concept where you have chains of sg chunks, each chunk with its own completion information. This is *precisely* what I've been trying to get people to address. Forget whether the individual sg fragments are based on pages or not: if you want to have IO merging and accurate completion callbacks, you need not just one sg list but multiple lists each with a separate callback header. Abandon the merging of sg-list requests (by moving that functionality into the higher-level layers) and that problem disappears: flat sg-lists will then work quite happily at the request layer. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: sync asyck i/o
Hi, On Tue, Feb 06, 2001 at 11:25:00AM -0800, Andre Hedrick wrote: On Tue, 6 Feb 2001, Stephen C. Tweedie wrote: No, we simply omit to instruct them to enable write-back caching. Linux assumes that the WCE (write cache enable) bit in a disk's caching mode page is zero. You can not be so blind to omit the command. Linux has traditionally ignored the issue. Don't ask me to defend it --- the last advice I got from anybody who knew SCSI well was that SCSI disks were defaulting to WCE-disabled. Note that disabling SCSI WCE doesn't disable the cache, it just enforces synchronous completion. With tagged command queuing, writeback caching doesn't necessarily mean a huge performance increase. But if WCE is being enabled by default on modern SCSI drives, then that's something which the scsi stack really does need to fix --- the upper block layers will most definitely break if we have WCE enabled and we don't set force-unit-access on the scsi commands. The ll_rw_block interface is perfectly clear: it expects the data to be written to persistent storage once the buffer_head end_io is called. If that's not the case, somebody needs to fix the lower layers. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote: [overhead of 512-byte bhs in the raw IO code is an artificial problem of the raw IO code.] No, it is a problem of the ll_rw_block interface: buffer_heads need to be aligned on disk at a multiple of their buffer size. Under the Unix raw IO interface it is perfectly legal to begin a 128kB IO at offset 512 bytes into a device. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote: On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: No, it is a problem of the ll_rw_block interface: buffer_heads need to be aligned on disk at a multiple of their buffer size. Under the Unix raw IO interface it is perfectly legal to begin a 128kB IO at offset 512 bytes into a device. then we should either fix this limitation, or the raw IO code should split the request up into several, variable-size bhs, so that the range is filled out optimally with aligned bhs. That gets us from 512-byte blocks to 4k, but no more (ll_rw_block enforces a single blocksize on all requests but that relaxing that requirement is no big deal). Buffer_heads can't deal with data which spans more than a page right now. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote: On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: That gets us from 512-byte blocks to 4k, but no more (ll_rw_block enforces a single blocksize on all requests but that relaxing that requirement is no big deal). Buffer_heads can't deal with data which spans more than a page right now. "struct buffer_head" can deal with pretty much any size: the only thing it cares about is bh-b_size. Right now, anything larger than a page is physically non-contiguous, and sorry if I didn't make that explicit, but I thought that was obvious enough that I didn't need to. We were talking about raw IO, and as long as we're doing IO out of user anonymous data allocated from individual pages, buffer_heads are limited to that page size in this context. Have you ever spent even just 5 minutes actually _looking_ at the block device layer, before you decided that you think it needs to be completely re-done some other way? It appears that you never bothered to. Yes. We still have this fundamental property: if a user sends in a 128kB IO, we end up having to split it up into buffer_heads and doing a separate submit_bh() on each single one. Given our VM, PAGE_SIZE (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in this case. THAT is the overhead that I'm talking about: having to split a large IO into small chunks, each of which just ends up having to be merged back again into a single struct request by the *make_request code. A constructed IO request basically doesn't care about anything in the buffer_head except for the data pointer and size, and the completion status info and callback. All of the physical IO description is in the struct request by this point. The chain of buffer_heads is carrying around a huge amount of information which isn't used by the IO, and if the caller is something like the raw IO driver which isn't using the buffer cache, that extra buffer_head data is just overhead. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, OK, if we take a step back what does this look like: On Mon, Feb 05, 2001 at 08:54:29PM +, Stephen C. Tweedie wrote: > > If we are doing readahead, we want completion callbacks raised as soon > as possible on IO completions, no matter how many other IOs have been > merged with the current one. More importantly though, when we are > merging multiple page or buffer_head IOs in a request, we want to know > exactly which buffer/page contents are valid and which are not once > the IO completes. This is the current situation. If the page cache submits a 64K IO to the block layer, it does so in pieces, and then expects to be told on return exactly which pages succeeded and which failed. That's where the mess of having multiple completion objects in a single IO request comes from. Can we just forbid this case? That's the short cut that SGI's kiobuf block dev patches do when they get kiobufs: they currently deal with either buffer_heads or kiobufs in struct requests, but they don't merge kiobuf requests. (XFS already clusters the IOs for them in that case.) Is that a realistic basis for a cleaned-up ll_rw_blk.c? It implies that the caller has to do IO merging. For read, that's not much pain, as the most important case --- readahead --- is already done in a generic way which could submit larger IOs relatively easily. It would be harder for writes, but high-level write clustering code has already been started. It implies that for any IO, on IO failure you don't get told which part of the IO failed. That adds code to the caller: the page cache would have to retry per-page to work out which pages are readable and which are not. It means that for soft raid, you don't get told which blocks are bad if a stripe has an error anywhere. Ingo, is that a potential problem? But it gives very, very simple semantics to the request layer: single IOs go in (with a completion callback and a single scatter-gather list), and results go back with success or failure. With that change, it becomes _much_ more natural to push a simple sg list down through the disk layers. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 11:06:48PM +, Alan Cox wrote: > > do you then tell the application _above_ raid0 if one of the > > underlying IOs succeeds and the other fails halfway through? > > struct > { > u32 flags; /* because everything needs flags */ > struct io_completion *completions; > kiovec_t sglist[0]; > } thingy; > > now kmalloc one object of the header the sglist of the right size and the > completion list. Shove the completion list on the end of it as another > array of objects and what is the problem. XFS uses both small metadata items in the buffer cache and large pagebufs. You may have merged a 512-byte read with a large pagebuf read: one completion callback is associated with a single sg fragment, the next callback belongs to a dozen different fragments. Associating the two lists becomes non-trivial, although it could be done. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 10:28:37PM +0100, Ingo Molnar wrote: > > On Mon, 5 Feb 2001, Stephen C. Tweedie wrote: > > it's exactly these 'compound' structures i'm vehemently against. I do > think it's a design nightmare. I can picture these monster kiobufs > complicating the whole code for no good reason - we couldnt even get the > bh-list code in block_device.c right - why do you think kiobufs *all > across the kernel* will be any better? > > RAID0 is not an issue. Split it up, use separate kiobufs for every > different disk. Umm, that's not the point --- of course you can use separate kiobufs for the communication between raid0 and the underlying disks, but what do you then tell the application _above_ raid0 if one of the underlying IOs succeeds and the other fails halfway through? And what about raid1? Are you really saying that raid1 doesn't need to know which blocks succeeded and which failed? That's the level of completion information I'm worrying about at the moment. > fragmented skbs are a different matter: they are simply a bit more generic > abstractions of 'memory buffer'. Clear goal, clear solution. I do not > think kiobufs have clear goals. The goal: allow arbitrary IOs to be pushed down through the stack in such a way that the callers can get meaningful information back about what worked and what did not. If the write was a 128kB raw IO, then you obviously get coarse granularity of completion callback. If the write was a series of independent pages which happened to be contiguous on disk, you actually get told which pages hit disk and which did not. > and what is the goal of having multi-page kiobufs. To avoid having to do > multiple function calls via a simpler interface? Shouldnt we optimize that > codepath instead? The original multi-page buffers came from the map_user_kiobuf interface: they represented a user data buffer. I'm not wedded to that format --- we can happily replace it with a fine-grained sg list --- but the reason they have been pushed so far down the IO stack is the need for accurate completion information on the originally requested IOs. In other words, even if we expand the kiobuf into a sg vector list, when it comes to merging requests in ll_rw_blk.c we still need to track the callbacks on each independent source kiobufs. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Mon, Feb 05, 2001 at 11:28:17AM -0800, Linus Torvalds wrote: > The _vectors_ are needed at the very lowest levels: the levels that do not > necessarily have to worry at all about completion notification etc. You > want the arbitrary scatter-gather vectors passed down to the stuff that > sets up the SG arrays etc, the stuff that doesn't care AT ALL about the > high-level semantics. OK, this is exactly where we have a problem: I can see too many cases where we *do* need to know about completion stuff at a fine granularity when it comes to disk IO (unlike network IO, where we can usually rely on a caller doing retransmit at some point in the stack). If we are doing readahead, we want completion callbacks raised as soon as possible on IO completions, no matter how many other IOs have been merged with the current one. More importantly though, when we are merging multiple page or buffer_head IOs in a request, we want to know exactly which buffer/page contents are valid and which are not once the IO completes. The current request struct's buffer_head list provides that quite naturally, but is a hugely heavyweight way of performing large IOs. What I'm really after is a way of sending IOs to make_request in such a way that if the caller provides an array of buffer_heads, it gets back completion information on each one, but if the IO is requested in large chunks (eg. XFS's pagebufs or large kiobufs from raw IO), then the request code can deal with it in those large chunks. What worries me is things like the soft raid1/5 code: pretending that we can skimp on the return information about which blocks were transferred successfully and which were not sounds like a really bad idea when you've got a driver which relies on that completion information in order to do intelligent error recovery. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 08:36:31AM -0800, Linus Torvalds wrote: > Have you ever thought about other things, like networking, special > devices, stuff like that? They can (and do) have packet boundaries that > have nothing to do with pages what-so-ever. They can have such notions as > packets that contain multiple streams in one packet, where it ends up > being split up into several pieces. Where neither the original packet > _nor_ the final pieces have _anything_ to do with "pages". > > THERE IS NO PAGE ALIGNMENT. And kiobufs don't require IO to be page aligned, and they have never done. The only page alignment they assume is that if a *single* scatter-gather element spans multiple pages, then the joins between those pages occur on page boundaries. Remember, a kiobuf is only designed to represent one scatter-gather fragment, not a full sg list. That was the whole reason for having a kiovec as a separate concept: if you have more than one independent fragment in the sg-list, you need more than one kiobuf. And the reason why we created sg fragments which can span pages was so that we can encode IOs which interact with the VM: any arbitrary virtually-contiguous user data buffer can be mapped into a *single* kiobuf for a write() call, so it's a generic way of supporting things like O_DIRECT without the IO layers having to know anything about VM (and Ben's async IO patches also use kiobufs in this way to allow read()s to write to the user's data buffer once the IO completes, without having to have a context switch back into that user's context.) Similarly, any extent of a file in the page cache can be encoded in a single kiobuf. And no, the simpler networking-style sg-list does not cut it for block device IO, because for block devices, we want to have separate completion status made available for each individual sg fragment in the IO. *That* is why the kiobuf is more heavyweight than the networking variant: each fragment [kiobuf] in the scatter-gather list [kiovec] has its own completion information. If we have a bunch of separate data buffers queued for sequential disk IO as a single request, then we still want things like readahead and error handling to work. That means that we want the first kiobuf in the chain to get its completion wakeup as soon as that segment of the IO is complete, without having to wait for the remaining sectors of the IO to be transferred. It also means that if we've done something like split the IO over a raid stripe, then when an error occurs, we still want to know which of the callers' buffers succeeded and which failed. Yes, I agree that the original kiovec mechanism of using a *kiobuf[] array to assemble the scatter-gather fragments sucked. But I don't believe that just throwing away the concept of kiobuf as a sc-fragment will work either when it comes to disk IOs: the need for per-fragment completion is too compelling. I'd rather shift to allowing kiobufs to be assembled into linked lists for IO to avoid *kiobuf[] vectors, in just the same way that we currently chain buffer_heads for IO. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 05:29:47PM +, Alan Cox wrote: > > > > _All_ drivers would have to do that in the degenerate case, because > > none of our drivers can deal with a dma boundary in the middle of a > > sector, and even in those places where the hardware supports it in > > theory, you are still often limited to word-alignment. > > Thats true for _block_ disk devices but if we want a generic kiovec then > if I am going from video capture to network I dont need to force anything more > than 4 byte align Kiobufs have never, ever required the IO to be aligned on any particular boundary. They simply make the assumption that the underlying buffered object can be described in terms of pages with some arbitrary (non-aligned) start/offset. Every video framebuffer I've ever seen satisfies that, so you can easily map an arbitrary contiguous region of the framebuffer with a kiobuf already. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 03:19:09PM +, Alan Cox wrote: > > Yes, it's the sort of thing that you would hope should work, but in > > practice it's not reliable. > > So the less smart devices need to call something like > > kiovec_align(kiovec, 512); > > and have it do the bounce buffers ? _All_ drivers would have to do that in the degenerate case, because none of our drivers can deal with a dma boundary in the middle of a sector, and even in those places where the hardware supports it in theory, you are still often limited to word-alignment. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 01:00:51PM +0100, Manfred Spraul wrote: > "Stephen C. Tweedie" wrote: > > > > You simply cannot do physical disk IO on > > non-sector-aligned memory or in chunks which aren't a multiple of > > sector size. > > Why not? > > Obviously the disk access itself must be sector aligned and the total > length must be a multiple of the sector length, but there shouldn't be > any restrictions on the data buffers. But there are. Many controllers just break down and corrupt things silently if you don't align the data buffers (Jeff Merkey found this by accident when he started generating unaligned IOs within page boundaries in his NWFS code). And a lot of controllers simply cannot break a sector dma over a page boundary (at least not without some form of IOMMU remapping). Yes, it's the sort of thing that you would hope should work, but in practice it's not reliable. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 08:01:45PM +0530, [EMAIL PROTECTED] wrote: > > >It's the very essence of readahead that we wake up the earlier buffers > >as soon as they become available, without waiting for the later ones > >to complete, so we _need_ this multiple completion concept. > > I can understand this in principle, but when we have a single request going > down to the device that actually fills in multiple buffers, do we get > notified (interrupted) by the device before all the data in that request > got transferred ? It depends on the device driver. Different controllers will have different maximum transfer size. For IDE, for example, we get wakeups all over the place. For SCSI, it depends on how many scatter-gather entries the driver can push into a single on-the-wire request. Exceed that limit and the driver is forced to open a new scsi mailbox, and you get independent completion signals for each such chunk. > >Which is exactly why we have one kiobuf per higher-level buffer, and > >we chain together kiobufs when we need to for a long request, but we > >still get the independent completion notifiers. > > As I mentioned above, the alternative is to have the i/o completion related > linkage information within the wakeup structures instead. That way, it > doesn't matter to the lower level driver what higher level structure we > have above (maybe buffer heads, may be page cache structures, may be > kiobufs). We only chain together memory descriptors for the buffers during > the io. You forgot IO failures: it is essential, once the IO completes, to know exactly which higher-level structures completed successfully and which did not. The low-level drivers have to have access to the independent completion notifications for this to work. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Fri, Feb 02, 2001 at 01:02:28PM +0100, Christoph Hellwig wrote: > > > I may still be persuaded that we need the full scatter-gather list > > fields throughout, but for now I tend to think that, at least in the > > disk layers, we may get cleaner results by allow linked lists of > > page-aligned kiobufs instead. That allows for merging of kiobufs > > without having to copy all of the vector information each time. > > But it will have the same problems as the array soloution: there will > be one complete kio structure for each kiobuf, with it's own end_io > callback, etc. And what's the problem with that? You *need* this. You have to have that multiple-completion concept in the disk layers. Think about chains of buffer_heads being sent to disk as a single IO --- you need to know which buffers make it to disk successfully and which had IO errors. And no, the IO success is *not* necessarily sequential from the start of the IO: if you are doing IO to raid0, for example, and the IO gets striped across two disks, you might find that the first disk gets an error so the start of the IO fails but the rest completes. It's the completion code which notifies the caller of what worked and what did not. And for readahead, you want to notify the caller as early as posssible about completion for the first part of the IO, even if the device driver is still processing the rest. Multiple completions are a necessary feature of the current block device interface. Removing that would be a step backwards. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Sun, Feb 04, 2001 at 06:54:58PM +0530, [EMAIL PROTECTED] wrote: > > Can't we define a kiobuf structure as just this ? A combination of a > frag_list and a page_list ? Then all code which needs to accept an arbitrary kiobuf needs to be able to parse both --- ugh. > BTW, We could have a higher level io container that includes a > field and a to take care of i/o completion IO completion requirements are much more complex. Think of disk readahead: we can create a single request struct for an IO of a hundred buffer heads, and as the device driver satisfies that request, it wakes up the buffer heads as it goes. There is a separete completion notification for every single buffer head in the chain. It's the very essence of readahead that we wake up the earlier buffers as soon as they become available, without waiting for the later ones to complete, so we _need_ this multiple completion concept. Which is exactly why we have one kiobuf per higher-level buffer, and we chain together kiobufs when we need to for a long request, but we still get the independent completion notifiers. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote: > > On Thu, 1 Feb 2001, Stephen C. Tweedie wrote: > > > Neither the read nor the write are page-aligned. I don't know where you > got that idea. It's obviously not true even in the common case: it depends > _entirely_ on what the file offsets are, and expecting the offset to be > zero is just being stupid. It's often _not_ zero. With networking it is in > fact seldom zero, because the network packets are seldom aligned either in > size or in location. The underlying buffer is. The VFS (and the current kiobuf code) is already happy about IO happening at odd offsets within a page. However, the more general case --- doing zero-copy IO on arbitrary unaligned buffers --- simply won't work if you expect to be able to push those buffers to disk without a copy. The splice case you talked about is fine because it's doing the normal prepare/commit logic where the underlying buffer is page aligned, even if the splice IO is not to a page aligned location. That's _exactly_ what kiobufs were intended to support. The prepare_read/prepare_write/ pull/push cycle lets the caller tell the pull() function where to store its data, becausse there are alignment constraints which just can't be ignored: you simply cannot do physical disk IO on non-sector-aligned memory or in chunks which aren't a multiple of sector size. (The buffer address alignment can sometimes be relaxed --- obviously if you're doing PIO then it doesn't matter --- but the length granularity is rigidly enforced.) Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Sat, Feb 03, 2001 at 12:28:47PM -0800, Linus Torvalds wrote: On Thu, 1 Feb 2001, Stephen C. Tweedie wrote: Neither the read nor the write are page-aligned. I don't know where you got that idea. It's obviously not true even in the common case: it depends _entirely_ on what the file offsets are, and expecting the offset to be zero is just being stupid. It's often _not_ zero. With networking it is in fact seldom zero, because the network packets are seldom aligned either in size or in location. The underlying buffer is. The VFS (and the current kiobuf code) is already happy about IO happening at odd offsets within a page. However, the more general case --- doing zero-copy IO on arbitrary unaligned buffers --- simply won't work if you expect to be able to push those buffers to disk without a copy. The splice case you talked about is fine because it's doing the normal prepare/commit logic where the underlying buffer is page aligned, even if the splice IO is not to a page aligned location. That's _exactly_ what kiobufs were intended to support. The prepare_read/prepare_write/ pull/push cycle lets the caller tell the pull() function where to store its data, becausse there are alignment constraints which just can't be ignored: you simply cannot do physical disk IO on non-sector-aligned memory or in chunks which aren't a multiple of sector size. (The buffer address alignment can sometimes be relaxed --- obviously if you're doing PIO then it doesn't matter --- but the length granularity is rigidly enforced.) Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Sun, Feb 04, 2001 at 06:54:58PM +0530, [EMAIL PROTECTED] wrote: Can't we define a kiobuf structure as just this ? A combination of a frag_list and a page_list ? Then all code which needs to accept an arbitrary kiobuf needs to be able to parse both --- ugh. BTW, We could have a higher level io container that includes a status field and a wait_queue_head to take care of i/o completion IO completion requirements are much more complex. Think of disk readahead: we can create a single request struct for an IO of a hundred buffer heads, and as the device driver satisfies that request, it wakes up the buffer heads as it goes. There is a separete completion notification for every single buffer head in the chain. It's the very essence of readahead that we wake up the earlier buffers as soon as they become available, without waiting for the later ones to complete, so we _need_ this multiple completion concept. Which is exactly why we have one kiobuf per higher-level buffer, and we chain together kiobufs when we need to for a long request, but we still get the independent completion notifiers. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Fri, Feb 02, 2001 at 01:02:28PM +0100, Christoph Hellwig wrote: I may still be persuaded that we need the full scatter-gather list fields throughout, but for now I tend to think that, at least in the disk layers, we may get cleaner results by allow linked lists of page-aligned kiobufs instead. That allows for merging of kiobufs without having to copy all of the vector information each time. But it will have the same problems as the array soloution: there will be one complete kio structure for each kiobuf, with it's own end_io callback, etc. And what's the problem with that? You *need* this. You have to have that multiple-completion concept in the disk layers. Think about chains of buffer_heads being sent to disk as a single IO --- you need to know which buffers make it to disk successfully and which had IO errors. And no, the IO success is *not* necessarily sequential from the start of the IO: if you are doing IO to raid0, for example, and the IO gets striped across two disks, you might find that the first disk gets an error so the start of the IO fails but the rest completes. It's the completion code which notifies the caller of what worked and what did not. And for readahead, you want to notify the caller as early as posssible about completion for the first part of the IO, even if the device driver is still processing the rest. Multiple completions are a necessary feature of the current block device interface. Removing that would be a step backwards. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 08:01:45PM +0530, [EMAIL PROTECTED] wrote: It's the very essence of readahead that we wake up the earlier buffers as soon as they become available, without waiting for the later ones to complete, so we _need_ this multiple completion concept. I can understand this in principle, but when we have a single request going down to the device that actually fills in multiple buffers, do we get notified (interrupted) by the device before all the data in that request got transferred ? It depends on the device driver. Different controllers will have different maximum transfer size. For IDE, for example, we get wakeups all over the place. For SCSI, it depends on how many scatter-gather entries the driver can push into a single on-the-wire request. Exceed that limit and the driver is forced to open a new scsi mailbox, and you get independent completion signals for each such chunk. Which is exactly why we have one kiobuf per higher-level buffer, and we chain together kiobufs when we need to for a long request, but we still get the independent completion notifiers. As I mentioned above, the alternative is to have the i/o completion related linkage information within the wakeup structures instead. That way, it doesn't matter to the lower level driver what higher level structure we have above (maybe buffer heads, may be page cache structures, may be kiobufs). We only chain together memory descriptors for the buffers during the io. You forgot IO failures: it is essential, once the IO completes, to know exactly which higher-level structures completed successfully and which did not. The low-level drivers have to have access to the independent completion notifications for this to work. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 01:00:51PM +0100, Manfred Spraul wrote: "Stephen C. Tweedie" wrote: You simply cannot do physical disk IO on non-sector-aligned memory or in chunks which aren't a multiple of sector size. Why not? Obviously the disk access itself must be sector aligned and the total length must be a multiple of the sector length, but there shouldn't be any restrictions on the data buffers. But there are. Many controllers just break down and corrupt things silently if you don't align the data buffers (Jeff Merkey found this by accident when he started generating unaligned IOs within page boundaries in his NWFS code). And a lot of controllers simply cannot break a sector dma over a page boundary (at least not without some form of IOMMU remapping). Yes, it's the sort of thing that you would hope should work, but in practice it's not reliable. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 03:19:09PM +, Alan Cox wrote: Yes, it's the sort of thing that you would hope should work, but in practice it's not reliable. So the less smart devices need to call something like kiovec_align(kiovec, 512); and have it do the bounce buffers ? _All_ drivers would have to do that in the degenerate case, because none of our drivers can deal with a dma boundary in the middle of a sector, and even in those places where the hardware supports it in theory, you are still often limited to word-alignment. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 05:29:47PM +, Alan Cox wrote: _All_ drivers would have to do that in the degenerate case, because none of our drivers can deal with a dma boundary in the middle of a sector, and even in those places where the hardware supports it in theory, you are still often limited to word-alignment. Thats true for _block_ disk devices but if we want a generic kiovec then if I am going from video capture to network I dont need to force anything more than 4 byte align Kiobufs have never, ever required the IO to be aligned on any particular boundary. They simply make the assumption that the underlying buffered object can be described in terms of pages with some arbitrary (non-aligned) start/offset. Every video framebuffer I've ever seen satisfies that, so you can easily map an arbitrary contiguous region of the framebuffer with a kiobuf already. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 08:36:31AM -0800, Linus Torvalds wrote: Have you ever thought about other things, like networking, special devices, stuff like that? They can (and do) have packet boundaries that have nothing to do with pages what-so-ever. They can have such notions as packets that contain multiple streams in one packet, where it ends up being split up into several pieces. Where neither the original packet _nor_ the final pieces have _anything_ to do with "pages". THERE IS NO PAGE ALIGNMENT. And kiobufs don't require IO to be page aligned, and they have never done. The only page alignment they assume is that if a *single* scatter-gather element spans multiple pages, then the joins between those pages occur on page boundaries. Remember, a kiobuf is only designed to represent one scatter-gather fragment, not a full sg list. That was the whole reason for having a kiovec as a separate concept: if you have more than one independent fragment in the sg-list, you need more than one kiobuf. And the reason why we created sg fragments which can span pages was so that we can encode IOs which interact with the VM: any arbitrary virtually-contiguous user data buffer can be mapped into a *single* kiobuf for a write() call, so it's a generic way of supporting things like O_DIRECT without the IO layers having to know anything about VM (and Ben's async IO patches also use kiobufs in this way to allow read()s to write to the user's data buffer once the IO completes, without having to have a context switch back into that user's context.) Similarly, any extent of a file in the page cache can be encoded in a single kiobuf. And no, the simpler networking-style sg-list does not cut it for block device IO, because for block devices, we want to have separate completion status made available for each individual sg fragment in the IO. *That* is why the kiobuf is more heavyweight than the networking variant: each fragment [kiobuf] in the scatter-gather list [kiovec] has its own completion information. If we have a bunch of separate data buffers queued for sequential disk IO as a single request, then we still want things like readahead and error handling to work. That means that we want the first kiobuf in the chain to get its completion wakeup as soon as that segment of the IO is complete, without having to wait for the remaining sectors of the IO to be transferred. It also means that if we've done something like split the IO over a raid stripe, then when an error occurs, we still want to know which of the callers' buffers succeeded and which failed. Yes, I agree that the original kiovec mechanism of using a *kiobuf[] array to assemble the scatter-gather fragments sucked. But I don't believe that just throwing away the concept of kiobuf as a sc-fragment will work either when it comes to disk IOs: the need for per-fragment completion is too compelling. I'd rather shift to allowing kiobufs to be assembled into linked lists for IO to avoid *kiobuf[] vectors, in just the same way that we currently chain buffer_heads for IO. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Mon, Feb 05, 2001 at 11:28:17AM -0800, Linus Torvalds wrote: The _vectors_ are needed at the very lowest levels: the levels that do not necessarily have to worry at all about completion notification etc. You want the arbitrary scatter-gather vectors passed down to the stuff that sets up the SG arrays etc, the stuff that doesn't care AT ALL about the high-level semantics. OK, this is exactly where we have a problem: I can see too many cases where we *do* need to know about completion stuff at a fine granularity when it comes to disk IO (unlike network IO, where we can usually rely on a caller doing retransmit at some point in the stack). If we are doing readahead, we want completion callbacks raised as soon as possible on IO completions, no matter how many other IOs have been merged with the current one. More importantly though, when we are merging multiple page or buffer_head IOs in a request, we want to know exactly which buffer/page contents are valid and which are not once the IO completes. The current request struct's buffer_head list provides that quite naturally, but is a hugely heavyweight way of performing large IOs. What I'm really after is a way of sending IOs to make_request in such a way that if the caller provides an array of buffer_heads, it gets back completion information on each one, but if the IO is requested in large chunks (eg. XFS's pagebufs or large kiobufs from raw IO), then the request code can deal with it in those large chunks. What worries me is things like the soft raid1/5 code: pretending that we can skimp on the return information about which blocks were transferred successfully and which were not sounds like a really bad idea when you've got a driver which relies on that completion information in order to do intelligent error recovery. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 10:28:37PM +0100, Ingo Molnar wrote: On Mon, 5 Feb 2001, Stephen C. Tweedie wrote: it's exactly these 'compound' structures i'm vehemently against. I do think it's a design nightmare. I can picture these monster kiobufs complicating the whole code for no good reason - we couldnt even get the bh-list code in block_device.c right - why do you think kiobufs *all across the kernel* will be any better? RAID0 is not an issue. Split it up, use separate kiobufs for every different disk. Umm, that's not the point --- of course you can use separate kiobufs for the communication between raid0 and the underlying disks, but what do you then tell the application _above_ raid0 if one of the underlying IOs succeeds and the other fails halfway through? And what about raid1? Are you really saying that raid1 doesn't need to know which blocks succeeded and which failed? That's the level of completion information I'm worrying about at the moment. fragmented skbs are a different matter: they are simply a bit more generic abstractions of 'memory buffer'. Clear goal, clear solution. I do not think kiobufs have clear goals. The goal: allow arbitrary IOs to be pushed down through the stack in such a way that the callers can get meaningful information back about what worked and what did not. If the write was a 128kB raw IO, then you obviously get coarse granularity of completion callback. If the write was a series of independent pages which happened to be contiguous on disk, you actually get told which pages hit disk and which did not. and what is the goal of having multi-page kiobufs. To avoid having to do multiple function calls via a simpler interface? Shouldnt we optimize that codepath instead? The original multi-page buffers came from the map_user_kiobuf interface: they represented a user data buffer. I'm not wedded to that format --- we can happily replace it with a fine-grained sg list --- but the reason they have been pushed so far down the IO stack is the need for accurate completion information on the originally requested IOs. In other words, even if we expand the kiobuf into a sg vector list, when it comes to merging requests in ll_rw_blk.c we still need to track the callbacks on each independent source kiobufs. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Mon, Feb 05, 2001 at 11:06:48PM +, Alan Cox wrote: do you then tell the application _above_ raid0 if one of the underlying IOs succeeds and the other fails halfway through? struct { u32 flags; /* because everything needs flags */ struct io_completion *completions; kiovec_t sglist[0]; } thingy; now kmalloc one object of the header the sglist of the right size and the completion list. Shove the completion list on the end of it as another array of objects and what is the problem. XFS uses both small metadata items in the buffer cache and large pagebufs. You may have merged a 512-byte read with a large pagebuf read: one completion callback is associated with a single sg fragment, the next callback belongs to a dozen different fragments. Associating the two lists becomes non-trivial, although it could be done. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, OK, if we take a step back what does this look like: On Mon, Feb 05, 2001 at 08:54:29PM +, Stephen C. Tweedie wrote: If we are doing readahead, we want completion callbacks raised as soon as possible on IO completions, no matter how many other IOs have been merged with the current one. More importantly though, when we are merging multiple page or buffer_head IOs in a request, we want to know exactly which buffer/page contents are valid and which are not once the IO completes. This is the current situation. If the page cache submits a 64K IO to the block layer, it does so in pieces, and then expects to be told on return exactly which pages succeeded and which failed. That's where the mess of having multiple completion objects in a single IO request comes from. Can we just forbid this case? That's the short cut that SGI's kiobuf block dev patches do when they get kiobufs: they currently deal with either buffer_heads or kiobufs in struct requests, but they don't merge kiobuf requests. (XFS already clusters the IOs for them in that case.) Is that a realistic basis for a cleaned-up ll_rw_blk.c? It implies that the caller has to do IO merging. For read, that's not much pain, as the most important case --- readahead --- is already done in a generic way which could submit larger IOs relatively easily. It would be harder for writes, but high-level write clustering code has already been started. It implies that for any IO, on IO failure you don't get told which part of the IO failed. That adds code to the caller: the page cache would have to retry per-page to work out which pages are readable and which are not. It means that for soft raid, you don't get told which blocks are bad if a stripe has an error anywhere. Ingo, is that a potential problem? But it gives very, very simple semantics to the request layer: single IOs go in (with a completion callback and a single scatter-gather list), and results go back with success or failure. With that change, it becomes _much_ more natural to push a simple sg list down through the disk layers. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote: > > > > If I have a page vector with a single offset/length pair, I can build > > a new header with the same vector and modified offset/length to split > > the vector in two without copying it. > > You just say in the higher-level structure ignore from x to y even if > they have an offset in their own vector. Exactly --- and so you end up with something _much_ uglier, because you end up with all sorts of combinations of length/offset fields all over the place. This is _precisely_ the mess I want to avoid. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Fri, Feb 02, 2001 at 12:51:35PM +0100, Christoph Hellwig wrote: If I have a page vector with a single offset/length pair, I can build a new header with the same vector and modified offset/length to split the vector in two without copying it. You just say in the higher-level structure ignore from x to y even if they have an offset in their own vector. Exactly --- and so you end up with something _much_ uglier, because you end up with all sorts of combinations of length/offset fields all over the place. This is _precisely_ the mess I want to avoid. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote: > I think you want the whole kio concept only for disk-like IO. No. I want something good for zero-copy IO in general, but a lot of that concerns the problem of interacting with the user, and the basic center of that interaction in 99% of the interesting cases is either a user VM buffer or the page cache --- all of which are page-aligned. If you look at the sorts of models being proposed (even by Linus) for splice, you get len = prepare_read(); prepare_write(); pull_fd(); commit_write(); in which the read is being pulled into a known location in the page cache -- it's page-aligned, again. I'm perfectly willing to accept that there may be a need for scatter-gather boundaries including non-page-aligned fragments in this model, but I can't see one if you're using the page cache as a mediator, nor if you're doing it through a user mmapped buffer. The only reason you need finer scatter-gather boundaries --- and it may be a compelling reason --- is if you are merging multiple IOs together into a single device-level IO. That makes perfect sense for the zerocopy tcp case where you're doing MSG_MORE-type coalescing. It doesn't help the existing SGI kiobuf block device code, because that performs its merging in the filesystem layers and the block device code just squirts the IOs to the wire as-is, but if we want to start merging those kiobuf-based IOs within make_request() then the block device layer may want it too. And Linus is right, the old way of using a *kiobuf[] for that was painful, but the solution of adding start/length to every entry in the page vector just doesn't sit right with many components of the block device environment either. I may still be persuaded that we need the full scatter-gather list fields throughout, but for now I tend to think that, at least in the disk layers, we may get cleaner results by allow linked lists of page-aligned kiobufs instead. That allows for merging of kiobufs without having to copy all of the vector information each time. The killer, however, is what happens if you want to split such a merged kiobuf. Right now, that's something that I can only imagine happening in the block layers if we start encoding buffer_head chains as kiobufs, but if we do that in the future, or if we start merging genuine kiobuf requests requests, then doing that split later on (for raid0 etc) may require duplicating whole chains of kiobufs. At that point, just doing scatter-gather lists is cleaner. But for now, the way to picture what I'm trying to achieve is that kiobufs are a bit like buffer_heads --- they represent the physical pages of some VM object that a higher layer has constructed, such as the page cache or a user VM buffer. You can chain these objects together for IO, but that doesn't stop the individual objects from being separate entities with independent IO completion callbacks to be honoured. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote: > > > On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote: > > In the disk IO case, you basically don't get that (the only thing > > which comes close is raid5 parity blocks). The data which the user > > started with is the data sent out on the wire. You do get some > > interesting cases such as soft raid and LVM, or even in the scsi stack > > if you run out of mailbox space, where you need to send only a > > sub-chunk of the input buffer. > > Though your describption is right, I don't think the case is very common: > Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code. On raid0 stripes, it's common to have stripes of between 16k and 64k, so it's rather more common there than you'd like. In any case, you need the code to handle it, and I don't want to make the code paths any more complex than necessary. > In raid1 you need some kind of clone iobuf, which should work with both > cases. In raid0 you need a complete new pagelist anyway No you don't. You take the existing one, specify which region of it is going to the current stripe, and send it off. Nothing more. > > In that case, having offset/len as the kiobuf limit markers is ideal: > > you can clone a kiobuf header using the same page vector as the > > parent, narrow down the start/end points, and continue down the stack > > without having to copy any part of the page list. If you had the > > offset/len data encoded implicitly into each entry in the sglist, you > > would not be able to do that. > > Sure you could: you embedd that information in a higher-level structure. What's the point in a common data container structure if you need higher-level information to make any sense out of it? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 09:46:27PM +0100, Christoph Hellwig wrote: > > Right now we can take a kiobuf and turn it into a bunch of > > buffer_heads for IO. The io_count lets us track all of those sub-IOs > > so that we know when all submitted IO has completed, so that we can > > pass the completion callback back up the chain without having to > > allocate yet more descriptor structs for the IO. > > > Again, remove this and the IO becomes more heavyweight because we need > > to create a separate struct for the info. > > No. Just allow passing the multiple of the devices blocksize over > ll_rw_block. That was just one example: you need the sub-ios just as much when you split up an IO over stripe boundaries in LVM or raid0, for example. Secondly, ll_rw_block needs to die anyway: you can expand the blocksize up to PAGE_SIZE but not beyond, whereas something like ll_rw_kiobuf can submit a much larger IO atomically (and we have devices which don't start to deliver good throughput until you use IO sizes of 1MB or more). > >> and the lack of > >> scatter gather in one kiobuf struct (you always need an array) > > > Again, _all_ data being sent down through the block device layer is > > either in buffer heads or is page aligned. > > That's the point. You are always talking about the block-layer only. I'm talking about why the minimal, generic solution doesn't provide what the block layer needs. > > Obviously, extra code will be needed to scan kiobufs if we do that, > > and unless we have both per-page _and_ per-kiobuf start/offset pairs > > (adding even further to the complexity), those scatter-gather lists > > would prevent us from carving up a kiobuf into smaller sub-ios without > > copying the whole (expanded) vector. > > No. I think I explained that in my last mail. How? If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want to split it in two, I have to make two new vectors (page X, offset 0, length n) and (page X, offset n, length PAGE_SIZE-n). That implies copying both vectors. If I have a page vector with a single offset/length pair, I can build a new header with the same vector and modified offset/length to split the vector in two without copying it. > > Possibly, but I remain to be convinced, because you may end up with a > > mechanism which is generic but is not well-tuned for any specific > > case, so everything goes slower. > > As kiobufs are widely used for real IO, just as containers, this is > better then nothing. Surely having all of the subsystems working fast is better still? > And IMHO a nice generic concepts that lets different subsystems work > toegther is a _lot_ better then a bunch of over-optimized, rather isolated > subsytems. The IO-Lite people have done a nice research of the effect of > an unified IO-Caching system vs. the typical isolated systems. I know, and IO-Lite has some major problems (the close integration of that code into the cache, for example, makes it harder to expose the zero-copy to user-land). --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 07:14:03PM +0100, Christoph Hellwig wrote: > On Thu, Feb 01, 2001 at 05:41:20PM +0000, Stephen C. Tweedie wrote: > > > > > > We can't allocate a huge kiobuf structure just for requesting one page of > > > IO. It might get better with VM-level IO clustering though. > > > > A kiobuf is *much* smaller than, say, a buffer_head, and we currently > > allocate a buffer_head per block for all IO. > > A kiobuf is 124 bytes, ... the vast majority of which is room for the page vector to expand without having to be copied. You don't touch that in the normal case. > a buffer_head 96. And a buffer_head is additionally > used for caching data, a kiobuf not. Buffer_heads are _sometimes_ used for caching data. That's one of the big problems with them, they are too overloaded, being both IO descriptors _and_ cache descriptors. If you've got 128k of data to write out from user space, do you want to set up one kiobuf or 256 buffer_heads? Buffer_heads become really very heavy indeed once you start doing non-trivial IO. > > What is so heavyweight in the current kiobuf (other than the embedded > > vector, which I've already noted I'm willing to cut)? > > array_len kiobufs can be reused after IO. You can depopulate a kiobuf, repopulate it with new pages and submit new IO without having to deallocate the kiobuf. You can't do this without knowing how big the data vector is. Removing that functionality will prevent reuse, making them _more_ heavyweight. > io_count, Right now we can take a kiobuf and turn it into a bunch of buffer_heads for IO. The io_count lets us track all of those sub-IOs so that we know when all submitted IO has completed, so that we can pass the completion callback back up the chain without having to allocate yet more descriptor structs for the IO. Again, remove this and the IO becomes more heavyweight because we need to create a separate struct for the info. > the presence of wait_queue AND end_io, That's fine, I'm happy scrapping the wait queue: people can always use the kiobuf private data field to refer to a wait queue if they want to. > and the lack of > scatter gather in one kiobuf struct (you always need an array) Again, _all_ data being sent down through the block device layer is either in buffer heads or is page aligned. You want us to triple the size of the "heavyweight" kiobuf's data vector for what gain, exactly? Obviously, extra code will be needed to scan kiobufs if we do that, and unless we have both per-page _and_ per-kiobuf start/offset pairs (adding even further to the complexity), those scatter-gather lists would prevent us from carving up a kiobuf into smaller sub-ios without copying the whole (expanded) vector. That's a _lot_ of extra complexity in the disk IO layers. I'm all for a fast kiobuf_to_sglist converter. But I haven't seen any evidence that such scatter-gather lists will do anything in the block device case except complicate the code and decrease performance. > S.th. like: ... > makes it a lot simpler for the subsytems to integrate. Possibly, but I remain to be convinced, because you may end up with a mechanism which is generic but is not well-tuned for any specific case, so everything goes slower. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 06:49:50PM +0100, Christoph Hellwig wrote: > > > Adding tons of base/limit pairs to kiobufs makes it worse not better > > For disk I/O it makes the handling a little easier for the cost of the > additional offset/length fields. Umm, actually, no, it makes it much worse for many of the cases. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote: > > > > I don't see any real advantage for disk IO. The real advantage is that > > we can have a generic structure that is also usefull in e.g. networking > > and can lead to a unified IO buffering scheme (a little like IO-Lite). > > Networking wants something lighter rather than heavier. Adding tons of > base/limit pairs to kiobufs makes it worse not better Networking has fundamentally different requirements. In a network stack, you want the ability to add fragments to unaligned chunks of data to represent headers at any point in the stack. In the disk IO case, you basically don't get that (the only thing which comes close is raid5 parity blocks). The data which the user started with is the data sent out on the wire. You do get some interesting cases such as soft raid and LVM, or even in the scsi stack if you run out of mailbox space, where you need to send only a sub-chunk of the input buffer. In that case, having offset/len as the kiobuf limit markers is ideal: you can clone a kiobuf header using the same page vector as the parent, narrow down the start/end points, and continue down the stack without having to copy any part of the page list. If you had the offset/len data encoded implicitly into each entry in the sglist, you would not be able to do that. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote: > On Thu, Feb 01, 2001 at 04:16:15PM +0000, Stephen C. Tweedie wrote: > > > > > > No, and with the current kiobufs it would not make sense, because they > > > are to heavy-weight. > > > > Really? In what way? > > We can't allocate a huge kiobuf structure just for requesting one page of > IO. It might get better with VM-level IO clustering though. A kiobuf is *much* smaller than, say, a buffer_head, and we currently allocate a buffer_head per block for all IO. A kiobuf contains enough embedded page vector space for 16 pages by default, but I'm happy enough to remove that if people want. However, note that that memory is not initialised, so there is no memory access cost at all for that empty space. Remove that space and instead of one memory allocation per kiobuf, you get two, so the cost goes *UP* for small IOs. > > > With page,length,offsett iobufs this makes sense > > > and is IMHO the way to go. > > > > What, you mean adding *extra* stuff to the heavyweight kiobuf makes it > > lean enough to do the job?? > > No. I was speaking abou the light-weight kiobuf Linux & Me discussed on > lkml some time ago (though I'd much more like to call it kiovec analogous > to BSD iovecs). What is so heavyweight in the current kiobuf (other than the embedded vector, which I've already noted I'm willing to cut)? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vma limited swapin readahead
Hi, On Thu, Feb 01, 2001 at 02:45:04PM -0200, Rik van Riel wrote: > On Thu, 1 Feb 2001, Stephen C. Tweedie wrote: > > But only when the extra pages we're reading in don't > displace useful data from memory, making us fault in > those other pages ... causing us to go to the disk > again and do more readahead, which could potentially > displace even more pages, etc... Remember, it's a balance. You can displace a few useful pages and still win overall because the cost _per page_ goes way down due to better disk IO utilisation. > One solution could be to put (most of) the swapin readahead > pages on the inactive_dirty list, so pressure by readahead > on the resident pages is smaller and the not used readahead > pages are reclaimed faster. Yep, that would make much sense. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote: > On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote: > > > > That would require the vfs interfaces themselves (address space > > readpage/writepage ops) to take kiobufs as arguments, instead of struct > > page * . That's not the case right now, is it ? > > No, and with the current kiobufs it would not make sense, because they > are to heavy-weight. Really? In what way? > With page,length,offsett iobufs this makes sense > and is IMHO the way to go. What, you mean adding *extra* stuff to the heavyweight kiobuf makes it lean enough to do the job?? Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 10:08:45AM -0600, Steve Lord wrote: > Christoph Hellwig wrote: > > On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote: > > > > > > That would require the vfs interfaces themselves (address space > > > readpage/writepage ops) to take kiobufs as arguments, instead of struct > > > page * . That's not the case right now, is it ? > > > > No, and with the current kiobufs it would not make sense, because they > > are to heavy-weight. With page,length,offsett iobufs this makes sense > > and is IMHO the way to go. > > Enquiring minds would like to know if you are working towards this > revamp of the kiobuf structure at the moment, you have been very quiet > recently. I'm in the middle of some parts of it, and am actively soliciting feedback on what cleanups are required. I've been merging all of the 2.2 fixes into a 2.4 kiobuf tree, and have started doing some of the cleanups needed --- removing the embedded page vector, and adding support for lightweight stacking of kiobufs for completion callback chains. However, filesystem IO is almost *always* page aligned: O_DIRECT IO comes from VM pages, and internal filesystem IO comes from page cache pages. Buffer cache IOs are the only exception, and kiobufs only fail for such IOs once you have multiple buffer_heads being merged into single requests. So, what are the benefits in the disk IO stack of adding length/offset pairs to each page of the kiobuf? Basically, the only advantage right now is that it would allow us to merge requests together without having to chain separate kiobufs. However, chaining kiobufs in this case is actually much better than merging them if the original IOs came in as kiobufs: merging kiobufs requires us to reallocate a new, longer (page/offset/len) vector, whereas chaining kiobufs is just a list operation. Having true scatter-gather lists in the kiobuf would let us represent arbitrary lists of buffer_heads as a single kiobuf, though, and that _is_ a big win if we can avoid using buffer_heads below the ll_rw_block layer at all. (It's not clear that this is really possible, though, since we still need to propagate completion information back up into each individual buffer head's status and wait queue.) Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vma limited swapin readahead
Hi, On Thu, Feb 01, 2001 at 08:53:33AM -0200, Marcelo Tosatti wrote: > > On Thu, 1 Feb 2001, Stephen C. Tweedie wrote: > > If we're under free memory shortage, "unlucky" readaheads will be harmful. I know, it's a balancing act. But given that even one successful readahead per read will halve the number of swapin seeks, the performance loss due to the extra scavenging has got to be bad to outweigh the benefit. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 01:28:33PM +0530, [EMAIL PROTECTED] wrote: > > Here's a second pass attempt, based on Ben's wait queue extensions: > Does this sound any better ? It's a mechanism, all right, but you haven't described what problems it is trying to solve, and where it is likely to be used, so it's hard to judge it. :) --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 10:25:22AM +0530, [EMAIL PROTECTED] wrote: > > >We _do_ need the ability to stack completion events, but as far as the > >kiobuf work goes, my current thoughts are to do that by stacking > >lightweight "clone" kiobufs. > > Would that work with stackable filesystems ? Only if the filesystems were using VFS interfaces which used kiobufs. Right now, the only filesystem using kiobufs is XFS, and it only passes them down to the block device layer, not to other filesystems. > Being able to track the children of a kiobuf would help with I/O > cancellation (e.g. to pull sub-ios off their request queues if I/O > cancellation for the parent kiobuf was issued). Not essential, I guess, in > general, but useful in some situations. What exactly is the justification for IO cancellation? It really upsets the normal flow of control through the IO stack to have voluntary cancellation semantics. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vma limited swapin readahead
Hi, On Wed, Jan 31, 2001 at 04:24:24PM -0800, David Gould wrote: > > I am skeptical of the argument that we can win by replacing "the least > desirable" pages with pages were even less desireable and that we have > no recent indication of any need for. It seems possible under heavy swap > to discard quite a portion of the useful pages in favor of junk that just > happenned to have a lucky disk address. When readin clustering was added to 2.2 for swap and paging, performance for a lot of VM-intensive tasks more than doubled. Disk seeks are _expensive_. If you read in 15 neighbouring pages on swapin and on average only one of them turns out to be useful, you have still halved the number of swapin IOs required. The performance advantages are so enormous that easily compensate for the cost of holding the other, unneeded pages in memory for a while. Also remember that the readahead pages won't actually get mapped into memory, so they can be recycled easily. So, under swapping you tend to find that the extra readin pages are going to be replacing old, unneeded readahead pages to some extent, rather than swapping out useful pages. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vma limited swapin readahead
Hi, On Wed, Jan 31, 2001 at 04:24:24PM -0800, David Gould wrote: I am skeptical of the argument that we can win by replacing "the least desirable" pages with pages were even less desireable and that we have no recent indication of any need for. It seems possible under heavy swap to discard quite a portion of the useful pages in favor of junk that just happenned to have a lucky disk address. When readin clustering was added to 2.2 for swap and paging, performance for a lot of VM-intensive tasks more than doubled. Disk seeks are _expensive_. If you read in 15 neighbouring pages on swapin and on average only one of them turns out to be useful, you have still halved the number of swapin IOs required. The performance advantages are so enormous that easily compensate for the cost of holding the other, unneeded pages in memory for a while. Also remember that the readahead pages won't actually get mapped into memory, so they can be recycled easily. So, under swapping you tend to find that the extra readin pages are going to be replacing old, unneeded readahead pages to some extent, rather than swapping out useful pages. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 10:25:22AM +0530, [EMAIL PROTECTED] wrote: We _do_ need the ability to stack completion events, but as far as the kiobuf work goes, my current thoughts are to do that by stacking lightweight "clone" kiobufs. Would that work with stackable filesystems ? Only if the filesystems were using VFS interfaces which used kiobufs. Right now, the only filesystem using kiobufs is XFS, and it only passes them down to the block device layer, not to other filesystems. Being able to track the children of a kiobuf would help with I/O cancellation (e.g. to pull sub-ios off their request queues if I/O cancellation for the parent kiobuf was issued). Not essential, I guess, in general, but useful in some situations. What exactly is the justification for IO cancellation? It really upsets the normal flow of control through the IO stack to have voluntary cancellation semantics. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 01:28:33PM +0530, [EMAIL PROTECTED] wrote: Here's a second pass attempt, based on Ben's wait queue extensions: Does this sound any better ? It's a mechanism, all right, but you haven't described what problems it is trying to solve, and where it is likely to be used, so it's hard to judge it. :) --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vma limited swapin readahead
Hi, On Thu, Feb 01, 2001 at 08:53:33AM -0200, Marcelo Tosatti wrote: On Thu, 1 Feb 2001, Stephen C. Tweedie wrote: If we're under free memory shortage, "unlucky" readaheads will be harmful. I know, it's a balancing act. But given that even one successful readahead per read will halve the number of swapin seeks, the performance loss due to the extra scavenging has got to be bad to outweigh the benefit. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 10:08:45AM -0600, Steve Lord wrote: Christoph Hellwig wrote: On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote: That would require the vfs interfaces themselves (address space readpage/writepage ops) to take kiobufs as arguments, instead of struct page * . That's not the case right now, is it ? No, and with the current kiobufs it would not make sense, because they are to heavy-weight. With page,length,offsett iobufs this makes sense and is IMHO the way to go. Enquiring minds would like to know if you are working towards this revamp of the kiobuf structure at the moment, you have been very quiet recently. I'm in the middle of some parts of it, and am actively soliciting feedback on what cleanups are required. I've been merging all of the 2.2 fixes into a 2.4 kiobuf tree, and have started doing some of the cleanups needed --- removing the embedded page vector, and adding support for lightweight stacking of kiobufs for completion callback chains. However, filesystem IO is almost *always* page aligned: O_DIRECT IO comes from VM pages, and internal filesystem IO comes from page cache pages. Buffer cache IOs are the only exception, and kiobufs only fail for such IOs once you have multiple buffer_heads being merged into single requests. So, what are the benefits in the disk IO stack of adding length/offset pairs to each page of the kiobuf? Basically, the only advantage right now is that it would allow us to merge requests together without having to chain separate kiobufs. However, chaining kiobufs in this case is actually much better than merging them if the original IOs came in as kiobufs: merging kiobufs requires us to reallocate a new, longer (page/offset/len) vector, whereas chaining kiobufs is just a list operation. Having true scatter-gather lists in the kiobuf would let us represent arbitrary lists of buffer_heads as a single kiobuf, though, and that _is_ a big win if we can avoid using buffer_heads below the ll_rw_block layer at all. (It's not clear that this is really possible, though, since we still need to propagate completion information back up into each individual buffer head's status and wait queue.) Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 04:09:53PM +0100, Christoph Hellwig wrote: On Thu, Feb 01, 2001 at 08:14:58PM +0530, [EMAIL PROTECTED] wrote: That would require the vfs interfaces themselves (address space readpage/writepage ops) to take kiobufs as arguments, instead of struct page * . That's not the case right now, is it ? No, and with the current kiobufs it would not make sense, because they are to heavy-weight. Really? In what way? With page,length,offsett iobufs this makes sense and is IMHO the way to go. What, you mean adding *extra* stuff to the heavyweight kiobuf makes it lean enough to do the job?? Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vma limited swapin readahead
Hi, On Thu, Feb 01, 2001 at 02:45:04PM -0200, Rik van Riel wrote: On Thu, 1 Feb 2001, Stephen C. Tweedie wrote: But only when the extra pages we're reading in don't displace useful data from memory, making us fault in those other pages ... causing us to go to the disk again and do more readahead, which could potentially displace even more pages, etc... Remember, it's a balance. You can displace a few useful pages and still win overall because the cost _per page_ goes way down due to better disk IO utilisation. One solution could be to put (most of) the swapin readahead pages on the inactive_dirty list, so pressure by readahead on the resident pages is smaller and the not used readahead pages are reclaimed faster. Yep, that would make much sense. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 06:05:15PM +0100, Christoph Hellwig wrote: On Thu, Feb 01, 2001 at 04:16:15PM +, Stephen C. Tweedie wrote: No, and with the current kiobufs it would not make sense, because they are to heavy-weight. Really? In what way? We can't allocate a huge kiobuf structure just for requesting one page of IO. It might get better with VM-level IO clustering though. A kiobuf is *much* smaller than, say, a buffer_head, and we currently allocate a buffer_head per block for all IO. A kiobuf contains enough embedded page vector space for 16 pages by default, but I'm happy enough to remove that if people want. However, note that that memory is not initialised, so there is no memory access cost at all for that empty space. Remove that space and instead of one memory allocation per kiobuf, you get two, so the cost goes *UP* for small IOs. With page,length,offsett iobufs this makes sense and is IMHO the way to go. What, you mean adding *extra* stuff to the heavyweight kiobuf makes it lean enough to do the job?? No. I was speaking abou the light-weight kiobuf Linux Me discussed on lkml some time ago (though I'd much more like to call it kiovec analogous to BSD iovecs). What is so heavyweight in the current kiobuf (other than the embedded vector, which I've already noted I'm willing to cut)? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote: I don't see any real advantage for disk IO. The real advantage is that we can have a generic structure that is also usefull in e.g. networking and can lead to a unified IO buffering scheme (a little like IO-Lite). Networking wants something lighter rather than heavier. Adding tons of base/limit pairs to kiobufs makes it worse not better Networking has fundamentally different requirements. In a network stack, you want the ability to add fragments to unaligned chunks of data to represent headers at any point in the stack. In the disk IO case, you basically don't get that (the only thing which comes close is raid5 parity blocks). The data which the user started with is the data sent out on the wire. You do get some interesting cases such as soft raid and LVM, or even in the scsi stack if you run out of mailbox space, where you need to send only a sub-chunk of the input buffer. In that case, having offset/len as the kiobuf limit markers is ideal: you can clone a kiobuf header using the same page vector as the parent, narrow down the start/end points, and continue down the stack without having to copy any part of the page list. If you had the offset/len data encoded implicitly into each entry in the sglist, you would not be able to do that. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 06:49:50PM +0100, Christoph Hellwig wrote: Adding tons of base/limit pairs to kiobufs makes it worse not better For disk I/O it makes the handling a little easier for the cost of the additional offset/length fields. Umm, actually, no, it makes it much worse for many of the cases. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 07:14:03PM +0100, Christoph Hellwig wrote: On Thu, Feb 01, 2001 at 05:41:20PM +, Stephen C. Tweedie wrote: We can't allocate a huge kiobuf structure just for requesting one page of IO. It might get better with VM-level IO clustering though. A kiobuf is *much* smaller than, say, a buffer_head, and we currently allocate a buffer_head per block for all IO. A kiobuf is 124 bytes, ... the vast majority of which is room for the page vector to expand without having to be copied. You don't touch that in the normal case. a buffer_head 96. And a buffer_head is additionally used for caching data, a kiobuf not. Buffer_heads are _sometimes_ used for caching data. That's one of the big problems with them, they are too overloaded, being both IO descriptors _and_ cache descriptors. If you've got 128k of data to write out from user space, do you want to set up one kiobuf or 256 buffer_heads? Buffer_heads become really very heavy indeed once you start doing non-trivial IO. What is so heavyweight in the current kiobuf (other than the embedded vector, which I've already noted I'm willing to cut)? array_len kiobufs can be reused after IO. You can depopulate a kiobuf, repopulate it with new pages and submit new IO without having to deallocate the kiobuf. You can't do this without knowing how big the data vector is. Removing that functionality will prevent reuse, making them _more_ heavyweight. io_count, Right now we can take a kiobuf and turn it into a bunch of buffer_heads for IO. The io_count lets us track all of those sub-IOs so that we know when all submitted IO has completed, so that we can pass the completion callback back up the chain without having to allocate yet more descriptor structs for the IO. Again, remove this and the IO becomes more heavyweight because we need to create a separate struct for the info. the presence of wait_queue AND end_io, That's fine, I'm happy scrapping the wait queue: people can always use the kiobuf private data field to refer to a wait queue if they want to. and the lack of scatter gather in one kiobuf struct (you always need an array) Again, _all_ data being sent down through the block device layer is either in buffer heads or is page aligned. You want us to triple the size of the "heavyweight" kiobuf's data vector for what gain, exactly? Obviously, extra code will be needed to scan kiobufs if we do that, and unless we have both per-page _and_ per-kiobuf start/offset pairs (adding even further to the complexity), those scatter-gather lists would prevent us from carving up a kiobuf into smaller sub-ios without copying the whole (expanded) vector. That's a _lot_ of extra complexity in the disk IO layers. I'm all for a fast kiobuf_to_sglist converter. But I haven't seen any evidence that such scatter-gather lists will do anything in the block device case except complicate the code and decrease performance. S.th. like: ... makes it a lot simpler for the subsytems to integrate. Possibly, but I remain to be convinced, because you may end up with a mechanism which is generic but is not well-tuned for any specific case, so everything goes slower. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 09:46:27PM +0100, Christoph Hellwig wrote: Right now we can take a kiobuf and turn it into a bunch of buffer_heads for IO. The io_count lets us track all of those sub-IOs so that we know when all submitted IO has completed, so that we can pass the completion callback back up the chain without having to allocate yet more descriptor structs for the IO. Again, remove this and the IO becomes more heavyweight because we need to create a separate struct for the info. No. Just allow passing the multiple of the devices blocksize over ll_rw_block. That was just one example: you need the sub-ios just as much when you split up an IO over stripe boundaries in LVM or raid0, for example. Secondly, ll_rw_block needs to die anyway: you can expand the blocksize up to PAGE_SIZE but not beyond, whereas something like ll_rw_kiobuf can submit a much larger IO atomically (and we have devices which don't start to deliver good throughput until you use IO sizes of 1MB or more). and the lack of scatter gather in one kiobuf struct (you always need an array) Again, _all_ data being sent down through the block device layer is either in buffer heads or is page aligned. That's the point. You are always talking about the block-layer only. I'm talking about why the minimal, generic solution doesn't provide what the block layer needs. Obviously, extra code will be needed to scan kiobufs if we do that, and unless we have both per-page _and_ per-kiobuf start/offset pairs (adding even further to the complexity), those scatter-gather lists would prevent us from carving up a kiobuf into smaller sub-ios without copying the whole (expanded) vector. No. I think I explained that in my last mail. How? If I've got a vector (page X, offset 0, length PAGE_SIZE) and I want to split it in two, I have to make two new vectors (page X, offset 0, length n) and (page X, offset n, length PAGE_SIZE-n). That implies copying both vectors. If I have a page vector with a single offset/length pair, I can build a new header with the same vector and modified offset/length to split the vector in two without copying it. Possibly, but I remain to be convinced, because you may end up with a mechanism which is generic but is not well-tuned for any specific case, so everything goes slower. As kiobufs are widely used for real IO, just as containers, this is better then nothing. Surely having all of the subsystems working fast is better still? And IMHO a nice generic concepts that lets different subsystems work toegther is a _lot_ better then a bunch of over-optimized, rather isolated subsytems. The IO-Lite people have done a nice research of the effect of an unified IO-Caching system vs. the typical isolated systems. I know, and IO-Lite has some major problems (the close integration of that code into the cache, for example, makes it harder to expose the zero-copy to user-land). --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote: On Thu, Feb 01, 2001 at 05:34:49PM +, Alan Cox wrote: In the disk IO case, you basically don't get that (the only thing which comes close is raid5 parity blocks). The data which the user started with is the data sent out on the wire. You do get some interesting cases such as soft raid and LVM, or even in the scsi stack if you run out of mailbox space, where you need to send only a sub-chunk of the input buffer. Though your describption is right, I don't think the case is very common: Sometimes in LVM on a pv boundary and maybe sometimes in the scsi code. On raid0 stripes, it's common to have stripes of between 16k and 64k, so it's rather more common there than you'd like. In any case, you need the code to handle it, and I don't want to make the code paths any more complex than necessary. In raid1 you need some kind of clone iobuf, which should work with both cases. In raid0 you need a complete new pagelist anyway No you don't. You take the existing one, specify which region of it is going to the current stripe, and send it off. Nothing more. In that case, having offset/len as the kiobuf limit markers is ideal: you can clone a kiobuf header using the same page vector as the parent, narrow down the start/end points, and continue down the stack without having to copy any part of the page list. If you had the offset/len data encoded implicitly into each entry in the sglist, you would not be able to do that. Sure you could: you embedd that information in a higher-level structure. What's the point in a common data container structure if you need higher-level information to make any sense out of it? --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Thu, Feb 01, 2001 at 09:33:27PM +0100, Christoph Hellwig wrote: I think you want the whole kio concept only for disk-like IO. No. I want something good for zero-copy IO in general, but a lot of that concerns the problem of interacting with the user, and the basic center of that interaction in 99% of the interesting cases is either a user VM buffer or the page cache --- all of which are page-aligned. If you look at the sorts of models being proposed (even by Linus) for splice, you get len = prepare_read(); prepare_write(); pull_fd(); commit_write(); in which the read is being pulled into a known location in the page cache -- it's page-aligned, again. I'm perfectly willing to accept that there may be a need for scatter-gather boundaries including non-page-aligned fragments in this model, but I can't see one if you're using the page cache as a mediator, nor if you're doing it through a user mmapped buffer. The only reason you need finer scatter-gather boundaries --- and it may be a compelling reason --- is if you are merging multiple IOs together into a single device-level IO. That makes perfect sense for the zerocopy tcp case where you're doing MSG_MORE-type coalescing. It doesn't help the existing SGI kiobuf block device code, because that performs its merging in the filesystem layers and the block device code just squirts the IOs to the wire as-is, but if we want to start merging those kiobuf-based IOs within make_request() then the block device layer may want it too. And Linus is right, the old way of using a *kiobuf[] for that was painful, but the solution of adding start/length to every entry in the page vector just doesn't sit right with many components of the block device environment either. I may still be persuaded that we need the full scatter-gather list fields throughout, but for now I tend to think that, at least in the disk layers, we may get cleaner results by allow linked lists of page-aligned kiobufs instead. That allows for merging of kiobufs without having to copy all of the vector information each time. The killer, however, is what happens if you want to split such a merged kiobuf. Right now, that's something that I can only imagine happening in the block layers if we start encoding buffer_head chains as kiobufs, but if we do that in the future, or if we start merging genuine kiobuf requests requests, then doing that split later on (for raid0 etc) may require duplicating whole chains of kiobufs. At that point, just doing scatter-gather lists is cleaner. But for now, the way to picture what I'm trying to achieve is that kiobufs are a bit like buffer_heads --- they represent the physical pages of some VM object that a higher layer has constructed, such as the page cache or a user VM buffer. You can chain these objects together for IO, but that doesn't stop the individual objects from being separate entities with independent IO completion callbacks to be honoured. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait /notify + callback chains
Hi, On Wed, Jan 31, 2001 at 07:28:01PM +0530, [EMAIL PROTECTED] wrote: > > Do the following modifications to your wait queue extension sound > reasonable ? > > 1. Change add_wait_queue to add elements to the end of queue (fifo, by > default) and instead have an add_wait_queue_lifo() routine that adds to the > head of the queue ? Cache efficiency: you wake up the task whose data set is most likely to be in L1 cache by waking it before its triggering event is flushed from cache. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/