Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Going through all the discussions once again and trying to look at this from the point of view of just basic requirements for data structures and mechanisms, that they imply. 1. Should have a data structure that represents a memory chain , which may not be contiguous in physical memory, and which can be passed down as a single unit all the way through to lowest level drivers - e.g for direct i/o to/from a contiguous virtual address range in user space (without any intermediate copies) (Networking and block i/o seem may have require different optimizations in the design of such a data structure, due to differences in the kind of patterns expected, as is apparent from the zero-copy networking fragments vs raw i/o kiobuf/kiovec patches. There are situations when such a data structure may be passed between subsystems as in the i2o example) This data structure could be part of an I/O container. 2. I/O containers may get split or merged as they pass through various layers --- so any completion mechanism and i/o container design should be able to account for both cases. At any point, a request could be - a collection of several higher level requests, or - could be one among several sub-requests of a single higher level request. (Just as appropriate "clustering" could happen at each level, appropriate "splitting" may also take place depending on the situation. It may make sense to delay splitting as far down the chain as possible in many situations, where the higher level is only interested in the i/o in its entirety and not in partial completion ) When caching/buffers are involved, sometimes the sub-requests of a single higher level request may have individual completion requirements (even when no merges were involved), because the sub-request buffers may be used to service other requests alongside. With raw i/o that might not be the case. 3. It is desirable that layers which process the requests along the way without splitting/merging, be able to pass along the same I/O container without any duplication or cloning, and intercept async i/o completions for post processing. 4. (Optional) It would be nice if different kinds of I/O containers or buffer structures could be used at different levels, without having explicit linkage fields (like bh --> page, for example) , and in a way that intermediate drivers or layers can work transparently. 3 & 4 are more of layering related items, which gets a little specific, but do 1 and 2 cover the general things we are looking for ? Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://vger.kernel.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi! > > So you consider inability to select() on regular files _feature_? > > select on files is unimplementable. You can't do background file IO the > same way you do background receiving of packets on socket. Filesystem is > synchronous. It can block. You can use helper friends if VFS layer is not able to handle background IO. Then we can do it right in linux-4.4. Pavel -- I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Linus Torvalds wrote: > Absolutely. This is exactly what I mean by saying that low-level drivers > may not actually be able to handle new cases that they've never been asked > to do before - they just never saw anything like a 64kB request before or > something that crossed its own alignment. > > But the _higher_ levels are there. And there's absolutely nothing in the > design that is a real problem. But there's no question that you might need > to fix up more than one or two low-level drivers. > > (The only drivers I know better are the IDE ones, and as far as I can tell > they'd have no trouble at all with any of this. Most other normal drivers > are likely to be in this same situation. But because I've not had a reason > to test, I certainly won't guarantee even that). PCI has dma_mask, which distinguishes different device capabilities. This nice interface handles 64-bit capable devices, 32-bit ones, ISA limitations (the old 16MB limit) and some other strange devices. This mask appears in block devices one way or another so that bounce buffers are used for high addresses. How about a mask for block devices which indicates the kinds of alignment and lengths that the driver can handle? For old drivers that can't be thoroughly tested, we assume the worst. Some devices have hardware limitations. Newer, tested drivers can relax the limits. It's probably not difficult to say, "this 64k request can't be handled so split it into 1k requests". It integrates naturally with the decision to use bounce buffers -- alignment restrictions cause copying just as high addresses causes copying. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Linus Torvalds wrote: Absolutely. This is exactly what I mean by saying that low-level drivers may not actually be able to handle new cases that they've never been asked to do before - they just never saw anything like a 64kB request before or something that crossed its own alignment. But the _higher_ levels are there. And there's absolutely nothing in the design that is a real problem. But there's no question that you might need to fix up more than one or two low-level drivers. (The only drivers I know better are the IDE ones, and as far as I can tell they'd have no trouble at all with any of this. Most other normal drivers are likely to be in this same situation. But because I've not had a reason to test, I certainly won't guarantee even that). PCI has dma_mask, which distinguishes different device capabilities. This nice interface handles 64-bit capable devices, 32-bit ones, ISA limitations (the old 16MB limit) and some other strange devices. This mask appears in block devices one way or another so that bounce buffers are used for high addresses. How about a mask for block devices which indicates the kinds of alignment and lengths that the driver can handle? For old drivers that can't be thoroughly tested, we assume the worst. Some devices have hardware limitations. Newer, tested drivers can relax the limits. It's probably not difficult to say, "this 64k request can't be handled so split it into 1k requests". It integrates naturally with the decision to use bounce buffers -- alignment restrictions cause copying just as high addresses causes copying. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi! So you consider inability to select() on regular files _feature_? select on files is unimplementable. You can't do background file IO the same way you do background receiving of packets on socket. Filesystem is synchronous. It can block. You can use helper friends if VFS layer is not able to handle background IO. Then we can do it right in linux-4.4. Pavel -- I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Going through all the discussions once again and trying to look at this from the point of view of just basic requirements for data structures and mechanisms, that they imply. 1. Should have a data structure that represents a memory chain , which may not be contiguous in physical memory, and which can be passed down as a single unit all the way through to lowest level drivers - e.g for direct i/o to/from a contiguous virtual address range in user space (without any intermediate copies) (Networking and block i/o seem may have require different optimizations in the design of such a data structure, due to differences in the kind of patterns expected, as is apparent from the zero-copy networking fragments vs raw i/o kiobuf/kiovec patches. There are situations when such a data structure may be passed between subsystems as in the i2o example) This data structure could be part of an I/O container. 2. I/O containers may get split or merged as they pass through various layers --- so any completion mechanism and i/o container design should be able to account for both cases. At any point, a request could be - a collection of several higher level requests, or - could be one among several sub-requests of a single higher level request. (Just as appropriate "clustering" could happen at each level, appropriate "splitting" may also take place depending on the situation. It may make sense to delay splitting as far down the chain as possible in many situations, where the higher level is only interested in the i/o in its entirety and not in partial completion ) When caching/buffers are involved, sometimes the sub-requests of a single higher level request may have individual completion requirements (even when no merges were involved), because the sub-request buffers may be used to service other requests alongside. With raw i/o that might not be the case. 3. It is desirable that layers which process the requests along the way without splitting/merging, be able to pass along the same I/O container without any duplication or cloning, and intercept async i/o completions for post processing. 4. (Optional) It would be nice if different kinds of I/O containers or buffer structures could be used at different levels, without having explicit linkage fields (like bh -- page, for example) , and in a way that intermediate drivers or layers can work transparently. 3 4 are more of layering related items, which gets a little specific, but do 1 and 2 cover the general things we are looking for ? Regards Suparna Suparna Bhattacharya Systems Software Group, IBM Global Services, India E-mail : [EMAIL PROTECTED] Phone : 91-80-5267117, Extn : 2525 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://vger.kernel.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Linus Torvalds wrote: > > On Thu, 8 Feb 2001, Rik van Riel wrote: > > > On Thu, 8 Feb 2001, Mikulas Patocka wrote: > > > > > > > You need aio_open. > > > > Could you explain this? > > > > > > If the server is sending many small files, disk spends huge > > > amount time walking directory tree and seeking to inodes. Maybe > > > opening the file is even slower than reading it > > > > Not if you have a big enough inode_cache and dentry_cache. > > > > OTOH ... if you have enough memory the whole async IO argument > > is moot anyway because all your files will be in memory too. > > Note that this _is_ an important point. > > You should never _ever_ think about pure IO speed as the most important > thing. Even if you get absolutely perfect IO streaming off the fastest > disk you can find, I will beat you every single time with a cached setup > that doesn't need to do IO at all. > > 90% of the VFS layer is all about caching, and trying to avoid IO. Of the > rest, about 9% is about trying to avoid even calling down to the low-level > filesystem, because it's faster if we can handle it at a high level > without any need to even worry about issues like physical disk addresses. > Even if those addresses are cached. > > The remaining 1% is about actually getting the IO done. At that point we > end up throwing our hands in the air and saying "ok, this will be slow". > > So if you design your system for disk load, you are missing a big portion > of the picture. > > There are cases where IO really matter. The most notable one being > databases, certainly _not_ web or ftp servers. For web- or ftp-servers you > buy more memory if you want high performance, and you tend to be limited > by the network speed anyway (if you have multiple gigabit networks and > network speed isn't an issue, then I can also tell you that buying a few > gigabyte of RAM isn't an issue, because you are obviously working for > something like the DoD and have very little regard for the cost of the > thing ;) > > For databases (and for file servers that you want to be robust over a > crash), IO throughput is an issue mainly because you need to put the damn > requests in stable memory somewhere. Which tends to mean that _write_ > speed is what really matters, because the reads you can still try to cache > as efficiently as humanly possible (and the issue of database design then > turns into trying to find every single piece of locality you can, so that > the read caching works as well as possible). > > Short and sweet: "aio_open()" is basically never supposed to be an issue. > If it is, you've misdesigned something, or you're trying too damn hard to > single-thread everything (and "hiding" the threading that _does_ happen by > just calling it "AIO" instead - lying to yourself, in short). Right - I agree with you that an AIO design is basically hiding an inherently multi threaded program flow. This argument is indeed very catchy. And looking from some other point one will see that most of the AIO designs are from times where multi threading in applications wasn't that common as it is now. Most prominently coprocesses in a shell come to my mind as a very good example about how to handle AIO (sort of)... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Linus Torvalds wrote: On Thu, 8 Feb 2001, Rik van Riel wrote: On Thu, 8 Feb 2001, Mikulas Patocka wrote: You need aio_open. Could you explain this? If the server is sending many small files, disk spends huge amount time walking directory tree and seeking to inodes. Maybe opening the file is even slower than reading it Not if you have a big enough inode_cache and dentry_cache. OTOH ... if you have enough memory the whole async IO argument is moot anyway because all your files will be in memory too. Note that this _is_ an important point. You should never _ever_ think about pure IO speed as the most important thing. Even if you get absolutely perfect IO streaming off the fastest disk you can find, I will beat you every single time with a cached setup that doesn't need to do IO at all. 90% of the VFS layer is all about caching, and trying to avoid IO. Of the rest, about 9% is about trying to avoid even calling down to the low-level filesystem, because it's faster if we can handle it at a high level without any need to even worry about issues like physical disk addresses. Even if those addresses are cached. The remaining 1% is about actually getting the IO done. At that point we end up throwing our hands in the air and saying "ok, this will be slow". So if you design your system for disk load, you are missing a big portion of the picture. There are cases where IO really matter. The most notable one being databases, certainly _not_ web or ftp servers. For web- or ftp-servers you buy more memory if you want high performance, and you tend to be limited by the network speed anyway (if you have multiple gigabit networks and network speed isn't an issue, then I can also tell you that buying a few gigabyte of RAM isn't an issue, because you are obviously working for something like the DoD and have very little regard for the cost of the thing ;) For databases (and for file servers that you want to be robust over a crash), IO throughput is an issue mainly because you need to put the damn requests in stable memory somewhere. Which tends to mean that _write_ speed is what really matters, because the reads you can still try to cache as efficiently as humanly possible (and the issue of database design then turns into trying to find every single piece of locality you can, so that the read caching works as well as possible). Short and sweet: "aio_open()" is basically never supposed to be an issue. If it is, you've misdesigned something, or you're trying too damn hard to single-thread everything (and "hiding" the threading that _does_ happen by just calling it "AIO" instead - lying to yourself, in short). Right - I agree with you that an AIO design is basically hiding an inherently multi threaded program flow. This argument is indeed very catchy. And looking from some other point one will see that most of the AIO designs are from times where multi threading in applications wasn't that common as it is now. Most prominently coprocesses in a shell come to my mind as a very good example about how to handle AIO (sort of)... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Thu, Feb 08, 2001 at 03:52:35PM +0100, Mikulas Patocka wrote: > > > How do you write high-performance ftp server without threads if select > > on regular file always returns "ready"? > > No, it's not really possible on Linux. Use SYS$QIO call on VMS :-) Ahh, but even VMS SYS$QIO is synchronous at doing opens, allocation of the IO request packets, and mapping file location to disk blocks. Only the data IO is ever async (and Ben's async IO stuff for Linux provides that too). --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Rik van Riel wrote: > On Thu, 8 Feb 2001, Mikulas Patocka wrote: > > > > > You need aio_open. > > > Could you explain this? > > > > If the server is sending many small files, disk spends huge > > amount time walking directory tree and seeking to inodes. Maybe > > opening the file is even slower than reading it > > Not if you have a big enough inode_cache and dentry_cache. > > OTOH ... if you have enough memory the whole async IO argument > is moot anyway because all your files will be in memory too. Note that this _is_ an important point. You should never _ever_ think about pure IO speed as the most important thing. Even if you get absolutely perfect IO streaming off the fastest disk you can find, I will beat you every single time with a cached setup that doesn't need to do IO at all. 90% of the VFS layer is all about caching, and trying to avoid IO. Of the rest, about 9% is about trying to avoid even calling down to the low-level filesystem, because it's faster if we can handle it at a high level without any need to even worry about issues like physical disk addresses. Even if those addresses are cached. The remaining 1% is about actually getting the IO done. At that point we end up throwing our hands in the air and saying "ok, this will be slow". So if you design your system for disk load, you are missing a big portion of the picture. There are cases where IO really matter. The most notable one being databases, certainly _not_ web or ftp servers. For web- or ftp-servers you buy more memory if you want high performance, and you tend to be limited by the network speed anyway (if you have multiple gigabit networks and network speed isn't an issue, then I can also tell you that buying a few gigabyte of RAM isn't an issue, because you are obviously working for something like the DoD and have very little regard for the cost of the thing ;) For databases (and for file servers that you want to be robust over a crash), IO throughput is an issue mainly because you need to put the damn requests in stable memory somewhere. Which tends to mean that _write_ speed is what really matters, because the reads you can still try to cache as efficiently as humanly possible (and the issue of database design then turns into trying to find every single piece of locality you can, so that the read caching works as well as possible). Short and sweet: "aio_open()" is basically never supposed to be an issue. If it is, you've misdesigned something, or you're trying too damn hard to single-thread everything (and "hiding" the threading that _does_ happen by just calling it "AIO" instead - lying to yourself, in short). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Marcelo Tosatti wrote: > > On Thu, 8 Feb 2001, Stephen C. Tweedie wrote: > > > > > > How do you write high-performance ftp server without threads if select > > > on regular file always returns "ready"? > > > > Select can work if the access is sequential, but async IO is a more > > general solution. > > Even async IO (ie aio_read/aio_write) should block on the request queue if > its full in Linus mind. Not necessarily. I said that "READA/WRITEA" are only worth exporting inside the kernel - because the latencies and complexities are low-level enough that it should not be exported to user space as such. But I could imagine a kernel aio package that does the equivalent of bh->b_end_io = completion_handler; generic_make_request(WRITE, bh);/* this may block */ bh= bh->b_next; /* Now, fill it up as much as we can.. */ current->state = TASK_INTERRUPTIBLE; while (more data to be written) { if (generic_make_request(WRITEA, bh) < 0) break; bh = bh->b_next; } return; and then you make the _completion handler_ thing continue to feed more requests. Yes, you may block at some points (because you need to always have at least _one_ request in-flight in order to have the state machine active, but you can basically try to avoid blocking more than necessary. But do you see why the above can't be done from user space? It requires that the completion handler (which runs in an interrupt context) be able to continue to feed requests and keep the queue filled. If you don't do that, you'll never have good throughput, because it takes too long to send signals, re-schedule or whatever to user mode. And do you see how it has to block _sometimes_? If people do hundreds of AIO requests, we can't let memory just fill up with pending writes.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Martin Dalecki wrote: > > > > But you'll have a bitch of a time trying to merge multiple > > threads/processes reading from the same area on disk at roughly the same > > time. Your higher levels won't even _know_ that there is merging to be > > done until the IO requests hit the wall in waiting for the disk. > > Merging is a hardware tighted optimization, so it should happen, there we you > really have full "knowlendge" and controll of the hardware -> namely the > device driver. Or, in many cases, the device itself. There are valid reasons for not doing merging in the driver, but they all tend to boil down to "even lower layers can do a better job of it". They basically _never_ boil down to "upper layers already did it for us". That said, there tend to be advantages to doing "appropriate" clustering at each level. Upper layers can (and do) use read-ahead to help the lower levels. The write-out can (and currently does not) try to sort the requests for better elevator behaviour. The driver level can (and does) further cluster the requests - even if the low-level device does a perfect job of orderign and merging on its own it's usually advantageous to have fewer (and bigger) commands in-flight in order to have fewer completion interrupts and less command traffic on the bus. So it's obviously not entirely black-and-white. Upper layers can help, but it's a mistake to think that they should "do the work". (Note: a lot of people seem to think that "layering" means that the complexity is in upper layers, and that lower layers should be simple and "stupid". This is not true. A well-balanced layering would have all layers doing potentially equally complex things - but the complexity should be _independent_. Complex interactions are bad. But it's also bad to thin kthat lower levels shouldn't be allowed to optimize because they should be "simple".). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel]RFC: Kernel mechanism: Compound event wait]
On Thu, 8 Feb 2001, Pavel Machek wrote: > > > > There are currently no other alternatives in user space. You'd have to > > create whole new interfaces for aio_read/write, and ways for the kernel to > > inform user space that "now you can re-try submitting your IO". > > Why is current select() interface not good enough? Ehh.. One major reason is rather simple: disk request wait times tend to be on the order of sub-millisecond (remember: if we run out of requests, that means that we have 256 of them already queued, which means that it's very likely that several of them will be freed up in the very near future due to completion). The fact is, that if you start doing write/select loops, you're going to waste a _large_ portion of your CPU speed on it. Especially considering that the select() call would have to go all the way down to the ll_rw_blk layer to figure out whether there are more requests etc. So there is (a) historical reasons that say that regular files can never wait and EAGAIN is not an acceptable return value and (b) practical reasons for why such an interface would be a bad one. There are better ways to do it. Either using threads, or just having a better aio-like interface. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Rik van Riel wrote: > On Thu, 8 Feb 2001, Mikulas Patocka wrote: > > > > > You need aio_open. > > > Could you explain this? > > > > If the server is sending many small files, disk spends huge > > amount time walking directory tree and seeking to inodes. Maybe > > opening the file is even slower than reading it > > Not if you have a big enough inode_cache and dentry_cache. Eh? However big the caches are, you can still get misses which will require multiple (blocking) disk accesses to handle... > OTOH ... if you have enough memory the whole async IO argument > is moot anyway because all your files will be in memory too. Only for cache hits. If you're doing a Mindcraft benchmark or something with everything in RAM, you're fine - for real world servers, that's not really an option ;-) Really, you want/need cache MISSES to be handled without blocking. However big the caches, short of running EVERYTHING from a ramdisk, these will still happen! James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Mikulas Patocka wrote: > > > You need aio_open. > > Could you explain this? > > If the server is sending many small files, disk spends huge > amount time walking directory tree and seeking to inodes. Maybe > opening the file is even slower than reading it Not if you have a big enough inode_cache and dentry_cache. OTOH ... if you have enough memory the whole async IO argument is moot anyway because all your files will be in memory too. regards, Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Mikulas Patocka wrote: > > > The problem is that aio_read and aio_write are pretty useless for ftp or > > > http server. You need aio_open. > > > > Could you explain this? > > If the server is sending many small files, disk spends huge amount time > walking directory tree and seeking to inodes. Maybe opening the file is > even slower than reading it - read is usually sequential but open needs to > seek at few areas of disk. > > And if you have one-threaded server using open, close, aio_read and > aio_write, you actually block the whole server while it is opening a > single file. This is not how async io is supposed to work. Ok but this is not the point of the discussion. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
> > The problem is that aio_read and aio_write are pretty useless for ftp or > > http server. You need aio_open. > > Could you explain this? If the server is sending many small files, disk spends huge amount time walking directory tree and seeking to inodes. Maybe opening the file is even slower than reading it - read is usually sequential but open needs to seek at few areas of disk. And if you have one-threaded server using open, close, aio_read and aio_write, you actually block the whole server while it is opening a single file. This is not how async io is supposed to work. Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, Feb 08 2001, Mikulas Patocka wrote: > > Even async IO (ie aio_read/aio_write) should block on the request queue if > > its full in Linus mind. > > This is not problem (you can create queue big enough to handle the load). Well in theory, but in practice this isn't a very good idea. At some point throwing yet more requests in there doesn't make a whole lot of sense. You are basically _always_ going to be able to empty the request list by dirtying lots of data. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Mikulas Patocka wrote: > > > > How do you write high-performance ftp server without threads if select > > > > on regular file always returns "ready"? > > > > > > Select can work if the access is sequential, but async IO is a more > > > general solution. > > > > Even async IO (ie aio_read/aio_write) should block on the request queue if > > its full in Linus mind. > > This is not problem (you can create queue big enough to handle the load). The point is that you want to be able to not block if the queue full (and the queue size has nothing to do with that). > The problem is that aio_read and aio_write are pretty useless for ftp or > http server. You need aio_open. Could you explain this? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
> > > How do you write high-performance ftp server without threads if select > > > on regular file always returns "ready"? > > > > Select can work if the access is sequential, but async IO is a more > > general solution. > > Even async IO (ie aio_read/aio_write) should block on the request queue if > its full in Linus mind. This is not problem (you can create queue big enough to handle the load). The problem is that aio_read and aio_write are pretty useless for ftp or http server. You need aio_open. Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Marcelo Tosatti wrote: > > On Thu, 8 Feb 2001, Ben LaHaise wrote: > > > > > > (besides, latency would suck. I bet you're better off waiting for the > > > requests if they are all used up. It takes too long to get deep into the > > > kernel from user space, and you cannot use the exclusive waiters with its > > > anti-herd behaviour etc). > > > > Ah, but no. In fact for some things, the wait queue extensions I'm using > > will be more efficient as things like test_and_set_bit for obtaining a > > lock gets executed without waking up a task. > > The latency argument is somewhat bogus because there is no problem to > check the request queue, in the aio syscalls, and simply fail if its full. Ugh, I forgot to say check the request queue before doing any filesystem work. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Ben LaHaise wrote: > > (besides, latency would suck. I bet you're better off waiting for the > > requests if they are all used up. It takes too long to get deep into the > > kernel from user space, and you cannot use the exclusive waiters with its > > anti-herd behaviour etc). > > Ah, but no. In fact for some things, the wait queue extensions I'm using > will be more efficient as things like test_and_set_bit for obtaining a > lock gets executed without waking up a task. The latency argument is somewhat bogus because there is no problem to check the request queue, in the aio syscalls, and simply fail if its full. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Linus Torvalds wrote: > There are currently no other alternatives in user space. You'd have to > create whole new interfaces for aio_read/write, and ways for the kernel to > inform user space that "now you can re-try submitting your IO". > > Could be done. But that's a big thing. Has been done. Still needs some work, but it works pretty well. As for throttling io, having ios submitted does not have to correspond to them being queued in the lower layers. The main issue with async io is limiting the amount of pinned memory for ios; if that's taken care of, I don't think it matters how many ios are in flight. > > An application which sets non blocking behavior and busy waits for a > > request (which seems to be your argument) is just stupid, of course. > > Tell me what else it could do at some point? You need something like > select() to wait on it. There are no such interfaces right now... > > (besides, latency would suck. I bet you're better off waiting for the > requests if they are all used up. It takes too long to get deep into the > kernel from user space, and you cannot use the exclusive waiters with its > anti-herd behaviour etc). Ah, but no. In fact for some things, the wait queue extensions I'm using will be more efficient as things like test_and_set_bit for obtaining a lock gets executed without waking up a task. > Simple rule: if you want to optimize concurrency and avoid waiting - use > several processes or threads instead. At which point you can get real work > done on multiple CPU's, instead of worrying about what happens when you > have to wait on the disk. There do exist plenty of cases where threads are not efficient enough. Just the stack overhead alone with 8000 threads makes things really suck. Event based io completion means that server processes don't need to have the overhead of select/poll. Add in NT style completion ports for waking up the right number of worker threads off of the completion queue, and That said, I don't expect all devices to support async io. But given support for files, raw and sockets all the important cases are covered. The remainder can be supported via userspace helpers. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi! > So you consider inability to select() on regular files _feature_? select on files is unimplementable. You can't do background file IO the same way you do background receiving of packets on socket. Filesystem is synchronous. It can block. > It can be a pretty serious problem with slow block devices > (floppy). It also hurts when you are trying to do high-performance > reads/writes. [I know it hurt in userspace sherlock search engine -- > kind of small altavista.] > > How do you write high-performance ftp server without threads if select > on regular file always returns "ready"? No, it's not really possible on Linux. Use SYS$QIO call on VMS :-) You can emulate asynchronous IO with kernel threads like FreeBSD and some commercial Unices do, but you still need as many (possibly kernel) threads as many requests you are servicing. > > Remember: in the end you HAVE to wait somewhere. You're always going to be > > able to generate data faster than the disk can take it. SOMETHING > > Userspace wants to _know_ when to stop. It asks politely using > "select()". And how do you want to wait for other select()ed events if you are blocked in wait_for_buffer in get_block (former bmap)? Making real async IO would require to rewrite all filesystems and whole VFS _from_scratch_. It won't happen. Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel]RFC: Kernel mechanism: Compound event wait]
On Thu, 8 Feb 2001, Pavel Machek wrote: > Hi! > > > > Its arguing against making a smart application block on the disk while its > > > able to use the CPU for other work. > > > > There are currently no other alternatives in user space. You'd have to > > create whole new interfaces for aio_read/write, and ways for the kernel to > > inform user space that "now you can re-try submitting your IO". > > Why is current select() interface not good enough? Think of random disk io scattered across the disk. Think about aio_write providing a means to perform zero copy io without needing to resort to playing mm tricks write protecting pages in the user's page tables. It's also a means for dealing efficiently with thousands of outstanding requests for network io. Using a select based interface is going to be an ugly kludge that still has all the overhead of select/poll. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Stephen C. Tweedie wrote: > > How do you write high-performance ftp server without threads if select > > on regular file always returns "ready"? > > Select can work if the access is sequential, but async IO is a more > general solution. Even async IO (ie aio_read/aio_write) should block on the request queue if its full in Linus mind. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Thu, Feb 08, 2001 at 12:15:13AM +0100, Pavel Machek wrote: > > > EAGAIN is _not_ a valid return value for block devices or for regular > > files. And in fact it _cannot_ be, because select() is defined to always > > return 1 on them - so if a write() were to return EAGAIN, user space would > > have nothing to wait on. Busy waiting is evil. > > So you consider inability to select() on regular files _feature_? Select might make some sort of sense for sequential access to files, and for random access via lseek/read but it makes no sense at all for pread and pwrite where select() has no idea _which_ part of the file the user is going to want to access next. > How do you write high-performance ftp server without threads if select > on regular file always returns "ready"? Select can work if the access is sequential, but async IO is a more general solution. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Linus Torvalds wrote: > > On Tue, 6 Feb 2001, Ben LaHaise wrote: > > > > On Tue, 6 Feb 2001, Stephen C. Tweedie wrote: > > > > > The whole point of the post was that it is merging, not splitting, > > > which is troublesome. How are you going to merge requests without > > > having chains of scatter-gather entities each with their own > > > completion callbacks? > > > > Let me just emphasize what Stephen is pointing out: if requests are > > properly merged at higher layers, then merging is neither required nor > > desired. > > I will claim that you CANNOT merge at higher levels and get good > performance. > > Sure, you can do read-ahead, and try to get big merges that way at a high > level. Good for you. > > But you'll have a bitch of a time trying to merge multiple > threads/processes reading from the same area on disk at roughly the same > time. Your higher levels won't even _know_ that there is merging to be > done until the IO requests hit the wall in waiting for the disk. Merging is a hardware tighted optimization, so it should happen, there we you really have full "knowlendge" and controll of the hardware -> namely the device driver. > Qutie frankly, this whole discussion sounds worthless. We have solved this > problem already: it's called a "buffer head". Deceptively simple at higher > levels, and lower levels can easily merge them together into chains and do > fancy scatter-gather structures of them that can be dynamically extended > at any time. > > The buffer heads together with "struct request" do a hell of a lot more > than just a simple scatter-gather: it's able to create ordered lists of > independent sg-events, together with full call-backs etc. They are > low-cost, fairly efficient, and they have worked beautifully for years. > > The fact that kiobufs can't be made to do the same thing is somebody elses > problem. I _know_ that merging has to happen late, and if others are > hitting their heads against this issue until they turn silly, then that's > their problem. You'll eventually learn, or you'll hit your heads into a > pulp. Amen. -- - phone: +49 214 8656 283 - job: STOCK-WORLD Media AG, LEV .de (MY OPPINNIONS ARE MY OWN!) - langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort: ru_RU.KOI8-R - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait]
Hi! > > Its arguing against making a smart application block on the disk while its > > able to use the CPU for other work. > > There are currently no other alternatives in user space. You'd have to > create whole new interfaces for aio_read/write, and ways for the kernel to > inform user space that "now you can re-try submitting your IO". Why is current select() interface not good enough? Defining that select may say regular file is not ready should be enough. Okay, maybe you'd want new fcntl() flag saying "I _really_ want this regular file to be non-blocking". No need for new interfaces. Pavel -- I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi! > > > Reading write(2): > > > > > >EAGAIN Non-blocking I/O has been selected using O_NONBLOCK and there was > > > no room in the pipe or socket connected to fd to write the data > > > immediately. > > > > > > I see no reason why "aio function have to block waiting for requests". > > > > That was my reasoning too with READA etc, but Linus seems to want that we > > can block while submitting the I/O (as throttling, Linus?) just not > > until completion. > > Note the "in the pipe or socket" part. > ^^ > > EAGAIN is _not_ a valid return value for block devices or for regular > files. And in fact it _cannot_ be, because select() is defined to always > return 1 on them - so if a write() were to return EAGAIN, user space would > have nothing to wait on. Busy waiting is evil. So you consider inability to select() on regular files _feature_? It can be a pretty serious problem with slow block devices (floppy). It also hurts when you are trying to do high-performance reads/writes. [I know it hurt in userspace sherlock search engine -- kind of small altavista.] How do you write high-performance ftp server without threads if select on regular file always returns "ready"? > Remember: in the end you HAVE to wait somewhere. You're always going to be > able to generate data faster than the disk can take it. SOMETHING Userspace wants to _know_ when to stop. It asks politely using "select()". Pavel -- I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, Feb 06, 2001 at 10:14:21AM -0800, Linus Torvalds wrote: > I will claim that you CANNOT merge at higher levels and get good > performance. > > Sure, you can do read-ahead, and try to get big merges that way at a high > level. Good for you. > > But you'll have a bitch of a time trying to merge multiple > threads/processes reading from the same area on disk at roughly the same > time. Your higher levels won't even _know_ that there is merging to be > done until the IO requests hit the wall in waiting for the disk. Hi, I've tried to experimentally check this statement. I instrumented a kernel with the following patch. It keeps a counter for every merge between unrelated requests. An unrelated requests is defined as the requests getting allocated from different currents. I did various tests and suprisingly I was not able to trigger a single unrelated merge on my IDE system with various IO loads (dbench, news expire, news sort, kernel compile, swapping ...) So either my patch is wrong (if yes, what is wrong?), or they do simply not happen in usual IO loads. I know that it has a few holes (like it doesn't count unrelated merges that happen from the same process, or if a process quits and another one gets its kernel stack and IO of both is merged it'll be counted as related merge), but if unrelated merges were relevant there should still show up more, no? My pet theory is that page and buffer cache filters most unrelated merges out. I haven't tried to use raw IO to avoid this problem, but I expect that anything that does raw IO will do some intelligent IO scheduling on its own anyways. If anyone is interested: it would be interesting if other people are able to trigger unrelated merges in real loads. Here is a patch. Display statistics using: (echo print unrelated_merge ; print related_merge ) | gdb vmlinux /proc/kcore --- linux/drivers/block/ll_rw_blk.c-REQSTAT Tue Jan 30 13:33:25 2001 +++ linux/drivers/block/ll_rw_blk.c Thu Feb 8 01:13:57 2001 @@ -31,6 +31,9 @@ #include +int unrelated_merge; +int related_merge; + /* * MAC Floppy IWM hooks */ @@ -478,6 +481,7 @@ rq->rq_status = RQ_ACTIVE; rq->special = NULL; rq->q = q; + rq->originator = current; } return rq; @@ -668,6 +672,11 @@ if (!q->merge_requests_fn(q, req, next, max_segments)) return; + if (next->originator != req->originator) + unrelated_merge++; + else + related_merge++; + q->elevator.elevator_merge_req_fn(req, next); req->bhtail->b_reqnext = next->bh; req->bhtail = next->bhtail; --- linux/include/linux/blkdev.h-REQSTATTue Jan 30 17:17:01 2001 +++ linux/include/linux/blkdev.hWed Feb 7 23:33:35 2001 @@ -45,6 +45,8 @@ struct buffer_head * bh; struct buffer_head * bhtail; request_queue_t *q; + + struct task_struct *originator; }; #include -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, Feb 06, 2001 at 10:14:21AM -0800, Linus Torvalds wrote: I will claim that you CANNOT merge at higher levels and get good performance. Sure, you can do read-ahead, and try to get big merges that way at a high level. Good for you. But you'll have a bitch of a time trying to merge multiple threads/processes reading from the same area on disk at roughly the same time. Your higher levels won't even _know_ that there is merging to be done until the IO requests hit the wall in waiting for the disk. Hi, I've tried to experimentally check this statement. I instrumented a kernel with the following patch. It keeps a counter for every merge between unrelated requests. An unrelated requests is defined as the requests getting allocated from different currents. I did various tests and suprisingly I was not able to trigger a single unrelated merge on my IDE system with various IO loads (dbench, news expire, news sort, kernel compile, swapping ...) So either my patch is wrong (if yes, what is wrong?), or they do simply not happen in usual IO loads. I know that it has a few holes (like it doesn't count unrelated merges that happen from the same process, or if a process quits and another one gets its kernel stack and IO of both is merged it'll be counted as related merge), but if unrelated merges were relevant there should still show up more, no? My pet theory is that page and buffer cache filters most unrelated merges out. I haven't tried to use raw IO to avoid this problem, but I expect that anything that does raw IO will do some intelligent IO scheduling on its own anyways. If anyone is interested: it would be interesting if other people are able to trigger unrelated merges in real loads. Here is a patch. Display statistics using: (echo print unrelated_merge ; print related_merge ) | gdb vmlinux /proc/kcore --- linux/drivers/block/ll_rw_blk.c-REQSTAT Tue Jan 30 13:33:25 2001 +++ linux/drivers/block/ll_rw_blk.c Thu Feb 8 01:13:57 2001 @@ -31,6 +31,9 @@ #include linux/module.h +int unrelated_merge; +int related_merge; + /* * MAC Floppy IWM hooks */ @@ -478,6 +481,7 @@ rq-rq_status = RQ_ACTIVE; rq-special = NULL; rq-q = q; + rq-originator = current; } return rq; @@ -668,6 +672,11 @@ if (!q-merge_requests_fn(q, req, next, max_segments)) return; + if (next-originator != req-originator) + unrelated_merge++; + else + related_merge++; + q-elevator.elevator_merge_req_fn(req, next); req-bhtail-b_reqnext = next-bh; req-bhtail = next-bhtail; --- linux/include/linux/blkdev.h-REQSTATTue Jan 30 17:17:01 2001 +++ linux/include/linux/blkdev.hWed Feb 7 23:33:35 2001 @@ -45,6 +45,8 @@ struct buffer_head * bh; struct buffer_head * bhtail; request_queue_t *q; + + struct task_struct *originator; }; #include linux/elevator.h -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi! Reading write(2): EAGAIN Non-blocking I/O has been selected using O_NONBLOCK and there was no room in the pipe or socket connected to fd to write the data immediately. I see no reason why "aio function have to block waiting for requests". That was my reasoning too with READA etc, but Linus seems to want that we can block while submitting the I/O (as throttling, Linus?) just not until completion. Note the "in the pipe or socket" part. ^^ EAGAIN is _not_ a valid return value for block devices or for regular files. And in fact it _cannot_ be, because select() is defined to always return 1 on them - so if a write() were to return EAGAIN, user space would have nothing to wait on. Busy waiting is evil. So you consider inability to select() on regular files _feature_? It can be a pretty serious problem with slow block devices (floppy). It also hurts when you are trying to do high-performance reads/writes. [I know it hurt in userspace sherlock search engine -- kind of small altavista.] How do you write high-performance ftp server without threads if select on regular file always returns "ready"? Remember: in the end you HAVE to wait somewhere. You're always going to be able to generate data faster than the disk can take it. SOMETHING Userspace wants to _know_ when to stop. It asks politely using "select()". Pavel -- I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait]
Hi! Its arguing against making a smart application block on the disk while its able to use the CPU for other work. There are currently no other alternatives in user space. You'd have to create whole new interfaces for aio_read/write, and ways for the kernel to inform user space that "now you can re-try submitting your IO". Why is current select() interface not good enough? Defining that select may say regular file is not ready should be enough. Okay, maybe you'd want new fcntl() flag saying "I _really_ want this regular file to be non-blocking". No need for new interfaces. Pavel -- I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Thu, Feb 08, 2001 at 12:15:13AM +0100, Pavel Machek wrote: EAGAIN is _not_ a valid return value for block devices or for regular files. And in fact it _cannot_ be, because select() is defined to always return 1 on them - so if a write() were to return EAGAIN, user space would have nothing to wait on. Busy waiting is evil. So you consider inability to select() on regular files _feature_? Select might make some sort of sense for sequential access to files, and for random access via lseek/read but it makes no sense at all for pread and pwrite where select() has no idea _which_ part of the file the user is going to want to access next. How do you write high-performance ftp server without threads if select on regular file always returns "ready"? Select can work if the access is sequential, but async IO is a more general solution. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Stephen C. Tweedie wrote: snip How do you write high-performance ftp server without threads if select on regular file always returns "ready"? Select can work if the access is sequential, but async IO is a more general solution. Even async IO (ie aio_read/aio_write) should block on the request queue if its full in Linus mind. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel]RFC: Kernel mechanism: Compound event wait]
On Thu, 8 Feb 2001, Pavel Machek wrote: Hi! Its arguing against making a smart application block on the disk while its able to use the CPU for other work. There are currently no other alternatives in user space. You'd have to create whole new interfaces for aio_read/write, and ways for the kernel to inform user space that "now you can re-try submitting your IO". Why is current select() interface not good enough? Think of random disk io scattered across the disk. Think about aio_write providing a means to perform zero copy io without needing to resort to playing mm tricks write protecting pages in the user's page tables. It's also a means for dealing efficiently with thousands of outstanding requests for network io. Using a select based interface is going to be an ugly kludge that still has all the overhead of select/poll. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi! So you consider inability to select() on regular files _feature_? select on files is unimplementable. You can't do background file IO the same way you do background receiving of packets on socket. Filesystem is synchronous. It can block. It can be a pretty serious problem with slow block devices (floppy). It also hurts when you are trying to do high-performance reads/writes. [I know it hurt in userspace sherlock search engine -- kind of small altavista.] How do you write high-performance ftp server without threads if select on regular file always returns "ready"? No, it's not really possible on Linux. Use SYS$QIO call on VMS :-) You can emulate asynchronous IO with kernel threads like FreeBSD and some commercial Unices do, but you still need as many (possibly kernel) threads as many requests you are servicing. Remember: in the end you HAVE to wait somewhere. You're always going to be able to generate data faster than the disk can take it. SOMETHING Userspace wants to _know_ when to stop. It asks politely using "select()". And how do you want to wait for other select()ed events if you are blocked in wait_for_buffer in get_block (former bmap)? Making real async IO would require to rewrite all filesystems and whole VFS _from_scratch_. It won't happen. Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Linus Torvalds wrote: There are currently no other alternatives in user space. You'd have to create whole new interfaces for aio_read/write, and ways for the kernel to inform user space that "now you can re-try submitting your IO". Could be done. But that's a big thing. Has been done. Still needs some work, but it works pretty well. As for throttling io, having ios submitted does not have to correspond to them being queued in the lower layers. The main issue with async io is limiting the amount of pinned memory for ios; if that's taken care of, I don't think it matters how many ios are in flight. An application which sets non blocking behavior and busy waits for a request (which seems to be your argument) is just stupid, of course. Tell me what else it could do at some point? You need something like select() to wait on it. There are no such interfaces right now... (besides, latency would suck. I bet you're better off waiting for the requests if they are all used up. It takes too long to get deep into the kernel from user space, and you cannot use the exclusive waiters with its anti-herd behaviour etc). Ah, but no. In fact for some things, the wait queue extensions I'm using will be more efficient as things like test_and_set_bit for obtaining a lock gets executed without waking up a task. Simple rule: if you want to optimize concurrency and avoid waiting - use several processes or threads instead. At which point you can get real work done on multiple CPU's, instead of worrying about what happens when you have to wait on the disk. There do exist plenty of cases where threads are not efficient enough. Just the stack overhead alone with 8000 threads makes things really suck. Event based io completion means that server processes don't need to have the overhead of select/poll. Add in NT style completion ports for waking up the right number of worker threads off of the completion queue, and That said, I don't expect all devices to support async io. But given support for files, raw and sockets all the important cases are covered. The remainder can be supported via userspace helpers. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Ben LaHaise wrote: snip (besides, latency would suck. I bet you're better off waiting for the requests if they are all used up. It takes too long to get deep into the kernel from user space, and you cannot use the exclusive waiters with its anti-herd behaviour etc). Ah, but no. In fact for some things, the wait queue extensions I'm using will be more efficient as things like test_and_set_bit for obtaining a lock gets executed without waking up a task. The latency argument is somewhat bogus because there is no problem to check the request queue, in the aio syscalls, and simply fail if its full. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Marcelo Tosatti wrote: On Thu, 8 Feb 2001, Ben LaHaise wrote: snip (besides, latency would suck. I bet you're better off waiting for the requests if they are all used up. It takes too long to get deep into the kernel from user space, and you cannot use the exclusive waiters with its anti-herd behaviour etc). Ah, but no. In fact for some things, the wait queue extensions I'm using will be more efficient as things like test_and_set_bit for obtaining a lock gets executed without waking up a task. The latency argument is somewhat bogus because there is no problem to check the request queue, in the aio syscalls, and simply fail if its full. Ugh, I forgot to say check the request queue before doing any filesystem work. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
How do you write high-performance ftp server without threads if select on regular file always returns "ready"? Select can work if the access is sequential, but async IO is a more general solution. Even async IO (ie aio_read/aio_write) should block on the request queue if its full in Linus mind. This is not problem (you can create queue big enough to handle the load). The problem is that aio_read and aio_write are pretty useless for ftp or http server. You need aio_open. Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Mikulas Patocka wrote: How do you write high-performance ftp server without threads if select on regular file always returns "ready"? Select can work if the access is sequential, but async IO is a more general solution. Even async IO (ie aio_read/aio_write) should block on the request queue if its full in Linus mind. This is not problem (you can create queue big enough to handle the load). The point is that you want to be able to not block if the queue full (and the queue size has nothing to do with that). The problem is that aio_read and aio_write are pretty useless for ftp or http server. You need aio_open. Could you explain this? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, Feb 08 2001, Mikulas Patocka wrote: Even async IO (ie aio_read/aio_write) should block on the request queue if its full in Linus mind. This is not problem (you can create queue big enough to handle the load). Well in theory, but in practice this isn't a very good idea. At some point throwing yet more requests in there doesn't make a whole lot of sense. You are basically _always_ going to be able to empty the request list by dirtying lots of data. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Mikulas Patocka wrote: The problem is that aio_read and aio_write are pretty useless for ftp or http server. You need aio_open. Could you explain this? If the server is sending many small files, disk spends huge amount time walking directory tree and seeking to inodes. Maybe opening the file is even slower than reading it - read is usually sequential but open needs to seek at few areas of disk. And if you have one-threaded server using open, close, aio_read and aio_write, you actually block the whole server while it is opening a single file. This is not how async io is supposed to work. Ok but this is not the point of the discussion. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Mikulas Patocka wrote: You need aio_open. Could you explain this? If the server is sending many small files, disk spends huge amount time walking directory tree and seeking to inodes. Maybe opening the file is even slower than reading it Not if you have a big enough inode_cache and dentry_cache. OTOH ... if you have enough memory the whole async IO argument is moot anyway because all your files will be in memory too. regards, Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Rik van Riel wrote: On Thu, 8 Feb 2001, Mikulas Patocka wrote: You need aio_open. Could you explain this? If the server is sending many small files, disk spends huge amount time walking directory tree and seeking to inodes. Maybe opening the file is even slower than reading it Not if you have a big enough inode_cache and dentry_cache. Eh? However big the caches are, you can still get misses which will require multiple (blocking) disk accesses to handle... OTOH ... if you have enough memory the whole async IO argument is moot anyway because all your files will be in memory too. Only for cache hits. If you're doing a Mindcraft benchmark or something with everything in RAM, you're fine - for real world servers, that's not really an option ;-) Really, you want/need cache MISSES to be handled without blocking. However big the caches, short of running EVERYTHING from a ramdisk, these will still happen! James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel]RFC: Kernel mechanism: Compound event wait]
On Thu, 8 Feb 2001, Pavel Machek wrote: There are currently no other alternatives in user space. You'd have to create whole new interfaces for aio_read/write, and ways for the kernel to inform user space that "now you can re-try submitting your IO". Why is current select() interface not good enough? Ehh.. One major reason is rather simple: disk request wait times tend to be on the order of sub-millisecond (remember: if we run out of requests, that means that we have 256 of them already queued, which means that it's very likely that several of them will be freed up in the very near future due to completion). The fact is, that if you start doing write/select loops, you're going to waste a _large_ portion of your CPU speed on it. Especially considering that the select() call would have to go all the way down to the ll_rw_blk layer to figure out whether there are more requests etc. So there is (a) historical reasons that say that regular files can never wait and EAGAIN is not an acceptable return value and (b) practical reasons for why such an interface would be a bad one. There are better ways to do it. Either using threads, or just having a better aio-like interface. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Martin Dalecki wrote: But you'll have a bitch of a time trying to merge multiple threads/processes reading from the same area on disk at roughly the same time. Your higher levels won't even _know_ that there is merging to be done until the IO requests hit the wall in waiting for the disk. Merging is a hardware tighted optimization, so it should happen, there we you really have full "knowlendge" and controll of the hardware - namely the device driver. Or, in many cases, the device itself. There are valid reasons for not doing merging in the driver, but they all tend to boil down to "even lower layers can do a better job of it". They basically _never_ boil down to "upper layers already did it for us". That said, there tend to be advantages to doing "appropriate" clustering at each level. Upper layers can (and do) use read-ahead to help the lower levels. The write-out can (and currently does not) try to sort the requests for better elevator behaviour. The driver level can (and does) further cluster the requests - even if the low-level device does a perfect job of orderign and merging on its own it's usually advantageous to have fewer (and bigger) commands in-flight in order to have fewer completion interrupts and less command traffic on the bus. So it's obviously not entirely black-and-white. Upper layers can help, but it's a mistake to think that they should "do the work". (Note: a lot of people seem to think that "layering" means that the complexity is in upper layers, and that lower layers should be simple and "stupid". This is not true. A well-balanced layering would have all layers doing potentially equally complex things - but the complexity should be _independent_. Complex interactions are bad. But it's also bad to thin kthat lower levels shouldn't be allowed to optimize because they should be "simple".). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Marcelo Tosatti wrote: On Thu, 8 Feb 2001, Stephen C. Tweedie wrote: snip How do you write high-performance ftp server without threads if select on regular file always returns "ready"? Select can work if the access is sequential, but async IO is a more general solution. Even async IO (ie aio_read/aio_write) should block on the request queue if its full in Linus mind. Not necessarily. I said that "READA/WRITEA" are only worth exporting inside the kernel - because the latencies and complexities are low-level enough that it should not be exported to user space as such. But I could imagine a kernel aio package that does the equivalent of bh-b_end_io = completion_handler; generic_make_request(WRITE, bh);/* this may block */ bh= bh-b_next; /* Now, fill it up as much as we can.. */ current-state = TASK_INTERRUPTIBLE; while (more data to be written) { if (generic_make_request(WRITEA, bh) 0) break; bh = bh-b_next; } return; and then you make the _completion handler_ thing continue to feed more requests. Yes, you may block at some points (because you need to always have at least _one_ request in-flight in order to have the state machine active, but you can basically try to avoid blocking more than necessary. But do you see why the above can't be done from user space? It requires that the completion handler (which runs in an interrupt context) be able to continue to feed requests and keep the queue filled. If you don't do that, you'll never have good throughput, because it takes too long to send signals, re-schedule or whatever to user mode. And do you see how it has to block _sometimes_? If people do hundreds of AIO requests, we can't let memory just fill up with pending writes.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Thu, 8 Feb 2001, Rik van Riel wrote: On Thu, 8 Feb 2001, Mikulas Patocka wrote: You need aio_open. Could you explain this? If the server is sending many small files, disk spends huge amount time walking directory tree and seeking to inodes. Maybe opening the file is even slower than reading it Not if you have a big enough inode_cache and dentry_cache. OTOH ... if you have enough memory the whole async IO argument is moot anyway because all your files will be in memory too. Note that this _is_ an important point. You should never _ever_ think about pure IO speed as the most important thing. Even if you get absolutely perfect IO streaming off the fastest disk you can find, I will beat you every single time with a cached setup that doesn't need to do IO at all. 90% of the VFS layer is all about caching, and trying to avoid IO. Of the rest, about 9% is about trying to avoid even calling down to the low-level filesystem, because it's faster if we can handle it at a high level without any need to even worry about issues like physical disk addresses. Even if those addresses are cached. The remaining 1% is about actually getting the IO done. At that point we end up throwing our hands in the air and saying "ok, this will be slow". So if you design your system for disk load, you are missing a big portion of the picture. There are cases where IO really matter. The most notable one being databases, certainly _not_ web or ftp servers. For web- or ftp-servers you buy more memory if you want high performance, and you tend to be limited by the network speed anyway (if you have multiple gigabit networks and network speed isn't an issue, then I can also tell you that buying a few gigabyte of RAM isn't an issue, because you are obviously working for something like the DoD and have very little regard for the cost of the thing ;) For databases (and for file servers that you want to be robust over a crash), IO throughput is an issue mainly because you need to put the damn requests in stable memory somewhere. Which tends to mean that _write_ speed is what really matters, because the reads you can still try to cache as efficiently as humanly possible (and the issue of database design then turns into trying to find every single piece of locality you can, so that the read caching works as well as possible). Short and sweet: "aio_open()" is basically never supposed to be an issue. If it is, you've misdesigned something, or you're trying too damn hard to single-thread everything (and "hiding" the threading that _does_ happen by just calling it "AIO" instead - lying to yourself, in short). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Thu, Feb 08, 2001 at 03:52:35PM +0100, Mikulas Patocka wrote: How do you write high-performance ftp server without threads if select on regular file always returns "ready"? No, it's not really possible on Linux. Use SYS$QIO call on VMS :-) Ahh, but even VMS SYS$QIO is synchronous at doing opens, allocation of the IO request packets, and mapping file location to disk blocks. Only the data IO is ever async (and Ben's async IO stuff for Linux provides that too). --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wednesday February 7, [EMAIL PROTECTED] wrote: > > > On Wed, 7 Feb 2001, Christoph Hellwig wrote: > > > On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote: > > > > > > Actually, they really aren't. > > > > > > They kind of _used_ to be, but more and more they've moved away from that > > > historical use. Check in particular the page cache, and as a really > > > extreme case the swap cache version of the page cache. > > > > Yes. And that exactly why I think it's ugly to have the left-over > > caching stuff in the same data sctruture as the IO buffer. > > I do agree. > > I would not be opposed to factoring out the "pure block IO" part from the > bh struct. It should not even be very hard. You'd do something like > > struct block_io { > .. here is the stuff needed for block IO .. > }; > > struct buffer_head { > struct block_io io; > .. here is the stuff needed for hashing etc .. > } > > and then you make "generic_make_request()" and everything lower down take > just the "struct block_io". > I was just thinking the same, or a similar thing. I wanted to do struct io_head { stuff }; struct buffer_head { struct io_head; more stuff; } so that, as an unnamed substructure, the content of the struct io_head would automagically be promoted to appear to be content of buffer_head. However I then remembered (when it didn't work) that unnamed substructures are a feature of the Plan-9 C compiler, not the GNU Compiler Collection. (Any gcc coders out there think this would be a good thing to add? http://plan9.bell-labs.com/sys/doc/compiler.html ) Anyway, I produced the same result in a rather ugly way with #defines and modified raid5 to use 32byte block_io structures instead of the 80+ byte buffer_heads, and it ... doesn't quite work :-( it boots fine, but raid5 dies and the Oops message is a few kilometers away. Anyway, I think the concept it fine. Patch is below for your inspection. It occurs to me that Stephen's desire to pass lots of requests through make_request all at once isn't a bad idea and could be done by simply linking the io_heads together with b_reqnext. This would require: 1/ all callers of generic_make_request (there are 3) to initialise b_reqnext 2/ all registered make_request_fn functions (there are again 3 I think) to cope with following b_reqnext It shouldn't be too hard to make the elevator code take advantage of any ordering that it fines in the list. I don't have a patch which does this. NeilBrown --- ./include/linux/fs.h2001/02/07 22:45:37 1.1 +++ ./include/linux/fs.h2001/02/07 23:09:05 @@ -207,6 +207,7 @@ #define BH_Protected 6 /* 1 if the buffer is protected */ /* + * THIS COMMENT NO-LONGER CORRECT. * Try to keep the most commonly used fields in single cache lines (16 * bytes) to improve performance. This ordering should be * particularly beneficial on 32-bit processors. @@ -217,31 +218,43 @@ * The second 16 bytes we use for lru buffer scans, as used by * sync_buffers() and refill_freelist(). -- sct */ + +/* + * io_head is all that is needed by device drivers. + */ +#define io_head_fields \ + unsigned long b_state; /* buffer state bitmap (see above) */ \ + struct buffer_head *b_reqnext; /* request queue */ \ + unsigned short b_size; /* block size */\ + kdev_t b_rdev; /* Real device */ \ + unsigned long b_rsector;/* Real buffer location on disk */ \ + char * b_data; /* pointer to data block (512 byte) */ \ + void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */ \ + void *b_private;/* reserved for b_end_io */ \ + struct page *b_page;/* the page this bh is mapped to */ \ + /* this line intensionally left blank */ +struct io_head { + io_head_fields +}; + +/* buffer_head adds all the stuff needed by the buffer cache */ struct buffer_head { - /* First cache line: */ + io_head_fields + struct buffer_head *b_next; /* Hash queue list */ unsigned long b_blocknr;/* block number */ - unsigned short b_size; /* block size */ unsigned short b_list; /* List that this buffer appears */ kdev_t b_dev; /* device (B_FREE = free) */ atomic_t b_count; /* users using this block */ - kdev_t b_rdev; /* Real device */ - unsigned long b_state; /* buffer state bitmap (see above) */ unsigned long b_flushtime; /* Time when (dirty) buffer should be written */ struct buffer_head *b_next_free;/* lru/free list linkage */ struct buffer_head *b_prev_free;/* doubly linked list
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Wed, Feb 07, 2001 at 12:12:44PM -0700, Richard Gooch wrote: > Stephen C. Tweedie writes: > > > > Sorry? I'm not sure where communication is breaking down here, but > > we really don't seem to be talking about the same things. SGI's > > kiobuf request patches already let us pass a large IO through the > > request layer in a single unit without having to split it up to > > squeeze it through the API. > > Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you > don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB > buffer_head is effectively the same thing. kiobufs let you encode _any_ contiguous region of user VA or of an inode's page cache contents in one kiobuf, no matter how many pages there are in it. A write of a megabyte to a raw device can be encoded as a single kiobuf if we want to pass the entire 1MB IO down to the block layers untouched. That's what the page vector in the kiobuf is for. Doing the same thing with buffer_heads would still require a couple of hundred of them, and you'd have to submit each such buffer_head to the IO subsystem independently. And then the IO layer will just have to reassemble them on the other side (and it may have to scan the device's entire request queue once for every single buffer_head to do so). > But an API extension to allow passing a pre-built chain would be even > better. Yep. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Stephen C. Tweedie writes: > Hi, > > On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote: > > Absolutely. And this is independent of what kind of interface we end up > > using, whether it be kiobuf of just plain "struct buffer_head". In that > > respect they are equivalent. > > Sorry? I'm not sure where communication is breaking down here, but > we really don't seem to be talking about the same things. SGI's > kiobuf request patches already let us pass a large IO through the > request layer in a single unit without having to split it up to > squeeze it through the API. Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB buffer_head is effectively the same thing. > If you really don't mind the size of the buffer_head as a sg fragment > header, then at least I'd like us to be able to submit a pre-built > chain of bh's all at once without having to go through the remap/merge > cost for each single bh. Even if you are limited to feeding one buffer_head at a time, the merge costs should be somewhat mitigated, since you'll decrease your calls into the API by a factor of 8 or 16. But an API extension to allow passing a pre-built chain would be even better. Hopefully I haven't missed the point. I've got the flu so I'm not running on all 4 cylinders :-( Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, Feb 07, 2001 at 10:36:47AM -0800, Linus Torvalds wrote: > > > On Wed, 7 Feb 2001, Christoph Hellwig wrote: > > > On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote: > > > > > > Actually, they really aren't. > > > > > > They kind of _used_ to be, but more and more they've moved away from that > > > historical use. Check in particular the page cache, and as a really > > > extreme case the swap cache version of the page cache. > > > > Yes. And that exactly why I think it's ugly to have the left-over > > caching stuff in the same data sctruture as the IO buffer. > > I do agree. > > I would not be opposed to factoring out the "pure block IO" part from the > bh struct. It should not even be very hard. You'd do something like > > struct block_io { > .. here is the stuff needed for block IO .. > }; > > struct buffer_head { > struct block_io io; > .. here is the stuff needed for hashing etc .. > } > > and then you make "generic_make_request()" and everything lower down take > just the "struct block_io". Yep. (besides the name block_io sucks :)) > You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's, > because they knoa about bh semantics (ie things like scaling the sector > number to the bh size etc). Which means that pretty much all the code > outside the block layer wouldn't even _notice_. Which is a sign of good > layering. Yep. > If you want to do this, please do go ahead. I'll take a look at it. > But do realize that this is not exactly a 2.4.x thing ;) Sure. Christoph -- Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, 7 Feb 2001, Christoph Hellwig wrote: > On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote: > > > > Actually, they really aren't. > > > > They kind of _used_ to be, but more and more they've moved away from that > > historical use. Check in particular the page cache, and as a really > > extreme case the swap cache version of the page cache. > > Yes. And that exactly why I think it's ugly to have the left-over > caching stuff in the same data sctruture as the IO buffer. I do agree. I would not be opposed to factoring out the "pure block IO" part from the bh struct. It should not even be very hard. You'd do something like struct block_io { .. here is the stuff needed for block IO .. }; struct buffer_head { struct block_io io; .. here is the stuff needed for hashing etc .. } and then you make "generic_make_request()" and everything lower down take just the "struct block_io". You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's, because they knoa about bh semantics (ie things like scaling the sector number to the bh size etc). Which means that pretty much all the code outside the block layer wouldn't even _notice_. Which is a sign of good layering. If you want to do this, please do go ahead. But do realize that this is not exactly a 2.4.x thing ;) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, Feb 06, 2001 at 09:35:58PM +0100, Ingo Molnar wrote: > caching bmap() blocks was a recent addition around 2.3.20, and i suggested > some time ago to cache pagecache blocks via explicit entries in struct > page. That would be one solution - but it creates overhead. > > but there isnt anything wrong with having the bhs around to cache blocks - > think of it as a 'cached and recycled IO buffer entry, with the block > information cached'. I was not talking about caching physical blocks but the remaining buffer-cache support stuff. Christoph -- Of course it doesn't work. We've performed a software upgrade. Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote: > > > On Tue, 6 Feb 2001, Christoph Hellwig wrote: > > > > The second is that bh's are two things: > > > > - a cacheing object > > - an io buffer > > Actually, they really aren't. > > They kind of _used_ to be, but more and more they've moved away from that > historical use. Check in particular the page cache, and as a really > extreme case the swap cache version of the page cache. Yes. And that exactly why I think it's ugly to have the left-over caching stuff in the same data sctruture as the IO buffer. > It certainly _used_ to be true that "bh"s were actually first-class memory > management citizens, and actually had a data buffer and a cache associated > with them. And because of that historical baggage, that's how many people > still think of them. I do even know that the pagecache is our primary cache now :) Anyway having that caching cruft still in is ugly. > > This is not really an clean appropeach, and I would really like to > > get away from it. > > Trust me, you really _can_ get away from it. It's not designed into the > bh's at all. You can already just allocate a single (or multiple) "struct > buffer_head" and just use them as IO objects, and give them your _own_ > pointers to the IO buffer etc. So true. Exactly because of that the data structures should become seperated also. Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote: > > > However, I really _do_ want to have the page cache have a bigger > granularity than the smallest memory mapping size, and there are always > special cases that might be able to generate IO in bigger chunks (ie > in-kernel services etc) No argument there. > > Yes. We still have this fundamental property: if a user sends in a > > 128kB IO, we end up having to split it up into buffer_heads and doing > > a separate submit_bh() on each single one. Given our VM, PAGE_SIZE > > (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in > > this case. > > Absolutely. And this is independent of what kind of interface we end up > using, whether it be kiobuf of just plain "struct buffer_head". In that > respect they are equivalent. Sorry? I'm not sure where communication is breaking down here, but we really don't seem to be talking about the same things. SGI's kiobuf request patches already let us pass a large IO through the request layer in a single unit without having to split it up to squeeze it through the API. > > THAT is the overhead that I'm talking about: having to split a large > > IO into small chunks, each of which just ends up having to be merged > > back again into a single struct request by the *make_request code. > > You could easily just generate the bh then and there, if you wanted to. In the current 2.4 tree, we already do: brw_kiovec creates the temporary buffer_heads on demand to feed them to the IO layers. > Your overhead comes from the fact that you want to gather the IO together. > And I'm saying that you _shouldn't_ gather the IO. There's no point. I don't --- the underlying layer does. And that is where the overhead is: for every single large IO being created by the higher layers, make_request is doing a dozen or more merges because I can only feed the IO through make_request in tiny pieces. > The > gathering is sufficiently done by the low-level code anyway, and I've > tried to explain why the low-level code _has_ to do that work regardless > of what upper layers do. I know. The problem is the low-level code doing it a hundred times for a single injected IO. > You need to generate a separate sg entry for each page anyway. So why not > just use the existing one? The "struct buffer_head". Which already > _handles_ all the issues that you have complained are hard to handle. Two issues here. First is that the buffer_head is an enormously heavyweight object for a sg-list fragment. It contains a ton of fields of interest only to the buffer cache. We could mitigate this to some extent by ensuring that the relevant fields for IO (rsector, size, req_next, state, data, page etc) were in a single cache line. Secondly, the cost of adding each single buffer_head to the request list is O(n) in the number of requests already on the list. We end up walking potentially the entire request queue before finding the request to merge against, and we do that again and again, once for every single buffer_head in the list. We do this even if the caller went in via a multi-bh ll_rw_block() call in which case we know in advance that all of the buffer_heads are contiguous on disk. There is a side problem: right now, things like raid remapping occur during generic_make_request, before we have a request built. That means that all of the raid0 remapping or raid1/5 request expanding is being done on a per-buffer_head, not per-request, basis, so again we're doing a whole lot of unnecessary duplicate work when an IO larger than a buffer_head is submitted. If you really don't mind the size of the buffer_head as a sg fragment header, then at least I'd like us to be able to submit a pre-built chain of bh's all at once without having to go through the remap/merge cost for each single bh. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Wed, Feb 07, 2001 at 09:10:32AM +, David Howells wrote: > > I presume that correct_size will always be a power of 2... Yes. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Linus Torvalds <[EMAIL PROTECTED]> wrote: > Actually, I'd rather leave it in, but speed it up with the saner and > faster > > if (bh->b_size & (correct_size-1)) { I presume that correct_size will always be a power of 2... David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Linus Torvalds [EMAIL PROTECTED] wrote: Actually, I'd rather leave it in, but speed it up with the saner and faster if (bh-b_size (correct_size-1)) { I presume that correct_size will always be a power of 2... David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Wed, Feb 07, 2001 at 09:10:32AM +, David Howells wrote: I presume that correct_size will always be a power of 2... Yes. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote: However, I really _do_ want to have the page cache have a bigger granularity than the smallest memory mapping size, and there are always special cases that might be able to generate IO in bigger chunks (ie in-kernel services etc) No argument there. Yes. We still have this fundamental property: if a user sends in a 128kB IO, we end up having to split it up into buffer_heads and doing a separate submit_bh() on each single one. Given our VM, PAGE_SIZE (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in this case. Absolutely. And this is independent of what kind of interface we end up using, whether it be kiobuf of just plain "struct buffer_head". In that respect they are equivalent. Sorry? I'm not sure where communication is breaking down here, but we really don't seem to be talking about the same things. SGI's kiobuf request patches already let us pass a large IO through the request layer in a single unit without having to split it up to squeeze it through the API. THAT is the overhead that I'm talking about: having to split a large IO into small chunks, each of which just ends up having to be merged back again into a single struct request by the *make_request code. You could easily just generate the bh then and there, if you wanted to. In the current 2.4 tree, we already do: brw_kiovec creates the temporary buffer_heads on demand to feed them to the IO layers. Your overhead comes from the fact that you want to gather the IO together. And I'm saying that you _shouldn't_ gather the IO. There's no point. I don't --- the underlying layer does. And that is where the overhead is: for every single large IO being created by the higher layers, make_request is doing a dozen or more merges because I can only feed the IO through make_request in tiny pieces. The gathering is sufficiently done by the low-level code anyway, and I've tried to explain why the low-level code _has_ to do that work regardless of what upper layers do. I know. The problem is the low-level code doing it a hundred times for a single injected IO. You need to generate a separate sg entry for each page anyway. So why not just use the existing one? The "struct buffer_head". Which already _handles_ all the issues that you have complained are hard to handle. Two issues here. First is that the buffer_head is an enormously heavyweight object for a sg-list fragment. It contains a ton of fields of interest only to the buffer cache. We could mitigate this to some extent by ensuring that the relevant fields for IO (rsector, size, req_next, state, data, page etc) were in a single cache line. Secondly, the cost of adding each single buffer_head to the request list is O(n) in the number of requests already on the list. We end up walking potentially the entire request queue before finding the request to merge against, and we do that again and again, once for every single buffer_head in the list. We do this even if the caller went in via a multi-bh ll_rw_block() call in which case we know in advance that all of the buffer_heads are contiguous on disk. There is a side problem: right now, things like raid remapping occur during generic_make_request, before we have a request built. That means that all of the raid0 remapping or raid1/5 request expanding is being done on a per-buffer_head, not per-request, basis, so again we're doing a whole lot of unnecessary duplicate work when an IO larger than a buffer_head is submitted. If you really don't mind the size of the buffer_head as a sg fragment header, then at least I'd like us to be able to submit a pre-built chain of bh's all at once without having to go through the remap/merge cost for each single bh. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote: On Tue, 6 Feb 2001, Christoph Hellwig wrote: The second is that bh's are two things: - a cacheing object - an io buffer Actually, they really aren't. They kind of _used_ to be, but more and more they've moved away from that historical use. Check in particular the page cache, and as a really extreme case the swap cache version of the page cache. Yes. And that exactly why I think it's ugly to have the left-over caching stuff in the same data sctruture as the IO buffer. It certainly _used_ to be true that "bh"s were actually first-class memory management citizens, and actually had a data buffer and a cache associated with them. And because of that historical baggage, that's how many people still think of them. I do even know that the pagecache is our primary cache now :) Anyway having that caching cruft still in is ugly. This is not really an clean appropeach, and I would really like to get away from it. Trust me, you really _can_ get away from it. It's not designed into the bh's at all. You can already just allocate a single (or multiple) "struct buffer_head" and just use them as IO objects, and give them your _own_ pointers to the IO buffer etc. So true. Exactly because of that the data structures should become seperated also. Christoph -- Of course it doesn't work. We've performed a software upgrade. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, Feb 06, 2001 at 09:35:58PM +0100, Ingo Molnar wrote: caching bmap() blocks was a recent addition around 2.3.20, and i suggested some time ago to cache pagecache blocks via explicit entries in struct page. That would be one solution - but it creates overhead. but there isnt anything wrong with having the bhs around to cache blocks - think of it as a 'cached and recycled IO buffer entry, with the block information cached'. I was not talking about caching physical blocks but the remaining buffer-cache support stuff. Christoph -- Of course it doesn't work. We've performed a software upgrade. Whip me. Beat me. Make me maintain AIX. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, 7 Feb 2001, Christoph Hellwig wrote: On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote: Actually, they really aren't. They kind of _used_ to be, but more and more they've moved away from that historical use. Check in particular the page cache, and as a really extreme case the swap cache version of the page cache. Yes. And that exactly why I think it's ugly to have the left-over caching stuff in the same data sctruture as the IO buffer. I do agree. I would not be opposed to factoring out the "pure block IO" part from the bh struct. It should not even be very hard. You'd do something like struct block_io { .. here is the stuff needed for block IO .. }; struct buffer_head { struct block_io io; .. here is the stuff needed for hashing etc .. } and then you make "generic_make_request()" and everything lower down take just the "struct block_io". You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's, because they knoa about bh semantics (ie things like scaling the sector number to the bh size etc). Which means that pretty much all the code outside the block layer wouldn't even _notice_. Which is a sign of good layering. If you want to do this, please do go ahead. But do realize that this is not exactly a 2.4.x thing ;) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Stephen C. Tweedie writes: Hi, On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote: Absolutely. And this is independent of what kind of interface we end up using, whether it be kiobuf of just plain "struct buffer_head". In that respect they are equivalent. Sorry? I'm not sure where communication is breaking down here, but we really don't seem to be talking about the same things. SGI's kiobuf request patches already let us pass a large IO through the request layer in a single unit without having to split it up to squeeze it through the API. Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB buffer_head is effectively the same thing. If you really don't mind the size of the buffer_head as a sg fragment header, then at least I'd like us to be able to submit a pre-built chain of bh's all at once without having to go through the remap/merge cost for each single bh. Even if you are limited to feeding one buffer_head at a time, the merge costs should be somewhat mitigated, since you'll decrease your calls into the API by a factor of 8 or 16. But an API extension to allow passing a pre-built chain would be even better. Hopefully I haven't missed the point. I've got the flu so I'm not running on all 4 cylinders :-( Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Wed, Feb 07, 2001 at 12:12:44PM -0700, Richard Gooch wrote: Stephen C. Tweedie writes: Sorry? I'm not sure where communication is breaking down here, but we really don't seem to be talking about the same things. SGI's kiobuf request patches already let us pass a large IO through the request layer in a single unit without having to split it up to squeeze it through the API. Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB buffer_head is effectively the same thing. kiobufs let you encode _any_ contiguous region of user VA or of an inode's page cache contents in one kiobuf, no matter how many pages there are in it. A write of a megabyte to a raw device can be encoded as a single kiobuf if we want to pass the entire 1MB IO down to the block layers untouched. That's what the page vector in the kiobuf is for. Doing the same thing with buffer_heads would still require a couple of hundred of them, and you'd have to submit each such buffer_head to the IO subsystem independently. And then the IO layer will just have to reassemble them on the other side (and it may have to scan the device's entire request queue once for every single buffer_head to do so). But an API extension to allow passing a pre-built chain would be even better. Yep. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wednesday February 7, [EMAIL PROTECTED] wrote: On Wed, 7 Feb 2001, Christoph Hellwig wrote: On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote: Actually, they really aren't. They kind of _used_ to be, but more and more they've moved away from that historical use. Check in particular the page cache, and as a really extreme case the swap cache version of the page cache. Yes. And that exactly why I think it's ugly to have the left-over caching stuff in the same data sctruture as the IO buffer. I do agree. I would not be opposed to factoring out the "pure block IO" part from the bh struct. It should not even be very hard. You'd do something like struct block_io { .. here is the stuff needed for block IO .. }; struct buffer_head { struct block_io io; .. here is the stuff needed for hashing etc .. } and then you make "generic_make_request()" and everything lower down take just the "struct block_io". I was just thinking the same, or a similar thing. I wanted to do struct io_head { stuff }; struct buffer_head { struct io_head; more stuff; } so that, as an unnamed substructure, the content of the struct io_head would automagically be promoted to appear to be content of buffer_head. However I then remembered (when it didn't work) that unnamed substructures are a feature of the Plan-9 C compiler, not the GNU Compiler Collection. (Any gcc coders out there think this would be a good thing to add? http://plan9.bell-labs.com/sys/doc/compiler.html ) Anyway, I produced the same result in a rather ugly way with #defines and modified raid5 to use 32byte block_io structures instead of the 80+ byte buffer_heads, and it ... doesn't quite work :-( it boots fine, but raid5 dies and the Oops message is a few kilometers away. Anyway, I think the concept it fine. Patch is below for your inspection. It occurs to me that Stephen's desire to pass lots of requests through make_request all at once isn't a bad idea and could be done by simply linking the io_heads together with b_reqnext. This would require: 1/ all callers of generic_make_request (there are 3) to initialise b_reqnext 2/ all registered make_request_fn functions (there are again 3 I think) to cope with following b_reqnext It shouldn't be too hard to make the elevator code take advantage of any ordering that it fines in the list. I don't have a patch which does this. NeilBrown --- ./include/linux/fs.h2001/02/07 22:45:37 1.1 +++ ./include/linux/fs.h2001/02/07 23:09:05 @@ -207,6 +207,7 @@ #define BH_Protected 6 /* 1 if the buffer is protected */ /* + * THIS COMMENT NO-LONGER CORRECT. * Try to keep the most commonly used fields in single cache lines (16 * bytes) to improve performance. This ordering should be * particularly beneficial on 32-bit processors. @@ -217,31 +218,43 @@ * The second 16 bytes we use for lru buffer scans, as used by * sync_buffers() and refill_freelist(). -- sct */ + +/* + * io_head is all that is needed by device drivers. + */ +#define io_head_fields \ + unsigned long b_state; /* buffer state bitmap (see above) */ \ + struct buffer_head *b_reqnext; /* request queue */ \ + unsigned short b_size; /* block size */\ + kdev_t b_rdev; /* Real device */ \ + unsigned long b_rsector;/* Real buffer location on disk */ \ + char * b_data; /* pointer to data block (512 byte) */ \ + void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */ \ + void *b_private;/* reserved for b_end_io */ \ + struct page *b_page;/* the page this bh is mapped to */ \ + /* this line intensionally left blank */ +struct io_head { + io_head_fields +}; + +/* buffer_head adds all the stuff needed by the buffer cache */ struct buffer_head { - /* First cache line: */ + io_head_fields + struct buffer_head *b_next; /* Hash queue list */ unsigned long b_blocknr;/* block number */ - unsigned short b_size; /* block size */ unsigned short b_list; /* List that this buffer appears */ kdev_t b_dev; /* device (B_FREE = free) */ atomic_t b_count; /* users using this block */ - kdev_t b_rdev; /* Real device */ - unsigned long b_state; /* buffer state bitmap (see above) */ unsigned long b_flushtime; /* Time when (dirty) buffer should be written */ struct buffer_head *b_next_free;/* lru/free list linkage */ struct buffer_head *b_prev_free;/* doubly linked list of buffers */ struct buffer_head
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > "struct buffer_head" can deal with pretty much any size: the only thing it > > cares about is bh->b_size. > > Right now, anything larger than a page is physically non-contiguous, > and sorry if I didn't make that explicit, but I thought that was > obvious enough that I didn't need to. We were talking about raw IO, > and as long as we're doing IO out of user anonymous data allocated > from individual pages, buffer_heads are limited to that page size in > this context. Sure. That's obviously also one of the reasons why the IO layer has never seen bigger requests anyway - the data _does_ tend to be fundamentally broken up into page-size entities, if for no other reason that that is how user-space sees memory. However, I really _do_ want to have the page cache have a bigger granularity than the smallest memory mapping size, and there are always special cases that might be able to generate IO in bigger chunks (ie in-kernel services etc) > Yes. We still have this fundamental property: if a user sends in a > 128kB IO, we end up having to split it up into buffer_heads and doing > a separate submit_bh() on each single one. Given our VM, PAGE_SIZE > (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in > this case. Absolutely. And this is independent of what kind of interface we end up using, whether it be kiobuf of just plain "struct buffer_head". In that respect they are equivalent. > THAT is the overhead that I'm talking about: having to split a large > IO into small chunks, each of which just ends up having to be merged > back again into a single struct request by the *make_request code. You could easily just generate the bh then and there, if you wanted to. Your overhead comes from the fact that you want to gather the IO together. And I'm saying that you _shouldn't_ gather the IO. There's no point. The gathering is sufficiently done by the low-level code anyway, and I've tried to explain why the low-level code _has_ to do that work regardless of what upper layers do. You need to generate a separate sg entry for each page anyway. So why not just use the existing one? The "struct buffer_head". Which already _handles_ all the issues that you have complained are hard to handle. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, Feb 06 2001, Linus Torvalds wrote: > > > [...] so I would be _really_ nervous about just turning it on > > > silently. This is all very much a 2.5.x-kind of thing ;) > > > > Then you might want to apply this :-) > > > > --- drivers/block/ll_rw_blk.c~ Wed Feb 7 02:38:31 2001 > > +++ drivers/block/ll_rw_blk.c Wed Feb 7 02:38:42 2001 > > @@ -1048,7 +1048,7 @@ > > /* Verify requested block sizes. */ > > for (i = 0; i < nr; i++) { > > struct buffer_head *bh = bhs[i]; > > - if (bh->b_size % correct_size) { > > + if (bh->b_size != correct_size) { > > printk(KERN_NOTICE "ll_rw_block: device %s: " > >"only %d-char blocks implemented (%u)\n", > >kdevname(bhs[0]->b_dev), > > Actually, I'd rather leave it in, but speed it up with the saner and > faster > > if (bh->b_size & (correct_size-1)) { > ... > > That way people who _want_ to test the odd-size thing can do so. And > normal code (that never generates requests on any other size than the > "native" size) won't ever notice either way. Fine, as I said I didn't spot anything bad so that's why it was changed. > (Oh, we'll eventually need to move to "correct_size == hardware > blocksize", not the "virtual blocksize" that it is now. As it it a tester > needs to set the soft-blk size by hand now). Exactly, wrt earlier mail about submitting < hw block size requests to the lower levels. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote: > > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block > > enforces a single blocksize on all requests but that relaxing that > > requirement is no big deal). Buffer_heads can't deal with data which > > spans more than a page right now. > > "struct buffer_head" can deal with pretty much any size: the only thing it > cares about is bh->b_size. Right now, anything larger than a page is physically non-contiguous, and sorry if I didn't make that explicit, but I thought that was obvious enough that I didn't need to. We were talking about raw IO, and as long as we're doing IO out of user anonymous data allocated from individual pages, buffer_heads are limited to that page size in this context. > Have you ever spent even just 5 minutes actually _looking_ at the block > device layer, before you decided that you think it needs to be completely > re-done some other way? It appears that you never bothered to. Yes. We still have this fundamental property: if a user sends in a 128kB IO, we end up having to split it up into buffer_heads and doing a separate submit_bh() on each single one. Given our VM, PAGE_SIZE (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in this case. THAT is the overhead that I'm talking about: having to split a large IO into small chunks, each of which just ends up having to be merged back again into a single struct request by the *make_request code. A constructed IO request basically doesn't care about anything in the buffer_head except for the data pointer and size, and the completion status info and callback. All of the physical IO description is in the struct request by this point. The chain of buffer_heads is carrying around a huge amount of information which isn't used by the IO, and if the caller is something like the raw IO driver which isn't using the buffer cache, that extra buffer_head data is just overhead. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, 7 Feb 2001, Jens Axboe wrote: > > > [...] so I would be _really_ nervous about just turning it on > > silently. This is all very much a 2.5.x-kind of thing ;) > > Then you might want to apply this :-) > > --- drivers/block/ll_rw_blk.c~Wed Feb 7 02:38:31 2001 > +++ drivers/block/ll_rw_blk.c Wed Feb 7 02:38:42 2001 > @@ -1048,7 +1048,7 @@ > /* Verify requested block sizes. */ > for (i = 0; i < nr; i++) { > struct buffer_head *bh = bhs[i]; > - if (bh->b_size % correct_size) { > + if (bh->b_size != correct_size) { > printk(KERN_NOTICE "ll_rw_block: device %s: " > "only %d-char blocks implemented (%u)\n", > kdevname(bhs[0]->b_dev), Actually, I'd rather leave it in, but speed it up with the saner and faster if (bh->b_size & (correct_size-1)) { ... That way people who _want_ to test the odd-size thing can do so. And normal code (that never generates requests on any other size than the "native" size) won't ever notice either way. (Oh, we'll eventually need to move to "correct_size == hardware blocksize", not the "virtual blocksize" that it is now. As it it a tester needs to set the soft-blk size by hand now). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > > The fact is, if you have problems like the above, then you don't > > understand the interfaces. And it sounds like you designed kiobuf support > > around the wrong set of interfaces. > > They used the only interfaces available at the time... Ehh.. "generic_make_request()" goes back a _loong_ time. It used to be called just "make_request()", but all my points still stand. It's even exported to modules. As far as I know, the raid code has always used this interface exactly because raid needed to feed back the remapped stuff and get around the blocksizing in ll_rw_block(). This really isn't anything new. I _know_ it's there in 2.2.x, and I would not be surprised if it was there even in 2.0.x. > > If you want to get at the _sector_ level, then you do > ... > > which doesn't look all that complicated to me. What's the problem? > > Doesn't this break nastily as soon as the IO hits an LVM or soft raid > device? I don't think we are safe if we create a larger-sized > buffer_head which spans a raid stripe: the raid mapping is only > applied once per buffer_head. Absolutely. This is exactly what I mean by saying that low-level drivers may not actually be able to handle new cases that they've never been asked to do before - they just never saw anything like a 64kB request before or something that crossed its own alignment. But the _higher_ levels are there. And there's absolutely nothing in the design that is a real problem. But there's no question that you might need to fix up more than one or two low-level drivers. (The only drivers I know better are the IDE ones, and as far as I can tell they'd have no trouble at all with any of this. Most other normal drivers are likely to be in this same situation. But because I've not had a reason to test, I certainly won't guarantee even that). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, Feb 06 2001, Linus Torvalds wrote: > > I don't see anything that would break doing this, in fact you can > > do this as long as the buffers are all at least a multiple of the > > block size. All the drivers I've inspected handle this fine, noone > > assumes that rq->bh->b_size is the same in all the buffers attached > > to the request. > > It's really easy to get this wrong when going forward in the request list: > you need to make sure that you update "request->current_nr_sectors" each > time you move on to the next bh. > > I would not be surprised if some of them have been seriously buggered. Maybe have been, but it looks good at least with the general drivers that I mentioned. > [...] so I would be _really_ nervous about just turning it on > silently. This is all very much a 2.5.x-kind of thing ;) Then you might want to apply this :-) --- drivers/block/ll_rw_blk.c~ Wed Feb 7 02:38:31 2001 +++ drivers/block/ll_rw_blk.c Wed Feb 7 02:38:42 2001 @@ -1048,7 +1048,7 @@ /* Verify requested block sizes. */ for (i = 0; i < nr; i++) { struct buffer_head *bh = bhs[i]; - if (bh->b_size % correct_size) { + if (bh->b_size != correct_size) { printk(KERN_NOTICE "ll_rw_block: device %s: " "only %d-char blocks implemented (%u)\n", kdevname(bhs[0]->b_dev), -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 04:41:21PM -0800, Linus Torvalds wrote: > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > > be aligned on disk at a multiple of their buffer size. > > Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the > traditional block device setup, where "b_blocknr" is the "virtual > blocknumber" and that indeed is tied in to the block size. > > The fact is, if you have problems like the above, then you don't > understand the interfaces. And it sounds like you designed kiobuf support > around the wrong set of interfaces. They used the only interfaces available at the time... > If you want to get at the _sector_ level, then you do ... > which doesn't look all that complicated to me. What's the problem? Doesn't this break nastily as soon as the IO hits an LVM or soft raid device? I don't think we are safe if we create a larger-sized buffer_head which spans a raid stripe: the raid mapping is only applied once per buffer_head. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, 7 Feb 2001, Ingo Molnar wrote: > > most likely some coding error on your side. buffer-size mismatches should > show up as filesystem corruption or random DMA scribble, not in-driver > oopses. I'm not sure. If I was a driver writer (and I'm happy those days are mostly behind me ;), I would not be totally dis-inclined to check for various limits and things. There can be hardware out there that simply has trouble with non-native alignment, ie be unhappy about getting a 1kB request that is aligned in memory at a 512-byte boundary. So there are real reasons why drivers might need updating. Don't dismiss the concerns out-of-hand. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, 7 Feb 2001, Jens Axboe wrote: > > I don't see anything that would break doing this, in fact you can > do this as long as the buffers are all at least a multiple of the > block size. All the drivers I've inspected handle this fine, noone > assumes that rq->bh->b_size is the same in all the buffers attached > to the request. It's really easy to get this wrong when going forward in the request list: you need to make sure that you update "request->current_nr_sectors" each time you move on to the next bh. I would not be surprised if some of them have been seriously buggered. On the other hand, I would _also_ not be surprised if we've actually fixed a lot of them: one of the things that the RAID code and loopback test is exactly getting these kinds of issues right (not this exact one, but similar ones). And let's remember things like the old ultrastor driver that was totally unable to handle anything but 1kB devices etc. I would not be _totally_ surprised if it turns out that there are still drivers out there that remember the time when Linux only ever had 1kB buffers. Even if it is 7 years ago or so ;) (Also, there might be drivers that are "optimized" - they set the IO length once per request, and just never set it again as they do partial end_io() calls. None of those kinds of issues would ever be found under normal load, so I would be _really_ nervous about just turning it on silently. This is all very much a 2.5.x-kind of thing ;) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, Feb 07, 2001 at 02:06:27AM +0100, Ingo Molnar wrote: > > On Tue, 6 Feb 2001, Jeff V. Merkey wrote: > > > > I don't see anything that would break doing this, in fact you can > > > do this as long as the buffers are all at least a multiple of the > > > block size. All the drivers I've inspected handle this fine, noone > > > assumes that rq->bh->b_size is the same in all the buffers attached > > > to the request. This includes SCSI (scsi_lib.c builds sg tables), > > > IDE, and the Compaq array + Mylex driver. This mostly leaves the > > > "old-style" drivers using CURRENT etc, the kernel helpers for these > > > handle it as well. > > > > > > So I would appreciate pointers to these devices that break so we > > > can inspect them. > > > > > > -- > > > Jens Axboe > > > > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. > > most likely some coding error on your side. buffer-size mismatches should > show up as filesystem corruption or random DMA scribble, not in-driver > oopses. > > Ingo Oops was in my code, but was caused by these drivers. The Adaptec driver did have an oops that was it's own code address, AIC7XXX crashed in my code. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, 7 Feb 2001, Jens Axboe wrote: > > > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. > > > > most likely some coding error on your side. buffer-size mismatches should > > show up as filesystem corruption or random DMA scribble, not in-driver > > oopses. > > I would suspect so, aic7xxx shouldn't care about anything except the > sg entries and I would seriously doubt that it makes any such > assumptions on them :-) yep - and not a single reference to b_size in aic7xxx.c. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, Feb 07, 2001 at 02:08:53AM +0100, Jens Axboe wrote: > On Tue, Feb 06 2001, Jeff V. Merkey wrote: > > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. > > Do you still have this oops? > I can recreate. Will work on it tommorrow. SCI testing today. Jeff > -- > Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, Feb 06 2001, Jeff V. Merkey wrote: > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. Do you still have this oops? -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, Feb 07 2001, Ingo Molnar wrote: > > > So I would appreciate pointers to these devices that break so we > > > can inspect them. > > > > > > -- > > > Jens Axboe > > > > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. > > most likely some coding error on your side. buffer-size mismatches should > show up as filesystem corruption or random DMA scribble, not in-driver > oopses. I would suspect so, aic7xxx shouldn't care about anything except the sg entries and I would seriously doubt that it makes any such assumptions on them :-) -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Jeff V. Merkey wrote: > > I don't see anything that would break doing this, in fact you can > > do this as long as the buffers are all at least a multiple of the > > block size. All the drivers I've inspected handle this fine, noone > > assumes that rq->bh->b_size is the same in all the buffers attached > > to the request. This includes SCSI (scsi_lib.c builds sg tables), > > IDE, and the Compaq array + Mylex driver. This mostly leaves the > > "old-style" drivers using CURRENT etc, the kernel helpers for these > > handle it as well. > > > > So I would appreciate pointers to these devices that break so we > > can inspect them. > > > > -- > > Jens Axboe > > Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. most likely some coding error on your side. buffer-size mismatches should show up as filesystem corruption or random DMA scribble, not in-driver oopses. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, Feb 07, 2001 at 02:02:21AM +0100, Jens Axboe wrote: > On Tue, Feb 06 2001, Jeff V. Merkey wrote: > > I remember Linus asking to try this variable buffer head chaining > > thing 512-1024-512 kind of stuff several months back, and mixing them to > > see what would happen -- result. About half the drivers break with it. > > The interface allows you to do it, I've tried it, (works on Andre's > > drivers, but a lot of SCSI drivers break) but a lot of drivers seem to > > have assumptions about these things all being the same size in a > > buffer head chain. > > I don't see anything that would break doing this, in fact you can > do this as long as the buffers are all at least a multiple of the > block size. All the drivers I've inspected handle this fine, noone > assumes that rq->bh->b_size is the same in all the buffers attached > to the request. This includes SCSI (scsi_lib.c builds sg tables), > IDE, and the Compaq array + Mylex driver. This mostly leaves the > "old-style" drivers using CURRENT etc, the kernel helpers for these > handle it as well. > > So I would appreciate pointers to these devices that break so we > can inspect them. > > -- > Jens Axboe Adaptec drivers had an oops. Also, AIC7XXX also had some oops with it. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, Feb 07, 2001 at 02:01:54AM +0100, Ingo Molnar wrote: > > On Tue, 6 Feb 2001, Jeff V. Merkey wrote: > > > I remember Linus asking to try this variable buffer head chaining > > thing 512-1024-512 kind of stuff several months back, and mixing them > > to see what would happen -- result. About half the drivers break with > > it. [...] > > time to fix them then - instead of rewriting the rest of the kernel ;-) > > Ingo I agree. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Jeff V. Merkey wrote: > I remember Linus asking to try this variable buffer head chaining > thing 512-1024-512 kind of stuff several months back, and mixing them > to see what would happen -- result. About half the drivers break with > it. [...] time to fix them then - instead of rewriting the rest of the kernel ;-) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, Feb 06 2001, Jeff V. Merkey wrote: > I remember Linus asking to try this variable buffer head chaining > thing 512-1024-512 kind of stuff several months back, and mixing them to > see what would happen -- result. About half the drivers break with it. > The interface allows you to do it, I've tried it, (works on Andre's > drivers, but a lot of SCSI drivers break) but a lot of drivers seem to > have assumptions about these things all being the same size in a > buffer head chain. I don't see anything that would break doing this, in fact you can do this as long as the buffers are all at least a multiple of the block size. All the drivers I've inspected handle this fine, noone assumes that rq->bh->b_size is the same in all the buffers attached to the request. This includes SCSI (scsi_lib.c builds sg tables), IDE, and the Compaq array + Mylex driver. This mostly leaves the "old-style" drivers using CURRENT etc, the kernel helpers for these handle it as well. So I would appreciate pointers to these devices that break so we can inspect them. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote: > > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block > > enforces a single blocksize on all requests but that relaxing that > > requirement is no big deal). Buffer_heads can't deal with data which > > spans more than a page right now. > > Stephen, you're so full of shit lately that it's unbelievable. You're > batting a clear 0.000 so far. > > "struct buffer_head" can deal with pretty much any size: the only thing it > cares about is bh->b_size. > > It so happens that if you have highmem support, then "create_bounce()" > will work on a per-page thing, but that just means that you'd better have > done your bouncing into low memory before you call generic_make_request(). > > Have you ever spent even just 5 minutes actually _looking_ at the block > device layer, before you decided that you think it needs to be completely > re-done some other way? It appears that you never bothered to. > > Sure, I would not be surprised if some device driver ends up being > surpised if you start passing it different request sizes than it is used > to. But that's a driver and testing issue, nothing more. > > (Which is not to say that "driver and testing" issues aren't important as > hell: it's one of the more scary things in fact, and it can take a long > time to get right if you start doing somehting that historically has never > been done and thus has historically never gotten any testing. So I'm not > saying that it should work out-of-the-box. But I _am_ saying that there's > no point in trying to re-design upper layers that already do ALL of this > with no problems at all). > > Linus > I remember Linus asking to try this variable buffer head chaining thing 512-1024-512 kind of stuff several months back, and mixing them to see what would happen -- result. About half the drivers break with it. The interface allows you to do it, I've tried it, (works on Andre's drivers, but a lot of SCSI drivers break) but a lot of drivers seem to have assumptions about these things all being the same size in a buffer head chain. :-) Jeff > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block > enforces a single blocksize on all requests but that relaxing that > requirement is no big deal). Buffer_heads can't deal with data which > spans more than a page right now. Stephen, you're so full of shit lately that it's unbelievable. You're batting a clear 0.000 so far. "struct buffer_head" can deal with pretty much any size: the only thing it cares about is bh->b_size. It so happens that if you have highmem support, then "create_bounce()" will work on a per-page thing, but that just means that you'd better have done your bouncing into low memory before you call generic_make_request(). Have you ever spent even just 5 minutes actually _looking_ at the block device layer, before you decided that you think it needs to be completely re-done some other way? It appears that you never bothered to. Sure, I would not be surprised if some device driver ends up being surpised if you start passing it different request sizes than it is used to. But that's a driver and testing issue, nothing more. (Which is not to say that "driver and testing" issues aren't important as hell: it's one of the more scary things in fact, and it can take a long time to get right if you start doing somehting that historically has never been done and thus has historically never gotten any testing. So I'm not saying that it should work out-of-the-box. But I _am_ saying that there's no point in trying to re-design upper layers that already do ALL of this with no problems at all). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, Feb 07, 2001 at 12:36:29AM +, Stephen C. Tweedie wrote: > Hi, > > On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote: > > > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > > > be aligned on disk at a multiple of their buffer size. Under the Unix > > > raw IO interface it is perfectly legal to begin a 128kB IO at offset > > > 512 bytes into a device. > > > > then we should either fix this limitation, or the raw IO code should split > > the request up into several, variable-size bhs, so that the range is > > filled out optimally with aligned bhs. > > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block > enforces a single blocksize on all requests but that relaxing that > requirement is no big deal). Buffer_heads can't deal with data which > spans more than a page right now. I can handle requests larger than a page (64K) but I am not using the buffer cache in Linux. We really need an NT/NetWare like model to support the non-Unix FS's properly. i.e. a disk request should be and get rid of this fixed block stuff with buffer heads. :-) I understand that the way the elevator is implemented in Linux makes this very hard at this point to support, since it's very troublesome to handling requests that overlap sector boundries. Jeff > > --Stephen > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Ingo Molnar wrote: > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > > be aligned on disk at a multiple of their buffer size. Under the Unix > > raw IO interface it is perfectly legal to begin a 128kB IO at offset > > 512 bytes into a device. > > then we should either fix this limitation, or the raw IO code should split > the request up into several, variable-size bhs, so that the range is > filled out optimally with aligned bhs. As mentioned, no such limitation exists if you just use the right interfaces. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote: > > > > [overhead of 512-byte bhs in the raw IO code is an artificial problem of > > the raw IO code.] > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > be aligned on disk at a multiple of their buffer size. Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the traditional block device setup, where "b_blocknr" is the "virtual blocknumber" and that indeed is tied in to the block size. That's the whole _point_ of ll_rw_block() and friends - they show the device at a different "virtual blocking" level than the low-level physical accesses necessarily are. Which very much means that if you have a 4kB "view", of the device, you get a stream of 4kB blocks. Not 4kB sized blocks at 512-byte offsets (or whatebver the hardware blocking size is). This way the interfaces are independent of the hardware blocksize. Which is logical and what you'd expect. You need to go to a lower level to see those kinds of blocking issues. But it is _not_ true of "generic_make_request()" and the block IO layer in general. It obviously _cannot_ be true, because the block I/O layer has always had the notion of merging consecutive blocks together - regardless of whether the end result is even a power of two or antyhing like that in size. You can make an IO request for pretty much any size, as long as it's a multiple of the hardare blocksize (normally 512 bytes, but there are certainly devices out there with other blocksizes). The fact is, if you have problems like the above, then you don't understand the interfaces. And it sounds like you designed kiobuf support around the wrong set of interfaces. If you want to get at the _sector_ level, then you do lock_bh(); bh->b_rdev = device; bh->b_rsector = sector-number (where linux defines "sector" to be 512 bytes) bh->b_size = size in bytes (must be a multiple of 512); bh->b_data = pointer; bh->b_end_io = callback; generic_make_request(rw, bh); which doesn't look all that complicated to me. What's the problem? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote: > > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > > be aligned on disk at a multiple of their buffer size. Under the Unix > > raw IO interface it is perfectly legal to begin a 128kB IO at offset > > 512 bytes into a device. > > then we should either fix this limitation, or the raw IO code should split > the request up into several, variable-size bhs, so that the range is > filled out optimally with aligned bhs. That gets us from 512-byte blocks to 4k, but no more (ll_rw_block enforces a single blocksize on all requests but that relaxing that requirement is no big deal). Buffer_heads can't deal with data which spans more than a page right now. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, Feb 07 2001, Stephen C. Tweedie wrote: > > [overhead of 512-byte bhs in the raw IO code is an artificial problem of > > the raw IO code.] > > No, it is a problem of the ll_rw_block interface: buffer_heads need to > be aligned on disk at a multiple of their buffer size. Under the Unix > raw IO interface it is perfectly legal to begin a 128kB IO at offset > 512 bytes into a device. Submitting buffers to lower layers that are not hw sector aligned can't be supported below ll_rw_blk anyway (they can, but look at the problems this has always created), and I would much rather see stuff like this handled outside of there. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hi, On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote: > > [overhead of 512-byte bhs in the raw IO code is an artificial problem of > the raw IO code.] No, it is a problem of the ll_rw_block interface: buffer_heads need to be aligned on disk at a multiple of their buffer size. Under the Unix raw IO interface it is perfectly legal to begin a 128kB IO at offset 512 bytes into a device. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Wed, 7 Feb 2001, Stephen C. Tweedie wrote: > No, it is a problem of the ll_rw_block interface: buffer_heads need to > be aligned on disk at a multiple of their buffer size. Under the Unix > raw IO interface it is perfectly legal to begin a 128kB IO at offset > 512 bytes into a device. then we should either fix this limitation, or the raw IO code should split the request up into several, variable-size bhs, so that the range is filled out optimally with aligned bhs. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Marcelo Tosatti wrote: > > Its arguing against making a smart application block on the disk while its > able to use the CPU for other work. There are currently no other alternatives in user space. You'd have to create whole new interfaces for aio_read/write, and ways for the kernel to inform user space that "now you can re-try submitting your IO". Could be done. But that's a big thing. > An application which sets non blocking behavior and busy waits for a > request (which seems to be your argument) is just stupid, of course. Tell me what else it could do at some point? You need something like select() to wait on it. There are no such interfaces right now... (besides, latency would suck. I bet you're better off waiting for the requests if they are all used up. It takes too long to get deep into the kernel from user space, and you cannot use the exclusive waiters with its anti-herd behaviour etc). Simple rule: if you want to optimize concurrency and avoid waiting - use several processes or threads instead. At which point you can get real work done on multiple CPU's, instead of worrying about what happens when you have to wait on the disk. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/