Re: DVD blockdevice buffers
Hi! > I can easily give more examples - just ask. BTW, the fact that this stuff > is so fragmented is not a bug - we want it evenly spread over disk, just > to have the ability to allocate a block/inode not too far from the piece > of bitmap we'll need to modify. BTW is this still true? This assumes that long seek takes more time than short seek. With 12.000rpm drive, one rotation takes 5msec. "Full" seek is around 12msec these days, no? Pavel -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Hi! I can easily give more examples - just ask. BTW, the fact that this stuff is so fragmented is not a bug - we want it evenly spread over disk, just to have the ability to allocate a block/inode not too far from the piece of bitmap we'll need to modify. BTW is this still true? This assumes that long seek takes more time than short seek. With 12.000rpm drive, one rotation takes 5msec. Full seek is around 12msec these days, no? Pavel -- Philips Velo 1: 1x4x8, 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On 25 May 2001, Eric W. Biederman wrote: > > I obviously picked a bad name, and a bad place to start. > int data_uptodate(struct page *page, unsigned offset, unsigned len) > > This is really an extension to PG_uptodate, not readpage. Ugh. The above is just horrible. It doesn't fix any problems, it is only an ugly work-around for a situation that never happens in real life. An application that only re-reads the data that it just wrote itself is a _stupid_ application, and I'm absolutely not interested in having a new interface that is useless for everything _but_ such a stupid application. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Linus Torvalds <[EMAIL PROTECTED]> writes: > On 25 May 2001, Eric W. Biederman wrote: > > > > For the small random read case we could use a > > mapping->a_ops->readpartialpage > > No, if so I'd prefer to just change "readpage()" to take the same kinds of > arguments commit_page() does, namely the beginning and end of the read > area. No. I obviously picked a bad name, and a bad place to start. int data_uptodate(struct page *page, unsigned offset, unsigned len) This is really an extension to PG_uptodate, not readpage. It should never ever do any I/O. It should just implement a check to see if we have all of the data wanted already in the page in the page cache. As simply a buffer checking entity it will likely share virtualy 0 code with read_page. > Filesystems could choose to ignore the arguments completely, and just act > the way they already do - filling in the whole page. > > OR a filesystem might know that the page is partially up-to-date (because > of a partial write), and just return an immediate "this area is already > uptodate" return code or something. Or it could even fill in the page > partially, and just unlock it (but not mark it up-to-date: the reader then > has to wait for the page and then look at PG_error to decide whether the > partial read succeeded or not). First mm/filemap.c has generic cache management, so it should make the decision. The logic is does this page have the data in cache? If so just return it. Otherwise read all that you can at once. So we either want a virtual function that can make the decision on a per filesystem bases if we have the data we need in the page cache. Or we need to convert the buffer_head into a more generic entity so everyone can use it. > I don't think it really matters, I have to say. It would be very easy to > implement (all the buffer-based filesystems already use the common > fs/buffer.c readpage, so it would really need changes in just one place, > along with some expanded prototypes with ignored arguments in some other > places). > > But it _could_ be a performance helper for some strange loads (write a > partial page and immediately read it back - what a stupid program), and > more importantly Al Viro felt earlier that a "partial read" approach might > help his metadata-in-page-cache stuff because metadata tends to sometimes > be scattered wildly across the disk. Maybe I think despite the similarities (partial pages) Al & and I are looking at two entirely different problems. > So then we'd have > > int (*readpage)(struct file *, struct page *, unsigned offset, unsigned > len); > > > and the semantics would be: > - the function needs to start IO for _at_least_ the page area >[offset, offset+len[ > - return error code for _immediate_ errors (ie not asynchronous) > - if there was an asynchronous read error, we set PG_error > - if the page is fully populated, we set PG_uptodate > - if the page was not fully populated, but the partial read succeeded, >the filesystem needs to have some way of keeping track of the partial >success ("page->buffers" is obviously the way for a block-based one), >and must _not_ set PG_uptodate. > - after the asynchronous operation (whether complete, partial or >unsuccessful), the page is unlocked to tell the reader that it is done. > > Now, this would be coupled with: > - generic_file_read() does the read-ahead decisions, and may decide that >we really only need a partial page. > > But NOTE! The above is meant to potentially avoid unnecessary IO and thus > speed up the read-in. HOWEVER, it _will_ slow down the case where we first > would read a small part of the page and then soon afterwards read in the > rest of the page. I suspect that is the common case by far, and that the > current whole-page approach is the faster one in 99% of all cases. So I'm > not at all convinced that the above is actually worth it. I don't want partial I/O at all. And I always want to see reads reading in all of the data for a page. I just want an interface where we can say hey we don't actually have to do any I/O for this read request, give them back their data. > If somebody can show that the above is worth it and worth implementing (ie > the Al Viro kind of "I have a real-life schenario where I'd like to use > it"), and implements it (should be a fairly trivial exercise), then I'll > happily accept new semantics like this. > > But I do _not_ want to see another new function ("partialread()"), and I > do _not_ want to see synchronous interfaces (Al's first suggestion). My naming mistake I don't want to see this logic combined with readpage. That is an entirely different case. I can't see how adding a slow case to PageUptodate to check for a partially uptodate page could hurt our performance. And I can imagine how it could help. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo
Re: blkdev-pagecache-2 [was Re: DVD blockdevice buffers]
On Fri, May 25, 2001 at 10:12:51PM +0200, Andrea Arcangeli wrote: > >ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.4.5pre6/blkdev-pagecache-2 ^ 4 sorry - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
blkdev-pagecache-2 [was Re: DVD blockdevice buffers]
On Thu, May 24, 2001 at 12:32:20AM +0200, Andrea Arcangeli wrote: > userspace. I will try to work on the blkdev patch tomorrow to bring it > in an usable state. It seems in an usable state right but it is still very early beta, I need to recheck the whole thing, I will do that tomorrow, for now it should get it right the fsck on a ro mount fs and the cache coherency across multiple inodes all pointing to the same blkdev, it actually worked without any problem in the first basic tests I did. However I expect it to corrupt a rw mounted fs if you open the blkdev under it (the fsck test happens with the fs ro), so while it's in an usable state it's not ready for public consumation yet. Of course ramdisk is still totally broken too. The other first round of bugs mentioned in the first thread should be fixed. The blocksize is still hardwired to 4k, I'll think about the read-modify-write problem later. About the proposed readpage API change I think it's not worthwhile for new hardware where reading 1k or 4k doesn't make relevant difference. Handling partial I/O seems worthwhile only during writes because a partial write would otherwise trigger a read-modify-write operation with a synchronous read. ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.4.5pre6/blkdev-pagecache-2 Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Hi, On Fri, May 25, 2001 at 02:24:52PM -0400, Alexander Viro wrote: > If you are OK with adding two extra arguments to ->readpage() I could > submit a patch replacing that with plain and simple page cache by tomorrow. > It should not be a problem to port, but I want to get some sleep before > testing it... The problem will be returning the IO completion status. We can't just rely on PG_Error: what happens if two separate partial reads are submitted at once within the same page, yet the page is not completely in cache? If we forced readpage to be synchronous in that case we could just return the status directly. Otherwise we need a separate way of determining the completion status once the page becomes unlocked (eg. have a special readpage return which means "all done, completion status is X", and resubmit the readpage to get that completion status once the page lock is dropped.) --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Hi, On Fri, May 25, 2001 at 09:09:37AM -0600, Eric W. Biederman wrote: > The case we don't get quite right are partial reads that hit cached > data, on a page that doesn't have PG_Uptodate set. We don't actually > need to do the I/O on the surrounding page to satisfy the read > request. But we do because generic_file_read doesn't even think about > that case. That's *precisely* the case in question. The whole design of the page cache involves reading entire pages at a time, in fact. We _could_ read in only partial pages, but in that case we end up wasting a lot of the page. > For the small random read case we could use a > mapping->a_ops->readpartialpage > function that sees if a request can be satisfied entirely > from cached data. But this is just to allow generic_file_read > to handle this, case. Agreed. The only case where blockdev-in-pagecache really results in significantly more IO is partial writes followed by partial reads. Reads from totally-uncached pages ought to just fill the entire page from disk; it's only when there is something already present in the cache for that page that we want to look for partial buffers. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Fri, 25 May 2001, Linus Torvalds wrote: > For example, I suspect that the metadata bitmaps in particular cache so > well that the fact that we need to do several seeks over them every once > in a while is a non-issue: we might be happier having the bitmaps in > memory (and having simpler code), than try to avoid the occasional seeks. > > The "simpler code" argument in particular is, I think, a fairly strong > one. Our current bitmap code is quite horrible, with multiple layers of > caching (ext2 will explicitly hold references to some blocks, while at the > same time depending on the buffer cache to cache the other blocks - > nightmare) Oh, current code is a complete mess - no arguments here. 8-element LRU. Combined with the fact that directories allocation tries to get even distribution of directory inodes by cylinder groups, you blow that LRU completely on a regular basis if your fs is larger that 16 cg. For 1Kb blocks fs it's 128Mb. For 4Kb - 2Gb. And pain starts at the half of that size. If you are OK with adding two extra arguments to ->readpage() I could submit a patch replacing that with plain and simple page cache by tomorrow. It should not be a problem to port, but I want to get some sleep before testing it... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Fri, 25 May 2001, Alexander Viro wrote: > > OK, here's a real-world scenario: inode table on 1Kb ext2 (or 4Kb on > Alpha, etc.) consists of compact pieces - one per cylinder group. > > There is a natural mapping from inodes to offsets in that beast. > However, these pieces can trivially be not page-aligned. readpage() > on a boundary of two pieces means large seek. Yes. But by "real-world" I mean "you can tell in real life". I see the theoretical arguments for it. But I want to know that it makes a real difference under real load. For example, I suspect that the metadata bitmaps in particular cache so well that the fact that we need to do several seeks over them every once in a while is a non-issue: we might be happier having the bitmaps in memory (and having simpler code), than try to avoid the occasional seeks. The "simpler code" argument in particular is, I think, a fairly strong one. Our current bitmap code is quite horrible, with multiple layers of caching (ext2 will explicitly hold references to some blocks, while at the same time depending on the buffer cache to cache the other blocks - nightmare) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Fri, 25 May 2001, Linus Torvalds wrote: > If somebody can show that the above is worth it and worth implementing (ie > the Al Viro kind of "I have a real-life schenario where I'd like to use > it"), and implements it (should be a fairly trivial exercise), then I'll > happily accept new semantics like this. OK, here's a real-world scenario: inode table on 1Kb ext2 (or 4Kb on Alpha, etc.) consists of compact pieces - one per cylinder group. There is a natural mapping from inodes to offsets in that beast. However, these pieces can trivially be not page-aligned. readpage() on a boundary of two pieces means large seek. Another example (even funnier) is bitmaps. Same story, but here you can have 1Kb per cylinder group. Which is 8Mb in that case. I.e. on Alpha it means that readpage() will require 7 seeks, 8Mb each. And the worst thing being, unless we have corrupted free inodes counters we _will_ find what we need in the first 1Kb chunk we are looking at. I can easily give more examples - just ask. BTW, the fact that this stuff is so fragmented is not a bug - we want it evenly spread over disk, just to have the ability to allocate a block/inode not too far from the piece of bitmap we'll need to modify. Al PS: Uff... OK, looking at the locking stuff in fs/super.c was useful - I've found a way to do it that is seriously simpler than what I used to do. Just let me torture it for a couple of hours - so far it looks fine... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On 25 May 2001, Eric W. Biederman wrote: > > For the small random read case we could use a > mapping->a_ops->readpartialpage No, if so I'd prefer to just change "readpage()" to take the same kinds of arguments commit_page() does, namely the beginning and end of the read area. Filesystems could choose to ignore the arguments completely, and just act the way they already do - filling in the whole page. OR a filesystem might know that the page is partially up-to-date (because of a partial write), and just return an immediate "this area is already uptodate" return code or something. Or it could even fill in the page partially, and just unlock it (but not mark it up-to-date: the reader then has to wait for the page and then look at PG_error to decide whether the partial read succeeded or not). I don't think it really matters, I have to say. It would be very easy to implement (all the buffer-based filesystems already use the common fs/buffer.c readpage, so it would really need changes in just one place, along with some expanded prototypes with ignored arguments in some other places). But it _could_ be a performance helper for some strange loads (write a partial page and immediately read it back - what a stupid program), and more importantly Al Viro felt earlier that a "partial read" approach might help his metadata-in-page-cache stuff because metadata tends to sometimes be scattered wildly across the disk. So then we'd have int (*readpage)(struct file *, struct page *, unsigned offset, unsigned len); and the semantics would be: - the function needs to start IO for _at_least_ the page area [offset, offset+len[ - return error code for _immediate_ errors (ie not asynchronous) - if there was an asynchronous read error, we set PG_error - if the page is fully populated, we set PG_uptodate - if the page was not fully populated, but the partial read succeeded, the filesystem needs to have some way of keeping track of the partial success ("page->buffers" is obviously the way for a block-based one), and must _not_ set PG_uptodate. - after the asynchronous operation (whether complete, partial or unsuccessful), the page is unlocked to tell the reader that it is done. Now, this would be coupled with: - generic_file_read() does the read-ahead decisions, and may decide that we really only need a partial page. But NOTE! The above is meant to potentially avoid unnecessary IO and thus speed up the read-in. HOWEVER, it _will_ slow down the case where we first would read a small part of the page and then soon afterwards read in the rest of the page. I suspect that is the common case by far, and that the current whole-page approach is the faster one in 99% of all cases. So I'm not at all convinced that the above is actually worth it. If somebody can show that the above is worth it and worth implementing (ie the Al Viro kind of "I have a real-life schenario where I'd like to use it"), and implements it (should be a fairly trivial exercise), then I'll happily accept new semantics like this. But I do _not_ want to see another new function ("partialread()"), and I do _not_ want to see synchronous interfaces (Al's first suggestion). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
"Stephen C. Tweedie" <[EMAIL PROTECTED]> writes: > Hi, > > On Wed, May 23, 2001 at 01:01:56PM -0700, Linus Torvalds wrote: > > > On Wed, 23 May 2001, Stephen C. Tweedie wrote: > > > > that the filesystems already do. And you can do it a lot _better_ than the > > > > > current buffer-cache-based approach. Done right, you can actually do all > > > > IO in page-sized chunks, BUT fall down on sector-sized things for the > > > > cases where you want to. > > > > > > Right, but you still lose the caching in that case. The write works, > > > but the "cache" becomes nothing more than a buffer. > > > > No. It is still cached. You find the buffer with "page->buffer", and when > > all of them are up-to-date (whether from read-in or from having written > > to them all), you just mark the whole page up-to-date. > > It works, but *only* if the application writes a whole page worth of > data. From the previous emails I had the understanding that this > application is writing small data items in random 512-byte blocks. It > is not writing the rest of the page. The page never becomes uptodate. > That in itself isn't a problem, but readpage() can't tell the > underlying layers that only a part of the page is wanted, so there's > no way to tell readpage that the page is in fact partially uptodate. > > And just telling the application to write the rest of the page too > isn't going to cut it, because the rest of the page may contain other > objects which aren't in cache so we can't write them without first > reading the page. The only alternative is to change the on-disk > layout, forcing a minimum PAGESIZE on the IO chunks. > > > This _works_. Try it on ext2 or NFS today. > > Not for this workload. Now, maybe it's not an interesting workload. > But shifting the uptodate granularity from buffer to page sized _does_ > impact the effectiveness of the cache for such an application. > > > So in short: the page cache supports _today_ all the optimizations. > > For write, perhaps; but for subsequent read, generic_read_page > doesn't see any of the data in the page unless the whole page has been > written. generic_read_page??? block_read_full_page seems to handle this correctly. At least with respect to keeping the data around, and not doing the I/O on data we already have. But it still reads in the unpopulated parts of the page even if it is unnecessary. The case we don't get quite right are partial reads that hit cached data, on a page that doesn't have PG_Uptodate set. We don't actually need to do the I/O on the surrounding page to satisfy the read request. But we do because generic_file_read doesn't even think about that case. For the small random read case we could use a mapping->a_ops->readpartialpage function that sees if a request can be satisfied entirely from cached data. But this is just to allow generic_file_read to handle this, case. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Stephen C. Tweedie [EMAIL PROTECTED] writes: Hi, On Wed, May 23, 2001 at 01:01:56PM -0700, Linus Torvalds wrote: On Wed, 23 May 2001, Stephen C. Tweedie wrote: that the filesystems already do. And you can do it a lot _better_ than the current buffer-cache-based approach. Done right, you can actually do all IO in page-sized chunks, BUT fall down on sector-sized things for the cases where you want to. Right, but you still lose the caching in that case. The write works, but the cache becomes nothing more than a buffer. No. It is still cached. You find the buffer with page-buffer, and when all of them are up-to-date (whether from read-in or from having written to them all), you just mark the whole page up-to-date. It works, but *only* if the application writes a whole page worth of data. From the previous emails I had the understanding that this application is writing small data items in random 512-byte blocks. It is not writing the rest of the page. The page never becomes uptodate. That in itself isn't a problem, but readpage() can't tell the underlying layers that only a part of the page is wanted, so there's no way to tell readpage that the page is in fact partially uptodate. And just telling the application to write the rest of the page too isn't going to cut it, because the rest of the page may contain other objects which aren't in cache so we can't write them without first reading the page. The only alternative is to change the on-disk layout, forcing a minimum PAGESIZE on the IO chunks. This _works_. Try it on ext2 or NFS today. Not for this workload. Now, maybe it's not an interesting workload. But shifting the uptodate granularity from buffer to page sized _does_ impact the effectiveness of the cache for such an application. So in short: the page cache supports _today_ all the optimizations. For write, perhaps; but for subsequent read, generic_read_page doesn't see any of the data in the page unless the whole page has been written. generic_read_page??? block_read_full_page seems to handle this correctly. At least with respect to keeping the data around, and not doing the I/O on data we already have. But it still reads in the unpopulated parts of the page even if it is unnecessary. The case we don't get quite right are partial reads that hit cached data, on a page that doesn't have PG_Uptodate set. We don't actually need to do the I/O on the surrounding page to satisfy the read request. But we do because generic_file_read doesn't even think about that case. For the small random read case we could use a mapping-a_ops-readpartialpage function that sees if a request can be satisfied entirely from cached data. But this is just to allow generic_file_read to handle this, case. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On 25 May 2001, Eric W. Biederman wrote: For the small random read case we could use a mapping-a_ops-readpartialpage No, if so I'd prefer to just change readpage() to take the same kinds of arguments commit_page() does, namely the beginning and end of the read area. Filesystems could choose to ignore the arguments completely, and just act the way they already do - filling in the whole page. OR a filesystem might know that the page is partially up-to-date (because of a partial write), and just return an immediate this area is already uptodate return code or something. Or it could even fill in the page partially, and just unlock it (but not mark it up-to-date: the reader then has to wait for the page and then look at PG_error to decide whether the partial read succeeded or not). I don't think it really matters, I have to say. It would be very easy to implement (all the buffer-based filesystems already use the common fs/buffer.c readpage, so it would really need changes in just one place, along with some expanded prototypes with ignored arguments in some other places). But it _could_ be a performance helper for some strange loads (write a partial page and immediately read it back - what a stupid program), and more importantly Al Viro felt earlier that a partial read approach might help his metadata-in-page-cache stuff because metadata tends to sometimes be scattered wildly across the disk. So then we'd have int (*readpage)(struct file *, struct page *, unsigned offset, unsigned len); and the semantics would be: - the function needs to start IO for _at_least_ the page area [offset, offset+len[ - return error code for _immediate_ errors (ie not asynchronous) - if there was an asynchronous read error, we set PG_error - if the page is fully populated, we set PG_uptodate - if the page was not fully populated, but the partial read succeeded, the filesystem needs to have some way of keeping track of the partial success (page-buffers is obviously the way for a block-based one), and must _not_ set PG_uptodate. - after the asynchronous operation (whether complete, partial or unsuccessful), the page is unlocked to tell the reader that it is done. Now, this would be coupled with: - generic_file_read() does the read-ahead decisions, and may decide that we really only need a partial page. But NOTE! The above is meant to potentially avoid unnecessary IO and thus speed up the read-in. HOWEVER, it _will_ slow down the case where we first would read a small part of the page and then soon afterwards read in the rest of the page. I suspect that is the common case by far, and that the current whole-page approach is the faster one in 99% of all cases. So I'm not at all convinced that the above is actually worth it. If somebody can show that the above is worth it and worth implementing (ie the Al Viro kind of I have a real-life schenario where I'd like to use it), and implements it (should be a fairly trivial exercise), then I'll happily accept new semantics like this. But I do _not_ want to see another new function (partialread()), and I do _not_ want to see synchronous interfaces (Al's first suggestion). Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Fri, 25 May 2001, Linus Torvalds wrote: If somebody can show that the above is worth it and worth implementing (ie the Al Viro kind of I have a real-life schenario where I'd like to use it), and implements it (should be a fairly trivial exercise), then I'll happily accept new semantics like this. OK, here's a real-world scenario: inode table on 1Kb ext2 (or 4Kb on Alpha, etc.) consists of compact pieces - one per cylinder group. There is a natural mapping from inodes to offsets in that beast. However, these pieces can trivially be not page-aligned. readpage() on a boundary of two pieces means large seek. Another example (even funnier) is bitmaps. Same story, but here you can have 1Kb per cylinder group. Which is 8Mb in that case. I.e. on Alpha it means that readpage() will require 7 seeks, 8Mb each. And the worst thing being, unless we have corrupted free inodes counters we _will_ find what we need in the first 1Kb chunk we are looking at. I can easily give more examples - just ask. BTW, the fact that this stuff is so fragmented is not a bug - we want it evenly spread over disk, just to have the ability to allocate a block/inode not too far from the piece of bitmap we'll need to modify. Al PS: Uff... OK, looking at the locking stuff in fs/super.c was useful - I've found a way to do it that is seriously simpler than what I used to do. Just let me torture it for a couple of hours - so far it looks fine... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Fri, 25 May 2001, Alexander Viro wrote: OK, here's a real-world scenario: inode table on 1Kb ext2 (or 4Kb on Alpha, etc.) consists of compact pieces - one per cylinder group. There is a natural mapping from inodes to offsets in that beast. However, these pieces can trivially be not page-aligned. readpage() on a boundary of two pieces means large seek. Yes. But by real-world I mean you can tell in real life. I see the theoretical arguments for it. But I want to know that it makes a real difference under real load. For example, I suspect that the metadata bitmaps in particular cache so well that the fact that we need to do several seeks over them every once in a while is a non-issue: we might be happier having the bitmaps in memory (and having simpler code), than try to avoid the occasional seeks. The simpler code argument in particular is, I think, a fairly strong one. Our current bitmap code is quite horrible, with multiple layers of caching (ext2 will explicitly hold references to some blocks, while at the same time depending on the buffer cache to cache the other blocks - nightmare) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Fri, 25 May 2001, Linus Torvalds wrote: For example, I suspect that the metadata bitmaps in particular cache so well that the fact that we need to do several seeks over them every once in a while is a non-issue: we might be happier having the bitmaps in memory (and having simpler code), than try to avoid the occasional seeks. The simpler code argument in particular is, I think, a fairly strong one. Our current bitmap code is quite horrible, with multiple layers of caching (ext2 will explicitly hold references to some blocks, while at the same time depending on the buffer cache to cache the other blocks - nightmare) Oh, current code is a complete mess - no arguments here. 8-element LRU. Combined with the fact that directories allocation tries to get even distribution of directory inodes by cylinder groups, you blow that LRU completely on a regular basis if your fs is larger that 16 cg. For 1Kb blocks fs it's 128Mb. For 4Kb - 2Gb. And pain starts at the half of that size. If you are OK with adding two extra arguments to -readpage() I could submit a patch replacing that with plain and simple page cache by tomorrow. It should not be a problem to port, but I want to get some sleep before testing it... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Hi, On Fri, May 25, 2001 at 09:09:37AM -0600, Eric W. Biederman wrote: The case we don't get quite right are partial reads that hit cached data, on a page that doesn't have PG_Uptodate set. We don't actually need to do the I/O on the surrounding page to satisfy the read request. But we do because generic_file_read doesn't even think about that case. That's *precisely* the case in question. The whole design of the page cache involves reading entire pages at a time, in fact. We _could_ read in only partial pages, but in that case we end up wasting a lot of the page. For the small random read case we could use a mapping-a_ops-readpartialpage function that sees if a request can be satisfied entirely from cached data. But this is just to allow generic_file_read to handle this, case. Agreed. The only case where blockdev-in-pagecache really results in significantly more IO is partial writes followed by partial reads. Reads from totally-uncached pages ought to just fill the entire page from disk; it's only when there is something already present in the cache for that page that we want to look for partial buffers. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Hi, On Fri, May 25, 2001 at 02:24:52PM -0400, Alexander Viro wrote: If you are OK with adding two extra arguments to -readpage() I could submit a patch replacing that with plain and simple page cache by tomorrow. It should not be a problem to port, but I want to get some sleep before testing it... The problem will be returning the IO completion status. We can't just rely on PG_Error: what happens if two separate partial reads are submitted at once within the same page, yet the page is not completely in cache? If we forced readpage to be synchronous in that case we could just return the status directly. Otherwise we need a separate way of determining the completion status once the page becomes unlocked (eg. have a special readpage return which means all done, completion status is X, and resubmit the readpage to get that completion status once the page lock is dropped.) --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
blkdev-pagecache-2 [was Re: DVD blockdevice buffers]
On Thu, May 24, 2001 at 12:32:20AM +0200, Andrea Arcangeli wrote: userspace. I will try to work on the blkdev patch tomorrow to bring it in an usable state. It seems in an usable state right but it is still very early beta, I need to recheck the whole thing, I will do that tomorrow, for now it should get it right the fsck on a ro mount fs and the cache coherency across multiple inodes all pointing to the same blkdev, it actually worked without any problem in the first basic tests I did. However I expect it to corrupt a rw mounted fs if you open the blkdev under it (the fsck test happens with the fs ro), so while it's in an usable state it's not ready for public consumation yet. Of course ramdisk is still totally broken too. The other first round of bugs mentioned in the first thread should be fixed. The blocksize is still hardwired to 4k, I'll think about the read-modify-write problem later. About the proposed readpage API change I think it's not worthwhile for new hardware where reading 1k or 4k doesn't make relevant difference. Handling partial I/O seems worthwhile only during writes because a partial write would otherwise trigger a read-modify-write operation with a synchronous read. ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.4.5pre6/blkdev-pagecache-2 Andrea - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev-pagecache-2 [was Re: DVD blockdevice buffers]
On Fri, May 25, 2001 at 10:12:51PM +0200, Andrea Arcangeli wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.4.5pre6/blkdev-pagecache-2 ^ 4 sorry - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Linus Torvalds [EMAIL PROTECTED] writes: On 25 May 2001, Eric W. Biederman wrote: For the small random read case we could use a mapping-a_ops-readpartialpage No, if so I'd prefer to just change readpage() to take the same kinds of arguments commit_page() does, namely the beginning and end of the read area. No. I obviously picked a bad name, and a bad place to start. int data_uptodate(struct page *page, unsigned offset, unsigned len) This is really an extension to PG_uptodate, not readpage. It should never ever do any I/O. It should just implement a check to see if we have all of the data wanted already in the page in the page cache. As simply a buffer checking entity it will likely share virtualy 0 code with read_page. Filesystems could choose to ignore the arguments completely, and just act the way they already do - filling in the whole page. OR a filesystem might know that the page is partially up-to-date (because of a partial write), and just return an immediate this area is already uptodate return code or something. Or it could even fill in the page partially, and just unlock it (but not mark it up-to-date: the reader then has to wait for the page and then look at PG_error to decide whether the partial read succeeded or not). First mm/filemap.c has generic cache management, so it should make the decision. The logic is does this page have the data in cache? If so just return it. Otherwise read all that you can at once. So we either want a virtual function that can make the decision on a per filesystem bases if we have the data we need in the page cache. Or we need to convert the buffer_head into a more generic entity so everyone can use it. I don't think it really matters, I have to say. It would be very easy to implement (all the buffer-based filesystems already use the common fs/buffer.c readpage, so it would really need changes in just one place, along with some expanded prototypes with ignored arguments in some other places). But it _could_ be a performance helper for some strange loads (write a partial page and immediately read it back - what a stupid program), and more importantly Al Viro felt earlier that a partial read approach might help his metadata-in-page-cache stuff because metadata tends to sometimes be scattered wildly across the disk. Maybe I think despite the similarities (partial pages) Al and I are looking at two entirely different problems. So then we'd have int (*readpage)(struct file *, struct page *, unsigned offset, unsigned len); and the semantics would be: - the function needs to start IO for _at_least_ the page area [offset, offset+len[ - return error code for _immediate_ errors (ie not asynchronous) - if there was an asynchronous read error, we set PG_error - if the page is fully populated, we set PG_uptodate - if the page was not fully populated, but the partial read succeeded, the filesystem needs to have some way of keeping track of the partial success (page-buffers is obviously the way for a block-based one), and must _not_ set PG_uptodate. - after the asynchronous operation (whether complete, partial or unsuccessful), the page is unlocked to tell the reader that it is done. Now, this would be coupled with: - generic_file_read() does the read-ahead decisions, and may decide that we really only need a partial page. But NOTE! The above is meant to potentially avoid unnecessary IO and thus speed up the read-in. HOWEVER, it _will_ slow down the case where we first would read a small part of the page and then soon afterwards read in the rest of the page. I suspect that is the common case by far, and that the current whole-page approach is the faster one in 99% of all cases. So I'm not at all convinced that the above is actually worth it. I don't want partial I/O at all. And I always want to see reads reading in all of the data for a page. I just want an interface where we can say hey we don't actually have to do any I/O for this read request, give them back their data. If somebody can show that the above is worth it and worth implementing (ie the Al Viro kind of I have a real-life schenario where I'd like to use it), and implements it (should be a fairly trivial exercise), then I'll happily accept new semantics like this. But I do _not_ want to see another new function (partialread()), and I do _not_ want to see synchronous interfaces (Al's first suggestion). My naming mistake I don't want to see this logic combined with readpage. That is an entirely different case. I can't see how adding a slow case to PageUptodate to check for a partially uptodate page could hurt our performance. And I can imagine how it could help. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: DVD blockdevice buffers
On 25 May 2001, Eric W. Biederman wrote: I obviously picked a bad name, and a bad place to start. int data_uptodate(struct page *page, unsigned offset, unsigned len) This is really an extension to PG_uptodate, not readpage. Ugh. The above is just horrible. It doesn't fix any problems, it is only an ugly work-around for a situation that never happens in real life. An application that only re-reads the data that it just wrote itself is a _stupid_ application, and I'm absolutely not interested in having a new interface that is useless for everything _but_ such a stupid application. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Hi, On Wed, May 23, 2001 at 01:01:56PM -0700, Linus Torvalds wrote: > On Wed, 23 May 2001, Stephen C. Tweedie wrote: > > > that the filesystems already do. And you can do it a lot _better_ than the > > > current buffer-cache-based approach. Done right, you can actually do all > > > IO in page-sized chunks, BUT fall down on sector-sized things for the > > > cases where you want to. > > > > Right, but you still lose the caching in that case. The write works, > > but the "cache" becomes nothing more than a buffer. > > No. It is still cached. You find the buffer with "page->buffer", and when > all of them are up-to-date (whether from read-in or from having written > to them all), you just mark the whole page up-to-date. It works, but *only* if the application writes a whole page worth of data. From the previous emails I had the understanding that this application is writing small data items in random 512-byte blocks. It is not writing the rest of the page. The page never becomes uptodate. That in itself isn't a problem, but readpage() can't tell the underlying layers that only a part of the page is wanted, so there's no way to tell readpage that the page is in fact partially uptodate. And just telling the application to write the rest of the page too isn't going to cut it, because the rest of the page may contain other objects which aren't in cache so we can't write them without first reading the page. The only alternative is to change the on-disk layout, forcing a minimum PAGESIZE on the IO chunks. > This _works_. Try it on ext2 or NFS today. Not for this workload. Now, maybe it's not an interesting workload. But shifting the uptodate granularity from buffer to page sized _does_ impact the effectiveness of the cache for such an application. > So in short: the page cache supports _today_ all the optimizations. For write, perhaps; but for subsequent read, generic_read_page doesn't see any of the data in the page unless the whole page has been written. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Wed, May 23, 2001 at 04:40:14PM -0400, Jeff Garzik wrote: > Linus Torvalds wrote: > > Now, it may be that the preliminary patches from Andrea do not work this > > way. I didn't look at them too closely, and I assume that Andrea basically > > made the block-size be the same as the page size. That's how I would have > > done it (and then waited for people to find real life cases where we want > > to allow sector writes). > > Due to limitations in low-level drivers, Andrea was forced to hardcode > 4096 for the block size, instead of using PAGE_SIZE or PAGE_CACHE_SIZE. Yes, actually to trigger the read-modify-write logic not more than with the current buffercache I could simply decrease the softblocksize of the blkdev pagecache to 1k, like the default granularity of the current buffercache before any filesystem is mounted, but that would impose a _very_ significant performance hit to the non-cached case which is quite important as well mainly for a blkdev I think. I measured on high end disks reading (out of cache) with 4k buffercache blocksize instead of with 1k buffercache blocksize is an exact x2 improvement because at that speed the bottleneck become the work that has to be done by the cpu. Infact rawio /dev/raw* is as well 2 times slower than the 2.4 4k bufferecache on blkdev in those environment (of course with rawio the cpu is not used much comared to the buffered I/O) and that's one of the reasons I also imposed a 4k granularity on the direct I/O from open("/dev/hda", O_DIRECT|O_RDRW) I didn't benchmarked yet but I suspect that doing rawio with forced 4k bh (as opposed to 512bytes bh of /dev/raw*) will make O_DIRECT on the blkdev much faster than the buffered I/O on the blkdev through pagecache just like O_DIRECT scored the 170MByte/sec of very scalable I/O recently I think also because it was done through ext2 that imposed a 4k softblocksize: http://boudicca.tux.org/hypermail/linux-kernel/2001week17/1175.html http://boudicca.tux.org/hypermail/linux-kernel/2001week17/att-1175/01-directio.png (boudicca.tux.org is not online at the moment but I assume it will return online soon) However this is still flexible, right now my first object is to solve the showstoppers (so for example I can run my machine with that patch applied) and then we can think how to solve the 4k/1k/512byte softblocksize issues. Possibly automatically or selectable from userspace. I will try to work on the blkdev patch tomorrow to bring it in an usable state. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Wed, May 23, 2001 at 06:13:13PM -0400, Alexander Viro wrote: > Uh-oh... After you solved what? The superblock is pinned by the kernel in buffercache while you fsck a ro mounted ext2, so I must somehow uptodate this superblock in the buffercache before collecting away the pagecache containing more recent info from fsck. It's all done lazily, I just thought not to break the assumption that an idling buffercache will never become not uptodate under you anytime because it seems not too painful to implement compared to changing the fs, it puts the check in a slow path and it doesn't break the API with the buffercache (so I don't need to change all the fs to check if the superblock is still uptodate before marking it dirty). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Wed, May 23, 2001 at 01:01:56PM -0700, Linus Torvalds wrote: > [..] I assume that Andrea basically > made the block-size be the same as the page size. That's how I would have exactly (softblocksize is 4k fixed, regardless of the page cache size to avoid confusing device drivers). > done it (and then waited for people to find real life cases where we want > to allow sector writes). Correct, the partial write logic is kind of disabled on x86 because the artificial softblocksize of the blkdev pagecache matches the pagecachesize but it should just work on the other archs. Now I can try to make the bh more granular for partial writes in a dynamic manner (so we don't pay the overhead of the 512byte bh in the common case) but I think this would need its own additional logic and I prefer to think about it after I solved the coherency issues between pinned buffer cache and filesystem, so after the showstoppers are solved and the patch is just usable in real life (possibly with the overhead of read-modify-write for some workload doing small random write I/O). An easy short term fix for removing the read-modify-write would be to use the hardblocksize of the underlying device as the softblocksize but again that would cause us to pay for the 512byte bhs which I don't like to... ;) Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Thu, 24 May 2001, Andrea Arcangeli wrote: > prefer to think about it after I solved the coherency issues between > pinned buffer cache and filesystem, so after the showstoppers are solved Uh-oh... After you solved what? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Linus Torvalds wrote: > Now, it may be that the preliminary patches from Andrea do not work this > way. I didn't look at them too closely, and I assume that Andrea basically > made the block-size be the same as the page size. That's how I would have > done it (and then waited for people to find real life cases where we want > to allow sector writes). Due to limitations in low-level drivers, Andrea was forced to hardcode 4096 for the block size, instead of using PAGE_SIZE or PAGE_CACHE_SIZE. -- Jeff Garzik | "Are you the police?" Building 1024| "No, ma'am. We're musicians." MandrakeSoft | - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Wed, 23 May 2001, Stephen C. Tweedie wrote: > > that the filesystems already do. And you can do it a lot _better_ than the > > current buffer-cache-based approach. Done right, you can actually do all > > IO in page-sized chunks, BUT fall down on sector-sized things for the > > cases where you want to. > > Right, but you still lose the caching in that case. The write works, > but the "cache" becomes nothing more than a buffer. No. It is still cached. You find the buffer with "page->buffer", and when all of them are up-to-date (whether from read-in or from having written to them all), you just mark the whole page up-to-date. This _works_. Try it on ext2 or NFS today. Now, it may be that the preliminary patches from Andrea do not work this way. I didn't look at them too closely, and I assume that Andrea basically made the block-size be the same as the page size. That's how I would have done it (and then waited for people to find real life cases where we want to allow sector writes). So in short: the page cache supports _today_ all the optimizations. In fact, you can, on NFS, do 4096 one-byte writes, and they will be (a) coalesced into one write over the wire, and (b) will be cached in the page and the page marked up-to-date. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Hi, On Wed, May 23, 2001 at 11:12:00AM -0700, Linus Torvalds wrote: > > On Wed, 23 May 2001, Stephen C. Tweedie wrote: > No, you can actually do all the "prepare_write()"/"commit_write()" stuff > that the filesystems already do. And you can do it a lot _better_ than the > current buffer-cache-based approach. Done right, you can actually do all > IO in page-sized chunks, BUT fall down on sector-sized things for the > cases where you want to. Right, but you still lose the caching in that case. The write works, but the "cache" becomes nothing more than a buffer. This actually came up recently after the first posting of the bdev-on-pagecache patches, when somebody was getting lousy database performance for an application I think they had developed from scratch --- it was using 512-byte blocks as the basic write alignment and was relying on the kernel caching that. In fact, in that case even our old buffer cache was failing due to the default blocksize of 1024 bytes, and he had had to add an ioctl to force the blocksize to 512 bytes before the application would perform at all well on Linux. So we do have at least one real-world example which will fail if we increase the IO granularity. We may well decide that the pain is worth it, but the page cache really cannot deal properly with this right now without having an uptodate labeling at finer granularity than the page (which would be unnecessary ugliness in most cases). --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Wed, 23 May 2001, Stephen C. Tweedie wrote: > > Right. I'd like to see buffered IO able to work well --- apart from > the VM issues, it's the easiest way to allow the application to take > advantage of readahead. However, there's one sticking point we > encountered, which is applications which write to block devices in > units smaller than a page. Small block writes get magically > transformed into read/modify/write cycles if you shift the block > devices into the page cache. No, you can actually do all the "prepare_write()"/"commit_write()" stuff that the filesystems already do. And you can do it a lot _better_ than the current buffer-cache-based approach. Done right, you can actually do all IO in page-sized chunks, BUT fall down on sector-sized things for the cases where you want to. This is exactly the same issue that filesystems had with writers of less than a page - and the page cache interfaces allow for byte-granular writes (as actually shown by things like NFS, which do exactly that. For a block device, the granularity obviously tends to be at least 512 bytes). > Of course, we could just say "then don't do that" and be done with it > --- after all, we already have this behaviour when writing to regular > files. No, we really don't. When you write an aligned 1kB block to a 1kB ext2 filesystem, it will _not_ do a page-sized read-modify-write. It will just create the proper 1kB buffers, and mark one of the dirty. Now, admittedly it is _easier_ to just always consider things 4kB in size. And faster too, for the common cases. So it might not be worth it to do the extra work unless somebody can show a good reason for it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Hi, On Sat, May 19, 2001 at 07:36:07PM -0700, Linus Torvalds wrote: > Right now we don't try to aggressively drop streaming pages, but it's > possible. Using raw devices is a silly work-around that should not be > needed, and this load shows a real problem in current Linux (one soon to > be fixed, I think - Andrea already has some experimental patches for the > page-cache thing). Right. I'd like to see buffered IO able to work well --- apart from the VM issues, it's the easiest way to allow the application to take advantage of readahead. However, there's one sticking point we encountered, which is applications which write to block devices in units smaller than a page. Small block writes get magically transformed into read/modify/write cycles if you shift the block devices into the page cache. Of course, we could just say "then don't do that" and be done with it --- after all, we already have this behaviour when writing to regular files. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Hi, On Sat, May 19, 2001 at 07:36:07PM -0700, Linus Torvalds wrote: Right now we don't try to aggressively drop streaming pages, but it's possible. Using raw devices is a silly work-around that should not be needed, and this load shows a real problem in current Linux (one soon to be fixed, I think - Andrea already has some experimental patches for the page-cache thing). Right. I'd like to see buffered IO able to work well --- apart from the VM issues, it's the easiest way to allow the application to take advantage of readahead. However, there's one sticking point we encountered, which is applications which write to block devices in units smaller than a page. Small block writes get magically transformed into read/modify/write cycles if you shift the block devices into the page cache. Of course, we could just say then don't do that and be done with it --- after all, we already have this behaviour when writing to regular files. --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Wed, 23 May 2001, Stephen C. Tweedie wrote: Right. I'd like to see buffered IO able to work well --- apart from the VM issues, it's the easiest way to allow the application to take advantage of readahead. However, there's one sticking point we encountered, which is applications which write to block devices in units smaller than a page. Small block writes get magically transformed into read/modify/write cycles if you shift the block devices into the page cache. No, you can actually do all the prepare_write()/commit_write() stuff that the filesystems already do. And you can do it a lot _better_ than the current buffer-cache-based approach. Done right, you can actually do all IO in page-sized chunks, BUT fall down on sector-sized things for the cases where you want to. This is exactly the same issue that filesystems had with writers of less than a page - and the page cache interfaces allow for byte-granular writes (as actually shown by things like NFS, which do exactly that. For a block device, the granularity obviously tends to be at least 512 bytes). Of course, we could just say then don't do that and be done with it --- after all, we already have this behaviour when writing to regular files. No, we really don't. When you write an aligned 1kB block to a 1kB ext2 filesystem, it will _not_ do a page-sized read-modify-write. It will just create the proper 1kB buffers, and mark one of the dirty. Now, admittedly it is _easier_ to just always consider things 4kB in size. And faster too, for the common cases. So it might not be worth it to do the extra work unless somebody can show a good reason for it. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Hi, On Wed, May 23, 2001 at 11:12:00AM -0700, Linus Torvalds wrote: On Wed, 23 May 2001, Stephen C. Tweedie wrote: No, you can actually do all the prepare_write()/commit_write() stuff that the filesystems already do. And you can do it a lot _better_ than the current buffer-cache-based approach. Done right, you can actually do all IO in page-sized chunks, BUT fall down on sector-sized things for the cases where you want to. Right, but you still lose the caching in that case. The write works, but the cache becomes nothing more than a buffer. This actually came up recently after the first posting of the bdev-on-pagecache patches, when somebody was getting lousy database performance for an application I think they had developed from scratch --- it was using 512-byte blocks as the basic write alignment and was relying on the kernel caching that. In fact, in that case even our old buffer cache was failing due to the default blocksize of 1024 bytes, and he had had to add an ioctl to force the blocksize to 512 bytes before the application would perform at all well on Linux. So we do have at least one real-world example which will fail if we increase the IO granularity. We may well decide that the pain is worth it, but the page cache really cannot deal properly with this right now without having an uptodate labeling at finer granularity than the page (which would be unnecessary ugliness in most cases). --Stephen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Wed, 23 May 2001, Stephen C. Tweedie wrote: that the filesystems already do. And you can do it a lot _better_ than the current buffer-cache-based approach. Done right, you can actually do all IO in page-sized chunks, BUT fall down on sector-sized things for the cases where you want to. Right, but you still lose the caching in that case. The write works, but the cache becomes nothing more than a buffer. No. It is still cached. You find the buffer with page-buffer, and when all of them are up-to-date (whether from read-in or from having written to them all), you just mark the whole page up-to-date. This _works_. Try it on ext2 or NFS today. Now, it may be that the preliminary patches from Andrea do not work this way. I didn't look at them too closely, and I assume that Andrea basically made the block-size be the same as the page size. That's how I would have done it (and then waited for people to find real life cases where we want to allow sector writes). So in short: the page cache supports _today_ all the optimizations. In fact, you can, on NFS, do 4096 one-byte writes, and they will be (a) coalesced into one write over the wire, and (b) will be cached in the page and the page marked up-to-date. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
Linus Torvalds wrote: Now, it may be that the preliminary patches from Andrea do not work this way. I didn't look at them too closely, and I assume that Andrea basically made the block-size be the same as the page size. That's how I would have done it (and then waited for people to find real life cases where we want to allow sector writes). Due to limitations in low-level drivers, Andrea was forced to hardcode 4096 for the block size, instead of using PAGE_SIZE or PAGE_CACHE_SIZE. -- Jeff Garzik | Are you the police? Building 1024| No, ma'am. We're musicians. MandrakeSoft | - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Thu, 24 May 2001, Andrea Arcangeli wrote: prefer to think about it after I solved the coherency issues between pinned buffer cache and filesystem, so after the showstoppers are solved Uh-oh... After you solved what? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Wed, May 23, 2001 at 01:01:56PM -0700, Linus Torvalds wrote: [..] I assume that Andrea basically made the block-size be the same as the page size. That's how I would have exactly (softblocksize is 4k fixed, regardless of the page cache size to avoid confusing device drivers). done it (and then waited for people to find real life cases where we want to allow sector writes). Correct, the partial write logic is kind of disabled on x86 because the artificial softblocksize of the blkdev pagecache matches the pagecachesize but it should just work on the other archs. Now I can try to make the bh more granular for partial writes in a dynamic manner (so we don't pay the overhead of the 512byte bh in the common case) but I think this would need its own additional logic and I prefer to think about it after I solved the coherency issues between pinned buffer cache and filesystem, so after the showstoppers are solved and the patch is just usable in real life (possibly with the overhead of read-modify-write for some workload doing small random write I/O). An easy short term fix for removing the read-modify-write would be to use the hardblocksize of the underlying device as the softblocksize but again that would cause us to pay for the 512byte bhs which I don't like to... ;) Andrea - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Wed, May 23, 2001 at 06:13:13PM -0400, Alexander Viro wrote: Uh-oh... After you solved what? The superblock is pinned by the kernel in buffercache while you fsck a ro mounted ext2, so I must somehow uptodate this superblock in the buffercache before collecting away the pagecache containing more recent info from fsck. It's all done lazily, I just thought not to break the assumption that an idling buffercache will never become not uptodate under you anytime because it seems not too painful to implement compared to changing the fs, it puts the check in a slow path and it doesn't break the API with the buffercache (so I don't need to change all the fs to check if the superblock is still uptodate before marking it dirty). Andrea - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Wed, May 23, 2001 at 04:40:14PM -0400, Jeff Garzik wrote: Linus Torvalds wrote: Now, it may be that the preliminary patches from Andrea do not work this way. I didn't look at them too closely, and I assume that Andrea basically made the block-size be the same as the page size. That's how I would have done it (and then waited for people to find real life cases where we want to allow sector writes). Due to limitations in low-level drivers, Andrea was forced to hardcode 4096 for the block size, instead of using PAGE_SIZE or PAGE_CACHE_SIZE. Yes, actually to trigger the read-modify-write logic not more than with the current buffercache I could simply decrease the softblocksize of the blkdev pagecache to 1k, like the default granularity of the current buffercache before any filesystem is mounted, but that would impose a _very_ significant performance hit to the non-cached case which is quite important as well mainly for a blkdev I think. I measured on high end disks reading (out of cache) with 4k buffercache blocksize instead of with 1k buffercache blocksize is an exact x2 improvement because at that speed the bottleneck become the work that has to be done by the cpu. Infact rawio /dev/raw* is as well 2 times slower than the 2.4 4k bufferecache on blkdev in those environment (of course with rawio the cpu is not used much comared to the buffered I/O) and that's one of the reasons I also imposed a 4k granularity on the direct I/O from open(/dev/hda, O_DIRECT|O_RDRW) I didn't benchmarked yet but I suspect that doing rawio with forced 4k bh (as opposed to 512bytes bh of /dev/raw*) will make O_DIRECT on the blkdev much faster than the buffered I/O on the blkdev through pagecache just like O_DIRECT scored the 170MByte/sec of very scalable I/O recently I think also because it was done through ext2 that imposed a 4k softblocksize: http://boudicca.tux.org/hypermail/linux-kernel/2001week17/1175.html http://boudicca.tux.org/hypermail/linux-kernel/2001week17/att-1175/01-directio.png (boudicca.tux.org is not online at the moment but I assume it will return online soon) However this is still flexible, right now my first object is to solve the showstoppers (so for example I can run my machine with that patch applied) and then we can think how to solve the 4k/1k/512byte softblocksize issues. Possibly automatically or selectable from userspace. I will try to work on the blkdev patch tomorrow to bring it in an usable state. Andrea - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Mon, May 21 2001, Adam Schrotenboer wrote: > On Sun, 20 May 2001, Jens Axboe wrote: > > > On Sat, May 19 2001, Adam Schrotenboer wrote: > > > /dev/raw* Where? I can't find it in my .config (grep RAW .config). I am > > > using 2.4.4-ac11 and playing w/ 2.4.5-pre3. > > > > It's automagically included, no config options necessary > > (drivers/char/raw.c) > > Then where is /dev/raw* ? I'm using devfs, if that helps any. Apparently raw doesn't setup using the devfs_reg functions, someone need to fix that. So either fix it (look at other drivers) or don't use devfs. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Sun, 20 May 2001, Jens Axboe wrote: > On Sat, May 19 2001, Adam Schrotenboer wrote: > > /dev/raw* Where? I can't find it in my .config (grep RAW .config). I am > > using 2.4.4-ac11 and playing w/ 2.4.5-pre3. > > It's automagically included, no config options necessary > (drivers/char/raw.c) Then where is /dev/raw* ? I'm using devfs, if that helps any. > -- > Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Sun, 20 May 2001, Jens Axboe wrote: On Sat, May 19 2001, Adam Schrotenboer wrote: /dev/raw* Where? I can't find it in my .config (grep RAW .config). I am using 2.4.4-ac11 and playing w/ 2.4.5-pre3. It's automagically included, no config options necessary (drivers/char/raw.c) Then where is /dev/raw* ? I'm using devfs, if that helps any. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Mon, May 21 2001, Adam Schrotenboer wrote: On Sun, 20 May 2001, Jens Axboe wrote: On Sat, May 19 2001, Adam Schrotenboer wrote: /dev/raw* Where? I can't find it in my .config (grep RAW .config). I am using 2.4.4-ac11 and playing w/ 2.4.5-pre3. It's automagically included, no config options necessary (drivers/char/raw.c) Then where is /dev/raw* ? I'm using devfs, if that helps any. Apparently raw doesn't setup using the devfs_reg functions, someone need to fix that. So either fix it (look at other drivers) or don't use devfs. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
In article <[EMAIL PROTECTED]>, Jens Axboe <[EMAIL PROTECTED]> wrote: >> >> As a result the system performance goes down. I'm still able to use >> my applications, but es every single piece of unused memory is swapped >> out, and swapping in costs a certain amount of time. > >That's why streaming media applications like a dvd player should use raw >I/O -- to bypass system cache. See /dev/raw* I disagree.. The fact is that the block device fs infrastructure is just sadly broken. By using the buffer cache, it makes memory management very hard, and just upgrading to the page cache would (a) speed stuff up and (b) make it much easier for the kernel to do the right thing wrt the MM use. Right now we don't try to aggressively drop streaming pages, but it's possible. Using raw devices is a silly work-around that should not be needed, and this load shows a real problem in current Linux (one soon to be fixed, I think - Andrea already has some experimental patches for the page-cache thing). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Sun, 20 May 2001, Jens Axboe wrote: > On Sat, May 19 2001, Adam Schrotenboer wrote: > > /dev/raw* Where? I can't find it in my .config (grep RAW .config). I am > > using 2.4.4-ac11 and playing w/ 2.4.5-pre3. > > It's automagically included, no config options necessary > (drivers/char/raw.c) then why can't I find /dev/raw* (I'm using devfs, FWIW) > > -- > Jens Axboe > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Sat, May 19 2001, Adam Schrotenboer wrote: > /dev/raw* Where? I can't find it in my .config (grep RAW .config). I am > using 2.4.4-ac11 and playing w/ 2.4.5-pre3. It's automagically included, no config options necessary (drivers/char/raw.c) -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
/dev/raw* Where? I can't find it in my .config (grep RAW .config). I am using 2.4.4-ac11 and playing w/ 2.4.5-pre3. TIA Adam Schrotenboer - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
/dev/raw* Where? I can't find it in my .config (grep RAW .config). I am using 2.4.4-ac11 and playing w/ 2.4.5-pre3. TIA Adam Schrotenboer - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Sat, May 19 2001, Adam Schrotenboer wrote: /dev/raw* Where? I can't find it in my .config (grep RAW .config). I am using 2.4.4-ac11 and playing w/ 2.4.5-pre3. It's automagically included, no config options necessary (drivers/char/raw.c) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Sun, 20 May 2001, Jens Axboe wrote: On Sat, May 19 2001, Adam Schrotenboer wrote: /dev/raw* Where? I can't find it in my .config (grep RAW .config). I am using 2.4.4-ac11 and playing w/ 2.4.5-pre3. It's automagically included, no config options necessary (drivers/char/raw.c) then why can't I find /dev/raw* (I'm using devfs, FWIW) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
In article [EMAIL PROTECTED], Jens Axboe [EMAIL PROTECTED] wrote: As a result the system performance goes down. I'm still able to use my applications, but es every single piece of unused memory is swapped out, and swapping in costs a certain amount of time. That's why streaming media applications like a dvd player should use raw I/O -- to bypass system cache. See /dev/raw* I disagree.. The fact is that the block device fs infrastructure is just sadly broken. By using the buffer cache, it makes memory management very hard, and just upgrading to the page cache would (a) speed stuff up and (b) make it much easier for the kernel to do the right thing wrt the MM use. Right now we don't try to aggressively drop streaming pages, but it's possible. Using raw devices is a silly work-around that should not be needed, and this load shows a real problem in current Linux (one soon to be fixed, I think - Andrea already has some experimental patches for the page-cache thing). Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Fri, May 18, 2001 at 09:25:31PM +0200, Jens Axboe wrote: > On Fri, May 18 2001, Eduard Hasenleithner wrote: > > I have a problem with the buffering mechanism of my blockdevice, > > namely a ide_scsi DVD-ROM drive. After inserting a DVD and reading > > data linearly from the DVD, an excessive amount of buffer memory gets > > allocated. > > > > This can easily be reproduced with > > cat /dev/sr0 > /dev/null > > > > Remember, nearly the same task is carried out when playing a DVD. > > > > As a result the system performance goes down. I'm still able to use > > my applications, but es every single piece of unused memory is swapped > > out, and swapping in costs a certain amount of time. > > That's why streaming media applications like a dvd player should use raw > I/O -- to bypass system cache. See /dev/raw* > Oh, thank you. That was very fast! I use xine. To be honest, the procedure of how to create a raw device is described in their FAQ. But it is not described, what the raw device does, only that it provides a speed improvement. Until today, I didn't know what rawio actually does. Strange that I didn't come across on some information about it. Was there a official announcement of the availability of this feature? Is some more detailled information about the rawio existing? -- Eduard Hasenleithner student of Salzburg University of Applied Sciences and Technologies - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Fri, May 18 2001, Eduard Hasenleithner wrote: > I have a problem with the buffering mechanism of my blockdevice, > namely a ide_scsi DVD-ROM drive. After inserting a DVD and reading > data linearly from the DVD, an excessive amount of buffer memory gets > allocated. > > This can easily be reproduced with > cat /dev/sr0 > /dev/null > > Remember, nearly the same task is carried out when playing a DVD. > > As a result the system performance goes down. I'm still able to use > my applications, but es every single piece of unused memory is swapped > out, and swapping in costs a certain amount of time. That's why streaming media applications like a dvd player should use raw I/O -- to bypass system cache. See /dev/raw* -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
DVD blockdevice buffers
I have a problem with the buffering mechanism of my blockdevice, namely a ide_scsi DVD-ROM drive. After inserting a DVD and reading data linearly from the DVD, an excessive amount of buffer memory gets allocated. This can easily be reproduced with cat /dev/sr0 > /dev/null Remember, nearly the same task is carried out when playing a DVD. As a result the system performance goes down. I'm still able to use my applications, but es every single piece of unused memory is swapped out, and swapping in costs a certain amount of time. So, what wents wrong? I tried to find some information on this with google and geocrawler, but i didn't have a success :( Kernel: linux-2.4.4 hoping for some tips ... -- Eduard Hasenleithner student of Salzburg University of Applied Sciences and Technologies - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
DVD blockdevice buffers
I have a problem with the buffering mechanism of my blockdevice, namely a ide_scsi DVD-ROM drive. After inserting a DVD and reading data linearly from the DVD, an excessive amount of buffer memory gets allocated. This can easily be reproduced with cat /dev/sr0 /dev/null Remember, nearly the same task is carried out when playing a DVD. As a result the system performance goes down. I'm still able to use my applications, but es every single piece of unused memory is swapped out, and swapping in costs a certain amount of time. So, what wents wrong? I tried to find some information on this with google and geocrawler, but i didn't have a success :( Kernel: linux-2.4.4 hoping for some tips ... -- Eduard Hasenleithner student of Salzburg University of Applied Sciences and Technologies - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Fri, May 18 2001, Eduard Hasenleithner wrote: I have a problem with the buffering mechanism of my blockdevice, namely a ide_scsi DVD-ROM drive. After inserting a DVD and reading data linearly from the DVD, an excessive amount of buffer memory gets allocated. This can easily be reproduced with cat /dev/sr0 /dev/null Remember, nearly the same task is carried out when playing a DVD. As a result the system performance goes down. I'm still able to use my applications, but es every single piece of unused memory is swapped out, and swapping in costs a certain amount of time. That's why streaming media applications like a dvd player should use raw I/O -- to bypass system cache. See /dev/raw* -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: DVD blockdevice buffers
On Fri, May 18, 2001 at 09:25:31PM +0200, Jens Axboe wrote: On Fri, May 18 2001, Eduard Hasenleithner wrote: I have a problem with the buffering mechanism of my blockdevice, namely a ide_scsi DVD-ROM drive. After inserting a DVD and reading data linearly from the DVD, an excessive amount of buffer memory gets allocated. This can easily be reproduced with cat /dev/sr0 /dev/null Remember, nearly the same task is carried out when playing a DVD. As a result the system performance goes down. I'm still able to use my applications, but es every single piece of unused memory is swapped out, and swapping in costs a certain amount of time. That's why streaming media applications like a dvd player should use raw I/O -- to bypass system cache. See /dev/raw* Oh, thank you. That was very fast! I use xine. To be honest, the procedure of how to create a raw device is described in their FAQ. But it is not described, what the raw device does, only that it provides a speed improvement. Until today, I didn't know what rawio actually does. Strange that I didn't come across on some information about it. Was there a official announcement of the availability of this feature? Is some more detailled information about the rawio existing? -- Eduard Hasenleithner student of Salzburg University of Applied Sciences and Technologies - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/