Re: [RFC] Generic deferred file writing
On Saturday, December 30, 2000 06:28:39 PM -0800 Linus Torvalds <[EMAIL PROTECTED]> wrote: > There are only two real advantages to deferred writing: > > - not having to do get_block() at all for temp-files, as we never have to >do the allocation if we end up removing the file. > >NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors >right is quite possibly big enough that this advantage is possibly very >questionable. It's very possible that people could speed things up >using this approach, but I also suspect that it is equally (if not >more) possible to speed things up by just making sure that the >low-level FS has a fast get_block(). > > - Using "global" access patterns to do a better job of "get_block()", ie >taking advantage of issues with journalling etc and deferring the write >in order to get a bigger journal. > > The second point is completely different, and THIS is where I think there > are potentially real advantages. Absolutely. I wrote reiserfs delayed allocation code back in october, and kind of left it alone until the VM had the callbacks needed to make it clean (err, less ugly). I included bunches of optimizations to reiserfs_get_block, and the most effective one was a cache of block pointers in the inode to avoid consecutive tree searches. This was a locking and an i/o win, for both reading and writing (reiserfs needs this more than ext2 does) For growing the file, delayed allocation was a huge bonus. For all the reasons you've already discussed, and because writing a file went from this: (reiserfs_get_block is starting/stopping the transaction) while(bytes_to_write) start_transaction allocate block insert block pointer end_transaction end To this: while(bytes_to_write) update counters end (delayed alloc routine is starting/stopping trans) start_transaction allocate X blocks insert X block pointers update counters end_transaction A big fat transaction is a happy one ;-) Anyway, I'll return to the optimizations once things have settled down a bit, and might give the generic delayed allocation (instead of reiserfs only code) a try. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Tue, 2 Jan 2001, Alexander Viro wrote: > Depends on a filesystem. Generally you don't want asynchronous operations > to grab semaphores shared with something else. kswapd knows to skip the locked > pages, but that's it - if writepage() is going to block on a semaphore you > will not know what had hit you. And while buffer-cache operations will not > trigger writepage() (grep for GFP_BUFFER and GFP_IO and you'll see) you have > no such warranties for other sources of memory pressure. If one of them > hits while you are holding such semaphore - you are toast. I just checked that and you're right, sorry for causing confusion and thanks for clearing this up. > We probably could pull it off for ext2_truncate() vs. ext2_get_block() > but it would not do us any good. It would give excessive exclusion for > operations that can be done in parallel just fine (example: we have > a hole from 100Kb to 200Kb. Pageouts in that area can be trivially > done i parallel - current code will not even try to do unrolls. With > your locking they will be serialized for no good reason). What for? Let me come back to the three phases I mentioned earlier: alloc_block: does only a read-only check whether a block needs to be allocated or not, this can be done in parallel and only needs the page lock. get_block: blocks are now really allocated and this needs locking of the bitmap. commit_block: write the allocated blocks to the inode and this now would use an inode specific semaphore to protect the updates of indirect blocks. The only problem I see is truncate, but if we move the release of unneeded indirect blocks to file_close, only new indirect blocks can appear while the file is open, but they won't change anymore, what would make lots of the checks easier. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Tue, 2 Jan 2001, Alexander Viro wrote: Depends on a filesystem. Generally you don't want asynchronous operations to grab semaphores shared with something else. kswapd knows to skip the locked pages, but that's it - if writepage() is going to block on a semaphore you will not know what had hit you. And while buffer-cache operations will not trigger writepage() (grep for GFP_BUFFER and GFP_IO and you'll see) you have no such warranties for other sources of memory pressure. If one of them hits while you are holding such semaphore - you are toast. I just checked that and you're right, sorry for causing confusion and thanks for clearing this up. We probably could pull it off for ext2_truncate() vs. ext2_get_block() but it would not do us any good. It would give excessive exclusion for operations that can be done in parallel just fine (example: we have a hole from 100Kb to 200Kb. Pageouts in that area can be trivially done i parallel - current code will not even try to do unrolls. With your locking they will be serialized for no good reason). What for? Let me come back to the three phases I mentioned earlier: alloc_block: does only a read-only check whether a block needs to be allocated or not, this can be done in parallel and only needs the page lock. get_block: blocks are now really allocated and this needs locking of the bitmap. commit_block: write the allocated blocks to the inode and this now would use an inode specific semaphore to protect the updates of indirect blocks. The only problem I see is truncate, but if we move the release of unneeded indirect blocks to file_close, only new indirect blocks can appear while the file is open, but they won't change anymore, what would make lots of the checks easier. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Saturday, December 30, 2000 06:28:39 PM -0800 Linus Torvalds [EMAIL PROTECTED] wrote: There are only two real advantages to deferred writing: - not having to do get_block() at all for temp-files, as we never have to do the allocation if we end up removing the file. NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors right is quite possibly big enough that this advantage is possibly very questionable. It's very possible that people could speed things up using this approach, but I also suspect that it is equally (if not more) possible to speed things up by just making sure that the low-level FS has a fast get_block(). - Using "global" access patterns to do a better job of "get_block()", ie taking advantage of issues with journalling etc and deferring the write in order to get a bigger journal. The second point is completely different, and THIS is where I think there are potentially real advantages. Absolutely. I wrote reiserfs delayed allocation code back in october, and kind of left it alone until the VM had the callbacks needed to make it clean (err, less ugly). I included bunches of optimizations to reiserfs_get_block, and the most effective one was a cache of block pointers in the inode to avoid consecutive tree searches. This was a locking and an i/o win, for both reading and writing (reiserfs needs this more than ext2 does) For growing the file, delayed allocation was a huge bonus. For all the reasons you've already discussed, and because writing a file went from this: (reiserfs_get_block is starting/stopping the transaction) while(bytes_to_write) start_transaction allocate block insert block pointer end_transaction end To this: while(bytes_to_write) update counters end (delayed alloc routine is starting/stopping trans) start_transaction allocate X blocks insert X block pointers update counters end_transaction A big fat transaction is a happy one ;-) Anyway, I'll return to the optimizations once things have settled down a bit, and might give the generic delayed allocation (instead of reiserfs only code) a try. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Tue, 2 Jan 2001, Roman Zippel wrote: > Block allocation is not my problem right now (and even directory handling > is not that difficult), but I will post somethings about this on fsdevel > later. > But one question is still open, I'd really like an answer for: > Is it possible to use a per-inode-indirect-block-semaphore? Depends on a filesystem. Generally you don't want asynchronous operations to grab semaphores shared with something else. kswapd knows to skip the locked pages, but that's it - if writepage() is going to block on a semaphore you will not know what had hit you. And while buffer-cache operations will not trigger writepage() (grep for GFP_BUFFER and GFP_IO and you'll see) you have no such warranties for other sources of memory pressure. If one of them hits while you are holding such semaphore - you are toast. We probably could pull it off for ext2_truncate() vs. ext2_get_block() but it would not do us any good. It would give excessive exclusion for operations that can be done in parallel just fine (example: we have a hole from 100Kb to 200Kb. Pageouts in that area can be trivially done i parallel - current code will not even try to do unrolls. With your locking they will be serialized for no good reason). What for? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Mon, 1 Jan 2001, Alexander Viro wrote: > But... But with AFFS you _have_ exclusion between block-allocation and > truncate(). It has no sparse files, so pageout will never allocate > anything. I.e. all allocations come from write(2). And both write(2) and > truncate(2) hold i_sem. > > Problem with AFFS is on the directory side of that business and there it's > really scary. Block allocation is trivial... Block allocation is not my problem right now (and even directory handling is not that difficult), but I will post somethings about this on fsdevel later. But one question is still open, I'd really like an answer for: Is it possible to use a per-inode-indirect-block-semaphore? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Alexander Viro wrote: > GFP_BUFFER _may_ become an issue if we move bitmaps into pagecache. > Then we'll need a per-address_space gfp_mask. Again, patches exist > and had been tested (not right now - I didn't port them to current > tree yet). Bugger if I remember whether they were posted or not - they've > definitely had been mentioned on linux-mm, but IIRC I had sent the > modifications of VM code only to Jens. I can repost them. Please, and I'll ask Rik to post them on the kernelnewbies.org patch page. (Rik?) Putting bitmaps and group descriptors into the page cache will allow the current adhoc bitmap and groupdesc caching code to be deleted - the for-real cache should have better lru, be more efficient to access, not need special locking and won't have an arbitrary limit on number of cached bitmaps. About 300 lines of spagetti gone. I suppose the group descriptor pages still need to be locked in memory so we can address them through a table instead of searching the hash. OK, this must be what you meant by a 'proper' fix to the ialloc group desc bug I posted last month, which by the way is *still* there. How about applying my patch in the interim? It's a real bug, it just doesn't trigger often. > Some pieces of balloc.c cleanup had been posted on fsdevel. Check the > archives. They prepare the ground for killing lock_super() contention > on ext2_new_inode(), but that part wasn't there back then. > > I will start -bird (aka FS-CURRENT) branch as soon as Linus opens 2.4. > Hopefully by the time of 2.5 it will be tested well enough. Right now > it exists as a large patchset against more or less recent -test and > I'm waiting for slowdown of the changes in main tree to put them all > together. It would be awfully nice to have those patches available via ftp. Web-based mail archives don't make it because you can't generally can't get the patches out intact - the tabs get expanded and other noise inserted. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Mon, 1 Jan 2001, Roman Zippel wrote: > The other reason for the question is that I'm currently overwork the block > handling in affs, especially the extended block handling, where I'm > implementing a new extended block cache, where I would pretty much prefer > to use a semaphore to protect it. Although I could do it probably without > the semaphore and use a spinlock+rechecking, but it would keep it so much > simpler. (I can post more details about this part on fsdevel if needed / > wanted.) But... But with AFFS you _have_ exclusion between block-allocation and truncate(). It has no sparse files, so pageout will never allocate anything. I.e. all allocations come from write(2). And both write(2) and truncate(2) hold i_sem. Problem with AFFS is on the directory side of that business and there it's really scary. Block allocation is trivial... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sun, 31 Dec 2000, Alexander Viro wrote: > Reread the original thread. GFP_BUFFER protects us from buffer cache > allocations triggering pageouts. It has nothing to the deadlock scenario > that would come from grabbing ->i_sem on pageout. I don't want to grab i_sem. It was a very, very early idea... :) > Sheesh... "Complexity" of ext2_get_block() (down to the ext2_new_block() > calls) is really, really not a problem. Normally it just gives you the > straightforward path. All unrolls are for contention cases and they > are precisely what we have to do there. Maybe complexity is the wrong word, of course the logic in there is straight forward (once one understood it :) ). Let me ask it differently and it's now only about indirect block handling. Is it possible to use a per-inode-indirect-block-semaphore? The reason for the question is, that I maybe see a different sort of contention here - live locks. I don't mind that getting of resources and rechecking if everything went well. The problem is how much resources you need to get (and to release, if something failed). Somewhere is always a point, where two threads can't make any progress or one thread can stall the progress of a second. To get back to ext2_get_block: IMO such a scenario could happen in the double or triple indirect block case, when two or more threads try to allocate/truncate a block here. Maybe my concerns are baseless, but I'd just like to know, that there isn't a possibility for a DOS attack here. (BTW that's what I mean with complexity, it's less the logical complexity, it's more the "runtime complexity"). The other reason for the question is that I'm currently overwork the block handling in affs, especially the extended block handling, where I'm implementing a new extended block cache, where I would pretty much prefer to use a semaphore to protect it. Although I could do it probably without the semaphore and use a spinlock+rechecking, but it would keep it so much simpler. (I can post more details about this part on fsdevel if needed / wanted.) bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sun, 31 Dec 2000, Alexander Viro wrote: Reread the original thread. GFP_BUFFER protects us from buffer cache allocations triggering pageouts. It has nothing to the deadlock scenario that would come from grabbing -i_sem on pageout. I don't want to grab i_sem. It was a very, very early idea... :) Sheesh... "Complexity" of ext2_get_block() (down to the ext2_new_block() calls) is really, really not a problem. Normally it just gives you the straightforward path. All unrolls are for contention cases and they are precisely what we have to do there. Maybe complexity is the wrong word, of course the logic in there is straight forward (once one understood it :) ). Let me ask it differently and it's now only about indirect block handling. Is it possible to use a per-inode-indirect-block-semaphore? The reason for the question is, that I maybe see a different sort of contention here - live locks. I don't mind that getting of resources and rechecking if everything went well. The problem is how much resources you need to get (and to release, if something failed). Somewhere is always a point, where two threads can't make any progress or one thread can stall the progress of a second. To get back to ext2_get_block: IMO such a scenario could happen in the double or triple indirect block case, when two or more threads try to allocate/truncate a block here. Maybe my concerns are baseless, but I'd just like to know, that there isn't a possibility for a DOS attack here. (BTW that's what I mean with complexity, it's less the logical complexity, it's more the "runtime complexity"). The other reason for the question is that I'm currently overwork the block handling in affs, especially the extended block handling, where I'm implementing a new extended block cache, where I would pretty much prefer to use a semaphore to protect it. Although I could do it probably without the semaphore and use a spinlock+rechecking, but it would keep it so much simpler. (I can post more details about this part on fsdevel if needed / wanted.) bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Mon, 1 Jan 2001, Roman Zippel wrote: The other reason for the question is that I'm currently overwork the block handling in affs, especially the extended block handling, where I'm implementing a new extended block cache, where I would pretty much prefer to use a semaphore to protect it. Although I could do it probably without the semaphore and use a spinlock+rechecking, but it would keep it so much simpler. (I can post more details about this part on fsdevel if needed / wanted.) But... But with AFFS you _have_ exclusion between block-allocation and truncate(). It has no sparse files, so pageout will never allocate anything. I.e. all allocations come from write(2). And both write(2) and truncate(2) hold i_sem. Problem with AFFS is on the directory side of that business and there it's really scary. Block allocation is trivial... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Alexander Viro wrote: GFP_BUFFER _may_ become an issue if we move bitmaps into pagecache. Then we'll need a per-address_space gfp_mask. Again, patches exist and had been tested (not right now - I didn't port them to current tree yet). Bugger if I remember whether they were posted or not - they've definitely had been mentioned on linux-mm, but IIRC I had sent the modifications of VM code only to Jens. I can repost them. Please, and I'll ask Rik to post them on the kernelnewbies.org patch page. (Rik?) Putting bitmaps and group descriptors into the page cache will allow the current adhoc bitmap and groupdesc caching code to be deleted - the for-real cache should have better lru, be more efficient to access, not need special locking and won't have an arbitrary limit on number of cached bitmaps. About 300 lines of spagetti gone. I suppose the group descriptor pages still need to be locked in memory so we can address them through a table instead of searching the hash. OK, this must be what you meant by a 'proper' fix to the ialloc group desc bug I posted last month, which by the way is *still* there. How about applying my patch in the interim? It's a real bug, it just doesn't trigger often. Some pieces of balloc.c cleanup had been posted on fsdevel. Check the archives. They prepare the ground for killing lock_super() contention on ext2_new_inode(), but that part wasn't there back then. I will start -bird (aka FS-CURRENT) branch as soon as Linus opens 2.4. Hopefully by the time of 2.5 it will be tested well enough. Right now it exists as a large patchset against more or less recent -testn and I'm waiting for slowdown of the changes in main tree to put them all together. It would be awfully nice to have those patches available via ftp. Web-based mail archives don't make it because you can't generally can't get the patches out intact - the tabs get expanded and other noise inserted. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Mon, 1 Jan 2001, Alexander Viro wrote: But... But with AFFS you _have_ exclusion between block-allocation and truncate(). It has no sparse files, so pageout will never allocate anything. I.e. all allocations come from write(2). And both write(2) and truncate(2) hold i_sem. Problem with AFFS is on the directory side of that business and there it's really scary. Block allocation is trivial... Block allocation is not my problem right now (and even directory handling is not that difficult), but I will post somethings about this on fsdevel later. But one question is still open, I'd really like an answer for: Is it possible to use a per-inode-indirect-block-semaphore? bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Tue, 2 Jan 2001, Roman Zippel wrote: Block allocation is not my problem right now (and even directory handling is not that difficult), but I will post somethings about this on fsdevel later. But one question is still open, I'd really like an answer for: Is it possible to use a per-inode-indirect-block-semaphore? Depends on a filesystem. Generally you don't want asynchronous operations to grab semaphores shared with something else. kswapd knows to skip the locked pages, but that's it - if writepage() is going to block on a semaphore you will not know what had hit you. And while buffer-cache operations will not trigger writepage() (grep for GFP_BUFFER and GFP_IO and you'll see) you have no such warranties for other sources of memory pressure. If one of them hits while you are holding such semaphore - you are toast. We probably could pull it off for ext2_truncate() vs. ext2_get_block() but it would not do us any good. It would give excessive exclusion for operations that can be done in parallel just fine (example: we have a hole from 100Kb to 200Kb. Pageouts in that area can be trivially done i parallel - current code will not even try to do unrolls. With your locking they will be serialized for no good reason). What for? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Mon, 1 Jan 2001, Roman Zippel wrote: > I just rechecked that, but I don't see no superblock lock here, it uses > the kernel_lock instead. Although Al could give the definitive answer for > this, he wrote it. :) No superblock lock in get_block() proper. Tons of it in the dungheap called balloc.c. _That's_ where the bottleneck is. BTW, even BKL is easily removable from get_block() - check /* Reader: */ and /* Writer: */ comments, they mark the places to put spinlock in. > > The way the Linux MM works, if the lower levels need to do buffer > > allocations, they will use GFP_BUFFER (which "bread()" does internally), > > which will mean that the VM layer will _not_ call down recursively to > > write stuff out while it's already trying to write something else. This is > > exactly so that filesystems don't have to release and re-try if they don't > > want to. > > > > In short, I don't see any of your arguments. > > Then I must have misunderstood Al. Al? > If you were right here, I would see absolutely no reason for the current > complexity. (Me is a bit confused here.) Reread the original thread. GFP_BUFFER protects us from buffer cache allocations triggering pageouts. It has nothing to the deadlock scenario that would come from grabbing ->i_sem on pageout. Sheesh... "Complexity" of ext2_get_block() (down to the ext2_new_block() calls) is really, really not a problem. Normally it just gives you the straightforward path. All unrolls are for contention cases and they are precisely what we have to do there. Again, scalability problems are in the block allocator, not in the indirect blocks handling. It's completely independent from get_block(). We overlock. Big way. And the structures we are protecting excessively are: * cylinder group descriptors * block bitmaps * (to less extent) inode bitmaps and inode table. That's what needs to be fixed. It has nothing to VFS or VM - purely internal ext2 business. Another ext2 issue is reducing the buffer cache pressure - mostly by moving the directories into page cache. I've posted such patches on fsdevel and they are applied to the kernel I'm running here. Works like a charm and allows the rest of metadata stay in cache longer. GFP_BUFFER _may_ become an issue if we move bitmaps into pagecache. Then we'll need a per-address_space gfp_mask. Again, patches exist and had been tested (not right now - I didn't port them to current tree yet). Bugger if I remember whether they were posted or not - they've definitely had been mentioned on linux-mm, but IIRC I had sent the modifications of VM code only to Jens. I can repost them. Some pieces of balloc.c cleanup had been posted on fsdevel. Check the archives. They prepare the ground for killing lock_super() contention on ext2_new_inode(), but that part wasn't there back then. I will start -bird (aka FS-CURRENT) branch as soon as Linus opens 2.4. Hopefully by the time of 2.5 it will be tested well enough. Right now it exists as a large patchset against more or less recent -test and I'm waiting for slowdown of the changes in main tree to put them all together. Cheers, Al - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sun, 31 Dec 2000, Linus Torvalds wrote: > cached_allocation = NULL; > > repeat: > spin_lock(); > result = try_to_find_existing(); > if (!result) { > if (!cached_allocation) { > spin_unlock(); > cached_allocation = allocate_block(); > goto repeat; > } > result = cached_allocation; > add_to_datastructures(result); > } > spin_unlock(); > return result; > > This is quite standard, and Linux does it in many places. It doesn't have > to be complex or ugly. No problem with that. > Also, I don't see why you claim the current get_block() is recursive and > hard to use: it obviously isn't. If you look at the current ext2 > get_block(), the way it protects most of its data structures is by the > super-block-global lock. That wouldn't work if your claims of recursive > invocation were true. I just rechecked that, but I don't see no superblock lock here, it uses the kernel_lock instead. Although Al could give the definitive answer for this, he wrote it. :) > The way the Linux MM works, if the lower levels need to do buffer > allocations, they will use GFP_BUFFER (which "bread()" does internally), > which will mean that the VM layer will _not_ call down recursively to > write stuff out while it's already trying to write something else. This is > exactly so that filesystems don't have to release and re-try if they don't > want to. > > In short, I don't see any of your arguments. Then I must have misunderstood Al. Al? If you were right here, I would see absolutely no reason for the current complexity. (Me is a bit confused here.) bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Roman Zippel wrote: > > On Sun, 31 Dec 2000, Linus Torvalds wrote: > > > Let me repeat myself one more time: > > > > I do not believe that "get_block()" is as big of a problem as people make > > it out to be. > > The real problem is that get_block() doesn't scale and it's very hard to > do. A recursive per inode-semaphore might help, but it's still a pain to > get it right. Not true. There's nothing unscalable in get_block() per se. The only lock we hold is the per-page lock, which we must hold anyway. get_block() itself does not need any real locking: you can do it with a simple per-inode spinlock if you want to (and release the spinlock and try again if you need to fetch non-cached data blocks). Sure, the current ext2 _implementation_ sucks. Nobody has ever contested that. Stop re-designing something just because you want to. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sun, 31 Dec 2000, Linus Torvalds wrote: > Let me repeat myself one more time: > > I do not believe that "get_block()" is as big of a problem as people make > it out to be. The real problem is that get_block() doesn't scale and it's very hard to do. A recursive per inode-semaphore might help, but it's still a pain to get it right. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Daniel Phillips wrote: > Linus Torvalds wrote: > > I do not believe that "get_block()" is as big of a problem as people make > > it out to be. > > I didn't mention get_block - disk accesses obviously far outweigh > filesystem cpu/cache usage in overall impact. The question is, what > happens to disk access patterns when we do the deferred allocation. Note that the deferred allocation is only possible with a full page write. Go and do statistics on a system of how often this happens, and what the circumstances are. Just for fun. I will bet you $5 USD that 99.9% of all such writes are to new files, at the end of the file. I'm sure you can come up with other usage patterns, but they'll be special (big databases etc, and I bet that they'll want to have stuff cached all the time anyway for other reasons). So I seriously doubt that you'll have much of an IO component to the writing anyway - except for the "normal" deferred write of actually writing the stuff out at all. Now, this is where I agree with you, but I disagree with where most of the discussion has been going: I _do_ believe that we may want to change block allocation discussions at write-out-time. That makes sense to me. But that doesn't really impact "ENOSPC" - the write would not be really "deferred" by the VM layer, and the filesystem would always be aware of the writer synchronously. > > One form of deferred writes I _do_ like is the mount-time-option form. > > Because that one doesn't add complexity. Kind of like the "noatime" mount > > option - it can be worth it under some circumstances, and sometimes it's > > acceptable to not get 100% unix semantics - at which point deferred writes > > have none of the disadvantages of trying to be clever. > > And the added attraction of requiring almost no effort. Did I mention my belief in the true meaning of "intelligence"? "Intelligence is the ability to avoid doing work, yet get the work done". Lazy programmers are the best programmers. Think Tom Sawyer painting the fence. That's intelligence. Requireing almost no effort is a big plus in my book. It's the "clever" programmer I'm afraid of. The one who isn't afraid of generating complexity, because he has a Plan (capital "P"), and he knows he can work out the details later. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Linus Torvalds wrote: > I do not believe that "get_block()" is as big of a problem as people make > it out to be. I didn't mention get_block - disk accesses obviously far outweigh filesystem cpu/cache usage in overall impact. The question is, what happens to disk access patterns when we do the deferred allocation. > One form of deferred writes I _do_ like is the mount-time-option form. > Because that one doesn't add complexity. Kind of like the "noatime" mount > option - it can be worth it under some circumstances, and sometimes it's > acceptable to not get 100% unix semantics - at which point deferred writes > have none of the disadvantages of trying to be clever. And the added attraction of requiring almost no effort. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Daniel Phillips wrote: > > It's not that hard or inefficient to return the ENOSPC from the usual > point. For example, make a gross overestimate of the space needed for > the write, compare to a cached filesystem free space value less the > amount deferred so far, and fail to take the optimization if it looks > even close. Let me repeat myself one more time: I do not believe that "get_block()" is as big of a problem as people make it out to be. And more importantly: I strongly believe that trying to be clever is detrimental to your health. The "clever" approach is to add tons of complexity, have various heuristics to try to not overflow, and then try to debug it considering that the ENOSPC case is actually rather rare. The "intelligent" approach is just to say that if get_block() shows up on the performance profiles, then it should be optimized. I'd rather be intelligent than clever. Optimize get_block(), which in the case of ext2 seems to be mostly ext2_new_block() and the balloc.c mess. The argument that Andrea had is bogus: the common case for writes (and writes is the only part that deferred writing would touch) is re-writing the whole file, and the IO to look up the metadata is never an issue for that case. Everything is basically cached and created on-the-fly. IO is not the issue, being good about new block allocation _is_ the issue. Don't get me wrong: I like the notion of deferred writes. But I'm also very pragmatic: I have not heard of a really good argument that makes it obvious that deferred writes is a major win performance-wise that would make it worth the complexity. One form of deferred writes I _do_ like is the mount-time-option form. Because that one doesn't add complexity. Kind of like the "noatime" mount option - it can be worth it under some circumstances, and sometimes it's acceptable to not get 100% unix semantics - at which point deferred writes have none of the disadvantages of trying to be clever. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Linus Torvalds wrote: > There are only two real advantages to deferred writing: > > - not having to do get_block() at all for temp-files, as we never have to >do the allocation if we end up removing the file. > >NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors >right is quite possibly big enough that this advantage is possibly very >questionable. It's very possible that people could speed things up >using this approach, but I also suspect that it is equally (if not >more) possible to speed things up by just making sure that the >low-level FS has a fast get_block(). It's not that hard or inefficient to return the ENOSPC from the usual point. For example, make a gross overestimate of the space needed for the write, compare to a cached filesystem free space value less the amount deferred so far, and fail to take the optimization if it looks even close. Also, it's not necessarily bad to defer the ENOSPC to file close time. The same thing can happen with failed disk io, and that's just a fact of life with asynchronous io. A reliable program needs to be able to deal with it. (I think there was a long thread on this not long ago, regarding filesystem errors returned after a program exits.) > - Using "global" access patterns to do a better job of "get_block()", ie >taking advantage of issues with journalling etc and deferring the write >in order to get a bigger journal. Another nicety is not having to bother the filesystem at all about sequences of short writes. > The second point is completely different, and THIS is where I think there > are potentially real advantages. However, I also think that this is not > actually about deferred writes at all: it's really a question of the > filesystem having access to the page when the physical write is actually > started so that the filesystem might choose to _change_ the allocation it > did - it might have allocated a backing store block at "get_block()" time, > but by the time it actually writes the stuff out to disk it may have > allocated a bigger contiguous area somewhere else for the data.. You'd most likely be able to measure the overhead under load of doing the allocation twice, due to the occasional need to read back a swapped-out metadata block. XFS does deferred allocation and the Reiser guys are talking about it - it seems to be something worth doing. For me the question is, if the VFS can already do the deferring, is there a good reason for a filesystem to duplicate that functionality? -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Alexander Viro wrote: > > On Sun, 31 Dec 2000, Linus Torvalds wrote: > > > The other thing is that one of the common cases for writing is consecutive > > writing to the end of the file. Now, you figure it out: if get_block() > > really is a bottle-neck, why not cache the last tree lookup? You'd get a > > 99% hitrate for that common case. > > Because it is not a bottleneck. The _real_ bottleneck is in ext2_new_block(). > Try to profile it and you'll see. > > We could diddle with ext2_get_block(). No arguments. But the real source of > PITA is balloc.c, not inode.c. Look at the group descriptor cache code and > weep. That, and bitmaps handling. I'm not surprised. Just doign pre-allocation 32 blocks at a time would probably help. But that code really should be re-written, I think. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Linus Torvalds wrote: > The other thing is that one of the common cases for writing is consecutive > writing to the end of the file. Now, you figure it out: if get_block() > really is a bottle-neck, why not cache the last tree lookup? You'd get a > 99% hitrate for that common case. Because it is not a bottleneck. The _real_ bottleneck is in ext2_new_block(). Try to profile it and you'll see. We could diddle with ext2_get_block(). No arguments. But the real source of PITA is balloc.c, not inode.c. Look at the group descriptor cache code and weep. That, and bitmaps handling. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, Dec 31, 2000 at 08:33:01AM -0800, Linus Torvalds wrote: > By doing a better job of caching stuff. Caching can happen after we are been slow and we waited for I/O synchronously the first time (bread). How can we optimize the first time (when the indirect blocks are out of buffer cache) without changing the on-disk format? We can't as far I can see. It's of course fine to optimize the address_space->physical_block resolver algorithm, because it has to run anyways if we want to write such data to disk eventually (despite it's asynchronous with allocate on flush, or synchronous like now). Probably it's a more sensible optimization than the allocate on flush thing. But still being able to run the resolver in an asynchronous manner, in parallel, only at the time we need to flush the page to disk, looks nicer behaviour to me. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Andrea Arcangeli wrote: > > get_block for large files can be improved using extents, but how can we > implement a fast get_block without restructuring the on-disk format of the > filesystem? (in turn using another filesystem instead of ext2?) By doing a better job of caching stuff. There are multiple levels of caching here. One issue is the question of "allocate a new block". Go and look how the ext2 block pre-allocation works, and cry. It should _not_ be a loop that sets one bit at a time, it should be something that notices when the (u32 *) is zero and grabs the whole 32 blocks in one go. Instead of defaulting to a pre-allocation of 8 blocks, doing that same expensive thing much too often. The other thing is that one of the common cases for writing is consecutive writing to the end of the file. Now, you figure it out: if get_block() really is a bottle-neck, why not cache the last tree lookup? You'd get a 99% hitrate for that common case. You should realize that all the block allocation etc code was written for a very different VFS layer. "get_block()" didn't even exist. We didn't have SMP issues. We had very different access patterns (the virtual caches in the page cache makes the accesses to "get_block()" very different, as the VFS layer keeps track of man mappings on its own. And get_block() was basically tacked on top of the old code. Al Viro did a good job of cleaning up some of the issues, but go and look at get_block() and follow it all the way down into "ext2_new_block()" and back, and I bet you'll ask yourself why it's so complex. And wonder if it couldn't just be cleaned up and sped up a lot that way. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, Dec 30, 2000 at 06:28:39PM -0800, Linus Torvalds wrote: > There are only two real advantages to deferred writing: > > - not having to do get_block() at all for temp-files, as we never have to >do the allocation if we end up removing the file. > >NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors >right is quite possibly big enough that this advantage is possibly very >questionable. It's very possible that people could speed things up >using this approach, but I also suspect that it is equally (if not >more) possible to speed things up by just making sure that the >low-level FS has a fast get_block(). get_block for large files can be improved using extents, but how can we implement a fast get_block without restructuring the on-disk format of the filesystem? (in turn using another filesystem instead of ext2?) get_block needs to walk all level of inode metadata indirection if they exists. It has to map the logical page from its (inode) address space to the physical blocks. If those indirection blocks aren't in cache it has to block to read them. It doesn't matter how it is actually implemented in core. And then later as you say those allocated blocks could never get written because the file may be deleted in the meantime. With allocate on flush we can run the slow get_block in parallel asynchronously using a kernel daemon after the page flushtime timeout triggers. It looks nicer to me. The in-core overhead of the reserved blocks for delayed allocation should be not relevant at all (and it also should not need the big kernel lock making the whole write path big lock free). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sat, 30 Dec 2000, Linus Torvalds wrote: > In fact, in a properly designed filesystem just a bit of low-level caching > would easily make the average "get_block()" be very fast indeed. The fact > that right now ext2 has not been optimized for this is _not_ a reason to > design the VFS layer around a slow get_block() operation. > [..] > The second point is completely different, and THIS is where I think there > are potentially real advantages. However, I also think that this is not > actually about deferred writes at all: it's really a question of the > filesystem having access to the page when the physical write is actually > started so that the filesystem might choose to _change_ the allocation it > did - it might have allocated a backing store block at "get_block()" time, > but by the time it actually writes the stuff out to disk it may have > allocated a bigger contiguous area somewhere else for the data.. > > I really think that the #2 thing is the more interesting one, and that > anybody looking at ext2 should look at just improving the locking and > making the block allocation functions run faster. Which should not be all > that difficult - the last time I looked at the thing it was quite > horrible. What makes get_block business complicated now, is that can be called recursively: get_block needs to allocate something, what might start new i/o which calls again get_block. Writing dirty pages should be a real asynchronous process, but it isn't right now, as get_block is synchronous. Making get_block asynchronous is almost impossible, so one usually does it in a separate thread. So IMO something like this should happen: dirty pages should be put on a separate list and a thread takes these pages and allocates the buffers for them and starts the i/o. This had another advantage: get_block wouldn't really need to do preallocation anymore, the get_block function could work instead on a number of pages (preallocation would instead happen in the page cache). This could make the get_block function and the needed locking very simple, e.g. one could use a simple semaphore instead of kernel_lock to protect getting of multiple blocks instead of only one. Also splitting it into several tasks can make it faster, so in one step we just do the resource allocation to guarantee the write, in a separate step we do the real allocation. If this is done for several pages at once, it can be very fast and simple. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sat, 30 Dec 2000, Linus Torvalds wrote: In fact, in a properly designed filesystem just a bit of low-level caching would easily make the average "get_block()" be very fast indeed. The fact that right now ext2 has not been optimized for this is _not_ a reason to design the VFS layer around a slow get_block() operation. [..] The second point is completely different, and THIS is where I think there are potentially real advantages. However, I also think that this is not actually about deferred writes at all: it's really a question of the filesystem having access to the page when the physical write is actually started so that the filesystem might choose to _change_ the allocation it did - it might have allocated a backing store block at "get_block()" time, but by the time it actually writes the stuff out to disk it may have allocated a bigger contiguous area somewhere else for the data.. I really think that the #2 thing is the more interesting one, and that anybody looking at ext2 should look at just improving the locking and making the block allocation functions run faster. Which should not be all that difficult - the last time I looked at the thing it was quite horrible. What makes get_block business complicated now, is that can be called recursively: get_block needs to allocate something, what might start new i/o which calls again get_block. Writing dirty pages should be a real asynchronous process, but it isn't right now, as get_block is synchronous. Making get_block asynchronous is almost impossible, so one usually does it in a separate thread. So IMO something like this should happen: dirty pages should be put on a separate list and a thread takes these pages and allocates the buffers for them and starts the i/o. This had another advantage: get_block wouldn't really need to do preallocation anymore, the get_block function could work instead on a number of pages (preallocation would instead happen in the page cache). This could make the get_block function and the needed locking very simple, e.g. one could use a simple semaphore instead of kernel_lock to protect getting of multiple blocks instead of only one. Also splitting it into several tasks can make it faster, so in one step we just do the resource allocation to guarantee the write, in a separate step we do the real allocation. If this is done for several pages at once, it can be very fast and simple. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, Dec 30, 2000 at 06:28:39PM -0800, Linus Torvalds wrote: There are only two real advantages to deferred writing: - not having to do get_block() at all for temp-files, as we never have to do the allocation if we end up removing the file. NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors right is quite possibly big enough that this advantage is possibly very questionable. It's very possible that people could speed things up using this approach, but I also suspect that it is equally (if not more) possible to speed things up by just making sure that the low-level FS has a fast get_block(). get_block for large files can be improved using extents, but how can we implement a fast get_block without restructuring the on-disk format of the filesystem? (in turn using another filesystem instead of ext2?) get_block needs to walk all level of inode metadata indirection if they exists. It has to map the logical page from its (inode) address space to the physical blocks. If those indirection blocks aren't in cache it has to block to read them. It doesn't matter how it is actually implemented in core. And then later as you say those allocated blocks could never get written because the file may be deleted in the meantime. With allocate on flush we can run the slow get_block in parallel asynchronously using a kernel daemon after the page flushtime timeout triggers. It looks nicer to me. The in-core overhead of the reserved blocks for delayed allocation should be not relevant at all (and it also should not need the big kernel lock making the whole write path big lock free). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Andrea Arcangeli wrote: get_block for large files can be improved using extents, but how can we implement a fast get_block without restructuring the on-disk format of the filesystem? (in turn using another filesystem instead of ext2?) By doing a better job of caching stuff. There are multiple levels of caching here. One issue is the question of "allocate a new block". Go and look how the ext2 block pre-allocation works, and cry. It should _not_ be a loop that sets one bit at a time, it should be something that notices when the (u32 *) is zero and grabs the whole 32 blocks in one go. Instead of defaulting to a pre-allocation of 8 blocks, doing that same expensive thing much too often. The other thing is that one of the common cases for writing is consecutive writing to the end of the file. Now, you figure it out: if get_block() really is a bottle-neck, why not cache the last tree lookup? You'd get a 99% hitrate for that common case. You should realize that all the block allocation etc code was written for a very different VFS layer. "get_block()" didn't even exist. We didn't have SMP issues. We had very different access patterns (the virtual caches in the page cache makes the accesses to "get_block()" very different, as the VFS layer keeps track of man mappings on its own. And get_block() was basically tacked on top of the old code. Al Viro did a good job of cleaning up some of the issues, but go and look at get_block() and follow it all the way down into "ext2_new_block()" and back, and I bet you'll ask yourself why it's so complex. And wonder if it couldn't just be cleaned up and sped up a lot that way. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, Dec 31, 2000 at 08:33:01AM -0800, Linus Torvalds wrote: By doing a better job of caching stuff. Caching can happen after we are been slow and we waited for I/O synchronously the first time (bread). How can we optimize the first time (when the indirect blocks are out of buffer cache) without changing the on-disk format? We can't as far I can see. It's of course fine to optimize the address_space-physical_block resolver algorithm, because it has to run anyways if we want to write such data to disk eventually (despite it's asynchronous with allocate on flush, or synchronous like now). Probably it's a more sensible optimization than the allocate on flush thing. But still being able to run the resolver in an asynchronous manner, in parallel, only at the time we need to flush the page to disk, looks nicer behaviour to me. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Linus Torvalds wrote: The other thing is that one of the common cases for writing is consecutive writing to the end of the file. Now, you figure it out: if get_block() really is a bottle-neck, why not cache the last tree lookup? You'd get a 99% hitrate for that common case. Because it is not a bottleneck. The _real_ bottleneck is in ext2_new_block(). Try to profile it and you'll see. We could diddle with ext2_get_block(). No arguments. But the real source of PITA is balloc.c, not inode.c. Look at the group descriptor cache code and weep. That, and bitmaps handling. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Alexander Viro wrote: On Sun, 31 Dec 2000, Linus Torvalds wrote: The other thing is that one of the common cases for writing is consecutive writing to the end of the file. Now, you figure it out: if get_block() really is a bottle-neck, why not cache the last tree lookup? You'd get a 99% hitrate for that common case. Because it is not a bottleneck. The _real_ bottleneck is in ext2_new_block(). Try to profile it and you'll see. We could diddle with ext2_get_block(). No arguments. But the real source of PITA is balloc.c, not inode.c. Look at the group descriptor cache code and weep. That, and bitmaps handling. I'm not surprised. Just doign pre-allocation 32 blocks at a time would probably help. But that code really should be re-written, I think. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Linus Torvalds wrote: There are only two real advantages to deferred writing: - not having to do get_block() at all for temp-files, as we never have to do the allocation if we end up removing the file. NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors right is quite possibly big enough that this advantage is possibly very questionable. It's very possible that people could speed things up using this approach, but I also suspect that it is equally (if not more) possible to speed things up by just making sure that the low-level FS has a fast get_block(). It's not that hard or inefficient to return the ENOSPC from the usual point. For example, make a gross overestimate of the space needed for the write, compare to a cached filesystem free space value less the amount deferred so far, and fail to take the optimization if it looks even close. Also, it's not necessarily bad to defer the ENOSPC to file close time. The same thing can happen with failed disk io, and that's just a fact of life with asynchronous io. A reliable program needs to be able to deal with it. (I think there was a long thread on this not long ago, regarding filesystem errors returned after a program exits.) - Using "global" access patterns to do a better job of "get_block()", ie taking advantage of issues with journalling etc and deferring the write in order to get a bigger journal. Another nicety is not having to bother the filesystem at all about sequences of short writes. The second point is completely different, and THIS is where I think there are potentially real advantages. However, I also think that this is not actually about deferred writes at all: it's really a question of the filesystem having access to the page when the physical write is actually started so that the filesystem might choose to _change_ the allocation it did - it might have allocated a backing store block at "get_block()" time, but by the time it actually writes the stuff out to disk it may have allocated a bigger contiguous area somewhere else for the data.. You'd most likely be able to measure the overhead under load of doing the allocation twice, due to the occasional need to read back a swapped-out metadata block. XFS does deferred allocation and the Reiser guys are talking about it - it seems to be something worth doing. For me the question is, if the VFS can already do the deferring, is there a good reason for a filesystem to duplicate that functionality? -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Daniel Phillips wrote: It's not that hard or inefficient to return the ENOSPC from the usual point. For example, make a gross overestimate of the space needed for the write, compare to a cached filesystem free space value less the amount deferred so far, and fail to take the optimization if it looks even close. Let me repeat myself one more time: I do not believe that "get_block()" is as big of a problem as people make it out to be. And more importantly: I strongly believe that trying to be clever is detrimental to your health. The "clever" approach is to add tons of complexity, have various heuristics to try to not overflow, and then try to debug it considering that the ENOSPC case is actually rather rare. The "intelligent" approach is just to say that if get_block() shows up on the performance profiles, then it should be optimized. I'd rather be intelligent than clever. Optimize get_block(), which in the case of ext2 seems to be mostly ext2_new_block() and the balloc.c mess. The argument that Andrea had is bogus: the common case for writes (and writes is the only part that deferred writing would touch) is re-writing the whole file, and the IO to look up the metadata is never an issue for that case. Everything is basically cached and created on-the-fly. IO is not the issue, being good about new block allocation _is_ the issue. Don't get me wrong: I like the notion of deferred writes. But I'm also very pragmatic: I have not heard of a really good argument that makes it obvious that deferred writes is a major win performance-wise that would make it worth the complexity. One form of deferred writes I _do_ like is the mount-time-option form. Because that one doesn't add complexity. Kind of like the "noatime" mount option - it can be worth it under some circumstances, and sometimes it's acceptable to not get 100% unix semantics - at which point deferred writes have none of the disadvantages of trying to be clever. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Linus Torvalds wrote: I do not believe that "get_block()" is as big of a problem as people make it out to be. I didn't mention get_block - disk accesses obviously far outweigh filesystem cpu/cache usage in overall impact. The question is, what happens to disk access patterns when we do the deferred allocation. One form of deferred writes I _do_ like is the mount-time-option form. Because that one doesn't add complexity. Kind of like the "noatime" mount option - it can be worth it under some circumstances, and sometimes it's acceptable to not get 100% unix semantics - at which point deferred writes have none of the disadvantages of trying to be clever. And the added attraction of requiring almost no effort. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Daniel Phillips wrote: Linus Torvalds wrote: I do not believe that "get_block()" is as big of a problem as people make it out to be. I didn't mention get_block - disk accesses obviously far outweigh filesystem cpu/cache usage in overall impact. The question is, what happens to disk access patterns when we do the deferred allocation. Note that the deferred allocation is only possible with a full page write. Go and do statistics on a system of how often this happens, and what the circumstances are. Just for fun. I will bet you $5 USD that 99.9% of all such writes are to new files, at the end of the file. I'm sure you can come up with other usage patterns, but they'll be special (big databases etc, and I bet that they'll want to have stuff cached all the time anyway for other reasons). So I seriously doubt that you'll have much of an IO component to the writing anyway - except for the "normal" deferred write of actually writing the stuff out at all. Now, this is where I agree with you, but I disagree with where most of the discussion has been going: I _do_ believe that we may want to change block allocation discussions at write-out-time. That makes sense to me. But that doesn't really impact "ENOSPC" - the write would not be really "deferred" by the VM layer, and the filesystem would always be aware of the writer synchronously. One form of deferred writes I _do_ like is the mount-time-option form. Because that one doesn't add complexity. Kind of like the "noatime" mount option - it can be worth it under some circumstances, and sometimes it's acceptable to not get 100% unix semantics - at which point deferred writes have none of the disadvantages of trying to be clever. And the added attraction of requiring almost no effort. Did I mention my belief in the true meaning of "intelligence"? "Intelligence is the ability to avoid doing work, yet get the work done". Lazy programmers are the best programmers. Think Tom Sawyer painting the fence. That's intelligence. Requireing almost no effort is a big plus in my book. It's the "clever" programmer I'm afraid of. The one who isn't afraid of generating complexity, because he has a Plan (capital "P"), and he knows he can work out the details later. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sun, 31 Dec 2000, Linus Torvalds wrote: Let me repeat myself one more time: I do not believe that "get_block()" is as big of a problem as people make it out to be. The real problem is that get_block() doesn't scale and it's very hard to do. A recursive per inode-semaphore might help, but it's still a pain to get it right. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Roman Zippel wrote: On Sun, 31 Dec 2000, Linus Torvalds wrote: Let me repeat myself one more time: I do not believe that "get_block()" is as big of a problem as people make it out to be. The real problem is that get_block() doesn't scale and it's very hard to do. A recursive per inode-semaphore might help, but it's still a pain to get it right. Not true. There's nothing unscalable in get_block() per se. The only lock we hold is the per-page lock, which we must hold anyway. get_block() itself does not need any real locking: you can do it with a simple per-inode spinlock if you want to (and release the spinlock and try again if you need to fetch non-cached data blocks). Sure, the current ext2 _implementation_ sucks. Nobody has ever contested that. Stop re-designing something just because you want to. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sun, 31 Dec 2000, Linus Torvalds wrote: cached_allocation = NULL; repeat: spin_lock(); result = try_to_find_existing(); if (!result) { if (!cached_allocation) { spin_unlock(); cached_allocation = allocate_block(); goto repeat; } result = cached_allocation; add_to_datastructures(result); } spin_unlock(); return result; This is quite standard, and Linux does it in many places. It doesn't have to be complex or ugly. No problem with that. Also, I don't see why you claim the current get_block() is recursive and hard to use: it obviously isn't. If you look at the current ext2 get_block(), the way it protects most of its data structures is by the super-block-global lock. That wouldn't work if your claims of recursive invocation were true. I just rechecked that, but I don't see no superblock lock here, it uses the kernel_lock instead. Although Al could give the definitive answer for this, he wrote it. :) The way the Linux MM works, if the lower levels need to do buffer allocations, they will use GFP_BUFFER (which "bread()" does internally), which will mean that the VM layer will _not_ call down recursively to write stuff out while it's already trying to write something else. This is exactly so that filesystems don't have to release and re-try if they don't want to. In short, I don't see any of your arguments. Then I must have misunderstood Al. Al? If you were right here, I would see absolutely no reason for the current complexity. (Me is a bit confused here.) bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Mon, 1 Jan 2001, Roman Zippel wrote: I just rechecked that, but I don't see no superblock lock here, it uses the kernel_lock instead. Although Al could give the definitive answer for this, he wrote it. :) No superblock lock in get_block() proper. Tons of it in the dungheap called balloc.c. _That's_ where the bottleneck is. BTW, even BKL is easily removable from get_block() - check /* Reader: */ and /* Writer: */ comments, they mark the places to put spinlock in. The way the Linux MM works, if the lower levels need to do buffer allocations, they will use GFP_BUFFER (which "bread()" does internally), which will mean that the VM layer will _not_ call down recursively to write stuff out while it's already trying to write something else. This is exactly so that filesystems don't have to release and re-try if they don't want to. In short, I don't see any of your arguments. Then I must have misunderstood Al. Al? If you were right here, I would see absolutely no reason for the current complexity. (Me is a bit confused here.) Reread the original thread. GFP_BUFFER protects us from buffer cache allocations triggering pageouts. It has nothing to the deadlock scenario that would come from grabbing -i_sem on pageout. Sheesh... "Complexity" of ext2_get_block() (down to the ext2_new_block() calls) is really, really not a problem. Normally it just gives you the straightforward path. All unrolls are for contention cases and they are precisely what we have to do there. Again, scalability problems are in the block allocator, not in the indirect blocks handling. It's completely independent from get_block(). We overlock. Big way. And the structures we are protecting excessively are: * cylinder group descriptors * block bitmaps * (to less extent) inode bitmaps and inode table. That's what needs to be fixed. It has nothing to VFS or VM - purely internal ext2 business. Another ext2 issue is reducing the buffer cache pressure - mostly by moving the directories into page cache. I've posted such patches on fsdevel and they are applied to the kernel I'm running here. Works like a charm and allows the rest of metadata stay in cache longer. GFP_BUFFER _may_ become an issue if we move bitmaps into pagecache. Then we'll need a per-address_space gfp_mask. Again, patches exist and had been tested (not right now - I didn't port them to current tree yet). Bugger if I remember whether they were posted or not - they've definitely had been mentioned on linux-mm, but IIRC I had sent the modifications of VM code only to Jens. I can repost them. Some pieces of balloc.c cleanup had been posted on fsdevel. Check the archives. They prepare the ground for killing lock_super() contention on ext2_new_inode(), but that part wasn't there back then. I will start -bird (aka FS-CURRENT) branch as soon as Linus opens 2.4. Hopefully by the time of 2.5 it will be tested well enough. Right now it exists as a large patchset against more or less recent -testn and I'm waiting for slowdown of the changes in main tree to put them all together. Cheers, Al - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, Dec 30, 2000 at 08:50:52PM -0500, Alexander Viro wrote: > And its meaning for 2/3 of filesystems would be? It should stay in the private part of the in-core superblock of course. > I _doubt_ it. If it is a pagecache issue it should apply to NFS. It should > apply to ramfs. It should apply to helluva lot of filesystems that are not > block-based. Pagecache doesn't (and shouldn't) know about blocks. With pagecache I meant the library of pagecache methods in buffer.c. Even if they are recalled by the lowlevel filesystem code and they can be overridden by lowlevel filesystem code, they aren't lowlevel filesystem code but they're infact common code. We can implement another version of them that instead of knowing about get_block, also know about another filesystem callback and when possible it only reserve the space for a delayed allocation later triggered (in parallel) by future kupdate. They will know about this new callback in the same way the current standard pagecache library methods knows about get_block_t. Filesystems implementing this callback will be able to use those new pagecache library methods. > it should use functions that do not expect such argument. That's it. No > need to invent new methods or shoehorn all block filesystems into the same > scheme. Of course. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Roman Zippel wrote: > > On Sun, 31 Dec 2000, Andrea Arcangeli wrote: > > > > estimate than just the data blocks it should not be hard to add an > > > extra callback to the filesystem. > > > > Yes, I was thinking at this callback too. Such a callback is nearly the only > > support we need from the filesystem to provide allocate on flush. > > Actually the getblock function could be split into 3 functions: > - alloc_block: mostly just decrementing a counter (and quota) > - get_block: allocating a block from the bitmap > - commit_block: inserting the new block into the inode > > This would be really useful for streaming, one could get as fast as > possible the block number and the data could be very quickly written, > while keeping the cache usage low. Or streaming directly from a device > to disk also wants to get rid of the data as fast as possible. Now, to insert a small note of sanity here: I think people are starting to overdesign stuff. The fact is that currently the "get_block()" interface that the page cache helper functions use does NOT have to be very expensive at all. In fact, in a properly designed filesystem just a bit of low-level caching would easily make the average "get_block()" be very fast indeed. The fact that right now ext2 has not been optimized for this is _not_ a reason to design the VFS layer around a slow get_block() operation. If you look at the generic block-based writing routines, they are actually not all that expensive. Any kind of complication is only going to make those functions more complex, and any performance gained could easily be lost in extra complexity. There are only two real advantages to deferred writing: - not having to do get_block() at all for temp-files, as we never have to do the allocation if we end up removing the file. NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors right is quite possibly big enough that this advantage is possibly very questionable. It's very possible that people could speed things up using this approach, but I also suspect that it is equally (if not more) possible to speed things up by just making sure that the low-level FS has a fast get_block(). - Using "global" access patterns to do a better job of "get_block()", ie taking advantage of issues with journalling etc and deferring the write in order to get a bigger journal. The second point is completely different, and THIS is where I think there are potentially real advantages. However, I also think that this is not actually about deferred writes at all: it's really a question of the filesystem having access to the page when the physical write is actually started so that the filesystem might choose to _change_ the allocation it did - it might have allocated a backing store block at "get_block()" time, but by the time it actually writes the stuff out to disk it may have allocated a bigger contiguous area somewhere else for the data.. I really think that the #2 thing is the more interesting one, and that anybody looking at ext2 should look at just improving the locking and making the block allocation functions run faster. Which should not be all that difficult - the last time I looked at the thing it was quite horrible. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sun, 31 Dec 2000, Andrea Arcangeli wrote: > > estimate than just the data blocks it should not be hard to add an > > extra callback to the filesystem. > > Yes, I was thinking at this callback too. Such a callback is nearly the only > support we need from the filesystem to provide allocate on flush. Actually the getblock function could be split into 3 functions: - alloc_block: mostly just decrementing a counter (and quota) - get_block: allocating a block from the bitmap - commit_block: inserting the new block into the inode This would be really useful for streaming, one could get as fast as possible the block number and the data could be very quickly written, while keeping the cache usage low. Or streaming directly from a device to disk also wants to get rid of the data as fast as possible. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Andrea Arcangeli wrote: > On Sat, Dec 30, 2000 at 03:00:43PM -0700, Eric W. Biederman wrote: > > To get ENOSPC handling 99% correct all we need to do is decrement a counter, > > that remembers how many disks blocks are free. If we need a better > > Yes, we need to add one field to the in-core superblock to do this accounting. And its meaning for 2/3 of filesystems would be? > > estimate than just the data blocks it should not be hard to add an > > extra callback to the filesystem. > > Yes, I was thinking at this callback too. Such a callback is nearly the only > support we need from the filesystem to provide allocate on flush. Allocate on > flush is a pagecache issue, not really a filesystem issue. When a filesystem I _doubt_ it. If it is a pagecache issue it should apply to NFS. It should apply to ramfs. It should apply to helluva lot of filesystems that are not block-based. Pagecache doesn't (and shouldn't) know about blocks. > doesn't implement such callback we can simply get_block(create) at pagecache > creation time as usual. Umm... You do realize that get_block is _not_ visible on pagecache level? Sure thing, filesystem should be free to use whatever functions it wants for address_space methods. No arguments here. It should be able whatever callbacks these functions expect. If filesystem doesn't implement reservation it should use functions that do not expect such argument. That's it. No need to invent new methods or shoehorn all block filesystems into the same scheme. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Linus Torvalds <[EMAIL PROTECTED]> writes: > On 30 Dec 2000, Eric W. Biederman wrote: > > > > One other thing to think about for the VFS/MM layer is limiting the > > total number of dirty pages in the system (to what disk pressure shows > > the disk can handle), to keep system performance smooth when swapping. > > This is a separate issue, and I think that it is most closely tied in to > the "RSS limit" kind of patches because of the memory mapping issues. If > you've seen the RSS rlimit patch (it's been posted a few times this week), > then you could think of that modified by a "Resident writable pages Set > Size" approach. Building on the RSS limit approach sounds much simpler then they way I was thinking. > Not just for shared mappings - this is also an issue with > limiting swapout. > > (I actually don't think that RSS is all that interesting, it's really the > "potentially dirty RSS" that counts for VM behaviour - everything else can > be dropped easily enough) Definitely. Now the only tricky bit is how do we sense when we are overloading the swap disks. Well that is the next step. I'll take a look and see what it takes to keep statistics on dirty mapped pages. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, Dec 30, 2000 at 03:00:43PM -0700, Eric W. Biederman wrote: > To get ENOSPC handling 99% correct all we need to do is decrement a counter, > that remembers how many disks blocks are free. If we need a better Yes, we need to add one field to the in-core superblock to do this accounting. > estimate than just the data blocks it should not be hard to add an > extra callback to the filesystem. Yes, I was thinking at this callback too. Such a callback is nearly the only support we need from the filesystem to provide allocate on flush. Allocate on flush is a pagecache issue, not really a filesystem issue. When a filesystem doesn't implement such callback we can simply get_block(create) at pagecache creation time as usual. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, 30 Dec 2000, Alexander Viro wrote: > Well, see above. I'm pretty nervous about breaking the ordering of metadata > allocation. For pageout() we don't have such ordering. For write() we > certainly do. Notice that reserving disk space upon write() and eating it > later is _very_ messy job - you'll have to take care of situations when > we reserve the space upon write() and get pageout do the real allocation. > Not nice, since pageout has no way in hell to tell whether it is eating > from a reserved area or just flushing the mmaped one. We could keep the > per-bh "reserved" flag to fold that information into the pagecache, but > IMO it's simply not worth the trouble. If some filesystems wants that - > hey, it can do that right now. Just make ->prepare_write() do reservations > and let ->commit_write() mark the page dirty. Then ->writepage() will > eventually flush it. This is a refinement of the idea and some abstraction like that is clearly needed, and maybe that is exactly the right one. For now I'm interested in putting this on the table so that we can check the stability and performance, maybe uncover come more bugs, then start going after some of the things that need to be done to turn it into a useful option. P.S., I humbly apologize for writing (!offset && bytes == PAGE_SIZE) when I could have just written (bytes == PAGE_SIZE). -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On 30 Dec 2000, Eric W. Biederman wrote: > > One other thing to think about for the VFS/MM layer is limiting the > total number of dirty pages in the system (to what disk pressure shows > the disk can handle), to keep system performance smooth when swapping. This is a separate issue, and I think that it is most closely tied in to the "RSS limit" kind of patches because of the memory mapping issues. If you've seen the RSS rlimit patch (it's been posted a few times this week), then you could think of that modified by a "Resident writable pages Set Size" approach. Not just for shared mappings - this is also an issue with limiting swapout. (I actually don't think that RSS is all that interesting, it's really the "potentially dirty RSS" that counts for VM behaviour - everything else can be dropped easily enough) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Linus Torvalds <[EMAIL PROTECTED]> writes: > In short, I don't see _those_ kinds of issues. I do see error reporting as > a major issue, though. If we need to do proper low-level block allocation > in order to get correct ENOSPC handling, then the win from doing deferred > writes is not very big. To get ENOSPC handling 99% correct all we need to do is decrement a counter, that remembers how many disks blocks are free. If we need a better estimate than just the data blocks it should not be hard to add an extra callback to the filesystem. There look to be some interesting cases to handle when we fill up a filesystem. Before actually failing and returning ENOSPC the filesystem might want to fsync itself. And see how correct it's estimates were. But that is the rare case and shouldn't affect performance. In the long term VFS support for deferred writes looks like a major win. Knowing how large a file is before we write it to disk allows very efficient disk organization, and fast file access (esp combined with an extent based fs). Support for compressing files in real time falls out naturally. Support for filesystems maintain coherency by never writing the same block back to the same disk location also appears. One other thing to think about for the VFS/MM layer is limiting the total number of dirty pages in the system (to what disk pressure shows the disk can handle), to keep system performance smooth when swapping. All cases except mmaped files are easy, and they can be handled by a modified page fault handler that directly puts the dirty bit on the struct page. (Except that is buggy with respect to clearing the dirty bit on the struct page.) In reality we would have to create a queue of pointers to dirty pte's from the page fault handler and depending on a timer or memory pressure move the dirty bits to the actual page. Combined with the code to make sync and fsync to work on the page cache we msync would be obsolete? Of course the most important part is that when all of that is working, the VFS/MM layer it would be perfect. World domination would be achieved. For who would be caught using an OS with an imperfect VFS layer :) Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, 30 Dec 2000, Linus Torvalds wrote: > > > On Sat, 30 Dec 2000, Alexander Viro wrote: > > > > Except that we've got file-expanding writes outside of ->i_sem. Thanks, but > > no thanks. > > No, Al, the file size is still updated inside i_sem. Then we are screwed. Look: we call write(). Twice. The second call happens to overflow the quota. Getting the second chunk of data written and the first one ending up as a hole is the last thing you would expect, isn't it? > In short, I don't see _those_ kinds of issues. I do see error reporting as > a major issue, though. If we need to do proper low-level block allocation > in order to get correct ENOSPC handling, then the win from doing deferred > writes is not very big. Well, see above. I'm pretty nervous about breaking the ordering of metadata allocation. For pageout() we don't have such ordering. For write() we certainly do. Notice that reserving disk space upon write() and eating it later is _very_ messy job - you'll have to take care of situations when we reserve the space upon write() and get pageout do the real allocation. Not nice, since pageout has no way in hell to tell whether it is eating from a reserved area or just flushing the mmaped one. We could keep the per-bh "reserved" flag to fold that information into the pagecache, but IMO it's simply not worth the trouble. If some filesystems wants that - hey, it can do that right now. Just make ->prepare_write() do reservations and let ->commit_write() mark the page dirty. Then ->writepage() will eventually flush it. Again, if one is willing to implement reservation on block level - fine, there is no need to change anything in VFS or VM. I certainly don't want to mess with that, but hey, if somebody is into masochism - let them. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Linus writes: > In short, I don't see _those_ kinds of issues. I do see error reporting as > a major issue, though. If we need to do proper low-level block allocation > in order to get correct ENOSPC handling, then the win from doing deferred > writes is not very big. It should be relatively light-weight to call into the filesystem simply to allocate a "count" of blocks needed for the current file. It may even be possible to do delayed inode allocation. This would defer the inode/block bitmap searching/allocation on ext2 until the file was actually written - only the free_blocks/free_inodes count in the superblock would be decremented, and we would get ENOSPC immediately if we don't have enough space for the file. On fsck, these values are recalculated from the group descriptors on ext2, so it wouldn't be a problem if the system crashed with pre-allocated blocks. It would definitely be a win on ext2 and XFS, and if it isn't possible on other filesystems, it should at least not be a loss. We would need to ensure we also keep enough space for indirect blocks and such, so we need to pass more information than just the number of blocks added (i.e. how big the file already is). Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, 30 Dec 2000, Alexander Viro wrote: > > Except that we've got file-expanding writes outside of ->i_sem. Thanks, but > no thanks. No, Al, the file size is still updated inside i_sem. Yes, it will do actual block allocation outside i_sem, but that is already true of any mmap'ed writes, and has been true for a long long time. So if we have a bug here (and I don't think we have one), it's not something new. But the inode semaphore doesn't protect the balloc() data structures anyway, as they are filesystem-global. If you're nervous about the effects of "truncate()", then that should be handled properly by truncate_inode_pages(). In short, I don't see _those_ kinds of issues. I do see error reporting as a major issue, though. If we need to do proper low-level block allocation in order to get correct ENOSPC handling, then the win from doing deferred writes is not very big. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, 30 Dec 2000, Daniel Phillips wrote: > When I saw you put in the if (PageDirty) -->writepage and related code > over the last couple of weeks I was wondering if you realize how close > we are to having generic deferred file writing in the VFS. I took some > time today to code this little hack and it comes awfully close to doing > the job. However, *** Warning, do not run this on a machine you care > about, it will mess it up ***. > > The advantages of deferred file writing are pretty obvious. Right now > we are deferring just the writeout of data to the disk, but we can also > defer the disk mapping, so that metadata blocks don't have to stay > around in cache waiting for data blocks to get mapped into them one at > a time - a whole group can be done in one flush. Except that we've got file-expanding writes outside of ->i_sem. Thanks, but no thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, 30 Dec 2000, Daniel Phillips wrote: > > When I saw you put in the if (PageDirty) -->writepage and related code > over the last couple of weeks I was wondering if you realize how close > we are to having generic deferred file writing in the VFS. I'm very aware of it indeed. However, it does break various common assumptions, one of them being proper error handling. Things like proper detection of quota overflows and even simple "out of disk space" issues. One of the main advantages of deferred writing would be that we could do temp-files without ever actually doing most of the low-level filesystem block allocation, but in order to get that advantage we really need to handle the out-of-disk case gracefully. I considered doing something like this as a mount option, so that people could decide on their own whether they want a safe filesystem, or whether it's ok to do deferred writes. People might find it worth it for /tmp, but might be unwilling to use it for /var/spool/mail, for example. (Hmm.. It might be perfectly fine for /vsr/spool/mail - mail delivery tends to be really careful about doing "fsync()" etc and actually pick up the errors that way. HOWEVER, before doing that you should expand the writepage logic to set the page "error" bit for when it fails to write out a full page - right now we just lose the error completely). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, 30 Dec 2000, Daniel Phillips wrote: When I saw you put in the if (PageDirty) --writepage and related code over the last couple of weeks I was wondering if you realize how close we are to having generic deferred file writing in the VFS. I'm very aware of it indeed. However, it does break various common assumptions, one of them being proper error handling. Things like proper detection of quota overflows and even simple "out of disk space" issues. One of the main advantages of deferred writing would be that we could do temp-files without ever actually doing most of the low-level filesystem block allocation, but in order to get that advantage we really need to handle the out-of-disk case gracefully. I considered doing something like this as a mount option, so that people could decide on their own whether they want a safe filesystem, or whether it's ok to do deferred writes. People might find it worth it for /tmp, but might be unwilling to use it for /var/spool/mail, for example. (Hmm.. It might be perfectly fine for /vsr/spool/mail - mail delivery tends to be really careful about doing "fsync()" etc and actually pick up the errors that way. HOWEVER, before doing that you should expand the writepage logic to set the page "error" bit for when it fails to write out a full page - right now we just lose the error completely). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, 30 Dec 2000, Daniel Phillips wrote: When I saw you put in the if (PageDirty) --writepage and related code over the last couple of weeks I was wondering if you realize how close we are to having generic deferred file writing in the VFS. I took some time today to code this little hack and it comes awfully close to doing the job. However, *** Warning, do not run this on a machine you care about, it will mess it up ***. The advantages of deferred file writing are pretty obvious. Right now we are deferring just the writeout of data to the disk, but we can also defer the disk mapping, so that metadata blocks don't have to stay around in cache waiting for data blocks to get mapped into them one at a time - a whole group can be done in one flush. Except that we've got file-expanding writes outside of -i_sem. Thanks, but no thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, 30 Dec 2000, Alexander Viro wrote: Except that we've got file-expanding writes outside of -i_sem. Thanks, but no thanks. No, Al, the file size is still updated inside i_sem. Yes, it will do actual block allocation outside i_sem, but that is already true of any mmap'ed writes, and has been true for a long long time. So if we have a bug here (and I don't think we have one), it's not something new. But the inode semaphore doesn't protect the balloc() data structures anyway, as they are filesystem-global. If you're nervous about the effects of "truncate()", then that should be handled properly by truncate_inode_pages(). In short, I don't see _those_ kinds of issues. I do see error reporting as a major issue, though. If we need to do proper low-level block allocation in order to get correct ENOSPC handling, then the win from doing deferred writes is not very big. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Linus writes: In short, I don't see _those_ kinds of issues. I do see error reporting as a major issue, though. If we need to do proper low-level block allocation in order to get correct ENOSPC handling, then the win from doing deferred writes is not very big. It should be relatively light-weight to call into the filesystem simply to allocate a "count" of blocks needed for the current file. It may even be possible to do delayed inode allocation. This would defer the inode/block bitmap searching/allocation on ext2 until the file was actually written - only the free_blocks/free_inodes count in the superblock would be decremented, and we would get ENOSPC immediately if we don't have enough space for the file. On fsck, these values are recalculated from the group descriptors on ext2, so it wouldn't be a problem if the system crashed with pre-allocated blocks. It would definitely be a win on ext2 and XFS, and if it isn't possible on other filesystems, it should at least not be a loss. We would need to ensure we also keep enough space for indirect blocks and such, so we need to pass more information than just the number of blocks added (i.e. how big the file already is). Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, 30 Dec 2000, Linus Torvalds wrote: On Sat, 30 Dec 2000, Alexander Viro wrote: Except that we've got file-expanding writes outside of -i_sem. Thanks, but no thanks. No, Al, the file size is still updated inside i_sem. Then we are screwed. Look: we call write(). Twice. The second call happens to overflow the quota. Getting the second chunk of data written and the first one ending up as a hole is the last thing you would expect, isn't it? In short, I don't see _those_ kinds of issues. I do see error reporting as a major issue, though. If we need to do proper low-level block allocation in order to get correct ENOSPC handling, then the win from doing deferred writes is not very big. Well, see above. I'm pretty nervous about breaking the ordering of metadata allocation. For pageout() we don't have such ordering. For write() we certainly do. Notice that reserving disk space upon write() and eating it later is _very_ messy job - you'll have to take care of situations when we reserve the space upon write() and get pageout do the real allocation. Not nice, since pageout has no way in hell to tell whether it is eating from a reserved area or just flushing the mmaped one. We could keep the per-bh "reserved" flag to fold that information into the pagecache, but IMO it's simply not worth the trouble. If some filesystems wants that - hey, it can do that right now. Just make -prepare_write() do reservations and let -commit_write() mark the page dirty. Then -writepage() will eventually flush it. Again, if one is willing to implement reservation on block level - fine, there is no need to change anything in VFS or VM. I certainly don't want to mess with that, but hey, if somebody is into masochism - let them. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Linus Torvalds [EMAIL PROTECTED] writes: In short, I don't see _those_ kinds of issues. I do see error reporting as a major issue, though. If we need to do proper low-level block allocation in order to get correct ENOSPC handling, then the win from doing deferred writes is not very big. To get ENOSPC handling 99% correct all we need to do is decrement a counter, that remembers how many disks blocks are free. If we need a better estimate than just the data blocks it should not be hard to add an extra callback to the filesystem. There look to be some interesting cases to handle when we fill up a filesystem. Before actually failing and returning ENOSPC the filesystem might want to fsync itself. And see how correct it's estimates were. But that is the rare case and shouldn't affect performance. rant In the long term VFS support for deferred writes looks like a major win. Knowing how large a file is before we write it to disk allows very efficient disk organization, and fast file access (esp combined with an extent based fs). Support for compressing files in real time falls out naturally. Support for filesystems maintain coherency by never writing the same block back to the same disk location also appears. /rant One other thing to think about for the VFS/MM layer is limiting the total number of dirty pages in the system (to what disk pressure shows the disk can handle), to keep system performance smooth when swapping. All cases except mmaped files are easy, and they can be handled by a modified page fault handler that directly puts the dirty bit on the struct page. (Except that is buggy with respect to clearing the dirty bit on the struct page.) In reality we would have to create a queue of pointers to dirty pte's from the page fault handler and depending on a timer or memory pressure move the dirty bits to the actual page. Combined with the code to make sync and fsync to work on the page cache we msync would be obsolete? Of course the most important part is that when all of that is working, the VFS/MM layer it would be perfect. World domination would be achieved. For who would be caught using an OS with an imperfect VFS layer :) Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On 30 Dec 2000, Eric W. Biederman wrote: One other thing to think about for the VFS/MM layer is limiting the total number of dirty pages in the system (to what disk pressure shows the disk can handle), to keep system performance smooth when swapping. This is a separate issue, and I think that it is most closely tied in to the "RSS limit" kind of patches because of the memory mapping issues. If you've seen the RSS rlimit patch (it's been posted a few times this week), then you could think of that modified by a "Resident writable pages Set Size" approach. Not just for shared mappings - this is also an issue with limiting swapout. (I actually don't think that RSS is all that interesting, it's really the "potentially dirty RSS" that counts for VM behaviour - everything else can be dropped easily enough) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, 30 Dec 2000, Alexander Viro wrote: Well, see above. I'm pretty nervous about breaking the ordering of metadata allocation. For pageout() we don't have such ordering. For write() we certainly do. Notice that reserving disk space upon write() and eating it later is _very_ messy job - you'll have to take care of situations when we reserve the space upon write() and get pageout do the real allocation. Not nice, since pageout has no way in hell to tell whether it is eating from a reserved area or just flushing the mmaped one. We could keep the per-bh "reserved" flag to fold that information into the pagecache, but IMO it's simply not worth the trouble. If some filesystems wants that - hey, it can do that right now. Just make -prepare_write() do reservations and let -commit_write() mark the page dirty. Then -writepage() will eventually flush it. This is a refinement of the idea and some abstraction like that is clearly needed, and maybe that is exactly the right one. For now I'm interested in putting this on the table so that we can check the stability and performance, maybe uncover come more bugs, then start going after some of the things that need to be done to turn it into a useful option. P.S., I humbly apologize for writing (!offset bytes == PAGE_SIZE) when I could have just written (bytes == PAGE_SIZE). -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, Dec 30, 2000 at 03:00:43PM -0700, Eric W. Biederman wrote: To get ENOSPC handling 99% correct all we need to do is decrement a counter, that remembers how many disks blocks are free. If we need a better Yes, we need to add one field to the in-core superblock to do this accounting. estimate than just the data blocks it should not be hard to add an extra callback to the filesystem. Yes, I was thinking at this callback too. Such a callback is nearly the only support we need from the filesystem to provide allocate on flush. Allocate on flush is a pagecache issue, not really a filesystem issue. When a filesystem doesn't implement such callback we can simply get_block(create) at pagecache creation time as usual. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Linus Torvalds [EMAIL PROTECTED] writes: On 30 Dec 2000, Eric W. Biederman wrote: One other thing to think about for the VFS/MM layer is limiting the total number of dirty pages in the system (to what disk pressure shows the disk can handle), to keep system performance smooth when swapping. This is a separate issue, and I think that it is most closely tied in to the "RSS limit" kind of patches because of the memory mapping issues. If you've seen the RSS rlimit patch (it's been posted a few times this week), then you could think of that modified by a "Resident writable pages Set Size" approach. Building on the RSS limit approach sounds much simpler then they way I was thinking. Not just for shared mappings - this is also an issue with limiting swapout. (I actually don't think that RSS is all that interesting, it's really the "potentially dirty RSS" that counts for VM behaviour - everything else can be dropped easily enough) Definitely. Now the only tricky bit is how do we sense when we are overloading the swap disks. Well that is the next step. I'll take a look and see what it takes to keep statistics on dirty mapped pages. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Andrea Arcangeli wrote: On Sat, Dec 30, 2000 at 03:00:43PM -0700, Eric W. Biederman wrote: To get ENOSPC handling 99% correct all we need to do is decrement a counter, that remembers how many disks blocks are free. If we need a better Yes, we need to add one field to the in-core superblock to do this accounting. And its meaning for 2/3 of filesystems would be? estimate than just the data blocks it should not be hard to add an extra callback to the filesystem. Yes, I was thinking at this callback too. Such a callback is nearly the only support we need from the filesystem to provide allocate on flush. Allocate on flush is a pagecache issue, not really a filesystem issue. When a filesystem I _doubt_ it. If it is a pagecache issue it should apply to NFS. It should apply to ramfs. It should apply to helluva lot of filesystems that are not block-based. Pagecache doesn't (and shouldn't) know about blocks. doesn't implement such callback we can simply get_block(create) at pagecache creation time as usual. Umm... You do realize that get_block is _not_ visible on pagecache level? Sure thing, filesystem should be free to use whatever functions it wants for address_space methods. No arguments here. It should be able whatever callbacks these functions expect. If filesystem doesn't implement reservation it should use functions that do not expect such argument. That's it. No need to invent new methods or shoehorn all block filesystems into the same scheme. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
Hi, On Sun, 31 Dec 2000, Andrea Arcangeli wrote: estimate than just the data blocks it should not be hard to add an extra callback to the filesystem. Yes, I was thinking at this callback too. Such a callback is nearly the only support we need from the filesystem to provide allocate on flush. Actually the getblock function could be split into 3 functions: - alloc_block: mostly just decrementing a counter (and quota) - get_block: allocating a block from the bitmap - commit_block: inserting the new block into the inode This would be really useful for streaming, one could get as fast as possible the block number and the data could be very quickly written, while keeping the cache usage low. Or streaming directly from a device to disk also wants to get rid of the data as fast as possible. bye, Roman - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sun, 31 Dec 2000, Roman Zippel wrote: On Sun, 31 Dec 2000, Andrea Arcangeli wrote: estimate than just the data blocks it should not be hard to add an extra callback to the filesystem. Yes, I was thinking at this callback too. Such a callback is nearly the only support we need from the filesystem to provide allocate on flush. Actually the getblock function could be split into 3 functions: - alloc_block: mostly just decrementing a counter (and quota) - get_block: allocating a block from the bitmap - commit_block: inserting the new block into the inode This would be really useful for streaming, one could get as fast as possible the block number and the data could be very quickly written, while keeping the cache usage low. Or streaming directly from a device to disk also wants to get rid of the data as fast as possible. Now, to insert a small note of sanity here: I think people are starting to overdesign stuff. The fact is that currently the "get_block()" interface that the page cache helper functions use does NOT have to be very expensive at all. In fact, in a properly designed filesystem just a bit of low-level caching would easily make the average "get_block()" be very fast indeed. The fact that right now ext2 has not been optimized for this is _not_ a reason to design the VFS layer around a slow get_block() operation. If you look at the generic block-based writing routines, they are actually not all that expensive. Any kind of complication is only going to make those functions more complex, and any performance gained could easily be lost in extra complexity. There are only two real advantages to deferred writing: - not having to do get_block() at all for temp-files, as we never have to do the allocation if we end up removing the file. NOTE NOTE NOTE! The overhead for trying to get ENOSPC and quota errors right is quite possibly big enough that this advantage is possibly very questionable. It's very possible that people could speed things up using this approach, but I also suspect that it is equally (if not more) possible to speed things up by just making sure that the low-level FS has a fast get_block(). - Using "global" access patterns to do a better job of "get_block()", ie taking advantage of issues with journalling etc and deferring the write in order to get a bigger journal. The second point is completely different, and THIS is where I think there are potentially real advantages. However, I also think that this is not actually about deferred writes at all: it's really a question of the filesystem having access to the page when the physical write is actually started so that the filesystem might choose to _change_ the allocation it did - it might have allocated a backing store block at "get_block()" time, but by the time it actually writes the stuff out to disk it may have allocated a bigger contiguous area somewhere else for the data.. I really think that the #2 thing is the more interesting one, and that anybody looking at ext2 should look at just improving the locking and making the block allocation functions run faster. Which should not be all that difficult - the last time I looked at the thing it was quite horrible. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Generic deferred file writing
On Sat, Dec 30, 2000 at 08:50:52PM -0500, Alexander Viro wrote: And its meaning for 2/3 of filesystems would be? It should stay in the private part of the in-core superblock of course. I _doubt_ it. If it is a pagecache issue it should apply to NFS. It should apply to ramfs. It should apply to helluva lot of filesystems that are not block-based. Pagecache doesn't (and shouldn't) know about blocks. With pagecache I meant the library of pagecache methods in buffer.c. Even if they are recalled by the lowlevel filesystem code and they can be overridden by lowlevel filesystem code, they aren't lowlevel filesystem code but they're infact common code. We can implement another version of them that instead of knowing about get_block, also know about another filesystem callback and when possible it only reserve the space for a delayed allocation later triggered (in parallel) by future kupdate. They will know about this new callback in the same way the current standard pagecache library methods knows about get_block_t. Filesystems implementing this callback will be able to use those new pagecache library methods. it should use functions that do not expect such argument. That's it. No need to invent new methods or shoehorn all block filesystems into the same scheme. Of course. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/