Re: [PATCH] ext4: dir inode reservation V3
hmm. so you trade 265% degradation of creation for 40% improvement of unlink? thanks, Alex Coly Li wrote: normal ext4 ext4 with dir inode reservation mount options: -o data=writeback -o data=writeback,dir_ireserve=low Create dirs:real0m49.101s real2m59.703s Create files: real24m17.962s real21m8.161s Unlink all: real24m43.788s real17m29.862s Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Creating files with dir inode reservation is 13% faster than normal ext4. Unlink all the directories and files is 29.2% faster as expected. When number of directories is increased, the performance improvement will be more considerable. More benchmark result will be posted here if necessary, because I need more time to run more test cases. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ext4: dir inode reservation V3
hmm. so you trade 265% degradation of creation for 40% improvement of unlink? thanks, Alex Coly Li wrote: normal ext4 ext4 with dir inode reservation mount options: -o data=writeback -o data=writeback,dir_ireserve=low Create dirs:real0m49.101s real2m59.703s Create files: real24m17.962s real21m8.161s Unlink all: real24m43.788s real17m29.862s Creating dirs with dir inode reservation is slower than normal ext4 as predicted, because allocating directory inodes in non-linear order will cause extra hard disk seeking and block I/O. Creating files with dir inode reservation is 13% faster than normal ext4. Unlink all the directories and files is 29.2% faster as expected. When number of directories is increased, the performance improvement will be more considerable. More benchmark result will be posted here if necessary, because I need more time to run more test cases. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On 9/19/07, David Chinner <[EMAIL PROTECTED]> wrote: > The problem is this: to alter the fundamental block size of the > filesystem we also need to alter the data block size and that is > exactly the piece that linux does not support right now. So while > we have the capability to use large block sizes in certain > filesystems, we can't use that capability until the data path > supports it. it's much simpler to teach fs to understand multipage data (like multipage bitmap scan, multipage extent search, etc) then deal with mm fragmentation. IMHO. at same time you don't bust IO traffic with non-used space. -- thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On 9/19/07, David Chinner [EMAIL PROTECTED] wrote: The problem is this: to alter the fundamental block size of the filesystem we also need to alter the data block size and that is exactly the piece that linux does not support right now. So while we have the capability to use large block sizes in certain filesystems, we can't use that capability until the data path supports it. it's much simpler to teach fs to understand multipage data (like multipage bitmap scan, multipage extent search, etc) then deal with mm fragmentation. IMHO. at same time you don't bust IO traffic with non-used space. -- thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ext4:fix unexpected error from ext4_reserve_global
ACK, of course. thanks, Alex Mingming Cao wrote: On Thu, 2007-06-14 at 19:29 +0400, Dmitriy Monakhov wrote: I just cant belive my eyes then i saw this at the first time... simple test: strace dd if=/dev/zero of=/mnt/file Thanks for reporting it. open("/dev/zero", O_RDONLY) = 0 close(1)= 0 open("/mnt/test/file", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 1 read(0, "\0\0\0\0\0\0\0\0\0"..., 512) = 512 write(1, "\0\0\0\0\0\0\0\0"..., 512) = 512 read(0, "\0\0\0\0\0\0\0\0\0"..., 512) = 512 write(1, "\0\0\0\0\0\0\0\0"..., 512) = -1 ENOENT (No such fil e or directory) This strange error returned from ext4_reserve_global(). It's just typo because: a) In fact this is 100% ENOSPC situation b) simular function ext4_reserve_local() returns -ENOSPC I agree. Patch is put in ext4 patch queue. Alex if you can Ack, that would be great. Thanks, Mingming Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]> --- fs/ext4/balloc.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index 4d7bfd2..43ae8f8 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -1920,7 +1920,7 @@ int ext4_reserve_global(struct super_block *sb, int blocks) { struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_reservation_slot *rs; - int i, rc = -ENOENT; + int i, rc = -ENOSPC; __u64 free = 0; rs = sbi->s_reservation_slots; - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ext4:fix unexpected error from ext4_reserve_global
ACK, of course. thanks, Alex Mingming Cao wrote: On Thu, 2007-06-14 at 19:29 +0400, Dmitriy Monakhov wrote: I just cant belive my eyes then i saw this at the first time... simple test: strace dd if=/dev/zero of=/mnt/file Thanks for reporting it. open(/dev/zero, O_RDONLY) = 0 close(1)= 0 open(/mnt/test/file, O_WRONLY|O_CREAT|O_TRUNC, 0666) = 1 read(0, \0\0\0\0\0\0\0\0\0..., 512) = 512 write(1, \0\0\0\0\0\0\0\0..., 512) = 512 read(0, \0\0\0\0\0\0\0\0\0..., 512) = 512 write(1, \0\0\0\0\0\0\0\0..., 512) = -1 ENOENT (No such fil e or directory) This strange error returned from ext4_reserve_global(). It's just typo because: a) In fact this is 100% ENOSPC situation b) simular function ext4_reserve_local() returns -ENOSPC I agree. Patch is put in ext4 patch queue. Alex if you can Ack, that would be great. Thanks, Mingming Signed-off-by: Dmitriy Monakhov [EMAIL PROTECTED] --- fs/ext4/balloc.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index 4d7bfd2..43ae8f8 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -1920,7 +1920,7 @@ int ext4_reserve_global(struct super_block *sb, int blocks) { struct ext4_sb_info *sbi = EXT4_SB(sb); struct ext4_reservation_slot *rs; - int i, rc = -ENOENT; + int i, rc = -ENOSPC; __u64 free = 0; rs = sbi-s_reservation_slots; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Andrew Morton wrote: I'm still not understanding. The terms you're using are a bit ambiguous. What does "find some dirty unallocated blocks" mean? Find a page which is dirty and which does not have a disk mapping? Normally the above operation would be implemented via ext4_writeback_writepage(), and it runs under lock_page(). I'm mostly worried about delayed allocation case. My impression was that holding number of pages locked isn't a good idea, even if they're locked in index order. so, I was going to turn number of pages writeback, then allocate blocks for all of them at once, then put proper blocknr's into bh's (or PG_mappedtodisk?). going to commit find inode I dirty do NOT find these blocks because they're allocated only, but pages/bhs aren't mapped to them start commit I think you're assuming here that commit would be using ->t_sync_datalist to locate dirty buffer_heads. nope, I mean sb->inode->page walk. But under this proposal, t_sync_datalist just gets removed: the new ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm understanding you, the way in which we'd handle any such race is to make kjournald's writeback of the dirty pages block in lock_page(). Once it gets the page lock it can look to see if some other thread has mapped the page to disk. if I'm right holding number of pages locked, then they won't be locked, but writeback. of course kjournald can block on writeback as well, but how does it find pages with *newly allocated* blocks only? It may turn out that kjournald needs a private way of getting at the I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we had the radix-tree-of-dirty-inodes thing then that's easy enough to do anyway, with a tagged search. But I expect that a single pass through the superblock's dirty inodes would suffice for ordered-data. Files which have chattr +j would screw things up, as usual. not dirty inodes only, but rather some fast way to find pages with newly allocated pages. I assume (hope) that your delayed allocation code implements ->writepages()? Doing the allocation one-page-at-a-time sounds painful... indeed. this is a root cause of all this complexity. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Andrew Morton wrote: On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: Yes, there can be issues with needing to allocate journal space within the context of a commit. But no-no, this isn't required. we only need to mark pages/blocks within transaction, otherwise race is possible when we allocate blocks in transaction, then transacton starts to commit, then we mark pages/blocks to be flushed before commit. I don't understand. Can you please describe the race in more detail? if I understood your idea right, then in data=ordered mode, commit thread writes all dirty mapped blocks before real commit. say, we have two threads: t1 is a thread doing flushing and t2 is a commit thread t1 t2 find dirty inode I find some dirty unallocated blocks journal_start() allocate blocks attach them to I journal_stop() going to commit find inode I dirty do NOT find these blocks because they're allocated only, but pages/bhs aren't mapped to them start commit map pages/bhs to just allocate blocks so, either we mark pages/bhs someway within journal_start()--journal_stop() or commit thread should do lookup for all dirty pages. the latter doesn't sound nice, IMHO. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Andrew Morton wrote: Yes, there can be issues with needing to allocate journal space within the context of a commit. But no-no, this isn't required. we only need to mark pages/blocks within transaction, otherwise race is possible when we allocate blocks in transaction, then transacton starts to commit, then we mark pages/blocks to be flushed before commit. a) If the page has newly allocated space on disk then the metadata which refers to that page is already in the journal: no new journal space needed. b) If the page doesn't have space allocated on disk then we don't need to write it out at ordered-mode commit time, because the post-recovery filesystem will not have any references to that page. c) If the page is dirty due to overwrite then no metadata update was required. IOW, under what circumstances would an ordered-mode commit need to allocate space for a delayed-allocate page? no need to allocate space within commit thread, I think. only to take care of the race I described above. in hackish version of data=ordered for delayed allocation I used counter of submitted bio's with newly-allocated blocks and commit thread waits for the counter to reach 0. However b) might lead to the hey-my-file-is-full-of-zeroes problem. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Andrew Morton wrote: Yes, there can be issues with needing to allocate journal space within the context of a commit. But no-no, this isn't required. we only need to mark pages/blocks within transaction, otherwise race is possible when we allocate blocks in transaction, then transacton starts to commit, then we mark pages/blocks to be flushed before commit. a) If the page has newly allocated space on disk then the metadata which refers to that page is already in the journal: no new journal space needed. b) If the page doesn't have space allocated on disk then we don't need to write it out at ordered-mode commit time, because the post-recovery filesystem will not have any references to that page. c) If the page is dirty due to overwrite then no metadata update was required. IOW, under what circumstances would an ordered-mode commit need to allocate space for a delayed-allocate page? no need to allocate space within commit thread, I think. only to take care of the race I described above. in hackish version of data=ordered for delayed allocation I used counter of submitted bio's with newly-allocated blocks and commit thread waits for the counter to reach 0. However b) might lead to the hey-my-file-is-full-of-zeroes problem. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Andrew Morton wrote: On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas [EMAIL PROTECTED] wrote: Andrew Morton wrote: Yes, there can be issues with needing to allocate journal space within the context of a commit. But no-no, this isn't required. we only need to mark pages/blocks within transaction, otherwise race is possible when we allocate blocks in transaction, then transacton starts to commit, then we mark pages/blocks to be flushed before commit. I don't understand. Can you please describe the race in more detail? if I understood your idea right, then in data=ordered mode, commit thread writes all dirty mapped blocks before real commit. say, we have two threads: t1 is a thread doing flushing and t2 is a commit thread t1 t2 find dirty inode I find some dirty unallocated blocks journal_start() allocate blocks attach them to I journal_stop() going to commit find inode I dirty do NOT find these blocks because they're allocated only, but pages/bhs aren't mapped to them start commit map pages/bhs to just allocate blocks so, either we mark pages/bhs someway within journal_start()--journal_stop() or commit thread should do lookup for all dirty pages. the latter doesn't sound nice, IMHO. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Andrew Morton wrote: I'm still not understanding. The terms you're using are a bit ambiguous. What does find some dirty unallocated blocks mean? Find a page which is dirty and which does not have a disk mapping? Normally the above operation would be implemented via ext4_writeback_writepage(), and it runs under lock_page(). I'm mostly worried about delayed allocation case. My impression was that holding number of pages locked isn't a good idea, even if they're locked in index order. so, I was going to turn number of pages writeback, then allocate blocks for all of them at once, then put proper blocknr's into bh's (or PG_mappedtodisk?). going to commit find inode I dirty do NOT find these blocks because they're allocated only, but pages/bhs aren't mapped to them start commit I think you're assuming here that commit would be using -t_sync_datalist to locate dirty buffer_heads. nope, I mean sb-inode-page walk. But under this proposal, t_sync_datalist just gets removed: the new ordered-data mode _only_ need to do the sb-inode-page walk. So if I'm understanding you, the way in which we'd handle any such race is to make kjournald's writeback of the dirty pages block in lock_page(). Once it gets the page lock it can look to see if some other thread has mapped the page to disk. if I'm right holding number of pages locked, then they won't be locked, but writeback. of course kjournald can block on writeback as well, but how does it find pages with *newly allocated* blocks only? It may turn out that kjournald needs a private way of getting at the I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we had the radix-tree-of-dirty-inodes thing then that's easy enough to do anyway, with a tagged search. But I expect that a single pass through the superblock's dirty inodes would suffice for ordered-data. Files which have chattr +j would screw things up, as usual. not dirty inodes only, but rather some fast way to find pages with newly allocated pages. I assume (hope) that your delayed allocation code implements -writepages()? Doing the allocation one-page-at-a-time sounds painful... indeed. this is a root cause of all this complexity. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Andrew Morton wrote: We can make great improvements here, and I've (twice) previously decribed how: hoist the entire ordered-mode data handling out of ext3, and out of the buffer_head layer and move it up into the VFS pagecache layer. Basically, do ordered-data with a commit-time inode walk, calling do_sync_mapping_range(). Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too. Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there. I'm not sure it's that easy. if we move to pages, then we have to mark pages to be flushed holding transaction open. now take delayed allocation into account: we need to allocate number of blocks at once and then mark all pages mapped, again within context of the same transaction. so, an implementation would look like the following? generic_writepages() { /* collect set of contig. dirty pages */ foo_get_blocks() { foo_journal_start(); foo_new_blocks(); foo_attach_blocks_to_inode(); generic_mark_pages_mapped(); foo_journal_stop(); } } another question is will it scale well given number of dirty inodes can be much larger than number of inodes with dirty mapped blocks (in delayed allocation case, for example) ? thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ext3][kernels = 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)
Andrew Morton wrote: We can make great improvements here, and I've (twice) previously decribed how: hoist the entire ordered-mode data handling out of ext3, and out of the buffer_head layer and move it up into the VFS pagecache layer. Basically, do ordered-data with a commit-time inode walk, calling do_sync_mapping_range(). Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too. Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there. I'm not sure it's that easy. if we move to pages, then we have to mark pages to be flushed holding transaction open. now take delayed allocation into account: we need to allocate number of blocks at once and then mark all pages mapped, again within context of the same transaction. so, an implementation would look like the following? generic_writepages() { /* collect set of contig. dirty pages */ foo_get_blocks() { foo_journal_start(); foo_new_blocks(); foo_attach_blocks_to_inode(); generic_mark_pages_mapped(); foo_journal_stop(); } } another question is will it scale well given number of dirty inodes can be much larger than number of inodes with dirty mapped blocks (in delayed allocation case, for example) ? thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-ext4-1
Theodore Ts'o wrote: P.S. One bug which I've noted --- if there is a failure due to disk filling up, running e2fsck on the filesystem will show that the i_blocks fields on the inodes where there was a failure to allocate disk blocks are left incorrect. I'm guessing this is a bug in the delayed allocation patches. Alex, when you have a moment, could you take a look? Thanks!! definitely. thanks for the report. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-ext4-1
Theodore Ts'o wrote: P.S. One bug which I've noted --- if there is a failure due to disk filling up, running e2fsck on the filesystem will show that the i_blocks fields on the inodes where there was a failure to allocate disk blocks are left incorrect. I'm guessing this is a bug in the delayed allocation patches. Alex, when you have a moment, could you take a look? Thanks!! definitely. thanks for the report. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
I think one problem with mmap/msync is that they can't maintain i_size atomically like regular write does. so, one needs to implement own i_size management in userspace. thanks, Alex > Side note: the only reason O_DIRECT exists is because database people are > too used to it, because other OS's haven't had enough taste to tell them > to do it right, so they've historically hacked their OS to get out of the > way. > As a result, our madvise and/or posix_fadvise interfaces may not be all > that strong, because people sadly don't use them that much. It's a sad > example of a totally broken interface (O_DIRECT) resulting in better > interfaces not getting used, and then not getting as much development > effort put into them. > So O_DIRECT not only is a total disaster from a design standpoint (just > look at all the crap it results in), it also indirectly has hurt better > interfaces. For example, POSIX_FADV_NOREUSE (which _could_ be a useful and > clean interface to make sure we don't pollute memory unnecessarily with > cached pages after they are all done) ends up being a no-op ;/ > Sad. And it's one of those self-fulfilling prophecies. Still, I hope some > day we can just rip the damn disaster out. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
I think one problem with mmap/msync is that they can't maintain i_size atomically like regular write does. so, one needs to implement own i_size management in userspace. thanks, Alex Side note: the only reason O_DIRECT exists is because database people are too used to it, because other OS's haven't had enough taste to tell them to do it right, so they've historically hacked their OS to get out of the way. As a result, our madvise and/or posix_fadvise interfaces may not be all that strong, because people sadly don't use them that much. It's a sad example of a totally broken interface (O_DIRECT) resulting in better interfaces not getting used, and then not getting as much development effort put into them. So O_DIRECT not only is a total disaster from a design standpoint (just look at all the crap it results in), it also indirectly has hurt better interfaces. For example, POSIX_FADV_NOREUSE (which _could_ be a useful and clean interface to make sure we don't pollute memory unnecessarily with cached pages after they are all done) ends up being a no-op ;/ Sad. And it's one of those self-fulfilling prophecies. Still, I hope some day we can just rip the damn disaster out. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] return ENOENT from ext3_link when racing with unlink
> Peter Staubach (PS) writes: PS> Just out of curosity, what keeps i_nlink from going to 0 immediately PS> after the new test is executed? i_mutex in vfs_link() and vfs_unlink() thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] return ENOENT from ext3_link when racing with unlink
Peter Staubach (PS) writes: PS Just out of curosity, what keeps i_nlink from going to 0 immediately PS after the new test is executed? i_mutex in vfs_link() and vfs_unlink() thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
> Eric Sandeen (ES) writes: ES> Al says "no" and I'm not arguing. :) ES> Apparently this may be OK with some filesystems, and Al says he doesn't ES> want to know about i_nlink in the vfs in any case. well, generic_drop_inode() uses i_nlink ... ES> But I suppose there may be other filesystems which DO care, and should ES> be checking if they're not. this is why I thought VFS could take care. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
> Eric Sandeen (ES) writes: ES> I tend to agree, chatting w/ Al I think he does too. :) I'll test ES> a patch that kicks out ext3_link() with -ENOENT at the top, and resubmit ES> that if things go well. shouldn't VFS do that? thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
> Eric Sandeen (ES) writes: ES> so I think it's possible that link can sneak in there & find it after ES> the mutex is dropped...? Is this ok? :) It's certainly -happening- ES> anyway yes, but it shouldn't allow to re-link such inode back, IMHO. a filesystem may start some non-revertable activity in its unlink method. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
ah, it seems vfs_link() doesn't check whether source is still alive. for example, in mkdir case vfs_mkdir() calls may_create() and checks the parent is still there: if (IS_DEADDIR(dir)) return -ENOENT; VFS doesn't set S_DEAD on regular files, but we could check i_nlink. thanks, Alex >>>>> Alex Tomas (AT) writes: AT> interesting .. AT> I thought VFS doesn't allow concurrent operations. AT> if unlink goes first, then link should wait on the AT> parent's i_mutex and then found no source name. AT> thanks, Alex >>>>> Eric Sandeen (ES) writes: ES> ) ES> I've been looking at a case where many threads are opening, unlinking, and ES> hardlinking files on ext3 . At unmount time I see an oops, because the superblock's ES> orphan list points to a freed inode. ES> I did some tracing of the inodes, and it looks like this: ES> ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan ES> i_state:0x7 cpu:1 i_count:2 i_nlink:0 ES> ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] ext3_orphan_add ES> i_state:0x7 cpu:1 i_count:2 i_nlink:0 ES> iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter ES> i_state:0x7 cpu:1 i_count:2 i_nlink:0 ES> ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter ES> i_state:0x7 cpu:3 i_count:1 i_nlink:0 ES> ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done ES> i_state:0x7 cpu:3 i_count:1 i_nlink:1 ES> The unlink gets there first, finds i_count > 0 (in use) but nlink goes to 0, so ES> it puts it on the orphan inode list. Then link comes along, and bumps the link ES> back up to 1. So now we are on the orphan inode list, but we are not unlinked. ES> Eventually when count goes to 0, and we still have 1 link, again no action is ES> taken to remove the inode from the orphan list, because it is still linked (i.e. ES> we don't go through ext3_delete()) ES> When this inode is eventually freed, the sb orphan list gets corrupted, because ES> we have freed it without first removing it from the orphan list. ES> I think the simple solution is to remove the inode from the orphan list ES> when we bump the link back up from 0 to 1. I put that test in there because ES> there are other potential reasons that we might be on the list (truncates, ES> direct IO). ES> Comments? ES> Thanks, ES> -Eric ES> p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused ES> arg, and are very infrequently called. I'll probably submit a patch ES> to just put the single line of code into the caller, too. ES> --- ES> Remove inode from the orphan list in ext3_link() if we might have ES> raced with ext3_unlink(), which potentially put it on the list. ES> If we're on the list with nlink > 0, we'll never get cleaned up ES> properly and eventually may corrupt the list. ES> Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]> ES> Index: linux-2.6.19/fs/ext3/namei.c ES> === ES> --- linux-2.6.19.orig/fs/ext3/namei.c ES> +++ linux-2.6.19/fs/ext3/namei.c ES> @@ -2204,6 +2204,9 @@ retry: inode-> i_ctime = CURRENT_TIME_SEC; ES> ext3_inc_count(handle, inode); ES> atomic_inc(>i_count); ES> + /* did we race w/ unlink? */ ES> + if (inode->i_nlink == 1) ES> + ext3_orphan_del(handle, inode); ES> err = ext3_add_nondir(handle, dentry, inode); ES> ext3_journal_stop(handle); ES> - ES> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in ES> the body of a message to [EMAIL PROTECTED] ES> More majordomo info at http://vger.kernel.org/majordomo-info.html AT> - AT> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in AT> the body of a message to [EMAIL PROTECTED] AT> More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
interesting .. I thought VFS doesn't allow concurrent operations. if unlink goes first, then link should wait on the parent's i_mutex and then found no source name. thanks, Alex > Eric Sandeen (ES) writes: ES> ) ES> I've been looking at a case where many threads are opening, unlinking, and ES> hardlinking files on ext3 . At unmount time I see an oops, because the superblock's ES> orphan list points to a freed inode. ES> I did some tracing of the inodes, and it looks like this: ES> ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan ES> i_state:0x7 cpu:1 i_count:2 i_nlink:0 ES> ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] ext3_orphan_add ES> i_state:0x7 cpu:1 i_count:2 i_nlink:0 ES> iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter ES> i_state:0x7 cpu:1 i_count:2 i_nlink:0 ES> ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter ES> i_state:0x7 cpu:3 i_count:1 i_nlink:0 ES> ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done ES> i_state:0x7 cpu:3 i_count:1 i_nlink:1 ES> The unlink gets there first, finds i_count > 0 (in use) but nlink goes to 0, so ES> it puts it on the orphan inode list. Then link comes along, and bumps the link ES> back up to 1. So now we are on the orphan inode list, but we are not unlinked. ES> Eventually when count goes to 0, and we still have 1 link, again no action is ES> taken to remove the inode from the orphan list, because it is still linked (i.e. ES> we don't go through ext3_delete()) ES> When this inode is eventually freed, the sb orphan list gets corrupted, because ES> we have freed it without first removing it from the orphan list. ES> I think the simple solution is to remove the inode from the orphan list ES> when we bump the link back up from 0 to 1. I put that test in there because ES> there are other potential reasons that we might be on the list (truncates, ES> direct IO). ES> Comments? ES> Thanks, ES> -Eric ES> p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused ES> arg, and are very infrequently called. I'll probably submit a patch ES> to just put the single line of code into the caller, too. ES> --- ES> Remove inode from the orphan list in ext3_link() if we might have ES> raced with ext3_unlink(), which potentially put it on the list. ES> If we're on the list with nlink > 0, we'll never get cleaned up ES> properly and eventually may corrupt the list. ES> Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]> ES> Index: linux-2.6.19/fs/ext3/namei.c ES> === ES> --- linux-2.6.19.orig/fs/ext3/namei.c ES> +++ linux-2.6.19/fs/ext3/namei.c ES> @@ -2204,6 +2204,9 @@ retry: inode-> i_ctime = CURRENT_TIME_SEC; ES>ext3_inc_count(handle, inode); ES>atomic_inc(>i_count); ES> + /* did we race w/ unlink? */ ES> + if (inode->i_nlink == 1) ES> + ext3_orphan_del(handle, inode); ES>err = ext3_add_nondir(handle, dentry, inode); ES>ext3_journal_stop(handle); ES> - ES> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in ES> the body of a message to [EMAIL PROTECTED] ES> More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
interesting .. I thought VFS doesn't allow concurrent operations. if unlink goes first, then link should wait on the parent's i_mutex and then found no source name. thanks, Alex Eric Sandeen (ES) writes: ES ) ES I've been looking at a case where many threads are opening, unlinking, and ES hardlinking files on ext3 . At unmount time I see an oops, because the superblock's ES orphan list points to a freed inode. ES I did some tracing of the inodes, and it looks like this: ES ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan ES i_state:0x7 cpu:1 i_count:2 i_nlink:0 ES ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] ext3_orphan_add ES i_state:0x7 cpu:1 i_count:2 i_nlink:0 ES iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter ES i_state:0x7 cpu:1 i_count:2 i_nlink:0 ES ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter ES i_state:0x7 cpu:3 i_count:1 i_nlink:0 ES ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done ES i_state:0x7 cpu:3 i_count:1 i_nlink:1 ES The unlink gets there first, finds i_count 0 (in use) but nlink goes to 0, so ES it puts it on the orphan inode list. Then link comes along, and bumps the link ES back up to 1. So now we are on the orphan inode list, but we are not unlinked. ES Eventually when count goes to 0, and we still have 1 link, again no action is ES taken to remove the inode from the orphan list, because it is still linked (i.e. ES we don't go through ext3_delete()) ES When this inode is eventually freed, the sb orphan list gets corrupted, because ES we have freed it without first removing it from the orphan list. ES I think the simple solution is to remove the inode from the orphan list ES when we bump the link back up from 0 to 1. I put that test in there because ES there are other potential reasons that we might be on the list (truncates, ES direct IO). ES Comments? ES Thanks, ES -Eric ES p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused ES arg, and are very infrequently called. I'll probably submit a patch ES to just put the single line of code into the caller, too. ES --- ES Remove inode from the orphan list in ext3_link() if we might have ES raced with ext3_unlink(), which potentially put it on the list. ES If we're on the list with nlink 0, we'll never get cleaned up ES properly and eventually may corrupt the list. ES Signed-off-by: Eric Sandeen [EMAIL PROTECTED] ES Index: linux-2.6.19/fs/ext3/namei.c ES === ES --- linux-2.6.19.orig/fs/ext3/namei.c ES +++ linux-2.6.19/fs/ext3/namei.c ES @@ -2204,6 +2204,9 @@ retry: inode- i_ctime = CURRENT_TIME_SEC; ESext3_inc_count(handle, inode); ESatomic_inc(inode-i_count); ES + /* did we race w/ unlink? */ ES + if (inode-i_nlink == 1) ES + ext3_orphan_del(handle, inode); ESerr = ext3_add_nondir(handle, dentry, inode); ESext3_journal_stop(handle); ES - ES To unsubscribe from this list: send the line unsubscribe linux-ext4 in ES the body of a message to [EMAIL PROTECTED] ES More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
ah, it seems vfs_link() doesn't check whether source is still alive. for example, in mkdir case vfs_mkdir() calls may_create() and checks the parent is still there: if (IS_DEADDIR(dir)) return -ENOENT; VFS doesn't set S_DEAD on regular files, but we could check i_nlink. thanks, Alex Alex Tomas (AT) writes: AT interesting .. AT I thought VFS doesn't allow concurrent operations. AT if unlink goes first, then link should wait on the AT parent's i_mutex and then found no source name. AT thanks, Alex Eric Sandeen (ES) writes: ES ) ES I've been looking at a case where many threads are opening, unlinking, and ES hardlinking files on ext3 . At unmount time I see an oops, because the superblock's ES orphan list points to a freed inode. ES I did some tracing of the inodes, and it looks like this: ES ext3_unlink():[/src/linux-2.6.18/fs/ext3/namei.c:2123] adding orphan ES i_state:0x7 cpu:1 i_count:2 i_nlink:0 ES ext3_orphan_add():[/src/linux-2.6.18/fs/ext3/namei.c:1890] ext3_orphan_add ES i_state:0x7 cpu:1 i_count:2 i_nlink:0 ES iput():[/src/linux-2.6.18/fs/inode.c:1139] iput enter ES i_state:0x7 cpu:1 i_count:2 i_nlink:0 ES ext3_link():[/src/linux-2.6.18/fs/ext3/namei.c:2202] ext3_link enter ES i_state:0x7 cpu:3 i_count:1 i_nlink:0 ES ext3_inc_count():[/src/linux-2.6.18/fs/ext3/namei.c:1627] done ES i_state:0x7 cpu:3 i_count:1 i_nlink:1 ES The unlink gets there first, finds i_count 0 (in use) but nlink goes to 0, so ES it puts it on the orphan inode list. Then link comes along, and bumps the link ES back up to 1. So now we are on the orphan inode list, but we are not unlinked. ES Eventually when count goes to 0, and we still have 1 link, again no action is ES taken to remove the inode from the orphan list, because it is still linked (i.e. ES we don't go through ext3_delete()) ES When this inode is eventually freed, the sb orphan list gets corrupted, because ES we have freed it without first removing it from the orphan list. ES I think the simple solution is to remove the inode from the orphan list ES when we bump the link back up from 0 to 1. I put that test in there because ES there are other potential reasons that we might be on the list (truncates, ES direct IO). ES Comments? ES Thanks, ES -Eric ES p.s. ext3_inc_count and ext3_dec_count seem misnamed, have an unused ES arg, and are very infrequently called. I'll probably submit a patch ES to just put the single line of code into the caller, too. ES --- ES Remove inode from the orphan list in ext3_link() if we might have ES raced with ext3_unlink(), which potentially put it on the list. ES If we're on the list with nlink 0, we'll never get cleaned up ES properly and eventually may corrupt the list. ES Signed-off-by: Eric Sandeen [EMAIL PROTECTED] ES Index: linux-2.6.19/fs/ext3/namei.c ES === ES --- linux-2.6.19.orig/fs/ext3/namei.c ES +++ linux-2.6.19/fs/ext3/namei.c ES @@ -2204,6 +2204,9 @@ retry: inode- i_ctime = CURRENT_TIME_SEC; ES ext3_inc_count(handle, inode); ES atomic_inc(inode-i_count); ES + /* did we race w/ unlink? */ ES + if (inode-i_nlink == 1) ES + ext3_orphan_del(handle, inode); ES err = ext3_add_nondir(handle, dentry, inode); ES ext3_journal_stop(handle); ES - ES To unsubscribe from this list: send the line unsubscribe linux-ext4 in ES the body of a message to [EMAIL PROTECTED] ES More majordomo info at http://vger.kernel.org/majordomo-info.html AT - AT To unsubscribe from this list: send the line unsubscribe linux-ext4 in AT the body of a message to [EMAIL PROTECTED] AT More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
Eric Sandeen (ES) writes: ES so I think it's possible that link can sneak in there find it after ES the mutex is dropped...? Is this ok? :) It's certainly -happening- ES anyway yes, but it shouldn't allow to re-link such inode back, IMHO. a filesystem may start some non-revertable activity in its unlink method. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
Eric Sandeen (ES) writes: ES I tend to agree, chatting w/ Al I think he does too. :) I'll test ES a patch that kicks out ext3_link() with -ENOENT at the top, and resubmit ES that if things go well. shouldn't VFS do that? thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] remove ext3 inode from orphan list when link and unlink race
Eric Sandeen (ES) writes: ES Al says no and I'm not arguing. :) ES Apparently this may be OK with some filesystems, and Al says he doesn't ES want to know about i_nlink in the vfs in any case. well, generic_drop_inode() uses i_nlink ... ES But I suppose there may be other filesystems which DO care, and should ES be checking if they're not. this is why I thought VFS could take care. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] delayed allocation for ext4
> David Chinner (DC) writes: DC> So that mean's we'll have 2 separate mechanisms for marking DC> pages as delalloc. XFS uses the BH_delay flag to indicate DC> that a buffer (block) attached to the page is using delalloc. >> >> well, for blocksize=pagesize we can save 56 bytes on every page. DC> Sure, but it means that ext4 w/ delalloc won't work on lots of DC> machines it does not currenly. but I'm going to implement that. not sure whether it's worth to have two different codepaths for block size=page size and block size < page size. DC> FWIW, how does this mechanism deal with block size < page size? DC> Don't you have to track delalloc on a block basis rather than DC> a page basis? >> >> I'm still thinking how better to deal with that w/o much code duplication. DC> Code duplication in ext4, or across all filesystems? given what Andrew said about moving the code into VFS, it's more about all filesystems. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] delayed allocation for ext4
David Chinner (DC) writes: DC So that mean's we'll have 2 separate mechanisms for marking DC pages as delalloc. XFS uses the BH_delay flag to indicate DC that a buffer (block) attached to the page is using delalloc. well, for blocksize=pagesize we can save 56 bytes on every page. DC Sure, but it means that ext4 w/ delalloc won't work on lots of DC machines it does not currenly. but I'm going to implement that. not sure whether it's worth to have two different codepaths for block size=page size and block size page size. DC FWIW, how does this mechanism deal with block size page size? DC Don't you have to track delalloc on a block basis rather than DC a page basis? I'm still thinking how better to deal with that w/o much code duplication. DC Code duplication in ext4, or across all filesystems? given what Andrew said about moving the code into VFS, it's more about all filesystems. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext4-block-reservation.patch
Hi, > Andrew Morton (AM) writes: AM> Should be cacheline_aligned_in_smp. AM> That's assuming it needs to be cacheline aligned at all. It can consume a AM> lot of space. the idea is to make block reservation cheap because it's called for every page. AM> AM> oh, this should be allocated with alloc_percpu(), in which case the AM> open-coded alignment can perhaps go away. got it. >> + >> +int ext4_reserve_local(struct super_block *sb, int blocks) >> +{ >> + struct ext4_sb_info *sbi = EXT4_SB(sb); >> + struct ext4_reservation_slot *rs; >> + int rc = -ENOSPC; >> + >> + preempt_disable(); >> + rs = sbi->s_reservation_slots + smp_processor_id(); AM> use get_cpu() here. ok. >> +void ext4_rebalance_reservation(struct ext4_reservation_slot *rs, __u64 >> free) >> +{ >> + int i, used_slots = 0; >> + __u64 chunk; >> + >> + /* let's know what slots have been used */ >> + for (i = 0; i < NR_CPUS; i++) >> + if (rs[i].rs_reserved || i == smp_processor_id()) >> + used_slots++; >> + >> + /* chunk is a number of block every used >> +* slot will get. make sure it isn't 0 */ >> + chunk = free + used_slots - 1; >> + do_div(chunk, used_slots); >> + >> + for (i = 0; i < NR_CPUS; i++) { AM> all these NR_CPUS loops need to go away. Use either AM> for_each_possible_cpu() or, preferably, for_each_online_cpu() and a hotplug AM> notifier. hmm, i see. AM> Why is this code using per-cpu data at all, btw? These optimisations tend AM> to be marginal in filesystems. What is the perfomance impact of making AM> this data be single-superblock-wide-instance? well, even on 2way box a single-lock reservation was in top10. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] delayed allocation for ext4
> Christoph Hellwig (CH) writes: CH> Note that recording delayed alloc state at a page granularity in addition CH> to just the buffer heads has a lot of advantages aswell and would help CH> xfs, too. But I think it makes a lot more sense to record it as a radix CH> tree tag to speed up the gang lookups for delalloc conversion. please, exaplein about radix tree tag. in ext4-delalloc I use this bit the only way - to avoid multiple reservation space for same page. I guess you need to find non-allocated pages. probably to flush them and update number of reserved blocks in case of -ENOSPC? thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] delayed allocation for ext4
Good day, > David Chinner (DC) writes: DC> So that mean's we'll have 2 separate mechanisms for marking DC> pages as delalloc. XFS uses the BH_delay flag to indicate DC> that a buffer (block) attached to the page is using delalloc. well, for blocksize=pagesize we can save 56 bytes on every page. DC> FWIW, how does this mechanism deal with block size < page size? DC> Don't you have to track delalloc on a block basis rather than DC> a page basis? I'm still thinking how better to deal with that w/o much code duplication. DC> Ah, that's why you can get away with a page flag - you've ignored DC> the partial page delay state problem. Any plans to use the DC> existing method in the future so we will be able to use ext4 delalloc DC> on machines with a page size larger than 4k? what do you mean by "exsiting"? BH_delay? thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] delayed allocation for ext4
Good day, David Chinner (DC) writes: DC So that mean's we'll have 2 separate mechanisms for marking DC pages as delalloc. XFS uses the BH_delay flag to indicate DC that a buffer (block) attached to the page is using delalloc. well, for blocksize=pagesize we can save 56 bytes on every page. DC FWIW, how does this mechanism deal with block size page size? DC Don't you have to track delalloc on a block basis rather than DC a page basis? I'm still thinking how better to deal with that w/o much code duplication. DC Ah, that's why you can get away with a page flag - you've ignored DC the partial page delay state problem. Any plans to use the DC existing method in the future so we will be able to use ext4 delalloc DC on machines with a page size larger than 4k? what do you mean by exsiting? BH_delay? thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] delayed allocation for ext4
Christoph Hellwig (CH) writes: CH Note that recording delayed alloc state at a page granularity in addition CH to just the buffer heads has a lot of advantages aswell and would help CH xfs, too. But I think it makes a lot more sense to record it as a radix CH tree tag to speed up the gang lookups for delalloc conversion. please, exaplein about radix tree tag. in ext4-delalloc I use this bit the only way - to avoid multiple reservation space for same page. I guess you need to find non-allocated pages. probably to flush them and update number of reserved blocks in case of -ENOSPC? thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext4-block-reservation.patch
Hi, Andrew Morton (AM) writes: AM Should be cacheline_aligned_in_smp. AM That's assuming it needs to be cacheline aligned at all. It can consume a AM lot of space. the idea is to make block reservation cheap because it's called for every page. AM looks AM oh, this should be allocated with alloc_percpu(), in which case the AM open-coded alignment can perhaps go away. got it. + +int ext4_reserve_local(struct super_block *sb, int blocks) +{ + struct ext4_sb_info *sbi = EXT4_SB(sb); + struct ext4_reservation_slot *rs; + int rc = -ENOSPC; + + preempt_disable(); + rs = sbi-s_reservation_slots + smp_processor_id(); AM use get_cpu() here. ok. +void ext4_rebalance_reservation(struct ext4_reservation_slot *rs, __u64 free) +{ + int i, used_slots = 0; + __u64 chunk; + + /* let's know what slots have been used */ + for (i = 0; i NR_CPUS; i++) + if (rs[i].rs_reserved || i == smp_processor_id()) + used_slots++; + + /* chunk is a number of block every used +* slot will get. make sure it isn't 0 */ + chunk = free + used_slots - 1; + do_div(chunk, used_slots); + + for (i = 0; i NR_CPUS; i++) { AM all these NR_CPUS loops need to go away. Use either AM for_each_possible_cpu() or, preferably, for_each_online_cpu() and a hotplug AM notifier. hmm, i see. AM Why is this code using per-cpu data at all, btw? These optimisations tend AM to be marginal in filesystems. What is the perfomance impact of making AM this data be single-superblock-wide-instance? well, even on 2way box a single-lock reservation was in top10. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] ext4-block-reservation.patch
Index: linux-2.6.20-rc1/include/linux/ext4_fs.h === --- linux-2.6.20-rc1.orig/include/linux/ext4_fs.h 2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/include/linux/ext4_fs.h2006-12-22 20:21:12.0 +0300 @@ -201,6 +201,7 @@ struct ext4_group_desc #define EXT4_STATE_JDATA 0x0001 /* journaled data exists */ #define EXT4_STATE_NEW 0x0002 /* inode is newly created */ #define EXT4_STATE_XATTR 0x0004 /* has in-inode xattrs */ +#define EXT4_STATE_BLOCKS_RESERVED 0x0008 /* blocks reserved */ /* Used to pass group descriptor data when online resize is done */ struct ext4_new_group_input { @@ -808,6 +809,10 @@ extern struct ext4_group_desc * ext4_get extern int ext4_should_retry_alloc(struct super_block *sb, int *retries); extern void ext4_init_block_alloc_info(struct inode *); extern void ext4_rsv_window_add(struct super_block *sb, struct ext4_reserve_window_node *rsv); +int ext4_reserve_init(struct super_block *sb); +void ext4_reserve_release(struct super_block *sb); +void ext4_release_blocks(struct super_block *sb, int blocks); +int ext4_reserve_blocks(struct super_block *sb, int blocks); /* dir.c */ extern int ext4_check_dir_entry(const char *, struct inode *, Index: linux-2.6.20-rc1/include/linux/ext4_fs_sb.h === --- linux-2.6.20-rc1.orig/include/linux/ext4_fs_sb.h2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/include/linux/ext4_fs_sb.h 2006-12-22 20:20:10.0 +0300 @@ -24,6 +24,8 @@ #endif #include +struct ext4_reservation_slot; + /* * third extended-fs super-block data in memory */ @@ -65,6 +67,9 @@ struct ext4_sb_info { struct rb_root s_rsv_window_root; struct ext4_reserve_window_node s_rsv_window_head; + /* global reservation structures */ + struct ext4_reservation_slot *s_reservation_slots; + /* Journaling */ struct inode * s_journal_inode; struct journal_s * s_journal; Index: linux-2.6.20-rc1/fs/ext4/super.c === --- linux-2.6.20-rc1.orig/fs/ext4/super.c 2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/fs/ext4/super.c2006-12-22 20:20:10.0 +0300 @@ -439,6 +439,7 @@ static void ext4_put_super (struct super struct ext4_super_block *es = sbi->s_es; int i; + ext4_reserve_release(sb); ext4_ext_release(sb); ext4_xattr_put_super(sb); jbd2_journal_destroy(sbi->s_journal); @@ -1867,6 +1868,7 @@ static int ext4_fill_super (struct super "writeback"); ext4_ext_init(sb); + ext4_reserve_init(sb); lock_kernel(); return 0; Index: linux-2.6.20-rc1/fs/ext4/balloc.c === --- linux-2.6.20-rc1.orig/fs/ext4/balloc.c 2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/fs/ext4/balloc.c 2006-12-22 20:32:11.0 +0300 @@ -630,8 +630,10 @@ void ext4_free_blocks(handle_t *handle, return; } ext4_free_blocks_sb(handle, sb, block, count, _freed_blocks); - if (dquot_freed_blocks) + if (dquot_freed_blocks) { + ext4_release_blocks(sb, dquot_freed_blocks); DQUOT_FREE_BLOCK(inode, dquot_freed_blocks); + } return; } @@ -1440,7 +1442,7 @@ ext4_fsblk_t ext4_new_blocks(handle_t *h struct ext4_sb_info *sbi; struct ext4_reserve_window_node *my_rsv = NULL; struct ext4_block_alloc_info *block_i; - unsigned short windowsz = 0; + unsigned short windowsz = 0, reserved = 0; #ifdef EXT4FS_DEBUG static int goal_hits, goal_attempts; #endif @@ -1462,6 +1464,13 @@ ext4_fsblk_t ext4_new_blocks(handle_t *h return 0; } + if (!(EXT4_I(inode)->i_state & EXT4_STATE_BLOCKS_RESERVED)) { + *errp = ext4_reserve_blocks(sb, num); + if (*errp) + return 0; + reserved = num; + } + sbi = EXT4_SB(sb); es = EXT4_SB(sb)->s_es; ext4_debug("goal=%lu.\n", goal); @@ -1674,8 +1683,11 @@ out: /* * Undo the block allocation */ - if (!performed_allocation) + if (!performed_allocation) { DQUOT_FREE_BLOCK(inode, *count); + if (reserved) + ext4_release_blocks(sb, reserved); + } brelse(bitmap_bh); return 0; } @@ -1834,3 +1846,161 @@ unsigned long ext4_bg_num_gdb(struct sup return ext4_bg_num_gdb_meta(sb,group); } + +/* + * reservation.c contains routines to reserve blocks. + * we need this for delayed allocation, otherwise we + * could meet -ENOSPC at flush time + */ + +/* + * as ->commit_write() where we're going to reserve + *
[RFC] ext4-delayed-allocation.patch
sv0, Opt_quota, Opt_noquota, Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, - Opt_grpquota, Opt_extents, + Opt_grpquota, Opt_extents, Opt_delayed_alloc, }; static match_table_t tokens = { @@ -780,6 +788,7 @@ static match_table_t tokens = { {Opt_usrquota, "usrquota"}, {Opt_barrier, "barrier=%u"}, {Opt_extents, "extents"}, + {Opt_delayed_alloc, "delalloc"}, {Opt_err, NULL}, {Opt_resize, "resize"}, }; @@ -1094,6 +1103,9 @@ clear_qf_name: else clear_opt(sbi->s_mount_opt, BARRIER); break; + case Opt_delayed_alloc: + set_opt(sbi->s_mount_opt, DELAYED_ALLOC); + break; case Opt_ignore: break; case Opt_resize: @@ -1869,6 +1881,7 @@ static int ext4_fill_super (struct super ext4_ext_init(sb); ext4_reserve_init(sb); + ext4_wb_init(sb); lock_kernel(); return 0; Index: linux-2.6.20-rc1/fs/ext4/extents.c === --- linux-2.6.20-rc1.orig/fs/ext4/extents.c 2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/fs/ext4/extents.c 2006-12-22 22:56:04.0 +0300 @@ -2159,6 +2159,36 @@ int ext4_ext_writepage_trans_blocks(stru return needed; } +int ext4_ext_calc_metadata_amount(struct inode *inode, int blocks) +{ + int lcap, icap, rcap, leafs, idxs, num; + + rcap = ext4_ext_space_root(inode); + if (blocks <= rcap) { + /* all extents fit to the root */ + return 0; + } + + rcap = ext4_ext_space_root_idx(inode); + lcap = ext4_ext_space_block(inode); + icap = ext4_ext_space_block_idx(inode); + + num = leafs = (blocks + lcap - 1) / lcap; + if (leafs <= rcap) { + /* all pointers to leafs fit to the root */ + return leafs; + } + + /* ok. we need separate index block(s) to link all leaf blocks */ + idxs = (leafs + icap - 1) / icap; + do { + num += idxs; + idxs = (idxs + icap - 1) / icap; + } while (idxs > rcap); + + return num; +} + EXPORT_SYMBOL(ext4_mark_inode_dirty); EXPORT_SYMBOL(ext4_ext_invalidate_cache); EXPORT_SYMBOL(ext4_ext_insert_extent); Index: linux-2.6.20-rc1/fs/ext4/Makefile === --- linux-2.6.20-rc1.orig/fs/ext4/Makefile 2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/fs/ext4/Makefile 2006-12-22 22:56:04.0 +0300 @@ -6,7 +6,7 @@ obj-$(CONFIG_EXT4DEV_FS) += ext4dev.o ext4dev-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \ ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \ - ext4_jbd2.o + ext4_jbd2.o writeback.o ext4dev-$(CONFIG_EXT4DEV_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o ext4dev-$(CONFIG_EXT4DEV_FS_POSIX_ACL) += acl.o Index: linux-2.6.20-rc1/fs/ext4/writeback.c === --- linux-2.6.20-rc1.orig/fs/ext4/writeback.c 2006-11-30 15:32:10.563465031 +0300 +++ linux-2.6.20-rc1/fs/ext4/writeback.c2006-12-22 22:59:33.0 +0300 @@ -0,0 +1,1167 @@ +/* + * Copyright (c) 2003-2006, Cluster File Systems, Inc, [EMAIL PROTECTED] + * Written by Alex Tomas <[EMAIL PROTECTED]> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public Licens + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- + */ + +/* + * TODO: + * MUST: + * - flush dirty pages in -ENOSPC case in order to free reserved blocks + * - direct I/O support + * - blocksize != PAGE_CACHE_SIZE support + * - store last unwritten page in ext4_wb_writepages() and + * continue from it in a next run + * WISH: + * - should ext4_wb_writepage() try to flush neighbours? + * - ext4_wb_block_truncate_page() must flush partial truncated pages + * - reservation can be done per write-request in ext4_file_write() + * rather than per-page in ext4_wb_commit_write() -- it's quite + * expensive to recalculate amount of required metadata for evey page + * - re-allocation to improve layout + */ + +#include +#include +
[RFC] booked-page-flag.patch
Index: linux-2.6.20-rc1/include/linux/page-flags.h === --- linux-2.6.20-rc1.orig/include/linux/page-flags.h2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/include/linux/page-flags.h 2006-12-22 20:05:31.0 +0300 @@ -90,6 +90,7 @@ #define PG_reclaim 17 /* To be reclaimed asap */ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define PG_booked 20 /* Has blocks reserved on-disk */ #if (BITS_PER_LONG > 32) @@ -230,6 +231,10 @@ static inline void SetPageUptodate(struc #define SetPageMappedToDisk(page) set_bit(PG_mappedtodisk, &(page)->flags) #define ClearPageMappedToDisk(page) clear_bit(PG_mappedtodisk, &(page)->flags) +#define PageBooked(page) test_bit(PG_booked, &(page)->flags) +#define SetPageBooked(page)set_bit(PG_booked, &(page)->flags) +#define ClearPageBooked(page) clear_bit(PG_booked, &(page)->flags) + #define PageReclaim(page) test_bit(PG_reclaim, &(page)->flags) #define SetPageReclaim(page) set_bit(PG_reclaim, &(page)->flags) #define ClearPageReclaim(page) clear_bit(PG_reclaim, &(page)->flags) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] delayed allocation for ext4
Good day, probably the previous set of patches (including mballoc/lg) is too large. so, I reworked delayed allocation a bit so that it can be used on top of regular balloc, though it still can be used with extents-enabled files only. this time series contains just 3 patches: - booked-page-flag.patch adds PG_booked bit to page->flags. it's used in delayed allocation to mark space is already reserved for page (including possible metadata) - ext4-block-reservation.patch this is scalable free space management. every time we delay allocation of some page, a space (including metadata) should be reserved - ext4-delayed-allocation.patch delayed allocation itself, enabled by "delalloc" mount option. extents support is also required. currently it works only with blocksize=pagesize. all the patches can be used in ftp://ftp.clusterfs.com/pub/people/alex/2.6.20-rc1/ the series passed basic tests like dd/dbench/fsx. any comments/questions are very welcome. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] delayed allocation for ext4
Good day, probably the previous set of patches (including mballoc/lg) is too large. so, I reworked delayed allocation a bit so that it can be used on top of regular balloc, though it still can be used with extents-enabled files only. this time series contains just 3 patches: - booked-page-flag.patch adds PG_booked bit to page-flags. it's used in delayed allocation to mark space is already reserved for page (including possible metadata) - ext4-block-reservation.patch this is scalable free space management. every time we delay allocation of some page, a space (including metadata) should be reserved - ext4-delayed-allocation.patch delayed allocation itself, enabled by delalloc mount option. extents support is also required. currently it works only with blocksize=pagesize. all the patches can be used in ftp://ftp.clusterfs.com/pub/people/alex/2.6.20-rc1/ the series passed basic tests like dd/dbench/fsx. any comments/questions are very welcome. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] booked-page-flag.patch
Index: linux-2.6.20-rc1/include/linux/page-flags.h === --- linux-2.6.20-rc1.orig/include/linux/page-flags.h2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/include/linux/page-flags.h 2006-12-22 20:05:31.0 +0300 @@ -90,6 +90,7 @@ #define PG_reclaim 17 /* To be reclaimed asap */ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define PG_booked 20 /* Has blocks reserved on-disk */ #if (BITS_PER_LONG 32) @@ -230,6 +231,10 @@ static inline void SetPageUptodate(struc #define SetPageMappedToDisk(page) set_bit(PG_mappedtodisk, (page)-flags) #define ClearPageMappedToDisk(page) clear_bit(PG_mappedtodisk, (page)-flags) +#define PageBooked(page) test_bit(PG_booked, (page)-flags) +#define SetPageBooked(page)set_bit(PG_booked, (page)-flags) +#define ClearPageBooked(page) clear_bit(PG_booked, (page)-flags) + #define PageReclaim(page) test_bit(PG_reclaim, (page)-flags) #define SetPageReclaim(page) set_bit(PG_reclaim, (page)-flags) #define ClearPageReclaim(page) clear_bit(PG_reclaim, (page)-flags) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] ext4-delayed-allocation.patch
, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota, - Opt_grpquota, Opt_extents, + Opt_grpquota, Opt_extents, Opt_delayed_alloc, }; static match_table_t tokens = { @@ -780,6 +788,7 @@ static match_table_t tokens = { {Opt_usrquota, usrquota}, {Opt_barrier, barrier=%u}, {Opt_extents, extents}, + {Opt_delayed_alloc, delalloc}, {Opt_err, NULL}, {Opt_resize, resize}, }; @@ -1094,6 +1103,9 @@ clear_qf_name: else clear_opt(sbi-s_mount_opt, BARRIER); break; + case Opt_delayed_alloc: + set_opt(sbi-s_mount_opt, DELAYED_ALLOC); + break; case Opt_ignore: break; case Opt_resize: @@ -1869,6 +1881,7 @@ static int ext4_fill_super (struct super ext4_ext_init(sb); ext4_reserve_init(sb); + ext4_wb_init(sb); lock_kernel(); return 0; Index: linux-2.6.20-rc1/fs/ext4/extents.c === --- linux-2.6.20-rc1.orig/fs/ext4/extents.c 2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/fs/ext4/extents.c 2006-12-22 22:56:04.0 +0300 @@ -2159,6 +2159,36 @@ int ext4_ext_writepage_trans_blocks(stru return needed; } +int ext4_ext_calc_metadata_amount(struct inode *inode, int blocks) +{ + int lcap, icap, rcap, leafs, idxs, num; + + rcap = ext4_ext_space_root(inode); + if (blocks = rcap) { + /* all extents fit to the root */ + return 0; + } + + rcap = ext4_ext_space_root_idx(inode); + lcap = ext4_ext_space_block(inode); + icap = ext4_ext_space_block_idx(inode); + + num = leafs = (blocks + lcap - 1) / lcap; + if (leafs = rcap) { + /* all pointers to leafs fit to the root */ + return leafs; + } + + /* ok. we need separate index block(s) to link all leaf blocks */ + idxs = (leafs + icap - 1) / icap; + do { + num += idxs; + idxs = (idxs + icap - 1) / icap; + } while (idxs rcap); + + return num; +} + EXPORT_SYMBOL(ext4_mark_inode_dirty); EXPORT_SYMBOL(ext4_ext_invalidate_cache); EXPORT_SYMBOL(ext4_ext_insert_extent); Index: linux-2.6.20-rc1/fs/ext4/Makefile === --- linux-2.6.20-rc1.orig/fs/ext4/Makefile 2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/fs/ext4/Makefile 2006-12-22 22:56:04.0 +0300 @@ -6,7 +6,7 @@ obj-$(CONFIG_EXT4DEV_FS) += ext4dev.o ext4dev-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o \ ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \ - ext4_jbd2.o + ext4_jbd2.o writeback.o ext4dev-$(CONFIG_EXT4DEV_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o ext4dev-$(CONFIG_EXT4DEV_FS_POSIX_ACL) += acl.o Index: linux-2.6.20-rc1/fs/ext4/writeback.c === --- linux-2.6.20-rc1.orig/fs/ext4/writeback.c 2006-11-30 15:32:10.563465031 +0300 +++ linux-2.6.20-rc1/fs/ext4/writeback.c2006-12-22 22:59:33.0 +0300 @@ -0,0 +1,1167 @@ +/* + * Copyright (c) 2003-2006, Cluster File Systems, Inc, [EMAIL PROTECTED] + * Written by Alex Tomas [EMAIL PROTECTED] + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public Licens + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111- + */ + +/* + * TODO: + * MUST: + * - flush dirty pages in -ENOSPC case in order to free reserved blocks + * - direct I/O support + * - blocksize != PAGE_CACHE_SIZE support + * - store last unwritten page in ext4_wb_writepages() and + * continue from it in a next run + * WISH: + * - should ext4_wb_writepage() try to flush neighbours? + * - ext4_wb_block_truncate_page() must flush partial truncated pages + * - reservation can be done per write-request in ext4_file_write() + * rather than per-page in ext4_wb_commit_write() -- it's quite + * expensive to recalculate amount of required metadata for evey page + * - re-allocation to improve layout + */ + +#include linux/module.h +#include linux/fs.h +#include linux/bio.h +#include linux/time.h +#include linux/ext4_jbd2.h +#include linux/jbd.h +#include linux
[RFC] ext4-block-reservation.patch
Index: linux-2.6.20-rc1/include/linux/ext4_fs.h === --- linux-2.6.20-rc1.orig/include/linux/ext4_fs.h 2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/include/linux/ext4_fs.h2006-12-22 20:21:12.0 +0300 @@ -201,6 +201,7 @@ struct ext4_group_desc #define EXT4_STATE_JDATA 0x0001 /* journaled data exists */ #define EXT4_STATE_NEW 0x0002 /* inode is newly created */ #define EXT4_STATE_XATTR 0x0004 /* has in-inode xattrs */ +#define EXT4_STATE_BLOCKS_RESERVED 0x0008 /* blocks reserved */ /* Used to pass group descriptor data when online resize is done */ struct ext4_new_group_input { @@ -808,6 +809,10 @@ extern struct ext4_group_desc * ext4_get extern int ext4_should_retry_alloc(struct super_block *sb, int *retries); extern void ext4_init_block_alloc_info(struct inode *); extern void ext4_rsv_window_add(struct super_block *sb, struct ext4_reserve_window_node *rsv); +int ext4_reserve_init(struct super_block *sb); +void ext4_reserve_release(struct super_block *sb); +void ext4_release_blocks(struct super_block *sb, int blocks); +int ext4_reserve_blocks(struct super_block *sb, int blocks); /* dir.c */ extern int ext4_check_dir_entry(const char *, struct inode *, Index: linux-2.6.20-rc1/include/linux/ext4_fs_sb.h === --- linux-2.6.20-rc1.orig/include/linux/ext4_fs_sb.h2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/include/linux/ext4_fs_sb.h 2006-12-22 20:20:10.0 +0300 @@ -24,6 +24,8 @@ #endif #include linux/rbtree.h +struct ext4_reservation_slot; + /* * third extended-fs super-block data in memory */ @@ -65,6 +67,9 @@ struct ext4_sb_info { struct rb_root s_rsv_window_root; struct ext4_reserve_window_node s_rsv_window_head; + /* global reservation structures */ + struct ext4_reservation_slot *s_reservation_slots; + /* Journaling */ struct inode * s_journal_inode; struct journal_s * s_journal; Index: linux-2.6.20-rc1/fs/ext4/super.c === --- linux-2.6.20-rc1.orig/fs/ext4/super.c 2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/fs/ext4/super.c2006-12-22 20:20:10.0 +0300 @@ -439,6 +439,7 @@ static void ext4_put_super (struct super struct ext4_super_block *es = sbi-s_es; int i; + ext4_reserve_release(sb); ext4_ext_release(sb); ext4_xattr_put_super(sb); jbd2_journal_destroy(sbi-s_journal); @@ -1867,6 +1868,7 @@ static int ext4_fill_super (struct super writeback); ext4_ext_init(sb); + ext4_reserve_init(sb); lock_kernel(); return 0; Index: linux-2.6.20-rc1/fs/ext4/balloc.c === --- linux-2.6.20-rc1.orig/fs/ext4/balloc.c 2006-12-14 04:14:23.0 +0300 +++ linux-2.6.20-rc1/fs/ext4/balloc.c 2006-12-22 20:32:11.0 +0300 @@ -630,8 +630,10 @@ void ext4_free_blocks(handle_t *handle, return; } ext4_free_blocks_sb(handle, sb, block, count, dquot_freed_blocks); - if (dquot_freed_blocks) + if (dquot_freed_blocks) { + ext4_release_blocks(sb, dquot_freed_blocks); DQUOT_FREE_BLOCK(inode, dquot_freed_blocks); + } return; } @@ -1440,7 +1442,7 @@ ext4_fsblk_t ext4_new_blocks(handle_t *h struct ext4_sb_info *sbi; struct ext4_reserve_window_node *my_rsv = NULL; struct ext4_block_alloc_info *block_i; - unsigned short windowsz = 0; + unsigned short windowsz = 0, reserved = 0; #ifdef EXT4FS_DEBUG static int goal_hits, goal_attempts; #endif @@ -1462,6 +1464,13 @@ ext4_fsblk_t ext4_new_blocks(handle_t *h return 0; } + if (!(EXT4_I(inode)-i_state EXT4_STATE_BLOCKS_RESERVED)) { + *errp = ext4_reserve_blocks(sb, num); + if (*errp) + return 0; + reserved = num; + } + sbi = EXT4_SB(sb); es = EXT4_SB(sb)-s_es; ext4_debug(goal=%lu.\n, goal); @@ -1674,8 +1683,11 @@ out: /* * Undo the block allocation */ - if (!performed_allocation) + if (!performed_allocation) { DQUOT_FREE_BLOCK(inode, *count); + if (reserved) + ext4_release_blocks(sb, reserved); + } brelse(bitmap_bh); return 0; } @@ -1834,3 +1846,161 @@ unsigned long ext4_bg_num_gdb(struct sup return ext4_bg_num_gdb_meta(sb,group); } + +/* + * reservation.c contains routines to reserve blocks. + * we need this for delayed allocation, otherwise we + * could meet -ENOSPC at flush time + */ + +/* + * as -commit_write() where we're going to
Re: Boot failure with ext2 and initrds
> Andrew Morton (AM) writes: AM> What lock protects the fields in struct ext[234]_reserve_window from being AM> concurrently modified by two CPUs? None, it seems. Ditto AM> ext[234]_reserve_window_node. i_mutex will cover it for write(), but not AM> for pageout over a file hole. If we end up with a zero- or negative-sized AM> window then odd things might happen. truncate_mutex? thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Boot failure with ext2 and initrds
Andrew Morton (AM) writes: AM What lock protects the fields in struct ext[234]_reserve_window from being AM concurrently modified by two CPUs? None, it seems. Ditto AM ext[234]_reserve_window_node. i_mutex will cover it for write(), but not AM for pageout over a file hole. If we end up with a zero- or negative-sized AM window then odd things might happen. truncate_mutex? thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
> Jan Blunck (JB) writes: JB> Nope, d_alloc() is setting d_flags to DCACHE_UNHASHED. Therefore it is not found JB> by __d_lookup() until it is rehashed which is implicit done by ->lookup(). that means we can have two processes allocated dentry for same name. they'll call ->lookup() each against own dentry, fill them and hash. so, we'll get two equal dentries. is that OK? thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
Jan Blunck (JB) writes: JB Nope, d_alloc() is setting d_flags to DCACHE_UNHASHED. Therefore it is not found JB by __d_lookup() until it is rehashed which is implicit done by -lookup(). that means we can have two processes allocated dentry for same name. they'll call -lookup() each against own dentry, fill them and hash. so, we'll get two equal dentries. is that OK? thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
> Jan Blunck (JB) writes: JB> i_sem does NOT protect the dcache. Also not in real_lookup(). The lock must be JB> acquired for ->lookup() and because we might sleep on i_sem, we have to get it JB> early and check for repopulation of the dcache. dentry is part of dcache, right? i_sem protects dentry from being returned with incomplete data inside, right? thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
> Jan Blunck (JB) writes: >> 1) i_sem protects dcache too JB> Where? i_sem is the per-inode lock, and shouldn't be used else. read comments in fs/namei.c:read_lookup() >> 2) tmpfs has no "own" data, so we can use it this way (see 2nd patch) >> 3) I have pdirops patch for ext3, but it needs some cleaning ... JB> I think you didn't get my point. JB> 1) Your approach is duplicating the locking effort for regular filesystem JB> (like ext2): JB> a) locking with s_pdirops_sems JB> b) locking the low-level filesystem data JB> It's cool that it speeds up tmpfs, but I don't think that this legatimate the JB> doubled locking for every other filesystem. JB> I'm not sure that it also increases performance for regular filesystems, if you JB> do the locking right. we've already done this for ext3. it works. it speeds some loads up significantly. especially on big directories. and you can control this via mount option, so almost zero cost for fs that doesn't support pdirops. JB> 2) In my opinion, a superblock-wide semaphore array which allows 1024 JB> different (different names and different operations) accesses to ONE single JB> inode (which is the data it should protect) is not a good idea. yes, it has some weakness, i'm reworking vfs patch to avoid inter-inode collisions. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
Jan Blunck (JB) writes: 1) i_sem protects dcache too JB Where? i_sem is the per-inode lock, and shouldn't be used else. read comments in fs/namei.c:read_lookup() 2) tmpfs has no own data, so we can use it this way (see 2nd patch) 3) I have pdirops patch for ext3, but it needs some cleaning ... JB I think you didn't get my point. JB 1) Your approach is duplicating the locking effort for regular filesystem JB (like ext2): JB a) locking with s_pdirops_sems JB b) locking the low-level filesystem data JB It's cool that it speeds up tmpfs, but I don't think that this legatimate the JB doubled locking for every other filesystem. JB I'm not sure that it also increases performance for regular filesystems, if you JB do the locking right. we've already done this for ext3. it works. it speeds some loads up significantly. especially on big directories. and you can control this via mount option, so almost zero cost for fs that doesn't support pdirops. JB 2) In my opinion, a superblock-wide semaphore array which allows 1024 JB different (different names and different operations) accesses to ONE single JB inode (which is the data it should protect) is not a good idea. yes, it has some weakness, i'm reworking vfs patch to avoid inter-inode collisions. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
Jan Blunck (JB) writes: JB i_sem does NOT protect the dcache. Also not in real_lookup(). The lock must be JB acquired for -lookup() and because we might sleep on i_sem, we have to get it JB early and check for repopulation of the dcache. dentry is part of dcache, right? i_sem protects dentry from being returned with incomplete data inside, right? thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
> Jan Blunck (JB) writes: JB> With luck you have s_pdirops_size (or 1024) different renames altering JB> concurrently one directory inode. Therefore you need a lock protecting JB> your filesystem data. This is basically the job done by i_sem. So in JB> my opinion you only move "The Problem" from the VFS to the lowlevel JB> filesystems. But then there is no need for i_sem or your JB> s_pdirops_sems anymore. 1) i_sem protects dcache too 2) tmpfs has no "own" data, so we can use it this way (see 2nd patch) 3) I have pdirops patch for ext3, but it needs some cleaning ... thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
Jan Blunck (JB) writes: JB With luck you have s_pdirops_size (or 1024) different renames altering JB concurrently one directory inode. Therefore you need a lock protecting JB your filesystem data. This is basically the job done by i_sem. So in JB my opinion you only move The Problem from the VFS to the lowlevel JB filesystems. But then there is no need for i_sem or your JB s_pdirops_sems anymore. 1) i_sem protects dcache too 2) tmpfs has no own data, so we can use it this way (see 2nd patch) 3) I have pdirops patch for ext3, but it needs some cleaning ... thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: tmpfs patch
Index: linux-2.6.10/mm/shmem.c === --- linux-2.6.10.orig/mm/shmem.c2005-01-28 19:32:16.0 +0300 +++ linux-2.6.10/mm/shmem.c 2005-02-19 20:05:32.642599576 +0300 @@ -1849,7 +1849,7 @@ #endif }; -static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t *gid, unsigned long *blocks, unsigned long *inodes) +static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t *gid, unsigned long *blocks, unsigned long *inodes, struct super_block *sb) { char *this_char, *value, *rest; @@ -1858,6 +1858,9 @@ continue; if ((value = strchr(this_char,'=')) != NULL) { *value++ = 0; + } else if (!strcmp(this_char,"pdirops")) { + sb->s_flags |= S_PDIROPS; + continue; } else { printk(KERN_ERR "tmpfs: No value for mount option '%s'\n", @@ -1928,7 +1931,7 @@ max_blocks = sbinfo->max_blocks; max_inodes = sbinfo->max_inodes; } - if (shmem_parse_options(data, NULL, NULL, NULL, _blocks, _inodes)) + if (shmem_parse_options(data, NULL, NULL, NULL, _blocks, _inodes, sb)) return -EINVAL; /* Keep it simple: disallow limited <-> unlimited remount */ if ((max_blocks || max_inodes) == !sbinfo) @@ -1978,7 +1981,7 @@ inodes = blocks; if (shmem_parse_options(data, , - , , , )) + , , , , sb)) return -EINVAL; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
fs/inode.c |1 fs/namei.c | 66 ++--- include/linux/fs.h | 11 3 files changed, 54 insertions(+), 24 deletions(-) Index: linux-2.6.10/fs/namei.c === --- linux-2.6.10.orig/fs/namei.c2005-01-28 19:32:13.0 +0300 +++ linux-2.6.10/fs/namei.c 2005-02-19 20:40:05.763437304 +0300 @@ -104,6 +104,28 @@ * any extra contention... */ +static inline struct semaphore * lock_sem(struct inode *dir, struct qstr *name) +{ + if (IS_PDIROPS(dir)) { + struct super_block *sb; + /* name->hash expected to be already calculated */ + sb = dir->i_sb; + BUG_ON(sb->s_pdirops_sems == NULL); + return sb->s_pdirops_sems + name->hash % sb->s_pdirops_size; + } + return >i_sem; +} + +static inline void lock_dir(struct inode *dir, struct qstr *name) +{ + down(lock_sem(dir, name)); +} + +static inline void unlock_dir(struct inode *dir, struct qstr *name) +{ + up(lock_sem(dir, name)); +} + /* In order to reduce some races, while at the same time doing additional * checking and hopefully speeding things up, we copy filenames to the * kernel data space before using them.. @@ -380,7 +402,7 @@ struct dentry * result; struct inode *dir = parent->d_inode; - down(>i_sem); + lock_dir(dir, name); /* * First re-do the cached lookup just in case it was created * while we waited for the directory semaphore.. @@ -406,7 +428,7 @@ else result = dentry; } - up(>i_sem); + unlock_dir(dir, name); return result; } @@ -414,7 +436,7 @@ * Uhhuh! Nasty case: the cache was re-populated while * we waited on the semaphore. Need to revalidate. */ - up(>i_sem); + unlock_dir(dir, name); if (result->d_op && result->d_op->d_revalidate) { if (!result->d_op->d_revalidate(result, nd) && !d_invalidate(result)) { dput(result); @@ -1182,12 +1204,26 @@ /* * p1 and p2 should be directories on the same fs. */ -struct dentry *lock_rename(struct dentry *p1, struct dentry *p2) +struct dentry *lock_rename(struct dentry *p1, struct qstr *n1, + struct dentry *p2, struct qstr *n2) { struct dentry *p; if (p1 == p2) { - down(>d_inode->i_sem); + if (IS_PDIROPS(p1->d_inode)) { + unsigned int h1, h2; + h1 = n1->hash % p1->d_inode->i_sb->s_pdirops_size; + h2 = n2->hash % p2->d_inode->i_sb->s_pdirops_size; + if (h1 < h2) { + lock_dir(p1->d_inode, n1); + lock_dir(p2->d_inode, n2); + } else if (h1 > h2) { + lock_dir(p2->d_inode, n2); + lock_dir(p1->d_inode, n1); + } else + lock_dir(p1->d_inode, n1); + } else + down(>d_inode->i_sem); return NULL; } @@ -1195,31 +1231,35 @@ for (p = p1; p->d_parent != p; p = p->d_parent) { if (p->d_parent == p2) { - down(>d_inode->i_sem); - down(>d_inode->i_sem); + lock_dir(p2->d_inode, n2); + lock_dir(p1->d_inode, n1); return p; } } for (p = p2; p->d_parent != p; p = p->d_parent) { if (p->d_parent == p1) { - down(>d_inode->i_sem); - down(>d_inode->i_sem); + lock_dir(p1->d_inode, n1); + lock_dir(p2->d_inode, n2); return p; } } - down(>d_inode->i_sem); - down(>d_inode->i_sem); + lock_dir(p1->d_inode, n1); + lock_dir(p2->d_inode, n2); return NULL; } -void unlock_rename(struct dentry *p1, struct dentry *p2) +void unlock_rename(struct dentry *p1, struct qstr *n1, + struct dentry *p2, struct qstr *n2) { - up(>d_inode->i_sem); + unlock_dir(p1->d_inode, n1); if (p1 != p2) { - up(>d_inode->i_sem); + unlock_dir(p2->d_inode, n2); up(>d_inode->i_sb->s_vfs_rename_sem); + } else if (IS_PDIROPS(p1->d_inode) && + n1->hash != n2->hash) { + unlock_dir(p2->d_inode, n2); } } @@ -1386,13 +1426,13 @@ dir = nd->dentry; nd->flags &= ~LOOKUP_PARENT; - down(>d_inode->i_sem); + lock_dir(dir->d_inode, >last);
[RFC] parallel directory operations
Good day Al and all could you review couple patches that implement $subj for vfs and tmpfs. In short the idea is that we can protect operations taking semaphore related for set of names. definitely, protection at vfs layer isn't enough and filesystem will need to protect their own structures by itself, but in some cases vfs patch is enough. for example, tmpfs. during some loads one can see quite high load in /tmp. being mounted as tmpfs on big smp, we can get high contention on i_sem. probably someone could try more-less real load? I wrote simple program that spawn few processes, then chdir to the given directory, then loops creating and unlinking file. The test box is dual PIII-1GHz: run 1: 2 processes create/unlink file on regular tmpfs [EMAIL PROTECTED] root]# mount -t tmpfs none /test [EMAIL PROTECTED] root]# (cd /test; time /root/crunl ./f 100 2) #1998: 100 iterations, create/unlink ./f-0-1998 #1999: 100 iterations, create/unlink ./f-1-1999 #384: done #384: done wait for completion ... OK real0m36.224s user0m0.823s sys 0m47.994s run 2: 2 processes create/unlink file on tmpfs + pdirops [EMAIL PROTECTED] root]# mount -t tmpfs -o pdirops none /test [EMAIL PROTECTED] root]# (cd /test; time /root/crunl ./f 100 2) #1992: 100 iterations, create/unlink ./f-0-1992 #1993: 100 iterations, create/unlink ./f-1-1993 #384: done #384: done wait for completion ... OK real0m15.108s user0m0.592s sys 0m29.406s run 3: 1 process creates/unlinks file on regular tmpfs [EMAIL PROTECTED] root]# mount -t tmpfs none /test [EMAIL PROTECTED] root]# (cd /test; time /root/crunl ./f 100 1) #2004: 100 iterations, create/unlink ./f-0-2004 #384: done wait for completion ... OK real0m11.950s user0m0.262s sys 0m7.465s run 4: 1 process creates/unlinks file on tmpf + pdirops [EMAIL PROTECTED] root]# mount -t tmpfs -o pdirops none /test [EMAIL PROTECTED] root]# (cd /test; time /root/crunl ./f 100 1) #2009: 100 iterations, create/unlink ./f-0-2009 #384: done wait for completion ... OK real0m8.047s user0m0.243s sys 0m7.646s 2 processes creating/unlinking on regular tmpfs cause ~200K context switches: procs memory swap io system cpu r b w swpd free buff cache si sobibo incs us sy id 2 0 0 0 1005760 6616 1092800 0 0 1007 215095 1 64 35 2 0 0 0 1005760 6616 1092800 0 0 1007 213580 1 67 32 2 0 0 0 1005760 6616 1092800 0 0 1007 214445 1 63 36 2 0 0 0 1005760 6616 1092800 0 0 1007 216250 1 63 36 2 processes creating/unlinking on tmpfs + pdirops cause ~44 context switches: procs memory swap io system cpu r b w swpd free buff cache si sobibo incs us sy id 2 0 1 0 1001824 6632 1091200 0 0 100841 2 98 0 2 0 2 0 1002144 6632 1091200 0 0 100845 2 98 0 2 0 2 0 1001632 6632 1091200 0 0 100747 2 98 0 the next benchmark is rename. two processes generate random name, create file, generate new name, rename created file in new name and unlink: run 5: regular tmpfs [EMAIL PROTECTED] root]# mount -t tmpfs none /test [EMAIL PROTECTED] root]# (cd /test; time /root/rndrename ./f 100 2) #2036: 100 iterations #2037: 100 iterations wait for completion ... OK real1m22.381s user0m10.254s sys 1m50.214s run 6: tmpfs + pdirops [EMAIL PROTECTED] root]# mount -t tmpfs -o pdirops none /test [EMAIL PROTECTED] root]# (cd /test; time /root/rndrename ./f 100 2) #2044: 100 iterations #2045: 100 iterations wait for completion ... OK real0m39.403s user0m9.411s sys 1m8.626s thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] parallel directory operations
Good day Al and all could you review couple patches that implement $subj for vfs and tmpfs. In short the idea is that we can protect operations taking semaphore related for set of names. definitely, protection at vfs layer isn't enough and filesystem will need to protect their own structures by itself, but in some cases vfs patch is enough. for example, tmpfs. during some loads one can see quite high load in /tmp. being mounted as tmpfs on big smp, we can get high contention on i_sem. probably someone could try more-less real load? I wrote simple program that spawn few processes, then chdir to the given directory, then loops creating and unlinking file. The test box is dual PIII-1GHz: run 1: 2 processes create/unlink file on regular tmpfs [EMAIL PROTECTED] root]# mount -t tmpfs none /test [EMAIL PROTECTED] root]# (cd /test; time /root/crunl ./f 100 2) #1998: 100 iterations, create/unlink ./f-0-1998 #1999: 100 iterations, create/unlink ./f-1-1999 #384: done #384: done wait for completion ... OK real0m36.224s user0m0.823s sys 0m47.994s run 2: 2 processes create/unlink file on tmpfs + pdirops [EMAIL PROTECTED] root]# mount -t tmpfs -o pdirops none /test [EMAIL PROTECTED] root]# (cd /test; time /root/crunl ./f 100 2) #1992: 100 iterations, create/unlink ./f-0-1992 #1993: 100 iterations, create/unlink ./f-1-1993 #384: done #384: done wait for completion ... OK real0m15.108s user0m0.592s sys 0m29.406s run 3: 1 process creates/unlinks file on regular tmpfs [EMAIL PROTECTED] root]# mount -t tmpfs none /test [EMAIL PROTECTED] root]# (cd /test; time /root/crunl ./f 100 1) #2004: 100 iterations, create/unlink ./f-0-2004 #384: done wait for completion ... OK real0m11.950s user0m0.262s sys 0m7.465s run 4: 1 process creates/unlinks file on tmpf + pdirops [EMAIL PROTECTED] root]# mount -t tmpfs -o pdirops none /test [EMAIL PROTECTED] root]# (cd /test; time /root/crunl ./f 100 1) #2009: 100 iterations, create/unlink ./f-0-2009 #384: done wait for completion ... OK real0m8.047s user0m0.243s sys 0m7.646s 2 processes creating/unlinking on regular tmpfs cause ~200K context switches: procs memory swap io system cpu r b w swpd free buff cache si sobibo incs us sy id 2 0 0 0 1005760 6616 1092800 0 0 1007 215095 1 64 35 2 0 0 0 1005760 6616 1092800 0 0 1007 213580 1 67 32 2 0 0 0 1005760 6616 1092800 0 0 1007 214445 1 63 36 2 0 0 0 1005760 6616 1092800 0 0 1007 216250 1 63 36 2 processes creating/unlinking on tmpfs + pdirops cause ~44 context switches: procs memory swap io system cpu r b w swpd free buff cache si sobibo incs us sy id 2 0 1 0 1001824 6632 1091200 0 0 100841 2 98 0 2 0 2 0 1002144 6632 1091200 0 0 100845 2 98 0 2 0 2 0 1001632 6632 1091200 0 0 100747 2 98 0 the next benchmark is rename. two processes generate random name, create file, generate new name, rename created file in new name and unlink: run 5: regular tmpfs [EMAIL PROTECTED] root]# mount -t tmpfs none /test [EMAIL PROTECTED] root]# (cd /test; time /root/rndrename ./f 100 2) #2036: 100 iterations #2037: 100 iterations wait for completion ... OK real1m22.381s user0m10.254s sys 1m50.214s run 6: tmpfs + pdirops [EMAIL PROTECTED] root]# mount -t tmpfs -o pdirops none /test [EMAIL PROTECTED] root]# (cd /test; time /root/rndrename ./f 100 2) #2044: 100 iterations #2045: 100 iterations wait for completion ... OK real0m39.403s user0m9.411s sys 1m8.626s thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pdirops: vfs patch
fs/inode.c |1 fs/namei.c | 66 ++--- include/linux/fs.h | 11 3 files changed, 54 insertions(+), 24 deletions(-) Index: linux-2.6.10/fs/namei.c === --- linux-2.6.10.orig/fs/namei.c2005-01-28 19:32:13.0 +0300 +++ linux-2.6.10/fs/namei.c 2005-02-19 20:40:05.763437304 +0300 @@ -104,6 +104,28 @@ * any extra contention... */ +static inline struct semaphore * lock_sem(struct inode *dir, struct qstr *name) +{ + if (IS_PDIROPS(dir)) { + struct super_block *sb; + /* name-hash expected to be already calculated */ + sb = dir-i_sb; + BUG_ON(sb-s_pdirops_sems == NULL); + return sb-s_pdirops_sems + name-hash % sb-s_pdirops_size; + } + return dir-i_sem; +} + +static inline void lock_dir(struct inode *dir, struct qstr *name) +{ + down(lock_sem(dir, name)); +} + +static inline void unlock_dir(struct inode *dir, struct qstr *name) +{ + up(lock_sem(dir, name)); +} + /* In order to reduce some races, while at the same time doing additional * checking and hopefully speeding things up, we copy filenames to the * kernel data space before using them.. @@ -380,7 +402,7 @@ struct dentry * result; struct inode *dir = parent-d_inode; - down(dir-i_sem); + lock_dir(dir, name); /* * First re-do the cached lookup just in case it was created * while we waited for the directory semaphore.. @@ -406,7 +428,7 @@ else result = dentry; } - up(dir-i_sem); + unlock_dir(dir, name); return result; } @@ -414,7 +436,7 @@ * Uhhuh! Nasty case: the cache was re-populated while * we waited on the semaphore. Need to revalidate. */ - up(dir-i_sem); + unlock_dir(dir, name); if (result-d_op result-d_op-d_revalidate) { if (!result-d_op-d_revalidate(result, nd) !d_invalidate(result)) { dput(result); @@ -1182,12 +1204,26 @@ /* * p1 and p2 should be directories on the same fs. */ -struct dentry *lock_rename(struct dentry *p1, struct dentry *p2) +struct dentry *lock_rename(struct dentry *p1, struct qstr *n1, + struct dentry *p2, struct qstr *n2) { struct dentry *p; if (p1 == p2) { - down(p1-d_inode-i_sem); + if (IS_PDIROPS(p1-d_inode)) { + unsigned int h1, h2; + h1 = n1-hash % p1-d_inode-i_sb-s_pdirops_size; + h2 = n2-hash % p2-d_inode-i_sb-s_pdirops_size; + if (h1 h2) { + lock_dir(p1-d_inode, n1); + lock_dir(p2-d_inode, n2); + } else if (h1 h2) { + lock_dir(p2-d_inode, n2); + lock_dir(p1-d_inode, n1); + } else + lock_dir(p1-d_inode, n1); + } else + down(p1-d_inode-i_sem); return NULL; } @@ -1195,31 +1231,35 @@ for (p = p1; p-d_parent != p; p = p-d_parent) { if (p-d_parent == p2) { - down(p2-d_inode-i_sem); - down(p1-d_inode-i_sem); + lock_dir(p2-d_inode, n2); + lock_dir(p1-d_inode, n1); return p; } } for (p = p2; p-d_parent != p; p = p-d_parent) { if (p-d_parent == p1) { - down(p1-d_inode-i_sem); - down(p2-d_inode-i_sem); + lock_dir(p1-d_inode, n1); + lock_dir(p2-d_inode, n2); return p; } } - down(p1-d_inode-i_sem); - down(p2-d_inode-i_sem); + lock_dir(p1-d_inode, n1); + lock_dir(p2-d_inode, n2); return NULL; } -void unlock_rename(struct dentry *p1, struct dentry *p2) +void unlock_rename(struct dentry *p1, struct qstr *n1, + struct dentry *p2, struct qstr *n2) { - up(p1-d_inode-i_sem); + unlock_dir(p1-d_inode, n1); if (p1 != p2) { - up(p2-d_inode-i_sem); + unlock_dir(p2-d_inode, n2); up(p1-d_inode-i_sb-s_vfs_rename_sem); + } else if (IS_PDIROPS(p1-d_inode) + n1-hash != n2-hash) { + unlock_dir(p2-d_inode, n2); } } @@ -1386,13 +1426,13 @@ dir = nd-dentry; nd-flags = ~LOOKUP_PARENT; - down(dir-d_inode-i_sem); + lock_dir(dir-d_inode, nd-last); dentry =
Re: [RFC] pdirops: tmpfs patch
Index: linux-2.6.10/mm/shmem.c === --- linux-2.6.10.orig/mm/shmem.c2005-01-28 19:32:16.0 +0300 +++ linux-2.6.10/mm/shmem.c 2005-02-19 20:05:32.642599576 +0300 @@ -1849,7 +1849,7 @@ #endif }; -static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t *gid, unsigned long *blocks, unsigned long *inodes) +static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t *gid, unsigned long *blocks, unsigned long *inodes, struct super_block *sb) { char *this_char, *value, *rest; @@ -1858,6 +1858,9 @@ continue; if ((value = strchr(this_char,'=')) != NULL) { *value++ = 0; + } else if (!strcmp(this_char,pdirops)) { + sb-s_flags |= S_PDIROPS; + continue; } else { printk(KERN_ERR tmpfs: No value for mount option '%s'\n, @@ -1928,7 +1931,7 @@ max_blocks = sbinfo-max_blocks; max_inodes = sbinfo-max_inodes; } - if (shmem_parse_options(data, NULL, NULL, NULL, max_blocks, max_inodes)) + if (shmem_parse_options(data, NULL, NULL, NULL, max_blocks, max_inodes, sb)) return -EINVAL; /* Keep it simple: disallow limited - unlimited remount */ if ((max_blocks || max_inodes) == !sbinfo) @@ -1978,7 +1981,7 @@ inodes = blocks; if (shmem_parse_options(data, mode, - uid, gid, blocks, inodes)) + uid, gid, blocks, inodes, sb)) return -EINVAL; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: Latest ext3 patches (extents, mballoc, delayed allocation)
> Sonny Rao (SR) writes: SR> Alex, small buglet, If the FIBMAP-ioctl get's called on a file with SR> delayed allocation, you need to flush it (or at least allocate) before SR> returning the mappings. This doesn't seem to work properly at SR> present. good catch. thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] Re: Latest ext3 patches (extents, mballoc, delayed allocation)
Sonny Rao (SR) writes: SR Alex, small buglet, If the FIBMAP-ioctl get's called on a file with SR delayed allocation, you need to flush it (or at least allocate) before SR returning the mappings. This doesn't seem to work properly at SR present. good catch. thanks. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Latest ext3 patches (extents, mballoc, delayed allocation)
Good day all, I've updated the patchset against 2.6.10. A bunch of bugs have been fixed and mballoc now behaves smarter a bit. Extents and mballoc patches collects some stats they print upon umount. NOTE: they must not be used to store important data. A lot of things are to be done. Please review. Any comments and suggestions are very welcome. The patches are too big to send in a message (previous time they got rejected). I put them in ftp://ftp.clusterfs.com/pub/people/alex/2.6.10 The following names are used: ext3rs - ext3 mounted with data=writeback,reservation options ext3i - ext3 mounted with extents,mballoc,delalloc options reiser - reiserfs v3 I did some benchmarking. Two systems were used: SMP - dual PIII-1GHz, 1GB, FC-connected to 2 disks raid0 UP - PIII-933MHz, 256MB, FC-connected to 2 disks raid0 The tests dd500 and dd1000 generate 500M and 1000M using dd on fresh fs. Real time varies run from run because of disks, but sys.time is stable. SMPUP TEST / FS real sys real sys dd500/ext2 - 9.14 2.3319.03 2.50 dd500/ext3 - 12.23 4.0618.99 4.19 dd500/ext3rs- 13.39 4.0115.08 4.16 dd500/ext3i - 9.19 2.3119.07 2.52 dd500/reiser- 7.95 2.8721.23 3.09 dd500/xfs - 17.88 2.4219.25 2.67 dd1000/ext2 - 37.47 4.6944.59 5.02 dd1000/ext3 - 38.03 7.9040.77 8.31 dd1000/ext3rs - 34.62 7.9540.51 8.30 dd1000/ext3i- 33.73 4.6538.74 4.93 dd1000/reiser - 29.29 5.7944.88 6.08 dd1000/xfs - 37.11 5.0339.98 5.27 The test untar unpacks linux-2.6.0.tar: SMP UP TEST / FS real sys real sys untar/ext2 - 9.21 2.4727.31 2.57 untar/ext3 - 15.63 3.3834.54 3.61 untar/ext3rs- 15.91 3.2735.98 3.65 untar/ext3i - 8.33 2.7016.75 2.84 untar/reiser- 13.38 5.4725.88 5.96 untar/xfs - 44.62 5.6451.92 4.88 The next test is dbench: SMP UP TEST / FS real sys MB/s real sys MB/s dbench1/ext2- 5.87 1.4363.72565.93 1.5463.79 dbench1/ext3- 2.46 1.75139.7943.60 1.85131.372 dbench1/ext3rs - 2.46 1.73140.3783.55 1.85133.071 dbench1/ext3i - 2.19 1.48156.9762.28 1.53150.403 dbench1/reiser - 2.80 2.05122.7612.88 2.11119.815 dbench1/xfs - 2.83 1.81122.1592.87 1.91119.564 dbench4/ext2- 4.99 7.13274.2169.96 6.22137.316 dbench4/ext3- 5.79 8.64236.85811.49 7.40130.146 dbench4/ext3rs - 5.80 8.57236.35611.45 7.38130.467 dbench4/ext3i - 5.16 7.16265.6219.25 6.31147.872 dbench4/reiser - 7.11 10.85 192.49111.85 8.59115.392 dbench4/xfs - 6.60 8.88207.59911.98 7.69114.177 dbench8/ext2- 16.04 14.27 181.93 18.32 12.14 149.228 dbench8/ext3- 18.87 17.04 165.21423.77 14.88 119.995 dbench8/ext3rs - 11.52 17.21 237.64723.35 15.04 123.088 dbench8/ext3i - 11.20 14.66 268.05221.00 12.49 130.223 dbench8/reiser - 13.92 21.65 196.30624.98 17.67 114.083 dbench8/xfs - 14.84 18.23 184.26325.84 15.53 105.743 dbench16/ext2 - 20.39 28.79 267.97947.37 25.13 115.582 dbench16/ext3 - 25.69 34.54 212.74553.43 30.78 102.266 dbench16/ext3rs - 24.47 34.36 223.37 54.68 30.44 100.082 dbench16/ext3i - 22.94 29.40 238.21544.89 25.73 121.962 dbench16/reiser - 28.56 43.86 191.40756.48 36.41 96.9686 dbench16/xfs- 33.49 36.94 163.15978.68 32.56 69.4764 dbench32/ext2 - 43.54 58.87 250.947108.1 51.24 101.354 dbench32/ext3 - 61.83 70.66 176.707139.84 63.77 78.8818 dbench32/ext3rs - 67.24 71.69 162.651145.03 63.38 75.5155 dbench32/ext3i - 55.29 59.41 197.757100.24 52.50 109.26 dbench32/reiser - 76.32 93.45 143.127128.77 77.94 85.7179 dbench32/xfs- 119.4 88.09 91.4746670.45 76.19 16.298 The followins crazy listing shows tiobench's results for SMP box: Sequential Reads File Blk Num Avg CPU IdentifierSize Size Thr Rate (CPU%) Latency Eff -- - --- -- -- - - ext2 512 40961 40.95 12.80% 0.095 320 ext3 512 40961 45.03 13.99% 0.086 322 ext3rs512 40961 50.53
Re: Latest ext3 patches (extents, mballoc, delayed allocation)
Good day all, I've updated the patchset against 2.6.10. A bunch of bugs have been fixed and mballoc now behaves smarter a bit. Extents and mballoc patches collects some stats they print upon umount. NOTE: they must not be used to store important data. A lot of things are to be done. Please review. Any comments and suggestions are very welcome. The patches are too big to send in a message (previous time they got rejected). I put them in ftp://ftp.clusterfs.com/pub/people/alex/2.6.10 The following names are used: ext3rs - ext3 mounted with data=writeback,reservation options ext3i - ext3 mounted with extents,mballoc,delalloc options reiser - reiserfs v3 I did some benchmarking. Two systems were used: SMP - dual PIII-1GHz, 1GB, FC-connected to 2 disks raid0 UP - PIII-933MHz, 256MB, FC-connected to 2 disks raid0 The tests dd500 and dd1000 generate 500M and 1000M using dd on fresh fs. Real time varies run from run because of disks, but sys.time is stable. SMPUP TEST / FS real sys real sys dd500/ext2 - 9.14 2.3319.03 2.50 dd500/ext3 - 12.23 4.0618.99 4.19 dd500/ext3rs- 13.39 4.0115.08 4.16 dd500/ext3i - 9.19 2.3119.07 2.52 dd500/reiser- 7.95 2.8721.23 3.09 dd500/xfs - 17.88 2.4219.25 2.67 dd1000/ext2 - 37.47 4.6944.59 5.02 dd1000/ext3 - 38.03 7.9040.77 8.31 dd1000/ext3rs - 34.62 7.9540.51 8.30 dd1000/ext3i- 33.73 4.6538.74 4.93 dd1000/reiser - 29.29 5.7944.88 6.08 dd1000/xfs - 37.11 5.0339.98 5.27 The test untar unpacks linux-2.6.0.tar: SMP UP TEST / FS real sys real sys untar/ext2 - 9.21 2.4727.31 2.57 untar/ext3 - 15.63 3.3834.54 3.61 untar/ext3rs- 15.91 3.2735.98 3.65 untar/ext3i - 8.33 2.7016.75 2.84 untar/reiser- 13.38 5.4725.88 5.96 untar/xfs - 44.62 5.6451.92 4.88 The next test is dbench: SMP UP TEST / FS real sys MB/s real sys MB/s dbench1/ext2- 5.87 1.4363.72565.93 1.5463.79 dbench1/ext3- 2.46 1.75139.7943.60 1.85131.372 dbench1/ext3rs - 2.46 1.73140.3783.55 1.85133.071 dbench1/ext3i - 2.19 1.48156.9762.28 1.53150.403 dbench1/reiser - 2.80 2.05122.7612.88 2.11119.815 dbench1/xfs - 2.83 1.81122.1592.87 1.91119.564 dbench4/ext2- 4.99 7.13274.2169.96 6.22137.316 dbench4/ext3- 5.79 8.64236.85811.49 7.40130.146 dbench4/ext3rs - 5.80 8.57236.35611.45 7.38130.467 dbench4/ext3i - 5.16 7.16265.6219.25 6.31147.872 dbench4/reiser - 7.11 10.85 192.49111.85 8.59115.392 dbench4/xfs - 6.60 8.88207.59911.98 7.69114.177 dbench8/ext2- 16.04 14.27 181.93 18.32 12.14 149.228 dbench8/ext3- 18.87 17.04 165.21423.77 14.88 119.995 dbench8/ext3rs - 11.52 17.21 237.64723.35 15.04 123.088 dbench8/ext3i - 11.20 14.66 268.05221.00 12.49 130.223 dbench8/reiser - 13.92 21.65 196.30624.98 17.67 114.083 dbench8/xfs - 14.84 18.23 184.26325.84 15.53 105.743 dbench16/ext2 - 20.39 28.79 267.97947.37 25.13 115.582 dbench16/ext3 - 25.69 34.54 212.74553.43 30.78 102.266 dbench16/ext3rs - 24.47 34.36 223.37 54.68 30.44 100.082 dbench16/ext3i - 22.94 29.40 238.21544.89 25.73 121.962 dbench16/reiser - 28.56 43.86 191.40756.48 36.41 96.9686 dbench16/xfs- 33.49 36.94 163.15978.68 32.56 69.4764 dbench32/ext2 - 43.54 58.87 250.947108.1 51.24 101.354 dbench32/ext3 - 61.83 70.66 176.707139.84 63.77 78.8818 dbench32/ext3rs - 67.24 71.69 162.651145.03 63.38 75.5155 dbench32/ext3i - 55.29 59.41 197.757100.24 52.50 109.26 dbench32/reiser - 76.32 93.45 143.127128.77 77.94 85.7179 dbench32/xfs- 119.4 88.09 91.4746670.45 76.19 16.298 The followins crazy listing shows tiobench's results for SMP box: Sequential Reads File Blk Num Avg CPU IdentifierSize Size Thr Rate (CPU%) Latency Eff -- - --- -- -- - - ext2 512 40961 40.95 12.80% 0.095 320 ext3 512 40961 45.03 13.99% 0.086 322 ext3rs512 40961 50.53
Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()
>>>>> Stephen C Tweedie (SCT) writes: SCT> Hi, SCT> On Tue, 2005-01-25 at 19:30, Alex Tomas wrote: >> >> journal_dirty_metadata(handle, bh) >> >> { >> >> transaction->t_reserved--; >> >> handle->h_buffer_credits--; >> >> if (jh->b_tcount > 0) { >> >> /* modifed, no need to track it any more */ >> >> transaction-> t_outstanding_credits++; >> >>jh-> b_tcount = -1; >> >> } >> >> } >> SCT> Actually, the whole thing can be wrapped in if (jh->b_tcount > 0) {}, I SCT> think. >> the idea is: >> 1) the sooner we drop reservation, the higher probability to cover many >> changes by single transaction SCT> But that's exactly why we _don't_ want to do this. You're dropping the SCT> reservation, but remember, we return unused handle credits to the SCT> transaction at the end of the handle's life. yup, you're right. so, here is an implementation of this. tested on UP/SMP by dbench/fsx. Stephen, Andrew, could you review the patch and give it a run? thanks, Alex fix against credits leak in journal_release_buffer() The idea is to charge a buffer in journal_dirty_metadata(), not in journal_get_*_access()). Each buffer has flag call journal_dirty_metadata() sets on the buffer. Signed-off-by: Alex Tomas <[EMAIL PROTECTED]> Index: linux-2.6.10/include/linux/journal-head.h === --- linux-2.6.10.orig/include/linux/journal-head.h 2003-06-24 18:05:26.0 +0400 +++ linux-2.6.10/include/linux/journal-head.h 2005-01-29 03:20:05.0 +0300 @@ -32,6 +32,13 @@ unsigned b_jlist; /* +* This flag signals the buffer has been modified by +* the currently running transaction +* [jbd_lock_bh_state()] +*/ + unsigned b_modified; + + /* * Copy of the buffer data frozen for writing to the log. * [jbd_lock_bh_state()] */ Index: linux-2.6.10/include/linux/ext3_jbd.h === --- linux-2.6.10.orig/include/linux/ext3_jbd.h 2005-01-28 19:32:15.0 +0300 +++ linux-2.6.10/include/linux/ext3_jbd.h 2005-01-29 03:13:41.0 +0300 @@ -113,9 +113,9 @@ static inline int __ext3_journal_get_undo_access(const char *where, handle_t *handle, - struct buffer_head *bh, int *credits) + struct buffer_head *bh) { - int err = journal_get_undo_access(handle, bh, credits); + int err = journal_get_undo_access(handle, bh); if (err) ext3_journal_abort_handle(where, __FUNCTION__, bh, handle,err); return err; @@ -123,19 +123,18 @@ static inline int __ext3_journal_get_write_access(const char *where, handle_t *handle, - struct buffer_head *bh, int *credits) + struct buffer_head *bh) { - int err = journal_get_write_access(handle, bh, credits); + int err = journal_get_write_access(handle, bh); if (err) ext3_journal_abort_handle(where, __FUNCTION__, bh, handle,err); return err; } static inline void -ext3_journal_release_buffer(handle_t *handle, struct buffer_head *bh, - int credits) +ext3_journal_release_buffer(handle_t *handle, struct buffer_head *bh) { - journal_release_buffer(handle, bh, credits); + journal_release_buffer(handle, bh); } static inline void @@ -175,12 +174,10 @@ } -#define ext3_journal_get_undo_access(handle, bh, credits) \ - __ext3_journal_get_undo_access(__FUNCTION__, (handle), (bh), (credits)) +#define ext3_journal_get_undo_access(handle, bh) \ + __ext3_journal_get_undo_access(__FUNCTION__, (handle), (bh)) #define ext3_journal_get_write_access(handle, bh) \ - __ext3_journal_get_write_access(__FUNCTION__, (handle), (bh), NULL) -#define ext3_journal_get_write_access_credits(handle, bh, credits) \ - __ext3_journal_get_write_access(__FUNCTION__, (handle), (bh), (credits)) + __ext3_journal_get_write_access(__FUNCTION__, (handle), (bh)) #define ext3_journal_revoke(handle, blocknr, bh) \ __ext3_journal_revoke(__FUNCTION__, (handle), (blocknr), (bh)) #define ext3_journal_get_create_access(handle, bh) \ Index: linux-2.6.10/include/linux/jbd.h === --- linux-2.6.10.orig/include/linux/jbd.h 2005-01-28 19:32:15.0 +0300 +++ linux-2.6.10/include/linux/jbd.h2005-01-29 03:13:41.0 +0300 @@ -865,15 +865,12 @@ extern handle_t *journal_start(journal_t *, int nblocks); extern int journal_restart (handle_t *, int nblocks); extern int journal_exte
Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()
> Stephen C Tweedie (SCT) writes: >> journal_dirty_metadata(handle, bh) >> { >> transaction->t_reserved--; >> handle->h_buffer_credits--; >> if (jh->b_tcount > 0) { >> /* modifed, no need to track it any more */ >> transaction-> t_outstanding_credits++; >>jh-> b_tcount = -1; >> } >> } SCT> Actually, the whole thing can be wrapped in if (jh->b_tcount > 0) {}, I SCT> think. If we have already charged the transaction for this buffer, then SCT> there's no need to charge the handle for it again. That's going to be SCT> particularly important for truncate(), where we are continually updating SCT> the same blocks (eg. bitmap, indirect blocks), so we want to make sure SCT> that after the first journal_dirty_metadata() call, no further charge is SCT> made. the idea is: 1) the sooner we drop reservation, the higher probability to cover many changes by single transaction 1) having h_buffer_credits being decremented for each modification could help us to debug handle overflow situations SCT> If we do that, do we in fact need t_reserved at all? hmm. if t_outstanding_credits holds number of modified buffers, then we need sum of all running h_buffer_credits to protect from transaction overflow. t_reserved is sum of h_buffer_credits. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()
Hi, could you review the following solution? t_outstanding_credits - number of _modified_ blocks in the transaction t_reserved - number of blocks all running handle reserved transaction size = t_outstanding_credits + t_reserved; #define TSIZE(t)((t)->t_outstanding_credits + (t)->t_reserved) journal_start(blocks) { if (TSIZE(transaction) + blocks > MAX) wait_for_space(journal); transaction->t_reserved += blocks; handle->h_buffer_credits = blocks; } journal_get_write_access(handle, bh) { if (jh->b_tcount >= 0) jh->b_tcount++; } journal_dirty_metadata(handle, bh) { transaction->t_reserved--; handle->h_buffer_credits--; if (jh->b_tcount > 0) { /* modifed, no need to track it any more */ transaction->t_outstanding_credits++; jh->b_tcount = -1; } } journal_release_buffer(handle, bh) { if (jh->b_tcount > 0) { /* it's not modified yet */ jh->b_tcount--; if (jh->b_tcount == 0) { /* remove from the transaction */ } } } journal_stop(handle) { transaction->t_outstanding_credits -= handle->h_buffer_credits; } thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()
Hi, could you review the following solution? t_outstanding_credits - number of _modified_ blocks in the transaction t_reserved - number of blocks all running handle reserved transaction size = t_outstanding_credits + t_reserved; #define TSIZE(t)((t)-t_outstanding_credits + (t)-t_reserved) journal_start(blocks) { if (TSIZE(transaction) + blocks MAX) wait_for_space(journal); transaction-t_reserved += blocks; handle-h_buffer_credits = blocks; } journal_get_write_access(handle, bh) { if (jh-b_tcount = 0) jh-b_tcount++; } journal_dirty_metadata(handle, bh) { transaction-t_reserved--; handle-h_buffer_credits--; if (jh-b_tcount 0) { /* modifed, no need to track it any more */ transaction-t_outstanding_credits++; jh-b_tcount = -1; } } journal_release_buffer(handle, bh) { if (jh-b_tcount 0) { /* it's not modified yet */ jh-b_tcount--; if (jh-b_tcount == 0) { /* remove from the transaction */ } } } journal_stop(handle) { transaction-t_outstanding_credits -= handle-h_buffer_credits; } thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()
Stephen C Tweedie (SCT) writes: journal_dirty_metadata(handle, bh) { transaction-t_reserved--; handle-h_buffer_credits--; if (jh-b_tcount 0) { /* modifed, no need to track it any more */ transaction- t_outstanding_credits++; jh- b_tcount = -1; } } SCT Actually, the whole thing can be wrapped in if (jh-b_tcount 0) {}, I SCT think. If we have already charged the transaction for this buffer, then SCT there's no need to charge the handle for it again. That's going to be SCT particularly important for truncate(), where we are continually updating SCT the same blocks (eg. bitmap, indirect blocks), so we want to make sure SCT that after the first journal_dirty_metadata() call, no further charge is SCT made. the idea is: 1) the sooner we drop reservation, the higher probability to cover many changes by single transaction 1) having h_buffer_credits being decremented for each modification could help us to debug handle overflow situations SCT If we do that, do we in fact need t_reserved at all? hmm. if t_outstanding_credits holds number of modified buffers, then we need sum of all running h_buffer_credits to protect from transaction overflow. t_reserved is sum of h_buffer_credits. thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()
> Stephen C Tweedie (SCT) writes: >> + /* return credit back to the handle if it was really spent */ >> + if (credits) { >> + handle->h_buffer_credits++; >> + spin_lock(>h_transaction->t_handle_lock); >> + handle->h_transaction->t_outstanding_credits++; >> + spin_lock(>h_transaction->t_handle_lock); >> + } SCT> That returns the credit to A (satisfying ext3), but you just grew SCT> t_outstanding_credits, thus growing the journal commitments without SCT> checking if it's safe to do so or being able to handle failure. So it SCT> just reintroduces the original bug. incremented h_buffer_credits will be subtracted from incremented t_outstanding_credits in journal_stop(). so, there is no imbalance at this point. then, if (b_tcount == 0) we can correct t_outstanind_credits or h_buffer_credits. wrong? let's try another way ... we have two processes: P1 and P2. they access block B. the code is: if (credits != 0) { handle->h_buffer_credits++; transaction->t_outstanding_credits++; } if (jh->b_tcount == 0) transcation->t_outstanding_credits--; case 1: P1 accesses B (*credits=1) P1 releases B (credits != 0) h1->h_buffer_credits++; (credits != 0) transaction->t_outstanding_credits++; (b_tcount == 0) transaction->t_outstanding_credits--; OUTPUT: transaction->t_outstanding_credits -= 1 case 2: P1 accesses B (*credits=1) P2 accesses B (*credits=0) P1 releases B P2 modifies B (credits != 0) h1->h_buffer_credits++; (credits != 0) transaction->t_outstainding_credits++; (b_tcount != 0) OUTPUT: journal_stop() will subtract incremented h_buffer_credits from incremented t_outstading_credits => no changes case 3: P1 accesses B (*credits=1) P2 accesses B (*credits=0) P2 releases B P1 modifies B (credits != 0) (credits != 0) (b_tcount == 0) OUTPUT: no changes case 4: P1 accesses B (*credits=1) P2 accesses B (*credits=0) P2 releases B P1 releases B (credits != 0) (credits != 0) (b_tcount == 0) (credits != 0) h1->h_buffer_credits++; (credits != 0) transaction->t_outstanding_credits++; (b_tcount == 0) transaction->t_outstanding_credits--; OUTPUT: P2 will change nothing, P1 will drop the buffer and correct t_outstanding_credits case 5: P1 accesses B (*credits=1) P2 accesses B (*credits=0) P1 releases B P2 releases B (credits != 0) h1->h_buffer_credits++; (credits != 0) transaction->t_outstanding_credits++; (b_tcount == 0) (credits != 0) (credits != 0) (b_tcount == 0) transaction->t_outstanding_credits--; OUTPUT: P1 will change own credits, P2 will drop the buffer and correct t_outstanding_credits thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [PATCH] JBD: log space management optimization
>>>>> Stephen C Tweedie (SCT) writes: SCT> If you returned them to the handle directly, it would be slightly more SCT> efficient. good point. thanks. here is the fixed patch. during truncate ext3 calls journal_forget() for freed blocks, but before these blocks go to the transaction and jbd reserves space in log for them (->t_outstanding_credits). also, journal_forget() removes these blocks from the transaction, but doesn't correct log space reservation. for example, removal of 500MB file reserves 136 blocks, but only 10 blocks go to the log. a commit is expensive and correct reservation allows us to avoid needless commits. here is the patch. tested on UP. Signed-off-by: Alex Tomas <[EMAIL PROTECTED]> Index: linux-2.6.7/fs/jbd/transaction.c === --- linux-2.6.7.orig/fs/jbd/transaction.c 2004-08-26 17:12:40.0 +0400 +++ linux-2.6.7/fs/jbd/transaction.c2005-01-24 22:51:34.0 +0300 @@ -1204,6 +1204,7 @@ transaction_t *transaction = handle->h_transaction; journal_t *journal = transaction->t_journal; struct journal_head *jh; + int drop_reserve = 0; BUFFER_TRACE(bh, "entry"); @@ -1227,6 +1228,7 @@ J_ASSERT_JH(jh, !jh->b_committed_data); __journal_unfile_buffer(jh); + drop_reserve = 1; /* * We are no longer going to journal this buffer. @@ -1249,7 +1251,7 @@ spin_unlock(>j_list_lock); jbd_unlock_bh_state(bh); __bforget(bh); - return; + goto drop; } } } else if (jh->b_transaction) { @@ -1264,6 +1266,7 @@ if (jh->b_next_transaction) { J_ASSERT(jh->b_next_transaction == transaction); jh->b_next_transaction = NULL; + drop_reserve = 1; } } @@ -1271,6 +1274,13 @@ spin_unlock(>j_list_lock); jbd_unlock_bh_state(bh); __brelse(bh); + +drop: + if (drop_reserve) { + /* no need to reserve log space for this block -bzzz */ + handle->h_buffer_credits++; + } + return; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()
> Stephen C Tweedie (SCT) writes: >> + /* return credit back to the handle if it was really spent */ >> + if (credits) >> + handle->h_buffer_credits++; >> + jh->b_tcount--; >> + if (jh->b_tcount == 0) { >> + /* >> +* this was last reference to the block from the current >> +* transaction and we'd like to return credit to the >> +* whole transaction -bzzz >> +*/ >> + if (!credits) >> + handle->h_buffer_credits++; SCT> I think there's a problem here. SCT> What if: SCT> Process A gets write access, and is the first to do so (*credits=1) SCT> Processes B gets write access (*credits=0) SCT> B modifies the buffer and finishes SCT> A looks again, sees B's modifications, and releases the buffer because SCT> it's no use any more. SCT> Now, B's release didn't return credits. The bh is part of the SCT> transaction and was not released. hmmm. that's a good catch. so, with this patch A increments h_buffer_credits and this one will go to the t_outstanding_credits while the buffer is still part of the transaction. indeed, an imbalance. probably something like the following would be enough? + /* return credit back to the handle if it was really spent */ + if (credits) { + handle->h_buffer_credits++; + spin_lock(>h_transaction->t_handle_lock); + handle->h_transaction->t_outstanding_credits++; + spin_lock(>h_transaction->t_handle_lock); + } thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow
> Stephen C Tweedie (SCT) writes: SCT> /* SCT>* Be pessimistic here about the number of those free blocks which SCT>* might be required for log descriptor control blocks. SCT>*/ SCT> ... SCT> left -= (left >> 3); oops. i overlooked this line. so, the fix becomes minor improvement patch ;) thanks! do you think the following can be improved? /* * @@@ AKPM: This seems rather over-defensive. We're giving commit * a _lot_ of headroom: 1/4 of the journal plus the size of * the committing transaction. Really, we only need to give it * committing_transaction->t_outstanding_credits plus "enough" for * the log control blocks. * Also, this test is inconsitent with the matching one in * journal_extend(). */ if (__log_space_left(journal) < jbd_space_needed(journal)) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow
> Stephen C Tweedie (SCT) writes: SCT> I don't see how that "limit" is relevant here. wbuf is nothing but the SCT> size of the IO batches we pass to ll_rw_block() during that commit SCT> phase. j_free affects the total size of space the *entire* commit has SCT> to run into, and (as akpm has commented with a big marker beside it) SCT> start_this_handle() reserves a *lot* of headroom for the extra space SCT> that may be needed for transaction metadata. /* If there's no more to do, or if the descriptor is full, let the IO rip! */ if (bufs == ARRAY_SIZE(wbuf) || commit_transaction->t_buffers == NULL || space_left < sizeof(journal_block_tag_t) + 16) { /* Force a new descriptor to be generated next time round the loop. */ descriptor = NULL; bufs = 0; ^^^ SCT> Have you really seen this patch make a difference in testing? of course - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow
Stephen C Tweedie (SCT) writes: SCT I don't see how that limit is relevant here. wbuf is nothing but the SCT size of the IO batches we pass to ll_rw_block() during that commit SCT phase. j_free affects the total size of space the *entire* commit has SCT to run into, and (as akpm has commented with a big marker beside it) SCT start_this_handle() reserves a *lot* of headroom for the extra space SCT that may be needed for transaction metadata. /* If there's no more to do, or if the descriptor is full, let the IO rip! */ if (bufs == ARRAY_SIZE(wbuf) || commit_transaction-t_buffers == NULL || space_left sizeof(journal_block_tag_t) + 16) { /* Force a new descriptor to be generated next time round the loop. */ descriptor = NULL; bufs = 0; ^^^ SCT Have you really seen this patch make a difference in testing? of course - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [PATCH] JBD: fix against journal overflow
Stephen C Tweedie (SCT) writes: SCT /* SCT* Be pessimistic here about the number of those free blocks which SCT* might be required for log descriptor control blocks. SCT*/ SCT ... SCT left -= (left 3); oops. i overlooked this line. so, the fix becomes minor improvement patch ;) thanks! do you think the following can be improved? /* * @@@ AKPM: This seems rather over-defensive. We're giving commit * a _lot_ of headroom: 1/4 of the journal plus the size of * the committing transaction. Really, we only need to give it * committing_transaction-t_outstanding_credits plus enough for * the log control blocks. * Also, this test is inconsitent with the matching one in * journal_extend(). */ if (__log_space_left(journal) jbd_space_needed(journal)) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()
Stephen C Tweedie (SCT) writes: + /* return credit back to the handle if it was really spent */ + if (credits) + handle-h_buffer_credits++; + jh-b_tcount--; + if (jh-b_tcount == 0) { + /* +* this was last reference to the block from the current +* transaction and we'd like to return credit to the +* whole transaction -bzzz +*/ + if (!credits) + handle-h_buffer_credits++; SCT I think there's a problem here. SCT What if: SCT Process A gets write access, and is the first to do so (*credits=1) SCT Processes B gets write access (*credits=0) SCT B modifies the buffer and finishes SCT A looks again, sees B's modifications, and releases the buffer because SCT it's no use any more. SCT Now, B's release didn't return credits. The bh is part of the SCT transaction and was not released. hmmm. that's a good catch. so, with this patch A increments h_buffer_credits and this one will go to the t_outstanding_credits while the buffer is still part of the transaction. indeed, an imbalance. probably something like the following would be enough? + /* return credit back to the handle if it was really spent */ + if (credits) { + handle-h_buffer_credits++; + spin_lock(handle-h_transaction-t_handle_lock); + handle-h_transaction-t_outstanding_credits++; + spin_lock(handle-h_transaction-t_handle_lock); + } thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [PATCH] JBD: log space management optimization
Stephen C Tweedie (SCT) writes: SCT If you returned them to the handle directly, it would be slightly more SCT efficient. good point. thanks. here is the fixed patch. during truncate ext3 calls journal_forget() for freed blocks, but before these blocks go to the transaction and jbd reserves space in log for them (-t_outstanding_credits). also, journal_forget() removes these blocks from the transaction, but doesn't correct log space reservation. for example, removal of 500MB file reserves 136 blocks, but only 10 blocks go to the log. a commit is expensive and correct reservation allows us to avoid needless commits. here is the patch. tested on UP. Signed-off-by: Alex Tomas [EMAIL PROTECTED] Index: linux-2.6.7/fs/jbd/transaction.c === --- linux-2.6.7.orig/fs/jbd/transaction.c 2004-08-26 17:12:40.0 +0400 +++ linux-2.6.7/fs/jbd/transaction.c2005-01-24 22:51:34.0 +0300 @@ -1204,6 +1204,7 @@ transaction_t *transaction = handle-h_transaction; journal_t *journal = transaction-t_journal; struct journal_head *jh; + int drop_reserve = 0; BUFFER_TRACE(bh, entry); @@ -1227,6 +1228,7 @@ J_ASSERT_JH(jh, !jh-b_committed_data); __journal_unfile_buffer(jh); + drop_reserve = 1; /* * We are no longer going to journal this buffer. @@ -1249,7 +1251,7 @@ spin_unlock(journal-j_list_lock); jbd_unlock_bh_state(bh); __bforget(bh); - return; + goto drop; } } } else if (jh-b_transaction) { @@ -1264,6 +1266,7 @@ if (jh-b_next_transaction) { J_ASSERT(jh-b_next_transaction == transaction); jh-b_next_transaction = NULL; + drop_reserve = 1; } } @@ -1271,6 +1274,13 @@ spin_unlock(journal-j_list_lock); jbd_unlock_bh_state(bh); __brelse(bh); + +drop: + if (drop_reserve) { + /* no need to reserve log space for this block -bzzz */ + handle-h_buffer_credits++; + } + return; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [PATCH] JBD: journal_release_buffer()
Stephen C Tweedie (SCT) writes: + /* return credit back to the handle if it was really spent */ + if (credits) { + handle-h_buffer_credits++; + spin_lock(handle-h_transaction-t_handle_lock); + handle-h_transaction-t_outstanding_credits++; + spin_lock(handle-h_transaction-t_handle_lock); + } SCT That returns the credit to A (satisfying ext3), but you just grew SCT t_outstanding_credits, thus growing the journal commitments without SCT checking if it's safe to do so or being able to handle failure. So it SCT just reintroduces the original bug. incremented h_buffer_credits will be subtracted from incremented t_outstanding_credits in journal_stop(). so, there is no imbalance at this point. then, if (b_tcount == 0) we can correct t_outstanind_credits or h_buffer_credits. wrong? let's try another way ... we have two processes: P1 and P2. they access block B. the code is: if (credits != 0) { handle-h_buffer_credits++; transaction-t_outstanding_credits++; } if (jh-b_tcount == 0) transcation-t_outstanding_credits--; case 1: P1 accesses B (*credits=1) P1 releases B (credits != 0) h1-h_buffer_credits++; (credits != 0) transaction-t_outstanding_credits++; (b_tcount == 0) transaction-t_outstanding_credits--; OUTPUT: transaction-t_outstanding_credits -= 1 case 2: P1 accesses B (*credits=1) P2 accesses B (*credits=0) P1 releases B P2 modifies B (credits != 0) h1-h_buffer_credits++; (credits != 0) transaction-t_outstainding_credits++; (b_tcount != 0) OUTPUT: journal_stop() will subtract incremented h_buffer_credits from incremented t_outstading_credits = no changes case 3: P1 accesses B (*credits=1) P2 accesses B (*credits=0) P2 releases B P1 modifies B (credits != 0) (credits != 0) (b_tcount == 0) OUTPUT: no changes case 4: P1 accesses B (*credits=1) P2 accesses B (*credits=0) P2 releases B P1 releases B (credits != 0) (credits != 0) (b_tcount == 0) (credits != 0) h1-h_buffer_credits++; (credits != 0) transaction-t_outstanding_credits++; (b_tcount == 0) transaction-t_outstanding_credits--; OUTPUT: P2 will change nothing, P1 will drop the buffer and correct t_outstanding_credits case 5: P1 accesses B (*credits=1) P2 accesses B (*credits=0) P1 releases B P2 releases B (credits != 0) h1-h_buffer_credits++; (credits != 0) transaction-t_outstanding_credits++; (b_tcount == 0) (credits != 0) (credits != 0) (b_tcount == 0) transaction-t_outstanding_credits--; OUTPUT: P1 will change own credits, P2 will drop the buffer and correct t_outstanding_credits thanks, Alex - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] JBD: log space management optimization
Good day, during truncate ext3 calls journal_forget() for freed blocks, but before these blocks go to the transaction and jbd reserves space in log for them (->t_outstanding_credits). also, journal_forget() removes these blocks from the transaction, but doesn't correct log space reservation. for example, removal of 500MB file reserves 136 blocks, but only 10 blocks go to the log. a commit is expensive and correct reservation allows us to avoid needless commits. here is the patch. tested on UP. thanks, Alex Signed-off-by: Alex Tomas <[EMAIL PROTECTED]> Index: linux-2.6.7/fs/jbd/transaction.c === --- linux-2.6.7.orig/fs/jbd/transaction.c 2004-08-26 17:12:40.0 +0400 +++ linux-2.6.7/fs/jbd/transaction.c2005-01-19 17:23:30.058160408 +0300 @@ -1204,6 +1257,7 @@ transaction_t *transaction = handle->h_transaction; journal_t *journal = transaction->t_journal; struct journal_head *jh; + int drop_reserve = 0; BUFFER_TRACE(bh, "entry"); @@ -1227,6 +1281,7 @@ J_ASSERT_JH(jh, !jh->b_committed_data); __journal_unfile_buffer(jh); + drop_reserve = 1; /* * We are no longer going to journal this buffer. @@ -1249,7 +1304,7 @@ spin_unlock(>j_list_lock); jbd_unlock_bh_state(bh); __bforget(bh); - return; + goto drop; } } } else if (jh->b_transaction) { @@ -1264,6 +1319,7 @@ if (jh->b_next_transaction) { J_ASSERT(jh->b_next_transaction == transaction); jh->b_next_transaction = NULL; + drop_reserve = 1; } } @@ -1271,6 +1327,15 @@ spin_unlock(>j_list_lock); jbd_unlock_bh_state(bh); __brelse(bh); + +drop: + if (drop_reserve) { + /* no need to reserve log space for this block -bzzz */ + spin_lock(>t_handle_lock); + transaction->t_outstanding_credits--; + spin_unlock(>t_handle_lock); + } + return; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] JBD: journal_release_buffer()
Good day, journal_release_buffer() can cause journal overflow in some (very rare) conditions. Please, take a look at possible fix. Th journal_head structure holds number of users of the buffer for current transaction. The routines do_get_write_access() and journal_get_create_access() tracks this number: 1) resets it to zero if the block's becoming part of the current transaction 2) increments it journal_release_buffer() decrements it and if it's 0, then the blocks isn't member of the transaction. The patch has been tested on UP with dbench and tool that uses xattr very much. Signed-off-by: Alex Tomas <[EMAIL PROTECTED]> Index: linux-2.6.7/include/linux/journal-head.h === --- linux-2.6.7.orig/include/linux/journal-head.h 2003-06-24 18:05:26.0 +0400 +++ linux-2.6.7/include/linux/journal-head.h2005-01-19 14:09:59.0 +0300 @@ -80,6 +80,11 @@ * [j_list_lock] */ struct journal_head *b_cpnext, *b_cpprev; + + /* +* counter to track users of the buffer in current transaction +*/ + int b_tcount; }; #endif /* JOURNAL_HEAD_H_INCLUDED */ Index: linux-2.6.7/fs/jbd/transaction.c === --- linux-2.6.7.orig/fs/jbd/transaction.c 2004-08-26 17:12:40.0 +0400 +++ linux-2.6.7/fs/jbd/transaction.c2005-01-19 17:23:30.058160408 +0300 @@ -611,6 +611,10 @@ handle->h_buffer_credits--; if (credits) (*credits)++; + + /* the block's becoming member of the trasaction -bzzz */ + jh->b_tcount = 0; + goto done; } @@ -694,6 +698,9 @@ if (credits) (*credits)++; + /* the block's becoming member of the trasaction -bzzz */ + jh->b_tcount = 0; + /* * Finally, if the buffer is not journaled right now, we need to make * sure it doesn't get written to disk before the caller actually @@ -723,6 +730,11 @@ memcpy(jh->b_frozen_data, source+offset, jh2bh(jh)->b_size); kunmap_atomic(source, KM_USER0); } + + /* track all references to the block to be able to recognize the +* situation when the buffer is not part of transaction -bzzz */ + jh->b_tcount++; + jbd_unlock_bh_state(bh); /* @@ -822,11 +834,20 @@ jh->b_transaction = transaction; JBUFFER_TRACE(jh, "file as BJ_Reserved"); __journal_file_buffer(jh, transaction, BJ_Reserved); + jh->b_tcount = 0; } else if (jh->b_transaction == journal->j_committing_transaction) { JBUFFER_TRACE(jh, "set next transaction"); jh->b_next_transaction = transaction; + jh->b_tcount = 0; } spin_unlock(>j_list_lock); + + /* +* track all reference to the block to be able to recognize +* the situation when the buffer is not part of transaction -bzzz +*/ + jh->b_tcount++; + jbd_unlock_bh_state(bh); /* @@ -1178,8 +1199,40 @@ void journal_release_buffer(handle_t *handle, struct buffer_head *bh, int credits) { + journal_t *journal = handle->h_transaction->t_journal; + struct journal_head *jh = bh2jh(bh); + BUFFER_TRACE(bh, "entry"); - handle->h_buffer_credits += credits; + + /* return credit back to the handle if it was really spent */ + if (credits) + handle->h_buffer_credits++; + + jbd_lock_bh_state(bh); + J_ASSERT(jh->b_tcount > 0); + + jh->b_tcount--; + if (jh->b_tcount == 0) { + /* we can drop it from the transaction -bzzz */ + J_ASSERT(jh->b_transaction == handle->h_transaction || + jh->b_next_transaction == handle->h_transaction); + if (jh->b_transaction == handle->h_transaction) { + spin_lock(>j_list_lock); + __journal_unfile_buffer(jh); + spin_unlock(>j_list_lock); + } else if(jh->b_next_transaction) { + jh->b_next_transaction = NULL; + } + + /* +* this was last reference to the block from the current +* transaction and we'd like to return credit to the +* whole transaction -bzzz +*/ + if (!credits) + handle->h_buffer_credits++; + } + jbd_unlock_bh_state(bh); } /** - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] JBD: fix against journal overflow
Good day, under some quite high load, jbd can hit J_ASSERT(journal->j_free > 1) in journal_next_log_block(). The cause is the following: journal_commit_transaction() { struct buffer_head *wbuf[64]; /* If there's no more to do, or if the descriptor is full, let the IO rip! */ if (bufs == ARRAY_SIZE(wbuf) || commit_transaction->t_buffers == NULL || space_left < sizeof(journal_block_tag_t) + 16) { so, the real limit isn't size of journal descriptor, but wbuf. __log_space_left() { /* * Be pessimistic here about the number of those free blocks which * might be required for log descriptor control blocks. */ #define MIN_LOG_RESERVED_BLOCKS 32 /* Allow for rounding errors */ left -= MIN_LOG_RESERVED_BLOCKS; so, jbd expects upto 32 blocks to be used for internal purpose. but with 64 blocks a descriptor limit we easily can break the limit. The fix allocates wbuf in journal_init_dev(). That's enough because we have only commit thread per filesystem. thank, Alex The journal_commit_transaction() generates too many descriptor blocks because static array wbuf can hold 64 blocks only. The fix is to have persistent array big enough to hold max. possible blocks. Signed-off-by: Alex Tomas <[EMAIL PROTECTED]> Index: linux-2.6.7/include/linux/jbd.h === --- linux-2.6.7.orig/include/linux/jbd.h2004-08-26 17:12:16.0 +0400 +++ linux-2.6.7/include/linux/jbd.h 2005-01-19 17:08:33.144512008 +0300 @@ -826,6 +826,12 @@ struct jbd_revoke_table_s *j_revoke_table[2]; /* +* array of bhs for journal_commit_transaction +*/ + struct buffer_head **j_wbuf; + int j_wbufsize; + + /* * An opaque pointer to fs-private information. ext3 puts its * superblock pointer here */ Index: linux-2.6.7/fs/jbd/commit.c === --- linux-2.6.7.orig/fs/jbd/commit.c2004-08-26 17:12:40.0 +0400 +++ linux-2.6.7/fs/jbd/commit.c 2005-01-19 17:28:32.965111552 +0300 @@ -103,7 +103,7 @@ { transaction_t *commit_transaction; struct journal_head *jh, *new_jh, *descriptor; - struct buffer_head *wbuf[64]; + struct buffer_head **wbuf = journal->j_wbuf; int bufs; int flags; int err; @@ -271,7 +283,7 @@ BUFFER_TRACE(bh, "start journal writeout"); get_bh(bh); wbuf[bufs++] = bh; - if (bufs == ARRAY_SIZE(wbuf)) { + if (bufs == journal->j_wbufsize) { jbd_debug(2, "submit %d writes\n", bufs); spin_unlock(>j_list_lock); @@ -488,7 +500,7 @@ /* If there's no more to do, or if the descriptor is full, let the IO rip! */ - if (bufs == ARRAY_SIZE(wbuf) || + if (bufs == journal->j_wbufsize || commit_transaction->t_buffers == NULL || space_left < sizeof(journal_block_tag_t) + 16) { Index: linux-2.6.7/fs/jbd/journal.c === --- linux-2.6.7.orig/fs/jbd/journal.c 2005-01-19 12:07:59.0 +0300 +++ linux-2.6.7/fs/jbd/journal.c2005-01-19 17:11:08.589880720 +0300 @@ -687,6 +687,7 @@ { journal_t *journal = journal_init_common(); struct buffer_head *bh; + int n; if (!journal) return NULL; @@ -702,6 +703,17 @@ journal->j_sb_buffer = bh; journal->j_superblock = (journal_superblock_t *)bh->b_data; + /* journal descriptor can store upto n blocks -bzzz */ + n = journal->j_blocksize / sizeof(journal_block_tag_t); + journal->j_wbufsize = n; + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + if (!journal->j_wbuf) { + printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n", + __FUNCTION__); + kfree(journal); + journal = NULL; + } + return journal; } @@ -717,7 +729,7 @@ { struct buffer_head *bh; journal_t *journal = journal_init_common(); - int err; + int err, n; unsigned long blocknr; if (!journal) @@ -734,6 +746,17 @@ journal->j_maxlen = inode->i_size >> inode->i_sb->s_blocksize_bits; journal->j_blocksize = inode->i_sb->s_blocksize; + /* journal descriptor can store
[PATCH] JBD: fix against journal overflow
Good day, under some quite high load, jbd can hit J_ASSERT(journal-j_free 1) in journal_next_log_block(). The cause is the following: journal_commit_transaction() { struct buffer_head *wbuf[64]; /* If there's no more to do, or if the descriptor is full, let the IO rip! */ if (bufs == ARRAY_SIZE(wbuf) || commit_transaction-t_buffers == NULL || space_left sizeof(journal_block_tag_t) + 16) { so, the real limit isn't size of journal descriptor, but wbuf. __log_space_left() { /* * Be pessimistic here about the number of those free blocks which * might be required for log descriptor control blocks. */ #define MIN_LOG_RESERVED_BLOCKS 32 /* Allow for rounding errors */ left -= MIN_LOG_RESERVED_BLOCKS; so, jbd expects upto 32 blocks to be used for internal purpose. but with 64 blocks a descriptor limit we easily can break the limit. The fix allocates wbuf in journal_init_dev(). That's enough because we have only commit thread per filesystem. thank, Alex The journal_commit_transaction() generates too many descriptor blocks because static array wbuf can hold 64 blocks only. The fix is to have persistent array big enough to hold max. possible blocks. Signed-off-by: Alex Tomas [EMAIL PROTECTED] Index: linux-2.6.7/include/linux/jbd.h === --- linux-2.6.7.orig/include/linux/jbd.h2004-08-26 17:12:16.0 +0400 +++ linux-2.6.7/include/linux/jbd.h 2005-01-19 17:08:33.144512008 +0300 @@ -826,6 +826,12 @@ struct jbd_revoke_table_s *j_revoke_table[2]; /* +* array of bhs for journal_commit_transaction +*/ + struct buffer_head **j_wbuf; + int j_wbufsize; + + /* * An opaque pointer to fs-private information. ext3 puts its * superblock pointer here */ Index: linux-2.6.7/fs/jbd/commit.c === --- linux-2.6.7.orig/fs/jbd/commit.c2004-08-26 17:12:40.0 +0400 +++ linux-2.6.7/fs/jbd/commit.c 2005-01-19 17:28:32.965111552 +0300 @@ -103,7 +103,7 @@ { transaction_t *commit_transaction; struct journal_head *jh, *new_jh, *descriptor; - struct buffer_head *wbuf[64]; + struct buffer_head **wbuf = journal-j_wbuf; int bufs; int flags; int err; @@ -271,7 +283,7 @@ BUFFER_TRACE(bh, start journal writeout); get_bh(bh); wbuf[bufs++] = bh; - if (bufs == ARRAY_SIZE(wbuf)) { + if (bufs == journal-j_wbufsize) { jbd_debug(2, submit %d writes\n, bufs); spin_unlock(journal-j_list_lock); @@ -488,7 +500,7 @@ /* If there's no more to do, or if the descriptor is full, let the IO rip! */ - if (bufs == ARRAY_SIZE(wbuf) || + if (bufs == journal-j_wbufsize || commit_transaction-t_buffers == NULL || space_left sizeof(journal_block_tag_t) + 16) { Index: linux-2.6.7/fs/jbd/journal.c === --- linux-2.6.7.orig/fs/jbd/journal.c 2005-01-19 12:07:59.0 +0300 +++ linux-2.6.7/fs/jbd/journal.c2005-01-19 17:11:08.589880720 +0300 @@ -687,6 +687,7 @@ { journal_t *journal = journal_init_common(); struct buffer_head *bh; + int n; if (!journal) return NULL; @@ -702,6 +703,17 @@ journal-j_sb_buffer = bh; journal-j_superblock = (journal_superblock_t *)bh-b_data; + /* journal descriptor can store upto n blocks -bzzz */ + n = journal-j_blocksize / sizeof(journal_block_tag_t); + journal-j_wbufsize = n; + journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL); + if (!journal-j_wbuf) { + printk(KERN_ERR %s: Cant allocate bhs for commit thread\n, + __FUNCTION__); + kfree(journal); + journal = NULL; + } + return journal; } @@ -717,7 +729,7 @@ { struct buffer_head *bh; journal_t *journal = journal_init_common(); - int err; + int err, n; unsigned long blocknr; if (!journal) @@ -734,6 +746,17 @@ journal-j_maxlen = inode-i_size inode-i_sb-s_blocksize_bits; journal-j_blocksize = inode-i_sb-s_blocksize; + /* journal descriptor can store upto n blocks -bzzz */ + n = journal-j_blocksize / sizeof(journal_block_tag_t); + journal-j_wbufsize = n; + journal-j_wbuf
[PATCH] JBD: journal_release_buffer()
Good day, journal_release_buffer() can cause journal overflow in some (very rare) conditions. Please, take a look at possible fix. Th journal_head structure holds number of users of the buffer for current transaction. The routines do_get_write_access() and journal_get_create_access() tracks this number: 1) resets it to zero if the block's becoming part of the current transaction 2) increments it journal_release_buffer() decrements it and if it's 0, then the blocks isn't member of the transaction. The patch has been tested on UP with dbench and tool that uses xattr very much. Signed-off-by: Alex Tomas [EMAIL PROTECTED] Index: linux-2.6.7/include/linux/journal-head.h === --- linux-2.6.7.orig/include/linux/journal-head.h 2003-06-24 18:05:26.0 +0400 +++ linux-2.6.7/include/linux/journal-head.h2005-01-19 14:09:59.0 +0300 @@ -80,6 +80,11 @@ * [j_list_lock] */ struct journal_head *b_cpnext, *b_cpprev; + + /* +* counter to track users of the buffer in current transaction +*/ + int b_tcount; }; #endif /* JOURNAL_HEAD_H_INCLUDED */ Index: linux-2.6.7/fs/jbd/transaction.c === --- linux-2.6.7.orig/fs/jbd/transaction.c 2004-08-26 17:12:40.0 +0400 +++ linux-2.6.7/fs/jbd/transaction.c2005-01-19 17:23:30.058160408 +0300 @@ -611,6 +611,10 @@ handle-h_buffer_credits--; if (credits) (*credits)++; + + /* the block's becoming member of the trasaction -bzzz */ + jh-b_tcount = 0; + goto done; } @@ -694,6 +698,9 @@ if (credits) (*credits)++; + /* the block's becoming member of the trasaction -bzzz */ + jh-b_tcount = 0; + /* * Finally, if the buffer is not journaled right now, we need to make * sure it doesn't get written to disk before the caller actually @@ -723,6 +730,11 @@ memcpy(jh-b_frozen_data, source+offset, jh2bh(jh)-b_size); kunmap_atomic(source, KM_USER0); } + + /* track all references to the block to be able to recognize the +* situation when the buffer is not part of transaction -bzzz */ + jh-b_tcount++; + jbd_unlock_bh_state(bh); /* @@ -822,11 +834,20 @@ jh-b_transaction = transaction; JBUFFER_TRACE(jh, file as BJ_Reserved); __journal_file_buffer(jh, transaction, BJ_Reserved); + jh-b_tcount = 0; } else if (jh-b_transaction == journal-j_committing_transaction) { JBUFFER_TRACE(jh, set next transaction); jh-b_next_transaction = transaction; + jh-b_tcount = 0; } spin_unlock(journal-j_list_lock); + + /* +* track all reference to the block to be able to recognize +* the situation when the buffer is not part of transaction -bzzz +*/ + jh-b_tcount++; + jbd_unlock_bh_state(bh); /* @@ -1178,8 +1199,40 @@ void journal_release_buffer(handle_t *handle, struct buffer_head *bh, int credits) { + journal_t *journal = handle-h_transaction-t_journal; + struct journal_head *jh = bh2jh(bh); + BUFFER_TRACE(bh, entry); - handle-h_buffer_credits += credits; + + /* return credit back to the handle if it was really spent */ + if (credits) + handle-h_buffer_credits++; + + jbd_lock_bh_state(bh); + J_ASSERT(jh-b_tcount 0); + + jh-b_tcount--; + if (jh-b_tcount == 0) { + /* we can drop it from the transaction -bzzz */ + J_ASSERT(jh-b_transaction == handle-h_transaction || + jh-b_next_transaction == handle-h_transaction); + if (jh-b_transaction == handle-h_transaction) { + spin_lock(journal-j_list_lock); + __journal_unfile_buffer(jh); + spin_unlock(journal-j_list_lock); + } else if(jh-b_next_transaction) { + jh-b_next_transaction = NULL; + } + + /* +* this was last reference to the block from the current +* transaction and we'd like to return credit to the +* whole transaction -bzzz +*/ + if (!credits) + handle-h_buffer_credits++; + } + jbd_unlock_bh_state(bh); } /** - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] JBD: log space management optimization
Good day, during truncate ext3 calls journal_forget() for freed blocks, but before these blocks go to the transaction and jbd reserves space in log for them (-t_outstanding_credits). also, journal_forget() removes these blocks from the transaction, but doesn't correct log space reservation. for example, removal of 500MB file reserves 136 blocks, but only 10 blocks go to the log. a commit is expensive and correct reservation allows us to avoid needless commits. here is the patch. tested on UP. thanks, Alex Signed-off-by: Alex Tomas [EMAIL PROTECTED] Index: linux-2.6.7/fs/jbd/transaction.c === --- linux-2.6.7.orig/fs/jbd/transaction.c 2004-08-26 17:12:40.0 +0400 +++ linux-2.6.7/fs/jbd/transaction.c2005-01-19 17:23:30.058160408 +0300 @@ -1204,6 +1257,7 @@ transaction_t *transaction = handle-h_transaction; journal_t *journal = transaction-t_journal; struct journal_head *jh; + int drop_reserve = 0; BUFFER_TRACE(bh, entry); @@ -1227,6 +1281,7 @@ J_ASSERT_JH(jh, !jh-b_committed_data); __journal_unfile_buffer(jh); + drop_reserve = 1; /* * We are no longer going to journal this buffer. @@ -1249,7 +1304,7 @@ spin_unlock(journal-j_list_lock); jbd_unlock_bh_state(bh); __bforget(bh); - return; + goto drop; } } } else if (jh-b_transaction) { @@ -1264,6 +1319,7 @@ if (jh-b_next_transaction) { J_ASSERT(jh-b_next_transaction == transaction); jh-b_next_transaction = NULL; + drop_reserve = 1; } } @@ -1271,6 +1327,15 @@ spin_unlock(journal-j_list_lock); jbd_unlock_bh_state(bh); __brelse(bh); + +drop: + if (drop_reserve) { + /* no need to reserve log space for this block -bzzz */ + spin_lock(transaction-t_handle_lock); + transaction-t_outstanding_credits--; + spin_unlock(transaction-t_handle_lock); + } + return; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/