Re: What's cooking in e2fsprogs.git (topics)
On Mon, Feb 25, 2008 at 10:13:39AM -0500, Theodore Tso wrote: On Sun, Feb 24, 2008 at 09:20:50PM -0700, Andreas Dilger wrote: On Feb 22, 2008 19:15 -0500, Theodore Ts'o wrote: So before the recent patch were we actually creating long symlinks in extents format? Or were we just setting the flag but still treating them as a block number? If it was the latter, I guess we can put in code into e2fsck to detect that case, and convert it back to a singleton block number. Eric informed me that the long symlinks were actually stored in extent mapped blocks. That is not harmful, because it can only be a single block and it will always fit into the inode. The other thing to note is that extent mapping is REQUIRED for 32-bit blocknumbers, so we may as well fix e2fsprogs to allow these symlinks to be handled normally. Well, at least some kernel versions (as of sometime just before 2.6.25, iirc) were storing the long symlink as a single block in i_block[0], despite EXTENTS_FL being set. Valerie noticed this, and I confirmed it, as it caused the mainline e2fsck extents support to core dump. Basically, what this means is that e2fsprogs can't trust EXTENTS_FL for long symlinks. But you do raise a good point that we need to support using the extents format in order to support blocks 2**32, so we can't just arbitrary convert all symlinks to the old-style direct block maps. How about the patch like below on top of the patch queue. Patch queue currently enable extent flag only for directory and file . This patch add it to normal symlink. With this fast symlink still have the extent format enabled. I guess this would need a patch to the interim branch of e2fsprogs to allow normal symlink to have extent format. -aneesh ext4: Enable extent format for symlink. From: Aneesh Kumar K.V [EMAIL PROTECTED] This patch enable extent format for normal symlink. Extent format enables to refere file system blocks 32 bits. Enabling extent format for symlink enables to have symlink block beyond 2**32 blocks. We still don't enable extent format for fast symlink. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/ialloc.c |4 ++-- fs/ext4/namei.c |2 ++ 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 78d1094..1462189 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -842,8 +842,8 @@ got: goto fail_free_drop; } if (test_opt(sb, EXTENTS)) { - /* set extent flag only for diretory and file */ - if (S_ISDIR(mode) || S_ISREG(mode)) { + /* set extent flag only for diretory, file and normal symlink*/ + if (S_ISDIR(mode) || S_ISREG(mode) || S_ISLNK(mode)) { EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL; ext4_ext_tree_init(handle, inode); err = ext4_update_incompat_feature(handle, sb, diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index da942bc..63c33e0 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -,6 +,8 @@ retry: goto out_stop; } } else { + /* clear the extent format for fast symlink */ + EXT4_I(inode)-i_flags = ~EXT4_EXTENTS_FL; inode-i_op = ext4_fast_symlink_inode_operations; memcpy((char*)EXT4_I(inode)-i_data,symname,l); inode-i_size = l-1; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: Convert uninitialized extent to initialized extent in case of file system full
A write to prealloc area cause the split of unititalized extent into a initialized and uninitialized extent. If we don't have space to add new extent information instead of returning error convert the existing uninitialized extent to initialized one. We need to zero out the blocks corresponding to the extent to prevent wrong data reaching userspace. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c | 151 ++--- 1 files changed, 144 insertions(+), 7 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index d315cc1..cdc7dca 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2136,6 +2136,124 @@ void ext4_ext_release(struct super_block *sb) #endif } +static int extend_credit_for_zeroout(handle_t *handle, struct inode *inode) +{ + int retval = 0, needed; + + if (handle-h_buffer_credits EXT4_RESERVE_TRANS_BLOCKS) + return 0; + + /* number of filesytem blocks in one page */ + needed = 1 (PAGE_CACHE_SHIFT - inode-i_blkbits); + + if (ext4_journal_extend(handle, needed) != 0) + retval = ext4_journal_restart(handle, needed); + + return retval; +} + +/* FIXME!! we need to try to merge to left or right after zerout */ +static int ext4_ext_zeroout(handle_t *handle, struct inode *inode, + ext4_lblk_t iblock, struct ext4_extent *ex) +{ + ext4_lblk_t ee_block; + unsigned int ee_len, blkcount, blocksize; + loff_t pos; + pgoff_t index, skip_index; + unsigned long offset; + struct page *page; + struct address_space *mapping = inode-i_mapping; + struct buffer_head *head, *bh; + int err = 0; + + ee_block = le32_to_cpu(ex-ee_block); + ee_len = blkcount = ext4_ext_get_actual_len(ex); + blocksize = inode-i_sb-s_blocksize; + + /* +* find the skip index. We can't call __grab_cache_page for this +* because we are in the writeout of this page and we already have +* taken the lock on this page +*/ + pos = iblock inode-i_blkbits; + skip_index = pos PAGE_CACHE_SHIFT; + + while (blkcount) { + pos = (ee_block + ee_len - blkcount) inode-i_blkbits; + index = pos PAGE_CACHE_SHIFT; + offset = (pos (PAGE_CACHE_SIZE - 1)); + if (index == skip_index) { + /* Page will already be locked via +* write_begin or writepage +*/ + read_lock_irq(mapping-tree_lock); + page = radix_tree_lookup(mapping-page_tree, index); + read_unlock_irq(mapping-tree_lock); + if (page) + page_cache_get(page); + else + return -ENOMEM; + } else { + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + } + + if (!page_has_buffers(page)) + create_empty_buffers(page, blocksize, 0); + + /* extent the credit in the journal */ + extend_credit_for_zeroout(handle, inode); + + head = page_buffers(page); + /* Look for the buffer_head which map the block */ + bh = head; + while (offset 0) { + bh = bh-b_this_page; + offset -= blocksize; + } + offset = (pos (PAGE_CACHE_SIZE - 1)); + + /* Now write all the buffer_heads in the page */ + do { + set_buffer_uptodate(bh); + if (ext4_should_journal_data(inode)) { + err = ext4_journal_get_write_access(handle, bh); + if (err) + goto err_out; + } + zero_user(page, offset, blocksize); + offset += blocksize; + if (ext4_should_journal_data(inode)) { + err = ext4_journal_dirty_metadata(handle, bh); + if (err) + goto err_out; + } else { + if (ext4_should_order_data(inode)) { + err = ext4_journal_dirty_data(handle, + bh); + if (err) + goto err_out; + } + mark_buffer_dirty(bh); + } + + bh = bh-b_this_page
[PATCH] ext4: Fix fallocate error path.
Put the old extent details back if we fail to split the uninitialized extent. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c | 26 -- 1 files changed, 24 insertions(+), 2 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 39d5315..d315cc1 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2152,7 +2152,7 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, ext4_lblk_t iblock, unsigned long max_blocks) { - struct ext4_extent *ex, newex; + struct ext4_extent *ex, newex, orig_ex; struct ext4_extent *ex1 = NULL; struct ext4_extent *ex2 = NULL; struct ext4_extent *ex3 = NULL; @@ -2171,6 +2171,9 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, allocated = ee_len - (iblock - ee_block); newblock = iblock - ee_block + ext_pblock(ex); ex2 = ex; + orig_ex.ee_block = ex-ee_block; + orig_ex.ee_len = cpu_to_le16(ee_len); + ext4_ext_store_pblock(orig_ex, ext_pblock(ex)); err = ext4_ext_get_access(handle, inode, path + depth); if (err) @@ -2199,13 +2202,25 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, ex3-ee_len = cpu_to_le16(allocated - max_blocks); ext4_ext_mark_uninitialized(ex3); err = ext4_ext_insert_extent(handle, inode, path, ex3); - if (err) + if (err) { + ex-ee_block = orig_ex.ee_block; + ex-ee_len = orig_ex.ee_len; + ext4_ext_store_pblock(ex, ext_pblock(orig_ex)); + ext4_ext_mark_uninitialized(ex); + ext4_ext_dirty(handle, inode, path + depth); goto out; + } /* * The depth, and hence eh ex might change * as part of the insert above. */ newdepth = ext_depth(inode); + /* +* update the extent length after successfull insert of the +* split extent +*/ + orig_ex.ee_len = cpu_to_le16(ee_len - + ext4_ext_get_actual_len(ex3)); if (newdepth != depth) { depth = newdepth; ext4_ext_drop_refs(path); @@ -2280,6 +2295,13 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, goto out; insert: err = ext4_ext_insert_extent(handle, inode, path, newex); + if (err) { + ex-ee_block = orig_ex.ee_block; + ex-ee_len = orig_ex.ee_len; + ext4_ext_store_pblock(ex, ext_pblock(orig_ex)); + ext4_ext_mark_uninitialized(ex); + ext4_ext_dirty(handle, inode, path + depth); + } out: return err ? err : allocated; } -- 1.5.4.1.97.g40aab-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: Convert uninitialized extent to initialized extent in case of file system full
Test results for the patch. mmaptest simply maps a range and write to it. The length of the extent indicate whether it is initialized or not. There is no space in the file system for another block. So the uninitialized extent have to be converted to initialized extent. [EMAIL PROTECTED]:/ext4# /root/mmaptest testfile 0 100 mmaping 0 to 100 [EMAIL PROTECTED] linux-2.6.25-rc2]$ ~/ext4migrate --display /home/kvaneesh/test-images/ext3.img testfile extent: block=0-3 len=4 start=1153 start_hi=0 extent: block=4-32807 len=32804 start=1157 start_hi=0 [EMAIL PROTECTED]:/ext4# dd if=/root/a.c of=testfile seek=6 bs=4096 conv=notrunc [EMAIL PROTECTED] linux-2.6.25-rc2]$ ~/ext4migrate --display /home/kvaneesh/test-images/ext3.img testfile extent: block=0-3 len=4 start=1153 start_hi=0 extent: block=4-32791 len=32788 start=1157 start_hi=0 extent: block=24-24 len=1 start=1177 start_hi=0 extent: block=25-32807 len=32783 start=1178 start_hi=0 [EMAIL PROTECTED]:/ext4# dd if=/root/a.c of=testfile seek=8 bs=4096 conv=notrunc [EMAIL PROTECTED] linux-2.6.25-rc2]$ ~/ext4migrate --display /home/kvaneesh/test-images/ext3.img testfile extent: block=0-3 len=4 start=1153 start_hi=0 extent: block=4-32791 len=32788 start=1157 start_hi=0 extent: block=24-24 len=1 start=1177 start_hi=0 extent: block=25-39 len=15 start=1178 start_hi=0 [EMAIL PROTECTED]:/ext4# /root/mmaptest testfile 4096 5000 mmaping 4096 to 5000 [EMAIL PROTECTED] linux-2.6.25-rc2]$ ~/ext4migrate --display /home/kvaneesh/test-images/ext3.img testfile extent: block=0-3 len=4 start=1153 start_hi=0 extent: block=4-23 len=20 start=1157 start_hi=0 extent: block=24-24 len=1 start=1177 start_hi=0 extent: block=25-39 len=15 start=1178 start_hi=0 - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ext4: Convert uninitialized extent to initialized extent in case of file system full
On Fri, Feb 22, 2008 at 08:01:28PM +0530, Aneesh Kumar K.V wrote: + + /* Now write all the buffer_heads in the page */ + do { + set_buffer_uptodate(bh); + if (ext4_should_journal_data(inode)) { + err = ext4_journal_get_write_access(handle, bh); + /* do we have that many credits ??*/ + if (err) + goto err_out; + } + zero_user(page, offset, blocksize); Ah oh, you are trying to zero out the pages in the page cache, that's seems wrong to me. By the time get_block() is called from writepages(), the pages should have meaningful content that needs to flush to disk, zero the pages out will lost the data. It is writebegin. In case of writebegin the pages doesn't have the content. By the time we reach write begin the page is supposed to have buffer heads that are alreayd mapped. So we won't end up calling get_blk. Even in case of mmap with page_mkwrite change we would have called writebegin equivalent before the writepage. I guess the above para is confusing. The callback is actually writebegin.In case of writebegin the page doesn't have the content. With respect to writepage by the time we call the callback the buffer_heads related to the page would already be mapped. So we won't end up calling get_blk. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: Use page_mkwrite vma_operations to get mmap write notification.
On Fri, Feb 22, 2008 at 10:10:48AM -0800, Mingming Cao wrote: On Fri, 2008-02-22 at 20:09 +0530, Aneesh Kumar K.V wrote: . + ext4_journal_stop(handle); + goto out_unlock; + } + if (!ret ext4_should_order_data(inode)) { + ret = walk_page_buffers(handle, page_buffers(page), + 0, end, NULL, ext4_journal_dirty_data); + } + if (!ret) + ret = block_commit_write(page, 0, end); + Hmm, it seems wired to do commit_write when the page is about becoming writable, but maybe that's the way it needs to? Don't we need to update the i_size somewhere? block_commit_write simply iterate over buffer_head of page and mark them dirty. That is why we don't want to call that for data=journalled mode. + ext4_journal_stop(handle); +out_unlock: + unlock_page(page); + mutex_unlock(inode-i_mutex); + return ret; +} It seems this combined the three journalling mode prepare_write() code here:( Since prepare_write() and commit_write() is going to sunset, why not simply calling mappings-a_ops-write_begin() and then write_end()? that should take care of pretty much the journalling and the page operations, no? write_begin and write_end works with the user space buffer. In this case we don't have one. Also what ext4_page_mkwrite does is mostly what write_begin/write_end does except the copy of user space buffer. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH] ext4: Convert uninitialized extent to initialized extent in case of file system full
This patch had very minimal testing. I am sending this to get the feedback on the approach. The skip_index section in the below patch is ugly. Any suggestion to improve ? NOTE: ext4_ext_convert_to_initialized error path have some BUGs. It doesn't reset the extent information in case of error. But that is another patch. From 6a73edd4dbb32344e6a83ebdc07edd0e96d376bd Mon Sep 17 00:00:00 2001 From: Aneesh Kumar K.V [EMAIL PROTECTED] Date: Thu, 21 Feb 2008 23:57:38 +0530 Subject: [PATCH] ext4: Convert uninitialized extent to initialized extent in case of file system full A write to prealloc area cause the split of unititalized extent into a initialized and uninitialized extent. If we don't have space to add new extent information instead of returning error convert the existing uninitialized extent to initialized one. We need to zero out the blocks corresponding to the extent to prevent wrong data reaching userspace. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c | 135 - 1 files changed, 133 insertions(+), 2 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index b179b03..d37c14e 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2137,6 +2137,103 @@ void ext4_ext_release(struct super_block *sb) #endif } +static int ext4_ext_zero_out(handle_t *handle, struct inode *inode, + ext4_lblk_t iblock, struct ext4_extent *ex) +{ + ext4_lblk_t ee_block; + unsigned int ee_len, blkcount, blocksize; + loff_t pos; + pgoff_t index, skip_index; + unsigned long offset; + struct page *page; + struct address_space *mapping = inode-i_mapping; + struct buffer_head *head, *bh; + int err = 0; + + ee_block = le32_to_cpu(ex-ee_block); + ee_len = blkcount = ext4_ext_get_actual_len(ex); + blocksize = inode-i_sb-s_blocksize; + + /* +* find the skip index. We can't call __grab_cache_page for this +* because we are in the writeout of this page and we already have +* taken the lock on this page +*/ + pos = iblock inode-i_blkbits; + skip_index = pos PAGE_CACHE_SHIFT; + + while (blkcount) { + pos = (ee_block + ee_len - blkcount) inode-i_blkbits; + index = pos PAGE_CACHE_SHIFT; + offset = (pos (PAGE_CACHE_SIZE - 1)); + if (index == skip_index) { + /* Page will already be locked in the writepage */ + read_lock_irq(mapping-tree_lock); + page = radix_tree_lookup(mapping-page_tree, index); + read_unlock_irq(mapping-tree_lock); + if (page) + page_cache_get(page); + else + return -ENOMEM; + } else { + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + } + + if (!page_has_buffers(page)) + create_empty_buffers(page, blocksize, 0); + + head = page_buffers(page); + /* Look for the buffer_head which map the block */ + bh = head; + while (offset 0) { + bh = bh-b_this_page; + offset -= blocksize; + } + offset = (pos (PAGE_CACHE_SIZE - 1)); + + /* Now write all the buffer_heads in the page */ + do { + set_buffer_uptodate(bh); + if (ext4_should_journal_data(inode)) { + err = ext4_journal_get_write_access(handle, bh); + /* do we have that many credits ??*/ + if (err) + goto err_out; + } + zero_user(page, offset, blocksize); + offset += blocksize; + if (ext4_should_journal_data(inode)) { + err = ext4_journal_dirty_metadata(handle, bh); + if (err) + goto err_out; + } else { + if (ext4_should_order_data(inode)) { + err = ext4_journal_dirty_data(handle, + bh); + if (err) + goto err_out; + } + mark_buffer_dirty(bh); + } + + bh = bh-b_this_page; + blkcount--; + } while ((bh != head) (blkcount 0
Re: [PATCH -v2] ext4: Use page_mkwrite vma_operations to get mmap write notification.
On Thu, Feb 21, 2008 at 09:39:20AM -0800, Mingming Cao wrote: On Tue, 2008-02-19 at 09:13 +0530, Aneesh Kumar K.V wrote: We would like to get notified when we are doing a write on mmap section. This is needed with respect to preallocated area. We split the preallocated area into initialzed extent and uninitialzed extent in the call back. This let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and that would result in data loss. The changes are also needed to handle ENOSPC when writing to an mmap section of files with holes. Hi Aneesh, I have a concern, it seems we missed journalling the allocation activity for the mmaped write. See comments below... Another thing, perhaps similar patch should be ported to ext2/3, as this also addressed the mmaped write ENOSPC error without preallocation/deleyed allocation. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] .. + +int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page) +{ + return block_page_mkwrite(vma, page, ext4_get_block); +} + I don't see block allocation being journalled here. block_page_mkwrite() eventually calling block_prepare_write() which invokes ext4_get_block() without starting a new journal handle. Perhaps call ext4 write_begin() and write_end() inode operations, that would taking care of different write_begin and write_end for three different journalling mode. You are correct. I missed the journalling details when working on the patch. Will send the one with fix. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: ext4_find_next_zero_bit needs an aligned address on some arch
ext4_find_next_zero_bit and ext4_find_next_bit needs a long aligned address on x8_64. Add mb_find_next_zero_bit and mb_find_next_bit and use them in the mballoc. Fix: https://bugzilla.redhat.com/show_bug.cgi?id=433286 Eric Sandeen debugged the problem and suggested the fix. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] CC:Eric Sandeen [EMAIL PROTECTED] --- fs/ext4/mballoc.c | 62 ++-- 1 files changed, 40 insertions(+), 22 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 89772b9..ccddd21 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -627,21 +627,19 @@ static ext4_fsblk_t ext4_grp_offs_to_block(struct super_block *sb, return block; } +static inline void *mb_correct_addr_and_bit(int *bit, void *addr) +{ #if BITS_PER_LONG == 64 -#define mb_correct_addr_and_bit(bit, addr) \ -{ \ - bit += ((unsigned long) addr 7UL) 3; \ - addr = (void *) ((unsigned long) addr ~7UL); \ -} + *bit += ((unsigned long) addr 7UL) 3; + addr = (void *) ((unsigned long) addr ~7UL); #elif BITS_PER_LONG == 32 -#define mb_correct_addr_and_bit(bit, addr) \ -{ \ - bit += ((unsigned long) addr 3UL) 3; \ - addr = (void *) ((unsigned long) addr ~3UL); \ -} + *bit += ((unsigned long) addr 3UL) 3; + addr = (void *) ((unsigned long) addr ~3UL); #else #error how many bits you are?! #endif + return addr; +} static inline int mb_test_bit(int bit, void *addr) { @@ -649,34 +647,54 @@ static inline int mb_test_bit(int bit, void *addr) * ext4_test_bit on architecture like powerpc * needs unsigned long aligned address */ - mb_correct_addr_and_bit(bit, addr); + addr = mb_correct_addr_and_bit(bit, addr); return ext4_test_bit(bit, addr); } static inline void mb_set_bit(int bit, void *addr) { - mb_correct_addr_and_bit(bit, addr); + addr = mb_correct_addr_and_bit(bit, addr); ext4_set_bit(bit, addr); } static inline void mb_set_bit_atomic(spinlock_t *lock, int bit, void *addr) { - mb_correct_addr_and_bit(bit, addr); + addr = mb_correct_addr_and_bit(bit, addr); ext4_set_bit_atomic(lock, bit, addr); } static inline void mb_clear_bit(int bit, void *addr) { - mb_correct_addr_and_bit(bit, addr); + addr = mb_correct_addr_and_bit(bit, addr); ext4_clear_bit(bit, addr); } static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr) { - mb_correct_addr_and_bit(bit, addr); + addr = mb_correct_addr_and_bit(bit, addr); ext4_clear_bit_atomic(lock, bit, addr); } +static inline int mb_find_next_zero_bit(void *addr, int max, int start) +{ + int fix = 0; + addr = mb_correct_addr_and_bit(fix, addr); + max += fix; + start += fix; + + return ext4_find_next_zero_bit(addr, max, start) - fix; +} + +static inline int mb_find_next_bit(void *addr, int max, int start) +{ + int fix = 0; + addr = mb_correct_addr_and_bit(fix, addr); + max += fix; + start += fix; + + return ext4_find_next_bit(addr, max, start) - fix; +} + static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max) { char *bb; @@ -946,12 +964,12 @@ static void ext4_mb_generate_buddy(struct super_block *sb, /* initialize buddy from bitmap which is aggregation * of on-disk bitmap and preallocations */ - i = ext4_find_next_zero_bit(bitmap, max, 0); + i = mb_find_next_zero_bit(bitmap, max, 0); grp-bb_first_free = i; while (i max) { fragments++; first = i; - i = ext4_find_next_bit(bitmap, max, i); + i = mb_find_next_bit(bitmap, max, i); len = i - first; free += len; if (len 1) @@ -959,7 +977,7 @@ static void ext4_mb_generate_buddy(struct super_block *sb, else grp-bb_counters[0]++; if (i max) - i = ext4_find_next_zero_bit(bitmap, max, i); + i = mb_find_next_zero_bit(bitmap, max, i); } grp-bb_fragments = fragments; @@ -1783,7 +1801,7 @@ static void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac, buddy = mb_find_buddy(e4b, i, max); BUG_ON(buddy == NULL); - k = ext4_find_next_zero_bit(buddy, max, 0); + k = mb_find_next_zero_bit(buddy, max, 0); BUG_ON(k = max); ac-ac_found++; @@ -1823,7 +1841,7 @@ static void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac, i = e4b-bd_info-bb_first_free; while (free ac-ac_status == AC_STATUS_CONTINUE) { - i
Re: [PATCH] fs/ext4/mballoc.c: Convert to list_for_each_entry_rcu()
On Tue, Feb 19, 2008 at 01:31:18AM +0100, Roel Kluin wrote: Please verify, this patch was not yet tested --- Convert list_for_each_rcu() to list_for_each_entry_rcu() Signed-off-by: Roel Kluin [EMAIL PROTECTED] NACK. This patch doesn't build. You have extra cur in the conversion. Right changes attached. ext4: Convert list_for_each_rcu() to list_for_each_entry_rcu() From: Aneesh Kumar K.V [EMAIL PROTECTED] The list_for_each_entry_rcu() primitive should be used instead of list_for_each_rcu(), as the former is easier to use and provides better type safety. http://groups.google.com/group/linux.kernel/browse_thread/thread/45749c83451cebeb/0633a65759ce7713?lnk=raot Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/mballoc.c | 18 +- 1 files changed, 5 insertions(+), 13 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 52d3af2..89772b9 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3135,10 +3135,10 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, { int bsbits, max; ext4_lblk_t end; - struct list_head *cur; loff_t size, orig_size, start_off; ext4_lblk_t start, orig_start; struct ext4_inode_info *ei = EXT4_I(ac-ac_inode); + struct ext4_prealloc_space *pa; /* do normalize only data requests, metadata requests do not need preallocation */ @@ -3224,12 +3224,9 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, /* check we don't cross already preallocated blocks */ rcu_read_lock(); - list_for_each_rcu(cur, ei-i_prealloc_list) { - struct ext4_prealloc_space *pa; + list_for_each_entry_rcu(pa, ei-i_prealloc_list, pa_inode_list) { unsigned long pa_end; - pa = list_entry(cur, struct ext4_prealloc_space, pa_inode_list); - if (pa-pa_deleted) continue; spin_lock(pa-pa_lock); @@ -3271,10 +3268,8 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, /* XXX: extra loop to check we really don't overlap preallocations */ rcu_read_lock(); - list_for_each_rcu(cur, ei-i_prealloc_list) { - struct ext4_prealloc_space *pa; + list_for_each_entry_rcu(pa, ei-i_prealloc_list, pa_inode_list) { unsigned long pa_end; - pa = list_entry(cur, struct ext4_prealloc_space, pa_inode_list); spin_lock(pa-pa_lock); if (pa-pa_deleted == 0) { pa_end = pa-pa_lstart + pa-pa_len; @@ -3401,7 +3396,6 @@ static noinline int ext4_mb_use_preallocated(struct ext4_allocation_context *ac) struct ext4_inode_info *ei = EXT4_I(ac-ac_inode); struct ext4_locality_group *lg; struct ext4_prealloc_space *pa; - struct list_head *cur; /* only data can be preallocated */ if (!(ac-ac_flags EXT4_MB_HINT_DATA)) @@ -3409,8 +3403,7 @@ static noinline int ext4_mb_use_preallocated(struct ext4_allocation_context *ac) /* first, try per-file preallocation */ rcu_read_lock(); - list_for_each_rcu(cur, ei-i_prealloc_list) { - pa = list_entry(cur, struct ext4_prealloc_space, pa_inode_list); + list_for_each_entry_rcu(pa, ei-i_prealloc_list, pa_inode_list) { /* all fields in this condition don't change, * so we can skip locking for them */ @@ -3442,8 +3435,7 @@ static noinline int ext4_mb_use_preallocated(struct ext4_allocation_context *ac) return 0; rcu_read_lock(); - list_for_each_rcu(cur, lg-lg_prealloc_list) { - pa = list_entry(cur, struct ext4_prealloc_space, pa_inode_list); + list_for_each_entry_rcu(pa, lg-lg_prealloc_list, pa_inode_list) { spin_lock(pa-pa_lock); if (pa-pa_deleted == 0 pa-pa_free = ac-ac_o_ex.fe_len) { atomic_inc(pa-pa_count); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Error with the latest stable series of the patch queue.
Hi all, I am seeing the below error in the console. But the tests are reported as success. EXT4-fs: mballoc enabled EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204044: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204045: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204047: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204056: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204061: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204065: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204068: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204069: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204071: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204077: invalid magic - -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] clear extents flag on inodes created in ext4_mknod
On Tue, Feb 19, 2008 at 10:19:44PM +0530, Aneesh Kumar K.V wrote: On Tue, Feb 19, 2008 at 10:39:52AM -0600, Eric Sandeen wrote: Eric Sandeen wrote: e2fsck doesn't expect to find char, block, fifo, or socket files with the extent flag set, so clear that in ext4_mknod. Signed-off-by: Eric Sandeen [EMAIL PROTECTED] --- Index: linux-2.6.24/fs/ext4/namei.c === --- linux-2.6.24.orig/fs/ext4/namei.c +++ linux-2.6.24/fs/ext4/namei.c @@ -1766,6 +1766,7 @@ retry: #ifdef CONFIG_EXT4DEV_FS_XATTR inode-i_op = ext4_special_inode_operations; #endif + EXT4_I(inode)-i_flags = ~EXT4_EXTENTS_FL; err = ext4_add_nondir(handle, dentry, inode); } ext4_journal_stop(handle); now that I think about it; perhaps it would be better to put this logic into ext4_new_inode, rather than setting it by default and clearing it here... that way new_inode() has all the logic about whether or not a particular type of file is in extents format. How about enabling it only for directory and regular files rather than enabling it globally and then disabling the flag for symlink and device files ? how about something like below. There are two reason for not inheriting the i_flag from directory. a) if the directory have extent flag set we end up with symlink which have extent flag set but on which ext4_ext_tree_init is not called. b) if we create a directory with extent flag and later mount the file system with -o noextents, the files created in that directory will also have extent flag set but we would not have called ext4_ext_tree_init for them. Both results in the stack below which cause ext4_error. Call Trace: [c0002c937320] [c01a0b84] .ext4_ext_find_extent+0x74/0x344 (unreliable) [c0002c9373e0] [c01a32c0] .ext4_ext_get_blocks+0x1e8/0xc3c [c0002c937570] [c0191394] .ext4_get_blocks_wrap+0xa4/0x19c [c0002c937640] [c0191588] .ext4_get_block+0xfc/0x16c [c0002c937700] [c011490c] .__block_prepare_write+0x1f0/0x4f8 [c0002c937810] [c0114e5c] .block_write_begin+0xc4/0x160 [c0002c9378e0] [c018ec28] .ext4_write_begin+0x12c/0x24c [c0002c9379e0] [c00a7ef4] .pagecache_write_begin+0x84/0x1e0 [c0002c937ac0] [c00f17f8] .__page_symlink+0x6c/0x154 [c0002c937b80] [c019784c] .ext4_symlink+0x1d8/0x2b8 [c0002c937c60] [c00f0508] .vfs_symlink+0x130/0x1c8 [c0002c937d00] [c00f0648] .sys_symlinkat+0xa8/0x110 [c0002c937e30] [c000872c] syscall_exit+0x0/0x40 and console message: EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204044: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index d2c2e55..f430939 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -794,7 +794,12 @@ got: ei-i_dir_start_lookup = 0; ei-i_disksize = 0; - ei-i_flags = EXT4_I(dir)-i_flags ~EXT4_INDEX_FL; + /* +* Don't inherit extent flag from directory. We set extent flag on +* newly created directory and file only if -o extent mount option is +* specfied +*/ + ei-i_flags = EXT4_I(dir)-i_flags ~ (EXT4_INDEX_FL|EXT4_EXTENTS_FL); if (S_ISLNK(mode)) ei-i_flags = ~(EXT4_IMMUTABLE_FL|EXT4_APPEND_FL); /* dirsync only applies to directories */ @@ -836,8 +841,10 @@ got: ext4_std_error(sb, err); goto fail_free_drop; } - if (test_opt(sb, EXTENTS) !S_ISLNK(mode)) { - EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL; + if (test_opt(sb, EXTENTS)) { + /* set extent flag only for diretory and file */ + if (S_ISDIR(mode) || S_ISREG(mode)) + EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL; ext4_ext_tree_init(handle, inode); err = ext4_update_incompat_feature(handle, sb, EXT4_FEATURE_INCOMPAT_EXTENTS); diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 23902ba..da942bc 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -1771,7 +1771,6 @@ retry: #ifdef CONFIG_EXT4DEV_FS_XATTR inode-i_op = ext4_special_inode_operations; #endif - EXT4_I(inode)-i_flags = ~EXT4_EXTENTS_FL; err = ext4_add_nondir(handle, dentry, inode); } ext4_journal_stop(handle); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Error with the latest stable series of the patch queue.
On Tue, Feb 19, 2008 at 06:15:01PM +0100, Valerie Clement wrote: Aneesh Kumar K.V wrote: Hi all, I am seeing the below error in the console. But the tests are reported as success. EXT4-fs: mballoc enabled EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204044: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204045: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204047: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204056: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204061: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204065: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204068: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204069: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204071: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204077: invalid magic - Hi Aneesh, I've got also several issues while running ffsb tests today. The tests ended with success but e2fsck reported an error: Pass 1: Checking inodes, blocks, and sizes Inode 3367164, i_size is 57380864, should be 57442304. Fix? Inode 3367164 is allocated in the last group of the filesystem. As I changed the allocation algorithm for the last group in the patch ext4_fix_block_alloc_algorithm_for_last_group.patch, I removed this patch and ran again the same test. I didn't reproduce the issue. *But* I reproduced it on a filesystem created with a smaller block size value (= 1024 instead of 4096 previously) and with a kernel *without* my patch applied. e2fsck reports the same error on inodes created in the last group. Sometimes in this configuration, error messages are also displayed on the console: EXT4-fs error (device sdc): ext4_valid_block_bitmap: Invalid block bitmap - block_group = 7358, block = 60276737 EXT4-fs error (device sdc): ext4_valid_block_bitmap: Invalid block bitmap - block_group = 7358, block = 60276737 and e2fsck reports errors like: Inode 2113777 has corrupt extent index at block 61165699 (logical -1) entry 0 Fix? So, there is a problem when allocating inodes in the last group: - without my patch when block size value is 1024, - with my patch when block size value is 4096. Could you check if your tests allocate inodes in the last group and run also e2fsck to see if it reports errors. Can you run the test with the below patch on top of stable series. diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index d2c2e55..14fb73b 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -794,7 +794,12 @@ got: ei-i_dir_start_lookup = 0; ei-i_disksize = 0; - ei-i_flags = EXT4_I(dir)-i_flags ~EXT4_INDEX_FL; + /* +* Don't inherit extent flag from directory. We set extent flag on +* newly created directory and file only if -o extent mount option is +* specfied +*/ + ei-i_flags = EXT4_I(dir)-i_flags ~ (EXT4_INDEX_FL|EXT4_EXTENTS_FL); if (S_ISLNK(mode)) ei-i_flags = ~(EXT4_IMMUTABLE_FL|EXT4_APPEND_FL); /* dirsync only applies to directories */ @@ -836,13 +841,16 @@ got: ext4_std_error(sb, err); goto fail_free_drop; } - if (test_opt(sb, EXTENTS) !S_ISLNK(mode)) { - EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL; - ext4_ext_tree_init(handle, inode); - err = ext4_update_incompat_feature(handle, sb, - EXT4_FEATURE_INCOMPAT_EXTENTS); - if (err) - goto fail; + if (test_opt(sb, EXTENTS)) { + /* set extent flag only for diretory and file */ + if (S_ISDIR(mode) || S_ISREG(mode)) { + EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL; + ext4_ext_tree_init(handle, inode); + err = ext4_update_incompat_feature(handle, sb, + EXT4_FEATURE_INCOMPAT_EXTENTS); + if (err) + goto fail; + } } ext4_debug(allocating inode %lu\n, inode-i_ino); diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 23902ba..da942bc 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -1771,7 +1771,6
Re: Error with the latest stable series of the patch queue.
On Tue, Feb 19, 2008 at 06:15:01PM +0100, Valerie Clement wrote: Aneesh Kumar K.V wrote: Hi all, I am seeing the below error in the console. But the tests are reported as success. EXT4-fs: mballoc enabled EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204044: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204045: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204047: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204056: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204061: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204065: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204068: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204069: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204071: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode #204077: invalid magic - The above problem is due to symlink having extent flag set but not having extent tree initialized. That was mainly due to inheriting the inode i_flag from parent directory. I am right now testing fix for this. Hi Aneesh, I've got also several issues while running ffsb tests today. The tests ended with success but e2fsck reported an error: Pass 1: Checking inodes, blocks, and sizes Inode 3367164, i_size is 57380864, should be 57442304. Fix? Inode 3367164 is allocated in the last group of the filesystem. As I changed the allocation algorithm for the last group in the patch ext4_fix_block_alloc_algorithm_for_last_group.patch, I removed this patch and ran again the same test. I didn't reproduce the issue. *But* I reproduced it on a filesystem created with a smaller block size value (= 1024 instead of 4096 previously) and with a kernel *without* my patch applied. e2fsck reports the same error on inodes created in the last group. Sometimes in this configuration, error messages are also displayed on the console: EXT4-fs error (device sdc): ext4_valid_block_bitmap: Invalid block bitmap - block_group = 7358, block = 60276737 EXT4-fs error (device sdc): ext4_valid_block_bitmap: Invalid block bitmap - block_group = 7358, block = 60276737 and e2fsck reports errors like: Inode 2113777 has corrupt extent index at block 61165699 (logical -1) entry 0 Fix? So, there is a problem when allocating inodes in the last group: - without my patch when block size value is 1024, - with my patch when block size value is 4096. Could you check if your tests allocate inodes in the last group and run also e2fsck to see if it reports errors. For the moment, I have no idea how to fix that problem. This looks like a completely different problem. Will try to see if i can reproduce it here. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] clear extents flag on inodes created in ext4_mknod
On Tue, Feb 19, 2008 at 10:39:52AM -0600, Eric Sandeen wrote: Eric Sandeen wrote: e2fsck doesn't expect to find char, block, fifo, or socket files with the extent flag set, so clear that in ext4_mknod. Signed-off-by: Eric Sandeen [EMAIL PROTECTED] --- Index: linux-2.6.24/fs/ext4/namei.c === --- linux-2.6.24.orig/fs/ext4/namei.c +++ linux-2.6.24/fs/ext4/namei.c @@ -1766,6 +1766,7 @@ retry: #ifdef CONFIG_EXT4DEV_FS_XATTR inode-i_op = ext4_special_inode_operations; #endif + EXT4_I(inode)-i_flags = ~EXT4_EXTENTS_FL; err = ext4_add_nondir(handle, dentry, inode); } ext4_journal_stop(handle); now that I think about it; perhaps it would be better to put this logic into ext4_new_inode, rather than setting it by default and clearing it here... that way new_inode() has all the logic about whether or not a particular type of file is in extents format. How about enabling it only for directory and regular files rather than enabling it globally and then disabling the flag for symlink and device files ? -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: set EXT4_EXTENTS_FL only for directory and regular files
Also don't inherit EXT4_EXTENTS_FL from parent directory. If we have a directory with extent flag set and later mount the file system with -o noextents, the files created in that directory will also have extent flag set but we would not have called ext4_ext_tree_init for them. This will cause error later when we are verifying the extent header Also we don't want to set extent flag for symlinks, char, block, fifo or socket Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/ialloc.c | 22 +++--- fs/ext4/namei.c |1 - 2 files changed, 15 insertions(+), 8 deletions(-) diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 028e601..78d1094 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -794,7 +794,12 @@ got: ei-i_dir_start_lookup = 0; ei-i_disksize = 0; - ei-i_flags = EXT4_I(dir)-i_flags ~EXT4_INDEX_FL; + /* +* Don't inherit extent flag from directory. We set extent flag on +* newly created directory and file only if -o extent mount option is +* specified +*/ + ei-i_flags = EXT4_I(dir)-i_flags ~(EXT4_INDEX_FL|EXT4_EXTENTS_FL); if (S_ISLNK(mode)) ei-i_flags = ~(EXT4_IMMUTABLE_FL|EXT4_APPEND_FL); /* dirsync only applies to directories */ @@ -837,12 +842,15 @@ got: goto fail_free_drop; } if (test_opt(sb, EXTENTS)) { - EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL; - ext4_ext_tree_init(handle, inode); - err = ext4_update_incompat_feature(handle, sb, - EXT4_FEATURE_INCOMPAT_EXTENTS); - if (err) - goto fail; + /* set extent flag only for diretory and file */ + if (S_ISDIR(mode) || S_ISREG(mode)) { + EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL; + ext4_ext_tree_init(handle, inode); + err = ext4_update_incompat_feature(handle, sb, + EXT4_FEATURE_INCOMPAT_EXTENTS); + if (err) + goto fail; + } } ext4_debug(allocating inode %lu\n, inode-i_ino); diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 39d4af4..da942bc 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2225,7 +2225,6 @@ retry: inode-i_op = ext4_fast_symlink_inode_operations; memcpy((char*)EXT4_I(inode)-i_data,symname,l); inode-i_size = l-1; - EXT4_I(inode)-i_flags = ~EXT4_EXTENTS_FL; } EXT4_I(inode)-i_disksize = inode-i_size; err = ext4_add_nondir(handle, dentry, inode); -- 1.5.4.1.97.g40aab-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: set EXT4_EXTENTS_FL only for directory and regular files
Mingming, On Wed, Feb 20, 2008 at 01:19:05AM +0530, Aneesh Kumar K.V wrote: Also don't inherit EXT4_EXTENTS_FL from parent directory. If we have a directory with extent flag set and later mount the file system with -o noextents, the files created in that directory will also have extent flag set but we would not have called ext4_ext_tree_init for them. This will cause error later when we are verifying the extent header Also we don't want to set extent flag for symlinks, char, block, fifo or socket Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] This should replace the below two patches in the patch queue. ext4-donot-set-extents-flag-for-any-symlinks.patch ext4-clear-extents-flag-on-inodes-created-in-ext4-mknod.patch -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: Use page_mkwrite vma_operations to get mmap write notification.
We would like to get notified when we are doing a write on mmap section. This is needed with respect to preallocated area. We split the preallocated area into initialzed extent and uninitialzed extent in the call back. This let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and that would result in data loss. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/file.c | 19 ++- fs/ext4/inode.c |6 ++ include/linux/ext4_fs.h |1 + 3 files changed, 25 insertions(+), 1 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 20507a2..77341c1 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -123,6 +123,23 @@ force_commit: return ret; } +static struct vm_operations_struct ext4_file_vm_ops = { + .fault = filemap_fault, + .page_mkwrite = ext4_page_mkwrite, +}; + +static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct address_space *mapping = file-f_mapping; + + if (!mapping-a_ops-readpage) + return -ENOEXEC; + file_accessed(file); + vma-vm_ops = ext4_file_vm_ops; + vma-vm_flags |= VM_CAN_NONLINEAR; + return 0; +} + const struct file_operations ext4_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -133,7 +150,7 @@ const struct file_operations ext4_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = ext4_compat_ioctl, #endif - .mmap = generic_file_mmap, + .mmap = ext4_file_mmap, .open = generic_file_open, .release= ext4_release_file, .fsync = ext4_sync_file, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 34f3eb6..81faa67 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3466,3 +3466,9 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val) return err; } + +int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page) +{ + return block_page_mkwrite(vma, page, ext4_get_block); +} + diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h index 22810b1..8f5a563 100644 --- a/include/linux/ext4_fs.h +++ b/include/linux/ext4_fs.h @@ -1059,6 +1059,7 @@ extern void ext4_set_aops(struct inode *inode); extern int ext4_writepage_trans_blocks(struct inode *); extern int ext4_block_truncate_page(handle_t *handle, struct page *page, struct address_space *mapping, loff_t from); +extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page); /* ioctl.c */ extern long ext4_ioctl(struct file *, unsigned int, unsigned long); -- 1.5.4.1.97.g40aab-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: Don't mark the filesystem with errors if we fail to fallocate.
IF we fail fallocate don't call ext4_error. Also don't hide errors from ext4_get_blocks_wrap Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 5b22f71..bb01ac6 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2653,13 +2653,14 @@ retry: ret = ext4_get_blocks_wrap(handle, inode, block, max_blocks, map_bh, EXT4_CREATE_UNINITIALIZED_EXT, 0); - WARN_ON(ret = 0); if (ret = 0) { +#ifdef EXT4FS_DEBUG + WARN_ON(ret = 0); ext4_error(inode-i_sb, ext4_fallocate, ext4_ext_get_blocks returned error: inode#%lu, block=%u, max_blocks=%lu, inode-i_ino, block, max_blocks); - ret = -EIO; +#endif ext4_mark_inode_dirty(handle, inode); ret2 = ext4_journal_stop(handle); break; -- 1.5.4.1.97.g40aab-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: When reading from fallocated blocks make sure we return zero.
On Mon, Feb 18, 2008 at 04:14:34PM -0800, Mingming Cao wrote: On Sat, 2008-02-16 at 08:53 +0530, Aneesh Kumar K.V wrote: How about the following patch? Regards, Mingming ext4: ext4_get_blocks_wrap fix for writing to preallocated From: Mingming Cao [EMAIL PROTECTED] This patch fixed a issue with wrting to a preallocated blocks. A write hit a BUG_ON() in fs/buffer.c saying the buffer is not mapped. On the write path, ext4_get_block_wrap() is called with create=1, but it will pass create=0 down to the underlying ext4ext_get_blocks() to do a look up first. In the preallocation case, ext4_ext_get_blocks() with create = 0, will return number of blocks pre-allocated and buffer head unmapped. ext4_get_blocks_wrap() thinks it succeeds too early, without checking if it needs again call ext4_ext_get_blocks with create = 1 which would do proper handling for writing to preallocated blocks: split the extent to initialized and uninitialized one and returns the mapped buffer head. Treating preallocated blocks as holes equally(i.e. ignoring the number of blocks pre-allocated and returns 0) when get_blocks() is called with create = 0 is not enough. ext4_ext_get_blocks() needs to differentiate these two cases for delayed allocation purpose, as for holes it need to do reservation and prepare for later delayed allocation, but for pre-allocated blocks it needs skip that work. It would makes things more clear if we have clear definition of what get_blocks() return value means. Similar to ext4_get_blocks_handle(), the following * return 0, # of blocks already allocated * if these are pre-allocated blocks and create = 0 * buffer head is unmapped * otherwise blocks are mapped. * * return = 0, if plain look up failed (blocks have not been allocated) * buffer head is unmapped * * return 0, error case. The for the write path, at ext4_ext_get_blocks_wrap(), it could check the buffer_mapped() status for preallocated extent before quit too early. Signed-off-by: Mingming Cao [EMAIL PROTECTED] Reviewed-by: Aneesh Kumar K.V [EMAIL PROTECTED] I guess we also need to make sure the buffer head have the mapped bit set. Something like the patch below. diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index bc7081f..69ccda9 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2294,6 +2294,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, struct ext4_allocation_request ar; __clear_bit(BH_New, bh_result-b_state); + __clear_bit(BH_Mapped, bh_result-b_state); ext_debug(blocks %u/%lu requested for inode %u\n, iblock, max_blocks, inode-i_ino); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH -v2] ext4: Use page_mkwrite vma_operations to get mmap write notification.
We would like to get notified when we are doing a write on mmap section. This is needed with respect to preallocated area. We split the preallocated area into initialzed extent and uninitialzed extent in the call back. This let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and that would result in data loss. The changes are also needed to handle ENOSPC when writing to an mmap section of files with holes. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/file.c | 19 ++- fs/ext4/inode.c |6 ++ include/linux/ext4_fs.h |1 + 3 files changed, 25 insertions(+), 1 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 20507a2..77341c1 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -123,6 +123,23 @@ force_commit: return ret; } +static struct vm_operations_struct ext4_file_vm_ops = { + .fault = filemap_fault, + .page_mkwrite = ext4_page_mkwrite, +}; + +static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct address_space *mapping = file-f_mapping; + + if (!mapping-a_ops-readpage) + return -ENOEXEC; + file_accessed(file); + vma-vm_ops = ext4_file_vm_ops; + vma-vm_flags |= VM_CAN_NONLINEAR; + return 0; +} + const struct file_operations ext4_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -133,7 +150,7 @@ const struct file_operations ext4_file_operations = { #ifdef CONFIG_COMPAT .compat_ioctl = ext4_compat_ioctl, #endif - .mmap = generic_file_mmap, + .mmap = ext4_file_mmap, .open = generic_file_open, .release= ext4_release_file, .fsync = ext4_sync_file, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 34f3eb6..81faa67 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3466,3 +3466,9 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val) return err; } + +int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page) +{ + return block_page_mkwrite(vma, page, ext4_get_block); +} + diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h index 22810b1..8f5a563 100644 --- a/include/linux/ext4_fs.h +++ b/include/linux/ext4_fs.h @@ -1059,6 +1059,7 @@ extern void ext4_set_aops(struct inode *inode); extern int ext4_writepage_trans_blocks(struct inode *); extern int ext4_block_truncate_page(handle_t *handle, struct page *page, struct address_space *mapping, loff_t from); +extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page); /* ioctl.c */ extern long ext4_ioctl(struct file *, unsigned int, unsigned long); -- 1.5.4.1.97.g40aab-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: When reading from fallocated blocks make sure we return zero.
On Sat, Feb 16, 2008 at 08:53:34AM +0530, Aneesh Kumar K.V wrote: On Fri, Feb 15, 2008 at 11:43:04AM -0800, Mingming Cao wrote: On Fri, 2008-02-15 at 23:46 +0530, Aneesh Kumar K.V wrote: fallocate blocks are considered as sparse area and read from them should return zero. ext4_ext_get_blocks should return zero for read request. The patch itself looks harmless, but I still don't see how this could fix the problem you described at irc: a write hit a BUG_ON() in fs/buffer.c saying the buffer is not mapped. Could you add more details here? Write will take the below call chain ext4_write_begin block_write_begin __block_prepare_write ext4_getblock ext4_get_blocks_wrap (1) ext4_ext_get_blocks with create = 0 return allocated ll_rw_block if buffer not uptodate. submit_bh BUG_ON(!buffer_mapped(bh)) ext4_ext_get_blocks at (1) should have returned 0. That would cause ext4_get_blocks_wrap to again call ext4_ext_get_blocks with create = 1 and that would have returned us the buffer head which is mapped. This would also result in splitting the extent to initialized and uninitialized one. The change is also needed to get mmap on fallocate space to work. [ cut here ] WARNING: at fs/buffer.c:1680 __block_write_full_page+0x101/0x2f1() Modules linked in: Pid: 2478, comm: mmaptest Not tainted 2.6.25-rc1 #12 [c0120e84] warn_on_slowpath+0x41/0x51 [c0108c00] ? native_sched_clock+0x2d/0x9f [c013b44e] ? __lock_acquire+0xacb/0xb13 [c013b44e] ? __lock_acquire+0xacb/0xb13 [c0180f97] __block_write_full_page+0x101/0x2f1 [c01d053f] ? ext4_get_block+0x0/0xc0 [c018124f] block_write_full_page+0xc8/0xd1 [c01d053f] ? ext4_get_block+0x0/0xc0 [c01d1a36] ext4_ordered_writepage+0xad/0x146 [c01cec12] ? bget_one+0x0/0xb [c014c5dd] __writepage+0xb/0x25 [c014cab2] write_cache_pages+0x180/0x287 [c014c5d2] ? __writepage+0x0/0x25 [c01527d5] ? __do_fault+0x2e2/0x324 [c0147889] ? unlock_page+0x25/0x28 [c014cbd6] generic_writepages+0x1d/0x27 [c014cc0c] do_writepages+0x2c/0x34 [c0147fe1] __filemap_fdatawrite_range+0x5b/0x67 [c01481ba] filemap_fdatawrite+0x15/0x17 [c017ea3d] do_fsync+0x2c/0x9a [c017eacb] __do_fsync+0x20/0x2f [c017eaf9] sys_fsync+0xd/0xf [c0104992] sysenter_past_esp+0x5f/0xa5 === ---[ end trace 5ba60b430e0af601 ]--- -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: When reading from fallocated blocks make sure we return zero.
fallocate blocks are considered as sparse area and read from them should return zero. ext4_ext_get_blocks should return zero for read request. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 3efbfd1..5b22f71 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2379,8 +2379,14 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, } if (create == EXT4_CREATE_UNINITIALIZED_EXT) goto out; - if (!create) + if (!create) { + /* +* read request should return zero blocks +* allocated +*/ + allocated = 0; goto out2; + } ret = ext4_ext_convert_to_initialized(handle, inode, path, iblock, -- 1.5.4.1.97.g40aab-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/PATCH] ext4: Fix the memory leak and buffer head leak with respect to ext4_ext_path
The path variable returned via ext4_ext_find_extent is a kmalloc variable and need to be freeded. It also contain refrences to buffer_head which need to be dropped. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c |6 +++--- fs/ext4/migrate.c |5 + include/linux/ext4_fs_extents.h |1 + 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index e856f66..995ac16 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -349,7 +349,7 @@ static void ext4_ext_show_leaf(struct inode *inode, struct ext4_ext_path *path) #define ext4_ext_show_leaf(inode,path) #endif -static void ext4_ext_drop_refs(struct ext4_ext_path *path) +void ext4_ext_drop_refs(struct ext4_ext_path *path) { int depth = path-p_depth; int i; @@ -2200,10 +2200,10 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, newdepth = ext_depth(inode); if (newdepth != depth) { depth = newdepth; - path = ext4_ext_find_extent(inode, iblock, NULL); + ext4_ext_drop_refs(path); + path = ext4_ext_find_extent(inode, iblock, path); if (IS_ERR(path)) { err = PTR_ERR(path); - path = NULL; goto out; } eh = path[depth].p_hdr; diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c index 8c6c685..5c1e27d 100644 --- a/fs/ext4/migrate.c +++ b/fs/ext4/migrate.c @@ -43,6 +43,7 @@ static int finish_range(handle_t *handle, struct inode *inode, if (IS_ERR(path)) { retval = PTR_ERR(path); + path = NULL; goto err_out; } @@ -74,6 +75,10 @@ static int finish_range(handle_t *handle, struct inode *inode, } retval = ext4_ext_insert_extent(handle, inode, path, newext); err_out: + if (path) { + ext4_ext_drop_refs(path); + kfree(path); + } lb-first_pblock = 0; return retval; } diff --git a/include/linux/ext4_fs_extents.h b/include/linux/ext4_fs_extents.h index 697da4b..1285c58 100644 --- a/include/linux/ext4_fs_extents.h +++ b/include/linux/ext4_fs_extents.h @@ -227,5 +227,6 @@ extern int ext4_ext_search_left(struct inode *, struct ext4_ext_path *, ext4_lblk_t *, ext4_fsblk_t *); extern int ext4_ext_search_right(struct inode *, struct ext4_ext_path *, ext4_lblk_t *, ext4_fsblk_t *); +extern void ext4_ext_drop_refs(struct ext4_ext_path *); #endif /* _LINUX_EXT4_EXTENTS */ -- 1.5.4.1.97.g40aab-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/PATCH] ext4: Request for journal write access early.
In ext4_ext_convert_to_initialized before we need to request for journal write access before we even modify the extent length. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c | 11 --- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 995ac16..c4d6f19 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2168,6 +2168,10 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, newblock = iblock - ee_block + ext_pblock(ex); ex2 = ex; + err = ext4_ext_get_access(handle, inode, path + depth); + if (err) + goto out; + /* ex1: ee_block to iblock - 1 : uninitialized */ if (iblock ee_block) { ex1 = ex; @@ -2210,6 +2214,10 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, ex = path[depth].p_ext; if (ex2 != newex) ex2 = ex; + + err = ext4_ext_get_access(handle, inode, path + depth); + if (err) + goto out; } allocated = max_blocks; } @@ -2230,9 +2238,6 @@ static int ext4_ext_convert_to_initialized(handle_t *handle, ex2-ee_len = cpu_to_le16(allocated); if (ex2 != ex) goto insert; - err = ext4_ext_get_access(handle, inode, path + depth); - if (err) - goto out; /* * New (initialized) extent starts from the first block * in the current extent. i.e., ex2 == ex -- 1.5.4.1.97.g40aab-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: Set directory link count to zero if we fail to create the directory.
With ext4 we fail directory creation if we fail to allocate initial block for the directory. With EXT4_FEATURE_RO_COMPAT_DIR_NLINK, if we fail to create the directory, the inode link count was wrongly set to 1. This cause e2fsck throws the below warning Inode 1015 is a zero-length directory. Clear? yes Inode 1120 is a zero-length directory. Clear? yes Inode 1121 is a zero-length directory. Clear? yes Fix the same by dropping the inode link count using drop_nlink Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/namei.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index a9347fb..fd3b031 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -1805,7 +1805,7 @@ retry: inode-i_size = EXT4_I(inode)-i_disksize = inode-i_sb-s_blocksize; dir_block = ext4_bread (handle, inode, 0, 1, err); if (!dir_block) { - ext4_dec_count(handle, inode); /* is this nlink == 0? */ + drop_nlink(inode); ext4_mark_inode_dirty(handle, inode); iput (inode); goto out_stop; -- 1.5.4.1.97.g40aab-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
EXT4_FEATURE_RO_COMPAT_DIR_NLINK and the directory size.
Steps: [root]# more /root/largedir i=65000 mkdir /mnt/tmp/test/ cd /mnt/tmp/test/ while [ $i -gt 0 ] do mkdir $i i=$(expr $i - 1) done # /root/largedir # cd /mnt/tmp/test # rm -rf * # cd .. # ls -al drwxr-xr-x 1 root root 1380352 Feb 12 07:15 test # e2fsck -fv /dev/sda7 # ls -al drwxr-xr-x 2 root root 1380352 Feb 12 07:15 test So e2fsck fix the link count. But the size is not updated. If i try to run the script again, i get EXT4-fs error (device sda7): ext4_find_entry: reading directory #187681 offset 0 attempt to access beyond end of device sda7: rw=32, want=17582522464, limit=19551042 attempt to access beyond end of device sda7: rw=32, want=32620150800, limit=19551042 attempt to access beyond end of device sda7: rw=32, want=32620150800, limit=19551042 Also after running fsstress fsck fails with zero size directory on powerpc. I am finding directories with i_links_count == 1 i_blocks == 0 I missed to check whether the l_i_blocks_hi was having some value. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][0/28] Lustre e2fsprogs patch series
On Sun, Feb 10, 2008 at 11:19:12PM -0500, Theodore Tso wrote: On Sat, Feb 02, 2008 at 12:59:43AM -0700, Andreas Dilger wrote: The following series of emails will contain the large part of the e2fsprogs patch series that is used for Lustre. It will not contain the regression tests for EXTENTS nor the DIR_NLINK features, as those are very large and were previously submitted. A full tarball that includes the patches, series, and regression tests will be uploaded to ftp://ftp.lustre.org/pub/lustre/other/e2fsprogs/ Hey Andreas, I've applied these patches to the tip of maint, and exported it as e2fsprogs-interim on the e2fsprogs git repository. There quite a few patch conflicts, mostly due to some changes that had happened on the tip of maint, but also apparently because your patchset was missing the flex bg changes. I haven't applied them yet, but I'll probably tack them at the end. If you could sanity check to make sure they are sane, I would appreciate it. Needed this diff to get it build from ../../../lib/ext2fs/crc16.h:18, from ../../../lib/ext2fs/crc16.c:9: /usr/include/sys/types.h:46: error: conflicting types for 'loff_t' /usr/include/linux/types.h:30: error: previous declaration of 'loff_t' was here /usr/include/sys/types.h:62: error: conflicting types for 'dev_t' /usr/include/linux/types.h:13: error: previous declaration of 'dev_t' was here In file included from /usr/include/sys/types.h:133, from /usr/include/stdlib.h:438, from ../../../lib/ext2fs/crc16.h:18, from ../../../lib/ext2fs/crc16.c:9 diff --git a/lib/ext2fs/crc16.c b/lib/ext2fs/crc16.c index 5d87e10..246813f 100644 --- a/lib/ext2fs/crc16.c +++ b/lib/ext2fs/crc16.c @@ -5,7 +5,6 @@ * Version 2. See the file COPYING for more details. */ -#include linux/types.h #include crc16.h /** CRC table for the CRC-16. The poly is 0x8005 (x^16 + x^15 + x^2 + 1) */ - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: Don't claim block from group which has corrupt bitmap
On Mon, Feb 11, 2008 at 05:00:18PM +0530, Aneesh Kumar K.V wrote: In ext4_mb_complex_scan_group, if the extent length of the newly found extentet is greater than than the total free blocks counted in group info, break without claiming the block. Document different ext4_error usage, explaining the state with which we continue if we mount with errors=continue Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/mballoc.c | 16 +++- 1 files changed, 15 insertions(+), 1 deletions(-) This was actually part of the version 2 of the patch ext4_avoid_panic_in_case_of_corrupted_bitmap.patch. But we missed it in the last update. So sending it as a separate patch. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG_ON at mballoc.c:3752
On Wed, Feb 06, 2008 at 03:59:48PM -0600, Dave Kleikamp wrote: File systems should not call BUG() due to a corrupt file system. Instead the code should fail the operation, possibly marking the file system read-only (or panicking) depending on the errors= mount option. Eric Sandeen explained me the same on IRC. I was busy with the migrate locking bug. That's why i didn't update here. Today i tried to reproduce the problem using the image provided. But in my case it is not hitting the BUG_ON (mostly due to single cpu). I did look at the code and am not still not clear how we can hit that BUG_ON. prealloc free space pa_free is generated out of bitmap. So only if something corrupted bitmap after we initialized prealloc space we will hit this case. In mballoc we error out if the block allocated or fall in system zone. One thing i noticed is, the journal is corrupt. So the only possibility that i have is journal write resulted in bitmap corruption. I also looked at the mballoc to make sure we don't panic in case of a corrupt bitmap. Below is the patch that i have now. This one is yet to go through the ABAT test but it would be nice to see whether the below change cause any other issues. Eric , can you run the test with below patch and see if this makes any difference ?. I know we are not fixing any bugs in the below patch. ext4: Don't panic in case of corrupt bitmap From: Aneesh Kumar K.V [EMAIL PROTECTED] Multiblock allocator was calling BUG_ON in many case if the free and used blocks count obtained looking at the bitmap is different from what the allocator internally accounted for. Use ext4_error in such case and don't panic the system. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/mballoc.c | 35 +-- 1 files changed, 21 insertions(+), 14 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 06d1f52..656729b 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -680,7 +680,6 @@ static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max) { char *bb; - /* FIXME!! is this needed */ BUG_ON(EXT4_MB_BITMAP(e4b) == EXT4_MB_BUDDY(e4b)); BUG_ON(max == NULL); @@ -964,7 +963,7 @@ static void ext4_mb_generate_buddy(struct super_block *sb, grp-bb_fragments = fragments; if (free != grp-bb_free) { - printk(KERN_DEBUG + ext4_error(sb, __FUNCTION__, EXT4-fs: group %lu: %u blocks in bitmap, %u in gd\n, group, free, grp-bb_free); grp-bb_free = free; @@ -1821,13 +1820,24 @@ static void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac, i = ext4_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), i); if (i = EXT4_BLOCKS_PER_GROUP(sb)) { - BUG_ON(free != 0); + /* +* IF we corrupt the bitmap we won't find any +* free blocks even though group info says we +* we have free blocks +*/ + ext4_error(sb, __FUNCTION__, %d free blocks as per + group info. But bitmap says 0\n, + free); break; } mb_find_extent(e4b, 0, i, ac-ac_g_ex.fe_len, ex); BUG_ON(ex.fe_len = 0); - BUG_ON(free ex.fe_len); + if (free ex.fe_len) { + ext4_error(sb, __FUNCTION__, %d free blocks as per + group info. But got %d blocks\n, + free, ex.fe_len); + } ext4_mb_measure_extent(ac, ex, e4b); @@ -3354,13 +3364,10 @@ static void ext4_mb_use_group_pa(struct ext4_allocation_context *ac, ac-ac_pa = pa; /* we don't correct pa_pstart or pa_plen here to avoid -* possible race when tte group is being loaded concurrently +* possible race when the group is being loaded concurrently * instead we correct pa later, after blocks are marked -* in on-disk bitmap -- see ext4_mb_release_context() */ - /* -* FIXME!! but the other CPUs can look at this particular -* pa and think that it have enought free blocks if we -* don't update pa_free here right ? +* in on-disk bitmap -- see ext4_mb_release_context() +* Other CPUs are prevented from allocating from this pa by lg_mutex */ mb_debug(use %u/%u from group pa %p\n, pa-pa_lstart-len, len, pa); } @@ -3743,13 +3750,13 @@ static int ext4_mb_release_inode_pa(struct ext4_buddy *e4b, bit = next + 1; } if (free != pa-pa_free) { - printk(KERN_ERR pa %p: logic %lu, phys. %lu, len %lu\n
[PATCH] ext4: Fix circular locking dependency with migrate and rm.
We now take inode-i_mutex lock to prevent any update of the inode i_data field. Before we switch the inode format we take i_data_sem to prevent parallel read. === [ INFO: possible circular locking dependency detected ] 2.6.24-rc8 #6 --- rm/2401 is trying to acquire lock: (ei-i_data_sem){}, at: [c01dca58] ext4_get_blocks_wrap+0x21/0x108 but task is already holding lock: (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (jbd2_handle){--..}: [c0143a5c] __lock_acquire+0xa31/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c01fc4ca] jbd2_journal_start+0xf5/0xff [c01e3539] ext4_journal_start_sb+0x48/0x4a [c01eb980] ext4_ext_migrate+0x7d/0x535 [c01df328] ext4_ioctl+0x528/0x56c [c0177700] do_ioctl+0x50/0x67 [c017794e] vfs_ioctl+0x237/0x24a [c0177992] sys_ioctl+0x31/0x4b [c0104f8a] sysenter_past_esp+0x5f/0xa5 [] 0x - #0 (ei-i_data_sem){}: [c014394c] __lock_acquire+0x921/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c044f247] down_read+0x42/0x79 [c01dca58] ext4_get_blocks_wrap+0x21/0x108 [c01dcba1] ext4_getblk+0x62/0x1c4 [c01e0de9] ext4_find_entry+0x350/0x5b7 [c01e200c] ext4_unlink+0x6e/0x1a4 [c017449e] vfs_unlink+0x49/0x89 [c0175f02] do_unlinkat+0x96/0x12c [c0175fa8] sys_unlink+0x10/0x12 [c0104f8a] sysenter_past_esp+0x5f/0xa5 [] 0x other info that might help us debug this: 3 locks held by rm/2401: #0: (type-i_mutex_dir_key#5/1){--..}, at: [c0175eca] do_unlinkat+0x5e/0x12c #1: (sb-s_type-i_mutex_key#8){--..}, at: [c017448b] vfs_unlink+0x36/0x89 #2: (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff stack backtrace: Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6 [c0106017] show_trace_log_lvl+0x1a/0x2f [c0106893] show_trace+0x12/0x14 [c0106b89] dump_stack+0x6c/0x72 [c0141b26] print_circular_bug_tail+0x5f/0x68 [c014394c] __lock_acquire+0x921/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c044f247] down_read+0x42/0x79 [c01dca58] ext4_get_blocks_wrap+0x21/0x108 [c01dcba1] ext4_getblk+0x62/0x1c4 [c01e0de9] ext4_find_entry+0x350/0x5b7 [c01e200c] ext4_unlink+0x6e/0x1a4 [c017449e] vfs_unlink+0x49/0x89 [c0175f02] do_unlinkat+0x96/0x12c [c0175fa8] sys_unlink+0x10/0x12 [c0104f8a] sysenter_past_esp+0x5f/0xa5 Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/migrate.c | 117 +--- 1 files changed, 74 insertions(+), 43 deletions(-) diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c index 9ee1f7c..8c6c685 100644 --- a/fs/ext4/migrate.c +++ b/fs/ext4/migrate.c @@ -61,10 +61,9 @@ static int finish_range(handle_t *handle, struct inode *inode, retval = ext4_journal_restart(handle, needed); if (retval) goto err_out; - } - if (needed) { + } else if (needed) { retval = ext4_journal_extend(handle, needed); - if (retval != 0) { + if (retval) { /* * IF not able to extend the journal restart the journal */ @@ -220,6 +219,26 @@ static int update_tind_extent_range(handle_t *handle, struct inode *inode, } +static int extend_credit_for_blkdel(handle_t *handle, struct inode *inode) +{ + int retval = 0, needed; + + if (handle-h_buffer_credits EXT4_RESERVE_TRANS_BLOCKS) + return 0; + /* +* We are freeing a blocks. During this we touch +* superblock, group descriptor and block bitmap. +* So allocate a credit of 3. We may update +* quota (user and group). +*/ + needed = 3 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode-i_sb); + + if (ext4_journal_extend(handle, needed) != 0) + retval = ext4_journal_restart(handle, needed); + + return retval; +} + static int free_dind_blocks(handle_t *handle, struct inode *inode, __le32 i_data) { @@ -234,11 +253,14 @@ static int free_dind_blocks(handle_t *handle, tmp_idata = (__le32 *)bh-b_data; for (i = 0; i max_entries; i++) { - if (tmp_idata[i]) + if (tmp_idata[i]) { + extend_credit_for_blkdel(handle, inode); ext4_free_blocks(handle, inode, le32_to_cpu(tmp_idata[i]), 1, 1); + } } put_bh(bh); + extend_credit_for_blkdel(handle, inode); ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1, 1); return 0; } @@ -267,29 +289,32 @@ static int free_tind_blocks(handle_t *handle
Re: jbd2_handle and i_data_sem circular locking dependency detected
On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote: Hi, On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote: This is with the new ext3 - ext4 migrate code added. The recently added lockdep for jbd2 helped to find this out. We want to hold the i_data_sem on the ext3 inode during migration to prevent walking the ext3 inode when it is being converted to ext4 format. Also we want to avoid file truncation and new blocks being added while converting to ext4. Also we dont want to reserve large number of credits for journal. Any idea how to fix this ? Hmm, while briefly looking at the code - why do you introduce i_data_sem and not use i_alloc_sem which is already in VFS inode? That is aimed exactly at the serialization of truncates, writes and similar users. That doesn't solve problems with lock ordering but I was just wondering... Another problem - ext4_fallocate() has the same lock ordering problem as the migration code and maybe there are others... One (stupid) solution to your problem is to make i_data_sem be always locked before the transaction is started. It could possibly have negative performance impact because you'd have to hold the semaphore for a longer time and thus a writer would block readers for longer time. So one would have to measure how big difference that would make. Another possibility is to start a single transaction for migration and extend it as long as you can (as truncate does it). And when you can't extend any more, you drop the i_data_sem and start a new transaction and acquire the semaphore again. This has the disadvantage that after dropping the semaphore you have to resync your original inode with the temporary one your are building which probably ends up being ugly as night... Hmm, but maybe we could get rid of this - hold i_mutex to protect against all writes (that ranks outside of transaction start so you can hold it for the whole migration time - maybe you even hold it if you are called from the write path...). After dropping i_data_sem you let some readers proceed but writers still wait on i_mutex so the file shouldn't change under you (but I suggest adding some BUG_ONs to verify that the file really doesn't change :). How about the patch below. I did the below testing a) migrate a file b) run fs_inode fsstres fsx_linux. The intention was to find out whether the new locking is breaking any of the other expected hierarchy. It seems to fine. I didn't get any lockdep warning. ext4: Fix circular locking dependency with migrate and rm. From: Aneesh Kumar K.V [EMAIL PROTECTED] We now take inode-i_mutex lock to prevent any update of the inode i_data field. Before we switch the inode format we take i_data_sem to prevent parallel read. === [ INFO: possible circular locking dependency detected ] 2.6.24-rc8 #6 --- rm/2401 is trying to acquire lock: (ei-i_data_sem){}, at: [c01dca58] ext4_get_blocks_wrap+0x21/0x108 but task is already holding lock: (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (jbd2_handle){--..}: [c0143a5c] __lock_acquire+0xa31/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c01fc4ca] jbd2_journal_start+0xf5/0xff [c01e3539] ext4_journal_start_sb+0x48/0x4a [c01eb980] ext4_ext_migrate+0x7d/0x535 [c01df328] ext4_ioctl+0x528/0x56c [c0177700] do_ioctl+0x50/0x67 [c017794e] vfs_ioctl+0x237/0x24a [c0177992] sys_ioctl+0x31/0x4b [c0104f8a] sysenter_past_esp+0x5f/0xa5 [] 0x - #0 (ei-i_data_sem){}: [c014394c] __lock_acquire+0x921/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c044f247] down_read+0x42/0x79 [c01dca58] ext4_get_blocks_wrap+0x21/0x108 [c01dcba1] ext4_getblk+0x62/0x1c4 [c01e0de9] ext4_find_entry+0x350/0x5b7 [c01e200c] ext4_unlink+0x6e/0x1a4 [c017449e] vfs_unlink+0x49/0x89 [c0175f02] do_unlinkat+0x96/0x12c [c0175fa8] sys_unlink+0x10/0x12 [c0104f8a] sysenter_past_esp+0x5f/0xa5 [] 0x other info that might help us debug this: 3 locks held by rm/2401: #0: (type-i_mutex_dir_key#5/1){--..}, at: [c0175eca] do_unlinkat+0x5e/0x12c #1: (sb-s_type-i_mutex_key#8){--..}, at: [c017448b] vfs_unlink+0x36/0x89 #2: (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff stack backtrace: Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6 [c0106017] show_trace_log_lvl+0x1a/0x2f [c0106893] show_trace+0x12/0x14 [c0106b89] dump_stack+0x6c/0x72 [c0141b26] print_circular_bug_tail+0x5f/0x68 [c014394c] __lock_acquire+0x921/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c044f247] down_read+0x42/0x79 [c01dca58] ext4_get_blocks_wrap+0x21/0x108 [c01dcba1] ext4_getblk+0x62/0x1c4 [c01e0de9] ext4_find_entry+0x350
Re: jbd2_handle and i_data_sem circular locking dependency detected
On Tue, Feb 05, 2008 at 02:42:28PM +0100, Jan Kara wrote: On Tue 05-02-08 17:53:42, Aneesh Kumar K.V wrote: How about the patch below. I did the below testing a) migrate a file b) run fs_inode fsstres fsx_linux. The intention was to find out whether the new locking is breaking any of the other expected hierarchy. It seems to fine. I didn't get any lockdep warning. I think there's a problem in the patch. I don't think you can call free_ind_block() while readers are accessing the file. That could make them think the file contains holes when they aren't there (or even worse read freed blocks or so). So you need to lock i_data_sem before this call (that means you have to move journal_extend() as well). Actually, I don't quite get why ext4_journal_extend() is called in that function. You can just count with the 1 credit in ext4_ext_migrate() when you call ext4_journal_extend() before calling ext4_ext_swap_inode_data(). Oh, probably because free_ind_block() could extend the transaction (which they don't do now as far as I can see). ext4_journal_extend is called to extend the journal by one credit to take care of writing the block containing inode. I moved the journal extend before taking i_data_sem lock. diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c index f97c993..dc6617a 100644 --- a/fs/ext4/migrate.c +++ b/fs/ext4/migrate.c @@ -302,10 +302,6 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode, struct ext4_inode_info *ei = EXT4_I(inode); struct ext4_inode_info *tmp_ei = EXT4_I(tmp_inode); - retval = free_ind_block(handle, inode); - if (retval) - goto err_out; - /* * One credit accounted for writing the * i_data field of the original inode @@ -318,6 +314,10 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode, } down_write(EXT4_I(inode)-i_data_sem); + retval = free_ind_block(handle, inode); + if (retval) + goto err_out; + /* * We have the extent map build with the tmp inode. * Now copy the i_data across The above change also make sure we don't fail after we free the indirect blocks. BTW: That freeing code should really take into account that it can modify bitmaps in different block groups. It's even not that hard to do. Just before each ext4_free_blocks() in free_ind_block() you check whether you have still enough credits in the handle (use h_buffer_credits) and if not, extend it by some amount. I have a FIXME at migrate.c:524 documenting exactly that. The difficult question was by how much we should extent the journal. ? But in reality we might have accumulated enough journal credits, I never really ran across a case where we are running out of the journal credit. Maybe you could do some test like writing a big file with some data and then while migration is running read it in a loop and compare that MD5SUM is the same all the time. Also run some memory-pressure during this test so that data blocks aren't cached for the whole time of the test... That should reasonably stress the migration code. I have tested migrate by booting with mem= and doing parallel read/write and migrate. I didn't do the MDSUM compare. Will do that this time. Thanks for all the help with review. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: jbd2_handle and i_data_sem circular locking dependency detected
On Tue, Feb 05, 2008 at 05:34:04PM +0100, Jan Kara wrote: On Tue 05-02-08 21:57:03, Aneesh Kumar K.V wrote: I have a FIXME at migrate.c:524 documenting exactly that. The difficult question was by how much we should extent the journal. ? But in reality we might have accumulated enough journal credits, I never really ran across a case where we are running out of the journal credit. Yes, I don't think it is likely to happen in reality but if somebody would trigger this, it would be almost impossible to track down so one should be quite careful with these things... And as I described, doing it failsafe is easy - just look how try_to_extend_transaction() in ext4/inode.c handles similar problems with truncate. I moved the indirect block freeing after i update the original inode. That makes sure even if we fail to free the indirect blocks we have the original inode converted. Now since i don't need atomicity with freeing of blocks i extend the journal for each block freed. Below is the diff i have right now. diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c index 1712844..1b00587 100644 --- a/fs/ext4/migrate.c +++ b/fs/ext4/migrate.c @@ -233,11 +233,14 @@ static int free_dind_blocks(handle_t *handle, tmp_idata = (__le32 *)bh-b_data; for (i = 0; i max_entries; i++) { - if (tmp_idata[i]) + if (tmp_idata[i]) { + extend_blkdelete_credit(handle, inode); ext4_free_blocks(handle, inode, le32_to_cpu(tmp_idata[i]), 1, 1); + } } put_bh(bh); + extend_blkdelete_credit(handle, inode); ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1, 1); return 0; } @@ -266,29 +269,32 @@ static int free_tind_blocks(handle_t *handle, } } put_bh(bh); + extend_blkdelete_credit(handle, inode); ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1, 1); return 0; } -static int free_ind_block(handle_t *handle, struct inode *inode) +static int free_ind_block(handle_t *handle, __le32 *i_data) { int retval; - struct ext4_inode_info *ei = EXT4_I(inode); - if (ei-i_data[EXT4_IND_BLOCK]) + /* ei-i_data[EXT4_IND_BLOCK] */ + if (i_data[0]) { + extend_blkdelete_credit(handle, inode); ext4_free_blocks(handle, inode, - le32_to_cpu(ei-i_data[EXT4_IND_BLOCK]), 1, 1); + le32_to_cpu(i_data[0]), 1, 1); + } - if (ei-i_data[EXT4_DIND_BLOCK]) { - retval = free_dind_blocks(handle, inode, - ei-i_data[EXT4_DIND_BLOCK]); + /* ei-i_data[EXT4_DIND_BLOCK] */ + if (i_data[1]) { + retval = free_dind_blocks(handle, inode, i_data[1]); if (retval) return retval; } - if (ei-i_data[EXT4_TIND_BLOCK]) { - retval = free_tind_blocks(handle, inode, - ei-i_data[EXT4_TIND_BLOCK]); + /* ei-i_data[EXT4_TIND_BLOCK] */ + if (i_data[2]) { + retval = free_tind_blocks(handle, inode, i_data[2]); if (retval) return retval; } @@ -299,6 +305,7 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode, struct inode *tmp_inode) { int retval; + __le32 i_data[3]; struct ext4_inode_info *ei = EXT4_I(inode); struct ext4_inode_info *tmp_ei = EXT4_I(tmp_inode); @@ -313,11 +320,11 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode, goto err_out; } - down_write(EXT4_I(inode)-i_data_sem); - retval = free_ind_block(handle, inode); - if (retval) - goto err_out; + i_data[0] = ei-i_data[EXT4_IND_BLOCK]; + i_data[1] = ei-i_data[EXT4_DIND_BLOCK]; + i_data[2] = ei-i_data[EXT4_TIND_BLOCK]; + down_write(EXT4_I(inode)-i_data_sem); /* * We have the extent map build with the tmp inode. * Now copy the i_data across @@ -338,8 +345,12 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode, inode-i_blocks += tmp_inode-i_blocks; spin_unlock(inode-i_lock); up_write(EXT4_I(inode)-i_data_sem); - ext4_mark_inode_dirty(handle, inode); + + /* Now free the indirec block of the inode */ + retval = free_ind_block(handle, i_data); + if (retval) + goto err_out; err_out: return retval; } @@ -367,6 +378,7 @@ static int free_ext_idx(handle_t *handle, struct inode *inode, } } put_bh(bh); + extend_blkdelete_credit(handle, inode); ext4_free_blocks(handle, inode, block, 1, 1); return
Re: [PATCH] Fix ext4 bitops
On Mon, Feb 04, 2008 at 10:24:36AM +0100, Heiko Carstens wrote: | fs/ext4/mballoc.c: In function 'ext4_mb_generate_buddy': | fs/ext4/mballoc.c:954: error: implicit declaration of function 'generic_find_next_le_bit' The s390 specific bitops uses parts of the generic implementation. Include the correct header. That doesn't work: fs/built-in.o: In function `ext4_mb_release_inode_pa': mballoc.c:(.text+0x95a8a): undefined reference to `generic_find_next_le_bit' fs/built-in.o: In function `ext4_mb_init_cache': mballoc.c:(.text+0x967ea): undefined reference to `generic_find_next_le_bit' This still needs generic_find_next_le_bit which comes from lib/find_next_bit.c. That one doesn't get built on s390 since we don't set GENERIC_FIND_NEXT_BIT. Currently we have the lengthly patch below queued. Similar issue on m68k. As Bastian also saw it on powerpc, I'm getting the impression the ext4 people don't (compile) test on big endian machines? Gr{oetje,eeting}s, I have sent this patches to linux-arch expecting a review from different arch people. It is true that the patches are tested only on powerpc, x86-64, x86. That's the primary reason of me sending the patches to linux-arch. Is there anything special I need to do so the ext4 code actually uses ext2_find_next_bit() ? Haven't looked at the ext4 code, but I'd like to test if the s390 implementation is ok. With the latest linus kernel in git you can test it by mounting ext4 mount -t ext4dev device mntpoint -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bugme-new] [Bug 9849] New: NULL pointer deref in journal_wait_on_commit_record
On Wed, Jan 30, 2008 at 03:17:57PM -0800, Mingming Cao wrote: The buufer head pointer passed to journal_wait_on_commit_record() could be NULL if the previous journal_submit_commit_record() failed or journal has already aborted. Looking at the jbd2 debug messages, before the oops happen, the jbd2 is aborted due to trying to access the next log block beyond the end of device. This might be caused by using a corrupted image. We need to check the error returns from journal_submit_commit_record() and avoid calling journal_wait_on_commit_record() in the failure case. Signed-off-by: Mingming Cao [EMAIL PROTECTED] The buufer head pointer passed to journal_wait_on_commit_record() could be NULL if the previous journal_submit_commit_record() failed or journal has already aborted. We need to check the error returns from journal_submit_commit_record() and avoid calling journal_wait_on_commit_record() in the failure case. Signed-off-by: Mingming Cao [EMAIL PROTECTED] --- fs/jbd2/commit.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6.24-rc8/fs/jbd2/commit.c === --- linux-2.6.24-rc8.orig/fs/jbd2/commit.c2008-01-30 14:12:10.0 -0800 +++ linux-2.6.24-rc8/fs/jbd2/commit.c 2008-01-30 15:09:50.0 -0800 @@ -872,7 +872,8 @@ wait_for_iobuf: if (err) __jbd2_journal_abort_hard(journal); } - err = journal_wait_on_commit_record(cbh); + if (!err !is_journal_aborted(journal)) + err = journal_wait_on_commit_record(cbh); if (err) jbd2_journal_abort(journal, err); Needs the below small change also. I don't see this patch in the patch queue. So i guess we can add the below diff to the same. The change was suggested by Girish. Before journal checksum changes sync_dirty_buffer did the get_bh. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index da8d0eb..2b88ab0 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -136,7 +136,7 @@ static int journal_submit_commit_record(journal_t *journal, JBUFFER_TRACE(descriptor, submit commit block); lock_buffer(bh); - + get_bh(bh); set_buffer_dirty(bh); set_buffer_uptodate(bh); bh-b_end_io = journal_end_buffer_io_sync; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
jbd2_handle and i_data_sem circular locking dependency detected
Hi, This is with the new ext3 - ext4 migrate code added. The recently added lockdep for jbd2 helped to find this out. We want to hold the i_data_sem on the ext3 inode during migration to prevent walking the ext3 inode when it is being converted to ext4 format. Also we want to avoid file truncation and new blocks being added while converting to ext4. Also we dont want to reserve large number of credits for journal. Any idea how to fix this ? -aneesh === [ INFO: possible circular locking dependency detected ] 2.6.24-rc8 #6 --- rm/2401 is trying to acquire lock: (ei-i_data_sem){}, at: [c01dca58] ext4_get_blocks_wrap+0x21/0x108 but task is already holding lock: (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (jbd2_handle){--..}: [c0143a5c] __lock_acquire+0xa31/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c01fc4ca] jbd2_journal_start+0xf5/0xff [c01e3539] ext4_journal_start_sb+0x48/0x4a [c01eb980] ext4_ext_migrate+0x7d/0x535 [c01df328] ext4_ioctl+0x528/0x56c [c0177700] do_ioctl+0x50/0x67 [c017794e] vfs_ioctl+0x237/0x24a [c0177992] sys_ioctl+0x31/0x4b [c0104f8a] sysenter_past_esp+0x5f/0xa5 [] 0x - #0 (ei-i_data_sem){}: [c014394c] __lock_acquire+0x921/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c044f247] down_read+0x42/0x79 [c01dca58] ext4_get_blocks_wrap+0x21/0x108 [c01dcba1] ext4_getblk+0x62/0x1c4 [c01e0de9] ext4_find_entry+0x350/0x5b7 [c01e200c] ext4_unlink+0x6e/0x1a4 [c017449e] vfs_unlink+0x49/0x89 [c0175f02] do_unlinkat+0x96/0x12c [c0175fa8] sys_unlink+0x10/0x12 [c0104f8a] sysenter_past_esp+0x5f/0xa5 [] 0x other info that might help us debug this: 3 locks held by rm/2401: #0: (type-i_mutex_dir_key#5/1){--..}, at: [c0175eca] do_unlinkat+0x5e/0x12c #1: (sb-s_type-i_mutex_key#8){--..}, at: [c017448b] vfs_unlink+0x36/0x89 #2: (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff stack backtrace: Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6 [c0106017] show_trace_log_lvl+0x1a/0x2f [c0106893] show_trace+0x12/0x14 [c0106b89] dump_stack+0x6c/0x72 [c0141b26] print_circular_bug_tail+0x5f/0x68 [c014394c] __lock_acquire+0x921/0xc1a [c0143cbf] lock_acquire+0x7a/0x94 [c044f247] down_read+0x42/0x79 [c01dca58] ext4_get_blocks_wrap+0x21/0x108 [c01dcba1] ext4_getblk+0x62/0x1c4 [c01e0de9] ext4_find_entry+0x350/0x5b7 [c01e200c] ext4_unlink+0x6e/0x1a4 [c017449e] vfs_unlink+0x49/0x89 [c0175f02] do_unlinkat+0x96/0x12c [c0175fa8] sys_unlink+0x10/0x12 [c0104f8a] sysenter_past_esp+0x5f/0xa5 - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: jbd2_handle and i_data_sem circular locking dependency detected
On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote: Hi, On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote: This is with the new ext3 - ext4 migrate code added. The recently added lockdep for jbd2 helped to find this out. We want to hold the i_data_sem on the ext3 inode during migration to prevent walking the ext3 inode when it is being converted to ext4 format. Also we want to avoid file truncation and new blocks being added while converting to ext4. Also we dont want to reserve large number of credits for journal. Any idea how to fix this ? Hmm, while briefly looking at the code - why do you introduce i_data_sem and not use i_alloc_sem which is already in VFS inode? That is aimed exactly at the serialization of truncates, writes and similar users. How about read ? We are changing the format of inode. We don't want even the read to go through. That doesn't solve problems with lock ordering but I was just wondering... Another problem - ext4_fallocate() has the same lock ordering problem as the migration code and maybe there are others... I will look at the same when fixing this. One (stupid) solution to your problem is to make i_data_sem be always locked before the transaction is started. It could possibly have negative performance impact because you'd have to hold the semaphore for a longer time and thus a writer would block readers for longer time. So one would have to measure how big difference that would make. Another possibility is to start a single transaction for migration and extend it as long as you can (as truncate does it). And when you can't extend any more, you drop the i_data_sem and start a new transaction and acquire the semaphore again. This has the disadvantage that after dropping the semaphore you have to resync your original inode with the temporary one your are building which probably ends up being ugly as night... Hmm, but maybe we could get rid of this - hold i_mutex to protect against all writes (that ranks outside of transaction start so you can hold it for the whole migration time - maybe you even hold it if you are called from the write path...). After dropping i_data_sem you let some readers proceed but writers still wait on i_mutex so the file shouldn't change under you (but I suggest adding some BUG_ONs to verify that the file really doesn't change :). A quick look says truncate can happen even when we hold i_mutex ?? But of all this looks like a workable solution. Will try this out. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: Fix reference counting on buffer head.
With journal checksum patch we added asyn commit of journal commit headers. During the conversion we missed to take a reference on buffer head. Before the change sync_dirty_buffer did the get_bh(). The associative put_bh is done by journal_wait_on_commit_record() Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/jbd2/commit.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index da8d0eb..2b88ab0 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -136,7 +136,7 @@ static int journal_submit_commit_record(journal_t *journal, JBUFFER_TRACE(descriptor, submit commit block); lock_buffer(bh); - + get_bh(bh); set_buffer_dirty(bh); set_buffer_uptodate(bh); bh-b_end_io = journal_end_buffer_io_sync; -- 1.5.4.rc3.24.gb53139-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: jbd2_handle and i_data_sem circular locking dependency detected
On Mon, Feb 04, 2008 at 10:23:16AM -0500, Josef Bacik wrote: On Monday 04 February 2008 5:12:28 am Aneesh Kumar K.V wrote: Hi, This is with the new ext3 - ext4 migrate code added. The recently added lockdep for jbd2 helped to find this out. We want to hold the i_data_sem on the ext3 inode during migration to prevent walking the ext3 inode when it is being converted to ext4 format. Also we want to avoid file truncation and new blocks being added while converting to ext4. Also we dont want to reserve large number of credits for journal. Any idea how to fix this ? Hello, Seems you should be taking the i_data_sem after starting the journal in ext4_ext_migrate. I haven't looked at this stuff too much, but everywhere I see i_data_sem taken its within a journal_start/journal_stop. Here's the patch. Let me know if I'm way off btw, like I said I'm new to this :). Thank you, But that doesn't help. Because we call ext4_journal_restart after taking i_data_sem. That implies we can sleep waiting for other running transaction to commit. And if that transaction (in the above example rm) is waiting for i_data_sem we deadlock. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix ext4 bitops
On Sun, Feb 03, 2008 at 01:39:02PM +0100, Geert Uytterhoeven wrote: On Sun, 3 Feb 2008, Heiko Carstens wrote: On Fri, Feb 01, 2008 at 10:04:04PM +0100, Bastian Blank wrote: On Fri, Feb 01, 2008 at 12:22:57PM -0800, Andrew Morton wrote: On Fri, 1 Feb 2008 21:02:08 +0100 Bastian Blank [EMAIL PROTECTED] wrote: Fix ext4 bitops. This is incomplete. Please tell us what was fixed. If it was a build error then please quote the compile error output in the changelog, as well as the usual description of what the problem is, and how it was fixed. | fs/ext4/mballoc.c: In function 'ext4_mb_generate_buddy': | fs/ext4/mballoc.c:954: error: implicit declaration of function 'generic_find_next_le_bit' The s390 specific bitops uses parts of the generic implementation. Include the correct header. That doesn't work: fs/built-in.o: In function `ext4_mb_release_inode_pa': mballoc.c:(.text+0x95a8a): undefined reference to `generic_find_next_le_bit' fs/built-in.o: In function `ext4_mb_init_cache': mballoc.c:(.text+0x967ea): undefined reference to `generic_find_next_le_bit' This still needs generic_find_next_le_bit which comes from lib/find_next_bit.c. That one doesn't get built on s390 since we don't set GENERIC_FIND_NEXT_BIT. Currently we have the lengthly patch below queued. Similar issue on m68k. As Bastian also saw it on powerpc, I'm getting the impression the ext4 people don't (compile) test on big endian machines? Gr{oetje,eeting}s, I have sent this patches to linux-arch expecting a review from different arch people. It is true that the patches are tested only on powerpc, x86-64, x86. That's the primary reason of me sending the patches to linux-arch. http://marc.info/?l=linux-archm=119503501127737w=2 -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG_ON at mballoc.c:3752
On Thu, Jan 31, 2008 at 04:42:07PM +0100, Eric Sesterhenn wrote: * Eric Sesterhenn ([EMAIL PROTECTED]) wrote: hi, while running a modified version of fsfuzzer i triggered the BUG() in ext4_mb_release_inode_pa(). Sadly I am not able to reproduce this using the generated image, but running the fuzzer will usually trigger this in less than 40 attempts. Increasing the JBD2 Debug level didnt give more information. The kernel is current git with ext4-fix-null-pointer-deref-in-journal_wait_on_commit_record.patch applied. I am now able to reproduce this using this image: http://www.cccmz.de/~snakebyte/ext4.24.img.bz2 the following commands will trigger the oops for me mount cfs/ext4.24.img /media/test -t ext4dev -o extents -o loop mkdir /media/test/stress chown snakebyte:snakebyte /media/test/stress sudo -u snakebyte fstest -n 10 -l 10 -f 5 -s 4 -p /media/test/stress/ The file system is corrupted. The BUG_ON indicate that the free spcae marked in the prealloc space and found by looking at the bitmap are not same. Do you have a set of steps that i can follow to reproduce this ? on a clean file system ? Where do i find the fsfuzzer that you are using ? [EMAIL PROTECTED]:/tmp$/home/opensource/patches/e2fsprogs-1.40.4.cfs1/e2fsck/e2fsck.static -f ./ext4.24.img e2fsck 1.40.4.cfs1 (31-Dec-2007) Superblock has an invalid ext3 journal (inode 8). Cleary? yes *** ext3 journal has been deleted - filesystem is now ext2 only *** Pass 1: Checking inodes, blocks, and sizes Inode 7 has illegal block(s). Cleary? yes Illegal block #552 (9568256) in inode 7. CLEARED. Illegal block #647 (4063232) in inode 7. CLEARED. Illegal block #659 (12517376) in inode 7. CLEARED. Illegal block #766 (2600468480) in inode 7. CLEARED. Illegal block #944 (51200) in inode 7. CLEARED. Illegal block #1135 (2583691264) in inode 7. CLEARED. Illegal block #1214 (15925248) in inode 7. CLEARED. Illegal block #1345 (771751936) in inode 7. CLEARED. Illegal block #1384 (10092544) in inode 7. CLEARED. Illegal block #1416 (14811136) in inode 7. CLEARED. Illegal block #1470 (10420224) in inode 7. CLEARED. Illegal block #1709 (10158080) in inode 7. CLEARED. Too many illegal blocks in inode 7. Clear inodey? yes Journal inode is not in use, but contains data. Cleary? yes Deleted inode 160 has zero dtime. Fixy? yes Deleted inode 257 has zero dtime. Fixy? yes Deleted inode 638 has zero dtime. Fixy? yes Deleted inode 1407 has zero dtime. Fixy? yes Deleted inode 1650 has zero dtime. Fixy? yes Deleted inode 1656 has zero dtime. Fixy? yes Deleted inode 1677 has zero dtime. Fixy? yes Deleted inode 1720 has zero dtime. Fixy? yes Inode 11 has illegal block(s). Cleary? yes Illegal block #249 (15990784) in inode 11. CLEARED. Inode 11, i_size is 12288, should be 33792. Fixy? yes Inode 11, i_blocks is 24, should be 28. Fixy? yes Recreate journal to make the filesystem ext3 again? Fixy? yes -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: Fix Null dereference.
Repoted by Adrian Bunk [EMAIL PROTECTED]: The Coverity checker spotted the following NULL dereference: static int ext4_mb_mark_diskspace_used { ... if (!bitmap_bh) goto out_err; ... out_err: sb-s_dirt = 1; put_bh(bitmap_bh); ... Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/mballoc.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 76e5fed..06d1f52 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3069,7 +3069,7 @@ static int ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac, out_err: sb-s_dirt = 1; - put_bh(bitmap_bh); + brelse(bitmap_bh); return err; } -- 1.5.4.rc3.24.gb53139-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl
On Fri, Jan 25, 2008 at 11:15:00PM -0500, Theodore Tso wrote: On Thu, Jan 24, 2008 at 11:25:32AM +0530, Aneesh Kumar K.V wrote: +static int free_ext_idx(handle_t *handle, struct inode *inode, + struct ext4_extent_idx *ix) +{ + int i, retval = 0; + ext4_fsblk_t block; + struct buffer_head *bh; + struct ext4_extent_header *eh; + + block = idx_pblock(ix); + bh = sb_bread(inode-i_sb, block); + if (!bh) + return -EIO; + + eh = (struct ext4_extent_header *)bh-b_data; + if (eh-eh_depth == 0) { + brelse(bh); + ext4_free_blocks(handle, inode, block, 1); + } else { + ix = EXT_FIRST_INDEX(eh); + for (i = 0; i le16_to_cpu(eh-eh_entries); i++, ix++) { + retval = free_ext_idx(handle, inode, ix); + if (retval) + return retval; + } + } + return retval; +} Aneesh, looks like if eh-eh_depth is != 0, bh gets leaked. This is how I plan to fix it up: +static int free_ext_idx(handle_t *handle, struct inode *inode, + struct ext4_extent_idx *ix) +{ + int i, retval = 0; + ext4_fsblk_t block; + struct buffer_head *bh; + struct ext4_extent_header *eh; + + block = idx_pblock(ix); + bh = sb_bread(inode-i_sb, block); + if (!bh) + return -EIO; + + eh = (struct ext4_extent_header *)bh-b_data; + if (eh-eh_depth == 0) + ext4_free_blocks(handle, inode, block, 1); + else { + ix = EXT_FIRST_INDEX(eh); + for (i = 0; i le16_to_cpu(eh-eh_entries); i++, ix++) { + retval = free_ext_idx(handle, inode, ix); + if (retval) + break; + } + } + put_bh(bh); We need to mark the index block as free. via ext4_free_blocks(handle, inode, block, 1); I remember making this change. May be it was related to dind/tind blocks. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 41/49] ext4: Add multi block allocator for ext4
updated patch. Waiting for the test results. I am only attaching the diff. Mballoc patch is really large. -aneesh diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 4f329af..ec7d349 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -89,6 +89,8 @@ When mounting an ext4 filesystem, the following option are accepted: extentsext4 will use extents to address file data. The file system will no longer be mountable by ext3. +noextents ext4 will not use extents for new files created. + journal_checksum Enable checksumming of the journal transactions. This will allow the recovery code in e2fsck and the kernel to detect corruption in the kernel. It is a @@ -206,6 +208,10 @@ nobh (a) cache disk block mapping information nobh option tries to avoid associating buffer heads (supported only for writeback mode). +mballoc(*) Use the mutliblock allocator for block allocation +nomballoc disabled multiblock allocator for block allocation. +stripe=n filesystem blocks per stripe for a RAID configuration. + Data Mode - diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index dec9945..4413a2d 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -857,6 +857,45 @@ CPUs. The procs_blocked line gives the number of processes currently blocked, waiting for I/O to complete. +1.9 Ext4 file system parameters +-- +Ext4 file system have one directory per partition under /proc/fs/ext4/ +# ls /proc/fs/ext4/hdc/ +group_prealloc max_to_scan mb_groups mb_history min_to_scan order2_req +stats stream_req + +mb_groups: +This file gives the details of mutiblock allocator buddy cache of free blocks + +mb_history: +Multiblock allocation history. + +stats: +This file indicate whether the multiblock allocator should start collecting +statistics. The statistics are shown during unmount + +group_prealloc: +The multiblock allocator normalize the block allocation request to +group_prealloc filesystem blocks if we don't have strip value set. +The stripe value can be specified at mount time or during mke2fs. + +max_to_scan: +How long multiblock allocator can look for a best extent (in found extents) + +min_to_scan: +How long multiblock allocator must look for a best extent + +order2_req: +Multiblock allocator use 2^N search using buddies only for requests greater +than or equal to order2_req. The request size is specfied in file system +blocks. A value of 2 indicate only if the requests are greater than or equal +to 4 blocks. + +stream_req: +Files smaller than stream_req are served by the stream allocator, whose +purpose is to pack requests as close each to other as possible to +produce smooth I/O traffic. Avalue of 16 indicate that file smaller than 16 +filesystem block size will use group based preallocation. -- Summary diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 0398aa0..310bad6 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -489,7 +489,7 @@ struct ext4_free_extent { */ struct ext4_locality_group { /* for allocator */ - struct semaphorelg_sem; /* to serialize allocates */ + struct mutexlg_sem; /* to serialize allocates */ struct list_headlg_prealloc_list;/* list of preallocations */ spinlock_t lg_prealloc_lock; }; @@ -563,7 +563,10 @@ struct ext4_buddy { #define EXT4_MB_BUDDY(e4b) ((e4b)-bd_buddy) #ifndef EXT4_MB_HISTORY -#define ext4_mb_store_history(ac) +static inline void ext4_mb_store_history(struct ext4_allocation_context *ac) +{ + return; +} #else static void ext4_mb_store_history(struct ext4_allocation_context *ac); #endif @@ -641,6 +644,10 @@ static ext4_fsblk_t ext4_grp_offs_to_block(struct super_block *sb, static inline int mb_test_bit(int bit, void *addr) { + /* +* ext4_test_bit on architecture like powerpc +* needs unsigned long aligned address +*/ mb_correct_addr_and_bit(bit, addr); return ext4_test_bit(bit, addr); } @@ -669,7 +676,7 @@ static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr) ext4_clear_bit_atomic(lock, bit, addr); } -static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max) +static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max) { char *bb; @@ -752,9 +759,20 @@ static void mb_cmp_bitmaps(struct ext4_buddy *e4b, void *bitmap) } #else -#define mb_free_blocks_double(a, b, c, d) -#define mb_mark_used_double(a, b, c) -#define
Patch queue update
I have updated patches based on the review feedback from Andrew. I have tested this on 128(64p) ppc64 sles 4(2p)ppc64 debian 4(2p)x86_64 ubuntu-gutsy Updated patches are at http://www.radian.org/~kvaneesh/ext4/jan-24-2008/ http://www.radian.org/~kvaneesh/ext4/jan-24-2008/patches.tar Diff for reference diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 4f329af..ec7d349 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -89,6 +89,8 @@ When mounting an ext4 filesystem, the following option are accepted: extentsext4 will use extents to address file data. The file system will no longer be mountable by ext3. +noextents ext4 will not use extents for new files created. + journal_checksum Enable checksumming of the journal transactions. This will allow the recovery code in e2fsck and the kernel to detect corruption in the kernel. It is a @@ -206,6 +208,10 @@ nobh (a) cache disk block mapping information nobh option tries to avoid associating buffer heads (supported only for writeback mode). +mballoc(*) Use the mutliblock allocator for block allocation +nomballoc disabled multiblock allocator for block allocation. +stripe=n filesystem blocks per stripe for a RAID configuration. + Data Mode - diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index dec9945..4413a2d 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -857,6 +857,45 @@ CPUs. The procs_blocked line gives the number of processes currently blocked, waiting for I/O to complete. +1.9 Ext4 file system parameters +-- +Ext4 file system have one directory per partition under /proc/fs/ext4/ +# ls /proc/fs/ext4/hdc/ +group_prealloc max_to_scan mb_groups mb_history min_to_scan order2_req +stats stream_req + +mb_groups: +This file gives the details of mutiblock allocator buddy cache of free blocks + +mb_history: +Multiblock allocation history. + +stats: +This file indicate whether the multiblock allocator should start collecting +statistics. The statistics are shown during unmount + +group_prealloc: +The multiblock allocator normalize the block allocation request to +group_prealloc filesystem blocks if we don't have strip value set. +The stripe value can be specified at mount time or during mke2fs. + +max_to_scan: +How long multiblock allocator can look for a best extent (in found extents) + +min_to_scan: +How long multiblock allocator must look for a best extent + +order2_req: +Multiblock allocator use 2^N search using buddies only for requests greater +than or equal to order2_req. The request size is specfied in file system +blocks. A value of 2 indicate only if the requests are greater than or equal +to 4 blocks. + +stream_req: +Files smaller than stream_req are served by the stream allocator, whose +purpose is to pack requests as close each to other as possible to +produce smooth I/O traffic. Avalue of 16 indicate that file smaller than 16 +filesystem block size will use group based preallocation. -- Summary diff --git a/fs/buffer.c b/fs/buffer.c index 982cf1a..921eeec 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -3232,19 +3232,21 @@ int bh_uptodate_or_lock(struct buffer_head *bh) return 1; } EXPORT_SYMBOL(bh_uptodate_or_lock); + /** * bh_submit_read: Submit a locked buffer for reading * @bh: struct buffer_head * - * Returns a negative error + * Returns zero on success and -EIO on error. */ int bh_submit_read(struct buffer_head *bh) { - if (!buffer_locked(bh)) - lock_buffer(bh); + BUG_ON(!buffer_locked(bh)); - if (buffer_uptodate(bh)) + if (buffer_uptodate(bh)) { + unlock_buffer(bh); return 0; + } get_bh(bh); bh-b_end_io = end_buffer_read_sync; @@ -3255,6 +3257,7 @@ int bh_submit_read(struct buffer_head *bh) return -EIO; } EXPORT_SYMBOL(bh_submit_read); + void __init buffer_init(void) { int nrpages; diff --git a/fs/ext4/defrag.c b/fs/ext4/defrag.c index 4ef3dc0..0d76c74 100644 --- a/fs/ext4/defrag.c +++ b/fs/ext4/defrag.c @@ -30,14 +30,6 @@ ext4_fsblk_t idx_pblock(struct ext4_extent_idx *ix) return block; } -/* Will go away */ -static void ext4_ext_store_pblock(struct ext4_extent *ex, ext4_fsblk_t pb) -{ - ex-ee_start_lo = cpu_to_le32((unsigned long) (pb 0x)); - ex-ee_start_hi = - cpu_to_le16((unsigned long) ((pb 31) 1) 0x); -} - /* * this structure is used to gather extents from the tree via ioctl */ diff --git
Re: [PATCH 41/49] ext4: Add multi block allocator for ext4
On Thu, Jan 24, 2008 at 01:26:14PM +0530, Aneesh Kumar K.V wrote: +/* find most significant bit */ +static int fmsb(unsigned short word) +{ + int order; + + if (word 255) { + order = 7; + word = 8; + } else { + order = -1; + } + + do { + order++; + word = 1; + } while (word != 0); + + return order; +} Did we just reinvent fls()? replaced by fls. That should be fls() - 1; The full patch is at http://www.radian.org/~kvaneesh/ext4/jan-24-2008/mballoc-core.patch The patch is too big to inline. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
GFP_ usage in kmalloc with mballoc
Hi, I looked at the GFP flag usage in mballoc. I think the below change make sense with respect to mballoc. First hunk is memory allocation during ext4_mb_init which is called during mount time. I guess it is ok to convert that to GFP_KERNEL. Second hunk is during ext4_mb_free_metadata. I guess it should be GFP_NOFS. This is called during ext4_free_blocks. diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index bec699a..a15f18e 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2662,7 +2662,7 @@ int ext4_mb_init(struct super_block *sb, int needs_recovery) sbi-s_mb_group_prealloc = MB_DEFAULT_GROUP_PREALLOC; i = sizeof(struct ext4_locality_group) * NR_CPUS; - sbi-s_locality_groups = kmalloc(i, GFP_NOFS); + sbi-s_locality_groups = kmalloc(i, GFP_KERNEL); if (sbi-s_locality_groups == NULL) { clear_opt(sbi-s_mount_opt, MBALLOC); kfree(sbi-s_mb_offsets); @@ -4366,7 +4366,7 @@ static int ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b, if (md == NULL) { ext4_unlock_group(sb, group); - md = kmalloc(sizeof(*md), GFP_KERNEL); + md = kmalloc(sizeof(*md), GFP_NOFS); if (md == NULL) return -ENOMEM; md-num = 0; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 23/49] Add buffer head related helper functions
On Wed, Jan 23, 2008 at 02:06:48PM -0800, Andrew Morton wrote: On Mon, 21 Jan 2008 22:02:02 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote: +} +EXPORT_SYMBOL(bh_uptodate_or_lock); +/** Missing newline. + * bh_submit_read: Submit a locked buffer for reading + * @bh: struct buffer_head + * + * Returns a negative error + */ +int bh_submit_read(struct buffer_head *bh) +{ + if (!buffer_locked(bh)) + lock_buffer(bh); + + if (buffer_uptodate(bh)) + return 0; Here it can lock the buffer then return zero + get_bh(bh); + bh-b_end_io = end_buffer_read_sync; + submit_bh(READ, bh); + wait_on_buffer(bh); + if (buffer_uptodate(bh)) + return 0; Here it will unlock the buffer and return zero. This function is unusable when passed an unlocked buffer. Updated patch below. commit 70d4ca32604e0935a8b9a49c5ac8b9c64c810693 Author: Aneesh Kumar K.V [EMAIL PROTECTED] Date: Thu Jan 24 10:50:24 2008 +0530 Add buffer head related helper functions Add buffer head related helper function bh_uptodate_or_lock and bh_submit_read which can be used by file system Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] diff --git a/fs/buffer.c b/fs/buffer.c index 7249e01..82aa2db 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -3213,6 +3213,53 @@ static int buffer_cpu_notify(struct notifier_block *self, return NOTIFY_OK; } +/** + * bh_uptodate_or_lock: Test whether the buffer is uptodate + * @bh: struct buffer_head + * + * Return true if the buffer is up-to-date and false, + * with the buffer locked, if not. + */ +int bh_uptodate_or_lock(struct buffer_head *bh) +{ + if (!buffer_uptodate(bh)) { + lock_buffer(bh); + if (!buffer_uptodate(bh)) + return 0; + unlock_buffer(bh); + } + return 1; +} +EXPORT_SYMBOL(bh_uptodate_or_lock); + +/** + * bh_submit_read: Submit a locked buffer for reading + * @bh: struct buffer_head + * + * Returns zero on success and -EIO on error.If the input + * buffer is not locked returns -EINVAL + * + */ +int bh_submit_read(struct buffer_head *bh) +{ + if (!buffer_locked(bh)) + return -EINVAL; + + if (buffer_uptodate(bh)) { + unlock_buffer(bh); + return 0; + } + + get_bh(bh); + bh-b_end_io = end_buffer_read_sync; + submit_bh(READ, bh); + wait_on_buffer(bh); + if (buffer_uptodate(bh)) + return 0; + return -EIO; +} +EXPORT_SYMBOL(bh_submit_read); + void __init buffer_init(void) { int nrpages; diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index da0d83f..e98801f 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -192,6 +192,8 @@ int sync_dirty_buffer(struct buffer_head *bh); int submit_bh(int, struct buffer_head *); void write_boundary_block(struct block_device *bdev, sector_t bblock, unsigned blocksize); +int bh_uptodate_or_lock(struct buffer_head *bh); +int bh_submit_read(struct buffer_head *bh); extern int buffer_heads_over_limit; - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 41/49] ext4: Add multi block allocator for ext4
On Wed, Jan 23, 2008 at 02:07:27PM -0800, Andrew Morton wrote: On Mon, 21 Jan 2008 22:02:20 -0500 Theodore Ts'o [EMAIL PROTECTED] wrote: From: Alex Tomas [EMAIL PROTECTED] Signed-off-by: Alex Tomas [EMAIL PROTECTED] Signed-off-by: Andreas Dilger [EMAIL PROTECTED] Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] Signed-off-by: Eric Sandeen [EMAIL PROTECTED] Signed-off-by: Theodore Ts'o [EMAIL PROTECTED] ... +#if BITS_PER_LONG == 64 +#define mb_correct_addr_and_bit(bit, addr) \ +{ \ + bit += ((unsigned long) addr 7UL) 3; \ + addr = (void *) ((unsigned long) addr ~7UL); \ +} +#elif BITS_PER_LONG == 32 +#define mb_correct_addr_and_bit(bit, addr) \ +{ \ + bit += ((unsigned long) addr 3UL) 3; \ + addr = (void *) ((unsigned long) addr ~3UL); \ +} +#else +#error how many bits you are?! +#endif Why do these exist? Initial version on mballoc supported on x86 32 this was there to give compile warning on 64 bit platform. I guess we can remove that now. Or may be we can keep it as such because it is harmless. +static inline int mb_test_bit(int bit, void *addr) +{ + mb_correct_addr_and_bit(bit, addr); + return ext4_test_bit(bit, addr); +} ext2_test_bit() already handles bitnum wordsize. If mb_correct_addr_and_bit() is actually needed then some suitable comment would help. ext4_test_bit on powerpc needs the addr to be 8 byte aligned. Othewise it fails +static inline void mb_set_bit(int bit, void *addr) +{ + mb_correct_addr_and_bit(bit, addr); + ext4_set_bit(bit, addr); +} + +static inline void mb_set_bit_atomic(spinlock_t *lock, int bit, void *addr) +{ + mb_correct_addr_and_bit(bit, addr); + ext4_set_bit_atomic(lock, bit, addr); +} + +static inline void mb_clear_bit(int bit, void *addr) +{ + mb_correct_addr_and_bit(bit, addr); + ext4_clear_bit(bit, addr); +} + +static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr) +{ + mb_correct_addr_and_bit(bit, addr); + ext4_clear_bit_atomic(lock, bit, addr); +} + +static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max) uninlining this will save about eighty squigabytes of text. Fixed Please review all of ext4/jbd2 with a view to removig unnecessary and wrong inlings. +{ + char *bb; + + /* FIXME!! is this needed */ + BUG_ON(EXT4_MB_BITMAP(e4b) == EXT4_MB_BUDDY(e4b)); + BUG_ON(max == NULL); + + if (order e4b-bd_blkbits + 1) { + *max = 0; + return NULL; + } + + /* at order 0 we see each particular block */ + *max = 1 (e4b-bd_blkbits + 3); + if (order == 0) + return EXT4_MB_BITMAP(e4b); + + bb = EXT4_MB_BUDDY(e4b) + EXT4_SB(e4b-bd_sb)-s_mb_offsets[order]; + *max = EXT4_SB(e4b-bd_sb)-s_mb_maxs[order]; + + return bb; +} + ... +#else +#define mb_free_blocks_double(a, b, c, d) +#define mb_mark_used_double(a, b, c) +#define mb_cmp_bitmaps(a, b) +#endif Please use the do{}while(0) thing. Or, better, proper C functions which have typechecking (unless this will cause undefined-var compile errors, which happens sometimes) makde static inline void. +/* find most significant bit */ +static int fmsb(unsigned short word) +{ + int order; + + if (word 255) { + order = 7; + word = 8; + } else { + order = -1; + } + + do { + order++; + word = 1; + } while (word != 0); + + return order; +} Did we just reinvent fls()? replaced by fls. +/* FIXME!! need more doc */ +static void ext4_mb_mark_free_simple(struct super_block *sb, + void *buddy, unsigned first, int len, + struct ext4_group_info *grp) +{ + struct ext4_sb_info *sbi = EXT4_SB(sb); + unsigned short min; + unsigned short max; + unsigned short chunk; + unsigned short border; + + BUG_ON(len = EXT4_BLOCKS_PER_GROUP(sb)); + + border = 2 sb-s_blocksize_bits; Won't this explode with = 32k blocksize? + while (len 0) { + /* find how many blocks can be covered since this position */ + max = ffs(first | border) - 1; + + /* find how many blocks of power 2 we need to mark */ + min = fmsb(len); + + if (max min) + min = max; + chunk = 1 min; + + /* mark multiblock chunks only */ + grp-bb_counters[min]++; + if (min 0) + mb_clear_bit(first min, +buddy + sbi-s_mb_offsets[min]); + + len -= chunk; + first += chunk; + } +} + ... +static
Re: [PATCH] Fix oops in mballoc caused by a variable overflow
On Thu, Jan 17, 2008 at 10:43:40AM +0100, Valerie Clement wrote: Aneesh Kumar K.V wrote: What about this ? I guess we will overflow start = start bsbits; Hi Aneesh, your patch below doesn't fix the issue, because as start_off is also loff_t, start_off = ac-ac_o_ex.fe_logical bsbits also overflows. loff_t is 64 bits. typedef long long __kernel_loff_t; typedef __u32 ext4_lblk_t; -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix oops in mballoc caused by a variable overflow
On Thu, Jan 17, 2008 at 10:43:40AM +0100, Valerie Clement wrote: Aneesh Kumar K.V wrote: What about this ? I guess we will overflow start = start bsbits; Hi Aneesh, your patch below doesn't fix the issue, because as start_off is also loff_t, start_off = ac-ac_o_ex.fe_logical bsbits also overflows. loff_t is 64 bits. typedef __kernel_loff_t loff_t; typedef long long __kernel_loff_t; typedef __u32 ext4_lblk_t; typedef unsigned long long ext4_fsblk_t start_off = ac-ac_o_ex.fe_logical bsbits; In the above line what we are storing in start_off is the offset in bytes.So it makes sense to use the type loff_t. It is neither logical block nor physical block. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix oops in mballoc caused by a variable overflow
On Thu, Jan 17, 2008 at 02:09:41PM +0100, Valerie Clement wrote: Aneesh Kumar K.V wrote: On Thu, Jan 17, 2008 at 10:43:40AM +0100, Valerie Clement wrote: Aneesh Kumar K.V wrote: What about this ? I guess we will overflow start = start bsbits; Hi Aneesh, your patch below doesn't fix the issue, because as start_off is also loff_t, start_off = ac-ac_o_ex.fe_logical bsbits also overflows. loff_t is 64 bits. typedef __kernel_loff_t loff_t; typedef long long __kernel_loff_t; typedef __u32 ext4_lblk_t; typedef unsigned long long ext4_fsblk_t start_off = ac-ac_o_ex.fe_logical bsbits; In the above line what we are storing in start_off is the offset in bytes.So it makes sense to use the type loff_t. It is neither logical block nor physical block. Oh yes, sorry, you're right. I read too quickly. In fact, it's missing a cast : start_off = (loff_t) ac-ac_o_ex.fe_logical bsbits; With that change, the test is ok. Updated patch below. -aneesh ext4: Fix overflow in ext4_mb_normalize_request From: Aneesh Kumar K.V [EMAIL PROTECTED] kernel BUG at fs/ext4/mballoc.c:3148! The BUG_ON is: BUG_ON(size = 0 || size = EXT4_BLOCKS_PER_GROUP(ac-ac_sb)); where the value of size is 4293920768. This is due to the overflow of the variable start in the ext4_mb_normalize_request() function. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/mballoc.c | 24 +++- 1 files changed, 11 insertions(+), 13 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index d8cd81e..d16083c 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2998,7 +2998,7 @@ static void ext4_mb_normalize_request(struct ext4_allocation_context *ac, int bsbits, max; ext4_lblk_t end; struct list_head *cur; - loff_t size, orig_size; + loff_t size, orig_size, start_off; ext4_lblk_t start, orig_start; struct ext4_inode_info *ei = EXT4_I(ac-ac_inode); @@ -3039,7 +3039,7 @@ static void ext4_mb_normalize_request(struct ext4_allocation_context *ac, /* first, try to predict filesize */ /* XXX: should this table be tunable? */ - start = 0; + start_off = 0; if (size = 16 * 1024) { size = 16 * 1024; } else if (size = 32 * 1024) { @@ -3055,26 +3055,24 @@ static void ext4_mb_normalize_request(struct ext4_allocation_context *ac, } else if (size = 1024 * 1024) { size = 1024 * 1024; } else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, bsbits)) { - start = ac-ac_o_ex.fe_logical bsbits; - start = (start / (1024 * 1024)) * (1024 * 1024); + start_off = ((loff_t)ac-ac_o_ex.fe_logical + (20 - bsbits)) 20; size = 1024 * 1024; } else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, bsbits)) { - start = ac-ac_o_ex.fe_logical bsbits; - start = (start / (4 * (1024 * 1024))) * 4 * (1024 * 1024); + start_off = ((loff_t)ac-ac_o_ex.fe_logical + (22 - bsbits)) 22; size = 4 * 1024 * 1024; } else if (NRL_CHECK_SIZE(ac-ac_o_ex.fe_len, (820)bsbits, max, bsbits)) { - start = ac-ac_o_ex.fe_logical; - start = start bsbits; - start = (start / (8 * (1024 * 1024))) * 8 * (1024 * 1024); + start_off = ((loff_t)ac-ac_o_ex.fe_logical + (23 - bsbits)) 23; size = 8 * 1024 * 1024; } else { - start = ac-ac_o_ex.fe_logical; - start = start bsbits; - size = ac-ac_o_ex.fe_len bsbits; + start_off = (loff_t)ac-ac_o_ex.fe_logical bsbits; + size = ac-ac_o_ex.fe_len bsbits; } orig_size = size = size bsbits; - orig_start = start = start bsbits; + orig_start = start = start_off bsbits; /* don't cover already allocated blocks in selected range */ if (ar-pleft start = ar-lleft) {
Re: [PATCH] Fix oops in mballoc caused by a variable overflow
On Wed, Jan 16, 2008 at 10:48:27AM -0800, Mingming Cao wrote: On Wed, 2008-01-16 at 20:11 +0100, Valerie Clement wrote: A simple dd oopses the kernel (2.6.24-rc7 with the latest patch queue): dd if=/dev/zero of=/mnt/test/foo bs=1M count=8096 EXT4-fs: mballoc enabled [ cut here ] kernel BUG at fs/ext4/mballoc.c:3148! The BUG_ON is: BUG_ON(size = 0 || size = EXT4_BLOCKS_PER_GROUP(ac-ac_sb)); where the value of size is 4293920768. This is due to the overflow of the variable start in the ext4_mb_normalize_request() function. The patch below fixes it. Thanks! Signed-off-by: Valerie Clement [EMAIL PROTECTED] --- mballoc.c | 23 --- 1 file changed, 12 insertions(+), 11 deletions(-) Index: linux-2.6.24-rc7/fs/ext4/mballoc.c === --- linux-2.6.24-rc7.orig/fs/ext4/mballoc.c 2008-01-16 19:22:45.0 +0100 +++ linux-2.6.24-rc7/fs/ext4/mballoc.c 2008-01-16 19:25:04.0 +0100 @@ -2990,6 +2990,7 @@ static void ext4_mb_normalize_request(st struct list_head *cur; loff_t size, orig_size; ext4_lblk_t start, orig_start; + ext4_fsblk_t pstart; ext4_fsblk_t is used for fs physical block number, here I think pstart is pointing to some logical block location.. struct ext4_inode_info *ei = EXT4_I(ac-ac_inode); /* do normalize only data requests, metadata requests @@ -3029,7 +3030,7 @@ static void ext4_mb_normalize_request(st /* first, try to predict filesize */ /* XXX: should this table be tunable? */ - start = 0; + pstart = 0; if (size = 16 * 1024) { size = 16 * 1024; } else if (size = 32 * 1024) { @@ -3045,25 +3046,25 @@ static void ext4_mb_normalize_request(st } else if (size = 1024 * 1024) { size = 1024 * 1024; } else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, bsbits)) { - start = ac-ac_o_ex.fe_logical bsbits; - start = (start / (1024 * 1024)) * (1024 * 1024); + pstart = ac-ac_o_ex.fe_logical bsbits; + pstart = (pstart / (1024 * 1024)) * (1024 * 1024); How about using shift... - start = ac-ac_o_ex.fe_logical bsbits; - start = (start / (1024 * 1024)) * (1024 * 1024); + start = (ac-ac_o_ex.fe_logical (20-bsbits)) 20; That would be more efficient and should fix the overflow issue size = 1024 * 1024; } else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, bsbits)) { - start = ac-ac_o_ex.fe_logical bsbits; - start = (start / (4 * (1024 * 1024))) * 4 * (1024 * 1024); + pstart = ac-ac_o_ex.fe_logical bsbits; + pstart = (pstart / (4 * (1024 * 1024))) * 4 * (1024 * 1024); + start = (ac-ac_o_ex.fe_logical (22-bsbits)) 22; size = 4 * 1024 * 1024; } else if(NRL_CHECK_SIZE(ac-ac_o_ex.fe_len,(820)bsbits,max,bsbits)){ - start = ac-ac_o_ex.fe_logical; - start = start bsbits; - start = (start / (8 * (1024 * 1024))) * 8 * (1024 * 1024); + pstart = ac-ac_o_ex.fe_logical; + pstart = pstart bsbits; + pstart = (pstart / (8 * (1024 * 1024))) * 8 * (1024 * 1024); + start = (ac-ac_o_ex.fe_logical (23-bsbits)) 23; size = 8 * 1024 * 1024; } else { - start = ac-ac_o_ex.fe_logical; - start = start bsbits; + pstart = ac-ac_o_ex.fe_logical; + pstart = pstart bsbits; size = ac-ac_o_ex.fe_len bsbits; What about this ? I guess we will overflow start = start bsbits; I guess start should be of type loff_t. Patch below -aneesh ext4: Fix overflow in ext4_mb_normalize_request From: Aneesh Kumar K.V [EMAIL PROTECTED] kernel BUG at fs/ext4/mballoc.c:3148! The BUG_ON is: BUG_ON(size = 0 || size = EXT4_BLOCKS_PER_GROUP(ac-ac_sb)); where the value of size is 4293920768. This is due to the overflow of the variable start in the ext4_mb_normalize_request() function. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/mballoc.c | 21 - 1 files changed, 8 insertions(+), 13 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index d8cd81e..d8a2db8 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2998,7 +2998,7 @@ static void ext4_mb_normalize_request(struct ext4_allocation_context *ac, int bsbits, max; ext4_lblk_t end; struct list_head *cur; - loff_t size, orig_size; + loff_t size, orig_size, start_off; ext4_lblk_t start, orig_start; struct ext4_inode_info *ei = EXT4_I(ac-ac_inode); @@ -3039,7 +3039,7 @@ static void ext4_mb_normalize_request(struct ext4_allocation_context *ac, /* first, try to predict filesize */ /* XXX
Re: checkpatch.pl warnings
On Mon, Jan 14, 2008 at 12:49:27PM -0800, Mingming Cao wrote: Hi Guys, Could you check the checkpatch.pl warnings and see if it make sense to fix them? Thanks! [EMAIL PROTECTED]:~/fs/ext4/stylecheck$ grep has style problems * linux-2.6.24-rc7-48-bit-i_blocks.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-ext3-4-migrate.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-ext4_export_iov_shorten_from_kernel_for_ext4.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-ext4-journal_chksum-2.6.20.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-ext4_rec_len_overflow_with_64kblk_fix-v2.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-ext4_store_maxbytes_for_bitmaped_files.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-inode-version-ext4.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-jbd-stats-through-procfs.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-large-file-blocktype.patch.out:Your patch has style problems, please review. If any of these errors linux-2.6.24-rc7-mballoc-core.patch.out:Your patch has style problems, please review. If any of these errors Fixed the checkpatch.pl warning for all the patches in the patch queue. The diff is attached below for review. patch queue at http://www.radian.org/~kvaneesh/ext4/jan-15-2008/ http://www.radian.org/~kvaneesh/ext4/jan-15-2008/patch-queue.tar This include the complete patch queue. Tested with -- fsx_linux, fs_inode, fsstress on x86_64 fsx_linux, fs_inode, fsstress on ppc64 diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 1e46997..2ea7ef4 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -1798,8 +1798,9 @@ static int ext4_remove_blocks(handle_t *handle, struct inode *inode, printk(KERN_INFO strange request: removal %u-%u from %u:%u\n, from, to, le32_to_cpu(ex-ee_block), ee_len); } else { - printk(KERN_INFO strange request: removal(2) %u-%u from %u:%u\n, - from, to, le32_to_cpu(ex-ee_block), ee_len); + printk(KERN_INFO strange request: removal(2) + %u-%u from %u:%u\n, + from, to, le32_to_cpu(ex-ee_block), ee_len); } return 0; } @@ -2140,10 +2141,11 @@ void ext4_ext_release(struct super_block *sb) * b Splits in two extents: Write is happening at either end of the extent * c Splits in three extents: Somone is writing in middle of the extent */ -static int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode, - struct ext4_ext_path *path, - ext4_lblk_t iblock, - unsigned long max_blocks) +static int ext4_ext_convert_to_initialized(handle_t *handle, + struct inode *inode, + struct ext4_ext_path *path, + ext4_lblk_t iblock, + unsigned long max_blocks) { struct ext4_extent *ex, newex; struct ext4_extent *ex1 = NULL; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index f7dc5f3..6bb788d 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -751,7 +751,8 @@ err_out: for (i = 1; i = num; i++) { BUFFER_TRACE(where[i].bh, call jbd2_journal_forget); ext4_journal_forget(handle, where[i].bh); - ext4_free_blocks(handle,inode,le32_to_cpu(where[i-1].key),1, 0); + ext4_free_blocks(handle, inode, + le32_to_cpu(where[i-1].key), 1, 0); } ext4_free_blocks(handle, inode, le32_to_cpu(where[num].key), blks, 0); @@ -2829,7 +2830,7 @@ static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode, EXT4_FEATURE_RO_COMPAT_HUGE_FILE)) { /* we are using combined 48 bit field */ i_blocks = ((u64)le16_to_cpu(raw_inode-i_blocks_high)) 32 | - le32_to_cpu(raw_inode-i_blocks_lo); + le32_to_cpu(raw_inode-i_blocks_lo); if (ei-i_flags EXT4_HUGE_FILE_FL) { /* i_blocks represent file system block size */ return i_blocks (inode-i_blkbits - 9); diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 16854fd..d8cd81e 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -337,7 +337,7 @@ struct ext4_group_info
Re: [PATCH] ext4: use ext4_ext_get_actual_len instead of directly using ext4_extent.ee_len
On Sat, Jan 12, 2008 at 11:44:00PM +0530, Aneesh Kumar K.V wrote: fs/ext4/extents.c | 26 ++ 1 files changed, 14 insertions(+), 12 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 81bce98..4269cc6 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -1029,7 +1029,7 @@ ext4_ext_search_left(struct inode *inode, struct ext4_ext_path *path, { struct ext4_extent_idx *ix; struct ext4_extent *ex; - int depth; + int depth, ee_len; BUG_ON(path == NULL); depth = path-p_depth; @@ -1043,6 +1043,7 @@ ext4_ext_search_left(struct inode *inode, struct ext4_ext_path *path, * first one in the file */ ex = path[depth].p_ext; + ee_len = ext4_ext_get_actual_len(ex); if (*logical le32_to_cpu(ex-ee_block)) { BUG_ON(EXT_FIRST_EXTENT(path[depth].p_hdr) != ex); while (--depth = 0) { @@ -1052,10 +1053,10 @@ ext4_ext_search_left(struct inode *inode, struct ext4_ext_path *path, return 0; } - BUG_ON(*logical le32_to_cpu(ex-ee_block) + le16_to_cpu(ex-ee_len)); + BUG_ON(*logical (le32_to_cpu(ex-ee_block) + ee_len)); - *logical = le32_to_cpu(ex-ee_block) + le16_to_cpu(ex-ee_len) - 1; - *phys = ext_pblock(ex) + le16_to_cpu(ex-ee_len) - 1; + *logical = le32_to_cpu(ex-ee_block) + ee_len - 1; + *phys = ext_pblock(ex) + ee_len - 1; return 0; } @@ -1075,7 +1076,7 @@ ext4_ext_search_right(struct inode *inode, struct ext4_ext_path *path, struct ext4_extent_idx *ix; struct ext4_extent *ex; ext4_fsblk_t block; - int depth; + int depth, ee_len; BUG_ON(path == NULL); depth = path-p_depth; @@ -1089,6 +1090,7 @@ ext4_ext_search_right(struct inode *inode, struct ext4_ext_path *path, * first one in the file */ ex = path[depth].p_ext; + ee_len = ext4_ext_get_actual_len(ex); if (*logical le32_to_cpu(ex-ee_block)) { BUG_ON(EXT_FIRST_EXTENT(path[depth].p_hdr) != ex); while (--depth = 0) { @@ -1100,7 +1102,7 @@ ext4_ext_search_right(struct inode *inode, struct ext4_ext_path *path, return 0; } - BUG_ON(*logical le32_to_cpu(ex-ee_block) + le16_to_cpu(ex-ee_len)); + BUG_ON(*logical (le32_to_cpu(ex-ee_block) + ee_len)); if (ex != EXT_LAST_EXTENT(path[depth].p_hdr)) { /* next allocated block in this leaf */ @@ -1315,7 +1317,7 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, if (ext1_ee_len + ext2_ee_len max_len) return 0; #ifdef AGGRESSIVE_TEST - if (le16_to_cpu(ex1-ee_len) = 4) + if (ext1_ee_len = 4) return 0; #endif @@ -1555,7 +1557,7 @@ has_space: nearex = path[depth].p_ext; nearex-ee_block = newext-ee_block; ext4_ext_store_pblock(nearex, ext_pblock(newext)); - nearex-ee_len = newext-ee_len; + nearex-ee_len = ext4_ext_get_actual_len(newext); merge: /* try to merge extents to the right */ This change is not needed. ext4_ext_try_to_merge check whether the extent is uninitialized or not and zero out the blocks if we are merging. @@ -2310,7 +2312,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, - le32_to_cpu(newex.ee_block) + ext_pblock(newex); /* number of remaining blocks in the extent */ New patch is attached below. -aneesh et4: Get the actual length of extent From: Aneesh Kumar K.V [EMAIL PROTECTED] ext4 use the extent len for encoding whether the extent is intialized or not. The helper function ext4_ext_get_actual_len should be used to get the actual length of the extent. Fix the below kernel BUG http://bugzilla.kernel.org/show_bug.cgi?id=9732 kernel BUG at fs/ext4/extents.c:1056! Call Trace: [88366073] :ext4dev:ext4_ext_get_blocks+0x5ba/0x8c1 [81053c91] lock_release_holdtime+0x27/0x49 [812748f6] _spin_unlock+0x17/0x20 [883400a6] :jbd2:start_this_handle+0x4e0/0x4fe [88366564] :ext4dev:ext4_fallocate+0x175/0x39a [81053c91] lock_release_holdtime+0x27/0x49 [81056480] __lock_acquire+0x4e7/0xc4d [81053c91] lock_release_holdtime+0x27/0x49 [810a8de7] sys_fallocate+0xe4/0x10d [8100c043] tracesys+0xd5/0xda Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c | 24 +--- 1 files changed, 13 insertions(+), 11 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 81bce98..1e46997 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -1029,7 +1029,7 @@ ext4_ext_search_left(struct inode *inode, struct ext4_ext_path *path, { struct ext4_extent_idx *ix; struct ext4_extent *ex; - int depth; + int depth, ee_len; BUG_ON(path == NULL); depth = path-p_depth
[PATCH] ext4: use ext4_ext_get_actual_len instead of directly using ext4_extent.ee_len
ext4 use the extent len for encoding whether the extent is intialized or not. The helper function ext4_ext_get_actual_len should be used to get the actual length of the extent. Fix the below kernel BUG http://bugzilla.kernel.org/show_bug.cgi?id=9732 kernel BUG at fs/ext4/extents.c:1056! Call Trace: [88366073] :ext4dev:ext4_ext_get_blocks+0x5ba/0x8c1 [81053c91] lock_release_holdtime+0x27/0x49 [812748f6] _spin_unlock+0x17/0x20 [883400a6] :jbd2:start_this_handle+0x4e0/0x4fe [88366564] :ext4dev:ext4_fallocate+0x175/0x39a [81053c91] lock_release_holdtime+0x27/0x49 [81056480] __lock_acquire+0x4e7/0xc4d [81053c91] lock_release_holdtime+0x27/0x49 [810a8de7] sys_fallocate+0xe4/0x10d [8100c043] tracesys+0xd5/0xda Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c | 26 ++ 1 files changed, 14 insertions(+), 12 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 81bce98..4269cc6 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -1029,7 +1029,7 @@ ext4_ext_search_left(struct inode *inode, struct ext4_ext_path *path, { struct ext4_extent_idx *ix; struct ext4_extent *ex; - int depth; + int depth, ee_len; BUG_ON(path == NULL); depth = path-p_depth; @@ -1043,6 +1043,7 @@ ext4_ext_search_left(struct inode *inode, struct ext4_ext_path *path, * first one in the file */ ex = path[depth].p_ext; + ee_len = ext4_ext_get_actual_len(ex); if (*logical le32_to_cpu(ex-ee_block)) { BUG_ON(EXT_FIRST_EXTENT(path[depth].p_hdr) != ex); while (--depth = 0) { @@ -1052,10 +1053,10 @@ ext4_ext_search_left(struct inode *inode, struct ext4_ext_path *path, return 0; } - BUG_ON(*logical le32_to_cpu(ex-ee_block) + le16_to_cpu(ex-ee_len)); + BUG_ON(*logical (le32_to_cpu(ex-ee_block) + ee_len)); - *logical = le32_to_cpu(ex-ee_block) + le16_to_cpu(ex-ee_len) - 1; - *phys = ext_pblock(ex) + le16_to_cpu(ex-ee_len) - 1; + *logical = le32_to_cpu(ex-ee_block) + ee_len - 1; + *phys = ext_pblock(ex) + ee_len - 1; return 0; } @@ -1075,7 +1076,7 @@ ext4_ext_search_right(struct inode *inode, struct ext4_ext_path *path, struct ext4_extent_idx *ix; struct ext4_extent *ex; ext4_fsblk_t block; - int depth; + int depth, ee_len; BUG_ON(path == NULL); depth = path-p_depth; @@ -1089,6 +1090,7 @@ ext4_ext_search_right(struct inode *inode, struct ext4_ext_path *path, * first one in the file */ ex = path[depth].p_ext; + ee_len = ext4_ext_get_actual_len(ex); if (*logical le32_to_cpu(ex-ee_block)) { BUG_ON(EXT_FIRST_EXTENT(path[depth].p_hdr) != ex); while (--depth = 0) { @@ -1100,7 +1102,7 @@ ext4_ext_search_right(struct inode *inode, struct ext4_ext_path *path, return 0; } - BUG_ON(*logical le32_to_cpu(ex-ee_block) + le16_to_cpu(ex-ee_len)); + BUG_ON(*logical (le32_to_cpu(ex-ee_block) + ee_len)); if (ex != EXT_LAST_EXTENT(path[depth].p_hdr)) { /* next allocated block in this leaf */ @@ -1315,7 +1317,7 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1, if (ext1_ee_len + ext2_ee_len max_len) return 0; #ifdef AGGRESSIVE_TEST - if (le16_to_cpu(ex1-ee_len) = 4) + if (ext1_ee_len = 4) return 0; #endif @@ -1555,7 +1557,7 @@ has_space: nearex = path[depth].p_ext; nearex-ee_block = newext-ee_block; ext4_ext_store_pblock(nearex, ext_pblock(newext)); - nearex-ee_len = newext-ee_len; + nearex-ee_len = ext4_ext_get_actual_len(newext); merge: /* try to merge extents to the right */ @@ -2310,7 +2312,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, - le32_to_cpu(newex.ee_block) + ext_pblock(newex); /* number of remaining blocks in the extent */ - allocated = le16_to_cpu(newex.ee_len) - + allocated = ext4_ext_get_actual_len(newex) - (iblock - le32_to_cpu(newex.ee_block)); goto out; } else { @@ -2426,7 +2428,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, newex.ee_len = cpu_to_le16(max_blocks); err = ext4_ext_check_overlap(inode, newex, path); if (err) - allocated = le16_to_cpu(newex.ee_len); + allocated = ext4_ext_get_actual_len(newex); else allocated = max_blocks; @@ -2458,7 +2460,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, * but otherwise we'd need
Re: Problems with mballoc and uninit_groups option
On Fri, Jan 11, 2008 at 03:04:08PM +0100, Valerie Clement wrote: Hi, I've got problems with mballoc when I create the ext4 filesystem with the uninit_groups option enabled. First, I do a single test on a filesystem created without the uninit_groups option and mounted with the defaults option: dd if=/dev/zero of=/mnt/test/foo bs=1M count=1024 In this case, the file blocks are allocated in the groups 4, 5, 6, 7, 8, 9, 10. When the filesystem is created with the uninit_groups option enabled and mounted with the defaults option, I do the same dd command. In this case, the file blocks are allocated in the groups 5, 7, 9, 25, 27, 49, 81. It seems that the blocks could be allocated only in the already initialized groups. That is because we skip the uninitialized group in ext4_mb_good_group. I guess we are trying criteria 0 allocation and we skip uninit group for criteria 0 . You can tune by setting higher value for /proc/fs/ext4/partition/orders2_req. Setting it to a high value would skip criteria 0 allocation for small requests. I guess you are using delayed allocation ? -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
patch queue update
Hi Mingming, New patches for patch queue can be found at http://www.radian.org/~kvaneesh/ext4/jan-10-2008-ver2/ The changes are a) mballoc patch got an explanation about regular allocator. b) mballoc regular allocator we changed the usage of ffs to fls. I guess it makes sense to use fls because we want to compare it against the tunable s_mb_order2_reqs. Only request above this order are using criteria 0 allocation. c) stripe.patch to use the stripe size set in the super block for block allocation. The diff is attached for reference. diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 0d31817..0085fde 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -468,7 +468,6 @@ static void ext4_mb_free_committed_blocks(struct super_block *); static void ext4_mb_return_to_preallocation(struct inode *inode, struct ext4_buddy *e4b, sector_t block, int count); -static void ext4_mb_show_ac(struct ext4_allocation_context *ac); static void ext4_mb_put_pa(struct ext4_allocation_context *, struct super_block *, struct ext4_prealloc_space *pa); static int ext4_mb_init_per_dev_proc(struct super_block *sb); @@ -1838,14 +1837,23 @@ static int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) if (unlikely(ac-ac_flags EXT4_MB_HINT_GOAL_ONLY)) goto out; - i = ffs(ac-ac_g_ex.fe_len); + /* +* ac-ac2_order is set only if the fe_len is a power of 2 +* if ac2_order is set we also set criteria to 0 so whtat we +* try exact allocation using buddy. +*/ + i = fls(ac-ac_g_ex.fe_len); ac-ac_2order = 0; - /* FIXME!! -* What happens if i is still greater than s_mb_order2_reqs + /* +* We search using buddy data only if the order of the request +* is greater than equal to the sbi_s_mb_order2_reqs +* You can tune it via /proc/fs/ext4/partition/order2_req */ if (i = sbi-s_mb_order2_reqs) { - i--; - if ((ac-ac_g_ex.fe_len (~(1 i))) == 0) + /* +* This should tell if fe_len is exactly power of 2 +*/ + if ((ac-ac_g_ex.fe_len (~(1 (i - 1 == 0) ac-ac_2order = i; } @@ -1865,17 +1873,17 @@ static int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) spin_unlock(sbi-s_md_lock); } + /* searching for the right group start from the goal value specified */ group = ac-ac_g_ex.fe_group; /* Let's just scan groups to find more-less suitable blocks */ cr = ac-ac_2order ? 0 : 1; + /* +* cr == 0 try to get exact allocation, +* cr == 3 try to get anything +*/ repeat: for (; cr 4 ac-ac_status == AC_STATUS_CONTINUE; cr++) { - /* FIXME!! -* We need to explain what criteria is and also -* need to define the number 0 to 4 for criteria -* What they actually means. -*/ ac-ac_criteria = cr; for (i = 0; i EXT4_SB(sb)-s_groups_count; group++, i++) { struct ext4_group_info *grp; @@ -1889,23 +1897,28 @@ repeat: if (grp-bb_free == 0) continue; + /* +* if the group is already init we check whether it is +* a good group and if not we don't load the buddy +*/ if (EXT4_MB_GRP_NEED_INIT(EXT4_GROUP_INFO(sb, group))) { - /* we need full data about the group -* to make a good selection */ + /* +* we need full data about the group +* to make a good selection +*/ err = ext4_mb_load_buddy(sb, group, e4b); if (err) goto out; ext4_mb_release_desc(e4b); } - /* check is group good for our criteries */ + /* +* If the particular group doesn't satisfy our +* criteria we continue with the next group +*/ if (!ext4_mb_good_group(ac, group, cr)) continue; - /* FIXME!! -* here also we are loading the buddy. so what difference -* does EXT4_MB_GRP_NEED_INIT actually make -*/ err = ext4_mb_load_buddy(sb, group, e4b);
Re: patch queue update
On Thu, Jan 10, 2008 at 02:43:23PM -0700, Andreas Dilger wrote: On Jan 10, 2008 21:03 +0530, Aneesh Kumar K.V wrote: if (i = sbi-s_mb_order2_reqs) { - i--; - if ((ac-ac_g_ex.fe_len (~(1 i))) == 0) + /* +* This should tell if fe_len is exactly power of 2 +*/ + if ((ac-ac_g_ex.fe_len (~(1 (i - 1 == 0) ac-ac_2order = i; While you changed i to (i - 1) in the if you didn't change it when setting ac_2order... Is that incorrect? Yes that ac_2order should be i - 1; Will fix it in the next update. I see that the patch queue update doesn't have most of the changes I have placed at http://www.radian.org/~kvaneesh/ext4/jan-10-2008-ver2/ /* +* Yield the CPU here so that we don't get soft lockup */ - schedule_timeout(HZ); + schedule(); goto repeat; } @@ -3808,7 +3820,7 @@ repeat: printk(KERN_ERR uh-oh! used pa while discarding\n); dump_stack(); current-state = TASK_UNINTERRUPTIBLE; - schedule(); + schedule_timeout(HZ); goto repeat; Is this change to schedule_timeout() intentional? The earlier code is removing the use of schedule_timeout. I could be wrong, as I didn't follow this discussion closely, but sometimes changes like this happen accidentally and people don't look at the patch itself... The patch queue had it modified from schedule_timeout to schedule(). I am moving it back to the original version. If we have set the task state to TASK_UNINTERRUPTIBLE it should be schedule_timeout. And at these place we intent to wait uninterrupted for 1 sec. The place where we wanted to just yield is ext4_mb_discard_group_preallocations. +static unsigned long ext4_get_stripe_size(struct ext4_sb_info *sbi) +{ + unsigned long stride = le16_to_cpu(sbi-s_es-s_raid_stride); + unsigned long stripe_width = le32_to_cpu(sbi-s_es-s_raid_stripe_width); + + if (sbi-s_stripe sbi-s_stripe = sbi-s_blocks_per_group) { + return sbi-s_stripe; + } else if (stripe_width = sbi-s_blocks_per_group) { + return stripe_width; + } else if (stride = sbi-s_blocks_per_group) { + return stride; + } If you are doing return XXX you don't need else. + /* +* set the stripe size. If we have specified it via mount option, then +* use the mount option value. If the value specified at mount time is +* greater than the blocks per group use the super block value. +* Allocator needs it be less than blocks per group. +*/ + sbi-s_stripe = ext4_get_stripe_size(sbi); This comment should probably go by ext4_get_stripe_size() definition instead of here at the caller. Will move that to the function definition. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] set s_raid_stride and s_raid_stripe_width
[..snip...] + if (param-s_raid_stride + (param-s_raid_stripe_width % param-s_raid_stride) != 0) + fprintf(stderr, _(\nWarning: RAID stripe-width %u not an even + multiple of stride %u.\n\n), + param-s_raid_stripe_width, param-s_raid_stride); We also want to validate that s_raid_stripe_width is not = blocks_per_group. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: Use superblock s_raid_stripe_width as stripe size during block allocation.
The stipe size used during block allocation is calculated as below. a) if we specify a stripe=value option using mount time. Use that value. b) if not use the value specified in super block. b) If the value specfied at mount time is greater than blocks per group use the value specified ini the super block. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/super.c | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index db1edc8..10330eb 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -2136,6 +2136,16 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent) sbi-s_rsv_window_head.rsv_alloc_hit = 0; sbi-s_rsv_window_head.rsv_goal_size = 0; ext4_rsv_window_add(sb, sbi-s_rsv_window_head); + /* +* set the stripe size. If we have specified it via mount option, then +* use the mount option value. If the value specified at mount time is +* greater than the blocks per group use the super block value. +* Allocator needs it be less than blocks per group. +*/ + if (!sbi-s_stripe || + sbi-s_stripe = sbi-s_blocks_per_group) { + sbi-s_stripe = le32_to_cpu(es-s_raid_stripe_width); + } /* * set up enough so that it can read an inode -- 1.5.4.rc2.60.gb2e62-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: Check for return value from sb_set_blocksize
sb_set_blocksize validates whether the specfied block size can be used by the file system. Make sure we fail mounting the file system if the blocksize specfied cannot be used. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/super.c | 18 +++--- 1 files changed, 7 insertions(+), 11 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 10330eb..ab62d7f 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1793,7 +1793,6 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent) unsigned long def_mount_opts; struct inode *root; int blocksize; - int hblock; int db_count; int i; int needs_recovery; @@ -1958,20 +1957,17 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent) goto failed_mount; } - hblock = bdev_hardsect_size(sb-s_bdev); if (sb-s_blocksize != blocksize) { - /* -* Make sure the blocksize for the filesystem is larger -* than the hardware sectorsize for the machine. -*/ - if (blocksize hblock) { - printk(KERN_ERR EXT4-fs: blocksize %d too small for - device blocksize %d.\n, blocksize, hblock); + + brelse (bh); + + /* Validate the filesystem blocksize */ + if (!sb_set_blocksize(sb, blocksize)) { + printk(KERN_ERR EXT4-fs: bad block size %d.\n, + blocksize); goto failed_mount; } - brelse (bh); - sb_set_blocksize(sb, blocksize); logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE; offset = do_div(logical_sb_block, blocksize); bh = sb_bread(sb, logical_sb_block); -- 1.5.4.rc2.60.gb2e62-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Should we enabling mballoc by default ?
Hi, mballoc currently causes fragmentation of small size files. The behaviour can be observed by running parallel dd on ext4 file system. A sample test case can be found here. http://www.radian.org/~kvaneesh/ext4/mballoc-frag/fragmentation-analysis This is because for small size request/file mballoc uses group prealloc space. That means if you have singe cpu and multiple threads doing parallel io to the file the block request are served from the same prealloc space. This results in the fragmentation observed above. (Group prealloc space is per cpu ) This problem should go away with delayed allocation because the write out happens with multiple blocks and mballoc places then close together even though they get served by the same group prealloc space. Considering that we don't have delalloc yet, i guess we should push mballoc to 2.6.25 with default disabled and O_DIRECT type workloads can enabled then via mount option. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: Check for return value from sb_set_blocksize
Updated patch. The earlier patch did multiple brelse() during failed mount case. ext4: Check for return value from sb_set_blocksize From: Aneesh Kumar K.V [EMAIL PROTECTED] sb_set_blocksize validates whether the specfied block size can be used by the file system. Make sure we fail mounting the file system if the blocksize specfied cannot be used. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/super.c | 15 +-- 1 files changed, 5 insertions(+), 10 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 10330eb..f9a9ef1 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1793,7 +1793,6 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent) unsigned long def_mount_opts; struct inode *root; int blocksize; - int hblock; int db_count; int i; int needs_recovery; @@ -1958,20 +1957,16 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent) goto failed_mount; } - hblock = bdev_hardsect_size(sb-s_bdev); if (sb-s_blocksize != blocksize) { - /* -* Make sure the blocksize for the filesystem is larger -* than the hardware sectorsize for the machine. -*/ - if (blocksize hblock) { - printk(KERN_ERR EXT4-fs: blocksize %d too small for - device blocksize %d.\n, blocksize, hblock); + + /* Validate the filesystem blocksize */ + if (!sb_set_blocksize(sb, blocksize)) { + printk(KERN_ERR EXT4-fs: bad block size %d.\n, + blocksize); goto failed_mount; } brelse (bh); - sb_set_blocksize(sb, blocksize); logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE; offset = do_div(logical_sb_block, blocksize); bh = sb_bread(sb, logical_sb_block); - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: Fix the soft lockup with multi block allocator.
On Wed, Jan 09, 2008 at 01:10:41PM +0100, Jan Kara wrote: With the multi block allocator when we don't have prealloc space we discard @@ -3790,7 +3782,9 @@ repeat: /* if we still need more blocks and some PAs were used, try again */ if (free needed busy) { + busy = 0; ext4_unlock_group(sb, group); + schedule_timeout(HZ); goto repeat; } Hmm, wouldn't just schedule() be enough here? That would give a good chance to other processes to proceed and we would avoid this artificial wait of 1s which is quite ugly IMO. Honza But then who will wake up the task ?. I have the below comment added to the patch in the patch queue. /* * We see this quiet rare. But if a particular workload is * effected by this we may need to add a waitqueue */ -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: Fix the soft lockup with multi block allocator.
On Wed, Jan 09, 2008 at 07:44:30PM +0100, Jan Kara wrote: On Wed 09-01-08 23:54:28, Aneesh Kumar K.V wrote: On Wed, Jan 09, 2008 at 01:10:41PM +0100, Jan Kara wrote: With the multi block allocator when we don't have prealloc space we discard @@ -3790,7 +3782,9 @@ repeat: /* if we still need more blocks and some PAs were used, try again */ if (free needed busy) { + busy = 0; ext4_unlock_group(sb, group); + schedule_timeout(HZ); goto repeat; } Hmm, wouldn't just schedule() be enough here? That would give a good chance to other processes to proceed and we would avoid this artificial wait of 1s which is quite ugly IMO. But then who will wake up the task ?. I have the below comment added to the patch in the patch queue. As far as I know, you don't have to wake-up the task explicitely. Scheduler will simply schedule the task sometime in future (it is a similar situation as if the task got preempted in the kernel). I missed the current-state = TASK_UNINTERRUPTIBLE; in that code piece. So yes without setting the task state yes it would put it back to the run queue.Infact schedule_timeout() without changing the task state also behaves similarly. Now that that we know that it if fine just to have a schedule() there since schedule_timeout() was just behaving as schedule(). I guess we should make the change you suggested. In that case we can remove the comment which says we need to add the wait queue. Mingming, Do you want me to send a patch or can you make the modification ? -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ext4: Use superblock s_raid_stripe_width as stripe size during block allocation.
On Wed, Jan 09, 2008 at 04:36:27PM -0700, Andreas Dilger wrote: On Jan 09, 2008 22:37 +0530, Aneesh Kumar K.V wrote: The stipe size used during block allocation is calculated as below. a) if we specify a stripe=value option using mount time. Use that value. b) if not use the value specified in super block. b) If the value specfied at mount time is greater than blocks per group use the value specified ini the super block. What if the value on disk is also bad? I'd add (after the second 'b' :-): d) if s_stripe is still s_blocks_per_group try s_raid_stride e) if s_stripe is still s_blocks_per_group use 0 But i guess mke2fs and tune2fs should validate the value of s_raid_stripe_width and s_raid_stride. Both of them should be less that blocks per group. Should I add extra check in the kernel for them ? Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/super.c | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index db1edc8..10330eb 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -2136,6 +2136,16 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent) sbi-s_rsv_window_head.rsv_alloc_hit = 0; sbi-s_rsv_window_head.rsv_goal_size = 0; ext4_rsv_window_add(sb, sbi-s_rsv_window_head); + /* +* set the stripe size. If we have specified it via mount option, then +* use the mount option value. If the value specified at mount time is +* greater than the blocks per group use the super block value. +* Allocator needs it be less than blocks per group. +*/ + if (!sbi-s_stripe || + sbi-s_stripe = sbi-s_blocks_per_group) { So what do you think should it be or =. Looking at the mballoc I guess it should work with stripe size equal to blocks per group. I am not sure how efficient the allocation would be though. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
mballoc and inode prealloc space deletion
Hi Alex, With the latest changes i see both group preallocation and inode preallocation being used by mballoc. Since the choice is now made with file size not just the request size. (None of the test were actually using inode prealloc previously). Now that the test are using inode prealloc i see the below dump stack on the console. I guess you put in the dump stack there to find out how often these happen. Do you still want the dump stack to be there. Or can I remove it. ? uh-oh! some one just deleted it Call Trace: [c000e4143970] [c0010554] .show_stack+0x6c/0x1a0 (unreliable) [c000e4143a10] [c019db90] .ext4_mb_discard_inode_preallocations+0x1d4/0x428 [c000e4143b20] [c0178e8c] .ext4_discard_reservation+0x30/0xb4 [c000e4143bc0] [c017c0b8] .ext4_release_file+0x54/0xbc [c000e4143c50] [c00de4b0] .__fput+0x114/0x214 [c000e4143d00] [c00db2d8] .filp_close+0xb0/0xd8 [c000e4143d90] [c00dbb20] .sys_close+0xc4/0x128 [c000e4143e30] [c000872c] syscall_exit+0x0/0x40 uh-oh! some one just deleted it Call Trace: [c000ede97970] [c0010554] .show_stack+0x6c/0x1a0 (unreliable) [c000ede97a10] [c019db90] .ext4_mb_discard_inode_preallocations+0x1d4/0x428 [c000ede97b20] [c0178e8c] .ext4_discard_reservation+0x30/0xb4 [c000ede97bc0] [c017c0b8] .ext4_release_file+0x54/0xbc [c000ede97c50] [c00de4b0] .__fput+0x114/0x214 [c000ede97d00] [c00db2d8] .filp_close+0xb0/0xd8 [c000ede97d90] [c00dbb20] .sys_close+0xc4/0x128 [c000ede97e30] [c000872c] syscall_exit+0x0/0x40 uh-oh! some one just deleted it Call Trace: [c000e6b1f970] [c0010554] .show_stack+0x6c/0x1a0 (unreliable) [c000e6b1fa10] [c019db90] .ext4_mb_discard_inode_preallocations+0x1d4/0x428 [c000e6b1fb20] [c0178e8c] .ext4_discard_reservation+0x30/0xb4 [c000e6b1fbc0] [c017c0b8] .ext4_release_file+0x54/0xbc [c000e6b1fc50] [c00de4b0] .__fput+0x114/0x214 [c000e6b1fd00] [c00db2d8] .filp_close+0xb0/0xd8 [c000e6b1fd90] [c00dbb20] .sys_close+0xc4/0x128 [c000e6b1fe30] [c000872c] syscall_exit+0x0/0x40 -- 0:conmux-control -- time-stamp -- Jan/08/08 2:20:11 -- -- 0:conmux-control -- time-stamp -- Jan/08/08 2:38:19 -- (bot:conmon-payload) disconnected - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mballoc update
On Tue, Jan 08, 2008 at 12:01:14PM +0530, Aneesh Kumar K.V wrote: Hi, This is the update for mballoc patch. The changes are result of merging with the lustre cvs version of mballoc. I liked this patch better because it is simple. I also the updated the commit message. The update commit message is also attached below. We only have one FIXME!! in the commit message now to explain the inode buddy cache allocator. Let me know what you think. This is not yet for patch queue. I will update the mballoc-core.patch and send it the full patch later. The patch for patch queue can be found at http://www.radian.org/~kvaneesh/ext4/jan-8-2008/mballoc-core.patch -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] mballoc changes from ldiskfs
Hi, This patch is not even compile tested. I am sending it over to find out whether some of the changes are even needed and to make sure i didn't drop any bug fix in the merge. something I noticed. a) prealloc table is completely gone. b) ext4_mb_put_pa change. ( I guess that is a bug with ldiskfs ). now by default request less that 64K use locality group preallocation. The ldiskfs change i looked at is from lustre/ldiskfs/kernel_patches/patches/ext3-mballoc3-core.patch diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 58a70a1..cb84516 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -532,10 +532,10 @@ static inline void mb_set_bit(int bit, void *addr) ext4_set_bit(bit, addr); } -static inline void mb_set_bit_atomic(int bit, void *addr) +static inline void mb_set_bit_atomic(spinlock_t *lock, int bit, void *addr) { mb_correct_addr_and_bit(bit, addr); - ext4_set_bit_atomic(NULL, bit, addr); + ext4_set_bit_atomic(lock, bit, addr); } static inline void mb_clear_bit(int bit, void *addr) @@ -544,10 +544,10 @@ static inline void mb_clear_bit(int bit, void *addr) ext4_clear_bit(bit, addr); } -static inline void mb_clear_bit_atomic(int bit, void *addr) +static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr) { mb_correct_addr_and_bit(bit, addr); - ext4_clear_bit_atomic(NULL, bit, addr); + ext4_clear_bit_atomic(lock, bit, addr); } static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max) @@ -1155,7 +1155,7 @@ static int mb_find_order_for_block(struct ext4_buddy *e4b, int block) return 0; } -static void mb_clear_bits(void *bm, int cur, int len) +static void mb_clear_bits(spinlock_t *lock, void *bm, int cur, int len) { __u32 *addr; @@ -1168,12 +1168,12 @@ static void mb_clear_bits(void *bm, int cur, int len) cur += 32; continue; } - mb_clear_bit_atomic(cur, bm); + mb_clear_bit_atomic(lock, cur, bm); cur++; } } -static void mb_set_bits(void *bm, int cur, int len) +static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len) { __u32 *addr; @@ -1186,7 +1186,7 @@ static void mb_set_bits(void *bm, int cur, int len) cur += 32; continue; } - mb_set_bit_atomic(cur, bm); + mb_set_bit_atomic(lock, cur, bm); cur++; } } @@ -1403,7 +1403,8 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) e4b-bd_info-bb_counters[ord]++; } - mb_set_bits(EXT4_MB_BITMAP(e4b), ex-fe_start, len0); + mb_set_bits(sb_bgl_lock(EXT4_SB(e4b-bd_sb), ex-fe_group), + EXT4_MB_BITMAP(e4b), ex-fe_start, len0); mb_check_buddy(e4b); return ret; @@ -1439,14 +1440,6 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, get_page(ac-ac_bitmap_page); ac-ac_buddy_page = e4b-bd_buddy_page; get_page(ac-ac_buddy_page); - - /* store last allocated for subsequent stream allocation */ - if ((ac-ac_flags EXT4_MB_HINT_DATA)) { - spin_lock(sbi-s_md_lock); - sbi-s_mb_last_group = ac-ac_f_ex.fe_group; - sbi-s_mb_last_start = ac-ac_f_ex.fe_start; - spin_unlock(sbi-s_md_lock); - } } /* @@ -1509,8 +1502,8 @@ static void ext4_mb_measure_extent(struct ext4_allocation_context *ac, struct ext4_free_extent *gex = ac-ac_g_ex; BUG_ON(ex-fe_len = 0); - BUG_ON(ex-fe_len = (1 ac-ac_sb-s_blocksize_bits) * 8); - BUG_ON(ex-fe_start = (1 ac-ac_sb-s_blocksize_bits) * 8); + BUG_ON(ex-fe_len = EXT4_BLOCKS_PER_GROUP(ac-ac_sb)); + BUG_ON(ex-fe_start = EXT4_BLOCKS_PER_GROUP(ac-ac_sb)); BUG_ON(ac-ac_status != AC_STATUS_CONTINUE); ac-ac_found++; @@ -1553,7 +1546,6 @@ static void ext4_mb_measure_extent(struct ext4_allocation_context *ac, /* if the request is satisfied, then we try to find * an extent that still satisfy the request, but is * smaller than previous one */ - if (ex-fe_len bex-fe_len) *bex = *ex; } @@ -1702,8 +1694,8 @@ static void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac, i = e4b-bd_info-bb_first_free; while (free ac-ac_status == AC_STATUS_CONTINUE) { - i = ext4_find_next_zero_bit(bitmap, sb-s_blocksize * 8, i); - if (i = sb-s_blocksize * 8) { + i = ext4_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), i); + if (i = EXT4_BLOCKS_PER_GROUP(sb)) { BUG_ON(free != 0); break; } @@ -1744,7 +1736,7 @@ static void ext4_mb_scan_aligned(struct
Re: [PATCH] mballoc changes from ldiskfs
On Mon, Jan 07, 2008 at 11:58:00PM +0530, Aneesh Kumar K.V wrote: Hi, This patch is not even compile tested. I am sending it over to find out whether some of the changes are even needed and to make sure i didn't drop any bug fix in the merge. something I noticed. a) prealloc table is completely gone. b) ext4_mb_put_pa change. ( I guess that is a bug with ldiskfs ). now by default request less that 64K use locality group preallocation. The ldiskfs change i looked at is from lustre/ldiskfs/kernel_patches/patches/ext3-mballoc3-core.patch [... snip... ] BUG_ON(ex-fe_len = 0); - BUG_ON(ex-fe_len = (1 ac-ac_sb-s_blocksize_bits) * 8); - BUG_ON(ex-fe_start = (1 ac-ac_sb-s_blocksize_bits) * 8); + BUG_ON(ex-fe_len = EXT4_BLOCKS_PER_GROUP(ac-ac_sb)); + BUG_ON(ex-fe_start = EXT4_BLOCKS_PER_GROUP(ac-ac_sb)); BUG_ON(ac-ac_status != AC_STATUS_CONTINUE); ac-ac_found++; @@ -1553,7 +1546,6 @@ static void ext4_mb_measure_extent(struct ext4_allocation_context *ac, /* if the request is satisfied, then we try to find * an extent that still satisfy the request, but is * smaller than previous one */ - if (ex-fe_len bex-fe_len) *bex = *ex; } This one is a bug fix for ext4 patch queue from Alex. So this change is needed. @@ -1702,8 +1694,8 @@ static void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac, i = e4b-bd_info-bb_first_free; while (free ac-ac_status == AC_STATUS_CONTINUE) { - i = ext4_find_next_zero_bit(bitmap, sb-s_blocksize * 8, i); - if (i = sb-s_blocksize * 8) { + i = ext4_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), i); + if (i = EXT4_BLOCKS_PER_GROUP(sb)) { BUG_ON(free != 0); break; } @@ -1744,7 +1736,7 @@ static void ext4_mb_scan_aligned(struct ext4_allocation_context *ac, i = (i - le32_to_cpu(sbi-s_es-s_first_data_block)) % EXT4_BLOCKS_PER_GROUP(sb); - while (i sb-s_blocksize * 8) { + while (i EXT4_BLOCKS_PER_GROUP(sb)) { if (!mb_test_bit(i, bitmap)) { max = mb_find_extent(e4b, 0, i, sbi-s_stripe, ex); if (max = sbi-s_stripe) { @@ -1839,20 +1831,6 @@ static int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac-ac_2order = i; } - /* if stream allocation is enabled, use global goal */ - - /* FIXME!! - * Need more explanation on what it is and how stream - * allocation is represented by the below conditional - */ - if ((ac-ac_g_ex.fe_len sbi-s_mb_large_req) - (ac-ac_flags EXT4_MB_HINT_DATA)) { - /* TBD: may be hot point */ - spin_lock(sbi-s_md_lock); - ac-ac_g_ex.fe_group = sbi-s_mb_last_group; - ac-ac_g_ex.fe_start = sbi-s_mb_last_start; - spin_unlock(sbi-s_md_lock); - } group = ac-ac_g_ex.fe_group; @@ -2291,7 +2269,8 @@ static void ext4_mb_history_init(struct super_block *sb) spin_lock_init(sbi-s_mb_history_lock); i = sbi-s_mb_history_max * sizeof(struct ext4_mb_history); sbi-s_mb_history = kmalloc(i, GFP_KERNEL); - memset(sbi-s_mb_history, 0, i); + if (likely(sbi-s_mb_history != NULL)) + memset(sbi-s_mb_history, 0, i); /* if we can't allocate history, then we simple won't use it */ } @@ -2300,7 +2279,7 @@ static void ext4_mb_store_history(struct ext4_allocation_context *ac) struct ext4_sb_info *sbi = EXT4_SB(ac-ac_sb); struct ext4_mb_history h; - if (likely(sbi-s_mb_history == NULL)) + if (unlikely(sbi-s_mb_history == NULL)) return; if (!(ac-ac_op sbi-s_mb_history_filter)) @@ -2312,11 +2291,6 @@ static void ext4_mb_store_history(struct ext4_allocation_context *ac) h.orig = ac-ac_o_ex; h.result = ac-ac_b_ex; h.flags = ac-ac_flags; - h.found = ac-ac_found; - h.groups = ac-ac_groups_scanned; - h.cr = ac-ac_criteria; - h.tail = ac-ac_tail; - h.buddy = ac-ac_buddy; h.merged = 0; This one is a bug for ext4 patch queue from Alex. So this change is needed. [] + + /* first, try to predict filesize */ + /* XXX: should this table be tunable? */ Here we says we need to have tunables. Does that mean prealloc table is needed ? + start = 0; + if (size = 16 * 1024) { + size = 16 * 1024; + } else if (size = 32 * 1024) { + size = 32 * 1024; + } else if (size = 64 * 1024) { + size = 64 * 1024; + } else if (size = 128 * 1024) { + size = 128 * 1024; + } else if (size = 256 * 1024) { + size = 256 * 1024; + } else if (size = 512 * 1024
[PATCH] mballoc update
Hi, This is the update for mballoc patch. The changes are result of merging with the lustre cvs version of mballoc. I liked this patch better because it is simple. I also the updated the commit message. The update commit message is also attached below. We only have one FIXME!! in the commit message now to explain the inode buddy cache allocator. Let me know what you think. This is not yet for patch queue. I will update the mballoc-core.patch and send it the full patch later. --- commit message --- ext4: Add multi block allocator for ext4 From: Alex Tomas [EMAIL PROTECTED] The allocation request involve request for multiple number of blocks near to the goal(block) value specified. During initialization phase of the allocator we decide to use the group preallocation or inode preallocation depending on the size file. The size of the file could be the resulting file size we would have after allocation or the current file size which ever is larger. If the size is less that sbi-s_mb_stream_request we select the group preallocation. The default value of s_mb_stream_request is 16 blocks. This can also be tuned via /proc/fs/ext4/partition/stream_req. The value is represented in terms of number of blocks. The main motivation for having small file use group preallocation is to ensure that we have small file closer in the disk. First stage the allocator looks at the inode prealloc list ext4_inode_info-i_prealloc_list contain list of prealloc spaces for this particular inode. The inode prealloc space is represented as pa_lstart - the logical start block for this prealloc space pa_pstart - the physical start block for this prealloc space pa_len- lenght for this prealloc space pa_free - free space available in this prealloc space The inode preallocation space is used looking at the _logical_ start block. If only the logical file block falls within the range of prealloc space we will consume the particular prealloc space. This make sure that that the we have contiguous physical blocks representing the file blocks The important thing to be noted in case of inode prealloc space is that we don't modify the values associated to inode prealloc space except pa_free. If we are not able to find blocks in the inode prealloc space and if we have the group allocation flag set then we look at the locality group prealloc space. These are per CPU prealloc list represented as ext4_sb_info.s_locality_groups[smp_processor_id()] The reason for having a per cpu locality group is to reduce the contention between CPUs. It is possible to get scheduled at this point. The locality group prealloc space is used looking at whether we have enough free space (pa_free) withing the prealloc space. If we can't allocate blocks via inode prealloc or/and locality group prealloc then we look at the buddy cache. The buddy cache is represented by ext4_sb_info.s_buddy_cache (struct inode) whose file offset gets mapped to the buddy and bitmap information regarding different groups. The buddy information is attached to buddy cache inode so that we can access them through the page cache. The information regarding each group is loaded via ext4_mb_load_buddy. The information involve block bitmap and buddy information. The information are stored in the inode as {page} [ group 0 buddy][ group 0 bitmap] [group 1][ group 1]... one block each for bitmap and buddy information. So for each group we take up 2 blocks. A page can contain blocks_per_page (PAGE_CACHE_SIZE / blocksize) blocks. So it can have information regarding groups_per_page which is blocks_per_page/2 Buddy cache inode is not stored on disk. The inode get thrown away at the end when unmounting the disk. We look for count number of blocks in the buddy cache. If we were able to locate that many free blocks we return with additional information regarding rest of the contiguous physical block available Before allocating blocks via buddy cache we normalize the request blocks. This ensure we ask for more blocks that we needed. The extra blocks that we get after allocation is added to the respective prealloc list. In case of inode preallocation we follow a list of heuristics based on file size. This can be found in ext4_mb_normalize_request. If we are doing a group prealloc we try to normalize the request to sbi-s_mb_group_prealloc. Default value of s_mb_group_prealloc is set to 512 blocks. This can be tuned via /proc/fs/ext4/partition/group_prealloc. The value is represented in terms of number of blocks. If we have mounted the file system with -O stripe=value option the group prealloc request is normalized to the stripe value (sbi-s_stripe) /* FIXME!! explanation of how blocks are picked from buddy cache and the tunable max_to_scan min_to_scan order2_req */ Both the prealloc space are getting populated as above. So for the first request we will hit the buddy cache which will result in this prealloc space getting filled. The
[Updated PATCH] ext4: Use the correct block number when reading the super block.
If the block device hard sector size is larger than EXT4_MIN_BLOCK_SIZE we end up with wrong block number when reading super block. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/super.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 1ca0f54..3095370 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1629,8 +1629,8 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent) * block sizes. We need to calculate the offset from buffer start. */ if (blocksize != EXT4_MIN_BLOCK_SIZE) { - logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE; - offset = do_div(logical_sb_block, blocksize); + logical_sb_block = (sb_block * EXT4_MIN_BLOCK_SIZE) / blocksize; + offset = do_div((sb_block * EXT4_MIN_BLOCK_SIZE), blocksize); } else { logical_sb_block = sb_block; } @@ -1747,8 +1747,8 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent) brelse (bh); sb_set_blocksize(sb, blocksize); - logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE; - offset = do_div(logical_sb_block, blocksize); + logical_sb_block = (sb_block * EXT4_MIN_BLOCK_SIZE) / blocksize; + offset = do_div((sb_block * EXT4_MIN_BLOCK_SIZE), blocksize); bh = sb_bread(sb, logical_sb_block); if (!bh) { printk(KERN_ERR -- 1.5.4.rc2-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Updated PATCH] ext4: Use the correct block number when reading the super block.
On Mon, Dec 31, 2007 at 01:59:22PM +0530, Aneesh Kumar K.V wrote: If the block device hard sector size is larger than EXT4_MIN_BLOCK_SIZE we end up with wrong block number when reading super block. Ignore the patch. I got confused by the do_div syntax. The do_div already save the division result in logical_sb_block; -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: Use the correct block number when reading the super block.
If the block device hard sector size is larger than EXT4_MIN_BLOCK_SIZE we end up with wrong block number when reading super block. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/super.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 1ca0f54..832e1ad 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1629,8 +1629,8 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent) * block sizes. We need to calculate the offset from buffer start. */ if (blocksize != EXT4_MIN_BLOCK_SIZE) { - logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE; - offset = do_div(logical_sb_block, blocksize); + logical_sb_block = (sb_block * EXT4_MIN_BLOCK_SIZE) / blocksize; + offset = do_div((sb_block * EXT4_MIN_BLOCK_SIZE), blocksize); } else { logical_sb_block = sb_block; } -- 1.5.4.rc2-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Patch queue update
Hi Mingming, I have placed the updated patch queue at http://www.radian.org/~kvaneesh/ext4/dec-24-2007/ The .tar can be found at http://www.radian.org/~kvaneesh/ext4/dec-24-2007/patch-series.tar The changes involve the below attached diff. I also updated the commit message of mballoc core patch. a) remove the ext-truncate-mutex.patch b) Add the truncate mutex to read write semaphore conversion patch set. This is added at the top of unstable series expecting that it can be pushed to stable soon. c) Update mballoc and migrate patch to use i_data_sem. d) mballoc soft lockup fix. e) update mballoc core patch commit message. f) Update enable-delalloc-and-mballoc.patch patch to enabled write back mode by default. Delalloc only support write back. diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index abc8900..643046b 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -526,7 +526,7 @@ static inline int rsv_is_empty(struct ext4_reserve_window *rsv) * when setting the reservation window size through ioctl before the file * is open for write (needs block allocation). * - * Needs truncate_mutex protection prior to call this function. + * Needs down_write(i_data_sem) protection prior to call this function. */ void ext4_init_block_alloc_info(struct inode *inode) { diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index ff83982..99e539d 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -1706,7 +1706,7 @@ static int ext4_ext_rm_idx(handle_t *handle, struct inode *inode, * This routine returns max. credits that the extent tree can consume. * It should be OK for low-performance paths like -writepage() * To allow many writing processes to fit into a single transaction, - * the caller should calculate credits under truncate_mutex and + * the caller should calculate credits under i_data_sem and * pass the actual path. */ int ext4_ext_calc_credits_for_insert(struct inode *inode, @@ -2272,7 +2272,8 @@ out: /* * Need to be called with - * mutex_lock(EXT4_I(inode)-truncate_mutex); + * down_read(EXT4_I(inode)-i_data_sem) if not allocating file system block + * (ie, create is zero). Otherwise down_write(EXT4_I(inode)-i_data_sem) */ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ext4_lblk_t iblock, @@ -2514,7 +2515,7 @@ void ext4_ext_truncate(struct inode * inode, struct page *page) if (page) ext4_block_truncate_page(handle, page, mapping, inode-i_size); - mutex_lock(EXT4_I(inode)-truncate_mutex); + down_write(EXT4_I(inode)-i_data_sem); ext4_ext_invalidate_cache(inode); ext4_mb_discard_inode_preallocations(inode); @@ -2552,7 +2553,7 @@ out_stop: if (inode-i_nlink) ext4_orphan_del(handle, inode); - mutex_unlock(EXT4_I(inode)-truncate_mutex); + up_write(EXT4_I(inode)-i_data_sem); ext4_journal_stop(handle); } @@ -2616,6 +2617,7 @@ long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) * modify 1 super block, 1 block bitmap and 1 group descriptor. */ credits = EXT4_DATA_TRANS_BLOCKS(inode-i_sb) + 3; + down_write((EXT4_I(inode)-i_data_sem)); retry: while (ret = 0 ret max_blocks) { block = block + ret; @@ -2672,6 +2674,7 @@ retry: if (ret == -ENOSPC ext4_should_retry_alloc(inode-i_sb, retries)) goto retry; + up_write((EXT4_I(inode)-i_data_sem)); /* * Time to update the file size. * Update only when preallocation was requested beyond the file size. diff --git a/fs/ext4/file.c b/fs/ext4/file.c index a6b2aa1..ac35ec5 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -37,9 +37,9 @@ static int ext4_release_file (struct inode * inode, struct file * filp) if ((filp-f_mode FMODE_WRITE) (atomic_read(inode-i_writecount) == 1)) { - mutex_lock(EXT4_I(inode)-truncate_mutex); + down_write(EXT4_I(inode)-i_data_sem); ext4_discard_reservation(inode); - mutex_unlock(EXT4_I(inode)-truncate_mutex); + up_write(EXT4_I(inode)-i_data_sem); } if (is_dx(inode) filp-private_data) ext4_htree_free_dir_info(filp-private_data); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index caec966..33018a5 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -346,7 +346,7 @@ static int ext4_block_to_path(struct inode *inode, * the whole chain, all way to the data (returns %NULL, *err == 0). * * Need to be called with - * mutex_lock(EXT4_I(inode)-truncate_mutex) + * down_read(EXT4_I(inode)-i_data_sem) */ static Indirect *ext4_get_branch(struct inode *inode, int depth, ext4_lblk_t *offsets, @@ -778,7 +778,8 @@ err_out: * * * Need to be called with - * mutex_lock(EXT4_I(inode)-truncate_mutex) + *
[PATCH] ext4: Fix the soft lockup with multi block allocator.
With the multi block allocator when we don't have prealloc space we discard the existing preallocaltion data and try to rebuild the buddy cache. While discarding the loop through the group specific prealloc list. If we find any particular prealloc space being used we mark the space busy. If we are not able to find enough free space and if we have any prealloc space busy we loop back again. With non preempted kernel this tight loop resulted in watchdog timer triggering soft lockup warning. Whe we are allocation the block we search the prealloc list and mark the prealloc space used via incrementing pa_count value. One after succesffuly allocating the block we need to update the block bitmap and this could actually involved a disk io if the bitmap need to read from the disk. This actually cause prealloc space to be marked used for quiet a long time. This inturn results in the discard logic going on tight loop resulting in watchdog timer triggering soft lockup warning. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/mballoc.c | 12 +++- 1 files changed, 3 insertions(+), 9 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 844765c..cbc8143 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3729,7 +3729,7 @@ static int ext4_mb_discard_group_preallocations(struct super_block *sb, struct list_head list; struct ext4_buddy e4b; int err; - int busy; + int busy = 0; int free = 0; mb_debug(discard preallocation for group %lu\n, group); @@ -3754,20 +3754,12 @@ static int ext4_mb_discard_group_preallocations(struct super_block *sb, INIT_LIST_HEAD(list); repeat: - busy = 0; ext4_lock_group(sb, group); list_for_each_entry_safe(pa, tmp, grp-bb_prealloc_list, pa_group_list) { spin_lock(pa-pa_lock); if (atomic_read(pa-pa_count)) { spin_unlock(pa-pa_lock); - /* FIXME!! -* It is quiet natural to have the pa being -* used on other cpus when we are trying free -* space - printk(KERN_ERR uh! busy PA\n); - dump_stack(); - */ busy = 1; continue; } @@ -3790,7 +3782,9 @@ repeat: /* if we still need more blocks and some PAs were used, try again */ if (free needed busy) { + busy = 0; ext4_unlock_group(sb, group); + schedule_timeout(HZ); goto repeat; } -- 1.5.4.rc0-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: soft lockup - CPU#0 stuck for 11s! [fsstress:5534]
On Thu, Dec 20, 2007 at 08:02:42PM +0530, Aneesh Kumar K.V wrote: I am seeing this with the patch queue. I can reproduce this on x86 and powerpc. I see the file system full when this happens. The same happens even without delalloc enabled. The below patch fix the same for me. One thing i observed with the patch queue is, enabling delalloc by default. Delalloc doesn't handle the file system full case because there is no block reservation. Unless we have block reservation i guess we should disable delalloc by default. I will send a full patch with proper log message. diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 72e1920..8b45ac0 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3791,6 +3791,7 @@ repeat: /* if we still need more blocks and some PAs were used, try again */ if (free needed busy) { ext4_unlock_group(sb, group); + schedule_timeout(HZ); goto repeat; } - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] truncate_mutex to read_write semaphore
The series include the truncate_mutex to read write semaphore conversion. I am marking below some of the test results. For O_DIRECT workloads we won't see the contention on truncate mutex because we are doing a get_block under inode-i_mutex. For FIBMAP we won't see contention because the get_block get called under BKL. threaded read with low memory --- Top contenting locks were: (/proc/lock_stat output) q-__queue_lock: 12549 12572 10.65 3302.1636818.7846618 395721 3.47 49636.48 571453.47 inode-i_data.tree_lock-W: 3970 4026 2.62 33.39 3508.74 25924 95164 5.33 949.5980180.03 inode-i_data.tree_lock-R: 1937 2002 2.52 22.05 1528.78 19543 141863 5.57 119.72137126.60 ei-truncate_mutex#2: 4553 4769 169.62 1028484.20 39334253.92 19610 47069 31.74 102280.63 680802.57 second run - q-__queue_lock: 12499 12535 3.76247.7119799.9446341 405427 4.34 216.31527282.59 inode-i_data.tree_lock-W: 4009 4071 10.04 31.78 3434.95 25612 93458 7.29 61.87 78365.20 inode-i_data.tree_lock-R: 1919 1973 4.4330.93 1523.04 18953 142635 4.95 109.20137098.84 ei-truncate_mutex#2: 4346 4499 1546.39 896379.29 31107317.47 19051 48223 37.94 122579.25 628968.65 The above result implies that the threaded read with low memory (booted with mem=512M on a 16 cpu x86-64) results in contention on truncate_mutex. threaded read with low memory after converting to i_data_sem --- ei-i_data_sem-R: 0 0 0.000.00 0.00 18017 48801 38.12 3494783.37 22982474.21 ei-i_data_sem-R: 0 0 0.000.00 0.00 18233 49118 45.09 4953783.87 32699001.46 As you can see from the /proc/lock_stat output above the write semaphore is not taken at all. threaded write -- ei-i_data_sem-W: 0 0 0.00 0.00 0.00 24 64163 41.04 2620905.3216331786.48 ei-i_data_sem-R: 0 0 0.00 0.00 0.00 13352 83969 51.40 1212864.742834511.75 Here we see some read semphore acquisition. We take read mode of the semaphore to not content in the overwrite case. We see no contention here because the write gets done under inode-i_mutex sb-s_type-i_mutex_key#1: 313958 313962 3650.35 99510834.17 4881402594.11 314481 616528 37.22 7579553.9754139119.82 -- sb-s_type-i_mutex_key#1 313962 [8027aa80] generic_file_aio_write+0x4f/0xc2 sb-s_type-i_mutex_key#1 0 [802a538c] generic_file_llseek+0x36/0x98 second-run - ei-i_data_sem-W: 00 0.00 0.00 0.00 2 6114341.56 9299754.4515811211.79 ei-i_data_sem-R: 00 0.00 0.00 0.00 13272 8244268.40 1632405.222877135.32 sb-s_type-i_mutex_key#1: 441031 441163 10873.77 144350457.93 4988289572.34 441679 742079 163.0515158665.5659655118.60 -- sb-s_type-i_mutex_key#1 441163 [802a538c] generic_file_llseek+0x36/0x98 sb-s_type-i_mutex_key#1 0 [8027aa80] generic_file_aio_write+0x4f/0xc2 The test program is at http://www.radian.org/~kvaneesh/ext4/truncate_mutex/ The file system is modified to create highly fragmented file via frag.c - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] ext4: Make ext4_get_blocks_wrap take the truncate_mutex early.
When doing a migrate from ext3 to ext4 inode we need to make sure the test for inode type and walking inode data happens inside lock. To make this happen move truncate_mutex early before checking the i_flags. This actually should enable us to remove the verify_chain(). Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c |9 -- fs/ext4/inode.c | 69 +- include/linux/ext4_fs.h | 12 ++-- 3 files changed, 23 insertions(+), 67 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 8528774..98ef84d 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2225,6 +2225,10 @@ out: return err ? err : allocated; } +/* + * Need to be called with + * mutex_lock(EXT4_I(inode)-truncate_mutex); + */ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ext4_fsblk_t iblock, unsigned long max_blocks, struct buffer_head *bh_result, @@ -2240,7 +2244,6 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, __clear_bit(BH_New, bh_result-b_state); ext_debug(blocks %d/%lu requested for inode %u\n, (int) iblock, max_blocks, (unsigned) inode-i_ino); - mutex_lock(EXT4_I(inode)-truncate_mutex); /* check in cache */ goal = ext4_ext_in_cache(inode, iblock, newex); @@ -2414,8 +2417,6 @@ out2: ext4_ext_drop_refs(path); kfree(path); } - mutex_unlock(EXT4_I(inode)-truncate_mutex); - return err ? err : allocated; } @@ -2544,6 +2545,7 @@ long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len) * modify 1 super block, 1 block bitmap and 1 group descriptor. */ credits = EXT4_DATA_TRANS_BLOCKS(inode-i_sb) + 3; + mutex_lock(EXT4_I(inode)-truncate_mutex) retry: while (ret = 0 ret max_blocks) { block = block + ret; @@ -2600,6 +2602,7 @@ retry: if (ret == -ENOSPC ext4_should_retry_alloc(inode-i_sb, retries)) goto retry; + mutex_unlock(EXT4_I(inode)-truncate_mutex) /* * Time to update the file size. * Update only when preallocation was requested beyond the file size. diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5489703..9cf37b2 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -243,13 +243,6 @@ static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v) p-bh = bh; } -static int verify_chain(Indirect *from, Indirect *to) -{ - while (from = to from-key == *from-p) - from++; - return (from to); -} - /** * ext4_block_to_path - parse the block number into array of offsets * @inode: inode in question (we are only interested in its superblock) @@ -344,10 +337,11 @@ static int ext4_block_to_path(struct inode *inode, * (pointer to last triple returned, [EMAIL PROTECTED] == 0) * or when it gets an IO error reading an indirect block * (ditto, [EMAIL PROTECTED] == -EIO) - * or when it notices that chain had been changed while it was reading - * (ditto, [EMAIL PROTECTED] == -EAGAIN) * or when it reads all @depth-1 indirect blocks successfully and finds * the whole chain, all way to the data (returns %NULL, *err == 0). + * + * Need to be called with + * mutex_lock(EXT4_I(inode)-truncate_mutex) */ static Indirect *ext4_get_branch(struct inode *inode, int depth, int *offsets, Indirect chain[4], int *err) @@ -365,9 +359,6 @@ static Indirect *ext4_get_branch(struct inode *inode, int depth, int *offsets, bh = sb_bread(sb, le32_to_cpu(p-key)); if (!bh) goto failure; - /* Reader: pointers */ - if (!verify_chain(chain, p)) - goto changed; add_chain(++p, bh, (__le32*)bh-b_data + *++offsets); /* Reader: end */ if (!p-key) @@ -375,10 +366,6 @@ static Indirect *ext4_get_branch(struct inode *inode, int depth, int *offsets, } return NULL; -changed: - brelse(bh); - *err = -EAGAIN; - goto no_block; failure: *err = -EIO; no_block: @@ -782,6 +769,10 @@ err_out: * return 0, # of blocks mapped or allocated. * return = 0, if plain lookup failed. * return 0, error case. + * + * + * Need to be called with + * mutex_lock(EXT4_I(inode)-truncate_mutex) */ int ext4_get_blocks_handle(handle_t *handle, struct inode *inode, sector_t iblock, unsigned long maxblocks, @@ -819,18 +810,6 @@ int ext4_get_blocks_handle(handle_t *handle, struct inode *inode, while (count maxblocks count = blocks_to_boundary) { ext4_fsblk_t blk; - if (!verify_chain(chain, partial
[RFC][PATCH 2/3] ext4: Convert truncate_mutex to read write semaphore.
We are currently taking the truncate_mutex for every read. This would have performance impact on large CPU configuration. Convert the lock to read write semaphore and take read lock when we are trying to read the file. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/balloc.c |2 +- fs/ext4/extents.c |8 fs/ext4/file.c|4 ++-- fs/ext4/inode.c | 36 ++-- fs/ext4/ioctl.c |4 ++-- fs/ext4/super.c |2 +- include/linux/ext4_fs.h | 22 +++--- include/linux/ext4_fs_i.h |6 +++--- 8 files changed, 46 insertions(+), 38 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index 71ee95e..4eca63d 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -463,7 +463,7 @@ static inline int rsv_is_empty(struct ext4_reserve_window *rsv) * when setting the reservation window size through ioctl before the file * is open for write (needs block allocation). * - * Needs truncate_mutex protection prior to call this function. + * Needs down_write(i_data_sem) protection prior to call this function. */ void ext4_init_block_alloc_info(struct inode *inode) { diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 9dd4746..a2475d4 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -1666,7 +1666,7 @@ int ext4_ext_rm_idx(handle_t *handle, struct inode *inode, * This routine returns max. credits that the extent tree can consume. * It should be OK for low-performance paths like -writepage() * To allow many writing processes to fit into a single transaction, - * the caller should calculate credits under truncate_mutex and + * the caller should calculate credits under i_data_sem and * pass the actual path. */ int ext4_ext_calc_credits_for_insert(struct inode *inode, @@ -2227,7 +2227,7 @@ out: /* * Need to be called with - * mutex_lock(EXT4_I(inode)-truncate_mutex); + * (EXT4_I(inode)-i_data_sem); */ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ext4_fsblk_t iblock, @@ -2446,7 +2446,7 @@ void ext4_ext_truncate(struct inode * inode, struct page *page) if (page) ext4_block_truncate_page(handle, page, mapping, inode-i_size); - mutex_lock(EXT4_I(inode)-truncate_mutex); + down_write(EXT4_I(inode)-i_data_sem); ext4_ext_invalidate_cache(inode); /* @@ -2482,7 +2482,7 @@ out_stop: if (inode-i_nlink) ext4_orphan_del(handle, inode); - mutex_unlock(EXT4_I(inode)-truncate_mutex); + up_write(EXT4_I(inode)-i_data_sem); ext4_journal_stop(handle); } diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 1a81cd6..ba4cf05 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -37,9 +37,9 @@ static int ext4_release_file (struct inode * inode, struct file * filp) if ((filp-f_mode FMODE_WRITE) (atomic_read(inode-i_writecount) == 1)) { - mutex_lock(EXT4_I(inode)-truncate_mutex); + down_write(EXT4_I(inode)-i_data_sem); ext4_discard_reservation(inode); - mutex_unlock(EXT4_I(inode)-truncate_mutex); + up_write(EXT4_I(inode)-i_data_sem); } if (is_dx(inode) filp-private_data) ext4_htree_free_dir_info(filp-private_data); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 9cf37b2..8aa8855 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -341,7 +341,7 @@ static int ext4_block_to_path(struct inode *inode, * the whole chain, all way to the data (returns %NULL, *err == 0). * * Need to be called with - * mutex_lock(EXT4_I(inode)-truncate_mutex) + * (EXT4_I(inode)-i_data_sem) */ static Indirect *ext4_get_branch(struct inode *inode, int depth, int *offsets, Indirect chain[4], int *err) @@ -772,7 +772,7 @@ err_out: * * * Need to be called with - * mutex_lock(EXT4_I(inode)-truncate_mutex) + * (EXT4_I(inode)-i_data_sem) */ int ext4_get_blocks_handle(handle_t *handle, struct inode *inode, sector_t iblock, unsigned long maxblocks, @@ -859,7 +859,7 @@ int ext4_get_blocks_handle(handle_t *handle, struct inode *inode, err = ext4_splice_branch(handle, inode, iblock, partial, indirect_blks, count); /* -* i_disksize growing is protected by truncate_mutex. Don't forget to +* i_disksize growing is protected by i_data_sem. Don't forget to * protect it if you're about to implement concurrent * ext4_get_block() -bzzz */ @@ -889,6 +889,30 @@ out: #define DIO_CREDITS (EXT4_RESERVE_TRANS_BLOCKS + 32) +int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, + unsigned long max_blocks, struct buffer_head *bh, + int create, int extend_disksize
[RFC][PATCH 1/3] ext4: Make ext4_get_blocks_wrap take the truncate_mutex early.
When doing a migrate from ext3 to ext4 inode we need to make sure the test for inode type and walking inode data happens inside lock. To make this happen move truncate_mutex early before checking the i_flags. This actually should enable us to remove the verify_chain(). Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c |7 +++-- fs/ext4/inode.c | 69 +- include/linux/ext4_fs.h | 12 ++-- 3 files changed, 21 insertions(+), 67 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 8528774..9dd4746 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2225,6 +2225,10 @@ out: return err ? err : allocated; } +/* + * Need to be called with + * mutex_lock(EXT4_I(inode)-truncate_mutex); + */ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ext4_fsblk_t iblock, unsigned long max_blocks, struct buffer_head *bh_result, @@ -2240,7 +2244,6 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, __clear_bit(BH_New, bh_result-b_state); ext_debug(blocks %d/%lu requested for inode %u\n, (int) iblock, max_blocks, (unsigned) inode-i_ino); - mutex_lock(EXT4_I(inode)-truncate_mutex); /* check in cache */ goal = ext4_ext_in_cache(inode, iblock, newex); @@ -2414,8 +2417,6 @@ out2: ext4_ext_drop_refs(path); kfree(path); } - mutex_unlock(EXT4_I(inode)-truncate_mutex); - return err ? err : allocated; } diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5489703..9cf37b2 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -243,13 +243,6 @@ static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v) p-bh = bh; } -static int verify_chain(Indirect *from, Indirect *to) -{ - while (from = to from-key == *from-p) - from++; - return (from to); -} - /** * ext4_block_to_path - parse the block number into array of offsets * @inode: inode in question (we are only interested in its superblock) @@ -344,10 +337,11 @@ static int ext4_block_to_path(struct inode *inode, * (pointer to last triple returned, [EMAIL PROTECTED] == 0) * or when it gets an IO error reading an indirect block * (ditto, [EMAIL PROTECTED] == -EIO) - * or when it notices that chain had been changed while it was reading - * (ditto, [EMAIL PROTECTED] == -EAGAIN) * or when it reads all @depth-1 indirect blocks successfully and finds * the whole chain, all way to the data (returns %NULL, *err == 0). + * + * Need to be called with + * mutex_lock(EXT4_I(inode)-truncate_mutex) */ static Indirect *ext4_get_branch(struct inode *inode, int depth, int *offsets, Indirect chain[4], int *err) @@ -365,9 +359,6 @@ static Indirect *ext4_get_branch(struct inode *inode, int depth, int *offsets, bh = sb_bread(sb, le32_to_cpu(p-key)); if (!bh) goto failure; - /* Reader: pointers */ - if (!verify_chain(chain, p)) - goto changed; add_chain(++p, bh, (__le32*)bh-b_data + *++offsets); /* Reader: end */ if (!p-key) @@ -375,10 +366,6 @@ static Indirect *ext4_get_branch(struct inode *inode, int depth, int *offsets, } return NULL; -changed: - brelse(bh); - *err = -EAGAIN; - goto no_block; failure: *err = -EIO; no_block: @@ -782,6 +769,10 @@ err_out: * return 0, # of blocks mapped or allocated. * return = 0, if plain lookup failed. * return 0, error case. + * + * + * Need to be called with + * mutex_lock(EXT4_I(inode)-truncate_mutex) */ int ext4_get_blocks_handle(handle_t *handle, struct inode *inode, sector_t iblock, unsigned long maxblocks, @@ -819,18 +810,6 @@ int ext4_get_blocks_handle(handle_t *handle, struct inode *inode, while (count maxblocks count = blocks_to_boundary) { ext4_fsblk_t blk; - if (!verify_chain(chain, partial)) { - /* -* Indirect block might be removed by -* truncate while we were reading it. -* Handling of that case: forget what we've -* got now. Flag the err as EAGAIN, so it -* will reread. -*/ - err = -EAGAIN; - count = 0; - break; - } blk = le32_to_cpu(*(chain[depth-1].p + count)); if (blk == first_block + count
[RFC] truncate_mutex read write semaphore conversion
The below patchset is NOT for patch queue. I am posting it here to get feedback regarding the approach and what test I need to run to make sure we are not breaking any locking rules. I have run dbench, ffsb, fsstress, fs_di, fs_inode, fsx_linux . Bonnie didn't run completely. In the automated setup i had it didn't save any logs to find out what went wrong. I will be looking into this - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 3/3] ext4: Take read lock during overwrite case.
When we are overwriting a file and not actually allocating new file system blocks we need to take only the read lock on i_data_sem. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/inode.c | 32 1 files changed, 24 insertions(+), 8 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 8aa8855..4489bfd 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -894,11 +894,31 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, int create, int extend_disksize) { int retval; - if (create) { - down_write((EXT4_I(inode)-i_data_sem)); + /* +* Try to see if we can get the block without requesting +* for new file system block. +*/ + down_read((EXT4_I(inode)-i_data_sem)); + if (EXT4_I(inode)-i_flags EXT4_EXTENTS_FL) { + retval = ext4_ext_get_blocks(handle, inode, block, max_blocks, + bh, 0, 0); } else { - down_read((EXT4_I(inode)-i_data_sem)); + retval = ext4_get_blocks_handle(handle, inode, block, max_blocks, + bh, 0, 0); } + up_read((EXT4_I(inode)-i_data_sem)); + if (!create || (retval 0)) + return retval; + + /* +* We need to allocate new blocks which will result +* in i_data update +*/ + down_write((EXT4_I(inode)-i_data_sem)); + /* +* We need to check for EXT4 here because migrate +* could have changed the inode type in between +*/ if (EXT4_I(inode)-i_flags EXT4_EXTENTS_FL) { retval = ext4_ext_get_blocks(handle, inode, block, max_blocks, bh, create, extend_disksize); @@ -906,11 +926,7 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, retval = ext4_get_blocks_handle(handle, inode, block, max_blocks, bh, create, extend_disksize); } - if (create) { - up_write((EXT4_I(inode)-i_data_sem)); - } else { - up_read((EXT4_I(inode)-i_data_sem)); - } + up_write((EXT4_I(inode)-i_data_sem)); return retval; } static int ext4_get_block(struct inode *inode, sector_t iblock, -- 1.5.3.6.985.g65c6a4-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: Check for the correct error return from ext4_ext_get_blocks
ext4_ext_get_blocks returns negative values on error. We should check for = 0 Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/extents.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index a2475d4..ce57245 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2558,8 +2558,8 @@ retry: ret = ext4_ext_get_blocks(handle, inode, block, max_blocks, map_bh, EXT4_CREATE_UNINITIALIZED_EXT, 0); - WARN_ON(!ret); - if (!ret) { + WARN_ON(ret = 0); + if (ret = 0) { ext4_error(inode-i_sb, ext4_fallocate, ext4_ext_get_blocks returned 0! inode#%lu , block=%llu, max_blocks=%llu, -- 1.5.3.6.985.g65c6a4-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Understanding mballoc
Alex, This is my attempt at understanding multi block allocator. I have few questions marked as FIXME below. Can you help answering them. Most of this data is already in the patch queue as commit message. I have updated some details regarding preallocation. Once we understand the details i will update the patch queue commit message. The allocation request involve request for multiple number of blocks near to the goal(block) value specified. During initialization phase of the allocator we decide to use the group preallocation or inode preallocation depending on the size of the request. If the request is smaller than sbi-s_mb_small_req we select the group preallocation. This is needed because we would like to have the small files closer. The value of s_mb_small_req is 256 blocks. /* FIXME!! Does the value of s_mb_small_req depend on the s_mb_prealloc_table ? If yes, then how do we update the s_mb_small_req. We have a hook to update the prealloc table via /proc. But that doesn't update the s_mb_small_req. */ /* FIXME!! The code within ext4_mb_group_or_file does below. if (ac-ac_o_ex.fe_len = sbi-s_mb_large_req) return; if (ac-ac_o_ex.fe_len = sbi-s_mb_small_req) return; That doesn't seem to make sense because the if the len is greater than s_mb_sall_req it will be always greater than s_mb_large_req. What are we expecting to do here ? */ First stage the allocator looks at the inode prealloc list ext4_inode_info-i_prealloc_list contain list of prealloc spaces for this particular inode. The inode prealloc space is represented as pa_lstart - the logical start block for this prealloc space pa_pstart - the physical start block for this prealloc space pa_len- lenght for this prealloc space pa_free - free space available in this prealloc space The inode preallocation space is used looking at the _logical_ start block. If only the logical file block falls within the range of prealloc space we will consume the particular prealloc space. This make sure that that the we have contiguous physical blocks representing the file blocks The important thing to be noted in case of inode prealloc space is that we don't modify the values associated to inode prealloc space except pa_free. If we are not able to find blocks in the inode prealloc space and if we have the group allocation flag set then we look at the locality group prealloc space. These are per CPU prealloc list repreasented as ext4_sb_info.s_locality_groups[smp_processor_id()] /* FIXME!! After getting the locality group related to the current CPU we could be scheduled out and scheduled in on different CPU. So why are we putting the locality group per cpu ? */ The locality group prealloc space is used looking at whether we have enough free space (pa_free) withing the prealloc space. If we can't allocate blocks via inode prealloc or/and locality group prealloc then we look at the buddy cache. The buddy cache is represented by ext4_sb_info.s_buddy_cache (struct inode) whose file offset gets mapped to the buddy and bitmap information regarding different groups. The buddy information is attached to buddy cache inode so that we can access them through the page cache. The information regarding each group is loaded via ext4_mb_load_buddy. The information involve block bitmap and buddy information. The information are stored in the inode as {page} [ group 0 buddy][ group 0 bitmap] [group 1][ group 1]... one block each for bitmap and buddy information. So for each group we take up 2 blocks. A page can contain blocks_per_page (PAGE_CACHE_SIZE / blocksize) blocks. So it can have information regarding groups_per_page which is blocks_per_page/2 Buddy cachche inode is not stored on disk. The inode get thrown away at the end when unmounting the disk. We look for count number of blocks in the buddy cache. If we were able to locate that many free blocks we return with additional information regarding rest of the contiguous physical block available /* FIXME: We need to explain the normalization of the request length. What are the conditions we are checking the request length against. Why are group request always requested at 512 blocks ? Buddy scanning follows different criteria. We need to explain what a criteria is how they infulence the allocation */ If we allocate more space than we requested for then the remaining space get added to the locality group prealloc space or inode prealloc space. Both the prealloc space are getting populated as above. So for the first request we will hit the buddy cache which will result in this prealloc space getting filled. The prealloc space is then later used for the subsequent request. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
circular locking dependency detected
=== [ INFO: possible circular locking dependency detected ] 2.6.24-rc3 #6 --- bash/2294 is trying to acquire lock: (journal-j_list_lock){--..}, at: [c01eee2f] journal_try_to_free_buffers+0x76/0x10c but task is already holding lock: (inode_lock){--..}, at: [c01864b6] drop_pagecache+0x48/0xd8 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (inode_lock){--..}: [c0143929] __lock_acquire+0xa31/0xc1a [c0143b8c] lock_acquire+0x7a/0x94 [c0447ae5] _spin_lock+0x2e/0x58 [c0185fae] __mark_inode_dirty+0xd8/0x15e [c018996f] __set_page_dirty+0xfb/0x10a [c01899fe] mark_buffer_dirty+0x80/0x86 [c01ed1b6] __journal_temp_unlink_buffer+0xc1/0xc5 [c01ed2f3] __journal_unfile_buffer+0xb/0x15 [c01ed338] __journal_refile_buffer+0x3b/0x85 [c01efe1b] journal_commit_transaction+0xe7f/0x10ec [c01f34b0] kjournald+0x131/0x35f [c0138ee9] kthread+0x3b/0x62 [c0105c7f] kernel_thread_helper+0x7/0x10 [] 0x - #0 (journal-j_list_lock){--..}: [c0143819] __lock_acquire+0x921/0xc1a [c0143b8c] lock_acquire+0x7a/0x94 [c0447ae5] _spin_lock+0x2e/0x58 [c01eee2f] journal_try_to_free_buffers+0x76/0x10c [c01cb7a2] ext3_releasepage+0x68/0x74 [c0151ff1] try_to_release_page+0x33/0x44 [c01578b0] __invalidate_mapping_pages+0x74/0xe0 [c01864de] drop_pagecache+0x70/0xd8 [c018657c] drop_caches_sysctl_handler+0x36/0x4e [c019efaa] proc_sys_write+0x6b/0x85 [c016da2b] vfs_write+0x90/0x119 [c016dfe2] sys_write+0x3d/0x61 [c0104f6e] sysenter_past_esp+0x5f/0xa5 [] 0x other info that might help us debug this: 2 locks held by bash/2294: #0: (type-s_umount_key#16){}, at: [c01864a6] drop_pagecache+0x38/0xd8 #1: (inode_lock){--..}, at: [c01864b6] drop_pagecache+0x48/0xd8 stack backtrace: [c0105ffb] show_trace_log_lvl+0x1a/0x2f [c0106875] show_trace+0x12/0x14 [c0106967] dump_stack+0x16/0x18 [c0141b33] print_circular_bug_tail+0x5f/0x68 [c0143819] __lock_acquire+0x921/0xc1a [c0143b8c] lock_acquire+0x7a/0x94 [c0447ae5] _spin_lock+0x2e/0x58 [c01eee2f] journal_try_to_free_buffers+0x76/0x10c [c01cb7a2] ext3_releasepage+0x68/0x74 [c0151ff1] try_to_release_page+0x33/0x44 [c01578b0] __invalidate_mapping_pages+0x74/0xe0 [c01864de] drop_pagecache+0x70/0xd8 [c018657c] drop_caches_sysctl_handler+0x36/0x4e [c019efaa] proc_sys_write+0x6b/0x85 [c016da2b] vfs_write+0x90/0x119 [c016dfe2] sys_write+0x3d/0x61 [c0104f6e] sysenter_past_esp+0x5f/0xa5 === - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ext4 still broken on arm (at least)
On Tue, Nov 27, 2007 at 09:14:44PM -0800, Andrew Morton wrote: fs/ext4/mballoc.c: In function `ext4_mb_generate_buddy': fs/ext4/mballoc.c:836: error: implicit declaration of function `ext2_find_next_bit' I actually sent in a patch which changes asking for review to linux-arch. I haven't got the response yet. Attaching the patch below Introduce ext4_find_next_bit From: Aneesh Kumar K.V [EMAIL PROTECTED] This gets used by the ext4 multi block allocator patches. Also add generic_find_next_le_bit Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] Cc: linux-ext4@vger.kernel.org Signed-off-by: Andrew Morton [EMAIL PROTECTED] --- include/asm-arm/bitops.h |2 + include/asm-generic/bitops/ext2-non-atomic.h |2 + include/asm-generic/bitops/le.h |4 ++ include/asm-m68k/bitops.h|2 + include/asm-m68knommu/bitops.h |2 + include/asm-powerpc/bitops.h |4 ++ include/asm-s390/bitops.h|2 + include/linux/ext4_fs.h |1 + lib/find_next_bit.c | 43 ++ 9 files changed, 62 insertions(+), 0 deletions(-) diff --git a/include/asm-arm/bitops.h b/include/asm-arm/bitops.h index 47a6b08..5c60bfc 100644 --- a/include/asm-arm/bitops.h +++ b/include/asm-arm/bitops.h @@ -310,6 +310,8 @@ static inline int constant_fls(int x) _find_first_zero_bit_le(p,sz) #define ext2_find_next_zero_bit(p,sz,off) \ _find_next_zero_bit_le(p,sz,off) +#define ext2_find_next_bit(p, sz, off) \ + _find_next_bit_le(p, sz, off) /* * Minix is defined to use little-endian byte ordering. diff --git a/include/asm-generic/bitops/ext2-non-atomic.h b/include/asm-generic/bitops/ext2-non-atomic.h index 1697404..63cf822 100644 --- a/include/asm-generic/bitops/ext2-non-atomic.h +++ b/include/asm-generic/bitops/ext2-non-atomic.h @@ -14,5 +14,7 @@ generic_find_first_zero_le_bit((unsigned long *)(addr), (size)) #define ext2_find_next_zero_bit(addr, size, off) \ generic_find_next_zero_le_bit((unsigned long *)(addr), (size), (off)) +#define ext2_find_next_bit(addr, size, off) \ + generic_find_next_le_bit((unsigned long *)(addr), (size), (off)) #endif /* _ASM_GENERIC_BITOPS_EXT2_NON_ATOMIC_H_ */ diff --git a/include/asm-generic/bitops/le.h b/include/asm-generic/bitops/le.h index b9c7e5d..80e3bf1 100644 --- a/include/asm-generic/bitops/le.h +++ b/include/asm-generic/bitops/le.h @@ -20,6 +20,8 @@ #define generic___test_and_clear_le_bit(nr, addr) __test_and_clear_bit(nr, addr) #define generic_find_next_zero_le_bit(addr, size, offset) find_next_zero_bit(addr, size, offset) +#define generic_find_next_le_bit(addr, size, offset) \ + find_next_bit(addr, size, offset) #elif defined(__BIG_ENDIAN) @@ -42,6 +44,8 @@ extern unsigned long generic_find_next_zero_le_bit(const unsigned long *addr, unsigned long size, unsigned long offset); +extern unsigned long generic_find_next_le_bit(const unsigned long *addr, + unsigned long size, unsigned long offset); #else #error Please fix asm/byteorder.h diff --git a/include/asm-m68k/bitops.h b/include/asm-m68k/bitops.h index 2976b5d..83d1f28 100644 --- a/include/asm-m68k/bitops.h +++ b/include/asm-m68k/bitops.h @@ -410,6 +410,8 @@ static inline int ext2_find_next_zero_bit(const void *vaddr, unsigned size, res = ext2_find_first_zero_bit (p, size - 32 * (p - addr)); return (p - addr) * 32 + res; } +#define ext2_find_next_bit(addr, size, off) \ + generic_find_next_le_bit((unsigned long *)(addr), (size), (off)) #endif /* __KERNEL__ */ diff --git a/include/asm-m68knommu/bitops.h b/include/asm-m68knommu/bitops.h index f8dfb7b..f43afe1 100644 --- a/include/asm-m68knommu/bitops.h +++ b/include/asm-m68knommu/bitops.h @@ -294,6 +294,8 @@ found_middle: return result + ffz(__swab32(tmp)); } +#define ext2_find_next_bit(addr, size, off) \ + generic_find_next_le_bit((unsigned long *)(addr), (size), (off)) #include asm-generic/bitops/minix.h #endif /* __KERNEL__ */ diff --git a/include/asm-powerpc/bitops.h b/include/asm-powerpc/bitops.h index 733b4af..220d9a7 100644 --- a/include/asm-powerpc/bitops.h +++ b/include/asm-powerpc/bitops.h @@ -359,6 +359,8 @@ static __inline__ int test_le_bit(unsigned long nr, unsigned long generic_find_next_zero_le_bit(const unsigned long *addr, unsigned long size, unsigned long offset); +unsigned long generic_find_next_le_bit(const unsigned long *addr, + unsigned long size, unsigned long offset); /* Bitmap functions for the ext2 filesystem */ #define ext2_set_bit(nr,addr) \ @@ -378,6 +380,8 @@ unsigned long generic_find_next_zero_le_bit(const unsigned long *addr
blk bitmap validation test results
Andreas suggested me to get the iozone results after multiple runs. I don't see any performance issue with the blk bitmap validation changes now. v2.6.24-rc3-35-g2e12044 File VM system latencies in microseconds - smaller is better -- Host OS 0K File 10K File MmapProtPage Create Delete Create Delete Latency Fault Fault - - -- -- -- -- --- - - elm3a150. Linux 2.6.24- 28.8 24.9 90.3 52.213.4K 1.796 3.0 elm3a150. Linux 2.6.24- 29.2 25.1 90.3 52.013.4K 1.829 3.0 elm3a150. Linux 2.6.24- 28.7 24.8 90.2 51.213.3K 1.971 3.0 v2.6.24-rc3-35-g2e12044 + with block bitmap patch File VM system latencies in microseconds - smaller is better -- Host OS 0K File 10K File MmapProtPage Create Delete Create Delete Latency Fault Fault - - -- -- -- -- --- - - elm3a150. Linux 2.6.24- 28.5 24.7 89.3 52.113.4K 1.785 3.0 elm3a150. Linux 2.6.24- 28.7 24.9 89.4 52.213.3K 1.792 3.0 elm3a150. Linux 2.6.24- 29.1 24.8 90.1 52.413.3K 1.779 3.0 elm3a150. Linux 2.6.24- 29.2 24.9 90.4 52.013.9K 1.848 3.0 iozones results are attached. Iozone: Performance Test of File I/O Version $Revision: 3.226 $ Compiled for 64 bit mode. Build: linux Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Jean-Marc Zucconi, Jeff Blomberg, Erik Habbinga, Kris Strecker. Run began: Thu Nov 22 03:03:46 2007 Auto Mode Command line used: /usr/local/autobench/sources/iozone/src/current/iozone -a -f /ext4/test.iozone Output is in Kbytes/sec Time Resolution = 0.01 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewritereadrereadread write read rewriteread fwrite frewrite fread freread 64 4 176276 432398 1279447 1484662 1066042 404998 874936 387461 1163036 190947 471099 1083249 1421755 64 8 192178 481234 1638743 1879725 1679761 512471 1119387 429630 1828508 212570 520419 1492919 1768283 64 16 206519 515423 1690338 2006158 2203800 561808 1336792 394870 2133730 259156 633392 1492919 1947927 64 32 209093 537079 1638743 2203800 2662899 621657 1363961 316997 2537060 353760 998622 1947927 2067979 64 64 213415 537079 1526887 2379626 2923952 592827 1330167 219883 2892445 215471 548044 1363961 2278628 128 4 184440 428193 1319723 1487977 1133103 402510 934004 435134 1123617 191002 456986 1266785 1391557 128 8 202837 504196 1504659 1827299 1705404 506097 1231904 483316 1827299 210884 496272 1539168 1749872 128 16 205398 554157 1827299 2058509 223 595328 1802755 499970 2248149 233950 617941 1663139 1997245 128 32 214079 603357 1827299 2166499 2608629 609522 1543594 435134 2673584 273382 684954 1732927 2098744 128 64 222693 568836 1727352 2248149 2985839 640042 1543594 324039 2969325 364770 1007629 1689305 223 128 128 224556 546819 1487977 2377579 3199360 624409 1424795 220679 3199360 223342 592699 1421023 2326073 256 4 191199 435452 1286217 1422540 1137672 408918 904732 443730 1153561 193230 413008 1300235 1398455 256 8 204140 501794 1652404 1802167 1683493 502733 1384034 541253 1752173 208825 515279 1600674 1766587 256 16 211208 508929 1649865 1985448 2247235 543720 1479379 563701 2330140 224172 523570 1817419 1970871 256 32 221262 555248 1894373 2170027 2665657 632056 1639786 539079 2533571 247816 609801 1842366 2118645 256 64 220670 568777 1868008 2247235 2975955 616807 1729594 453092 3043436 278835 706964 2015259 2223962 256 128 221627 584570
[Take 2] ext2/3/4 block bitmap validation patches
This is the updated ext2/3/4 block bitmap validation patches Changes from the last post a) moved the bh_uptodate_or_lock and bh_submit_read to fs/buffer.c and added EXPORT_SYMBOL b) Updated bh_submit_read not to release buffer on failure. This handles one reference handling bug in the error patch. - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext2: add block bitmap validation
When a new block bitmap is read from disk in read_block_bitmap() there are a few bits that should ALWAYS be set. In particular, the blocks given corresponding to block bitmap, inode bitmap and inode tables. Validate the block bitmap against these blocks. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext2/balloc.c | 81 -- 1 files changed, 72 insertions(+), 9 deletions(-) diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c index 377ad17..4b5c511 100644 --- a/fs/ext2/balloc.c +++ b/fs/ext2/balloc.c @@ -69,9 +69,53 @@ struct ext2_group_desc * ext2_get_group_desc(struct super_block * sb, return desc + offset; } +static int ext2_valid_block_bitmap(struct super_block *sb, + struct ext2_group_desc *desc, + unsigned int block_group, + struct buffer_head *bh) +{ + ext2_grpblk_t offset; + ext2_grpblk_t next_zero_bit; + ext2_fsblk_t bitmap_blk; + ext2_fsblk_t group_first_block; + + group_first_block = ext2_group_first_block_no(sb, block_group); + + /* check whether block bitmap block number is set */ + bitmap_blk = le32_to_cpu(desc-bg_block_bitmap); + offset = bitmap_blk - group_first_block; + if (!ext2_test_bit(offset, bh-b_data)) + /* bad block bitmap */ + goto err_out; + + /* check whether the inode bitmap block number is set */ + bitmap_blk = le32_to_cpu(desc-bg_inode_bitmap); + offset = bitmap_blk - group_first_block; + if (!ext2_test_bit(offset, bh-b_data)) + /* bad block bitmap */ + goto err_out; + + /* check whether the inode table block number is set */ + bitmap_blk = le32_to_cpu(desc-bg_inode_table); + offset = bitmap_blk - group_first_block; + next_zero_bit = ext2_find_next_zero_bit(bh-b_data, + offset + EXT2_SB(sb)-s_itb_per_group, + offset); + if (next_zero_bit = offset + EXT2_SB(sb)-s_itb_per_group) + /* good bitmap for inode tables */ + return 1; + +err_out: + ext2_error(sb, __FUNCTION__, + Invalid block bitmap - + block_group = %d, block = %lu, + block_group, bitmap_blk); + return 0; +} + /* - * Read the bitmap for a given block_group, reading into the specified - * slot in the superblock's bitmap cache. + * Read the bitmap for a given block_group,and validate the + * bits for block/inode/inode tables are set in the bitmaps * * Return buffer_head on success or NULL in case of failure. */ @@ -80,17 +124,36 @@ read_block_bitmap(struct super_block *sb, unsigned int block_group) { struct ext2_group_desc * desc; struct buffer_head * bh = NULL; - - desc = ext2_get_group_desc (sb, block_group, NULL); + ext2_fsblk_t bitmap_blk; + + desc = ext2_get_group_desc(sb, block_group, NULL); if (!desc) - goto error_out; - bh = sb_bread(sb, le32_to_cpu(desc-bg_block_bitmap)); - if (!bh) - ext2_error (sb, read_block_bitmap, + return NULL; + bitmap_blk = le32_to_cpu(desc-bg_block_bitmap); + bh = sb_getblk(sb, bitmap_blk); + if (unlikely(!bh)) { + ext2_error(sb, __FUNCTION__, Cannot read block bitmap - block_group = %d, block_bitmap = %u, block_group, le32_to_cpu(desc-bg_block_bitmap)); -error_out: + return NULL; + } + if (likely(bh_uptodate_or_lock(bh))) + return bh; + + if (bh_submit_read(bh) 0) { + brelse(bh); + ext2_error(sb, __FUNCTION__, + Cannot read block bitmap - + block_group = %d, block_bitmap = %u, + block_group, le32_to_cpu(desc-bg_block_bitmap)); + return NULL; + } + if (!ext2_valid_block_bitmap(sb, desc, block_group, bh)) { + brelse(bh); + return NULL; + } + return bh; } -- 1.5.3.5.652.gf192c-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext3: add block bitmap validation
When a new block bitmap is read from disk in read_block_bitmap() there are a few bits that should ALWAYS be set. In particular, the blocks given corresponding to block bitmap, inode bitmap and inode tables. Validate the block bitmap against these blocks. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext3/balloc.c | 78 - 1 files changed, 70 insertions(+), 8 deletions(-) diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c index a8ba7e8..a26e683 100644 --- a/fs/ext3/balloc.c +++ b/fs/ext3/balloc.c @@ -80,13 +80,57 @@ struct ext3_group_desc * ext3_get_group_desc(struct super_block * sb, return desc + offset; } +static int ext3_valid_block_bitmap(struct super_block *sb, + struct ext3_group_desc *desc, + unsigned int block_group, + struct buffer_head *bh) +{ + ext3_grpblk_t offset; + ext3_grpblk_t next_zero_bit; + ext3_fsblk_t bitmap_blk; + ext3_fsblk_t group_first_block; + + group_first_block = ext3_group_first_block_no(sb, block_group); + + /* check whether block bitmap block number is set */ + bitmap_blk = le32_to_cpu(desc-bg_block_bitmap); + offset = bitmap_blk - group_first_block; + if (!ext3_test_bit(offset, bh-b_data)) + /* bad block bitmap */ + goto err_out; + + /* check whether the inode bitmap block number is set */ + bitmap_blk = le32_to_cpu(desc-bg_inode_bitmap); + offset = bitmap_blk - group_first_block; + if (!ext3_test_bit(offset, bh-b_data)) + /* bad block bitmap */ + goto err_out; + + /* check whether the inode table block number is set */ + bitmap_blk = le32_to_cpu(desc-bg_inode_table); + offset = bitmap_blk - group_first_block; + next_zero_bit = ext3_find_next_zero_bit(bh-b_data, + offset + EXT3_SB(sb)-s_itb_per_group, + offset); + if (next_zero_bit = offset + EXT3_SB(sb)-s_itb_per_group) + /* good bitmap for inode tables */ + return 1; + +err_out: + ext3_error(sb, __FUNCTION__, + Invalid block bitmap - + block_group = %d, block = %lu, + block_group, bitmap_blk); + return 0; +} + /** * read_block_bitmap() * @sb:super block * @block_group: given block group * - * Read the bitmap for a given block_group, reading into the specified - * slot in the superblock's bitmap cache. + * Read the bitmap for a given block_group,and validate the + * bits for block/inode/inode tables are set in the bitmaps * * Return buffer_head on success or NULL in case of failure. */ @@ -95,17 +139,35 @@ read_block_bitmap(struct super_block *sb, unsigned int block_group) { struct ext3_group_desc * desc; struct buffer_head * bh = NULL; + ext3_fsblk_t bitmap_blk; - desc = ext3_get_group_desc (sb, block_group, NULL); + desc = ext3_get_group_desc(sb, block_group, NULL); if (!desc) - goto error_out; - bh = sb_bread(sb, le32_to_cpu(desc-bg_block_bitmap)); - if (!bh) - ext3_error (sb, read_block_bitmap, + return NULL; + bitmap_blk = le32_to_cpu(desc-bg_block_bitmap); + bh = sb_getblk(sb, bitmap_blk); + if (unlikely(!bh)) { + ext3_error(sb, __FUNCTION__, Cannot read block bitmap - block_group = %d, block_bitmap = %u, block_group, le32_to_cpu(desc-bg_block_bitmap)); -error_out: + return NULL; + } + if (likely(bh_uptodate_or_lock(bh))) + return bh; + + if (bh_submit_read(bh) 0) { + brelse(bh); + ext3_error(sb, __FUNCTION__, + Cannot read block bitmap - + block_group = %d, block_bitmap = %u, + block_group, le32_to_cpu(desc-bg_block_bitmap)); + return NULL; + } + if (!ext3_valid_block_bitmap(sb, desc, block_group, bh)) { + brelse(bh); + return NULL; + } return bh; } /* -- 1.5.3.5.652.gf192c-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ext4: add block bitmap validation
When a new block bitmap is read from disk in read_block_bitmap() there are a few bits that should ALWAYS be set. In particular, the blocks given corresponding to block bitmap, inode bitmap and inode tables. Validate the block bitmap against these blocks. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/balloc.c | 97 - 1 files changed, 80 insertions(+), 17 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index ff3428e..3b9b07b 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -189,13 +189,65 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb, return desc; } +static int ext4_valid_block_bitmap(struct super_block *sb, + struct ext4_group_desc *desc, + unsigned int block_group, + struct buffer_head *bh) +{ + ext4_grpblk_t offset; + ext4_grpblk_t next_zero_bit; + ext4_fsblk_t bitmap_blk; + ext4_fsblk_t group_first_block; + + if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG)) { + /* with FLEX_BG, the inode/block bitmaps and itable +* blocks may not be in the group at all +* so the bitmap validation will be skipped for those groups +* or it has to also read the block group where the bitmaps +* are located to verify they are set. +*/ + return 1; + } + group_first_block = ext4_group_first_block_no(sb, block_group); + + /* check whether block bitmap block number is set */ + bitmap_blk = ext4_block_bitmap(sb, desc); + offset = bitmap_blk - group_first_block; + if (!ext4_test_bit(offset, bh-b_data)) + /* bad block bitmap */ + goto err_out; + + /* check whether the inode bitmap block number is set */ + bitmap_blk = ext4_inode_bitmap(sb, desc); + offset = bitmap_blk - group_first_block; + if (!ext4_test_bit(offset, bh-b_data)) + /* bad block bitmap */ + goto err_out; + + /* check whether the inode table block number is set */ + bitmap_blk = ext4_inode_table(sb, desc); + offset = bitmap_blk - group_first_block; + next_zero_bit = ext4_find_next_zero_bit(bh-b_data, + offset + EXT4_SB(sb)-s_itb_per_group, + offset); + if (next_zero_bit = offset + EXT4_SB(sb)-s_itb_per_group) + /* good bitmap for inode tables */ + return 1; + +err_out: + ext4_error(sb, __FUNCTION__, + Invalid block bitmap - + block_group = %d, block = %llu, + block_group, bitmap_blk); + return 0; +} /** * read_block_bitmap() * @sb:super block * @block_group: given block group * - * Read the bitmap for a given block_group, reading into the specified - * slot in the superblock's bitmap cache. + * Read the bitmap for a given block_group,and validate the + * bits for block/inode/inode tables are set in the bitmaps * * Return buffer_head on success or NULL in case of failure. */ @@ -210,25 +262,36 @@ read_block_bitmap(struct super_block *sb, ext4_group_t block_group) if (!desc) return NULL; bitmap_blk = ext4_block_bitmap(sb, desc); + bh = sb_getblk(sb, bitmap_blk); + if (unlikely(!bh)) { + ext4_error(sb, __FUNCTION__, + Cannot read block bitmap - + block_group = %d, block_bitmap = %llu, + block_group, bitmap_blk); + return NULL; + } + if (bh_uptodate_or_lock(bh)) + return bh; + if (desc-bg_flags cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { - bh = sb_getblk(sb, bitmap_blk); - if (!buffer_uptodate(bh)) { - lock_buffer(bh); - if (!buffer_uptodate(bh)) { - ext4_init_block_bitmap(sb, bh, block_group, - desc); - set_buffer_uptodate(bh); - } - unlock_buffer(bh); - } - } else { - bh = sb_bread(sb, bitmap_blk); + ext4_init_block_bitmap(sb, bh, block_group, desc); + set_buffer_uptodate(bh); + unlock_buffer(bh); + return bh; } - if (!bh) - ext4_error (sb, __FUNCTION__, + if (bh_submit_read(bh) 0) { + brelse(bh); + ext4_error(sb, __FUNCTION__, Cannot read block bitmap - - block_group = %lu, block_bitmap = %llu
[PATCH] Add buffer head related helper functions
Add buffer head related helper function bh_uptodate_or_lock and bh_submit_read which can be used by file system Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/buffer.c | 41 + include/linux/buffer_head.h |2 ++ 2 files changed, 43 insertions(+), 0 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 7249e01..7593ff3 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -3213,6 +3213,47 @@ static int buffer_cpu_notify(struct notifier_block *self, return NOTIFY_OK; } +/** + * bh_uptodate_or_lock: Test whether the buffer is uptodate + * @bh: struct buffer_head + * + * Return true if the buffer is up-to-date and false, + * with the buffer locked, if not. + */ +int bh_uptodate_or_lock(struct buffer_head *bh) +{ + if (!buffer_uptodate(bh)) { + lock_buffer(bh); + if (!buffer_uptodate(bh)) + return 0; + unlock_buffer(bh); + } + return 1; +} +EXPORT_SYMBOL(bh_uptodate_or_lock); +/** + * bh_submit_read: Submit a locked buffer for reading + * @bh: struct buffer_head + * + * Returns a negative error + */ +int bh_submit_read(struct buffer_head *bh) +{ + if (!buffer_locked(bh)) + lock_buffer(bh); + + if (buffer_uptodate(bh)) + return 0; + + get_bh(bh); + bh-b_end_io = end_buffer_read_sync; + submit_bh(READ, bh); + wait_on_buffer(bh); + if (buffer_uptodate(bh)) + return 0; + return -EIO; +} +EXPORT_SYMBOL(bh_submit_read); void __init buffer_init(void) { int nrpages; diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index da0d83f..e98801f 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -192,6 +192,8 @@ int sync_dirty_buffer(struct buffer_head *bh); int submit_bh(int, struct buffer_head *); void write_boundary_block(struct block_device *bdev, sector_t bblock, unsigned blocksize); +int bh_uptodate_or_lock(struct buffer_head *bh); +int bh_submit_read(struct buffer_head *bh); extern int buffer_heads_over_limit; -- 1.5.3.5.652.gf192c-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Take 2] ext2/3/4 block bitmap validation patches
Aneesh Kumar K.V wrote: This is the updated ext2/3/4 block bitmap validation patches Changes from the last post a) moved the bh_uptodate_or_lock and bh_submit_read to fs/buffer.c and added EXPORT_SYMBOL b) Updated bh_submit_read not to release buffer on failure. This handles one reference handling bug in the error patch. I forgot to pass --numbered to git format-patch The order is [PATCH 1/4] Add buffer head related helper functions [PATCH 2/4] ext2: add block bitmap validation [PATCH 3/4] ext3: add block bitmap validation [PATCH 4/4] ext4: add block bitmap validation it is against patch queue and applies as the last set of patches in the stable series buffer_head_helper.patch ext2_bitmap_validation.patch ext3_bitmap_validation.patch ext4_bitmap_validation.patch stable-boundary -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] ext3: add block bitmap validation
When a new block bitmap is read from disk in read_block_bitmap() there are a few bits that should ALWAYS be set. In particular, the blocks given corresponding to block bitmap, inode bitmap and inode tables. Validate the block bitmap against these blocks. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext3/balloc.c | 77 - 1 files changed, 69 insertions(+), 8 deletions(-) diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c index a8ba7e8..8f732b8 100644 --- a/fs/ext3/balloc.c +++ b/fs/ext3/balloc.c @@ -80,13 +80,52 @@ struct ext3_group_desc * ext3_get_group_desc(struct super_block * sb, return desc + offset; } +static int ext3_valid_block_bitmap(struct super_block *sb, + struct ext3_group_desc *desc, + unsigned int block_group, + struct buffer_head *bh) +{ + ext3_grpblk_t offset; + ext3_grpblk_t next_zero_bit; + ext3_fsblk_t bitmap_blk; + ext3_fsblk_t group_first_block; + + group_first_block = ext3_group_first_block_no(sb, block_group); + + /* check whether block bitmap block number is set */ + bitmap_blk = le32_to_cpu(desc-bg_block_bitmap); + offset = bitmap_blk - group_first_block; + if (!ext3_test_bit(offset, bh-b_data)) { + /* bad block bitmap */ + return 0; + } + /* check whether the inode bitmap block number is set */ + bitmap_blk = le32_to_cpu(desc-bg_inode_bitmap); + offset = bitmap_blk - group_first_block; + if (!ext3_test_bit(offset, bh-b_data)) { + /* bad block bitmap */ + return 0; + } + /* check whether the inode table block number is set */ + bitmap_blk = le32_to_cpu(desc-bg_inode_table); + offset = bitmap_blk - group_first_block; + next_zero_bit = ext3_find_next_zero_bit(bh-b_data, + EXT3_SB(sb)-s_itb_per_group + 1, + offset); + if (next_zero_bit = offset + EXT3_SB(sb)-s_itb_per_group) + /* good bitmap for inode tables */ + return 1; + + return 0; +} + /** * read_block_bitmap() * @sb:super block * @block_group: given block group * - * Read the bitmap for a given block_group, reading into the specified - * slot in the superblock's bitmap cache. + * Read the bitmap for a given block_group,and validate the + * bits for block/inode/inode tables are set in the bitmaps * * Return buffer_head on success or NULL in case of failure. */ @@ -95,17 +134,39 @@ read_block_bitmap(struct super_block *sb, unsigned int block_group) { struct ext3_group_desc * desc; struct buffer_head * bh = NULL; + ext3_fsblk_t bitmap_blk; - desc = ext3_get_group_desc (sb, block_group, NULL); + desc = ext3_get_group_desc(sb, block_group, NULL); if (!desc) - goto error_out; - bh = sb_bread(sb, le32_to_cpu(desc-bg_block_bitmap)); - if (!bh) - ext3_error (sb, read_block_bitmap, + return NULL; + bitmap_blk = le32_to_cpu(desc-bg_block_bitmap); + bh = sb_getblk(sb, bitmap_blk); + if (unlikely(!bh)) { + ext3_error(sb, __FUNCTION__, + Cannot read block bitmap - + block_group = %d, block_bitmap = %u, + block_group, le32_to_cpu(desc-bg_block_bitmap)); + return NULL; + } + if (likely(bh_uptodate_or_lock(bh))) + return bh; + + if (bh_submit_read(bh) 0) { + brelse(bh); + ext3_error(sb, __FUNCTION__, Cannot read block bitmap - block_group = %d, block_bitmap = %u, block_group, le32_to_cpu(desc-bg_block_bitmap)); -error_out: + return NULL; + } + if (!ext3_valid_block_bitmap(sb, desc, block_group, bh)) { + brelse(bh); + ext3_error(sb, __FUNCTION__, + Invalid block bitmap - + block_group = %d, block = %lu, + block_group, bitmap_blk); + return NULL; + } return bh; } /* -- 1.5.3.5.652.gf192c-dirty - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] ext4: add block bitmap validation
When a new block bitmap is read from disk in read_block_bitmap() there are a few bits that should ALWAYS be set. In particular, the blocks given corresponding to block bitmap, inode bitmap and inode tables. Validate the block bitmap against these blocks. Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED] --- fs/ext4/balloc.c | 96 +- 1 files changed, 80 insertions(+), 16 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index 71ee95e..2657a1e 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -189,13 +189,62 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb, return desc; } +static int ext4_valid_block_bitmap(struct super_block *sb, + struct ext4_group_desc *desc, + unsigned int block_group, + struct buffer_head *bh) +{ + ext4_grpblk_t offset; + ext4_grpblk_t next_zero_bit; + ext4_fsblk_t bitmap_blk; + ext4_fsblk_t group_first_block; + + if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG)) { + /* with FLEX_BG, the inode/block bitmaps and itable +* blocks may not be in the group at all +* so the bitmap validation will be skipped for those groups +* or it has to also read the block group where the bitmaps +* are located to verify they are set. +*/ + return 1; + } + group_first_block = ext4_group_first_block_no(sb, block_group); + + /* check whether block bitmap block number is set */ + bitmap_blk = ext4_block_bitmap(sb, desc); + offset = bitmap_blk - group_first_block; + if (!ext4_test_bit(offset, bh-b_data)) { + /* bad block bitmap */ + return 0; + } + + /* check whether the inode bitmap block number is set */ + bitmap_blk = ext4_inode_bitmap(sb, desc); + offset = bitmap_blk - group_first_block; + if (!ext4_test_bit(offset, bh-b_data)) { + /* bad block bitmap */ + return 0; + } + + /* check whether the inode table block number is set */ + bitmap_blk = ext4_inode_table(sb, desc); + offset = bitmap_blk - group_first_block; + next_zero_bit = ext4_find_next_zero_bit(bh-b_data, + EXT4_SB(sb)-s_itb_per_group + 1, + offset); + if (next_zero_bit = offset + EXT4_SB(sb)-s_itb_per_group) + /* good bitmap for inode tables */ + return 1; + + return 0; +} /** * read_block_bitmap() * @sb:super block * @block_group: given block group * - * Read the bitmap for a given block_group, reading into the specified - * slot in the superblock's bitmap cache. + * Read the bitmap for a given block_group,and validate the + * bits for block/inode/inode tables are set in the bitmaps * * Return buffer_head on success or NULL in case of failure. */ @@ -210,25 +259,40 @@ read_block_bitmap(struct super_block *sb, unsigned int block_group) if (!desc) return NULL; bitmap_blk = ext4_block_bitmap(sb, desc); + bh = sb_getblk(sb, bitmap_blk); + if (unlikely(!bh)) { + ext4_error(sb, __FUNCTION__, + Cannot read block bitmap - + block_group = %d, block_bitmap = %llu, + block_group, bitmap_blk); + return NULL; + } + if (bh_uptodate_or_lock(bh)) + return bh; + if (desc-bg_flags cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { - bh = sb_getblk(sb, bitmap_blk); - if (!buffer_uptodate(bh)) { - lock_buffer(bh); - if (!buffer_uptodate(bh)) { - ext4_init_block_bitmap(sb, bh, block_group, - desc); - set_buffer_uptodate(bh); - } - unlock_buffer(bh); - } - } else { - bh = sb_bread(sb, bitmap_blk); + ext4_init_block_bitmap(sb, bh, block_group, desc); + set_buffer_uptodate(bh); + unlock_buffer(bh); + return bh; } - if (!bh) - ext4_error (sb, __FUNCTION__, + if (bh_submit_read(bh) 0) { + brelse(bh); + ext4_error(sb, __FUNCTION__, Cannot read block bitmap - block_group = %d, block_bitmap = %llu, block_group, bitmap_blk); + return NULL; + } + if (!ext4_valid_block_bitmap(sb, desc, block_group, bh)) { + brelse(bh
delalloc and reservation.
Hi All, I looked at the delalloc and reservation differences that Valerie was observing. Below is my understanding. I am not sure whether the below will result in higher fragmentation that Eric Sandeen is observing. I guess it should not. Even though the reservation gets discarded during the clear inode due to memory pressure the request for new reservation should get the blocks nearby and not break extents right ? any how below is the simple case. without delalloc the blocks are requested during prepare_write/write_begin. That means we enter ext4_new_blocks_old which will call ext4_try_to_allocate_with_rsv. Now if there is no reservation for this inode a new one will be allocated. After using the blocks this reservation is destroyed during the close via ext4_release_file With delalloc the blocks are not requested until we hit writeback/ext4_da_writepages That means if we create new file and close them the reservation will be discarded during close via ext4_release_file.( Actually there will be nothing to clear) Now when we do a sync/or write back. We try to get the block, the inode will request for new reservation. This reservation is not discarded untill we call clear_inode and that results in the behavior we are seeing. Free blocks: 1440-8191, 8194-8199, 8202-8207, 8210-8215, 8218-8223, 8226-8231, 8234-8239, 8242-8247, 8250-8255, 8258-8263, 8266-8271, 8274-8279, 8282-8287, 8290-8295, 8298-8303, 8306-8311, 8314-8319, 8322-8327, 8330-8335, 8338-8343, 8346-12799 So now the question is where do we discard the reservation in case of delalloc. -aneesh - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: delalloc and reservation
I guess the list dropped this mail. Sending again. -aneesh ---BeginMessage--- Aneesh Kumar K.V wrote: Hi All, I looked at the delalloc and reservation differences that Valerie was observing. Below is my understanding. I am not sure whether the below will result in higher fragmentation that Eric Sandeen is observing. I guess it should not. Even though the reservation gets discarded during the clear inode due to memory pressure the request for new reservation should get the blocks nearby and not break extents right ? any how below is the simple case. without delalloc the blocks are requested during prepare_write/write_begin. That means we enter ext4_new_blocks_old which will call ext4_try_to_allocate_with_rsv. Now if there is no reservation for this inode a new one will be allocated. After using the blocks this reservation is destroyed during the close via ext4_release_file With delalloc the blocks are not requested until we hit writeback/ext4_da_writepages That means if we create new file and close them the reservation will be discarded during close via ext4_release_file.( Actually there will be nothing to clear) Now when we do a sync/or write back. We try to get the block, the inode will request for new reservation. This reservation is not discarded untill we call clear_inode and that results in the behavior we are seeing. Free blocks: 1440-8191, 8194-8199, 8202-8207, 8210-8215, 8218-8223, 8226-8231, 8234-8239, 8242-8247, 8250-8255, 8258-8263, 8266-8271, 8274-8279, 8282-8287, 8290-8295, 8298-8303, 8306-8311, 8314-8319, 8322-8327, 8330-8335, 8338-8343, 8346-12799 So now the question is where do we discard the reservation in case of delalloc. - with respect to mballoc we are not seeing this because we are doing allocation from group prealloc list which is per cpu. For most the case we have EXT4_MB_HINT_GROUP_ALLOC set in mballoc. In ext4_mb_group_or_file i already have a FIXME!! regarding this. currently we have /* request is so large that we don't care about * streaming - it overweights any possible seek */ if (ac-ac_o_ex.fe_len = sbi-s_mb_large_req) return; /* FIXME!! * is this = considering the above ? */ if (ac-ac_o_ex.fe_len = sbi-s_mb_small_req) return; . .. /* we're going to use group allocation */ ac-ac_flags |= EXT4_MB_HINT_GROUP_ALLOC; . So for small size we have the EXT4_MB_HINT_GROUP_ALLOC set . Now if i change the the line below FIXME!! to = , that will force small size to use inode prealloc and that cause Free blocks: 1442-1443, 1446-1447, 1450-1451, 1454-1455, 1458-1459, 1462-1463, 1466-1467, 1470-1471, 1474-1475, 1478-1479, 1482-1483, 1486-1487, 1490-1491, 1494-1495, 1498-1499, 1502-1503, 1506-1507, 1510-1511, 1514-1515, 1518-12799 So the problem is generic. -aneesh ---End Message---