Re: What's cooking in e2fsprogs.git (topics)

2008-02-25 Thread Aneesh Kumar K.V
On Mon, Feb 25, 2008 at 10:13:39AM -0500, Theodore Tso wrote:
 On Sun, Feb 24, 2008 at 09:20:50PM -0700, Andreas Dilger wrote:
  On Feb 22, 2008  19:15 -0500, Theodore Ts'o wrote:
   So before the recent patch were we actually creating long symlinks in
   extents format?  Or were we just setting the flag but still treating
   them as a block number?  If it was the latter, I guess we can put in
   code into e2fsck to detect that case, and convert it back to a
   singleton block number.  
  
  Eric informed me that the long symlinks were actually stored in extent
  mapped blocks.  That is not harmful, because it can only be a single
  block and it will always fit into the inode.  The other thing to note
  is that extent mapping is REQUIRED for  32-bit blocknumbers, so we
  may as well fix e2fsprogs to allow these symlinks to be handled normally.
 
 Well, at least some kernel versions (as of sometime just before
 2.6.25, iirc) were storing the long symlink as a single block in
 i_block[0], despite EXTENTS_FL being set.  Valerie noticed this, and I
 confirmed it, as it caused the mainline e2fsck extents support to core
 dump.  Basically, what this means is that e2fsprogs can't trust
 EXTENTS_FL for long symlinks.
 
 But you do raise a good point that we need to support using the
 extents format in order to support blocks  2**32, so we can't just
 arbitrary convert all symlinks to the old-style direct block maps.

How about the patch like below on top of the patch queue. Patch queue
currently enable extent flag only for directory and file . This patch
add it to normal symlink. With this fast symlink still have the extent
format enabled. I guess this would need a patch to the interim branch of
e2fsprogs to allow normal symlink to have extent format.

-aneesh

ext4: Enable extent format for symlink.

From: Aneesh Kumar K.V [EMAIL PROTECTED]

This patch enable extent format for normal symlink. Extent format enables
to refere file system blocks  32 bits. Enabling extent format for symlink
enables to have symlink block beyond 2**32 blocks. We still don't enable
extent format for fast symlink.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---

 fs/ext4/ialloc.c |4 ++--
 fs/ext4/namei.c  |2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)


diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 78d1094..1462189 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -842,8 +842,8 @@ got:
goto fail_free_drop;
}
if (test_opt(sb, EXTENTS)) {
-   /* set extent flag only for diretory and file */
-   if (S_ISDIR(mode) || S_ISREG(mode)) {
+   /* set extent flag only for diretory, file and normal symlink*/
+   if (S_ISDIR(mode) || S_ISREG(mode) || S_ISLNK(mode)) {
EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL;
ext4_ext_tree_init(handle, inode);
err = ext4_update_incompat_feature(handle, sb,
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index da942bc..63c33e0 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -,6 +,8 @@ retry:
goto out_stop;
}
} else {
+   /* clear the extent format for fast symlink */
+   EXT4_I(inode)-i_flags = ~EXT4_EXTENTS_FL;
inode-i_op = ext4_fast_symlink_inode_operations;
memcpy((char*)EXT4_I(inode)-i_data,symname,l);
inode-i_size = l-1;
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: Convert uninitialized extent to initialized extent in case of file system full

2008-02-22 Thread Aneesh Kumar K.V
A write to prealloc area cause the split of unititalized extent into a 
initialized
and uninitialized extent. If we don't have space to add new extent information 
instead
of returning error convert the existing uninitialized extent to initialized 
one. We
need to zero out the blocks corresponding to the extent to prevent wrong data 
reaching
userspace.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/extents.c |  151 ++---
 1 files changed, 144 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index d315cc1..cdc7dca 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2136,6 +2136,124 @@ void ext4_ext_release(struct super_block *sb)
 #endif
 }
 
+static int extend_credit_for_zeroout(handle_t *handle, struct inode *inode)
+{
+   int retval = 0, needed;
+
+   if (handle-h_buffer_credits  EXT4_RESERVE_TRANS_BLOCKS)
+   return 0;
+
+   /* number of filesytem blocks in one page */
+   needed = 1  (PAGE_CACHE_SHIFT - inode-i_blkbits);
+
+   if (ext4_journal_extend(handle, needed) != 0)
+   retval = ext4_journal_restart(handle, needed);
+
+   return retval;
+}
+
+/* FIXME!! we need to try to merge to left or right after zerout  */
+static int ext4_ext_zeroout(handle_t *handle, struct inode *inode,
+   ext4_lblk_t iblock, struct ext4_extent *ex)
+{
+   ext4_lblk_t ee_block;
+   unsigned int ee_len, blkcount, blocksize;
+   loff_t pos;
+   pgoff_t index, skip_index;
+   unsigned long offset;
+   struct page *page;
+   struct address_space *mapping = inode-i_mapping;
+   struct buffer_head *head, *bh;
+   int err = 0;
+
+   ee_block = le32_to_cpu(ex-ee_block);
+   ee_len = blkcount = ext4_ext_get_actual_len(ex);
+   blocksize = inode-i_sb-s_blocksize;
+
+   /*
+* find the skip index. We can't call __grab_cache_page for this
+* because we are in the writeout of this page and we already have
+* taken the lock on this page
+*/
+   pos = iblock   inode-i_blkbits;
+   skip_index = pos  PAGE_CACHE_SHIFT;
+
+   while (blkcount) {
+   pos = (ee_block  + ee_len - blkcount)  inode-i_blkbits;
+   index = pos  PAGE_CACHE_SHIFT;
+   offset = (pos  (PAGE_CACHE_SIZE - 1));
+   if (index == skip_index) {
+   /* Page will already be locked via
+* write_begin or writepage
+*/
+   read_lock_irq(mapping-tree_lock);
+   page = radix_tree_lookup(mapping-page_tree, index);
+   read_unlock_irq(mapping-tree_lock);
+   if (page)
+   page_cache_get(page);
+   else
+   return -ENOMEM;
+   } else {
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   }
+
+   if (!page_has_buffers(page))
+   create_empty_buffers(page, blocksize, 0);
+
+   /* extent the credit in the journal */
+   extend_credit_for_zeroout(handle, inode);
+
+   head = page_buffers(page);
+   /* Look for the buffer_head which map the block */
+   bh = head;
+   while (offset  0) {
+   bh = bh-b_this_page;
+   offset -= blocksize;
+   }
+   offset = (pos  (PAGE_CACHE_SIZE - 1));
+
+   /* Now write all the buffer_heads in the page */
+   do {
+   set_buffer_uptodate(bh);
+   if (ext4_should_journal_data(inode)) {
+   err = ext4_journal_get_write_access(handle, bh);
+   if (err)
+   goto err_out;
+   }
+   zero_user(page, offset, blocksize);
+   offset += blocksize;
+   if (ext4_should_journal_data(inode)) {
+   err = ext4_journal_dirty_metadata(handle, bh);
+   if (err)
+   goto err_out;
+   } else {
+   if (ext4_should_order_data(inode)) {
+   err = ext4_journal_dirty_data(handle,
+   bh);
+   if (err)
+   goto err_out;
+   }
+   mark_buffer_dirty(bh);
+   }
+
+   bh = bh-b_this_page

[PATCH] ext4: Fix fallocate error path.

2008-02-22 Thread Aneesh Kumar K.V
Put the old extent details back if we fail to split the
uninitialized extent.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/extents.c |   26 --
 1 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 39d5315..d315cc1 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2152,7 +2152,7 @@ static int ext4_ext_convert_to_initialized(handle_t 
*handle,
ext4_lblk_t iblock,
unsigned long max_blocks)
 {
-   struct ext4_extent *ex, newex;
+   struct ext4_extent *ex, newex, orig_ex;
struct ext4_extent *ex1 = NULL;
struct ext4_extent *ex2 = NULL;
struct ext4_extent *ex3 = NULL;
@@ -2171,6 +2171,9 @@ static int ext4_ext_convert_to_initialized(handle_t 
*handle,
allocated = ee_len - (iblock - ee_block);
newblock = iblock - ee_block + ext_pblock(ex);
ex2 = ex;
+   orig_ex.ee_block = ex-ee_block;
+   orig_ex.ee_len   = cpu_to_le16(ee_len);
+   ext4_ext_store_pblock(orig_ex, ext_pblock(ex));
 
err = ext4_ext_get_access(handle, inode, path + depth);
if (err)
@@ -2199,13 +2202,25 @@ static int ext4_ext_convert_to_initialized(handle_t 
*handle,
ex3-ee_len = cpu_to_le16(allocated - max_blocks);
ext4_ext_mark_uninitialized(ex3);
err = ext4_ext_insert_extent(handle, inode, path, ex3);
-   if (err)
+   if (err) {
+   ex-ee_block = orig_ex.ee_block;
+   ex-ee_len   = orig_ex.ee_len;
+   ext4_ext_store_pblock(ex, ext_pblock(orig_ex));
+   ext4_ext_mark_uninitialized(ex);
+   ext4_ext_dirty(handle, inode, path + depth);
goto out;
+   }
/*
 * The depth, and hence eh  ex might change
 * as part of the insert above.
 */
newdepth = ext_depth(inode);
+   /*
+* update the extent length after successfull insert of the
+* split extent
+*/
+   orig_ex.ee_len = cpu_to_le16(ee_len -
+   ext4_ext_get_actual_len(ex3));
if (newdepth != depth) {
depth = newdepth;
ext4_ext_drop_refs(path);
@@ -2280,6 +2295,13 @@ static int ext4_ext_convert_to_initialized(handle_t 
*handle,
goto out;
 insert:
err = ext4_ext_insert_extent(handle, inode, path, newex);
+   if (err) {
+   ex-ee_block = orig_ex.ee_block;
+   ex-ee_len   = orig_ex.ee_len;
+   ext4_ext_store_pblock(ex, ext_pblock(orig_ex));
+   ext4_ext_mark_uninitialized(ex);
+   ext4_ext_dirty(handle, inode, path + depth);
+   }
 out:
return err ? err : allocated;
 }
-- 
1.5.4.1.97.g40aab-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Convert uninitialized extent to initialized extent in case of file system full

2008-02-22 Thread Aneesh Kumar K.V

Test results for the patch.

mmaptest simply maps a range and write to it.
The length of the extent indicate whether it is initialized or not.

There is no space in the file system for another block.
So the uninitialized extent have to be converted to initialized extent.

[EMAIL PROTECTED]:/ext4# /root/mmaptest testfile 0 100
mmaping 0 to 100
[EMAIL PROTECTED] linux-2.6.25-rc2]$ ~/ext4migrate --display 
/home/kvaneesh/test-images/ext3.img  testfile
extent: block=0-3 len=4 start=1153 start_hi=0
extent: block=4-32807 len=32804 start=1157 start_hi=0

[EMAIL PROTECTED]:/ext4# dd if=/root/a.c of=testfile  seek=6 bs=4096 
conv=notrunc
[EMAIL PROTECTED] linux-2.6.25-rc2]$ ~/ext4migrate --display 
/home/kvaneesh/test-images/ext3.img  testfile
extent: block=0-3 len=4 start=1153 start_hi=0
extent: block=4-32791 len=32788 start=1157 start_hi=0
extent: block=24-24 len=1 start=1177 start_hi=0
extent: block=25-32807 len=32783 start=1178 start_hi=0

[EMAIL PROTECTED]:/ext4# dd if=/root/a.c of=testfile  seek=8 bs=4096 
conv=notrunc
[EMAIL PROTECTED] linux-2.6.25-rc2]$ ~/ext4migrate --display 
/home/kvaneesh/test-images/ext3.img  testfile
extent: block=0-3 len=4 start=1153 start_hi=0
extent: block=4-32791 len=32788 start=1157 start_hi=0
extent: block=24-24 len=1 start=1177 start_hi=0
extent: block=25-39 len=15 start=1178 start_hi=0

[EMAIL PROTECTED]:/ext4# /root/mmaptest testfile 4096 5000
mmaping 4096 to 5000
[EMAIL PROTECTED] linux-2.6.25-rc2]$ ~/ext4migrate --display 
/home/kvaneesh/test-images/ext3.img  testfile
extent: block=0-3 len=4 start=1153 start_hi=0
extent: block=4-23 len=20 start=1157 start_hi=0
extent: block=24-24 len=1 start=1177 start_hi=0
extent: block=25-39 len=15 start=1178 start_hi=0

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ext4: Convert uninitialized extent to initialized extent in case of file system full

2008-02-22 Thread Aneesh Kumar K.V
On Fri, Feb 22, 2008 at 08:01:28PM +0530, Aneesh Kumar K.V wrote:
   +
   + /* Now write all the buffer_heads in the page */
   + do {
   + set_buffer_uptodate(bh);
   + if (ext4_should_journal_data(inode)) {
   + err = ext4_journal_get_write_access(handle, bh);
   + /* do we have that many credits ??*/
   + if (err)
   + goto err_out;
   + }
   + zero_user(page, offset, blocksize);
  
  Ah oh, you are trying to zero out the pages in the page cache, that's
  seems wrong to me. By the time get_block() is called from writepages(),
  the pages should have meaningful content that needs to flush to disk,
  zero the pages out will lost the data.
  
 
 It is writebegin.  In case of writebegin the pages doesn't have the content. 
 By the
 time we reach write begin the page is supposed to have buffer heads that
 are alreayd mapped. So we won't end up calling get_blk. Even in case of
 mmap with page_mkwrite change we would have called writebegin equivalent
 before the writepage.

I guess the above para is confusing.

The callback is actually writebegin.In case of writebegin the page
doesn't have the content. With respect to writepage by the time we call the
callback the buffer_heads related to the page would already be mapped.
So we won't end up calling get_blk.


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Use page_mkwrite vma_operations to get mmap write notification.

2008-02-22 Thread Aneesh Kumar K.V
On Fri, Feb 22, 2008 at 10:10:48AM -0800, Mingming Cao wrote:
 On Fri, 2008-02-22 at 20:09 +0530, Aneesh Kumar K.V wrote:

.

  +   ext4_journal_stop(handle);
  +   goto out_unlock;
  +   }
  +   if (!ret  ext4_should_order_data(inode)) {
  +   ret = walk_page_buffers(handle, page_buffers(page),
  +   0, end, NULL, ext4_journal_dirty_data);
  +   }
  +   if (!ret)
  +   ret = block_commit_write(page, 0, end);
  +
 Hmm, it seems wired to do commit_write when the page is about becoming
 writable, but maybe that's the way it needs to?
 
 Don't we need to update the i_size somewhere?

block_commit_write simply iterate over buffer_head of page and mark them
dirty. That is why we don't want to call that for data=journalled mode.

 
  +   ext4_journal_stop(handle);
  +out_unlock:
  +   unlock_page(page);
  +   mutex_unlock(inode-i_mutex);
  +   return ret;
  +}
 
 It seems this combined the three journalling mode prepare_write() code
 here:( 
 
 Since prepare_write() and commit_write() is going to sunset, why not
 simply calling mappings-a_ops-write_begin() and then write_end()? that
 should take care of pretty much the journalling and the page operations,
 no?

write_begin and write_end works with the user space buffer. In this case
we don't have one. Also what ext4_page_mkwrite does is mostly what
write_begin/write_end does except the copy of user space buffer.


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH] ext4: Convert uninitialized extent to initialized extent in case of file system full

2008-02-21 Thread Aneesh Kumar K.V
This patch had very minimal testing. I am sending this to get the
feedback on the approach. The skip_index section in the below patch
is ugly. Any suggestion to improve ?

NOTE: ext4_ext_convert_to_initialized error path have some BUGs. It
doesn't reset the extent information in case of error. But that is
another patch.


From 6a73edd4dbb32344e6a83ebdc07edd0e96d376bd Mon Sep 17 00:00:00 2001
From: Aneesh Kumar K.V [EMAIL PROTECTED]
Date: Thu, 21 Feb 2008 23:57:38 +0530
Subject: [PATCH] ext4: Convert uninitialized extent to initialized extent in 
case of file system full

A write to prealloc area cause the split of unititalized extent into a 
initialized
and uninitialized extent. If we don't have space to add new extent information 
instead
of returning error convert the existing uninitialized extent to initialized 
one. We
need to zero out the blocks corresponding to the extent to prevent wrong data 
reaching
userspace.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/extents.c |  135 -
 1 files changed, 133 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index b179b03..d37c14e 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2137,6 +2137,103 @@ void ext4_ext_release(struct super_block *sb)
 #endif
 }
 
+static int ext4_ext_zero_out(handle_t *handle, struct inode *inode,
+   ext4_lblk_t iblock, struct ext4_extent *ex)
+{
+   ext4_lblk_t ee_block;
+   unsigned int ee_len, blkcount, blocksize;
+   loff_t pos;
+   pgoff_t index, skip_index;
+   unsigned long offset;
+   struct page *page;
+   struct address_space *mapping = inode-i_mapping;
+   struct buffer_head *head, *bh;
+   int err = 0;
+
+   ee_block = le32_to_cpu(ex-ee_block);
+   ee_len = blkcount = ext4_ext_get_actual_len(ex);
+   blocksize = inode-i_sb-s_blocksize;
+
+   /*
+* find the skip index. We can't call __grab_cache_page for this
+* because we are in the writeout of this page and we already have
+* taken the lock on this page
+*/
+   pos = iblock   inode-i_blkbits;
+   skip_index = pos  PAGE_CACHE_SHIFT;
+
+   while (blkcount) {
+   pos = (ee_block  + ee_len - blkcount)  inode-i_blkbits;
+   index = pos  PAGE_CACHE_SHIFT;
+   offset = (pos  (PAGE_CACHE_SIZE - 1));
+   if (index == skip_index) {
+   /* Page will already be locked in the writepage */
+   read_lock_irq(mapping-tree_lock);
+   page = radix_tree_lookup(mapping-page_tree, index);
+   read_unlock_irq(mapping-tree_lock);
+   if (page)
+   page_cache_get(page);
+   else
+   return -ENOMEM;
+   } else {
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   }
+
+   if (!page_has_buffers(page))
+   create_empty_buffers(page, blocksize, 0);
+
+   head = page_buffers(page);
+   /* Look for the buffer_head which map the block */
+   bh = head;
+   while (offset  0) {
+   bh = bh-b_this_page;
+   offset -= blocksize;
+   }
+   offset = (pos  (PAGE_CACHE_SIZE - 1));
+
+   /* Now write all the buffer_heads in the page */
+   do {
+   set_buffer_uptodate(bh);
+   if (ext4_should_journal_data(inode)) {
+   err = ext4_journal_get_write_access(handle, bh);
+   /* do we have that many credits ??*/
+   if (err)
+   goto err_out;
+   }
+   zero_user(page, offset, blocksize);
+   offset += blocksize;
+   if (ext4_should_journal_data(inode)) {
+   err = ext4_journal_dirty_metadata(handle, bh);
+   if (err)
+   goto err_out;
+   } else {
+   if (ext4_should_order_data(inode)) {
+   err = ext4_journal_dirty_data(handle,
+   bh);
+   if (err)
+   goto err_out;
+   }
+   mark_buffer_dirty(bh);
+   }
+
+   bh = bh-b_this_page;
+   blkcount--;
+   } while ((bh != head)  (blkcount  0

Re: [PATCH -v2] ext4: Use page_mkwrite vma_operations to get mmap write notification.

2008-02-21 Thread Aneesh Kumar K.V
On Thu, Feb 21, 2008 at 09:39:20AM -0800, Mingming Cao wrote:
 On Tue, 2008-02-19 at 09:13 +0530, Aneesh Kumar K.V wrote:
  We would like to get notified when we are doing a write on mmap section.
  This is needed with respect to preallocated area. We split the preallocated
  area into initialzed extent and uninitialzed extent in the call back. This
  let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and
  that would result in data loss. The changes are also needed to handle ENOSPC
  when writing to an mmap section of files with holes.
  
 
 Hi Aneesh,
 
 I have a concern, it seems we missed journalling the allocation activity
 for the mmaped write. See comments below...
 
 Another thing, perhaps similar patch should be ported to ext2/3, as this
 also addressed the mmaped write ENOSPC error without
 preallocation/deleyed allocation.
 
  Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]

..

  +
  +int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page)
  +{
  +   return block_page_mkwrite(vma, page, ext4_get_block);
  +}
  +
 
 I don't see block allocation being journalled here. block_page_mkwrite()
 eventually calling block_prepare_write() which invokes ext4_get_block()
 without starting a new journal handle.
 
 Perhaps call ext4 write_begin() and write_end() inode operations, that
 would taking care of different write_begin and write_end for three
 different journalling mode.
 

You are correct. I missed the journalling details when working on the
patch. Will send the one with fix.


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: ext4_find_next_zero_bit needs an aligned address on some arch

2008-02-20 Thread Aneesh Kumar K.V
ext4_find_next_zero_bit and ext4_find_next_bit needs a long aligned
address on x8_64. Add mb_find_next_zero_bit and mb_find_next_bit
and use them in the mballoc.

Fix: https://bugzilla.redhat.com/show_bug.cgi?id=433286

Eric Sandeen debugged the problem and suggested the fix.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
CC:Eric Sandeen [EMAIL PROTECTED]
---
 fs/ext4/mballoc.c |   62 ++--
 1 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 89772b9..ccddd21 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -627,21 +627,19 @@ static ext4_fsblk_t ext4_grp_offs_to_block(struct 
super_block *sb,
return block;
 }
 
+static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
+{
 #if BITS_PER_LONG == 64
-#define mb_correct_addr_and_bit(bit, addr) \
-{  \
-   bit += ((unsigned long) addr  7UL)  3;   \
-   addr = (void *) ((unsigned long) addr  ~7UL);  \
-}
+   *bit += ((unsigned long) addr  7UL)  3;
+   addr = (void *) ((unsigned long) addr  ~7UL);
 #elif BITS_PER_LONG == 32
-#define mb_correct_addr_and_bit(bit, addr) \
-{  \
-   bit += ((unsigned long) addr  3UL)  3;   \
-   addr = (void *) ((unsigned long) addr  ~3UL);  \
-}
+   *bit += ((unsigned long) addr  3UL)  3;
+   addr = (void *) ((unsigned long) addr  ~3UL);
 #else
 #error how many bits you are?!
 #endif
+   return addr;
+}
 
 static inline int mb_test_bit(int bit, void *addr)
 {
@@ -649,34 +647,54 @@ static inline int mb_test_bit(int bit, void *addr)
 * ext4_test_bit on architecture like powerpc
 * needs unsigned long aligned address
 */
-   mb_correct_addr_and_bit(bit, addr);
+   addr = mb_correct_addr_and_bit(bit, addr);
return ext4_test_bit(bit, addr);
 }
 
 static inline void mb_set_bit(int bit, void *addr)
 {
-   mb_correct_addr_and_bit(bit, addr);
+   addr = mb_correct_addr_and_bit(bit, addr);
ext4_set_bit(bit, addr);
 }
 
 static inline void mb_set_bit_atomic(spinlock_t *lock, int bit, void *addr)
 {
-   mb_correct_addr_and_bit(bit, addr);
+   addr = mb_correct_addr_and_bit(bit, addr);
ext4_set_bit_atomic(lock, bit, addr);
 }
 
 static inline void mb_clear_bit(int bit, void *addr)
 {
-   mb_correct_addr_and_bit(bit, addr);
+   addr = mb_correct_addr_and_bit(bit, addr);
ext4_clear_bit(bit, addr);
 }
 
 static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr)
 {
-   mb_correct_addr_and_bit(bit, addr);
+   addr = mb_correct_addr_and_bit(bit, addr);
ext4_clear_bit_atomic(lock, bit, addr);
 }
 
+static inline int mb_find_next_zero_bit(void *addr, int max, int start)
+{
+   int fix = 0;
+   addr = mb_correct_addr_and_bit(fix, addr);
+   max += fix;
+   start += fix;
+
+   return ext4_find_next_zero_bit(addr, max, start) - fix;
+}
+
+static inline int mb_find_next_bit(void *addr, int max, int start)
+{
+   int fix = 0;
+   addr = mb_correct_addr_and_bit(fix, addr);
+   max += fix;
+   start += fix;
+
+   return ext4_find_next_bit(addr, max, start) - fix;
+}
+
 static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
 {
char *bb;
@@ -946,12 +964,12 @@ static void ext4_mb_generate_buddy(struct super_block *sb,
 
/* initialize buddy from bitmap which is aggregation
 * of on-disk bitmap and preallocations */
-   i = ext4_find_next_zero_bit(bitmap, max, 0);
+   i = mb_find_next_zero_bit(bitmap, max, 0);
grp-bb_first_free = i;
while (i  max) {
fragments++;
first = i;
-   i = ext4_find_next_bit(bitmap, max, i);
+   i = mb_find_next_bit(bitmap, max, i);
len = i - first;
free += len;
if (len  1)
@@ -959,7 +977,7 @@ static void ext4_mb_generate_buddy(struct super_block *sb,
else
grp-bb_counters[0]++;
if (i  max)
-   i = ext4_find_next_zero_bit(bitmap, max, i);
+   i = mb_find_next_zero_bit(bitmap, max, i);
}
grp-bb_fragments = fragments;
 
@@ -1783,7 +1801,7 @@ static void ext4_mb_simple_scan_group(struct 
ext4_allocation_context *ac,
buddy = mb_find_buddy(e4b, i, max);
BUG_ON(buddy == NULL);
 
-   k = ext4_find_next_zero_bit(buddy, max, 0);
+   k = mb_find_next_zero_bit(buddy, max, 0);
BUG_ON(k = max);
 
ac-ac_found++;
@@ -1823,7 +1841,7 @@ static void ext4_mb_complex_scan_group(struct 
ext4_allocation_context *ac,
i = e4b-bd_info-bb_first_free;
 
while (free  ac-ac_status == AC_STATUS_CONTINUE) {
-   i

Re: [PATCH] fs/ext4/mballoc.c: Convert to list_for_each_entry_rcu()

2008-02-19 Thread Aneesh Kumar K.V
On Tue, Feb 19, 2008 at 01:31:18AM +0100, Roel Kluin wrote:
 Please verify, this patch was not yet tested
 ---
 Convert list_for_each_rcu() to list_for_each_entry_rcu()
 
 Signed-off-by: Roel Kluin [EMAIL PROTECTED]


NACK. This patch doesn't build. You have extra cur in the conversion.
Right changes attached. 

ext4:  Convert list_for_each_rcu() to list_for_each_entry_rcu()

From: Aneesh Kumar K.V [EMAIL PROTECTED]

The list_for_each_entry_rcu() primitive should be used instead of
list_for_each_rcu(), as the former is easier to use and provides
better type safety.

http://groups.google.com/group/linux.kernel/browse_thread/thread/45749c83451cebeb/0633a65759ce7713?lnk=raot

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---

 fs/ext4/mballoc.c |   18 +-
 1 files changed, 5 insertions(+), 13 deletions(-)


diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 52d3af2..89772b9 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3135,10 +3135,10 @@ ext4_mb_normalize_request(struct 
ext4_allocation_context *ac,
 {
int bsbits, max;
ext4_lblk_t end;
-   struct list_head *cur;
loff_t size, orig_size, start_off;
ext4_lblk_t start, orig_start;
struct ext4_inode_info *ei = EXT4_I(ac-ac_inode);
+   struct ext4_prealloc_space *pa;
 
/* do normalize only data requests, metadata requests
   do not need preallocation */
@@ -3224,12 +3224,9 @@ ext4_mb_normalize_request(struct ext4_allocation_context 
*ac,
 
/* check we don't cross already preallocated blocks */
rcu_read_lock();
-   list_for_each_rcu(cur, ei-i_prealloc_list) {
-   struct ext4_prealloc_space *pa;
+   list_for_each_entry_rcu(pa, ei-i_prealloc_list, pa_inode_list) {
unsigned long pa_end;
 
-   pa = list_entry(cur, struct ext4_prealloc_space, pa_inode_list);
-
if (pa-pa_deleted)
continue;
spin_lock(pa-pa_lock);
@@ -3271,10 +3268,8 @@ ext4_mb_normalize_request(struct ext4_allocation_context 
*ac,
 
/* XXX: extra loop to check we really don't overlap preallocations */
rcu_read_lock();
-   list_for_each_rcu(cur, ei-i_prealloc_list) {
-   struct ext4_prealloc_space *pa;
+   list_for_each_entry_rcu(pa, ei-i_prealloc_list, pa_inode_list) {
unsigned long pa_end;
-   pa = list_entry(cur, struct ext4_prealloc_space, pa_inode_list);
spin_lock(pa-pa_lock);
if (pa-pa_deleted == 0) {
pa_end = pa-pa_lstart + pa-pa_len;
@@ -3401,7 +3396,6 @@ static noinline int ext4_mb_use_preallocated(struct 
ext4_allocation_context *ac)
struct ext4_inode_info *ei = EXT4_I(ac-ac_inode);
struct ext4_locality_group *lg;
struct ext4_prealloc_space *pa;
-   struct list_head *cur;
 
/* only data can be preallocated */
if (!(ac-ac_flags  EXT4_MB_HINT_DATA))
@@ -3409,8 +3403,7 @@ static noinline int ext4_mb_use_preallocated(struct 
ext4_allocation_context *ac)
 
/* first, try per-file preallocation */
rcu_read_lock();
-   list_for_each_rcu(cur, ei-i_prealloc_list) {
-   pa = list_entry(cur, struct ext4_prealloc_space, pa_inode_list);
+   list_for_each_entry_rcu(pa, ei-i_prealloc_list, pa_inode_list) {
 
/* all fields in this condition don't change,
 * so we can skip locking for them */
@@ -3442,8 +3435,7 @@ static noinline int ext4_mb_use_preallocated(struct 
ext4_allocation_context *ac)
return 0;
 
rcu_read_lock();
-   list_for_each_rcu(cur, lg-lg_prealloc_list) {
-   pa = list_entry(cur, struct ext4_prealloc_space, pa_inode_list);
+   list_for_each_entry_rcu(pa, lg-lg_prealloc_list, pa_inode_list) {
spin_lock(pa-pa_lock);
if (pa-pa_deleted == 0  pa-pa_free = ac-ac_o_ex.fe_len) {
atomic_inc(pa-pa_count);
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Error with the latest stable series of the patch queue.

2008-02-19 Thread Aneesh Kumar K.V
Hi all,

I am seeing the below error in the console. But the tests are reported
as success.

EXT4-fs: mballoc enabled
EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
#204044: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
#204045: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
#204047: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
#204056: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
#204061: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
#204065: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
#204068: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
#204069: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
#204071: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
#204077: invalid magic - 


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] clear extents flag on inodes created in ext4_mknod

2008-02-19 Thread Aneesh Kumar K.V
On Tue, Feb 19, 2008 at 10:19:44PM +0530, Aneesh Kumar K.V wrote:
 On Tue, Feb 19, 2008 at 10:39:52AM -0600, Eric Sandeen wrote:
  Eric Sandeen wrote:
   e2fsck doesn't expect to find char, block, fifo, or socket
   files with the extent flag set, so clear that in ext4_mknod.
   
   Signed-off-by: Eric Sandeen [EMAIL PROTECTED]
   ---
   
   Index: linux-2.6.24/fs/ext4/namei.c
   ===
   --- linux-2.6.24.orig/fs/ext4/namei.c
   +++ linux-2.6.24/fs/ext4/namei.c
   @@ -1766,6 +1766,7 @@ retry:
#ifdef CONFIG_EXT4DEV_FS_XATTR
 inode-i_op = ext4_special_inode_operations;
#endif
   + EXT4_I(inode)-i_flags = ~EXT4_EXTENTS_FL;
 err = ext4_add_nondir(handle, dentry, inode);
 }
 ext4_journal_stop(handle);
  
  now that I think about it; perhaps it would be better to put this logic
  into ext4_new_inode, rather than setting it by default and clearing it
  here... that way new_inode() has all the logic about whether or not a
  particular type of file is in extents format.
  
 
 How about enabling it only for directory and regular files rather than
 enabling it globally and then disabling the flag for  symlink and device
 files ?
 

how about something like below. There are two reason for not inheriting
the i_flag from directory.


a) if the directory have extent flag set we end up with symlink which
have extent flag set but on which ext4_ext_tree_init is not called.

b) if we create a directory with extent flag and later mount the file
system with -o noextents, the files created in that directory will also
have extent flag set but we would not have called ext4_ext_tree_init for
them.

Both results in the stack below which cause ext4_error.

Call Trace:
[c0002c937320] [c01a0b84] .ext4_ext_find_extent+0x74/0x344 
(unreliable)
[c0002c9373e0] [c01a32c0] .ext4_ext_get_blocks+0x1e8/0xc3c
[c0002c937570] [c0191394] .ext4_get_blocks_wrap+0xa4/0x19c
[c0002c937640] [c0191588] .ext4_get_block+0xfc/0x16c
[c0002c937700] [c011490c] .__block_prepare_write+0x1f0/0x4f8
[c0002c937810] [c0114e5c] .block_write_begin+0xc4/0x160
[c0002c9378e0] [c018ec28] .ext4_write_begin+0x12c/0x24c
[c0002c9379e0] [c00a7ef4] .pagecache_write_begin+0x84/0x1e0
[c0002c937ac0] [c00f17f8] .__page_symlink+0x6c/0x154
[c0002c937b80] [c019784c] .ext4_symlink+0x1d8/0x2b8
[c0002c937c60] [c00f0508] .vfs_symlink+0x130/0x1c8
[c0002c937d00] [c00f0648] .sys_symlinkat+0xa8/0x110
[c0002c937e30] [c000872c] syscall_exit+0x0/0x40


and console message:

EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
#204044: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)




diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index d2c2e55..f430939 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -794,7 +794,12 @@ got:
ei-i_dir_start_lookup = 0;
ei-i_disksize = 0;
 
-   ei-i_flags = EXT4_I(dir)-i_flags  ~EXT4_INDEX_FL;
+   /*
+* Don't inherit extent flag from directory. We set extent flag on
+* newly created directory and file only if -o extent mount option is
+* specfied
+*/
+   ei-i_flags = EXT4_I(dir)-i_flags  ~ (EXT4_INDEX_FL|EXT4_EXTENTS_FL);
if (S_ISLNK(mode))
ei-i_flags = ~(EXT4_IMMUTABLE_FL|EXT4_APPEND_FL);
/* dirsync only applies to directories */
@@ -836,8 +841,10 @@ got:
ext4_std_error(sb, err);
goto fail_free_drop;
}
-   if (test_opt(sb, EXTENTS)  !S_ISLNK(mode)) {
-   EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL;
+   if (test_opt(sb, EXTENTS)) {
+   /* set extent flag only for diretory and file */
+   if (S_ISDIR(mode) || S_ISREG(mode))
+   EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL;
ext4_ext_tree_init(handle, inode);
err = ext4_update_incompat_feature(handle, sb,
EXT4_FEATURE_INCOMPAT_EXTENTS);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 23902ba..da942bc 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1771,7 +1771,6 @@ retry:
 #ifdef CONFIG_EXT4DEV_FS_XATTR
inode-i_op = ext4_special_inode_operations;
 #endif
-   EXT4_I(inode)-i_flags = ~EXT4_EXTENTS_FL;
err = ext4_add_nondir(handle, dentry, inode);
}
ext4_journal_stop(handle);
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Error with the latest stable series of the patch queue.

2008-02-19 Thread Aneesh Kumar K.V
On Tue, Feb 19, 2008 at 06:15:01PM +0100, Valerie Clement wrote:
 Aneesh Kumar K.V wrote:
 Hi all,

 I am seeing the below error in the console. But the tests are reported
 as success.

 EXT4-fs: mballoc enabled
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204044: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204045: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204047: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204056: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204061: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204065: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204068: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204069: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204071: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204077: invalid magic - 

 Hi Aneesh,

 I've got also several issues while running ffsb tests today. The tests
 ended with success but e2fsck reported an error:

  Pass 1: Checking inodes, blocks, and sizes
  Inode 3367164, i_size is 57380864, should be 57442304.  Fix?

 Inode 3367164 is allocated in the last group of the filesystem.

 As I changed the allocation algorithm for the last group in the patch
 ext4_fix_block_alloc_algorithm_for_last_group.patch, I removed this
 patch and ran again the same test. I didn't reproduce the issue.

 *But* I reproduced it on a filesystem created with a smaller block size
 value (= 1024 instead of 4096 previously) and with a kernel *without*
 my patch applied. e2fsck reports the same error on inodes created in the
 last group. Sometimes in this configuration, error messages are also
 displayed on the console:

 EXT4-fs error (device sdc): ext4_valid_block_bitmap: Invalid block bitmap 
 - block_group = 7358, block = 60276737
 EXT4-fs error (device sdc): ext4_valid_block_bitmap: Invalid block bitmap 
 - block_group = 7358, block = 60276737

 and e2fsck reports errors like:
 Inode 2113777 has corrupt extent index at block 61165699 (logical -1) entry 0
 Fix?

 So, there is a problem when allocating inodes in the last group:
 - without my patch when block size value is 1024,
 - with my patch when block size value is 4096.

 Could you check if your tests allocate inodes in the last group and run
 also e2fsck to see if it reports errors.



Can you run the test with the below patch on top of stable series.


diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index d2c2e55..14fb73b 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -794,7 +794,12 @@ got:
ei-i_dir_start_lookup = 0;
ei-i_disksize = 0;
 
-   ei-i_flags = EXT4_I(dir)-i_flags  ~EXT4_INDEX_FL;
+   /*
+* Don't inherit extent flag from directory. We set extent flag on
+* newly created directory and file only if -o extent mount option is
+* specfied
+*/
+   ei-i_flags = EXT4_I(dir)-i_flags  ~ (EXT4_INDEX_FL|EXT4_EXTENTS_FL);
if (S_ISLNK(mode))
ei-i_flags = ~(EXT4_IMMUTABLE_FL|EXT4_APPEND_FL);
/* dirsync only applies to directories */
@@ -836,13 +841,16 @@ got:
ext4_std_error(sb, err);
goto fail_free_drop;
}
-   if (test_opt(sb, EXTENTS)  !S_ISLNK(mode)) {
-   EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL;
-   ext4_ext_tree_init(handle, inode);
-   err = ext4_update_incompat_feature(handle, sb,
-   EXT4_FEATURE_INCOMPAT_EXTENTS);
-   if (err)
-   goto fail;
+   if (test_opt(sb, EXTENTS)) {
+   /* set extent flag only for diretory and file */
+   if (S_ISDIR(mode) || S_ISREG(mode)) {
+   EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL;
+   ext4_ext_tree_init(handle, inode);
+   err = ext4_update_incompat_feature(handle, sb,
+   EXT4_FEATURE_INCOMPAT_EXTENTS);
+   if (err)
+   goto fail;
+   }
}
 
ext4_debug(allocating inode %lu\n, inode-i_ino);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 23902ba..da942bc 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1771,7 +1771,6

Re: Error with the latest stable series of the patch queue.

2008-02-19 Thread Aneesh Kumar K.V
On Tue, Feb 19, 2008 at 06:15:01PM +0100, Valerie Clement wrote:
 Aneesh Kumar K.V wrote:
 Hi all,

 I am seeing the below error in the console. But the tests are reported
 as success.

 EXT4-fs: mballoc enabled
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204044: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204045: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204047: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204056: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204061: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204065: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204068: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204069: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204071: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
 EXT4-fs error (device sda7): ext4_ext_find_extent: bad header in inode
 #204077: invalid magic - 


The above problem is due to symlink having extent flag set but not
having extent tree initialized. That was mainly due to inheriting the
inode i_flag from parent directory. I am right now testing fix for this.


 Hi Aneesh,

 I've got also several issues while running ffsb tests today. The tests
 ended with success but e2fsck reported an error:

  Pass 1: Checking inodes, blocks, and sizes
  Inode 3367164, i_size is 57380864, should be 57442304.  Fix?

 Inode 3367164 is allocated in the last group of the filesystem.

 As I changed the allocation algorithm for the last group in the patch
 ext4_fix_block_alloc_algorithm_for_last_group.patch, I removed this
 patch and ran again the same test. I didn't reproduce the issue.

 *But* I reproduced it on a filesystem created with a smaller block size
 value (= 1024 instead of 4096 previously) and with a kernel *without*
 my patch applied. e2fsck reports the same error on inodes created in the
 last group. Sometimes in this configuration, error messages are also
 displayed on the console:

 EXT4-fs error (device sdc): ext4_valid_block_bitmap: Invalid block bitmap 
 - block_group = 7358, block = 60276737
 EXT4-fs error (device sdc): ext4_valid_block_bitmap: Invalid block bitmap 
 - block_group = 7358, block = 60276737

 and e2fsck reports errors like:
 Inode 2113777 has corrupt extent index at block 61165699 (logical -1) entry 0
 Fix?

 So, there is a problem when allocating inodes in the last group:
 - without my patch when block size value is 1024,
 - with my patch when block size value is 4096.

 Could you check if your tests allocate inodes in the last group and run
 also e2fsck to see if it reports errors.

 For the moment, I have no idea how to fix that problem.


This looks like a completely different problem. Will try to see if i can
reproduce it here.

-aneesh







-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] clear extents flag on inodes created in ext4_mknod

2008-02-19 Thread Aneesh Kumar K.V
On Tue, Feb 19, 2008 at 10:39:52AM -0600, Eric Sandeen wrote:
 Eric Sandeen wrote:
  e2fsck doesn't expect to find char, block, fifo, or socket
  files with the extent flag set, so clear that in ext4_mknod.
  
  Signed-off-by: Eric Sandeen [EMAIL PROTECTED]
  ---
  
  Index: linux-2.6.24/fs/ext4/namei.c
  ===
  --- linux-2.6.24.orig/fs/ext4/namei.c
  +++ linux-2.6.24/fs/ext4/namei.c
  @@ -1766,6 +1766,7 @@ retry:
   #ifdef CONFIG_EXT4DEV_FS_XATTR
  inode-i_op = ext4_special_inode_operations;
   #endif
  +   EXT4_I(inode)-i_flags = ~EXT4_EXTENTS_FL;
  err = ext4_add_nondir(handle, dentry, inode);
  }
  ext4_journal_stop(handle);
 
 now that I think about it; perhaps it would be better to put this logic
 into ext4_new_inode, rather than setting it by default and clearing it
 here... that way new_inode() has all the logic about whether or not a
 particular type of file is in extents format.
 

How about enabling it only for directory and regular files rather than
enabling it globally and then disabling the flag for  symlink and device
files ?

-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: set EXT4_EXTENTS_FL only for directory and regular files

2008-02-19 Thread Aneesh Kumar K.V
Also don't inherit EXT4_EXTENTS_FL from parent directory.
If we have a directory with extent flag set and later mount the file
system with -o noextents, the files created in that directory will also
have extent flag set but we would not have called ext4_ext_tree_init for
them. This will cause error later when we are verifying the extent header

Also we don't want to set extent flag for symlinks, char, block, fifo
or socket

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/ialloc.c |   22 +++---
 fs/ext4/namei.c  |1 -
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 028e601..78d1094 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -794,7 +794,12 @@ got:
ei-i_dir_start_lookup = 0;
ei-i_disksize = 0;
 
-   ei-i_flags = EXT4_I(dir)-i_flags  ~EXT4_INDEX_FL;
+   /*
+* Don't inherit extent flag from directory. We set extent flag on
+* newly created directory and file only if -o extent mount option is
+* specified
+*/
+   ei-i_flags = EXT4_I(dir)-i_flags  ~(EXT4_INDEX_FL|EXT4_EXTENTS_FL);
if (S_ISLNK(mode))
ei-i_flags = ~(EXT4_IMMUTABLE_FL|EXT4_APPEND_FL);
/* dirsync only applies to directories */
@@ -837,12 +842,15 @@ got:
goto fail_free_drop;
}
if (test_opt(sb, EXTENTS)) {
-   EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL;
-   ext4_ext_tree_init(handle, inode);
-   err = ext4_update_incompat_feature(handle, sb,
-   EXT4_FEATURE_INCOMPAT_EXTENTS);
-   if (err)
-   goto fail;
+   /* set extent flag only for diretory and file */
+   if (S_ISDIR(mode) || S_ISREG(mode)) {
+   EXT4_I(inode)-i_flags |= EXT4_EXTENTS_FL;
+   ext4_ext_tree_init(handle, inode);
+   err = ext4_update_incompat_feature(handle, sb,
+   EXT4_FEATURE_INCOMPAT_EXTENTS);
+   if (err)
+   goto fail;
+   }
}
 
ext4_debug(allocating inode %lu\n, inode-i_ino);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 39d4af4..da942bc 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2225,7 +2225,6 @@ retry:
inode-i_op = ext4_fast_symlink_inode_operations;
memcpy((char*)EXT4_I(inode)-i_data,symname,l);
inode-i_size = l-1;
-   EXT4_I(inode)-i_flags = ~EXT4_EXTENTS_FL;
}
EXT4_I(inode)-i_disksize = inode-i_size;
err = ext4_add_nondir(handle, dentry, inode);
-- 
1.5.4.1.97.g40aab-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: set EXT4_EXTENTS_FL only for directory and regular files

2008-02-19 Thread Aneesh Kumar K.V
Mingming,


On Wed, Feb 20, 2008 at 01:19:05AM +0530, Aneesh Kumar K.V wrote:
 Also don't inherit EXT4_EXTENTS_FL from parent directory.
 If we have a directory with extent flag set and later mount the file
 system with -o noextents, the files created in that directory will also
 have extent flag set but we would not have called ext4_ext_tree_init for
 them. This will cause error later when we are verifying the extent header
 
 Also we don't want to set extent flag for symlinks, char, block, fifo
 or socket
 
 Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]

This should replace the below two patches in the patch queue.

ext4-donot-set-extents-flag-for-any-symlinks.patch
ext4-clear-extents-flag-on-inodes-created-in-ext4-mknod.patch


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: Use page_mkwrite vma_operations to get mmap write notification.

2008-02-18 Thread Aneesh Kumar K.V
We would like to get notified when we are doing a write on mmap section.
This is needed with respect to preallocated area. We split the preallocated
area into initialzed extent and uninitialzed extent in the call back. This
let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and
that would result in data loss.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/file.c  |   19 ++-
 fs/ext4/inode.c |6 ++
 include/linux/ext4_fs.h |1 +
 3 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 20507a2..77341c1 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -123,6 +123,23 @@ force_commit:
return ret;
 }
 
+static struct vm_operations_struct ext4_file_vm_ops = {
+   .fault  = filemap_fault,
+   .page_mkwrite   = ext4_page_mkwrite,
+};
+
+static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct address_space *mapping = file-f_mapping;
+
+   if (!mapping-a_ops-readpage)
+   return -ENOEXEC;
+   file_accessed(file);
+   vma-vm_ops = ext4_file_vm_ops;
+   vma-vm_flags |= VM_CAN_NONLINEAR;
+   return 0;
+}
+
 const struct file_operations ext4_file_operations = {
.llseek = generic_file_llseek,
.read   = do_sync_read,
@@ -133,7 +150,7 @@ const struct file_operations ext4_file_operations = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext4_compat_ioctl,
 #endif
-   .mmap   = generic_file_mmap,
+   .mmap   = ext4_file_mmap,
.open   = generic_file_open,
.release= ext4_release_file,
.fsync  = ext4_sync_file,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 34f3eb6..81faa67 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3466,3 +3466,9 @@ int ext4_change_inode_journal_flag(struct inode *inode, 
int val)
 
return err;
 }
+
+int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+   return block_page_mkwrite(vma, page, ext4_get_block);
+}
+
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 22810b1..8f5a563 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -1059,6 +1059,7 @@ extern void ext4_set_aops(struct inode *inode);
 extern int ext4_writepage_trans_blocks(struct inode *);
 extern int ext4_block_truncate_page(handle_t *handle, struct page *page,
struct address_space *mapping, loff_t from);
+extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
 
 /* ioctl.c */
 extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
-- 
1.5.4.1.97.g40aab-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: Don't mark the filesystem with errors if we fail to fallocate.

2008-02-18 Thread Aneesh Kumar K.V
IF we fail fallocate don't call ext4_error. Also don't hide errors
from ext4_get_blocks_wrap

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/extents.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 5b22f71..bb01ac6 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2653,13 +2653,14 @@ retry:
ret = ext4_get_blocks_wrap(handle, inode, block,
  max_blocks, map_bh,
  EXT4_CREATE_UNINITIALIZED_EXT, 0);
-   WARN_ON(ret = 0);
if (ret = 0) {
+#ifdef EXT4FS_DEBUG
+   WARN_ON(ret = 0);
ext4_error(inode-i_sb, ext4_fallocate,
ext4_ext_get_blocks returned error: 
inode#%lu, block=%u, max_blocks=%lu,
inode-i_ino, block, max_blocks);
-   ret = -EIO;
+#endif
ext4_mark_inode_dirty(handle, inode);
ret2 = ext4_journal_stop(handle);
break;
-- 
1.5.4.1.97.g40aab-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: When reading from fallocated blocks make sure we return zero.

2008-02-18 Thread Aneesh Kumar K.V
On Mon, Feb 18, 2008 at 04:14:34PM -0800, Mingming Cao wrote:
 On Sat, 2008-02-16 at 08:53 +0530, Aneesh Kumar K.V wrote:
 
 How about the following patch?
 
 Regards,
 Mingming
 
 ext4: ext4_get_blocks_wrap fix for writing to preallocated
 From: Mingming Cao [EMAIL PROTECTED]
 
 This patch fixed a issue with wrting to a preallocated blocks.
 A write hit a BUG_ON() in fs/buffer.c saying the buffer is not mapped.
 
 On the write path, ext4_get_block_wrap() is called with create=1, but it
 will pass create=0 down to the underlying ext4ext_get_blocks()
 to do a look up first.  In the preallocation case, ext4_ext_get_blocks()
 with create = 0, will return number of blocks pre-allocated and buffer
 head unmapped. ext4_get_blocks_wrap() thinks it succeeds too early, without
 checking if it needs again call ext4_ext_get_blocks with create = 1
 which would do proper handling for writing to preallocated blocks:
 split the extent to initialized and uninitialized one and
 returns the mapped buffer head.
 
 Treating preallocated blocks as holes equally(i.e. ignoring the number of 
 blocks
 pre-allocated and returns 0) when get_blocks() is called with create = 0 is 
 not enough.
 ext4_ext_get_blocks() needs to differentiate these two cases for delayed 
 allocation
 purpose, as for holes it need to do reservation and prepare for later
 delayed allocation, but for pre-allocated blocks it needs skip that work.
 
 It would makes things more clear if we have clear definition of what
 get_blocks() return value means.
 
 Similar to ext4_get_blocks_handle(),  the following 
 * return  0, # of blocks already allocated
 *  if these are pre-allocated blocks and create = 0
 * buffer head is unmapped
 *  otherwise blocks are mapped.
 *
 * return = 0, if plain look up failed (blocks have not been allocated)
 *  buffer head is unmapped
 *
 * return  0, error case.
 
 The for the write path, at ext4_ext_get_blocks_wrap(), it could check the
 buffer_mapped() status for preallocated extent before quit too early.
 
 Signed-off-by: Mingming Cao [EMAIL PROTECTED]

Reviewed-by: Aneesh Kumar K.V [EMAIL PROTECTED]



I guess we also need to make sure the buffer head have the mapped bit
set. Something like the patch below.

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index bc7081f..69ccda9 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2294,6 +2294,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode 
*inode,
struct ext4_allocation_request ar;
 
__clear_bit(BH_New, bh_result-b_state);
+   __clear_bit(BH_Mapped, bh_result-b_state);
ext_debug(blocks %u/%lu requested for inode %u\n,
iblock, max_blocks, inode-i_ino);
 
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH -v2] ext4: Use page_mkwrite vma_operations to get mmap write notification.

2008-02-18 Thread Aneesh Kumar K.V
We would like to get notified when we are doing a write on mmap section.
This is needed with respect to preallocated area. We split the preallocated
area into initialzed extent and uninitialzed extent in the call back. This
let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and
that would result in data loss. The changes are also needed to handle ENOSPC
when writing to an mmap section of files with holes.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/file.c  |   19 ++-
 fs/ext4/inode.c |6 ++
 include/linux/ext4_fs.h |1 +
 3 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 20507a2..77341c1 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -123,6 +123,23 @@ force_commit:
return ret;
 }
 
+static struct vm_operations_struct ext4_file_vm_ops = {
+   .fault  = filemap_fault,
+   .page_mkwrite   = ext4_page_mkwrite,
+};
+
+static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct address_space *mapping = file-f_mapping;
+
+   if (!mapping-a_ops-readpage)
+   return -ENOEXEC;
+   file_accessed(file);
+   vma-vm_ops = ext4_file_vm_ops;
+   vma-vm_flags |= VM_CAN_NONLINEAR;
+   return 0;
+}
+
 const struct file_operations ext4_file_operations = {
.llseek = generic_file_llseek,
.read   = do_sync_read,
@@ -133,7 +150,7 @@ const struct file_operations ext4_file_operations = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext4_compat_ioctl,
 #endif
-   .mmap   = generic_file_mmap,
+   .mmap   = ext4_file_mmap,
.open   = generic_file_open,
.release= ext4_release_file,
.fsync  = ext4_sync_file,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 34f3eb6..81faa67 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3466,3 +3466,9 @@ int ext4_change_inode_journal_flag(struct inode *inode, 
int val)
 
return err;
 }
+
+int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+   return block_page_mkwrite(vma, page, ext4_get_block);
+}
+
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 22810b1..8f5a563 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -1059,6 +1059,7 @@ extern void ext4_set_aops(struct inode *inode);
 extern int ext4_writepage_trans_blocks(struct inode *);
 extern int ext4_block_truncate_page(handle_t *handle, struct page *page,
struct address_space *mapping, loff_t from);
+extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page);
 
 /* ioctl.c */
 extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
-- 
1.5.4.1.97.g40aab-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: When reading from fallocated blocks make sure we return zero.

2008-02-17 Thread Aneesh Kumar K.V
On Sat, Feb 16, 2008 at 08:53:34AM +0530, Aneesh Kumar K.V wrote:
 On Fri, Feb 15, 2008 at 11:43:04AM -0800, Mingming Cao wrote:
  On Fri, 2008-02-15 at 23:46 +0530, Aneesh Kumar K.V wrote:
   fallocate blocks are considered as sparse area and read from them should
   return zero. ext4_ext_get_blocks should return zero for read request.
   
  
  The patch itself looks harmless, but I still don't see how this could
  fix the problem you described at irc: a write hit a BUG_ON() in
  fs/buffer.c saying the buffer is not mapped. Could you add more details
  here?
 
 Write will take the below call chain
 
 ext4_write_begin
   block_write_begin
 __block_prepare_write
ext4_getblock
  ext4_get_blocks_wrap
 (1)  ext4_ext_get_blocks with create = 0 return allocated
ll_rw_block  if buffer not uptodate.
  submit_bh
  BUG_ON(!buffer_mapped(bh))
 
 
 ext4_ext_get_blocks at (1) should have returned 0. That would cause
 ext4_get_blocks_wrap to again call ext4_ext_get_blocks with create = 1
 and that would have returned us the buffer head which is mapped. This
 would also result in splitting the extent to initialized and
 uninitialized one.
 

The change is also needed to get mmap on fallocate space to work.

[ cut here ]
WARNING: at fs/buffer.c:1680 __block_write_full_page+0x101/0x2f1()
 Modules linked in:
 Pid: 2478, comm: mmaptest Not tainted 2.6.25-rc1 #12
  [c0120e84] warn_on_slowpath+0x41/0x51
  [c0108c00] ?  native_sched_clock+0x2d/0x9f
  [c013b44e] ?  __lock_acquire+0xacb/0xb13
  [c013b44e] ?  __lock_acquire+0xacb/0xb13
  [c0180f97] __block_write_full_page+0x101/0x2f1
  [c01d053f] ?  ext4_get_block+0x0/0xc0
  [c018124f] block_write_full_page+0xc8/0xd1
  [c01d053f] ?  ext4_get_block+0x0/0xc0
  [c01d1a36] ext4_ordered_writepage+0xad/0x146
  [c01cec12] ? bget_one+0x0/0xb
  [c014c5dd] __writepage+0xb/0x25
  [c014cab2] write_cache_pages+0x180/0x287
  [c014c5d2] ? __writepage+0x0/0x25
  [c01527d5] ?  __do_fault+0x2e2/0x324
  [c0147889] ? unlock_page+0x25/0x28 
  [c014cbd6] generic_writepages+0x1d/0x27
  [c014cc0c] do_writepages+0x2c/0x34
  [c0147fe1] __filemap_fdatawrite_range+0x5b/0x67
  [c01481ba] filemap_fdatawrite+0x15/0x17
  [c017ea3d] do_fsync+0x2c/0x9a
  [c017eacb] __do_fsync+0x20/0x2f
  [c017eaf9] sys_fsync+0xd/0xf
  [c0104992] sysenter_past_esp+0x5f/0xa5
  ===
 ---[ end trace 5ba60b430e0af601 ]---


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: When reading from fallocated blocks make sure we return zero.

2008-02-15 Thread Aneesh Kumar K.V
fallocate blocks are considered as sparse area and read from them should
return zero. ext4_ext_get_blocks should return zero for read request.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/extents.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3efbfd1..5b22f71 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2379,8 +2379,14 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode 
*inode,
}
if (create == EXT4_CREATE_UNINITIALIZED_EXT)
goto out;
-   if (!create)
+   if (!create) {
+   /*
+* read request should return zero blocks
+* allocated
+*/
+   allocated = 0;
goto out2;
+   }
 
ret = ext4_ext_convert_to_initialized(handle, inode,
path, iblock,
-- 
1.5.4.1.97.g40aab-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC/PATCH] ext4: Fix the memory leak and buffer head leak with respect to ext4_ext_path

2008-02-14 Thread Aneesh Kumar K.V
The path variable returned via ext4_ext_find_extent is a kmalloc variable
and need to be freeded. It also contain refrences to buffer_head which need
to be dropped.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |6 +++---
 fs/ext4/migrate.c   |5 +
 include/linux/ext4_fs_extents.h |1 +
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e856f66..995ac16 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -349,7 +349,7 @@ static void ext4_ext_show_leaf(struct inode *inode, struct 
ext4_ext_path *path)
 #define ext4_ext_show_leaf(inode,path)
 #endif
 
-static void ext4_ext_drop_refs(struct ext4_ext_path *path)
+void ext4_ext_drop_refs(struct ext4_ext_path *path)
 {
int depth = path-p_depth;
int i;
@@ -2200,10 +2200,10 @@ static int ext4_ext_convert_to_initialized(handle_t 
*handle,
newdepth = ext_depth(inode);
if (newdepth != depth) {
depth = newdepth;
-   path = ext4_ext_find_extent(inode, iblock, NULL);
+   ext4_ext_drop_refs(path);
+   path = ext4_ext_find_extent(inode, iblock, path);
if (IS_ERR(path)) {
err = PTR_ERR(path);
-   path = NULL;
goto out;
}
eh = path[depth].p_hdr;
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 8c6c685..5c1e27d 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -43,6 +43,7 @@ static int finish_range(handle_t *handle, struct inode *inode,
 
if (IS_ERR(path)) {
retval = PTR_ERR(path);
+   path = NULL;
goto err_out;
}
 
@@ -74,6 +75,10 @@ static int finish_range(handle_t *handle, struct inode 
*inode,
}
retval = ext4_ext_insert_extent(handle, inode, path, newext);
 err_out:
+   if (path) {
+   ext4_ext_drop_refs(path);
+   kfree(path);
+   }
lb-first_pblock = 0;
return retval;
 }
diff --git a/include/linux/ext4_fs_extents.h b/include/linux/ext4_fs_extents.h
index 697da4b..1285c58 100644
--- a/include/linux/ext4_fs_extents.h
+++ b/include/linux/ext4_fs_extents.h
@@ -227,5 +227,6 @@ extern int ext4_ext_search_left(struct inode *, struct 
ext4_ext_path *,
ext4_lblk_t *, ext4_fsblk_t *);
 extern int ext4_ext_search_right(struct inode *, struct ext4_ext_path *,
ext4_lblk_t *, ext4_fsblk_t *);
+extern void ext4_ext_drop_refs(struct ext4_ext_path *);
 #endif /* _LINUX_EXT4_EXTENTS */
 
-- 
1.5.4.1.97.g40aab-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC/PATCH] ext4: Request for journal write access early.

2008-02-14 Thread Aneesh Kumar K.V
In ext4_ext_convert_to_initialized before we need to request for journal
write access before we even modify the extent length.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/extents.c |   11 ---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 995ac16..c4d6f19 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2168,6 +2168,10 @@ static int ext4_ext_convert_to_initialized(handle_t 
*handle,
newblock = iblock - ee_block + ext_pblock(ex);
ex2 = ex;
 
+   err = ext4_ext_get_access(handle, inode, path + depth);
+   if (err)
+   goto out;
+
/* ex1: ee_block to iblock - 1 : uninitialized */
if (iblock  ee_block) {
ex1 = ex;
@@ -2210,6 +2214,10 @@ static int ext4_ext_convert_to_initialized(handle_t 
*handle,
ex = path[depth].p_ext;
if (ex2 != newex)
ex2 = ex;
+
+   err = ext4_ext_get_access(handle, inode, path + depth);
+   if (err)
+   goto out;
}
allocated = max_blocks;
}
@@ -2230,9 +2238,6 @@ static int ext4_ext_convert_to_initialized(handle_t 
*handle,
ex2-ee_len = cpu_to_le16(allocated);
if (ex2 != ex)
goto insert;
-   err = ext4_ext_get_access(handle, inode, path + depth);
-   if (err)
-   goto out;
/*
 * New (initialized) extent starts from the first block
 * in the current extent. i.e., ex2 == ex
-- 
1.5.4.1.97.g40aab-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: Set directory link count to zero if we fail to create the directory.

2008-02-12 Thread Aneesh Kumar K.V
With ext4 we fail directory creation if we fail to allocate initial
block for the directory. With EXT4_FEATURE_RO_COMPAT_DIR_NLINK,
if we fail to create the directory, the inode link count was wrongly
set to 1.  This cause e2fsck throws the below warning

Inode 1015 is a zero-length directory.  Clear? yes
Inode 1120 is a zero-length directory.  Clear? yes
Inode 1121 is a zero-length directory.  Clear? yes

Fix the same by dropping the inode link count using drop_nlink

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/namei.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index a9347fb..fd3b031 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1805,7 +1805,7 @@ retry:
inode-i_size = EXT4_I(inode)-i_disksize = inode-i_sb-s_blocksize;
dir_block = ext4_bread (handle, inode, 0, 1, err);
if (!dir_block) {
-   ext4_dec_count(handle, inode); /* is this nlink == 0? */
+   drop_nlink(inode);
ext4_mark_inode_dirty(handle, inode);
iput (inode);
goto out_stop;
-- 
1.5.4.1.97.g40aab-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


EXT4_FEATURE_RO_COMPAT_DIR_NLINK and the directory size.

2008-02-12 Thread Aneesh Kumar K.V
Steps:
[root]# more /root/largedir 
i=65000
mkdir /mnt/tmp/test/
cd /mnt/tmp/test/
while [ $i -gt 0 ]
do 
mkdir $i
i=$(expr $i - 1)
done

# /root/largedir
# cd /mnt/tmp/test
# rm -rf *
# cd ..
# ls -al 
drwxr-xr-x  1 root root 1380352 Feb 12 07:15 test
# e2fsck -fv /dev/sda7
# ls -al 
drwxr-xr-x  2 root root 1380352 Feb 12 07:15 test


So e2fsck fix the link count. But the size is not updated. If i try to
run the script again, i get 

EXT4-fs error (device sda7): ext4_find_entry: reading directory #187681
offset 0
attempt to access beyond end of device
sda7: rw=32, want=17582522464, limit=19551042
attempt to access beyond end of device
sda7: rw=32, want=32620150800, limit=19551042
attempt to access beyond end of device
sda7: rw=32, want=32620150800, limit=19551042



Also after running fsstress fsck fails with zero size directory on
powerpc. I am finding directories with 

i_links_count == 1
i_blocks == 0

I missed to check whether the l_i_blocks_hi was having some value.

-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][0/28] Lustre e2fsprogs patch series

2008-02-11 Thread Aneesh Kumar K.V
On Sun, Feb 10, 2008 at 11:19:12PM -0500, Theodore Tso wrote:
 On Sat, Feb 02, 2008 at 12:59:43AM -0700, Andreas Dilger wrote:
  The following series of emails will contain the large part of the
  e2fsprogs patch series that is used for Lustre.  It will not contain
  the regression tests for EXTENTS nor the DIR_NLINK features, as those
  are very large and were previously submitted.
  
  A full tarball that includes the patches, series, and regression tests
  will be uploaded to ftp://ftp.lustre.org/pub/lustre/other/e2fsprogs/
 
 Hey Andreas,
 
 I've applied these patches to the tip of maint, and exported it as
 e2fsprogs-interim on the e2fsprogs git repository.  There quite a
 few patch conflicts, mostly due to some changes that had happened on
 the tip of maint, but also apparently because your patchset was
 missing the flex bg changes.  I haven't applied them yet, but I'll
 probably tack them at the end.
 
 If you could sanity check to make sure they are sane, I would
 appreciate it.
 

Needed this diff to get it build

from ../../../lib/ext2fs/crc16.h:18,
 from ../../../lib/ext2fs/crc16.c:9:
/usr/include/sys/types.h:46: error: conflicting types for 'loff_t'
/usr/include/linux/types.h:30: error: previous declaration of 'loff_t' was here
/usr/include/sys/types.h:62: error: conflicting types for 'dev_t'
/usr/include/linux/types.h:13: error: previous declaration of 'dev_t' was here
In file included from /usr/include/sys/types.h:133,
 from /usr/include/stdlib.h:438,
 from ../../../lib/ext2fs/crc16.h:18,
 from ../../../lib/ext2fs/crc16.c:9


diff --git a/lib/ext2fs/crc16.c b/lib/ext2fs/crc16.c
index 5d87e10..246813f 100644
--- a/lib/ext2fs/crc16.c
+++ b/lib/ext2fs/crc16.c
@@ -5,7 +5,6 @@
  * Version 2. See the file COPYING for more details.
  */
 
-#include linux/types.h
 #include crc16.h
 
 /** CRC table for the CRC-16. The poly is 0x8005 (x^16 + x^15 + x^2 + 1) */
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Don't claim block from group which has corrupt bitmap

2008-02-11 Thread Aneesh Kumar K.V
On Mon, Feb 11, 2008 at 05:00:18PM +0530, Aneesh Kumar K.V wrote:
 In ext4_mb_complex_scan_group, if the extent length of the newly
 found extentet is greater than than the total free blocks counted
 in group info, break without claiming the block.
 
 Document different ext4_error usage, explaining the state with which we
 continue if we mount with errors=continue
 
 Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
 ---
  fs/ext4/mballoc.c |   16 +++-
  1 files changed, 15 insertions(+), 1 deletions(-)


This was actually part of the version 2 of the patch
ext4_avoid_panic_in_case_of_corrupted_bitmap.patch. But we missed it
in the last update. So sending it as a separate patch.


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG_ON at mballoc.c:3752

2008-02-07 Thread Aneesh Kumar K.V
On Wed, Feb 06, 2008 at 03:59:48PM -0600, Dave Kleikamp wrote:
 
 File systems should not call BUG() due to a corrupt file system.
 Instead the code should fail the operation, possibly marking the file
 system read-only (or panicking) depending on the errors= mount option.
 

Eric Sandeen explained me the same on IRC. I was busy with the migrate
locking bug. That's why i didn't update here. Today i tried to reproduce
the problem using the image provided. But in my case it is not hitting
the BUG_ON (mostly due to single cpu). I did look at the code and am not
still not clear how we can hit that BUG_ON. prealloc free space pa_free is
generated out of bitmap. So only if something corrupted bitmap after we
initialized prealloc space we will hit this case. In mballoc we error out
if the block allocated or fall in system zone. One thing i noticed is,
the journal is corrupt. So the only possibility that i have is journal write
resulted in bitmap corruption.

I also looked at the mballoc to make sure we don't panic in case of a
corrupt bitmap. Below is the patch that i have now. This one is yet to
go through the ABAT test but it would be nice to see whether the below
change cause any other issues.

Eric ,
can you run the test with below patch and see if this makes any
difference ?. I know we are not fixing any bugs in the below patch.


ext4: Don't panic in case of corrupt bitmap

From: Aneesh Kumar K.V [EMAIL PROTECTED]

Multiblock allocator was calling BUG_ON in many case if the free and used
blocks count obtained looking at the bitmap is different from what
the allocator internally accounted for. Use ext4_error in such case
and don't panic the system.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---

 fs/ext4/mballoc.c |   35 +--
 1 files changed, 21 insertions(+), 14 deletions(-)


diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 06d1f52..656729b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -680,7 +680,6 @@ static void *mb_find_buddy(struct ext4_buddy *e4b, int 
order, int *max)
 {
char *bb;
 
-   /* FIXME!! is this needed */
BUG_ON(EXT4_MB_BITMAP(e4b) == EXT4_MB_BUDDY(e4b));
BUG_ON(max == NULL);
 
@@ -964,7 +963,7 @@ static void ext4_mb_generate_buddy(struct super_block *sb,
grp-bb_fragments = fragments;
 
if (free != grp-bb_free) {
-   printk(KERN_DEBUG
+   ext4_error(sb, __FUNCTION__,
EXT4-fs: group %lu: %u blocks in bitmap, %u in gd\n,
group, free, grp-bb_free);
grp-bb_free = free;
@@ -1821,13 +1820,24 @@ static void ext4_mb_complex_scan_group(struct 
ext4_allocation_context *ac,
i = ext4_find_next_zero_bit(bitmap,
EXT4_BLOCKS_PER_GROUP(sb), i);
if (i = EXT4_BLOCKS_PER_GROUP(sb)) {
-   BUG_ON(free != 0);
+   /*
+* IF we corrupt the bitmap  we won't find any
+* free blocks even though group info says we
+* we have free blocks
+*/
+   ext4_error(sb, __FUNCTION__, %d free blocks as per 
+   group info. But bitmap says 0\n,
+   free);
break;
}
 
mb_find_extent(e4b, 0, i, ac-ac_g_ex.fe_len, ex);
BUG_ON(ex.fe_len = 0);
-   BUG_ON(free  ex.fe_len);
+   if (free  ex.fe_len) {
+   ext4_error(sb, __FUNCTION__, %d free blocks as per 
+   group info. But got %d blocks\n,
+   free, ex.fe_len);
+   }
 
ext4_mb_measure_extent(ac, ex, e4b);
 
@@ -3354,13 +3364,10 @@ static void ext4_mb_use_group_pa(struct 
ext4_allocation_context *ac,
ac-ac_pa = pa;
 
/* we don't correct pa_pstart or pa_plen here to avoid
-* possible race when tte group is being loaded concurrently
+* possible race when the group is being loaded concurrently
 * instead we correct pa later, after blocks are marked
-* in on-disk bitmap -- see ext4_mb_release_context() */
-   /*
-* FIXME!! but the other CPUs can look at this particular
-* pa and think that it have enought free blocks if we
-* don't update pa_free here right ?
+* in on-disk bitmap -- see ext4_mb_release_context()
+* Other CPUs are prevented from allocating from this pa by lg_mutex
 */
mb_debug(use %u/%u from group pa %p\n, pa-pa_lstart-len, len, pa);
 }
@@ -3743,13 +3750,13 @@ static int ext4_mb_release_inode_pa(struct ext4_buddy 
*e4b,
bit = next + 1;
}
if (free != pa-pa_free) {
-   printk(KERN_ERR pa %p: logic %lu, phys. %lu, len %lu\n

[PATCH] ext4: Fix circular locking dependency with migrate and rm.

2008-02-06 Thread Aneesh Kumar K.V
We now take inode-i_mutex lock to prevent any update of the inode i_data
field. Before we switch the inode format we take i_data_sem to prevent
parallel read.

===
[ INFO: possible circular locking dependency detected ]
2.6.24-rc8 #6
---
rm/2401 is trying to acquire lock:
 (ei-i_data_sem){}, at: [c01dca58] ext4_get_blocks_wrap+0x21/0x108

but task is already holding lock:
 (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

- #1 (jbd2_handle){--..}:
   [c0143a5c] __lock_acquire+0xa31/0xc1a
   [c0143cbf] lock_acquire+0x7a/0x94
   [c01fc4ca] jbd2_journal_start+0xf5/0xff
   [c01e3539] ext4_journal_start_sb+0x48/0x4a
   [c01eb980] ext4_ext_migrate+0x7d/0x535
   [c01df328] ext4_ioctl+0x528/0x56c
   [c0177700] do_ioctl+0x50/0x67
   [c017794e] vfs_ioctl+0x237/0x24a
   [c0177992] sys_ioctl+0x31/0x4b
   [c0104f8a] sysenter_past_esp+0x5f/0xa5
   [] 0x

- #0 (ei-i_data_sem){}:
   [c014394c] __lock_acquire+0x921/0xc1a
   [c0143cbf] lock_acquire+0x7a/0x94
   [c044f247] down_read+0x42/0x79
   [c01dca58] ext4_get_blocks_wrap+0x21/0x108
   [c01dcba1] ext4_getblk+0x62/0x1c4
   [c01e0de9] ext4_find_entry+0x350/0x5b7
   [c01e200c] ext4_unlink+0x6e/0x1a4
   [c017449e] vfs_unlink+0x49/0x89
   [c0175f02] do_unlinkat+0x96/0x12c
   [c0175fa8] sys_unlink+0x10/0x12
   [c0104f8a] sysenter_past_esp+0x5f/0xa5
   [] 0x

other info that might help us debug this:

3 locks held by rm/2401:
 #0:  (type-i_mutex_dir_key#5/1){--..}, at: [c0175eca] 
do_unlinkat+0x5e/0x12c
 #1:  (sb-s_type-i_mutex_key#8){--..}, at: [c017448b] vfs_unlink+0x36/0x89
 #2:  (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff

stack backtrace:
Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6
 [c0106017] show_trace_log_lvl+0x1a/0x2f
 [c0106893] show_trace+0x12/0x14
 [c0106b89] dump_stack+0x6c/0x72
 [c0141b26] print_circular_bug_tail+0x5f/0x68
 [c014394c] __lock_acquire+0x921/0xc1a
 [c0143cbf] lock_acquire+0x7a/0x94
 [c044f247] down_read+0x42/0x79
 [c01dca58] ext4_get_blocks_wrap+0x21/0x108
 [c01dcba1] ext4_getblk+0x62/0x1c4
 [c01e0de9] ext4_find_entry+0x350/0x5b7
 [c01e200c] ext4_unlink+0x6e/0x1a4
 [c017449e] vfs_unlink+0x49/0x89
 [c0175f02] do_unlinkat+0x96/0x12c
 [c0175fa8] sys_unlink+0x10/0x12
 [c0104f8a] sysenter_past_esp+0x5f/0xa5

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/migrate.c |  117 +---
 1 files changed, 74 insertions(+), 43 deletions(-)

diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 9ee1f7c..8c6c685 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -61,10 +61,9 @@ static int finish_range(handle_t *handle, struct inode 
*inode,
retval = ext4_journal_restart(handle, needed);
if (retval)
goto err_out;
-   }
-   if (needed) {
+   } else if (needed) {
retval = ext4_journal_extend(handle, needed);
-   if (retval != 0) {
+   if (retval) {
/*
 * IF not able to extend the journal restart the journal
 */
@@ -220,6 +219,26 @@ static int update_tind_extent_range(handle_t *handle, 
struct inode *inode,
 
 }
 
+static int extend_credit_for_blkdel(handle_t *handle, struct inode *inode)
+{
+   int retval = 0, needed;
+
+   if (handle-h_buffer_credits  EXT4_RESERVE_TRANS_BLOCKS)
+   return 0;
+   /*
+* We are freeing a blocks. During this we touch
+* superblock, group descriptor and block bitmap.
+* So allocate a credit of 3. We may update
+* quota (user and group).
+*/
+   needed = 3 + 2*EXT4_QUOTA_TRANS_BLOCKS(inode-i_sb);
+
+   if (ext4_journal_extend(handle, needed) != 0)
+   retval = ext4_journal_restart(handle, needed);
+
+   return retval;
+}
+
 static int free_dind_blocks(handle_t *handle,
struct inode *inode, __le32 i_data)
 {
@@ -234,11 +253,14 @@ static int free_dind_blocks(handle_t *handle,
 
tmp_idata = (__le32 *)bh-b_data;
for (i = 0; i  max_entries; i++) {
-   if (tmp_idata[i])
+   if (tmp_idata[i]) {
+   extend_credit_for_blkdel(handle, inode);
ext4_free_blocks(handle, inode,
le32_to_cpu(tmp_idata[i]), 1, 1);
+   }
}
put_bh(bh);
+   extend_credit_for_blkdel(handle, inode);
ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1, 1);
return 0;
 }
@@ -267,29 +289,32 @@ static int free_tind_blocks(handle_t *handle

Re: jbd2_handle and i_data_sem circular locking dependency detected

2008-02-05 Thread Aneesh Kumar K.V
On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote:
   Hi,
 
 On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote:
  This is with the new ext3 - ext4 migrate code added. The recently added
  lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
  on the ext3 inode during migration to prevent walking the ext3 inode
  when it is being converted to ext4 format. Also we want to avoid 
  file truncation and new blocks being added while converting to ext4.
  Also we dont want to reserve large number of credits for journal.
  Any idea how to fix this ?
   Hmm, while briefly looking at the code - why do you introduce i_data_sem
 and not use i_alloc_sem which is already in VFS inode? That is aimed
 exactly at the serialization of truncates, writes and similar users.
 That doesn't solve problems with lock ordering but I was just wondering...
   Another problem - ext4_fallocate() has the same lock ordering problem as
 the migration code and maybe there are others...
   One (stupid) solution to your problem is to make i_data_sem be
 always locked before the transaction is started. It could possibly have
 negative performance impact because you'd have to hold the semaphore for
 a longer time and thus a writer would block readers for longer time. So one
 would have to measure how big difference that would make.
   Another possibility is to start a single transaction for migration and
 extend it as long as you can (as truncate does it). And when you can't
 extend any more, you drop the i_data_sem and start a new transaction and
 acquire the semaphore again. This has the disadvantage that after dropping
 the semaphore you have to resync your original inode with the temporary
 one your are building which probably ends up being ugly as night... Hmm,
 but maybe we could get rid of this - hold i_mutex to protect against all
 writes (that ranks outside of transaction start so you can hold it for the
 whole migration time - maybe you even hold it if you are called from the
 write path...). After dropping i_data_sem you let some readers proceed
 but writers still wait on i_mutex so the file shouldn't change under you
 (but I suggest adding some BUG_ONs to verify that the file really doesn't
 change :).
 

How about the patch below. I did the below testing
a) migrate a file
b) run fs_inode fsstres fsx_linux.

The intention was to find out whether the new locking is breaking any of
the other expected hierarchy. It seems to fine. I didn't get any lockdep
warning.

ext4: Fix circular locking dependency with migrate and rm.

From: Aneesh Kumar K.V [EMAIL PROTECTED]

We now take inode-i_mutex lock to prevent any update of the inode i_data
field. Before we switch the inode format we take i_data_sem to prevent
parallel read.

===
[ INFO: possible circular locking dependency detected ]
2.6.24-rc8 #6
---
rm/2401 is trying to acquire lock:
 (ei-i_data_sem){}, at: [c01dca58] ext4_get_blocks_wrap+0x21/0x108

but task is already holding lock:
 (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

- #1 (jbd2_handle){--..}:
   [c0143a5c] __lock_acquire+0xa31/0xc1a
   [c0143cbf] lock_acquire+0x7a/0x94
   [c01fc4ca] jbd2_journal_start+0xf5/0xff
   [c01e3539] ext4_journal_start_sb+0x48/0x4a
   [c01eb980] ext4_ext_migrate+0x7d/0x535
   [c01df328] ext4_ioctl+0x528/0x56c
   [c0177700] do_ioctl+0x50/0x67
   [c017794e] vfs_ioctl+0x237/0x24a
   [c0177992] sys_ioctl+0x31/0x4b
   [c0104f8a] sysenter_past_esp+0x5f/0xa5
   [] 0x

- #0 (ei-i_data_sem){}:
   [c014394c] __lock_acquire+0x921/0xc1a
   [c0143cbf] lock_acquire+0x7a/0x94
   [c044f247] down_read+0x42/0x79
   [c01dca58] ext4_get_blocks_wrap+0x21/0x108
   [c01dcba1] ext4_getblk+0x62/0x1c4
   [c01e0de9] ext4_find_entry+0x350/0x5b7
   [c01e200c] ext4_unlink+0x6e/0x1a4
   [c017449e] vfs_unlink+0x49/0x89
   [c0175f02] do_unlinkat+0x96/0x12c
   [c0175fa8] sys_unlink+0x10/0x12
   [c0104f8a] sysenter_past_esp+0x5f/0xa5
   [] 0x

other info that might help us debug this:

3 locks held by rm/2401:
 #0:  (type-i_mutex_dir_key#5/1){--..}, at: [c0175eca] 
do_unlinkat+0x5e/0x12c
 #1:  (sb-s_type-i_mutex_key#8){--..}, at: [c017448b] vfs_unlink+0x36/0x89
 #2:  (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff

stack backtrace:
Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6
 [c0106017] show_trace_log_lvl+0x1a/0x2f
 [c0106893] show_trace+0x12/0x14
 [c0106b89] dump_stack+0x6c/0x72
 [c0141b26] print_circular_bug_tail+0x5f/0x68
 [c014394c] __lock_acquire+0x921/0xc1a
 [c0143cbf] lock_acquire+0x7a/0x94
 [c044f247] down_read+0x42/0x79
 [c01dca58] ext4_get_blocks_wrap+0x21/0x108
 [c01dcba1] ext4_getblk+0x62/0x1c4
 [c01e0de9] ext4_find_entry+0x350

Re: jbd2_handle and i_data_sem circular locking dependency detected

2008-02-05 Thread Aneesh Kumar K.V
On Tue, Feb 05, 2008 at 02:42:28PM +0100, Jan Kara wrote:
 On Tue 05-02-08 17:53:42, Aneesh Kumar K.V wrote:
  
  How about the patch below. I did the below testing
  a) migrate a file
  b) run fs_inode fsstres fsx_linux.
  
  The intention was to find out whether the new locking is breaking any of
  the other expected hierarchy. It seems to fine. I didn't get any lockdep
  warning.
   I think there's a problem in the patch. I don't think you can call
 free_ind_block() while readers are accessing the file. That could make them
 think the file contains holes when they aren't there (or even worse read
 freed blocks or so). So you need to lock i_data_sem before this call (that
 means you have to move journal_extend() as well). Actually, I don't quite
 get why ext4_journal_extend() is called in that function. You can just
 count with the 1 credit in ext4_ext_migrate() when you call
 ext4_journal_extend() before calling ext4_ext_swap_inode_data(). Oh,
 probably because free_ind_block() could extend the transaction (which they
 don't do now as far as I can see).

ext4_journal_extend is called to extend the journal by one credit to
take care of writing the block containing inode. I moved the journal
extend before taking i_data_sem lock.

diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index f97c993..dc6617a 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -302,10 +302,6 @@ static int ext4_ext_swap_inode_data(handle_t *handle, 
struct inode *inode,
struct ext4_inode_info *ei = EXT4_I(inode);
struct ext4_inode_info *tmp_ei = EXT4_I(tmp_inode);
 
-   retval = free_ind_block(handle, inode);
-   if (retval)
-   goto err_out;
-
/*
 * One credit accounted for writing the
 * i_data field of the original inode
@@ -318,6 +314,10 @@ static int ext4_ext_swap_inode_data(handle_t *handle, 
struct inode *inode,
}
 
down_write(EXT4_I(inode)-i_data_sem);
+   retval = free_ind_block(handle, inode);
+   if (retval)
+   goto err_out;
+
/*
 * We have the extent map build with the tmp inode.
 * Now copy the i_data across

The above change also make sure we don't fail after we free the indirect
blocks.


   BTW: That freeing code should really take into account that it can modify
 bitmaps in different block groups. It's even not that hard to do. Just
 before each ext4_free_blocks() in free_ind_block() you check whether you
 have still enough credits in the handle (use h_buffer_credits) and if not,
 extend it by some amount.


I have a FIXME at migrate.c:524 documenting exactly that. The
difficult question was by how much we should extent the journal. ? But
in reality we might have accumulated enough journal credits, I never
really ran across a case where we are running out of the journal credit.


   Maybe you could do some test like writing a big file with some data and then
 while migration is running read it in a loop and compare that MD5SUM is the
 same all the time. Also run some memory-pressure during this test so that
 data blocks aren't cached for the whole time of the test... That should
 reasonably stress the migration code.

I have tested migrate by booting with mem= and doing parallel
read/write and migrate. I didn't do the MDSUM compare. Will do that this
time.


Thanks for all the help with review.
-aneesh
 
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: jbd2_handle and i_data_sem circular locking dependency detected

2008-02-05 Thread Aneesh Kumar K.V
On Tue, Feb 05, 2008 at 05:34:04PM +0100, Jan Kara wrote:
 On Tue 05-02-08 21:57:03, Aneesh Kumar K.V wrote:
  
  I have a FIXME at migrate.c:524 documenting exactly that. The
  difficult question was by how much we should extent the journal. ? But
  in reality we might have accumulated enough journal credits, I never
  really ran across a case where we are running out of the journal credit.
   Yes, I don't think it is likely to happen in reality but if somebody
 would trigger this, it would be almost impossible to track down so one
 should be quite careful with these things...
   And as I described, doing it failsafe is easy - just look how
 try_to_extend_transaction() in ext4/inode.c handles similar problems with
 truncate.
 

I moved the indirect block freeing after i update the original inode.
That makes sure even if we fail to free the indirect blocks we have the
original inode converted. Now since i don't need atomicity with freeing of
blocks i extend the journal for each block freed. Below is the diff i have
right now.

diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 1712844..1b00587 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -233,11 +233,14 @@ static int free_dind_blocks(handle_t *handle,
 
tmp_idata = (__le32 *)bh-b_data;
for (i = 0; i  max_entries; i++) {
-   if (tmp_idata[i])
+   if (tmp_idata[i]) {
+   extend_blkdelete_credit(handle, inode);
ext4_free_blocks(handle, inode,
le32_to_cpu(tmp_idata[i]), 1, 1);
+   }
}
put_bh(bh);
+   extend_blkdelete_credit(handle, inode);
ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1, 1);
return 0;
 }
@@ -266,29 +269,32 @@ static int free_tind_blocks(handle_t *handle,
}
}
put_bh(bh);
+   extend_blkdelete_credit(handle, inode);
ext4_free_blocks(handle, inode, le32_to_cpu(i_data), 1, 1);
return 0;
 }
 
-static int free_ind_block(handle_t *handle, struct inode *inode)
+static int free_ind_block(handle_t *handle, __le32 *i_data)
 {
int retval;
-   struct ext4_inode_info *ei = EXT4_I(inode);
 
-   if (ei-i_data[EXT4_IND_BLOCK])
+   /* ei-i_data[EXT4_IND_BLOCK] */
+   if (i_data[0]) {
+   extend_blkdelete_credit(handle, inode);
ext4_free_blocks(handle, inode,
-   le32_to_cpu(ei-i_data[EXT4_IND_BLOCK]), 1, 1);
+   le32_to_cpu(i_data[0]), 1, 1);
+   }
 
-   if (ei-i_data[EXT4_DIND_BLOCK]) {
-   retval = free_dind_blocks(handle, inode,
-   ei-i_data[EXT4_DIND_BLOCK]);
+   /* ei-i_data[EXT4_DIND_BLOCK] */
+   if (i_data[1]) {
+   retval = free_dind_blocks(handle, inode, i_data[1]);
if (retval)
return retval;
}
 
-   if (ei-i_data[EXT4_TIND_BLOCK]) {
-   retval = free_tind_blocks(handle, inode,
-   ei-i_data[EXT4_TIND_BLOCK]);
+   /* ei-i_data[EXT4_TIND_BLOCK] */
+   if (i_data[2]) {
+   retval = free_tind_blocks(handle, inode, i_data[2]);
if (retval)
return retval;
}
@@ -299,6 +305,7 @@ static int ext4_ext_swap_inode_data(handle_t *handle, 
struct inode *inode,
struct inode *tmp_inode)
 {
int retval;
+   __le32  i_data[3];
struct ext4_inode_info *ei = EXT4_I(inode);
struct ext4_inode_info *tmp_ei = EXT4_I(tmp_inode);
 
@@ -313,11 +320,11 @@ static int ext4_ext_swap_inode_data(handle_t *handle, 
struct inode *inode,
goto err_out;
}
 
-   down_write(EXT4_I(inode)-i_data_sem);
-   retval = free_ind_block(handle, inode);
-   if (retval)
-   goto err_out;
+   i_data[0] = ei-i_data[EXT4_IND_BLOCK];
+   i_data[1] = ei-i_data[EXT4_DIND_BLOCK];
+   i_data[2] = ei-i_data[EXT4_TIND_BLOCK];
 
+   down_write(EXT4_I(inode)-i_data_sem);
/*
 * We have the extent map build with the tmp inode.
 * Now copy the i_data across
@@ -338,8 +345,12 @@ static int ext4_ext_swap_inode_data(handle_t *handle, 
struct inode *inode,
inode-i_blocks += tmp_inode-i_blocks;
spin_unlock(inode-i_lock);
up_write(EXT4_I(inode)-i_data_sem);
-
ext4_mark_inode_dirty(handle, inode);
+
+   /* Now free the indirec block of the  inode */
+   retval = free_ind_block(handle, i_data);
+   if (retval)
+   goto err_out;
 err_out:
return retval;
 }
@@ -367,6 +378,7 @@ static int free_ext_idx(handle_t *handle, struct inode 
*inode,
}
}
put_bh(bh);
+   extend_blkdelete_credit(handle, inode);
ext4_free_blocks(handle, inode, block, 1, 1);
return

Re: [PATCH] Fix ext4 bitops

2008-02-04 Thread Aneesh Kumar K.V
On Mon, Feb 04, 2008 at 10:24:36AM +0100, Heiko Carstens wrote:
 | fs/ext4/mballoc.c: In function 'ext4_mb_generate_buddy':
 | fs/ext4/mballoc.c:954: error: implicit declaration of function 
 'generic_find_next_le_bit'
 
 The s390 specific bitops uses parts of the generic implementation.
 Include the correct header.

That doesn't work:

fs/built-in.o: In function `ext4_mb_release_inode_pa':
mballoc.c:(.text+0x95a8a): undefined reference to 
`generic_find_next_le_bit'
fs/built-in.o: In function `ext4_mb_init_cache':
mballoc.c:(.text+0x967ea): undefined reference to 
`generic_find_next_le_bit'

This still needs generic_find_next_le_bit which comes
from lib/find_next_bit.c. That one doesn't get built on s390 since we
don't set GENERIC_FIND_NEXT_BIT.
Currently we have the lengthly patch below queued.
   
   Similar issue on m68k. As Bastian also saw it on powerpc, I'm getting the
   impression the ext4 people don't (compile) test on big endian machines?
   
   Gr{oetje,eeting}s,
   
  
  I have sent this patches to linux-arch expecting a review from
  different arch people. It is true that the patches are tested only on
  powerpc, x86-64, x86. That's the primary reason of me sending the
  patches to linux-arch.
 
 Is there anything special I need to do so the ext4 code actually uses
 ext2_find_next_bit() ? Haven't looked at the ext4 code, but I'd like to
 test if the s390 implementation is ok.

With the latest linus kernel in git you can test it by mounting  ext4

mount -t ext4dev device mntpoint


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bugme-new] [Bug 9849] New: NULL pointer deref in journal_wait_on_commit_record

2008-02-04 Thread Aneesh Kumar K.V
On Wed, Jan 30, 2008 at 03:17:57PM -0800, Mingming Cao wrote:
 
 The buufer head pointer passed to journal_wait_on_commit_record() could
 be NULL if the previous journal_submit_commit_record() failed or journal
 has already aborted.
 
 Looking at the jbd2 debug messages, before the oops happen, the jbd2 is
 aborted due to trying to access the next log block beyond the end of
 device. This might be caused by using a corrupted image.
 
 We need to check the error returns from journal_submit_commit_record()
 and avoid calling journal_wait_on_commit_record() in the failure case.
 
 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 The buufer head pointer passed to journal_wait_on_commit_record()
 could be NULL if the previous journal_submit_commit_record() failed
 or journal has already aborted.
 
 We need to check the error returns from journal_submit_commit_record()
 and avoid calling journal_wait_on_commit_record() in the failure case.
 
 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 ---
  fs/jbd2/commit.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 Index: linux-2.6.24-rc8/fs/jbd2/commit.c
 ===
 --- linux-2.6.24-rc8.orig/fs/jbd2/commit.c2008-01-30 14:12:10.0 
 -0800
 +++ linux-2.6.24-rc8/fs/jbd2/commit.c 2008-01-30 15:09:50.0 -0800
 @@ -872,7 +872,8 @@ wait_for_iobuf:
   if (err)
   __jbd2_journal_abort_hard(journal);
   }
 - err = journal_wait_on_commit_record(cbh);
 + if (!err  !is_journal_aborted(journal))
 + err = journal_wait_on_commit_record(cbh);
 
   if (err)
   jbd2_journal_abort(journal, err);
 
 

Needs the below small change also. I don't see this patch in the patch
queue. So i guess we can add the below diff to the same. The change was
suggested by Girish. Before journal checksum changes sync_dirty_buffer
did the get_bh.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index da8d0eb..2b88ab0 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -136,7 +136,7 @@ static int journal_submit_commit_record(journal_t *journal,
 
JBUFFER_TRACE(descriptor, submit commit block);
lock_buffer(bh);
-
+   get_bh(bh);
set_buffer_dirty(bh);
set_buffer_uptodate(bh);
bh-b_end_io = journal_end_buffer_io_sync;
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


jbd2_handle and i_data_sem circular locking dependency detected

2008-02-04 Thread Aneesh Kumar K.V
Hi,

This is with the new ext3 - ext4 migrate code added. The recently added
lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
on the ext3 inode during migration to prevent walking the ext3 inode
when it is being converted to ext4 format. Also we want to avoid 
file truncation and new blocks being added while converting to ext4.
Also we dont want to reserve large number of credits for journal.
Any idea how to fix this ?


-aneesh

===
[ INFO: possible circular locking dependency detected ]
2.6.24-rc8 #6
---
rm/2401 is trying to acquire lock:
 (ei-i_data_sem){}, at: [c01dca58] ext4_get_blocks_wrap+0x21/0x108

but task is already holding lock:
 (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

- #1 (jbd2_handle){--..}:
   [c0143a5c] __lock_acquire+0xa31/0xc1a
   [c0143cbf] lock_acquire+0x7a/0x94
   [c01fc4ca] jbd2_journal_start+0xf5/0xff
   [c01e3539] ext4_journal_start_sb+0x48/0x4a
   [c01eb980] ext4_ext_migrate+0x7d/0x535
   [c01df328] ext4_ioctl+0x528/0x56c
   [c0177700] do_ioctl+0x50/0x67
   [c017794e] vfs_ioctl+0x237/0x24a
   [c0177992] sys_ioctl+0x31/0x4b
   [c0104f8a] sysenter_past_esp+0x5f/0xa5
   [] 0x

- #0 (ei-i_data_sem){}:
   [c014394c] __lock_acquire+0x921/0xc1a
   [c0143cbf] lock_acquire+0x7a/0x94
   [c044f247] down_read+0x42/0x79
   [c01dca58] ext4_get_blocks_wrap+0x21/0x108
   [c01dcba1] ext4_getblk+0x62/0x1c4
   [c01e0de9] ext4_find_entry+0x350/0x5b7
   [c01e200c] ext4_unlink+0x6e/0x1a4
   [c017449e] vfs_unlink+0x49/0x89
   [c0175f02] do_unlinkat+0x96/0x12c
   [c0175fa8] sys_unlink+0x10/0x12
   [c0104f8a] sysenter_past_esp+0x5f/0xa5
   [] 0x

other info that might help us debug this:

3 locks held by rm/2401:
 #0:  (type-i_mutex_dir_key#5/1){--..}, at: [c0175eca] 
do_unlinkat+0x5e/0x12c
 #1:  (sb-s_type-i_mutex_key#8){--..}, at: [c017448b] vfs_unlink+0x36/0x89
 #2:  (jbd2_handle){--..}, at: [c01fc4a7] jbd2_journal_start+0xd2/0xff

stack backtrace:
Pid: 2401, comm: rm Not tainted 2.6.24-rc8 #6
 [c0106017] show_trace_log_lvl+0x1a/0x2f
 [c0106893] show_trace+0x12/0x14
 [c0106b89] dump_stack+0x6c/0x72
 [c0141b26] print_circular_bug_tail+0x5f/0x68
 [c014394c] __lock_acquire+0x921/0xc1a
 [c0143cbf] lock_acquire+0x7a/0x94
 [c044f247] down_read+0x42/0x79
 [c01dca58] ext4_get_blocks_wrap+0x21/0x108
 [c01dcba1] ext4_getblk+0x62/0x1c4
 [c01e0de9] ext4_find_entry+0x350/0x5b7
 [c01e200c] ext4_unlink+0x6e/0x1a4
 [c017449e] vfs_unlink+0x49/0x89
 [c0175f02] do_unlinkat+0x96/0x12c
 [c0175fa8] sys_unlink+0x10/0x12
 [c0104f8a] sysenter_past_esp+0x5f/0xa5
 
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: jbd2_handle and i_data_sem circular locking dependency detected

2008-02-04 Thread Aneesh Kumar K.V
On Mon, Feb 04, 2008 at 05:31:56PM +0100, Jan Kara wrote:
   Hi,
 
 On Mon 04-02-08 15:42:28, Aneesh Kumar K.V wrote:
  This is with the new ext3 - ext4 migrate code added. The recently added
  lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
  on the ext3 inode during migration to prevent walking the ext3 inode
  when it is being converted to ext4 format. Also we want to avoid 
  file truncation and new blocks being added while converting to ext4.
  Also we dont want to reserve large number of credits for journal.
  Any idea how to fix this ?
   Hmm, while briefly looking at the code - why do you introduce i_data_sem
 and not use i_alloc_sem which is already in VFS inode? That is aimed
 exactly at the serialization of truncates, writes and similar users.

How about read ? We are changing the format of inode. We don't want even
the read to go through.


 That doesn't solve problems with lock ordering but I was just wondering...
   Another problem - ext4_fallocate() has the same lock ordering problem as
 the migration code and maybe there are others...


I will look at the same when fixing this.

   One (stupid) solution to your problem is to make i_data_sem be
 always locked before the transaction is started. It could possibly have
 negative performance impact because you'd have to hold the semaphore for
 a longer time and thus a writer would block readers for longer time. So one
 would have to measure how big difference that would make.
   Another possibility is to start a single transaction for migration and
 extend it as long as you can (as truncate does it). And when you can't
 extend any more, you drop the i_data_sem and start a new transaction and
 acquire the semaphore again. This has the disadvantage that after dropping
 the semaphore you have to resync your original inode with the temporary
 one your are building which probably ends up being ugly as night... Hmm,
 but maybe we could get rid of this - hold i_mutex to protect against all
 writes (that ranks outside of transaction start so you can hold it for the
 whole migration time - maybe you even hold it if you are called from the
 write path...). After dropping i_data_sem you let some readers proceed
 but writers still wait on i_mutex so the file shouldn't change under you
 (but I suggest adding some BUG_ONs to verify that the file really doesn't
 change :).

A quick look says truncate can happen even when we hold i_mutex ??

But of all this looks like a workable solution. Will try this out.


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: Fix reference counting on buffer head.

2008-02-04 Thread Aneesh Kumar K.V
With journal checksum patch we added asyn commit of journal commit headers.
During the conversion we missed to take a reference on buffer head. Before
the change sync_dirty_buffer did the get_bh(). The associative put_bh is
done by journal_wait_on_commit_record()

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/jbd2/commit.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index da8d0eb..2b88ab0 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -136,7 +136,7 @@ static int journal_submit_commit_record(journal_t *journal,
 
JBUFFER_TRACE(descriptor, submit commit block);
lock_buffer(bh);
-
+   get_bh(bh);
set_buffer_dirty(bh);
set_buffer_uptodate(bh);
bh-b_end_io = journal_end_buffer_io_sync;
-- 
1.5.4.rc3.24.gb53139-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: jbd2_handle and i_data_sem circular locking dependency detected

2008-02-04 Thread Aneesh Kumar K.V
On Mon, Feb 04, 2008 at 10:23:16AM -0500, Josef Bacik wrote:
 On Monday 04 February 2008 5:12:28 am Aneesh Kumar K.V wrote:
  Hi,
 
  This is with the new ext3 - ext4 migrate code added. The recently added
  lockdep for jbd2 helped to find this out. We want to hold the i_data_sem
  on the ext3 inode during migration to prevent walking the ext3 inode
  when it is being converted to ext4 format. Also we want to avoid
  file truncation and new blocks being added while converting to ext4.
  Also we dont want to reserve large number of credits for journal.
  Any idea how to fix this ?
 
 
 Hello,
 
 Seems you should be taking the i_data_sem after starting the journal in 
 ext4_ext_migrate.  I haven't looked at this stuff too much, but everywhere I 
 see i_data_sem taken its within a journal_start/journal_stop.  Here's the 
 patch.  Let me know if I'm way off btw, like I said I'm new to this :).  
 Thank 
 you,

But that doesn't help. Because we call ext4_journal_restart after taking
i_data_sem. That implies we can sleep waiting for other running
transaction to commit. And if that transaction (in the above example
rm) is waiting for i_data_sem we deadlock.

-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix ext4 bitops

2008-02-03 Thread Aneesh Kumar K.V
On Sun, Feb 03, 2008 at 01:39:02PM +0100, Geert Uytterhoeven wrote:
 On Sun, 3 Feb 2008, Heiko Carstens wrote:
  On Fri, Feb 01, 2008 at 10:04:04PM +0100, Bastian Blank wrote:
   On Fri, Feb 01, 2008 at 12:22:57PM -0800, Andrew Morton wrote:
On Fri, 1 Feb 2008 21:02:08 +0100
Bastian Blank [EMAIL PROTECTED] wrote:

 Fix ext4 bitops.

This is incomplete.  Please tell us what was fixed.

If it was a build error then please quote the compile error output in 
the
changelog, as well as the usual description of what the problem is, and 
how
it was fixed.
   
   | fs/ext4/mballoc.c: In function 'ext4_mb_generate_buddy':
   | fs/ext4/mballoc.c:954: error: implicit declaration of function 
   'generic_find_next_le_bit'
   
   The s390 specific bitops uses parts of the generic implementation.
   Include the correct header.
  
  That doesn't work:
  
  fs/built-in.o: In function `ext4_mb_release_inode_pa':
  mballoc.c:(.text+0x95a8a): undefined reference to `generic_find_next_le_bit'
  fs/built-in.o: In function `ext4_mb_init_cache':
  mballoc.c:(.text+0x967ea): undefined reference to `generic_find_next_le_bit'
  
  This still needs generic_find_next_le_bit which comes
  from lib/find_next_bit.c. That one doesn't get built on s390 since we
  don't set GENERIC_FIND_NEXT_BIT.
  Currently we have the lengthly patch below queued.
 
 Similar issue on m68k. As Bastian also saw it on powerpc, I'm getting the
 impression the ext4 people don't (compile) test on big endian machines?
 
 Gr{oetje,eeting}s,
 

I have sent this patches to linux-arch expecting a review from
different arch people. It is true that the patches are tested only on
powerpc, x86-64, x86. That's the primary reason of me sending the
patches to linux-arch.

http://marc.info/?l=linux-archm=119503501127737w=2


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG_ON at mballoc.c:3752

2008-02-03 Thread Aneesh Kumar K.V
On Thu, Jan 31, 2008 at 04:42:07PM +0100, Eric Sesterhenn wrote:
 * Eric Sesterhenn ([EMAIL PROTECTED]) wrote:
  hi,
  
  while running a modified version of fsfuzzer i triggered the BUG() in
  ext4_mb_release_inode_pa(). Sadly I am not able to reproduce this using
  the generated image, but running the fuzzer will usually trigger this in
  less than 40 attempts. Increasing the JBD2 Debug level didnt give more
  information. The kernel is current git with
  ext4-fix-null-pointer-deref-in-journal_wait_on_commit_record.patch
  applied. 
 
 I am now able to reproduce this using this image:
 http://www.cccmz.de/~snakebyte/ext4.24.img.bz2
 
 the following commands will trigger the oops for me
 
 mount cfs/ext4.24.img /media/test -t ext4dev -o extents -o loop
 mkdir /media/test/stress
 chown snakebyte:snakebyte /media/test/stress  sudo -u snakebyte fstest -n 
 10 -l 10 -f 5 -s 4 -p /media/test/stress/
 

The file system is corrupted. The BUG_ON indicate that the free spcae
marked in the prealloc space and found by looking at the bitmap are not
same. Do you have a set of steps that i can follow to reproduce this ?
on a clean file system ? Where do i find the fsfuzzer that you are using ?


[EMAIL 
PROTECTED]:/tmp$/home/opensource/patches/e2fsprogs-1.40.4.cfs1/e2fsck/e2fsck.static
  -f ./ext4.24.img
e2fsck 1.40.4.cfs1 (31-Dec-2007)
Superblock has an invalid ext3 journal (inode 8).
Cleary? yes

*** ext3 journal has been deleted - filesystem is now ext2 only ***

Pass 1: Checking inodes, blocks, and sizes
Inode 7 has illegal block(s).  Cleary? yes

Illegal block #552 (9568256) in inode 7.  CLEARED.
Illegal block #647 (4063232) in inode 7.  CLEARED.
Illegal block #659 (12517376) in inode 7.  CLEARED.
Illegal block #766 (2600468480) in inode 7.  CLEARED.
Illegal block #944 (51200) in inode 7.  CLEARED.
Illegal block #1135 (2583691264) in inode 7.  CLEARED.
Illegal block #1214 (15925248) in inode 7.  CLEARED.
Illegal block #1345 (771751936) in inode 7.  CLEARED.
Illegal block #1384 (10092544) in inode 7.  CLEARED.
Illegal block #1416 (14811136) in inode 7.  CLEARED.
Illegal block #1470 (10420224) in inode 7.  CLEARED.
Illegal block #1709 (10158080) in inode 7.  CLEARED.
Too many illegal blocks in inode 7.
Clear inodey? yes

Journal inode is not in use, but contains data.  Cleary? yes

Deleted inode 160 has zero dtime.  Fixy? yes

Deleted inode 257 has zero dtime.  Fixy? yes

Deleted inode 638 has zero dtime.  Fixy? yes

Deleted inode 1407 has zero dtime.  Fixy? yes

Deleted inode 1650 has zero dtime.  Fixy? yes

Deleted inode 1656 has zero dtime.  Fixy? yes

Deleted inode 1677 has zero dtime.  Fixy? yes

Deleted inode 1720 has zero dtime.  Fixy? yes

Inode 11 has illegal block(s).  Cleary? yes

Illegal block #249 (15990784) in inode 11.  CLEARED.
Inode 11, i_size is 12288, should be 33792.  Fixy? yes

Inode 11, i_blocks is 24, should be 28.  Fixy? yes

Recreate journal to make the filesystem ext3 again?
Fixy? yes

-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: Fix Null dereference.

2008-02-03 Thread Aneesh Kumar K.V
Repoted by Adrian Bunk [EMAIL PROTECTED]:

The Coverity checker spotted the following NULL dereference:

static int ext4_mb_mark_diskspace_used
{
...
if (!bitmap_bh)
goto out_err;
...
out_err:
sb-s_dirt = 1;
put_bh(bitmap_bh);
...

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/mballoc.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 76e5fed..06d1f52 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3069,7 +3069,7 @@ static int ext4_mb_mark_diskspace_used(struct 
ext4_allocation_context *ac,
 
 out_err:
sb-s_dirt = 1;
-   put_bh(bitmap_bh);
+   brelse(bitmap_bh);
return err;
 }
 
-- 
1.5.4.rc3.24.gb53139-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 36/49] ext4: Add EXT4_IOC_MIGRATE ioctl

2008-01-26 Thread Aneesh Kumar K.V
On Fri, Jan 25, 2008 at 11:15:00PM -0500, Theodore Tso wrote:
 On Thu, Jan 24, 2008 at 11:25:32AM +0530, Aneesh Kumar K.V wrote:
  +static int free_ext_idx(handle_t *handle, struct inode *inode,
  +   struct ext4_extent_idx *ix)
  +{
  +   int i, retval = 0;
  +   ext4_fsblk_t block;
  +   struct buffer_head *bh;
  +   struct ext4_extent_header *eh;
  +
  +   block = idx_pblock(ix);
  +   bh = sb_bread(inode-i_sb, block);
  +   if (!bh)
  +   return -EIO;
  +
  +   eh = (struct ext4_extent_header *)bh-b_data;
  +   if (eh-eh_depth == 0) {
  +   brelse(bh);
  +   ext4_free_blocks(handle, inode, block, 1);
  +   } else {
  +   ix = EXT_FIRST_INDEX(eh);
  +   for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ix++) {
  +   retval = free_ext_idx(handle, inode, ix);
  +   if (retval)
  +   return retval;
  +   }
  +   }
  +   return retval;
  +}
 
 Aneesh, looks like if eh-eh_depth is != 0, bh gets leaked.  This is
 how I plan to fix it up:
 
 +static int free_ext_idx(handle_t *handle, struct inode *inode,
 + struct ext4_extent_idx *ix)
 +{
 + int i, retval = 0;
 + ext4_fsblk_t block;
 + struct buffer_head *bh;
 + struct ext4_extent_header *eh;
 +
 + block = idx_pblock(ix);
 + bh = sb_bread(inode-i_sb, block);
 + if (!bh)
 + return -EIO;
 +
 + eh = (struct ext4_extent_header *)bh-b_data;
 + if (eh-eh_depth == 0)
 + ext4_free_blocks(handle, inode, block, 1);
 + else {
 + ix = EXT_FIRST_INDEX(eh);
 + for (i = 0; i  le16_to_cpu(eh-eh_entries); i++, ix++) {
 + retval = free_ext_idx(handle, inode, ix);
 + if (retval)
 + break;
 + }
 + }
 + put_bh(bh);


We need to mark the index block as free.
 via ext4_free_blocks(handle, inode, block, 1);

I remember making this change. May be it was related to dind/tind
blocks. 


-aneesh


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 41/49] ext4: Add multi block allocator for ext4

2008-01-24 Thread Aneesh Kumar K.V
updated patch. Waiting for the test results.

I am only attaching the diff. Mballoc patch is really large.

-aneesh
diff --git a/Documentation/filesystems/ext4.txt 
b/Documentation/filesystems/ext4.txt
index 4f329af..ec7d349 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -89,6 +89,8 @@ When mounting an ext4 filesystem, the following option are 
accepted:
 extentsext4 will use extents to address file data.  The
file system will no longer be mountable by ext3.
 
+noextents  ext4 will not use extents for new files created.
+
 journal_checksum   Enable checksumming of the journal transactions.
This will allow the recovery code in e2fsck and the
kernel to detect corruption in the kernel.  It is a
@@ -206,6 +208,10 @@ nobh   (a) cache disk block mapping 
information
nobh option tries to avoid associating buffer
heads (supported only for writeback mode).
 
+mballoc(*) Use the mutliblock allocator for block 
allocation
+nomballoc  disabled multiblock allocator for block allocation.
+stripe=n   filesystem blocks per stripe for a RAID configuration.
+
 
 Data Mode
 -
diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index dec9945..4413a2d 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -857,6 +857,45 @@ CPUs.
 The   procs_blocked line gives  the  number of  processes currently blocked,
 waiting for I/O to complete.
 
+1.9 Ext4 file system parameters
+--
+Ext4 file system have one directory per partition under /proc/fs/ext4/
+# ls /proc/fs/ext4/hdc/
+group_prealloc  max_to_scan  mb_groups  mb_history  min_to_scan  order2_req
+stats  stream_req
+
+mb_groups:
+This file gives the details of mutiblock allocator buddy cache of free blocks
+
+mb_history:
+Multiblock allocation history.
+
+stats:
+This file indicate whether the multiblock allocator should start collecting
+statistics. The statistics are shown during unmount
+
+group_prealloc:
+The multiblock allocator normalize the block allocation request to
+group_prealloc filesystem blocks if we don't have strip value set.
+The stripe value can be specified at mount time or during mke2fs.
+
+max_to_scan:
+How long multiblock allocator can look for a best extent (in found extents)
+
+min_to_scan:
+How long multiblock allocator  must look for a best extent
+
+order2_req:
+Multiblock allocator use  2^N search using buddies only for requests greater
+than or equal to order2_req. The request size is specfied in file system
+blocks. A value of 2 indicate only if the requests are greater than or equal
+to 4 blocks.
+
+stream_req:
+Files smaller than stream_req are served by the stream allocator, whose
+purpose is to pack requests as close each to other as possible to
+produce smooth I/O traffic. Avalue of 16 indicate that file smaller than 16
+filesystem block size will use group based preallocation.
 
 --
 Summary
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 0398aa0..310bad6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -489,7 +489,7 @@ struct ext4_free_extent {
  */
 struct ext4_locality_group {
/* for allocator */
-   struct semaphorelg_sem; /* to serialize allocates */
+   struct mutexlg_sem; /* to serialize allocates */
struct list_headlg_prealloc_list;/* list of preallocations */
spinlock_t  lg_prealloc_lock;
 };
@@ -563,7 +563,10 @@ struct ext4_buddy {
 #define EXT4_MB_BUDDY(e4b) ((e4b)-bd_buddy)
 
 #ifndef EXT4_MB_HISTORY
-#define ext4_mb_store_history(ac)
+static inline void ext4_mb_store_history(struct ext4_allocation_context *ac)
+{
+   return;
+}
 #else
 static void ext4_mb_store_history(struct ext4_allocation_context *ac);
 #endif
@@ -641,6 +644,10 @@ static ext4_fsblk_t ext4_grp_offs_to_block(struct 
super_block *sb,
 
 static inline int mb_test_bit(int bit, void *addr)
 {
+   /*
+* ext4_test_bit on architecture like powerpc
+* needs unsigned long aligned address
+*/
mb_correct_addr_and_bit(bit, addr);
return ext4_test_bit(bit, addr);
 }
@@ -669,7 +676,7 @@ static inline void mb_clear_bit_atomic(spinlock_t *lock, 
int bit, void *addr)
ext4_clear_bit_atomic(lock, bit, addr);
 }
 
-static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
+static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
 {
char *bb;
 
@@ -752,9 +759,20 @@ static void mb_cmp_bitmaps(struct ext4_buddy *e4b, void 
*bitmap)
 }
 
 #else
-#define mb_free_blocks_double(a, b, c, d)
-#define mb_mark_used_double(a, b, c)
-#define 

Patch queue update

2008-01-24 Thread Aneesh Kumar K.V
I have updated patches based on the review feedback from Andrew.

I have tested this on 
128(64p) ppc64  sles
4(2p)ppc64  debian
4(2p)x86_64 ubuntu-gutsy

Updated patches are at
http://www.radian.org/~kvaneesh/ext4/jan-24-2008/
http://www.radian.org/~kvaneesh/ext4/jan-24-2008/patches.tar

Diff for reference

diff --git a/Documentation/filesystems/ext4.txt 
b/Documentation/filesystems/ext4.txt
index 4f329af..ec7d349 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -89,6 +89,8 @@ When mounting an ext4 filesystem, the following option are 
accepted:
 extentsext4 will use extents to address file data.  The
file system will no longer be mountable by ext3.
 
+noextents  ext4 will not use extents for new files created.
+
 journal_checksum   Enable checksumming of the journal transactions.
This will allow the recovery code in e2fsck and the
kernel to detect corruption in the kernel.  It is a
@@ -206,6 +208,10 @@ nobh   (a) cache disk block mapping 
information
nobh option tries to avoid associating buffer
heads (supported only for writeback mode).
 
+mballoc(*) Use the mutliblock allocator for block 
allocation
+nomballoc  disabled multiblock allocator for block allocation.
+stripe=n   filesystem blocks per stripe for a RAID configuration.
+
 
 Data Mode
 -
diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index dec9945..4413a2d 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -857,6 +857,45 @@ CPUs.
 The   procs_blocked line gives  the  number of  processes currently blocked,
 waiting for I/O to complete.
 
+1.9 Ext4 file system parameters
+--
+Ext4 file system have one directory per partition under /proc/fs/ext4/
+# ls /proc/fs/ext4/hdc/
+group_prealloc  max_to_scan  mb_groups  mb_history  min_to_scan  order2_req
+stats  stream_req
+
+mb_groups:
+This file gives the details of mutiblock allocator buddy cache of free blocks
+
+mb_history:
+Multiblock allocation history.
+
+stats:
+This file indicate whether the multiblock allocator should start collecting
+statistics. The statistics are shown during unmount
+
+group_prealloc:
+The multiblock allocator normalize the block allocation request to
+group_prealloc filesystem blocks if we don't have strip value set.
+The stripe value can be specified at mount time or during mke2fs.
+
+max_to_scan:
+How long multiblock allocator can look for a best extent (in found extents)
+
+min_to_scan:
+How long multiblock allocator  must look for a best extent
+
+order2_req:
+Multiblock allocator use  2^N search using buddies only for requests greater
+than or equal to order2_req. The request size is specfied in file system
+blocks. A value of 2 indicate only if the requests are greater than or equal
+to 4 blocks.
+
+stream_req:
+Files smaller than stream_req are served by the stream allocator, whose
+purpose is to pack requests as close each to other as possible to
+produce smooth I/O traffic. Avalue of 16 indicate that file smaller than 16
+filesystem block size will use group based preallocation.
 
 --
 Summary
diff --git a/fs/buffer.c b/fs/buffer.c
index 982cf1a..921eeec 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3232,19 +3232,21 @@ int bh_uptodate_or_lock(struct buffer_head *bh)
return 1;
 }
 EXPORT_SYMBOL(bh_uptodate_or_lock);
+
 /**
  * bh_submit_read: Submit a locked buffer for reading
  * @bh: struct buffer_head
  *
- * Returns a negative error
+ * Returns zero on success and -EIO on error.
  */
 int bh_submit_read(struct buffer_head *bh)
 {
-   if (!buffer_locked(bh))
-   lock_buffer(bh);
+   BUG_ON(!buffer_locked(bh));
 
-   if (buffer_uptodate(bh))
+   if (buffer_uptodate(bh)) {
+   unlock_buffer(bh);
return 0;
+   }
 
get_bh(bh);
bh-b_end_io = end_buffer_read_sync;
@@ -3255,6 +3257,7 @@ int bh_submit_read(struct buffer_head *bh)
return -EIO;
 }
 EXPORT_SYMBOL(bh_submit_read);
+
 void __init buffer_init(void)
 {
int nrpages;
diff --git a/fs/ext4/defrag.c b/fs/ext4/defrag.c
index 4ef3dc0..0d76c74 100644
--- a/fs/ext4/defrag.c
+++ b/fs/ext4/defrag.c
@@ -30,14 +30,6 @@ ext4_fsblk_t idx_pblock(struct ext4_extent_idx *ix)
return block;
 }
 
-/* Will go away */
-static void ext4_ext_store_pblock(struct ext4_extent *ex, ext4_fsblk_t pb)
-{
-   ex-ee_start_lo = cpu_to_le32((unsigned long) (pb  0x));
-   ex-ee_start_hi =
-   cpu_to_le16((unsigned long) ((pb  31)  1)  0x);
-}
-
 /*
  * this structure is used to gather extents from the tree via ioctl
  */
diff --git 

Re: [PATCH 41/49] ext4: Add multi block allocator for ext4

2008-01-24 Thread Aneesh Kumar K.V
On Thu, Jan 24, 2008 at 01:26:14PM +0530, Aneesh Kumar K.V wrote:
 
  
   +/* find most significant bit */
   +static int fmsb(unsigned short word)
   +{
   + int order;
   +
   + if (word  255) {
   + order = 7;
   + word = 8;
   + } else {
   + order = -1;
   + }
   +
   + do {
   + order++;
   + word = 1;
   + } while (word != 0);
   +
   + return order;
   +}
  
  Did we just reinvent fls()?
 
 replaced by fls.
 
  

That should be fls() - 1;

The full patch is at

http://www.radian.org/~kvaneesh/ext4/jan-24-2008/mballoc-core.patch

The patch is too big to inline.

-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


GFP_ usage in kmalloc with mballoc

2008-01-24 Thread Aneesh Kumar K.V
Hi,


I looked at the GFP flag usage in mballoc.  I think the below change
make sense with respect to mballoc.

First hunk is memory allocation during ext4_mb_init which is called
during mount time. I guess it is ok to convert that to GFP_KERNEL.

Second hunk is during ext4_mb_free_metadata. I guess it should be
GFP_NOFS. This is called during ext4_free_blocks.


diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index bec699a..a15f18e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2662,7 +2662,7 @@ int ext4_mb_init(struct super_block *sb, int 
needs_recovery)
sbi-s_mb_group_prealloc = MB_DEFAULT_GROUP_PREALLOC;
 
i = sizeof(struct ext4_locality_group) * NR_CPUS;
-   sbi-s_locality_groups = kmalloc(i, GFP_NOFS);
+   sbi-s_locality_groups = kmalloc(i, GFP_KERNEL);
if (sbi-s_locality_groups == NULL) {
clear_opt(sbi-s_mount_opt, MBALLOC);
kfree(sbi-s_mb_offsets);
@@ -4366,7 +4366,7 @@ static int ext4_mb_free_metadata(handle_t *handle, struct 
ext4_buddy *e4b,
 
if (md == NULL) {
ext4_unlock_group(sb, group);
-   md = kmalloc(sizeof(*md), GFP_KERNEL);
+   md = kmalloc(sizeof(*md), GFP_NOFS);
if (md == NULL)
return -ENOMEM;
md-num = 0;
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 23/49] Add buffer head related helper functions

2008-01-23 Thread Aneesh Kumar K.V
On Wed, Jan 23, 2008 at 02:06:48PM -0800, Andrew Morton wrote:
  On Mon, 21 Jan 2008 22:02:02 -0500 Theodore Ts'o [EMAIL PROTECTED] 
  wrote:
  +}
  +EXPORT_SYMBOL(bh_uptodate_or_lock);
  +/**
 
 Missing newline.
 
  + * bh_submit_read: Submit a locked buffer for reading
  + * @bh: struct buffer_head
  + *
  + * Returns a negative error
  + */
  +int bh_submit_read(struct buffer_head *bh)
  +{
  +   if (!buffer_locked(bh))
  +   lock_buffer(bh);
  +
  +   if (buffer_uptodate(bh))
  +   return 0;
 
 Here it can lock the buffer then return zero
 
  +   get_bh(bh);
  +   bh-b_end_io = end_buffer_read_sync;
  +   submit_bh(READ, bh);
  +   wait_on_buffer(bh);
  +   if (buffer_uptodate(bh))
  +   return 0;
 
 Here it will unlock the buffer and return zero.
 
 This function is unusable when passed an unlocked buffer.
 

Updated patch below.

commit 70d4ca32604e0935a8b9a49c5ac8b9c64c810693
Author: Aneesh Kumar K.V [EMAIL PROTECTED]
Date:   Thu Jan 24 10:50:24 2008 +0530

Add buffer head related helper functions

Add buffer head related helper function bh_uptodate_or_lock and
bh_submit_read which can be used by file system

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..82aa2db 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3213,6 +3213,53 @@ static int buffer_cpu_notify(struct notifier_block *self,
return NOTIFY_OK;
 }
 
+/**
+ * bh_uptodate_or_lock: Test whether the buffer is uptodate
+ * @bh: struct buffer_head
+ *
+ * Return true if the buffer is up-to-date and false,
+ * with the buffer locked, if not.
+ */
+int bh_uptodate_or_lock(struct buffer_head *bh)
+{
+   if (!buffer_uptodate(bh)) {
+   lock_buffer(bh);
+   if (!buffer_uptodate(bh))
+   return 0;
+   unlock_buffer(bh);
+   }
+   return 1;
+}
+EXPORT_SYMBOL(bh_uptodate_or_lock);
+
+/**
+ * bh_submit_read: Submit a locked buffer for reading
+ * @bh: struct buffer_head
+ *
+ * Returns zero on success and -EIO on error.If the input
+ * buffer is not locked returns -EINVAL
+ *
+ */
+int bh_submit_read(struct buffer_head *bh)
+{
+   if (!buffer_locked(bh))
+   return -EINVAL;
+
+   if (buffer_uptodate(bh)) {
+   unlock_buffer(bh);
+   return 0;
+   }
+
+   get_bh(bh);
+   bh-b_end_io = end_buffer_read_sync;
+   submit_bh(READ, bh);
+   wait_on_buffer(bh);
+   if (buffer_uptodate(bh))
+   return 0;
+   return -EIO;
+}
+EXPORT_SYMBOL(bh_submit_read);
+
 void __init buffer_init(void)
 {
int nrpages;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index da0d83f..e98801f 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -192,6 +192,8 @@ int sync_dirty_buffer(struct buffer_head *bh);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
sector_t bblock, unsigned blocksize);
+int bh_uptodate_or_lock(struct buffer_head *bh);
+int bh_submit_read(struct buffer_head *bh);
 
 extern int buffer_heads_over_limit;
 
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 41/49] ext4: Add multi block allocator for ext4

2008-01-23 Thread Aneesh Kumar K.V
On Wed, Jan 23, 2008 at 02:07:27PM -0800, Andrew Morton wrote:
  On Mon, 21 Jan 2008 22:02:20 -0500 Theodore Ts'o [EMAIL PROTECTED] 
  wrote:
  From: Alex Tomas [EMAIL PROTECTED]
  
  Signed-off-by: Alex Tomas [EMAIL PROTECTED]
  Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
  Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
  Signed-off-by: Eric Sandeen [EMAIL PROTECTED]
  Signed-off-by: Theodore Ts'o [EMAIL PROTECTED]
 
  ...
 
  +#if BITS_PER_LONG == 64
  +#define mb_correct_addr_and_bit(bit, addr) \
  +{  \
  +   bit += ((unsigned long) addr  7UL)  3;   \
  +   addr = (void *) ((unsigned long) addr  ~7UL);  \
  +}
  +#elif BITS_PER_LONG == 32
  +#define mb_correct_addr_and_bit(bit, addr) \
  +{  \
  +   bit += ((unsigned long) addr  3UL)  3;   \
  +   addr = (void *) ((unsigned long) addr  ~3UL);  \
  +}
  +#else
  +#error how many bits you are?!
  +#endif
 
 Why do these exist?

Initial version on mballoc supported on x86 32 this was there to give
compile warning on 64 bit platform. I guess we can remove that now.
Or may be we can keep it as such because it is harmless.


 
  +static inline int mb_test_bit(int bit, void *addr)
  +{
  +   mb_correct_addr_and_bit(bit, addr);
  +   return ext4_test_bit(bit, addr);
  +}
 
 ext2_test_bit() already handles bitnum  wordsize.
 
 If mb_correct_addr_and_bit() is actually needed then some suitable comment
 would help.

ext4_test_bit on powerpc needs the addr to be 8 byte aligned. Othewise
it fails

 
  +static inline void mb_set_bit(int bit, void *addr)
  +{
  +   mb_correct_addr_and_bit(bit, addr);
  +   ext4_set_bit(bit, addr);
  +}
  +
  +static inline void mb_set_bit_atomic(spinlock_t *lock, int bit, void *addr)
  +{
  +   mb_correct_addr_and_bit(bit, addr);
  +   ext4_set_bit_atomic(lock, bit, addr);
  +}
  +
  +static inline void mb_clear_bit(int bit, void *addr)
  +{
  +   mb_correct_addr_and_bit(bit, addr);
  +   ext4_clear_bit(bit, addr);
  +}
  +
  +static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void 
  *addr)
  +{
  +   mb_correct_addr_and_bit(bit, addr);
  +   ext4_clear_bit_atomic(lock, bit, addr);
  +}
  +
  +static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int 
  *max)
 
 uninlining this will save about eighty squigabytes of text.

Fixed


 
 Please review all of ext4/jbd2 with a view to removig unnecessary and wrong
 inlings.
 
  +{
  +   char *bb;
  +
  +   /* FIXME!! is this needed */
  +   BUG_ON(EXT4_MB_BITMAP(e4b) == EXT4_MB_BUDDY(e4b));
  +   BUG_ON(max == NULL);
  +
  +   if (order  e4b-bd_blkbits + 1) {
  +   *max = 0;
  +   return NULL;
  +   }
  +
  +   /* at order 0 we see each particular block */
  +   *max = 1  (e4b-bd_blkbits + 3);
  +   if (order == 0)
  +   return EXT4_MB_BITMAP(e4b);
  +
  +   bb = EXT4_MB_BUDDY(e4b) + EXT4_SB(e4b-bd_sb)-s_mb_offsets[order];
  +   *max = EXT4_SB(e4b-bd_sb)-s_mb_maxs[order];
  +
  +   return bb;
  +}
  +
 
  ...
 
  +#else
  +#define mb_free_blocks_double(a, b, c, d)
  +#define mb_mark_used_double(a, b, c)
  +#define mb_cmp_bitmaps(a, b)
  +#endif
 
 Please use the do{}while(0) thing.  Or, better, proper C functions which
 have typechecking (unless this will cause undefined-var compile errors,
 which happens sometimes)

makde static inline void.

 
  +/* find most significant bit */
  +static int fmsb(unsigned short word)
  +{
  +   int order;
  +
  +   if (word  255) {
  +   order = 7;
  +   word = 8;
  +   } else {
  +   order = -1;
  +   }
  +
  +   do {
  +   order++;
  +   word = 1;
  +   } while (word != 0);
  +
  +   return order;
  +}
 
 Did we just reinvent fls()?

replaced by fls.

 
  +/* FIXME!! need more doc */
  +static void ext4_mb_mark_free_simple(struct super_block *sb,
  +   void *buddy, unsigned first, int len,
  +   struct ext4_group_info *grp)
  +{
  +   struct ext4_sb_info *sbi = EXT4_SB(sb);
  +   unsigned short min;
  +   unsigned short max;
  +   unsigned short chunk;
  +   unsigned short border;
  +
  +   BUG_ON(len = EXT4_BLOCKS_PER_GROUP(sb));
  +
  +   border = 2  sb-s_blocksize_bits;
 
 Won't this explode with = 32k blocksize?
 
  +   while (len  0) {
  +   /* find how many blocks can be covered since this position */
  +   max = ffs(first | border) - 1;
  +
  +   /* find how many blocks of power 2 we need to mark */
  +   min = fmsb(len);
  +
  +   if (max  min)
  +   min = max;
  +   chunk = 1  min;
  +
  +   /* mark multiblock chunks only */
  +   grp-bb_counters[min]++;
  +   if (min  0)
  +   mb_clear_bit(first  min,
  +buddy + sbi-s_mb_offsets[min]);
  +
  +   len -= chunk;
  +   first += chunk;
  +   }
  +}
  +
  
  ...
 
  +static

Re: [PATCH] Fix oops in mballoc caused by a variable overflow

2008-01-17 Thread Aneesh Kumar K.V
On Thu, Jan 17, 2008 at 10:43:40AM +0100, Valerie Clement wrote:
 Aneesh Kumar K.V wrote:
 What about this  ? I guess we will overflow start = start  bsbits;


 Hi Aneesh,
 your patch below doesn't fix the issue, because as start_off is also  
 loff_t, start_off = ac-ac_o_ex.fe_logical  bsbits  also overflows.


loff_t is 64 bits.

typedef long long   __kernel_loff_t;
typedef __u32 ext4_lblk_t;

-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix oops in mballoc caused by a variable overflow

2008-01-17 Thread Aneesh Kumar K.V
On Thu, Jan 17, 2008 at 10:43:40AM +0100, Valerie Clement wrote:
 Aneesh Kumar K.V wrote:
 What about this  ? I guess we will overflow start = start  bsbits;


 Hi Aneesh,
 your patch below doesn't fix the issue, because as start_off is also  
 loff_t, start_off = ac-ac_o_ex.fe_logical  bsbits  also overflows.


loff_t is 64 bits.

typedef __kernel_loff_t loff_t;
typedef long long   __kernel_loff_t;
typedef __u32 ext4_lblk_t;
typedef unsigned long long ext4_fsblk_t

start_off = ac-ac_o_ex.fe_logical  bsbits;

In the above line what we are storing in start_off is the offset in bytes.So it 
makes
sense to use the type loff_t. It is neither logical block nor physical block.


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix oops in mballoc caused by a variable overflow

2008-01-17 Thread Aneesh Kumar K.V
On Thu, Jan 17, 2008 at 02:09:41PM +0100, Valerie Clement wrote:
 Aneesh Kumar K.V wrote:
 On Thu, Jan 17, 2008 at 10:43:40AM +0100, Valerie Clement wrote:
 Aneesh Kumar K.V wrote:
 What about this  ? I guess we will overflow start = start  bsbits;

 Hi Aneesh,
 your patch below doesn't fix the issue, because as start_off is also  
 loff_t, start_off = ac-ac_o_ex.fe_logical  bsbits  also overflows.


 loff_t is 64 bits.

 typedef __kernel_loff_t loff_t;
 typedef long long   __kernel_loff_t;
 typedef __u32 ext4_lblk_t;
 typedef unsigned long long ext4_fsblk_t

 start_off = ac-ac_o_ex.fe_logical  bsbits;

 In the above line what we are storing in start_off is the offset in bytes.So 
 it makes
 sense to use the type loff_t. It is neither logical block nor physical block.

 Oh yes, sorry, you're right. I read too quickly.

 In fact, it's missing a cast :
   start_off = (loff_t) ac-ac_o_ex.fe_logical  bsbits;

 With that change, the test is ok.

Updated patch below.

-aneesh
ext4: Fix overflow in ext4_mb_normalize_request

From: Aneesh Kumar K.V [EMAIL PROTECTED]

kernel BUG at fs/ext4/mballoc.c:3148!

The BUG_ON is:
BUG_ON(size = 0 || size = EXT4_BLOCKS_PER_GROUP(ac-ac_sb));

where the value of size is 4293920768.

This is due to the overflow of the variable start in the
ext4_mb_normalize_request() function.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---

 fs/ext4/mballoc.c |   24 +++-
 1 files changed, 11 insertions(+), 13 deletions(-)


diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d8cd81e..d16083c 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2998,7 +2998,7 @@ static void ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	int bsbits, max;
 	ext4_lblk_t end;
 	struct list_head *cur;
-	loff_t size, orig_size;
+	loff_t size, orig_size, start_off;
 	ext4_lblk_t start, orig_start;
 	struct ext4_inode_info *ei = EXT4_I(ac-ac_inode);
 
@@ -3039,7 +3039,7 @@ static void ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 
 	/* first, try to predict filesize */
 	/* XXX: should this table be tunable? */
-	start = 0;
+	start_off = 0;
 	if (size = 16 * 1024) {
 		size = 16 * 1024;
 	} else if (size = 32 * 1024) {
@@ -3055,26 +3055,24 @@ static void ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	} else if (size = 1024 * 1024) {
 		size = 1024 * 1024;
 	} else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, bsbits)) {
-		start = ac-ac_o_ex.fe_logical  bsbits;
-		start = (start / (1024 * 1024)) * (1024 * 1024);
+		start_off = ((loff_t)ac-ac_o_ex.fe_logical 
+		(20 - bsbits))  20;
 		size = 1024 * 1024;
 	} else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, bsbits)) {
-		start = ac-ac_o_ex.fe_logical  bsbits;
-		start = (start / (4 * (1024 * 1024))) * 4 * (1024 * 1024);
+		start_off = ((loff_t)ac-ac_o_ex.fe_logical 
+			(22 - bsbits))  22;
 		size = 4 * 1024 * 1024;
 	} else if (NRL_CHECK_SIZE(ac-ac_o_ex.fe_len,
 	(820)bsbits, max, bsbits)) {
-		start = ac-ac_o_ex.fe_logical;
-		start = start  bsbits;
-		start = (start / (8 * (1024 * 1024))) * 8 * (1024 * 1024);
+		start_off = ((loff_t)ac-ac_o_ex.fe_logical 
+			(23 - bsbits))  23;
 		size = 8 * 1024 * 1024;
 	} else {
-		start = ac-ac_o_ex.fe_logical;
-		start = start  bsbits;
-		size = ac-ac_o_ex.fe_len  bsbits;
+		start_off = (loff_t)ac-ac_o_ex.fe_logical  bsbits;
+		size	  = ac-ac_o_ex.fe_len  bsbits;
 	}
 	orig_size = size = size  bsbits;
-	orig_start = start = start  bsbits;
+	orig_start = start = start_off  bsbits;
 
 	/* don't cover already allocated blocks in selected range */
 	if (ar-pleft  start = ar-lleft) {


Re: [PATCH] Fix oops in mballoc caused by a variable overflow

2008-01-16 Thread Aneesh Kumar K.V
On Wed, Jan 16, 2008 at 10:48:27AM -0800, Mingming Cao wrote:
 On Wed, 2008-01-16 at 20:11 +0100, Valerie Clement wrote:
  A simple dd oopses the kernel (2.6.24-rc7 with the latest patch queue):
dd if=/dev/zero of=/mnt/test/foo bs=1M count=8096
  
  EXT4-fs: mballoc enabled
  [ cut here ]
  kernel BUG at fs/ext4/mballoc.c:3148!
  
  The BUG_ON is:
  BUG_ON(size = 0 || size = EXT4_BLOCKS_PER_GROUP(ac-ac_sb));
  
  where the value of size is 4293920768.
  
  This is due to the overflow of the variable start in the 
  ext4_mb_normalize_request() function.
  The patch below fixes it.
  
 Thanks!
 
  Signed-off-by: Valerie Clement [EMAIL PROTECTED]
  ---
  
   mballoc.c |   23 ---
   1 file changed, 12 insertions(+), 11 deletions(-)
  
  
  Index: linux-2.6.24-rc7/fs/ext4/mballoc.c
  ===
  --- linux-2.6.24-rc7.orig/fs/ext4/mballoc.c 2008-01-16 19:22:45.0 
  +0100
  +++ linux-2.6.24-rc7/fs/ext4/mballoc.c  2008-01-16 19:25:04.0 
  +0100
  @@ -2990,6 +2990,7 @@ static void ext4_mb_normalize_request(st
  struct list_head *cur;
  loff_t size, orig_size;
  ext4_lblk_t start, orig_start;
  +   ext4_fsblk_t pstart;
 
 ext4_fsblk_t is used for fs physical block number, here I think pstart
 is pointing to some logical block location..
 
  struct ext4_inode_info *ei = EXT4_I(ac-ac_inode);
  
  /* do normalize only data requests, metadata requests
  @@ -3029,7 +3030,7 @@ static void ext4_mb_normalize_request(st
  
  /* first, try to predict filesize */
  /* XXX: should this table be tunable? */
  -   start = 0;
  +   pstart = 0;
  if (size = 16 * 1024) {
  size = 16 * 1024;
  } else if (size = 32 * 1024) {
  @@ -3045,25 +3046,25 @@ static void ext4_mb_normalize_request(st
  } else if (size = 1024 * 1024) {
  size = 1024 * 1024;
  } else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, bsbits)) {
  -   start = ac-ac_o_ex.fe_logical  bsbits;
  -   start = (start / (1024 * 1024)) * (1024 * 1024);
  +   pstart = ac-ac_o_ex.fe_logical  bsbits;
  +   pstart = (pstart / (1024 * 1024)) * (1024 * 1024);
 
 How about using shift...
 
 - start = ac-ac_o_ex.fe_logical  bsbits;
 - start = (start / (1024 * 1024)) * (1024 * 1024);
 + start = (ac-ac_o_ex.fe_logical  (20-bsbits))  20;
 
 That would be more efficient and should fix the overflow issue
 
  size = 1024 * 1024;
  } else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, bsbits)) {
  -   start = ac-ac_o_ex.fe_logical  bsbits;
  -   start = (start / (4 * (1024 * 1024))) * 4 * (1024 * 1024);
  +   pstart = ac-ac_o_ex.fe_logical  bsbits;
  +   pstart = (pstart / (4 * (1024 * 1024))) * 4 * (1024 * 1024);
 
 + start = (ac-ac_o_ex.fe_logical  (22-bsbits))  22;
 
  size = 4 * 1024 * 1024;
  } else 
  if(NRL_CHECK_SIZE(ac-ac_o_ex.fe_len,(820)bsbits,max,bsbits)){
  -   start = ac-ac_o_ex.fe_logical;
  -   start = start  bsbits;
  -   start = (start / (8 * (1024 * 1024))) * 8 * (1024 * 1024);
  +   pstart = ac-ac_o_ex.fe_logical;
  +   pstart = pstart  bsbits;
  +   pstart = (pstart / (8 * (1024 * 1024))) * 8 * (1024 * 1024);
 
 + start = (ac-ac_o_ex.fe_logical  (23-bsbits))  23;
 
  size = 8 * 1024 * 1024;
  } else {
  -   start = ac-ac_o_ex.fe_logical;
  -   start = start  bsbits;
  +   pstart = ac-ac_o_ex.fe_logical;
  +   pstart = pstart  bsbits;
  size = ac-ac_o_ex.fe_len  bsbits;


What about this  ? I guess we will overflow 
start = start  bsbits;

I guess start should be of type loff_t. Patch below

-aneesh

ext4: Fix overflow in ext4_mb_normalize_request

From: Aneesh Kumar K.V [EMAIL PROTECTED]

kernel BUG at fs/ext4/mballoc.c:3148!

The BUG_ON is:
BUG_ON(size = 0 || size = EXT4_BLOCKS_PER_GROUP(ac-ac_sb));

where the value of size is 4293920768.

This is due to the overflow of the variable start in the
ext4_mb_normalize_request() function.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---

 fs/ext4/mballoc.c |   21 -
 1 files changed, 8 insertions(+), 13 deletions(-)


diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d8cd81e..d8a2db8 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2998,7 +2998,7 @@ static void ext4_mb_normalize_request(struct 
ext4_allocation_context *ac,
int bsbits, max;
ext4_lblk_t end;
struct list_head *cur;
-   loff_t size, orig_size;
+   loff_t size, orig_size, start_off;
ext4_lblk_t start, orig_start;
struct ext4_inode_info *ei = EXT4_I(ac-ac_inode);
 
@@ -3039,7 +3039,7 @@ static void ext4_mb_normalize_request(struct 
ext4_allocation_context *ac,
 
/* first, try to predict filesize */
/* XXX

Re: checkpatch.pl warnings

2008-01-15 Thread Aneesh Kumar K.V
On Mon, Jan 14, 2008 at 12:49:27PM -0800, Mingming Cao wrote:
 Hi Guys,
 
 Could you check the checkpatch.pl warnings and see if it make sense to fix 
 them? Thanks!
 
 [EMAIL PROTECTED]:~/fs/ext4/stylecheck$ grep has style problems *
 linux-2.6.24-rc7-48-bit-i_blocks.patch.out:Your patch has style problems, 
 please review.  If any of these errors
 linux-2.6.24-rc7-ext3-4-migrate.patch.out:Your patch has style problems, 
 please review.  If any of these errors
 linux-2.6.24-rc7-ext4_export_iov_shorten_from_kernel_for_ext4.patch.out:Your 
 patch has style problems, please review.  If any of these errors
 linux-2.6.24-rc7-ext4-journal_chksum-2.6.20.patch.out:Your patch has style 
 problems, please review.  If any of these errors
 linux-2.6.24-rc7-ext4_rec_len_overflow_with_64kblk_fix-v2.patch.out:Your 
 patch has style problems, please review.  If any of these errors
 linux-2.6.24-rc7-ext4_store_maxbytes_for_bitmaped_files.patch.out:Your patch 
 has style problems, please review.  If any of these errors
 linux-2.6.24-rc7-inode-version-ext4.patch.out:Your patch has style problems, 
 please review.  If any of these errors
 linux-2.6.24-rc7-jbd-stats-through-procfs.out:Your patch has style problems, 
 please review.  If any of these errors
 linux-2.6.24-rc7-large-file-blocktype.patch.out:Your patch has style 
 problems, please review.  If any of these errors
 linux-2.6.24-rc7-mballoc-core.patch.out:Your patch has style problems, please 
 review.  If any of these errors
 

Fixed the checkpatch.pl warning for all the patches in the patch queue.
The diff is attached below for review.

patch queue at
http://www.radian.org/~kvaneesh/ext4/jan-15-2008/
http://www.radian.org/~kvaneesh/ext4/jan-15-2008/patch-queue.tar

This include the complete patch queue.

Tested with
--
fsx_linux, fs_inode, fsstress on x86_64 
fsx_linux, fs_inode, fsstress on ppc64


diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 1e46997..2ea7ef4 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1798,8 +1798,9 @@ static int ext4_remove_blocks(handle_t *handle, struct 
inode *inode,
printk(KERN_INFO strange request: removal %u-%u from %u:%u\n,
from, to, le32_to_cpu(ex-ee_block), ee_len);
} else {
-   printk(KERN_INFO strange request: removal(2) %u-%u from 
%u:%u\n,
-   from, to, le32_to_cpu(ex-ee_block), ee_len);
+   printk(KERN_INFO strange request: removal(2) 
+   %u-%u from %u:%u\n,
+   from, to, le32_to_cpu(ex-ee_block), ee_len);
}
return 0;
 }
@@ -2140,10 +2141,11 @@ void ext4_ext_release(struct super_block *sb)
  *   b Splits in two extents: Write is happening at either end of the extent
  *   c Splits in three extents: Somone is writing in middle of the extent
  */
-static int ext4_ext_convert_to_initialized(handle_t *handle, struct inode 
*inode,
-   struct ext4_ext_path *path,
-   ext4_lblk_t iblock,
-   unsigned long max_blocks)
+static int ext4_ext_convert_to_initialized(handle_t *handle,
+   struct inode *inode,
+   struct ext4_ext_path *path,
+   ext4_lblk_t iblock,
+   unsigned long max_blocks)
 {
struct ext4_extent *ex, newex;
struct ext4_extent *ex1 = NULL;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f7dc5f3..6bb788d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -751,7 +751,8 @@ err_out:
for (i = 1; i = num; i++) {
BUFFER_TRACE(where[i].bh, call jbd2_journal_forget);
ext4_journal_forget(handle, where[i].bh);
-   ext4_free_blocks(handle,inode,le32_to_cpu(where[i-1].key),1, 0);
+   ext4_free_blocks(handle, inode,
+   le32_to_cpu(where[i-1].key), 1, 0);
}
ext4_free_blocks(handle, inode, le32_to_cpu(where[num].key), blks, 0);
 
@@ -2829,7 +2830,7 @@ static blkcnt_t ext4_inode_blocks(struct ext4_inode 
*raw_inode,
EXT4_FEATURE_RO_COMPAT_HUGE_FILE)) {
/* we are using combined 48 bit field */
i_blocks = ((u64)le16_to_cpu(raw_inode-i_blocks_high))  32 |
-   
le32_to_cpu(raw_inode-i_blocks_lo);
+   le32_to_cpu(raw_inode-i_blocks_lo);
if (ei-i_flags  EXT4_HUGE_FILE_FL) {
/* i_blocks represent file system block size */
return i_blocks   (inode-i_blkbits - 9);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 16854fd..d8cd81e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -337,7 +337,7 @@ struct ext4_group_info 

Re: [PATCH] ext4: use ext4_ext_get_actual_len instead of directly using ext4_extent.ee_len

2008-01-14 Thread Aneesh Kumar K.V
On Sat, Jan 12, 2008 at 11:44:00PM +0530, Aneesh Kumar K.V wrote:
  fs/ext4/extents.c |   26 ++
  1 files changed, 14 insertions(+), 12 deletions(-)
 
 diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
 index 81bce98..4269cc6 100644
 --- a/fs/ext4/extents.c
 +++ b/fs/ext4/extents.c
 @@ -1029,7 +1029,7 @@ ext4_ext_search_left(struct inode *inode, struct 
 ext4_ext_path *path,
  {
   struct ext4_extent_idx *ix;
   struct ext4_extent *ex;
 - int depth;
 + int depth, ee_len;
 
   BUG_ON(path == NULL);
   depth = path-p_depth;
 @@ -1043,6 +1043,7 @@ ext4_ext_search_left(struct inode *inode, struct 
 ext4_ext_path *path,
* first one in the file */
 
   ex = path[depth].p_ext;
 + ee_len = ext4_ext_get_actual_len(ex);
   if (*logical  le32_to_cpu(ex-ee_block)) {
   BUG_ON(EXT_FIRST_EXTENT(path[depth].p_hdr) != ex);
   while (--depth = 0) {
 @@ -1052,10 +1053,10 @@ ext4_ext_search_left(struct inode *inode, struct 
 ext4_ext_path *path,
   return 0;
   }
 
 - BUG_ON(*logical  le32_to_cpu(ex-ee_block) + le16_to_cpu(ex-ee_len));
 + BUG_ON(*logical  (le32_to_cpu(ex-ee_block) + ee_len));
 
 - *logical = le32_to_cpu(ex-ee_block) + le16_to_cpu(ex-ee_len) - 1;
 - *phys = ext_pblock(ex) + le16_to_cpu(ex-ee_len) - 1;
 + *logical = le32_to_cpu(ex-ee_block) + ee_len - 1;
 + *phys = ext_pblock(ex) + ee_len - 1;
   return 0;
  }
 
 @@ -1075,7 +1076,7 @@ ext4_ext_search_right(struct inode *inode, struct 
 ext4_ext_path *path,
   struct ext4_extent_idx *ix;
   struct ext4_extent *ex;
   ext4_fsblk_t block;
 - int depth;
 + int depth, ee_len;
 
   BUG_ON(path == NULL);
   depth = path-p_depth;
 @@ -1089,6 +1090,7 @@ ext4_ext_search_right(struct inode *inode, struct 
 ext4_ext_path *path,
* first one in the file */
 
   ex = path[depth].p_ext;
 + ee_len = ext4_ext_get_actual_len(ex);
   if (*logical  le32_to_cpu(ex-ee_block)) {
   BUG_ON(EXT_FIRST_EXTENT(path[depth].p_hdr) != ex);
   while (--depth = 0) {
 @@ -1100,7 +1102,7 @@ ext4_ext_search_right(struct inode *inode, struct 
 ext4_ext_path *path,
   return 0;
   }
 
 - BUG_ON(*logical  le32_to_cpu(ex-ee_block) + le16_to_cpu(ex-ee_len));
 + BUG_ON(*logical  (le32_to_cpu(ex-ee_block) + ee_len));
 
   if (ex != EXT_LAST_EXTENT(path[depth].p_hdr)) {
   /* next allocated block in this leaf */
 @@ -1315,7 +1317,7 @@ ext4_can_extents_be_merged(struct inode *inode, struct 
 ext4_extent *ex1,
   if (ext1_ee_len + ext2_ee_len  max_len)
   return 0;
  #ifdef AGGRESSIVE_TEST
 - if (le16_to_cpu(ex1-ee_len) = 4)
 + if (ext1_ee_len = 4)
   return 0;
  #endif
 
 @@ -1555,7 +1557,7 @@ has_space:
   nearex = path[depth].p_ext;
   nearex-ee_block = newext-ee_block;
   ext4_ext_store_pblock(nearex, ext_pblock(newext));
 - nearex-ee_len = newext-ee_len;
 + nearex-ee_len = ext4_ext_get_actual_len(newext);
 
  merge:
   /* try to merge extents to the right */

This change is not needed.  ext4_ext_try_to_merge check whether the
extent is uninitialized or not and zero out the blocks if we are
merging.

 @@ -2310,7 +2312,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode 
 *inode,
  - le32_to_cpu(newex.ee_block)
  + ext_pblock(newex);
   /* number of remaining blocks in the extent */

New patch is attached below.

-aneesh
et4: Get the actual length of extent

From: Aneesh Kumar K.V [EMAIL PROTECTED]

ext4 use the extent len for encoding whether the extent is intialized or
not. The helper function ext4_ext_get_actual_len should be used to get the
actual length of the extent.

Fix the below kernel BUG
http://bugzilla.kernel.org/show_bug.cgi?id=9732

kernel BUG at fs/ext4/extents.c:1056!

Call Trace:
[88366073] :ext4dev:ext4_ext_get_blocks+0x5ba/0x8c1
[81053c91] lock_release_holdtime+0x27/0x49
[812748f6] _spin_unlock+0x17/0x20
[883400a6] :jbd2:start_this_handle+0x4e0/0x4fe
[88366564] :ext4dev:ext4_fallocate+0x175/0x39a
[81053c91] lock_release_holdtime+0x27/0x49
[81056480] __lock_acquire+0x4e7/0xc4d
[81053c91] lock_release_holdtime+0x27/0x49
[810a8de7] sys_fallocate+0xe4/0x10d
[8100c043] tracesys+0xd5/0xda


Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---

 fs/ext4/extents.c |   24 +---
 1 files changed, 13 insertions(+), 11 deletions(-)


diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 81bce98..1e46997 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1029,7 +1029,7 @@ ext4_ext_search_left(struct inode *inode, struct ext4_ext_path *path,
 {
 	struct ext4_extent_idx *ix;
 	struct ext4_extent *ex;
-	int depth;
+	int depth, ee_len;
 
 	BUG_ON(path == NULL);
 	depth = path-p_depth

[PATCH] ext4: use ext4_ext_get_actual_len instead of directly using ext4_extent.ee_len

2008-01-12 Thread Aneesh Kumar K.V

ext4 use the extent len for encoding whether the extent is intialized or
not. The helper function ext4_ext_get_actual_len should be used to get the
actual length of the extent.

Fix the below kernel BUG
http://bugzilla.kernel.org/show_bug.cgi?id=9732

kernel BUG at fs/ext4/extents.c:1056!

Call Trace:
[88366073] :ext4dev:ext4_ext_get_blocks+0x5ba/0x8c1
[81053c91] lock_release_holdtime+0x27/0x49
[812748f6] _spin_unlock+0x17/0x20
[883400a6] :jbd2:start_this_handle+0x4e0/0x4fe
[88366564] :ext4dev:ext4_fallocate+0x175/0x39a
[81053c91] lock_release_holdtime+0x27/0x49
[81056480] __lock_acquire+0x4e7/0xc4d
[81053c91] lock_release_holdtime+0x27/0x49
[810a8de7] sys_fallocate+0xe4/0x10d
[8100c043] tracesys+0xd5/0xda


Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/extents.c |   26 ++
 1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 81bce98..4269cc6 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1029,7 +1029,7 @@ ext4_ext_search_left(struct inode *inode, struct 
ext4_ext_path *path,
 {
struct ext4_extent_idx *ix;
struct ext4_extent *ex;
-   int depth;
+   int depth, ee_len;
 
BUG_ON(path == NULL);
depth = path-p_depth;
@@ -1043,6 +1043,7 @@ ext4_ext_search_left(struct inode *inode, struct 
ext4_ext_path *path,
 * first one in the file */
 
ex = path[depth].p_ext;
+   ee_len = ext4_ext_get_actual_len(ex);
if (*logical  le32_to_cpu(ex-ee_block)) {
BUG_ON(EXT_FIRST_EXTENT(path[depth].p_hdr) != ex);
while (--depth = 0) {
@@ -1052,10 +1053,10 @@ ext4_ext_search_left(struct inode *inode, struct 
ext4_ext_path *path,
return 0;
}
 
-   BUG_ON(*logical  le32_to_cpu(ex-ee_block) + le16_to_cpu(ex-ee_len));
+   BUG_ON(*logical  (le32_to_cpu(ex-ee_block) + ee_len));
 
-   *logical = le32_to_cpu(ex-ee_block) + le16_to_cpu(ex-ee_len) - 1;
-   *phys = ext_pblock(ex) + le16_to_cpu(ex-ee_len) - 1;
+   *logical = le32_to_cpu(ex-ee_block) + ee_len - 1;
+   *phys = ext_pblock(ex) + ee_len - 1;
return 0;
 }
 
@@ -1075,7 +1076,7 @@ ext4_ext_search_right(struct inode *inode, struct 
ext4_ext_path *path,
struct ext4_extent_idx *ix;
struct ext4_extent *ex;
ext4_fsblk_t block;
-   int depth;
+   int depth, ee_len;
 
BUG_ON(path == NULL);
depth = path-p_depth;
@@ -1089,6 +1090,7 @@ ext4_ext_search_right(struct inode *inode, struct 
ext4_ext_path *path,
 * first one in the file */
 
ex = path[depth].p_ext;
+   ee_len = ext4_ext_get_actual_len(ex);
if (*logical  le32_to_cpu(ex-ee_block)) {
BUG_ON(EXT_FIRST_EXTENT(path[depth].p_hdr) != ex);
while (--depth = 0) {
@@ -1100,7 +1102,7 @@ ext4_ext_search_right(struct inode *inode, struct 
ext4_ext_path *path,
return 0;
}
 
-   BUG_ON(*logical  le32_to_cpu(ex-ee_block) + le16_to_cpu(ex-ee_len));
+   BUG_ON(*logical  (le32_to_cpu(ex-ee_block) + ee_len));
 
if (ex != EXT_LAST_EXTENT(path[depth].p_hdr)) {
/* next allocated block in this leaf */
@@ -1315,7 +1317,7 @@ ext4_can_extents_be_merged(struct inode *inode, struct 
ext4_extent *ex1,
if (ext1_ee_len + ext2_ee_len  max_len)
return 0;
 #ifdef AGGRESSIVE_TEST
-   if (le16_to_cpu(ex1-ee_len) = 4)
+   if (ext1_ee_len = 4)
return 0;
 #endif
 
@@ -1555,7 +1557,7 @@ has_space:
nearex = path[depth].p_ext;
nearex-ee_block = newext-ee_block;
ext4_ext_store_pblock(nearex, ext_pblock(newext));
-   nearex-ee_len = newext-ee_len;
+   nearex-ee_len = ext4_ext_get_actual_len(newext);
 
 merge:
/* try to merge extents to the right */
@@ -2310,7 +2312,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode 
*inode,
   - le32_to_cpu(newex.ee_block)
   + ext_pblock(newex);
/* number of remaining blocks in the extent */
-   allocated = le16_to_cpu(newex.ee_len) -
+   allocated = ext4_ext_get_actual_len(newex) -
(iblock - le32_to_cpu(newex.ee_block));
goto out;
} else {
@@ -2426,7 +2428,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode 
*inode,
newex.ee_len = cpu_to_le16(max_blocks);
err = ext4_ext_check_overlap(inode, newex, path);
if (err)
-   allocated = le16_to_cpu(newex.ee_len);
+   allocated = ext4_ext_get_actual_len(newex);
else
allocated = max_blocks;
 
@@ -2458,7 +2460,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode 
*inode,
 * but otherwise we'd need

Re: Problems with mballoc and uninit_groups option

2008-01-11 Thread Aneesh Kumar K.V
On Fri, Jan 11, 2008 at 03:04:08PM +0100, Valerie Clement wrote:
 Hi,

 I've got problems with mballoc when I create the ext4 filesystem with  
 the uninit_groups option enabled.

 First, I do a single test on a filesystem created without the  
 uninit_groups option and mounted with the defaults option:
   dd if=/dev/zero of=/mnt/test/foo bs=1M count=1024

 In this case, the file blocks are allocated in the groups 4, 5, 6, 7, 8,  
 9, 10.

 When the filesystem is created with the uninit_groups option enabled  
 and mounted with the defaults option, I do the same dd command.

 In this case, the file blocks are allocated in the groups 5, 7, 9, 25,  
 27, 49, 81. It seems that the blocks could be allocated only in the  
 already initialized groups.


That is because we skip the uninitialized group in ext4_mb_good_group.
I guess we are trying criteria 0 allocation and we skip uninit group for
criteria 0 . You can tune by setting higher value for
/proc/fs/ext4/partition/orders2_req. Setting it to a high value would
skip criteria 0 allocation for small requests.

I guess you are using delayed allocation ?

-aneesh

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


patch queue update

2008-01-10 Thread Aneesh Kumar K.V
Hi Mingming,

New patches for patch queue can be found at
http://www.radian.org/~kvaneesh/ext4/jan-10-2008-ver2/

The changes are

a) mballoc patch got an explanation about regular allocator.
b) mballoc regular allocator we changed the usage of ffs to fls. I guess
it makes sense to use fls because we want to compare it against the
tunable s_mb_order2_reqs. Only request above this order are using
criteria 0 allocation.
c) stripe.patch to use the stripe size set in the super block for block
allocation.

The diff is attached for reference.

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 0d31817..0085fde 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -468,7 +468,6 @@ static void ext4_mb_free_committed_blocks(struct 
super_block *);
 static void ext4_mb_return_to_preallocation(struct inode *inode,
struct ext4_buddy *e4b, sector_t block,
int count);
-static void ext4_mb_show_ac(struct ext4_allocation_context *ac);
 static void ext4_mb_put_pa(struct ext4_allocation_context *, struct 
super_block *,
struct ext4_prealloc_space *pa);
 static int ext4_mb_init_per_dev_proc(struct super_block *sb);
@@ -1838,14 +1837,23 @@ static int ext4_mb_regular_allocator(struct 
ext4_allocation_context *ac)
if (unlikely(ac-ac_flags  EXT4_MB_HINT_GOAL_ONLY))
goto out;
 
-   i = ffs(ac-ac_g_ex.fe_len);
+   /*
+* ac-ac2_order is set only if the fe_len is a power of 2
+* if ac2_order is set we also set criteria to 0 so whtat we
+* try exact allocation using buddy.
+*/
+   i = fls(ac-ac_g_ex.fe_len);
ac-ac_2order = 0;
-   /* FIXME!!
-* What happens if i is still greater than s_mb_order2_reqs
+   /*
+* We search using buddy data only if the order of the request
+* is greater than equal to the sbi_s_mb_order2_reqs
+* You can tune it via /proc/fs/ext4/partition/order2_req
 */
if (i = sbi-s_mb_order2_reqs) {
-   i--;
-   if ((ac-ac_g_ex.fe_len  (~(1  i))) == 0)
+   /*
+* This should tell if fe_len is exactly power of 2
+*/
+   if ((ac-ac_g_ex.fe_len  (~(1  (i - 1 == 0)
ac-ac_2order = i;
}
 
@@ -1865,17 +1873,17 @@ static int ext4_mb_regular_allocator(struct 
ext4_allocation_context *ac)
spin_unlock(sbi-s_md_lock);
}
 
+   /* searching for the right group start from the goal value specified */
group = ac-ac_g_ex.fe_group;
 
/* Let's just scan groups to find more-less suitable blocks */
cr = ac-ac_2order ? 0 : 1;
+   /*
+* cr == 0 try to get exact allocation,
+* cr == 3  try to get anything
+*/
 repeat:
for (; cr  4  ac-ac_status == AC_STATUS_CONTINUE; cr++) {
-   /* FIXME!!
-* We need to explain what criteria is and also
-* need to define the number 0 to 4 for criteria
-* What they actually means.
-*/
ac-ac_criteria = cr;
for (i = 0; i  EXT4_SB(sb)-s_groups_count; group++, i++) {
struct ext4_group_info *grp;
@@ -1889,23 +1897,28 @@ repeat:
if (grp-bb_free == 0)
continue;
 
+   /*
+* if the group is already init we check whether it is
+* a good group and if not we don't load the buddy
+*/
if (EXT4_MB_GRP_NEED_INIT(EXT4_GROUP_INFO(sb, group))) {
-   /* we need full data about the group
-* to make a good selection */
+   /*
+* we need full data about the group
+* to make a good selection
+*/
err = ext4_mb_load_buddy(sb, group, e4b);
if (err)
goto out;
ext4_mb_release_desc(e4b);
}
 
-   /* check is group good for our criteries */
+   /*
+* If the particular group doesn't satisfy our
+* criteria we continue with the next group
+*/
if (!ext4_mb_good_group(ac, group, cr))
continue;
 
-   /* FIXME!!
-* here also we are loading the buddy. so what 
difference
-* does EXT4_MB_GRP_NEED_INIT actually make
-*/
err = ext4_mb_load_buddy(sb, group, e4b);
  

Re: patch queue update

2008-01-10 Thread Aneesh Kumar K.V
On Thu, Jan 10, 2008 at 02:43:23PM -0700, Andreas Dilger wrote:
 On Jan 10, 2008  21:03 +0530, Aneesh Kumar K.V wrote:
  if (i = sbi-s_mb_order2_reqs) {
  -   i--;
  -   if ((ac-ac_g_ex.fe_len  (~(1  i))) == 0)
  +   /*
  +* This should tell if fe_len is exactly power of 2
  +*/
  +   if ((ac-ac_g_ex.fe_len  (~(1  (i - 1 == 0)
  ac-ac_2order = i;
 
 While you changed i to (i - 1) in the if you didn't change it when
 setting ac_2order...  Is that incorrect?

Yes that ac_2order should be i - 1;
Will fix it in the next update.

I see that the patch queue update doesn't have most of the changes I
have placed at http://www.radian.org/~kvaneesh/ext4/jan-10-2008-ver2/

 
  /*
  +* Yield the CPU here so that we don't get soft lockup
   */
  -   schedule_timeout(HZ);
  +   schedule();
  goto repeat;
  }
   
  @@ -3808,7 +3820,7 @@ repeat:
  printk(KERN_ERR uh-oh! used pa while discarding\n);
  dump_stack();
  current-state = TASK_UNINTERRUPTIBLE;
  -   schedule();
  +   schedule_timeout(HZ);
  goto repeat;
 
 Is this change to schedule_timeout() intentional?  The earlier code is
 removing the use of schedule_timeout.  I could be wrong, as I didn't
 follow this discussion closely, but sometimes changes like this happen
 accidentally and people don't look at the patch itself...


The patch queue had it modified from schedule_timeout to schedule(). I
am moving it back to the original version. If we have set the task state
to TASK_UNINTERRUPTIBLE it should be schedule_timeout. And at these
place we intent to wait uninterrupted for 1 sec. The place where we
wanted to just yield is ext4_mb_discard_group_preallocations.


  +static unsigned long ext4_get_stripe_size(struct ext4_sb_info *sbi)
  +{
  +   unsigned long stride = le16_to_cpu(sbi-s_es-s_raid_stride);
  +   unsigned long stripe_width = 
  le32_to_cpu(sbi-s_es-s_raid_stripe_width);
  +
  +   if (sbi-s_stripe  sbi-s_stripe = sbi-s_blocks_per_group) {
  +   return sbi-s_stripe;
  +   } else if (stripe_width = sbi-s_blocks_per_group) {
  +   return stripe_width;
  +   } else if (stride = sbi-s_blocks_per_group) {
  +   return stride;
  +   }
 
 If you are doing return XXX you don't need else.
 
  +   /*
  +* set the stripe size. If we have specified it via mount option, then
  +* use the mount option value. If the value specified at mount time is
  +* greater than the blocks per group use the super block value.
  +* Allocator needs it be less than blocks per group.
  +*/
  +   sbi-s_stripe = ext4_get_stripe_size(sbi);
 
 This comment should probably go by ext4_get_stripe_size() definition instead
 of here at the caller.

Will move that to the function definition.

-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] set s_raid_stride and s_raid_stripe_width

2008-01-09 Thread Aneesh Kumar K.V

[..snip...]

 + if (param-s_raid_stride 
 + (param-s_raid_stripe_width % param-s_raid_stride) != 0)
 + fprintf(stderr, _(\nWarning: RAID stripe-width %u not an even 
 +   multiple of stride %u.\n\n),
 + param-s_raid_stripe_width, param-s_raid_stride);


We also want to validate that s_raid_stripe_width is not =
blocks_per_group.


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: Use superblock s_raid_stripe_width as stripe size during block allocation.

2008-01-09 Thread Aneesh Kumar K.V
The stipe size used during block allocation is calculated as below.
a) if we specify a stripe=value option using mount time. Use that value.
b) if not use the value specified in super block.
b) If the value specfied at mount time is greater than blocks per group use
the value specified ini the super block.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/super.c |   10 ++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index db1edc8..10330eb 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2136,6 +2136,16 @@ static int ext4_fill_super (struct super_block *sb, void 
*data, int silent)
sbi-s_rsv_window_head.rsv_alloc_hit = 0;
sbi-s_rsv_window_head.rsv_goal_size = 0;
ext4_rsv_window_add(sb, sbi-s_rsv_window_head);
+   /*
+* set the stripe size. If we have specified it via mount option, then
+* use the mount option value. If the value specified at mount time is
+* greater than the blocks per group use the super block value.
+* Allocator needs it be less than blocks per group.
+*/
+   if (!sbi-s_stripe ||
+   sbi-s_stripe = sbi-s_blocks_per_group) {
+   sbi-s_stripe = le32_to_cpu(es-s_raid_stripe_width);
+   }
 
/*
 * set up enough so that it can read an inode
-- 
1.5.4.rc2.60.gb2e62-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: Check for return value from sb_set_blocksize

2008-01-09 Thread Aneesh Kumar K.V
sb_set_blocksize validates whether the specfied block size can be used by
the file system. Make sure we fail mounting the file system if the
blocksize specfied cannot be used.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/super.c |   18 +++---
 1 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 10330eb..ab62d7f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1793,7 +1793,6 @@ static int ext4_fill_super (struct super_block *sb, void 
*data, int silent)
unsigned long def_mount_opts;
struct inode *root;
int blocksize;
-   int hblock;
int db_count;
int i;
int needs_recovery;
@@ -1958,20 +1957,17 @@ static int ext4_fill_super (struct super_block *sb, 
void *data, int silent)
goto failed_mount;
}
 
-   hblock = bdev_hardsect_size(sb-s_bdev);
if (sb-s_blocksize != blocksize) {
-   /*
-* Make sure the blocksize for the filesystem is larger
-* than the hardware sectorsize for the machine.
-*/
-   if (blocksize  hblock) {
-   printk(KERN_ERR EXT4-fs: blocksize %d too small for 
-  device blocksize %d.\n, blocksize, hblock);
+
+   brelse (bh);
+
+   /* Validate the filesystem blocksize */
+   if (!sb_set_blocksize(sb, blocksize)) {
+   printk(KERN_ERR EXT4-fs: bad block size %d.\n,
+   blocksize);
goto failed_mount;
}
 
-   brelse (bh);
-   sb_set_blocksize(sb, blocksize);
logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE;
offset = do_div(logical_sb_block, blocksize);
bh = sb_bread(sb, logical_sb_block);
-- 
1.5.4.rc2.60.gb2e62-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Should we enabling mballoc by default ?

2008-01-09 Thread Aneesh Kumar K.V
Hi,

mballoc currently causes fragmentation of small size files. The
behaviour can be observed by running parallel dd on ext4 file system. A
sample test case can be found here.

http://www.radian.org/~kvaneesh/ext4/mballoc-frag/fragmentation-analysis

This is because for small size request/file mballoc uses group prealloc
space. That means if you have singe cpu and multiple threads doing
parallel io to the file the block request are served from the same
prealloc space. This results in the fragmentation observed above. (Group
prealloc space is per cpu )

This problem should go away with delayed allocation because the write
out happens with multiple blocks and mballoc places then close together
even though they get served by the same group prealloc space.
Considering that we don't have delalloc yet, i guess we should push
mballoc to 2.6.25 with default disabled and O_DIRECT type workloads can
enabled then via mount option.


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Check for return value from sb_set_blocksize

2008-01-09 Thread Aneesh Kumar K.V
Updated patch. The earlier patch did multiple brelse() during failed
mount case.


ext4: Check for return value from sb_set_blocksize

From: Aneesh Kumar K.V [EMAIL PROTECTED]

sb_set_blocksize validates whether the specfied block size can be used by
the file system. Make sure we fail mounting the file system if the
blocksize specfied cannot be used.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---

 fs/ext4/super.c |   15 +--
 1 files changed, 5 insertions(+), 10 deletions(-)


diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 10330eb..f9a9ef1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1793,7 +1793,6 @@ static int ext4_fill_super (struct super_block *sb, void 
*data, int silent)
unsigned long def_mount_opts;
struct inode *root;
int blocksize;
-   int hblock;
int db_count;
int i;
int needs_recovery;
@@ -1958,20 +1957,16 @@ static int ext4_fill_super (struct super_block *sb, 
void *data, int silent)
goto failed_mount;
}
 
-   hblock = bdev_hardsect_size(sb-s_bdev);
if (sb-s_blocksize != blocksize) {
-   /*
-* Make sure the blocksize for the filesystem is larger
-* than the hardware sectorsize for the machine.
-*/
-   if (blocksize  hblock) {
-   printk(KERN_ERR EXT4-fs: blocksize %d too small for 
-  device blocksize %d.\n, blocksize, hblock);
+
+   /* Validate the filesystem blocksize */
+   if (!sb_set_blocksize(sb, blocksize)) {
+   printk(KERN_ERR EXT4-fs: bad block size %d.\n,
+   blocksize);
goto failed_mount;
}
 
brelse (bh);
-   sb_set_blocksize(sb, blocksize);
logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE;
offset = do_div(logical_sb_block, blocksize);
bh = sb_bread(sb, logical_sb_block);
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Fix the soft lockup with multi block allocator.

2008-01-09 Thread Aneesh Kumar K.V
On Wed, Jan 09, 2008 at 01:10:41PM +0100, Jan Kara wrote:
  With the multi block allocator when we don't have prealloc space we discard
  @@ -3790,7 +3782,9 @@ repeat:
   
  /* if we still need more blocks and some PAs were used, try again */
  if (free  needed  busy) {
  +   busy = 0;
  ext4_unlock_group(sb, group);
  +   schedule_timeout(HZ);
  goto repeat;
  }
   Hmm, wouldn't just schedule() be enough here? That would give a good
 chance to other processes to proceed and we would avoid this artificial
 wait of 1s which is quite ugly IMO.
 
   Honza

But then who will wake up the task ?. I have the below comment added to
the patch in the patch queue.

/*
 * We see this quiet rare. But if a particular workload is
 * effected by this we may need to add a waitqueue
 */


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Fix the soft lockup with multi block allocator.

2008-01-09 Thread Aneesh Kumar K.V
On Wed, Jan 09, 2008 at 07:44:30PM +0100, Jan Kara wrote:
 On Wed 09-01-08 23:54:28, Aneesh Kumar K.V wrote:
  On Wed, Jan 09, 2008 at 01:10:41PM +0100, Jan Kara wrote:
With the multi block allocator when we don't have prealloc space we 
discard
@@ -3790,7 +3782,9 @@ repeat:
 
/* if we still need more blocks and some PAs were used, try 
again */
if (free  needed  busy) {
+   busy = 0;
ext4_unlock_group(sb, group);
+   schedule_timeout(HZ);
goto repeat;
}
 Hmm, wouldn't just schedule() be enough here? That would give a good
   chance to other processes to proceed and we would avoid this artificial
   wait of 1s which is quite ugly IMO.
   
  
  But then who will wake up the task ?. I have the below comment added to
  the patch in the patch queue.
   As far as I know, you don't have to wake-up the task explicitely.
 Scheduler will simply schedule the task sometime in future (it is a similar
 situation as if the task got preempted in the kernel).

I missed the current-state = TASK_UNINTERRUPTIBLE; in that code piece.
So yes without setting the task state yes it would put it back to the run
queue.Infact schedule_timeout() without changing the task state also
behaves similarly. Now that that we know that it if fine just to have a
schedule() there since schedule_timeout() was just behaving as
schedule(). I guess we should make the change you suggested. In that
case we can remove the comment which says we need to add the wait queue.

Mingming,

Do you want me to send a patch or can you make the modification ?

-aneesh

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext4: Use superblock s_raid_stripe_width as stripe size during block allocation.

2008-01-09 Thread Aneesh Kumar K.V
On Wed, Jan 09, 2008 at 04:36:27PM -0700, Andreas Dilger wrote:
 On Jan 09, 2008  22:37 +0530, Aneesh Kumar K.V wrote:
  The stipe size used during block allocation is calculated as below.
  a) if we specify a stripe=value option using mount time. Use that value.
  b) if not use the value specified in super block.
  b) If the value specfied at mount time is greater than blocks per group use
  the value specified ini the super block.
 
 What if the value on disk is also bad?  I'd add (after the second 'b' :-):
 
 d) if s_stripe is still  s_blocks_per_group try s_raid_stride
 e) if s_stripe is still  s_blocks_per_group use 0

But i guess mke2fs and tune2fs should validate the value of
s_raid_stripe_width and s_raid_stride. Both of them should be less that
blocks per group. Should I add extra check in the kernel for them ?

 
  Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
  ---
   fs/ext4/super.c |   10 ++
   1 files changed, 10 insertions(+), 0 deletions(-)
  
  diff --git a/fs/ext4/super.c b/fs/ext4/super.c
  index db1edc8..10330eb 100644
  --- a/fs/ext4/super.c
  +++ b/fs/ext4/super.c
  @@ -2136,6 +2136,16 @@ static int ext4_fill_super (struct super_block *sb, 
  void *data, int silent)
  sbi-s_rsv_window_head.rsv_alloc_hit = 0;
  sbi-s_rsv_window_head.rsv_goal_size = 0;
  ext4_rsv_window_add(sb, sbi-s_rsv_window_head);
  +   /*
  +* set the stripe size. If we have specified it via mount option, then
  +* use the mount option value. If the value specified at mount time is
  +* greater than the blocks per group use the super block value.
  +* Allocator needs it be less than blocks per group.
  +*/
  +   if (!sbi-s_stripe ||
  +   sbi-s_stripe = sbi-s_blocks_per_group) {
 

So what do you think should it be  or =. Looking at the mballoc I
guess it should work with stripe size equal to blocks per group. I am
not sure how efficient the allocation would be though.


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mballoc and inode prealloc space deletion

2008-01-08 Thread Aneesh Kumar K.V
Hi Alex,

With the latest changes i see both group preallocation and inode
preallocation being used by mballoc. Since the choice is now made with
file size not just the request size. (None of the test were actually
using inode prealloc previously). Now that the test are using inode prealloc i
see the below dump stack on the console. I guess you put in the dump
stack there to  find out how often these happen. Do you still want the
dump stack to be there. Or can I remove it. ?

uh-oh! some one just deleted it
Call Trace:
[c000e4143970] [c0010554] .show_stack+0x6c/0x1a0
(unreliable)
[c000e4143a10] [c019db90]
.ext4_mb_discard_inode_preallocations+0x1d4/0x428
[c000e4143b20] [c0178e8c]
.ext4_discard_reservation+0x30/0xb4
[c000e4143bc0] [c017c0b8] .ext4_release_file+0x54/0xbc
[c000e4143c50] [c00de4b0] .__fput+0x114/0x214
[c000e4143d00] [c00db2d8] .filp_close+0xb0/0xd8
[c000e4143d90] [c00dbb20] .sys_close+0xc4/0x128
[c000e4143e30] [c000872c] syscall_exit+0x0/0x40
uh-oh! some one just deleted it
Call Trace:
[c000ede97970] [c0010554] .show_stack+0x6c/0x1a0
(unreliable)
[c000ede97a10] [c019db90]
.ext4_mb_discard_inode_preallocations+0x1d4/0x428
[c000ede97b20] [c0178e8c]
.ext4_discard_reservation+0x30/0xb4
[c000ede97bc0] [c017c0b8] .ext4_release_file+0x54/0xbc
[c000ede97c50] [c00de4b0] .__fput+0x114/0x214
[c000ede97d00] [c00db2d8] .filp_close+0xb0/0xd8
[c000ede97d90] [c00dbb20] .sys_close+0xc4/0x128
[c000ede97e30] [c000872c] syscall_exit+0x0/0x40
uh-oh! some one just deleted it
Call Trace:
[c000e6b1f970] [c0010554] .show_stack+0x6c/0x1a0
(unreliable)
[c000e6b1fa10] [c019db90]
.ext4_mb_discard_inode_preallocations+0x1d4/0x428
[c000e6b1fb20] [c0178e8c]
.ext4_discard_reservation+0x30/0xb4
[c000e6b1fbc0] [c017c0b8] .ext4_release_file+0x54/0xbc
[c000e6b1fc50] [c00de4b0] .__fput+0x114/0x214
[c000e6b1fd00] [c00db2d8] .filp_close+0xb0/0xd8
[c000e6b1fd90] [c00dbb20] .sys_close+0xc4/0x128
[c000e6b1fe30] [c000872c] syscall_exit+0x0/0x40
-- 0:conmux-control -- time-stamp -- Jan/08/08  2:20:11 --
-- 0:conmux-control -- time-stamp -- Jan/08/08  2:38:19 --
(bot:conmon-payload) disconnected

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mballoc update

2008-01-08 Thread Aneesh Kumar K.V
On Tue, Jan 08, 2008 at 12:01:14PM +0530, Aneesh Kumar K.V wrote:
 Hi,
 
 This is the update for mballoc patch. The changes are result of merging
 with the lustre cvs version of mballoc. I liked this patch better because
 it is simple. I also the updated the commit message. The update commit
 message is also attached below. We only have one FIXME!! in the commit
 message now to explain the inode buddy cache  allocator.
 
 Let me know what you think.
 
 This is not yet for patch queue. I will update the mballoc-core.patch
 and send it the full patch later.

The patch for patch queue can be found at

http://www.radian.org/~kvaneesh/ext4/jan-8-2008/mballoc-core.patch

-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mballoc changes from ldiskfs

2008-01-07 Thread Aneesh Kumar K.V
Hi,

This patch is  not even compile tested. I am sending it over to find out
whether some of the changes are even needed and to make sure i didn't
drop any bug fix in the merge.

something I noticed.

a) prealloc table is completely gone.
b) ext4_mb_put_pa change. ( I guess that is a bug with ldiskfs ).


now by default request less that 64K use locality group preallocation.

The ldiskfs change i looked at is from
lustre/ldiskfs/kernel_patches/patches/ext3-mballoc3-core.patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 58a70a1..cb84516 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -532,10 +532,10 @@ static inline void mb_set_bit(int bit, void *addr)
ext4_set_bit(bit, addr);
 }
 
-static inline void mb_set_bit_atomic(int bit, void *addr)
+static inline void mb_set_bit_atomic(spinlock_t *lock, int bit, void *addr)
 {
mb_correct_addr_and_bit(bit, addr);
-   ext4_set_bit_atomic(NULL, bit, addr);
+   ext4_set_bit_atomic(lock, bit, addr);
 }
 
 static inline void mb_clear_bit(int bit, void *addr)
@@ -544,10 +544,10 @@ static inline void mb_clear_bit(int bit, void *addr)
ext4_clear_bit(bit, addr);
 }
 
-static inline void mb_clear_bit_atomic(int bit, void *addr)
+static inline void mb_clear_bit_atomic(spinlock_t *lock, int bit, void *addr)
 {
mb_correct_addr_and_bit(bit, addr);
-   ext4_clear_bit_atomic(NULL, bit, addr);
+   ext4_clear_bit_atomic(lock, bit, addr);
 }
 
 static inline void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
@@ -1155,7 +1155,7 @@ static int mb_find_order_for_block(struct ext4_buddy 
*e4b, int block)
return 0;
 }
 
-static void mb_clear_bits(void *bm, int cur, int len)
+static void mb_clear_bits(spinlock_t *lock, void *bm, int cur, int len)
 {
__u32 *addr;
 
@@ -1168,12 +1168,12 @@ static void mb_clear_bits(void *bm, int cur, int len)
cur += 32;
continue;
}
-   mb_clear_bit_atomic(cur, bm);
+   mb_clear_bit_atomic(lock, cur, bm);
cur++;
}
 }
 
-static void mb_set_bits(void *bm, int cur, int len)
+static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len)
 {
__u32 *addr;
 
@@ -1186,7 +1186,7 @@ static void mb_set_bits(void *bm, int cur, int len)
cur += 32;
continue;
}
-   mb_set_bit_atomic(cur, bm);
+   mb_set_bit_atomic(lock, cur, bm);
cur++;
}
 }
@@ -1403,7 +1403,8 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct 
ext4_free_extent *ex)
e4b-bd_info-bb_counters[ord]++;
}
 
-   mb_set_bits(EXT4_MB_BITMAP(e4b), ex-fe_start, len0);
+   mb_set_bits(sb_bgl_lock(EXT4_SB(e4b-bd_sb), ex-fe_group),
+   EXT4_MB_BITMAP(e4b), ex-fe_start, len0);
mb_check_buddy(e4b);
 
return ret;
@@ -1439,14 +1440,6 @@ static void ext4_mb_use_best_found(struct 
ext4_allocation_context *ac,
get_page(ac-ac_bitmap_page);
ac-ac_buddy_page = e4b-bd_buddy_page;
get_page(ac-ac_buddy_page);
-
-   /* store last allocated for subsequent stream allocation */
-   if ((ac-ac_flags  EXT4_MB_HINT_DATA)) {
-   spin_lock(sbi-s_md_lock);
-   sbi-s_mb_last_group = ac-ac_f_ex.fe_group;
-   sbi-s_mb_last_start = ac-ac_f_ex.fe_start;
-   spin_unlock(sbi-s_md_lock);
-   }
 }
 
 /*
@@ -1509,8 +1502,8 @@ static void ext4_mb_measure_extent(struct 
ext4_allocation_context *ac,
struct ext4_free_extent *gex = ac-ac_g_ex;
 
BUG_ON(ex-fe_len = 0);
-   BUG_ON(ex-fe_len = (1  ac-ac_sb-s_blocksize_bits) * 8);
-   BUG_ON(ex-fe_start = (1  ac-ac_sb-s_blocksize_bits) * 8);
+   BUG_ON(ex-fe_len = EXT4_BLOCKS_PER_GROUP(ac-ac_sb));
+   BUG_ON(ex-fe_start = EXT4_BLOCKS_PER_GROUP(ac-ac_sb));
BUG_ON(ac-ac_status != AC_STATUS_CONTINUE);
 
ac-ac_found++;
@@ -1553,7 +1546,6 @@ static void ext4_mb_measure_extent(struct 
ext4_allocation_context *ac,
/* if the request is satisfied, then we try to find
 * an extent that still satisfy the request, but is
 * smaller than previous one */
-   if (ex-fe_len  bex-fe_len)
*bex = *ex;
}
 
@@ -1702,8 +1694,8 @@ static void ext4_mb_complex_scan_group(struct 
ext4_allocation_context *ac,
i = e4b-bd_info-bb_first_free;
 
while (free  ac-ac_status == AC_STATUS_CONTINUE) {
-   i = ext4_find_next_zero_bit(bitmap, sb-s_blocksize * 8, i);
-   if (i = sb-s_blocksize * 8) {
+   i = ext4_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), 
i);
+   if (i = EXT4_BLOCKS_PER_GROUP(sb)) {
BUG_ON(free != 0);
break;
}
@@ -1744,7 +1736,7 @@ static void ext4_mb_scan_aligned(struct 

Re: [PATCH] mballoc changes from ldiskfs

2008-01-07 Thread Aneesh Kumar K.V
On Mon, Jan 07, 2008 at 11:58:00PM +0530, Aneesh Kumar K.V wrote:
 Hi,
 
 This patch is  not even compile tested. I am sending it over to find out
 whether some of the changes are even needed and to make sure i didn't
 drop any bug fix in the merge.
 
 something I noticed.
 
 a) prealloc table is completely gone.
 b) ext4_mb_put_pa change. ( I guess that is a bug with ldiskfs ).
 
 
 now by default request less that 64K use locality group preallocation.
 
 The ldiskfs change i looked at is from
 lustre/ldiskfs/kernel_patches/patches/ext3-mballoc3-core.patch

[... snip... ]

 
   BUG_ON(ex-fe_len = 0);
 - BUG_ON(ex-fe_len = (1  ac-ac_sb-s_blocksize_bits) * 8);
 - BUG_ON(ex-fe_start = (1  ac-ac_sb-s_blocksize_bits) * 8);
 + BUG_ON(ex-fe_len = EXT4_BLOCKS_PER_GROUP(ac-ac_sb));
 + BUG_ON(ex-fe_start = EXT4_BLOCKS_PER_GROUP(ac-ac_sb));
   BUG_ON(ac-ac_status != AC_STATUS_CONTINUE);
 
   ac-ac_found++;
 @@ -1553,7 +1546,6 @@ static void ext4_mb_measure_extent(struct 
 ext4_allocation_context *ac,
   /* if the request is satisfied, then we try to find
* an extent that still satisfy the request, but is
* smaller than previous one */
 - if (ex-fe_len  bex-fe_len)
   *bex = *ex;
   }


This one is a bug fix for ext4 patch queue from Alex. So this change is needed.


 
 @@ -1702,8 +1694,8 @@ static void ext4_mb_complex_scan_group(struct 
 ext4_allocation_context *ac,
   i = e4b-bd_info-bb_first_free;
 
   while (free  ac-ac_status == AC_STATUS_CONTINUE) {
 - i = ext4_find_next_zero_bit(bitmap, sb-s_blocksize * 8, i);
 - if (i = sb-s_blocksize * 8) {
 + i = ext4_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), 
 i);
 + if (i = EXT4_BLOCKS_PER_GROUP(sb)) {
   BUG_ON(free != 0);
   break;
   }
 @@ -1744,7 +1736,7 @@ static void ext4_mb_scan_aligned(struct 
 ext4_allocation_context *ac,
   i = (i - le32_to_cpu(sbi-s_es-s_first_data_block))
   % EXT4_BLOCKS_PER_GROUP(sb);
 
 - while (i  sb-s_blocksize * 8) {
 + while (i  EXT4_BLOCKS_PER_GROUP(sb)) {
   if (!mb_test_bit(i, bitmap)) {
   max = mb_find_extent(e4b, 0, i, sbi-s_stripe, ex);
   if (max = sbi-s_stripe) {
 @@ -1839,20 +1831,6 @@ static int ext4_mb_regular_allocator(struct 
 ext4_allocation_context *ac)
   ac-ac_2order = i;
   }
 
 - /* if stream allocation is enabled, use global goal */
 -
 - /* FIXME!!
 -  * Need more explanation on what it is and how stream
 -  * allocation is represented by the below conditional
 -  */
 - if ((ac-ac_g_ex.fe_len  sbi-s_mb_large_req) 
 - (ac-ac_flags  EXT4_MB_HINT_DATA)) {
 - /* TBD: may be hot point */
 - spin_lock(sbi-s_md_lock);
 - ac-ac_g_ex.fe_group = sbi-s_mb_last_group;
 - ac-ac_g_ex.fe_start = sbi-s_mb_last_start;
 - spin_unlock(sbi-s_md_lock);
 - }
 
   group = ac-ac_g_ex.fe_group;
 
 @@ -2291,7 +2269,8 @@ static void ext4_mb_history_init(struct super_block *sb)
   spin_lock_init(sbi-s_mb_history_lock);
   i = sbi-s_mb_history_max * sizeof(struct ext4_mb_history);
   sbi-s_mb_history = kmalloc(i, GFP_KERNEL);
 - memset(sbi-s_mb_history, 0, i);
 + if (likely(sbi-s_mb_history != NULL))
 + memset(sbi-s_mb_history, 0, i);
   /* if we can't allocate history, then we simple won't use it */
  }
 
 @@ -2300,7 +2279,7 @@ static void ext4_mb_store_history(struct 
 ext4_allocation_context *ac)
   struct ext4_sb_info *sbi = EXT4_SB(ac-ac_sb);
   struct ext4_mb_history h;
 
 - if (likely(sbi-s_mb_history == NULL))
 + if (unlikely(sbi-s_mb_history == NULL))
   return;
 
   if (!(ac-ac_op  sbi-s_mb_history_filter))
 @@ -2312,11 +2291,6 @@ static void ext4_mb_store_history(struct 
 ext4_allocation_context *ac)
   h.orig = ac-ac_o_ex;
   h.result = ac-ac_b_ex;
   h.flags = ac-ac_flags;
 - h.found = ac-ac_found;
 - h.groups = ac-ac_groups_scanned;
 - h.cr = ac-ac_criteria;
 - h.tail = ac-ac_tail;
 - h.buddy = ac-ac_buddy;
   h.merged = 0;


This one is a bug for ext4 patch queue from Alex. So this change is
needed.


[]
 +
 + /* first, try to predict filesize */
 + /* XXX: should this table be tunable? */

Here we says we need to have tunables. Does that mean prealloc table is
needed ?


 + start = 0;
 + if (size = 16 * 1024) {
 + size = 16 * 1024;
 + } else if (size = 32 * 1024) {
 + size = 32 * 1024;
 + } else if (size = 64 * 1024) {
 + size = 64 * 1024;
 + } else if (size = 128 * 1024) {
 + size = 128 * 1024;
 + } else if (size = 256 * 1024) {
 + size = 256 * 1024;
 + } else if (size = 512 * 1024

[PATCH] mballoc update

2008-01-07 Thread Aneesh Kumar K.V
Hi,

This is the update for mballoc patch. The changes are result of merging
with the lustre cvs version of mballoc. I liked this patch better because
it is simple. I also the updated the commit message. The update commit
message is also attached below. We only have one FIXME!! in the commit
message now to explain the inode buddy cache  allocator.

Let me know what you think.

This is not yet for patch queue. I will update the mballoc-core.patch
and send it the full patch later.

--- commit message ---

ext4: Add multi block allocator for ext4

From: Alex Tomas [EMAIL PROTECTED]

The allocation request involve request for multiple number of
blocks near to the goal(block) value specified.

During initialization phase of the allocator we decide to use the group
preallocation or inode preallocation depending on the size file. The size of
the file could be the resulting file size we would have after allocation or
the current file size which ever is larger. If the size is less that
sbi-s_mb_stream_request we select the group preallocation. The default value
of s_mb_stream_request is 16 blocks. This can also be tuned via
/proc/fs/ext4/partition/stream_req. The value is represented in terms of
number of blocks.

The main motivation for having small file use group preallocation is to ensure
that we have small file closer in the disk.


First stage the allocator looks at the inode prealloc list
ext4_inode_info-i_prealloc_list contain list of prealloc
spaces for this particular inode. The inode prealloc space
is represented as
pa_lstart - the logical start block for this prealloc space
pa_pstart - the physical start block for this prealloc space
pa_len- lenght for this prealloc space
pa_free   -  free space available in this prealloc space

The inode preallocation space is used looking at the _logical_
start block. If only the logical file block falls within the
range of prealloc space we will consume the particular prealloc
space. This make sure that that the we have contiguous physical
blocks representing the file blocks

The important thing to be noted in case of inode prealloc space
is that we don't modify the values associated to inode prealloc
space except pa_free.

If we are not able to find blocks in the inode prealloc space and if we have
the group allocation flag set then we look at the locality group prealloc
space. These are per CPU prealloc list represented as

ext4_sb_info.s_locality_groups[smp_processor_id()]

The reason for having a per cpu locality group is to reduce the contention
between CPUs. It is possible to get scheduled at this point.

The locality group prealloc space is used looking at whether we have
enough free space (pa_free) withing the prealloc space.

If we can't allocate blocks via inode prealloc or/and locality group prealloc
then we look at the buddy cache. The buddy cache is represented by
ext4_sb_info.s_buddy_cache (struct inode) whose file offset gets mapped to the
buddy and bitmap information regarding different groups. The buddy information
is attached to buddy cache inode so that we can access them through the page
cache. The information regarding each group is loaded via ext4_mb_load_buddy.
The information involve block bitmap and buddy information. The information are
stored in the inode as

 {page}
 [ group 0 buddy][ group 0 bitmap] [group 1][ group 1]...


one block each for bitmap and buddy information.
So for each group we take up 2 blocks. A page can
contain blocks_per_page (PAGE_CACHE_SIZE / blocksize)  blocks.
So it can have information regarding groups_per_page which
is blocks_per_page/2

Buddy cache inode is not stored on disk. The inode get
thrown away at the end when unmounting the disk.

We look for count number of blocks in the buddy cache. If
we were able to locate that many free blocks we return
with additional information regarding rest of the
contiguous physical block available


Before allocating blocks via buddy cache we normalize the request blocks. This
ensure we ask for more blocks that we needed. The extra blocks that we get
after allocation is added to the respective prealloc list. In case of inode
preallocation we follow a list of heuristics based on file size. This can be
found in ext4_mb_normalize_request. If we are doing a group prealloc we try to
normalize the request to sbi-s_mb_group_prealloc. Default value of
s_mb_group_prealloc is set to 512 blocks. This can be tuned via
/proc/fs/ext4/partition/group_prealloc. The value is represented in terms of
number of blocks. If we have mounted the file system with -O stripe=value
option the group prealloc request is normalized to the stripe value 
(sbi-s_stripe)

/* FIXME!! explanation of how blocks are picked from buddy cache and the tunable
max_to_scan min_to_scan  order2_req */

Both the prealloc space are getting populated
as above. So for the first request we will hit the buddy cache
which will result in this prealloc space getting filled. The 

[Updated PATCH] ext4: Use the correct block number when reading the super block.

2007-12-31 Thread Aneesh Kumar K.V
If the block device hard sector size is larger than EXT4_MIN_BLOCK_SIZE we
end up with wrong block number when reading super block.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/super.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 1ca0f54..3095370 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1629,8 +1629,8 @@ static int ext4_fill_super (struct super_block *sb, void 
*data, int silent)
 * block sizes.  We need to calculate the offset from buffer start.
 */
if (blocksize != EXT4_MIN_BLOCK_SIZE) {
-   logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE;
-   offset = do_div(logical_sb_block, blocksize);
+   logical_sb_block = (sb_block * EXT4_MIN_BLOCK_SIZE) / blocksize;
+   offset = do_div((sb_block * EXT4_MIN_BLOCK_SIZE), blocksize);
} else {
logical_sb_block = sb_block;
}
@@ -1747,8 +1747,8 @@ static int ext4_fill_super (struct super_block *sb, void 
*data, int silent)
 
brelse (bh);
sb_set_blocksize(sb, blocksize);
-   logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE;
-   offset = do_div(logical_sb_block, blocksize);
+   logical_sb_block = (sb_block * EXT4_MIN_BLOCK_SIZE) / blocksize;
+   offset = do_div((sb_block * EXT4_MIN_BLOCK_SIZE), blocksize);
bh = sb_bread(sb, logical_sb_block);
if (!bh) {
printk(KERN_ERR
-- 
1.5.4.rc2-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Updated PATCH] ext4: Use the correct block number when reading the super block.

2007-12-31 Thread Aneesh Kumar K.V
On Mon, Dec 31, 2007 at 01:59:22PM +0530, Aneesh Kumar K.V wrote:
 If the block device hard sector size is larger than EXT4_MIN_BLOCK_SIZE we
 end up with wrong block number when reading super block.
 

Ignore the patch. I got confused by the do_div syntax. The do_div already save
the division result in logical_sb_block;

-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: Use the correct block number when reading the super block.

2007-12-30 Thread Aneesh Kumar K.V
If the block device hard sector size is larger than EXT4_MIN_BLOCK_SIZE we
end up with wrong block number when reading super block.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/super.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 1ca0f54..832e1ad 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1629,8 +1629,8 @@ static int ext4_fill_super (struct super_block *sb, void 
*data, int silent)
 * block sizes.  We need to calculate the offset from buffer start.
 */
if (blocksize != EXT4_MIN_BLOCK_SIZE) {
-   logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE;
-   offset = do_div(logical_sb_block, blocksize);
+   logical_sb_block = (sb_block * EXT4_MIN_BLOCK_SIZE) / blocksize;
+   offset = do_div((sb_block * EXT4_MIN_BLOCK_SIZE), blocksize);
} else {
logical_sb_block = sb_block;
}
-- 
1.5.4.rc2-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Patch queue update

2007-12-23 Thread Aneesh Kumar K.V
Hi Mingming,

I have placed the updated patch queue at

http://www.radian.org/~kvaneesh/ext4/dec-24-2007/

The .tar can be found at

http://www.radian.org/~kvaneesh/ext4/dec-24-2007/patch-series.tar

The changes involve the below attached diff. I also updated the commit message
of mballoc core patch.

a) remove the ext-truncate-mutex.patch
b) Add the truncate mutex to read write semaphore conversion patch set.
This is added at the top of unstable series expecting that it can be
pushed to stable soon. 
c) Update mballoc and migrate patch to use i_data_sem.
d) mballoc soft lockup fix.
e) update mballoc core patch commit message.
f) Update enable-delalloc-and-mballoc.patch patch to enabled write back
   mode by default. Delalloc only support write back.


diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index abc8900..643046b 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -526,7 +526,7 @@ static inline int rsv_is_empty(struct ext4_reserve_window 
*rsv)
  * when setting the reservation window size through ioctl before the file
  * is open for write (needs block allocation).
  *
- * Needs truncate_mutex protection prior to call this function.
+ * Needs down_write(i_data_sem) protection prior to call this function.
  */
 void ext4_init_block_alloc_info(struct inode *inode)
 {
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index ff83982..99e539d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1706,7 +1706,7 @@ static int ext4_ext_rm_idx(handle_t *handle, struct inode 
*inode,
  * This routine returns max. credits that the extent tree can consume.
  * It should be OK for low-performance paths like -writepage()
  * To allow many writing processes to fit into a single transaction,
- * the caller should calculate credits under truncate_mutex and
+ * the caller should calculate credits under i_data_sem and
  * pass the actual path.
  */
 int ext4_ext_calc_credits_for_insert(struct inode *inode,
@@ -2272,7 +2272,8 @@ out:
 
 /*
  * Need to be called with
- * mutex_lock(EXT4_I(inode)-truncate_mutex);
+ * down_read(EXT4_I(inode)-i_data_sem) if not allocating file system block
+ * (ie, create is zero). Otherwise down_write(EXT4_I(inode)-i_data_sem)
  */
 int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_lblk_t iblock,
@@ -2514,7 +2515,7 @@ void ext4_ext_truncate(struct inode * inode, struct page 
*page)
if (page)
ext4_block_truncate_page(handle, page, mapping, inode-i_size);
 
-   mutex_lock(EXT4_I(inode)-truncate_mutex);
+   down_write(EXT4_I(inode)-i_data_sem);
ext4_ext_invalidate_cache(inode);
 
ext4_mb_discard_inode_preallocations(inode);
@@ -2552,7 +2553,7 @@ out_stop:
if (inode-i_nlink)
ext4_orphan_del(handle, inode);
 
-   mutex_unlock(EXT4_I(inode)-truncate_mutex);
+   up_write(EXT4_I(inode)-i_data_sem);
ext4_journal_stop(handle);
 }
 
@@ -2616,6 +2617,7 @@ long ext4_fallocate(struct inode *inode, int mode, loff_t 
offset, loff_t len)
 * modify 1 super block, 1 block bitmap and 1 group descriptor.
 */
credits = EXT4_DATA_TRANS_BLOCKS(inode-i_sb) + 3;
+   down_write((EXT4_I(inode)-i_data_sem));
 retry:
while (ret = 0  ret  max_blocks) {
block = block + ret;
@@ -2672,6 +2674,7 @@ retry:
if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, retries))
goto retry;
 
+   up_write((EXT4_I(inode)-i_data_sem));
/*
 * Time to update the file size.
 * Update only when preallocation was requested beyond the file size.
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index a6b2aa1..ac35ec5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -37,9 +37,9 @@ static int ext4_release_file (struct inode * inode, struct 
file * filp)
if ((filp-f_mode  FMODE_WRITE) 
(atomic_read(inode-i_writecount) == 1))
{
-   mutex_lock(EXT4_I(inode)-truncate_mutex);
+   down_write(EXT4_I(inode)-i_data_sem);
ext4_discard_reservation(inode);
-   mutex_unlock(EXT4_I(inode)-truncate_mutex);
+   up_write(EXT4_I(inode)-i_data_sem);
}
if (is_dx(inode)  filp-private_data)
ext4_htree_free_dir_info(filp-private_data);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index caec966..33018a5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -346,7 +346,7 @@ static int ext4_block_to_path(struct inode *inode,
  * the whole chain, all way to the data (returns %NULL, *err == 0).
  *
  *  Need to be called with
- *  mutex_lock(EXT4_I(inode)-truncate_mutex)
+ *  down_read(EXT4_I(inode)-i_data_sem)
  */
 static Indirect *ext4_get_branch(struct inode *inode, int depth,
 ext4_lblk_t  *offsets,
@@ -778,7 +778,8 @@ err_out:
  *
  *
  * Need to be called with
- * mutex_lock(EXT4_I(inode)-truncate_mutex)
+ * 

[PATCH] ext4: Fix the soft lockup with multi block allocator.

2007-12-21 Thread Aneesh Kumar K.V
With the multi block allocator when we don't have prealloc space we discard
the existing preallocaltion data and try to rebuild the buddy cache. While
discarding the loop through the group specific prealloc list. If we find any
particular prealloc space being used we mark the space busy. If we are not
able to find enough free space and if we have any prealloc space busy we loop
back again. With non preempted kernel this tight loop resulted in watchdog
timer triggering soft lockup warning.


Whe we are allocation the block we search the prealloc list and mark the
prealloc space used via incrementing pa_count value. One after succesffuly
allocating the block we need to update the block bitmap and this could
actually involved a disk io if the bitmap need to read from the disk. This
actually cause prealloc space to be marked used for quiet a long time. This
inturn results in the discard logic going on tight loop resulting in watchdog
timer triggering soft lockup warning.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/mballoc.c |   12 +++-
 1 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 844765c..cbc8143 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3729,7 +3729,7 @@ static int ext4_mb_discard_group_preallocations(struct 
super_block *sb,
struct list_head list;
struct ext4_buddy e4b;
int err;
-   int busy;
+   int busy = 0;
int free = 0;
 
mb_debug(discard preallocation for group %lu\n, group);
@@ -3754,20 +3754,12 @@ static int ext4_mb_discard_group_preallocations(struct 
super_block *sb,
INIT_LIST_HEAD(list);
 
 repeat:
-   busy = 0;
ext4_lock_group(sb, group);
list_for_each_entry_safe(pa, tmp,
grp-bb_prealloc_list, pa_group_list) {
spin_lock(pa-pa_lock);
if (atomic_read(pa-pa_count)) {
spin_unlock(pa-pa_lock);
-   /* FIXME!!
-* It is quiet natural to have the pa being
-* used on other cpus when we are trying free
-* space
-   printk(KERN_ERR uh! busy PA\n);
-   dump_stack();
-   */
busy = 1;
continue;
}
@@ -3790,7 +3782,9 @@ repeat:
 
/* if we still need more blocks and some PAs were used, try again */
if (free  needed  busy) {
+   busy = 0;
ext4_unlock_group(sb, group);
+   schedule_timeout(HZ);
goto repeat;
}
 
-- 
1.5.4.rc0-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: soft lockup - CPU#0 stuck for 11s! [fsstress:5534]

2007-12-21 Thread Aneesh Kumar K.V
On Thu, Dec 20, 2007 at 08:02:42PM +0530, Aneesh Kumar K.V wrote:
 I am seeing this with the patch queue. I can reproduce this on x86 and
 powerpc. I see the file system full when this happens. The same happens even
 without delalloc enabled.
 
 


The below patch fix the same for me. One thing i observed with the patch queue
is, enabling delalloc by default. Delalloc doesn't handle the file system full
case because there is no block reservation. Unless we have block reservation i
guess we should disable delalloc by default.

I will send a full patch with proper log message.

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 72e1920..8b45ac0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3791,6 +3791,7 @@ repeat:
/* if we still need more blocks and some PAs were used, try again */
if (free  needed  busy) {
ext4_unlock_group(sb, group);
+   schedule_timeout(HZ);
goto repeat;
}
 
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] truncate_mutex to read_write semaphore

2007-12-14 Thread Aneesh Kumar K.V
The series include the truncate_mutex to read write semaphore conversion. I am
marking below some of the test results.


For O_DIRECT workloads we won't see the contention on truncate mutex because we 
are
doing a get_block under inode-i_mutex.


For FIBMAP we won't see contention because the get_block get called under BKL.


threaded read with low memory
---

Top contenting locks were: (/proc/lock_stat output)

q-__queue_lock:   12549 12572 10.65  3302.1636818.7846618 
395721 3.47  49636.48  571453.47
inode-i_data.tree_lock-W: 3970  4026  2.62   33.39  3508.74 25924 
95164  5.33  949.5980180.03
inode-i_data.tree_lock-R: 1937  2002  2.52   22.05  1528.78 19543 
141863 5.57  119.72137126.60
ei-truncate_mutex#2:  4553  4769  169.62 1028484.20 39334253.92 19610 
47069  31.74 102280.63 680802.57

second run
-

q-__queue_lock:   12499 12535 3.76247.7119799.9446341  
405427  4.34  216.31527282.59
inode-i_data.tree_lock-W: 4009  4071  10.04   31.78 3434.95 25612  
93458  7.29  61.87 78365.20
inode-i_data.tree_lock-R: 1919  1973  4.4330.93 1523.04 18953  
142635 4.95  109.20137098.84
ei-truncate_mutex#2:  4346  4499  1546.39 896379.29 31107317.47 19051  
48223  37.94 122579.25 628968.65

The above result implies that the threaded read with low memory (booted with
mem=512M on a 16 cpu x86-64) results in contention on truncate_mutex.


threaded read with low memory after converting to i_data_sem
---

ei-i_data_sem-R:  0 0  0.000.00  0.00   18017  
48801   38.12 3494783.37 22982474.21
ei-i_data_sem-R:  0 0  0.000.00  0.00   18233  
49118   45.09 4953783.87 32699001.46

As you can see from the /proc/lock_stat output above the write semaphore is
not taken at all.

threaded write
--
ei-i_data_sem-W: 0 0  0.00 0.00  0.00 24
64163   41.04 2620905.3216331786.48
ei-i_data_sem-R: 0 0  0.00 0.00  0.00 13352 
83969   51.40 1212864.742834511.75


Here we see some read semphore acquisition. We take read mode of the semaphore
to  not content in the overwrite case. We see no contention here because the
write gets done under inode-i_mutex

sb-s_type-i_mutex_key#1:  313958  313962   3650.35 99510834.17  
4881402594.11  314481   616528   37.22 7579553.9754139119.82
  --
  sb-s_type-i_mutex_key#1 313962 [8027aa80] 
generic_file_aio_write+0x4f/0xc2
  sb-s_type-i_mutex_key#1 0  [802a538c] 
generic_file_llseek+0x36/0x98


second-run
-

ei-i_data_sem-W: 00  0.00  0.00   0.00   2   6114341.56   
9299754.4515811211.79
ei-i_data_sem-R: 00  0.00  0.00   0.00   13272   8244268.40   
1632405.222877135.32



sb-s_type-i_mutex_key#1:  441031  441163 10873.77 144350457.93  
4988289572.34  441679 742079   163.0515158665.5659655118.60
  --
   sb-s_type-i_mutex_key#1 441163  [802a538c] 
generic_file_llseek+0x36/0x98
   sb-s_type-i_mutex_key#1 0   [8027aa80] 
generic_file_aio_write+0x4f/0xc2



The test program is at http://www.radian.org/~kvaneesh/ext4/truncate_mutex/
The file system is modified to create highly fragmented file via frag.c



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] ext4: Make ext4_get_blocks_wrap take the truncate_mutex early.

2007-12-14 Thread Aneesh Kumar K.V
When doing a migrate from ext3 to ext4 inode we need to make sure the test
for inode type and walking inode data happens inside  lock. To make this
happen move truncate_mutex early before checking the i_flags.


This actually should enable us to remove the verify_chain().

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |9 --
 fs/ext4/inode.c |   69 +-
 include/linux/ext4_fs.h |   12 ++--
 3 files changed, 23 insertions(+), 67 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 8528774..98ef84d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2225,6 +2225,10 @@ out:
return err ? err : allocated;
 }
 
+/*
+ * Need to be called with
+ * mutex_lock(EXT4_I(inode)-truncate_mutex);
+ */
 int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_fsblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
@@ -2240,7 +2244,6 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode 
*inode,
__clear_bit(BH_New, bh_result-b_state);
ext_debug(blocks %d/%lu requested for inode %u\n, (int) iblock,
max_blocks, (unsigned) inode-i_ino);
-   mutex_lock(EXT4_I(inode)-truncate_mutex);
 
/* check in cache */
goal = ext4_ext_in_cache(inode, iblock, newex);
@@ -2414,8 +2417,6 @@ out2:
ext4_ext_drop_refs(path);
kfree(path);
}
-   mutex_unlock(EXT4_I(inode)-truncate_mutex);
-
return err ? err : allocated;
 }
 
@@ -2544,6 +2545,7 @@ long ext4_fallocate(struct inode *inode, int mode, loff_t 
offset, loff_t len)
 * modify 1 super block, 1 block bitmap and 1 group descriptor.
 */
credits = EXT4_DATA_TRANS_BLOCKS(inode-i_sb) + 3;
+   mutex_lock(EXT4_I(inode)-truncate_mutex)
 retry:
while (ret = 0  ret  max_blocks) {
block = block + ret;
@@ -2600,6 +2602,7 @@ retry:
if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, retries))
goto retry;
 
+   mutex_unlock(EXT4_I(inode)-truncate_mutex)
/*
 * Time to update the file size.
 * Update only when preallocation was requested beyond the file size.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5489703..9cf37b2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -243,13 +243,6 @@ static inline void add_chain(Indirect *p, struct 
buffer_head *bh, __le32 *v)
p-bh = bh;
 }
 
-static int verify_chain(Indirect *from, Indirect *to)
-{
-   while (from = to  from-key == *from-p)
-   from++;
-   return (from  to);
-}
-
 /**
  * ext4_block_to_path - parse the block number into array of offsets
  * @inode: inode in question (we are only interested in its superblock)
@@ -344,10 +337,11 @@ static int ext4_block_to_path(struct inode *inode,
  * (pointer to last triple returned, [EMAIL PROTECTED] == 0)
  * or when it gets an IO error reading an indirect block
  * (ditto, [EMAIL PROTECTED] == -EIO)
- * or when it notices that chain had been changed while it was reading
- * (ditto, [EMAIL PROTECTED] == -EAGAIN)
  * or when it reads all @depth-1 indirect blocks successfully and finds
  * the whole chain, all way to the data (returns %NULL, *err == 0).
+ *
+ *  Need to be called with
+ *  mutex_lock(EXT4_I(inode)-truncate_mutex)
  */
 static Indirect *ext4_get_branch(struct inode *inode, int depth, int *offsets,
 Indirect chain[4], int *err)
@@ -365,9 +359,6 @@ static Indirect *ext4_get_branch(struct inode *inode, int 
depth, int *offsets,
bh = sb_bread(sb, le32_to_cpu(p-key));
if (!bh)
goto failure;
-   /* Reader: pointers */
-   if (!verify_chain(chain, p))
-   goto changed;
add_chain(++p, bh, (__le32*)bh-b_data + *++offsets);
/* Reader: end */
if (!p-key)
@@ -375,10 +366,6 @@ static Indirect *ext4_get_branch(struct inode *inode, int 
depth, int *offsets,
}
return NULL;
 
-changed:
-   brelse(bh);
-   *err = -EAGAIN;
-   goto no_block;
 failure:
*err = -EIO;
 no_block:
@@ -782,6 +769,10 @@ err_out:
  * return  0, # of blocks mapped or allocated.
  * return = 0, if plain lookup failed.
  * return  0, error case.
+ *
+ *
+ * Need to be called with
+ * mutex_lock(EXT4_I(inode)-truncate_mutex)
  */
 int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
sector_t iblock, unsigned long maxblocks,
@@ -819,18 +810,6 @@ int ext4_get_blocks_handle(handle_t *handle, struct inode 
*inode,
while (count  maxblocks  count = blocks_to_boundary) {
ext4_fsblk_t blk;
 
-   if (!verify_chain(chain, partial

[RFC][PATCH 2/3] ext4: Convert truncate_mutex to read write semaphore.

2007-12-03 Thread Aneesh Kumar K.V
We are currently taking the truncate_mutex for every read. This would have
performance impact on large CPU configuration. Convert the lock to read write
semaphore and take read lock when we are trying to read the file.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/balloc.c  |2 +-
 fs/ext4/extents.c |8 
 fs/ext4/file.c|4 ++--
 fs/ext4/inode.c   |   36 ++--
 fs/ext4/ioctl.c   |4 ++--
 fs/ext4/super.c   |2 +-
 include/linux/ext4_fs.h   |   22 +++---
 include/linux/ext4_fs_i.h |6 +++---
 8 files changed, 46 insertions(+), 38 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 71ee95e..4eca63d 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -463,7 +463,7 @@ static inline int rsv_is_empty(struct ext4_reserve_window 
*rsv)
  * when setting the reservation window size through ioctl before the file
  * is open for write (needs block allocation).
  *
- * Needs truncate_mutex protection prior to call this function.
+ * Needs down_write(i_data_sem) protection prior to call this function.
  */
 void ext4_init_block_alloc_info(struct inode *inode)
 {
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9dd4746..a2475d4 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1666,7 +1666,7 @@ int ext4_ext_rm_idx(handle_t *handle, struct inode *inode,
  * This routine returns max. credits that the extent tree can consume.
  * It should be OK for low-performance paths like -writepage()
  * To allow many writing processes to fit into a single transaction,
- * the caller should calculate credits under truncate_mutex and
+ * the caller should calculate credits under i_data_sem and
  * pass the actual path.
  */
 int ext4_ext_calc_credits_for_insert(struct inode *inode,
@@ -2227,7 +2227,7 @@ out:
 
 /*
  * Need to be called with
- * mutex_lock(EXT4_I(inode)-truncate_mutex);
+ * (EXT4_I(inode)-i_data_sem);
  */
 int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_fsblk_t iblock,
@@ -2446,7 +2446,7 @@ void ext4_ext_truncate(struct inode * inode, struct page 
*page)
if (page)
ext4_block_truncate_page(handle, page, mapping, inode-i_size);
 
-   mutex_lock(EXT4_I(inode)-truncate_mutex);
+   down_write(EXT4_I(inode)-i_data_sem);
ext4_ext_invalidate_cache(inode);
 
/*
@@ -2482,7 +2482,7 @@ out_stop:
if (inode-i_nlink)
ext4_orphan_del(handle, inode);
 
-   mutex_unlock(EXT4_I(inode)-truncate_mutex);
+   up_write(EXT4_I(inode)-i_data_sem);
ext4_journal_stop(handle);
 }
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a81cd6..ba4cf05 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -37,9 +37,9 @@ static int ext4_release_file (struct inode * inode, struct 
file * filp)
if ((filp-f_mode  FMODE_WRITE) 
(atomic_read(inode-i_writecount) == 1))
{
-   mutex_lock(EXT4_I(inode)-truncate_mutex);
+   down_write(EXT4_I(inode)-i_data_sem);
ext4_discard_reservation(inode);
-   mutex_unlock(EXT4_I(inode)-truncate_mutex);
+   up_write(EXT4_I(inode)-i_data_sem);
}
if (is_dx(inode)  filp-private_data)
ext4_htree_free_dir_info(filp-private_data);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9cf37b2..8aa8855 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -341,7 +341,7 @@ static int ext4_block_to_path(struct inode *inode,
  * the whole chain, all way to the data (returns %NULL, *err == 0).
  *
  *  Need to be called with
- *  mutex_lock(EXT4_I(inode)-truncate_mutex)
+ *  (EXT4_I(inode)-i_data_sem)
  */
 static Indirect *ext4_get_branch(struct inode *inode, int depth, int *offsets,
 Indirect chain[4], int *err)
@@ -772,7 +772,7 @@ err_out:
  *
  *
  * Need to be called with
- * mutex_lock(EXT4_I(inode)-truncate_mutex)
+ * (EXT4_I(inode)-i_data_sem)
  */
 int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
sector_t iblock, unsigned long maxblocks,
@@ -859,7 +859,7 @@ int ext4_get_blocks_handle(handle_t *handle, struct inode 
*inode,
err = ext4_splice_branch(handle, inode, iblock,
partial, indirect_blks, count);
/*
-* i_disksize growing is protected by truncate_mutex.  Don't forget to
+* i_disksize growing is protected by i_data_sem.  Don't forget to
 * protect it if you're about to implement concurrent
 * ext4_get_block() -bzzz
*/
@@ -889,6 +889,30 @@ out:
 
 #define DIO_CREDITS (EXT4_RESERVE_TRANS_BLOCKS + 32)
 
+int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
+   unsigned long max_blocks, struct buffer_head *bh,
+   int create, int extend_disksize

[RFC][PATCH 1/3] ext4: Make ext4_get_blocks_wrap take the truncate_mutex early.

2007-12-03 Thread Aneesh Kumar K.V
When doing a migrate from ext3 to ext4 inode we need to make sure the test
for inode type and walking inode data happens inside  lock. To make this
happen move truncate_mutex early before checking the i_flags.


This actually should enable us to remove the verify_chain().

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/extents.c   |7 +++--
 fs/ext4/inode.c |   69 +-
 include/linux/ext4_fs.h |   12 ++--
 3 files changed, 21 insertions(+), 67 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 8528774..9dd4746 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2225,6 +2225,10 @@ out:
return err ? err : allocated;
 }
 
+/*
+ * Need to be called with
+ * mutex_lock(EXT4_I(inode)-truncate_mutex);
+ */
 int ext4_ext_get_blocks(handle_t *handle, struct inode *inode,
ext4_fsblk_t iblock,
unsigned long max_blocks, struct buffer_head *bh_result,
@@ -2240,7 +2244,6 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode 
*inode,
__clear_bit(BH_New, bh_result-b_state);
ext_debug(blocks %d/%lu requested for inode %u\n, (int) iblock,
max_blocks, (unsigned) inode-i_ino);
-   mutex_lock(EXT4_I(inode)-truncate_mutex);
 
/* check in cache */
goal = ext4_ext_in_cache(inode, iblock, newex);
@@ -2414,8 +2417,6 @@ out2:
ext4_ext_drop_refs(path);
kfree(path);
}
-   mutex_unlock(EXT4_I(inode)-truncate_mutex);
-
return err ? err : allocated;
 }
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5489703..9cf37b2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -243,13 +243,6 @@ static inline void add_chain(Indirect *p, struct 
buffer_head *bh, __le32 *v)
p-bh = bh;
 }
 
-static int verify_chain(Indirect *from, Indirect *to)
-{
-   while (from = to  from-key == *from-p)
-   from++;
-   return (from  to);
-}
-
 /**
  * ext4_block_to_path - parse the block number into array of offsets
  * @inode: inode in question (we are only interested in its superblock)
@@ -344,10 +337,11 @@ static int ext4_block_to_path(struct inode *inode,
  * (pointer to last triple returned, [EMAIL PROTECTED] == 0)
  * or when it gets an IO error reading an indirect block
  * (ditto, [EMAIL PROTECTED] == -EIO)
- * or when it notices that chain had been changed while it was reading
- * (ditto, [EMAIL PROTECTED] == -EAGAIN)
  * or when it reads all @depth-1 indirect blocks successfully and finds
  * the whole chain, all way to the data (returns %NULL, *err == 0).
+ *
+ *  Need to be called with
+ *  mutex_lock(EXT4_I(inode)-truncate_mutex)
  */
 static Indirect *ext4_get_branch(struct inode *inode, int depth, int *offsets,
 Indirect chain[4], int *err)
@@ -365,9 +359,6 @@ static Indirect *ext4_get_branch(struct inode *inode, int 
depth, int *offsets,
bh = sb_bread(sb, le32_to_cpu(p-key));
if (!bh)
goto failure;
-   /* Reader: pointers */
-   if (!verify_chain(chain, p))
-   goto changed;
add_chain(++p, bh, (__le32*)bh-b_data + *++offsets);
/* Reader: end */
if (!p-key)
@@ -375,10 +366,6 @@ static Indirect *ext4_get_branch(struct inode *inode, int 
depth, int *offsets,
}
return NULL;
 
-changed:
-   brelse(bh);
-   *err = -EAGAIN;
-   goto no_block;
 failure:
*err = -EIO;
 no_block:
@@ -782,6 +769,10 @@ err_out:
  * return  0, # of blocks mapped or allocated.
  * return = 0, if plain lookup failed.
  * return  0, error case.
+ *
+ *
+ * Need to be called with
+ * mutex_lock(EXT4_I(inode)-truncate_mutex)
  */
 int ext4_get_blocks_handle(handle_t *handle, struct inode *inode,
sector_t iblock, unsigned long maxblocks,
@@ -819,18 +810,6 @@ int ext4_get_blocks_handle(handle_t *handle, struct inode 
*inode,
while (count  maxblocks  count = blocks_to_boundary) {
ext4_fsblk_t blk;
 
-   if (!verify_chain(chain, partial)) {
-   /*
-* Indirect block might be removed by
-* truncate while we were reading it.
-* Handling of that case: forget what we've
-* got now. Flag the err as EAGAIN, so it
-* will reread.
-*/
-   err = -EAGAIN;
-   count = 0;
-   break;
-   }
blk = le32_to_cpu(*(chain[depth-1].p + count));
 
if (blk == first_block + count

[RFC] truncate_mutex read write semaphore conversion

2007-12-03 Thread Aneesh Kumar K.V
The below patchset is NOT for patch queue. I am posting it here to get
feedback regarding the approach and what test I need to run to make sure we
are not breaking any locking rules.

I have run dbench, ffsb, fsstress, fs_di, fs_inode, fsx_linux . Bonnie
didn't run completely. In the automated setup i had it didn't save any
logs to find out what went wrong. I will be looking into this


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 3/3] ext4: Take read lock during overwrite case.

2007-12-03 Thread Aneesh Kumar K.V
When we are overwriting a file and not actually allocating new file system
blocks we need to take only the read lock on i_data_sem.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/inode.c |   32 
 1 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8aa8855..4489bfd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -894,11 +894,31 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode 
*inode, sector_t block,
int create, int extend_disksize)
 {
int retval;
-   if (create) {
-   down_write((EXT4_I(inode)-i_data_sem));
+   /*
+* Try to see if we can get  the block without requesting
+* for new file system block.
+*/
+   down_read((EXT4_I(inode)-i_data_sem));
+   if (EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL) {
+   retval =  ext4_ext_get_blocks(handle, inode, block, max_blocks,
+   bh, 0, 0);
} else {
-   down_read((EXT4_I(inode)-i_data_sem));
+   retval =  ext4_get_blocks_handle(handle, inode, block, 
max_blocks,
+   bh, 0, 0);
}
+   up_read((EXT4_I(inode)-i_data_sem));
+   if (!create || (retval  0))
+   return retval;
+
+   /*
+* We need to allocate new blocks which will result
+* in i_data update
+*/
+   down_write((EXT4_I(inode)-i_data_sem));
+   /*
+* We need to check for EXT4 here because migrate
+* could have changed the inode type in between
+*/
if (EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL) {
retval =  ext4_ext_get_blocks(handle, inode, block, max_blocks,
bh, create, extend_disksize);
@@ -906,11 +926,7 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode 
*inode, sector_t block,
retval =  ext4_get_blocks_handle(handle, inode, block, 
max_blocks,
bh, create, extend_disksize);
}
-   if (create) {
-   up_write((EXT4_I(inode)-i_data_sem));
-   } else {
-   up_read((EXT4_I(inode)-i_data_sem));
-   }
+   up_write((EXT4_I(inode)-i_data_sem));
return retval;
 }
 static int ext4_get_block(struct inode *inode, sector_t iblock,
-- 
1.5.3.6.985.g65c6a4-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: Check for the correct error return from ext4_ext_get_blocks

2007-12-03 Thread Aneesh Kumar K.V
ext4_ext_get_blocks returns negative values on error. We should
check for  = 0

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/extents.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a2475d4..ce57245 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2558,8 +2558,8 @@ retry:
ret = ext4_ext_get_blocks(handle, inode, block,
  max_blocks, map_bh,
  EXT4_CREATE_UNINITIALIZED_EXT, 0);
-   WARN_ON(!ret);
-   if (!ret) {
+   WARN_ON(ret = 0);
+   if (ret = 0) {
ext4_error(inode-i_sb, ext4_fallocate,
   ext4_ext_get_blocks returned 0! inode#%lu
   , block=%llu, max_blocks=%llu,
-- 
1.5.3.6.985.g65c6a4-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Understanding mballoc

2007-12-03 Thread Aneesh Kumar K.V
Alex,

This is my attempt at understanding multi block allocator. I have
few questions marked as FIXME below. Can you help answering them.
Most of this data is already in the patch queue as commit message.
I have updated some details regarding preallocation. Once we
understand the details i will update the patch queue commit message.



The allocation request involve request for multiple number of
blocks near to the goal(block) value specified.

During initialization phase of the allocator we decide to use the group
preallocation or inode preallocation depending on the size of the request. If
the request is smaller than  sbi-s_mb_small_req we select the group
preallocation.  This is needed because we would like to have the small files
closer. The value of s_mb_small_req  is 256 blocks.

/* FIXME!!
Does the value of s_mb_small_req depend on the s_mb_prealloc_table ?
If yes, then how do we update the s_mb_small_req. We have a  hook to update
the prealloc table via /proc. But that doesn't update the s_mb_small_req.
*/

/* FIXME!! The code within ext4_mb_group_or_file does  below.
if (ac-ac_o_ex.fe_len = sbi-s_mb_large_req)
return;

if (ac-ac_o_ex.fe_len = sbi-s_mb_small_req)
return;

That doesn't seem to make sense because the if the len is greater than
s_mb_sall_req it will be always greater than s_mb_large_req. What are we
expecting to do here ?
*/


First stage the allocator looks at the inode prealloc list
ext4_inode_info-i_prealloc_list contain list of prealloc
spaces for this particular inode. The inode prealloc space
is represented as
pa_lstart - the logical start block for this prealloc space
pa_pstart - the physical start block for this prealloc space
pa_len- lenght for this prealloc space
pa_free   -  free space available in this prealloc space

The inode preallocation space is used looking at the _logical_
start block. If only the logical file block falls within the
range of prealloc space we will consume the particular prealloc
space. This make sure that that the we have contiguous physical
blocks representing the file blocks

The important thing to be noted in case of inode prealloc space
is that we don't modify the values associated to inode prealloc
space except pa_free.

If we are not able to find blocks in the inode prealloc space and if we have
the group allocation flag set then we look at the locality group prealloc
space. These are per CPU prealloc list repreasented as

ext4_sb_info.s_locality_groups[smp_processor_id()]

/* FIXME!! 
After getting the locality group related to the current CPU we could be
scheduled out and scheduled in on different CPU. So why are we putting the
locality group per cpu ?
*/

The locality group prealloc space is used looking at whether we have
enough free space (pa_free) withing the prealloc space.


If we can't allocate blocks via inode prealloc or/and locality group prealloc
then we look at the buddy cache. The buddy cache is represented by
ext4_sb_info.s_buddy_cache (struct inode) whose file offset gets mapped to the
buddy and bitmap information regarding different groups. The buddy information
is attached to buddy cache inode so that we can access them through the page
cache. The information regarding each group is loaded via ext4_mb_load_buddy.
The information involve block bitmap and buddy information. The information are
stored in the inode as

 {page}
 [ group 0 buddy][ group 0 bitmap] [group 1][ group 1]...


one block each for bitmap and buddy information.
So for each group we take up 2 blocks. A page can
contain blocks_per_page (PAGE_CACHE_SIZE / blocksize)  blocks.
So it can have information regarding groups_per_page which
is blocks_per_page/2

Buddy cachche inode is not stored on disk. The inode get
thrown away at the end when unmounting the disk.

We look for count number of blocks in the buddy cache. If
we were able to locate that many free blocks we return
with additional information regarding rest of the
contiguous physical block available

/* FIXME: 
We need to explain the normalization of the request length.
What are the conditions we are checking the request length
against. Why are group request always requested at 512 blocks ?


Buddy scanning follows different criteria. We need to explain what
a criteria is how they infulence the allocation 
*/

If we allocate more space than we requested for then the remaining
space get added to the locality group prealloc space or
inode prealloc space.


Both the prealloc space are getting populated
as above. So for the first request we will hit the buddy cache
which will result in this prealloc space getting filled. The prealloc
space is then later used for the subsequent request.
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


circular locking dependency detected

2007-11-29 Thread Aneesh Kumar K.V

===
[ INFO: possible circular locking dependency detected ]
2.6.24-rc3 #6
---
bash/2294 is trying to acquire lock:
(journal-j_list_lock){--..}, at: [c01eee2f] 
journal_try_to_free_buffers+0x76/0x10c

but task is already holding lock:
(inode_lock){--..}, at: [c01864b6] drop_pagecache+0x48/0xd8

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

- #1 (inode_lock){--..}:
[c0143929] __lock_acquire+0xa31/0xc1a
[c0143b8c] lock_acquire+0x7a/0x94
[c0447ae5] _spin_lock+0x2e/0x58
[c0185fae] __mark_inode_dirty+0xd8/0x15e
[c018996f] __set_page_dirty+0xfb/0x10a
[c01899fe] mark_buffer_dirty+0x80/0x86
[c01ed1b6] __journal_temp_unlink_buffer+0xc1/0xc5
[c01ed2f3] __journal_unfile_buffer+0xb/0x15
[c01ed338] __journal_refile_buffer+0x3b/0x85
[c01efe1b] journal_commit_transaction+0xe7f/0x10ec
[c01f34b0] kjournald+0x131/0x35f
[c0138ee9] kthread+0x3b/0x62
[c0105c7f] kernel_thread_helper+0x7/0x10
[] 0x

- #0 (journal-j_list_lock){--..}:
[c0143819] __lock_acquire+0x921/0xc1a
[c0143b8c] lock_acquire+0x7a/0x94
[c0447ae5] _spin_lock+0x2e/0x58
[c01eee2f] journal_try_to_free_buffers+0x76/0x10c
[c01cb7a2] ext3_releasepage+0x68/0x74
[c0151ff1] try_to_release_page+0x33/0x44
[c01578b0] __invalidate_mapping_pages+0x74/0xe0
[c01864de] drop_pagecache+0x70/0xd8
[c018657c] drop_caches_sysctl_handler+0x36/0x4e
[c019efaa] proc_sys_write+0x6b/0x85
[c016da2b] vfs_write+0x90/0x119
[c016dfe2] sys_write+0x3d/0x61
[c0104f6e] sysenter_past_esp+0x5f/0xa5
[] 0x

other info that might help us debug this:

2 locks held by bash/2294:
#0:  (type-s_umount_key#16){}, at: [c01864a6] drop_pagecache+0x38/0xd8
#1:  (inode_lock){--..}, at: [c01864b6] drop_pagecache+0x48/0xd8

stack backtrace:
[c0105ffb] show_trace_log_lvl+0x1a/0x2f
[c0106875] show_trace+0x12/0x14
[c0106967] dump_stack+0x16/0x18
[c0141b33] print_circular_bug_tail+0x5f/0x68
[c0143819] __lock_acquire+0x921/0xc1a
[c0143b8c] lock_acquire+0x7a/0x94
[c0447ae5] _spin_lock+0x2e/0x58
[c01eee2f] journal_try_to_free_buffers+0x76/0x10c
[c01cb7a2] ext3_releasepage+0x68/0x74
[c0151ff1] try_to_release_page+0x33/0x44
[c01578b0] __invalidate_mapping_pages+0x74/0xe0
[c01864de] drop_pagecache+0x70/0xd8
[c018657c] drop_caches_sysctl_handler+0x36/0x4e
[c019efaa] proc_sys_write+0x6b/0x85
[c016da2b] vfs_write+0x90/0x119
[c016dfe2] sys_write+0x3d/0x61
[c0104f6e] sysenter_past_esp+0x5f/0xa5
===
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext4 still broken on arm (at least)

2007-11-27 Thread Aneesh Kumar K.V
On Tue, Nov 27, 2007 at 09:14:44PM -0800, Andrew Morton wrote:
 fs/ext4/mballoc.c: In function `ext4_mb_generate_buddy':
 fs/ext4/mballoc.c:836: error: implicit declaration of function 
 `ext2_find_next_bit'  



I actually sent in a patch which changes asking for review to
linux-arch. I haven't got the response yet. Attaching the patch
below 

Introduce ext4_find_next_bit

From: Aneesh Kumar K.V [EMAIL PROTECTED]

This gets used by the ext4 multi block allocator patches.

Also add generic_find_next_le_bit

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---

 include/asm-arm/bitops.h |2 +
 include/asm-generic/bitops/ext2-non-atomic.h |2 +
 include/asm-generic/bitops/le.h  |4 ++
 include/asm-m68k/bitops.h|2 +
 include/asm-m68knommu/bitops.h   |2 +
 include/asm-powerpc/bitops.h |4 ++
 include/asm-s390/bitops.h|2 +
 include/linux/ext4_fs.h  |1 +
 lib/find_next_bit.c  |   43 ++
 9 files changed, 62 insertions(+), 0 deletions(-)


diff --git a/include/asm-arm/bitops.h b/include/asm-arm/bitops.h
index 47a6b08..5c60bfc 100644
--- a/include/asm-arm/bitops.h
+++ b/include/asm-arm/bitops.h
@@ -310,6 +310,8 @@ static inline int constant_fls(int x)
_find_first_zero_bit_le(p,sz)
 #define ext2_find_next_zero_bit(p,sz,off)  \
_find_next_zero_bit_le(p,sz,off)
+#define ext2_find_next_bit(p, sz, off) \
+   _find_next_bit_le(p, sz, off)
 
 /*
  * Minix is defined to use little-endian byte ordering.
diff --git a/include/asm-generic/bitops/ext2-non-atomic.h 
b/include/asm-generic/bitops/ext2-non-atomic.h
index 1697404..63cf822 100644
--- a/include/asm-generic/bitops/ext2-non-atomic.h
+++ b/include/asm-generic/bitops/ext2-non-atomic.h
@@ -14,5 +14,7 @@
generic_find_first_zero_le_bit((unsigned long *)(addr), (size))
 #define ext2_find_next_zero_bit(addr, size, off) \
generic_find_next_zero_le_bit((unsigned long *)(addr), (size), (off))
+#define ext2_find_next_bit(addr, size, off) \
+   generic_find_next_le_bit((unsigned long *)(addr), (size), (off))
 
 #endif /* _ASM_GENERIC_BITOPS_EXT2_NON_ATOMIC_H_ */
diff --git a/include/asm-generic/bitops/le.h b/include/asm-generic/bitops/le.h
index b9c7e5d..80e3bf1 100644
--- a/include/asm-generic/bitops/le.h
+++ b/include/asm-generic/bitops/le.h
@@ -20,6 +20,8 @@
 #define generic___test_and_clear_le_bit(nr, addr) __test_and_clear_bit(nr, 
addr)
 
 #define generic_find_next_zero_le_bit(addr, size, offset) 
find_next_zero_bit(addr, size, offset)
+#define generic_find_next_le_bit(addr, size, offset) \
+   find_next_bit(addr, size, offset)
 
 #elif defined(__BIG_ENDIAN)
 
@@ -42,6 +44,8 @@
 
 extern unsigned long generic_find_next_zero_le_bit(const unsigned long *addr,
unsigned long size, unsigned long offset);
+extern unsigned long generic_find_next_le_bit(const unsigned long *addr,
+   unsigned long size, unsigned long offset);
 
 #else
 #error Please fix asm/byteorder.h
diff --git a/include/asm-m68k/bitops.h b/include/asm-m68k/bitops.h
index 2976b5d..83d1f28 100644
--- a/include/asm-m68k/bitops.h
+++ b/include/asm-m68k/bitops.h
@@ -410,6 +410,8 @@ static inline int ext2_find_next_zero_bit(const void 
*vaddr, unsigned size,
res = ext2_find_first_zero_bit (p, size - 32 * (p - addr));
return (p - addr) * 32 + res;
 }
+#define ext2_find_next_bit(addr, size, off) \
+   generic_find_next_le_bit((unsigned long *)(addr), (size), (off))
 
 #endif /* __KERNEL__ */
 
diff --git a/include/asm-m68knommu/bitops.h b/include/asm-m68knommu/bitops.h
index f8dfb7b..f43afe1 100644
--- a/include/asm-m68knommu/bitops.h
+++ b/include/asm-m68knommu/bitops.h
@@ -294,6 +294,8 @@ found_middle:
return result + ffz(__swab32(tmp));
 }
 
+#define ext2_find_next_bit(addr, size, off) \
+   generic_find_next_le_bit((unsigned long *)(addr), (size), (off))
 #include asm-generic/bitops/minix.h
 
 #endif /* __KERNEL__ */
diff --git a/include/asm-powerpc/bitops.h b/include/asm-powerpc/bitops.h
index 733b4af..220d9a7 100644
--- a/include/asm-powerpc/bitops.h
+++ b/include/asm-powerpc/bitops.h
@@ -359,6 +359,8 @@ static __inline__ int test_le_bit(unsigned long nr,
 unsigned long generic_find_next_zero_le_bit(const unsigned long *addr,
unsigned long size, unsigned long offset);
 
+unsigned long generic_find_next_le_bit(const unsigned long *addr,
+   unsigned long size, unsigned long offset);
 /* Bitmap functions for the ext2 filesystem */
 
 #define ext2_set_bit(nr,addr) \
@@ -378,6 +380,8 @@ unsigned long generic_find_next_zero_le_bit(const unsigned 
long *addr

blk bitmap validation test results

2007-11-22 Thread Aneesh Kumar K.V

Andreas suggested me to get the iozone results after multiple runs.
I don't see any performance issue with the blk bitmap validation changes
now. 




v2.6.24-rc3-35-g2e12044

File  VM system latencies in microseconds - smaller is better
--
Host OS   0K File  10K File  MmapProtPage
   Create Delete Create Delete  Latency Fault   Fault 
- - -- -- -- --  --- -   - 
elm3a150. Linux 2.6.24-   28.8   24.9   90.3   52.213.4K 1.796 3.0

elm3a150. Linux 2.6.24-   29.2   25.1   90.3   52.013.4K 1.829 3.0
elm3a150. Linux 2.6.24-   28.7   24.8   90.2   51.213.3K 1.971 3.0


v2.6.24-rc3-35-g2e12044 + with block bitmap patch 




File  VM system latencies in microseconds - smaller is better
--
Host OS   0K File  10K File  MmapProtPage
   Create Delete Create Delete  Latency Fault   Fault 
- - -- -- -- --  --- -   - 
elm3a150. Linux 2.6.24-   28.5   24.7   89.3   52.113.4K 1.785 3.0

elm3a150. Linux 2.6.24-   28.7   24.9   89.4   52.213.3K 1.792 3.0
elm3a150. Linux 2.6.24-   29.1   24.8   90.1   52.413.3K 1.779 3.0
elm3a150. Linux 2.6.24-   29.2   24.9   90.4   52.013.9K 1.848 3.0


iozones results are attached.

Iozone: Performance Test of File I/O
Version $Revision: 3.226 $
Compiled for 64 bit mode.
Build: linux 

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
 Randy Dunlap, Mark Montague, Dan Million, 
 Jean-Marc Zucconi, Jeff Blomberg,
 Erik Habbinga, Kris Strecker.

Run began: Thu Nov 22 03:03:46 2007

Auto Mode
Command line used: 
/usr/local/autobench/sources/iozone/src/current/iozone -a -f /ext4/test.iozone
Output is in Kbytes/sec
Time Resolution = 0.01 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random  random
bkwd  record  stride   
  KB  reclen   write rewritereadrereadread   write
read rewriteread   fwrite frewrite   fread  freread
  64   4  176276  432398  1279447  1484662 1066042  404998  
874936  387461 1163036   190947   471099 1083249  1421755
  64   8  192178  481234  1638743  1879725 1679761  512471 
1119387  429630 1828508   212570   520419 1492919  1768283
  64  16  206519  515423  1690338  2006158 2203800  561808 
1336792  394870 2133730   259156   633392 1492919  1947927
  64  32  209093  537079  1638743  2203800 2662899  621657 
1363961  316997 2537060   353760   998622 1947927  2067979
  64  64  213415  537079  1526887  2379626 2923952  592827 
1330167  219883 2892445   215471   548044 1363961  2278628
 128   4  184440  428193  1319723  1487977 1133103  402510  
934004  435134 1123617   191002   456986 1266785  1391557
 128   8  202837  504196  1504659  1827299 1705404  506097 
1231904  483316 1827299   210884   496272 1539168  1749872
 128  16  205398  554157  1827299  2058509 223  595328 
1802755  499970 2248149   233950   617941 1663139  1997245
 128  32  214079  603357  1827299  2166499 2608629  609522 
1543594  435134 2673584   273382   684954 1732927  2098744
 128  64  222693  568836  1727352  2248149 2985839  640042 
1543594  324039 2969325   364770  1007629 1689305  223
 128 128  224556  546819  1487977  2377579 3199360  624409 
1424795  220679 3199360   223342   592699 1421023  2326073
 256   4  191199  435452  1286217  1422540 1137672  408918  
904732  443730 1153561   193230   413008 1300235  1398455
 256   8  204140  501794  1652404  1802167 1683493  502733 
1384034  541253 1752173   208825   515279 1600674  1766587
 256  16  211208  508929  1649865  1985448 2247235  543720 
1479379  563701 2330140   224172   523570 1817419  1970871
 256  32  221262  555248  1894373  2170027 2665657  632056 
1639786  539079 2533571   247816   609801 1842366  2118645
 256  64  220670  568777  1868008  2247235 2975955  616807 
1729594  453092 3043436   278835   706964 2015259  2223962
 256 128  221627  584570  

[Take 2] ext2/3/4 block bitmap validation patches

2007-11-19 Thread Aneesh Kumar K.V

This is the updated ext2/3/4 block bitmap validation patches

Changes from the last post

a) moved the bh_uptodate_or_lock and bh_submit_read to
  fs/buffer.c and added EXPORT_SYMBOL

b) Updated bh_submit_read not to release buffer on
   failure. This handles one reference handling bug in the
   error patch.


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext2: add block bitmap validation

2007-11-19 Thread Aneesh Kumar K.V
When a new block bitmap is read from disk in read_block_bitmap()
there are a few bits that should ALWAYS be set.  In particular,
the blocks given corresponding to block bitmap, inode bitmap and inode tables.
Validate the block bitmap against these blocks.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext2/balloc.c |   81 --
 1 files changed, 72 insertions(+), 9 deletions(-)

diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
index 377ad17..4b5c511 100644
--- a/fs/ext2/balloc.c
+++ b/fs/ext2/balloc.c
@@ -69,9 +69,53 @@ struct ext2_group_desc * ext2_get_group_desc(struct 
super_block * sb,
return desc + offset;
 }
 
+static int ext2_valid_block_bitmap(struct super_block *sb,
+   struct ext2_group_desc *desc,
+   unsigned int block_group,
+   struct buffer_head *bh)
+{
+   ext2_grpblk_t offset;
+   ext2_grpblk_t next_zero_bit;
+   ext2_fsblk_t bitmap_blk;
+   ext2_fsblk_t group_first_block;
+
+   group_first_block = ext2_group_first_block_no(sb, block_group);
+
+   /* check whether block bitmap block number is set */
+   bitmap_blk = le32_to_cpu(desc-bg_block_bitmap);
+   offset = bitmap_blk - group_first_block;
+   if (!ext2_test_bit(offset, bh-b_data))
+   /* bad block bitmap */
+   goto err_out;
+
+   /* check whether the inode bitmap block number is set */
+   bitmap_blk = le32_to_cpu(desc-bg_inode_bitmap);
+   offset = bitmap_blk - group_first_block;
+   if (!ext2_test_bit(offset, bh-b_data))
+   /* bad block bitmap */
+   goto err_out;
+
+   /* check whether the inode table block number is set */
+   bitmap_blk = le32_to_cpu(desc-bg_inode_table);
+   offset = bitmap_blk - group_first_block;
+   next_zero_bit = ext2_find_next_zero_bit(bh-b_data,
+   offset + EXT2_SB(sb)-s_itb_per_group,
+   offset);
+   if (next_zero_bit = offset + EXT2_SB(sb)-s_itb_per_group)
+   /* good bitmap for inode tables */
+   return 1;
+
+err_out:
+   ext2_error(sb, __FUNCTION__,
+   Invalid block bitmap - 
+   block_group = %d, block = %lu,
+   block_group, bitmap_blk);
+   return 0;
+}
+
 /*
- * Read the bitmap for a given block_group, reading into the specified 
- * slot in the superblock's bitmap cache.
+ * Read the bitmap for a given block_group,and validate the
+ * bits for block/inode/inode tables are set in the bitmaps
  *
  * Return buffer_head on success or NULL in case of failure.
  */
@@ -80,17 +124,36 @@ read_block_bitmap(struct super_block *sb, unsigned int 
block_group)
 {
struct ext2_group_desc * desc;
struct buffer_head * bh = NULL;
-   
-   desc = ext2_get_group_desc (sb, block_group, NULL);
+   ext2_fsblk_t bitmap_blk;
+
+   desc = ext2_get_group_desc(sb, block_group, NULL);
if (!desc)
-   goto error_out;
-   bh = sb_bread(sb, le32_to_cpu(desc-bg_block_bitmap));
-   if (!bh)
-   ext2_error (sb, read_block_bitmap,
+   return NULL;
+   bitmap_blk = le32_to_cpu(desc-bg_block_bitmap);
+   bh = sb_getblk(sb, bitmap_blk);
+   if (unlikely(!bh)) {
+   ext2_error(sb, __FUNCTION__,
Cannot read block bitmap - 
block_group = %d, block_bitmap = %u,
block_group, le32_to_cpu(desc-bg_block_bitmap));
-error_out:
+   return NULL;
+   }
+   if (likely(bh_uptodate_or_lock(bh)))
+   return bh;
+
+   if (bh_submit_read(bh)  0) {
+   brelse(bh);
+   ext2_error(sb, __FUNCTION__,
+   Cannot read block bitmap - 
+   block_group = %d, block_bitmap = %u,
+   block_group, le32_to_cpu(desc-bg_block_bitmap));
+   return NULL;
+   }
+   if (!ext2_valid_block_bitmap(sb, desc, block_group, bh)) {
+   brelse(bh);
+   return NULL;
+   }
+
return bh;
 }
 
-- 
1.5.3.5.652.gf192c-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext3: add block bitmap validation

2007-11-19 Thread Aneesh Kumar K.V
When a new block bitmap is read from disk in read_block_bitmap()
there are a few bits that should ALWAYS be set.  In particular,
the blocks given corresponding to block bitmap, inode bitmap and inode tables.
Validate the block bitmap against these blocks.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext3/balloc.c |   78 -
 1 files changed, 70 insertions(+), 8 deletions(-)

diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index a8ba7e8..a26e683 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -80,13 +80,57 @@ struct ext3_group_desc * ext3_get_group_desc(struct 
super_block * sb,
return desc + offset;
 }
 
+static int ext3_valid_block_bitmap(struct super_block *sb,
+   struct ext3_group_desc *desc,
+   unsigned int block_group,
+   struct buffer_head *bh)
+{
+   ext3_grpblk_t offset;
+   ext3_grpblk_t next_zero_bit;
+   ext3_fsblk_t bitmap_blk;
+   ext3_fsblk_t group_first_block;
+
+   group_first_block = ext3_group_first_block_no(sb, block_group);
+
+   /* check whether block bitmap block number is set */
+   bitmap_blk = le32_to_cpu(desc-bg_block_bitmap);
+   offset = bitmap_blk - group_first_block;
+   if (!ext3_test_bit(offset, bh-b_data))
+   /* bad block bitmap */
+   goto err_out;
+
+   /* check whether the inode bitmap block number is set */
+   bitmap_blk = le32_to_cpu(desc-bg_inode_bitmap);
+   offset = bitmap_blk - group_first_block;
+   if (!ext3_test_bit(offset, bh-b_data))
+   /* bad block bitmap */
+   goto err_out;
+
+   /* check whether the inode table block number is set */
+   bitmap_blk = le32_to_cpu(desc-bg_inode_table);
+   offset = bitmap_blk - group_first_block;
+   next_zero_bit = ext3_find_next_zero_bit(bh-b_data,
+   offset + EXT3_SB(sb)-s_itb_per_group,
+   offset);
+   if (next_zero_bit = offset + EXT3_SB(sb)-s_itb_per_group)
+   /* good bitmap for inode tables */
+   return 1;
+
+err_out:
+   ext3_error(sb, __FUNCTION__,
+   Invalid block bitmap - 
+   block_group = %d, block = %lu,
+   block_group, bitmap_blk);
+   return 0;
+}
+
 /**
  * read_block_bitmap()
  * @sb:super block
  * @block_group:   given block group
  *
- * Read the bitmap for a given block_group, reading into the specified
- * slot in the superblock's bitmap cache.
+ * Read the bitmap for a given block_group,and validate the
+ * bits for block/inode/inode tables are set in the bitmaps
  *
  * Return buffer_head on success or NULL in case of failure.
  */
@@ -95,17 +139,35 @@ read_block_bitmap(struct super_block *sb, unsigned int 
block_group)
 {
struct ext3_group_desc * desc;
struct buffer_head * bh = NULL;
+   ext3_fsblk_t bitmap_blk;
 
-   desc = ext3_get_group_desc (sb, block_group, NULL);
+   desc = ext3_get_group_desc(sb, block_group, NULL);
if (!desc)
-   goto error_out;
-   bh = sb_bread(sb, le32_to_cpu(desc-bg_block_bitmap));
-   if (!bh)
-   ext3_error (sb, read_block_bitmap,
+   return NULL;
+   bitmap_blk = le32_to_cpu(desc-bg_block_bitmap);
+   bh = sb_getblk(sb, bitmap_blk);
+   if (unlikely(!bh)) {
+   ext3_error(sb, __FUNCTION__,
Cannot read block bitmap - 
block_group = %d, block_bitmap = %u,
block_group, le32_to_cpu(desc-bg_block_bitmap));
-error_out:
+   return NULL;
+   }
+   if (likely(bh_uptodate_or_lock(bh)))
+   return bh;
+
+   if (bh_submit_read(bh)  0) {
+   brelse(bh);
+   ext3_error(sb, __FUNCTION__,
+   Cannot read block bitmap - 
+   block_group = %d, block_bitmap = %u,
+   block_group, le32_to_cpu(desc-bg_block_bitmap));
+   return NULL;
+   }
+   if (!ext3_valid_block_bitmap(sb, desc, block_group, bh)) {
+   brelse(bh);
+   return NULL;
+   }
return bh;
 }
 /*
-- 
1.5.3.5.652.gf192c-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ext4: add block bitmap validation

2007-11-19 Thread Aneesh Kumar K.V
When a new block bitmap is read from disk in read_block_bitmap()
there are a few bits that should ALWAYS be set.  In particular,
the blocks given corresponding to block bitmap, inode bitmap and inode tables.
Validate the block bitmap against these blocks.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/balloc.c |   97 -
 1 files changed, 80 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index ff3428e..3b9b07b 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -189,13 +189,65 @@ struct ext4_group_desc * ext4_get_group_desc(struct 
super_block * sb,
return desc;
 }
 
+static int ext4_valid_block_bitmap(struct super_block *sb,
+   struct ext4_group_desc *desc,
+   unsigned int block_group,
+   struct buffer_head *bh)
+{
+   ext4_grpblk_t offset;
+   ext4_grpblk_t next_zero_bit;
+   ext4_fsblk_t bitmap_blk;
+   ext4_fsblk_t group_first_block;
+
+   if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
+   /* with FLEX_BG, the inode/block bitmaps and itable
+* blocks may not be in the group at all
+* so the bitmap validation will be skipped for those groups
+* or it has to also read the block group where the bitmaps
+* are located to verify they are set.
+*/
+   return 1;
+   }
+   group_first_block = ext4_group_first_block_no(sb, block_group);
+
+   /* check whether block bitmap block number is set */
+   bitmap_blk = ext4_block_bitmap(sb, desc);
+   offset = bitmap_blk - group_first_block;
+   if (!ext4_test_bit(offset, bh-b_data))
+   /* bad block bitmap */
+   goto err_out;
+
+   /* check whether the inode bitmap block number is set */
+   bitmap_blk = ext4_inode_bitmap(sb, desc);
+   offset = bitmap_blk - group_first_block;
+   if (!ext4_test_bit(offset, bh-b_data))
+   /* bad block bitmap */
+   goto err_out;
+
+   /* check whether the inode table block number is set */
+   bitmap_blk = ext4_inode_table(sb, desc);
+   offset = bitmap_blk - group_first_block;
+   next_zero_bit = ext4_find_next_zero_bit(bh-b_data,
+   offset + EXT4_SB(sb)-s_itb_per_group,
+   offset);
+   if (next_zero_bit = offset + EXT4_SB(sb)-s_itb_per_group)
+   /* good bitmap for inode tables */
+   return 1;
+
+err_out:
+   ext4_error(sb, __FUNCTION__,
+   Invalid block bitmap - 
+   block_group = %d, block = %llu,
+   block_group, bitmap_blk);
+   return 0;
+}
 /**
  * read_block_bitmap()
  * @sb:super block
  * @block_group:   given block group
  *
- * Read the bitmap for a given block_group, reading into the specified
- * slot in the superblock's bitmap cache.
+ * Read the bitmap for a given block_group,and validate the
+ * bits for block/inode/inode tables are set in the bitmaps
  *
  * Return buffer_head on success or NULL in case of failure.
  */
@@ -210,25 +262,36 @@ read_block_bitmap(struct super_block *sb, ext4_group_t 
block_group)
if (!desc)
return NULL;
bitmap_blk = ext4_block_bitmap(sb, desc);
+   bh = sb_getblk(sb, bitmap_blk);
+   if (unlikely(!bh)) {
+   ext4_error(sb, __FUNCTION__,
+   Cannot read block bitmap - 
+   block_group = %d, block_bitmap = %llu,
+   block_group, bitmap_blk);
+   return NULL;
+   }
+   if (bh_uptodate_or_lock(bh))
+   return bh;
+
if (desc-bg_flags  cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
-   bh = sb_getblk(sb, bitmap_blk);
-   if (!buffer_uptodate(bh)) {
-   lock_buffer(bh);
-   if (!buffer_uptodate(bh)) {
-   ext4_init_block_bitmap(sb, bh, block_group,
-  desc);
-   set_buffer_uptodate(bh);
-   }
-   unlock_buffer(bh);
-   }
-   } else {
-   bh = sb_bread(sb, bitmap_blk);
+   ext4_init_block_bitmap(sb, bh, block_group, desc);
+   set_buffer_uptodate(bh);
+   unlock_buffer(bh);
+   return bh;
}
-   if (!bh)
-   ext4_error (sb, __FUNCTION__,
+   if (bh_submit_read(bh)  0) {
+   brelse(bh);
+   ext4_error(sb, __FUNCTION__,
Cannot read block bitmap - 
-   block_group = %lu, block_bitmap = %llu

[PATCH] Add buffer head related helper functions

2007-11-19 Thread Aneesh Kumar K.V
Add buffer head related helper function
bh_uptodate_or_lock and bh_submit_read
which can be used by file system

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/buffer.c |   41 +
 include/linux/buffer_head.h |2 ++
 2 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..7593ff3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3213,6 +3213,47 @@ static int buffer_cpu_notify(struct notifier_block *self,
return NOTIFY_OK;
 }
 
+/**
+ * bh_uptodate_or_lock: Test whether the buffer is uptodate
+ * @bh: struct buffer_head
+ *
+ * Return true if the buffer is up-to-date and false,
+ * with the buffer locked, if not.
+ */
+int bh_uptodate_or_lock(struct buffer_head *bh)
+{
+   if (!buffer_uptodate(bh)) {
+   lock_buffer(bh);
+   if (!buffer_uptodate(bh))
+   return 0;
+   unlock_buffer(bh);
+   }
+   return 1;
+}
+EXPORT_SYMBOL(bh_uptodate_or_lock);
+/**
+ * bh_submit_read: Submit a locked buffer for reading
+ * @bh: struct buffer_head
+ *
+ * Returns a negative error
+ */
+int bh_submit_read(struct buffer_head *bh)
+{
+   if (!buffer_locked(bh))
+   lock_buffer(bh);
+
+   if (buffer_uptodate(bh))
+   return 0;
+
+   get_bh(bh);
+   bh-b_end_io = end_buffer_read_sync;
+   submit_bh(READ, bh);
+   wait_on_buffer(bh);
+   if (buffer_uptodate(bh))
+   return 0;
+   return -EIO;
+}
+EXPORT_SYMBOL(bh_submit_read);
 void __init buffer_init(void)
 {
int nrpages;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index da0d83f..e98801f 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -192,6 +192,8 @@ int sync_dirty_buffer(struct buffer_head *bh);
 int submit_bh(int, struct buffer_head *);
 void write_boundary_block(struct block_device *bdev,
sector_t bblock, unsigned blocksize);
+int bh_uptodate_or_lock(struct buffer_head *bh);
+int bh_submit_read(struct buffer_head *bh);
 
 extern int buffer_heads_over_limit;
 
-- 
1.5.3.5.652.gf192c-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Take 2] ext2/3/4 block bitmap validation patches

2007-11-19 Thread Aneesh Kumar K.V



Aneesh Kumar K.V wrote:

This is the updated ext2/3/4 block bitmap validation patches

Changes from the last post

a) moved the bh_uptodate_or_lock and bh_submit_read to
  fs/buffer.c and added EXPORT_SYMBOL

b) Updated bh_submit_read not to release buffer on
   failure. This handles one reference handling bug in the
   error patch.




I forgot to pass --numbered to git format-patch
The order is 



[PATCH 1/4] Add buffer head related helper functions
[PATCH 2/4] ext2: add block bitmap validation
[PATCH 3/4] ext3: add block bitmap validation
[PATCH 4/4] ext4: add block bitmap validation


it is against patch queue and applies as the last set of
patches in the stable series



buffer_head_helper.patch
ext2_bitmap_validation.patch
ext3_bitmap_validation.patch
ext4_bitmap_validation.patch
stable-boundary


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] ext3: add block bitmap validation

2007-11-15 Thread Aneesh Kumar K.V
When a new block bitmap is read from disk in read_block_bitmap()
there are a few bits that should ALWAYS be set.  In particular,
the blocks given corresponding to block bitmap, inode bitmap and inode tables.
Validate the block bitmap against these blocks.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext3/balloc.c |   77 -
 1 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index a8ba7e8..8f732b8 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -80,13 +80,52 @@ struct ext3_group_desc * ext3_get_group_desc(struct 
super_block * sb,
return desc + offset;
 }
 
+static int ext3_valid_block_bitmap(struct super_block *sb,
+   struct ext3_group_desc *desc,
+   unsigned int block_group,
+   struct buffer_head *bh)
+{
+   ext3_grpblk_t offset;
+   ext3_grpblk_t next_zero_bit;
+   ext3_fsblk_t bitmap_blk;
+   ext3_fsblk_t group_first_block;
+
+   group_first_block = ext3_group_first_block_no(sb, block_group);
+
+   /* check whether block bitmap block number is set */
+   bitmap_blk = le32_to_cpu(desc-bg_block_bitmap);
+   offset = bitmap_blk - group_first_block;
+   if (!ext3_test_bit(offset, bh-b_data)) {
+   /* bad block bitmap */
+   return 0;
+   }
+   /* check whether the inode bitmap block number is set */
+   bitmap_blk = le32_to_cpu(desc-bg_inode_bitmap);
+   offset = bitmap_blk - group_first_block;
+   if (!ext3_test_bit(offset, bh-b_data)) {
+   /* bad block bitmap */
+   return 0;
+   }
+   /* check whether the inode table block number is set */
+   bitmap_blk = le32_to_cpu(desc-bg_inode_table);
+   offset = bitmap_blk - group_first_block;
+   next_zero_bit = ext3_find_next_zero_bit(bh-b_data,
+   EXT3_SB(sb)-s_itb_per_group + 1,
+   offset);
+   if (next_zero_bit = offset + EXT3_SB(sb)-s_itb_per_group)
+   /* good bitmap for inode tables */
+   return 1;
+
+   return 0;
+}
+
 /**
  * read_block_bitmap()
  * @sb:super block
  * @block_group:   given block group
  *
- * Read the bitmap for a given block_group, reading into the specified
- * slot in the superblock's bitmap cache.
+ * Read the bitmap for a given block_group,and validate the
+ * bits for block/inode/inode tables are set in the bitmaps
  *
  * Return buffer_head on success or NULL in case of failure.
  */
@@ -95,17 +134,39 @@ read_block_bitmap(struct super_block *sb, unsigned int 
block_group)
 {
struct ext3_group_desc * desc;
struct buffer_head * bh = NULL;
+   ext3_fsblk_t bitmap_blk;
 
-   desc = ext3_get_group_desc (sb, block_group, NULL);
+   desc = ext3_get_group_desc(sb, block_group, NULL);
if (!desc)
-   goto error_out;
-   bh = sb_bread(sb, le32_to_cpu(desc-bg_block_bitmap));
-   if (!bh)
-   ext3_error (sb, read_block_bitmap,
+   return NULL;
+   bitmap_blk = le32_to_cpu(desc-bg_block_bitmap);
+   bh = sb_getblk(sb, bitmap_blk);
+   if (unlikely(!bh)) {
+   ext3_error(sb, __FUNCTION__,
+   Cannot read block bitmap - 
+   block_group = %d, block_bitmap = %u,
+   block_group, le32_to_cpu(desc-bg_block_bitmap));
+   return NULL;
+   }
+   if (likely(bh_uptodate_or_lock(bh)))
+   return bh;
+
+   if (bh_submit_read(bh)  0) {
+   brelse(bh);
+   ext3_error(sb, __FUNCTION__,
Cannot read block bitmap - 
block_group = %d, block_bitmap = %u,
block_group, le32_to_cpu(desc-bg_block_bitmap));
-error_out:
+   return NULL;
+   }
+   if (!ext3_valid_block_bitmap(sb, desc, block_group, bh)) {
+   brelse(bh);
+   ext3_error(sb, __FUNCTION__,
+   Invalid block bitmap - 
+   block_group = %d, block = %lu,
+   block_group, bitmap_blk);
+   return NULL;
+   }
return bh;
 }
 /*
-- 
1.5.3.5.652.gf192c-dirty

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] ext4: add block bitmap validation

2007-11-15 Thread Aneesh Kumar K.V
When a new block bitmap is read from disk in read_block_bitmap()
there are a few bits that should ALWAYS be set.  In particular,
the blocks given corresponding to block bitmap, inode bitmap and inode tables.
Validate the block bitmap against these blocks.

Signed-off-by: Aneesh Kumar K.V [EMAIL PROTECTED]
---
 fs/ext4/balloc.c |   96 +-
 1 files changed, 80 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 71ee95e..2657a1e 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -189,13 +189,62 @@ struct ext4_group_desc * ext4_get_group_desc(struct 
super_block * sb,
return desc;
 }
 
+static int ext4_valid_block_bitmap(struct super_block *sb,
+   struct ext4_group_desc *desc,
+   unsigned int block_group,
+   struct buffer_head *bh)
+{
+   ext4_grpblk_t offset;
+   ext4_grpblk_t next_zero_bit;
+   ext4_fsblk_t bitmap_blk;
+   ext4_fsblk_t group_first_block;
+
+   if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
+   /* with FLEX_BG, the inode/block bitmaps and itable
+* blocks may not be in the group at all
+* so the bitmap validation will be skipped for those groups
+* or it has to also read the block group where the bitmaps
+* are located to verify they are set.
+*/
+   return 1;
+   }
+   group_first_block = ext4_group_first_block_no(sb, block_group);
+
+   /* check whether block bitmap block number is set */
+   bitmap_blk = ext4_block_bitmap(sb, desc);
+   offset = bitmap_blk - group_first_block;
+   if (!ext4_test_bit(offset, bh-b_data)) {
+   /* bad block bitmap */
+   return 0;
+   }
+
+   /* check whether the inode bitmap block number is set */
+   bitmap_blk = ext4_inode_bitmap(sb, desc);
+   offset = bitmap_blk - group_first_block;
+   if (!ext4_test_bit(offset, bh-b_data)) {
+   /* bad block bitmap */
+   return 0;
+   }
+
+   /* check whether the inode table block number is set */
+   bitmap_blk = ext4_inode_table(sb, desc);
+   offset = bitmap_blk - group_first_block;
+   next_zero_bit = ext4_find_next_zero_bit(bh-b_data,
+   EXT4_SB(sb)-s_itb_per_group + 1,
+   offset);
+   if (next_zero_bit = offset + EXT4_SB(sb)-s_itb_per_group)
+   /* good bitmap for inode tables */
+   return 1;
+
+   return 0;
+}
 /**
  * read_block_bitmap()
  * @sb:super block
  * @block_group:   given block group
  *
- * Read the bitmap for a given block_group, reading into the specified
- * slot in the superblock's bitmap cache.
+ * Read the bitmap for a given block_group,and validate the
+ * bits for block/inode/inode tables are set in the bitmaps
  *
  * Return buffer_head on success or NULL in case of failure.
  */
@@ -210,25 +259,40 @@ read_block_bitmap(struct super_block *sb, unsigned int 
block_group)
if (!desc)
return NULL;
bitmap_blk = ext4_block_bitmap(sb, desc);
+   bh = sb_getblk(sb, bitmap_blk);
+   if (unlikely(!bh)) {
+   ext4_error(sb, __FUNCTION__,
+   Cannot read block bitmap - 
+   block_group = %d, block_bitmap = %llu,
+   block_group, bitmap_blk);
+   return NULL;
+   }
+   if (bh_uptodate_or_lock(bh))
+   return bh;
+
if (desc-bg_flags  cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
-   bh = sb_getblk(sb, bitmap_blk);
-   if (!buffer_uptodate(bh)) {
-   lock_buffer(bh);
-   if (!buffer_uptodate(bh)) {
-   ext4_init_block_bitmap(sb, bh, block_group,
-  desc);
-   set_buffer_uptodate(bh);
-   }
-   unlock_buffer(bh);
-   }
-   } else {
-   bh = sb_bread(sb, bitmap_blk);
+   ext4_init_block_bitmap(sb, bh, block_group, desc);
+   set_buffer_uptodate(bh);
+   unlock_buffer(bh);
+   return bh;
}
-   if (!bh)
-   ext4_error (sb, __FUNCTION__,
+   if (bh_submit_read(bh)  0) {
+   brelse(bh);
+   ext4_error(sb, __FUNCTION__,
Cannot read block bitmap - 
block_group = %d, block_bitmap = %llu,
block_group, bitmap_blk);
+   return NULL;
+   }
+   if (!ext4_valid_block_bitmap(sb, desc, block_group, bh)) {
+   brelse(bh

delalloc and reservation.

2007-10-29 Thread Aneesh Kumar K.V

Hi All,

I looked at the delalloc and reservation differences that Valerie was observing.
Below is my understanding. I am not sure whether the below will result in 
higher fragmentation that Eric Sandeen is observing. I guess it should not. Even

though the reservation gets discarded during the clear inode due to memory 
pressure
the request for new reservation should get the blocks nearby and not break 
extents right ?


any how below is the simple case.

without delalloc the blocks are requested during prepare_write/write_begin.
That means we enter ext4_new_blocks_old which will call 
ext4_try_to_allocate_with_rsv.
Now if there is no reservation for this inode a new one will be allocated.  
After
using the blocks this reservation is destroyed during the close via 
ext4_release_file

With delalloc the blocks are not requested until we hit 
writeback/ext4_da_writepages
That means if we create new file and close them the reservation will be 
discarded
during close via ext4_release_file.( Actually there will be nothing to clear)
Now when we do a sync/or write back. We try to get the block, the inode will
request for new reservation. This reservation is not discarded untill we call 
clear_inode
and that results in the behavior we are seeing. 


Free blocks: 1440-8191, 8194-8199, 8202-8207, 8210-8215, 8218-8223, 8226-8231, 
8234-8239, 8242-8247, 8250-8255, 8258-8263, 8266-8271, 8274-8279, 8282-8287, 
8290-8295, 8298-8303, 8306-8311, 8314-8319, 8322-8327, 8330-8335, 8338-8343, 
8346-12799

So now the question is where do we discard the reservation in case of delalloc.

-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: delalloc and reservation

2007-10-29 Thread Aneesh Kumar K.V

I guess the list dropped this mail. Sending again.

-aneesh
---BeginMessage---



Aneesh Kumar K.V wrote:

Hi All,

I looked at the delalloc and reservation differences that Valerie was 
observing.
Below is my understanding. I am not sure whether the below will result 
in higher fragmentation that Eric Sandeen is observing. I guess it 
should not. Even
though the reservation gets discarded during the clear inode due to 
memory pressure
the request for new reservation should get the blocks nearby and not 
break extents right ?



any how below is the simple case.

without delalloc the blocks are requested during prepare_write/write_begin.
That means we enter ext4_new_blocks_old which will call 
ext4_try_to_allocate_with_rsv.
Now if there is no reservation for this inode a new one will be 
allocated.  After
using the blocks this reservation is destroyed during the close via 
ext4_release_file


With delalloc the blocks are not requested until we hit 
writeback/ext4_da_writepages
That means if we create new file and close them the reservation will be 
discarded
during close via ext4_release_file.( Actually there will be nothing to 
clear)
Now when we do a sync/or write back. We try to get the block, the inode 
will
request for new reservation. This reservation is not discarded untill we 
call clear_inode

and that results in the behavior we are seeing.
Free blocks: 1440-8191, 8194-8199, 8202-8207, 8210-8215, 8218-8223, 
8226-8231, 8234-8239, 8242-8247, 8250-8255, 8258-8263, 8266-8271, 
8274-8279, 8282-8287, 8290-8295, 8298-8303, 8306-8311, 8314-8319, 
8322-8327, 8330-8335, 8338-8343, 8346-12799


So now the question is where do we discard the reservation in case of 
delalloc.


-


with respect to mballoc we are not seeing this because we are doing
allocation from group prealloc list which is per cpu. 


For most the case we have EXT4_MB_HINT_GROUP_ALLOC set in mballoc.

In ext4_mb_group_or_file i already have a FIXME!! regarding this.

currently we have

   /* request is so large that we don't care about
* streaming - it overweights any possible seek */
   if (ac-ac_o_ex.fe_len = sbi-s_mb_large_req)
   return;

   /* FIXME!!
* is this  =  considering the above ?
*/
   if (ac-ac_o_ex.fe_len = sbi-s_mb_small_req)
   return;

   .
   ..

  /* we're going to use group allocation */
   ac-ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;
   
  

  .

So for small size we have the EXT4_MB_HINT_GROUP_ALLOC set . Now if
i change the the line below FIXME!! to = , that will force
small size to use inode prealloc and that cause

Free blocks: 1442-1443, 1446-1447, 1450-1451, 1454-1455, 1458-1459, 1462-1463, 
1466-1467, 1470-1471, 1474-1475, 1478-1479, 1482-1483, 1486-1487, 1490-1491, 
1494-1495, 1498-1499, 1502-1503, 1506-1507, 1510-1511, 1514-1515, 1518-12799


So the problem is generic.


-aneesh



---End Message---


  1   2   3   >