date:20070710

random ext3 image bugs

2007-07-10 Thread Michal Piotrowski

Hi all,

My insane file system test triggered a few bugs in ext3 image. Anyone 
interested?

[40859.103232] SELinux: initialized (dev loop7, type ext3), uses xattr
[40859.119575] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 757: 
bad block 4829
[40859.129837] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
ino=757
[40859.139840] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 8381: 
bad block 4829
[40859.149436] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
ino=8381
[40859.164064] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 4314: 
bad block 4829
[40859.172919] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
ino=4314
[40859.183789] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 760: 
bad block 4829
[40859.192601] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
ino=760
[40859.209274] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 6390: 
bad block 4829
[40859.218258] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
ino=6390
[40859.256550] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 14610: 
bad block 4829
[40859.266803] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
ino=14610
[40859.284814] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 764: 
bad block 4829
[40859.293644] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
ino=764
[..]

[40861.091690] EXT3-fs error (device loop2): ext3_readdir: bad entry in 
directory #11: rec_len % 4 != 0 - offset=0, inode=3925999616, rec_len=1023, 
name_len=0
[40862.667815] EXT3-fs error (device loop2): ext3_readdir: bad entry in 
directory #11: rec_len % 4 != 0 - offset=0, inode=3925999616, rec_len=1023, 
name_len=0

[..]

25999616, rec_len=1023, name_len=0
[40870.459390] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #6301: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.478123] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #4168: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.514409] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #10426: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.557496] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #12477: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.575138] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #12471: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.607312] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #6301: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.624644] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #4168: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.664405] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #6301: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.681864] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #4168: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.707222] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #10431: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.736133] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #6225: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.754564] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #12466: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.775175] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #6266: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.800203] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #6301: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.818098] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #4168: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.838409] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #14524: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.860820] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #6301: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.879037] EXT3-fs error (device loop2): htree_dirblock_to_tree: bad entry 
in directory #4168: rec_len is smaller than minimal - offset=0, inode=0, 
rec_len=0, name_len=0
[40870.899012] EXT3-fs error

Random corruption test for e2fsck

2007-07-10 Thread Kalpak Shah

Hi,

This is a random corruption test which can be included in the e2fsprogs
regression tests. It does the following:
1) Create an test fs and format it with ext2/3/4 and random selection of
features.
2) Mount it and copy data into it.
3) Move around the blocks of the filesystem randomly causing corruption.
Also overwrite some random blocks with garbage from /dev/urandom. Create
a copy of this corrupted filesystem.
4) Unmount and run e2fsck. If the first run of e2fsck produces any
errors like uncorrected errors, library error, segfault, usage error,
etc. then it is deemed a bug. But in any case, a second run of e2fsck is
done to check if it renders the filesystem clean. 
5) If the test went by without any errors the test image is deleted and
in case of any errors the user is notified that the log of this test run
should be mailed to linux-ext4@ and the image should be preserved.

Any comments are welcome.

---
Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
Signed-off-by: Kalpak Shah <[EMAIL PROTECTED]>


Thanks,
Kalpak.

Index: e2fsprogs-regression/tests/f_random_corruption/script
===
--- /dev/null
+++ e2fsprogs-regression/tests/f_random_corruption/script
@@ -0,0 +1,277 @@
+# This is to make sure that if this test fails other tests can still be run
+# instead of doing an exit. We break before the end of the loop.
+while (( 1 )); do
+
+# choose block and inode sizes randomly
+BLK_SIZES=(1024 2048 4096)
+INODE_SIZES=(128 256 512 1024)
+
+SEED=$(head -1 /dev/urandom | od -N 1 | awk '{ print $2 }')
+RANDOM=$SEED
+
+IMAGE=${IMAGE:-$TMPFILE}
+DATE=`date '+%Y%m%d%H%M%S'`
+ARCHIVE=$IMAGE.$DATE
+SIZE=${SIZE:-$(( 192000 + RANDOM + RANDOM )) }
+FS_TYPE=${FS_TYPE:-ext3}
+BLK_SIZE=${BLK_SIZES[(( $RANDOM % ${#BLK_SIZES[*]} ))]}
+INODE_SIZE=${INODE_SIZES[(( $RANDOM % ${#INODE_SIZES[*]} ))]}
+DEF_FEATURES="sparse_super,filetype,resize_inode,dir_index"
+FEATURES=${FEATURES:-$DEF_FEATURES}
+MOUNT_OPTS="-o loop"
+MNTPT=$test_dir/temp
+OUT=$test_name.$DATE.log
+
+# Do you want to try and mount the filesystem?
+MOUNT_AFTER_CORRUPTION=${MOUNT_AFTER_CORRUPTION:-"no"}
+# Do you want to remove the files from the mounted filesystem? Ideally use it
+# only in test environment.
+REMOVE_FILES=${REMOVE_FILES:-"no"}
+
+# In KB
+CORRUPTION_SIZE=${CORRUPTION_SIZE:-16}
+CORRUPTION_ITERATIONS=${CORRUPTION_ITERATIONS:-5}
+
+MKFS=../misc/mke2fs
+E2FSCK=../e2fsck/e2fsck
+FIRST_FSCK_OPTS="-fyv"
+SECOND_FSCK_OPTS="-fyv"
+
+# Lets check if the image can fit in the current filesystem.
+BASE_BS=`stat -f . | grep "Block size:" | cut -d " " -f3`
+BASE_AVAIL_BLOCKS=`stat -f . | grep "Available:" | cut -d ":" -f5`
+
+if (( BASE_BS * BASE_AVAIL_BLOCKS < NUM_BLKS * BLK_SIZE )); then
+	echo "The base filesystem does not have enough space to accomodate the"
+	echo "test image. Aborting test"
+	break;
+fi
+
+# Lets have a journal more times than not.
+HAVE_JOURNAL=$(( $RANDOM % 12 ))
+if (( HAVE_JOURNAL == 0 )); then
+	FS_TYPE="ext2"
+	HAVE_JOURNAL=""
+else
+	HAVE_JOURNAL="-j"
+fi
+
+# Experimental features should not be used too often.
+LAZY_BG=$(( $RANDOM % 12 ))
+if (( LAZY_BG == 0 )); then
+   	FEATURES=$FEATURES,lazy_bg
+fi
+META_BG=$(( $RANDOM % 12 ))
+if (( META_BG == 0 )); then
+   	FEATURES=$FEATURES,meta_bg
+fi
+
+modprobe ext4 2> /dev/null
+modprobe ext4dev 2> /dev/null
+
+# If ext4 is present in the kernel then we can play with ext4 options
+EXT4=`grep ext4 /proc/filesystems`
+if [ -n "$EXT4" ]; then
+	USE_EXT4=$(( $RANDOM % 2 ))
+	if (( USE_EXT4 == 1 )); then
+		FS_TYPE="ext4dev"
+	fi
+fi
+
+if [ "$FS_TYPE" = "ext4dev" ]; then
+	UNINIT_GROUPS=$(( $RANDOM % 12 ))
+	if (( UNINIT_GROUPS == 0 )); then
+		FEATURES=$FEATURES,uninit_groups
+	fi
+	EXPAND_ESIZE=$(( $RANDOM % 12 ))
+	if (( EXPAND_EISIZE == 0 )); then
+		FIRST_FSCK_OPTS=$FIRST_FSCK_OPTS," -E expand_extra_isize"
+	fi
+fi
+
+MKFS_OPTS=" $HAVE_JOURNAL -b $BLK_SIZE -I $INODE_SIZE -O $FEATURES"
+
+NUM_BLKS=$(( (SIZE * 1024) / BLK_SIZE ))
+
+unset_vars()
+{
+	unset IMAGE DATE ARCHIVE FS_TYPE SIZE BLK_SIZE MKFS_OPTS MOUNT_OPTS
+	unset E2FSCK FIRST_FSCK_OPTS SECOND_FSCK_OPTS OUT
+}
+
+cleanup()
+{
+	echo "Error occured..." >> $OUT.failed
+umount -f $MNTPT > /dev/null 2>&1 | tee -a $OUT
+	echo " failed"
+	echo "*** This appears to be a bug in e2fsprogs ***"
+	echo "Please contact linux-ext4 for further assistance."
+	echo "Include $OUT as an attachment, and save $ARCHIVE locally for future reference."
+	unset_vars
+	break;
+}
+
+echo -n "Random corruption test for e2fsck:"
+# Truncate the output log file
+> $OUT
+
+get_random_location()
+{
+	total=$1
+
+	tmp=$(( (RANDOM * 32768) % total ))
+
+	# Try and have more corruption in metadata at the start of the
+	# filesystem.
+	if (( tmp % 3 == 0 || tmp % 5 == 0 || tmp % 7 == 0 )); then
+		tmp=$(( $tmp % 32768 ))
+	fi
+
+	echo $tmp
+}
+
+make_fs_dirty()
+{
+MAX_BLKS_TO_DIRTY=${1:-NUM_BLKS}
+from=$(( (RANDOM * RANDOM) % NUM_BLKS ))
+
+# Number

Re: ext4-patch-queue rebased to 2.6.22

2007-07-10 Thread Amit K. Arora

On Mon, Jul 09, 2007 at 01:37:56PM -0400, Theodore Ts'o wrote:
> So we're just waiting for Amit to make the minor on-disk format change
> Andreas suggested before we push to Linus.

I have commited following changes to the ext4 patch queue:

1. Updated ext4-fallocate-1-syscall_i386_amd64_ppc to add compat wrapper
   for x86_64/ia32. This addresses Heiko's concern that we did not have
   a wrapper for ia32.

2. Added a new patch ext4-fallocate-8-new-ondisk-format and updated
   the series file. This patch, as suggested by Andreas, will allow
   an initialized extent to be of max 2^15 length. Main purpose of this
   change is to have a better extent-to-group alignment.
   For uninitialized extents the max length remains same - i.e. 2^15 - 1.

Thanks!
--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Random corruption test for e2fsck

2007-07-10 Thread Theodore Tso

On Tue, Jul 10, 2007 at 06:37:40PM +0530, Kalpak Shah wrote:
> Hi,
> 
> This is a random corruption test which can be included in the e2fsprogs
> regression tests. 
> 1) Create an test fs and format it with ext2/3/4 and random selection of
> features.
> 2) Mount it and copy data into it.

This requires root privileges in order to mount the loop filesystem.
Any chance you could change it to use debugfs to populate the
filesystem, so we don't need root privs in order to mount it.

This will increase the number of people that will actually run the
test, and more importantly not encourage people from running "make
check" as root.

> 3) Move around the blocks of the filesystem randomly causing corruption.
> Also overwrite some random blocks with garbage from /dev/urandom. Create
> a copy of this corrupted filesystem.
>
> 4) Unmount and run e2fsck. If the first run of e2fsck produces any
> errors like uncorrected errors, library error, segfault, usage error,
> etc. then it is deemed a bug. But in any case, a second run of e2fsck is
> done to check if it renders the filesystem clean. 

Err, you do unmount the filesystem first before you start corrupting
it, right?  (Checking script; sure looks like it.)

> 5) If the test went by without any errors the test image is deleted and
> in case of any errors the user is notified that the log of this test run
> should be mailed to linux-ext4@ and the image should be preserved.

I certainly like the general concept!!

I wonder if the code to create a random filesystem and corrupting it
should be kept as separate shell script, since it can be reused in
another of interesting ways.  One thought would be to write a test
script that mounts corrupted filesystems using UML and then does some
exercises on it (tar cvf on the filesyste, random renames on the
filesystem, rm -rf of all of the contents of the filesystems), to see
whether we can provoke a kernel oops.

Regards,

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Random corruption test for e2fsck

2007-07-10 Thread Eric Sandeen

Theodore Tso wrote:
>> 5) If the test went by without any errors the test image is deleted and
>> in case of any errors the user is notified that the log of this test run
>> should be mailed to linux-ext4@ and the image should be preserved.
> 
> I certainly like the general concept!!
> 
> I wonder if the code to create a random filesystem and corrupting it
> should be kept as separate shell script, since it can be reused in
> another of interesting ways.  One thought would be to write a test
> script that mounts corrupted filesystems using UML and then does some
> exercises on it (tar cvf on the filesyste, random renames on the
> filesystem, rm -rf of all of the contents of the filesystems), to see
> whether we can provoke a kernel oops.

FWIW, that's what fsfuzzer does, in an fs-agnostic way.

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: random ext3 image bugs

2007-07-10 Thread Eric Sandeen

Michal Piotrowski wrote:
> Hi all,
> 
> My insane file system test triggered a few bugs in ext3 image. Anyone 
> interested?

Not yet; all the errors below are ext3 properly coping with the
corrupted image, no?  But if you can make it oops, that's more
interesting.  :)

-Eric

> [40859.103232] SELinux: initialized (dev loop7, type ext3), uses xattr
> [40859.119575] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 757: 
> bad block 4829
> [40859.129837] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
> ino=757
> [40859.139840] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 
> 8381: bad block 4829
> [40859.149436] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
> ino=8381
> [40859.164064] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 
> 4314: bad block 4829
> [40859.172919] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
> ino=4314
> [40859.183789] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 760: 
> bad block 4829
> [40859.192601] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
> ino=760
> [40859.209274] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 
> 6390: bad block 4829
> [40859.218258] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
> ino=6390
> [40859.256550] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 
> 14610: bad block 4829
> [40859.266803] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
> ino=14610
> [40859.284814] EXT3-fs error (device loop7): ext3_xattr_block_get: inode 764: 
> bad block 4829
> [40859.293644] inode_doinit_with_dentry:  getxattr returned 5 for dev=loop7 
> ino=764
> [..]
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Random corruption test for e2fsck

2007-07-10 Thread Eric Sandeen

Kalpak Shah wrote:
> Hi,
> 
> This is a random corruption test which can be included in the e2fsprogs
> regression tests. It does the following:
> 1) Create an test fs and format it with ext2/3/4 and random selection of
> features.
> 2) Mount it and copy data into it.
> 3) Move around the blocks of the filesystem randomly causing corruption.
> Also overwrite some random blocks with garbage from /dev/urandom. Create
> a copy of this corrupted filesystem.
> 4) Unmount and run e2fsck. If the first run of e2fsck produces any
> errors like uncorrected errors, library error, segfault, usage error,
> etc. then it is deemed a bug. But in any case, a second run of e2fsck is
> done to check if it renders the filesystem clean. 
> 5) If the test went by without any errors the test image is deleted and
> in case of any errors the user is notified that the log of this test run
> should be mailed to linux-ext4@ and the image should be preserved.
> 
> Any comments are welcome.

Seems like a pretty good idea.  I had played with such a thing using
fsfuzzer... fsfuzzer always seemed at least as useful useful as a fsck
tester than a kernel code tester anyway.  (OOC, did you look at fsfuzzer
when you did this?)

My only concern is that since it's introducing random corruption, new
things will probably pop up from time to time; when we do an rpm build
for Fedora/RHEL, it automatically runs make check:

%check
make check

which seems like a reasonably good idea to me.  However, I'd rather not
have last-minute build failures introduced by new random collection of
bits that have never been seen before.  Maybe "make RANDOM=0 check" as
an option would be a good idea for automated builds...?

Thanks,
-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Initial results of FLEX_BG feature.

2007-07-10 Thread Jose R. Santos

Hi folks,

I've started playing with the FLEX_BG feature (for now packing of
block group metadata closer together) and started doing some
preliminary benchmarking to see if the feature is worth pursuing.
I chose an FFSB profile that does single threaded small creates and
writes and then does an fsync.  This is something I ran for a customer
a while ago in which ext3 performed poorly.

Here are some of the results (in transactions/[EMAIL PROTECTED] util) on a 
single
[EMAIL PROTECTED] rpm disk.

ext4[EMAIL PROTECTED]
ext4(flex_bg)   [EMAIL PROTECTED] 20% improvement
ext4(data=writeback)[EMAIL PROTECTED] <- hum...
ext4(flex_bg data=writeback)[EMAIL PROTECTED] 28% over best ext4
ext3[EMAIL PROTECTED]
ext3(data=writeback)[EMAIL PROTECTED]
ext2[EMAIL PROTECTED]
xfs [EMAIL PROTECTED]
jfs [EMAIL PROTECTED]

The results are from packing the metadata of 64 block groups closer
together at fsck time.  Still need to clean up the e2fsprog patches,
but I hope to submit them to the list later this week for others to
try.  It seems like fsck doesn't quite like the new location of the
metadata and I'm not sure how big of an effort it will be to fix it.  I
mentioned this since one of the assumptions of implementing FLEX_BG was
the reduce time in fsck and it could be a while before I'm able to test
this.

-JRS
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ext4-patch-queue rebased to 2.6.22

2007-07-10 Thread Andreas Dilger

On Jul 10, 2007  20:24 +0530, Amit K. Arora wrote:
> On Mon, Jul 09, 2007 at 01:37:56PM -0400, Theodore Ts'o wrote:
> > So we're just waiting for Amit to make the minor on-disk format change
> > Andreas suggested before we push to Linus.
> 
> 2. Added a new patch ext4-fallocate-8-new-ondisk-format and updated
>the series file. This patch, as suggested by Andreas, will allow
>an initialized extent to be of max 2^15 length. Main purpose of this
>change is to have a better extent-to-group alignment.
>For uninitialized extents the max length remains same - i.e. 2^15 - 1.

One tiny change I'd ask for in this patch (it isn't critical to get in
before the upstream submission as it is only code style) is instead of
using (EXT_MAX_LEN - 1) for uninitialized extents, instead use a separate
#define EXT_UNINIT_MAX_LEN (EXT_MAX_LEN - 1) and use that in the code.
While a minor change, this localizes the knowledge of the maximum length
of uninitialized extents into just one place - right after the maximum
length of initialized extents.

It might even make sense to change the other #define to be called
EXT_INIT_MAX_LEN so people have to think about this when using the #define.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

block groups with no inode tables

2007-07-10 Thread Jose R. Santos

Hi folks,

As I play with the allocation of the metadata for the FLEX_BG feature,
it seems that we could benefit from having block groups with no inode
tables.  Right now we allocate one inode table per bg base on the
inode_blocks_per_group.  For FLEX_BG though, it would make more sense
to have a larger inode tables that fully use the inode bitmap allocated
on the first few block groups.  Once we reach the number of inode per
FLEX_BG, then the remaining block groups could then have no inode
tables defined.

The idea here is that we better utilize the inode bitmaps and reduce the
number of inode tables to improve mkfs/fsck times. We could also
support expansion of inode since we have block groups that have empty
entries in the block group descriptors and as long as we can find
enough empty blocks for the inode table expanding the number of inodes
should be relatively easy.

Don't know if ext4 currently supports this.  Any thoughts?

-JRS
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: block groups with no inode tables

2007-07-10 Thread coly li

Hi, once we decide to do this, how about storing inode inside the
directory ?

IMHO, the latter one is more attractive :-)

Coly

在 2007-07-10二的 12:12 -0500，Jose R. Santos写道：
> Hi folks,
> 
> As I play with the allocation of the metadata for the FLEX_BG feature,
> it seems that we could benefit from having block groups with no inode
> tables.  Right now we allocate one inode table per bg base on the
> inode_blocks_per_group.  For FLEX_BG though, it would make more sense
> to have a larger inode tables that fully use the inode bitmap allocated
> on the first few block groups.  Once we reach the number of inode per
> FLEX_BG, then the remaining block groups could then have no inode
> tables defined.
> 
> The idea here is that we better utilize the inode bitmaps and reduce the
> number of inode tables to improve mkfs/fsck times. We could also
> support expansion of inode since we have block groups that have empty
> entries in the block group descriptors and as long as we can find
> enough empty blocks for the inode table expanding the number of inodes
> should be relatively easy.
> 
> Don't know if ext4 currently supports this.  Any thoughts?
> 
> -JRS
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: block groups with no inode tables

2007-07-10 Thread Dave Kleikamp

On Wed, 2007-07-11 at 01:30 +0800, coly li wrote:
> Hi, once we decide to do this, how about storing inode inside the
> directory ?

Which directory?

> IMHO, the latter one is more attractive :-)

Sounds like a mess to me.  Consider ln and mv.

> Coly

-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

ext2 reservations (Re: -mm merge plans for 2.6.23)

2007-07-10 Thread Alexey Dobriyan

> ext2-reservations.patch
> 
>  Still needs decent testing.

Was this oops silently fixed?
http://lkml.org/lkml/2007/3/2/138
2.6.21-rc2-mm1: EIP is at ext2_discard_reservation+0x1c/0x52

I still have that ext2 partition backed up.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ext4-patch-queue rebased to 2.6.22

2007-07-10 Thread Amit K. Arora

On Tue, Jul 10, 2007 at 11:09:39AM -0600, Andreas Dilger wrote:
> On Jul 10, 2007  20:24 +0530, Amit K. Arora wrote:
> > On Mon, Jul 09, 2007 at 01:37:56PM -0400, Theodore Ts'o wrote:
> > > So we're just waiting for Amit to make the minor on-disk format change
> > > Andreas suggested before we push to Linus.
> > 
> > 2. Added a new patch ext4-fallocate-8-new-ondisk-format and updated
> >the series file. This patch, as suggested by Andreas, will allow
> >an initialized extent to be of max 2^15 length. Main purpose of this
> >change is to have a better extent-to-group alignment.
> >For uninitialized extents the max length remains same - i.e. 2^15 - 1.
> 
> One tiny change I'd ask for in this patch (it isn't critical to get in
> before the upstream submission as it is only code style) is instead of
> using (EXT_MAX_LEN - 1) for uninitialized extents, instead use a separate
> #define EXT_UNINIT_MAX_LEN (EXT_MAX_LEN - 1) and use that in the code.
> While a minor change, this localizes the knowledge of the maximum length
> of uninitialized extents into just one place - right after the maximum
> length of initialized extents.
> 
> It might even make sense to change the other #define to be called
> EXT_INIT_MAX_LEN so people have to think about this when using the #define.

Done. Changes are in ext4 patch queue.
Can you please have a quick look and see if this is what you preferred ?

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: block groups with no inode tables

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 12:40 -0500, Dave Kleikamp wrote:
> On Wed, 2007-07-11 at 01:30 +0800, coly li wrote:
> > Hi, once we decide to do this, how about storing inode inside the
> > directory ?
> 
> Which directory?
I think Coly is refering to the idea of
store-inode-inside-in-directory-file.

It's one way to implement the dynamic inode table allocation. With it
you don't have system-wide inode tables anymore, but all inode
structures are directly stored in the directory file.

> 
> > IMHO, the latter one is more attractive :-)
> 
> Sounds like a mess to me.  Consider ln and mv.
> 
> > Coly
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: block groups with no inode tables

2007-07-10 Thread Dave Kleikamp

On Tue, 2007-07-10 at 11:59 -0400, Mingming Cao wrote:
> On Tue, 2007-07-10 at 12:40 -0500, Dave Kleikamp wrote:
> > On Wed, 2007-07-11 at 01:30 +0800, coly li wrote:
> > > Hi, once we decide to do this, how about storing inode inside the
> > > directory ?
> > 
> > Which directory?
> I think Coly is refering to the idea of
> store-inode-inside-in-directory-file.
> 
> It's one way to implement the dynamic inode table allocation. With it
> you don't have system-wide inode tables anymore, but all inode
> structures are directly stored in the directory file.

Assuming you mean the parent directory?  An inode isn't tied to a
specific parent.

ln dir1/file1 dir2/
mv dir1/file1 dir3/
rmdir dir1

What is happens to the inode?  I really don't think that the directory
is the right place to store an inode.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/7][TAKE6] fallocate system call

2007-07-10 Thread Amit K. Arora

This is the latest fallocate patchset and is rebased to 2.6.22.

Following are the changes from TAKE5:
1) Rebased to 2.6.22
2) Added compat wrapper for x86_64
3) Dropped s390 and ia64 patches, since the platform maintaners can
   add the support for fallocate once it is in mainline.
4) Added a change suggested by Andreas for better extent-to-group
   alignment in ext4 (Patch 6/6). Please refer following post:
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg02445.html
5) Renamed mode flags and values from "FA_" to "FALLOC_"
6) Added manpage (updated version of the one initially submitted by
   David Chinner).


Todos:
-
1> Implementation on other architectures (other than i386, x86_64,
   and ppc64). s390(x) and ia64 patches are ready and will be pushed
   by platform maintaners when the fallocate is in mainline.
2> A generic file system operation to handle fallocate
   (generic_fallocate), for filesystems that do _not_ have the fallocate
   inode operation implemented.
3> Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()
4> A testcase to test the system call. Will post it soon.


Following patches follow:
Patch 1/7 : manpage for fallocate
Patch 2/7 : fallocate() implementation in i386, x86_64 and powerpc
Patch 3/7 : support new modes in fallocate
Patch 4/7 : ext4: fallocate support in ext4
Patch 5/7 : ext4: write support for preallocated blocks
Patch 6/7 : ext4: support new modes in ext4
Patch 7/7 : ext4: change for better extent-to-group alignment


--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/7] manpage for fallocate

2007-07-10 Thread Amit K. Arora

Following is the modified version of the manpage originally submitted by
David Chinner. Please use `nroff -man fallocate.2 | less` to view.


.TH fallocate 2
.SH NAME
fallocate \- allocate or remove file space
.SH SYNOPSIS
.nf
.B #include 
.PP
.BI "int syscall(int, int fd, int mode, loff_t offset, loff_t len);
.Op
.SH DESCRIPTION
The
.BR fallocate
syscall allows a user to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.IR offset
and continuing for
.IR len
bytes.
The
.I mode
parameter determines the operation to be performed on the given range.
Currently there are four modes:
.TP
.B FALLOC_ALLOCATE
allocates and initialises to zero the disk space within the given range.
After a successful call, subsequent writes are guaranteed not to fail because
of lack of disk space.  If the size of the file is less than
.IR offset + len ,
then the file is increased to this size; otherwise the file size is left
unchanged.
.B FALLOC_ALLOCATE
closely resembles
.B posix_fallocate(3)
and is intended as a method of optimally implementing this function.
.B FALLOC_ALLOCATE
may allocate a larger range that was specified.
.TP
.B FALLOC_RESV_SPACE
provides the same functionality as
.B FALLOC_ALLOCATE
except it does not ever change the file size. This allows allocation
of zero blocks beyond the end of file and is useful for optimising
append workloads.
.TP
.B FALLOC_DEALLOCATE
removes any preallocated space within the given range. The file size
may change if deallocation is towards the end of the file.
.TP
.B FALLOC_UNRESV_SPACE
removes the underlying disk space within the given range. The disk space
shall be removed regardless of it's contents so both allocated space
from
.B FALLOC_ALLOCATE
and
.B FALLOC_RESV_SPACE
as well as from
.B write(3)
will be removed.
.B FALLOC_UNRESV_SPACE
shall never remove disk blocks outside the range specified.
.B FALLOC_UNRESV_SPACE
shall never change the file size. If changing the file size
is required when deallocating blocks from an offset to end
of file (or beyond end of file) is required,
.B ftuncate64(3)
or
.B FALLOC_DEALLOCATE
should be used.

.SH "RETURN VALUE"
.BR fallocate()
returns zero on success, or an error number on failure.
Note that
.IR errno
is not set.
.SH "ERRORS"
.TP
.B EBADF
.I fd
is not a valid file descriptor, or is not opened for writing.
.TP
.B EFBIG
.I offset+len
exceeds the maximum file size.
.TP
.B EINVAL
.I offset
or
.I len
was less than 0.
.TP
.B ENODEV
.I fd
does not refer to a regular file or a directory.
.TP
.B ENOSPC
There is not enough space left on the device containing the file
referred to by
.IR fd.
.TP
.B ESPIPE
.I fd
refers to a pipe of file descriptor.
.B ENOSYS
The filesystem underlying the file descriptor does not support this
operation.
.SH AVAILABILITY
The
.BR fallocate ()
system call is available since 2.6.XX
.SH "SEE ALSO"
.BR syscall (2),
.BR posix_fadvise (3)
.BR ftruncate (3)
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc

2007-07-10 Thread Amit K. Arora

From: Amit Arora <[EMAIL PROTECTED]>

sys_fallocate() implementation on i386, x86_64 and powerpc

fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called ->fallocate().
Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.


Signed-off-by: Amit Arora <[EMAIL PROTECTED]>

Index: linux-2.6.22/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.22.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+   .long sys_fallocate
Index: linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
===
--- linux-2.6.22.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
return sys_truncate(path, (high << 32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+u32 lenhi, u32 lenlo)
+{
+   return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
+((loff_t)lenhi << 32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long 
high,
 unsigned long low)
 {
Index: linux-2.6.22/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.22.orig/arch/x86_64/ia32/ia32entry.S
+++ linux-2.6.22/arch/x86_64/ia32/ia32entry.S
@@ -719,4 +719,5 @@ ia32_sys_call_table:
.quad compat_sys_signalfd
.quad compat_sys_timerfd
.quad sys_eventfd
+   .quad sys32_fallocate
 ia32_syscall_end:
Index: linux-2.6.22/fs/open.c
===
--- linux-2.6.22.orig/fs/open.c
+++ linux-2.6.22/fs/open.c
@@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies if fallocate should preallocate blocks OR free
+ *   (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
+ *   FA_DEALLOCATE modes are supported.
+ * @offset: The offset within file, from where (un)allocation is being
+ * requested. It should not have a negative value.
+ * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
+ *
+ * This system call, depending on the mode, preallocates or unallocates blocks
+ * for a file. The range of blocks depends on the value of offset and len
+ * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * system call succeeds, subsequent writes to the file in the given range
+ * (specified by offset & len) should not fail - even if the file system
+ * later becomes full. Hence the preallocation done is persistent (valid
+ * even after reopen of the file and remount/reboot).
+ *
+ * It is expected that the ->fallocate() inode operation implemented by the
+ * individual file systems will update the file size and/or ctime/mtime
+ * depending on the mode and also on the success of the operation.
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ * 0   : On SUCCESS a value of zero is returned.
+ * error   : On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ *  Generic fallocate to be added for file systems that do not
+ *  support fallocate it.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EIN

[PATCH 3/7] support new modes in fallocate

2007-07-10 Thread Amit K. Arora

From:  Amit Arora <[EMAIL PROTECTED]>
Implement new flags and values for mode argument.

This patch implements the new flags and values for the "mode" argument
of the fallocate system call. It is based on the discussion between
Andreas Dilger and David Chinner on the man page proposed (by the later)
on fallocate.

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>

Index: linux-2.6.22/include/linux/fs.h
===
--- linux-2.6.22.orig/include/linux/fs.h
+++ linux-2.6.22/include/linux/fs.h
@@ -267,15 +267,17 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WAIT_AFTER 4
 
 /*
- * sys_fallocate modes
- * Currently sys_fallocate supports two modes:
- * FA_ALLOCATE  : This is the preallocate mode, using which an application/user
- *   may request (pre)allocation of blocks.
- * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
- *   the preallocated blocks.
+ * sys_fallocate mode flags and values
  */
-#define FA_ALLOCATE0x1
-#define FA_DEALLOCATE  0x2
+#define FALLOC_FL_DEALLOC  0x01 /* default is allocate */
+#define FALLOC_FL_KEEP_SIZE0x02 /* default is extend/shrink size */
+#define FALLOC_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC 
*/
+
+#define FALLOC_ALLOCATE0
+#define FALLOC_DEALLOCATE  FALLOC_FL_DEALLOC
+#define FALLOC_RESV_SPACE  FALLOC_FL_KEEP_SIZE
+#define FALLOC_UNRESV_SPACE(FALLOC_FL_DEALLOC | FALLOC_FL_KEEP_SIZE | \
+FALLOC_FL_DEL_DATA)
 
 #ifdef __KERNEL__
 
Index: linux-2.6.22/fs/open.c
===
--- linux-2.6.22.orig/fs/open.c
+++ linux-2.6.22/fs/open.c
@@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned
  * sys_fallocate - preallocate blocks or free preallocated blocks
  * @fd: the file descriptor
  * @mode: mode specifies if fallocate should preallocate blocks OR free
- *   (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
- *   FA_DEALLOCATE modes are supported.
+ *   (unallocate) preallocated blocks.
  * @offset: The offset within file, from where (un)allocation is being
  * requested. It should not have a negative value.
  * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
  *
  * This system call, depending on the mode, preallocates or unallocates blocks
  * for a file. The range of blocks depends on the value of offset and len
- * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * arguments provided by the user/application. For FALLOC_ALLOCATE and
+ * FALLOC_RESV_SPACE modes, if the sys_fallocate()
  * system call succeeds, subsequent writes to the file in the given range
  * (specified by offset & len) should not fail - even if the file system
  * later becomes full. Hence the preallocation done is persistent (valid
- * even after reopen of the file and remount/reboot).
+ * even after reopen of the file and remount/reboot). If FALLOC_RESV_SPACE mode
+ * is passed, the file size will not be changed even if the preallocation
+ * is beyond EOF.
  *
  * It is expected that the ->fallocate() inode operation implemented by the
  * individual file systems will update the file size and/or ctime/mtime
- * depending on the mode and also on the success of the operation.
+ * depending on the mode (change is visible to user or not - say file size)
+ * and obviously, on the success of the operation.
  *
  * Note: Incase the file system does not support preallocation,
  * posix_fallocate() should fall back to the library implementation (i.e.
@@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in
 
/* Return error if mode is not supported */
ret = -EOPNOTSUPP;
-   if (mode != FA_ALLOCATE && mode != FA_DEALLOCATE)
+   if (!(mode == FALLOC_ALLOCATE || mode == FALLOC_DEALLOCATE ||
+   mode == FALLOC_RESV_SPACE || mode == FALLOC_UNRESV_SPACE))
goto out;
 
ret = -EBADF;
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/7] ext4: fallocate support in ext4

2007-07-10 Thread Amit K. Arora

From: Amit Arora <[EMAIL PROTECTED]>

fallocate support in ext4

This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation. Current implementation only supports
preallocation for regular files (directories not supported as of date)
with extent maps. This patch does not support block-mapped files currently.
Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a  item.


Signed-off-by: Amit Arora <[EMAIL PROTECTED]>

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c 2007-07-09 15:24:33.0 -0700
+++ linux-2.6.22/fs/ext4/extents.c  2007-07-09 15:24:39.0 -0700
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
} else if (path->p_ext) {
ext_debug("  %d:%d:%llu ",
  le32_to_cpu(path->p_ext->ee_block),
- le16_to_cpu(path->p_ext->ee_len),
+ ext4_ext_get_actual_len(path->p_ext),
  ext_pblock(path->p_ext));
} else
ext_debug("  []");
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
- le16_to_cpu(ex->ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug("\n");
 }
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug("  -> %d:%llu:%d ",
le32_to_cpu(path->p_ext->ee_block),
ext_pblock(path->p_ext),
-   le16_to_cpu(path->p_ext->ee_len));
+   ext4_ext_get_actual_len(path->p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug("move %d:%llu:%d in new leaf %llu\n",
le32_to_cpu(path[depth].p_ext->ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext->ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1106,7 +1106,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1->ee_block) + ext1_ee_len !=
le32_to_cpu(ex2->ee_block))
return 0;
 
@@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1->ee_len) >= 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1144,7 +1156,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int ret = 0;
 
b1 = le32_to_cpu(newext->ee_block);
-   len1 = le16_to_cpu(newext->ee_len);
+   len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1191,8 +1203,9 @@ int ext4_ext_insert_extent(handle_t *han
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
int depth, len, err, next;
+   unsigned uninitialized = 0;
 
-   BUG_ON(newext->ee_len == 0);
+   BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);
ex = path[depth].p_ext;
BUG_ON(path[depth].p_hdr == NULL);
@@ -1200,14 +1213,24 @@ int ext4_ext_insert_extent(handle_t *han
/* try to insert block into found extent and return */
if (ex && ext4_can_extents_be_merged(inode, ex, newext)) {

[PATCH 5/7] ext4: write support for preallocated blocks

2007-07-10 Thread Amit K. Arora

From:  Amit Arora <[EMAIL PROTECTED]>

write support for preallocated blocks

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c 2007-07-09 15:24:39.0 -0700
+++ linux-2.6.22/fs/ext4/extents.c  2007-07-09 15:24:48.0 -0700
@@ -1140,6 +1140,53 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+   struct ext4_extent_header *eh;
+   unsigned int depth, len;
+   int merge_done = 0;
+   int uninitialized = 0;
+
+   depth = ext_depth(inode);
+   BUG_ON(path[depth].p_hdr == NULL);
+   eh = path[depth].p_hdr;
+
+   while (ex < EXT_LAST_EXTENT(eh)) {
+   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+   break;
+   /* merge with next extent! */
+   if (ext4_ext_is_uninitialized(ex))
+   uninitialized = 1;
+   ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+   + ext4_ext_get_actual_len(ex + 1));
+   if (uninitialized)
+   ext4_ext_mark_uninitialized(ex);
+
+   if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+   len = (EXT_LAST_EXTENT(eh) - ex - 1)
+   * sizeof(struct ext4_extent);
+   memmove(ex + 1, ex + 2, len);
+   }
+   eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1);
+   merge_done = 1;
+   WARN_ON(eh->eh_entries == 0);
+   if (!eh->eh_entries)
+   ext4_error(inode->i_sb, "ext4_ext_try_to_merge",
+  "inode#%lu, eh->eh_entries = 0!", inode->i_ino);
+   }
+
+   return merge_done;
+}
+
+/*
  * check if a portion of the "newext" extent overlaps with an
  * existing extent.
  *
@@ -1327,25 +1374,7 @@ has_space:
 
 merge:
/* try to merge extents to the right */
-   while (nearex < EXT_LAST_EXTENT(eh)) {
-   if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-   break;
-   /* merge with next extent! */
-   if (ext4_ext_is_uninitialized(nearex))
-   uninitialized = 1;
-   nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-   + ext4_ext_get_actual_len(nearex + 1));
-   if (uninitialized)
-   ext4_ext_mark_uninitialized(nearex);
-
-   if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
-   len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-   * sizeof(struct ext4_extent);
-   memmove(nearex + 1, nearex + 2, len);
-   }
-   eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
-   BUG_ON(eh->eh_entries == 0);
-   }
+   ext4_ext_try_to_merge(inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -2011,15 +2040,158 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a> There is no split required: Entire extent should be initialized
+ *   b> Splits in two extents: Write is happening at either end of the extent
+ *   c> Splits in three extents: Somone is writing in middle of the extent
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+   struct ext4_ext_path *path,
+   ext4_fsblk_t iblock,
+   unsigned long max_blocks)
+{
+   struct ext4_extent *ex, newex;
+   struct ext4_extent *ex1 = NULL;
+   struct ext4_extent *ex2 = NULL;
+   struct ext4_extent *ex3 = NULL;
+   struct ext4_extent_header *eh;
+   unsigned int allocated, ee_block, ee_len, depth;
+   ext4_fsblk_t newblock;
+   int err

[PATCH 6/7] ext4: support new modes in ext4

2007-07-10 Thread Amit K. Arora

From: Amit Arora <[EMAIL PROTECTED]>
Support new values of mode in ext4.

This patch supports new mode values/flags in ext4. With this patch ext4
will be able to support FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes. Supporting
FALLOC_DEALLOCATE and FALLOC_UNRESV_SPACE fallocate modes in ext4 is a work for
future.

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -2453,8 +2453,9 @@ int ext4_ext_writepage_trans_blocks(stru
 /*
  * preallocate space for a file. This implements ext4's fallocate inode
  * operation, which gets called from sys_fallocate system call.
- * Currently only FA_ALLOCATE mode is supported on extent based files.
- * We may have more modes supported in future - like FA_DEALLOCATE, which
+ * Currently only FALLOC_ALLOCATE  and FALLOC_RESV_SPACE modes are supported on
+ * extent based files.
+ * We may have more modes supported in future - like FALLOC_DEALLOCATE, which
  * tells fallocate to unallocate previously (pre)allocated blocks.
  * For block-mapped files, posix_fallocate should fall back to the method
  * of writing zeroes to the required new blocks (the same behavior which is
@@ -2475,7 +2476,8 @@ long ext4_fallocate(struct inode *inode,
 * currently supporting (pre)allocate mode for extent-based
 * files _only_
 */
-   if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+   if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) ||
+   !(mode == FALLOC_ALLOCATE || mode == FALLOC_RESV_SPACE))
return -EOPNOTSUPP;
 
/* preallocation to directories is currently not supported */
@@ -2548,9 +2550,11 @@ retry:
 
/*
 * Time to update the file size.
-* Update only when preallocation was requested beyond the file size.
+* Update only when preallocation was requested beyond the file size
+* and when FALLOC_FL_KEEP_SIZE mode is not specified!
 */
-   if ((offset + len) > i_size_read(inode)) {
+   if (!(mode & FALLOC_FL_KEEP_SIZE) &&
+   (offset + len) > i_size_read(inode)) {
if (ret > 0) {
/*
 * if no error, we assume preallocation succeeded
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/7] ext4: change for better extent-to-group alignment

2007-07-10 Thread Amit K. Arora

From: Amit Arora <[EMAIL PROTECTED]>
Change on-disk format for extent to represent uninitialized/initialized extents

This change was suggested by Andreas Dilger as part of the following
post:
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg02445.html

This patch changes the EXT_MAX_LEN value and extent code which marks/checks
uninitialized extents. With this change it will be possible to have
initialized extents with 2^15 blocks (earlier the max blocks we could have
was 2^15 - 1). This way we can have better extent-to-block alignment.
Now, maximum number of blocks we can have in an initialized extent is 2^15
and in an uninitialized extent is 2^15 - 1.

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1106,7 +1106,7 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   unsigned short ext1_ee_len, ext2_ee_len;
+   unsigned short ext1_ee_len, ext2_ee_len, max_len;
 
/*
 * Make sure that either both extents are uninitialized, or
@@ -1115,6 +1115,11 @@ ext4_can_extents_be_merged(struct inode 
if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
return 0;
 
+   if (ext4_ext_is_uninitialized(ex1))
+   max_len = EXT_UNINIT_MAX_LEN;
+   else
+   max_len = EXT_INIT_MAX_LEN;
+
ext1_ee_len = ext4_ext_get_actual_len(ex1);
ext2_ee_len = ext4_ext_get_actual_len(ex2);
 
@@ -1127,7 +1132,7 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len > max_len)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1->ee_len) >= 4)
@@ -1814,7 +1819,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
ex->ee_block = cpu_to_le32(block);
ex->ee_len = cpu_to_le16(num);
-   if (uninitialized)
+   /*
+* Do not mark uninitialized if all the blocks in the
+* extent have been removed.
+*/
+   if (uninitialized && num)
ext4_ext_mark_uninitialized(ex);
 
err = ext4_ext_dirty(handle, inode, path + depth);
@@ -2307,6 +2316,18 @@ int ext4_ext_get_blocks(handle_t *handle
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
 
+   /*
+* See if request is beyond maximum number of blocks we can have in
+* a single extent. For an initialized extent this limit is
+* EXT_INIT_MAX_LEN and for an uninitialized extent this limit is
+* EXT_UNINIT_MAX_LEN.
+*/
+   if (max_blocks > EXT_INIT_MAX_LEN && create != 
EXT4_CREATE_UNINITIALIZED_EXT)
+   max_blocks = EXT_INIT_MAX_LEN;
+   else if (max_blocks > EXT_UNINIT_MAX_LEN &&
+create == EXT4_CREATE_UNINITIALIZED_EXT)
+   max_blocks = EXT_UNINIT_MAX_LEN;
+
/* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
newex.ee_block = cpu_to_le32(iblock);
newex.ee_len = cpu_to_le16(max_blocks);
Index: linux-2.6.22/include/linux/ext4_fs_extents.h
===
--- linux-2.6.22.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22/include/linux/ext4_fs_extents.h
@@ -141,7 +141,25 @@ typedef int (*ext_prepare_callback)(stru
 
 #define EXT_MAX_BLOCK  0x
 
-#define EXT_MAX_LEN((1UL << 15) - 1)
+/*
+ * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an
+ * initialized extent. This is 2^15 and not (2^16 - 1), since we use the
+ * MSB of ee_len field in the extent datastructure to signify if this
+ * particular extent is an initialized extent or an uninitialized (i.e.
+ * preallocated).
+ * EXT_UNINIT_MAX_LEN is the maximum number of blocks we can have in an
+ * uninitialized extent.
+ * If ee_len is <= 0x8000, it is an initialized extent. Otherwise, it is an
+ * uninitialized one. In other words, if MSB of ee_len is set, it is an
+ * uninitialized extent with only one special scenario when ee_len = 0x8000.
+ * In this case we can not have an uninitialized extent of zero length and
+ * thus we make it as a special case of initialized extent with 0x8000 length.
+ * This way we get better extent-to-group alignment for initialized extents.
+ * Hence, the maximum number of blocks we can have in an *initialized*
+ * extent is 2^15 (32768) and in an *uninitialized* extent is 2^15-1 (32767).
+ */
+#define EXT_INIT_MAX_LEN   (1UL << 15)
+#define EXT_UNINIT_MAX_LEN (EXT_INIT_MAX_LEN - 1)
 
 
 #def

Re: ext4-patch-queue rebased to 2.6.22

2007-07-10 Thread Andreas Dilger

On Jul 10, 2007  23:25 +0530, Amit K. Arora wrote:
> On Tue, Jul 10, 2007 at 11:09:39AM -0600, Andreas Dilger wrote:
> > It might even make sense to change the other #define to be called
> > EXT_INIT_MAX_LEN so people have to think about this when using the #define.
> 
> Done. Changes are in ext4 patch queue.
> Can you please have a quick look and see if this is what you preferred ?

Yes, it looks good, though I wonder if it also makes sense to change the
ext4_ext_*_uninitialized() code to use EXT_INIT_MAX_LEN instead of 0x8000,
since that makes it a bit clearer that the two are related.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: block groups with no inode tables

2007-07-10 Thread Theodore Tso

On Tue, Jul 10, 2007 at 12:12:21PM -0500, Jose R. Santos wrote:
> Hi folks,
> 
> As I play with the allocation of the metadata for the FLEX_BG feature,
> it seems that we could benefit from having block groups with no inode
> tables.  Right now we allocate one inode table per bg base on the
> inode_blocks_per_group.  For FLEX_BG though, it would make more sense
> to have a larger inode tables that fully use the inode bitmap allocated
> on the first few block groups.  Once we reach the number of inode per
> FLEX_BG, then the remaining block groups could then have no inode
> tables defined.
> 
> The idea here is that we better utilize the inode bitmaps and reduce the
> number of inode tables to improve mkfs/fsck times. We could also
> support expansion of inode since we have block groups that have empty
> entries in the block group descriptors and as long as we can find
> enough empty blocks for the inode table expanding the number of inodes
> should be relatively easy.
> 
> Don't know if ext4 currently supports this.  Any thoughts?

Plans to support are there; Andreas sent a patch back in April to
implement this, using bg_itable_unused, which is already reserved in
the block group data structure.  The idea here is to speed up fsck by
specifying how many inodes are actually in use in the block group, so
we don't have to initialize them until they are to be used.  This is
tied with the checksum patches, since doing this means we need to
really worry about the accuracy of the block group descriptors or we
could lose a lot of data if the block group descriptors are corrupted.

We also have something already implemented which does this on a
per-blockgroup basis.  That's the LAZY_BG feature, which was intended
for testing really big filesystems without needing to initialize all
of the inode tables.  In fact mke2fs -O lazy_bg it only initializes
the first and last blockgroups, in order to make sure we can force the
use of blocks at the very end of the filesystem, so we can find any
2**32 bit cleanliness problems, or other problems with really big
block numbers.

Regards,

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/7] manpage for fallocate

2007-07-10 Thread Heikki Orsila

On Wed, Jul 11, 2007 at 01:48:20AM +0530, Amit K. Arora wrote:
> .BI "int syscall(int, int fd, int mode, loff_t offset, loff_t len);

Correction: "int syscall(int fd, int mode, ...)",

> .SH "ERRORS"
> .TP
> .B EBADF
> .I fd
> is not a valid file descriptor, or is not opened for writing.
> .TP
> .B EFBIG
> .I offset+len
> exceeds the maximum file size.
> .TP
> .B EINVAL
> .I offset
> or
> .I len
> was less than 0.
> .TP
> .B ENODEV
> .I fd
> does not refer to a regular file or a directory.
> .TP
> .B ENOSPC
> There is not enough space left on the device containing the file
> referred to by
> .IR fd.
> .TP
> .B ESPIPE
> .I fd
> refers to a pipe of file descriptor.
> .B ENOSYS
> The filesystem underlying the file descriptor does not support this
> operation.

EINTR?

-- 
Heikki Orsila   Barbie's law:
[EMAIL PROTECTED]   "Math is hard, let's go shopping!"
http://www.iki.fi/shd
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATHC] Fix for ext2 reservation (Re: -mm merge plans for 2.6.23)

2007-07-10 Thread Badari Pulavarty

On Tue, 2007-07-10 at 12:39 -0700, Martin Bligh wrote:
> Andrew Morton wrote:
> > 
> > Begin forwarded message:
> > 
> > Date: Tue, 10 Jul 2007 21:49:23 +0400
> > From: Alexey Dobriyan <[EMAIL PROTECTED]>
> > To: Andrew Morton <[EMAIL PROTECTED]>
> > Cc: [EMAIL PROTECTED], linux-ext4@vger.kernel.org
> > Subject: ext2 reservations (Re: -mm merge plans for 2.6.23)
> > 
> > 
> >> ext2-reservations.patch
> >>
> >>  Still needs decent testing.
> > 
> > Was this oops silently fixed?
> > http://lkml.org/lkml/2007/3/2/138
> > 2.6.21-rc2-mm1: EIP is at ext2_discard_reservation+0x1c/0x52
> > 
> > I still have that ext2 partition backed up.
> 
> Now I'm confused - I thought there was a latent issue there, and
> then we went back and revisited it, and we decided there wasn't ;-(

Well, I looked at the problem now and here is the fix :)

Greg, Please consider this for stable release also.

Thanks,
Badari

ext2 reservation fix - Alexey Dobriyan reported ext2 discard
reservation panic while ago (http://lkml.org/lkml/2007/3/2/138).
If ext2_new_inode() fails for any reason it would end up calling 
ext2_discard_reservation() (due to last iput). Normally, it does
nothing since we don't have a reservation window structure
allocated. But the NULL pointer check wouldn't work with  slab 
poisioning, and causes oops.

Fix is to initialize i_block_alloc_info to NULL in ext2_alloc_inode()
code instead of assuming that it would be NULL. Same fix already
exists in ext3 and ext4.

Signed-off-by: Badari Pulavarty <[EMAIL PROTECTED]>

 fs/ext2/super.c |1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6.22/fs/ext2/super.c
===
--- linux-2.6.22.orig/fs/ext2/super.c   2007-07-08 16:32:17.0 -0700
+++ linux-2.6.22/fs/ext2/super.c2007-07-10 16:36:42.0 -0700
@@ -147,6 +147,7 @@ static struct inode *ext2_alloc_inode(st
ei->i_acl = EXT2_ACL_NOT_CACHED;
ei->i_default_acl = EXT2_ACL_NOT_CACHED;
 #endif
+   ei->i_block_alloc_info = NULL;
ei->vfs_inode.i_version = 1;
return &ei->vfs_inode;
 }


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 1][PATCH 1/2] Add noextents mount option

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:35:48 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> Add a mount option to turn off extents.

Please update the changelog to describe the reason for making this change.


> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> ---
> Index: linux-2.6.22-rc4/fs/ext4/super.c
> ===
> --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 17:02:18.0 
> -0700
> +++ linux-2.6.22-rc4/fs/ext4/super.c  2007-06-11 17:02:22.0 -0700
> @@ -725,7 +725,7 @@
>   Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
>   Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
>   Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
> - Opt_grpquota, Opt_extents,
> + Opt_grpquota, Opt_extents, Opt_noextents,
>  };
>  
>  static match_table_t tokens = {
> @@ -776,6 +776,7 @@
>   {Opt_usrquota, "usrquota"},
>   {Opt_barrier, "barrier=%u"},
>   {Opt_extents, "extents"},
> + {Opt_noextents, "noextents"},
>   {Opt_err, NULL},
>   {Opt_resize, "resize"},
>  };
> @@ -,6 +1112,9 @@
>   case Opt_extents:
>   set_opt (sbi->s_mount_opt, EXTENTS);
>   break;
> + case Opt_noextents:
> + clear_opt (sbi->s_mount_opt, EXTENTS);
> + break;
>   default:
>   printk (KERN_ERR
>   "EXT4-fs: Unrecognized mount option \"%s\" "
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:36:01 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> Turn on extents feature by default in ext4 filesystem. User could use
> -o noextents to turn it off.
> 

Oh, there you go.

> 
> Index: linux-2.6.22-rc4/fs/ext4/super.c
> ===
> --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 17:02:22.0 
> -0700
> +++ linux-2.6.22-rc4/fs/ext4/super.c  2007-06-11 17:03:09.0 -0700
> @@ -1546,6 +1546,12 @@
>  
>   set_opt(sbi->s_mount_opt, RESERVATION);
>  
> + /*
> +  * turn on extents feature by default in ext4 filesystem
> +  * User -o noextents to turn it off
> +  */
> + set_opt (sbi->s_mount_opt, EXTENTS);
> +

Broken coding style.

Please feed all the ext4 patches through scripts/checkpatch.pl (preferably
version 0.07 - see Andy's patch on lkml) and then consider addressing the
(quite large) number of mistakes which are detected.


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 1/5] cleanups: Propagate some i_flags to disk

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:36:12 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into
> ext4-specific i_flags. Hence, when someone sets these flags via a different
> interface than ioctl, they are stored correctly.
> 

This changelog is inadequate.  I had to hunt down the equivalent ext3
patch's changelog to understand the reasons for this change.  Please update
this patch's changelog using the below: 

ext3: copy i_flags to inode flags on write

A patch that stores inode flags such as S_IMMUTABLE, S_APPEND, etc.  from
i_flags to EXT3_I(inode)->i_flags when inode is written to disk.  The same
thing is done on GETFLAGS ioctl.

Quota code changes these flags on quota files (to make it harder for
sysadmin to screw himself) and these changes were not correctly propagated
into the filesystem (especially, lsattr did not show them and users were
wondering...).

Propagate flags such as S_APPEND, S_IMMUTABLE, etc.  from i_flags into
ext3-specific i_flags.  Hence, when someone sets these flags via a
different interface than ioctl, they are stored correctly.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 3/5] cleanups: set_jbd2_64bit_feature for >16TB ext4 fs

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:36:32 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more
> than 32bit block sizes during mount time.  This ensure proper record
> lenth when writing to the journal.

This patch isn't in Ted's kernel.org directory and hasn't been in -mm. 
Where did it come from?  Is something amiss with ext4 patch management?

> Signed-off-by: Jose R. Santos <[EMAIL PROTECTED]>
> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> Signed-off-by: Laurent Vivier <[EMAIL PROTECTED]>
> ---
>  fs/ext4/super.c |   11 +++
>  1 file changed, 11 insertions(+)
> 
> Index: linux-2.6.22-rc4/fs/ext4/super.c
> ===
> --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 16:15:54.0 
> -0700
> +++ linux-2.6.22-rc4/fs/ext4/super.c  2007-06-11 16:16:10.0 -0700
> @@ -1804,6 +1804,13 @@

Please prepare patches using `diff -p'

>   goto failed_mount3;
>   }
>  
> + if (ext4_blocks_count(es) > 0xULL &&
> + !jbd2_journal_set_features(EXT4_SB(sb)->s_journal, 0, 0,
> +JBD2_FEATURE_INCOMPAT_64BIT)) {
> + printk(KERN_ERR "ext4: Failed to set 64-bit journal feature\n");
> + goto failed_mount4;
> + }

It is not appropriate for the text "ext4" to appear in a JBD2 message.

>   /* We have now updated the journal if required, so we can
>* validate the data journaling mode. */
>   switch (test_opt(sb, DATA_FLAGS)) {


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:36:48 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> > On Jun 07, 2007  23:45 -0500, Jose R. Santos wrote:
> > > The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but
> > > create_proc_entry() does not do lookups on file names with more that one
> > > directory deep.  This causes the entry creation to fail and hence, no proc
> > > file is created.  This patch moves the file to /proc/jbd2-degug.
> > > 
> > > The file could be move to /proc/fs/jbd2/jbd2-debug, but it would require
> > > some minor alterations to the jbd-stats patch.
> > 
> > I don't think we really want to be adding top-level files in /proc.
> > What about using the "debugfs" filesystem (not to be confused with
> > the e2fsprogs 'debugfs' command)?
> 
> How about this then?  Moved the file to use debugfs as well as having
> the nice effect of removing more lines than what it adds.
> 
> Signed-off-by: Jose R. Santos <[EMAIL PROTECTED]>

Please clean up the changelog.

The changelog should include information about the location and the content
of these debugfs files.  it should provide any instructions which users
will need to be able to create and use those files.

Alternatively (and preferably) do this via an update to
Documentation/filesystems/ext4.txt.

>  fs/jbd2/journal.c|   6220 +42 -0 !
>  include/linux/jbd2.h |21 + 1 - 0 !
>  2 files changed, 21 insertions(+), 43 deletions(-)

Again, this patch isn't in Ted's kernel.org directory and hasn't been in -mm.

Apart from the lack of testing and review which this causes, it means I
can't just do `pushpatch name-of-this-patch' and look at it in tkdiff.  So
I squint at the diff, but that's harder when the diff wasn't prepared with
`diff -p'.  Oh well.


> Index: linux-2.6.22-rc4/fs/jbd2/journal.c
> ===
> --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c   2007-06-11 16:16:18.0 
> -0700
> +++ linux-2.6.22-rc4/fs/jbd2/journal.c2007-06-11 16:36:10.0 
> -0700
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -1954,60 +1955,37 @@
>   * /proc tunables
>   */
>  #if defined(CONFIG_JBD2_DEBUG)
> -int jbd2_journal_enable_debug;
> +u16 jbd2_journal_enable_debug;
>  EXPORT_SYMBOL(jbd2_journal_enable_debug);
>  #endif
>  
> -#if defined(CONFIG_JBD2_DEBUG) && defined(CONFIG_PROC_FS)
> +#if defined(CONFIG_JBD2_DEBUG) && defined(CONFIG_DEBUG_FS)

Has this been compile-tested with CONFIG_DEBUGFS=n?

>  
> -#define create_jbd_proc_entry() do {} while (0)
> -#define jbd2_remove_jbd_proc_entry() do {} while (0)
> +#define jbd2_create_debugfs_entry() do {} while (0)
> +#define jbd2_remove_debugfs_entry() do {} while (0)

I suggest that these be converted to (preferable) inline functions while
you're there.

>  #endif
>  
> @@ -2067,7 +2045,7 @@
>   ret = journal_init_caches();
>   if (ret != 0)
>   jbd2_journal_destroy_caches();
> - create_jbd_proc_entry();
> + jbd2_create_debugfs_entry();
>   return ret;
>  }
>  
> @@ -2078,7 +2056,7 @@
>   if (n)
>   printk(KERN_EMERG "JBD: leaked %d journal_heads!\n", n);
>  #endif
> - jbd2_remove_jbd_proc_entry();
> + jbd2_remove_debugfs_entry();
>   jbd2_journal_destroy_caches();
>  }
>  
> Index: linux-2.6.22-rc4/include/linux/jbd2.h
> ===
> --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 
> 16:16:18.0 -0700
> +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 
> -0700
> @@ -57,7 +57,7 @@
>   * CONFIG_JBD2_DEBUG is on.
>   */
>  #define JBD_EXPENSIVE_CHECKING

JBD2?

> -extern int jbd2_journal_enable_debug;
> +extern u16 jbd2_journal_enable_debug;

Why was this made 16-bit?  To save 2 bytes?  Could have saved 3 if we're
going to do that.


Shoudln't all this debug info be a per-superblock thing rather than
kernel-wide?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:36:56 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> This patch is a spinoff of the old nanosecond patches.

I don't know what the "old nanosecond patches" are.  A link to a suitable
changlog for those patches would do in a pinch.  Preferable would be to
write a proper changelog for this patch.

> It includes some cleanups and addition of a creation timestamp. The
> EXT3_FEATURE_RO_COMPAT_EXTRA_ISIZE flag has also been added along with
> s_{min, want}_extra_isize fields in struct ext3_super_block.
> 
> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
> Signed-off-by: Kalpak Shah <[EMAIL PROTECTED]>
> Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>
> Signed-off-by: Dave Kleikamp <[EMAIL PROTECTED]>
> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> 
> Index: linux-2.6.22-rc4/fs/ext4/ialloc.c

Please include diffstat output when preparing patches.

> +
> +#define EXT4_FITS_IN_INODE(ext4_inode, einode, field)\
> + ((offsetof(typeof(*ext4_inode), field) +\
> +   sizeof((ext4_inode)->field))  \
> + <= (EXT4_GOOD_OLD_INODE_SIZE +  \
> + (einode)->i_extra_isize))   \

Please add explanatory commentary to EXT4_FITS_IN_INODE(): tell readers
under what circumstances something will not fit in an inode and what the
consequences of this are.

> +static inline __le32 ext4_encode_extra_time(struct timespec *time)
> +{
> +   return cpu_to_le32((sizeof(time->tv_sec) > 4 ?
> +time->tv_sec >> 32 : 0) |
> +((time->tv_nsec << 2) & EXT4_NSEC_MASK));
> +}
> +
> +static inline void ext4_decode_extra_time(struct timespec *time, __le32 
> extra)
> +{
> +   if (sizeof(time->tv_sec) > 4)
> +time->tv_sec |= (__u64)(le32_to_cpu(extra) & EXT4_EPOCH_MASK)
> +<< 32;
> +   time->tv_nsec = (le32_to_cpu(extra) & EXT4_NSEC_MASK) >> 2;
> +}

Consider uninlining these functions.

> +#define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode)
>\
> +do {\
> + (raw_inode)->xtime = cpu_to_le32((inode)->xtime.tv_sec);   \
> + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \
> + (raw_inode)->xtime ## _extra = \
> + ext4_encode_extra_time(&(inode)->xtime);   \
> +} while (0)
> +
> +#define EXT4_EINODE_SET_XTIME(xtime, einode, raw_inode)  
>\
> +do {\
> + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime))  \
> + (raw_inode)->xtime = cpu_to_le32((einode)->xtime.tv_sec);  \
> + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\
> + (raw_inode)->xtime ## _extra = \
> + ext4_encode_extra_time(&(einode)->xtime);  \
> +} while (0)
> +
> +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode)
>\
> +do {\
> + (inode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime);   \
> + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \
> + ext4_decode_extra_time(&(inode)->xtime,\
> +raw_inode->xtime ## _extra);\
> +} while (0)
> +
> +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode)  
>\
> +do {\
> + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime))  \
> + (einode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime);  \
> + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\
> + ext4_decode_extra_time(&(einode)->xtime,   \
> +raw_inode->xtime ## _extra);\
> +} while (0)

Ugly.  I expect these could be implemented as plain old C functions. 
Caller could pass in the address of the ext4_inode field which the function
is to operate upon.

>  #if defined(__KERNEL__) || defined(__linux__)

(What's the __linux__ for?)

>  #define i_reserved1  osd1.linux1.l_i_reserved1
>  #define i_frag   osd2.linux2.l_i_frag
> @@ -539,6 +603,13 @@
>   return container_of(inode, struct ext4_inode_info, vfs_inode);
>  }
>  
> +static inline struct timespec ext4_current_time(struct inode *inode)
> +{
> + return (inode->i_sb->s_time_gran < NSEC_PER_SEC) ?
> + current_fs_time(inode->i_sb) : CURRENT_TIME_SEC;
> +}

Now, I've forgotten how this works.  Remind me, please.  Can ext4
filesystems ever have one-second timestamp granularity?  If so, how

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:37:04 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> This patch converts the 32-bit i_version in the generic inode to a 64-bit
> i_version field.
> 

That's obvious from the patch.  But what was the reason for making this
(unrelated to ext4) change?

Please update the changelog for this.

> Index: linux-2.6.21/include/linux/fs.h
> ===
> --- linux-2.6.21.orig/include/linux/fs.h
> +++ linux-2.6.21/include/linux/fs.h
> @@ -549,7 +549,7 @@ struct inode {
>   uid_t   i_uid;
>   gid_t   i_gid;
>   dev_t   i_rdev;
> - unsigned long   i_version;
> + u64 i_version;
>   loff_t  i_size;
>  #ifdef __NEED_I_SIZE_ORDERED
>   seqcount_t  i_size_seqcount;

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 2/5] i_version: Add hi 32 bit inode version on ext4 on-disk inode

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:37:16 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> This patch adds a 32-bit i_version_hi field to ext4_inode, which can be used 
> for 64-bit inode versions. This field will store the higher 32 bits of the 
> version, while Jean Noel's patch has added support to store the lower 32-bits 
> in osd1.linux1.l_i_version.

Please wordwrap this changelog entry to less than 80 columns.  Well, less
than 258, anyway ;)

> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
> Signed-off-by: Kalpak Shah <[EMAIL PROTECTED]>
> ---
> Index: linux-2.6.21/include/linux/ext4_fs.h
> ===
> --- linux-2.6.21.orig/include/linux/ext4_fs.h
> +++ linux-2.6.21/include/linux/ext4_fs.h
> @@ -342,6 +342,7 @@ struct ext4_inode {
>   __le32  i_atime_extra;  /* extra Access time  (nsec << 2 | epoch) */
>   __le32  i_crtime;   /* File Creation time */
>   __le32  i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
> + __le32  i_version_hi;   /* high 32 bits for 64-bit version */
>  };

Aren't there forward- backward-compatibility issues here?  How does the
filesystem driver work out whether this field is present and valid?

The changelog should describe this design issue.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 2/5] cleanups: Add extent sanity checks

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:36:22 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> with the patch all headers are checked. the code should become
> more resistant to on-disk corruptions. needless BUG_ON() have
> been removed. please, review for inclusion.
> 
> ...

> @@ -269,6 +239,70 @@
>   return size;
>  }
>  
> +static inline int
> +ext4_ext_max_entries(struct inode *inode, int depth)

Please remove the `inline'.

`inline' is usually wrong.  It is wrong here.  If this function has a
single callsite (it has) then the compiler will inline it.  If the function
later gains more callsites then the compiler will know (correctly) not to
inline it.

We hope.  If we find the compiler still inlines a function as large as this
one then we'd need to use noinline and complain at the gcc developers.

> +{
> + int max;
> +
> + if (depth == ext_depth(inode)) {
> + if (depth == 0)
> + max = ext4_ext_space_root(inode);
> + else
> + max = ext4_ext_space_root_idx(inode);
> + } else {
> + if (depth == 0)
> + max = ext4_ext_space_block(inode);
> + else
> + max = ext4_ext_space_block_idx(inode);
> + }
> +
> + return max;
> +}
> +
> +static int __ext4_ext_check_header(const char *function, struct inode *inode,
> + struct ext4_extent_header *eh,
> + int depth)
> +{
> + const char *error_msg = NULL;

Unneeded initialisation.

> + int max = 0;
> +
> + if (unlikely(eh->eh_magic != EXT4_EXT_MAGIC)) {
> + error_msg = "invalid magic";
> + goto corrupted;
> + }
> + if (unlikely(le16_to_cpu(eh->eh_depth) != depth)) {
> + error_msg = "unexpected eh_depth";
> + goto corrupted;
> + }
> + if (unlikely(eh->eh_max == 0)) {
> + error_msg = "invalid eh_max";
> + goto corrupted;
> + }
> + max = ext4_ext_max_entries(inode, depth);
> + if (unlikely(le16_to_cpu(eh->eh_max) > max)) {
> + error_msg = "too large eh_max";
> + goto corrupted;
> + }
> + if (unlikely(le16_to_cpu(eh->eh_entries) > le16_to_cpu(eh->eh_max))) {
> + error_msg = "invalid eh_entries";
> + goto corrupted;
> + }
> + return 0;
> +
> +corrupted:
> + ext4_error(inode->i_sb, function,
> + "bad header in inode #%lu: %s - magic %x, "
> + "entries %u, max %u(%u), depth %u(%u)",
> + inode->i_ino, error_msg, le16_to_cpu(eh->eh_magic),
> + le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max),
> + max, le16_to_cpu(eh->eh_depth), depth);
> +
> + return -EIO;
> +}
> +
>
> ...
>
> + i = depth = ext_depth(inode);
>

Mulitple assignments like this are somewhat unpopular from the coding-style
POV.  

We have a local variable called `i' which is not used as a simple counter
and which has no explanatory comment.  This is plain bad.  Please find a
better name for this variable.  Review the code for other such lack of
clarity - I'm sure there's more.


> - BUG_ON(le16_to_cpu(eh->eh_entries) > le16_to_cpu(eh->eh_max));
> - BUG_ON(eh->eh_magic != EXT4_EXT_MAGIC);

Yeah, this patch improves things a lot.

How well-tested is it?  Have any of these errors actually been triggered?

>   path[i].p_hdr = ext_block_hdr(path[i].p_bh);
> - if (ext4_ext_check_header(__FUNCTION__, inode,
> - path[i].p_hdr)) {
> - err = -EIO;
> - goto out;
> - }
>   }
>  
> - BUG_ON(le16_to_cpu(path[i].p_hdr->eh_entries)
> -> le16_to_cpu(path[i].p_hdr->eh_max));
> - BUG_ON(path[i].p_hdr->eh_magic != EXT4_EXT_MAGIC);
> -
>   if (!path[i].p_idx) {
>   /* this level hasn't been touched yet */
>   path[i].p_idx = EXT_LAST_INDEX(path[i].p_hdr);
> @@ -1873,17 +1890,24 @@
>   i, EXT_FIRST_INDEX(path[i].p_hdr),
>   path[i].p_idx);
>   if (ext4_ext_more_to_rm(path + i)) {
> + struct buffer_head *bh;
>   /* go to the next level */
>   ext_debug("move to level %d (block %llu)\n",
> i + 1, idx_pblock(path[i].p_idx));
>   memset(path + i + 1, 0, sizeof(*path));
> - path[i+1].p_bh =
> - sb_bread(sb, idx_pblock(path[i].p_idx));
> - if (!path[i+1].p_bh) {
> + bh = sb_bread(sb, idx_pblock(path[i].p_idx));
> + if (!bh) {
>   /* should we reset i_size? */
>

Re: [EXT4 set 4][PATCH 3/5] i_version:ext4 inode version read/store

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:37:36 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> This patch adds 64-bit inode version support to ext4. The lower 32 bits
> are stored in the osd1.linux1.l_i_version field while the high 32 bits
> are stored in the i_version_hi field newly created in the ext4_inode.

So reading the code here does serve to answer the question I raised against
the earlier patch.  A bit.

I'd have thought that this patch and the one which adds i_version_hi should
be folded into a single diff?

> 
> Index: linux-2.6.21/fs/ext4/inode.c
> ===
> --- linux-2.6.21.orig/fs/ext4/inode.c
> +++ linux-2.6.21/fs/ext4/inode.c
> @@ -2709,6 +2709,13 @@ void ext4_read_inode(struct inode * inod
>   EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode);
>   EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
> 
> + inode->i_version = le32_to_cpu(raw_inode->i_disk_version);
> + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
> + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
> + inode->i_version |=
> + (__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32;





> + }

I don't quite see how the above two tests are sufficient to unambiguously
determine that the i_version_hi field is present on-disk.

I guess we're implicitly assuming that if the on-disk inode is big enough
then it _must_ have i_version_hi in there?  If so, why is the comparison
with EXT4_GOOD_OLD_INODE_SIZE needed?

Some description of the overall approach to inode version identification
would be helpful here.

>   if (S_ISREG(inode->i_mode)) {
>   inode->i_op = &ext4_file_inode_operations;
>   inode->i_fop = &ext4_file_operations;
> @@ -2852,8 +2859,14 @@ static int ext4_do_update_inode(handle_t
>   } else for (block = 0; block < EXT4_N_BLOCKS; block++)
>   raw_inode->i_block[block] = ei->i_data[block];
> 
> - if (ei->i_extra_isize)
> + raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
> + if (ei->i_extra_isize) {
> + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) {

There's no comparison with EXT4_GOOD_OLD_INODE_SIZE here...

> + raw_inode->i_version_hi =
> + cpu_to_le32(inode->i_version >> 32);
> + }
>   raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
> + }
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:37:45 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> This patch is on top of i_version_update_vfs.
> The i_version field of the inode is set on inode creation and incremented when
> the inode is being modified.
> 

Again, I don't think I've ever seen this patch before.  It is at least a
month old.

> 
> Index: linux-2.6.22-rc4/fs/ext4/ialloc.c
> ===
> --- linux-2.6.22-rc4.orig/fs/ext4/ialloc.c2007-06-13 17:16:28.0 
> -0700
> +++ linux-2.6.22-rc4/fs/ext4/ialloc.c 2007-06-13 17:24:45.0 -0700
> @@ -565,6 +565,7 @@ got:
>   inode->i_blocks = 0;
>   inode->i_mtime = inode->i_atime = inode->i_ctime = ei->i_crtime =
>  ext4_current_time(inode);
> + inode->i_version = 1;
>  
>   memset(ei->i_data, 0, sizeof(ei->i_data));
>   ei->i_dir_start_lookup = 0;
> Index: linux-2.6.22-rc4/fs/ext4/inode.c
> ===
> --- linux-2.6.22-rc4.orig/fs/ext4/inode.c 2007-06-13 17:21:29.0 
> -0700
> +++ linux-2.6.22-rc4/fs/ext4/inode.c  2007-06-13 17:24:45.0 -0700
> @@ -3082,6 +3082,7 @@ int ext4_mark_iloc_dirty(handle_t *handl
>  {
>   int err = 0;
>  
> + inode->i_version++;
>   /* the do_update_inode consumes one bh->b_count */
>   get_bh(iloc->bh);
>  
> Index: linux-2.6.22-rc4/fs/ext4/super.c
> ===
> --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-13 17:19:11.0 
> -0700
> +++ linux-2.6.22-rc4/fs/ext4/super.c  2007-06-13 17:24:45.0 -0700
> @@ -2846,8 +2846,8 @@ out:
>   i_size_write(inode, off+len-towrite);
>   EXT4_I(inode)->i_disksize = inode->i_size;
>   }
> - inode->i_version++;
>   inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> + inode->i_version = 1;
>   ext4_mark_inode_dirty(handle, inode);
>   mutex_unlock(&inode->i_mutex);
>   return len - towrite;

ext4 already has code to update i_version on directories.  Here we appear
to be udpating it on regular files?

But for what reason?  The changelog doesn't say?

AFAICT the code forgets to update i_version during file overwrites (unless
the overwrite was over a hole).  But without a decent description of this
change I cannot tell whether this was a bug.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:37:53 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> Add a "noversion" mount option to disable inode version updates.

Why is this option being offered to our users?  To reduce disk traffic,
like noatime?

If so, what are the implications of this?  What would the user lose?

> Index: linux-2.6.21/fs/ext4/super.c
> ===
> --- linux-2.6.21.orig/fs/ext4/super.c
> +++ linux-2.6.21/fs/ext4/super.c
> @@ -725,7 +725,7 @@ enum {
>   Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
>   Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
>   Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
> - Opt_grpquota, Opt_extents, Opt_noextents,
> + Opt_grpquota, Opt_extents, Opt_noextents, Opt_noversion,
>  };
> 
>  static match_table_t tokens = {
> @@ -777,6 +777,7 @@ static match_table_t tokens = {
>   {Opt_barrier, "barrier=%u"},
>   {Opt_extents, "extents"},
>   {Opt_noextents, "noextents"},
> + {Opt_noversion, "noversion"},
>   {Opt_err, NULL},
>   {Opt_resize, "resize"},
>  };
> @@ -1115,6 +1116,9 @@ clear_qf_name:
>   case Opt_noextents:
>   clear_opt (sbi->s_mount_opt, EXTENTS);
>   break;
> + case Opt_noversion:
> + set_opt(sbi->s_mount_opt, NOVERSION);
> + break;
>   default:
>   printk (KERN_ERR
>   "EXT4-fs: Unrecognized mount option \"%s\" "
> Index: linux-2.6.21/include/linux/ext4_fs.h
> ===
> --- linux-2.6.21.orig/include/linux/ext4_fs.h
> +++ linux-2.6.21/include/linux/ext4_fs.h
> @@ -473,6 +473,7 @@ do {  
>\
>  #define EXT4_MOUNT_USRQUOTA  0x10 /* "old" user quota */
>  #define EXT4_MOUNT_GRPQUOTA  0x20 /* "old" group quota */
>  #define EXT4_MOUNT_EXTENTS   0x40 /* Extents support */
> +#define EXT4_MOUNT_NOVERSION 0x80 /* No inode version updates */
> 
>  /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
>  #ifndef _LINUX_EXT2_FS_H
> Index: linux-2.6.21/fs/ext4/inode.c
> ===
> --- linux-2.6.21.orig/fs/ext4/inode.c
> +++ linux-2.6.21/fs/ext4/inode.c
> @@ -3082,7 +3082,9 @@ int ext4_mark_iloc_dirty(handle_t *handl
>  {
>   int err = 0;
> 
> - inode->i_version++;
> + if (!test_opt(inode->i_sb, NOVERSION))
> + inode->i_version++;
> +
>   /* the do_update_inode consumes one bh->b_count */
>   get_bh(iloc->bh);

An update to Documentation/filesystems/ext4.txt would be an appropriate
way in which to address the above questions.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:38:01 -0400
Mingming Cao <[EMAIL PROTECTED]> wrote:

> This patch is on top of the nanosecond timestamp and i_version_hi
> patches. 

This sort of information isn't needed (or desired) when this patch hits the
git tree.  Please ensure that things like this are cleaned up before the
patches go to Linus.

> We need to make sure that existing filesystems can also avail the new
> fields that have been added to the inode. We use s_want_extra_isize and
> s_min_extra_isize to decide by how much we should expand the inode. If
> EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature is set then we expand by
> max(s_want_extra_isize, s_min_extra_isize , sizeof(ext4_inode) -
> EXT4_GOOD_OLD_INODE_SIZE) bytes. Actually it is still an open question
> about whether users should be able to set s_*_extra_isize smaller than
> the known fields or not. 
> 
> This patch also adds the functionality to expand inodes to include the
> newly added fields. We start by trying to expand by s_want_extra_isize
> bytes and if its fails we try to expand by s_min_extra_isize bytes. This
> is done by changing the i_extra_isize if enough space is available in
> the inode and no EAs are present. If EAs are present and there is enough
> space in the inode then the EAs in the inode are shifted to make space.
> If enough space is not available in the inode due to the EAs then 1 or
> more EAs are shifted to the external EA block. In the worst case when
> even the external EA block does not have enough space we inform the user
> that some EA would need to be deleted or s_min_extra_isize would have to
> be reduced.
> 
> This would be online expansion of inodes. I am also working on adding an
> "expand_inodes" option to e2fsck which will expand all the used inodes.
> 
> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
> Signed-off-by: Kalpak Shah <[EMAIL PROTECTED]>
> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> 
> Index: linux-2.6.22-rc4/fs/ext4/inode.c
> ===
> --- linux-2.6.22-rc4.orig/fs/ext4/inode.c 2007-06-14 17:32:27.0 
> -0700
> +++ linux-2.6.22-rc4/fs/ext4/inode.c  2007-06-14 17:32:41.0 -0700
> @@ -3120,6 +3120,40 @@ ext4_reserve_inode_write(handle_t *handl
>  }
>  
>  /*
> + * Expand an inode by new_extra_isize bytes.
> + * Returns 0 on success or negative error number on failure.
> + */
> +int ext4_expand_extra_isize(struct inode *inode, unsigned int 
> new_extra_isize,
> + struct ext4_iloc iloc, handle_t *handle)
> +{
> + struct ext4_inode *raw_inode;
> + struct ext4_xattr_ibody_header *header;
> + struct ext4_xattr_entry *entry;
> +
> + if (EXT4_I(inode)->i_extra_isize >= new_extra_isize) {
> + return 0;
> + }

This patch has lots of unneeded braces in it.  checkpatch appears to catch
them.


> + raw_inode = ext4_raw_inode(&iloc);
> +
> + header = IHDR(inode, raw_inode);
> + entry = IFIRST(header);
> +
> + /* No extended attributes present */
> + if (!(EXT4_I(inode)->i_state & EXT4_STATE_XATTR) ||
> + header->h_magic != cpu_to_le32(EXT4_XATTR_MAGIC)) {
> + memset((void *)raw_inode + EXT4_GOOD_OLD_INODE_SIZE, 0,
> + new_extra_isize);
> + EXT4_I(inode)->i_extra_isize = new_extra_isize;
> + return 0;
> + }
> +
> + /* try to expand with EA present */
> + return ext4_expand_extra_isize_ea(inode, new_extra_isize,
> + raw_inode, handle);
> +}
> +
> +/*
>   * What we do here is to mark the in-core inode as clean with respect to 
> inode
>   * dirtiness (it may still be data-dirty).
>   * This means that the in-core inode may be reaped by prune_icache
> @@ -3143,10 +3177,33 @@ ext4_reserve_inode_write(handle_t *handl
>  int ext4_mark_inode_dirty(handle_t *handle, struct inode *inode)
>  {
>   struct ext4_iloc iloc;
> - int err;
> + int err, ret;
> + static int expand_message;
>  
>   might_sleep();
>   err = ext4_reserve_inode_write(handle, inode, &iloc);
> + if (EXT4_I(inode)->i_extra_isize <
> + EXT4_SB(inode->i_sb)->s_want_extra_isize &&
> + !(EXT4_I(inode)->i_state & EXT4_STATE_NO_EXPAND)) {
> + /* We need extra buffer credits since we may write into EA block
> +  * with this same handle */
> + if ((jbd2_journal_extend(handle,
> +  EXT4_DATA_TRANS_BLOCKS(inode->i_sb))) == 0) {
> + ret = ext4_expand_extra_isize(inode,
> + EXT4_SB(inode->i_sb)->s_want_extra_isize,
> + iloc, handle);
> + if (ret) {
> + EXT4_I(inode)->i_state |= EXT4_STATE_NO_EXPAND;
> + if (!expand_message) {
> + ext4_warning(inode->i_sb, __FUNCTION__,
> + "Unable

Re: [EXT4 set 1][PATCH 1/2] Add noextents mount option

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> On Sun, 01 Jul 2007 03:35:48 -0400
> Mingming Cao <[EMAIL PROTECTED]> wrote:
> 
> > Add a mount option to turn off extents.
> 
> Please update the changelog to describe the reason for making this change.
> 
> 

Sure, I will update the changelog, mainly this is for wide testing of
extents feature in ext4dev.

> > Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> > ---
> > Index: linux-2.6.22-rc4/fs/ext4/super.c
> > ===
> > --- linux-2.6.22-rc4.orig/fs/ext4/super.c   2007-06-11 17:02:18.0 
> > -0700
> > +++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-11 17:02:22.0 
> > -0700
> > @@ -725,7 +725,7 @@
> > Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
> > Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
> > Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
> > -   Opt_grpquota, Opt_extents,
> > +   Opt_grpquota, Opt_extents, Opt_noextents,
> >  };
> >  
> >  static match_table_t tokens = {
> > @@ -776,6 +776,7 @@
> > {Opt_usrquota, "usrquota"},
> > {Opt_barrier, "barrier=%u"},
> > {Opt_extents, "extents"},
> > +   {Opt_noextents, "noextents"},
> > {Opt_err, NULL},
> > {Opt_resize, "resize"},
> >  };
> > @@ -,6 +1112,9 @@
> > case Opt_extents:
> > set_opt (sbi->s_mount_opt, EXTENTS);
> > break;
> > +   case Opt_noextents:
> > +   clear_opt (sbi->s_mount_opt, EXTENTS);
> > +   break;
> > default:
> > printk (KERN_ERR
> > "EXT4-fs: Unrecognized mount option \"%s\" "
> > 
> > 
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [EMAIL PROTECTED]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 3/5] cleanups: set_jbd2_64bit_feature for >16TB ext4 fs

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> On Sun, 01 Jul 2007 03:36:32 -0400
> Mingming Cao <[EMAIL PROTECTED]> wrote:
> 
> > Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more
> > than 32bit block sizes during mount time.  This ensure proper record
> > lenth when writing to the journal.
> 
> This patch isn't in Ted's kernel.org directory and hasn't been in -mm. 
> Where did it come from?  Is something amiss with ext4 patch management?
> 
Jose Santo posted it to linux-ext4 mailing list.

I agree this bug fix should included in Ted's git tree or mm tree. There
are other ext4 cleanups in this series that should goes to mm tree also.

> > Signed-off-by: Jose R. Santos <[EMAIL PROTECTED]>
> > Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
> > Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> > Signed-off-by: Laurent Vivier <[EMAIL PROTECTED]>
> > ---
> >  fs/ext4/super.c |   11 +++
> >  1 file changed, 11 insertions(+)
> > 
> > Index: linux-2.6.22-rc4/fs/ext4/super.c
> > ===
> > --- linux-2.6.22-rc4.orig/fs/ext4/super.c   2007-06-11 16:15:54.0 
> > -0700
> > +++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-11 16:16:10.0 
> > -0700
> > @@ -1804,6 +1804,13 @@
> 
> Please prepare patches using `diff -p'
> 

Will do.

> > goto failed_mount3;
> > }
> >  
> > +   if (ext4_blocks_count(es) > 0xULL &&
> > +   !jbd2_journal_set_features(EXT4_SB(sb)->s_journal, 0, 0,
> > +  JBD2_FEATURE_INCOMPAT_64BIT)) {
> > +   printk(KERN_ERR "ext4: Failed to set 64-bit journal feature\n");
> > +   goto failed_mount4;
> > +   }
> 
> It is not appropriate for the text "ext4" to appear in a JBD2 message.

This is part of ext4 code. Ext4 will set the 64-bit JBD2 flag if the fs
is larger than 32 bit blocks.

> > /* We have now updated the journal if required, so we can
> >  * validate the data journaling mode. */
> > switch (test_opt(sb, DATA_FLAGS)) {
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATHC] Fix for ext2 reservation (Re: -mm merge plans for 2.6.23)

2007-07-10 Thread Andrew Morton

On Tue, 10 Jul 2007 15:15:57 -0700
Badari Pulavarty <[EMAIL PROTECTED]> wrote:

> Well, I looked at the problem now and here is the fix :)

whee, thanks.

> Greg, Please consider this for stable release also.

No, it is only relevant to -mm's ext2-reservations.patch.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATHC] Fix for ext2 reservation (Re: -mm merge plans for 2.6.23)

2007-07-10 Thread Badari Pulavarty

On Tue, 2007-07-10 at 16:55 -0700, Andrew Morton wrote:
> On Tue, 10 Jul 2007 15:15:57 -0700
> Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> 
> > Well, I looked at the problem now and here is the fix :)
> 
> whee, thanks.
> 
> > Greg, Please consider this for stable release also.
> 
> No, it is only relevant to -mm's ext2-reservations.patch.

Yes. Sorry. I was looking at -mm tree and assumed that
ext2-reservation code made into mainline also :(

Greg, please ignore the patch for stable.

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATHC] Fix for ext2 reservation (Re: -mm merge plans for 2.6.23)

2007-07-10 Thread Greg KH

On Tue, Jul 10, 2007 at 05:04:17PM -0700, Badari Pulavarty wrote:
> On Tue, 2007-07-10 at 16:55 -0700, Andrew Morton wrote:
> > On Tue, 10 Jul 2007 15:15:57 -0700
> > Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> > 
> > > Well, I looked at the problem now and here is the fix :)
> > 
> > whee, thanks.
> > 
> > > Greg, Please consider this for stable release also.
> > 
> > No, it is only relevant to -mm's ext2-reservations.patch.
> 
> Yes. Sorry. I was looking at -mm tree and assumed that
> ext2-reservation code made into mainline also :(
> 
> Greg, please ignore the patch for stable.

Ok, dropped from my -stable queue mbox.

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> On Sun, 01 Jul 2007 03:36:01 -0400
> Mingming Cao <[EMAIL PROTECTED]> wrote:
> 
> > Turn on extents feature by default in ext4 filesystem. User could use
> > -o noextents to turn it off.
> > 
> 
> Oh, there you go.
> 
> > 
> > Index: linux-2.6.22-rc4/fs/ext4/super.c
> > ===
> > --- linux-2.6.22-rc4.orig/fs/ext4/super.c   2007-06-11 17:02:22.0 
> > -0700
> > +++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-11 17:03:09.0 
> > -0700
> > @@ -1546,6 +1546,12 @@
> >  
> > set_opt(sbi->s_mount_opt, RESERVATION);
> >  
> > +   /*
> > +* turn on extents feature by default in ext4 filesystem
> > +* User -o noextents to turn it off
> > +*/
> > +   set_opt (sbi->s_mount_opt, EXTENTS);
> > +
> 
> Broken coding style.
> 
> Please feed all the ext4 patches through scripts/checkpatch.pl (preferably
> version 0.07 - see Andy's patch on lkml) and then consider addressing the
> (quite large) number of mistakes which are detected.
> 

Sorry about this. I was using version 0.04. The latest one I can find
for now is 0.05(searching lkml), but it didn't catch this codling style
bug either. I appreciate if anyone can point me the version 0.07, thanks

Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> On Sun, 01 Jul 2007 03:37:04 -0400
> Mingming Cao <[EMAIL PROTECTED]> wrote:
> 
> > This patch converts the 32-bit i_version in the generic inode to a 64-bit
> > i_version field.
> > 
> 
> That's obvious from the patch.  But what was the reason for making this
> (unrelated to ext4) change?
> 

The need is came from NFSv4

On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: 
> Hi,
> 
> This is an update of the i_version patch.
> The i_version field is a 64bit counter that is set on every inode
> creation and that is incremented every time the inode data is modified
> (similarly to the "ctime" time-stamp).
> The aim is to fulfill a NFSv4 requirement for rfc3530:
> "5.5.  Mandatory Attributes - Definitions
> Name  #   DataType   Access   Description
> ___
> change3   uint64   READ A value created by the
>   server that the client can use to determine if file
>   data, directory contents or attributes of the object
>   have been modified.  The servermay return the object's
>   time_metadata attribute for this attribute's value but
>   only if the filesystem object can not be updated more
>   frequently than the resolution of time_metadata.
> "
> 

> Please update the changelog for this.
> 

Is above description clear to you?


> > Index: linux-2.6.21/include/linux/fs.h
> > ===
> > --- linux-2.6.21.orig/include/linux/fs.h
> > +++ linux-2.6.21/include/linux/fs.h
> > @@ -549,7 +549,7 @@ struct inode {
> > uid_t   i_uid;
> > gid_t   i_gid;
> > dev_t   i_rdev;
> > -   unsigned long   i_version;
> > +   u64 i_version;
> > loff_t  i_size;
> >  #ifdef __NEED_I_SIZE_ORDERED
> > seqcount_t  i_size_seqcount;
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Andrew Morton

On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:

> On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> > On Sun, 01 Jul 2007 03:37:04 -0400
> > Mingming Cao <[EMAIL PROTECTED]> wrote:
> > 
> > > This patch converts the 32-bit i_version in the generic inode to a 64-bit
> > > i_version field.
> > > 
> > 
> > That's obvious from the patch.  But what was the reason for making this
> > (unrelated to ext4) change?
> > 
> 
> The need is came from NFSv4
> 
> On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: 
> > Hi,
> > 
> > This is an update of the i_version patch.
> > The i_version field is a 64bit counter that is set on every inode
> > creation and that is incremented every time the inode data is modified
> > (similarly to the "ctime" time-stamp).
> > The aim is to fulfill a NFSv4 requirement for rfc3530:
> > "5.5.  Mandatory Attributes - Definitions
> > Name#   DataType   Access   Description
> > ___
> > change  3   uint64   READ A value created by the
> > server that the client can use to determine if file
> > data, directory contents or attributes of the object
> > have been modified.  The servermay return the object's
> > time_metadata attribute for this attribute's value but
> > only if the filesystem object can not be updated more
> > frequently than the resolution of time_metadata.
> > "
> > 
> 
> > Please update the changelog for this.
> > 
> 
> Is above description clear to you?
> 

Yes, thanks.  It doesn't actually tell us why we want to implement
this attribute and it doesn't tell us what the implications of failing
to do so are, but I guess we can take that on trust from the NFS guys.

But I suspect the ext4 implementation doesn't actually do this.  afaict we
won't update i_version for file overwrites (especially if s_time_gran can
indeed be 1,000,000,000) and of course for MAP_SHARED modifications.  What
would be the implications of this?

And how does the NFS server know that the filesystem implements i_version? 
Will a zero-value of i_version have special significance, telling the
server to not send this attribute, perhaps?


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/7] manpage for fallocate

2007-07-10 Thread Barry Naujok

On Wed, 11 Jul 2007 06:18:20 +1000, Amit K. Arora  
<[EMAIL PROTECTED]> wrote:



Following is the modified version of the manpage originally submitted by
David Chinner. Please use `nroff -man fallocate.2 | less` to view.


A few more touch-ups attached.

Regards,
Barry.

fallocate.2
Description: Binary data

Re: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc

2007-07-10 Thread Stephen Rothwell

On Wed, 11 Jul 2007 01:50:00 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
>
> --- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c
> +++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c
> @@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd, 
>   return sys_fadvise64_64(fd, ((u64)offset_hi << 32) | offset_lo,
>   len, advice);
>  }
> +
> +asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo,
> + unsigned offset_hi, unsigned len_lo,
> + unsigned len_hi)

Please call this compat_sys_fallocate in line with the powerpc version -
it gives us a hint that maybe we should think about how to consolidate
them.  I know other stuff in that file is called sys32_ ... but it is time
for a change :-)

-- 
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/


pgpV3hr3WqoAk.pgp
Description: PGP signature

Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:38:10 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:

> [PATCH] jbd2 stats through procfs
> 
> The patch below updates the jbd stats patch to 2.6.20/jbd2.
> The initial patch was posted by Alex Tomas in December 2005
> (http://marc.info/?l=linux-ext4&m=113538565128617&w=2).
> It provides statistics via procfs such as transaction lifetime and size.
> 
> [ This probably should be rewritten to use debugfs?   -- Ted]
> 

I suppose that given that we're creating a spot in debugfs for the jbd2
debug code, yes, this also should be moved over.

But the jbd2 debug debugfs entries were kernel-wide whereas this is
per-superblock.  I think.

That email from Alex contains pretty important information.  I suggest that
it be added to the changelog after accuracy-checking.  Addition to
Documentation/filesystems/ext4.txt would be good.

> --
> 
> Index: linux-2.6.22-rc4/include/linux/jbd2.h
> ===
> --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 
> 17:28:17.0 -0700
> +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-13 10:45:21.0 
> -0700
> @@ -408,6 +408,16 @@
>  };
>  
>  
> +/*
> + * Some stats for checkpoint phase
> + */
> +struct transaction_chp_stats_s {
> + unsigned long   cs_chp_time;
> + unsigned long   cs_forced_to_close;
> + unsigned long   cs_written;
> + unsigned long   cs_dropped;
> +};

It would be nice to document what units all these fields are in.  Jiffies,
I assume.

>  /* The transaction_t type is the guts of the journaling mechanism.  It
>   * tracks a compound transaction through its various states:
>   *
> @@ -543,6 +553,21 @@
>   spinlock_t  t_handle_lock;
>  
>   /*
> +  * Longest time some handle had to wait for running transaction
> +  */
> + unsigned long   t_max_wait;
> +
> + /*
> +  * When transaction started
> +  */
> + unsigned long   t_start;
> +
> + /*
> +  * Checkpointing stats [j_checkpoint_sem]
> +  */
> + struct transaction_chp_stats_s t_chp_stats;
> +
> + /*
>* Number of outstanding updates running on this transaction
>* [t_handle_lock]
>*/
> @@ -573,6 +598,57 @@
>  
>  };
>  
> +struct transaction_run_stats_s {
> + unsigned long   rs_wait;
> + unsigned long   rs_running;
> + unsigned long   rs_locked;
> + unsigned long   rs_flushing;
> + unsigned long   rs_logging;
> +
> + unsigned long   rs_handle_count;
> + unsigned long   rs_blocks;
> + unsigned long   rs_blocks_logged;
> +};
> +
> +struct transaction_stats_s
> +{
> + int ts_type;
> + unsigned long   ts_tid;
> + union {
> + struct transaction_run_stats_s run;
> + struct transaction_chp_stats_s chp;
> + } u;
> +};
> +
> +#define JBD2_STATS_RUN   1
> +#define JBD2_STATS_CHECKPOINT2
> +
> +#define ts_wait  u.run.rs_wait
> +#define ts_running   u.run.rs_running
> +#define ts_lockedu.run.rs_locked
> +#define ts_flushing  u.run.rs_flushing
> +#define ts_logging   u.run.rs_logging
> +#define ts_handle_count  u.run.rs_handle_count
> +#define ts_blocksu.run.rs_blocks
> +#define ts_blocks_logged u.run.rs_blocks_logged
> +
> +#define ts_chp_time  u.chp.cs_chp_time
> +#define ts_forced_to_close   u.chp.cs_forced_to_close
> +#define ts_written   u.chp.cs_written
> +#define ts_dropped   u.chp.cs_dropped

That's a bit sleazy.  We can drop the "u" from 'struct transaction_stats_s'
and make it an anonymous union, then open-code foo.run.rs_wait everywhere.

But that's a bit sleazy too, because the reader of the code would not know
that a write to foo.run.rs_wait will stomp on the value of
foo.chp.cs_chp_time.

So to minimize reader confusion it would be best I think to just open-code
the full u.run.rs_wait at all code-sites.

The macros above are the worst possible choice: they hide information from
the code-reader just to save the code-writer a bit of typing.  But we very
much want to optimise for code-readers, not for code-writers.

> +#define CURRENT_MSECS(jiffies_to_msecs(jiffies))

hm, that isn't something which should be in an ext4 header file.  And it
shouldn't be in UPPER CASE and it shouldn't pretend to be a constant (or a
global scalar).

IOW: yuk.

How's about raising a separate, standalone patch which creates a new
kernel-wide coded-in-C function such as

unsigned long jiffies_in_msecs(void);

?  (That's assuming we don't already have one.  Most likely we have seven
of them hiding in various dark corners).

> +static inline unsigned int
> +jbd2_time_diff(unsigned int start, unsigned int end)
> +{
> + if (unlikely(start > end))
> + end = end + (~

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote:
> On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:
> 
> > On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> > > On Sun, 01 Jul 2007 03:37:04 -0400
> > > Mingming Cao <[EMAIL PROTECTED]> wrote:
> > > 
> > > > This patch converts the 32-bit i_version in the generic inode to a 
> > > > 64-bit
> > > > i_version field.
> > > > 
> > > 
> > > That's obvious from the patch.  But what was the reason for making this
> > > (unrelated to ext4) change?
> > > 
> > 
> > The need is came from NFSv4
> > 
> > On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: 
> > > Hi,
> > > 
> > > This is an update of the i_version patch.
> > > The i_version field is a 64bit counter that is set on every inode
> > > creation and that is incremented every time the inode data is modified
> > > (similarly to the "ctime" time-stamp).
> > > The aim is to fulfill a NFSv4 requirement for rfc3530:
> > > "5.5.  Mandatory Attributes - Definitions
> > > Name  #   DataType   Access   Description
> > > ___
> > > change3   uint64   READ A value created by the
> > >   server that the client can use to determine if file
> > >   data, directory contents or attributes of the object
> > >   have been modified.  The servermay return the object's
> > >   time_metadata attribute for this attribute's value but
> > >   only if the filesystem object can not be updated more
> > >   frequently than the resolution of time_metadata.
> > > "
> > > 
> > 
> > > Please update the changelog for this.
> > > 
> > 
> > Is above description clear to you?
> > 
> 
> Yes, thanks.  It doesn't actually tell us why we want to implement
> this attribute and it doesn't tell us what the implications of failing
> to do so are, but I guess we can take that on trust from the NFS guys.
> 
> But I suspect the ext4 implementation doesn't actually do this.  afaict we
> won't update i_version for file overwrites (especially if s_time_gran can
> indeed be 1,000,000,000) and of course for MAP_SHARED modifications.  What
> would be the implications of this?
> 

In the case of overwrite (file date updated), I assume the ctime/mtime
is being updated and the inode is being dirtied, so the version number
is being updated.

 vfs_write()->..
->__generic_file_aio_write_nolock()
->file_update_time()
->mark_inode_dirty_sync()
->__mark_inode_dirty(I_DIRTY_SYNC)
->ext4_dirty_inode()
->ext4_mark_inode_dirty()

> And how does the NFS server know that the filesystem implements i_version? 
> Will a zero-value of i_version have special significance, telling the
> server to not send this attribute, perhaps?

Bruce raised up this question a few days back when he reviewed this
patch, I think the solution is add a superblock flag for fs support
inode versioning, probably at VFS layer?

Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs

2007-07-10 Thread Cédric Augonnet


2007/7/10, Andrew Morton <[EMAIL PROTECTED]>:

Hi all,


> + size = sizeof(struct transaction_stats_s);
> + s->stats = kmalloc(size, GFP_KERNEL);
> + if (s == NULL) {
> + kfree(s);
> + return -EIO;

ENOMEM


I'm sorry if i missed some point, but i just don't see the use of such
a kfree here, not sure Andrew meant you should only return ENOMEM
instead, but why issuing those kfree(NULL), instead of just a if (!s)
return ENOMEM ?

Regards,
Cédric
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Neil Brown

On Tuesday July 10, [EMAIL PROTECTED] wrote:
> 
> Yes, thanks.  It doesn't actually tell us why we want to implement
> this attribute and it doesn't tell us what the implications of failing
> to do so are, but I guess we can take that on trust from the NFS guys.

You would like to think so, but remember NFSv4 was designed by a
committee :-)

The 'change' number is used for cache consistency, and as the spec
makes very strong statements about the 'change' number, it is very
hard (or impossible) to implement a server correctly without storing a
change number in stable storage (just one of my grips about V4).

> 
> But I suspect the ext4 implementation doesn't actually do this.  afaict we
> won't update i_version for file overwrites (especially if s_time_gran can
> indeed be 1,000,000,000) and of course for MAP_SHARED modifications.  What
> would be the implications of this?

The first part sounds like a bug - i_version should really be updated
by every call to ->commit_write (if that is still what it is called).

The MAP_SHARED thing is less obvious.  I guess every time we notice
that the page might have been changed, we need to increment i_version.

> 
> And how does the NFS server know that the filesystem implements i_version? 
> Will a zero-value of i_version have special significance, telling the
> server to not send this attribute, perhaps?

That is a very important question.  Zero probably makes sense, but
what ever it is needs to be agreed and documented.
And just by-the-way, the server doesn't really have the option of not
sending the attribute.  If i_version isn't defined, it has to fake
something using mtime, and hope that is good enough.

Alternately we could mandate that i_version is always kept up-to-date
and if a filesystem doesn't have anything to load from storage, it
just sets it to the current time in nanoseconds.

That would mean that a client would need to flush it's cache whenever
the inode fell out of cache on the server, but I don't think we can
reliably do better than that.

I think I like that approach.

So my vote is to increment i_version in common code every time any
change is made to the file, and alloc_inode should initialise it to
current time, which might be changed by the filesystem before it calls
unlock_new_inode. 
... but doesn't lustre want to control its i_version... so maybe not :-(

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Trond Myklebust

On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote:
> On Tuesday July 10, [EMAIL PROTECTED] wrote:
> > 
> > Yes, thanks.  It doesn't actually tell us why we want to implement
> > this attribute and it doesn't tell us what the implications of failing
> > to do so are, but I guess we can take that on trust from the NFS guys.
> 
> You would like to think so, but remember NFSv4 was designed by a
> committee :-)
> 
> The 'change' number is used for cache consistency, and as the spec
> makes very strong statements about the 'change' number, it is very
> hard (or impossible) to implement a server correctly without storing a
> change number in stable storage (just one of my grips about V4).

Well... It reflects a requirement that was just as present in the
caching models that we use for NFSv2/v3, but that we glossed over for
other reasons.
The real difference here is that v2/v3 caching model assumes that you
have sufficient resolution in the ctime/mtime to allow clients to detect
any changes to the file/directory contents, whereas NFSv4 allows you to
use an arbitrary variable (that may be the ctime, if it has sufficient
resolution) for the same purposes.

> > But I suspect the ext4 implementation doesn't actually do this.  afaict we
> > won't update i_version for file overwrites (especially if s_time_gran can
> > indeed be 1,000,000,000) and of course for MAP_SHARED modifications.  What
> > would be the implications of this?
> 
> The first part sounds like a bug - i_version should really be updated
> by every call to ->commit_write (if that is still what it is called).
> 
> The MAP_SHARED thing is less obvious.  I guess every time we notice
> that the page might have been changed, we need to increment i_version.

You need to increment is any time that you detect remotely visible
changes.
IOW: any change that posix mandates should result in a ctime update,
must also result in an update of i_version. The only difference is that
i_version must not suffer from the time resolution issues that ctime
does.

> > And how does the NFS server know that the filesystem implements i_version? 
> > Will a zero-value of i_version have special significance, telling the
> > server to not send this attribute, perhaps?
> 
> That is a very important question.  Zero probably makes sense, but
> what ever it is needs to be agreed and documented.
> And just by-the-way, the server doesn't really have the option of not
> sending the attribute.  If i_version isn't defined, it has to fake
> something using mtime, and hope that is good enough.
> 
> Alternately we could mandate that i_version is always kept up-to-date
> and if a filesystem doesn't have anything to load from storage, it
> just sets it to the current time in nanoseconds.
> 
> That would mean that a client would need to flush it's cache whenever
> the inode fell out of cache on the server, but I don't think we can
> reliably do better than that.
> 
> I think I like that approach.
> 
> So my vote is to increment i_version in common code every time any
> change is made to the file, and alloc_inode should initialise it to
> current time, which might be changed by the filesystem before it calls
> unlock_new_inode. 
> ... but doesn't lustre want to control its i_version... so maybe not :-(

If lustre wants to be exportable via pNFS (as Peter Braam has suggested
it should), then it had better be able to return a change attribute that
is compatible with the NFSv4.1 spec...

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev

2007-07-10 Thread Dave Jones

On Tue, Jul 10, 2007 at 05:35:13PM -0400, Mingming Cao wrote:
 > 
 > Sorry about this. I was using version 0.04. The latest one I can find
 > for now is 0.05(searching lkml), but it didn't catch this codling style
 > bug either. I appreciate if anyone can point me the version 0.07, thanks

It's now in-tree in scripts/checkpatch.pl
(linus' tree is still at 0.06 though, I suspect Andrew has something
 newer in -mm)

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Initial results of FLEX_BG feature.

2007-07-10 Thread Andreas Dilger

On Jul 10, 2007  11:23 -0500, Jose R. Santos wrote:
> I've started playing with the FLEX_BG feature (for now packing of
> block group metadata closer together) and started doing some
> preliminary benchmarking to see if the feature is worth pursuing.
> I chose an FFSB profile that does single threaded small creates and
> writes and then does an fsync.  This is something I ran for a customer
> a while ago in which ext3 performed poorly.

Jose,
thanks for the information and testing.  This is definitely very
interesting and shows this is an avenue we should pursue.

> Here are some of the results (in transactions/[EMAIL PROTECTED] util) on a 
> single
> [EMAIL PROTECTED] rpm disk.
> 
> ext4  [EMAIL PROTECTED]
> ext4(flex_bg) [EMAIL PROTECTED] 20% improvement
> ext4(data=writeback)  [EMAIL PROTECTED] <- hum...
> ext4(flex_bg data=writeback)  [EMAIL PROTECTED] 28% over best ext4
> ext3  [EMAIL PROTECTED]
> ext3(data=writeback)  [EMAIL PROTECTED]
> ext2  [EMAIL PROTECTED]
> xfs   [EMAIL PROTECTED]
> jfs   [EMAIL PROTECTED]
> 
> The results are from packing the metadata of 64 block groups closer
> together at fsck time.  Still need to clean up the e2fsprog patches,

Does this mean that you are just moving the bitmaps and inode table
at mke2fs time, or also such things as directory blocks at fsck time?

> but I hope to submit them to the list later this week for others to
> try.  It seems like fsck doesn't quite like the new location of the
> metadata and I'm not sure how big of an effort it will be to fix it.  I
> mentioned this since one of the assumptions of implementing FLEX_BG was
> the reduce time in fsck and it could be a while before I'm able to test
> this.

i think in the spirit of the original META_BG option, Ted had wanted to
put all the bitmaps from EXT4_DESC_PER_BLOCK groups somewhere within the
metagroup.  It would also be interesting to see if moving ALL of the
group metadata to a single location in the filesystem makes a bit difference.
If not, then we may as well keep it spread out for safety.

You might also want to test out placement of the journal in the middle
of the filesystem, the U. Wisconsin folks tested this in one of their
papers and showed some noticable improvements.  That isn't exactly
related, but it is a relatively simple tweak to mke2fs/tune2fs to give
it an allocation goal of group_desc[s_groups_count / 2].bg_inode_table
(to put it past inode table in middle group).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Andrew Morton

On Tue, 10 Jul 2007 20:19:16 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:

> On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote:
> > On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:
> > 
> > > On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> > > > On Sun, 01 Jul 2007 03:37:04 -0400
> > > > Mingming Cao <[EMAIL PROTECTED]> wrote:
> > > > 
> > > > > This patch converts the 32-bit i_version in the generic inode to a 
> > > > > 64-bit
> > > > > i_version field.
> > > > > 
> > > > 
> > > > That's obvious from the patch.  But what was the reason for making this
> > > > (unrelated to ext4) change?
> > > > 
> > > 
> > > The need is came from NFSv4
> > > 
> > > On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: 
> > > > Hi,
> > > > 
> > > > This is an update of the i_version patch.
> > > > The i_version field is a 64bit counter that is set on every inode
> > > > creation and that is incremented every time the inode data is modified
> > > > (similarly to the "ctime" time-stamp).
> > > > The aim is to fulfill a NFSv4 requirement for rfc3530:
> > > > "5.5.  Mandatory Attributes - Definitions
> > > > Name#   DataType   Access   Description
> > > > ___
> > > > change  3   uint64   READ A value created by the
> > > > server that the client can use to determine if file
> > > > data, directory contents or attributes of the object
> > > > have been modified.  The servermay return the object's
> > > > time_metadata attribute for this attribute's value but
> > > > only if the filesystem object can not be updated more
> > > > frequently than the resolution of time_metadata.
> > > > "
> > > > 
> > > 
> > > > Please update the changelog for this.
> > > > 
> > > 
> > > Is above description clear to you?
> > > 
> > 
> > Yes, thanks.  It doesn't actually tell us why we want to implement
> > this attribute and it doesn't tell us what the implications of failing
> > to do so are, but I guess we can take that on trust from the NFS guys.
> > 
> > But I suspect the ext4 implementation doesn't actually do this.  afaict we
> > won't update i_version for file overwrites (especially if s_time_gran can
> > indeed be 1,000,000,000) and of course for MAP_SHARED modifications.  What
> > would be the implications of this?
> > 
> 
> In the case of overwrite (file date updated), I assume the ctime/mtime
> is being updated and the inode is being dirtied, so the version number
> is being updated.
> 
>  vfs_write()->..
> ->__generic_file_aio_write_nolock()
> ->file_update_time()
> ->mark_inode_dirty_sync()
> ->__mark_inode_dirty(I_DIRTY_SYNC)
> ->ext4_dirty_inode()
> ->ext4_mark_inode_dirty()

That assumes an mtime update for every write().  OK, so two writes in a
single nanosecond won't be happening.  But in that case why is this code:

static inline struct timespec ext4_current_time(struct inode *inode)
{
return (inode->i_sb->s_time_gran < NSEC_PER_SEC) ?
current_fs_time(inode->i_sb) : CURRENT_TIME_SEC;
}

checking (s_time_gran < NSEC_PER_SEC) ??

Overall it is a bit unpleasing to rely upon mtime updates for a correct NFS
server implementation: if we were to later decrease s_time_gran (as we
might do, for performance reasons), the NFS server implementation starts
reporting incorrect information.

> > And how does the NFS server know that the filesystem implements i_version? 
> > Will a zero-value of i_version have special significance, telling the
> > server to not send this attribute, perhaps?
> 
> Bruce raised up this question a few days back when he reviewed this
> patch, I think the solution is add a superblock flag for fs support
> inode versioning, probably at VFS layer?

That would work.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: block groups with no inode tables

2007-07-10 Thread Andreas Dilger

On Jul 10, 2007  16:30 -0400, Theodore Tso wrote:
> On Tue, Jul 10, 2007 at 12:12:21PM -0500, Jose R. Santos wrote:
> > As I play with the allocation of the metadata for the FLEX_BG feature,
> > it seems that we could benefit from having block groups with no inode
> > tables.  Right now we allocate one inode table per bg base on the
> > inode_blocks_per_group.  For FLEX_BG though, it would make more sense
> > to have a larger inode tables that fully use the inode bitmap allocated
> > on the first few block groups.  Once we reach the number of inode per
> > FLEX_BG, then the remaining block groups could then have no inode
> > tables defined.
> > 
> > The idea here is that we better utilize the inode bitmaps and reduce the
> > number of inode tables to improve mkfs/fsck times. We could also
> > support expansion of inode since we have block groups that have empty
> > entries in the block group descriptors and as long as we can find
> > enough empty blocks for the inode table expanding the number of inodes
> > should be relatively easy.
> > 
> > Don't know if ext4 currently supports this.  Any thoughts?
> 
> Plans to support are there; Andreas sent a patch back in April to
> implement this, using bg_itable_unused, which is already reserved in
> the block group data structure.  The idea here is to speed up fsck by
> specifying how many inodes are actually in use in the block group, so
> we don't have to initialize them until they are to be used.  This is
> tied with the checksum patches, since doing this means we need to
> really worry about the accuracy of the block group descriptors or we
> could lose a lot of data if the block group descriptors are corrupted.

I think Jose means something slightly different, but in the end the
uninit_groups feature (patches in the patch queue, but disabled for
some reason) essentially implements this.  We don't need to read inode
bitmaps from disk if the INODE_UNINIT flag is in the group.

I think all that is needed to get the semantics Jose wants is to tune
the inode allocation in ext4_new_inode() to avoid inode bitmaps that
are not yet initialized.  I suppose the other incremental feature would
be to allow the blocks in the inode table become used for file allocation,
but this exposes us to potential malicious corruption in some cases if
users create "inode looking" data files (e.g. suid root inodes) on a full
filesystem and e2fsck is convinced to treat them as inodes.

We might instead limit this space to directories and indirect/index
blocks, which wouldn't be a bad idea but when we get to changing the
inode structures too much I'd like to combine several of the other
changes.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs

2007-07-10 Thread Alexey Dobriyan

On Tue, Jul 10, 2007 at 11:21:49PM -0400, Cédric Augonnet wrote:
> 2007/7/10, Andrew Morton <[EMAIL PROTECTED]>:
> 
> Hi all,
> 
> >> + size = sizeof(struct transaction_stats_s);
> >> + s->stats = kmalloc(size, GFP_KERNEL);
> >> + if (s == NULL) {
   ^
> >> + kfree(s);
> >> + return -EIO;
> >
> >ENOMEM

That, and

if (s->stats == NULL)
kfree(s);

> I'm sorry if i missed some point, but i just don't see the use of such
> a kfree here, not sure Andrew meant you should only return ENOMEM
> instead, but why issuing those kfree(NULL), instead of just a if (!s)
> return ENOMEM ?

kfree() is correct, check isn't.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs

2007-07-10 Thread Andrew Morton

On Tue, 10 Jul 2007 23:21:49 -0400 "Cédric Augonnet" <[EMAIL PROTECTED]> wrote:

> 2007/7/10, Andrew Morton <[EMAIL PROTECTED]>:
> 
> Hi all,
> 
> > > + size = sizeof(struct transaction_stats_s);
> > > + s->stats = kmalloc(size, GFP_KERNEL);
> > > + if (s == NULL) {
> > > + kfree(s);
> > > + return -EIO;
> >
> > ENOMEM
> 
> I'm sorry if i missed some point, but i just don't see the use of such
> a kfree here, not sure Andrew meant you should only return ENOMEM
> instead, but why issuing those kfree(NULL), instead of just a if (!s)
> return ENOMEM ?
> 

You found a bug.  It was meant to be

if (s->stats == NULL)


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: block groups with no inode tables

2007-07-10 Thread Andreas Dilger

On Jul 10, 2007  14:09 -0500, Dave Kleikamp wrote:
> Assuming you mean the parent directory?  An inode isn't tied to a
> specific parent.
> 
>   ln dir1/file1 dir2/
>   mv dir1/file1 dir3/
>   rmdir dir1
> 
> What is happens to the inode?

The inode stays in the same place, and the block map of the directories
are changed to enclose the inode.  In ideal (== normal) circumstances,
inodes are allocated within a directory in a sequential manner, and this
would also result in linear inode block allocation, great for extent-mapped
files.  In cases like the above, you will have fragmented IO patterns,
but those are already true for existing directories.

> I really don't think that the directory is the right place to store an inode.

There are actually some performance benefits from this, see
http://citeseer.ist.psu.edu/ganger97embedded.html

Each inode would be a disk block, or possibly a few (slightly larger than
now) inodes per block, on the order of 1kB or more.  This allows for
packing small files into the inode also (as an EA) or alternately having
many extents in the inode for huge files or lots of inline EAs.

I've also got a plan to overcome the hard-link limitations in that paper,
by storing the filename of an inode as an EA in the inode, prefixed by
the inode number & generation of the parent.  When doing a readdir or
lookup, we know the parent directory in which we are looking, so we can
only consider names in that directory.  When doing a readdir, we can
immediately list all of the names for this inode together.  The caveat
is that we need a flexible EA scheme to handle this, maybe a directory
with more EAs in it?

The one thing that I'm not sure about is how to handle the case where
inode blocks are allocated in relatively random order.  I'd _like_ to
be able to avoid the POSIX telldir/seekdir problem by doing readdir()
in block order, but that also means that if we allocate an inode block
between two other existing inode blocks in a directory that we should
"insert" the block into the directory instead of e.g. appending it.
That means the file offset in a directory is not constant, but maybe it
is OK to return the physical block number for telldir?

We would still have a hash for the files, but instead of per block
as it is now, it would need to have leaf entries for each name, since
an inode can have many names and would appear in multiple hash buckets.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/7] fallocate() implementation in i386, x86_64 and powerpc

2007-07-10 Thread Heiko Carstens

On Wed, Jul 11, 2007 at 12:10:34PM +1000, Stephen Rothwell wrote:
> On Wed, 11 Jul 2007 01:50:00 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> >
> > --- linux-2.6.22.orig/arch/x86_64/ia32/sys_ia32.c
> > +++ linux-2.6.22/arch/x86_64/ia32/sys_ia32.c
> > @@ -879,3 +879,11 @@ asmlinkage long sys32_fadvise64(int fd, 
> > return sys_fadvise64_64(fd, ((u64)offset_hi << 32) | offset_lo,
> > len, advice);
> >  }
> > +
> > +asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo,
> > +   unsigned offset_hi, unsigned len_lo,
> > +   unsigned len_hi)
> 
> Please call this compat_sys_fallocate in line with the powerpc version -
> it gives us a hint that maybe we should think about how to consolidate
> them.  I know other stuff in that file is called sys32_ ... but it is time
> for a change :-)

Maybe it would be worth to finally consider this:

http://marc.info/?l=linux-kernel&m=118411511620432&w=2

?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> On Sun, 01 Jul 2007 03:36:56 -0400
> Mingming Cao <[EMAIL PROTECTED]> wrote:
> 
> > This patch is a spinoff of the old nanosecond patches.
> 
> I don't know what the "old nanosecond patches" are.  A link to a suitable
> changlog for those patches would do in a pinch.  Preferable would be to
> write a proper changelog for this patch.
> 
I found the original patch
http://marc.info/?l=linux-ext4&m=115091699809181&w=2

Andreas or Kalpak, is changelog from the original patch is accurate to
apply here?

Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Neil Brown


It just occurred to me:

 If i_version is 64bit, then knfsd would need to be careful when
 reading it on a 32bit host.  What are the locking rules?

 Presumably it is only updated under i_mutex protection, but having to
 get i_mutex to read it would seem a little heavy handed.

 Should it use a seqlock like i_size?
 Could we use the same seqlock that i_size uses, or would we need a
 separate one?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Mingming Cao

On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote:
> On Tuesday July 10, [EMAIL PROTECTED] wrote:
> > 
> > Yes, thanks.  It doesn't actually tell us why we want to implement
> > this attribute and it doesn't tell us what the implications of failing
> > to do so are, but I guess we can take that on trust from the NFS guys.
> 
> You would like to think so, but remember NFSv4 was designed by a
> committee :-)
> 
> The 'change' number is used for cache consistency, and as the spec
> makes very strong statements about the 'change' number, it is very
> hard (or impossible) to implement a server correctly without storing a
> change number in stable storage (just one of my grips about V4).
> 
> > 
> > But I suspect the ext4 implementation doesn't actually do this.  afaict we
> > won't update i_version for file overwrites (especially if s_time_gran can
> > indeed be 1,000,000,000) and of course for MAP_SHARED modifications.  What
> > would be the implications of this?
> 
> The first part sounds like a bug - i_version should really be updated
> by every call to ->commit_write (if that is still what it is called).
> 
> The MAP_SHARED thing is less obvious.  I guess every time we notice
> that the page might have been changed, we need to increment i_version.
> 
> > 
> > And how does the NFS server know that the filesystem implements i_version? 
> > Will a zero-value of i_version have special significance, telling the
> > server to not send this attribute, perhaps?
> 
> That is a very important question.  Zero probably makes sense, but
> what ever it is needs to be agreed and documented.
> And just by-the-way, the server doesn't really have the option of not
> sending the attribute.  If i_version isn't defined, it has to fake
> something using mtime, and hope that is good enough.
> 
> Alternately we could mandate that i_version is always kept up-to-date
> and if a filesystem doesn't have anything to load from storage, it
> just sets it to the current time in nanoseconds.
> 
> That would mean that a client would need to flush it's cache whenever
> the inode fell out of cache on the server, but I don't think we can
> reliably do better than that.
> 
> I think I like that approach.
> 
> So my vote is to increment i_version in common code every time any
> change is made to the file, 

David Chinneer pointed that we need to journal the version number
updates together with the operations that causes the change of the inode
version number, in order to survive server crashes so clients won't see
the counter go backwards.

So increment i_version in fs code is probably the place to ensure the
inode version changes are stored to disk. It's seems update the ext4
inode version in every ext4_mark_inode_dirty() is the easiest way.

> and alloc_inode should initialise it to
> current time, which might be changed by the filesystem before it calls
> unlock_new_inode. 
> ... but doesn't lustre want to control its i_version... so maybe not :-(
> 
> NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Andrew Morton

On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:

> David Chinneer pointed that we need to journal the version number
> updates together with the operations that causes the change of the inode
> version number, in order to survive server crashes so clients won't see
> the counter go backwards.
> 
> So increment i_version in fs code is probably the place to ensure the
> inode version changes are stored to disk. It's seems update the ext4
> inode version in every ext4_mark_inode_dirty() is the easiest way.

That still makes us dependent upon _something_ changing the inode.  For
overwrites the only something is mtime.

If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
I don't think we do) then I guess we'll need new code in or around
file_update_time() to do this.
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 21:42 -0700, Andrew Morton wrote:
> On Tue, 10 Jul 2007 23:21:49 -0400 "Cédric Augonnet" <[EMAIL PROTECTED]> 
> wrote:
> 
> > 2007/7/10, Andrew Morton <[EMAIL PROTECTED]>:
> > 
> > Hi all,
> > 
> > > > + size = sizeof(struct transaction_stats_s);
> > > > + s->stats = kmalloc(size, GFP_KERNEL);
> > > > + if (s == NULL) {
> > > > + kfree(s);
> > > > + return -EIO;
> > >
> > > ENOMEM
> > 
> > I'm sorry if i missed some point, but i just don't see the use of such
> > a kfree here, not sure Andrew meant you should only return ENOMEM
> > instead, but why issuing those kfree(NULL), instead of just a if (!s)
> > return ENOMEM ?
> > 
> 
> You found a bug.  It was meant to be
> 
>   if (s->stats == NULL)
> 
> 

Thanks, I will make sure this get fixed in ext4 patch queue.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Andrew Morton

On Wed, 11 Jul 2007 15:05:27 +1000 Neil Brown <[EMAIL PROTECTED]> wrote:

> 
> It just occurred to me:
> 
>  If i_version is 64bit, then knfsd would need to be careful when
>  reading it on a 32bit host.  What are the locking rules?
> 
>  Presumably it is only updated under i_mutex protection, but having to
>  get i_mutex to read it would seem a little heavy handed.
> 
>  Should it use a seqlock like i_size?
>  Could we use the same seqlock that i_size uses, or would we need a
>  separate one?
> 

seqlocks are a bit of a pain to use (we've had plenty of deadlocks on the
i_size one).  We could reuse inode.i_lock for this modification.  Its
mandate is "general purpose innermost lock to protect stuff in this inode".


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 21:22 -0700, Andrew Morton wrote:
> On Tue, 10 Jul 2007 20:19:16 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:
> 
> > On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote:
> > > On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:
> > > 
> > > > On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> > > > > On Sun, 01 Jul 2007 03:37:04 -0400
> > > > > Mingming Cao <[EMAIL PROTECTED]> wrote:
> > > > > 
> > > > > > This patch converts the 32-bit i_version in the generic inode to a 
> > > > > > 64-bit
> > > > > > i_version field.
> > > > > > 
> > > > > 
> > > > > That's obvious from the patch.  But what was the reason for making 
> > > > > this
> > > > > (unrelated to ext4) change?
> > > > > 
> > > > 
> > > > The need is came from NFSv4
> > > > 
> > > > On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: 
> > > > > Hi,
> > > > > 
> > > > > This is an update of the i_version patch.
> > > > > The i_version field is a 64bit counter that is set on every inode
> > > > > creation and that is incremented every time the inode data is modified
> > > > > (similarly to the "ctime" time-stamp).
> > > > > The aim is to fulfill a NFSv4 requirement for rfc3530:
> > > > > "5.5.  Mandatory Attributes - Definitions
> > > > > Name  #   DataType   Access   Description
> > > > > ___
> > > > > change3   uint64   READ A value created 
> > > > > by the
> > > > >   server that the client can use to determine if file
> > > > >   data, directory contents or attributes of the object
> > > > >   have been modified.  The servermay return the object's
> > > > >   time_metadata attribute for this attribute's value but
> > > > >   only if the filesystem object can not be updated more
> > > > >   frequently than the resolution of time_metadata.
> > > > > "
> > > > > 
> > > > 
> > > > > Please update the changelog for this.
> > > > > 
> > > > 
> > > > Is above description clear to you?
> > > > 
> > > 
> > > Yes, thanks.  It doesn't actually tell us why we want to implement
> > > this attribute and it doesn't tell us what the implications of failing
> > > to do so are, but I guess we can take that on trust from the NFS guys.
> > > 
> > > But I suspect the ext4 implementation doesn't actually do this.  afaict we
> > > won't update i_version for file overwrites (especially if s_time_gran can
> > > indeed be 1,000,000,000) and of course for MAP_SHARED modifications.  What
> > > would be the implications of this?
> > > 
> > 
> > In the case of overwrite (file date updated), I assume the ctime/mtime
> > is being updated and the inode is being dirtied, so the version number
> > is being updated.
> > 
> >  vfs_write()->..
> > ->__generic_file_aio_write_nolock()
> > ->file_update_time()
> > ->mark_inode_dirty_sync()
> > ->__mark_inode_dirty(I_DIRTY_SYNC)
> > ->ext4_dirty_inode()
> > ->ext4_mark_inode_dirty()
> 
> That assumes an mtime update for every write().  OK, so two writes in a
> single nanosecond won't be happening.  But in that case why is this code:
> 
> static inline struct timespec ext4_current_time(struct inode *inode)
> {
>   return (inode->i_sb->s_time_gran < NSEC_PER_SEC) ?
>   current_fs_time(inode->i_sb) : CURRENT_TIME_SEC;
> }
> 
> checking (s_time_gran < NSEC_PER_SEC) ??
> 

Ext4 can still load/read ext3 fs (which by default with 128 bytes old
inode size, means doens't have support nanosecond timestamps), so it's
not always gurantee nanosecond timestamps granularity.(it depends on the
size of the inode (>128 bytes), by default, a fresh ext4  increase inode
size to 256 bytes to have the room to store nanoseond timestamps, inode
versioning etc)

> Overall it is a bit unpleasing to rely upon mtime updates for a correct NFS
> server implementation: if we were to later decrease s_time_gran (as we
> might do, for performance reasons), the NFS server implementation starts
> reporting incorrect information.
> 

:( that is a problem...

> > > And how does the NFS server know that the filesystem implements 
> > > i_version? 
> > > Will a zero-value of i_version have special significance, telling the
> > > server to not send this attribute, perhaps?
> > 
> > Bruce raised up this question a few days back when he reviewed this
> > patch, I think the solution is add a superblock flag for fs support
> > inode versioning, probably at VFS layer?
> 
> That would work.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMA

Re: Initial results of FLEX_BG feature.

2007-07-10 Thread Jose R. Santos

On Tue, 10 Jul 2007 22:12:14 -0600
Andreas Dilger <[EMAIL PROTECTED]> wrote:

> On Jul 10, 2007  11:23 -0500, Jose R. Santos wrote:
> > I've started playing with the FLEX_BG feature (for now packing of
> > block group metadata closer together) and started doing some
> > preliminary benchmarking to see if the feature is worth pursuing.
> > I chose an FFSB profile that does single threaded small creates and
> > writes and then does an fsync.  This is something I ran for a customer
> > a while ago in which ext3 performed poorly.
> 
> Jose,
> thanks for the information and testing.  This is definitely very
> interesting and shows this is an avenue we should pursue.
> 
> > Here are some of the results (in transactions/[EMAIL PROTECTED] util) on a 
> > single
> > [EMAIL PROTECTED] rpm disk.
> > 
> > ext4[EMAIL PROTECTED]
> > ext4(flex_bg)   [EMAIL PROTECTED] 20% improvement
> > ext4(data=writeback)[EMAIL PROTECTED] <- hum...
> > ext4(flex_bg data=writeback)[EMAIL PROTECTED] 28% over best ext4
> > ext3[EMAIL PROTECTED]
> > ext3(data=writeback)[EMAIL PROTECTED]
> > ext2[EMAIL PROTECTED]
> > xfs [EMAIL PROTECTED]
> > jfs [EMAIL PROTECTED]
> > 
> > The results are from packing the metadata of 64 block groups closer
> > together at fsck time.  Still need to clean up the e2fsprog patches,
> 
> Does this mean that you are just moving the bitmaps and inode table
> at mke2fs time, or also such things as directory blocks at fsck time?

Right now what I've done is allocate the bitmaps and inode tables at the
beginning of each group of 64 BG.  Still need to work on fsck since just
removing the restriction on were the bitmaps and inode table are
located still gives me errors of uninitialized inodes with dtime set.
Seems like fsck still expect inode information to be located at
specific locations within the disk.

> > but I hope to submit them to the list later this week for others to
> > try.  It seems like fsck doesn't quite like the new location of the
> > metadata and I'm not sure how big of an effort it will be to fix it.  I
> > mentioned this since one of the assumptions of implementing FLEX_BG was
> > the reduce time in fsck and it could be a while before I'm able to test
> > this.
> 
> i think in the spirit of the original META_BG option, Ted had wanted to
> put all the bitmaps from EXT4_DESC_PER_BLOCK groups somewhere within the
> metagroup.  It would also be interesting to see if moving ALL of the
> group metadata to a single location in the filesystem makes a bit difference.
> If not, then we may as well keep it spread out for safety.

This is by no means a final implementation, rather it's a means to
test whether this feature is worth pursuing.  I plan on testing various
thing before coming up with a final design of what the feature should
look like.

I did try moving all of the groups metadata at the beginning of the
disk but it was slightly slower on an rsync test.  Have not tried it
with FFSB yet.

Things on the TODO list of testing needed to be done are:

- More metadata intensive FFSB profile testing.  I've been meaning to
add more operations to FFSB in order to make this possible.  Now I have
an excuse.

- Testing of different ratios of groups per flex groups.

- Testing with storage devices with fast write cache.  When I did the
customer testing a couple of months ago with this FFSB profile, JFS was
the fastest of the filesystems when paired with a decent storage
subsystem with fast write cache.  It would be interesting to see what
effects do fast write caching have on such a feature.

- Testing fsck time once e2fsprogs understands how to read such a
filesystem.

- Testing an aged file systems to see what effects (if any) this
feature has in a fragmented filesystem.

 
> You might also want to test out placement of the journal in the middle
> of the filesystem, the U. Wisconsin folks tested this in one of their
> papers and showed some noticable improvements.  That isn't exactly
> related, but it is a relatively simple tweak to mke2fs/tune2fs to give
> it an allocation goal of group_desc[s_groups_count / 2].bg_inode_table
> (to put it past inode table in middle group).

Make sense.  Do you have a link to the paper?

> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 

Thanks

-JRS
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-10 Thread Jose R. Santos

On Tue, 10 Jul 2007 16:30:25 -0700
Andrew Morton <[EMAIL PROTECTED]> wrote:

> On Sun, 01 Jul 2007 03:36:48 -0400
> Mingming Cao <[EMAIL PROTECTED]> wrote:
> 
> > > On Jun 07, 2007  23:45 -0500, Jose R. Santos wrote:
> > > > The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but
> > > > create_proc_entry() does not do lookups on file names with more that one
> > > > directory deep.  This causes the entry creation to fail and hence, no 
> > > > proc
> > > > file is created.  This patch moves the file to /proc/jbd2-degug.
> > > > 
> > > > The file could be move to /proc/fs/jbd2/jbd2-debug, but it would require
> > > > some minor alterations to the jbd-stats patch.
> > > 
> > > I don't think we really want to be adding top-level files in /proc.
> > > What about using the "debugfs" filesystem (not to be confused with
> > > the e2fsprogs 'debugfs' command)?
> > 
> > How about this then?  Moved the file to use debugfs as well as having
> > the nice effect of removing more lines than what it adds.
> > 
> > Signed-off-by: Jose R. Santos <[EMAIL PROTECTED]>
> 
> Please clean up the changelog.
> 
> The changelog should include information about the location and the content
> of these debugfs files.  it should provide any instructions which users
> will need to be able to create and use those files.

Will fix.

> Alternatively (and preferably) do this via an update to
> Documentation/filesystems/ext4.txt.

Seems like I also need to update the doc on Kconfig as well.  Do you
prefer this in separate patches? (current patch, kconfig patch, ext4
doc update patch?
 
> >  fs/jbd2/journal.c|   6220 +42 -0 !
> >  include/linux/jbd2.h |21 + 1 - 0 !
> >  2 files changed, 21 insertions(+), 43 deletions(-)
> 
> Again, this patch isn't in Ted's kernel.org directory and hasn't been in -mm.
> 
> Apart from the lack of testing and review which this causes, it means I
> can't just do `pushpatch name-of-this-patch' and look at it in tkdiff.  So
> I squint at the diff, but that's harder when the diff wasn't prepared with
> `diff -p'.  Oh well.

Will fix.

> 
> > Index: linux-2.6.22-rc4/fs/jbd2/journal.c
> > ===
> > --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c 2007-06-11 16:16:18.0 
> > -0700
> > +++ linux-2.6.22-rc4/fs/jbd2/journal.c  2007-06-11 16:36:10.0 
> > -0700
> > @@ -35,6 +35,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -1954,60 +1955,37 @@
> >   * /proc tunables
> >   */
> >  #if defined(CONFIG_JBD2_DEBUG)
> > -int jbd2_journal_enable_debug;
> > +u16 jbd2_journal_enable_debug;
> >  EXPORT_SYMBOL(jbd2_journal_enable_debug);
> >  #endif
> >  
> > -#if defined(CONFIG_JBD2_DEBUG) && defined(CONFIG_PROC_FS)
> > +#if defined(CONFIG_JBD2_DEBUG) && defined(CONFIG_DEBUG_FS)
> 
> Has this been compile-tested with CONFIG_DEBUGFS=n?

I think I did, but honestly don't remember.  Will check with the new
patch. :) 

> >  
> > -#define create_jbd_proc_entry() do {} while (0)
> > -#define jbd2_remove_jbd_proc_entry() do {} while (0)
> > +#define jbd2_create_debugfs_entry() do {} while (0)
> > +#define jbd2_remove_debugfs_entry() do {} while (0)
> 
> I suggest that these be converted to (preferable) inline functions while
> you're there.

OK.

> >  #endif
> >  
> > @@ -2067,7 +2045,7 @@
> > ret = journal_init_caches();
> > if (ret != 0)
> > jbd2_journal_destroy_caches();
> > -   create_jbd_proc_entry();
> > +   jbd2_create_debugfs_entry();
> > return ret;
> >  }
> >  
> > @@ -2078,7 +2056,7 @@
> > if (n)
> > printk(KERN_EMERG "JBD: leaked %d journal_heads!\n", n);
> >  #endif
> > -   jbd2_remove_jbd_proc_entry();
> > +   jbd2_remove_debugfs_entry();
> > jbd2_journal_destroy_caches();
> >  }
> >  
> > Index: linux-2.6.22-rc4/include/linux/jbd2.h
> > ===
> > --- linux-2.6.22-rc4.orig/include/linux/jbd2.h  2007-06-11 
> > 16:16:18.0 -0700
> > +++ linux-2.6.22-rc4/include/linux/jbd2.h   2007-06-11 16:35:25.0 
> > -0700
> > @@ -57,7 +57,7 @@
> >   * CONFIG_JBD2_DEBUG is on.
> >   */
> >  #define JBD_EXPENSIVE_CHECKING
> 
> JBD2?
>
> > -extern int jbd2_journal_enable_debug;
> > +extern u16 jbd2_journal_enable_debug;
> 
> Why was this made 16-bit?  To save 2 bytes?  Could have saved 3 if we're
> going to do that.

OK.
 
> Shoudln't all this debug info be a per-superblock thing rather than
> kernel-wide?

I don't think it is worth pursuing this feature since this seems to
have been broken for a while now (its been there since the first git
revission in ext3) and nobody has noticed it until now.  It could be
address on a later patch though, since the initial purpose of the patch
was to fix the broken JBD2_DEBUG option. Of course, this may not be
clearly express in the changelog. :)

-JRS
-
To unsubscribe from this list: send the line "un

Re: Initial results of FLEX_BG feature.

2007-07-10 Thread Eric Sandeen

Jose R. Santos wrote:
> On Tue, 10 Jul 2007 22:12:14 -0600
> Andreas Dilger <[EMAIL PROTECTED]> wrote:

...

>  
>> You might also want to test out placement of the journal in the middle
>> of the filesystem, the U. Wisconsin folks tested this in one of their
>> papers and showed some noticable improvements.  That isn't exactly
>> related, but it is a relatively simple tweak to mke2fs/tune2fs to give
>> it an allocation goal of group_desc[s_groups_count / 2].bg_inode_table
>> (to put it past inode table in middle group).
> 
> Make sense.  Do you have a link to the paper?

filesystem shrinking would need to be fixed to handle this too, right...

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:38:18 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:

> >From [EMAIL PROTECTED] Thu May 17 17:21:08 2007
> Hi,
> 
> I have rebased this patch to 2.6.22-rc1 so that it can be added to the
> ext4 patch queue. It has been tested by creating more than 65000 subdirs
> and then deleting them and checking the nlinks. The e2fsprogs part of
> this patch was sent earlier by me to linux-ext4 and doesn't need any
> changes, so not submitting it again.
> 
> --
> This patch adds support to ext4 for allowing more than 65000
> subdirectories. Currently the maximum number of subdirectories is capped
> at 32000.
> 
> If we exceed 65000 subdirectories in an htree directory it sets the
> inode link count to 1 and no longer counts subdirectories.  The
> directory link count is not actually used when determining if a
> directory is empty, as that only counts subdirectories and not regular
> files that might be in there. 
> 
> A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if
> the subdir count for any directory crosses 65000.
> 

Would I be correct in assuming that a later fsck will clear
EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any >65000 subdir
directories?

If so, that is worth a mention in the changelog, perhaps?

Please remind us what is the behaviour of an RO_COMPAT flag?  It means that
old ext4, ext3 and ext2 can only mount this fs read-only, yes?

> 
> Index: linux-2.6.22-rc4/fs/ext4/namei.c
> ===
> --- linux-2.6.22-rc4.orig/fs/ext4/namei.c 2007-06-14 17:30:47.0 
> -0700
> +++ linux-2.6.22-rc4/fs/ext4/namei.c  2007-06-14 17:32:55.0 -0700
> @@ -1619,6 +1619,27 @@ static int ext4_delete_entry (handle_t *
>   return -ENOENT;
>  }
>  
> +static inline void ext4_inc_count(handle_t *handle, struct inode *inode)
> +{
> + inc_nlink(inode);
> + if (is_dx(inode) && inode->i_nlink > 1) {
> + /* limit is 16-bit i_links_count */
> + if (inode->i_nlink >= EXT4_LINK_MAX || inode->i_nlink == 2) {
> + inode->i_nlink = 1;
> + EXT4_SET_RO_COMPAT_FEATURE(inode->i_sb,
> +   EXT4_FEATURE_RO_COMPAT_DIR_NLINK);
> + }
> + }
> +}

Looks too big to be inlined.

Why do we set EXT4_FEATURE_RO_COMPAT_DIR_NLINK if i_nlink==2?

> +static inline void ext4_dec_count(handle_t *handle, struct inode *inode)
> +{
> + drop_nlink(inode);
> + if (S_ISDIR(inode->i_mode) && inode->i_nlink == 0)
> + inc_nlink(inode);
> +}

Probably too big to inline.

> +
>  static int ext4_add_nondir(handle_t *handle,
>   struct dentry *dentry, struct inode *inode)
>  {
> @@ -1715,7 +1736,7 @@ static int ext4_mkdir(struct inode * dir
>   struct ext4_dir_entry_2 * de;
>   int err, retries = 0;
>  
> - if (dir->i_nlink >= EXT4_LINK_MAX)
> + if (EXT4_DIR_LINK_MAX(dir))
>   return -EMLINK;
>  
>  retry:
> @@ -1738,7 +1759,7 @@ retry:
>   inode->i_size = EXT4_I(inode)->i_disksize = inode->i_sb->s_blocksize;
>   dir_block = ext4_bread (handle, inode, 0, 1, &err);
>   if (!dir_block) {
> - drop_nlink(inode); /* is this nlink == 0? */
> + ext4_dec_count(handle, inode); /* is this nlink == 0? */
>   ext4_mark_inode_dirty(handle, inode);
>   iput (inode);
>   goto out_stop;
> @@ -1770,7 +1791,7 @@ retry:
>   iput (inode);
>   goto out_stop;
>   }
> - inc_nlink(dir);
> + ext4_inc_count(handle, dir);
>   ext4_update_dx_flag(dir);
>   ext4_mark_inode_dirty(handle, dir);
>   d_instantiate(dentry, inode);
> @@ -2035,10 +2056,10 @@ static int ext4_rmdir (struct inode * di
>   retval = ext4_delete_entry(handle, dir, de, bh);
>   if (retval)
>   goto end_rmdir;
> - if (inode->i_nlink != 2)
> - ext4_warning (inode->i_sb, "ext4_rmdir",
> -   "empty directory has nlink!=2 (%d)",
> -   inode->i_nlink);
> + if (!EXT4_DIR_LINK_EMPTY(inode))
> + ext4_warning(inode->i_sb, "ext4_rmdir",
> +  "empty directory has too many links (%d)",
> +  inode->i_nlink);
>   inode->i_version++;
>   clear_nlink(inode);
>   /* There's no need to set i_disksize: the fact that i_nlink is
> @@ -2048,7 +2069,7 @@ static int ext4_rmdir (struct inode * di
>   ext4_orphan_add(handle, inode);
>   inode->i_ctime = dir->i_ctime = dir->i_mtime = ext4_current_time(inode);
>   ext4_mark_inode_dirty(handle, inode);
> - drop_nlink(dir);
> + ext4_dec_count(handle, dir);
>   ext4_update_dx_flag(dir);
>   ext4_mark_inode_dirty(handle, dir);
>  
> @@ -2099,7 +2120,7 @@ static int ext4_unlink(struct inode * di
>   dir->i_ctime = dir->i_mtime

Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-10 Thread Mingming Cao

On Wed, 2007-07-11 at 00:38 -0500, Jose R. Santos wrote:
> On Tue, 10 Jul 2007 16:30:25 -0700
> Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
> > On Sun, 01 Jul 2007 03:36:48 -0400
> > Mingming Cao <[EMAIL PROTECTED]> wrote:
> > 
> > > > On Jun 07, 2007  23:45 -0500, Jose R. Santos wrote:
> > > > > The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but
> > > > > create_proc_entry() does not do lookups on file names with more that 
> > > > > one
> > > > > directory deep.  This causes the entry creation to fail and hence, no 
> > > > > proc
> > > > > file is created.  This patch moves the file to /proc/jbd2-degug.
> > > > > 
> > > > > The file could be move to /proc/fs/jbd2/jbd2-debug, but it would 
> > > > > require
> > > > > some minor alterations to the jbd-stats patch.
> > > > 
> > > > I don't think we really want to be adding top-level files in /proc.
> > > > What about using the "debugfs" filesystem (not to be confused with
> > > > the e2fsprogs 'debugfs' command)?
> > > 
> > > How about this then?  Moved the file to use debugfs as well as having
> > > the nice effect of removing more lines than what it adds.
> > > 
> > > Signed-off-by: Jose R. Santos <[EMAIL PROTECTED]>
> > 
> > Please clean up the changelog.
> > 
> > The changelog should include information about the location and the content
> > of these debugfs files.  it should provide any instructions which users
> > will need to be able to create and use those files.
> 
> Will fix.
> 
> > Alternatively (and preferably) do this via an update to
> > Documentation/filesystems/ext4.txt.
> 
> Seems like I also need to update the doc on Kconfig as well.  Do you
> prefer this in separate patches? (current patch, kconfig patch, ext4
> doc update patch?
> 
> > >  fs/jbd2/journal.c|   6220 +42 -0 !
> > >  include/linux/jbd2.h |21 + 1 - 0 !
> > >  2 files changed, 21 insertions(+), 43 deletions(-)
> > 
> > Again, this patch isn't in Ted's kernel.org directory and hasn't been in 
> > -mm.
> > 
> > Apart from the lack of testing and review which this causes, it means I
> > can't just do `pushpatch name-of-this-patch' and look at it in tkdiff.  So
> > I squint at the diff, but that's harder when the diff wasn't prepared with
> > `diff -p'.  Oh well.
> 
> Will fix.
> 
> > 
> > > Index: linux-2.6.22-rc4/fs/jbd2/journal.c
> > > ===
> > > --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c   2007-06-11 
> > > 16:16:18.0 -0700
> > > +++ linux-2.6.22-rc4/fs/jbd2/journal.c2007-06-11 16:36:10.0 
> > > -0700
> > > @@ -35,6 +35,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  
> > >  #include 
> > >  #include 
> > > @@ -1954,60 +1955,37 @@
> > >   * /proc tunables
> > >   */
> > >  #if defined(CONFIG_JBD2_DEBUG)
> > > -int jbd2_journal_enable_debug;
> > > +u16 jbd2_journal_enable_debug;
> > >  EXPORT_SYMBOL(jbd2_journal_enable_debug);
> > >  #endif
> > >  
> > > -#if defined(CONFIG_JBD2_DEBUG) && defined(CONFIG_PROC_FS)
> > > +#if defined(CONFIG_JBD2_DEBUG) && defined(CONFIG_DEBUG_FS)
> > 
> > Has this been compile-tested with CONFIG_DEBUGFS=n?
> 
> I think I did, but honestly don't remember.  Will check with the new
> patch. :) 
> 

Yes, I remember I did, that discovered some inconsistency in ext4 code,
which has already been fixed.

> > >  
> > > -#define create_jbd_proc_entry() do {} while (0)
> > > -#define jbd2_remove_jbd_proc_entry() do {} while (0)
> > > +#define jbd2_create_debugfs_entry() do {} while (0)
> > > +#define jbd2_remove_debugfs_entry() do {} while (0)
> > 
> > I suggest that these be converted to (preferable) inline functions while
> > you're there.
> 
> OK.
> 
> > >  #endif
> > >  
> > > @@ -2067,7 +2045,7 @@
> > >   ret = journal_init_caches();
> > >   if (ret != 0)
> > >   jbd2_journal_destroy_caches();
> > > - create_jbd_proc_entry();
> > > + jbd2_create_debugfs_entry();
> > >   return ret;
> > >  }
> > >  
> > > @@ -2078,7 +2056,7 @@
> > >   if (n)
> > >   printk(KERN_EMERG "JBD: leaked %d journal_heads!\n", n);
> > >  #endif
> > > - jbd2_remove_jbd_proc_entry();
> > > + jbd2_remove_debugfs_entry();
> > >   jbd2_journal_destroy_caches();
> > >  }
> > >  
> > > Index: linux-2.6.22-rc4/include/linux/jbd2.h
> > > ===
> > > --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 
> > > 16:16:18.0 -0700
> > > +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 
> > > -0700
> > > @@ -57,7 +57,7 @@
> > >   * CONFIG_JBD2_DEBUG is on.
> > >   */
> > >  #define JBD_EXPENSIVE_CHECKING
> > 
> > JBD2?
> >
> > > -extern int jbd2_journal_enable_debug;
> > > +extern u16 jbd2_journal_enable_debug;
> > 
> > Why was this made 16-bit?  To save 2 bytes?  Could have saved 3 if we're
> > going to do that.
> 
> OK.
> 
> > Shoudln't all this debug info be a per-superblock thing rather than
> > kernel-wide

Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 23:35 -0400, Dave Jones wrote:
> On Tue, Jul 10, 2007 at 05:35:13PM -0400, Mingming Cao wrote:
>  > 
>  > Sorry about this. I was using version 0.04. The latest one I can find
>  > for now is 0.05(searching lkml), but it didn't catch this codling style
>  > bug either. I appreciate if anyone can point me the version 0.07, thanks
> 
> It's now in-tree in scripts/checkpatch.pl
> (linus' tree is still at 0.06 though, I suspect Andrew has something
>  newer in -mm)
> 

Thanks, Andy has uploaded the 0.07 version at
http://www.shadowen.org/~apw/public/checkpatch/checkpatch.pl-0.07


Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 8][PATCH 1/1]Add journal checksums

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:38:25 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:

> Journal checksum feature has been added to detect corruption of journal.

That was brief.  No description of what it does, how it does it, why it
does it, how one operates it, why (or why not) one would choose to enable
it nor of the costs of enabling it.

> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
> Signed-off-by: Girish Shilamkar <[EMAIL PROTECTED]>
> Signed-off-by: Dave Kleikamp <[EMAIL PROTECTED]>
> 
> diff -Nurp linux024/fs/ext4/super.c linux/fs/ext4/super.c
> --- linux024/fs/ext4/super.c  2007-06-25 16:19:24.0 -0500
> +++ linux/fs/ext4/super.c 2007-06-26 08:35:16.0 -0500
> @@ -721,6 +721,7 @@ enum {
>   Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
>   Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
>   Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
> + Opt_journal_checksum, Opt_journal_async_commit,

A new user-visible feature should be accompanied by an update to
Documentation/filesystems/ext4.txt?

>   Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
>   Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
>   Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
> @@ -760,6 +761,8 @@ static match_table_t tokens = {
>   {Opt_journal_update, "journal=update"},
>   {Opt_journal_inum, "journal=%u"},
>   {Opt_journal_dev, "journal_dev=%u"},
> + {Opt_journal_checksum, "journal_checksum"},
> + {Opt_journal_async_commit, "journal_async_commit"},

What's journal_async_commit?

>   {Opt_abort, "abort"},
>   {Opt_data_journal, "data=journal"},
>   {Opt_data_ordered, "data=ordered"},
> @@ -948,6 +951,13 @@ static int parse_options (char *options,
>   return 0;
>   *journal_devnum = option;
>   break;
> + case Opt_journal_checksum:
> + set_opt (sbi->s_mount_opt, JOURNAL_CHECKSUM);
> + break;
> + case Opt_journal_async_commit:
> + set_opt (sbi->s_mount_opt, JOURNAL_ASYNC_COMMIT);
> + set_opt (sbi->s_mount_opt, JOURNAL_CHECKSUM);
> + break;
>   case Opt_noload:
>   set_opt (sbi->s_mount_opt, NOLOAD);
>   break;
> @@ -1817,6 +1827,21 @@ static int ext4_fill_super (struct super
>   goto failed_mount4;
>   }
>  
> + if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
> + jbd2_journal_set_features(sbi->s_journal,
> + JBD2_FEATURE_COMPAT_CHECKSUM, 0,
> + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
> + } else if (test_opt(sb, JOURNAL_CHECKSUM)) {
> + jbd2_journal_set_features(sbi->s_journal,
> + JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
> + jbd2_journal_clear_features(sbi->s_journal, 0, 0,
> + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
> + } else {
> + jbd2_journal_clear_features(sbi->s_journal,
> + JBD2_FEATURE_COMPAT_CHECKSUM, 0,
> + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
> + }

Some discussion of the forward- and backward- compatibility design would be
appropriate.  Also a description of whether and how fsck can be used to fix
up these feature flags.

>   /* We have now updated the journal if required, so we can
>* validate the data journaling mode. */
>   switch (test_opt(sb, DATA_FLAGS)) {
> diff -Nurp linux024/fs/jbd2/commit.c linux/fs/jbd2/commit.c
> --- linux024/fs/jbd2/commit.c 2007-06-25 16:19:25.0 -0500
> +++ linux/fs/jbd2/commit.c2007-06-26 08:40:03.0 -0500
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

I think we just broke CONFIG_EXT4=y, CONFIG_CRC32=n builds.

>  /*
>   * Default IO end handler for temporary BJ_IO buffer_heads.
> @@ -93,15 +94,18 @@ static int inverted_lock(journal_t *jour
>   return 1;
>  }
>  
> -/* Done it all: now write the commit record.  We should have
> +/*
> + * Done it all: now submit the commit record.  We should have
>   * cleaned up our previous buffers by now, so if we are in abort
>   * mode we can now just skip the rest of the journal write
>   * entirely.
>   *
>   * Returns 1 if the journal needs to be aborted or 0 on success
>   */
> -static int journal_write_commit_record(journal_t *journal,
> - transaction_t *commit_transaction)
> +static int journal_submit_commit_record(journal_t *journal,
> + transaction_t *commit_transaction,
> + struct buffer_head **cbh,
> + __u32 crc32_sum)
>  {
>   struct journal_head *descriptor;
>   struct buffer_head *bh;
> @@ -1

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote:
> On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:
> 
> > David Chinneer pointed that we need to journal the version number
> > updates together with the operations that causes the change of the inode
> > version number, in order to survive server crashes so clients won't see
> > the counter go backwards.
> > 
> > So increment i_version in fs code is probably the place to ensure the
> > inode version changes are stored to disk. It's seems update the ext4
> > inode version in every ext4_mark_inode_dirty() is the easiest way.
> 
> That still makes us dependent upon _something_ changing the inode.  For
> overwrites the only something is mtime.
> 
> If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
> I don't think we do) then I guess we'll need new code in or around
> file_update_time() to do this.

do you mean mark inode dirty all the times in file_update_time()? Not
sure about the overhead for ext3/4.

Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:38:51 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:

> Subject: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes
> Date: Sun, 01 Jul 2007 03:38:51 -0400
> Sender: [EMAIL PROTECTED]
> Organization: IBM Linux Technology Center
> X-Mailer: Evolution 2.8.0 (2.8.0-33.el5) 
> 
> From: Dmitry Monakhov <[EMAIL PROTECTED]>
> Subject: ext4: extent compilation fixes

I hope this patch series will hit git with nice titles like "ext4: extent
compilation fixes" and not funny ones like
"Morecleanups:ext4_extent_compilation_fixes".

> Fix compilation with EXT_DEBUG, also fix leXX_to_cpu convertions.

"conversions".
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:

> From: Dmitry Monakhov <[EMAIL PROTECTED]>
> Subject: ext4: extent macros cleanup
> 
> - Replace math equation to it's macro equivalent

s/it's/its/;)

> - make ext4_ext_grow_indepth() indexes/leaf correct

hm, what was wrong with it?

> Signed-off-by: Dmitry Monakhov <[EMAIL PROTECTED]>
> Acked-by: Alex Tomas <[EMAIL PROTECTED]>
> Signed-off-by: Dave Kleikamp <[EMAIL PROTECTED]>
> ---
>  fs/ext4/extents.c |   11 +++
>  1 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 12fe3d7..1fd00ac 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -375,7 +375,7 @@ ext4_ext_binsearch_idx(struct inode *inode, struct 
> ext4_ext_path *path, int bloc
>   ext_debug("binsearch for %d(idx):  ", block);
> 
>   l = EXT_FIRST_INDEX(eh) + 1;
> - r = EXT_FIRST_INDEX(eh) + le16_to_cpu(eh->eh_entries) - 1;
> + r = EXT_LAST_INDEX(eh);
>   while (l <= r) {
>   m = l + (r - l) / 2;
>   if (block < le32_to_cpu(m->ei_block))
> @@ -440,7 +440,7 @@ ext4_ext_binsearch(struct inode *inode, struct 
> ext4_ext_path *path, int block)
>   ext_debug("binsearch for %d:  ", block);
> 
>   l = EXT_FIRST_EXTENT(eh) + 1;
> - r = EXT_FIRST_EXTENT(eh) + le16_to_cpu(eh->eh_entries) - 1;
> + r = EXT_LAST_EXTENT(eh);
> 
>   while (l <= r) {
>   m = l + (r - l) / 2;
> @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, 
> struct inode *inode,
>   curp->p_hdr->eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode));
>   curp->p_hdr->eh_entries = cpu_to_le16(1);
>   curp->p_idx = EXT_FIRST_INDEX(curp->p_hdr);
> - /* FIXME: it works, but actually path[0] can be index */
> - curp->p_idx->ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)->ee_block;
> + 
> + if (path[0].p_hdr->eh_depth)
> +   curp->p_idx->ei_block = EXT_FIRST_INDEX(path[0].p_hdr)->ei_block;
> + else
> +   curp->p_idx->ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)->ee_block;

whitespace bustage there.


-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-10 Thread Andrew Morton

On Wed, 11 Jul 2007 00:38:09 -0500 "Jose R. Santos" <[EMAIL PROTECTED]> wrote:

> 
> > Alternatively (and preferably) do this via an update to
> > Documentation/filesystems/ext4.txt.
> 
> Seems like I also need to update the doc on Kconfig as well.  Do you
> prefer this in separate patches? (current patch, kconfig patch, ext4
> doc update patch?

All these changes are logically connected (aren't they?).  A single patch
is fine.

> > Shoudln't all this debug info be a per-superblock thing rather than
> > kernel-wide?
> 
> I don't think it is worth pursuing this feature since this seems to
> have been broken for a while now (its been there since the first git
> revission in ext3) and nobody has noticed it until now.  It could be
> address on a later patch though, since the initial purpose of the patch
> was to fix the broken JBD2_DEBUG option. Of course, this may not be
> clearly express in the changelog. :)
> 

I don't think that making it all per-superblock is worth the effort - it's
a developer-only thing and developer will have the knowledge to test ext4
on an otherwise-ext3 setup if they're really fussed about the accuracy.

So yes, a bare make-it-work patch sounds appropriate.  Or remove it, but
hey, it might be useful.  The timestamping stuff certainly looks useful.

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Andrew Morton

On Tue, 10 Jul 2007 23:18:50 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:

> On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote:
> > On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:
> > 
> > > David Chinneer pointed that we need to journal the version number
> > > updates together with the operations that causes the change of the inode
> > > version number, in order to survive server crashes so clients won't see
> > > the counter go backwards.
> > > 
> > > So increment i_version in fs code is probably the place to ensure the
> > > inode version changes are stored to disk. It's seems update the ext4
> > > inode version in every ext4_mark_inode_dirty() is the easiest way.
> > 
> > That still makes us dependent upon _something_ changing the inode.  For
> > overwrites the only something is mtime.
> > 
> > If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
> > I don't think we do) then I guess we'll need new code in or around
> > file_update_time() to do this.
> 
> do you mean mark inode dirty all the times in file_update_time()? Not
> sure about the overhead for ext3/4.
> 

hm, I hadn't thought about it in any detail.

Maybe something like

--- a/fs/inode.c~a
+++ a/fs/inode.c
@@ -1238,6 +1238,11 @@ void file_update_time(struct file *file)
sync_it = 1;
}
 
+   if (IS_I_VERSION_64(inode)) {
+   inode_inc_iversion(inode);  /* Takes i_lock on 32-bit */
+   sync_it = 1;
+   }
+
if (sync_it)
mark_inode_dirty_sync(inode);
 }
_

?
-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

83 matches

Mail list logo