[RFC] basic delayed allocation in ext4

2007-07-26 Thread Alex Tomas

Good day,

please review ...

thanks, Alex

Basic delayed allocation in ext4

Two special -get_block() methods are introduced:

 * ext4_da_get_block_prep()
   to be used with -prepare_write(), defers allocation till flush
 * ext4_da_get_block_write()
   to be used with mpage_da_writepages(), allocate blocks and correct on-disk 
size

Current implementation works with data=writeback only, you should
mount filesystem with delalloc,data=writeback options.

TODO:
 * reservation
 * data=ordered
 * quota
 * bmap

Signed-off-by: Alex Tomas [EMAIL PROTECTED]


Index: linux-2.6.22/include/linux/ext4_fs.h
===
--- linux-2.6.22.orig/include/linux/ext4_fs.h   2007-07-26 12:30:25.0 
+0400
+++ linux-2.6.22/include/linux/ext4_fs.h2007-07-26 12:32:04.0 
+0400
@@ -488,6 +488,7 @@ do {
   \
 #define EXT4_MOUNT_EXTENTS 0x40 /* Extents support */
 #define EXT4_MOUNT_JOURNAL_CHECKSUM0x80 /* Journal checksums */
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT0x100 /* Journal Async 
Commit */
+#define EXT4_MOUNT_DELALLOC0x200 /* Delalloc support */
 /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
 #define clear_opt(o, opt)  o = ~EXT4_MOUNT_##opt
Index: linux-2.6.22/fs/ext4/super.c
===
--- linux-2.6.22.orig/fs/ext4/super.c   2007-07-26 12:30:25.0 +0400
+++ linux-2.6.22/fs/ext4/super.c2007-07-26 12:32:04.0 +0400
@@ -728,7 +728,7 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
-   Opt_grpquota, Opt_extents, Opt_noextents,
+   Opt_grpquota, Opt_extents, Opt_noextents, Opt_delalloc,
 };

 static match_table_t tokens = {
@@ -782,6 +782,7 @@ static match_table_t tokens = {
{Opt_barrier, barrier=%u},
{Opt_extents, extents},
{Opt_noextents, noextents},
+   {Opt_delalloc, delalloc},
{Opt_err, NULL},
{Opt_resize, resize},
 };
@@ -1127,6 +1128,9 @@ clear_qf_name:
case Opt_noextents:
clear_opt (sbi-s_mount_opt, EXTENTS);
break;
+   case Opt_delalloc:
+   set_opt (sbi-s_mount_opt, DELALLOC);
+   break;
default:
printk (KERN_ERR
EXT4-fs: Unrecognized mount option \%s\ 
Index: linux-2.6.22/fs/ext4/inode.c
===
--- linux-2.6.22.orig/fs/ext4/inode.c   2007-07-26 12:30:22.0 +0400
+++ linux-2.6.22/fs/ext4/inode.c2007-07-26 12:32:04.0 +0400
@@ -39,6 +39,8 @@
 #include xattr.h
 #include acl.h

+static void ext4_invalidatepage(struct page *page, unsigned long offset);
+
 /*
  * Test whether an inode is a fast symlink.
  */
@@ -1291,6 +1293,142 @@ static int ext4_journalled_commit_write(
 }

 /*
+ * this is a special callback for -prepare_write() only
+ * it's intention is to return mapped block or reserve space
+ */
+static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
+   struct buffer_head *bh_result, int create)
+{
+   int ret = 0;
+
+   BUG_ON(create == 0);
+   BUG_ON(bh_result-b_size != inode-i_sb-s_blocksize);
+
+   /* first, we need to know whether the block is allocated already
+* XXX: when the filesystem has a lot of free blocks, we could
+* reserve even allocated blocks to save this lookup */
+   ret = ext4_get_blocks_wrap(NULL, inode, iblock, 1,  bh_result, 0, 0);
+   if (ret = 0) {
+   if (buffer_mapped(bh_result)) {
+   bh_result-b_size = (ret  inode-i_blkbits);
+   } else {
+   /* OK, the block isn't allocated yet, let's reserve 
space */
+   /* XXX: call reservation here */
+   /* XXX: __block_prepare_write() unmaps passed block, is 
it OK? */
+   map_bh(bh_result, inode-i_sb, 0);
+   set_buffer_new(bh_result);
+   set_buffer_delay(bh_result);
+   }
+   ret = 0;
+   }
+
+   return ret;
+}
+
+
+static int ext4_da_prepare_write(struct file *file, struct page *page,
+   unsigned from, unsigned to)
+{
+   return block_prepare_write(page, from, to, ext4_da_get_block_prep);
+}
+
+static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
+   struct buffer_head *bh_result, int create)
+{
+   int ret, needed_blocks = 

[patch 25/39] fix inode_table test in ext234_check_descriptors

2007-07-26 Thread akpm
From: Eric Sandeen [EMAIL PROTECTED]

ext[234]_check_descriptors sanity checks block group descriptor geometry at
mount time, testing whether the block bitmap, inode bitmap, and inode table
reside wholly within the blockgroup.  However, the inode table test is off
by one so that if the last block in the inode table resides on the last
block of the block group, the test incorrectly fails.  This is because it
tests the last block as (start + length) rather than (start + length - 1).

This can be seen by trying to mount a filesystem made such as:

 mkfs.ext2 -F -b 1024 -m 0 -g 256 -N 3744 fsfile 1024

which yields:

 EXT2-fs error (device loop0): ext2_check_descriptors: Inode table for group 0 
not in group (block 101)!
 EXT2-fs: group descriptors corrupted!

There is a similar bug in e2fsprogs, patch already sent for that.

(I wonder if inside(), outside(), and/or in_range() should someday be
used in this and other tests throughout the ext filesystems...)

Signed-off-by: Eric Sandeen [EMAIL PROTECTED]
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---

 fs/ext2/super.c |2 +-
 fs/ext3/super.c |2 +-
 fs/ext4/super.c |2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff -puN fs/ext2/super.c~fix-inode_table-test-in-ext234_check_descriptors 
fs/ext2/super.c
--- a/fs/ext2/super.c~fix-inode_table-test-in-ext234_check_descriptors
+++ a/fs/ext2/super.c
@@ -580,7 +580,7 @@ static int ext2_check_descriptors (struc
return 0;
}
if (le32_to_cpu(gdp-bg_inode_table)  first_block ||
-   le32_to_cpu(gdp-bg_inode_table) + sbi-s_itb_per_group 
+   le32_to_cpu(gdp-bg_inode_table) + sbi-s_itb_per_group - 1 

last_block)
{
ext2_error (sb, ext2_check_descriptors,
diff -puN fs/ext3/super.c~fix-inode_table-test-in-ext234_check_descriptors 
fs/ext3/super.c
--- a/fs/ext3/super.c~fix-inode_table-test-in-ext234_check_descriptors
+++ a/fs/ext3/super.c
@@ -1221,7 +1221,7 @@ static int ext3_check_descriptors (struc
return 0;
}
if (le32_to_cpu(gdp-bg_inode_table)  first_block ||
-   le32_to_cpu(gdp-bg_inode_table) + sbi-s_itb_per_group 
+   le32_to_cpu(gdp-bg_inode_table) + sbi-s_itb_per_group - 1 

last_block)
{
ext3_error (sb, ext3_check_descriptors,
diff -puN fs/ext4/super.c~fix-inode_table-test-in-ext234_check_descriptors 
fs/ext4/super.c
--- a/fs/ext4/super.c~fix-inode_table-test-in-ext234_check_descriptors
+++ a/fs/ext4/super.c
@@ -1283,7 +1283,7 @@ static int ext4_check_descriptors (struc
}
inode_table = ext4_inode_table(sb, gdp);
if (inode_table  first_block ||
-   inode_table + sbi-s_itb_per_group  last_block)
+   inode_table + sbi-s_itb_per_group - 1  last_block)
{
ext4_error (sb, ext4_check_descriptors,
Inode table for group %d
_
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-26 Thread Jeff Garzik

Alex Tomas wrote:

Jeff Garzik wrote:

Is this based on Christoph's work?

Christoph, or some other XFS hacker, already did generic delalloc, 
modeled on the XFS delalloc code.


nope, this one is simple (something I'd prefer for ext4).


The XFS one is proven and the work was already completed.

What were the specific technical issues that made it unsuitable for ext4?

I would rather not reinvent the wheel, particularly if the reinvention 
is less capable than the existing work.


Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-26 Thread Jeff Garzik

Alex Tomas wrote:

Good day,

please review ...

thanks, Alex


basic delayed allocation in VFS:

 * block_prepare_write() can be passed special -get_block() which
   doesn't allocate blocks, but reserve them and mark bh delayed
 * a filesystem can use mpage_da_writepages() with other -get_block()
   which doesn't defer allocation. mpage_da_writepages() finds all
   non-allocated blocks and try to allocate them with minimal calls
   to -get_block(), then submit IO using __mpage_writepage()


Signed-off-by: Alex Tomas [EMAIL PROTECTED]


Is this based on Christoph's work?

Christoph, or some other XFS hacker, already did generic delalloc, 
modeled on the XFS delalloc code.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] e2fsprogs: Support for large inode migration.

2007-07-26 Thread Aneesh Kumar K.V



Theodore Tso wrote:

On Thu, Jul 26, 2007 at 05:15:30PM +0530, Aneesh Kumar K.V wrote:

Let me guess, you're testing with a filesystem with two block groups,
right?  And to date you've tested *only* by doubling the size of the
inode.

I tested this with multiple ( 1 and 7 ) groups. But yes all the
testing was to change inode size from 128 to 256.


Your patch comments stated As a part of increasing the inode size we
throw away the free inodes in the last block group.  This is *only*
true if the filesystem has two (and exactly two) block groups.  Hence
my guess that only tested on filesystems with two block groups.



Okey i didn't stress the last part in the above sentence. What i wanted
to convey was inodes in the block groups towards the end (well more than one).




If you tested with larger numbers of block groups, the filesystem must
have been mostly empty, or at least not a large number of directories,
or else you would have noticed that most of the time that your
algorithm wouldn't work.


I guess Undo I/O manager can go in because I have been using it for
the ext3 - ext4 inode migration testing and for testing the above patch.


Well, I really want a valid use case for undo code, and right now the
above patch is IMHO not suitable for merging into mainline, given all
of the problems that it has.



What are the issues you see with PATCH 1 and PATCH 2 which implement 
Undo I/O Manager and undoe2fs other than it is not hooked into any of

the existing tools. I will try to add it to mke2fs as you suggested. But
should that prevent it from going in ?


-aneesh
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] e2fsprogs: Support for large inode migration.

2007-07-26 Thread Theodore Tso
On Wed, Jul 25, 2007 at 01:46:25PM -0600, Andreas Dilger wrote:
 I was just going to write the same things.  However, it should be noted
 that we DO in fact use only the start of each inode table block in
 normal cases.  We did a survey of this across some hundreds of filesystems
 and because all kernels scan each block group's itable from the start of
 the bitmap we only find the beginning of each itable used.

Yep, granted.

 That said, if we were going to follow this approach (which isn't so bad,
 IMHO, because most filesystems are far over-provisioned in terms of
 inodes) then we shouldn't _require_ that the last inodes are unused, but
 rather that  1/2 of the inodes in each group are unused.  Also, we
 should still keep the inodes in the same block group, and do renumbering
 of the inodes in the directories.  At this point, we have a tool that
 could also allow changing the total number of inodes in the filesystem,
 which is something we've wanted in the past.

The problem is without the block relocation code, all we will be able
to do is shrink the total number of inodes in the filesystem --- and
most of the time, there's not a lot of value in shrinking the size of
the inode table.  Sure, you get a tiny amount of space back, and maybe
e2fsck times speed up, but that's about it.  Most of the time people
who want to change the total number of inodes want to increase it.

 Well, since this isn't exactly a common occurrance, I don't think we
 need to push everything into tune2fs.  Having a separate resize2fs
 seems reasonable (we are resizing the inodes after all), and keeping
 so much complexity out of tune2fs helps ensure that we don't introduce
 bugs into tune2fs itself (which is a far more used and critical tool IMHO).

Well, if you look at resize2fs, the major complexity can roughly be
broken down as follows:

* 30% -- Inode mover (and iterating over directory entries)
* 30% -- Block mover (and iterating over inodes to fix up entries)
* 5% -- Moving the inode table 
* 10% -- Updating the superblock/block group descriptors
* 25% -- Making sure nothing bad happens if resize2fs crashes in the 
middle (except when moving the inode table; then we cross 
our fingers and pray)

With the undo I/O manager, the last goes away, and if you need to deal
with iterating over the directory entries, what's left is mostly the
block mover, which really isn't that hard, since the inode mover and
the block mover shares a fair amount of the infrastructure to keep
track what had moved where.

  My suggestion would be to use something like /var/lib/e2fsprogs as the
  defalut directory.  And we should also do some tests to make sure
  something sane happens if we run out of room for the undo file.
  Presumably the only thing we can do is to abort the run and then back
  out the chnages using what was written out to the undo file.
 
 I was going to say /var/tmp, since we don't want to start having to
 manage old versions of these files, add entries to logrotate, etc.

Well, I wanted them in a separate directory so that we could
automatically find the undo files and deal with them automatically.
For example, e2fsck would be able to deal with recovering from an
interrupted resize2fs or tune2fs operation if the system crashed.

In some cases you wouldn't want to automatically reuse them after a
completed operation (i.e., an e2fsck undo file would rarely get
used), so we would need to tag the undo file with what program
generated them, and some kind of temporal identifer (i.e., the
superblock last write/mount time).

I also don't think logrotate entries are that bad of an idea

  - Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


+ remove-unused-bh-in-calls-to-ext234_get_group_desc.patch added to -mm tree

2007-07-26 Thread akpm

The patch titled
 remove unused bh in calls to ext234_get_group_desc
has been added to the -mm tree.  Its filename is
 remove-unused-bh-in-calls-to-ext234_get_group_desc.patch

*** Remember to use Documentation/SubmitChecklist when testing your code ***

See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
out what to do about this

--
Subject: remove unused bh in calls to ext234_get_group_desc
From: Eric Sandeen [EMAIL PROTECTED]

ext[234]_get_group_desc never tests the bh argument, and only sets it if it
is passed in; it is perfectly happy with a NULL bh argument.  But, many
callers send one in and never use it.  May as well call with NULL like
other callers who don't use the bh.

Signed-off-by: Eric Sandeen [EMAIL PROTECTED]
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---

 fs/ext2/ialloc.c |   24 
 fs/ext2/inode.c  |2 +-
 fs/ext3/ialloc.c |   17 +++--
 fs/ext4/ialloc.c |   17 +++--
 4 files changed, 23 insertions(+), 37 deletions(-)

diff -puN fs/ext2/ialloc.c~remove-unused-bh-in-calls-to-ext234_get_group_desc 
fs/ext2/ialloc.c
--- a/fs/ext2/ialloc.c~remove-unused-bh-in-calls-to-ext234_get_group_desc
+++ a/fs/ext2/ialloc.c
@@ -177,7 +177,6 @@ static void ext2_preread_inode(struct in
unsigned long block_group;
unsigned long offset;
unsigned long block;
-   struct buffer_head *bh;
struct ext2_group_desc * gdp;
struct backing_dev_info *bdi;
 
@@ -188,7 +187,7 @@ static void ext2_preread_inode(struct in
return;
 
block_group = (inode-i_ino - 1) / EXT2_INODES_PER_GROUP(inode-i_sb);
-   gdp = ext2_get_group_desc(inode-i_sb, block_group, bh);
+   gdp = ext2_get_group_desc(inode-i_sb, block_group, NULL);
if (gdp == NULL)
return;
 
@@ -217,11 +216,10 @@ static int find_group_dir(struct super_b
int ngroups = EXT2_SB(sb)-s_groups_count;
int avefreei = ext2_count_free_inodes(sb) / ngroups;
struct ext2_group_desc *desc, *best_desc = NULL;
-   struct buffer_head *bh, *best_bh = NULL;
int group, best_group = -1;
 
for (group = 0; group  ngroups; group++) {
-   desc = ext2_get_group_desc (sb, group, bh);
+   desc = ext2_get_group_desc (sb, group, NULL);
if (!desc || !desc-bg_free_inodes_count)
continue;
if (le16_to_cpu(desc-bg_free_inodes_count)  avefreei)
@@ -231,7 +229,6 @@ static int find_group_dir(struct super_b
 le16_to_cpu(best_desc-bg_free_blocks_count))) {
best_group = group;
best_desc = desc;
-   best_bh = bh;
}
}
if (!best_desc)
@@ -284,7 +281,6 @@ static int find_group_orlov(struct super
int max_debt, max_dirs, min_blocks, min_inodes;
int group = -1, i;
struct ext2_group_desc *desc;
-   struct buffer_head *bh;
 
freei = percpu_counter_read_positive(sbi-s_freeinodes_counter);
avefreei = freei / ngroups;
@@ -295,7 +291,6 @@ static int find_group_orlov(struct super
if ((parent == sb-s_root-d_inode) ||
(EXT2_I(parent)-i_flags  EXT2_TOPDIR_FL)) {
struct ext2_group_desc *best_desc = NULL;
-   struct buffer_head *best_bh = NULL;
int best_ndir = inodes_per_group;
int best_group = -1;
 
@@ -303,7 +298,7 @@ static int find_group_orlov(struct super
parent_group = (unsigned)group % ngroups;
for (i = 0; i  ngroups; i++) {
group = (parent_group + i) % ngroups;
-   desc = ext2_get_group_desc (sb, group, bh);
+   desc = ext2_get_group_desc (sb, group, NULL);
if (!desc || !desc-bg_free_inodes_count)
continue;
if (le16_to_cpu(desc-bg_used_dirs_count) = best_ndir)
@@ -315,11 +310,9 @@ static int find_group_orlov(struct super
best_group = group;
best_ndir = le16_to_cpu(desc-bg_used_dirs_count);
best_desc = desc;
-   best_bh = bh;
}
if (best_group = 0) {
desc = best_desc;
-   bh = best_bh;
group = best_group;
goto found;
}
@@ -345,7 +338,7 @@ static int find_group_orlov(struct super
 
for (i = 0; i  ngroups; i++) {
group = (parent_group + i) % ngroups;
-   desc = ext2_get_group_desc (sb, group, bh);
+   desc = ext2_get_group_desc (sb, group, NULL);
if (!desc || !desc-bg_free_inodes_count)
continue;

[BUG?] ext4_ext_put_in_cache uses __u32 to receive physical block number.

2007-07-26 Thread Yan Zheng
Hi, all

I think I found a bug in ext4/extents.c, ext4_ext_put_in_cache uses
__u32 to receive physical block number.  ext4_ext_put_in_cache is
used in ext4_ext_get_blocks, it sets ext4 inode's extent cache
according most recently tree lookup (higher 16 bits of saved physical
block number are always zero). when serving a mapping request,
ext4_ext_get_blocks first check whether the logical block is in
inode's extent cache. if the logical block is in the cache and the
cached region isn't a gap, ext4_ext_get_blocks gets physical block
number by using cached region's physical block number and offset in
the cached region.  as described above, ext4_ext_get_blocks may
return wrong result when there are physical block numbers bigger than
0x.

Regards

YZ
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-26 Thread David Chinner
[please don't top post!]

On Thu, Jul 26, 2007 at 05:33:08PM +0400, Alex Tomas wrote:
 Jeff Garzik wrote:
 The XFS one is proven and the work was already completed.
 
 What were the specific technical issues that made it unsuitable for ext4?
 
 I would rather not reinvent the wheel, particularly if the reinvention 
 is less capable than the existing work.

 It duplicates fs/mpage.c in bio building and introduces new generic API
 (iomap, map_blocks_t, etc).

Using a new API for new functionality is a bad thing?

 In contrast, my trivial implementation re-use
 existing code in fs/mpage.c, doesn't introduce new API and I tend to think
 provides quite the same functionality. I can be wrong, of course ...

No, it doesn't provide the same functionality.

Firstly, XFS attaches a different I/O completion to delalloc writes
to allow us to update the file size when the write is beyond the
current on disk EOF. This code cannot do that as all it does is
allocation and present normal looking buffers to the generic code
path.

Secondly, apart from delalloc, XFS cannot use the generic code paths
for writeback because unwritten extent conversion also requires
custom I/O completion handlers. Given that __mpage_writepage() only
calls -writepage when it is confused, XFS simply cannot use this
API.

Also, looking at the way mpage_da_map_blocks() is done - if we have
an 128MB delalloc extent - ext4 will allocate that will allocate it
in one go, right? What happens if we then crash after only writing a
few megabytes of that extent? stale data exposure? XFS can allocate
multiple gigabytes in a single get_blocks call so even if ext4 can't
do this, it's a problem for XFS.

So without the ability to attach specific I/O completions to bios
or support for unwritten extents directly in __mpage_writepage,
there is no way XFS can use this generic delayed allocation code.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html