Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Valerie Henson
On Thu, Apr 26, 2007 at 10:47:38AM +0200, Jan Kara wrote:
>   Do I get it right that you just have in each cnode a pointer to the
> previous & next cnode? But then if two consecutive cnodes get corrupted,
> you have no way to connect the chain, do you? If each cnode contained
> some unique identifier of the file and a number identifying position of
> cnode,  then there would be at least some way (through expensive) to
> link them together correctly...

You're right, it's easy to add a little more redundancy that would
make it possible to recover from two consecutive nodes being
corrupted.  Keeping a parent inode id in each continuation inode is
definitely a smart thing to do.

Some minor side notes: Continuation inodes aren't really in any
defined order - if you look at Jeff's ping-pong chunk allocation
example, you'll see that the data in each continuation inode won't be
in linearly increasing order.  Also, while the current implementation
is a simple doubly-linked list, this may not be the best solution
long-term.  What's important is that each continuation inode have a
back pointer to the parent and that there is some structure for
quickly looking up the continuation inode for a given file offset.
Suggestions for data structures that work well in this situation are
welcome. :)

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Valerie Henson
On Thu, Apr 26, 2007 at 12:05:04PM -0400, Jeff Dike wrote:
> 
> No, I'm referring to a different file.  The scenario is that you have
> a growing file in a nearly full disk with files being deleted (and
> thus space being freed) such that allocations for the growing file
> bounce back and forth between chunks.

This is an excellent question.  I call this the ping-pong problem.
The solution is as Amit describes: You have a maximum of one
continuation inode per file per chunk, and you require sparse files.
Here's an example, spelled out:

Allocate file 1 in chunk A.
Grow file 1.
Chunk A fills up.
Allocate continuation inode for file 1 in chunk B.
Chunk A gets some free space.
Chunk B fills up.
Pick chunk A for allocating next block of file 1.
Try to look up a continuation inode for file 1 in chunk A.
Continuation inode for file 1 found in chunk A!
Attach newly allocated block to existing inode for file 1 in chunk A.

This is why the file format inside each chunk needs to support sparse
files.

I have a presentation that has a series of slides on problems and
potential resolutions that might help:

http://infohost.nmt.edu/~val/review/chunkfs_presentation.pdf

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-26 Thread Eric W. Biederman
Miklos Szeredi <[EMAIL PROTECTED]> writes:

>> On Apr 25 2007 11:21, Eric W. Biederman wrote:
>> >>
>> >> Why did we want to use fsuid, exactly?
>> >
>> >- Because ruid is completely the wrong thing we want mounts owned
>> >  by whomever's permissions we are using to perform the mount.
>> 
>> Think nfs. I access some nfs file as an unprivileged user. knfsd, by
>> nature, would run as euid=0, uid=0, but it needs fsuid=jengelh for
>> most permission logic to work as expected.
>
> I don't think knfsd will ever want to call mount(2).
>
> But yeah, I've been convinced, that using fsuid is the right thing to
> do.

Actually knfsd does call mount when it crosses a mount point on the nfs
server it generates an equivalent mount point in linux.  At least I think
that is the what it is doing.  It is very similar to our mount propagation
path.

However as a special case I don't think the permission checking is likely
to bite us there.  It is worth double checking once we have the other details
ironed out.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-26 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > > So then as far as you're concerned, the patches which were in -mm will
> > > > remain unchanged?
> > > 
> > > Basically yes. I've merged the update patch, which was not yet added
> > > to -mm, did some cosmetic code changes, and updated the patch headers.
> > > 
> > > There's one open point, that I think we haven't really explored, and
> > > that is the propagation semantics.  I think you had the idea, that a
> > > propagated mount should inherit ownership from the parent into which
> > > it was propagated.
> > 
> > Don't think that was me.  I stayed out of those early discussions
> > because I wasn't comfortable guessing at the proper semantics yet.
> 
> Yes, sorry, it was Eric's suggestion.
> 
> > But really, I, as admin, have to set up both propagation and user mounts
> > for a particular subtree, so why would I *not* want user mounts to be
> > propagated?
> > 
> > So, in my own situation, I have done
> > 
> > make / rshared
> > mount --bind /share /share
> > make /share unbindable
> > for u in $users; do
> > mount --rbind / /share/$u/root
> > make /share/$u/root rslave
> > make /share/$u/root rshared
> > mount --bind -o user=$u /share/$u/root/home/$u 
> > /share/$u/root/home/$u
> > done
> > 
> > All users get chrooted into /share/$USER/root, some also get their own
> > namespace.  Clearly if a user in a new namespace does
> > 
> > mount --bind -o user=me ~/somedir ~/otherdir
> > 
> > then logs out, and logs back in, I want the ~/otherdir in the new
> > namespace (and the one in the 'init' namespace) to also be owned by
> > 'me'.
> > 
> > > That sounds good if everyone agrees?
> > 
> > I've shown where I think propagating the mount owner is useful.  Can you
> > detail a scenario where doing so would be bad?  Then we can work toward
> > semantics that make sense...
> 
> But in your example, the "propagated mount inherits ownership from
> parent mount" would also work, since in all namespaces the owner of
> the parent would necessary be "me".

true.

> The "inherits parent" semantics would work better for example in the
> "all nosuid" namespace, where the user is free to modify it's mount
> namespace. 
> 
> If for example propagation is set up from the initial namespace to
> this user's namespace and a new mount is added to the initial
> namespace, it would be nice if the propagated new mount would also be
> owned by the user (and be "nosuid" of course).

ok, so in the example i gave, this would be the admin in the
initial namespace mounting something under /home/$USER/, which
gets propagated to slave /share/$USER/root/home/$USER, where
we would want a different mount owner.

> Does the above make sense?  I'm not sure I've explained clearly
> enough.

I think I see.  Sounds like inherit from parent does the right thing
all around, at least in cases we've thought of so far.

thanks,
-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Jörn Engel
On Thu, 26 April 2007 10:47:40 +1000, David Chinner wrote:
> 
> This assumes that you know a chunk has been corrupted, though.
> How do you find that out?

Option 1: you notice something odd while serving userspace.
Option 2: a checking/scrubbing daemon of some sorts.

The first will obviously miss any corruption in data that is not touched
for a long time (ever?).

> > What you need to make this go fast is (1) a pre-made list of which
> > chunks have links with which other chunks,
> 
> So you add a new on-disk structure that needs to be kept up to
> date? How do you trust that structure to be correct if you are
> not journalling it?

Only chance I see is to treat this list as hints.  It should contain all
chunks that possibly have links.  It may also contain chunks that don't
have links.  By keeping strict FFS-style ordering of all relevant
writes, any mismatch should only cost fsck time.

Managing this list appears to be less than trivial.  Might actually be
easier to have LogFS-style rmap for each object in the filesystem.

> What happens if fsfuzzer trashes part
> of this table as well and you can't trust it?

If you have 5000 redundant copies of data and all get corrupted, you are
doomed.  I don't expect my filesystem to recover after having written
0x00 over the whole device.

Being able to recover a single corruption happening anywhere on the
device is already a huge step forward.  Of course most current
filesystems wouldn't even be able to detect all possible corruptions.
That alone would be a step forward.

One of the smart things of ZFS is to checksum everything.  Among the
Linux filesystems only JFFS2 seems to do it, but it cannot distinguish
between corrupted data and incomplete writes before a crash.  It
definitely costs performance, but that is the price one has to pay if
errors are to be detected.

Jörn

-- 
Do not stop an army on its way home.
-- Sun Tzu
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-26 Thread Miklos Szeredi
> On Apr 25 2007 11:21, Eric W. Biederman wrote:
> >>
> >> Why did we want to use fsuid, exactly?
> >
> >- Because ruid is completely the wrong thing we want mounts owned
> >  by whomever's permissions we are using to perform the mount.
> 
> Think nfs. I access some nfs file as an unprivileged user. knfsd, by
> nature, would run as euid=0, uid=0, but it needs fsuid=jengelh for
> most permission logic to work as expected.

I don't think knfsd will ever want to call mount(2).

But yeah, I've been convinced, that using fsuid is the right thing to
do.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-26 Thread Miklos Szeredi
> Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > So then as far as you're concerned, the patches which were in -mm will
> > > remain unchanged?
> > 
> > Basically yes. I've merged the update patch, which was not yet added
> > to -mm, did some cosmetic code changes, and updated the patch headers.
> > 
> > There's one open point, that I think we haven't really explored, and
> > that is the propagation semantics.  I think you had the idea, that a
> > propagated mount should inherit ownership from the parent into which
> > it was propagated.
> 
> Don't think that was me.  I stayed out of those early discussions
> because I wasn't comfortable guessing at the proper semantics yet.

Yes, sorry, it was Eric's suggestion.

> But really, I, as admin, have to set up both propagation and user mounts
> for a particular subtree, so why would I *not* want user mounts to be
> propagated?
> 
> So, in my own situation, I have done
> 
>   make / rshared
>   mount --bind /share /share
>   make /share unbindable
>   for u in $users; do
>   mount --rbind / /share/$u/root
>   make /share/$u/root rslave
>   make /share/$u/root rshared
>   mount --bind -o user=$u /share/$u/root/home/$u 
> /share/$u/root/home/$u
>   done
> 
> All users get chrooted into /share/$USER/root, some also get their own
> namespace.  Clearly if a user in a new namespace does
> 
>   mount --bind -o user=me ~/somedir ~/otherdir
> 
> then logs out, and logs back in, I want the ~/otherdir in the new
> namespace (and the one in the 'init' namespace) to also be owned by
> 'me'.
> 
> > That sounds good if everyone agrees?
> 
> I've shown where I think propagating the mount owner is useful.  Can you
> detail a scenario where doing so would be bad?  Then we can work toward
> semantics that make sense...

But in your example, the "propagated mount inherits ownership from
parent mount" would also work, since in all namespaces the owner of
the parent would necessary be "me".

The "inherits parent" semantics would work better for example in the
"all nosuid" namespace, where the user is free to modify it's mount
namespace. 

If for example propagation is set up from the initial namespace to
this user's namespace and a new mount is added to the initial
namespace, it would be nice if the propagated new mount would also be
owned by the user (and be "nosuid" of course).

Does the above make sense?  I'm not sure I've explained clearly
enough.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-26 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > So then as far as you're concerned, the patches which were in -mm will
> > remain unchanged?
> 
> Basically yes. I've merged the update patch, which was not yet added
> to -mm, did some cosmetic code changes, and updated the patch headers.
> 
> There's one open point, that I think we haven't really explored, and
> that is the propagation semantics.  I think you had the idea, that a
> propagated mount should inherit ownership from the parent into which
> it was propagated.

Don't think that was me.  I stayed out of those early discussions
because I wasn't comfortable guessing at the proper semantics yet.

But really, I, as admin, have to set up both propagation and user mounts
for a particular subtree, so why would I *not* want user mounts to be
propagated?

So, in my own situation, I have done

make / rshared
mount --bind /share /share
make /share unbindable
for u in $users; do
mount --rbind / /share/$u/root
make /share/$u/root rslave
make /share/$u/root rshared
mount --bind -o user=$u /share/$u/root/home/$u 
/share/$u/root/home/$u
done

All users get chrooted into /share/$USER/root, some also get their own
namespace.  Clearly if a user in a new namespace does

mount --bind -o user=me ~/somedir ~/otherdir

then logs out, and logs back in, I want the ~/otherdir in the new
namespace (and the one in the 'init' namespace) to also be owned by
'me'.

> That sounds good if everyone agrees?

I've shown where I think propagating the mount owner is useful.  Can you
detail a scenario where doing so would be bad?  Then we can work toward
semantics that make sense...

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-26 Thread Jan Engelhardt

On Apr 25 2007 11:21, Eric W. Biederman wrote:
>>
>> Why did we want to use fsuid, exactly?
>
>- Because ruid is completely the wrong thing we want mounts owned
>  by whomever's permissions we are using to perform the mount.

Think nfs. I access some nfs file as an unprivileged user. knfsd, by
nature, would run as euid=0, uid=0, but it needs fsuid=jengelh for
most permission logic to work as expected.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ext2/3 block remapping tool

2007-04-26 Thread Jan Kara
  Hello,

  I've been lately playing with remapping ext2/ext3 blocks (especially how
much it can give us in terms of speed of things like KDE start). For that
I've written two simple tools (you can get them from
ftp.suse.com/pub/people/jack/ext3remapper.tar.gz):
  e2block2file to transform (preparsed) output from blktrace into a list
of accessed files and offsets accessed
  e2remapblocks to use output from e2block2file and remap blocks into big
chunks in the order in which they were accessed.
  (see README in the tools archive for more details)

  So far the tools (especially e2remapblocks ;) work on unmounted
filesystem. The ultimate goal is to be able to do similar things for
mounted filesystems but I wanted to see whether block remapping is worth it
and what kernel interfaces would be useful for achieving the goal.
  BTW, the results for KDE startup are as follows:
The root partition was about 4.8 GB with around 1 GB free. System has
1GB mem. All measurements (except for warmcache) were performed after
  sync; echo 3 >/proc/sys/vm/drop_caches

Ordinary start: 19.2 20.3 19.5 19.8 19.3; avg. 19.62
Start with all data cached: 7 7.6 7.3 7.1 7.1; avg. 7.22
Start with fcache (see thread http://lkml.org/lkml/2006/5/15/46 for details
on fcache):
  11.3 11 10.3 10.8 10.6; avg. 10.8
Start with blocks remapped with e2remapblocks:
  13.5 15 13 14.5 14.5; avg. 14.1
(after remapping, data was stored in 20 continguous extents on disk)

Honza


-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] ext4: write support for preallocated blocks/extents

2007-04-26 Thread Amit K. Arora
This patch adds write support for preallocated (using fallocate system
call) blocks/extents. The preallocated extents in ext4 are marked
"uninitialized", hence they need special handling especially while
writing to them. This patch takes care of that.

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 fs/ext4/extents.c   |  228 +++-
 include/linux/ext4_fs_extents.h |1 
 2 files changed, 202 insertions(+), 27 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -1141,6 +1141,51 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * ext4_ext_try_to_merge:
+ * tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+   struct ext4_ext_path *path,
+   struct ext4_extent *ex)
+{
+   struct ext4_extent_header *eh;
+   unsigned int depth, len;
+   int merge_done=0, uninitialized = 0;
+
+   depth = ext_depth(inode);
+   BUG_ON(path[depth].p_hdr == NULL);
+   eh = path[depth].p_hdr;
+
+   while (ex < EXT_LAST_EXTENT(eh)) {
+   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+   break;
+   /* merge with next extent! */
+   if (ext4_ext_is_uninitialized(ex))
+   uninitialized = 1;
+   ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+   + ext4_ext_get_actual_len(ex + 1));
+   if (uninitialized)
+   ext4_ext_mark_uninitialized(ex);
+
+   if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+   len = (EXT_LAST_EXTENT(eh) - ex - 1)
+   * sizeof(struct ext4_extent);
+   memmove(ex + 1, ex + 2, len);
+   }
+   eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
+   merge_done = 1;
+   BUG_ON(eh->eh_entries == 0);
+   }
+
+   return merge_done;
+}
+
+
+/*
  * ext4_ext_check_overlap:
  * check if a portion of the "newext" extent overlaps with an
  * existing extent.
@@ -1316,25 +1361,7 @@ has_space:
 
 merge:
/* try to merge extents to the right */
-   while (nearex < EXT_LAST_EXTENT(eh)) {
-   if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-   break;
-   /* merge with next extent! */
-   if (ext4_ext_is_uninitialized(nearex))
-   uninitialized = 1;
-   nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-   + ext4_ext_get_actual_len(nearex + 1));
-   if (uninitialized)
-   ext4_ext_mark_uninitialized(nearex);
-
-   if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
-   len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-   * sizeof(struct ext4_extent);
-   memmove(nearex + 1, nearex + 2, len);
-   }
-   eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
-   BUG_ON(eh->eh_entries == 0);
-   }
+   ext4_ext_try_to_merge(inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -1999,15 +2026,149 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * ext4_ext_convert_to_initialized:
+ * this function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three). Atleast one initialized extent
+ * and atmost two uninitialized extents can result.
+ * There are three possibilities:
+ *   a> No split required: Entire extent should be initialized.
+ *   b> Split into two extents: Only one end of the extent is being written to.
+ *   c> Split into three extents: Somone is writing in middle of the extent.
+ */
+int ext4_ext_convert_to_initialized(handle_t *handle, struct inode *inode,
+   struct ext4_ext_path *path,
+   ext4_fsblk_t iblock,
+   unsigned long max_blocks)
+{
+   struct ext4_extent *ex, *ex1 = NULL, *ex2 = NULL, *ex3 = NULL, newex;
+   struct ext4_extent_header *eh;
+   unsigned int allocated, ee_block, ee_len, depth;
+   ext4_fsblk_t newblock;
+   int err = 0, ret = 0;
+
+   depth = ext_depth(inode);
+   eh = path[depth].p_hdr;
+   ex = path[depth].p_ext;
+   ee_block = le3

[PATCH 4/5] ext4: fallocate support in ext4

2007-04-26 Thread Amit K. Arora
This patch has the ext4 implemtation of fallocate system call.

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 fs/ext4/extents.c   |  201 +++-
 fs/ext4/file.c  |1 
 include/linux/ext4_fs.h |7 +
 include/linux/ext4_fs_extents.h |   13 ++
 4 files changed, 179 insertions(+), 43 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in
} else if (path->p_ext) {
ext_debug("  %d:%d:%llu ",
  le32_to_cpu(path->p_ext->ee_block),
- le16_to_cpu(path->p_ext->ee_len),
+ ext4_ext_get_actual_len(path->p_ext),
  ext_pblock(path->p_ext));
} else
ext_debug("  []");
@@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
- le16_to_cpu(ex->ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug("\n");
 }
@@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug("  -> %d:%llu:%d ",
le32_to_cpu(path->p_ext->ee_block),
ext_pblock(path->p_ext),
-   le16_to_cpu(path->p_ext->ee_len));
+   ext4_ext_get_actual_len(path->p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug("move %d:%llu:%d in new leaf %llu\n",
le32_to_cpu(path[depth].p_ext->ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext->ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1107,7 +1107,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1->ee_block) + ext1_ee_len !=
le32_to_cpu(ex2->ee_block))
return 0;
 
@@ -1116,14 +1128,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1->ee_len) >= 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;
return 0;
 }
@@ -1145,7 +1157,7 @@ unsigned int ext4_ext_check_overlap(stru
unsigned int depth, len1;
 
b1 = le32_to_cpu(newext->ee_block);
-   len1 = le16_to_cpu(newext->ee_len);
+   len1 = ext4_ext_get_actual_len(newext);
depth = ext_depth(inode);
if (!path[depth].p_ext)
goto out;
@@ -1181,9 +1193,9 @@ int ext4_ext_insert_extent(handle_t *han
struct ext4_extent *ex, *fex;
struct ext4_extent *nearex; /* nearest extent */
struct ext4_ext_path *npath = NULL;
-   int depth, len, err, next;
+   int depth, len, err, next, uninitialized = 0;
 
-   BUG_ON(newext->ee_len == 0);
+   BUG_ON(ext4_ext_get_actual_len(newext) == 0);
depth = ext_depth(inode);
ex = path[depth].p_ext;
BUG_ON(path[depth].p_hdr == NULL);
@@ -1191,14 +1203,23 @@ int ext4_ext_insert_extent(handle_t *han
/* try to insert block into found extent and return */
if (ex && ext4_can_extents_be_merged(inode, ex, newext)) {
ext_debug("append %d block to %d:%d (from %llu)\n",
-   le16_to_cpu(newext->ee_len),
+   ext4_ext_get_actual_len(newext),

[PATCH 3/5] ext4: Extent overlap bugfix

2007-04-26 Thread Amit K. Arora
This is a fix for an extent-overlap bug. The fallocate() implementation
on ext4 depends on this bugfix. Though this fix had been posted earlier,
but because it is still not part of mainline code, I have attached it
here too.

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 fs/ext4/extents.c   |   50 ++--
 include/linux/ext4_fs_extents.h |1 
 2 files changed, 49 insertions(+), 2 deletions(-)

Index: linux-2.6.21/fs/ext4/extents.c
===
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -1129,6 +1129,45 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * ext4_ext_check_overlap:
+ * check if a portion of the "newext" extent overlaps with an
+ * existing extent.
+ *
+ * If there is an overlap discovered, it updates the length of the newext
+ * such that there will be no overlap, and then returns 1.
+ * If there is no overlap found, it returns 0.
+ */
+unsigned int ext4_ext_check_overlap(struct inode *inode,
+   struct ext4_extent *newext,
+   struct ext4_ext_path *path)
+{
+   unsigned long b1, b2;
+   unsigned int depth, len1;
+
+   b1 = le32_to_cpu(newext->ee_block);
+   len1 = le16_to_cpu(newext->ee_len);
+   depth = ext_depth(inode);
+   if (!path[depth].p_ext)
+   goto out;
+   b2 = le32_to_cpu(path[depth].p_ext->ee_block);
+
+   /* get the next allocated block if the extent in the path
+* is before the requested block(s) */
+   if (b2 < b1) {
+   b2 = ext4_ext_next_allocated_block(path);
+   if (b2 == EXT_MAX_BLOCK)
+   goto out;
+   }
+
+   if (b1 + len1 > b2) {
+   newext->ee_len = cpu_to_le16(b2 - b1);
+   return 1;
+   }
+out:
+   return 0;
+}
+
+/*
  * ext4_ext_insert_extent:
  * tries to merge requsted extent into the existing extent or
  * inserts requested extent as new one into the tree,
@@ -2032,7 +2071,15 @@ int ext4_ext_get_blocks(handle_t *handle
 
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
-   allocated = max_blocks;
+
+   /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
+   newex.ee_block = cpu_to_le32(iblock);
+   newex.ee_len = cpu_to_le16(max_blocks);
+   err = ext4_ext_check_overlap(inode, &newex, path);
+   if (err)
+   allocated = le16_to_cpu(newex.ee_len);
+   else
+   allocated = max_blocks;
newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err);
if (!newblock)
goto out2;
@@ -2040,7 +2087,6 @@ int ext4_ext_get_blocks(handle_t *handle
goal, newblock, allocated);
 
/* try to insert new extent into found leaf and return */
-   newex.ee_block = cpu_to_le32(iblock);
ext4_ext_store_pblock(&newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
err = ext4_ext_insert_extent(handle, inode, path, &newex);
Index: linux-2.6.21/include/linux/ext4_fs_extents.h
===
--- linux-2.6.21.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.21/include/linux/ext4_fs_extents.h
@@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode *
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct 
ext4_ext_path *);
+extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent 
*, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct 
ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, 
ext_prepare_callback, void *);
 extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct 
ext4_ext_path *);
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] fallocate() on s390

2007-04-26 Thread Amit K. Arora
This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with "preferred" ordering of arguments in this system call (i.e. int,
int, loff_t, loff_t).

I will request s390 experts to please review this code and verify if
this patch is correct. Thanks!

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 arch/s390/kernel/compat_wrapper.S |   10 ++
 arch/s390/kernel/sys_s390.c   |   10 ++
 arch/s390/kernel/syscalls.S   |1 +
 include/asm-s390/unistd.h |3 ++-
 4 files changed, 23 insertions(+), 1 deletion(-)

Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S
===
--- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S
+++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S
@@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
llgtr   %r2,%r2 # char *
llgtr   %r3,%r3 # struct compat_timeval *
jg  compat_sys_utimes
+
+   .globl  s390_fallocate_wrapper
+s390_fallocate_wrapper:
+   lgfr%r2,%r2 # int
+   sllg%r3,%r3,32  # get high word of 64bit loff_t
+   or  %r3,%r4 # get low word of 64bit loff_t
+   sllg%r4,%r5,32  # get high word of 64bit loff_t
+   or  %r4,%r6 # get low word of 64bit loff_t
+   llgf%r5,164(%r15)   # unsigned int
+   jg  s390_fallocate
Index: linux-2.6.21/arch/s390/kernel/sys_s390.c
===
--- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c
+++ linux-2.6.21/arch/s390/kernel/sys_s390.c
@@ -268,6 +268,16 @@ s390_fadvise64_64(struct fadvise64_64_ar
 }
 
 /*
+ * This is a wrapper to call sys_fallocate(). Since s390 ABI has a problem
+ * with the int, int, loff_t, loff_t ordering of arguments, this wrapper
+ * is required.
+ */
+asmlinkage long s390_fallocate(int fd, loff_t offset, loff_t len, int mode)
+{
+   return sys_fallocate(fd, mode, offset, len);
+}
+
+/*
  * Do a system call from kernel instead of calling sys_execve so we
  * end up with proper pt_regs.
  */
Index: linux-2.6.21/arch/s390/kernel/syscalls.S
===
--- linux-2.6.21.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.21/arch/s390/kernel/syscalls.S
@@ -322,3 +322,4 @@ NI_SYSCALL  
/* 310 sys_move_pages *
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(s390_fallocate,s390_fallocate,s390_fallocate_wrapper)
Index: linux-2.6.21/include/asm-s390/unistd.h
===
--- linux-2.6.21.orig/include/asm-s390/unistd.h
+++ linux-2.6.21/include/asm-s390/unistd.h
@@ -251,8 +251,9 @@
 #define __NR_getcpu311
 #define __NR_epoll_pwait   312
 #define __NR_utimes313
+#define __NR_fallocate 314
 
-#define NR_syscalls 314
+#define NR_syscalls 315
 
 /* 
  * There are some system calls that are not present on 64 bit, some
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-04-26 Thread Amit K. Arora
This patch implements the fallocate() system call and adds support for
i386, x86_64 and powerpc.

NOTE: It is based on 2.6.21 kernel version.

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 arch/i386/kernel/syscall_table.S |1 
 arch/powerpc/kernel/sys_ppc32.c  |7 ++
 arch/x86_64/kernel/functionlist  |1 
 fs/open.c|   41 +++
 include/asm-i386/unistd.h|3 +-
 include/asm-powerpc/systbl.h |1 
 include/asm-powerpc/unistd.h |3 +-
 include/asm-x86_64/unistd.h  |4 ++-
 include/linux/fs.h   |7 ++
 include/linux/syscalls.h |1 
 10 files changed, 66 insertions(+), 3 deletions(-)

Index: linux-2.6.21/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.21/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_fallocate /* 320 */
Index: linux-2.6.21/arch/x86_64/kernel/functionlist
===
--- linux-2.6.21.orig/arch/x86_64/kernel/functionlist
+++ linux-2.6.21/arch/x86_64/kernel/functionlist
@@ -931,6 +931,7 @@
 *(.text.sys_getitimer)
 *(.text.sys_getgroups)
 *(.text.sys_ftruncate)
+*(.text.sys_fallocate)
 *(.text.sysfs_lookup)
 *(.text.sys_exit_group)
 *(.text.stub_fork)
Index: linux-2.6.21/fs/open.c
===
--- linux-2.6.21.orig/fs/open.c
+++ linux-2.6.21/fs/open.c
@@ -350,6 +350,47 @@ asmlinkage long sys_ftruncate64(unsigned
 }
 #endif
 
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+
+   if (len == 0 || offset < 0)
+   goto out;
+
+   ret = -EBADF;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   if (!(file->f_mode & FMODE_WRITE))
+   goto out_fput;
+
+   inode = file->f_path.dentry->d_inode;
+
+   ret = -ESPIPE;
+   if (S_ISFIFO(inode->i_mode))
+   goto out_fput;
+
+   ret = -ENODEV;
+   if (!S_ISREG(inode->i_mode))
+   goto out_fput;
+
+   ret = -EFBIG;
+   if (offset + len > inode->i_sb->s_maxbytes)
+   goto out_fput;
+
+   if (inode->i_op && inode->i_op->fallocate)
+   ret = inode->i_op->fallocate(inode, mode, offset, len);
+   else
+   ret = -ENOSYS;
+out_fput:
+   fput(file);
+out:
+   return ret;
+}
+EXPORT_SYMBOL(sys_fallocate);
+
 /*
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
Index: linux-2.6.21/include/asm-i386/unistd.h
===
--- linux-2.6.21.orig/include/asm-i386/unistd.h
+++ linux-2.6.21/include/asm-i386/unistd.h
@@ -325,10 +325,11 @@
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_fallocate 320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.21/include/asm-powerpc/systbl.h
===
--- linux-2.6.21.orig/include/asm-powerpc/systbl.h
+++ linux-2.6.21/include/asm-powerpc/systbl.h
@@ -307,3 +307,4 @@ COMPAT_SYS_SPU(set_robust_list)
 COMPAT_SYS_SPU(move_pages)
 SYSCALL_SPU(getcpu)
 COMPAT_SYS(epoll_pwait)
+COMPAT_SYS(fallocate)
Index: linux-2.6.21/include/asm-powerpc/unistd.h
===
--- linux-2.6.21.orig/include/asm-powerpc/unistd.h
+++ linux-2.6.21/include/asm-powerpc/unistd.h
@@ -326,10 +326,11 @@
 #define __NR_move_pages301
 #define __NR_getcpu302
 #define __NR_epoll_pwait   303
+#define __NR_fallocate 304
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls  304
+#define __NR_syscalls  305
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls
Index: linux-2.6.21/include/asm-x86_64/unistd.h
===
--- linux-2.6.21.orig/include/asm-x86_64/unistd.h
+++ linux-2.6.21/include/asm-x86_64/unistd.h
@@ -619,8 +619,10 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_fallocate 280
+__SYSCALL(__NR_fallocate, sys_fallocate)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_fallocate
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.21/include/l

[PATCH 0/5] fallocate system call

2007-04-26 Thread Amit K. Arora
Based on the discussion, this new patchset uses following as the
interface for fallocate() system call:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

It seems that only s390 architecture has a problem with such a layout of
arguments in fallocate(). Thus for s390, we plan to have a wrapper
(say, sys_s390_fallocate()) for the sys_fallocate(), which will get
called by glibc when an application issues a fallocate() system call
on s390. The s390 arch specific changes will be part of a separate
patch (PATCH 2/5). It will be great if some s390 expert can verify the
patch, since I have not been able to test it on s390 so far.

It was also noted that minor changes might be required to strace code
to take care of "different arguments on s390" issue.

Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for
preallocation and deallocation of preallocated blocks respectively. More
modes can be added, when required.

ToDos:
=
1>   Implementation on other architectures (other than i386, x86_64, 
ppc64 and s390(x)) 
2>   A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3>   Changes to glibc,
a) to support fallocate() system call
b) so that posix_fallocate() and posix_fallocate64() call
   fallocate() system call
4>   Changes to XFS to implement the fallocate inode operation


Following patches follow:

Patch 1/5 : fallocate() implementation in i86, x86_64 and powerpc
Patch 2/5 : fallocate() on s390
Patch 3/5 : ext4: Extent overlap bugfix
Patch 4/5 : ext4: fallocate support in ext4
Patch 5/5 : ext4: write support for preallocated blocks

--
Regards,
Amit Arora

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VFAT: slow fs corruption? [long]

2007-04-26 Thread Albrecht Dreß

Hi Ben!

Thanks a lot for your comments, and sorry for the late reply, I did more tests  
in the meantime...


Am 19.04.07 01:00 schrieb(en) Benjamin LaHaise:

On Wed, Apr 18, 2007 at 07:58:40PM +0200, Albrecht Dreß wrote:
> - Are there known issues with VFAT in 2.6.11 which might lead to the >  
observed problems?  Were they fixed?
> - Is it possible to change the block size in ext2 to 16k (to match the  
SD > card's erase block size)?


Flash cards tend to be rather flaky given that they are low cost  
consumergrade commodities these days.  I would recommend getting a new card  
firstand seeing if you can still replicate the problem.


I am afraid my explanation was not completely clear: I reformatted the *same*  
card which gave a lot of VFAT errors with ext2 and ran the test again.  Up to  
now, I filled the card several times and erased it, without any problem.  Only  
after killing (switch power off) the system in mid-air, I had a non-clean fs  
and one broken file (which is not surprising, afaik), giving input/output  
errors when I tried to read it.


So my first observation, that VFAT corrupts "somehow", but ext2 doesn't, still  
seems to be valid!


Out of the box, I've had to replace 2 of 8 flash cards in the last 6 months  
when they showed similiarly eerie data corruption when files disappeared.


Yes, I know... Unfortunately, it's not possible to get reliable data about the  
achievable life time.


Doing an md5sum on the device 2 times in a row and getting back different  
results is Not Good.


Good idea - will be the next check!

Thanks, Albrecht.


--
 Albrecht Dreß  -  Johanna-Kirchner-Straße 13  -  D-53123 Bonn (Germany)
   Phone (+49) 228 6199571  -  mailto:[EMAIL PROTECTED]
  GnuPG public key:  http://www.mynetcologne.de/~nc-dreszal/pubkey.asc


pgp5Nai42E969.pgp
Description: PGP signature


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Amit Gud

Jeff Dike wrote:

On Thu, Apr 26, 2007 at 10:53:16AM -0500, Amit Gud wrote:

Jeff Dike wrote:

How about this case:

Growing file starts in chunk A.
Overflows into chunk B.
Delete file in chunk A.
Growing file overflows chunk B and spots new free space in
chunk A (and nothing anywhere else)
Overflows into chunk A
Delete file in chunk B.
Overflow into chunk B again.

Maybe this is not realistic, but in the absence of a mechanism to pull
data back from an overflow chunk, it seems at least a theoretical
possibility that there could be > 1 continuation inodes per file per
chunk.

Preventive measures are taken to limit only one continuation inode per 
file per chunk. This can be done easily in the chunk allocation 
algorithm for disk space. Although I'm not quite sure what you mean by 
"Delete file in chunk A". If you are referring to same file thats 
growing, then deletion is not possible, because individual parts of any 
file in any chunk cannot be deleted.


No, I'm referring to a different file.  The scenario is that you have
a growing file in a nearly full disk with files being deleted (and
thus space being freed) such that allocations for the growing file
bounce back and forth between chunks.



In such scenario either lot of continuation inodes < number of chunks 
would be created or lot of sparse pieces would be created. But we can 
certainly enforce the constraint of one continuation inode per file per 
chunk, excluding the file's primary chunk in which it started.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Amit Gud

Alan Cox wrote:
Preventive measures are taken to limit only one continuation inode per 
file per chunk. This can be done easily in the chunk allocation 
algorithm for disk space. Although I'm not quite sure what you mean by 


How are you handling the allocation in this situation, are you assuming
that a chunk is "out of bounds" because part of a file already lives on
it or simply keeping a single inode per chunk which has multiple sparse
pieces of the file on it ?

ie if I write 0-8MB to chunk A and then 8-16 to chunk B can I write
16-24MB to chunk A producing a single inode of 0-8 16-24, or does it have
to find another chunk to use ?


Hello Alan,

You re-use the same inode with multiple sparse pieces.

This way you avoid hopping around continuation inodes and coming back to 
same chunk with which you started but this time on a different 
continuation inode. This may not be I/O intensive for successive 
traversals if the continuation inodes are pinned in the memory, but it 
certainly is a waste of resource - inodes. Not allowing this would make 
worst case of every file having a continuation inode in every chunk, 
even worse; may be like only single file exist in the file system and 
rest all inodes in all chunks (including file's own chunk) are 
continuation inodes.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-26 Thread Miklos Szeredi
> So then as far as you're concerned, the patches which were in -mm will
> remain unchanged?

Basically yes. I've merged the update patch, which was not yet added
to -mm, did some cosmetic code changes, and updated the patch headers.

There's one open point, that I think we haven't really explored, and
that is the propagation semantics.  I think you had the idea, that a
propagated mount should inherit ownership from the parent into which
it was propagated.

That sounds good if everyone agrees?

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-26 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > > Right, I figure if the normal action is to always do
> > > > mnt->user = current->fsuid, then for the special case we
> > > > pass a uid in someplace.  Of course...  do we not have a
> > > > place to do that?  Would it be a no-no to use 'data' for
> > > > a non-fs-specific arg?
> > > 
> > > I guess it would be OK for bind, but not for new- and remounts, where
> > > 'data' is already used.
> > > 
> > > Maybe it's best to stay with fsuid after all, and live with having to
> > > restore capabilities.  It's not so bad after all, this seems to do the
> > > trick:
> > > 
> > >   cap_t cap = cap_get_proc();
> > >   setfsuid(uid);
> > >   cap_set_proc(cap);
> > > 
> > > Unfortunately these functions are not in libc, but in a separate
> > > "libcap" library.  Ugh.
> > 
> > Ok, are you still planning to nix the MS_SETUSER flag, though, as
> > Eric suggested?  I think it's cleanest - always set the mnt->user
> > field to current->fsuid, and require CAP_SYS_ADMIN if the
> > mountpoint->mnt->user != current->fsuid.
> 
> It would be a nice cleanup, but I think it's unworkable for the
> following reasons:
> 
> Up till now mount(2) and umount(2) always required CAP_SYS_ADMIN, and
> we must make sure, that unless there's some explicit action by the
> sysadmin, these rules are still enfoced.
> 
> For example, with just a check for mnt->mnt_uid == current->fsuid, a
> fsuid=0 process could umount or submount all the "legacy" mounts even
> without CAP_SYS_ADMIN.
>
> This is a fundamental security problem, with getting rid of MS_SETUSER
> and MNT_USER.
> 
> Another, rather unlikely situation is if an existing program sets
> fsuid to non-zero before calling mount, hence unwantingly making that
> mount owned by some user after these patches.
> 
> Also adding "user=0" to the options in /proc/mounts would be an
> inteface breakage, that is probably harmless, but people wouldn't like
> it.  Special casing the zero uid for this case is more ugly IMO, than
> the problem we are trying to solve.
> 
> If we didn't have existing systems to deal with, then of course I'd
> agree with Eric's suggestion.
> 
> Miklos

So then as far as you're concerned, the patches which were in -mm will
remain unchanged?

-serge

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Alan Cox
> Preventive measures are taken to limit only one continuation inode per 
> file per chunk. This can be done easily in the chunk allocation 
> algorithm for disk space. Although I'm not quite sure what you mean by 

How are you handling the allocation in this situation, are you assuming
that a chunk is "out of bounds" because part of a file already lives on
it or simply keeping a single inode per chunk which has multiple sparse
pieces of the file on it ?

ie if I write 0-8MB to chunk A and then 8-16 to chunk B can I write
16-24MB to chunk A producing a single inode of 0-8 16-24, or does it have
to find another chunk to use ?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Jeff Dike
On Thu, Apr 26, 2007 at 10:53:16AM -0500, Amit Gud wrote:
> Jeff Dike wrote:
> >How about this case:
> >
> > Growing file starts in chunk A.
> > Overflows into chunk B.
> > Delete file in chunk A.
> > Growing file overflows chunk B and spots new free space in
> >chunk A (and nothing anywhere else)
> > Overflows into chunk A
> > Delete file in chunk B.
> > Overflow into chunk B again.
> >
> >Maybe this is not realistic, but in the absence of a mechanism to pull
> >data back from an overflow chunk, it seems at least a theoretical
> >possibility that there could be > 1 continuation inodes per file per
> >chunk.
> >
> 
> Preventive measures are taken to limit only one continuation inode per 
> file per chunk. This can be done easily in the chunk allocation 
> algorithm for disk space. Although I'm not quite sure what you mean by 
> "Delete file in chunk A". If you are referring to same file thats 
> growing, then deletion is not possible, because individual parts of any 
> file in any chunk cannot be deleted.

No, I'm referring to a different file.  The scenario is that you have
a growing file in a nearly full disk with files being deleted (and
thus space being freed) such that allocations for the growing file
bounce back and forth between chunks.

Jeff

-- 
Work email - jdike at linux dot intel dot com
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Amit Gud

Jeff Dike wrote:

How about this case:

Growing file starts in chunk A.
Overflows into chunk B.
Delete file in chunk A.
Growing file overflows chunk B and spots new free space in
chunk A (and nothing anywhere else)
Overflows into chunk A
Delete file in chunk B.
Overflow into chunk B again.

Maybe this is not realistic, but in the absence of a mechanism to pull
data back from an overflow chunk, it seems at least a theoretical
possibility that there could be > 1 continuation inodes per file per
chunk.



Preventive measures are taken to limit only one continuation inode per 
file per chunk. This can be done easily in the chunk allocation 
algorithm for disk space. Although I'm not quite sure what you mean by 
"Delete file in chunk A". If you are referring to same file thats 
growing, then deletion is not possible, because individual parts of any 
file in any chunk cannot be deleted.



AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-26 Thread Miklos Szeredi
> Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > > Right, I figure if the normal action is to always do
> > > mnt->user = current->fsuid, then for the special case we
> > > pass a uid in someplace.  Of course...  do we not have a
> > > place to do that?  Would it be a no-no to use 'data' for
> > > a non-fs-specific arg?
> > 
> > I guess it would be OK for bind, but not for new- and remounts, where
> > 'data' is already used.
> > 
> > Maybe it's best to stay with fsuid after all, and live with having to
> > restore capabilities.  It's not so bad after all, this seems to do the
> > trick:
> > 
> > cap_t cap = cap_get_proc();
> > setfsuid(uid);
> > cap_set_proc(cap);
> > 
> > Unfortunately these functions are not in libc, but in a separate
> > "libcap" library.  Ugh.
> 
> Ok, are you still planning to nix the MS_SETUSER flag, though, as
> Eric suggested?  I think it's cleanest - always set the mnt->user
> field to current->fsuid, and require CAP_SYS_ADMIN if the
> mountpoint->mnt->user != current->fsuid.

It would be a nice cleanup, but I think it's unworkable for the
following reasons:

Up till now mount(2) and umount(2) always required CAP_SYS_ADMIN, and
we must make sure, that unless there's some explicit action by the
sysadmin, these rules are still enfoced.

For example, with just a check for mnt->mnt_uid == current->fsuid, a
fsuid=0 process could umount or submount all the "legacy" mounts even
without CAP_SYS_ADMIN.

This is a fundamental security problem, with getting rid of MS_SETUSER
and MNT_USER.

Another, rather unlikely situation is if an existing program sets
fsuid to non-zero before calling mount, hence unwantingly making that
mount owned by some user after these patches.

Also adding "user=0" to the options in /proc/mounts would be an
inteface breakage, that is probably harmless, but people wouldn't like
it.  Special casing the zero uid for this case is more ugly IMO, than
the problem we are trying to solve.

If we didn't have existing systems to deal with, then of course I'd
agree with Eric's suggestion.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Jeff Dike
On Wed, Apr 25, 2007 at 03:47:10PM -0700, Valerie Henson wrote:
> Actually, there is an upper limit on the number of continuation
> inodes.  Each file can have a maximum of one continuation inode per
> chunk. (This is why we need to support sparse files.)

How about this case:

Growing file starts in chunk A.
Overflows into chunk B.
Delete file in chunk A.
Growing file overflows chunk B and spots new free space in
chunk A (and nothing anywhere else)
Overflows into chunk A
Delete file in chunk B.
Overflow into chunk B again.

Maybe this is not realistic, but in the absence of a mechanism to pull
data back from an overflow chunk, it seems at least a theoretical
possibility that there could be > 1 continuation inodes per file per
chunk.

Jeff

-- 
Work email - jdike at linux dot intel dot com
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-26 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
> > Right, I figure if the normal action is to always do
> > mnt->user = current->fsuid, then for the special case we
> > pass a uid in someplace.  Of course...  do we not have a
> > place to do that?  Would it be a no-no to use 'data' for
> > a non-fs-specific arg?
> 
> I guess it would be OK for bind, but not for new- and remounts, where
> 'data' is already used.
> 
> Maybe it's best to stay with fsuid after all, and live with having to
> restore capabilities.  It's not so bad after all, this seems to do the
> trick:
> 
>   cap_t cap = cap_get_proc();
>   setfsuid(uid);
>   cap_set_proc(cap);
> 
> Unfortunately these functions are not in libc, but in a separate
> "libcap" library.  Ugh.

Ok, are you still planning to nix the MS_SETUSER flag, though, as Eric
suggested?  I think it's cleanest - always set the mnt->user field to
current->fsuid, and require CAP_SYS_ADMIN if the mountpoint->mnt->user !=
current->fsuid.

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 2/4] Move the file data to the new blocks

2007-04-26 Thread Takashi Sato
Move the blocks on the temporary inode to the original inode
by a page.
1. Read the file data from the old blocks to the page
2. Move the block on the temporary inode to the original inode
3. Write the file data on the page into the new blocks

Signed-off-by: Takashi Sato <[EMAIL PROTECTED]>
---
diff -Nrup -X linux-2.6.19-rc6_move_data/Documentation/dontdiff 
linux-2.6.19-rc6_move_data/fs/ext4/extents.c 
linux-2.6.19-rc6-full-except_lg/fs/ext4/extents.c
--- linux-2.6.19-rc6_move_data/fs/ext4/extents.c2007-04-26 
19:36:34.0 +0900
+++ linux-2.6.19-rc6-full-except_lg/fs/ext4/extents.c   2007-04-26 
19:17:59.0 +0900
@@ -2533,6 +2533,610 @@ ext4_ext_next_extent(struct inode *inode
 }
 
 /**
+ * ext4_ext_merge_across - merge extents across leaf block
+ *
+ * @handle journal handle
+ * @inode  target file's inode
+ * @o_startfirst original extent to be defraged
+ * @o_end  last original extent to be defraged
+ * @start_ext  first new extent to be merged
+ * @new_extmiddle of new extent to be merged
+ * @end_extlast new extent to be merged
+ *
+ * This function returns 0 if succeed, otherwise returns error value.
+ */
+static int
+ext4_ext_merge_across_blocks(handle_t *handle, struct inode *inode,
+   struct ext4_extent *o_start,
+   struct ext4_extent *o_end, struct ext4_extent *start_ext,
+   struct ext4_extent *new_ext,struct ext4_extent *end_ext)
+{
+   struct ext4_ext_path *org_path = NULL;
+   unsigned long eblock = 0;
+   int err = 0;
+   int new_flag = 0;
+   int end_flag = 0;
+
+   if (le16_to_cpu(start_ext->ee_len) &&
+   le16_to_cpu(new_ext->ee_len) &&
+   le16_to_cpu(end_ext->ee_len)) {
+
+   if ((o_start) == (o_end)) {
+
+   /*   start_ext   new_extend_ext
+* dest |-|---||
+* org  |--|
+*/
+
+   end_flag = 1;
+   } else {
+
+   /*   start_ext   new_ext   end_ext
+* dest |-|--|-|
+* org  |---|--|
+*/
+
+   o_end->ee_block = end_ext->ee_block;
+   o_end->ee_len = end_ext->ee_len;
+   ext4_ext_store_pblock(o_end, ext_pblock(end_ext));
+   }
+
+   o_start->ee_len = start_ext->ee_len;
+   new_flag = 1;
+
+   } else if ((le16_to_cpu(start_ext->ee_len)) &&
+   (le16_to_cpu(new_ext->ee_len)) &&
+   (!le16_to_cpu(end_ext->ee_len)) &&
+   ((o_start) == (o_end))) {
+
+   /* start_extnew_ext
+* dest |--|---|
+* org  |--|
+*/
+
+   o_start->ee_len = start_ext->ee_len;
+   new_flag = 1;
+
+   } else if ((!le16_to_cpu(start_ext->ee_len)) &&
+   (le16_to_cpu(new_ext->ee_len)) &&
+   (le16_to_cpu(end_ext->ee_len)) &&
+   ((o_start) == (o_end))) {
+
+   /*  new_extend_ext
+* dest |--|---|
+* org  |--|
+*/
+
+   o_end->ee_block = end_ext->ee_block;
+   o_end->ee_len = end_ext->ee_len;
+   ext4_ext_store_pblock(o_end, ext_pblock(end_ext));
+
+   /* If new_ext was first block */
+   if (!new_ext->ee_block)
+   eblock = 0;
+   else
+   eblock = le32_to_cpu(new_ext->ee_block);
+
+   new_flag = 1;
+   } else {
+   printk("Unexpected case \n");
+   return -EIO;
+   }
+
+   if (new_flag) {
+   org_path = ext4_ext_find_extent(inode, eblock, NULL);
+   if (IS_ERR(org_path)) {
+   err = PTR_ERR(org_path);
+   org_path = NULL;
+   goto ERR;
+   }
+   err = ext4_ext_insert_extent(handle, inode,
+   org_path, new_ext);
+   if (err)
+   goto ERR;
+   }
+
+   if (end_flag) {
+   org_path = ext4_ext_find_extent(inode,
+   end_ext->ee_block -1, org_path);
+   if (IS_ERR(org_path)) {
+   err = PTR_ERR(org_path);
+   org_path = NULL;
+   goto ERR;
+   }
+   err = ext4_ext_insert_extent(handle, inode,
+   org_path, end_ext);
+   if (err)
+  

[RFC][PATCH 3/4] Online defrag command

2007-04-26 Thread Takashi Sato
The defrag command.  Usage is as follows:
o Put the multiple files closer together.
  # e4defrag -r directory-name
o Defrag for a single file.
  # e4defrag file-name
o Defrag for all files on ext4.
  # e4defrag device-name

Signed-off-by: Takashi Sato <[EMAIL PROTECTED]>
---
/*
 * e4defrag, ext4 filesystem defragmenter
 *
 */

#ifndef _LARGEFILE_SOURCE
#define _LARGEFILE_SOURCE
#endif

#ifndef _LARGEFILE64_SOURCE
#define _LARGEFILE64_SOURCE
#endif

#define _XOPEN_SOURCE   500
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define EXT4_SUPER_MAGIC0xEF53 /* magic number for ext4 */
#define DEFRAG_PAGES128 /* the number of pages to defrag at one 
time */
#define MAX_BLOCKS_LEN  16384 /* Maximum length of contiguous blocks */

/* data type for filesystem-wide blocks number */
#define  ext4_fsblk_t unsigned long long

/* ioctl command */
#define EXT4_IOC_GET_DATA_BLOCK _IOW('f', 9, ext4_fsblk_t)
#define EXT4_IOC_DEFRAG _IOW('f', 10, struct ext4_ext_defrag_data)

#define DEVNAME 0
#define DIRNAME 1
#define FILENAME2

#define RETURN_OK   0
#define RETURN_NG   -1
#define FTW_CONT0
#define FTW_STOP-1
#define FTW_OPEN_FD 2000
#define FILE_CHK_OK 0
#define FILE_CHK_NG -1
#define FS_EXT4 "ext4dev"
#define ROOT_UID0
/* defrag block size, in bytes */
#define DEFRAG_SIZE 67108864

#define min(x,y) (((x) > (y)) ? (y) : (x))

#define PRINT_ERR_MSG(msg)  fprintf(stderr, "%s\n", (msg));
#define PRINT_FILE_NAME(file)   \
fprintf(stderr, "\t\t\"%s\"\n", (file));

#define MSG_USAGE   \
"Usage : e4defrag [-v] file...| directory...| device...\n  
: e4defrag [-r] \
directory... | device... \n"
#define MSG_R_OPTION\
" with regional block allocation mode.\n"
#define NGMSG_MTAB  \
"\te4defrag  : Can not access /etc/mtab."
#define NGMSG_UNMOUNT   "\te4defrag  : FS is not mounted."
#define NGMSG_EXT4  \
"\te4defrag  : FS is not ext4 File System."
#define NGMSG_FS_INFO   "\te4defrag  : get FSInfo fail."
#define NGMSG_FILE_INFO "\te4defrag  : get FileInfo fail."
#define NGMSG_FILE_OPEN "\te4defrag  : open fail."
#define NGMSG_FILE_SYNC "\te4defrag  : sync(fsync) fail."
#define NGMSG_FILE_DEFRAG   "\te4defrag  : defrag fail."
#define NGMSG_FILE_BLOCKSIZE"\te4defrag  : can't get blocksize."
#define NGMSG_FILE_DATA "\te4defrag  : can't get data."
#define NGMSG_FILE_UNREG\
"\te4defrag  : File is not regular file."
#define NGMSG_FILE_LARGE\
"\te4defrag  : Defrag size is larger than FileSystem's free space."
#define NGMSG_FILE_PRIORITY \
"\te4defrag  : File is not current user's file or current user is not root."
#define NGMSG_FILE_LOCK "\te4defrag  : File is locked."
#define NGMSG_FILE_BLANK"\te4defrag  : File size is 0."
#define NGMSG_GET_LCKINFO   "\te4defrag  : get LockInfo fail."
#define NGMSG_TYPE  \
"e4defrag  : Can not process %s."

struct ext4_ext_defrag_data {
ext4_fsblk_t start_offset; /* start offset to defrag in blocks */
ext4_fsblk_t defrag_size;  /* size of defrag in blocks */
ext4_fsblk_t goal;   /* block offset for allocation */
};

int detail_flag = 0;
int regional_flag = 0;
int amount_cnt = 0;
int succeed_cnt = 0;
ext4_fsblk_tgoal = 0;

/*
 * Check if there's enough disk space
 */
int
check_free_size(int fd, off64_t fsize)
{
struct statfs   fsbuf;
off64_t file_asize = 0;

if (-1 == fstatfs(fd, &fsbuf)) {
if (detail_flag) {
perror(NGMSG_FS_INFO);
}
return RETURN_NG;
}

/* compute free space for root and normal user separately */
if (ROOT_UID == getuid())
file_asize = (off64_t)fsbuf.f_bsize * fsbuf.f_bfree;
else
file_asize = (off64_t)fsbuf.f_bsize * fsbuf.f_bavail;

if (file_asize >= fsize)
return RETURN_OK;

return RETURN_NG;
}

/*
 * file tree walk callback function
 * check file attributes before ioctl call to avoid illegal operations
 */
int
ftw_fn(
const char* file,
const struct stat64 *sb,
int flag,
struct FTW*  

ext4 online defrag (ver 0.4)

2007-04-26 Thread Takashi Sato
Hi all,

I have made following changes to the previous online defrag patchset
to improve it. Note that there is no functional change.

1. Change the handling of temporary inode.
   Now ext4_ext_defrag() calls ext4_new_inode()/iput() pair instead of
   new_inode()/delete_ext_defrag_inode(). Because new_inode() does not
   initialize all of entries that I need such as i_extra_isize.

2. Change how to swap blocks.
   In this patchset, the original blocks of the target file are swapped
   with temporary inode carefully to release them in iput().

3. Add an exclusive lock.
   Now ext4_inode_info.truncate_mutex is locked while the file being
   defragmented.

4. Add marking locality group as dirty.
   The lg is moved to s_locality_dirty list and marked as dirty
   if nr_to_write (total page count which has not written in disk yet)
   is 0 or less and lg_io is not empty in ext4_lg_sync_single_group(). 
   This makes sure that inode is written to disk.

Current status:
These patches are at the experimental stage so they have issues and
items to improve. But these are worth enough to examine my trial.

Dependencies:
My patches depend on the following Alex's patches of the multi-block
allocation for Linux 2.6.19-rc6.
"[RFC] delayed allocation, mballoc, etc"
http://marc.theaimsgroup.com/?l=linux-ext4&m=116493228301966&w=2

Outstanding issues:
Nothing for the moment.

Items to improve:
- Optimize the depth of extent tree and the number of leaf nodes.
  after defragmentation.
- The blocks on the temporary inode are moved to the original inode
  by a page in the current implementation.  I have to tune
  the pages unit for the performance.
- Support indirect block file.

Next steps:
- Defragmentation for free space fragmentation.
  If filesytem has insufficient contiguous blocks, move other files
  to make sufficient space and allocate the contiguous blocks for
  the target file.

Summary of patches:
*These patches apply on top of Alex's patches.
"[RFC] delayed allocation, mballoc, etc"
http://marc.theaimsgroup.com/?l=linux-ext4&m=116493228301966&w=2

[PATCH 1/4] Allocate new contiguous blocks with Alex's mballoc
- Search contiguous free blocks and allocate them for the temporary
  inode with Alex's multi-block allocation.

[PATCH 2/4] Move the file data to the new blocks
- Move the blocks on the temporary inode to the original inode
  by a page.

[PATCH 3/4] Online defrag command
- The defrag command.  Usage is as follows:
  o Put the multiple files closer together.
# e4defrag -r directory-name
  o Defrag for a single file.
# e4defrag file-name
  o Defrag for all files on ext4.
# e4defrag device-name

[PATCH 4/4] ext4_locality_group bug fix 
- Move lg_list to s_locality_dirty in ext4_lg_sync_single_group()
  to flush all of dirty inodes.

Any comments from reviews or tests are very welcome.

Cheers, Takashi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 4/4] ext4_locality_group bug fix

2007-04-26 Thread Takashi Sato
Move lg_list to s_locality_dirty and mark lg as dirty
if nr_to_write(total page count which has not written in disk yet)
is 0 or less and lg_io is not empty in ext4_lg_sync_single_group(). 
This makes sure that inode is written to disk.

Signed-off-by: Takashi Sato <[EMAIL PROTECTED]>
---
diff -Nrup -X linux-2.6.19-rc6-lg/Documentation/dontdiff 
linux-2.6.19-rc6-lg/fs/ext4/lg.c linux-2.6.19-rc6-full/fs/ext4/lg.c
--- linux-2.6.19-rc6-lg/fs/ext4/lg.c2007-04-26 19:55:37.0 +0900
+++ linux-2.6.19-rc6-full/fs/ext4/lg.c  2007-04-26 19:17:59.0 +0900
@@ -389,6 +389,10 @@ int ext4_lg_sync_single_group(struct sup
cond_resched();
spin_lock(&inode_lock);
if (wbc->nr_to_write <= 0) {
+   if (!list_empty(&lg->lg_io)) {
+   set_bit(EXT4_LG_DIRTY, &lg->lg_flags);
+   list_move(&lg->lg_list, &sbi->s_locality_dirty);
+   }
rc = EXT4_STOP_WRITEBACK;
code = 6;
break;
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 1/4] Allocate new contiguous blocks

2007-04-26 Thread Takashi Sato
Search contiguous free blocks with Alex's mutil-block allocation
and allocate them for the temporary inode.

This patch applies on top of Alex's patches.
"[RFC] delayed allocation, mballoc, etc"
http://marc.theaimsgroup.com/?l=linux-ext4&m=116493228301966&w=2

Signed-off-by: Takashi Sato <[EMAIL PROTECTED]>
---
diff -Nrup -X linux-2.6.19-rc6-alloc_block/Documentation/dontdiff 
linux-2.6.19-rc6-full/fs/ext4/extents.c 
linux-2.6.19-rc6-alloc_block/fs/ext4/extents.c
--- linux-2.6.19-rc6-full/fs/ext4/extents.c 2007-04-26 20:36:50.0 
+0900
+++ linux-2.6.19-rc6-alloc_block/fs/ext4/extents.c  2007-04-26 
20:36:05.0 +0900
@@ -2335,10 +2335,635 @@ int ext4_ext_calc_metadata_amount(struct
return num;
 }
 
+/*
+ * this structure is used to gather extents from the tree via ioctl
+ */
+struct ext4_extent_buf {
+   ext4_fsblk_t start;
+   int buflen;
+   void *buffer;
+   void *cur;
+   int err;
+};
+
+/*
+ * this structure is used to collect stats info about the tree
+ */
+struct ext4_extent_tree_stats {
+   int depth;
+   int extents_num;
+   int leaf_num;
+};
+
+static int
+ext4_ext_store_extent_cb(struct inode *inode,
+   struct ext4_ext_path *path,
+   struct ext4_ext_cache *newex,
+   struct ext4_extent_buf *buf)
+{
+
+   if (newex->ec_type != EXT4_EXT_CACHE_EXTENT)
+   return EXT_CONTINUE;
+
+   if (buf->err < 0)
+   return EXT_BREAK;
+   if (buf->cur - buf->buffer + sizeof(*newex) > buf->buflen)
+   return EXT_BREAK;
+
+   if (!copy_to_user(buf->cur, newex, sizeof(*newex))) {
+   buf->err++;
+   buf->cur += sizeof(*newex);
+   } else {
+   buf->err = -EFAULT;
+   return EXT_BREAK;
+   }
+   return EXT_CONTINUE;
+}
+
+static int
+ext4_ext_collect_stats_cb(struct inode *inode,
+   struct ext4_ext_path *path,
+   struct ext4_ext_cache *ex,
+   struct ext4_extent_tree_stats *buf)
+{
+   int depth;
+
+   if (ex->ec_type != EXT4_EXT_CACHE_EXTENT)
+   return EXT_CONTINUE;
+
+   depth = ext_depth(inode);
+   buf->extents_num++;
+   if (path[depth].p_ext == EXT_FIRST_EXTENT(path[depth].p_hdr))
+   buf->leaf_num++;
+   return EXT_CONTINUE;
+}
+
+int ext4_ext_ioctl(struct inode *inode, struct file *filp, unsigned int cmd,
+   unsigned long arg)
+{
+   int err = 0;
+   if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+   return -EINVAL;
+
+   if (cmd == EXT4_IOC_GET_EXTENTS) {
+   struct ext4_extent_buf buf;
+
+   if (copy_from_user(&buf, (void *) arg, sizeof(buf)))
+   return -EFAULT;
+
+   buf.cur = buf.buffer;
+   buf.err = 0;
+   mutex_lock(&EXT4_I(inode)->truncate_mutex);
+   err = ext4_ext_walk_space(inode, buf.start, EXT_MAX_BLOCK,
+   (void *)ext4_ext_store_extent_cb, &buf);
+   mutex_unlock(&EXT4_I(inode)->truncate_mutex);
+   if (err == 0)
+   err = buf.err;
+   } else if (cmd == EXT4_IOC_GET_TREE_STATS) {
+   struct ext4_extent_tree_stats buf;
+
+   mutex_lock(&EXT4_I(inode)->truncate_mutex);
+   buf.depth = ext_depth(inode);
+   buf.extents_num = 0;
+   buf.leaf_num = 0;
+   err = ext4_ext_walk_space(inode, 0, EXT_MAX_BLOCK,
+   (void *)ext4_ext_collect_stats_cb, &buf);
+   mutex_unlock(&EXT4_I(inode)->truncate_mutex);
+   if (!err)
+   err = copy_to_user((void *) arg, &buf, sizeof(buf));
+   } else if (cmd == EXT4_IOC_GET_TREE_DEPTH) {
+   mutex_lock(&EXT4_I(inode)->truncate_mutex);
+   err = ext_depth(inode);
+   mutex_unlock(&EXT4_I(inode)->truncate_mutex);
+   } else if (cmd == EXT4_IOC_FIBMAP) {
+   ext4_fsblk_t __user *p = (ext4_fsblk_t __user *)arg;
+   ext4_fsblk_t block = 0;
+   struct address_space *mapping = filp->f_mapping;
+
+   if (copy_from_user(&block, (ext4_fsblk_t __user *)arg,
+   sizeof(block)))
+   return -EFAULT;
+
+   lock_kernel();
+   block = ext4_bmap(mapping, block);
+   unlock_kernel();
+
+   return put_user(block, p);
+   } else if (cmd == EXT4_IOC_DEFRAG) {
+   struct ext4_ext_defrag_data defrag;
+
+   if (copy_from_user(&defrag,
+   (struct ext4_ext_defrag_data __user *)arg,
+   sizeof(defrag)))
+   return -EFAULT;
+
+   err = ext4_ext_defrag(filp, defrag.start_offset,
+   

Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-26 Thread Jan Kara
> On Wed, Apr 25, 2007 at 08:54:34PM +1000, David Chinner wrote:
> > On Tue, Apr 24, 2007 at 04:53:11PM -0500, Amit Gud wrote:
> > >  --   --
> > > | cnode 0  |-->| cnode 0  |--> to another cnode or NULL
> > >  --   --
> > > | cnode 1  |-  | cnode 1  |-
> > >  --   |   --  |
> > > | cnode 2  |-- |  | cnode 2  |--   |
> > >  --  | |  --  |   |
> > > | cnode 3  | | |  | cnode 3  | |   |
> > >  --  | |  --  |   |
> > > |  |  ||  |   |
> > > 
> > >  inodes   inodes or NULL
> > 
> > How do you recover if fsfuzzer takes out a cnode in the chain? The
> > chunk is marked clean, but clearly corrupted and needs fixing and
> > you don't know what it was pointing at.  Hence you have a pointer to
> > a trashed cnode *somewhere* that you need to find and fix, and a
> > bunch of orphaned cnodes that nobody points to *somewhere else* in
> > the filesystem that you have to find. That's a full scan fsck case,
> > isn't?
> 
> Excellent question.  This is one of the trickier aspects of chunkfs -
> the orphan inode problem (tricky, but solvable).  The problem is what
> if you smash/lose/corrupt an inode in one chunk that has a
> continuation inode in another chunk?  A back pointer does you no good
> if the back pointer is corrupted.
> 
> What you do is keep tabs on whether you see damage that looks like
> this has occurred - e.g., inode use/free counts wrong, you had to zero
> a corrupted inode - and when this happens, you do a scan of all
> continuation inodes in chunks that have links to the corrupted chunk.
> What you need to make this go fast is (1) a pre-made list of which
> chunks have links with which other chunks, (2) a fast way to read all
> of the continuation inodes in a chunk (ignoring chunk-local inodes).
> This stage is O(fs size) approximately, but it should be quite swift.
  Do I get it right that you just have in each cnode a pointer to the
previous & next cnode? But then if two consecutive cnodes get corrupted,
you have no way to connect the chain, do you? If each cnode contained
some unique identifier of the file and a number identifying position of
cnode,  then there would be at least some way (through expensive) to
link them together correctly...

Honza
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html