date:20070515

Re: + knfsd-exportfs-add-exportfsh-header-fix.patch added to -mm tree

2007-05-15 Thread Christoph Hellwig

On Tue, May 15, 2007 at 02:49:14PM -0700, [EMAIL PROTECTED] wrote:
>  fs/cifs/export.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff -puN fs/cifs/export.c~knfsd-exportfs-add-exportfsh-header-fix 
> fs/cifs/export.c
> --- a/fs/cifs/export.c~knfsd-exportfs-add-exportfsh-header-fix
> +++ a/fs/cifs/export.c
> @@ -29,7 +29,8 @@
>*/
>  
>  #include 
> - 
> +#include 
> +
>  #ifdef CONFIG_CIFS_EXPERIMENTAL

Looks like cifs has grown and export_operations table since I did the
patch.  But with only a get_parent method that returns and error it's
not useful at all, so we should rather remove the whole file:


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/cifs/cifsfs.c
===
--- linux-2.6.orig/fs/cifs/cifsfs.c 2007-05-16 07:55:35.0 +0200
+++ linux-2.6/fs/cifs/cifsfs.c  2007-05-16 07:55:50.0 +0200
@@ -49,10 +49,6 @@
 static struct quotactl_ops cifs_quotactl_ops;
 #endif /* QUOTA */
 
-#ifdef CONFIG_CIFS_EXPERIMENTAL
-extern struct export_operations cifs_export_ops;
-#endif /* EXPERIMENTAL */
-
 int cifsFYI = 0;
 int cifsERROR = 1;
 int traceSMB = 0;
@@ -114,10 +110,6 @@ cifs_read_super(struct super_block *sb, 
 
sb->s_magic = CIFS_MAGIC_NUMBER;
sb->s_op = &cifs_super_ops;
-#ifdef CONFIG_CIFS_EXPERIMENTAL
-   if (experimEnabled != 0)
-   sb->s_export_op = &cifs_export_ops;
-#endif /* EXPERIMENTAL */  
 /* if (cifs_sb->tcon->ses->server->maxBuf > MAX_CIFS_HDR_SIZE + 512)
sb->s_blocksize = cifs_sb->tcon->ses->server->maxBuf - 
MAX_CIFS_HDR_SIZE; */
 #ifdef CONFIG_CIFS_QUOTA
Index: linux-2.6/fs/cifs/export.c
===
--- linux-2.6.orig/fs/cifs/export.c 2007-05-16 07:55:59.0 +0200
+++ /dev/null   1970-01-01 00:00:00.0 +
@@ -1,52 +0,0 @@
-/*
- *   fs/cifs/export.c
- *
- *   Copyright (C) International Business Machines  Corp., 2007
- *   Author(s): Steve French ([EMAIL PROTECTED])
- *
- *   Common Internet FileSystem (CIFS) client
- * 
- *   Operations related to support for exporting files via NFSD
- *
- *   This library is free software; you can redistribute it and/or modify
- *   it under the terms of the GNU Lesser General Public License as published
- *   by the Free Software Foundation; either version 2.1 of the License, or
- *   (at your option) any later version.
- *
- *   This library is distributed in the hope that it will be useful,
- *   but WITHOUT ANY WARRANTY; without even the implied warranty of
- *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See
- *   the GNU Lesser General Public License for more details.
- *
- *   You should have received a copy of the GNU Lesser General Public License
- *   along with this library; if not, write to the Free Software
- *   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
- */
- 
- /* 
-  * See Documentation/filesystems/Exporting
-  * and examples in fs/exportfs
-  */
-
-#include 
- 
-#ifdef CONFIG_CIFS_EXPERIMENTAL
- 
-static struct dentry *cifs_get_parent(struct dentry *dentry)
-{
-   /* BB need to add code here eventually to enable export via NFSD */
-   return ERR_PTR(-EACCES);
-}
- 
-struct export_operations cifs_export_ops = {
-   .get_parent = cifs_get_parent,
-/* Following five export operations are unneeded so far and can default */ 

-/* .get_dentry =
-   .get_name =
-   .find_exported_dentry =
-   .decode_fh = 
-   .encode_fs =  */
- };
- 
-#endif /* EXPERIMENTAL */
- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] LogFS take three

2007-05-15 Thread Willy Tarreau

On Wed, May 16, 2007 at 02:06:31AM +0200, Jörn Engel wrote:
> On Tue, 15 May 2007 13:37:59 -0700, Andrew Morton wrote:
> > It's strange and a bit regrettable that an fs would have dependency on MTD,
> > really.
> 
> It is and changing this wouldn't be too hard.  All device access goes
> through five functions (read, write, erase, is_bad and mark_bad).  As
> soon as someone seriously cares I will add a struct logfs_device_ops and
> have a second set of these functions for block devices.
> 
> On hard disks it shouldn't make too much sense.  The filesystem will
> fragment like a splinter bomb and be just as popular.

On hard disks, yes, but as you suggested, there are lots of other flash
devices interfaced as block devices. CompactFlash comes to mind, USB
keys too. And on these ones, the most important is to reduce the number
of writes and to support large sizes. I already see LogFS as an interesting
alternative to JFFS2 on such devices, eventhough it does not (yet?) support
compression.

Regards,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 8/14] Union-mount lookup

2007-05-15 Thread Bharata B Rao

On Tue, May 15, 2007 at 09:57:24AM +0200, Jan Engelhardt wrote:
> 
> On May 14 2007 15:12, Bharata B Rao wrote:
> > 
> >+struct dentry * d_lookup_single(struct dentry *parent, struct qstr *name)
> >+{
> >+struct dentry *dentry;
> >+unsigned long seq;
> >+
> >+do {
> >+seq = read_seqbegin(&rename_lock);
> >+dentry = __d_lookup_single(parent, name);
> >+if (dentry)
> >+break;
> >+} while (read_seqretry(&rename_lock, seq));
> >+return dentry;
> >+}
> 
> Replace with tabs.

This is copied from fs/dcache.c:d_lookup() and the whitespaces came from there.
But that is not an excuse, will fix.

> 
> >+lookup_union:
> >+do {
> >+struct vfsmount *mnt = find_mnt(topmost);
> >+UM_DEBUG_DCACHE("name=\"%s\", inode=%p, device=%s\n",
> >+topmost->d_name.name, topmost->d_inode,
> >+mnt->mnt_devname);
> >+mntput(mnt);
> >+} while (0);
> 
> Why the extra do{}while? [elsewhere too]

Not sure, may be to get a scope to define 'mnt' here. Jan ?

> 
> >+if (topmost->d_union) {
> >+union_lock_spinlock(topmost, &topmost->d_lock);
> >+}
> 
> Extra {} could go [elsewhere too].
> 
> >+if (last->d_overlaid
> >+&& (last->d_overlaid != dentry)) {
> 
> As can these extra () [elsewhere too].
>

Sure, will fix all these.
 
> >+static inline struct dentry * __lookup_hash_single(struct qstr *name, 
> >struct dentry *base, struct nameidata *nd)
> >+{
> >+struct dentry *dentry;
> >+struct inode *inode;
> >+int err;
> >+
> >+inode = base->d_inode;
> >+
> >+err = permission(inode, MAY_EXEC, nd);
> >+dentry = ERR_PTR(err);
> >+if (err)
> >+goto out;
> >+
> >+dentry = __lookup_hash_kern_single(name, base, nd);
> >+out:
> >+return dentry;
> >+}
> 
> This looks a little big for being inline.

ok.

Regards,
Bharata.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 7/14] Union-mount mounting

2007-05-15 Thread Bharata B Rao

On Tue, May 15, 2007 at 09:29:39AM +0200, Jan Engelhardt wrote:
> 
> On May 14 2007 15:11, Bharata B Rao wrote:
> >
> >TODO: bind and move mounts aren't yet supported with union mounts.
> 
> Are the semantics already set?

Not yet.

> 
> >@@ -294,6 +294,10 @@ static struct vfsmount *clone_mnt(struct
> > if (!mnt)
> > goto alloc_failed;
> > 
> >+/*
> >+ * As of now, cloning of union mounted mnt isn't permitted.
> >+ */
> >+BUG_ON(mnt->mnt_flags & MNT_UNION);
> 
> One, please avoid BUG_ONs. Now I am not sure if clone_mnt is called
> as part of kthread creation when CLONE_FS is it. If so, get rid of
> this one real fast. Also see chunk "@@ -1031,.. @@ do_loopback"
> below.

Looks like not CLONE_FS but CLONE_NEWNS ends up calling clone_mnt.

> 
>   if(mnt->mnt_flags & MNT_UNION)
>   goto return_einval;
> 
> or something.

Will do this for now, but eventually we need to get this working sanely anyway.

> 
> >+#ifdef CONFIG_UNION_MOUNT
> >+struct union_info *uinfo = NULL;
> >+#endif
> > 
> > retval = security_sb_umount(mnt, flags);
> > if (retval)
> >@@ -685,6 +696,14 @@ static int do_umount(struct vfsmount *mn
> > }
> > 
> > down_write(&namespace_sem);
> >+#ifdef CONFIG_UNION_MOUNT
> >+/*
> >+ * Grab a reference to the union_info which gets detached
> >+ * from the dentries in release_mounts().
> >+ */
> >+if (mnt->mnt_flags & MNT_UNION)
> >+uinfo = union_lock_and_get(mnt->mnt_root);
> >+#endif
> > spin_lock(&vfsmount_lock);
> > event++;
> > 
> >@@ -699,6 +718,15 @@ static int do_umount(struct vfsmount *mn
> > security_sb_umount_busy(mnt);
> > up_write(&namespace_sem);
> > release_mounts(&umount_list);
> >+#ifdef CONFIG_UNION_MOUNT
> >+if (uinfo) {
> >+if (atomic_read(&uinfo->u_count) == 1)
> >+/* We are the last user of this union_info */
> >+union_release(uinfo);
> >+else
> >+union_put_and_unlock(uinfo);
> >+}
> >+#endif
> > return retval;
> > }
> > 
> 
> Is it feasible to do with with some less #if/#endif magic?:
> 

Will try. We need union_info here which is available only with
CONFIG_UNION_MOUNT.

> >@@ -1031,6 +1070,15 @@ static int do_loopback(struct nameidata 
> > if (err)
> > return err;
> > 
> >+/*
> >+ * bind mounting to or from union mounts is not supported
> >+ */
> >+err = -EINVAL;
> >+if (nd->mnt->mnt_flags & MNT_UNION)
> >+goto out_unlocked;
> >+if (old_nd.mnt->mnt_flags & MNT_UNION)
> >+goto out_unlocked;
> >+
> 
> Do the same in clone_mnt.
> 
> > down_write(&namespace_sem);
> > err = -EINVAL;
> > if (IS_MNT_UNBINDABLE(old_nd.mnt))
> >@@ -1064,6 +1112,7 @@ static int do_loopback(struct nameidata 
> > 
> > out:
> > up_write(&namespace_sem);
> >+out_unlocked:
> > path_release(&old_nd);
> > return err;
> > }
> 
> >+++ b/include/linux/fs.h
> >@@ -1984,6 +1984,9 @@ static inline ino_t parent_ino(struct de
> > /* kernel/fork.c */
> > extern int unshare_files(void);
> > 
> >+/* fs/union.c */
> >+#include 
> >+
> > /* Transaction based IO helpers */
> > 
> > /*
> 
> This raises a big eyebrow. If linux/fs.h can compile without the
> inclusion of linux/union.h, do not put linux/union.h in fs.h.
> 

Ok, better to include union.h in the appropriate .c file which needs it.

> >+#ifdef CONFIG_UNION_MOUNT
> >+
> >+#include 
> >+
> >+/* namespace stuff used at mount time */
> >+extern void attach_mnt_union(struct vfsmount *, struct nameidata *);
> >+extern void detach_mnt_union(struct vfsmount *, struct path *);
> 
> You do not need that #include I suppose. Just predeclare the structs.
> 
>   struct path;
>   struct vfsmount;
>   extern void ...
> 
> Saves us the "compiler slurps in so many .h files" problem.

Sure. And thanks for the review.

Regards,
Bharata.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] LogFS take three

2007-05-15 Thread David Woodhouse

On Tue, 2007-05-15 at 21:19 +0200, Jörn Engel wrote:
> On Tue, 15 May 2007 15:07:05 -0400, John Stoffel wrote:
> > 
> > I've been semi watching this, and the only comment I really can give
> > is that I hate the name.  To me, logfs implies a filesystem for
> > logging purposes, not for Flash hardware with wear leveling issues to
> > be taken into account.
> 
> Yeah, well, ...
> 
> Two years ago when I started all this, I was looking for a good name.
> All I could come up with sounded stupid, so I picked "LogFS" as a code
> name.  As soon as I find a better name, the code name should get
> replaced.
> 
> By now I still don't have anything better.  All alternatives that were
> proposed are just as bad - with the added disadvantage of being new and
> not established yet.  My hope of ever finding a better name is nearly
> zero. 

Personally I'd just go for 'JFFS3'. After all, it has a better claim to
the name than either of its predecessors :)

-- 
dwmw2

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-15 Thread David Chinner

On Wed, May 16, 2007 at 01:33:59AM +0530, Amit K. Arora wrote:
> This patch implements sys_fallocate() and adds support on i386, x86_64
> and powerpc platforms.

Can you please pick up the ia64 support patch I posted as well?

> Changelog:
> -
> Note: The changes below are from the initial post (dated 26th April,
> 2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel
> version on which this patch is based. TAKE2 was based on 2.6.21 and this
> is based on 2.6.22-rc1.
> 
> Following changes were made to the previous version:
>  1) Added description before sys_fallocate() definition.
>  2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
> posix_fallocate should return EINVAL for len <= 0.
>  3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
>  4) Do not return ENODEV for dirs (let individual file systems decide if
> they want to support preallocation to directories or not.
>  5) Check for wrap through zero.
>  6) Update c/mtime if fallocate() succeeds.

Please don't make this always happen. c/mtime updates should be dependent
on the mode being used and whether there is visible change to the file. If no
userspace visible changes to the file occurred, then timestamps should not
be changed.

e.g. FA_ALLOCATE that changes file size requires same semantics of ftruncate()
extending the file, otherwise no change in timestamps should occur.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] LogFS take three

2007-05-15 Thread Roland Dreier

There are rather a lot of of FIXME comments, including scary stuff like

 > +/*
 > + * FIXME: this cannot be right but it does "fix" a bug of i_count
 > + * dropping too low.  Needs more thought.
 > + */
 > +atomic_inc(&old_dentry->d_inode->i_count);

and

 > +int __logfs_write_inode(struct inode *inode)
 > +{
 > +/*
 > + * FIXME: Those two inodes are 512 bytes in total.  Not good to
 > + * have on the stack.  Possibly the best solution would be to bite
 > + * the bullet and do another format change before release and
 > + * shrink the inodes.
 > + */
 > +struct logfs_disk_inode old, new;

are you going to change the format?  or fix this some other way?

I think a sweep through the code searching for FIXME and at least
rewriting all such comments to look like stuff that can be deferred
would be warranted ;)

 - R.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] LogFS take three

2007-05-15 Thread Jörn Engel

On Wed, 16 May 2007 02:06:31 +0200, Jörn Engel wrote:
>
> > > +
> > > + if (dest) {
> > > + /* symlink */
> > > + ret = logfs_inode_write(inode, dest, destlen, 0);
> > > + } else {
> > > + /* creat/mkdir/mknod */
> > > + ret = __logfs_write_inode(inode);
> > > + }
> > > + super->s_victim_ino = 0;
> > > + if (ret) {
> > > + if (!dest)
> > > + li->li_flags |= LOGFS_IF_STILLBORN;
> > > + /* FIXME: truncate symlink */
> > > + inode->i_nlink--;
> > > + iput(inode);
> > > + goto out;
> > > + }
> > > +
> > > + if (inode->i_mode & S_IFDIR)
> > > + dir->i_nlink++;
> > 
> > You have helper functions for i_nlink++, which remember to do
> > mark_inode_dirty()?
> 
> I do.  Looks like I should use them here and in at least one other
> place.  Will recheck them all.

Actually, this code was correct.  New inodes get written immediatly to
ensure catching -ENOSPC at a place where errors can get returned.  After
leaving this function, the inode is not dirty.  Either it has been
written or it is marked LOGFS_IF_STILLBORN never will be.

Your intuition was still good.  Link() did need the change you
indicated.

Jörn

-- 
Joern's library part 14:
http://www.sandpile.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-15 Thread Jeff Zheng

Here is the information of the created raid0. Hope it is enough.

Jeff

The crashing one:
md: bind
md: bind
md: raid0 personality registered for level 0
md0: setting max_sectors to 4096, segment boundary to 1048575
raid0: looking at sde
raid0:   comparing sde(5859284992) with sde(5859284992)
raid0:   END
raid0:   ==> UNIQUE
raid0: 1 zones
raid0: looking at sdd
raid0:   comparing sdd(5859284992) with sde(5859284992)
raid0:   EQUAL
raid0: FINAL 1 zones
raid0: done.
raid0 : md_size is 11718569984 blocks.
raid0 : conf->hash_spacing is 11718569984 blocks.
raid0 : nb_zone is 2.
raid0 : Allocating 8 bytes for hash.
JFS: nTxBlock = 8192, nTxLock = 65536

The working one:
md: bind
md: bind
md: bind
md: bind
md0: setting max_sectors to 4096, segment boundary to 1048575
raid0: looking at sdd
raid0:   comparing sdd(2929641472) with sdd(2929641472)
raid0:   END
raid0:   ==> UNIQUE
raid0: 1 zones
raid0: looking at sdg
raid0:   comparing sdg(2929641472) with sdd(2929641472)
raid0:   EQUAL
raid0: looking at sdf
raid0:   comparing sdf(2929641472) with sdd(2929641472)
raid0:   EQUAL
raid0: looking at sde
raid0:   comparing sde(2929641472) with sdd(2929641472)
raid0:   EQUAL
raid0: FINAL 1 zones
raid0: done.
raid0 : md_size is 11718565888 blocks.
raid0 : conf->hash_spacing is 11718565888 blocks.
raid0 : nb_zone is 2.
raid0 : Allocating 8 bytes for hash.
JFS: nTxBlock = 8192, nTxLock = 65536

-Original Message-
From: Neil Brown [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 16 May 2007 12:04 p.m.
To: Michal Piotrowski
Cc: Jeff Zheng; Ingo Molnar; [EMAIL PROTECTED];
[EMAIL PROTECTED]; linux-fsdevel@vger.kernel.org
Subject: Re: Software raid0 will crash the file-system, when each disk
is 5TB

On Wednesday May 16, [EMAIL PROTECTED] wrote:
> >
> > Anybody have a clue?
> >

No...
When a raid0 array is assemble, quite a lot of message get printed
about number of zones and hash_spacing etc.  Can you collect and post
those.  Both for the failing case (2*5.5T) and the working case
(4*2.55T) is possible.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-15 Thread Mingming Cao

On Wed, 2007-05-16 at 01:33 +0530, Amit K. Arora wrote:
> This patch implements sys_fallocate() and adds support on i386, x86_64
> and powerpc platforms.

> @@ -1137,6 +1148,8 @@ struct inode_operations {
>   ssize_t (*listxattr) (struct dentry *, char *, size_t);
>   int (*removexattr) (struct dentry *, const char *);
>   void (*truncate_range)(struct inode *, loff_t, loff_t);
> + long (*fallocate)(struct inode *inode, int mode, loff_t offset,
> +   loff_t len);
>  };

Does the return value from fallocate inode operation has to be *long*?
It's not consistent with the ext4_fallocate() define in patch 4/5, 

+int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t
len)

thus cause compile warnings.



Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] AFS: Implement shared-writable mmap

2007-05-15 Thread Nick Piggin


Andrew Morton wrote:

On Tue, 15 May 2007 16:52:31 +0100
David Howells <[EMAIL PROTECTED]> wrote:



Implement shared-writable mmap for AFS.



This blows up in -mm:

fs/afs/file.c:59: error: 'filemap_nopage' undeclared here (not in a function)
fs/afs/file.c:60: error: unknown field 'populate' specified in initializer
fs/afs/file.c:60: error: 'filemap_populate' undeclared here (not in a function)

because Nick went and renamed half the VM and deleted the other half.


And page_mkwrite is next ;)



I need to work out what to do with

mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-doc-fix.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-fix.patch
mm-merge-nopfn-into-fault.patch
convert-hugetlbfs-to-use-vm_ops-fault.patch
mm-remove-legacy-cruft.patch
mm-debug-check-for-the-fault-vs-invalidate-race.patch
mm-fix-clear_page_dirty_for_io-vs-fault-race.patch

Probably merge them, I guess.  Hugh had concerns, I think over small
additional overhead from the lock_page()?


Yes he did. It seems to only be noticable in microbenchmarks. In my opinion
not enough to withhold pagecache corruption bug fixes.

Still, I have some lock_page speedup work that eliminates that regression
anyway.

However, Hugh hasn't exactly said yes or no yet...

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] AFS: Implement shared-writable mmap

2007-05-15 Thread Nick Piggin


David Howells wrote:

Implement shared-writable mmap for AFS.

The key with which to access the file is obtained from the VMA at the point
where the PTE is made writable by the page_mkwrite() VMA op and cached in the
affected page.

If there's an outstanding write on the page made with a different key, then
page_mkwrite() will flush it before attaching a record of the new key.


Good, will be nice to get a page_mkwrite() user in the tree.




+/*
+ * notification that a previously read-only page is about to become writable
+ * - if it returns an error, the caller will deliver a bus error signal
+ *
+ * we use this to make a record of the key with which the writeback should be
+ * performed and to flush any outstanding writes made with a different key
+ *
+ * the key to be used is attached to the file pinned by the VMA
+ */
+int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+   struct afs_vnode *vnode = AFS_FS_I(vma->vm_file->f_mapping->host);
+   struct key *key = vma->vm_file->private_data;
+   int ret;
+
+   _enter("{{%x:%u},%x},{%lx}",
+  vnode->fid.vid, vnode->fid.vnode, key_serial(key), page->index);
+
+   lock_page(page);
+   ret = afs_prepare_write(vma->vm_file, page, 0, 0);
+   unlock_page(page);
+
+   _leave(" = %d", ret);
+   return ret;
+}


By the looks of afs_prepare_write, it is going to go bang when the page
gets truncated before lock_page.

Checking page->mapping after lock_page should do the trick.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] AFS: Fix afs_prepare_write()

2007-05-15 Thread Nick Piggin


David Howells wrote:

afs_prepare_write() should not mark a page up to date if it only partially
fills it in, in expectation of the caller filling in the rest prior to calling
commit_write().  commit_write(), however, should mark the page up to date.


Acked-by: Nick Piggin <[EMAIL PROTECTED]>



Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/afs/write.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/fs/afs/write.c b/fs/afs/write.c
index 28f3751..a03b92a 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -206,7 +206,6 @@ int afs_prepare_write(struct file *file, struct page *page,
_leave(" = %d [prep]", ret);
return ret;
}
-   SetPageUptodate(page);
}
 
 try_again:

@@ -311,8 +310,8 @@ int afs_commit_write(struct file *file, struct page *page,
spin_unlock(&vnode->writeback_lock);
}
 
+	SetPageUptodate(page);

set_page_dirty(page);
-
if (PageDirty(page))
_debug("dirtied");
 






--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] LogFS take three

2007-05-15 Thread Jörn Engel

On Tue, 15 May 2007 19:26:17 -0400, Albert Cahalan wrote:
> 
> Please don't forget the immutable bit. ("man lsattr")
> Having both, BSD-style, would be even better.
> The immutable bit is important for working around
> software bugs and "features" that damage files.
> 
> I also can't find xattr support.

Not forgotten, just on a different list.  Bugs beat features. :)

Jörn

-- 
Data expands to fill the space available for storage.
-- Parkinson's Law
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] LogFS take three

2007-05-15 Thread Jörn Engel

On Tue, 15 May 2007 13:37:59 -0700, Andrew Morton wrote:
> > +
> > +config LOGFS_FSCK
> > +   bool "Run LogFS fsck at mount time"
> > +   depends on LOGFS
> > +   help
> > + Run a full filesystem check on every mount.  If any errors are
> > + found, mounting the filesystem will fail.  This is a debug option
> > + for developers.
> > +
> > + If unsure, say N.
> > +
> 
> No dependency on MTD,

Ack.

> Can this code be tested by people who don't have MTD hardware?  We used to
> ahve a fake-mtd-on-a-blockdev thing, whcih was in a state of some
> disrepair.  Maybe it got repaired.  Or removed.  I can't immediately locate
> it...

Got removed.  Block2mtd is the new one and works with logfs.  At least
my version of it does and I pushed my patches to David - not sure if
they made it to mainline already.  Mtdram should work as well.

> It's strange and a bit regrettable that an fs would have dependency on MTD,
> really.

It is and changing this wouldn't be too hard.  All device access goes
through five functions (read, write, erase, is_bad and mark_bad).  As
soon as someone seriously cares I will add a struct logfs_device_ops and
have a second set of these functions for block devices.

On hard disks it shouldn't make too much sense.  The filesystem will
fragment like a splinter bomb and be just as popular.

> ooh, documentation.  Quick, merge it!

:)

> > +/* memtree.c */
> > +void btree_init(struct btree_head *head);
> > +void *btree_lookup(struct btree_head *head, long val);
> > +int btree_insert(struct btree_head *head, long val, void *ptr);
> > +int btree_remove(struct btree_head *head, long val);
> 
> These names are too generic.  If we later add a btree library: blam.

My plan was to move this code to lib/ sooner or later.  If you consider
it useful in its current state, I can do it immediatly.  And if someone
else merged a superior btree library I'd happily remove mine and use the
new one instead.

Opinions?

> > +/* super.c */
> > +int mtdread(struct super_block *sb, loff_t ofs, size_t len, void *buf);
> > +int mtdwrite(struct super_block *sb, loff_t ofs, size_t len, void *buf);
> > +int mtderase(struct super_block *sb, loff_t ofs, size_t len);
> > +void logfs_crash_dump(struct super_block *sb);
> > +int all_ff(void *buf, size_t len);
> > +int logfs_statfs(struct dentry *dentry, struct kstatfs *stats);
> 
> Have you checked that all of this needs global scope?

Not within the last month.  Will recheck.

> > +
> > +/* progs/fsck.c */
> > +#ifdef CONFIG_LOGFS_FSCK
> > +int logfs_fsck(struct super_block *sb);
> > +#else
> > +#define logfs_fsck(sb) ({ 0; })
> 
> static inline int logfs_fsck(struct super_block *sb)
> {
>   return 0;
> }
> 
> is better: nicer to look at, has typechecking.

That's what I tried first.  But the brick I carry on my neck decided to
compile and link the read logfs_fsck() unconditionally and so this
collided.

Will change the Makefile and this.

> > +static inline struct logfs_super *LOGFS_SUPER(struct super_block *sb)
> > +{
> > +   return sb->s_fs_info;
> > +}
> > +
> > +static inline struct logfs_inode *LOGFS_INODE(struct inode *inode)
> > +{
> > +   return container_of(inode, struct logfs_inode, vfs_inode);
> > +}
> 
> Do these need to be uppercase?

No more than EXT2_SB() or EXT2_I().

Since you are the second person to ask, I sense a majority vote to
change the case.

> > +static inline pgoff_t logfs_index(u64 pos)
> > +{
> > +   return pos / LOGFS_BLOCKSIZE;
> > +}
> 
> If the compiler goofs up here we'll end up trying to do a 64/32 divide and
> it won't link on 32-bit machines.  It would be safer to do
> 
>   return pos >> LOGFS_BLOCKSHIFT;

Will change.

> > +int logfs_uncompress(void *in, void *out, size_t inlen, size_t outlen)
> > +{
> > +   int err, ret;
> > +
> > +   ret = -EIO;
> > +   mutex_lock(&compr_mutex);
> 
> A per-superblock lock and stream would be nicer.

Yes and no.  The zlib workspaces weigh about 300k.  On any decent SMP
machine, that hardly matters and getting the scalability is nice.  For
low-end embedded, that amount of memory per filesystem can matter.

I have no clue how many people would suffer one way or another.

> > +
> > +
> > +static inline loff_t file_end(struct inode *inode)
> > +{
> > +   return (i_size_read(inode) + inode->i_sb->s_blocksize - 1)
> > +   >> inode->i_sb->s_blocksize_bits;
> > +}
> > +static void logfs_set_name(struct logfs_disk_dentry *dd, struct qstr *name)
> > +{
> 
> The code has a strange mix of two-blank-lines-between-functions and
> no-blank-lines-between-functions.  One blank line is usual.

Will change.

> > +
> > +   if (dest) {
> > +   /* symlink */
> > +   ret = logfs_inode_write(inode, dest, destlen, 0);
> > +   } else {
> > +   /* creat/mkdir/mknod */
> > +   ret = __logfs_write_inode(inode);
> > +   }
> > +   super->s_victim_ino = 0;
> > +   if (ret) {
> > +   if (!dest)
> > +   li->li_flags |= LOGFS_IF_STILLBORN;
> > +

Re: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-15 Thread Neil Brown

On Wednesday May 16, [EMAIL PROTECTED] wrote:
> >
> > Anybody have a clue?
> >

No...
When a raid0 array is assemble, quite a lot of message get printed
about number of zones and hash_spacing etc.  Can you collect and post
those.  Both for the failing case (2*5.5T) and the working case
(4*2.55T) is possible.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5][TAKE3] fallocate system call

2007-05-15 Thread Mingming Cao

On Wed, 2007-05-16 at 01:07 +0530, Amit K. Arora wrote:

> ToDos:
> -
> 1> Implementation on other architectures (other than i386, x86_64,
> ppc64 and s390(x)). David Chinner has already posted a patch for ia64.

Here is the 2.6.22-rc1 version of David's patch: add fallocate() on ia64

From: David Chinner <[EMAIL PROTECTED]>
Subject: [PATCH] ia64 fallocate syscall
Cc: "Amit K. Arora" <[EMAIL PROTECTED]>, 
[EMAIL PROTECTED], [EMAIL PROTECTED],
[EMAIL PROTECTED], [EMAIL PROTECTED]

ia64 fallocate syscall support.

Signed-Off-By: Dave Chinner <[EMAIL PROTECTED]>

---
 arch/ia64/kernel/entry.S  |1 +
 include/asm-ia64/unistd.h |3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S
===
--- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S  2007-05-12 
18:45:56.0 -0700
+++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S   2007-05-15 15:36:48.0 
-0700
@@ -1585,5 +1585,6 @@
data8 sys_getcpu
data8 sys_epoll_pwait   // 1305
data8 sys_utimensat
+   data8 sys_fallocate
 
.org sys_call_table + 8*NR_syscalls // guard against failures to 
increase NR_syscalls
Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h
===
--- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-12 
18:45:56.0 -0700
+++ linux-2.6.22-rc1/include/asm-ia64/unistd.h  2007-05-15 15:37:51.0 
-0700
@@ -296,6 +296,7 @@
 #define __NR_getcpu1304
 #define __NR_epoll_pwait   1305
 #define __NR_utimensat 1306
+#define __NR_fallocate 1307
 
 #ifdef __KERNEL__
 


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-15 Thread Michal Piotrowski


[Ingo, Neil, linux-raid added to CC]

On 16/05/07, Jeff Zheng <[EMAIL PROTECTED]> wrote:

Hi everyone:

We are experiencing problems with software raid0, with very
large disk arrays.
We are using two 3ware disk array controllers, each of them is connected
8 750GB harddrives. And we build a software raid0 on top of that. The
total capacity is 5.5TB+5.5TB=11TB

We use jfs as the file-system, we have a test application that write
data continuously to the disks. After writing 52 10GB files, jfs
crashed. And we are not able to recover it, fsck doesn't recognise it
anymore.
We then tried xfs, same application, lasted a little longer, but gives
kernel crash later.

We then reconfigured the hardware array, this time we configured two
disk array from each controller, than we have 4 disk arrays, each of
them have 4 750GB harddrives. Than build a new software raid0 on top of
that. Total capacity is still the same, but 2.75T+2.75T+2.75T+2.75T=11T.

This time we managed to fill the whole 11T data without problem, we are
still doing validation on all 11TB of data written to the disks.

It happened on 2.6.20 and 2.6.13.

So I think the problem is in the way on software raid handling very
large disk, maybe a integer overflow or something. I've searched on the
web, only find another guy complaining the same thing on the xfs mailing
list.

Anybody have a clue?


Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Regards,
Michal

--
Michal K. K. Piotrowski
Kernel Monkeys
(http://kernel.wikidot.com/start)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] LogFS take three

2007-05-15 Thread Albert Cahalan


Please don't forget the immutable bit. ("man lsattr")
Having both, BSD-style, would be even better.
The immutable bit is important for working around
software bugs and "features" that damage files.

I also can't find xattr support.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Software raid0 will crash the file-system, when each disk is 5TB

2007-05-15 Thread Jeff Zheng

Hi everyone:

We are experiencing problems with software raid0, with very
large disk arrays.
We are using two 3ware disk array controllers, each of them is connected
8 750GB harddrives. And we build a software raid0 on top of that. The
total capacity is 5.5TB+5.5TB=11TB

We use jfs as the file-system, we have a test application that write
data continuously to the disks. After writing 52 10GB files, jfs
crashed. And we are not able to recover it, fsck doesn't recognise it
anymore.
We then tried xfs, same application, lasted a little longer, but gives
kernel crash later.

We then reconfigured the hardware array, this time we configured two
disk array from each controller, than we have 4 disk arrays, each of
them have 4 750GB harddrives. Than build a new software raid0 on top of
that. Total capacity is still the same, but 2.75T+2.75T+2.75T+2.75T=11T.

This time we managed to fill the whole 11T data without problem, we are
still doing validation on all 11TB of data written to the disks.

It happened on 2.6.20 and 2.6.13.

So I think the problem is in the way on software raid handling very
large disk, maybe a integer overflow or something. I've searched on the
web, only find another guy complaining the same thing on the xfs mailing
list.

Anybody have a clue?


Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] AFS: Implement shared-writable mmap

2007-05-15 Thread Andrew Morton

On Tue, 15 May 2007 16:52:31 +0100
David Howells <[EMAIL PROTECTED]> wrote:

> Implement shared-writable mmap for AFS.

This blows up in -mm:

fs/afs/file.c:59: error: 'filemap_nopage' undeclared here (not in a function)
fs/afs/file.c:60: error: unknown field 'populate' specified in initializer
fs/afs/file.c:60: error: 'filemap_populate' undeclared here (not in a function)

because Nick went and renamed half the VM and deleted the other half.

I need to work out what to do with

mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-doc-fix.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear-fix.patch
mm-merge-nopfn-into-fault.patch
convert-hugetlbfs-to-use-vm_ops-fault.patch
mm-remove-legacy-cruft.patch
mm-debug-check-for-the-fault-vs-invalidate-race.patch
mm-fix-clear_page_dirty_for_io-vs-fault-race.patch

Probably merge them, I guess.  Hugh had concerns, I think over small
additional overhead from the lock_page()?

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] LogFS take three

2007-05-15 Thread Andrew Morton

On Tue, 15 May 2007 17:19:20 +0200
J__rn Engel <[EMAIL PROTECTED]> wrote:

> Add LogFS, a scalable flash filesystem.
>
> ...
>
>  
> +config LOGFS
> + tristate "Log Filesystem (EXPERIMENTAL)"
> + depends on EXPERIMENTAL
> + select ZLIB_INFLATE
> + select ZLIB_DEFLATE
> + help
> +   Flash filesystem aimed to scale efficiently to large devices.
> +   In comparison to JFFS2 it offers significantly faster mount
> +   times and potentially less RAM usage, although the latter has
> +   not been measured yet.
> +
> +   In its current state it is still very experimental and should
> +   not be used for other than testing purposes.
> +
> +   If unsure, say N.
> +
> +config LOGFS_FSCK
> + bool "Run LogFS fsck at mount time"
> + depends on LOGFS
> + help
> +   Run a full filesystem check on every mount.  If any errors are
> +   found, mounting the filesystem will fail.  This is a debug option
> +   for developers.
> +
> +   If unsure, say N.
> +

No dependency on MTD,

> @@ -0,0 +1,373 @@
> +/*
> + * fs/logfs/logfs.h
> + *
> + * As should be obvious for Linux kernel code, license is GPLv2
> + *
> + * Copyright (c) 2005-2007 Joern Engel
> + *
> + * Private header for logfs.
> + */
> +#ifndef fs_logfs_logfs_h
> +#define fs_logfs_logfs_h
> +
> +#define __CHECK_ENDIAN__
> +
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

But it includes an MTD header file.

Can this code be tested by people who don't have MTD hardware?  We used to
ahve a fake-mtd-on-a-blockdev thing, whcih was in a state of some
disrepair.  Maybe it got repaired.  Or removed.  I can't immediately locate
it...

It's strange and a bit regrettable that an fs would have dependency on MTD,
really.

> +
> +/**
> + * struct logfs_area - area management information
> + *
> + * @a_sb:the superblock this area belongs to
> + * @a_is_open:   1 if the area is currently open, else 0
> + * @a_segno: segment number of area
> + * @a_used_objects:  number of used objects (XXX: should get removed)
> + * @a_used_bytes:number of used bytes
> + * @a_ops:   area operations (either journal or ostore)
> + * @a_wbuf:  write buffer
> + * @a_erase_count:   erase count
> + * @a_level: GC level
> + */

ooh, documentation.  Quick, merge it!

> +/* memtree.c */
> +void btree_init(struct btree_head *head);
> +void *btree_lookup(struct btree_head *head, long val);
> +int btree_insert(struct btree_head *head, long val, void *ptr);
> +int btree_remove(struct btree_head *head, long val);

These names are too generic.  If we later add a btree library: blam.

> +
> +/* readwrite.c */
> +int logfs_inode_read(struct inode *inode, void *buf, size_t n, loff_t _pos);
> +int logfs_inode_write(struct inode *inode, const void *buf, size_t n,
> + loff_t pos);

It's a bit rude stealing the logfs* namespace, but I guess you got there
first ;)

> +int logfs_readpage_nolock(struct page *page);
> +int logfs_write_buf(struct inode *inode, pgoff_t index, void *buf);
> +int logfs_delete(struct inode *inode, pgoff_t index);
> +int logfs_rewrite_block(struct inode *inode, pgoff_t index, u64 ofs, int 
> level);
> +int logfs_is_valid_block(struct super_block *sb, u64 ofs, u64 ino, u64 pos);
> +void logfs_truncate(struct inode *inode);
> +u64 logfs_seek_data(struct inode *inode, u64 pos);
> +
> +int logfs_init_rw(struct logfs_super *super);
> +void logfs_cleanup_rw(struct logfs_super *super);
> +
> +/* segment.c */
> +int logfs_erase_segment(struct super_block *sb, u32 ofs);
> +int wbuf_read(struct super_block *sb, u64 ofs, size_t len, void *buf);
> +int logfs_segment_read(struct super_block *sb, void *buf, u64 ofs);
> +s64 logfs_segment_write(struct inode *inode, void *buf, u64 pos, int level,
> + int alloc);
> +int logfs_segment_delete(struct inode *inode, u64 ofs, u64 pos, int level);
> +void logfs_set_blocks(struct inode *inode, u64 no);
> +void __logfs_set_blocks(struct inode *inode);
> +/* area handling */
> +int logfs_init_areas(struct super_block *sb);
> +void logfs_cleanup_areas(struct logfs_super *super);
> +int logfs_open_area(struct logfs_area *area);
> +void logfs_close_area(struct logfs_area *area);
> +
> +/* super.c */
> +int mtdread(struct super_block *sb, loff_t ofs, size_t len, void *buf);
> +int mtdwrite(struct super_block *sb, loff_t ofs, size_t len, void *buf);
> +int mtderase(struct super_block *sb, loff_t ofs, size_t len);
> +void logfs_crash_dump(struct super_block *sb);
> +int all_ff(void *buf, size_t len);
> +int logfs_statfs(struct dentry *dentry, struct kstatfs *stats);

Have you checked that all of this needs global scope?

> +
> +/* progs/fsck.c */
> +#ifdef CONFIG_LOGFS_FSCK
> +int logfs_fsck(struct super_block *sb);
> +#else
> +#define logfs_fsck(sb) ({ 0; })

static inline int logfs_fsck(struct

[PATCH 5/5][TAKE3] ext4: write support for preallocated blocks

2007-05-15 Thread Amit K. Arora

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Changelog:
-
Note: The changes below are from the initial post (dated 26th April,
2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel
version on which this patch is based. TAKE2 was based on 2.6.21 and this
is based on 2.6.22-rc1.

 1) Replaced BUG_ON with WARN_ON & ext4_error.
 2) Added variable names to the function declaration of
ext4_ext_try_to_merge().
 3) Updated variable declarations to use multiple-definitions-per-line.
 4) "if((a=foo())).." was broken into "a=foo(); if(a).."
 5) Removed extra spaces.

Here is the updated patch:

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 fs/ext4/extents.c   |  234 +++-
 include/linux/ext4_fs_extents.h |3 
 2 files changed, 210 insertions(+), 27 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -1140,6 +1140,54 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+   struct ext4_extent_header *eh;
+   unsigned int depth, len;
+   int merge_done = 0;
+   int uninitialized = 0;
+
+   depth = ext_depth(inode);
+   BUG_ON(path[depth].p_hdr == NULL);
+   eh = path[depth].p_hdr;
+
+   while (ex < EXT_LAST_EXTENT(eh))
+   {
+   if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+   break;
+   /* merge with next extent! */
+   if (ext4_ext_is_uninitialized(ex))
+   uninitialized = 1;
+   ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+   + ext4_ext_get_actual_len(ex + 1));
+   if (uninitialized)
+   ext4_ext_mark_uninitialized(ex);
+
+   if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+   len = (EXT_LAST_EXTENT(eh) - ex - 1)
+   * sizeof(struct ext4_extent);
+   memmove(ex + 1, ex + 2, len);
+   }
+   eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1);
+   merge_done = 1;
+   WARN_ON(eh->eh_entries == 0);
+   if (!eh->eh_entries)
+   ext4_error(inode->i_sb, "ext4_ext_try_to_merge",
+  "inode#%lu, eh->eh_entries = 0!", inode->i_ino);
+   }
+
+   return merge_done;
+}
+
+/*
  * check if a portion of the "newext" extent overlaps with an
  * existing extent.
  *
@@ -1327,25 +1375,7 @@ has_space:
 
 merge:
/* try to merge extents to the right */
-   while (nearex < EXT_LAST_EXTENT(eh)) {
-   if (!ext4_can_extents_be_merged(inode, nearex, nearex + 1))
-   break;
-   /* merge with next extent! */
-   if (ext4_ext_is_uninitialized(nearex))
-   uninitialized = 1;
-   nearex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(nearex)
-   + ext4_ext_get_actual_len(nearex + 1));
-   if (uninitialized)
-   ext4_ext_mark_uninitialized(nearex);
-
-   if (nearex + 1 < EXT_LAST_EXTENT(eh)) {
-   len = (EXT_LAST_EXTENT(eh) - nearex - 1)
-   * sizeof(struct ext4_extent);
-   memmove(nearex + 1, nearex + 2, len);
-   }
-   eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
-   BUG_ON(eh->eh_entries == 0);
-   }
+   ext4_ext_try_to_merge(inode, path, nearex);
 
/* try to merge extents to the left */
 
@@ -2011,15 +2041,152 @@ void ext4_ext_release(struct super_block
 #endif
 }
 
+/*
+ * This function is called by ext4_ext_get_blocks() if someone tries to write
+ * to an uninitialized extent. It may result in splitting the uninitialized
+ * extent into multiple extents (upto three - one initialized and two
+ * uninitialized).
+ * There are three possibilities:
+ *   a> There is no split required: Entire extent should be initialized
+ *   b> Splits in two extents: Write is happening at either end of the extent
+ *   c> Splits in three extents: S

[PATCH 4/5][TAKE3] ext4: fallocate support in ext4

2007-05-15 Thread Amit K. Arora

This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation.

Current implementation only supports preallocation for regular files
(directories not supported as of date) with extent maps. This patch
does not support block-mapped files currently.

Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a  item.

Changelog:
-
Note: The changes below are from the initial post (dated 26th April,
2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel
version on which this patch is based and point "8)" below.
TAKE2 was based on 2.6.21 and this is based on 2.6.22-rc1.

Here are the changes from the previous post:
 1) Added more description for ext4_fallocate().
 2) Now returning EOPNOTSUPP when files are block-mapped (non-extent).
 3) Moved journal_start & journal_stop inside the while loop.
 4) Replaced BUG_ON with WARN_ON & ext4_error.
 5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally.
 6) Added variable names in the function declaration of ext4_fallocate()
 7) Converted macros that handle uninitialized extents into inline
functions.
 8) Removed unnecessary "EXPORT_SYMBOL(ext4_fallocate);".

Here is the updated patch:

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 fs/ext4/extents.c   |  240 +---
 fs/ext4/file.c  |1 
 include/linux/ext4_fs.h |8 +
 include/linux/ext4_fs_extents.h |   12 ++
 4 files changed, 220 insertions(+), 41 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
} else if (path->p_ext) {
ext_debug("  %d:%d:%llu ",
  le32_to_cpu(path->p_ext->ee_block),
- le16_to_cpu(path->p_ext->ee_len),
+ ext4_ext_get_actual_len(path->p_ext),
  ext_pblock(path->p_ext));
} else
ext_debug("  []");
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
 
for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
- le16_to_cpu(ex->ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug("\n");
 }
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, 
ext_debug("  -> %d:%llu:%d ",
le32_to_cpu(path->p_ext->ee_block),
ext_pblock(path->p_ext),
-   le16_to_cpu(path->p_ext->ee_len));
+   ext4_ext_get_actual_len(path->p_ext));
 
 #ifdef CHECK_BINSEARCH
{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug("move %d:%llu:%d in new leaf %llu\n",
le32_to_cpu(path[depth].p_ext->ee_block),
ext_pblock(path[depth].p_ext),
-   le16_to_cpu(path[depth].p_ext->ee_len),
+   ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1106,7 +1106,19 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
 {
-   if (le32_to_cpu(ex1->ee_block) + le16_to_cpu(ex1->ee_len) !=
+   unsigned short ext1_ee_len, ext2_ee_len;
+
+   /*
+* Make sure that either both extents are uninitialized, or
+* both are _not_.
+*/
+   if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
+   return 0;
+
+   ext1_ee_len = ext4_ext_get_actual_len(ex1);
+   ext2_ee_len = ext4_ext_get_actual_len(ex2);
+
+   if (le32_to_cpu(ex1->ee_block) + ext1_ee_len !=
le32_to_cpu(ex2->ee_block))
return 0;
 
@@ -1115,14 +1127,14 @@ ext4_can_extents_be_merged(struct inode 
 * as an RO_COMPAT feature, refuse to merge to extents if
 * this can result in the top bit of ee_len being set.
 */
-   if (le16_to_cpu(ex1->ee_len) + le16_to_cpu(ex2->ee_len) > EXT_MAX_LEN)
+   if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
return 0;
 #ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1->ee_len) >= 4)
return 0;
 #endif
 
-   if (ext_pblock(ex1) + le16_to_cpu(ex1->ee_len) == ext_pblock(ex2))
+   if (ext_pblock(ex1) + ext1_ee_len == ext_pblock(ex2))
return 1;

[PATCH 3/5][TAKE3] ext4: Extent overlap bugfix

2007-05-15 Thread Amit K. Arora

This patch adds a check for overlap of extents and cuts short the
new extent to be inserted, if there is a chance of overlap.

Changelog:
-
Note: The changes below are from the initial post (dated 26th April,
2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel
version on which this patch is based. TAKE2 was based on 2.6.21 and this
is based on 2.6.22-rc1.
As suggested by Andrew, a check for wrap though zero has been added.

Here is the new patch:

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 fs/ext4/extents.c   |   60 ++--
 include/linux/ext4_fs_extents.h |1 
 2 files changed, 59 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc1/fs/ext4/extents.c
===
--- linux-2.6.22-rc1.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc1/fs/ext4/extents.c
@@ -1128,6 +1128,55 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * check if a portion of the "newext" extent overlaps with an
+ * existing extent.
+ *
+ * If there is an overlap discovered, it updates the length of the newext
+ * such that there will be no overlap, and then returns 1.
+ * If there is no overlap found, it returns 0.
+ */
+unsigned int ext4_ext_check_overlap(struct inode *inode,
+   struct ext4_extent *newext,
+   struct ext4_ext_path *path)
+{
+   unsigned long b1, b2;
+   unsigned int depth, len1;
+   unsigned int ret = 0;
+
+   b1 = le32_to_cpu(newext->ee_block);
+   len1 = le16_to_cpu(newext->ee_len);
+   depth = ext_depth(inode);
+   if (!path[depth].p_ext)
+   goto out;
+   b2 = le32_to_cpu(path[depth].p_ext->ee_block);
+
+   /*
+* get the next allocated block if the extent in the path
+* is before the requested block(s) 
+*/
+   if (b2 < b1) {
+   b2 = ext4_ext_next_allocated_block(path);
+   if (b2 == EXT_MAX_BLOCK)
+   goto out;
+   }
+
+   /* check for wrap through zero */
+   if (b1 + len1 < b1) {
+   len1 = EXT_MAX_BLOCK - b1;
+   newext->ee_len = cpu_to_le16(len1);
+   ret = 1;
+   }
+
+   /* check for overlap */
+   if (b1 + len1 > b2) {
+   newext->ee_len = cpu_to_le16(b2 - b1);
+   ret = 1;
+   }
+out:
+   return ret;
+}
+
+/*
  * ext4_ext_insert_extent:
  * tries to merge requsted extent into the existing extent or
  * inserts requested extent as new one into the tree,
@@ -2031,7 +2080,15 @@ int ext4_ext_get_blocks(handle_t *handle
 
/* allocate new block */
goal = ext4_ext_find_goal(inode, path, iblock);
-   allocated = max_blocks;
+
+   /* Check if we can really insert (iblock)::(iblock+max_blocks) extent */
+   newex.ee_block = cpu_to_le32(iblock);
+   newex.ee_len = cpu_to_le16(max_blocks);
+   err = ext4_ext_check_overlap(inode, &newex, path);
+   if (err)
+   allocated = le16_to_cpu(newex.ee_len);
+   else
+   allocated = max_blocks;
newblock = ext4_new_blocks(handle, inode, goal, &allocated, &err);
if (!newblock)
goto out2;
@@ -2039,7 +2096,6 @@ int ext4_ext_get_blocks(handle_t *handle
goal, newblock, allocated);
 
/* try to insert new extent into found leaf and return */
-   newex.ee_block = cpu_to_le32(iblock);
ext4_ext_store_pblock(&newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
err = ext4_ext_insert_extent(handle, inode, path, &newex);
Index: linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
===
--- linux-2.6.22-rc1.orig/include/linux/ext4_fs_extents.h
+++ linux-2.6.22-rc1/include/linux/ext4_fs_extents.h
@@ -190,6 +190,7 @@ ext4_ext_invalidate_cache(struct inode *
 
 extern int ext4_extent_tree_init(handle_t *, struct inode *);
 extern int ext4_ext_calc_credits_for_insert(struct inode *, struct 
ext4_ext_path *);
+extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent 
*, struct ext4_ext_path *);
 extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct 
ext4_ext_path *, struct ext4_extent *);
 extern int ext4_ext_walk_space(struct inode *, unsigned long, unsigned long, 
ext_prepare_callback, void *);
 extern struct ext4_ext_path * ext4_ext_find_extent(struct inode *, int, struct 
ext4_ext_path *);
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/5][TAKE3] fallocate() on s390

2007-05-15 Thread Amit K. Arora

This is the patch suggested by Martin Schwidefsky to support
sys_fallocate() on s390(x) platform.

He also suggested a wrapper in glibc to handle this system call on
s390. Posting it here so that we get feedback for this too.

.globl __fallocate
ENTRY(__fallocate)
stm %r6,%r7,28(%r15)/* save %r6/%r7 on stack */
cfi_offset (%r7, -68)
cfi_offset (%r6, -72)
lm  %r6,%r7,96(%r15)/* load loff_t len from stack */
svc SYS_ify(fallocate)
lm  %r6,%r7,28(%r15)/* restore %r6/%r7 from stack */
br  %r14
PSEUDO_END(__fallocate)


Here are the comments and the patch to linux kernel from him.

-
From: Martin Schwidefsky <[EMAIL PROTECTED]>

This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with the arguments of this system call.

Signed-off-by: Martin Schwidefsky <[EMAIL PROTECTED]>
---
 arch/s390/kernel/compat_wrapper.S |   10 ++
 arch/s390/kernel/sys_s390.c   |   29 +
 arch/s390/kernel/syscalls.S   |1 +
 include/asm-s390/unistd.h |3 ++-
 4 files changed, 42 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S
===
--- linux-2.6.22-rc1.orig/arch/s390/kernel/compat_wrapper.S
+++ linux-2.6.22-rc1/arch/s390/kernel/compat_wrapper.S
@@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
llgtr   %r2,%r2 # char *
llgtr   %r3,%r3 # struct compat_timeval *
jg  compat_sys_utimes
+
+   .globl  sys_fallocate_wrapper
+sys_fallocate_wrapper:
+   lgfr%r2,%r2 # int
+   lgfr%r3,%r3 # int
+   sllg%r4,%r4,32  # get high word of 64bit loff_t
+   lr  %r4,%r5 # get low word of 64bit loff_t
+   sllg%r5,%r6,32  # get high word of 64bit loff_t
+   l   %r5,164(%r15)   # get low word of 64bit loff_t
+   jg  sys_fallocate
Index: linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c
===
--- linux-2.6.22-rc1.orig/arch/s390/kernel/sys_s390.c
+++ linux-2.6.22-rc1/arch/s390/kernel/sys_s390.c
@@ -265,3 +265,32 @@ s390_fadvise64_64(struct fadvise64_64_ar
return -EFAULT;
return sys_fadvise64_64(a.fd, a.offset, a.len, a.advice);
 }
+
+#ifndef CONFIG_64BIT
+/*
+ * This is a wrapper to call sys_fallocate(). For 31 bit s390 the last
+ * 64 bit argument "len" is split into the upper and lower 32 bits. The
+ * system call wrapper in the user space loads the value to %r6/%r7.
+ * The code in entry.S keeps the values in %r2 - %r6 where they are and
+ * stores %r7 to 96(%r15). But the standard C linkage requires that
+ * the whole 64 bit value for len is stored on the stack and doesn't
+ * use %r6 at all. So s390_fallocate has to convert the arguments from
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, %r6/96(%r15)-99(%r15): len
+ * to
+ *   %r2: fd, %r3: mode, %r4/%r5: offset, 96(%r15)-103(%r15): len
+ */
+asmlinkage long s390_fallocate(int fd, int mode, loff_t offset,
+  u32 len_high, u32 len_low)
+{
+   union {
+   u64 len;
+   struct {
+   u32 high;
+   u32 low;
+   };
+   } cv;
+   cv.high = len_high;
+   cv.low = len_low;
+   return sys_fallocate(fd, mode, offset, cv.len);
+}
+#endif
Index: linux-2.6.22-rc1/arch/s390/kernel/syscalls.S
===
--- linux-2.6.22-rc1.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.22-rc1/arch/s390/kernel/syscalls.S
@@ -322,3 +322,4 @@ NI_SYSCALL  
/* 310 sys_move_pages *
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(s390_fallocate,sys_fallocate,sys_fallocate_wrapper)
Index: linux-2.6.22-rc1/include/asm-s390/unistd.h
===
--- linux-2.6.22-rc1.orig/include/asm-s390/unistd.h
+++ linux-2.6.22-rc1/include/asm-s390/unistd.h
@@ -251,8 +251,9 @@
 #define __NR_getcpu311
 #define __NR_epoll_pwait   312
 #define __NR_utimes313
+#define __NR_fallocate 314
 
-#define NR_syscalls 314
+#define NR_syscalls 315
 
 /* 
  * There are some system calls that are not present on 64 bit, some
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/5][TAKE3] fallocate() implementation on i86, x86_64 and powerpc

2007-05-15 Thread Amit K. Arora

This patch implements sys_fallocate() and adds support on i386, x86_64
and powerpc platforms.

Changelog:
-
Note: The changes below are from the initial post (dated 26th April,
2007) and _not_ from TAKE2. The only difference from TAKE2 is the kernel
version on which this patch is based. TAKE2 was based on 2.6.21 and this
is based on 2.6.22-rc1.

Following changes were made to the previous version:
 1) Added description before sys_fallocate() definition.
 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
posix_fallocate should return EINVAL for len <= 0.
 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
 4) Do not return ENODEV for dirs (let individual file systems decide if
they want to support preallocation to directories or not.
 5) Check for wrap through zero.
 6) Update c/mtime if fallocate() succeeds.
 7) Added mode descriptions in fs.h
 8) Added variable names to function definition (fallocate inode op)

Here is the new patch:

Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
---
 arch/i386/kernel/syscall_table.S |1 
 arch/powerpc/kernel/sys_ppc32.c  |7 +++
 arch/x86_64/ia32/ia32entry.S |1 
 fs/open.c|   89 +++
 include/asm-i386/unistd.h|3 -
 include/asm-powerpc/systbl.h |1 
 include/asm-powerpc/unistd.h |3 -
 include/asm-x86_64/unistd.h  |2 
 include/linux/fs.h   |   13 +
 include/linux/syscalls.h |1 
 10 files changed, 119 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.22-rc1.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22-rc1/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+   .long sys_fallocate
Index: linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c
===
--- linux-2.6.22-rc1.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22-rc1/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
return sys_truncate(path, (high << 32) | low);
 }
 
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+u32 lenhi, u32 lenlo)
+{
+   return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
+((loff_t)lenhi << 32) | lenlo);
+}
+
 asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long 
high,
 unsigned long low)
 {
Index: linux-2.6.22-rc1/fs/open.c
===
--- linux-2.6.22-rc1.orig/fs/open.c
+++ linux-2.6.22-rc1/fs/open.c
@@ -353,6 +353,95 @@ asmlinkage long sys_ftruncate64(unsigned
 #endif
 
 /*
+ * sys_fallocate - preallocate blocks or free preallocated blocks
+ * @fd: the file descriptor
+ * @mode: mode specifies if fallocate should preallocate blocks OR free
+ *   (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
+ *   FA_DEALLOCATE modes are supported.
+ * @offset: The offset within file, from where (un)allocation is being
+ * requested. It should not have a negative value.
+ * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
+ *
+ * This system call, depending on the mode, preallocates or unallocates blocks
+ * for a file. The range of blocks depends on the value of offset and len
+ * arguments provided by the user/application. For FA_ALLOCATE mode, if this
+ * system call succeeds, subsequent writes to the file in the given range
+ * (specified by offset & len) should not fail - even if the file system
+ * later becomes full. Hence the preallocation done is persistent (valid
+ * even after reopen of the file and remount/reboot).
+ *
+ * Note: Incase the file system does not support preallocation,
+ * posix_fallocate() should fall back to the library implementation (i.e.
+ * allocating zero-filled new blocks to the file).
+ *
+ * Return Values
+ * 0   : On SUCCESS a value of zero is returned.
+ * error   : On Failure, an error code will be returned.
+ * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
+ * fall back on library implementation of fallocate.
+ *
+ *  Generic fallocate to be added for file systems that do not
+ *  support fallocate it.
+ */
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+
+   if (offset < 0 || len <= 0)
+   goto out;
+
+   /* Return error if mode is not supported */
+   ret = -EOPNOTSUPP;
+   if (mode != FA_ALLOCATE && mode !=FA_DEALLOCATE)
+   goto out;
+
+   ret = -EBADF;
+   file =

[PATCH 0/5][TAKE3] fallocate system call

2007-05-15 Thread Amit K. Arora

-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
P L E A S EN O T E :
***
1. Patches have been now rebased to 2.6.22-rc1 kernel. Earlier they were
based on 2.6.21.
2. An unnecessary export of symbol is removed from the ext4 preallocate
patch. Details in the corresponding post (PATCH 4/5).
3. Return type now described in the interface description below.
4. Besides above points, everything is exactly same as TAKE2.
-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

This is the new set of patches which take care of the review comments
received from the community (mainly from Andrew).

Description:
---
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.

Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.

Interface:
-
The proposed system call's layout is:

 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

fd: The descriptor of the open file.

mode*: This specifies the behavior of the system call. Currently the
  system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
  FA_ALLOCATE: Applications can use this mode to preallocate blocks to
a given file (specified by fd). This mode changes the file size if
the preallocation is done beyond the EOF. It also updates the
ctime/mtime in the inode of the corresponding file, marking a
successfull allocation.
  FA_DEALLOCATE: This mode can be used by applications to deallocate the
previously preallocated blocks. This also may change the file size
and the ctime/mtime.
* New modes might get added in future. One such new mode which is
  already under discussion is FA_PREALLOCATE, which when used will
  preallocate space but will not change the filesize and [cm]time.
  Since the semantics of this new mode is not clear and agreed upon yet,
  this patchset does not implement it currently.

offset: This is the offset in bytes, from where the preallocation should
  start.

len: This is the number of bytes requested for preallocation (from
  offset).
 
RETURN VALUE: The system call returns 0 on success and an error on
failure. This is done to keep the semantics same as of
posix_fallocate(). 

sys_fallocate() on s390:
---
There is a problem with s390 ABI to implement sys_fallocate() with the
proposed order of arguments. Martin Schwidefsky has suggested a patch to
solve this problem which makes use of a wrapper in the kernel. This will
require special handling of this system call on s390 in glibc as well.
But, this seems to be the best solution so far.

Known Problem:
-
mmapped writes into uninitialized extents is a known problem with the
current ext4 patches. Like XFS, ext4 may need to implement
->page_mkwrite() to solve this. See:
http://lkml.org/lkml/2007/5/8/583

Since there is a talk of ->fault() replacing ->page_mkwrite() and also
with a generic block_page_mkwrite() implementation already posted, we
can implement this later some time. See:
http://lkml.org/lkml/2007/3/7/161
http://lkml.org/lkml/2007/3/18/198

ToDos:
-
1> Implementation on other architectures (other than i386, x86_64,
ppc64 and s390(x)). David Chinner has already posted a patch for ia64.
2> A generic file system operation to handle fallocate
(generic_fallocate), for filesystems that do _not_ have the fallocate
inode operation implemented.
3> Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Changelog:
-
Each post will have an individual changelog for a particular patch.


Following patches follow:
Patch 1/5 : fallocate() implementation on i86, x86_64 and powerpc
Patch 2/5 : fallocate() on s390
Patch 3/5 : ext4: Extent overlap bugfix
Patch 4/5 : ext4: fallocate support in ext4
Patch 5/5 : ext4: write support for preallocated blocks


--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a m

Re: [PATCH] LogFS take three

2007-05-15 Thread Jörn Engel

On Tue, 15 May 2007 15:07:05 -0400, John Stoffel wrote:
> 
> I've been semi watching this, and the only comment I really can give
> is that I hate the name.  To me, logfs implies a filesystem for
> logging purposes, not for Flash hardware with wear leveling issues to
> be taken into account.

Yeah, well, ...

Two years ago when I started all this, I was looking for a good name.
All I could come up with sounded stupid, so I picked "LogFS" as a code
name.  As soon as I find a better name, the code name should get
replaced.

By now I still don't have anything better.  All alternatives that were
proposed are just as bad - with the added disadvantage of being new and
not established yet.  My hope of ever finding a better name is nearly
zero.

> Also, having scanned through the code, I find the name "cookie" using
> in logfs_inode(), logfs_iput(), logfs_iget() to be badly named.  It
> should really be something like *cached_inode, which would seem to
> give more natural semantics of
> 
>   if (cached_inode)
>   do_cached_inode_ops(...)
>   else
>   do_inode_ops(...)

Half-agreed.  For callers, the name "cookie" makes sense.  It is a
transparent thing they should not tough and hand back unchanged.  For
logfs_iget() and logfs_iput() something like "is_cached" would be
better.

Will change.

Jörn

-- 
Linux [...] existed just for discussion between people who wanted
to show off how geeky they were.
-- Rob Enderle
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] LogFS take three

2007-05-15 Thread Jörn Engel

On Tue, 15 May 2007 20:37:25 +0200, Sam Ravnborg wrote:
> On Tue, May 15, 2007 at 05:19:20PM +0200, Jörn Engel wrote:
> > [ I have put everyone that gave comments to the last patch on Cc:.  Hope
> > that doesn't offend anyone. ]
> > 
> > 
> > Add LogFS, a scalable flash filesystem.
> 
> Have you run sparse on this code?

Several thousand times. :)

> I do not recall if you have written something about it.
> I do not see any obvious things sparse would catch (just browsing quickly)
> but it's always a good thing to do.

Absolutely! I added this line for sparse:
+#define __CHECK_ENDIAN__ 

Not sure how much of the kernel is endian-clean.  Logfs should be.

> +++ linux-2.6.21logfs/fs/logfs/progs/fsck.c 2007-05-15 00:54:22.0 
> +0200
> @@ -0,0 +1,332 @@
> +/*
> + * fs/logfs/prog/fsck.c- filesystem check
> + *
> + * As should be obvious for Linux kernel code, license is GPLv2
> + *
> + * Copyright (c) 2005-2007 Joern Engel
> + *
> + * In principle this could get moved to userspace.  However it might still
> + * make some sense to keep it in the kernel.  It is a pure checker and will
> + * only report problems, not attempt to repair them.
> + */
> +#include "../logfs.h"
> +
> If potential userspace tools needs to include ../logfs.h then there is
> something that ought to be moved to include/linux/logfs.h instead.

There shouldn't be anything left in ../logfs.h that is needed for
userspace.  But the kernel fsck is lame and calls into functions like
iget(), which it pulls in from ../logfs.h.

Jörn

-- 
The only real mistake is the one from which we learn nothing.
-- John Powell
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.21] circular locking dependency found in QUOTA OFF

2007-05-15 Thread Jan Kara

On Tue 15-05-07 19:52:03, Folkert van Heusden wrote:
> I'm afraid it doesn't compile:
> 
>   CHK include/linux/version.h
>   CHK include/linux/utsrelease.h
>   CHK include/linux/compile.h
>   CC  fs/dquot.o
>   CC  fs/quota.o
> fs/quota.c: In function `quota_sync_sb':
> fs/quota.c:180: error: `I_MUTEX_NESTED' undeclared (first use in this 
> function)
> fs/quota.c:180: error: (Each undeclared identifier is reported only once
> fs/quota.c:180: error: for each function it appears in.)
> make[1]: *** [fs/quota.o] Error 1
> make: *** [fs] Error 2
  Fixed patch follows (I messed up testing as I thought that lockdep is
under config option Debug mutexes but it is not...). I'm sorry for the
stupid mistake. Andrew, please discard my previous patch and use this new
one...

Honza
> On Tue, May 15, 2007 at 04:09:39PM +0200, Jan Kara wrote:
> >   Hi,
> > 
> >   Thanks for report. It seems like a lockdep problem. i_mutex for quota
> > files is below dqonoff_sem. We were already fixing this for filesystem
> > specific quota IO functions but obviously we've missed a few cases. I
> > wonder why it showed up only now... Anyway, attached is a fix. Andrew,
> > would you add it? Thanks.
> > 
> > Honza
> > 
> > On Mon 14-05-07 19:43:18, Michal Piotrowski wrote:
> > > [adding Jan and fsdevel to CC]
> > > 
> > > Hi Folkert,
> > > 
> > > On 14/05/07, Folkert van Heusden <[EMAIL PROTECTED]> wrote:
> > > >Hi,
> > > >
> > > >When I cleanly reboot my pc running 2.6.21 on a P4 with HT and 2GB of ram
> > > >and system on an 1-filesystem IDE disk, I get the following circular
> > > >locking dependency error:
> > > >
> > > >[330961.226405] ===
> > > >[330961.226489] [ INFO: possible circular locking dependency detected ]
> > > >[330961.226531] 2.6.21 #5
> > > >[330961.226569] ---
> > > >[330961.226611] quotaoff/12249 is trying to acquire lock:
> > > >[330961.226652]  (&sb->s_type->i_mutex_key#4){--..}, at: []
> > > >mutex_lock+0x8/0xa
> > > >[330961.226861]
> > > >[330961.226862] but task is already holding lock:
> > > >[330961.226938]  (&s->s_dquot.dqonoff_mutex){--..}, at: []
> > > >mutex_lock+0x8/0xa
> > > >[330961.227111]
> > > >[330961.227111] which lock already depends on the new lock.
> > > >[330961.227112]
> > > >[330961.227225]
> > > >[330961.227225] the existing dependency chain (in reverse order) is:
> > > >[330961.227303]
> > > >[330961.227303] -> #1 (&s->s_dquot.dqonoff_mutex){--..}:
> > > >[330961.227473][] check_prev_add+0x15b/0x281
> > > >[330961.227766][] check_prevs_add+0x8b/0xe8
> > > >[330961.228056][] __lock_acquire+0x692/0xb81
> > > >[330961.228353][] lock_acquire+0x62/0x81
> > > >[330961.228643][] __mutex_lock_slowpath+0x75/0x28c
> > > >[330961.228934][] mutex_lock+0x8/0xa
> > > >[330961.229221][] vfs_quota_on_inode+0xc1/0x25f
> > > >[330961.229513][] vfs_quota_on+0x75/0x79
> > > >[330961.229803][] ext3_quota_on+0x95/0xb0
> > > >[330961.230093][] do_quotactl+0xc9/0x2dd
> > > >[330961.230384][] sys_quotactl+0x84/0xd6
> > > >[330961.230673][] syscall_call+0x7/0xb
> > > >[330961.230963][] 0x
> > > >[330961.231268]
> > > >[330961.231268] -> #0 (&sb->s_type->i_mutex_key#4){--..}:
> > > >[330961.231469][] check_prev_add+0x34/0x281
> > > >[330961.231759][] check_prevs_add+0x8b/0xe8
> > > >[330961.232049][] __lock_acquire+0x692/0xb81
> > > >[330961.232344][] lock_acquire+0x62/0x81
> > > >[330961.232632][] __mutex_lock_slowpath+0x75/0x28c
> > > >[330961.232923][] mutex_lock+0x8/0xa
> > > >[330961.233211][] vfs_quota_off+0x1cf/0x260
> > > >[330961.233500][] do_quotactl+0x29f/0x2dd
> > > >[330961.233792][] sys_quotactl+0x84/0xd6
> > > >[330961.234081][] syscall_call+0x7/0xb
> > > >[330961.234503][] 0x
> > > >[330961.234795]
> > > >[330961.234795] other info that might help us debug this:
> > > >[330961.234796]
> > > >[330961.234908] 2 locks held by quotaoff/12249:
> > > >[330961.234947]  #0:  (&type->s_umount_key#15){}, at: []
> > > >get_super+0x53/0x94
> > > >[330961.235183]  #1:  (&s->s_dquot.dqonoff_mutex){--..}, at: []
> > > >mutex_lock+0x8/0xa
> > > >[330961.235386]
> > > >[330961.235387] stack backtrace:
> > > >[330961.235462]  [] show_trace_log_lvl+0x1a/0x30
> > > >[330961.235535]  [] show_trace+0x12/0x14
> > > >[330961.235606]  [] dump_stack+0x16/0x18
> > > >[330961.235679]  [] print_circular_bug_tail+0x6f/0x71
> > > >[330961.235753]  [] check_prev_add+0x34/0x281
> > > >[330961.235825]  [] check_prevs_add+0x8b/0xe8
> > > >[330961.235897]  [] __lock_acquire+0x692/0xb81
> > > >[330961.235969]  [] lock_acquire+0x62/0x81
> > > >[330961.236041]  [] __mutex_lock_slo

Re: [PATCH] LogFS take three

2007-05-15 Thread Sam Ravnborg

On Tue, May 15, 2007 at 05:19:20PM +0200, Jörn Engel wrote:
> [ I have put everyone that gave comments to the last patch on Cc:.  Hope
> that doesn't offend anyone. ]
> 
> 
> Add LogFS, a scalable flash filesystem.

Have you run sparse on this code?
I do not recall if you have written something about it.
I do not see any obvious things sparse would catch (just browsing quickly)
but it's always a good thing to do.

 
+++ linux-2.6.21logfs/fs/logfs/progs/fsck.c 2007-05-15 00:54:22.0 
+0200
@@ -0,0 +1,332 @@
+/*
+ * fs/logfs/prog/fsck.c- filesystem check
+ *
+ * As should be obvious for Linux kernel code, license is GPLv2
+ *
+ * Copyright (c) 2005-2007 Joern Engel
+ *
+ * In principle this could get moved to userspace.  However it might still
+ * make some sense to keep it in the kernel.  It is a pure checker and will
+ * only report problems, not attempt to repair them.
+ */
+#include "../logfs.h"
+
If potential userspace tools needs to include ../logfs.h then there is
something that ought to be moved to include/linux/logfs.h instead.

Sam
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.21] circular locking dependency found in QUOTA OFF

2007-05-15 Thread Jan Kara

On Tue 15-05-07 19:52:03, Folkert van Heusden wrote:
> I'm afraid it doesn't compile:
> 
>   CHK include/linux/version.h
>   CHK include/linux/utsrelease.h
>   CHK include/linux/compile.h
>   CC  fs/dquot.o
>   CC  fs/quota.o
> fs/quota.c: In function `quota_sync_sb':
> fs/quota.c:180: error: `I_MUTEX_NESTED' undeclared (first use in this 
> function)
> fs/quota.c:180: error: (Each undeclared identifier is reported only once
> fs/quota.c:180: error: for each function it appears in.)
> make[1]: *** [fs/quota.o] Error 1
> make: *** [fs] Error 2
  Huh, I has compiled just fine for me... Can you send me your .config?
Thanks.

Honza

> On Tue, May 15, 2007 at 04:09:39PM +0200, Jan Kara wrote:
> >   Hi,
> > 
> >   Thanks for report. It seems like a lockdep problem. i_mutex for quota
> > files is below dqonoff_sem. We were already fixing this for filesystem
> > specific quota IO functions but obviously we've missed a few cases. I
> > wonder why it showed up only now... Anyway, attached is a fix. Andrew,
> > would you add it? Thanks.
> > 
> > Honza
> > 
> > On Mon 14-05-07 19:43:18, Michal Piotrowski wrote:
> > > [adding Jan and fsdevel to CC]
> > > 
> > > Hi Folkert,
> > > 
> > > On 14/05/07, Folkert van Heusden <[EMAIL PROTECTED]> wrote:
> > > >Hi,
> > > >
> > > >When I cleanly reboot my pc running 2.6.21 on a P4 with HT and 2GB of ram
> > > >and system on an 1-filesystem IDE disk, I get the following circular
> > > >locking dependency error:
> > > >
> > > >[330961.226405] ===
> > > >[330961.226489] [ INFO: possible circular locking dependency detected ]
> > > >[330961.226531] 2.6.21 #5
> > > >[330961.226569] ---
> > > >[330961.226611] quotaoff/12249 is trying to acquire lock:
> > > >[330961.226652]  (&sb->s_type->i_mutex_key#4){--..}, at: []
> > > >mutex_lock+0x8/0xa
> > > >[330961.226861]
> > > >[330961.226862] but task is already holding lock:
> > > >[330961.226938]  (&s->s_dquot.dqonoff_mutex){--..}, at: []
> > > >mutex_lock+0x8/0xa
> > > >[330961.227111]
> > > >[330961.227111] which lock already depends on the new lock.
> > > >[330961.227112]
> > > >[330961.227225]
> > > >[330961.227225] the existing dependency chain (in reverse order) is:
> > > >[330961.227303]
> > > >[330961.227303] -> #1 (&s->s_dquot.dqonoff_mutex){--..}:
> > > >[330961.227473][] check_prev_add+0x15b/0x281
> > > >[330961.227766][] check_prevs_add+0x8b/0xe8
> > > >[330961.228056][] __lock_acquire+0x692/0xb81
> > > >[330961.228353][] lock_acquire+0x62/0x81
> > > >[330961.228643][] __mutex_lock_slowpath+0x75/0x28c
> > > >[330961.228934][] mutex_lock+0x8/0xa
> > > >[330961.229221][] vfs_quota_on_inode+0xc1/0x25f
> > > >[330961.229513][] vfs_quota_on+0x75/0x79
> > > >[330961.229803][] ext3_quota_on+0x95/0xb0
> > > >[330961.230093][] do_quotactl+0xc9/0x2dd
> > > >[330961.230384][] sys_quotactl+0x84/0xd6
> > > >[330961.230673][] syscall_call+0x7/0xb
> > > >[330961.230963][] 0x
> > > >[330961.231268]
> > > >[330961.231268] -> #0 (&sb->s_type->i_mutex_key#4){--..}:
> > > >[330961.231469][] check_prev_add+0x34/0x281
> > > >[330961.231759][] check_prevs_add+0x8b/0xe8
> > > >[330961.232049][] __lock_acquire+0x692/0xb81
> > > >[330961.232344][] lock_acquire+0x62/0x81
> > > >[330961.232632][] __mutex_lock_slowpath+0x75/0x28c
> > > >[330961.232923][] mutex_lock+0x8/0xa
> > > >[330961.233211][] vfs_quota_off+0x1cf/0x260
> > > >[330961.233500][] do_quotactl+0x29f/0x2dd
> > > >[330961.233792][] sys_quotactl+0x84/0xd6
> > > >[330961.234081][] syscall_call+0x7/0xb
> > > >[330961.234503][] 0x
> > > >[330961.234795]
> > > >[330961.234795] other info that might help us debug this:
> > > >[330961.234796]
> > > >[330961.234908] 2 locks held by quotaoff/12249:
> > > >[330961.234947]  #0:  (&type->s_umount_key#15){}, at: []
> > > >get_super+0x53/0x94
> > > >[330961.235183]  #1:  (&s->s_dquot.dqonoff_mutex){--..}, at: []
> > > >mutex_lock+0x8/0xa
> > > >[330961.235386]
> > > >[330961.235387] stack backtrace:
> > > >[330961.235462]  [] show_trace_log_lvl+0x1a/0x30
> > > >[330961.235535]  [] show_trace+0x12/0x14
> > > >[330961.235606]  [] dump_stack+0x16/0x18
> > > >[330961.235679]  [] print_circular_bug_tail+0x6f/0x71
> > > >[330961.235753]  [] check_prev_add+0x34/0x281
> > > >[330961.235825]  [] check_prevs_add+0x8b/0xe8
> > > >[330961.235897]  [] __lock_acquire+0x692/0xb81
> > > >[330961.235969]  [] lock_acquire+0x62/0x81
> > > >[330961.236041]  [] __mutex_lock_slowpath+0x75/0x28c
> > > >[330961.236113]  [] mutex_lock+0x8/0xa
> > > >[330961.236185]  [] vfs_quota_off+0x1cf/0x260
> > > >[330961.236257]  []

Re: [2.6.21] circular locking dependency found in QUOTA OFF

2007-05-15 Thread Folkert van Heusden

I'm afraid it doesn't compile:

  CHK include/linux/version.h
  CHK include/linux/utsrelease.h
  CHK include/linux/compile.h
  CC  fs/dquot.o
  CC  fs/quota.o
fs/quota.c: In function `quota_sync_sb':
fs/quota.c:180: error: `I_MUTEX_NESTED' undeclared (first use in this function)
fs/quota.c:180: error: (Each undeclared identifier is reported only once
fs/quota.c:180: error: for each function it appears in.)
make[1]: *** [fs/quota.o] Error 1
make: *** [fs] Error 2


On Tue, May 15, 2007 at 04:09:39PM +0200, Jan Kara wrote:
>   Hi,
> 
>   Thanks for report. It seems like a lockdep problem. i_mutex for quota
> files is below dqonoff_sem. We were already fixing this for filesystem
> specific quota IO functions but obviously we've missed a few cases. I
> wonder why it showed up only now... Anyway, attached is a fix. Andrew,
> would you add it? Thanks.
> 
>   Honza
> 
> On Mon 14-05-07 19:43:18, Michal Piotrowski wrote:
> > [adding Jan and fsdevel to CC]
> > 
> > Hi Folkert,
> > 
> > On 14/05/07, Folkert van Heusden <[EMAIL PROTECTED]> wrote:
> > >Hi,
> > >
> > >When I cleanly reboot my pc running 2.6.21 on a P4 with HT and 2GB of ram
> > >and system on an 1-filesystem IDE disk, I get the following circular
> > >locking dependency error:
> > >
> > >[330961.226405] ===
> > >[330961.226489] [ INFO: possible circular locking dependency detected ]
> > >[330961.226531] 2.6.21 #5
> > >[330961.226569] ---
> > >[330961.226611] quotaoff/12249 is trying to acquire lock:
> > >[330961.226652]  (&sb->s_type->i_mutex_key#4){--..}, at: []
> > >mutex_lock+0x8/0xa
> > >[330961.226861]
> > >[330961.226862] but task is already holding lock:
> > >[330961.226938]  (&s->s_dquot.dqonoff_mutex){--..}, at: []
> > >mutex_lock+0x8/0xa
> > >[330961.227111]
> > >[330961.227111] which lock already depends on the new lock.
> > >[330961.227112]
> > >[330961.227225]
> > >[330961.227225] the existing dependency chain (in reverse order) is:
> > >[330961.227303]
> > >[330961.227303] -> #1 (&s->s_dquot.dqonoff_mutex){--..}:
> > >[330961.227473][] check_prev_add+0x15b/0x281
> > >[330961.227766][] check_prevs_add+0x8b/0xe8
> > >[330961.228056][] __lock_acquire+0x692/0xb81
> > >[330961.228353][] lock_acquire+0x62/0x81
> > >[330961.228643][] __mutex_lock_slowpath+0x75/0x28c
> > >[330961.228934][] mutex_lock+0x8/0xa
> > >[330961.229221][] vfs_quota_on_inode+0xc1/0x25f
> > >[330961.229513][] vfs_quota_on+0x75/0x79
> > >[330961.229803][] ext3_quota_on+0x95/0xb0
> > >[330961.230093][] do_quotactl+0xc9/0x2dd
> > >[330961.230384][] sys_quotactl+0x84/0xd6
> > >[330961.230673][] syscall_call+0x7/0xb
> > >[330961.230963][] 0x
> > >[330961.231268]
> > >[330961.231268] -> #0 (&sb->s_type->i_mutex_key#4){--..}:
> > >[330961.231469][] check_prev_add+0x34/0x281
> > >[330961.231759][] check_prevs_add+0x8b/0xe8
> > >[330961.232049][] __lock_acquire+0x692/0xb81
> > >[330961.232344][] lock_acquire+0x62/0x81
> > >[330961.232632][] __mutex_lock_slowpath+0x75/0x28c
> > >[330961.232923][] mutex_lock+0x8/0xa
> > >[330961.233211][] vfs_quota_off+0x1cf/0x260
> > >[330961.233500][] do_quotactl+0x29f/0x2dd
> > >[330961.233792][] sys_quotactl+0x84/0xd6
> > >[330961.234081][] syscall_call+0x7/0xb
> > >[330961.234503][] 0x
> > >[330961.234795]
> > >[330961.234795] other info that might help us debug this:
> > >[330961.234796]
> > >[330961.234908] 2 locks held by quotaoff/12249:
> > >[330961.234947]  #0:  (&type->s_umount_key#15){}, at: []
> > >get_super+0x53/0x94
> > >[330961.235183]  #1:  (&s->s_dquot.dqonoff_mutex){--..}, at: []
> > >mutex_lock+0x8/0xa
> > >[330961.235386]
> > >[330961.235387] stack backtrace:
> > >[330961.235462]  [] show_trace_log_lvl+0x1a/0x30
> > >[330961.235535]  [] show_trace+0x12/0x14
> > >[330961.235606]  [] dump_stack+0x16/0x18
> > >[330961.235679]  [] print_circular_bug_tail+0x6f/0x71
> > >[330961.235753]  [] check_prev_add+0x34/0x281
> > >[330961.235825]  [] check_prevs_add+0x8b/0xe8
> > >[330961.235897]  [] __lock_acquire+0x692/0xb81
> > >[330961.235969]  [] lock_acquire+0x62/0x81
> > >[330961.236041]  [] __mutex_lock_slowpath+0x75/0x28c
> > >[330961.236113]  [] mutex_lock+0x8/0xa
> > >[330961.236185]  [] vfs_quota_off+0x1cf/0x260
> > >[330961.236257]  [] do_quotactl+0x29f/0x2dd
> > >[330961.236330]  [] sys_quotactl+0x84/0xd6
> > >[330961.236402]  [] syscall_call+0x7/0xb
> > >[330961.236473]  ===
> > >
> > 
> > Is this a 2.6.21 regression?
> > 
> > Regards,
> > Michal
> > 
> > -- 
> > Michal K. K. Piotrowski
> > Kernel Monkeys
> > (http://kernel.wikidot.com/start)
> -- 
> Jan Kara <[EMAIL PROTECTED]>
> SuSE CR Labs

> i_mutex on quota files is

[PATCH] AFS: Write back dirty data on unmount

2007-05-15 Thread David Howells

Fix AFS to write back dirty on unmounting.  This didn't happen because
afs_super_ops.drop_inode was pointing to generic_delete_inode.  Now this
pointer is left set to NULL so that the default behaviour occurs instead.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/afs/super.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/fs/afs/super.c b/fs/afs/super.c
index 579af63..370cecc 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -47,7 +47,6 @@ struct file_system_type afs_fs_type = {
 static const struct super_operations afs_super_ops = {
.statfs = afs_statfs,
.alloc_inode= afs_alloc_inode,
-   .drop_inode = generic_delete_inode,
.write_inode= afs_write_inode,
.destroy_inode  = afs_destroy_inode,
.clear_inode= afs_clear_inode,

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] AFS: Implement shared-writable mmap

2007-05-15 Thread David Howells

Implement shared-writable mmap for AFS.

The key with which to access the file is obtained from the VMA at the point
where the PTE is made writable by the page_mkwrite() VMA op and cached in the
affected page.

If there's an outstanding write on the page made with a different key, then
page_mkwrite() will flush it before attaching a record of the new key.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/afs/file.c |   22 --
 fs/afs/internal.h |1 +
 fs/afs/write.c|   28 
 3 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 9c0e721..da2a18b 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -22,6 +22,7 @@ static int afs_readpage(struct file *file, struct page *page);
 static void afs_invalidatepage(struct page *page, unsigned long offset);
 static int afs_releasepage(struct page *page, gfp_t gfp_flags);
 static int afs_launder_page(struct page *page);
+static int afs_mmap(struct file *file, struct vm_area_struct *vma);
 
 const struct file_operations afs_file_operations = {
.open   = afs_open,
@@ -31,7 +32,7 @@ const struct file_operations afs_file_operations = {
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
.aio_write  = afs_file_write,
-   .mmap   = generic_file_readonly_mmap,
+   .mmap   = afs_mmap,
.sendfile   = generic_file_sendfile,
.fsync  = afs_fsync,
 };
@@ -54,6 +55,12 @@ const struct address_space_operations afs_fs_aops = {
.writepages = afs_writepages,
 };
 
+static struct vm_operations_struct afs_file_vm_ops = {
+   .nopage = filemap_nopage,
+   .populate   = filemap_populate,
+   .page_mkwrite   = afs_page_mkwrite,
+};
+
 /*
  * open an AFS file or directory and attach a key to it
  */
@@ -266,7 +273,6 @@ static void afs_invalidatepage(struct page *page, unsigned 
long offset)
 static int afs_launder_page(struct page *page)
 {
_enter("{%lu}", page->index);
-
return 0;
 }
 
@@ -293,3 +299,15 @@ static int afs_releasepage(struct page *page, gfp_t 
gfp_flags)
_leave(" = 0");
return 0;
 }
+
+/*
+ * memory map part of an AFS file
+ */
+static int afs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   _enter("");
+
+   file_accessed(file);
+   vma->vm_ops = &afs_file_vm_ops;
+   return 0;
+}
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 61038da..141e073 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -710,6 +710,7 @@ extern ssize_t afs_file_write(struct kiocb *, const struct 
iovec *,
  unsigned long, loff_t);
 extern int afs_writeback_all(struct afs_vnode *);
 extern int afs_fsync(struct file *, struct dentry *, int);
+extern int afs_page_mkwrite(struct vm_area_struct *, struct page *);
 
 
 /*/
diff --git a/fs/afs/write.c b/fs/afs/write.c
index a03b92a..6ef818d 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -174,6 +174,8 @@ static int afs_prepare_page(struct afs_vnode *vnode, struct 
page *page,
  * prepare to perform part of a write to a page
  * - the caller holds the page locked, preventing it from being written out or
  *   modified by anyone else
+ * - may be called from afs_page_mkwrite() to set up a page for modification
+ *   through shared-writable mmap
  */
 int afs_prepare_write(struct file *file, struct page *page,
  unsigned offset, unsigned to)
@@ -825,3 +827,29 @@ int afs_fsync(struct file *file, struct dentry *dentry, 
int datasync)
_leave(" = %d", ret);
return ret;
 }
+
+/*
+ * notification that a previously read-only page is about to become writable
+ * - if it returns an error, the caller will deliver a bus error signal
+ *
+ * we use this to make a record of the key with which the writeback should be
+ * performed and to flush any outstanding writes made with a different key
+ *
+ * the key to be used is attached to the file pinned by the VMA
+ */
+int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+   struct afs_vnode *vnode = AFS_FS_I(vma->vm_file->f_mapping->host);
+   struct key *key = vma->vm_file->private_data;
+   int ret;
+
+   _enter("{{%x:%u},%x},{%lx}",
+  vnode->fid.vid, vnode->fid.vnode, key_serial(key), page->index);
+
+   lock_page(page);
+   ret = afs_prepare_write(vma->vm_file, page, 0, 0);
+   unlock_page(page);
+
+   _leave(" = %d", ret);
+   return ret;
+}

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] AFS: Fix afs_prepare_write()

2007-05-15 Thread David Howells

afs_prepare_write() should not mark a page up to date if it only partially
fills it in, in expectation of the caller filling in the rest prior to calling
commit_write().  commit_write(), however, should mark the page up to date.

Signed-off-by: David Howells <[EMAIL PROTECTED]>
---

 fs/afs/write.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/fs/afs/write.c b/fs/afs/write.c
index 28f3751..a03b92a 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -206,7 +206,6 @@ int afs_prepare_write(struct file *file, struct page *page,
_leave(" = %d [prep]", ret);
return ret;
}
-   SetPageUptodate(page);
}
 
 try_again:
@@ -311,8 +310,8 @@ int afs_commit_write(struct file *file, struct page *page,
spin_unlock(&vnode->writeback_lock);
}
 
+   SetPageUptodate(page);
set_page_dirty(page);
-
if (PageDirty(page))
_debug("dirtied");
 

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Review status (Re: [PATCH] LogFS take three)

2007-05-15 Thread Jörn Engel

Most of my homework is done.  There are six items left plus another five
I believe should not get changed.


Changed:
o Kconfig description updated
o Spaces converted to tabs in Makefile
o Sorted/seperated #includes
o structures are __packed instead of packed, #define removed
o removed TRACE()
o 32bit magic is u instead of ull now
o #defines aligned
o Removed obsolete #defines, documented rest
o LOGFS_SUPER and LOGFS_INODE are inline functions now
o removed alloc_disk_sum and free_disk_sum
o moved dir_callback up
o removed pointless cast from max() parameter
o moved LOGFS_LINK_MAX to header
o license headers
o removed logfs_writepage()
o introduced and documented LOGFS_SEGMENT_RESERVE
o documented completion dance in mtderase()
o removed isbad() check from mtderase()
o removed logfs_device_getpage() and logfs_cached_read()
o resurrected code to pick correct device
o removed beXX typedefs
o removed safe_read()
o removed wrapper in logfs_cleanup_gc()
o device_read() returns error
o renamed logfs_alloc_blocks() to match current implementation
o kerneldoc comments for structures
o removed logfs_compress_none and logfs_uncompress_none
o Split header
o move fsck behind debug option


Changed many times (and I sincerely hope not to have missed any):
o removed // comments, usually dead code
o added newline between variable declarations and code
o moved tail comments
o minor code cleanups
o changed comment style
o added loglevels to printk


Unchanged:
o enum for OBJ_TOP_JOURNAL and friends
  Afaics all the potential enums end up being converted through cpu_to_64XX,
  I'll have to see if that causes more problems than the enums can hope to
  solve.
o generic function for logfs_type()
o generic helper for logfs_prepare_write()
o removed EOF
o error handling
o move rootdir generation back to mkfs


Won't happen (unless I get convinced to do otherwise):
o lowercase LOGFS_SUPER and LOGFS_INODE
  These simply match the common pattern used is many Linux filesystems.
o remove "extern"
  For structures, this keyword actually serves a purpose and makes sense.
  On the other hand, it is completely bogus for functions and I won't be
  bullied into adding bogus crud either.
o Move fs/logfs/Locking to Documentation/
  At least JFFS2 has such documentation in its own directory as well.
  I can see good arguments for either side and no strong consensus.  While
  I ultimately just don't case, doing nothing is a good option in cases
  of great uncertainty.
o change (void*) to "proper" cast
  Neither variant is much better than the other.  I'm open for suggestions
  to completely remove the casts, though.
o Change LOGFS_BUG() and LOGFS_BUG_ON() to inline functions
  These are macros for very much the same reasons BUG() and BUG_ON() are.


Jörn

-- 
To my face you have the audacity to advise me to become a thief - the worst
kind of thief that is conceivable, a thief of spiritual things, a thief of
ideas! It is insufferable, intolerable!
-- M. Binet in Scarabouche
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 13/14] ext3 whiteout support

2007-05-15 Thread Theodore Tso

On Mon, May 14, 2007 at 10:35:47PM +0200, Jan Blunck wrote:
> 
> I don't know. I tried to contact him a few weeks ago but failed.
> Guess, maybe he isn't reading the @thunk.org email anymore which was
> reference in the e2fsprogs source I used.

I do pay more attention to mail sent to the @mit.edu address, but I
don't recall any reservation requests sent to my @thunk.org address
(nope, don't see it; maybe it got caught in my spam filters?  When did
you send it, precisely?)

> Ted,
> from ext2_fs.h I learn that the value 0x0020 is left unused.
> 
> #define EXT2_FEATURE_INCOMPAT_META_BG   0x0010
> #define EXT3_FEATURE_INCOMPAT_EXTENTS   0x0040
> #define EXT4_FEATURE_INCOMPAT_64BIT 0x0080
> 
> Is this intentionally?

I think 0x0020 was used at one point, but I can't remember what it was
used for.  It's almost certainly not in wide use at this point, so
it's probably safe to use it; OTOH it's not like we're running out of
bits at this point.

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 8/14] Union-mount lookup

2007-05-15 Thread Trond Myklebust

On Mon, 2007-05-14 at 15:12 +0530, Bharata B Rao wrote:
> From: Jan Blunck <[EMAIL PROTECTED]>
> Subject: Union-mount lookup
> 
> Modifies the vfs lookup routines to work with union mounted directories.
> 
> The existing lookup routines generally lookup for a pathname only in the
> topmost or given directory. The changed versions of the lookup routines
> search for the pathname in the entire union mounted stack. Also they have been
> modified to setup the union stack during lookup from dcache cache and from
> real_lookup().
> 
> Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
> Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
> ---
>  fs/dcache.c|   16 +
>  fs/namei.c |   78 +-
>  fs/namespace.c |   35 ++
>  fs/union.c |  598 
> +
>  include/linux/dcache.h |   17 +
>  include/linux/namei.h  |4 
>  include/linux/union.h  |   49 
>  7 files changed, 786 insertions(+), 11 deletions(-)
> 
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1286,7 +1286,7 @@ struct dentry * d_lookup(struct dentry *
>   return dentry;
>  }
>  
> -struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
> +struct dentry * __d_lookup_single(struct dentry *parent, struct qstr *name)
>  {
>   unsigned int len = name->len;
>   unsigned int hash = name->hash;
> @@ -1371,6 +1371,20 @@ out:
>   return dentry;
>  }
>  
> +struct dentry * d_lookup_single(struct dentry *parent, struct qstr *name)
> +{
> + struct dentry *dentry;
> + unsigned long seq;
> +
> +do {
> +seq = read_seqbegin(&rename_lock);
> +dentry = __d_lookup_single(parent, name);
> +if (dentry)
> + break;
> + } while (read_seqretry(&rename_lock, seq));
> + return dentry;
> +}
> +
>  /**
>   * d_validate - verify dentry provided from insecure source
>   * @dentry: The dentry alleged to be valid child of @dparent
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -374,6 +374,33 @@ void release_open_intent(struct nameidat
>  }
>  
>  static inline struct dentry *
> +do_revalidate_single(struct dentry *dentry, struct nameidata *nd)
> +{
> + int status = dentry->d_op->d_revalidate(dentry, nd);
> + if (unlikely(status <= 0)) {

d_revalidate() returns a 0 or 1 result, not an error.

> + /*
> +  * The dentry failed validation.
> +  * If d_revalidate returned 0 attempt to invalidate
> +  * the dentry otherwise d_revalidate is asking us
> +  * to return a fail status.
> +  */
> + if (!status) {
> + if (!d_invalidate(dentry)) {
> + __dput_single(dentry);
> + dentry = NULL;
> + }
> + } else {
> + __dput_single(dentry);
> + dentry = ERR_PTR(status);

See above

> + }
> + }
> + return dentry;
> +}
> +
> +/*
> + * FIXME: We need a union aware revalidate here!
> + */
> +static inline struct dentry *
>  do_revalidate(struct dentry *dentry, struct nameidata *nd)
>  {
>   int status = dentry->d_op->d_revalidate(dentry, nd);
> @@ -403,16 +430,16 @@ do_revalidate(struct dentry *dentry, str
>   */
>  static struct dentry * cached_lookup(struct dentry * parent, struct qstr * 
> name, struct nameidata *nd)
>  {
> - struct dentry * dentry = __d_lookup(parent, name);
> + struct dentry *dentry = __d_lookup_single(parent, name);
>  
>   /* lockess __d_lookup may fail due to concurrent d_move() 
>* in some unrelated directory, so try with d_lookup
>*/
>   if (!dentry)
> - dentry = d_lookup(parent, name);
> + dentry = d_lookup_single(parent, name);
>  
>   if (dentry && dentry->d_op && dentry->d_op->d_revalidate)
> - dentry = do_revalidate(dentry, nd);
> + dentry = do_revalidate_single(dentry, nd);
>  
>   return dentry;
>  }
> @@ -465,7 +492,7 @@ ok:
>   * make sure that nobody added the entry to the dcache in the meantime..
>   * SMP-safe
>   */
> -static struct dentry * real_lookup(struct dentry * parent, struct qstr * 
> name, struct nameidata *nd)
> +struct dentry * real_lookup_single(struct dentry *parent, struct qstr *name, 
> struct nameidata *nd)
>  {
>   struct dentry * result;
>   struct inode *dir = parent->d_inode;
> @@ -485,7 +512,7 @@ static struct dentry * real_lookup(struc
>*
>* so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
>*/
> - result = d_lookup(parent, name);
> + result = d_lookup_single(parent, name);
>   if (!result) {
>   struct dentry * dentry = d_alloc(parent, name);
>   result = ERR_PTR(-ENOMEM);
> @@ -506,7 +533,7 @@ static struct dentry * real_lookup(struc
>*/
>   mutex_unlock(&dir->i_mutex);
>   if (result->d_op && resul

Re: [2.6.21] circular locking dependency found in QUOTA OFF

2007-05-15 Thread Jan Kara

  Hi,

  Thanks for report. It seems like a lockdep problem. i_mutex for quota
files is below dqonoff_sem. We were already fixing this for filesystem
specific quota IO functions but obviously we've missed a few cases. I
wonder why it showed up only now... Anyway, attached is a fix. Andrew,
would you add it? Thanks.

Honza

On Mon 14-05-07 19:43:18, Michal Piotrowski wrote:
> [adding Jan and fsdevel to CC]
> 
> Hi Folkert,
> 
> On 14/05/07, Folkert van Heusden <[EMAIL PROTECTED]> wrote:
> >Hi,
> >
> >When I cleanly reboot my pc running 2.6.21 on a P4 with HT and 2GB of ram
> >and system on an 1-filesystem IDE disk, I get the following circular
> >locking dependency error:
> >
> >[330961.226405] ===
> >[330961.226489] [ INFO: possible circular locking dependency detected ]
> >[330961.226531] 2.6.21 #5
> >[330961.226569] ---
> >[330961.226611] quotaoff/12249 is trying to acquire lock:
> >[330961.226652]  (&sb->s_type->i_mutex_key#4){--..}, at: []
> >mutex_lock+0x8/0xa
> >[330961.226861]
> >[330961.226862] but task is already holding lock:
> >[330961.226938]  (&s->s_dquot.dqonoff_mutex){--..}, at: []
> >mutex_lock+0x8/0xa
> >[330961.227111]
> >[330961.227111] which lock already depends on the new lock.
> >[330961.227112]
> >[330961.227225]
> >[330961.227225] the existing dependency chain (in reverse order) is:
> >[330961.227303]
> >[330961.227303] -> #1 (&s->s_dquot.dqonoff_mutex){--..}:
> >[330961.227473][] check_prev_add+0x15b/0x281
> >[330961.227766][] check_prevs_add+0x8b/0xe8
> >[330961.228056][] __lock_acquire+0x692/0xb81
> >[330961.228353][] lock_acquire+0x62/0x81
> >[330961.228643][] __mutex_lock_slowpath+0x75/0x28c
> >[330961.228934][] mutex_lock+0x8/0xa
> >[330961.229221][] vfs_quota_on_inode+0xc1/0x25f
> >[330961.229513][] vfs_quota_on+0x75/0x79
> >[330961.229803][] ext3_quota_on+0x95/0xb0
> >[330961.230093][] do_quotactl+0xc9/0x2dd
> >[330961.230384][] sys_quotactl+0x84/0xd6
> >[330961.230673][] syscall_call+0x7/0xb
> >[330961.230963][] 0x
> >[330961.231268]
> >[330961.231268] -> #0 (&sb->s_type->i_mutex_key#4){--..}:
> >[330961.231469][] check_prev_add+0x34/0x281
> >[330961.231759][] check_prevs_add+0x8b/0xe8
> >[330961.232049][] __lock_acquire+0x692/0xb81
> >[330961.232344][] lock_acquire+0x62/0x81
> >[330961.232632][] __mutex_lock_slowpath+0x75/0x28c
> >[330961.232923][] mutex_lock+0x8/0xa
> >[330961.233211][] vfs_quota_off+0x1cf/0x260
> >[330961.233500][] do_quotactl+0x29f/0x2dd
> >[330961.233792][] sys_quotactl+0x84/0xd6
> >[330961.234081][] syscall_call+0x7/0xb
> >[330961.234503][] 0x
> >[330961.234795]
> >[330961.234795] other info that might help us debug this:
> >[330961.234796]
> >[330961.234908] 2 locks held by quotaoff/12249:
> >[330961.234947]  #0:  (&type->s_umount_key#15){}, at: []
> >get_super+0x53/0x94
> >[330961.235183]  #1:  (&s->s_dquot.dqonoff_mutex){--..}, at: []
> >mutex_lock+0x8/0xa
> >[330961.235386]
> >[330961.235387] stack backtrace:
> >[330961.235462]  [] show_trace_log_lvl+0x1a/0x30
> >[330961.235535]  [] show_trace+0x12/0x14
> >[330961.235606]  [] dump_stack+0x16/0x18
> >[330961.235679]  [] print_circular_bug_tail+0x6f/0x71
> >[330961.235753]  [] check_prev_add+0x34/0x281
> >[330961.235825]  [] check_prevs_add+0x8b/0xe8
> >[330961.235897]  [] __lock_acquire+0x692/0xb81
> >[330961.235969]  [] lock_acquire+0x62/0x81
> >[330961.236041]  [] __mutex_lock_slowpath+0x75/0x28c
> >[330961.236113]  [] mutex_lock+0x8/0xa
> >[330961.236185]  [] vfs_quota_off+0x1cf/0x260
> >[330961.236257]  [] do_quotactl+0x29f/0x2dd
> >[330961.236330]  [] sys_quotactl+0x84/0xd6
> >[330961.236402]  [] syscall_call+0x7/0xb
> >[330961.236473]  ===
> >
> 
> Is this a 2.6.21 regression?
> 
> Regards,
> Michal
> 
> -- 
> Michal K. K. Piotrowski
> Kernel Monkeys
> (http://kernel.wikidot.com/start)
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
i_mutex on quota files is special. Unlike i_mutexes for other inodes it is
acquired under dqonoff_mutex. Tell lockdep about this lock ranking. Also
comment and code in quota_sync_sb() seem to be bogus (as i_mutex for quota
file can be acquired under dqonoff_mutex). Move truncate_inode_pages()
call under dqonoff_mutex and save some problems with races...

Signed-off-by: Jan Kara <[EMAIL PROTECTED]>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.21/fs/dquot.c linux-2.6.21-1-quota_lockdep/fs/dquot.c
--- linux-2.6.21/fs/dquot.c	2007-05-15 14:18:47.0 +0200
+++ linux-2.6.21-1-quota_lockdep/fs/dquot.c	2007-05-15 14:22:47.0 +0200
@@ -1421,7 +1421,7 @@ int vfs_quota_off(struct super_block *sb
 			/* If quota was reenabled in the meantime, we have
 			 * nothing to

Re: [PATCH 1/5][TAKE2] fallocate() implementation on i86, x86_64 and powerpc

2007-05-15 Thread Amit K. Arora

On Tue, May 15, 2007 at 09:44:36AM +1000, Stephen Rothwell wrote:
> On Mon, 14 May 2007 20:15:24 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> >
> > This patch implements sys_fallocate() and adds support on i386, x86_64
> > and powerpc platforms.
> 
> This patch no longer applies to Linus' tree - for a start there is no file
> arch/x86_64/kernel/functionlist any more.
> 
> Can you rebase it, please?

I will rebase it to 2.6.22-rc1 and repost the patches soon.
Thanks!

--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 2/14] Add a new mount flag (MNT_UNION) for union mount

2007-05-15 Thread Jan Blunck

On 5/15/07, Eric Van Hensbergen <[EMAIL PROTECTED]> wrote:

On 5/15/07, Bharata B Rao <[EMAIL PROTECTED]> wrote:
>
> So there can be two cases in union mounts:
> 1. A file exists in topmost layer and also in one or more lower layers. 
Deleting
> the file would result in the top layer file being deleted and a whiteout being
> created in the top layer.
>
> 2. A file exists in one or more of lower layers, but not in the topmost layer.
> Deleting this file would result in just a whiteout being created in the
> topmost layer.
>

I'd imagine there is a third potential option, which I'll admit strays
a bit from the conventional UNIX semantic.

And this is exactly why I designed it as is: NOT to do anything that
anyone can imagine. This is the reason why most of the filesystem
unification approaches fail: too much unnecessary complexity.

If only one layer is
marked as writable, then any changes (including delete) only effect
that layer.  I could imagine this would be useful in situations like
overlaying a sandbox on an otherwise read-only source code tree (you
might want to just get rid of a modification by removing your file and
have it replaced by the original underlying source).

You just unmount the topmost writable layer and replace it by a clean
tmpfs. There you go. Removing a file and not creating a whiteout
totally breaks what the user of the filesystem expects.

BTW: Undoing your changes to source code is solved by many
applications: patch, backup files, version control systems ...

I suppose a further extension would be to have multiple layers marked
as mutable and functions such as delete would effect all mutable
layers, but functions like create would only affect the top mutable
layer.

You want per systemcall policies of what layer is affected??? Crazy
idea but how do you do this without letting the user know of what
layer the file is in the first place. This doesn't work without major
extension of the VFS/syscall interface.

As an aside, perhaps it would be useful to mark the mutable layer at
mount time (instead of having it always be the top layer).  Again this
could lead to some optional non-conventional file system semantics,
but its proven useful in Plan 9 union mount semantics and it seems a
fairly trivial extension to what you currently have.

I don't think so. Plan9 union directory semantic is different. Plan9
is different.

I'll stick to the conventional UNIX semantics since I think that
otherwise it will just make this thing too complex. First we need to
solve the existing problems like union readdir, bind mounts and
whiteout handling. If this is all solved and you are disappointed with
what you can achieve with only the topmost layer writable feel free to
come up with trivial extensions :)

Cheers,
Jan
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5][TAKE2] fallocate system call

2007-05-15 Thread Amit K. Arora

On Tue, May 15, 2007 at 12:31:21AM -0600, Andreas Dilger wrote:
> On May 14, 2007  18:59 +0530, Amit K. Arora wrote:
> >  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> > 
> > fd: The descriptor of the open file.
> > 
> > mode*: This specifies the behavior of the system call. Currently the
> >   system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
> >   FA_ALLOCATE: Applications can use this mode to preallocate blocks to
> > a given file (specified by fd). This mode changes the file size if
> > the preallocation is done beyond the EOF. It also updates the
> > ctime/mtime in the inode of the corresponding file, marking a
> > successfull allocation.
> >   FA_DEALLOCATE: This mode can be used by applications to deallocate the
> > previously preallocated blocks. This also may change the file size
> > and the ctime/mtime.
> > * New modes might get added in future. One such new mode which is
> >   already under discussion is FA_PREALLOCATE, which when used will
> >   preallocate space but will not change the filesize and [cm]time.
> >   Since the semantics of this new mode is not clear and agreed upon yet,
> >   this patchset does not implement it currently.
> > 
> > offset: This is the offset in bytes, from where the preallocation should
> >   start.
> > 
> > len: This is the number of bytes requested for preallocation (from
> >   offset).
> 
> What is the return value?  I'd hope it is the number of bytes preallocated,
> in case of interrupted preallocation for whatever reason (interrupt, out of
> space, etc) like a regular write(2) call.  In this case the return type needs
> to also be an loff_t to match @len.

The return value in current implementation has been kept as "long" where
zero is returned for success and an error on failure. This is done to
keep it inline with posix_fallocate behavior.

This point was brought up sometime back by Badari. At that time it was
decided to keep it the way posix_fallocate is designed. Here are the
posts related to this:
http://lkml.org/lkml/2007/3/2/18
http://lkml.org/lkml/2007/3/2/162
http://lkml.org/lkml/2007/3/2/208

Still if you feel that we should be returning number of bytes
preallocated, we can again ask for opinion here.

Thanks!
--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 2/14] Add a new mount flag (MNT_UNION) for union mount

2007-05-15 Thread Eric Van Hensbergen


On 5/15/07, Bharata B Rao <[EMAIL PROTECTED]> wrote:


So there can be two cases in union mounts:
1. A file exists in topmost layer and also in one or more lower layers. Deleting
the file would result in the top layer file being deleted and a whiteout being
created in the top layer.

2. A file exists in one or more of lower layers, but not in the topmost layer.
Deleting this file would result in just a whiteout being created in the
topmost layer.



I'd imagine there is a third potential option, which I'll admit strays
a bit from the conventional UNIX semantic.  If only one layer is
marked as writable, then any changes (including delete) only effect
that layer.  I could imagine this would be useful in situations like
overlaying a sandbox on an otherwise read-only source code tree (you
might want to just get rid of a modification by removing your file and
have it replaced by the original underlying source).

I suppose a further extension would be to have multiple layers marked
as mutable and functions such as delete would effect all mutable
layers, but functions like create would only affect the top mutable
layer.

As an aside, perhaps it would be useful to mark the mutable layer at
mount time (instead of having it always be the top layer).  Again this
could lead to some optional non-conventional file system semantics,
but its proven useful in Plan 9 union mount semantics and it seems a
fairly trivial extension to what you currently have.

-eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 13/14] ext3 whiteout support

2007-05-15 Thread Jan Blunck

On 5/15/07, Bharata B Rao <[EMAIL PROTECTED]> wrote:

On Mon, May 14, 2007 at 01:16:57PM -0700, Badari Pulavarty wrote:
> On Mon, 2007-05-14 at 15:14 +0530, Bharata B Rao wrote:
> > From: Bharata B Rao <[EMAIL PROTECTED]>
> >
> > +static int ext3_whiteout(struct inode *dir, struct dentry *dentry)
> > +{
> > +   struct inode * inode;
> > +   int err, retries = 0;
> > +   handle_t *handle;
> > +
> > +retry:
> > +   handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS(dir->i_sb) +
> > +   EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 +
> > +   2*EXT3_QUOTA_INIT_BLOCKS(dir->i_sb));
> > +   if (IS_ERR(handle))
> > +   return PTR_ERR(handle);
> > +
> > +   if (IS_DIRSYNC(dir))
> > +   handle->h_sync = 1;
> > +
> > +   inode = ext3_new_inode (handle, dir, S_IFWHT | S_IRUGO);
> > +   err = PTR_ERR(inode);
> > +   if (IS_ERR(inode))
> > +   goto out_stop;
>
> Don't you need to call init_special_inode() here ?
> Or this is handled somewhere else ?

Whiteout doesn't have any attributes and hence we are not explicitly
doing init_special_inode() on this. Accesses to whiteout files are trapped
at the VFS lookup itself and creation and deletion of whiteouts are handled
automatically by VFS. So I believe init_special_inode() isn't necessary
on a whiteout file.

I added default whiteout file operations. So calling
init_special_inode() seems to make sense.

I know the ext2/ext3 whiteout patches are not really where they should
be. I plan to use a reserved inode number to reflect the case that the
inode itself doesn't have any attributes itself. It makes sense to
have a singleton whiteout inode per superblock.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 2/14] Add a new mount flag (MNT_UNION) for union mount

2007-05-15 Thread Bharata B Rao

On Mon, May 14, 2007 at 10:38:46PM +0200, Jan Engelhardt wrote:
> 
> On May 14 2007 15:09, Bharata B Rao wrote:
> >
> >Introduce MNT_UNION, MS_UNION and FS_WHT flags. There are the necessary flags
> >for doing
> >
> >mount /dev/hda3 /mnt -o union
> >
> >You need additional patches for util-linux for that to work.
> >
> >Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
> >Signed-off-by: Bharata B Rao <[EMAIL PROTECTED]>
> >---
> > 
> >+/* Unions couldn't be writable if the filesystem
> >+ * doesn't know about whiteouts */
> >+err = -ENOTSUPP;
> >+if ((mnt_flags & MNT_UNION) &&
> >+!(newmnt->mnt_sb->s_flags & MS_RDONLY) &&
> >+!(newmnt->mnt_sb->s_type->fs_flags & FS_WHT))
> >+goto unlock;
> >+
> 
> Maybe I am too biased towards unionfs/aufs, but if I have an {rw,rw} union
> with whiteouts disabled (delete=all in unionfs speak), then FS_WHT
> does not need to be supported. Your patches do not seem to do
> delete=all semantics, do they?
> 

No, they don't support delete=all semantics.

Correct me if I am wrong, but a key difference from unionfs is that in
union mounts only the topmost layer is writable (and it doesn't matter if
the lower layers are mounted rw), while this not so with unionfs.
Hence when we delete any file which is present in a lower layer, we create
a whiteout for it on the topmost layer to mask it out.

So there can be two cases in union mounts:
1. A file exists in topmost layer and also in one or more lower layers. Deleting
the file would result in the top layer file being deleted and a whiteout being
created in the top layer.

2. A file exists in one or more of lower layers, but not in the topmost layer.
Deleting this file would result in just a whiteout being created in the
topmost layer.

Regards,
Bharata.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 8/14] Union-mount lookup

2007-05-15 Thread Jan Engelhardt


On May 14 2007 15:12, Bharata B Rao wrote:
> 
>+struct dentry * d_lookup_single(struct dentry *parent, struct qstr *name)
>+{
>+  struct dentry *dentry;
>+  unsigned long seq;
>+
>+do {
>+seq = read_seqbegin(&rename_lock);
>+dentry = __d_lookup_single(parent, name);
>+if (dentry)
>+  break;
>+  } while (read_seqretry(&rename_lock, seq));
>+  return dentry;
>+}

Replace with tabs.

>+lookup_union:
>+  do {
>+  struct vfsmount *mnt = find_mnt(topmost);
>+  UM_DEBUG_DCACHE("name=\"%s\", inode=%p, device=%s\n",
>+  topmost->d_name.name, topmost->d_inode,
>+  mnt->mnt_devname);
>+  mntput(mnt);
>+  } while (0);

Why the extra do{}while? [elsewhere too]

>+  if (topmost->d_union) {
>+  union_lock_spinlock(topmost, &topmost->d_lock);
>+  }

Extra {} could go [elsewhere too].

>+  if (last->d_overlaid
>+  && (last->d_overlaid != dentry)) {

As can these extra () [elsewhere too].

>+static inline struct dentry * __lookup_hash_single(struct qstr *name, struct 
>dentry *base, struct nameidata *nd)
>+{
>+  struct dentry *dentry;
>+  struct inode *inode;
>+  int err;
>+
>+  inode = base->d_inode;
>+
>+  err = permission(inode, MAY_EXEC, nd);
>+  dentry = ERR_PTR(err);
>+  if (err)
>+  goto out;
>+
>+  dentry = __lookup_hash_kern_single(name, base, nd);
>+out:
>+  return dentry;
>+}

This looks a little big for being inline.



Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 7/14] Union-mount mounting

2007-05-15 Thread Jan Engelhardt


On May 14 2007 15:11, Bharata B Rao wrote:
>
>TODO: bind and move mounts aren't yet supported with union mounts.

Are the semantics already set?

>@@ -294,6 +294,10 @@ static struct vfsmount *clone_mnt(struct
>   if (!mnt)
>   goto alloc_failed;
> 
>+  /*
>+   * As of now, cloning of union mounted mnt isn't permitted.
>+   */
>+  BUG_ON(mnt->mnt_flags & MNT_UNION);

One, please avoid BUG_ONs. Now I am not sure if clone_mnt is called
as part of kthread creation when CLONE_FS is it. If so, get rid of
this one real fast. Also see chunk "@@ -1031,.. @@ do_loopback"
below.

if(mnt->mnt_flags & MNT_UNION)
goto return_einval;

or something.

>+#ifdef CONFIG_UNION_MOUNT
>+  struct union_info *uinfo = NULL;
>+#endif
> 
>   retval = security_sb_umount(mnt, flags);
>   if (retval)
>@@ -685,6 +696,14 @@ static int do_umount(struct vfsmount *mn
>   }
> 
>   down_write(&namespace_sem);
>+#ifdef CONFIG_UNION_MOUNT
>+  /*
>+   * Grab a reference to the union_info which gets detached
>+   * from the dentries in release_mounts().
>+   */
>+  if (mnt->mnt_flags & MNT_UNION)
>+  uinfo = union_lock_and_get(mnt->mnt_root);
>+#endif
>   spin_lock(&vfsmount_lock);
>   event++;
> 
>@@ -699,6 +718,15 @@ static int do_umount(struct vfsmount *mn
>   security_sb_umount_busy(mnt);
>   up_write(&namespace_sem);
>   release_mounts(&umount_list);
>+#ifdef CONFIG_UNION_MOUNT
>+  if (uinfo) {
>+  if (atomic_read(&uinfo->u_count) == 1)
>+  /* We are the last user of this union_info */
>+  union_release(uinfo);
>+  else
>+  union_put_and_unlock(uinfo);
>+  }
>+#endif
>   return retval;
> }
> 

Is it feasible to do with with some less #if/#endif magic?:

>@@ -1031,6 +1070,15 @@ static int do_loopback(struct nameidata 
>   if (err)
>   return err;
> 
>+  /*
>+   * bind mounting to or from union mounts is not supported
>+   */
>+  err = -EINVAL;
>+  if (nd->mnt->mnt_flags & MNT_UNION)
>+  goto out_unlocked;
>+  if (old_nd.mnt->mnt_flags & MNT_UNION)
>+  goto out_unlocked;
>+

Do the same in clone_mnt.

>   down_write(&namespace_sem);
>   err = -EINVAL;
>   if (IS_MNT_UNBINDABLE(old_nd.mnt))
>@@ -1064,6 +1112,7 @@ static int do_loopback(struct nameidata 
> 
> out:
>   up_write(&namespace_sem);
>+out_unlocked:
>   path_release(&old_nd);
>   return err;
> }

>+++ b/include/linux/fs.h
>@@ -1984,6 +1984,9 @@ static inline ino_t parent_ino(struct de
> /* kernel/fork.c */
> extern int unshare_files(void);
> 
>+/* fs/union.c */
>+#include 
>+
> /* Transaction based IO helpers */
> 
> /*

This raises a big eyebrow. If linux/fs.h can compile without the
inclusion of linux/union.h, do not put linux/union.h in fs.h.

>+#ifdef CONFIG_UNION_MOUNT
>+
>+#include 
>+
>+/* namespace stuff used at mount time */
>+extern void attach_mnt_union(struct vfsmount *, struct nameidata *);
>+extern void detach_mnt_union(struct vfsmount *, struct path *);

You do not need that #include I suppose. Just predeclare the structs.

struct path;
struct vfsmount;
extern void ...

Saves us the "compiler slurps in so many .h files" problem.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 5/14] Introduce union stack

2007-05-15 Thread Jan Blunck

On 5/14/07, Jan Engelhardt <[EMAIL PROTECTED]> wrote:

>+static inline void union_lock(struct dentry *dentry)
>+{
>+  if (unlikely(dentry && dentry->d_union)) {
>+  struct union_info *ui = dentry->d_union;
>+
>+  UM_DEBUG_LOCK("\"%s\" locking %p (count=%d)\n",
>+dentry->d_name.name, ui,
>+atomic_read(&ui->u_count));
>+  __union_lock(dentry->d_union);
>+  }
>+}
>+
>+static inline void union_unlock(struct dentry *dentry)
>+{
>+  if (unlikely(dentry && dentry->d_union)) {
>+  struct union_info *ui = dentry->d_union;
>+
>+  UM_DEBUG_LOCK("\"%s\" unlocking %p (count=%d)\n",
>+dentry->d_name.name, ui,
>+atomic_read(&ui->u_count));
>+  __union_unlock(dentry->d_union);
>+  }
>+}

Do we really need the unlikely()? d_union may be a new feature,
but it may very well be possible that someone puts the bigger
part of his/her files under a union. And when d_unions get
stable, people will probably begin making their root filesystem
unioned for livecds, and then unlikely() will rather be a
likely penalty. My stance: just
if (dentry != NULL && dentry->d_union != NULL)
This also goes for union_trylock.

Good question. My intention was that since most of the union code
costs performance (stack traversal, readdir) I optimize for the normal
(not unified) case.

>+static inline int union_trylock(struct dentry *dentry)
>+{
>+  int locked = 1;
>+
>+  if (unlikely(dentry && dentry->d_union)) {
>+  UM_DEBUG_LOCK("\"%s\" try locking %p (count=%d)\n",
>+dentry->d_name.name, dentry->d_union,
>+atomic_read(&dentry->d_union->u_count));
>+  BUG_ON(!atomic_read(&dentry->d_union->u_count));
>+  locked = mutex_trylock(&dentry->d_union->u_mutex);
>+  UM_DEBUG_LOCK("\"%s\" trylock %p %s\n", dentry->d_name.name,
>+dentry->d_union,
>+locked ? "succeeded" : "failed");
>+  }
>+  return (locked ? 1 : 0);
>+}

return locked ? 1 : 0
or even
return !!locked;
or since we're just passing up from mutex_trylock:
return locked;
?

Ahh, this seems to be a left-over of the semaphore -> mutex conversion.

>+/*
>+ * This is a *I can't get no sleep* helper

More commonly known as "insomnia". :)

:)

Before I forget this: thank you (and Badari) for reviewing the patches!

Cheers,
Jan
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

50 matches

Mail list logo