Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-27 Thread Valerie Henson
On Fri, Apr 27, 2007 at 12:53:34PM +0200, J??rn Engel wrote:
> 
> All this would get easier if continuation inodes were known to be rare.
> You can ditch the doubly-linked list in favor of a pointer to the main
> inode then - traversing the list again is cheap, after all.  And you can
> just try to read the same block once for every continuation inode.
> 
> If those lists can get long and you need a mapping from offset to
> continuation inode on the medium, you are basically fscked.  Storing the
> mapping requires space.  You need the mapping only when space (in some
> chunk) gets tight and you allocate continuation inodes.  So either you
> don't need the mapping or you don't have a good place to put it.

Any mapping structure will have to be pre-allocated.

> Having a mapping in memory is also questionable.  Either you scan the
> whole file on first access and spend a long time for large files.  Or
> you create the mapping on the fly.  In that case the page cache will
> already give you a 90% solution for free.

So in my secret heart of hearts, I do indeed hope that cnodes are rare
enough that we don't actually have to do anything smart to make them
go fast.  Either having no fast lookup structure or creating it in
memory as needed would be the nicest solution.  However, since I can't
guarantee this will be the case, it's nice to have some idea of what
we'll do if this does become important.

> You should spend a lot of effort trying to minimize cnodes. ;)

Yep.  It's much better to optimize away most cnodes instead of trying
to make the go fast.

-VAL
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] fallocate system call

2007-04-27 Thread Chris Wedgwood
On Fri, Apr 27, 2007 at 07:46:13PM +0200, Heiko Carstens wrote:

> If one insists to have fd at first argument, what is wrong with
> having u32 arguments only?

Well, I was one of those who objected as it seems *UGLY* to me.

> It's not that this syscall comes even close to what can be
> considered performance critical...

Right.

> It adds userspace overhead for one architecture. Every *trace and
> *libc needs special handling on s390 for this syscall. I would
> prefer to avoid this.

I'm not that bothered about it.  I would prefer it did use clean
64-bit arguments, but given it's a non-critical syscall I'm don't
think the aesthetics are worth impossing crud on s390 for.


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext2/3 block remapping tool

2007-04-27 Thread Andreas Dilger
On Apr 26, 2007  21:29 +0200, Jan Kara wrote:
>   I've been lately playing with remapping ext2/ext3 blocks (especially how
> much it can give us in terms of speed of things like KDE start). For that
> I've written two simple tools (you can get them from
> ftp.suse.com/pub/people/jack/ext3remapper.tar.gz):
>   e2block2file to transform (preparsed) output from blktrace into a list
> of accessed files and offsets accessed
>   e2remapblocks to use output from e2block2file and remap blocks into big
> chunks in the order in which they were accessed.

Does it map the whole file contiguously, or does it interleave blocks of the
file in the order they are accessed?  I would hope that it maps the whole
file contiguously, and let readahead work properly to fetch the whole file.
Also, keeping the file contiguous avoids fragmentation later if that file is
updated, deleted, etc, and conflicts with allocator/defrag/etc.

>   (see README in the tools archive for more details)
> 
>   So far the tools (especially e2remapblocks ;) work on unmounted
> filesystem. The ultimate goal is to be able to do similar things for
> mounted filesystems but I wanted to see whether block remapping is worth it
> and what kernel interfaces would be useful for achieving the goal.

I'd prefer that such functionality be integrated with Takashi's online
defrag tool, since it needs virtually the same functionality.  For that
matter, this is also very similar to the block-mapped -> extents tool
from Aneesh.  It doesn't make sense to have so many separate tools for
users, especially if they start interfering with each other (i.e. defrag
undoes the remapping done by your tool).

>   BTW, the results for KDE startup are as follows:
> The root partition was about 4.8 GB with around 1 GB free. System has
> 1GB mem. All measurements (except for warmcache) were performed after
>   sync; echo 3 >/proc/sys/vm/drop_caches
> 
> Ordinary start: 19.2 20.3 19.5 19.8 19.3; avg. 19.62
> Start with all data cached: 7 7.6 7.3 7.1 7.1; avg. 7.22
> Start with fcache (see thread http://lkml.org/lkml/2006/5/15/46 for details
> on fcache):
>   11.3 11 10.3 10.8 10.6; avg. 10.8
> Start with blocks remapped with e2remapblocks:
>   13.5 15 13 14.5 14.5; avg. 14.1
> (after remapping, data was stored in 20 continguous extents on disk)



Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] fallocate system call

2007-04-27 Thread Heiko Carstens
On Fri, Apr 27, 2007 at 04:43:28PM +0200, Jörn Engel wrote:
> On Fri, 27 April 2007 14:10:03 +0200, Heiko Carstens wrote:
> > 
> > After long discussions where at least two possible implementations
> > were suggested that would work on _all_ architectures you chose one
> > which doesn't and causes extra effort.
> 
> I believe the long discussion also showed that every possible
> implementation has drawbacks.  To me this one appeared to be the best of
> many bad choices.

If one insists to have fd at first argument, what is wrong with having
u32 arguments only? It's not that this syscall comes even close to
what can be considered performance critical...

> Is this implementation worse than we thought?

It adds userspace overhead for one architecture. Every *trace and
*libc needs special handling on s390 for this syscall. I would
prefer to avoid this.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 37/44] hostfs convert to new aops

2007-04-27 Thread Jeff Dike
On Tue, Apr 24, 2007 at 11:24:23AM +1000, Nick Piggin wrote:
> This also gets rid of a lot of useless read_file stuff. And also
> optimises the full page write case by marking a !uptodate page uptodate.
> 
> Cc: Jeff Dike <[EMAIL PROTECTED]>
> Cc: Linux Filesystems 
> Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Looks good.

Acked-by: Jeff Dike <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-27 Thread Jeff Dike
On Thu, Apr 26, 2007 at 09:58:25PM -0700, Valerie Henson wrote:
> Here's an example, spelled out:
> 
> Allocate file 1 in chunk A.
> Grow file 1.
> Chunk A fills up.
> Allocate continuation inode for file 1 in chunk B.
> Chunk A gets some free space.
> Chunk B fills up.
> Pick chunk A for allocating next block of file 1.
> Try to look up a continuation inode for file 1 in chunk A.
> Continuation inode for file 1 found in chunk A!
> Attach newly allocated block to existing inode for file 1 in chunk A.

So far, so good (and the slides are helpful, tx!).  What happens when
file 1 keeps growing and chunk A fills up (and chunk B is still full)?
Can the same continuation inode also point at chunk C, where the file
is going to grow to?

Jeff

-- 
Work email - jdike at linux dot intel dot com
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] fallocate system call

2007-04-27 Thread Jörn Engel
On Fri, 27 April 2007 14:10:03 +0200, Heiko Carstens wrote:
> 
> After long discussions where at least two possible implementations
> were suggested that would work on _all_ architectures you chose one
> which doesn't and causes extra effort.

I believe the long discussion also showed that every possible
implementation has drawbacks.  To me this one appeared to be the best of
many bad choices.

Is this implementation worse than we thought?

Jörn

-- 
The grand essentials of happiness are: something to do, something to
love, and something to hope for.
-- Allan K. Chalmers
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 05/10] unprivileged mounts: allow unprivileged bind mounts

2007-04-27 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Allow bind mounts to unprivileged users if the following conditions are met:

  - mountpoint is not a symlink
  - parent mount is owned by the user
  - the number of user mounts is below the maximum

Unprivileged mounts imply MS_SETUSER, and will also have the "nosuid" and
"nodev" mount flags set.

In particular, if mounting process doesn't have CAP_SETUID capability,
then the "nosuid" flag will be added, and if it doesn't have CAP_MKNOD
capability, then the "nodev" flag will be added.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-26 13:18:46.0 +0200
+++ linux/fs/namespace.c2007-04-26 13:30:04.0 +0200
@@ -237,11 +237,34 @@ static void dec_nr_user_mounts(void)
spin_unlock(&vfsmount_lock);
 }
 
-static void set_mnt_user(struct vfsmount *mnt)
+static int reserve_user_mount(void)
+{
+   int err = 0;
+
+   spin_lock(&vfsmount_lock);
+   if (nr_user_mounts >= max_user_mounts && !capable(CAP_SYS_ADMIN))
+   err = -EPERM;
+   else
+   nr_user_mounts++;
+   spin_unlock(&vfsmount_lock);
+   return err;
+}
+
+static void __set_mnt_user(struct vfsmount *mnt)
 {
BUG_ON(mnt->mnt_flags & MNT_USER);
mnt->mnt_uid = current->fsuid;
mnt->mnt_flags |= MNT_USER;
+
+   if (!capable(CAP_SETUID))
+   mnt->mnt_flags |= MNT_NOSUID;
+   if (!capable(CAP_MKNOD))
+   mnt->mnt_flags |= MNT_NODEV;
+}
+
+static void set_mnt_user(struct vfsmount *mnt)
+{
+   __set_mnt_user(mnt);
spin_lock(&vfsmount_lock);
nr_user_mounts++;
spin_unlock(&vfsmount_lock);
@@ -260,10 +283,16 @@ static struct vfsmount *clone_mnt(struct
int flag)
 {
struct super_block *sb = old->mnt_sb;
-   struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);
+   struct vfsmount *mnt;
 
+   if (flag & CL_SETUSER) {
+   int err = reserve_user_mount();
+   if (err)
+   return ERR_PTR(err);
+   }
+   mnt = alloc_vfsmnt(old->mnt_devname);
if (!mnt)
-   return ERR_PTR(-ENOMEM);
+   goto alloc_failed;
 
mnt->mnt_flags = old->mnt_flags;
atomic_inc(&sb->s_active);
@@ -275,7 +304,7 @@ static struct vfsmount *clone_mnt(struct
/* don't copy the MNT_USER flag */
mnt->mnt_flags &= ~MNT_USER;
if (flag & CL_SETUSER)
-   set_mnt_user(mnt);
+   __set_mnt_user(mnt);
 
if (flag & CL_SLAVE) {
list_add(&mnt->mnt_slave, &old->mnt_slave_list);
@@ -300,6 +329,11 @@ static struct vfsmount *clone_mnt(struct
spin_unlock(&vfsmount_lock);
}
return mnt;
+
+ alloc_failed:
+   if (flag & CL_SETUSER)
+   dec_nr_user_mounts();
+   return ERR_PTR(-ENOMEM);
 }
 
 static inline void __mntput(struct vfsmount *mnt)
@@ -748,22 +782,26 @@ asmlinkage long sys_oldumount(char __use
 
 #endif
 
-static int mount_is_safe(struct nameidata *nd)
+/*
+ * Conditions for unprivileged mounts are:
+ * - mountpoint is not a symlink
+ * - mountpoint is in a mount owned by the user
+ */
+static bool permit_mount(struct nameidata *nd, int *flags)
 {
+   struct inode *inode = nd->dentry->d_inode;
+
if (capable(CAP_SYS_ADMIN))
-   return 0;
-   return -EPERM;
-#ifdef notyet
-   if (S_ISLNK(nd->dentry->d_inode->i_mode))
-   return -EPERM;
-   if (nd->dentry->d_inode->i_mode & S_ISVTX) {
-   if (current->uid != nd->dentry->d_inode->i_uid)
-   return -EPERM;
-   }
-   if (vfs_permission(nd, MAY_WRITE))
-   return -EPERM;
-   return 0;
-#endif
+   return true;
+
+   if (S_ISLNK(inode->i_mode))
+   return false;
+
+   if (!is_mount_owner(nd->mnt, current->fsuid))
+   return false;
+
+   *flags |= MS_SETUSER;
+   return true;
 }
 
 static int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry)
@@ -987,9 +1025,10 @@ static int do_loopback(struct nameidata 
int clone_flags;
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
-   int err = mount_is_safe(nd);
-   if (err)
-   return err;
+   int err;
+
+   if (!permit_mount(nd, &flags))
+   return -EPERM;
if (!old_name || !*old_name)
return -EINVAL;
err = path_lookup(old_name, LOOKUP_FOLLOW, &old_nd);

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 08/10] unprivileged mounts: allow unprivileged fuse mounts

2007-04-27 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Use FS_SAFE for "fuse" fs type, but not for "fuseblk".

FUSE was designed from the beginning to be safe for unprivileged users.  This
has also been verified in practice over many years.  In addition unprivileged
mounts require the parent mount to be owned by the user, which is more strict
than the current userspace policy.

This will enable future installations to remove the suid-root fusermount
utility.

Don't require the "user_id=" and "group_id=" options for unprivileged mounts,
but if they are present, verify them for sanity.

Disallow the "allow_other" option for unprivileged mounts.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/fuse/inode.c
===
--- linux.orig/fs/fuse/inode.c  2007-04-26 13:07:11.0 +0200
+++ linux/fs/fuse/inode.c   2007-04-26 13:07:33.0 +0200
@@ -311,6 +311,19 @@ static int parse_fuse_opt(char *opt, str
d->max_read = ~0;
d->blksize = 512;
 
+   /*
+* For unprivileged mounts use current uid/gid.  Still allow
+* "user_id" and "group_id" options for compatibility, but
+* only if they match these values.
+*/
+   if (!capable(CAP_SYS_ADMIN)) {
+   d->user_id = current->uid;
+   d->user_id_present = 1;
+   d->group_id = current->gid;
+   d->group_id_present = 1;
+
+   }
+
while ((p = strsep(&opt, ",")) != NULL) {
int token;
int value;
@@ -339,6 +352,8 @@ static int parse_fuse_opt(char *opt, str
case OPT_USER_ID:
if (match_int(&args[0], &value))
return 0;
+   if (d->user_id_present && d->user_id != value)
+   return 0;
d->user_id = value;
d->user_id_present = 1;
break;
@@ -346,6 +361,8 @@ static int parse_fuse_opt(char *opt, str
case OPT_GROUP_ID:
if (match_int(&args[0], &value))
return 0;
+   if (d->group_id_present && d->group_id != value)
+   return 0;
d->group_id = value;
d->group_id_present = 1;
break;
@@ -536,6 +553,10 @@ static int fuse_fill_super(struct super_
if (!parse_fuse_opt((char *) data, &d, is_bdev))
return -EINVAL;
 
+   /* This is a privileged option */
+   if ((d.flags & FUSE_ALLOW_OTHER) && !capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
if (is_bdev) {
 #ifdef CONFIG_BLOCK
if (!sb_set_blocksize(sb, d.blksize))
@@ -639,6 +660,7 @@ static struct file_system_type fuse_fs_t
.fs_flags   = FS_HAS_SUBTYPE,
.get_sb = fuse_get_sb,
.kill_sb= kill_anon_super,
+   .fs_flags   = FS_SAFE,
 };
 
 #ifdef CONFIG_BLOCK

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 00/10] mount ownership and unprivileged mount syscall (v5)

2007-04-27 Thread Miklos Szeredi
v4 -> v5:

 - fold back Andrew's changes
 - fold back my update patch:
o use fsuid instead of ruid
o allow forced unpriv. unmounts for "safe" filesystems
o allow mounting over special files, but not over symlinks
o set nosuid and nodev based on lack of specific capability
 - patch header updates
 - new patch: on propagation inherit owner from parent
 - new patch: add "no submounts" mount flag

The last two patches are up for discussion.

The rest I think is in pretty good shape for merging.  If somebody
feels otherwise, please complain now.

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 10/10] unprivileged mounts: add "no submounts" flag

2007-04-27 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Add a new mount flag "nomnt", which denies submounts for the owner.
This would be useful, if we want to support traditional /etc/fstab
based user mounts.

In this case mount(8) would still have to be suid-root, to check the
mountpoint against the user/users flag in /etc/fstab, but /etc/mtab
would no longer be mandatory for storing the actual owner of the
mount.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-27 12:57:11.0 +0200
+++ linux/fs/namespace.c2007-04-27 12:57:14.0 +0200
@@ -449,6 +449,7 @@ static int show_vfsmnt(struct seq_file *
{ MNT_NOATIME, ",noatime" },
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
+   { MNT_NOMNT, ",nomnt" },
{ 0, NULL }
};
struct proc_fs_info *fs_infop;
@@ -806,6 +807,9 @@ static bool permit_mount(struct nameidat
if (S_ISLNK(inode->i_mode))
return false;
 
+   if (nd->mnt->mnt_flags & MNT_NOMNT)
+   return false;
+
if (!is_mount_owner(nd->mnt, current->fsuid))
return false;
 
@@ -1575,9 +1579,11 @@ long do_mount(char *dev_name, char *dir_
mnt_flags |= MNT_NODIRATIME;
if (flags & MS_RELATIME)
mnt_flags |= MNT_RELATIME;
+   if (flags & MS_NOMNT)
+   mnt_flags |= MNT_NOMNT;
 
flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
-  MS_NOATIME | MS_NODIRATIME | MS_RELATIME);
+  MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_NOMNT);
 
/* ... and get the mountpoint */
retval = path_lookup(dir_name, LOOKUP_FOLLOW, &nd);
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-27 12:57:11.0 +0200
+++ linux/include/linux/fs.h2007-04-27 12:57:14.0 +0200
@@ -128,6 +128,7 @@ extern int dir_notify_enable;
 #define MS_SHARED  (1<<20) /* change to shared */
 #define MS_RELATIME(1<<21) /* Update atime relative to mtime/ctime. */
 #define MS_SETUSER (1<<22) /* set mnt_uid to current user */
+#define MS_NOMNT   (1<<23) /* don't allow unprivileged submounts */
 #define MS_ACTIVE  (1<<30)
 #define MS_NOUSER  (1<<31)
 
Index: linux/include/linux/mount.h
===
--- linux.orig/include/linux/mount.h2007-04-27 12:57:01.0 +0200
+++ linux/include/linux/mount.h 2007-04-27 12:57:14.0 +0200
@@ -28,6 +28,7 @@ struct mnt_namespace;
 #define MNT_NOATIME0x08
 #define MNT_NODIRATIME 0x10
 #define MNT_RELATIME   0x20
+#define MNT_NOMNT  0x40
 
 #define MNT_SHRINKABLE 0x100
 #define MNT_USER   0x200

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 02/10] unprivileged mounts: allow unprivileged umount

2007-04-27 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

The owner doesn't need sysadmin capabilities to call umount().

Similar behavior as umount(8) on mounts having "user=UID" option in /etc/mtab.
The difference is that umount also checks /etc/fstab, presumably to exclude
another mount on the same mountpoint.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-26 13:10:48.0 +0200
+++ linux/fs/namespace.c2007-04-26 13:16:21.0 +0200
@@ -658,6 +658,27 @@ static int do_umount(struct vfsmount *mn
return retval;
 }
 
+static bool is_mount_owner(struct vfsmount *mnt, uid_t uid)
+{
+   return (mnt->mnt_flags & MNT_USER) && mnt->mnt_uid == uid;
+}
+
+/*
+ * umount is permitted for
+ *  - sysadmin
+ *  - mount owner, if not forced umount
+ */
+static bool permit_umount(struct vfsmount *mnt, int flags)
+{
+   if (capable(CAP_SYS_ADMIN))
+   return true;
+
+   if (flags & MNT_FORCE)
+   return false;
+
+   return is_mount_owner(mnt, current->fsuid);
+}
+
 /*
  * Now umount can handle mount points as well as block devices.
  * This is important for filesystems which use unnamed block devices.
@@ -681,7 +702,7 @@ asmlinkage long sys_umount(char __user *
goto dput_and_out;
 
retval = -EPERM;
-   if (!capable(CAP_SYS_ADMIN))
+   if (!permit_umount(nd.mnt, flags))
goto dput_and_out;
 
retval = do_umount(nd.mnt, flags);

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] fallocate system call

2007-04-27 Thread Heiko Carstens
On Thu, Apr 26, 2007 at 11:20:56PM +0530, Amit K. Arora wrote:
> Based on the discussion, this new patchset uses following as the
> interface for fallocate() system call:
> 
>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> It seems that only s390 architecture has a problem with such a layout of
> arguments in fallocate(). Thus for s390, we plan to have a wrapper
> (say, sys_s390_fallocate()) for the sys_fallocate(), which will get
> called by glibc when an application issues a fallocate() system call
> on s390. The s390 arch specific changes will be part of a separate
> patch (PATCH 2/5). It will be great if some s390 expert can verify the
> patch, since I have not been able to test it on s390 so far.

After long discussions where at least two possible implementations
were suggested that would work on _all_ architectures you chose one
which doesn't and causes extra effort.

> It was also noted that minor changes might be required to strace code
> to take care of "different arguments on s390" issue.

This is not limited to strace...

Besides that the s390 backend looks ok.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 01/10] unprivileged mounts: add user mounts to the kernel

2007-04-27 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

This patchset adds support for keeping mount ownership information in the
kernel, and allow unprivileged mount(2) and umount(2) in certain cases.

The mount owner has the following privileges:

  - unmount the owned mount
  - create a submount under the owned mount

The sysadmin can set the owner explicitly on mount and remount.  When an
unprivileged user creates a mount, then the owner is automatically set to the
user.

The following use cases are envisioned:

1) Private namespace, with selected mounts owned by user.  E.g. 
   /home/$USER is a good candidate for allowing unpriv mounts and unmounts
   within.

2) Private namespace, with all mounts owned by user and having the "nosuid"
   flag.  User can mount and umount anywhere within the namespace, but suid
   programs will not work.

3) Global namespace, with a designated directory, which is a mount owned by
   the user.  E.g.  /mnt/users/$USER is set up so that it is bind mounted onto
   itself, and set to be owned by $USER.  The user can add/remove mounts only
   under this directory.

The following extra security measures are taken for unprivileged mounts:

 - usermounts are limited by a sysctl tunable
 - force "nosuid,nodev" mount options on the created mount

For testing unprivileged mounts (and for other purposes) simple
mount/umount utilities are available from:

  http://www.kernel.org/pub/linux/kernel/people/mszeredi/mmount/

I'll also submit a patch to util-linux, to add the same functionality
to mount(8) and umount(8).


This patch:


A new mount flag, MS_SETUSER is used to make a mount owned by a user.  If this
flag is specified, then the owner will be set to the current fsuid and the
mount will be marked with the MNT_USER flag.  On remount don't preserve
previous owner, and treat MS_SETUSER as for a new mount.  The MS_SETUSER flag
is ignored on mount move.

The MNT_USER flag is not copied on any kind of mount cloning: namespace
creation, binding or propagation.  For bind mounts the cloned mount(s) are set
to MNT_USER depending on the MS_SETUSER mount flag.  In all the other cases
MNT_USER is always cleared.

For MNT_USER mounts a "user=UID" option is added to /proc/PID/mounts.  This is
compatible with how mount ownership is stored in /etc/mtab.

The rationale for using MS_SETUSER and MNT_USER, to distinguish "user"
mounts from "non-user" or "legacy" mounts are follows:

  a) Mount(2) and umount(2) on legacy mounts always need CAP_SYS_ADMIN
 capability.  As opposed to user mounts, which will only require,
 that the mount owner matches the current fsuid.  So a process
 with fsuid=0 should not be able to mount/umount legacy mounts
 without the CAP_SYS_ADMIN capability.

  b) Legacy userspace programs may set fsuid to nonzero before calling
 mount(2).  In such an unlikely case, this patchset would cause
 an unintended side effect of making the mount owned by the fsuid.

  c) For legacy mounts, no "user=UID" option should be shown in
 /proc/mounts for backwards compatibility.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-26 13:08:35.0 +0200
+++ linux/fs/namespace.c2007-04-26 13:10:48.0 +0200
@@ -227,6 +227,13 @@ static struct vfsmount *skip_mnt_tree(st
return p;
 }
 
+static void set_mnt_user(struct vfsmount *mnt)
+{
+   BUG_ON(mnt->mnt_flags & MNT_USER);
+   mnt->mnt_uid = current->fsuid;
+   mnt->mnt_flags |= MNT_USER;
+}
+
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
int flag)
 {
@@ -241,6 +248,11 @@ static struct vfsmount *clone_mnt(struct
mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;
 
+   /* don't copy the MNT_USER flag */
+   mnt->mnt_flags &= ~MNT_USER;
+   if (flag & CL_SETUSER)
+   set_mnt_user(mnt);
+
if (flag & CL_SLAVE) {
list_add(&mnt->mnt_slave, &old->mnt_slave_list);
mnt->mnt_master = old;
@@ -403,6 +415,8 @@ static int show_vfsmnt(struct seq_file *
if (mnt->mnt_flags & fs_infop->flag)
seq_puts(m, fs_infop->str);
}
+   if (mnt->mnt_flags & MNT_USER)
+   seq_printf(m, ",user=%i", mnt->mnt_uid);
if (mnt->mnt_sb->s_op->show_options)
err = mnt->mnt_sb->s_op->show_options(m, mnt);
seq_puts(m, " 0 0\n");
@@ -923,8 +937,9 @@ static int do_change_type(struct nameida
 /*
  * do loopback mount.
  */
-static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
+static int do_loopback(struct nameidata *nd, char *old_name, int flags)
 {
+   int clone_flags;
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
int err

[patch 09/10] unprivileged mounts: propagation: inherit owner from parent

2007-04-27 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

On mount propagation, let the owner of the clone be inherited from the
parent into which it has been propagated.  Also if the parent has the
"nosuid" flag, set this flag for the child as well.

This makes sense for example, when propagation is set up from the
initial namespace into a per-user namespace, where some or all of the
mounts may be owned by the user.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-27 12:57:01.0 +0200
+++ linux/fs/namespace.c2007-04-27 12:57:11.0 +0200
@@ -250,10 +250,10 @@ static int reserve_user_mount(void)
return err;
 }
 
-static void __set_mnt_user(struct vfsmount *mnt)
+static void __set_mnt_user(struct vfsmount *mnt, uid_t owner)
 {
BUG_ON(mnt->mnt_flags & MNT_USER);
-   mnt->mnt_uid = current->fsuid;
+   mnt->mnt_uid = owner;
mnt->mnt_flags |= MNT_USER;
 
if (!capable(CAP_SETUID))
@@ -264,7 +264,7 @@ static void __set_mnt_user(struct vfsmou
 
 static void set_mnt_user(struct vfsmount *mnt)
 {
-   __set_mnt_user(mnt);
+   __set_mnt_user(mnt, current->fsuid);
spin_lock(&vfsmount_lock);
nr_user_mounts++;
spin_unlock(&vfsmount_lock);
@@ -280,7 +280,7 @@ static void clear_mnt_user(struct vfsmou
 }
 
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
-   int flag)
+   int flag, uid_t owner)
 {
struct super_block *sb = old->mnt_sb;
struct vfsmount *mnt;
@@ -304,7 +304,10 @@ static struct vfsmount *clone_mnt(struct
/* don't copy the MNT_USER flag */
mnt->mnt_flags &= ~MNT_USER;
if (flag & CL_SETUSER)
-   __set_mnt_user(mnt);
+   __set_mnt_user(mnt, owner);
+
+   if (flag & CL_NOSUID)
+   mnt->mnt_flags |= MNT_NOSUID;
 
if (flag & CL_SLAVE) {
list_add(&mnt->mnt_slave, &old->mnt_slave_list);
@@ -822,7 +825,7 @@ static int lives_below_in_same_fs(struct
 }
 
 struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry,
-   int flag)
+   int flag, uid_t owner)
 {
struct vfsmount *res, *p, *q, *r, *s;
struct nameidata nd;
@@ -830,7 +833,7 @@ struct vfsmount *copy_tree(struct vfsmou
if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(mnt))
return ERR_PTR(-EPERM);
 
-   res = q = clone_mnt(mnt, dentry, flag);
+   res = q = clone_mnt(mnt, dentry, flag, owner);
if (IS_ERR(q))
goto error;
q->mnt_mountpoint = mnt->mnt_mountpoint;
@@ -852,7 +855,7 @@ struct vfsmount *copy_tree(struct vfsmou
p = s;
nd.mnt = q;
nd.dentry = p->mnt_mountpoint;
-   q = clone_mnt(p, p->mnt_root, flag);
+   q = clone_mnt(p, p->mnt_root, flag, owner);
if (IS_ERR(q))
goto error;
spin_lock(&vfsmount_lock);
@@ -1028,7 +1031,8 @@ static int do_change_type(struct nameida
  */
 static int do_loopback(struct nameidata *nd, char *old_name, int flags)
 {
-   int clone_flags;
+   int clone_flags = 0;
+   uid_t owner = 0;
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
int err;
@@ -1049,11 +1053,15 @@ static int do_loopback(struct nameidata 
if (!check_mnt(nd->mnt) || !check_mnt(old_nd.mnt))
goto out;
 
-   clone_flags = (flags & MS_SETUSER) ? CL_SETUSER : 0;
+   if (flags & MS_SETUSER) {
+   clone_flags |= CL_SETUSER;
+   owner = current->fsuid;
+   }
+
if (flags & MS_REC)
-   mnt = copy_tree(old_nd.mnt, old_nd.dentry, clone_flags);
+   mnt = copy_tree(old_nd.mnt, old_nd.dentry, clone_flags, owner);
else
-   mnt = clone_mnt(old_nd.mnt, old_nd.dentry, clone_flags);
+   mnt = clone_mnt(old_nd.mnt, old_nd.dentry, clone_flags, owner);
 
err = PTR_ERR(mnt);
if (IS_ERR(mnt))
@@ -1227,7 +1235,7 @@ static int do_new_mount(struct nameidata
}
 
if (flags & MS_SETUSER)
-   __set_mnt_user(mnt);
+   __set_mnt_user(mnt, current->fsuid);
 
return do_add_mount(mnt, nd, mnt_flags, NULL);
 
@@ -1620,7 +1628,7 @@ static struct mnt_namespace *dup_mnt_ns(
down_write(&namespace_sem);
/* First pass: copy the tree topology */
new_ns->root = copy_tree(mnt_ns->root, mnt_ns->root->mnt_root,
-   CL_COPY_ALL | CL_EXPIRE);
+   CL_COPY_ALL | CL_EXPIRE, 0);
if (IS_ERR(new_ns->root)) {
up_write(&na

[patch 04/10] unprivileged mounts: propagate error values from clone_mnt

2007-04-27 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Allow clone_mnt() to return errors other than ENOMEM.  This will be used for
returning a different error value when the number of user mounts goes over the
limit.

Fix copy_tree() to return EPERM for unbindable mounts.

Don't propagate further from dup_mnt_ns() as that copy_tree() can only fail
with -ENOMEM.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-26 13:17:13.0 +0200
+++ linux/fs/namespace.c2007-04-26 13:18:46.0 +0200
@@ -262,41 +262,42 @@ static struct vfsmount *clone_mnt(struct
struct super_block *sb = old->mnt_sb;
struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname);
 
-   if (mnt) {
-   mnt->mnt_flags = old->mnt_flags;
-   atomic_inc(&sb->s_active);
-   mnt->mnt_sb = sb;
-   mnt->mnt_root = dget(root);
-   mnt->mnt_mountpoint = mnt->mnt_root;
-   mnt->mnt_parent = mnt;
-
-   /* don't copy the MNT_USER flag */
-   mnt->mnt_flags &= ~MNT_USER;
-   if (flag & CL_SETUSER)
-   set_mnt_user(mnt);
-
-   if (flag & CL_SLAVE) {
-   list_add(&mnt->mnt_slave, &old->mnt_slave_list);
-   mnt->mnt_master = old;
-   CLEAR_MNT_SHARED(mnt);
-   } else {
-   if ((flag & CL_PROPAGATION) || IS_MNT_SHARED(old))
-   list_add(&mnt->mnt_share, &old->mnt_share);
-   if (IS_MNT_SLAVE(old))
-   list_add(&mnt->mnt_slave, &old->mnt_slave);
-   mnt->mnt_master = old->mnt_master;
-   }
-   if (flag & CL_MAKE_SHARED)
-   set_mnt_shared(mnt);
+   if (!mnt)
+   return ERR_PTR(-ENOMEM);
 
-   /* stick the duplicate mount on the same expiry list
-* as the original if that was on one */
-   if (flag & CL_EXPIRE) {
-   spin_lock(&vfsmount_lock);
-   if (!list_empty(&old->mnt_expire))
-   list_add(&mnt->mnt_expire, &old->mnt_expire);
-   spin_unlock(&vfsmount_lock);
-   }
+   mnt->mnt_flags = old->mnt_flags;
+   atomic_inc(&sb->s_active);
+   mnt->mnt_sb = sb;
+   mnt->mnt_root = dget(root);
+   mnt->mnt_mountpoint = mnt->mnt_root;
+   mnt->mnt_parent = mnt;
+
+   /* don't copy the MNT_USER flag */
+   mnt->mnt_flags &= ~MNT_USER;
+   if (flag & CL_SETUSER)
+   set_mnt_user(mnt);
+
+   if (flag & CL_SLAVE) {
+   list_add(&mnt->mnt_slave, &old->mnt_slave_list);
+   mnt->mnt_master = old;
+   CLEAR_MNT_SHARED(mnt);
+   } else {
+   if ((flag & CL_PROPAGATION) || IS_MNT_SHARED(old))
+   list_add(&mnt->mnt_share, &old->mnt_share);
+   if (IS_MNT_SLAVE(old))
+   list_add(&mnt->mnt_slave, &old->mnt_slave);
+   mnt->mnt_master = old->mnt_master;
+   }
+   if (flag & CL_MAKE_SHARED)
+   set_mnt_shared(mnt);
+
+   /* stick the duplicate mount on the same expiry list
+* as the original if that was on one */
+   if (flag & CL_EXPIRE) {
+   spin_lock(&vfsmount_lock);
+   if (!list_empty(&old->mnt_expire))
+   list_add(&mnt->mnt_expire, &old->mnt_expire);
+   spin_unlock(&vfsmount_lock);
}
return mnt;
 }
@@ -783,11 +784,11 @@ struct vfsmount *copy_tree(struct vfsmou
struct nameidata nd;
 
if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(mnt))
-   return NULL;
+   return ERR_PTR(-EPERM);
 
res = q = clone_mnt(mnt, dentry, flag);
-   if (!q)
-   goto Enomem;
+   if (IS_ERR(q))
+   goto error;
q->mnt_mountpoint = mnt->mnt_mountpoint;
 
p = mnt;
@@ -808,8 +809,8 @@ struct vfsmount *copy_tree(struct vfsmou
nd.mnt = q;
nd.dentry = p->mnt_mountpoint;
q = clone_mnt(p, p->mnt_root, flag);
-   if (!q)
-   goto Enomem;
+   if (IS_ERR(q))
+   goto error;
spin_lock(&vfsmount_lock);
list_add_tail(&q->mnt_list, &res->mnt_list);
attach_mnt(q, &nd);
@@ -817,7 +818,7 @@ struct vfsmount *copy_tree(struct vfsmou
}
}
return res;
-Enomem:
+ error:
if (res) {
LIST_HEAD(umount_list);
spin_lock(&vfsmount_lock);
@@ -825,7 +826,7 @@ Enomem:
sp

[patch 03/10] unprivileged mounts: account user mounts

2007-04-27 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Add sysctl variables for accounting and limiting the number of user
mounts.

The maximum number of user mounts is set to 1024 by default.  This
won't in itself enable user mounts, setting a mount to be owned by a
user is first needed

[akpm]
 - don't use enumerated sysctls

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/Documentation/filesystems/proc.txt
===
--- linux.orig/Documentation/filesystems/proc.txt   2007-04-26 
13:08:35.0 +0200
+++ linux/Documentation/filesystems/proc.txt2007-04-26 13:17:13.0 
+0200
@@ -923,6 +923,15 @@ reaches aio-max-nr then io_setup will fa
 raising aio-max-nr does not result in the pre-allocation or re-sizing
 of any kernel data structures.
 
+nr_user_mounts and max_user_mounts
+--
+
+These represent the number of "user" mounts and the maximum number of
+"user" mounts respectively.  User mounts may be created by
+unprivileged users.  User mounts may also be created with sysadmin
+privileges on behalf of a user, in which case nr_user_mounts may
+exceed max_user_mounts.
+
 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
 ---
 
Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-26 13:16:21.0 +0200
+++ linux/fs/namespace.c2007-04-26 13:17:13.0 +0200
@@ -39,6 +39,9 @@ static int hash_mask __read_mostly, hash
 static struct kmem_cache *mnt_cache __read_mostly;
 static struct rw_semaphore namespace_sem;
 
+int nr_user_mounts;
+int max_user_mounts = 1024;
+
 /* /sys/fs */
 decl_subsys(fs, NULL, NULL);
 EXPORT_SYMBOL_GPL(fs_subsys);
@@ -227,11 +230,30 @@ static struct vfsmount *skip_mnt_tree(st
return p;
 }
 
+static void dec_nr_user_mounts(void)
+{
+   spin_lock(&vfsmount_lock);
+   nr_user_mounts--;
+   spin_unlock(&vfsmount_lock);
+}
+
 static void set_mnt_user(struct vfsmount *mnt)
 {
BUG_ON(mnt->mnt_flags & MNT_USER);
mnt->mnt_uid = current->fsuid;
mnt->mnt_flags |= MNT_USER;
+   spin_lock(&vfsmount_lock);
+   nr_user_mounts++;
+   spin_unlock(&vfsmount_lock);
+}
+
+static void clear_mnt_user(struct vfsmount *mnt)
+{
+   if (mnt->mnt_flags & MNT_USER) {
+   mnt->mnt_uid = 0;
+   mnt->mnt_flags &= ~MNT_USER;
+   dec_nr_user_mounts();
+   }
 }
 
 static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root,
@@ -283,6 +305,7 @@ static inline void __mntput(struct vfsmo
 {
struct super_block *sb = mnt->mnt_sb;
dput(mnt->mnt_root);
+   clear_mnt_user(mnt);
free_vfsmnt(mnt);
deactivate_super(sb);
 }
@@ -1028,6 +1051,7 @@ static int do_remount(struct nameidata *
down_write(&sb->s_umount);
err = do_remount_sb(sb, flags, data, 0);
if (!err) {
+   clear_mnt_user(nd->mnt);
nd->mnt->mnt_flags = mnt_flags;
if (flags & MS_SETUSER)
set_mnt_user(nd->mnt);
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-26 13:08:36.0 +0200
+++ linux/include/linux/fs.h2007-04-26 13:17:13.0 +0200
@@ -50,6 +50,9 @@ extern struct inodes_stat_t inodes_stat;
 
 extern int leases_enable, lease_break_time;
 
+extern int nr_user_mounts;
+extern int max_user_mounts;
+
 #ifdef CONFIG_DNOTIFY
 extern int dir_notify_enable;
 #endif
Index: linux/kernel/sysctl.c
===
--- linux.orig/kernel/sysctl.c  2007-04-26 13:08:35.0 +0200
+++ linux/kernel/sysctl.c   2007-04-26 13:17:13.0 +0200
@@ -1064,6 +1064,22 @@ static ctl_table fs_table[] = {
 #endif 
 #endif
{
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = "nr_user_mounts",
+   .data   = &nr_user_mounts,
+   .maxlen = sizeof(int),
+   .mode   = 0444,
+   .proc_handler   = &proc_dointvec,
+   },
+   {
+   .ctl_name   = CTL_UNNUMBERED,
+   .procname   = "max_user_mounts",
+   .data   = &max_user_mounts,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = &proc_dointvec,
+   },
+   {
.ctl_name   = KERN_SETUID_DUMPABLE,
.procname   = "suid_dumpable",
.data   = &suid_dumpable,

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 07/10] unprivileged mounts: allow unprivileged mounts

2007-04-27 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Define a new fs flag FS_SAFE, which denotes, that unprivileged mounting of
this filesystem may not constitute a security problem.

Since most filesystems haven't been designed with unprivileged mounting in
mind, a thorough audit is needed before setting this flag.

For "safe" filesystems also allow unprivileged forced unmounting.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/namespace.c
===
--- linux.orig/fs/namespace.c   2007-04-26 13:30:04.0 +0200
+++ linux/fs/namespace.c2007-04-26 13:51:29.0 +0200
@@ -724,14 +724,16 @@ static bool is_mount_owner(struct vfsmou
 /*
  * umount is permitted for
  *  - sysadmin
- *  - mount owner, if not forced umount
+ *  - mount owner
+ *o if not forced umount,
+ *o if forced umount, and filesystem is "safe"
  */
 static bool permit_umount(struct vfsmount *mnt, int flags)
 {
if (capable(CAP_SYS_ADMIN))
return true;
 
-   if (flags & MNT_FORCE)
+   if ((flags & MNT_FORCE) && !(mnt->mnt_sb->s_type->fs_flags & FS_SAFE))
return false;
 
return is_mount_owner(mnt, current->fsuid);
@@ -787,13 +789,17 @@ asmlinkage long sys_oldumount(char __use
  * - mountpoint is not a symlink
  * - mountpoint is in a mount owned by the user
  */
-static bool permit_mount(struct nameidata *nd, int *flags)
+static bool permit_mount(struct nameidata *nd, struct file_system_type *type,
+int *flags)
 {
struct inode *inode = nd->dentry->d_inode;
 
if (capable(CAP_SYS_ADMIN))
return true;
 
+   if (type && !(type->fs_flags & FS_SAFE))
+   return false;
+
if (S_ISLNK(inode->i_mode))
return false;
 
@@ -1027,7 +1033,7 @@ static int do_loopback(struct nameidata 
struct vfsmount *mnt = NULL;
int err;
 
-   if (!permit_mount(nd, &flags))
+   if (!permit_mount(nd, NULL, &flags))
return -EPERM;
if (!old_name || !*old_name)
return -EINVAL;
@@ -1188,26 +1194,46 @@ out:
  * create a new mount for userspace and request it to be added into the
  * namespace's tree
  */
-static int do_new_mount(struct nameidata *nd, char *type, int flags,
+static int do_new_mount(struct nameidata *nd, char *fstype, int flags,
int mnt_flags, char *name, void *data)
 {
+   int err;
struct vfsmount *mnt;
+   struct file_system_type *type;
 
-   if (!type || !memchr(type, 0, PAGE_SIZE))
+   if (!fstype || !memchr(fstype, 0, PAGE_SIZE))
return -EINVAL;
 
-   /* we need capabilities... */
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
-   mnt = do_kern_mount(type, flags & ~MS_SETUSER, name, data);
-   if (IS_ERR(mnt))
+   type = get_fs_type(fstype);
+   if (!type)
+   return -ENODEV;
+
+   err = -EPERM;
+   if (!permit_mount(nd, type, &flags))
+   goto out_put_filesystem;
+
+   if (flags & MS_SETUSER) {
+   err = reserve_user_mount();
+   if (err)
+   goto out_put_filesystem;
+   }
+
+   mnt = vfs_kern_mount(type, flags & ~MS_SETUSER, name, data);
+   put_filesystem(type);
+   if (IS_ERR(mnt)) {
+   if (flags & MS_SETUSER)
+   dec_nr_user_mounts();
return PTR_ERR(mnt);
+   }
 
if (flags & MS_SETUSER)
-   set_mnt_user(mnt);
+   __set_mnt_user(mnt);
 
return do_add_mount(mnt, nd, mnt_flags, NULL);
+
+ out_put_filesystem:
+   put_filesystem(type);
+   return err;
 }
 
 /*
@@ -1237,7 +1263,7 @@ int do_add_mount(struct vfsmount *newmnt
if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
goto unlock;
 
-   /* MNT_USER was set earlier */
+   /* some flags may have been set earlier */
newmnt->mnt_flags |= mnt_flags;
if ((err = graft_tree(newmnt, nd)))
goto unlock;
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-26 13:46:26.0 +0200
+++ linux/include/linux/fs.h2007-04-26 13:48:14.0 +0200
@@ -96,6 +96,7 @@ extern int dir_notify_enable;
 #define FS_REQUIRES_DEV 1 
 #define FS_BINARY_MOUNTDATA 2
 #define FS_HAS_SUBTYPE 4
+#define FS_SAFE 8  /* Safe to mount by unprivileged users */
 #define FS_REVAL_DOT   16384   /* Check the paths ".", ".." for staleness */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move()
 * during rename() internally.

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 06/10] unprivileged mounts: put declaration of put_filesystem() in fs.h

2007-04-27 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Declarations go into headers.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/fs/super.c
===
--- linux.orig/fs/super.c   2007-04-26 13:08:34.0 +0200
+++ linux/fs/super.c2007-04-26 13:46:26.0 +0200
@@ -40,10 +40,6 @@
 #include 
 
 
-void get_filesystem(struct file_system_type *fs);
-void put_filesystem(struct file_system_type *fs);
-struct file_system_type *get_fs_type(const char *name);
-
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2007-04-26 13:17:13.0 +0200
+++ linux/include/linux/fs.h2007-04-26 13:46:26.0 +0200
@@ -1919,6 +1919,8 @@ extern int vfs_fstat(unsigned int, struc
 
 extern int vfs_ioctl(struct file *, unsigned int, unsigned int, unsigned long);
 
+extern void get_filesystem(struct file_system_type *fs);
+extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern struct super_block *get_super(struct block_device *);
 extern struct super_block *user_get_super(dev_t);

--
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-27 Thread Jörn Engel
On Thu, 26 April 2007 22:07:43 -0700, Valerie Henson wrote:
>
> What's important is that each continuation inode have a
> back pointer to the parent and that there is some structure for
> quickly looking up the continuation inode for a given file offset.
> Suggestions for data structures that work well in this situation are
> welcome. :)

All this would get easier if continuation inodes were known to be rare.
You can ditch the doubly-linked list in favor of a pointer to the main
inode then - traversing the list again is cheap, after all.  And you can
just try to read the same block once for every continuation inode.

If those lists can get long and you need a mapping from offset to
continuation inode on the medium, you are basically fscked.  Storing the
mapping requires space.  You need the mapping only when space (in some
chunk) gets tight and you allocate continuation inodes.  So either you
don't need the mapping or you don't have a good place to put it.

Having a mapping in memory is also questionable.  Either you scan the
whole file on first access and spend a long time for large files.  Or
you create the mapping on the fly.  In that case the page cache will
already give you a 90% solution for free.

You should spend a lot of effort trying to minimize cnodes. ;)

Jörn

-- 
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] unprivileged mounts update

2007-04-27 Thread Jan Engelhardt

On Apr 26 2007 22:27, Miklos Szeredi wrote:
>> On Apr 25 2007 11:21, Eric W. Biederman wrote:
>> >>
>> >> Why did we want to use fsuid, exactly?
>> >
>> >- Because ruid is completely the wrong thing we want mounts owned
>> >  by whomever's permissions we are using to perform the mount.
>> 
>> Think nfs. I access some nfs file as an unprivileged user. knfsd, by
>> nature, would run as euid=0, uid=0, but it needs fsuid=jengelh for
>> most permission logic to work as expected.
>
>I don't think knfsd will ever want to call mount(2).

I was actually out at something different...

/* Make sure a caller can chown. */
if ((ia_valid & ATTR_UID) &&
(current->fsuid != inode->i_uid ||
 attr->ia_uid != inode->i_uid) && !capable(CAP_CHOWN))
goto error;

for example. Using current->[e]uid would not make sense here.

>But yeah, I've been convinced, that using fsuid is the right thing to
>do.

Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html