Re: [RFC] User CLONE_NEWNS permission and rlimits

2005-04-19 Thread Al Viro
On Tue, Apr 19, 2005 at 11:38:21PM -0400, Ritesh Kumar wrote:
> You are right. A more priviledged process running as a child of
> another process should not be allowed to look at the same namespace as
> its parent.

No go.  That immediately breaks any suid program that takes a pathname
as an argument and is supposed to do something to the file in question.
Or uses dotfiles for per-user config.  gpg(1) fits both, for example,
and that's not something rarely used.  Moreover, used in fsckload of
scripts that are entirely out of your control, so something like "OK,
use it on stdin, then" is not an answer (and it still doesn't address
the second issue - gpg *does* need access to keyring, after all).

> Also, the access control for the filesystem is still in the kernel.
> What we change in the userspace is just the namespace and nothing
> else. If you are fundamentally denied access to a file (from the
> kernel) then you cannot access it no matter how you access it using
> userspace libraries.

The issue is not with being able to see something you shouldn't see.
It's being able to trick more priveleged process into accepting your
data as something it trusts.  OR not being able to use suid programs
on your own files at all.  Neither is acceptable.

BTW, your references to Plan 9 completely miss one very important thing -
they manage to live without any suid stuff at all.  Which is certainly
very nice, but not useful in our case, unless you volunteer to rewrite
suid applications to the form that would not need suid.
 
> Plus, it is not very clear to me what to you mean by 'tasks'. If that
> is processes, then the child will inherit a separate copy of the
> namespace from the parent (Copy-on-write of the data structs of the
> user library probably... I'll have to think over this). So no race
> conditions here.

... and no working mount(8) either.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-19 Thread Mike Waychison
Eric Van Hensbergen wrote:
> Somewhat related question for Viro/the group:
> 
> Why is CLONE_NEWNS considered a priveledged operation?  Would placing
> limits on the number of private namespaces a user can own solve any
> resource concerns or is there something more nefarious I'm missing?
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Likely because its a chroot vulnerability.

It allows a process to obtain a reference to the root vfsmount that
doesn't have chroot checks performed on it.

Consider the following pseudo example:

main():
chdir("/");
fd = open(".", O_RDONLY);
clone(cloned_func, cloned_stack, CLONE_NEWNS, NULL);

cloned_func:
fchdir(fd);
chdir("..");

if main is run within a chroot where it's "/" is on the same vfsmount as
 it's "..", then the application can step out of the chroot using clone(2).

Note: using chdir in a vfsmount outside of your namespace works, however
you won't be able to walk off that vfsmount (to its parent or children).

Mike Waychison
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] User CLONE_NEWNS permission and rlimits

2005-04-19 Thread Ritesh Kumar
You are right. A more priviledged process running as a child of
another process should not be allowed to look at the same namespace as
its parent. However, there is atleast one other example where
something like this exists and there is a counter for that. We can
learn from the counter.

Consider the LD_PRELOAD env variable. The dynamic linker ignores this
variable (well there are suitable exceptions clearly defined in the
ld.so manpage) the moment it sees that the child is a root process.
Thus, even though the parent had changed the effective behavior of
dynamic linking , the child doesn't suffer from the same.
In my prototype, currently, I am okay because the basis of the
functionality is an interposing library which is ignored if somebody
does and 'su'.

Also, the access control for the filesystem is still in the kernel.
What we change in the userspace is just the namespace and nothing
else. If you are fundamentally denied access to a file (from the
kernel) then you cannot access it no matter how you access it using
userspace libraries.

Plus, it is not very clear to me what to you mean by 'tasks'. If that
is processes, then the child will inherit a separate copy of the
namespace from the parent (Copy-on-write of the data structs of the
user library probably... I'll have to think over this). So no race
conditions here. For mutiple threads we will have to use mutual
exclusion on the 'userspace vfs' to keep race conditions out...
similar to many other things (like malloc et al).

Ritesh 

On 4/19/05, Al Viro <[EMAIL PROTECTED]> wrote:
> On Tue, Apr 19, 2005 at 11:02:53PM -0400, Ritesh Kumar wrote:
> > I am new to the list so please bear with me :-)
> >
> > I have also be thinking about filesystem namespaces which are
> > completely under the user's own control.
> 
> How do you deal with su(1) finding /etc/shadow in your namespace
> and seeing an entry for root there - with no password?
> 
> > I was also thinking of them
> > being inherited and changed along the process heirarchy.
> 
> We have that already...
> 
> > So a given
> > process is allowed to change its namespace any way it likes and map it
> > to its parent's namespace.
> 
> See above.
> 
> > More importantly, I was thinking in terms of having this entire
> > capability in the userspace itself. Instead of giving all the details
> > right here... let me redirect you to the page where I have set up the
> > prototype. You should be able to download the sample code (very small)
> > and browse through it to get an idea of what I had in mind. I also
> > have an article which explains what I was thinking. In essense, I was
> > thinking of splitting up the conceps of 1) accessing the filesystem on
> > the HDD/device and 2) setting up a namespace for accessing the files
> > into two separate concepts and bringing up 2) completely in the
> > userspace.
> > What do you think? I would like to have feedback on the idea.
> 
> That your library will leave any suid program seeing hell knows
> what.  Which gets very unpleasant when you are using it to do something
> with your files...
> 
> That's besides the issues with races when two tasks that share
> namespace attempt to change it.
> 
> > http://www.cs.unc.edu/~ritesh/projects/perprocessfs.html
> 


-- 
Rationality is the fundamental limitation to all human thought.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-19 Thread Al Viro
On Tue, Apr 19, 2005 at 06:53:29PM -0500, Eric Van Hensbergen wrote:
> On 4/19/05, Al Viro <[EMAIL PROTECTED]> wrote:
> > On Tue, Apr 19, 2005 at 05:13:32PM -0500, Eric Van Hensbergen wrote:
> > > The motivation behind this patch is to make private namespaces more
> > > accessible by allowing their creation at mount/bind time.
> > >
> > 
> > *UGH*
> > 
> > So what happens to those who happen to share task->fs with the parent?
> > 
> 
> Okay, I'll admit to being a bit too hasty with pushing out that patch
> - I was being particularly myopic looking for a solution only for a
> command-line mount.  Are you generally opposed to new namespace
> creation at mount time or just my slimy hack?  A shared task->fs seems
> like something which could be easily checked against and disallowed.

a) ability to create a private namespace without forking anything - sure,
that would be useful.  However, that's not something I would push into
mount(2) (already overloaded to hell and back).

There used to be a kinda-sorta agreement on a new syscall:
unshare(bitmap)
with arguments like those of clone(2).  That's not just for namespaces -
e.g. you might legitimately want to unshare VM in a thread and leave the
rest alone.  Or unshare ->fs (i.e. uncouple cwd from the rest of group).

Most of the code is already there - do_fork() has to do such stuff anyway.
So how about adding sys_unshare(flags) that would do that job?  Flags would
correspond to those of clone(2), except that all these guys would be
"what do we unshare" instead of "what do we leave shared".


b) I _really_ don't like the idea of messing with the parent.  Make it
a shell builtin if you want to affect shell behaviour; the same reason
why cd is a builtin and not an external command.


c) I would be really, really careful with implications of "let user
do whatever he wants" - that certainly should include bindings and
that can create heaps of fun for suid stuff.  More comments when
I get around to digging through FUSE thread...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] User CLONE_NEWNS permission and rlimits

2005-04-19 Thread Al Viro
On Tue, Apr 19, 2005 at 11:02:53PM -0400, Ritesh Kumar wrote:
> I am new to the list so please bear with me :-)
> 
> I have also be thinking about filesystem namespaces which are
> completely under the user's own control.

How do you deal with su(1) finding /etc/shadow in your namespace
and seeing an entry for root there - with no password?

> I was also thinking of them
> being inherited and changed along the process heirarchy.

We have that already...

> So a given
> process is allowed to change its namespace any way it likes and map it
> to its parent's namespace.

See above.

> More importantly, I was thinking in terms of having this entire
> capability in the userspace itself. Instead of giving all the details
> right here... let me redirect you to the page where I have set up the
> prototype. You should be able to download the sample code (very small)
> and browse through it to get an idea of what I had in mind. I also
> have an article which explains what I was thinking. In essense, I was
> thinking of splitting up the conceps of 1) accessing the filesystem on
> the HDD/device and 2) setting up a namespace for accessing the files
> into two separate concepts and bringing up 2) completely in the
> userspace.
> What do you think? I would like to have feedback on the idea.

That your library will leave any suid program seeing hell knows
what.  Which gets very unpleasant when you are using it to do something
with your files...

That's besides the issues with races when two tasks that share
namespace attempt to change it.

> http://www.cs.unc.edu/~ritesh/projects/perprocessfs.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] User CLONE_NEWNS permission and rlimits

2005-04-19 Thread Ritesh Kumar
I am new to the list so please bear with me :-)

I have also be thinking about filesystem namespaces which are
completely under the user's own control. I was also thinking of them
being inherited and changed along the process heirarchy. So a given
process is allowed to change its namespace any way it likes and map it
to its parent's namespace.
More importantly, I was thinking in terms of having this entire
capability in the userspace itself. Instead of giving all the details
right here... let me redirect you to the page where I have set up the
prototype. You should be able to download the sample code (very small)
and browse through it to get an idea of what I had in mind. I also
have an article which explains what I was thinking. In essense, I was
thinking of splitting up the conceps of 1) accessing the filesystem on
the HDD/device and 2) setting up a namespace for accessing the files
into two separate concepts and bringing up 2) completely in the
userspace.
What do you think? I would like to have feedback on the idea.

http://www.cs.unc.edu/~ritesh/projects/perprocessfs.html

Ritesh

On 4/19/05, Ram <[EMAIL PROTECTED]> wrote:
> On Tue, 2005-04-19 at 18:24, Eric Van Hensbergen wrote:
> > This is again related to the FUSE permission thread, but a slightly
> > different idea and without a slimy hack patch.
> >
> > I really want to enable users to be able to create private namespaces,
> > but I want to try and avoid creating a venerability by allowing them
> > to abuse system resources.  It looks like this can be done by adding
> > RLIMIT_NEWNS as a per-user resource limit, and tracking the number of
> > private namespaces a user has in the user_struct.  Any time a user
> > creates a private namespace (either via clone with CLONE_NEWNS) or any
> > other method, this limit is checked and the per user count is
> > incremented (in copy_namespace).  When namespaces are cleaned up (in
> > __put_namespace), the per-user count is decremented.
> >
> > Is this sufficient to cover any exposure?  What's the correct solution
> > for the shared sub-trees RFC?  Should there be something similar for
> > user mounts/binds?
> 
> A new namespace in a shared subtree realm can create number-of-
> private-namespaces number of mounts or binds depending on the number of
> binds and mounts in the shared tree.
> 
> for example if  there were 10 shared vfsmounts in the original
> namespace, a new private namespace will duplicate 10 of these, and
> any mount or bind attempted in any of these vfsmounts will double the
> number of mounts and binds.
> 
> Hence probably you may want to keep a tab on the number mounts and
> binds a user does, instead of keeping a tab on the number of namespaces
> a user creates.
> 
> RP
> 
> >
> >  -eric
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [EMAIL PROTECTED]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Rationality is the fundamental limitation to all human thought.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] User CLONE_NEWNS permission and rlimits

2005-04-19 Thread Ram
On Tue, 2005-04-19 at 18:24, Eric Van Hensbergen wrote:
> This is again related to the FUSE permission thread, but a slightly
> different idea and without a slimy hack patch.
> 
> I really want to enable users to be able to create private namespaces,
> but I want to try and avoid creating a venerability by allowing them
> to abuse system resources.  It looks like this can be done by adding
> RLIMIT_NEWNS as a per-user resource limit, and tracking the number of
> private namespaces a user has in the user_struct.  Any time a user
> creates a private namespace (either via clone with CLONE_NEWNS) or any
> other method, this limit is checked and the per user count is
> incremented (in copy_namespace).  When namespaces are cleaned up (in
> __put_namespace), the per-user count is decremented.
> 
> Is this sufficient to cover any exposure?  What's the correct solution
> for the shared sub-trees RFC?  Should there be something similar for
> user mounts/binds?

A new namespace in a shared subtree realm can create number-of-
private-namespaces number of mounts or binds depending on the number of
binds and mounts in the shared tree.

for example if  there were 10 shared vfsmounts in the original
namespace, a new private namespace will duplicate 10 of these, and
any mount or bind attempted in any of these vfsmounts will double the
number of mounts and binds. 

Hence probably you may want to keep a tab on the number mounts and
binds a user does, instead of keeping a tab on the number of namespaces
a user creates.

RP

> 
>  -eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] User CLONE_NEWNS permission and rlimits

2005-04-19 Thread Eric Van Hensbergen
This is again related to the FUSE permission thread, but a slightly
different idea and without a slimy hack patch.

I really want to enable users to be able to create private namespaces,
but I want to try and avoid creating a venerability by allowing them
to abuse system resources.  It looks like this can be done by adding
RLIMIT_NEWNS as a per-user resource limit, and tracking the number of
private namespaces a user has in the user_struct.  Any time a user
creates a private namespace (either via clone with CLONE_NEWNS) or any
other method, this limit is checked and the per user count is
incremented (in copy_namespace).  When namespaces are cleaned up (in
__put_namespace), the per-user count is decremented.

Is this sufficient to cover any exposure?  What's the correct solution
for the shared sub-trees RFC?  Should there be something similar for
user mounts/binds?

 -eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Bryan Henderson
>> routines will fail - since they assume that page->private represents
>> bufferheads. So we need a better way to do this.
>
>They are not generic then. Some file systems store things completely
>different from buffer head ring in page->private.

I've seen these instances (and worked around them because I maintain 
filesystem code that does in fact use private pages but not use the buffer 
cache to manage them).  I've always assumed they're just errors -- corners 
that were cut in the original project to abstract out the buffer cache. 
Anyone who has a problem with them should just fix them.

>I think that one reasonable way to add generic support for journalling
>is to split struct address_space into two objects: lower layer that
>represents "file" (say, struct vm_file), in which pages are linearly
>ordered, and on top of this vm_cache (representing transaction) that
>keeps track of pages from various vm_file's. vm_file is embedded into
>inode, and vm_cache has a pointer to (the analog of) struct
>address_space_operations.
>
>vm_cache's are created by file system back-end as necessary (can be
>embedded into inode for non-journalled file systems). VM scanner and
>balance_dirty_pages() call vm_cache operations to do write-out.

That looks entirely reasonable to me, but should be combined with 
divorcing address spaces from files.  An address space (or the "lower 
level" above) should be a simple virtual memory object, managed by the 
virtual memory manager.  It can be used for a file data cache, but also 
for anything else you want to participate in system memory management / 
page replacement.

We're already practically there.  Address spaces are tied to files only in 
these ways:

  1) The code is in the fs/ directory.  It needs to be be in mm/ .

  2) The "host" field is a struct inode *.  It needs to be void *.

  3) In a handful of places (and they keep moving), memory manager 
 code dereferences 'host' and looks in the inode.  I know these 
 are trivial connections, because I work around them by supplying
 a dummy inode (and sometimes a dummy superblock) with a few 
 fields filled in.

(Incidentally, _I_ am actually using address spaces for file caches; I 
just can't tie them to the files in the traditional way; the cache exists 
even when there are no inodes for the file).

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-19 Thread Eric Van Hensbergen
On 4/19/05, Al Viro <[EMAIL PROTECTED]> wrote:
> On Tue, Apr 19, 2005 at 05:13:32PM -0500, Eric Van Hensbergen wrote:
> > The motivation behind this patch is to make private namespaces more
> > accessible by allowing their creation at mount/bind time.
> >
> 
> *UGH*
> 
> So what happens to those who happen to share task->fs with the parent?
> 

Okay, I'll admit to being a bit too hasty with pushing out that patch
- I was being particularly myopic looking for a solution only for a
command-line mount.  Are you generally opposed to new namespace
creation at mount time or just my slimy hack?  A shared task->fs seems
like something which could be easily checked against and disallowed.

   -eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-19 Thread Al Viro
On Tue, Apr 19, 2005 at 05:13:32PM -0500, Eric Van Hensbergen wrote:
> The motivation behind this patch is to make private namespaces more
> accessible by allowing their creation at mount/bind time.
> 
> Based on some of the FUSE permissions discussions, I wanted to check
> into modifying the mount system calls -- adding a flag which created a
> new namespace for the resulting mount.  I quickly discovered that what
> I typically wanted (for the case of running a mount command) was to
> actually create a new namespace for the parent thread (typically the
> shell), inherit that namespace, and then perform the mount.
> 
> Its not clear to me that both options are needed, cloning the parent's
> namespace seems to be what you want most of the time.
> 
> In order to minimize code impact I split the copy_namespace function,
> perhaps the right long term solution is to change it's interface to
> accommodate the changes.  Things look a bit more invasive as I moved
> the copy_namespace function above do_mount.  The patch follows:

*UGH*

So what happens to those who happen to share task->fs with the parent?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-19 Thread Eric Van Hensbergen
The motivation behind this patch is to make private namespaces more
accessible by allowing their creation at mount/bind time.

Based on some of the FUSE permissions discussions, I wanted to check
into modifying the mount system calls -- adding a flag which created a
new namespace for the resulting mount.  I quickly discovered that what
I typically wanted (for the case of running a mount command) was to
actually create a new namespace for the parent thread (typically the
shell), inherit that namespace, and then perform the mount.

Its not clear to me that both options are needed, cloning the parent's
namespace seems to be what you want most of the time.

In order to minimize code impact I split the copy_namespace function,
perhaps the right long term solution is to change it's interface to
accommodate the changes.  Things look a bit more invasive as I moved
the copy_namespace function above do_mount.  The patch follows:

  fs/namespace.c |  193
+
  include/linux/fs.h |2 
 2 files changed, 108 insertions(+), 87 deletions(-)

--- linux-2.5/include/linux/fs.h2005-04-19 17:02:28.530152496 -0500
+++ newns-2.5/include/linux/fs.h2005-04-19 17:03:52.619368992 -0500
@@ -103,6 +103,8 @@ extern int dir_notify_enable;
 #define MS_REC 16384
 #define MS_VERBOSE 32768
  #define MS_POSIXACL   (1<<16) /* VFS does not apply the umask */
+#define MS_CLONE_NEWNS (1<<17) /* clone my namespace before mount */
+#define MS_CLONE_NEWPNS (1<<18) /* clone my & my parent namespace */
 #define MS_ACTIVE  (1<<30)
 #define MS_NOUSER  (1<<31)
 
--- linux-2.5/fs/namespace.c2005-04-19 17:02:14.551277608 -0500
+++ newns-2.5/fs/namespace.c2005-04-19 17:03:38.227556880 -0500
@@ -991,6 +991,104 @@ int copy_mount_options(const void __user
return 0;
 }
 
+int update_namespace(struct task_struct *tsk, struct namespace *new_ns )
+{
+   struct namespace *namespace = tsk->namespace;
+   struct vfsmount *rootmnt = NULL, *pwdmnt = NULL, *altrootmnt = NULL;
+   struct fs_struct *fs = tsk->fs;
+   struct vfsmount *p, *q;
+
+   if (!namespace)
+   return 0;
+
+   get_namespace(namespace);
+
+   if (!capable(CAP_SYS_ADMIN)) {
+   put_namespace(namespace);
+   return -EPERM;
+   }
+
+   down_write(&tsk->namespace->sem);
+   if(!new_ns) {
+   new_ns = kmalloc(sizeof(struct namespace), GFP_KERNEL);
+   if (!new_ns)
+   goto out;
+
+   atomic_set(&new_ns->count, 1);
+   init_rwsem(&new_ns->sem);
+   INIT_LIST_HEAD(&new_ns->list);
+
+   /* First pass: copy the tree topology */
+   new_ns->root = copy_tree(namespace->root, 
namespace->root->mnt_root);
+   if (!new_ns->root) {
+   up_write(&tsk->namespace->sem);
+   kfree(new_ns);
+   goto out;
+   }
+   spin_lock(&vfsmount_lock);
+   list_add_tail(&new_ns->list, &new_ns->root->mnt_list);
+   spin_unlock(&vfsmount_lock);
+   } else 
+   get_namespace(new_ns);
+
+   /*
+* Second pass: switch the tsk->fs->* elements and mark new vfsmounts
+* as belonging to new namespace.  We have already acquired a private
+* fs_struct, so tsk->fs->lock is not needed.
+*/
+   p = namespace->root;
+   q = new_ns->root;
+   while (p) {
+   q->mnt_namespace = new_ns;
+   if (fs) {
+   if (p == fs->rootmnt) {
+   rootmnt = p;
+   fs->rootmnt = mntget(q);
+   }
+   if (p == fs->pwdmnt) {
+   pwdmnt = p;
+   fs->pwdmnt = mntget(q);
+   }
+   if (p == fs->altrootmnt) {
+   altrootmnt = p;
+   fs->altrootmnt = mntget(q);
+   }
+   }
+   p = next_mnt(p, namespace->root);
+   q = next_mnt(q, new_ns->root);
+   }
+   up_write(&tsk->namespace->sem);
+
+   tsk->namespace = new_ns;
+
+   if (rootmnt)
+   mntput(rootmnt);
+   if (pwdmnt)
+   mntput(pwdmnt);
+   if (altrootmnt)
+   mntput(altrootmnt);
+
+   put_namespace(namespace);
+   return 0;
+
+out:
+   put_namespace(namespace);
+   return -ENOMEM;
+}
+
+int copy_namespace(int flags, struct task_struct *tsk)
+{
+   if (!tsk->namespace)
+   return 0;
+
+   if (!(flags & CLONE_NEWNS)) {
+   get_namespace(tsk->namespace);
+   return 0;
+   }
+
+   return update_namespace( tsk, NULL );
+}
+
 /*
  * Flags is a 32-bit value that allows up to 31 non-fs dependent fl

Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Martin Jambor
Hi,

I see the scope  of the discussion here got quickly beyond the scope
of my first posting :-) Anyway, the filesystem we're implementing is a
variant of a classic log-structured filesystem which is quite similiar
to unix filesystems in many aspects (like inodes and stuff) and we
will have 0-1 (sort of) transactions so as far as this issue is
concerned our case is probably very similiar to ext3 delayed
allocation.

On 4/19/05, Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> The idea is to "reserve" a block at the prepare/commit write instead
> of allocating the block. Do the actual allocation in writepage().

Exactly.

> Here are the issues:
> 
> 
> 1) Currently none of the generic helper routines can handle this.
> We need to add support to do these, but still somehow make the
> routines generic enough for every ones use.

I'm quite happy about most of them. I can't see how we could use any
generic form of writepage(s) as we write stuff in a quite different
way from almost anybody else but all the others except
block_prepare_write do  pretty much exactly what we need (if I have
not missed something).

> 2) There is no easy way to find out if we "reserved" a block or
> not in writepage() correctly. There are 2 paths to writepage().
> 
> sys_write() -> prepare/commit()
> and later sync() > writepage()
> 
> mmap() -> touch a page()
> and later --> writepage()
> 
> In order to do the correct accounting, we need to mark a page
> to indicate if we reserved a block or not. One way to do this,
> to use page->private to indicate this. But then, all the generic
> routines will fail - since they assume that page->private represents
> bufferheads. So we need a better way to do this.

I didn't hope for a special bit in struct page so I wanted to simply
fake the page/buffer mapping somehow. Since we don't really care
whether a page is mapped or reserved as long as it is at least one of
these when actually writing it (we write stuff to different places
from where we have read it from), the PG_mappedtodisk is fine for us
as long as no other kernel code thinks that having it set means we
also have buffers which point to meaningful positions on the device
because we don't. Is that the case?

Of course, having a PG_RESERVED flag would be a nice and clean thing
to use and we would be more than happy to do so.

> 3) We need add hooks into filesystem specific calls from these
> generic routines to handle "journaling mode" requirements

Our fs is basically one big journal so we don't need any of these. Or
at least I don't see any need for it at the moment.

> So, what are your requirements ?  I am looking for a common
> way to combine all the requirements and come out with a
> saner "generic" routines to handle these.

I'm happy with most generic functions. we need to implement
writepage(s) ourselves no matter what, the only problem is
block_prepare_write and I can currently only see two options for us:

1) Implement it ourselves and use a flag in the struct page to mark it reserved.

2) Use block_prepare_write but enable the get_block function to mark
an individual buffer as reserved so that it is trated as mapped (can
be dirty and stuff) but no code assumes it is located somewhere on the
disk (for example block_prepare_write would not call
unmap_underlying_metadata).

I think we'll go for the first method, but the second would make life
easier for filesystems which can have pages consisting of both mapped
and reserved blocks.

Thank you very much for your reply, the whole thread has been well
worth reading.

Martin Jambor
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-19 Thread Eric Van Hensbergen
Somewhat related question for Viro/the group:

Why is CLONE_NEWNS considered a priveledged operation?  Would placing
limits on the number of private namespaces a user can own solve any
resource concerns or is there something more nefarious I'm missing?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-19 Thread Bodo Eggert
On Tue, 19 Apr 2005, Eric Van Hensbergen wrote:
> On 4/19/05, Bodo Eggert <[EMAIL PROTECTED]> wrote:

> > Allowing user mounts with no* should be allways ok (no config needed
> > besides the ulimit), and mounting specified files to defined locations
> > is allready supported by fstab.
> >
> 
> Do folks think that the limits should be per-user or per-process for
> user-mounts, what about separate limits for # of private namespaces
> and # of mounts?

Per-user.

> The fstab support doesn't seem to provide enough flexibility for
> certain situations, say I want to support mounting any remote file
> system, as long as its in the user's private hierarchy?
[...]

The dir is owned by the user, therefore it's allowed with no*.
-- 
Top 100 things you don't want the sysadmin to say:
11. Can you get VMS for this Sparc thingy?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Nikita Danilov
Mingming Cao <[EMAIL PROTECTED]> writes:

> On Tue, 2005-04-19 at 19:55 +0400, Nikita Danilov wrote:
>> Badari Pulavarty <[EMAIL PROTECTED]> writes:
>> 
>> > On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote:
>> >> Badari Pulavarty <[EMAIL PROTECTED]> writes:
>> >> 
>> >> [...]
>> >> 
>> >> >
>> >> > Yes. Its possible to do what you want to. I am currently working on
>> >> > adding "delayed allocation" support to ext3. As part of that, We
>> >> 
>> >> As you most likely already know, Alex Thomas already implemented delayed
>> >> block allocation for ext3.
>> >
>> > Yep. I reviewed Alex Thomas patches for delayed allocation. He handled
>> > all the cases in his code and did NOT use any mpage* routines to do
>> > the work. I was hoping to change the mpage infrastructure to handle
>> > these, so that every filesystem doesn't have to do their thing.
>> >
>> 
>> Just keep in mind that filesystem != ext3. :-) Generic support makes
>> sense only when it is usable by multiple file systems. This is not
>> always possible, e.g., there is no "generic block allocator" for
>> precisely the same reason: disk space allocation policies are tightly
>> intertwined with the rest of file system internals.
>> 
>
> This generic support should be useful for ext2 and xfs. From delayed

But it won't work for reiser4, that allocates blocks _across_ multiple
files. E.g., if many files were created in the same directory,
allocation (performed just before write-out) will assign block numbers
so that files are ordered according to the readdir order on the disk
(with each file body being an interval in that ordering). This is done
by arranging all dirty blocks of a given transaction according to some
"ideal" ordering and then trying to map this ordering onto disk blocks.

As you see, in this case allocation is not done on inode-by-inode basis
at all: instead delayed allocation is done at the transaction level of
granularity, and I am trying to point out that this is natural thing for
the journalled file system to do.

The same goes for write-out: in ext3 there is only one "active"
transaction at any moment, and this means that ->writepages() calls can
go in arbitrary order, but for the file system type with multiple active
transactions that can be committed separately, order of ->writepages()
calls has to follow ordering between transactions. Again, this means
that write-out should be transaction rather than inode based.

If we want really generic support for journalling and
delayed-allocation, mpage_* functions are the wrong level. Instead
proper notion of transaction has to be introduced, and file system IO
and disk space allocation interfaces adjusted appropriately.

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Mingming Cao
On Tue, 2005-04-19 at 19:55 +0400, Nikita Danilov wrote:
> Badari Pulavarty <[EMAIL PROTECTED]> writes:
> 
> > On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote:
> >> Badari Pulavarty <[EMAIL PROTECTED]> writes:
> >> 
> >> [...]
> >> 
> >> >
> >> > Yes. Its possible to do what you want to. I am currently working on
> >> > adding "delayed allocation" support to ext3. As part of that, We
> >> 
> >> As you most likely already know, Alex Thomas already implemented delayed
> >> block allocation for ext3.
> >
> > Yep. I reviewed Alex Thomas patches for delayed allocation. He handled
> > all the cases in his code and did NOT use any mpage* routines to do
> > the work. I was hoping to change the mpage infrastructure to handle
> > these, so that every filesystem doesn't have to do their thing.
> >
> 
> Just keep in mind that filesystem != ext3. :-) Generic support makes
> sense only when it is usable by multiple file systems. This is not
> always possible, e.g., there is no "generic block allocator" for
> precisely the same reason: disk space allocation policies are tightly
> intertwined with the rest of file system internals.
> 

This generic support should be useful for ext2 and xfs. From delayed
allocation point of view, it should not aware any filesystem specific
block allocation policies, and it should not care.:)  It just simply
gathering all pages that need to map block on disk, and asking the
filesystem get_blocks() call back function, which will take care of the
filesystem-specific multiple blocks mapping for it.

Current get_blocks() function for ext3 is just simply loop calling
ext3_get_block().  I am trying to add a real ext3_get_blocks() to reduce
the cpu cost, reduce the number of metadata updates and increase the
possibility to get contiguous blocks on disk.


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Badari Pulavarty
Alex Tomas wrote:
Nikita Danilov (ND) writes:

 >>> > In order to do the correct accounting, we need to mark a page
 >>> > to indicate if we reserved a block or not. One way to do this,
 >>> > to use page->private to indicate this. But then, all the generic
 >>> 
 >>> I believe one can use PG_mappedtodisk bit in page->flags for this
 >>> purpose. There was old Andrew Morton's patch that introduced new bit
 >>> (PG_delalloc?) for this purpose.
 >> 
 >> That would be good. But I don't feel like asking for a bit in page
 >> if there is a way to get around it.

 ND> Clarification: PG_mappedtodisk is already here, it seems you can reuse
 ND> this already existing bit to implement delayed allocation support.
I think we need another one, because mappedtodisk != reserved. we could use
mappedtodisk, but this means in ->commit_write() we'd need to check that one
more time (first time in ->prepare_write())
Yep. We need one more to indicate the we reserved a block for this page.
Other option I was thinking on how to avoid is, by "reserving" a block
when a mapped page changes from read -> write. Andrew's -mm tree has
patch to give us a notification when it happens.
Thanks,
Badari
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Alex Tomas
> Nikita Danilov (ND) writes:

 >>> > In order to do the correct accounting, we need to mark a page
 >>> > to indicate if we reserved a block or not. One way to do this,
 >>> > to use page->private to indicate this. But then, all the generic
 >>> 
 >>> I believe one can use PG_mappedtodisk bit in page->flags for this
 >>> purpose. There was old Andrew Morton's patch that introduced new bit
 >>> (PG_delalloc?) for this purpose.
 >> 
 >> That would be good. But I don't feel like asking for a bit in page
 >> if there is a way to get around it.

 ND> Clarification: PG_mappedtodisk is already here, it seems you can reuse
 ND> this already existing bit to implement delayed allocation support.

I think we need another one, because mappedtodisk != reserved. we could use
mappedtodisk, but this means in ->commit_write() we'd need to check that one
more time (first time in ->prepare_write())

thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Nikita Danilov
Badari Pulavarty <[EMAIL PROTECTED]> writes:

> On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote:
>> Badari Pulavarty <[EMAIL PROTECTED]> writes:
>> 
>> [...]
>> 
>> >
>> > Yes. Its possible to do what you want to. I am currently working on
>> > adding "delayed allocation" support to ext3. As part of that, We
>> 
>> As you most likely already know, Alex Thomas already implemented delayed
>> block allocation for ext3.
>
> Yep. I reviewed Alex Thomas patches for delayed allocation. He handled
> all the cases in his code and did NOT use any mpage* routines to do
> the work. I was hoping to change the mpage infrastructure to handle
> these, so that every filesystem doesn't have to do their thing.
>

Just keep in mind that filesystem != ext3. :-) Generic support makes
sense only when it is usable by multiple file systems. This is not
always possible, e.g., there is no "generic block allocator" for
precisely the same reason: disk space allocation policies are tightly
intertwined with the rest of file system internals.

>
>> 
>> >
>> > In order to do the correct accounting, we need to mark a page
>> > to indicate if we reserved a block or not. One way to do this,
>> > to use page->private to indicate this. But then, all the generic
>> 
>> I believe one can use PG_mappedtodisk bit in page->flags for this
>> purpose. There was old Andrew Morton's patch that introduced new bit
>> (PG_delalloc?) for this purpose.
>
> That would be good. But I don't feel like asking for a bit in page
> if there is a way to get around it.

Clarification: PG_mappedtodisk is already here, it seems you can reuse
this already existing bit to implement delayed allocation support.

>

[...]

>> >
> Need to think some more. I guess you thought about this more than you
> do :)
>
> Thanks,
> Badari
>

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-19 Thread Eric Van Hensbergen
On 4/19/05, Bodo Eggert <[EMAIL PROTECTED]> wrote:
> >
> > Well, that would kinda be the intent behind the permissions file  --
> > it can specify what restricted set of images/devices/whatever the user
> > can mount, I suppose the sensible thing would be to always enforce
> > nosuid and nsgid, but I'd rather keep these as the default version of
> > options (allowing admins to shoot themselves in the foot perhaps, but
> > in the single-user workstation case, is seems like there's less reason
> > to be so paranoid).
> 
> I think you shouldn't help the admins by creating shoes with target marks.
>

Fair enough.  Since I don't really have any cases I can think of that
require this sort of behavior, I'll back off on allowing user mounts
with suid or sgid enabled.

> 
> Allowing user mounts with no* should be allways ok (no config needed
> besides the ulimit), and mounting specified files to defined locations
> is allready supported by fstab.
>

Do folks think that the limits should be per-user or per-process for
user-mounts, what about separate limits for # of private namespaces
and # of mounts?

The fstab support doesn't seem to provide enough flexibility for
certain situations, say I want to support mounting any remote file
system, as long as its in the user's private hierarchy?   What if I
want user's to be able to mount FUSE, v9fs, etc. user-space file
systems, but only in a private namespace and only in their private
hierarchy?  Or are these situations which you think should "always be
okay" as long as nosuid and nogid (and newns?) are implicit?

 -eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Alex Tomas
> Badari Pulavarty (BP) writes:

 >> 2) Andrew proposed the excelent solution

 BP> Well, I wasn't sure how heavy thats going to be. He was recommending
 BP> that we flush all dirty pages from all inodes for each transaction
 BP> commit. Isn't it ?

this is exactly what ext3 does being mounted with data=ordered
each page write(2) touches goes onto jbd list and commit thread
flushes them all. the only reason we can't use existing sync()
infrastructure is that we aren't permitted to touch metadata (in
our case, to allocate blocks) during commit. so, here one more
flag comes to wbc to signal sync to skip not-allocated-yet pages.
I like this a lot!

thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-19 Thread Miklos Szeredi
> 
> I think you shouldn't help the admins by creating shoes with target marks.
> 
> Allowing user mounts with no* should be allways ok (no config needed 
> besides the ulimit), and mounting specified files to defined locations
> is allready supported by fstab.

I tend to agree.  It should be obvious which sort of mounts are safe
and which are not.  The exceptions can go into fstab.

In a private namespace environment bind mounts (nodev,nosuid) should
be OK.  Network filesystems (with limitations to the ports used) are
also.  Disk filesystems are usually not safe to mount for users,
because they are not tested and verified against untrusted source.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Badari Pulavarty
On Tue, 2005-04-19 at 08:04, Alex Tomas wrote:
> > Badari Pulavarty (BP) writes:
>  
>  >> you can introduce one more bit to page->flags
> 
>  BP> Agreed. I was hoping to avoid it as much as I can.
> 
> well, you're gonna modify mpage api anyway ...

Okay, I will give a serious look then. Last time, I tried to
go near page->flags I got slapped :( This time we have a
valid reason, I guess :)

> 
>  BP> What I meant by jounalling mode is that - after the pages are submitted
>  BP> for IO, we need some way of waiting for the IOs to finish inorder to
>  BP> guarantee the ordering ? Is this not needed for anything other than
>  BP> ext3 ?
> 
> 1) i'm not sure anyone else supports this

Fair enough. If no one needs it - lets keep the interface simple.

> 2) Andrew proposed the excelent solution

Well, I wasn't sure how heavy thats going to be. He was recommending
that we flush all dirty pages from all inodes for each transaction
commit. Isn't it ?

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Alex Tomas
> Badari Pulavarty (BP) writes:
 
 >> you can introduce one more bit to page->flags

 BP> Agreed. I was hoping to avoid it as much as I can.

well, you're gonna modify mpage api anyway ...

 BP> What I meant by jounalling mode is that - after the pages are submitted
 BP> for IO, we need some way of waiting for the IOs to finish inorder to
 BP> guarantee the ordering ? Is this not needed for anything other than
 BP> ext3 ?

1) i'm not sure anyone else supports this
2) Andrew proposed the excelent solution

thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-19 Thread Bodo Eggert
On Tue, 19 Apr 2005, Eric Van Hensbergen wrote:
> On 4/17/05, Bodo Eggert <[EMAIL PROTECTED]>

> > > I was thinking about this a while back and thought having a user-mount
> > > permissions file might be the right way to address lots of these
> > > issues.  Essentially it would contain information about what
> > > users/groups were allowed to mount what sources to what destinations
> > > and with what mandatory options.
> > 
> > Users being able to mount random fs containing suid or device nodes
> > are root whenever they want to. If you want to mount with dev or suid,
> > use sudo and restrict the mount to a limited set of images/devices/whatever.
> 
> Well, that would kinda be the intent behind the permissions file  --
> it can specify what restricted set of images/devices/whatever the user
> can mount, I suppose the sensible thing would be to always enforce
> nosuid and nsgid, but I'd rather keep these as the default version of
> options (allowing admins to shoot themselves in the foot perhaps, but
> in the single-user workstation case, is seems like there's less reason
> to be so paranoid).

I think you shouldn't help the admins by creating shoes with target marks.

Allowing user mounts with no* should be allways ok (no config needed 
besides the ulimit), and mounting specified files to defined locations
is allready supported by fstab.
-- 
Top 100 things you don't want the sysadmin to say:
6. We prefer not to change the root password, it's an nice easy one
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Badari Pulavarty
On Tue, 2005-04-19 at 03:10, Alex Tomas wrote:
> > Badari Pulavarty (BP) writes:
> 
>  BP> In order to do the correct accounting, we need to mark a page
>  BP> to indicate if we reserved a block or not. One way to do this,
>  BP> to use page->private to indicate this. But then, all the generic
>  BP> routines will fail - since they assume that page->private represents
>  BP> bufferheads. So we need a better way to do this.
> 
> you can introduce one more bit to page->flags

Agreed. I was hoping to avoid it as much as I can.

> 
>  BP> 3) We need add hooks into filesystem specific calls from these
>  BP> generic routines to handle "journaling mode" requirements
>  BP> (for ext3 and may be others).
> 
> nobody uses journaling mode except ext3

What I meant by jounalling mode is that - after the pages are submitted
for IO, we need some way of waiting for the IOs to finish inorder to
guarantee the ordering ? Is this not needed for anything other than
ext3 ?

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Badari Pulavarty
On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote:
> Badari Pulavarty <[EMAIL PROTECTED]> writes:
> 
> [...]
> 
> >
> > Yes. Its possible to do what you want to. I am currently working on
> > adding "delayed allocation" support to ext3. As part of that, We
> 
> As you most likely already know, Alex Thomas already implemented delayed
> block allocation for ext3.

Yep. I reviewed Alex Thomas patches for delayed allocation. He handled
all the cases in his code and did NOT use any mpage* routines to do
the work. I was hoping to change the mpage infrastructure to handle
these, so that every filesystem doesn't have to do their thing.


> 
> >
> > In order to do the correct accounting, we need to mark a page
> > to indicate if we reserved a block or not. One way to do this,
> > to use page->private to indicate this. But then, all the generic
> 
> I believe one can use PG_mappedtodisk bit in page->flags for this
> purpose. There was old Andrew Morton's patch that introduced new bit
> (PG_delalloc?) for this purpose.

That would be good. But I don't feel like asking for a bit in page
if there is a way to get around it.

> 
> > routines will fail - since they assume that page->private represents
> > bufferheads. So we need a better way to do this.
> 
> They are not generic then. Some file systems store things completely
> different from buffer head ring in page->private.

Yep. Instead of changing the whole world, I was hoping to come up with
few common interfaces (which doesn't assume anything about bufferheads
etc..) which are useful for more than one filesystem.


> >
> > 3) We need add hooks into filesystem specific calls from these
> > generic routines to handle "journaling mode" requirements
> > (for ext3 and may be others).
> 
> Please don't. There is no such thing as "generic
> journalling". Traditional WAL used by ext3, phase-trees of Tux2, and
> wandering logs of reiser4 are so much different that there is no hope
> for a single API to accommodate them all. Adding such API will only
> force more workarounds and hacks in non-ext3 file systems.
> 
> What _is_ common to all journalling file systems on the other hand, is
> the notion of transaction as the natural unit of caching and
> write-out. Currently in Linux, write-out is inode-based
> (->writepages()). Reiser4 already has a patch that replaces
> sync_sb_inodes() function with super-block operation. In reiser4 case,
> this operation scans the list of transactions (instead of the list of
> inodes) and writes some of them out, which is natural thing to do for a
> journalled file system.
> 
> Similarly, transaction is a unit of caching: it's often necessary to
> scan all pages of a given transaction, all dirty pages of a given
> transaction, or to check whether given page belongs to a given
> transaction. That is, transaction plays role similar to struct
> address_space. But currently there is 1-to-1 relation between inodes and
> address_spaces, and this forces file system to implement additional data
> structures to duplicate functionality already present in address_space.
> >
> > So, what are your requirements ?  I am looking for a common
> > way to combine all the requirements and come out with a
> > saner "generic" routines to handle these.
> >
> 
> I think that one reasonable way to add generic support for journalling
> is to split struct address_space into two objects: lower layer that
> represents "file" (say, struct vm_file), in which pages are linearly
> ordered, and on top of this vm_cache (representing transaction) that
> keeps track of pages from various vm_file's. vm_file is embedded into
> inode, and vm_cache has a pointer to (the analog of) struct
> address_space_operations.
> 
> vm_cache's are created by file system back-end as necessary (can be
> embedded into inode for non-journalled file systems). VM scanner and
> balance_dirty_pages() call vm_cache operations to do write-out.

Need to think some more. I guess you thought about this more than you
do :)

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] VFS bugfix: two read_inode() calles without clear_inode() call between

2005-04-19 Thread Artem B. Bityuckiy
Hello,

here is a patch to fix the problem discussed at the "[PATC] small VFS
change for JFFS2" thread in LKML (http://lkml.org/lkml/2005/4/18/77).

The problem description:
~~

prune_icache() removes inodes from the inode hash (inode->i_hash) and
drops the node_lock spinlock. If at that moment iget() is called, we end
up with the situation when VFS calls ->read_inode() twice for the same
inode without calling ->clear_inode() between. This happens despite of
the I_FREEING inode state because the inode is already removed from the
hash by the time find_inode_fast() is invoked.

The fix is: do not remove the inode from the hash too early.

The following patch fixes the problem. It was tested with JFFS2 (only)
and works perfectly.

Comments?



Signed-off-by: Artem B. Bityuckiy <[EMAIL PROTECTED]>


diff -auNrp linux-2.6.11.5/fs/inode.c linux-2.6.11.5_fixed/fs/inode.c
--- linux-2.6.11.5/fs/inode.c   2005-03-19 09:35:04.0 +0300
+++ linux-2.6.11.5_fixed/fs/inode.c 2005-04-18 17:54:16.0
+0400
@@ -284,6 +284,12 @@ static void dispose_list(struct list_hea
if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
+
+   spin_lock(&inode_lock);
+   hlist_del_init(&inode->i_hash);
+   list_del_init(&inode->i_sb_list);
+   spin_unlock(&inode_lock);
+
destroy_inode(inode);
nr_disposed++;
}
@@ -319,8 +325,6 @@ static int invalidate_list(struct list_h
inode = list_entry(tmp, struct inode, i_sb_list);
invalidate_inode_buffers(inode);
if (!atomic_read(&inode->i_count)) {
-   hlist_del_init(&inode->i_hash);
-   list_del(&inode->i_sb_list);
list_move(&inode->i_list, dispose);
inode->i_state |= I_FREEING;
count++;
@@ -455,8 +459,6 @@ static void prune_icache(int nr_to_scan)
if (!can_unuse(inode))
continue;
}
-   hlist_del_init(&inode->i_hash);
-   list_del_init(&inode->i_sb_list);
list_move(&inode->i_list, &freeable);
inode->i_state |= I_FREEING;
nr_pruned++;

-- 
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-19 Thread Eric Van Hensbergen
On 4/17/05, Bodo Eggert <[EMAIL PROTECTED]>
<[EMAIL PROTECTED]> wrote:
> 
> > I was thinking about this a while back and thought having a user-mount
> > permissions file might be the right way to address lots of these
> > issues.  Essentially it would contain information about what
> > users/groups were allowed to mount what sources to what destinations
> > and with what mandatory options.
> 
> Users being able to mount random fs containing suid or device nodes
> are root whenever they want to. If you want to mount with dev or suid,
> use sudo and restrict the mount to a limited set of images/devices/whatever.
>

Well, that would kinda be the intent behind the permissions file  --
it can specify what restricted set of images/devices/whatever the user
can mount, I suppose the sensible thing would be to always enforce
nosuid and nsgid, but I'd rather keep these as the default version of
options (allowing admins to shoot themselves in the foot perhaps, but
in the single-user workstation case, is seems like there's less reason
to be so paranoid).

   -eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Nikita Danilov
Badari Pulavarty <[EMAIL PROTECTED]> writes:

[...]

>
> Yes. Its possible to do what you want to. I am currently working on
> adding "delayed allocation" support to ext3. As part of that, We

As you most likely already know, Alex Thomas already implemented delayed
block allocation for ext3.

[...]

>
> In order to do the correct accounting, we need to mark a page
> to indicate if we reserved a block or not. One way to do this,
> to use page->private to indicate this. But then, all the generic

I believe one can use PG_mappedtodisk bit in page->flags for this
purpose. There was old Andrew Morton's patch that introduced new bit
(PG_delalloc?) for this purpose.

> routines will fail - since they assume that page->private represents
> bufferheads. So we need a better way to do this.

They are not generic then. Some file systems store things completely
different from buffer head ring in page->private.

>
> 3) We need add hooks into filesystem specific calls from these
> generic routines to handle "journaling mode" requirements
> (for ext3 and may be others).

Please don't. There is no such thing as "generic
journalling". Traditional WAL used by ext3, phase-trees of Tux2, and
wandering logs of reiser4 are so much different that there is no hope
for a single API to accommodate them all. Adding such API will only
force more workarounds and hacks in non-ext3 file systems.

What _is_ common to all journalling file systems on the other hand, is
the notion of transaction as the natural unit of caching and
write-out. Currently in Linux, write-out is inode-based
(->writepages()). Reiser4 already has a patch that replaces
sync_sb_inodes() function with super-block operation. In reiser4 case,
this operation scans the list of transactions (instead of the list of
inodes) and writes some of them out, which is natural thing to do for a
journalled file system.

Similarly, transaction is a unit of caching: it's often necessary to
scan all pages of a given transaction, all dirty pages of a given
transaction, or to check whether given page belongs to a given
transaction. That is, transaction plays role similar to struct
address_space. But currently there is 1-to-1 relation between inodes and
address_spaces, and this forces file system to implement additional data
structures to duplicate functionality already present in address_space.

>
> So, what are your requirements ?  I am looking for a common
> way to combine all the requirements and come out with a
> saner "generic" routines to handle these.
>

I think that one reasonable way to add generic support for journalling
is to split struct address_space into two objects: lower layer that
represents "file" (say, struct vm_file), in which pages are linearly
ordered, and on top of this vm_cache (representing transaction) that
keeps track of pages from various vm_file's. vm_file is embedded into
inode, and vm_cache has a pointer to (the analog of) struct
address_space_operations.

vm_cache's are created by file system back-end as necessary (can be
embedded into inode for non-journalled file systems). VM scanner and
balance_dirty_pages() call vm_cache operations to do write-out.

>
> Thanks,
> Badari

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Alex Tomas
> Badari Pulavarty (BP) writes:

 BP> In order to do the correct accounting, we need to mark a page
 BP> to indicate if we reserved a block or not. One way to do this,
 BP> to use page->private to indicate this. But then, all the generic
 BP> routines will fail - since they assume that page->private represents
 BP> bufferheads. So we need a better way to do this.

you can introduce one more bit to page->flags

 BP> 3) We need add hooks into filesystem specific calls from these
 BP> generic routines to handle "journaling mode" requirements
 BP> (for ext3 and may be others).

nobody uses journaling mode except ext3

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html