Re: [patch 00/10] mount ownership and unprivileged mount syscall (v8)

2008-02-23 Thread Al Viro
On Sat, Feb 23, 2008 at 06:33:13PM +0100, Miklos Szeredi wrote:
> > c) just what is limited by that sysctl?  AFAICS, rbind is allowed
> > if mountpoint is on user vfsmount and it seems to create vfsmounts without
> > eating into that limit just fine...  What's the point of limiting the
> > amount of vfsmounts marked user when you do not limit the number of vfsmount
> > one can allocate?
> 
> The limit is there, so that unprivileged users cannot create insane
> number of mounts.  It's just a safety thing, analogous to
> /proc/sys/fs/file-max.

Can't they?  Looks like one can create any number of vfsmounts without
getting more than one marked MNT_USER...

What are you trying to limit - vfsmounts or superblocks?  The former is
not limited in your patchset at all, AFAICS - you can
do while true; do
mount --bind /bin ~/my_directory;
mount --bind /sbin ~/my_directory;
done
indefinitely and all the bleeding stack of vfsmounts will be !MNT_USER.
Or any number of similar schemes, really, without overmounting if you
wish to avoid it.

If you are trying to limit the number of superblocks (i.e. active instances
of filesystems), then I'd say that vfsmounts make piss-poor proxies for
those and it would be better to count the objects you really want to count...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 00/10] mount ownership and unprivileged mount syscall (v8)

2008-02-23 Thread Al Viro
On Mon, Feb 18, 2008 at 12:47:59PM +0100, Miklos Szeredi wrote:
> So what should I do?
> 
> Would Al be wanting to merge this into his VFS tree?  (Can't find it
> on git.kernel.org yet, BTW.)

FWIW, it's on hera right now, should propagate to git.kernel.org in a few.

Branches I'd pushed there: vfs-fixes.b0 and ro-bind.b0.  The latter is
on top of the former.  There will be more, but that at least takes care
of the most urgent stuff.  Again, apologies for things being too damn
slow ;-/

As for the unprivileged mounts...
a) why do we lose them on clone() in new namespace?  Bloody
inconvenient, to put it mildly.
b) why do we prohibit all kinds of remount?
c) just what is limited by that sysctl?  AFAICS, rbind is allowed
if mountpoint is on user vfsmount and it seems to create vfsmounts without
eating into that limit just fine...  What's the point of limiting the
amount of vfsmounts marked user when you do not limit the number of vfsmount
one can allocate?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 03/11] slim down debugfs

2008-02-23 Thread Al Viro
On Tue, Feb 19, 2008 at 05:04:38AM +0100, Arnd Bergmann wrote:
> With most of debugfs now copied to generic code in libfs,
> we can remove the original copy and replace it with thin
> wrappers around libfs.
> 
> Signed-off-by: Arnd Bergmann <[EMAIL PROTECTED]>
> Index: linux-2.6/fs/Kconfig
> ===
> --- linux-2.6.orig/fs/Kconfig
> +++ linux-2.6/fs/Kconfig
> @@ -1001,6 +1001,14 @@ config CONFIGFS_FS
> Both sysfs and configfs can and should exist together on the
> same system. One is not a replacement for the other.
>  
> +config LIBFS
> + tristate
> + default m
> + help
> +   libfs is a helper library used by many of the simpler file
> +   systems. Parts of libfs can be modular when all of its users
> +   are modules as well, and the users should select this symbol.

NAK.  For one thing, you need dependencies or selects and none of the
patches later in the series introduces them.  For another, neither
dependencies nor selects work well if you have non-modular users of
that sucker.  Seriously, try to add those and do allmodconfig.  Then
look what a mess it produces.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 01/11] add generic versions of debugfs file operations

2008-02-23 Thread Al Viro
On Tue, Feb 19, 2008 at 05:04:36AM +0100, Arnd Bergmann wrote:
> The file operations in debugfs are rather generic and can
> be used by other file systems, so it can be interesting to
> include them in libfs, with more generic names, and exported
> to modules.
> 
> This patch adds a new copy of these operations to libfs,
> so that the debugfs version can later be cut down.

BTW, where does CONFIG_LIBFS come from?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 02/11] introduce simple_fs_type

2008-02-23 Thread Al Viro
On Tue, Feb 19, 2008 at 05:04:37AM +0100, Arnd Bergmann wrote:
> There is a number of pseudo file systems in the kernel
> that are basically copies of debugfs, all implementing the
> same boilerplate code, just with different bugs.
> 
> This adds yet another copy to the kernel in the libfs directory,
> with generalized helpers that can be used by any of them.
> 
> The most interesting function here is the new "struct dentry *
> simple_register_filesystem(struct simple_fs_type *type)", which
> returns the root directory of a new file system that can then
> be passed to simple_create_file() and similar functions as a
> parent.

Don't mix "split the file" with "add new stuff", please.  And frankly,
I'm not convinced that embedding file_system_type into another struct
is a good idea.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 01/11] add generic versions of debugfs file operations

2008-02-23 Thread Al Viro
On Tue, Feb 19, 2008 at 05:04:36AM +0100, Arnd Bergmann wrote:
> + char buf[3];
> + u32 *val = file->private_data;
> +
> + if (*val)
> + buf[0] = 'Y';
> + else
> + buf[0] = 'N';
> + buf[1] = '\n';
> + buf[2] = 0x00;
> + return simple_read_from_buffer(user_buf, count, ppos, buf, 2);

Ewww - caps, \n...  BTW, \0 is pointless here - simple_read_from_buffer() will
not access it with these arguments)...

> +{
> + char buf[32];
> + int buf_size;
> + u32 *val = file->private_data;
> +
> + buf_size = min(count, (sizeof(buf)-1));
> + if (copy_from_user(buf, user_buf, buf_size))
> + return -EFAULT;
> +
> + switch (buf[0]) {
> + case 'y':
> + case 'Y':
> + case '1':
> + *val = 1;
> + break;
> + case 'n':
> + case 'N':
> + case '0':
> + *val = 0;
> + break;

Please, check the length; sloppy input grammar is a bad idea.  Hell, at the
very least you want -EINVAL if input is not recognized...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to show propagation state for mounts

2008-02-20 Thread Al Viro
On Wed, Feb 20, 2008 at 11:29:13AM -0800, Ram Pai wrote:

> I wonder, what is wrong in reporting mounts in other namespaces that
> either receive and send propagation to mounts in our namespace?

A plenty.  E.g. if foo trusts control over /var/blah to bar, it's not
obvious that foo has any business knowing if bar gets it from somebody
else in turn.  And I'm not sure that bar has any business knowing that
foo has the damn thing attached in five places instead of just one,
let alone _where_ it has been attached.

If you get down to it, the thing is about delegating control over part
of namespace to somebody, without letting them control, see, etc. the
rest of it.  So I'd rather be very conservative about extra information
we allow to piggyback on that.  I don't know... perhaps with stable peer
group IDs it would be OK to show peer group ID by (our) vfsmount + peer
group ID of master + peer group ID of nearest dominating group that has
intersection with our namespace.  Then we don't leak information (AFAICS),
get full propagation information between our vfsmounts and cooperating
tasks in different namespaces can figure the things out as much as possible
without leaking 3rd-party information to either.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to show propagation state for mounts

2008-02-20 Thread Al Viro
On Wed, Feb 20, 2008 at 04:39:15PM +0100, Miklos Szeredi wrote:
> > mountinfo - IMO needs a sane discussion of what and how should be shown
> > wrt propagation state
> 
> Here's my take on the matter.
> 
> The propagation tree can be either be represented
> 
>  1) "from root to leaf" listing members of peer groups and their
>  slaves explicitly,
> 
>  2) or "from leaf to root" by identifying each peer group and then for
>  each mount showing the id of its own group and the id of the group's
>  master.
> 
> 2) can have two variants:
> 
>  2a) id of peer group is constant in time
> 
>  2b) id of peer group may change
> 
> The current patch does 2b).  Having a fixed id for each peer group
> would mean introducing a new object to anchor the peer group into,
> which would add complexity to the whole thing.
> 
> All of these are implementable, just need to decide which one we want.

Eh...  Much more interesting question: since the propagation tree spans
multiple namespaces in a lot of normal uses, how do we deal with
reconstructing propagation through the parts that are not present in
our namespace?  Moreover, what should and what should not be kept private
to namespace?  Full exposure of mount trees is definitely over the top
(it shows potentially sensitive information), so we probably want less
than that.

FWIW, my gut feeling is that for each peer group that intersects with our
namespace we ought to expose in some form
* all vfsmounts belonging to that intesection
* the nearest dominating peer group (== master (of master ...) of)
that also has a non-empty intersection with our namespace

It's less about the form of representation (after all, we generate poll
events when contents of that sucker changes, so one *can* get a consistent
snapshot of the entire thing) and more about having it self-contained
when we have namespaces in the play.

IOW, the data in there should give answers to questions that make sense.
"Do events get propagated from this vfsmount I have to that vfsmount I have?"
is a meaningful one; ditto for "are events here propagated to somewhere I
don't see?" or "are events getting propagated here from somewhere I don't
see?".

Dumping pieces of raw graph, with IDs of nodes we can't see and without
any way to connect those pieces, OTOH, doesn't make much sense.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: git tree with VFS stuff

2008-02-20 Thread Al Viro
On Thu, Feb 21, 2008 at 01:13:48AM +1100, Stephen Rothwell wrote:
> Hi Miklos,
> 
> On Tue, 19 Feb 2008 14:32:28 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote:
> >
> > I've created a git tree with the following mounts related stuff:
> > 
> >   - read-only bind mounts
> >   - /proc//mountinfo
> >   - unprivileged mounts
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfsstuff.git master
> > 
> > I guess, giving these a spin in linux-next wouldn't hurt?
> 
> I don't think this is what we want to use linux-next for.  Linux-next is
> really a place for stuff that will pretty clearly go into the next kernel
> release i.e. 2.6.26 right now.  If you want to experiment on things for
> beyond that timeframe, then a snapshot of linux-next may be a good base.
> I will take them when they reach the appropriate subsystem tree and are
> ready for integration.

FWIW, I must apologize for delay with getting the damn tree open on
kernel.org ;-/   The last couple of weeks had been Not Fun(tm) in a
lot of respects.

Hopefully I'll finish putting the damn thing into publishable shape by
tomorrow.  As for the stuff mentioned above...  ro-bind series - definitely
yes, mountinfo - IMO needs a sane discussion of what and how should be shown
wrt propagation state, unprivileged mounts - in the "need to finish reviewing"
pile.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Add vfsmount to vfs helper functions.

2008-02-17 Thread Al Viro
On Mon, Feb 18, 2008 at 09:03:51AM +0900, Tetsuo Handa wrote:
> Hello.
> 
> > No printable comments, except for that:
> > 
> > (e) why don't you guys move the Linus' Serious Mistake to _callers_ of
> > vfs_mknod() and its ilk?
> > 
> > Which obviously solves all problems with having vfsmount.
> 
> Excuse me. I didn't understand what "the Linus' Serious Mistake to
> _callers_ of vfs_mknod()" is. Could you give me some URLs or hints?

Linus' Serious Mistake: LSM.

Moving that junk to callers for vfs_mknod(): should be obvious; remove
the calls of security_whatever() from vfs_...() and add them in places
that call the functions in question.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Add vfsmount to vfs helper functions.

2008-02-17 Thread Al Viro
On Sun, Feb 17, 2008 at 06:00:30PM +0900, Tetsuo Handa wrote:
> Hello.
> 
> This is "(c) Add new hooks." approach I proposed at
> http://www.mail-archive.com/linux-fsdevel@vger.kernel.org/msg11536.html .
> 
> Although this is an incomplete patch,
> I want to know whether you can tolerate this approach or not.
> 
> If you cannot tolerate this approach, may be we need to consider
> implementing "(d) Calculate pathname while doing name resolution" approach.
> 

No printable comments, except for that:

(e) why don't you guys move the Linus' Serious Mistake to _callers_ of
vfs_mknod() and its ilk?

Which obviously solves all problems with having vfsmount.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-02-02 Thread Al Viro
On Sat, Feb 02, 2008 at 01:45:15PM -0500, Erez Zadok wrote:
> > You are thinking about non-interesting case.  _Files_ are not much
> > of a problem.  Directory tree is.  The real problems with all unionfs and
> > stacking implementations I've seen so far, all way back to Heidemann et.al.
> > start when topology of the underlying layer changes.
> 
> OK, so if I understand you, your concerns center around the fact that lower
> directories can be moved around (i.e., topology changes), what happens then
> for operations that go through the stackable f/s, and what should users
> expect to see.

Correct.
 
> > If you have clear
> > semantics for unionfs behaviour in presence of such things, by all means,
> > publish it - as far as I know *nobody* had done that; not even on the
> > "what should we see when..." level, nevermind the implementation.
> 
> Since stacking and NFS have some similarities, I first checked w/ the NFS
> people to see what are their semantics in a similar scenario: an NFS client
> could be validating a directory, then issue, say, a ->create; but in
> between, the server could have moved the directory that was validated.  In
> NFS, the ->create operation succeeds, and creates the file in the new
> location of the directory which was validated.
>
> Unionfs's behavior is similar: the newly created file will be successfully
> created in the moved directory.  The only exception is that if a lower
> branch is marked readonly by unionfs, a copyup will take place.

Er...  For unionfs it's a much more monumental problem.  Look - you have
a mapping between the directory trees of layers; everything else is
defined in terms of that mapping ("this files shadows that file", etc.).

If you allow a mix of old and new mappings, you can easily run into the
situations when at some moment X1 covers Y1, X2 covers Y2, X2 is a descendent
of X1 and Y1 is a descendent of Y2.  You *really* don't want to go there -
if nothing else, defining behaviour of copyup in face of that insanity
will be very painful.

What's more, and what makes NFS a bad model, NFS client doesn't lock
directories of NFS server.  unionfs *does* locking of directories in
underlying layers, which means that you have an entirely new set of
constraints - you can deadlock easily.  As the matter of fact, your
current code suffers from that problem - it violates the locking rules
in operations on covered layers.
 
> This had not been a problem for unionfs users to date.  The main reason is
> that when unionfs users modify lower files, they often do so while there's
> little to no activity going through the union itself.  And while it doesn't
> prevent directories from being moved around, this common usage mode does
> reduce the frequency in which topology changes can be an issue for unionfs
> users.

Ugh...  "Currently unionfs users tend to use it ways that do not trigger
fs corruption" is nice, but doesn't really address the "current unionfs
implementation contains races leadint to fs corruption"...

> > around, changing parents, etc.  Cross-directory rename() certainly rates
> > very high on the list of "WTF had they been smoking in UCB?" misfeatures,
> > but it's there and it has to be dealt with.
> 
> Well, it was UCB, UCLA, and Sun.  I don't think back in the early 90s they
> were too concerned about topology changes; f/s stacking was a new idea and
> they wanted to explore what can be done with it conceptually, not produce
> commercial-grade software (still, remind me to tell you the first-hand story
> I've learned about of how full-blown stacking _almost_ made it into Solaris
> 2.0 :-)
 
> The only known reference to try and address this coherency problem was
> Heidemann's SOSP'95 paper titled "Performance of cache coherence in
> stackable filing."  The paper advocated changing the whole VFS and the
> caches (page cache + dnlc) to create a "unified cache manager" that was
> aware of complex layering topologies (including fan-out and fan-in).  It was
> even able to handle compression layers, where file data offsets change b/t
> the layers (a nasty problem).  Code for this unified cache manager was never
> released AFAIK.  I think Heidemann's approach was elegant, but I don't think
> it was practical as it required radical VFS/VM surgery.  Ironically, MS
> Windows has a single I/O cache manager that all storage and filesystem
> modules talk to directly (they're not allowed to pass IRPs directly b/t
> layers): so Windows can handle such coherency better than most Unix systems
> can today.
 
Different problem.  IIRC, that paper implicitly assumed that mapping between
vnodes in different layers would _not_ be subject to massive surgeries from
operations involved.

> However, for this "view" idea to work, I'll need a way to lock-out or hide
> the lower directories from the namespace, so no one can access them in r-w
> mode any longer (if at all).  Moreover, even if such a method was available,
> one would have to decide what to do with any open files 

Re: [PATCH v2 8/9] bfs: remove multiple assignments

2008-01-30 Thread Al Viro
On Mon, Jan 28, 2008 at 01:02:03AM -0600, Joel Schopp wrote:
> >>>-inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME_SEC;
> >>>+inode->i_mtime = CURRENT_TIME_SEC;
> >>>+inode->i_atime = CURRENT_TIME_SEC;
> >>>+inode->i_ctime = CURRENT_TIME_SEC;
> >>multiple assignments like "x = y = z = value;" can potentially
> >>(depending on the compiler and arch) be faster than "x = value; y =
> >>value; z=value;"
> >>
> >>I am surprized that this script complains about them as it is a
> >>perfectly valid thing to do in C.
> >
> >I think it seems wise to ask the maintainers of checkpatch.pl to
> >comment on that. I'm Cc:ing them now.
> >
> 
> There are plenty of things that are valid to do in C that don't make for 
> maintainable code.  These scripts are designed to make your code easier for 
> real people to review and maintain.

Except that in this case the new variant is not equivalent to the old one...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Add vfsmount to vfs helper functions.

2008-01-30 Thread Al Viro
On Fri, Jan 25, 2008 at 07:20:56PM +0900, Kentaro Takeda wrote:
> In the LSM ml, we are discussing about
> "how to know requested pathnames within LSM modules".
> 
> Currently, VFS helper functions don't pass "struct vfsmount" parameter.
> Therefore, we cannot calculate requested pathnames within LSM modules
> because LSM hooks can't know "struct vfsmount" parameter that corresponds with
> "struct dentry" passed to VFS helper functions.
> 
> AppArmor is proposing a patch that appends "struct vfsmount" parameters to
> VFS helper functions so that LSM modules (SELinux, AppArmor, TOMOYO) can
> calculate requested pathnames.
> 
> The changes in include/linux/fs.h are shown below.
> What do you think about these changes?

That they are bloody *wrong*.  You have not addressed any of the objections
that had been posted too many times to repeat.  Damn it, you've not even
bothered to deal with the specific obvious stupidity with vfs_rename() -
just reposted the dreck and devil take all feedback.  Wonderful...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-26 Thread Al Viro
On Sat, Jan 26, 2008 at 12:08:30AM -0500, Erez Zadok wrote:

> > * lock_parent(): who said that you won't get dentry moved
> > before managing to grab i_mutex on parent?  While we are at it,
> > who said that you won't get dentry moved between fetching d_parent
> > and doing dget()?  In that case parent could've been _freed_ before
> > you get to dget().
> 
> OK, so looks like I should use dget_parent() in my lock_parent(), as I've
> done elsewhere.  I'll also take a look at all instances in which I get
> dentry->d_parent and see if a d_lock is needed there.
 
dget_parent() doesn't deal with the problem of rename() done directly
in that layer while you'd been waiting for i_mutex.

> > +   lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
> > +   err = vfs_rename(lower_old_dir_dentry->d_inode, lower_old_dentry,
> > +lower_new_dir_dentry->d_inode, lower_new_dentry);
> > +   unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
> > 
> > Uh-huh...  To start with, what guarantees that your lower_old_dentry
> > is still a child of your lower_old_dir_dentry?
> 
> We dget/dget_parent the old/new dentry and parents a few lines above
> (actually, it looked like I forgot to dget(lower_new_dentry) -- fixed).

And?  Having a reference to dentry does not prevent it being moved
elsewhere by direct rename(2) in that layer.  It will exist, that
much is guaranteed by grabbing a reference.  However, there is no
warranties whatsoever that by the time you get i_mutex on what had
once been its parent, it will still remain the parent of our dentry.

> BTW, my sense of the relationship b/t upper and lower objects and their
> validity in a stackable f/s, is that it's similar to the relationship b/t
> the NFS client and server -- the client can't be sure that a file on the
> server doesn't change b/t ->revalidate and ->op (hence nfs's reliance on dir
> mtime checks).

You are thinking about non-interesting case.  _Files_ are not much
of a problem.  Directory tree is.  The real problems with all unionfs and
stacking implementations I've seen so far, all way back to Heidemann et.al.
start when topology of the underlying layer changes.  If you have clear
semantics for unionfs behaviour in presence of such things, by all means,
publish it - as far as I know *nobody* had done that; not even on the
"what should we see when..." level, nevermind the implementation.
 
> Perhaps this general topic is a good one to discuss at more length at LSF?
> Suggestions are welcome.

It would; I honestly do not know if the problem is solvable with the
(lack of) constraints you apparently want.  Again, the real PITA begins
when you start dealing with pieces of underlying trees getting moved
around, changing parents, etc.  Cross-directory rename() certainly rates
very high on the list of "WTF had they been smoking in UCB?" misfeatures,
but it's there and it has to be dealt with.

BTW, and that's a completely unrelated story, I'd rather see whiteouts
done directly by filesystems involved - it would simplify the life big
way.  How about adding a dir->i_op->whiteout(dir, dentry) and seeing if
your variant could be turned into such a method to be used by really
piss-poor filesystems?  All UFS-related ones (including ext*) can trivially
support whiteouts without any PITA; adding them to tmpfs is also not a big
deal and anything that caches inode type in directory entries should be
easy to extend in the same way as ext*/ufs...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-16 Thread Al Viro
After grep for locking-related things:

* lock_parent(): who said that you won't get dentry moved
before managing to grab i_mutex on parent?  While we are at it,
who said that you won't get dentry moved between fetching d_parent
and doing dget()?  In that case parent could've been _freed_ before
you get to dget().

* in create_parents():
+   struct inode *inode = lower_dentry->d_inode;
+   /*
+* If we get here, it means that we created a new
+* dentry+inode, but copying permissions failed.
+* Therefore, we should delete this inode and dput
+* the dentry so as not to leave cruft behind.
+*/
+   if (lower_dentry->d_op && lower_dentry->d_op->d_iput)
+   lower_dentry->d_op->d_iput(lower_dentry,
+  inode);
+   else
+   iput(inode);
+   lower_dentry->d_inode = NULL;
+   dput(lower_dentry);
+   lower_dentry = ERR_PTR(err);
+   goto out;
Really?  So what happens if it had become positive after your test and
somebody had looked it up in lower layer and just now happens to be
in the middle of operations on it?  Will be thucking frilled by that...

* __unionfs_rename():
+   lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
+   err = vfs_rename(lower_old_dir_dentry->d_inode, lower_old_dentry,
+lower_new_dir_dentry->d_inode, lower_new_dentry);
+   unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);

Uh-huh...  To start with, what guarantees that your lower_old_dentry
is still a child of your lower_old_dir_dentry?  What's more, you are
not checking the result of lock_rename(), i.e. asking for serious trouble.

* revalidation stuff: err...  how the devil can it work for
directories, when there's nothing to prevent changes in underlying
layers between ->d_revalidate() and operation itself?  For the upper
layer (unionfs itself) everything's more or less fine, but the rest
of that...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] VFS: extend /proc/mounts

2008-01-16 Thread Al Viro
On Wed, Jan 16, 2008 at 04:09:30PM -0800, Andrew Morton wrote:
> On Thu, 17 Jan 2008 00:58:06 +0100 (CET) Jan Engelhardt <[EMAIL PROTECTED]> 
> wrote:
> 
> > 
> > On Jan 17 2008 00:43, Karel Zak wrote:
> > >> 
> > >> Seems like a plain bad idea to me.  There will be any number of home-made
> > >> /proc/mounts parsers and we don't know what they do.
> > >
> > > So, let's use /proc/mounts_v2  ;-)
> > 
> > Was not it like "don't use /proc for new things"?
> 
> Well yeah.  If we're going to do a brand new mechanism to expose
> per-mount data then we should hunker down and get it right.

Which automatically means "no sysfs".  We are NOT converting vfsmounts
to kobject-based lifetime rules.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] VFS: extend /proc/mounts

2008-01-16 Thread Al Viro
On Wed, Jan 16, 2008 at 11:12:31PM +0100, Miklos Szeredi wrote:
> The alternative (and completely safe) solution is to add another file
> to proc.  Me no likey.

Since we need saner layout, I would strongly suggest exactly that.

> major:minor -- is the major minor number of the device hosting the filesystem

Bad description.  "Value of st_dev for files on that filesystem", please -
there might be no such thing as "the device hosting the filesystem" _and_
the value here may bloody well be unrelated to device actually holding
all data (for things like ext2meta, etc.).

> 1) The mount is a shared mount.
> 2) Its peer mount of mount with id 20
> 3) It is also a slave mount of the master-mount with the id  19
> 4) The filesystem on device with major/minor number 98:0 and subdirectory
>   mnt/1/abc makes the root directory of this mount.
> 5) And finally the mount with id 16 is its parent.

I'd suggest doing a new file that would *not* try to imitate /etc/mtab.
Another thing is, how much of propagation information do we want to
be exposed and what do we intend to do with it?  Note that "entire
propagation tree" is out of question - it spans many namespaces and
contains potentially sensitive information.  So we won't see all nodes.

What do we want to *do* with the information about propagation?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] netns / sysfs interaction

2008-01-07 Thread Al Viro
On Mon, Jan 07, 2008 at 03:01:47AM -0700, Eric W. Biederman wrote:
> Al Viro <[EMAIL PROTECTED]> writes:

> What appears to be a clean solution is to have multiple sysfs superblocks
> and to capture the namespace at mount time.

It is not a clean solution at all.  In particular, it leaves you with hell
of a coherency issues between these trees.

>  For planning purposes there
> is a device namespace on the drawing board as well, so you can keep
> your same major minor numbers for devices (tty names, network attached
> disk) in a migration event.

Yes, I'm quite sure there's more coming.  Which is why I'm asking now,
before we are even deeper into that... area

>   This means netns isn't the only
> namespace we will have to worry about with sysfs before it is all
> done.

Exciting.
 
> > a) what happens if I do chdir("/sys/class/net/eth42/") and then
> > migrate?
> 
> It shouldn't be any better or worse then any other filesystem.  The
> prerequisite for a OS level migration is that the set of all
> namespaces and all of the processes that use them all go together.
> As we recreate the virtual filesystem and virtual devices we should
> recreate a sysfs that is essentially the same.  I doubt we will go
> to the trouble of keeping the unnamed device number we are mounted on
> and the inode numbers the same, but otherwise we should be able to
> recreate an identical looking sysfs (baring real hardware changes).

Have you even bothered to read the pathname in question?  Please, do so.

> > c) what happens to open files?  E.g. to /sys/class/net - say it,
> > if migration happens between two getdents(2).
> 
> How do we restore the internal state?  Hmm.The rule is that you
> are only guaranteed to see directory entries that existed
> both before you started to read the directory and after you finished.
> 
> The cheap solution is just to declared everything hotplugged and
> deleted and recreated.  Removing any meaningful guarantee of seeing
> anything.
> 
> Since we only depend upon the value of f_pos that should largely work.
> 
> If we ever figure out how to preserve inode numbers over a migration
> event the current scheme will work unmodified but that sounds like
> more pain then it is worth.
> 

Inode numbers?  Are you suggesting a wholesale replacement of all struct
file referenced by descriptor tables, all way down to inodes?  May I see
the patches for that, please?

> Third when the goal is isolation and not migration (a better chroot)
> then our hardware never changes.

... and you have quite a bit of system state (starting with those net:eth0
symlinks, etc.) visible in there, not just the hardware.

> The idea is supporting multiple superblocks for sysfs:
> 
>   Ultimately capturing the relevant namespace at mount time
>   and if we don't have a superblock for that namespace creating
>   a new one.
> 
>   So we have one sysfs dirent tree and multiple dentry trees.
> 
>   The tricky parts are rename/move and blocking mount/unmount requests
>   for sysfs until we complete the rename operation calling d_move
>   everywhere.

Excuse me, _what_?  Are you seriously suggesting going through all dentry
trees, doing d_move() in each?  I want to see your locking.  It's promising
to be worse than devfs had ever been.  Much worse.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] netns / sysfs interaction

2008-01-06 Thread Al Viro
As much as I hate to touch either subject, let alone both at
once...  Eric, would you mind explaining what exactly do you want
sysfs to do in presense of your "namespaces"?  On the "what does user
see if we do <...>" level.

a) what happens if I do chdir("/sys/class/net/eth42/") and then
migrate?

b) what happens to /sys/class/net/eth0/device visibility/things
it points to/etc.?

c) what happens to open files?  E.g. to /sys/class/net - say it,
if migration happens between two getdents(2).

d) what happens to visibility in other parts of sysfs?  E.g. to
things like
$ ls  /sys/devices/pci\:00/\:00\:0a.0/
bus device  local_cpus  power  resource1 uevent
class   driver  modaliasresource   subsystem_device  vendor
config  irq net:eth0resource0  subsystem_vendor
$
See that net:eth0 in there?  Are all such suckers seen?

e) while we are at it, wouldn't seeing the information in
/sys/devices/pci in general defeat whatever purpose you have in mind
for your stuff?

Context: we need sane locking for sysfs.  I think I have a more or less
workable scheme, but its feasibility depends big way on what netns needs
to have.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.

2007-12-16 Thread Al Viro
On Sun, Dec 16, 2007 at 05:52:08PM +0100, Indan Zupancic wrote:

> What prevents them from mounting tmpfs on top of /dev, bypassing your fs?

Or binding /dev/null over nodes they want to get rid of...

> Also, if they have root there are plenty of ways to prevent an administrator
> from logging in, e.g. using iptables or changing the password.

Indeed.

BTW, tmpfs with root marked append-only and populated in normal ways on boot
would get a comparable effect without spending so much efforts.  Still won't
really help if attacker gains root, but then neither will your variant.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [2.6.24-rc3-mm1] loop cleanup in fs/namespace.c - repost

2007-11-26 Thread Al Viro
On Wed, Nov 21, 2007 at 03:24:33PM -0800, Andrew Morton wrote:
> -repeat:
> - if (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock)) {
> + while (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock)) {
>   if (likely(!mnt->mnt_pinned)) {
>   spin_unlock(&vfsmount_lock);
>   __mntput(mnt);
> - return;
> + break;
>   }
>   atomic_add(mnt->mnt_pinned + 1, &mnt->mnt_count);
>   mnt->mnt_pinned = 0;
>   spin_unlock(&vfsmount_lock);
>   acct_auto_close_mnt(mnt);
>   security_sb_umount_close(mnt);
> - goto repeat;
>   }
>  }
>  
> This patch has no changelog which I can use.

BTW, I have a problem with that one.  The change is misleading; FWIW,
I'm very tempted to turn that into tail recursion, with

if (!atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock))
return;

in the very beginning (i.e. invert the test and pull the body to top level).
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] Shadow directories

2007-10-18 Thread Al Viro
On Fri, Oct 19, 2007 at 12:27:16PM +0930, David Newall wrote:

> >Learn to read.  Linux has never allowed that.  Most of the Unix systems
> >do not allow that.
> 
> I did read the claim and it is ambiguous, in that it can reasonably be 
> read to mean that most UNIX systems never allowed such links, which is 
> wrong.  All UNIX systems allowed it until relatively recently.

FVO"relatively recently" exceeding a decade and half.  In any case,
it's _trivial_ to get fs corruption on any system with such links -
play with rename() races a bit and you'll get it.  And yes, it does
include 4.4BSD and quite a chunk of even later history.

Anyway, you are quite welcome to propose a sane locking scheme capable
of dealing with that mess.

As for the posted patch, AFAICS it's FUBAR in handling of .. in such
directories.  Moreover, how are you going to keep that shadow tree
in sync with the main one if somebody starts doing renames in the
latter?  Or mount --move, or...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] Shadow directories

2007-10-18 Thread Al Viro
On Fri, Oct 19, 2007 at 06:07:45AM +0930, David Newall wrote:
> >considerations of this whole scheme. Linux, like most Unix systems, 
> >has never allowed hard links to directories for a number of reasons;
> 
> The claim is wrong.  UNIX systems have traditionally allowed the 
> superuser to create hard links to directories.  See link(2) for 2.10BSD 
> . 
> Having got that wrong throws doubt on the argument; perhaps a path can 
> simultaneously be a file and a directory.

Learn to read.  Linux has never allowed that.  Most of the Unix systems
do not allow that.  Original _did_ allow that, but at the cost of very
easily triggered fs corruption (and it didn't have things like rename(2) -
it _did_ have userland implementation, of course, in suid-root mv(1),
but that sucker had been extremely racy and could be easily used to
screw filesystem to hell and back; adding rename(2) to the set of primitives
combined with multiple links to directories leads to very nasty issues on
_any_ system).
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] block_device_operations prototype changes

2007-08-27 Thread Al Viro
It's time to sanitize prototypes of bdev ->open(), ->release()
and ->ioctl().  This stuff had sat in "need to fix" for a long time
and there is a bunch of bugs hard to fix without dealing with it.

1) ->open() gets inode * and file *.  Almost all instances use only
inode->i_bdev
file->f_mode
file->f_flags - O_ACCMODE|O_NDELAY|O_EXCL
and nothing else.  Most actually use only inode->i_bdev->bd_disk and
those are easy; the trouble begins when we use f_mode and/or f_flags
in non-trivial ways.  One good example is cdrom.c - its behaviour
depends on whether we have O_NDELAY (doesn't lock the door, among
other things) and on whether we are opening for write.  Unfortunately,
it's not just something we care about at open() time - we need different
cleanups at close(), and that doesn't work at all.

At the very least, we need to know whether we want to unlock the door.
Suppose somebody opens it with O_NDELAY (ioctl-only) when it's already
opened for data (or mounted, for that matter).  Now we close it; how
do we know if we need to unlock?  Even if we did get struct file * (and
we do not, but that's a separate story), we could not rely on file->f_flags.
Why?  Because O_NDELAY can be flipped at will by fcntl() after the file
got opened.  IOW, we can't rely on it.

Most of that crap can be solved by saving the relevant information from
f_flags into f_mode (e.g. fs/block_dev.c::do_open()), where they'd remain
safe.  However, that's not quite enough - struct file we pass to ->open()
is often a fake one and it certainly won't live until ->release().

One thing that _can_ be solved that way, though, and it alone makes the
scheme worthwhile: floppy.c abuses can be finally killed.  Just saving
->f_flags & 2 in ->f_mode is enough to eliminate the need for floppy_open()
to play games with permission(9) and modify *file in any way.  Anything
close to current fdutils either passes O_RDWR |... or 3 | ... to open()
when it wants to do priveleged ioctls and then we get open_namei check
that we can write to file.  So we can check that bit saved in ->f_mode
for "can we do that kind of ioctl"; no need to mess with anything.

That's a *big* win, since this is eliminates the only place that uses
more than the fields mentioned above and the only place that tries to
modify struct file and check for modifications in later method calls.

So having done that, we get to the situation where ->open() only uses
inode->i_bdev and file->f_mode (and only reads these).  I.e. we will be
able to go for much saner
int (*open)(struct block_device *bdev, mode_t mode)

We even can kill that fake struct file.  At that point we have sanitized
->open(), no regressions and a chance to solve the problems with mode-dependent
cleanups on close().  However, to use that chance we'll need to do something
with ->release().

2) ->release() gets struct inode * and struct file * too.  It only uses
inode->i_bdev->bd_disk and, sometimes, file->f_mode.  Tries to use
file->f_mode, that is.  Even all way back in 2.2 when it was ->release()
of file_operations, we used to get NULL on operations like umount, etc.,
so users of file->f_mode/file->f_flags had to check for file != NULL
first.  And pray that all such users will want some reasonable default
behaviour.

For one thing, that's not true anymore.  E.g. pktcdvd.c does blkdev_get()
with O_NDELAY and blkdev_put() as matching close.  IOW, blkdev_put() can't
expect that matching open had no O_NDELAY.

For another, blkdev_put() is used by blkdev_close() since 2.3, so we _always_
get NULL now.  IOW, the cleanup-related logics doesn't work properly, period.

The obvious first step is to switch to
int (*release)(struct gendisk *disk, mode_t mode)
(there are good reasons why open wants bdev and release does not).  At least
that way we can start passing the right value without fake struct file, etc.

Now, the only thing we need is a way to pass that mode_t to blkdev_put().
IOW, blkdev_put() needs an extra argument.  And that's where the things
get really interesting.  blkdev_close() should just pass file->f_mode to
it; that's easy.  However, we have more callers and need case-by-case
analysis for those.

We don't have too many of those callers, fortunately, and almost all of them
are easy - we know what had matching blkdev_open() got and that's it.

Exceptions are interesting.
* we have reiserfs journal that could've been opened r/o or r/w.
BFD - we can store the mode_t just fine along with the reference to bdev.
* a bug (AFAICT) in md.c - we open raid components r/w and if it
fails, it fails.  Good for our purposes, but... how about raid0 on read-only
devices?  In any case, we have a ready place to store mode_t, so it's not a
problem for getting the right value to blkdev_put().  FWIW, I think that
allowing fallback to r/o (and making the entire array r/o, of course) would
be a good thing to have.  Any comments from drivers/md folks?
* open_bdev_excl()/close_bd

Re: Adding a security parameter to VFS functions

2007-08-16 Thread Al Viro
On Thu, Aug 16, 2007 at 03:57:24PM -0700, Linus Torvalds wrote:
> I personally consider this an affront to everythign that is decent.
> 
> Why the *hell* would mkdir() be so magical as to need something like that?
> 
> Make it something sane, like a "struct nameidata" instead, and make it at 
> least try to look like the path creation that is done by "open()".  Or 
> create a "struct file *" or something.

No.  I agree that mkdir/mknod/etc. should all take the same thing.
*However*, it should not be nameidata or file.  Why?  Because it
should not contain vfsmount.  Object creation should not *care*
where fs is mounted.  At all.  The same goes for open(), BTW - letting
it see the reference to vfsmount via nameidata is bloody wrong and
I really couldn't care less what kinds of perversions apparmour crowd
might like - pathnames simply do not belong there.

Said that, I agree that we need uniform prototypes for all these guys
and I can see arguments both for and against having creds passed by
reference stored in whatever struct we are passing (for: copy-on-write
refcounted cred objects would almost never have to be modified or
copied and normally we would just pick the current creds reference
from task_struct and assing it to a single field in whatever we pass
instead of nameidata; against: might be conceptually simpler to have
all that data directly in that struct and a helper filling it in).

Which way we do that and which struct we end up passing is a very good
question, but I want to make it very clear: having object creation/removal/
renaming methods[1] depend on which vfsmount we are dealing with is *wrong*.
IOW, I strongly object against the use of nameidata here.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] coda: file count cannot be used to discover last close

2007-07-19 Thread Al Viro
On Fri, Jul 20, 2007 at 12:10:00AM -0400, Jan Harkes wrote:
> I will try to find a clean way to block the close syscall until fput
> drops the last reference. However I realize that such an implementation
> would not be acceptable for other file systems, and there are some
> interesting unresolved details such as 'which close is going to be the
> last close'.

Simply impossible.

fd = open(...);
fd2 = dup(fd);

close(fd2);  <-- deadlock
close(fd);

Or
fd = open(...);
p = mmap(..., fd, ...);
close(fd);  <-- deadlock
...
munmap(p, ...);

The problem is that you are mixing two different operations:
* closing descriptor: descriptor doesn't refer to this opened file
anymore.  Can be done by close(), can be done by dup2(), can be done by
fcntl() analog of dup2(), can be done by execve(), can be done by exit().
And can be done by death-by-signal.
* closing opened file: no references (including descriptors) exist
anymore, no IO of any kind will be done on that IO channel.  Can happen
when descriptor gets closed, can happen when mapping gets destroyed (munmap,
etc.), can happen when SCM_RIGHTS datagram gets destroyed (including the
garbage collection), can happen when any syscall on that file finishes
after another thread has closed descriptor passed to that syscall, etc.

You _can't_ expect the errors from the latter to be reliably
passed to some process; for one thing, many files can get closed by
single system call (e.g. closing an AF_UNIX socket with pending SCM_RIGHTS
datagrams will flush those datagrams, dropping references to files we
tried to pass via them).

Why does CODA need special warranties in that area, anyway?
Related question: does fsync() force the writeback?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] coda: kill file_count abuse

2007-07-19 Thread Al Viro
On Fri, Jul 20, 2007 at 12:36:01PM +1000, David Chinner wrote:
> To the context that dropped the last reference. It can't be
> reported to anything else

Oh, for fsck sake...

Send a datagram with SCM_RIGHTS in it.  Have all other references
to these files closed.  Now have close(2) kill the socket that
might eventually be used to receive the datagram.  Now, all these
files are closed.  Tell me, which error value would you report
to that caller of close() and how would it guess which files had
those been?

Consider munmap().  It can release the last reference to any number
of files (mmap them in different places, then munmap the entire
area).  Which error would you report?

Consider exit(), for crying out loud.  You have a file opened.
You fork a child.  You close the file in parent.  Child goes
away.  Who do you report that stuff?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] coda: kill file_count abuse

2007-07-19 Thread Al Viro
On Fri, Jul 20, 2007 at 10:45:34AM +1000, David Chinner wrote:
> On Thu, Jul 19, 2007 at 06:16:00PM -0400, Jan Harkes wrote:
> > On Thu, Jul 19, 2007 at 11:45:08PM +0200, Christoph Hellwig wrote:
> > > ->release is the proper way to detect the last close of a file,
> > > file_count should never be used in filesystems.
> > 
> > Has been tried, the problem with that once ->release is called it is too
> > late to pass the the error back to close(2).
> 
> I think you'll find the problem is that fput() throws away the error
> from ->release, not that it's too late

Just where would that return value go?

BTW, the reason why checks for struct file refcount blow is not far
from that:

task A: write()
task B (sharing descriptor table with A): close()
task C (with another reference to struct file in question): close()
task A: return from write()

Now, the final fput() here happens in write().  In particular, no
call of close(2) sees refcount equal to 1.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] isofs: mounting to regular file may succeed

2007-07-17 Thread Al Viro
On Tue, Jul 17, 2007 at 12:04:07AM -0700, Andrew Morton wrote:

> I don't think any (all?) other filesystems perform checks like this.
> Is this something which can/should be performed at the VFS level?

I don't see why we _should_ check that; VFS checks that we don't
turn directory into non-directory and vice versa, same as for binding.
There is no reason why fs driver could not give you a single-node filesystem
with root being e.g. a block device node.  You could mount it on any
non-directory and get a block device seen there, etc.

IOW, if filesystem driver considers fs image it's working from
as invalid, it's up to filesystem driver.  _What_ is considered invalid
depends on fs type; what the driver cares to report as dubious is, again,
up to driver.  If isofs wants to warn about an ISO image that definitely
violates ISO9660, it's isofs decision.  VFS will work with it as long
as fs driver does...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: *at syscalls for xattrs?

2007-07-16 Thread Al Viro
On Mon, Jul 16, 2007 at 09:56:10AM +0200, Jan Engelhardt wrote:
> >Just one question: what the bleeding hell for?  Not that the rest of
> >..at() family made any damn sense as an interface...
> 
> fd1 = open("dir1", O_DIRECTORY):
> fd2 = open("dir2", O_DIRECTORY);
> system("mount -t tmpfs none dir1");
> system("mount -t tmpfs none dir2");
> openat(fd1, "file1", O_RDWR | O_CREAT);
> openat(fd2, "file2", O_RDWR | O_CREAT);
> 
> If you have a better way to accomplish this, let me know. :)

To accomplish what, exactly?  Access to overmounted directory?
So bind it elsewhere and use that.  I still don't see the point -
neither of the interface nor of your example...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: *at syscalls for xattrs?

2007-07-15 Thread Al Viro
On Sun, Jul 15, 2007 at 02:13:21PM -0700, Nicholas Miell wrote:
> 
> I suspect he was asking for 
> 
> int getxattrat(int fd, const char *path, const char *name, void *value, 
>   size_t size, int flags)
> int setxattrat(int fd, const char *path, const char *name, void *value,
>   size_t size, int xattrflags, int atflags)
> 
> rather than the ability to access xattrs as files.

Just one question: what the bleeding hell for?  Not that the rest of
..at() family made any damn sense as an interface...

> > > BTW, why is fstatat called fstatat and not statat? (Same goes for 
> > > futimesat.) It does not take a file descriptor for the file argument. 
> > > Otherwise we'd also need fopenat/funlinkat, etc. Any reasons?
> > 
> > Ulrich having an odd taste?
> 
> Solaris compatibility.

"Sun having no taste whatsoever"
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: *at syscalls for xattrs?

2007-07-15 Thread Al Viro
On Sun, Jul 15, 2007 at 09:46:27PM +0200, Jan Engelhardt wrote:
> Hi,
> 
> 
> recently, the family of *at() syscalls and functions (openat, fstatat, 
> etc.) have been added to Linux and Glibc, respectively.
> In short: I am missing xattr at functions :)

No.  They are not fscking forks.  They are almost as revolting, but
not quite on the same level.

> BTW, why is fstatat called fstatat and not statat? (Same goes for 
> futimesat.) It does not take a file descriptor for the file argument. 
> Otherwise we'd also need fopenat/funlinkat, etc. Any reasons?

Ulrich having an odd taste?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro
On Wed, Jul 11, 2007 at 01:01:07PM +0300, Pekka J Enberg wrote:
> On Wed, 11 Jul 2007, Al Viro wrote:
> > BTW, read() or write() in progress might get rather unhappy if your
> > live replacement of ->f_mapping races with them...
> 
> For writes, we (1) never start any new operations after we've cleaned up 
> the file descriptor tables so (2) after we're done with do_fsync() we 
> never touch ->f_mapping again.

Er, no.  do_fsync() won't hit the sys_write() that is yet to enter
->write().  And you can't get rid of new callers _anyway_ (see
previous mail).
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro
On Wed, Jul 11, 2007 at 12:50:48PM +0300, Pekka J Enberg wrote:
> Hi Al,
> 
> On Wed, 11 Jul 2007, Al Viro wrote:
> > Better: I have the only opened descriptor for foo.  I send it to myself
> > as described above.  I close it.  revoke() is called, finds no opened
> > instances of foo in any descriptor tables and cheerfully does nothing.
> > I call recvmsg() and I have completely undamaged opened file back.
> 
> Uhm, nice. So, revoke() needs a proper inode -> struct files mapping 
> somewhere. Can we add a list of files to struct inode? Are there other 
> cases where a file can point to an inode but the file is not attached to 
> any file descriptor?

Umm...  Any number, really - it might be in the middle of syscall
while another task sharing descriptor table has closed the descriptor.
Then there's quota, then there's process accounting, then there's
execve() in progress, then there's knfsd working with that struct
file, etc.

The fundamental issue here is that even if you do find struct file,
you can't blindly rip its ->f_mapping since it can be in the middle
of ->read(), ->write(), pageout, etc.  And even if you do manage
that, you still have the ability to do fchmod() later.

I don't see how the ability to find all instances in SCM_RIGHTS
datagrams (for example) will help you with the race I've described
first.  Original state: task B has the only reference to file.
revoke() is called, passes task A.  B sends datagram to A and closes
file.  A receives datagram.  Now the only reference is in A's table
and you've already passed that.

So you can't avoid processes keeping pointers to struct file.
If you could find all struct file over given inode (which, I suspect,
will lead to interesting locking), you could call something on
that struct file, but you'd have zero exclusion with processes
calling methods on it.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro
On Wed, Jul 11, 2007 at 10:37:33AM +0100, Al Viro wrote:
> On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote:
> > From: Pekka Enberg <[EMAIL PROTECTED]>
> > 
> > The revokeat(2) and frevoke(2) system calls invalidate open file descriptors
> > and shared mappings of an inode.  After an successful revocation, operations
> > on file descriptors fail with the EBADF or ENXIO error code for regular and
> > device files, respectively.  Attempting to read from or write to a revoked
> > mapping causes SIGBUS.
> > 
> > The actual operation is done in two passes:
> > 
> >  1. Revoke all file descriptors that point to the given inode. We do
> > this under tasklist_lock so that after this pass, we don't need
> > to worry about racing with close(2) or dup(2).
> 
> How does that deal with the following:
> 
> task A gets its descriptor table cleansed
> task B sends a descriptor to task A via SCM_RIGHTS datagram
> task B gets its descriptor table cleansed
> task A receives the datagram and gets the descriptor installed
> task A does any kind of IO on that descriptor
> ->f_mapping gets replaced in the most inconvenient time.
> 
> Come to think of that, what do you do if I create a socketpair,
> stuff the descriptor into SCM_RIGHTS datagram and send it to
> myself?  Then receive it a day after you've called revoke() and
> voila - I've got an almost undamaged struct file back...  At
> the very least, it allows me to fchmod().  Or fchdir() if that
> had been a directory...
> 
> BTW, read() or write() in progress might get rather unhappy if your
> live replacement of ->f_mapping races with them...

Better: I have the only opened descriptor for foo.  I send it to myself
as described above.  I close it.  revoke() is called, finds no opened
instances of foo in any descriptor tables and cheerfully does nothing.
I call recvmsg() and I have completely undamaged opened file back.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/5] revoke: core code

2007-07-11 Thread Al Viro
On Wed, Jul 11, 2007 at 12:01:06PM +0300, Pekka J Enberg wrote:
> From: Pekka Enberg <[EMAIL PROTECTED]>
> 
> The revokeat(2) and frevoke(2) system calls invalidate open file descriptors
> and shared mappings of an inode.  After an successful revocation, operations
> on file descriptors fail with the EBADF or ENXIO error code for regular and
> device files, respectively.  Attempting to read from or write to a revoked
> mapping causes SIGBUS.
> 
> The actual operation is done in two passes:
> 
>  1. Revoke all file descriptors that point to the given inode. We do
> this under tasklist_lock so that after this pass, we don't need
> to worry about racing with close(2) or dup(2).

How does that deal with the following:

task A gets its descriptor table cleansed
task B sends a descriptor to task A via SCM_RIGHTS datagram
task B gets its descriptor table cleansed
task A receives the datagram and gets the descriptor installed
task A does any kind of IO on that descriptor
->f_mapping gets replaced in the most inconvenient time.

Come to think of that, what do you do if I create a socketpair,
stuff the descriptor into SCM_RIGHTS datagram and send it to
myself?  Then receive it a day after you've called revoke() and
voila - I've got an almost undamaged struct file back...  At
the very least, it allows me to fchmod().  Or fchdir() if that
had been a directory...

BTW, read() or write() in progress might get rather unhappy if your
live replacement of ->f_mapping races with them...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding subroot information to /proc/mounts, or obtaining that through other means

2007-06-20 Thread Al Viro
On Wed, Jun 20, 2007 at 01:57:33PM -0700, H. Peter Anvin wrote:
> ... or, alternatively, add a subfield to the first field (which would
> entail escaping whatever separator we choose):
> 
> /dev/md6 /export ext3 rw,data=ordered 0 0
> /dev/md6:/users/foo /home/foo ext3 rw,data=ordered 0 0
> /dev/md6:/users/bar /home/bar ext3 rw,data=ordered 0 0

Hell, no.  The first field is in principle impossible to parse unless
you know the fs type.

How about making a new file with sane format?  From the very
beginning.  E.g. mountpoint + ID + relative path + type + options,
where ID uniquely identifies superblock (e.g. numeric st_dev)
and backing device (if any) is sitting among the options...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook

2007-05-24 Thread Al Viro
On Thu, May 24, 2007 at 08:10:00PM +0200, Andreas Gruenbacher wrote:

> Read it like this: we don't have a good idea how to support multiple 
> namespaces so far. Currently, we interpret all pathnames relative to the 
> namespace a process is in. Confined processes don't have the privilege to 
> create or manipulate namespaces, which makes this safe. We may find a better 
> future solution.

You also don't have a solution for multiple chroot jails, since they
often have the same fs mounted in many places on given box *and* since
the pathnames from the confined processes' POVs have fsck-all to do
with each other.

It's really not kinder than multiple namespaces as far as your approach
is concerned.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 05:25:49PM +0200, Miklos Szeredi wrote:
> > > How will this work with copy_tree() and namespace duplication, which
> > > currently walk the tree with only namespace_sem held?
> > 
> > Easy - grab namespace_sem, grab vfsmount_lock, walk the subtree and bump
> > mnt_busy on everything (by 1 + number of non-busy children).  Then drop
> > vfsmount_lock and do as usual, dropping references in tree being copied
> > as you go.  Nothing will get attached or detached due to namespace_sem,
> > nothing will get evicted by anybody other than you since you've got all
> > that stuff pinned down.  End of story...
> 
> Right.
> 
> Do you have some code?
> 
> Should I try to code something up?

I hope to get some breathing space next week, then I'll get back to
VFS work.  I'd rather do that one myself, since it'll be a long series
of equivalent transformations - debugging such rewrite of refcounting
done as a single patch is going to be hell.  And yes, refcounting rewrite
is near the top of the list (another thing is wading through several
threads from hell and reviewing unionfs ;-/)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 04:32:37PM +0200, Miklos Szeredi wrote:
> > Umm...  It is related to detached subtrees, but I'm not sure if it is what
> > you are thinking about.
> 
> I was thinking of a similar one by Mike Waychison.  It had the problem
> of requiring a spinlock for mntget/mntput.  It was also different in
> that it did not gradually dissolve detached trees, but kept them as
> whole blobs until the last ref went away.
 
Here the spinlock is needed only when mnt_busy goes to 0, so presumably
it won't be a serious problem on more or less common setups; however,
it certainly would need serious profiling.

> How will this work with copy_tree() and namespace duplication, which
> currently walk the tree with only namespace_sem held?

Easy - grab namespace_sem, grab vfsmount_lock, walk the subtree and bump
mnt_busy on everything (by 1 + number of non-busy children).  Then drop
vfsmount_lock and do as usual, dropping references in tree being copied
as you go.  Nothing will get attached or detached due to namespace_sem,
nothing will get evicted by anybody other than you since you've got all
that stuff pinned down.  End of story...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 03:23:54PM +0200, Ph. Marek wrote:
> How about some special node in eg. /proc (or a new filesystem)?
> Eg.
>/fileAsDir/etc/passwd/owner ...
> would work for all *files*. For directories we do not know whether we're 
> still 
> climbing the hierarchy or would like to see meta-data.

So we need to make *anything* done anywhere in the namespace to modify
the dentry tree on that fs.  Could you spell "fuck, NO"?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 03:01:38PM +0200, Miklos Szeredi wrote:
> Someone might think of a way to make those work with directories.
> Invisible directory entries, anyone?  

Not unless you manage to get working union-mount [*NOT* unionfs]
 
> > * invalidation on unlink is still an open problem.
> > * locking in final mntput() doesn't look nice; we probably need
> > a new refcounting scheme for vfsmounts to make that work.  I have a variant
> > that might work here (and make life much easier for expiry logics in
> > automount/shared trees, which is what it had been initially proposed for),
> 
> Which variant?  We had that "detached subtrees" thing, is that it?

Umm...  It is related to detached subtrees, but I'm not sure if it is what
you are thinking about.

Short version of the story: new counter (mnt_busy) that would be defined
in the following way: the number of external references (not due to the
vfsmount tree structure or from namespace to root) + the number of
children that have non-zero ->mnt_busy.  And a per-vfsmount flag ("goner").

The rules for handling ->mnt_busy:
* duplicating external reference: increment m->mnt_busy
* getting from m to child: increment child->mnt_busy, if it went
from 0 to non-zero - increment m->mnt_busy as well (that's done under
vfsmount_lock, so we can safely check for zero here).
* getting from m to parent: increment parent->mnt_busy.
* dropping external reference: decrement m->mnt_busy; if it's still
non-zero, we are done.  If it's zero, we are in for some work (and had
acquired vfsmount_lock by atomic_dec_and_lock()).  Here's what we do:
* go through ancestors, decrementing ->mnt_busy, until we
  hit the root or get to one with ->mnt_busy staying
  non-zero.
* find the most remote ancestor that has zero ->mnt_busy
  and is marked as goner (might be m itself).
* if no such beast exists, we are done.
* otherwise, detach the subtree rooted in that ancestor
  from its parent (if any) and unhash its root (if hashed).
  Now there is no external references to any vfsmount in that
  subtree.
* now we can kill all vfsmounts in that subtree.
* detaching m from parent: nothing; we trade a busy child of parent
for new external reference to parent.
* lazy umount: in addition to detaching everything from parents
and dropping resulting external references to parents, mark everything
in the subtree as goners.
* normal umount: check ->mnt_busy *and* lack of children, detach,
mark as goner, drop resulting external reference to parent.
* fun new stuff - umount of intact subtree: detach the subtree from
parent, do *not* dissolve it, mark everything in subtree as goners.  If
something we mark as goner is not busy, we can kill it and all its descendents.
The subtree will be shrinking as its pieces lose external references.
* check for expirability: "we hold an external reference to m and
m->mnt_busy is 1".  No need to look into children, etc.
* your vfsmounts: simply mark them goners from the very beginning.
 
> > but it still doesn't kill the need to deal with invalidation.  And
> > yes, NFS still needs it (and so do all network filesystems, really).
> > The question of caching is related to that.
> 
> So what's so special about invalidation?  Why not just treat
> dir-on-file mounts the same as any other ref on the dentry?

Because of the case of having something mounted in that subtree.  The
current code doesn't even try to evict such stuff.  NFS *does*, but
it's not in position to do that decently (not NFS fault, it's just that
we don't have the data needed for it).

Note that one problem we used to have back then is gone - namely, per-namespace
semaphores.  It's a global semaphore now, so we *can* do cross-namespace
rogering of mount trees without that kind of locking horrors.

What we really need is "go through dentry subtree, try to evict everything
we can, for anything that has stuff mounted on it go through all such
vfsmounts and kick them and all their descendents out".  That's what should
happen on invalidation.  From generic code, so that NFS wouldn't have to
bother.

And _that_ is what we could call from ->unlink() on your inode - would take
care of submounts.

Note that I'm not all that happy about this scheme; we might make it work,
but I still want to see a good use scenario for that kind of stuff.
Invalidation logics is a separate story - it's simply needed for existing
stuff; that area sucks *badly*, regardless of adding these hybrid objects.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 08:34:42AM -0400, Trond Myklebust wrote:
> On Wed, 2007-05-23 at 08:36 +0100, Al Viro wrote:
> > On Wed, May 23, 2007 at 09:19:17AM +0200, Miklos Szeredi wrote:
> > > > Eh...  Arbitrary limitations are fun, aren't they?
> > > 
> > > But these mounts _are_ special.  There is really no point in moving or
> > > pivoting them.
> > 
> > pivoting - probably true, moving... why not?
> 
> Moving would be an implementation artefact that doesn't really
> correspond to any useful operation on the filesyst
> 
> AFAIK, most filesystems that have implemented subfiles (excepting
> Reiser4 of course) do not allow you to rename or move the subfile
> directory or its contents from one parent file to another.

If that's about xattr and nothing else, colour me thoroughly uninterested.
If it might have other interesting uses, OTOH...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 12:39:25PM +0100, Al Viro wrote:
> Then I do not understand what this mechanism could be used for, other
> than an odd way to twist POSIX behaviour and see how much of the userland
> would survive that.  Certainly not useful for your "look into tarball
> as a tree", unless you seriously want to scan the entire damn fs for
> tarballs at mount time and set up a superblock for each.  And for per-file
> extended attributes/forks/whatever-you-call-that-abomination it also
> obviously doesn't help, since you lose them for directories.
> 
> IOW, what uses do you have in mind?  Complete scenario, please...

Ah... After rereading the thread you've mentioned in the very beginning,
I think I understand what you are driving at.  However, in that case
* I really don't see why bother with returning vfsmount at all.
dentry alone is enough to create a new vfsmount, all in fs/namei.c.
* the lifetime rules look fscking scary.  You call that ->enter()
on nearly every damn lookup.  OK, so you'll recreate equivalent vfsmount,
but...  That's a lot of allocations/freeing.  Can we do some caching and
deal with it on memory pressure?
* invalidation on unlink is still an open problem.
* locking in final mntput() doesn't look nice; we probably need
a new refcounting scheme for vfsmounts to make that work.  I have a variant
that might work here (and make life much easier for expiry logics in
automount/shared trees, which is what it had been initially proposed for),
but it still doesn't kill the need to deal with invalidation.  And yes,
NFS still needs it (and so do all network filesystems, really).  The question
of caching is related to that.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
> When the real superblock is created.  It could even be the _same_
> super block as the real one.  There'd be just the problem of anchoring
> the dir-on-file dentries somewhere...
> 
> Or with fuse the dir-on-file mount can just come from any mounted
> filesystem, again possibly the same one as the parent.  I do actually
> test with this.  The userspace filesystem supplies a file descriptor,
> from which the struct path is extracted and returned from ->enter().

Then I do not understand what this mechanism could be used for, other
than an odd way to twist POSIX behaviour and see how much of the userland
would survive that.  Certainly not useful for your "look into tarball
as a tree", unless you seriously want to scan the entire damn fs for
tarballs at mount time and set up a superblock for each.  And for per-file
extended attributes/forks/whatever-you-call-that-abomination it also
obviously doesn't help, since you lose them for directories.

IOW, what uses do you have in mind?  Complete scenario, please...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 12:09:19PM +0200, Miklos Szeredi wrote:
> Right.  After locking vfsmount_lock, mount_dironfile() should recheck
> if there was a race and bail out.

Owww...  Not pretty, that...
 
> I don't think the filesystem ought to try _creating_ a vfsmount.  I
> imagine, that the fs has already a kernel-internal mounted for this
> kind of stuff, and it just supplies a dentry from that.  The vfsmount
> isn't actually important, but it should be readily available, and it's
> easier to clone from a vfsmount/dentry pair.

I don't get it.  What's the point of that exercise, then?  When do you
create that kernel-internal mount?

> As I said, the superblock should be persistent, so we'll get a stable
> st_dev for multiple mounts.

OK, but then I guess I don't understand the intended use.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 11:03:08AM +0200, Miklos Szeredi wrote:
> I still don't get it where the superblock comes in.  The locking is
> "interesting" in there, yes.  And I haven't completely convinced
> myself it's right, let alone something that won't easily be screwed up
> in the future.  So there's definitely room for thought there.
> 
> But how does it matter if two different paths have the same sb or a
> different sb mounted over them?
 
Because then you get a slew of fun issues with dropping the final reference
to vfsmount vs. lookup on another place.  What hold do you have on that
superblock and when do you switch from "oh, called ->enter() on the same
inode again, return vfsmount over the same superblock" to "need to
initialize that damn superblock, all mounts are gone"?

> The same dentry is mounted over each one.  The contents of the
> directory should only depend on the contents of the underlying inode.
> The path leading up to it is completely irrelevant.

So what kind of exclusion do you have for ->enter()?  None?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Tue, May 22, 2007 at 08:48:49PM +0200, Miklos Szeredi wrote:
>   */
> -static int __follow_mount(struct path *path)
> +static int __follow_mount(struct path *path, bool enter)
>  {
>   int res = 0;
>   while (d_mountpoint(path->dentry)) {
> - struct vfsmount *mounted = lookup_mnt(path->mnt, path->dentry);
> + struct vfsmount *mounted =
> + lookup_mnt(path->mnt, path->dentry, enter);
> +
>   if (!mounted)
>   break;
>   dput(path->dentry);
> @@ -689,27 +697,37 @@ static int __follow_mount(struct path *p
>   return res;
>  }
>  
> -static void follow_mount(struct vfsmount **mnt, struct dentry **dentry)
> +/*
> + * Follows mounts on the given nameidata.
> + *
> + * Only follows "directory on file" mounts if LOOKUP_ENTER is set.
> + */
> +void follow_mount(struct nameidata *nd)

BTW, I'd split that (and matching updates in callers) into separate
patch.

>  {
> - while (d_mountpoint(*dentry)) {
> - struct vfsmount *mounted = lookup_mnt(*mnt, *dentry);
> + while (d_mountpoint(nd->dentry)) {
> + bool enter = nd->flags & LOOKUP_ENTER;

int, surely?

> + * This is called if the object has no ->lookup() defined, yet the
> + * path contains a slash after the object name.
> + *
> + * If the filesystem defines an ->enter() method, this will be called,
> + * and the filesystem shall fill the supplied struct path or return an
> + * error.
> + *
> + * The returned path will be bind mounted on top of the object with
> + * the MNT_DIRONFILE flag, and the nameidata will descend into the
> + * mount.
> + */
> +static int enter_file(struct inode *inode, struct nameidata *nd)
> +{
> + int err;
> + struct path newpath;
> +
> + printk(KERN_DEBUG "%s/%d enter %s/\n", current->comm, current->pid,
> +nd->dentry->d_name.name);
> + if (!inode->i_op->enter)
> + return -ENOTDIR;
> +
> + newpath.mnt = NULL;
> + newpath.dentry = NULL;
> + err = inode->i_op->enter(nd, &newpath);
> + if (!err) {
> + err = mount_dironfile(nd, &newpath);
> + pathput(&newpath);
> + }
> + return err;

Ouch.  What guarantees that two lookups won't race right here?  You are
not holding any locks at that point, AFAICS...

BTW, why newpath?  What's wrong with simply returning a new vfsmount
with right ->mnt_root/->mnt_sb (instead of creating it inside
mount_dironfile())?  ERR_PTR() for error, struct vfsmount * for success...

> @@ -301,8 +310,8 @@ static struct vfsmount *clone_mnt(struct
>   mnt->mnt_mountpoint = mnt->mnt_root;
>   mnt->mnt_parent = mnt;
>  
> - /* don't copy the MNT_USER flag */
> - mnt->mnt_flags &= ~MNT_USER;
> + /* don't copy some flags */
> + mnt->mnt_flags &= ~(MNT_USER | MNT_DIRONFILE);
>   if (flag & CL_SETUSER)
>   __set_mnt_user(mnt, owner);

Hmm?  So you do copy them and strip your MNT_DIRONFILE from copies?

> + * This is tricky, because for namespace modification we must take the
> + * namespace semaphore.  But mntput() is called from various places,
> + * sometimes with namespace_sem held.  Fortunately in those places the
> + * mount cannot yet have MNT_DIRONFILE, or at least that's what I
> + * hope...
> + *
> + * The umounting is done in two stages, first the mount is removed
> + * from the hashes.  This is done atomically wrt other mount lookups,
> + * so it's not possible to acquire a new ref to this dead mount that
> + * way.
> + *
> + * Then after having locked namespace_sem and relocked vfsmount_lock,
> + * the mount is properly detached.
> + */
> +static void umount_dironfile(struct vfsmount *mnt)
> + __releases(vfsmount_lock)
> +{
> + struct nameidata nd;

You've got to be kidding.  nameidata is *big*.  If anything, we want
to make detach_mnt() take struct path * instead, but even that is
lousy due to recursion.

I really don't like what's going on here.  The thing is, current code
is based on assumption that presence in the mount tree => holding a
reference.  We _might_ deal with that (there was an old plan to change
refcounting logics for vfsmounts), but that sort of games with locks
spells trouble.  What happens, for example, if namespace gets cloned
before you grab namespace_sem?

There's another problem, BTW - a lot of stuff does stat + open + fstat +
compare kind of sequence.  You'll end up mounting/umounting between stat
and open, which opens you to race with somebody else.  Get a different
st_dev, eat a nice unreproducible error from application...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 10:05:21AM +0200, Miklos Szeredi wrote:
> > Er...  These mounts might not be propagated, but what about a bind
> > over another instance of such file in master tree?
> 
> So your question is, which mount takes priority on the lookup?  It
> probably should be the propagated real mount, rather than the
> dir-on-file one, shouldn't it?

There might be dragons in that area...
 
> > > I think they should be the same superblock, same dentry.  What would
> > > be the advantage of doing otherwise?
> > 
> > Then you are going to have interesting time with locking in final mntput().
> 
> Final mntput of what?

When the last reference to your mount goes away.
 
> > BTW, what about having several links to the same file?  You have i_mutex
> > on the inode, so serialization of those is not a problem, but...
> 
> Sorry, I lost it...

Say /foo/bar/a is such a file.

cd /foo/bar
ln a b

now do lookups on a/ and b/

What happens?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 09:19:17AM +0200, Miklos Szeredi wrote:
> > Eh...  Arbitrary limitations are fun, aren't they?
> 
> But these mounts _are_ special.  There is really no point in moving or
> pivoting them.

pivoting - probably true, moving... why not?
 
> > What about MNT_SLAVE stuff being set up prior to that lookup?
> 
> These mounts are not propagated.  Or at least I hope so.  Propagation
> stuff is a bit too complicated for my poor little brain.

Er...  These mounts might not be propagated, but what about a bind
over another instance of such file in master tree?
 
> I think they should be the same superblock, same dentry.  What would
> be the advantage of doing otherwise?

Then you are going to have interesting time with locking in final mntput().
BTW, what about having several links to the same file?  You have i_mutex
on the inode, so serialization of those is not a problem, but...
 
> I think doing this recursively should be allowed.  "Releasing last ref
> cleans up the mess" should work in that case.

Releasing the last reference will lead to cascade of umounts in that
case...  IOW, need to be careful with locking.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-23 Thread Al Viro
On Wed, May 23, 2007 at 08:36:04AM +0200, Miklos Szeredi wrote:
> > Interesting...  How do you deal with mount propagation and things like
> > mount --move?
> 
> Moving (or doing other mount operations on) an ancestor shouldn't be a
> problem.  Moving this mount itself is not allowed, and neither is
> doing bind or pivot_root.  Maybe bind could be allowed...

Eh...  Arbitrary limitations are fun, aren't they?
 
> When doing recursive bind on ancestor, these mounts are skipped.

What about clone copying your namespace?  What about MNT_SLAVE stuff being
set up prior to that lookup?  More interesting question: should independent
lookups of that sucker on different paths end up with the same superblock
(and vfsmount for each) or should we get fully independent mount on each?
The latter would be interesting wrt cache coherency...

> > As for unlink...  How do you deal with having that thing
> > mounted, mounting something _under_ it (so that vfsmount would be kept
> > busy) and then unlinking that sucker?
> 
> Yeah, that's a good point.  Current patch doesn't deal with that.
> Simplest solution could be to disallow submounting these.  Don't think
> it makes much sense anyway.

Arbitrary limitations... (and that's where revalidate horrors come in, BTW).
BTW^2: what if fs mounted that way will happen to have such node itself?

I'm not saying that it's unfeasible or won't lead to interesting things,
but it really needs semantics done right...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] file as directory

2007-05-22 Thread Al Viro
On Tue, May 22, 2007 at 08:48:49PM +0200, Miklos Szeredi wrote:
> Why do we want this?
> 
> 
> That depends on who you ask.  My answer is this:
> 
>   'foo.tar.gz/foo/bar' or
>   'foo.tar.gz/contents/foo/bar'
> 
> or something similar.
> 
> Others might suggest accessing streams, resource forks or extended
> attributes through such an interface.  However this patch only deals
> with the non-directory case, so directories would be excluded from
> that interface.
> 
> But otherwise this patch doesn't limit the uses of the "file as
> directory" concept in any way.  It just adds the infrastructure to
> support these whacky beasts.
> 
> How is it done?
> ---
> 
> (See this [1] thread for more discussion on the subject)
> 
> When a non-directory object is accessed without a trailing slash, then
> path resolution returns the object itself as usual.
> 
> If a non-directory object is accessed with a trailing slash, then the
> filesystem may opt to let the file be accessed as a directory.  In
> this case "something" (as supplied by the filesystem) is mounted on
> top of the non-directory object.
> 
> This mount will have special properties:
> 
>  - If there's no trailing slash is after the file name, the mount
>won't be followed, even if the path resolution would otherwise
>follow mounts.
> 
>  - The mount only stays there while it is referenced by some external
>object, like a pwd or an open file.  When it is no longer
>referenced, it is automatically unmounted.
> 
>  - Unlike "real" mounts, this won't block unlink(2) or rename(2) on
>the underlying object.

Interesting...  How do you deal with mount propagation and things like
mount --move?  As for unlink...  How do you deal with having that thing
mounted, mounting something _under_ it (so that vfsmount would be kept
busy) and then unlinking that sucker?

I'll look through the patch tonight; it sounds interesting, assuming that
we don't run into serious crap with locking and  revalidation
logics.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 40/41] AppArmor: all the rest

2007-04-12 Thread Al Viro
On Thu, Apr 12, 2007 at 11:32:00AM +0100, Al Viro wrote:
> On Thu, Apr 12, 2007 at 02:08:49AM -0700, [EMAIL PROTECTED] wrote:
> > +   } else if (profile1 > profile2) {
> > +   /* profile1 cannot be NULL here. */
> > +   spin_lock_irqsave(&profile1->lock, profile1->int_flags);
> > +   if (profile2)
> > +   spin_lock(&profile2->lock);
> > +
> > +   } else {
> > +   /* profile2 cannot be NULL here. */
> > +   spin_lock_irqsave(&profile2->lock, profile2->int_flags);
> > +   spin_lock(&profile1->lock);
> > +   }
> 
> Ahem...
> 
> profile2 is locked individually.  profile1 > profile2.  profile1 is not
> locked.  We try to lock both.  profile1 is locked OK, flags (with interrupts
> disabled) are stored into it.  We spin trying to lock profile2.  Eventually,
> whoever had held profile2 unlocks it, restoring the flags from profile2.
> We happily grab the spinlock and move on.  When we unlock the pair, we
> restore flags from profile1.  I.e. we are left with interrupts disabled.

Please, ignore - shouldn't have posted without coffee...  Flags would be
for different CPUs in that case, obviously.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 33/41] Add d_namespace_path() to obtain namespace relative pathnames

2007-04-12 Thread Al Viro
> +char *d_namespace_path(struct dentry *dentry, struct vfsmount *vfsmnt,
> +char *buf, int buflen)
> +{
> + char *res;
> + struct vfsmount *rootmnt, *nsrootmnt;
> + struct dentry *root;
> +
> + read_lock(¤t->fs->lock);
> + rootmnt = mntget(current->fs->rootmnt);
> + read_unlock(¤t->fs->lock);
> + spin_lock(&vfsmount_lock);
> + nsrootmnt = mntget(rootmnt->mnt_ns->root);

... and when somebody does umount -l on your chroot jail, you get
NULL ->mnt_ns.  Oops...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 40/41] AppArmor: all the rest

2007-04-12 Thread Al Viro
On Thu, Apr 12, 2007 at 02:08:49AM -0700, [EMAIL PROTECTED] wrote:
> + } else if (profile1 > profile2) {
> + /* profile1 cannot be NULL here. */
> + spin_lock_irqsave(&profile1->lock, profile1->int_flags);
> + if (profile2)
> + spin_lock(&profile2->lock);
> +
> + } else {
> + /* profile2 cannot be NULL here. */
> + spin_lock_irqsave(&profile2->lock, profile2->int_flags);
> + spin_lock(&profile1->lock);
> + }

Ahem...

profile2 is locked individually.  profile1 > profile2.  profile1 is not
locked.  We try to lock both.  profile1 is locked OK, flags (with interrupts
disabled) are stored into it.  We spin trying to lock profile2.  Eventually,
whoever had held profile2 unlocks it, restoring the flags from profile2.
We happily grab the spinlock and move on.  When we unlock the pair, we
restore flags from profile1.  I.e. we are left with interrupts disabled.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook

2007-04-12 Thread Al Viro
On Thu, Apr 12, 2007 at 02:08:10AM -0700, [EMAIL PROTECTED] wrote:
> This is needed for computing pathnames in the AppArmor LSM.

Which is an argument against said LSM in current form.


> - error = security_inode_create(dir, dentry, mode);
> + error = security_inode_create(dir, dentry, nd ? nd->mnt : NULL, mode);

is a clear sign that interface is wrong.  Leaving aside the general
idiocy of "we prohibit to do something with file if mounted here,
but if there's another mountpoint, well, we just miss", an API of that
kind is just plain wrong.  Either you can live without seeing vfsmount
in that method (in which case you shouldn't pass it at all), or you
have a hole.

NACK.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Move across mount,sb

2007-03-14 Thread Al Viro
On Thu, Mar 15, 2007 at 02:20:25AM +0100, Jan Engelhardt wrote:
> Why is EXDEV returned when _vfs mountpoints_ are crossed? Should not it 
> be more like the following?
> 
>   error = -EXDEV;
>   if (oldnd.mnt->mnt_sb != newnd.mnt->mnt_sb)
>   goto exit2;
> 

No.  This is absolutely deliberate - mountpoint creates a boundary and
such boundaries are very useful for restricting modifications of filesystem.
IOW, it's not a bug and it applies to other operations as well (link(), for
example).
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ext2: conditional removal of NFSD code

2007-01-06 Thread Al Viro
On Sun, Jan 07, 2007 at 12:44:56AM +0300, Alexey Dobriyan wrote:
> > #if defined(CONFIG_EXPORTFS) || defined(CONFIG_EXPORTFS_MODULE)
> > #define set_export_ops(sb, ops) sb->s_export_op = ops
> > #else
> > #define set_export_ops(sb, ops) 0
> > #endif
> >
> > That way you can get rid of the function pointer from the struct
> > superblock too.
> 
> Exactly! I've just started with filesystems I use.
> 
> it should be wrapped in do {} while 0, of course.

What the hell for?  Repeat after me:
* do {} while(0) is always inferior to ((void)0)
* do { expr; } while(0) is always inferior to ((void)(expr))
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is a NULL check missing in nfs_lookup?

2007-01-05 Thread Al Viro
On Fri, Jan 05, 2007 at 12:11:17PM -0500, Shaya Potter wrote:
> ok, I guess what I don't understand is when are
> 
> dentry->d_sb->s_root and nd->mnt->mnt_root not equivalent.
> 
> I guess it would be "when crossing a mountpoint on the server", so I
> look at nfs_follow_mountpoint() and basically see that you create a new
> mountpoint and a new nd for that.  And since superblock objects are only
> per "real" file system not per mount point, they will be different.
> 
> I guess the question is, why shouldn't a dentry object know what
> vfsmount it belongs to is?  Can it belong to multiple vfsmount objects?

Yes.  dentry belongs to superblock.  vfsmount refers to a subtree of some
superblock's dentry tree.  There can be any number of vfsmounts refering
to subtrees of the same fs.  Some might refer to the entire tree, some -
to its arbitrary subtrees (possibly as small as a single file).  There
can be any number of vfsmounts refering to any given subtree.

Think what happens when you create a binding.  Or when you clone an entire
namespace.  (Pieces of) the same filesystem might be mounted in a lot
of places.  vfsmount describes a part of unified tree delimited by
mountpoints; if the same fs (or its parts) are seen under several
mountpoints, you get vfsmounts that refer to the same fs instance, i.e.
the same superblock and dentry tree.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2006-12-20 Thread Al Viro
On Wed, Dec 20, 2006 at 05:50:11PM +0100, Miklos Szeredi wrote:
> I don't see any problems with changing struct kstat.  There would be
> reservations against changing inode.i_ino though.
> 
> So filesystems that have 64bit inodes will need a specialized
> getattr() method instead of generic_fillattr().

And they are already free to do so.  And no, struct kstat doesn't need
to be changed - it has u64 ino already.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via idr)

2006-11-16 Thread Al Viro
On Wed, Nov 15, 2006 at 11:42:38AM -0500, Jeff Layton wrote:
> +{
> + int rv;
> +
> + rv = idr_pre_get(&inode->i_sb->s_inode_ids, GFP_KERNEL);
> + if (! rv)
> + return -ENOMEM;
> +
> + lock_super(inode->i_sb);

[EMAIL PROTECTED]

Please, use something saner.  Use of lock_super() for anything generic
is wrong; using it for something that'd better be fast is even dumber...

> @@ -1025,6 +1055,7 @@ void generic_delete_inode(struct inode *
>   spin_lock(&inode_lock);
>   hlist_del_init(&inode->i_hash);
>   spin_unlock(&inode_lock);
> + iunique_unregister(inode);

Unconditional?  Hitting every fs out there?  With that kind of locking?

>   wake_up_inode(inode);
>   BUG_ON(inode->i_state != I_CLEAR);
>   destroy_inode(inode);
> @@ -1057,6 +1088,7 @@ static void generic_forget_inode(struct 
>   inode->i_state |= I_FREEING;
>   inodes_stat.nr_inodes--;
>   spin_unlock(&inode_lock);
> + iunique_unregister(inode);

Ditto.

>   if (inode->i_data.nrpages)
>   truncate_inode_pages(&inode->i_data, 0);
>   clear_inode(inode);
> diff --git a/fs/pipe.c b/fs/pipe.c
> index b1626f2..d74ae65 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -845,6 +845,9 @@ static struct inode * get_pipe_inode(voi
>   if (!inode)
>   goto fail_inode;
>  
> + if (iunique_register(inode, 0))
> + goto fail_iput;
> +

Humm...  I wonder what the overhead of that is going to be.  There
easily can be shitloads of pipes on the box, with all sorts of
lifetimes.  And we'd better be fast on that codepath...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] New system call, unshare

2005-08-22 Thread Al Viro
On Wed, Aug 10, 2005 at 04:08:31PM +0200, Florian Weimer wrote:
> * Janak Desai:
> 
> > With unshare, namespace setup can be done using PAM session
> > management functions without patching individual commands.
> 
> I don't think it's a good idea to use security-critical code well
> without its original specification.  Clearly the current situation
> sucks, but this is mainly a lack of PAM functionality, IMHO.

Eh?  We are talking about a primitive that has far more uses than
PAM.  This is a missing piece of the stuff done by clone() and fork():
each task is a virtual machine with sharable components.  We can
get a copy of machine  with arbitrary set of components replaced with
private copies.  That's what clone() and fork() do.  The thing missing
from that set is taking a component (VM, descriptors, etc.) of process
itself and making it private.  The same thing we do on fork(), but
without creating a new process.

FWIW, I'm OK with that.  IIRC, Linus ACKed the concept some time ago.
PAM is one obvious use, but there's are other situations where the lack
of that primitive is inconvenient...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] New system call, unshare

2005-08-22 Thread Al Viro
On Mon, Aug 08, 2005 at 03:46:06PM +0100, Alan Cox wrote:
> On Llu, 2005-08-08 at 09:33 -0400, Janak Desai wrote:
> > 
> > [PATCH 1/2] unshare system call: System Call handler function sys_unshare
> 
> 
> Given the complexity of the kernel code involved and the obscurity of
> the functionality why not just do another clone() in userspace to
> unshare the things you want to unshare and then _exit the parent ?

Because you want to keep children?  Because you don't want to deal with
the implications for sessions/groups/etc.?

FWIW, syscall makes sense.  It is a valid primitive and the only reason
to keep it out of clone() (i.e. not making it just another flag to clone())
is that clone() is already cluttered _and_ uses bad calling conventions
for that stuff ("I want to retain " rather than "I want private ").
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 06:08:12PM -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 20 Aug 2005, Al Viro wrote:
> > 
> > That looks OK except for
> > * ncpfs fix is actually missing here
> 
> Well, the thing is, with the change to page_follow_link() and 
> page_put_link(), ncpfs should now work fine - it doesn't need any fixing 
> any more.

Ah - right, it's using normal methods...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Sat, Aug 20, 2005 at 12:15:42AM +0100, Al Viro wrote:
> 
> That looks OK except for
>   * jffs2 is b0rken (see patch in another mail)
>   * afs, autofs4, befs, devfs, freevxfs, jffs2, jfs, ncpfs, procfs,
> smbfs, sysvfs, ufs, xfs - prototype change for ->follow_link()
>   * befs, smbfs, xfs - same for ->put_link()
>   * ncpfs fix is actually missing here
> 
> Prototype changes are covered by patch below (incremental on top of your +
> jffs2 fix upthread).  No ncpfs changes - these will go separately, assuming
> you haven't done them yet; just a plain janitor stuff.


Gaack...  And here's the patch itself - sorry.

diff -urN RC13-rc6-git10-base/fs/afs/mntpt.c current/fs/afs/mntpt.c
--- RC13-rc6-git10-base/fs/afs/mntpt.c  2005-06-17 15:48:29.0 -0400
+++ current/fs/afs/mntpt.c  2005-08-19 19:02:48.0 -0400
@@ -30,7 +30,7 @@
   struct dentry *dentry,
   struct nameidata *nd);
 static int afs_mntpt_open(struct inode *inode, struct file *file);
-static int afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd);
+static void *afs_mntpt_follow_link(struct dentry *dentry, struct nameidata 
*nd);
 
 struct file_operations afs_mntpt_file_operations = {
.open   = afs_mntpt_open,
@@ -233,7 +233,7 @@
 /*
  * follow a link from a mountpoint directory, thus causing it to be mounted
  */
-static int afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd)
+static void *afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
struct vfsmount *newmnt;
struct dentry *old_dentry;
@@ -249,7 +249,7 @@
newmnt = afs_mntpt_do_automount(dentry);
if (IS_ERR(newmnt)) {
path_release(nd);
-   return PTR_ERR(newmnt);
+   return (void *)newmnt;
}
 
old_dentry = nd->dentry;
@@ -267,7 +267,7 @@
}
 
kleave(" = %d", err);
-   return err;
+   return ERR_PTR(err);
 } /* end afs_mntpt_follow_link() */
 
 /*/
diff -urN RC13-rc6-git10-base/fs/autofs4/symlink.c current/fs/autofs4/symlink.c
--- RC13-rc6-git10-base/fs/autofs4/symlink.c2005-06-17 15:48:29.0 
-0400
+++ current/fs/autofs4/symlink.c2005-08-19 19:03:11.0 -0400
@@ -12,11 +12,11 @@
 
 #include "autofs_i.h"
 
-static int autofs4_follow_link(struct dentry *dentry, struct nameidata *nd)
+static void *autofs4_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
struct autofs_info *ino = autofs4_dentry_ino(dentry);
nd_set_link(nd, (char *)ino->u.symlink);
-   return 0;
+   return NULL;
 }
 
 struct inode_operations autofs4_symlink_inode_operations = {
diff -urN RC13-rc6-git10-base/fs/befs/linuxvfs.c current/fs/befs/linuxvfs.c
--- RC13-rc6-git10-base/fs/befs/linuxvfs.c  2005-06-17 15:48:29.0 
-0400
+++ current/fs/befs/linuxvfs.c  2005-08-19 19:03:52.0 -0400
@@ -41,8 +41,8 @@
 static void befs_destroy_inode(struct inode *inode);
 static int befs_init_inodecache(void);
 static void befs_destroy_inodecache(void);
-static int befs_follow_link(struct dentry *, struct nameidata *);
-static void befs_put_link(struct dentry *, struct nameidata *);
+static void *befs_follow_link(struct dentry *, struct nameidata *);
+static void befs_put_link(struct dentry *, struct nameidata *, void *);
 static int befs_utf2nls(struct super_block *sb, const char *in, int in_len,
char **out, int *out_len);
 static int befs_nls2utf(struct super_block *sb, const char *in, int in_len,
@@ -487,10 +487,10 @@
}
 
nd_set_link(nd, link);
-   return 0;
+   return NULL;
 }
 
-static void befs_put_link(struct dentry *dentry, struct nameidata *nd)
+static void befs_put_link(struct dentry *dentry, struct nameidata *nd, void *p)
 {
befs_inode_info *befs_ino = BEFS_I(dentry->d_inode);
if (befs_ino->i_flags & BEFS_LONG_SYMLINK) {
diff -urN RC13-rc6-git10-base/fs/devfs/base.c current/fs/devfs/base.c
--- RC13-rc6-git10-base/fs/devfs/base.c 2005-06-17 15:48:29.0 -0400
+++ current/fs/devfs/base.c 2005-08-19 19:04:17.0 -0400
@@ -2491,11 +2491,11 @@
return 0;
 }  /*  End Function devfs_mknod  */
 
-static int devfs_follow_link(struct dentry *dentry, struct nameidata *nd)
+static void *devfs_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
struct devfs_entry *p = get_devfs_entry_from_vfs_inode(dentry->d_inode);
nd_set_link(nd, p ? p->u.symlink.linkname : ERR_PTR(-ENODEV));
-   return 0;
+   return NULL;
 }  /*  End Function devfs_follow_link  */
 
 static struct inode_operations devfs_iops = {
diff -urN RC13-rc6-git10-base/fs/freevxfs/vxfs_immed.c 
curre

Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 03:04:52PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 19 Aug 2005, Anton Altaparmakov wrote:
> > 
> > Yes, sure.  I have applied your patch to our 2.6.11.4 tree (with the one 
> > liner change I emailed you just now) and have kicked off a compile.
> 
> Actually, hold on. The original patch had another problem: it returned an
> uninitialized "page" pointer when page_getlink() failed.
> 
> This one should have that fixed, and has converted a few other 
> filesystems. Most of them trivially, but I took the opportunity to just 
> simplify NFS while I was at it, since it now has no reason to need to save 
> off the "struct page *" any more.
> 
> It's still not tested, but at least I've looked at it a bit more ;)

That looks OK except for
* jffs2 is b0rken (see patch in another mail)
* afs, autofs4, befs, devfs, freevxfs, jffs2, jfs, ncpfs, procfs,
smbfs, sysvfs, ufs, xfs - prototype change for ->follow_link()
* befs, smbfs, xfs - same for ->put_link()
* ncpfs fix is actually missing here

Prototype changes are covered by patch below (incremental on top of your +
jffs2 fix upthread).  No ncpfs changes - these will go separately, assuming
you haven't done them yet; just a plain janitor stuff.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 01:35:49PM -0700, Linus Torvalds wrote:
> I'm not sure if this merits a new file or new organization (hey,
> fs/lib/xxx might be good in theory). In particular, I had actually been
> hoping to release 2.6.13 today, but this seems like a valid thing to hold 
> things up for - but not if we're going to re-organize things.
> 
> The one thing that strikes me is that we might actually have less pain if
> we just changed the calling convention for follow_link/put_link slightly
> instead of creating a new library function. The existing "page cache"  
> thing really _does_ work very well, and would work fine for NFS and ncpfs
> too, if we just allowed an extra cookie to be passed along from
> "follow_link()" to "put_link()".

Er...  We can do that, but current calling conventions for ->follow_link()
are already too complex (and we had bugs in that area).  What we have is
1) you can return -error; then you must release nameidata yourself
and ->put_link() will _not_ be called.
2) you can do nd_set_link(nd, ERR_PTR(-error)) and return 0
3) you can do nd_set_link(nd, path) and return 0
4) you can return 0 (after having moved nameidata yourself)
Note that in practice (2) is far more convenient than (1).  So pretty much
everything does it for errors and (1) turns out to be a source of bugs -
people forget to drop nameidata.  The only real case for (1) is failed
(4) - when we went too far and already dropped nameidata naturally.

Let me take a look at the current situation...
* a bunch of filesystems do only (2) and (3).  That's the normal
case.
* jffs2 follow_link() is broken - it has an exit where it returns
-EIO and leaks nameidata.
* procfs special symlinks - (1) or (4), correctly used.
* afs magic - (1) or (4), correctly used.
* hppfs - badly broken for unrelated reasons.

Adding "you can return a cookie as long as it's not ERR_PTR() and it will
be treated like (2), (3) or (4), except that you won't want it in (4)" will
add to confusion.  And of course, we will have to update every ->put_link()
and ->follow_link() out there...

_If_ you want 2.6.13 with minimal PITA right now, I'd suggest leaving API
as is until 2.6.13-git1 and using stray_page_... + patch below to deal with
jffs2 until then.  Then we can deal with API change - immediately after
2.6.13 release.  Removing stray_... in process.

diff -urN RC13-rc6-git10-base/fs/jffs2/symlink.c current/fs/jffs2/symlink.c
--- RC13-rc6-git10-base/fs/jffs2/symlink.c  2005-08-10 10:37:52.0 
-0400
+++ current/fs/jffs2/symlink.c  2005-08-19 17:35:52.0 -0400
@@ -30,6 +30,7 @@
 static int jffs2_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
struct jffs2_inode_info *f = JFFS2_INODE_INFO(dentry->d_inode);
+   char *p = (char *)f->dents;

/*
 * We don't acquire the f->sem mutex here since the only data we
@@ -45,13 +46,14 @@
 * nd_set_link() call.
 */

-   if (!f->dents) {
+   if (!p) {
printk(KERN_ERR "jffs2_follow_link(): can't find symlink 
taerget\n");
-   return -EIO;
+   p = ERR_PTR(-EIO);
+   } else {
+   D1(printk(KERN_DEBUG "jffs2_follow_link(): target path is 
'%s'\n", (char *) f->dents));
}
-   D1(printk(KERN_DEBUG "jffs2_follow_link(): target path is '%s'\n", 
(char *) f->dents));
 
-   nd_set_link(nd, (char *)f->dents);
+   nd_set_link(nd, p);

/*
 * We unlock the f->sem mutex but VFS will use the f->dents string. 
This is safe
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 08:41:29PM +0100, Matthew Wilcox wrote:
> > is getting crowded.  Linus, do you have any objections to that or 
> > suggestions
> > on filename here?
> 
> fs/symlink.c?

Or fs/lib/symlink.c...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 10:16:47PM +0300, Mika Penttilä wrote:
> Just out of curiosity - what protects even local filesystems against 
> concurrent truncate and symlink resolving when using the page cache helpers?

How do you get truncate(2) or ftruncate(2) to do something with a symlink?
The former follows links, the latter takes an open file...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 07:00:38PM +0100, Christoph Hellwig wrote:
> On Fri, Aug 19, 2005 at 07:02:18PM +0100, Al Viro wrote:
> > On Fri, Aug 19, 2005 at 05:53:32PM +0100, Al Viro wrote:
> > > I'm taking NFS helpers to libfs.c and switching ncpfs to them.  IMO that's
> > > better than copying the damn thing and other network filesystems might 
> > > have
> > > the same needs eventually...
> > 
> > [something like this - completely untested]
> > 
> > * stray_page_get_link(inode, filler) - returns ERR_PTR(error) or pointer
> > to symlink body.  Said symlink body sits in a page at offset equal to
> > offsetof(page, struct stray_page_link).  filler() is expected to put it
> > at such offset. Page is cached.
> > 
> > * stray_page_put_link() - ->put_link() suitable for links obtained from
> > stray_page_get_link().  Unlike the usual pagecache-based variants, this
> > sucker does _not_ rely on page staying cached.
> > 
> > * nfs and ncpfs switched to the helpers above.
> 
> Can you add some kerneldoc comments to describe them?  Especially as
> the name is not very descriptive.

Hey, if anybody has suggestions on names - they are very welcome ;-)

FWIW, I'd rather take page_symlink(), page_symlink_inode_operations,
page_put_link(), page_follow_link_light(), page_readlink(), page_getlink(),
generic_readlink() and vfs_readlink() to the same place where these guys
would live.  They all belong together and none of them has any business
in fs/namei.c.  Options: fs/libfs.c or separate library since fs/libfs.c
is getting crowded.  Linus, do you have any objections to that or suggestions
on filename here?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 05:53:32PM +0100, Al Viro wrote:
> I'm taking NFS helpers to libfs.c and switching ncpfs to them.  IMO that's
> better than copying the damn thing and other network filesystems might have
> the same needs eventually...

[something like this - completely untested]

* stray_page_get_link(inode, filler) - returns ERR_PTR(error) or pointer
to symlink body.  Said symlink body sits in a page at offset equal to
offsetof(page, struct stray_page_link).  filler() is expected to put it
at such offset. Page is cached.

* stray_page_put_link() - ->put_link() suitable for links obtained from
stray_page_get_link().  Unlike the usual pagecache-based variants, this
sucker does _not_ rely on page staying cached.

* nfs and ncpfs switched to the helpers above.

Signed-off-by: Al Viro <[EMAIL PROTECTED]>

diff -urN RC14-rc6-git10-base/fs/libfs.c current/fs/libfs.c
--- RC13-rc6-git10-base/fs/libfs.c  2005-08-10 10:37:52.0 -0400
+++ current/fs/libfs.c  2005-08-19 13:29:39.0 -0400
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 int simple_getattr(struct vfsmount *mnt, struct dentry *dentry,
@@ -616,6 +617,42 @@
return ret;
 }
 
+char *stray_page_get_link(struct inode *inode, int (*filler)(struct inode *, 
struct page *))
+{
+   struct stray_page_symlink *p;
+   struct page *page;
+
+   page = read_cache_page(&inode->i_data, 0, (filler_t *)filler, inode);
+   if (IS_ERR(page))
+   goto read_failed;
+   if (!PageUptodate(page))
+   goto getlink_read_error;
+   p = kmap(page);
+   p->page = page;
+   return p->body;
+
+getlink_read_error:
+   page_cache_release(page);
+   page = ERR_PTR(-EIO);
+read_failed:
+   return (char *)page;/* error */
+}
+
+void stray_page_put_link(struct dentry *dentry, struct nameidata *nd)
+{
+   char *s = nd_get_link(nd);
+   if (!IS_ERR(s)) {
+   struct stray_page_symlink *p;
+   struct page *page;
+
+   p = container_of(s, struct stray_page_symlink, body[0]);
+   page = p->page;
+
+   kunmap(page);
+   page_cache_release(page);
+   }
+}
+
 EXPORT_SYMBOL(dcache_dir_close);
 EXPORT_SYMBOL(dcache_dir_lseek);
 EXPORT_SYMBOL(dcache_dir_open);
@@ -648,3 +685,5 @@
 EXPORT_SYMBOL_GPL(simple_attr_close);
 EXPORT_SYMBOL_GPL(simple_attr_read);
 EXPORT_SYMBOL_GPL(simple_attr_write);
+EXPORT_SYMBOL(stray_page_get_link);
+EXPORT_SYMBOL(stray_page_put_link);
diff -urN RC13-rc6-git10-base/fs/ncpfs/inode.c current/fs/ncpfs/inode.c
--- RC13-rc6-git10-base/fs/ncpfs/inode.c2005-06-17 15:48:29.0 
-0400
+++ current/fs/ncpfs/inode.c2005-08-19 13:12:01.0 -0400
@@ -104,7 +104,7 @@
 
 extern struct dentry_operations ncp_root_dentry_operations;
 #if defined(CONFIG_NCPFS_EXTRAS) || defined(CONFIG_NCPFS_NFS_NS)
-extern struct address_space_operations ncp_symlink_aops;
+extern int ncp_follow_link(struct dentry *dentry, struct nameidata *nd);
 extern int ncp_symlink(struct inode*, struct dentry*, const char*);
 #endif
 
@@ -233,8 +233,8 @@
 #if defined(CONFIG_NCPFS_EXTRAS) || defined(CONFIG_NCPFS_NFS_NS)
 static struct inode_operations ncp_symlink_inode_operations = {
.readlink   = generic_readlink,
-   .follow_link= page_follow_link_light,
-   .put_link   = page_put_link,
+   .follow_link= ncp_follow_link,
+   .put_link   = stray_page_put_link,
.setattr= ncp_notify_change,
 };
 #endif
@@ -272,7 +272,6 @@
 #if defined(CONFIG_NCPFS_EXTRAS) || defined(CONFIG_NCPFS_NFS_NS)
} else if (S_ISLNK(inode->i_mode)) {
inode->i_op = &ncp_symlink_inode_operations;
-   inode->i_data.a_ops = &ncp_symlink_aops;
 #endif
} else {
make_bad_inode(inode);
diff -urN RC13-rc6-git10-base/fs/ncpfs/symlink.c current/fs/ncpfs/symlink.c
--- RC13-rc6-git10-base/fs/ncpfs/symlink.c  2005-06-17 15:48:29.0 
-0400
+++ current/fs/ncpfs/symlink.c  2005-08-19 13:26:26.0 -0400
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ncplib_kernel.h"
 
 
@@ -41,9 +42,8 @@
 
 /* - read a symbolic link -- */
 
-static int ncp_symlink_readpage(struct file *file, struct page *page)
+static int symlink_filler(struct inode *inode, struct page *page)
 {
-   struct inode *inode = page->mapping->host;
int error, length, len;
char *link, *rawlink;
char *buf = kmap(page);
@@ -77,6 +77,7 @@
}
 
len = NCP_MAX_SYMLINK_SIZE;
+   buf += offsetof(struct stray_page_symlink, body);
error = ncp_vol2io(NCP_SERVER(inode), buf, &len, link, length, 0);
kfree(rawlink);
if (error)
@@ -96,13 +97,12 @@
return error;
 }
 
-/*
- *

Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 09:43:17AM -0700, Linus Torvalds wrote:
> Actually, looking at the ncpfs patch, I'd rather not apply that patch 
> as-is. It looks like it will totally disable symlink caching, which would 
> be kind of sad. Somebody willing to do the same thing NFS does?
> 
> NFS hides away the "struct page *" pointer at the end of the page data, so
> that when we pass the pointer to the virtual address around, we can 
> trivially look up the "struct page".
> 
> An alternative is to make the symlink address-space use a gfp_mask of 
> GFP_KERNEL (no highmem), at which point ncpfs_put_link() just becomes 
> something like
> 
>   void ncpfs_put_link(struct dentry *dentry, struct nameidata *nd)
>   {
>   char *addr = nd_get_link(nd);
> 
>   if (!IS_ERR(addr))
>   page_cache_release(virt_to_page(addr));
>   }
> 
> which is pretty ugly, but even simpler than the NFS trick.
> 
> Anybody?

I'm taking NFS helpers to libfs.c and switching ncpfs to them.  IMO that's
better than copying the damn thing and other network filesystems might have
the same needs eventually...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 09:07:35AM -0700, Linus Torvalds wrote:
> Hmm.. NFS _does_ use the page cache for symlinks, but uses it slightly 
> differently: instead of relying on the page cache entry being the same 
> when freeing the page, it just caches the page it looked up in the page 
> cache (ie "nfs_follow_link()" does look up the page cache, but then hides 
> the page pointer inside the page data itself (uglee), and thus does not 
> depend on the mapping staying the same (nfs_put_link() just takes the page 
> from the symlink data).

For NFS that was done exactly to deal with cache invalidation.  IIRC, I've
convinced myself that it wasn't going to happen on ncpfs and happily
abstained from duplicating the NFS variant.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 04:44:06PM +0100, Anton Altaparmakov wrote:
> I tried stracing nautilus to answer your question.  And this time for
> the first time instead of a Bad page state I got a BUG() triggered in
> fs/namei.c, the arrow below marks the spot:
> 
> void page_put_link(struct dentry *dentry, struct nameidata *nd)
> {
>   if (!IS_ERR(nd_get_link(nd))) {
>   struct page *page;
>   page = find_get_page(dentry->d_inode->i_mapping, 0);
>   if (!page)
> > BUG();

Ergo, something had blown page cache for our inode.  Interesting...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel bug: Bad page state: related to generic symlink code and mmap

2005-08-19 Thread Al Viro
On Fri, Aug 19, 2005 at 12:14:48PM +0100, Anton Altaparmakov wrote:
> Hi,
> 
> There is a bug somewhere in 2.6.11.4 and I can't figure out where it is.
> I assume it is present in older and newer kernels, too as the related
> code hasn't changed much AFAICS and googling for "Bad page state"
> returns rather a lot of hits relating to both older (up to 2.5.70!) and
> newer kernels...
> 
> Note: PLEASE do not stop reading because you read ncpfs below as I am
> pretty sure it is not ncpfs related!  And looking at google a lot of
> people have reported such similar problems since 2.5.70 or so and they
> were all told to go away as they have bad ram.  That is impossible
> because this happens on well over 600 workstations and several servers
> 100% reproducible.  Many different types of hardware, different makes,
> difference age, all running smp kernels even if single cpu.  You can't
> tell me they all have bad ram.  Windows works fine and Linux works fine
> except for that one specific problem which is 100% reproducible...
> 
> The bug only appears, but it appears 100% reproducibly when a cross
> volume symlink on ncpfs is accessed using nautilus under gnome.  I.e.
> double click on a cross volume symlink on ncpfs in nautilus and the
> machine locks up solid.

Ugh...  Could you at least tell what does nautilus attempt to do at that
point?  Something that wouldn't show up with simple ls -l  or
cat  >/dev/null, judging by the above, but what?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Al Viro
On Wed, Apr 20, 2005 at 05:23:10PM -0700, Bryan Henderson wrote:
> >That assumes that everyone has the same stuff in the same places.  I.e.
> >that there is a universal tree with different subset hidden from 
> different
> >processes.  But that is obviously a wrong approach - e.g. it loses 
> ability
> >to bind different stuff on the same place in different namespaces.
> 
> Aren't you trying to boil another egg in my pot?  In Linux today, everyone 
> (every process on the same Linux system, that is) has the same stuff in 
> the same place.

Incorrect.  We have working namespaces right now and had them for quite
a while.  And no, they do not suffer from such braindamage.  Usable
right now and yes, I *do* use that.

> I'm trying to propose an incremental improvement, and 
> relaxing that restriction isn't part of it.

Funny, how that *introduces* the restriction you are talking about...

> lifetime of the mount.  But as between multiple processes on the same 
> system at the same time, yeah, the directory has one name.

Care to experiment?  Create a new namespace and do different bindings at
the same place in old and new one.  Or mount --move in either, etc.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Al Viro
On Wed, Apr 20, 2005 at 07:53:04PM +0200, Miklos Szeredi wrote:
> The user expects to have the see the same files in all sessions,
> whether those be local logins, remote logins, ftp/scp/etc sessions.

Umm...  You know, I would be *very* unhappy if I found that some process
on parcelfarce was able to see the contents of ~/.ssh/* from the laptop
I'm using right now.  Even more so if that would apply to random ftp
sewer I happen to use at the moment...
 
> If I'm remotely logged into server X from Y, and want to use scp to
> copy some file from X to Y or vica versa, I will want my private
> mounts to be visible from the scp.

Do you?  Really?  OK, so I've got ~/bin/ and ~/bin/arch/ in my path on
my boxen.  The latter has ~/bin/{i386,alpha,sparc,amd64,hppa,ppc} bound
on it - depending on the host I'm using.  Tell me, why would I want that
private mount to be visible when I log in from one box to another?  To
make sure that wrong binaries would be picked?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Al Viro
On Wed, Apr 20, 2005 at 09:43:47PM +0100, Jamie Lokier wrote:
> Al Viro is right to point out that environment variables are not
> shared.  But he forgets that _files_ are shared.

So they are.  ~/.profile, for instance ;-)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Al Viro
On Wed, Apr 20, 2005 at 11:57:23AM -0700, Bryan Henderson wrote:
>   - When you mount(), you say whether the names should be visible by 
> default or not.  It takes system privilege to make them visible by 
> default, but an ordinary user can mount a willing filesystem over a 
> directory he's permitted to modify unconditionally, invisible by default

That assumes that everyone has the same stuff in the same places.  I.e.
that there is a universal tree with different subset hidden from different
processes.  But that is obviously a wrong approach - e.g. it loses ability
to bind different stuff on the same place in different namespaces.  _And_
it doesn't work accross the host boundary for even more obvious reasons
(/bin/sh is going to be a different binary on i386, alpha and sparc boxen,
no matter what).

IOW, notion that every directory has its "real" absolute pathname
(and that's what your approach boils down to) won't match the reality
anyway.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Al Viro
On Wed, Apr 20, 2005 at 09:51:26AM -0700, Ram wrote:
> Reading through the thread I assume the requirement is:
> 
> 1) A User being able to create his own VFS-mount environment 
> 2) being able to use the same VFS-mount environment from 
> multiple login sessions.
> 3) Being able to switch some processes to some other
>   VFS-mount environment.

Excuse me, but could somebody give coherent rationale for such requirements?
_Especially_ for joining existing group by completely unrelated process -
something we don't do for any other component of process.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Al Viro
On Wed, Apr 20, 2005 at 01:03:40PM +0100, Jamie Lokier wrote:
> It shouldn't be literally per-user - it should be possible for a user
> to have several environment _when_ they want that.  chroot-jail style
> virtual server environments require that too.
> 
> But that shouldn't be the only option - because it would be horrible
> to use.  If I login on multiple terminals, I normally want to mount
> filesystems in /home/jamie/mnt on one terminal, and use them on another.

And when you log in on several terminals you usually want same $PATH.
You don't do that by sharing VM between shell processes, do you?  Sure,
that would work with sufficient kernel-side hacks for joining thread
group and making e.g. bash multithreaded.  Nobody does it though - it
doesn't buy you anything really useful.
 
> How can libpam join the user's existing namespace?
> 
> Having a separate usermount-namespace for each login of the same user
> would not be nice to use.

I don't see why.  _IF_ you can change the set of mounts after you log in,
there's no more need to do any kernel tricks for that stuff than you would
need for environment, etc.  If you can't - well, the last point where you
can get something set up is login with no changes afterwards.  In that case
everything is just as trivial...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-20 Thread Al Viro
On Wed, Apr 20, 2005 at 10:45:58AM +0100, Jamie Lokier wrote:
> For FUSE, what's needed is that a user can mount something, and the
> mounted fs is visible only to that user, but it's visible to _all_ of
> the user's processes.

So get that namespace as soon as you log in.
 
> We think namespaces are a nice way to do that: making a user-owned
> filesystem only visible to a user.  But the mechanism of CLONE_NEWNS
> does not work, because it presumes namespace divisions are only
> propagated over parent-child divisions, like environment variables.
 
> What we really want is a mount point that propagates across all the
> processes owned by one user, but is not there for other users.

This is almost certainly bogus.  Same user can easily want several
different environments set on the same box.

>- Have a namespace per user.  The user's namespace will be entered
>  by the "login" program somehow.

Trivial right now - just have libpam do that.
 
>- All logins to the same user acquire the same per-user namespace.
>  This isn't possible at the moment; it would be a kernel extension
>  + administrative change to login.

No.  Identical setup at login time - sure.  Enforced _single_ namespace
is just plain wrong.  Moreover, "all user's processes" is the wrong answer
to practically any question (well, aside of "what processes do you kill
when you get rid of luser's account").
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] User CLONE_NEWNS permission and rlimits

2005-04-19 Thread Al Viro
On Tue, Apr 19, 2005 at 11:38:21PM -0400, Ritesh Kumar wrote:
> You are right. A more priviledged process running as a child of
> another process should not be allowed to look at the same namespace as
> its parent.

No go.  That immediately breaks any suid program that takes a pathname
as an argument and is supposed to do something to the file in question.
Or uses dotfiles for per-user config.  gpg(1) fits both, for example,
and that's not something rarely used.  Moreover, used in fsckload of
scripts that are entirely out of your control, so something like "OK,
use it on stdin, then" is not an answer (and it still doesn't address
the second issue - gpg *does* need access to keyring, after all).

> Also, the access control for the filesystem is still in the kernel.
> What we change in the userspace is just the namespace and nothing
> else. If you are fundamentally denied access to a file (from the
> kernel) then you cannot access it no matter how you access it using
> userspace libraries.

The issue is not with being able to see something you shouldn't see.
It's being able to trick more priveleged process into accepting your
data as something it trusts.  OR not being able to use suid programs
on your own files at all.  Neither is acceptable.

BTW, your references to Plan 9 completely miss one very important thing -
they manage to live without any suid stuff at all.  Which is certainly
very nice, but not useful in our case, unless you volunteer to rewrite
suid applications to the form that would not need suid.
 
> Plus, it is not very clear to me what to you mean by 'tasks'. If that
> is processes, then the child will inherit a separate copy of the
> namespace from the parent (Copy-on-write of the data structs of the
> user library probably... I'll have to think over this). So no race
> conditions here.

... and no working mount(8) either.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-19 Thread Al Viro
On Tue, Apr 19, 2005 at 06:53:29PM -0500, Eric Van Hensbergen wrote:
> On 4/19/05, Al Viro <[EMAIL PROTECTED]> wrote:
> > On Tue, Apr 19, 2005 at 05:13:32PM -0500, Eric Van Hensbergen wrote:
> > > The motivation behind this patch is to make private namespaces more
> > > accessible by allowing their creation at mount/bind time.
> > >
> > 
> > *UGH*
> > 
> > So what happens to those who happen to share task->fs with the parent?
> > 
> 
> Okay, I'll admit to being a bit too hasty with pushing out that patch
> - I was being particularly myopic looking for a solution only for a
> command-line mount.  Are you generally opposed to new namespace
> creation at mount time or just my slimy hack?  A shared task->fs seems
> like something which could be easily checked against and disallowed.

a) ability to create a private namespace without forking anything - sure,
that would be useful.  However, that's not something I would push into
mount(2) (already overloaded to hell and back).

There used to be a kinda-sorta agreement on a new syscall:
unshare(bitmap)
with arguments like those of clone(2).  That's not just for namespaces -
e.g. you might legitimately want to unshare VM in a thread and leave the
rest alone.  Or unshare ->fs (i.e. uncouple cwd from the rest of group).

Most of the code is already there - do_fork() has to do such stuff anyway.
So how about adding sys_unshare(flags) that would do that job?  Flags would
correspond to those of clone(2), except that all these guys would be
"what do we unshare" instead of "what do we leave shared".


b) I _really_ don't like the idea of messing with the parent.  Make it
a shell builtin if you want to affect shell behaviour; the same reason
why cd is a builtin and not an external command.


c) I would be really, really careful with implications of "let user
do whatever he wants" - that certainly should include bindings and
that can create heaps of fun for suid stuff.  More comments when
I get around to digging through FUSE thread...
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] User CLONE_NEWNS permission and rlimits

2005-04-19 Thread Al Viro
On Tue, Apr 19, 2005 at 11:02:53PM -0400, Ritesh Kumar wrote:
> I am new to the list so please bear with me :-)
> 
> I have also be thinking about filesystem namespaces which are
> completely under the user's own control.

How do you deal with su(1) finding /etc/shadow in your namespace
and seeing an entry for root there - with no password?

> I was also thinking of them
> being inherited and changed along the process heirarchy.

We have that already...

> So a given
> process is allowed to change its namespace any way it likes and map it
> to its parent's namespace.

See above.

> More importantly, I was thinking in terms of having this entire
> capability in the userspace itself. Instead of giving all the details
> right here... let me redirect you to the page where I have set up the
> prototype. You should be able to download the sample code (very small)
> and browse through it to get an idea of what I had in mind. I also
> have an article which explains what I was thinking. In essense, I was
> thinking of splitting up the conceps of 1) accessing the filesystem on
> the HDD/device and 2) setting up a namespace for accessing the files
> into two separate concepts and bringing up 2) completely in the
> userspace.
> What do you think? I would like to have feedback on the idea.

That your library will leave any suid program seeing hell knows
what.  Which gets very unpleasant when you are using it to do something
with your files...

That's besides the issues with races when two tasks that share
namespace attempt to change it.

> http://www.cs.unc.edu/~ritesh/projects/perprocessfs.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][2.6 patch] Allow creation of new namespaces during mount system call

2005-04-19 Thread Al Viro
On Tue, Apr 19, 2005 at 05:13:32PM -0500, Eric Van Hensbergen wrote:
> The motivation behind this patch is to make private namespaces more
> accessible by allowing their creation at mount/bind time.
> 
> Based on some of the FUSE permissions discussions, I wanted to check
> into modifying the mount system calls -- adding a flag which created a
> new namespace for the resulting mount.  I quickly discovered that what
> I typically wanted (for the case of running a mount command) was to
> actually create a new namespace for the parent thread (typically the
> shell), inherit that namespace, and then perform the mount.
> 
> Its not clear to me that both options are needed, cloning the parent's
> namespace seems to be what you want most of the time.
> 
> In order to minimize code impact I split the copy_namespace function,
> perhaps the right long term solution is to change it's interface to
> accommodate the changes.  Things look a bit more invasive as I moved
> the copy_namespace function above do_mount.  The patch follows:

*UGH*

So what happens to those who happen to share task->fs with the parent?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS4 mount problem

2005-04-18 Thread Al Viro
On Mon, Apr 18, 2005 at 06:33:09PM +0100, David Howells wrote:
> Al Viro <[EMAIL PROTECTED]> wrote:
> 
> > 
> > Architecture-dependent blob passed to mount(2) (aka nfs4_mount_data).
> > If you want it to be a blob, at least have a decency to use encoding
> > that would not depend on alignment rules and word size.  Hell, you
> > could use XDR - it's not that nfs would need something new to handle
> > it.  Or, better yet, use a normal string.
> 
> Mount doesn't appear to permit a big enough blob though. It has a hard limit
> of PAGE_SIZE.

Excuse me?  Would the use of fixed offsets, field sizes and endianness
make the blob bigger?  And as for the length of string representation
going past 4Kb...  that could be easily dealt with in sys_mount() if it
really becomes a problem.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS4 mount problem

2005-04-18 Thread Al Viro
On Mon, Apr 18, 2005 at 10:07:14AM -0700, Bryan Henderson wrote:
> >On Fri, Apr 15, 2005 at 01:22:59PM -0700, David S. Miller wrote:
> >> 
> >> Make a ->compat_read_super() just like we have a ->compat_ioctl()
> >> method for files, if you want to suggest a solution like what
> >> you describe.
> >
> >I don't think we should encourage filesystem writers to do such stupid
> >things as ncfps/smbfs do.  In fact I'm totally unhappy thay nfs4 went
> >down that road.
> 
> Which road is that?

Architecture-dependent blob passed to mount(2) (aka nfs4_mount_data).
If you want it to be a blob, at least have a decency to use encoding
that would not depend on alignment rules and word size.  Hell, you
could use XDR - it's not that nfs would need something new to handle
it.  Or, better yet, use a normal string.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] shared subtrees

2005-01-17 Thread Al Viro
On Mon, Jan 17, 2005 at 03:11:18PM -0500, Mike Waychison wrote:
 
> I don't think that solves the problem.  B should receive copies (with
> shared semantics if called for) of all mountpoints C1,..,Cn that are
> children of A if A->A.  This is regardless of whether or not propagation
> occurs before or after the attach.

... when that makes sense.  Do you see any real problems with the proposed
behaviour (i.e. propagation happens before attachment)?

BTW, you do realize that rbind also has "copy before attaching" semantics,
right?
 
> Allowing this is like allowing directory aliasing in the sense that an
> aliased directory that is nested within itself opens us to
> badness/headaches 8)
> 
> I still think the only way to handle this is to disallow vfsmounts in a
> p-node to have (grand)parent-child relationships.  This may have to be
> extended to the 'owned by' case as well.

Not feasible (and think what _that_ will do to --move, especially since
propagation can span namespace boundaries).
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] shared subtrees

2005-01-16 Thread Al Viro
On Sun, Jan 16, 2005 at 01:42:09PM -0500, J. Bruce Fields wrote:
> On Sun, Jan 16, 2005 at 06:06:56PM +0000, Al Viro wrote:
> > On Sun, Jan 16, 2005 at 11:02:13AM -0500, J. Bruce Fields wrote:
> > > On Thu, Jan 13, 2005 at 10:18:51PM +, Al Viro wrote:
> > > > 6. mount --move
> > > > prohibited if what we are moving is in some p-node, otherwise we move
> > > > as usual to intended mountpoint and create copies for everything that
> > > > gets propagation from there (as we would do for rbind).
> > > 
> > > Why this prohibition?
> > 
> > How do you propagate that?  We can weaken that to "in a p-node that
> > owns something or contains more than one vfsmount", but it's not
> > worth the trouble, AFAICS.
> 
> I guess I'm not seeing what there is to propagate.  If the vfsmount we
> are moving is mounted under a vfsmount that's in a p-node, then there'd
> be something to propagate, but since the --move doesn't change the
> structure of mounts underneath the moved mountpoint, I wouldn't expect
> any changes to be propagated from it to other mountpoints.
> 
> I must be missing something fundamental

No - I have been missing a typo.  Make that "if mountpoint of what we
are moving...".
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] shared subtrees

2005-01-16 Thread Al Viro
On Sun, Jan 16, 2005 at 11:02:13AM -0500, J. Bruce Fields wrote:
> On Thu, Jan 13, 2005 at 10:18:51PM +0000, Al Viro wrote:
> > 6. mount --move
> > prohibited if what we are moving is in some p-node, otherwise we move
> > as usual to intended mountpoint and create copies for everything that
> > gets propagation from there (as we would do for rbind).
> 
> Why this prohibition?

How do you propagate that?  We can weaken that to "in a p-node that
owns something or contains more than one vfsmount", but it's not
worth the trouble, AFAICS.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] shared subtrees

2005-01-15 Thread Al Viro
On Sat, Jan 15, 2005 at 07:46:59PM -0500, J. Bruce Fields wrote:
> On Thu, Jan 13, 2005 at 10:18:51PM +0000, Al Viro wrote:
> > 2. mount
> > 
> > We have a new vfsmount A and want to attach it to mountpoint somewhere in
> > vfsmount B.  If B does not belong to any p-node, everything is as usual; A
> > doesn't become a member or slave of any p-node and is simply attached to B.
> > 
> > If B belongs to a p-node p, consider all vfsmounts B1,...,Bn that get events
> > propagated from B and all p-nodes p1,...,pk that contain them.
> > * A gets cloned into n copies and these copies (A1,...,An) are attached
> > to corresponding points in B1,...,Bn.
> > * k new p-nodes (q1,...,qk) are created
> > * Ai is contained in qj <=> Bi is contained in qj
> 
> Minor typo: looks like that second qj should be pj.

ACK
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html