[ceph] locking fun with d_materialise_unique()

2013-01-28 Thread Al Viro
There's a fun potential problem with CEPH_MDS_OP_LOOKUPSNAP handling
in ceph_fill_trace().  Consider the following scenario:

Process calls stat(2).  Lookup locks parent, allocates dentry and calls
->lookup().  Request is created and sent over the wire.  Then we sit and
wait for completion.  Just as the reply has arrived, process gets SIGKILL.
OK, we get to
/*
 * ensure we aren't running concurrently with
 * ceph_fill_trace or ceph_readdir_prepopulate, which
 * rely on locks (dir mutex) held by our caller.
 */
mutex_lock(&req->r_fill_mutex);
req->r_err = err;
req->r_aborted = true;
mutex_unlock(&req->r_fill_mutex);
and we got there before handle_reply() grabbed ->r_fill_mutex.  Then we return
to ceph_lookup(), drop the reference to request and bugger off.  Parent is
unlocked by caller.

In the meanwhile, there's another thread sitting in handle_reply().  It
got ->r_fill_mutex and called ceph_fill_trace().  Had that been something
like rename request, ceph_fill_trace() would've checked req->r_aborted and
that would've been it.  However, we hit this:
} else if (req->r_op == CEPH_MDS_OP_LOOKUPSNAP ||
   req->r_op == CEPH_MDS_OP_MKSNAP) {
struct dentry *dn = req->r_dentry;
and proceed to
dout(" linking snapped dir %p to dn %p\n", in, dn);
dn = splice_dentry(dn, in, NULL, true);
which does
realdn = d_materialise_unique(dn, in);
and we are in trouble - d_materialise_unique() assumes that ->i_mutex on parent
is held, which isn't guaranteed anymore.  Not that d_delete() done a couple
of lines earlier was any better...

I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't
get to its splice_dentry() and d_delete() calls in similar situations -
I hadn't checked that one yet.  If it isn't guaranteed, we have a problem
there as well.

I might very well be missing something - that code is seriously convoluted,
and race wouldn't be easy to hit, so I don't have anything resembling
a candidate reproducer ;-/  IOW, this is just from RTFS and I'd really
appreciate comments from folks familiar with ceph.

VFS side of requirements is fairly simple:
* d_splice_alias(d, _), d_add_ci(d, _), d_add(d, _),
d_materialise_unique(d, _), d_delete(d), d_move(_, d) should be called
only with ->i_mutex held on the parent of d.
* d_move(d, _), d_add_unique(d, _), d_instantiate_unique(d, _),
d_instantiate(d, _) should be called only with d being parentless (i.e.
d->d_parent == d, aka. IS_ROOT(d)) or with ->i_mutex held on the parent of d.
* with the exception of "prepopulate dentry tree at ->get_sb() time"
kind of situations, d_alloc(d, _) and d_alloc_name(d, _) should be called
only with d->d_inode->i_mutex held (and it won't be too hard to get rid of
those exceptions, actually).
* lookup_one_len(_, d, _) should only be called with ->i_mutex held
on d->d_inode
* d_move(d1, d2) in case when d1 and d2 have different parents
should only be called with ->s_vfs_rename_mutex held on d1->d_sb (== d2->d_sb).

We are guaranteed that ->i_mutex is held on (inode of) parent of d in
->lookup(_, d, _)
->atomic_open(_, d, _, _, _, _)
->mkdir(_, d, _)
->symlink(_, d, _)
->create(_, d, _, _)
->mknod(_, d, _, _)
->link(_, _, d)
->unlink(_, d)
->rmdir(_, d)
->rename(_, d, _, _)
->rename(_, _, _, d)
Note that this is *not* guaranteed for another argument of ->link() - the
inode we are linking has ->i_mutex held, but nothing of that kind is promised
for its parent directory.
We also are guaranteed that ->i_mutex is held on the inode of opened
directory passed to ->readdir() and on victims of ->unlink(), ->rmdir() and
overwriting ->rename().

FWIW, I went through that stuff this weekend and we are fairly close to having
those requirements satisfied - I'll push a branch with the accumulated fixes
in a few and after that we should be down to very few remaining violations and
dubious places (ceph issues above being one of those).  And yes, this stuff
really need to be in Documentation/filesystems somewhere, along with the
full description of locking rules for ->d_parent and ->d_name accesses.  I'm
trying to put that together right now...
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph] locking fun with d_materialise_unique()

2013-01-28 Thread Sage Weil
Hi Al,

On Tue, 29 Jan 2013, Al Viro wrote:
>   There's a fun potential problem with CEPH_MDS_OP_LOOKUPSNAP handling
> in ceph_fill_trace().  Consider the following scenario:
> 
> Process calls stat(2).  Lookup locks parent, allocates dentry and calls
> ->lookup().  Request is created and sent over the wire.  Then we sit and
> wait for completion.  Just as the reply has arrived, process gets SIGKILL.
> OK, we get to
> /*
>  * ensure we aren't running concurrently with
>  * ceph_fill_trace or ceph_readdir_prepopulate, which
>  * rely on locks (dir mutex) held by our caller.
>  */
> mutex_lock(&req->r_fill_mutex);
> req->r_err = err;
> req->r_aborted = true;
> mutex_unlock(&req->r_fill_mutex);
> and we got there before handle_reply() grabbed ->r_fill_mutex.  Then we return
> to ceph_lookup(), drop the reference to request and bugger off.  Parent is
> unlocked by caller.
> 
> In the meanwhile, there's another thread sitting in handle_reply().  It
> got ->r_fill_mutex and called ceph_fill_trace().  Had that been something
> like rename request, ceph_fill_trace() would've checked req->r_aborted and
> that would've been it.  However, we hit this:
> } else if (req->r_op == CEPH_MDS_OP_LOOKUPSNAP ||
>req->r_op == CEPH_MDS_OP_MKSNAP) {
> struct dentry *dn = req->r_dentry;
> and proceed to
> dout(" linking snapped dir %p to dn %p\n", in, dn);
> dn = splice_dentry(dn, in, NULL, true);
> which does
> realdn = d_materialise_unique(dn, in);
> and we are in trouble - d_materialise_unique() assumes that ->i_mutex on 
> parent
> is held, which isn't guaranteed anymore.  Not that d_delete() done a couple
> of lines earlier was any better...

Yep, that is indeed a problem.  I think we just need to do the r_aborted 
and/or r_locked_dir check in the else if condition...

> I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't
> get to its splice_dentry() and d_delete() calls in similar situations -
> I hadn't checked that one yet.  If it isn't guaranteed, we have a problem
> there as well.

...and the condition guarding readdir_prepopulate().  :)

> I might very well be missing something - that code is seriously convoluted,
> and race wouldn't be easy to hit, so I don't have anything resembling
> a candidate reproducer ;-/  IOW, this is just from RTFS and I'd really
> appreciate comments from folks familiar with ceph.

I think you're reading it correctly.  The main thing to keep in mind here 
is that we *do* need to call fill_inode() for the inode metadata on these 
requests to keep the mds and client state in sync.  The dentry state is 
safe to ignore.

It would be great to have the dir i_mutex rules summarized somewhere, even 
if it is just a copy of the below.  It took a fair bit of trial and error 
to infer what was going on when writing this code.  :)

Ping me when you've pushed that branch and I'll take a look...

Thanks!
sage





> 
> VFS side of requirements is fairly simple:
>   * d_splice_alias(d, _), d_add_ci(d, _), d_add(d, _),
> d_materialise_unique(d, _), d_delete(d), d_move(_, d) should be called
> only with ->i_mutex held on the parent of d.
>   * d_move(d, _), d_add_unique(d, _), d_instantiate_unique(d, _),
> d_instantiate(d, _) should be called only with d being parentless (i.e.
> d->d_parent == d, aka. IS_ROOT(d)) or with ->i_mutex held on the parent of d.
>   * with the exception of "prepopulate dentry tree at ->get_sb() time"
> kind of situations, d_alloc(d, _) and d_alloc_name(d, _) should be called
> only with d->d_inode->i_mutex held (and it won't be too hard to get rid of
> those exceptions, actually).
>   * lookup_one_len(_, d, _) should only be called with ->i_mutex held
> on d->d_inode
>   * d_move(d1, d2) in case when d1 and d2 have different parents
> should only be called with ->s_vfs_rename_mutex held on d1->d_sb (== 
> d2->d_sb).
> 
> We are guaranteed that ->i_mutex is held on (inode of) parent of d in
> ->lookup(_, d, _)
> ->atomic_open(_, d, _, _, _, _)
> ->mkdir(_, d, _)
> ->symlink(_, d, _)
> ->create(_, d, _, _)
> ->mknod(_, d, _, _)
> ->link(_, _, d)
> ->unlink(_, d)
> ->rmdir(_, d)
> ->rename(_, d, _, _)
> ->rename(_, _, _, d)
> Note that this is *not* guaranteed for another argument of ->link() - the
> inode we are linking has ->i_mutex held, but nothing of that kind is promised
> for its parent directory.
> We also are guaranteed that ->i_mutex is held on the inode of opened
> directory passed to ->readdir() and on victims of ->unlink(), ->rmdir() and
> overwriting ->rename().
> 
> FWIW, I went through that stuff this weekend and we are fairly close to having
> those requirements satisfied - I'll push a branch with the accumulated f

Re: [ceph] locking fun with d_materialise_unique()

2013-01-29 Thread Al Viro
On Mon, Jan 28, 2013 at 09:20:34PM -0800, Sage Weil wrote:

> Yep, that is indeed a problem.  I think we just need to do the r_aborted 
> and/or r_locked_dir check in the else if condition...
> 
> > I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't
> > get to its splice_dentry() and d_delete() calls in similar situations -
> > I hadn't checked that one yet.  If it isn't guaranteed, we have a problem
> > there as well.
> 
> ...and the condition guarding readdir_prepopulate().  :)

> I think you're reading it correctly.  The main thing to keep in mind here 
> is that we *do* need to call fill_inode() for the inode metadata on these 
> requests to keep the mds and client state in sync.  The dentry state is 
> safe to ignore.

You mean the parts under
if (rinfo->head->is_dentry) {
and
if (rinfo->head->is_target) {
in there?  Because there's fill_inode() called from readdir_prepopulate()
and it's a lot more problematic than those two...

> It would be great to have the dir i_mutex rules summarized somewhere, even 
> if it is just a copy of the below.  It took a fair bit of trial and error 
> to infer what was going on when writing this code.  :)

Directory ->i_mutex rules are in part documented - "what VFS guarantees
to hold" side is in Documentation/filesystems/directory-locking.  It's
the other side ("what locks are expected to be held by callers of dcache.c
functions") that is badly missing...

> Ping me when you've pushed that branch and I'll take a look...

To gitol...@ra.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git
   01a88fa..4056362  master -> master

with tentative ceph patch in the very end.  Should be on git.kernel.org
shortly...
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph] locking fun with d_materialise_unique()

2013-01-29 Thread Sage Weil
Hi Al,

On Tue, 29 Jan 2013, Al Viro wrote:

> On Mon, Jan 28, 2013 at 09:20:34PM -0800, Sage Weil wrote:
> 
> > Yep, that is indeed a problem.  I think we just need to do the r_aborted 
> > and/or r_locked_dir check in the else if condition...
> > 
> > > I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't
> > > get to its splice_dentry() and d_delete() calls in similar situations -
> > > I hadn't checked that one yet.  If it isn't guaranteed, we have a problem
> > > there as well.
> > 
> > ...and the condition guarding readdir_prepopulate().  :)
> 
> > I think you're reading it correctly.  The main thing to keep in mind here 
> > is that we *do* need to call fill_inode() for the inode metadata on these 
> > requests to keep the mds and client state in sync.  The dentry state is 
> > safe to ignore.
> 
> You mean the parts under
> if (rinfo->head->is_dentry) {
> and
> if (rinfo->head->is_target) {
> in there?  Because there's fill_inode() called from readdir_prepopulate()
> and it's a lot more problematic than those two...

Yep, patch below.
 
> > It would be great to have the dir i_mutex rules summarized somewhere, even 
> > if it is just a copy of the below.  It took a fair bit of trial and error 
> > to infer what was going on when writing this code.  :)
> 
> Directory ->i_mutex rules are in part documented - "what VFS guarantees
> to hold" side is in Documentation/filesystems/directory-locking.  It's
> the other side ("what locks are expected to be held by callers of dcache.c
> functions") that is badly missing...
> 
> > Ping me when you've pushed that branch and I'll take a look...
> 
> To gitol...@ra.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git
>01a88fa..4056362  master -> master
> 
> with tentative ceph patch in the very end.  Should be on git.kernel.org
> shortly...

We should drop teh mds_client.c hunk from your patch, and then do 
something like the below.  I'll put it in the ceph tree so we can do some 
basic testing.  Unfortunately, the abort case is a bit annoying to 
trigger.. we'll need to write a new test for it.

Thanks-
sage



>From 80d70ef02d5277af8e1355db74fffe71f7217e76 Mon Sep 17 00:00:00 2001
From: Sage Weil 
Date: Tue, 29 Jan 2013 13:01:15 -0800
Subject: [PATCH] ceph: prepopulate inodes only when request is aborted

If r_aborted is true, we do not hold the dir i_mutex, and cannot touch
the dcache.  However, we still need to update the inodes with the state
returned by the MDS.

Reported-by: Al Viro 
Signed-off-by: Sage Weil 
---
 fs/ceph/inode.c |   36 
 1 file changed, 36 insertions(+)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 77b1b18..b51653e 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1195,6 +1195,39 @@ done:
 /*
  * Prepopulate our cache with readdir results, leases, etc.
  */
+static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req,
+  struct ceph_mds_session *session)
+{
+   struct ceph_mds_reply_info_parsed *rinfo = &req->r_reply_info;
+   int i, err = 0;
+
+   for (i = 0; i < rinfo->dir_nr; i++) {
+   struct ceph_vino vino;
+   struct inode *in;
+   int rc;
+
+   vino.ino = le64_to_cpu(rinfo->dir_in[i].in->ino);
+   vino.snap = le64_to_cpu(rinfo->dir_in[i].in->snapid);
+
+   in = ceph_get_inode(req->r_dentry->d_sb, vino);
+   if (IS_ERR(in)) {
+   err = PTR_ERR(in);
+   dout("new_inode badness got %d\n", err);
+   continue;
+   }
+   rc = fill_inode(in, &rinfo->dir_in[i], NULL, session,
+   req->r_request_started, -1,
+   &req->r_caps_reservation);
+   if (rc < 0) {
+   pr_err("fill_inode badness on %p got %d\n", in, rc);
+   err = rc;
+   continue;
+   }
+   }
+
+   return err;
+}
+
 int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 struct ceph_mds_session *session)
 {
@@ -1209,6 +1242,9 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
u64 frag = le32_to_cpu(rhead->args.readdir.frag);
struct ceph_dentry_info *di;
 
+   if (req->r_aborted)
+   return readdir_prepopulate_inodes_only(req, session);
+
if (le32_to_cpu(rinfo->head->op) == CEPH_MDS_OP_LSSNAP) {
snapdir = ceph_get_snapdir(parent->d_inode);
parent = d_find_alias(snapdir);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph] locking fun with d_materialise_unique()

2013-01-30 Thread Al Viro
On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote:

> We should drop teh mds_client.c hunk from your patch, and then do 
> something like the below.  I'll put it in the ceph tree so we can do some 
> basic testing.  Unfortunately, the abort case is a bit annoying to 
> trigger.. we'll need to write a new test for it.

hunk dropped, the rest folded into your patch, resulting branch pushed...
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph] locking fun with d_materialise_unique()

2013-01-31 Thread Al Viro
On Wed, Jan 30, 2013 at 02:42:14PM +, Al Viro wrote:
> On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote:
> 
> > We should drop teh mds_client.c hunk from your patch, and then do 
> > something like the below.  I'll put it in the ceph tree so we can do some 
> > basic testing.  Unfortunately, the abort case is a bit annoying to 
> > trigger.. we'll need to write a new test for it.
> 
> hunk dropped, the rest folded into your patch, resulting branch pushed...

BTW, moderately related issue - ceph_get_dentry_parent_inode() uses
look fishy.  In ceph_rename() it seems to be pointless.  We do have
the directory inode of parent of old_dentry - it's old_dir, so it's just
a fancy way to spell igrab(old_directory).  And as far as I can tell, *all*
other callers are racy.
* link("a/foo", "b/bar") can happen in parallel with
rename("a/foo", "c/splat").  link(2) holds ->i_mutex on a/foo (so it can't race
with unlink) and on b/; rename(2) holds it on a/, c/ and c/splat (if the last
one exists).  It also holds ->s_vfs_rename_sem, so cross-directory renames
are serialized.  For normal filesystems that's enough and there ->link()
couldn't care less about a/; ceph_link() wants the inode of a/ for some
reason.  If it *really* needs the parent of a/foo for the operation, the
current code is SOL - the directory we grab can cease being that parent
just as it's getting returned to ceph_link()
* ->d_revalidate() can happen in parallel with rename().  You
don't seem to be using the parent inode much in there, so that should
be reasonably easy to deal with.
* ceph_open() can race with rename(); ->atomic_open() is
called with parent locked, but ->open() isn't.  AFAICS, open() with
O_TRUNC could step into that code...
* ceph_setattr() *definitely* can race with rename(); we have
the object itself locked, but not its parent (and I'm really surprised
by the need to know that parent for such operation).
* CEPH_IOC_SET_LAYOUT vs. rename() - no idea what that ioctl is
about, but opened files can be moved around, TYVM, even when they are
in the middle of ioctl(2).
* ->setxattr() and ->removexattr() - again, can happen in parallel
with rename(), and again I really wonder why do we need the parent directory
for such operations.

What's going on there?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph] locking fun with d_materialise_unique()

2013-02-05 Thread Sage Weil
Hi Al-

I've returned and recovered from LCA and have no more excuse for keeping 
my brain turned off:

On Fri, 1 Feb 2013, Al Viro wrote:
> On Wed, Jan 30, 2013 at 02:42:14PM +, Al Viro wrote:
> > On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote:
> > 
> > > We should drop teh mds_client.c hunk from your patch, and then do 
> > > something like the below.  I'll put it in the ceph tree so we can do some 
> > > basic testing.  Unfortunately, the abort case is a bit annoying to 
> > > trigger.. we'll need to write a new test for it.
> > 
> > hunk dropped, the rest folded into your patch, resulting branch pushed...
> 
> BTW, moderately related issue - ceph_get_dentry_parent_inode() uses
> look fishy.  In ceph_rename() it seems to be pointless.  We do have
> the directory inode of parent of old_dentry - it's old_dir, so it's just
> a fancy way to spell igrab(old_directory).  And as far as I can tell, *all*

Yeah, that's an easy fix.

> other callers are racy.
>   * link("a/foo", "b/bar") can happen in parallel with
> rename("a/foo", "c/splat").  link(2) holds ->i_mutex on a/foo (so it can't 
> race
> with unlink) and on b/; rename(2) holds it on a/, c/ and c/splat (if the last
> one exists).  It also holds ->s_vfs_rename_sem, so cross-directory renames
> are serialized.  For normal filesystems that's enough and there ->link()
> couldn't care less about a/; ceph_link() wants the inode of a/ for some
> reason.  If it *really* needs the parent of a/foo for the operation, the
> current code is SOL - the directory we grab can cease being that parent
> just as it's getting returned to ceph_link()

And I think this use is totally unnecessary.

>   * ->d_revalidate() can happen in parallel with rename().  You
> don't seem to be using the parent inode much in there, so that should
> be reasonably easy to deal with.

This is the tricky one.  We're looking at the parent because some of the 
directory inode state is effectively a lease over the validity of the 
directory's dentries.  In the general case, this is more efficient than a 
lot of per-dentry state in the client/server protocol.

Probably what we *should* be doing is linking to the parent from the 
child's ceph_dentry_info under the MDS session mutex, so that it is 
properly aligned with the state being passed from the server.  I 
considered this originally but was annoyed about essentially duplicating 
d_parent.  But... the locking being what it is, that is probably the only 
way to make the d_revalidate completely safe.  It'll be a bigger patch to 
change that behavior, though.

>   * ceph_open() can race with rename(); ->atomic_open() is
> called with parent locked, but ->open() isn't.  AFAICS, open() with
> O_TRUNC could step into that code...
>   * ceph_setattr() *definitely* can race with rename(); we have
> the object itself locked, but not its parent (and I'm really surprised
> by the need to know that parent for such operation).
>   * CEPH_IOC_SET_LAYOUT vs. rename() - no idea what that ioctl is
> about, but opened files can be moved around, TYVM, even when they are
> in the middle of ioctl(2).
>   * ->setxattr() and ->removexattr() - again, can happen in parallel
> with rename(), and again I really wonder why do we need the parent directory
> for such operations.

These are only there so that an fsync(dirfd) will wait for the file 
creation or other metadata to commit.  I suspect I am (heavily) 
over-reading what POSIX says here... and should be concerned only with the 
namespace modifications (link, unlink, rename, O_CREAT).

Assuming that's the case,

git://github.com/ceph/ceph-client.git #wip-dentries

has a set of patches that fix this up.  (Well, except for d_revalidate, 
which is a bigger change; opened #4023 in the ceph bug tracker.)

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html