Re: [ceph] locking fun with d_materialise_unique()

2013-02-05 Thread Sage Weil
Hi Al-

I've returned and recovered from LCA and have no more excuse for keeping 
my brain turned off:

On Fri, 1 Feb 2013, Al Viro wrote:
 On Wed, Jan 30, 2013 at 02:42:14PM +, Al Viro wrote:
  On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote:
  
   We should drop teh mds_client.c hunk from your patch, and then do 
   something like the below.  I'll put it in the ceph tree so we can do some 
   basic testing.  Unfortunately, the abort case is a bit annoying to 
   trigger.. we'll need to write a new test for it.
  
  hunk dropped, the rest folded into your patch, resulting branch pushed...
 
 BTW, moderately related issue - ceph_get_dentry_parent_inode() uses
 look fishy.  In ceph_rename() it seems to be pointless.  We do have
 the directory inode of parent of old_dentry - it's old_dir, so it's just
 a fancy way to spell igrab(old_directory).  And as far as I can tell, *all*

Yeah, that's an easy fix.

 other callers are racy.
   * link(a/foo, b/bar) can happen in parallel with
 rename(a/foo, c/splat).  link(2) holds -i_mutex on a/foo (so it can't 
 race
 with unlink) and on b/; rename(2) holds it on a/, c/ and c/splat (if the last
 one exists).  It also holds -s_vfs_rename_sem, so cross-directory renames
 are serialized.  For normal filesystems that's enough and there -link()
 couldn't care less about a/; ceph_link() wants the inode of a/ for some
 reason.  If it *really* needs the parent of a/foo for the operation, the
 current code is SOL - the directory we grab can cease being that parent
 just as it's getting returned to ceph_link()

And I think this use is totally unnecessary.

   * -d_revalidate() can happen in parallel with rename().  You
 don't seem to be using the parent inode much in there, so that should
 be reasonably easy to deal with.

This is the tricky one.  We're looking at the parent because some of the 
directory inode state is effectively a lease over the validity of the 
directory's dentries.  In the general case, this is more efficient than a 
lot of per-dentry state in the client/server protocol.

Probably what we *should* be doing is linking to the parent from the 
child's ceph_dentry_info under the MDS session mutex, so that it is 
properly aligned with the state being passed from the server.  I 
considered this originally but was annoyed about essentially duplicating 
d_parent.  But... the locking being what it is, that is probably the only 
way to make the d_revalidate completely safe.  It'll be a bigger patch to 
change that behavior, though.

   * ceph_open() can race with rename(); -atomic_open() is
 called with parent locked, but -open() isn't.  AFAICS, open() with
 O_TRUNC could step into that code...
   * ceph_setattr() *definitely* can race with rename(); we have
 the object itself locked, but not its parent (and I'm really surprised
 by the need to know that parent for such operation).
   * CEPH_IOC_SET_LAYOUT vs. rename() - no idea what that ioctl is
 about, but opened files can be moved around, TYVM, even when they are
 in the middle of ioctl(2).
   * -setxattr() and -removexattr() - again, can happen in parallel
 with rename(), and again I really wonder why do we need the parent directory
 for such operations.

These are only there so that an fsync(dirfd) will wait for the file 
creation or other metadata to commit.  I suspect I am (heavily) 
over-reading what POSIX says here... and should be concerned only with the 
namespace modifications (link, unlink, rename, O_CREAT).

Assuming that's the case,

git://github.com/ceph/ceph-client.git #wip-dentries

has a set of patches that fix this up.  (Well, except for d_revalidate, 
which is a bigger change; opened #4023 in the ceph bug tracker.)

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph] locking fun with d_materialise_unique()

2013-01-31 Thread Al Viro
On Wed, Jan 30, 2013 at 02:42:14PM +, Al Viro wrote:
 On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote:
 
  We should drop teh mds_client.c hunk from your patch, and then do 
  something like the below.  I'll put it in the ceph tree so we can do some 
  basic testing.  Unfortunately, the abort case is a bit annoying to 
  trigger.. we'll need to write a new test for it.
 
 hunk dropped, the rest folded into your patch, resulting branch pushed...

BTW, moderately related issue - ceph_get_dentry_parent_inode() uses
look fishy.  In ceph_rename() it seems to be pointless.  We do have
the directory inode of parent of old_dentry - it's old_dir, so it's just
a fancy way to spell igrab(old_directory).  And as far as I can tell, *all*
other callers are racy.
* link(a/foo, b/bar) can happen in parallel with
rename(a/foo, c/splat).  link(2) holds -i_mutex on a/foo (so it can't race
with unlink) and on b/; rename(2) holds it on a/, c/ and c/splat (if the last
one exists).  It also holds -s_vfs_rename_sem, so cross-directory renames
are serialized.  For normal filesystems that's enough and there -link()
couldn't care less about a/; ceph_link() wants the inode of a/ for some
reason.  If it *really* needs the parent of a/foo for the operation, the
current code is SOL - the directory we grab can cease being that parent
just as it's getting returned to ceph_link()
* -d_revalidate() can happen in parallel with rename().  You
don't seem to be using the parent inode much in there, so that should
be reasonably easy to deal with.
* ceph_open() can race with rename(); -atomic_open() is
called with parent locked, but -open() isn't.  AFAICS, open() with
O_TRUNC could step into that code...
* ceph_setattr() *definitely* can race with rename(); we have
the object itself locked, but not its parent (and I'm really surprised
by the need to know that parent for such operation).
* CEPH_IOC_SET_LAYOUT vs. rename() - no idea what that ioctl is
about, but opened files can be moved around, TYVM, even when they are
in the middle of ioctl(2).
* -setxattr() and -removexattr() - again, can happen in parallel
with rename(), and again I really wonder why do we need the parent directory
for such operations.

What's going on there?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph] locking fun with d_materialise_unique()

2013-01-30 Thread Al Viro
On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote:

 We should drop teh mds_client.c hunk from your patch, and then do 
 something like the below.  I'll put it in the ceph tree so we can do some 
 basic testing.  Unfortunately, the abort case is a bit annoying to 
 trigger.. we'll need to write a new test for it.

hunk dropped, the rest folded into your patch, resulting branch pushed...
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph] locking fun with d_materialise_unique()

2013-01-29 Thread Al Viro
On Mon, Jan 28, 2013 at 09:20:34PM -0800, Sage Weil wrote:

 Yep, that is indeed a problem.  I think we just need to do the r_aborted 
 and/or r_locked_dir check in the else if condition...
 
  I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't
  get to its splice_dentry() and d_delete() calls in similar situations -
  I hadn't checked that one yet.  If it isn't guaranteed, we have a problem
  there as well.
 
 ...and the condition guarding readdir_prepopulate().  :)

 I think you're reading it correctly.  The main thing to keep in mind here 
 is that we *do* need to call fill_inode() for the inode metadata on these 
 requests to keep the mds and client state in sync.  The dentry state is 
 safe to ignore.

You mean the parts under
if (rinfo-head-is_dentry) {
and
if (rinfo-head-is_target) {
in there?  Because there's fill_inode() called from readdir_prepopulate()
and it's a lot more problematic than those two...

 It would be great to have the dir i_mutex rules summarized somewhere, even 
 if it is just a copy of the below.  It took a fair bit of trial and error 
 to infer what was going on when writing this code.  :)

Directory -i_mutex rules are in part documented - what VFS guarantees
to hold side is in Documentation/filesystems/directory-locking.  It's
the other side (what locks are expected to be held by callers of dcache.c
functions) that is badly missing...

 Ping me when you've pushed that branch and I'll take a look...

To gitol...@ra.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git
   01a88fa..4056362  master - master

with tentative ceph patch in the very end.  Should be on git.kernel.org
shortly...
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph] locking fun with d_materialise_unique()

2013-01-29 Thread Sage Weil
Hi Al,

On Tue, 29 Jan 2013, Al Viro wrote:

 On Mon, Jan 28, 2013 at 09:20:34PM -0800, Sage Weil wrote:
 
  Yep, that is indeed a problem.  I think we just need to do the r_aborted 
  and/or r_locked_dir check in the else if condition...
  
   I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't
   get to its splice_dentry() and d_delete() calls in similar situations -
   I hadn't checked that one yet.  If it isn't guaranteed, we have a problem
   there as well.
  
  ...and the condition guarding readdir_prepopulate().  :)
 
  I think you're reading it correctly.  The main thing to keep in mind here 
  is that we *do* need to call fill_inode() for the inode metadata on these 
  requests to keep the mds and client state in sync.  The dentry state is 
  safe to ignore.
 
 You mean the parts under
 if (rinfo-head-is_dentry) {
 and
 if (rinfo-head-is_target) {
 in there?  Because there's fill_inode() called from readdir_prepopulate()
 and it's a lot more problematic than those two...

Yep, patch below.
 
  It would be great to have the dir i_mutex rules summarized somewhere, even 
  if it is just a copy of the below.  It took a fair bit of trial and error 
  to infer what was going on when writing this code.  :)
 
 Directory -i_mutex rules are in part documented - what VFS guarantees
 to hold side is in Documentation/filesystems/directory-locking.  It's
 the other side (what locks are expected to be held by callers of dcache.c
 functions) that is badly missing...
 
  Ping me when you've pushed that branch and I'll take a look...
 
 To gitol...@ra.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git
01a88fa..4056362  master - master
 
 with tentative ceph patch in the very end.  Should be on git.kernel.org
 shortly...

We should drop teh mds_client.c hunk from your patch, and then do 
something like the below.  I'll put it in the ceph tree so we can do some 
basic testing.  Unfortunately, the abort case is a bit annoying to 
trigger.. we'll need to write a new test for it.

Thanks-
sage



From 80d70ef02d5277af8e1355db74fffe71f7217e76 Mon Sep 17 00:00:00 2001
From: Sage Weil s...@inktank.com
Date: Tue, 29 Jan 2013 13:01:15 -0800
Subject: [PATCH] ceph: prepopulate inodes only when request is aborted

If r_aborted is true, we do not hold the dir i_mutex, and cannot touch
the dcache.  However, we still need to update the inodes with the state
returned by the MDS.

Reported-by: Al Viro v...@zeniv.linux.org.uk
Signed-off-by: Sage Weil s...@inktank.com
---
 fs/ceph/inode.c |   36 
 1 file changed, 36 insertions(+)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 77b1b18..b51653e 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1195,6 +1195,39 @@ done:
 /*
  * Prepopulate our cache with readdir results, leases, etc.
  */
+static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req,
+  struct ceph_mds_session *session)
+{
+   struct ceph_mds_reply_info_parsed *rinfo = req-r_reply_info;
+   int i, err = 0;
+
+   for (i = 0; i  rinfo-dir_nr; i++) {
+   struct ceph_vino vino;
+   struct inode *in;
+   int rc;
+
+   vino.ino = le64_to_cpu(rinfo-dir_in[i].in-ino);
+   vino.snap = le64_to_cpu(rinfo-dir_in[i].in-snapid);
+
+   in = ceph_get_inode(req-r_dentry-d_sb, vino);
+   if (IS_ERR(in)) {
+   err = PTR_ERR(in);
+   dout(new_inode badness got %d\n, err);
+   continue;
+   }
+   rc = fill_inode(in, rinfo-dir_in[i], NULL, session,
+   req-r_request_started, -1,
+   req-r_caps_reservation);
+   if (rc  0) {
+   pr_err(fill_inode badness on %p got %d\n, in, rc);
+   err = rc;
+   continue;
+   }
+   }
+
+   return err;
+}
+
 int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 struct ceph_mds_session *session)
 {
@@ -1209,6 +1242,9 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
u64 frag = le32_to_cpu(rhead-args.readdir.frag);
struct ceph_dentry_info *di;
 
+   if (req-r_aborted)
+   return readdir_prepopulate_inodes_only(req, session);
+
if (le32_to_cpu(rinfo-head-op) == CEPH_MDS_OP_LSSNAP) {
snapdir = ceph_get_snapdir(parent-d_inode);
parent = d_find_alias(snapdir);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph] locking fun with d_materialise_unique()

2013-01-28 Thread Sage Weil
Hi Al,

On Tue, 29 Jan 2013, Al Viro wrote:
   There's a fun potential problem with CEPH_MDS_OP_LOOKUPSNAP handling
 in ceph_fill_trace().  Consider the following scenario:
 
 Process calls stat(2).  Lookup locks parent, allocates dentry and calls
 -lookup().  Request is created and sent over the wire.  Then we sit and
 wait for completion.  Just as the reply has arrived, process gets SIGKILL.
 OK, we get to
 /*
  * ensure we aren't running concurrently with
  * ceph_fill_trace or ceph_readdir_prepopulate, which
  * rely on locks (dir mutex) held by our caller.
  */
 mutex_lock(req-r_fill_mutex);
 req-r_err = err;
 req-r_aborted = true;
 mutex_unlock(req-r_fill_mutex);
 and we got there before handle_reply() grabbed -r_fill_mutex.  Then we return
 to ceph_lookup(), drop the reference to request and bugger off.  Parent is
 unlocked by caller.
 
 In the meanwhile, there's another thread sitting in handle_reply().  It
 got -r_fill_mutex and called ceph_fill_trace().  Had that been something
 like rename request, ceph_fill_trace() would've checked req-r_aborted and
 that would've been it.  However, we hit this:
 } else if (req-r_op == CEPH_MDS_OP_LOOKUPSNAP ||
req-r_op == CEPH_MDS_OP_MKSNAP) {
 struct dentry *dn = req-r_dentry;
 and proceed to
 dout( linking snapped dir %p to dn %p\n, in, dn);
 dn = splice_dentry(dn, in, NULL, true);
 which does
 realdn = d_materialise_unique(dn, in);
 and we are in trouble - d_materialise_unique() assumes that -i_mutex on 
 parent
 is held, which isn't guaranteed anymore.  Not that d_delete() done a couple
 of lines earlier was any better...

Yep, that is indeed a problem.  I think we just need to do the r_aborted 
and/or r_locked_dir check in the else if condition...

 I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't
 get to its splice_dentry() and d_delete() calls in similar situations -
 I hadn't checked that one yet.  If it isn't guaranteed, we have a problem
 there as well.

...and the condition guarding readdir_prepopulate().  :)

 I might very well be missing something - that code is seriously convoluted,
 and race wouldn't be easy to hit, so I don't have anything resembling
 a candidate reproducer ;-/  IOW, this is just from RTFS and I'd really
 appreciate comments from folks familiar with ceph.

I think you're reading it correctly.  The main thing to keep in mind here 
is that we *do* need to call fill_inode() for the inode metadata on these 
requests to keep the mds and client state in sync.  The dentry state is 
safe to ignore.

It would be great to have the dir i_mutex rules summarized somewhere, even 
if it is just a copy of the below.  It took a fair bit of trial and error 
to infer what was going on when writing this code.  :)

Ping me when you've pushed that branch and I'll take a look...

Thanks!
sage





 
 VFS side of requirements is fairly simple:
   * d_splice_alias(d, _), d_add_ci(d, _), d_add(d, _),
 d_materialise_unique(d, _), d_delete(d), d_move(_, d) should be called
 only with -i_mutex held on the parent of d.
   * d_move(d, _), d_add_unique(d, _), d_instantiate_unique(d, _),
 d_instantiate(d, _) should be called only with d being parentless (i.e.
 d-d_parent == d, aka. IS_ROOT(d)) or with -i_mutex held on the parent of d.
   * with the exception of prepopulate dentry tree at -get_sb() time
 kind of situations, d_alloc(d, _) and d_alloc_name(d, _) should be called
 only with d-d_inode-i_mutex held (and it won't be too hard to get rid of
 those exceptions, actually).
   * lookup_one_len(_, d, _) should only be called with -i_mutex held
 on d-d_inode
   * d_move(d1, d2) in case when d1 and d2 have different parents
 should only be called with -s_vfs_rename_mutex held on d1-d_sb (== 
 d2-d_sb).
 
 We are guaranteed that -i_mutex is held on (inode of) parent of d in
 -lookup(_, d, _)
 -atomic_open(_, d, _, _, _, _)
 -mkdir(_, d, _)
 -symlink(_, d, _)
 -create(_, d, _, _)
 -mknod(_, d, _, _)
 -link(_, _, d)
 -unlink(_, d)
 -rmdir(_, d)
 -rename(_, d, _, _)
 -rename(_, _, _, d)
 Note that this is *not* guaranteed for another argument of -link() - the
 inode we are linking has -i_mutex held, but nothing of that kind is promised
 for its parent directory.
 We also are guaranteed that -i_mutex is held on the inode of opened
 directory passed to -readdir() and on victims of -unlink(), -rmdir() and
 overwriting -rename().
 
 FWIW, I went through that stuff this weekend and we are fairly close to having
 those requirements satisfied - I'll push a branch with the accumulated fixes
 in a few and after that we should be down to very few remaining violations and
 dubious places (ceph issues above being one