Re: [ceph] locking fun with d_materialise_unique()
Hi Al- I've returned and recovered from LCA and have no more excuse for keeping my brain turned off: On Fri, 1 Feb 2013, Al Viro wrote: On Wed, Jan 30, 2013 at 02:42:14PM +, Al Viro wrote: On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote: We should drop teh mds_client.c hunk from your patch, and then do something like the below. I'll put it in the ceph tree so we can do some basic testing. Unfortunately, the abort case is a bit annoying to trigger.. we'll need to write a new test for it. hunk dropped, the rest folded into your patch, resulting branch pushed... BTW, moderately related issue - ceph_get_dentry_parent_inode() uses look fishy. In ceph_rename() it seems to be pointless. We do have the directory inode of parent of old_dentry - it's old_dir, so it's just a fancy way to spell igrab(old_directory). And as far as I can tell, *all* Yeah, that's an easy fix. other callers are racy. * link(a/foo, b/bar) can happen in parallel with rename(a/foo, c/splat). link(2) holds -i_mutex on a/foo (so it can't race with unlink) and on b/; rename(2) holds it on a/, c/ and c/splat (if the last one exists). It also holds -s_vfs_rename_sem, so cross-directory renames are serialized. For normal filesystems that's enough and there -link() couldn't care less about a/; ceph_link() wants the inode of a/ for some reason. If it *really* needs the parent of a/foo for the operation, the current code is SOL - the directory we grab can cease being that parent just as it's getting returned to ceph_link() And I think this use is totally unnecessary. * -d_revalidate() can happen in parallel with rename(). You don't seem to be using the parent inode much in there, so that should be reasonably easy to deal with. This is the tricky one. We're looking at the parent because some of the directory inode state is effectively a lease over the validity of the directory's dentries. In the general case, this is more efficient than a lot of per-dentry state in the client/server protocol. Probably what we *should* be doing is linking to the parent from the child's ceph_dentry_info under the MDS session mutex, so that it is properly aligned with the state being passed from the server. I considered this originally but was annoyed about essentially duplicating d_parent. But... the locking being what it is, that is probably the only way to make the d_revalidate completely safe. It'll be a bigger patch to change that behavior, though. * ceph_open() can race with rename(); -atomic_open() is called with parent locked, but -open() isn't. AFAICS, open() with O_TRUNC could step into that code... * ceph_setattr() *definitely* can race with rename(); we have the object itself locked, but not its parent (and I'm really surprised by the need to know that parent for such operation). * CEPH_IOC_SET_LAYOUT vs. rename() - no idea what that ioctl is about, but opened files can be moved around, TYVM, even when they are in the middle of ioctl(2). * -setxattr() and -removexattr() - again, can happen in parallel with rename(), and again I really wonder why do we need the parent directory for such operations. These are only there so that an fsync(dirfd) will wait for the file creation or other metadata to commit. I suspect I am (heavily) over-reading what POSIX says here... and should be concerned only with the namespace modifications (link, unlink, rename, O_CREAT). Assuming that's the case, git://github.com/ceph/ceph-client.git #wip-dentries has a set of patches that fix this up. (Well, except for d_revalidate, which is a bigger change; opened #4023 in the ceph bug tracker.) Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph] locking fun with d_materialise_unique()
On Wed, Jan 30, 2013 at 02:42:14PM +, Al Viro wrote: On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote: We should drop teh mds_client.c hunk from your patch, and then do something like the below. I'll put it in the ceph tree so we can do some basic testing. Unfortunately, the abort case is a bit annoying to trigger.. we'll need to write a new test for it. hunk dropped, the rest folded into your patch, resulting branch pushed... BTW, moderately related issue - ceph_get_dentry_parent_inode() uses look fishy. In ceph_rename() it seems to be pointless. We do have the directory inode of parent of old_dentry - it's old_dir, so it's just a fancy way to spell igrab(old_directory). And as far as I can tell, *all* other callers are racy. * link(a/foo, b/bar) can happen in parallel with rename(a/foo, c/splat). link(2) holds -i_mutex on a/foo (so it can't race with unlink) and on b/; rename(2) holds it on a/, c/ and c/splat (if the last one exists). It also holds -s_vfs_rename_sem, so cross-directory renames are serialized. For normal filesystems that's enough and there -link() couldn't care less about a/; ceph_link() wants the inode of a/ for some reason. If it *really* needs the parent of a/foo for the operation, the current code is SOL - the directory we grab can cease being that parent just as it's getting returned to ceph_link() * -d_revalidate() can happen in parallel with rename(). You don't seem to be using the parent inode much in there, so that should be reasonably easy to deal with. * ceph_open() can race with rename(); -atomic_open() is called with parent locked, but -open() isn't. AFAICS, open() with O_TRUNC could step into that code... * ceph_setattr() *definitely* can race with rename(); we have the object itself locked, but not its parent (and I'm really surprised by the need to know that parent for such operation). * CEPH_IOC_SET_LAYOUT vs. rename() - no idea what that ioctl is about, but opened files can be moved around, TYVM, even when they are in the middle of ioctl(2). * -setxattr() and -removexattr() - again, can happen in parallel with rename(), and again I really wonder why do we need the parent directory for such operations. What's going on there? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph] locking fun with d_materialise_unique()
On Tue, Jan 29, 2013 at 01:03:23PM -0800, Sage Weil wrote: We should drop teh mds_client.c hunk from your patch, and then do something like the below. I'll put it in the ceph tree so we can do some basic testing. Unfortunately, the abort case is a bit annoying to trigger.. we'll need to write a new test for it. hunk dropped, the rest folded into your patch, resulting branch pushed... -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph] locking fun with d_materialise_unique()
On Mon, Jan 28, 2013 at 09:20:34PM -0800, Sage Weil wrote: Yep, that is indeed a problem. I think we just need to do the r_aborted and/or r_locked_dir check in the else if condition... I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't get to its splice_dentry() and d_delete() calls in similar situations - I hadn't checked that one yet. If it isn't guaranteed, we have a problem there as well. ...and the condition guarding readdir_prepopulate(). :) I think you're reading it correctly. The main thing to keep in mind here is that we *do* need to call fill_inode() for the inode metadata on these requests to keep the mds and client state in sync. The dentry state is safe to ignore. You mean the parts under if (rinfo-head-is_dentry) { and if (rinfo-head-is_target) { in there? Because there's fill_inode() called from readdir_prepopulate() and it's a lot more problematic than those two... It would be great to have the dir i_mutex rules summarized somewhere, even if it is just a copy of the below. It took a fair bit of trial and error to infer what was going on when writing this code. :) Directory -i_mutex rules are in part documented - what VFS guarantees to hold side is in Documentation/filesystems/directory-locking. It's the other side (what locks are expected to be held by callers of dcache.c functions) that is badly missing... Ping me when you've pushed that branch and I'll take a look... To gitol...@ra.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git 01a88fa..4056362 master - master with tentative ceph patch in the very end. Should be on git.kernel.org shortly... -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph] locking fun with d_materialise_unique()
Hi Al, On Tue, 29 Jan 2013, Al Viro wrote: On Mon, Jan 28, 2013 at 09:20:34PM -0800, Sage Weil wrote: Yep, that is indeed a problem. I think we just need to do the r_aborted and/or r_locked_dir check in the else if condition... I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't get to its splice_dentry() and d_delete() calls in similar situations - I hadn't checked that one yet. If it isn't guaranteed, we have a problem there as well. ...and the condition guarding readdir_prepopulate(). :) I think you're reading it correctly. The main thing to keep in mind here is that we *do* need to call fill_inode() for the inode metadata on these requests to keep the mds and client state in sync. The dentry state is safe to ignore. You mean the parts under if (rinfo-head-is_dentry) { and if (rinfo-head-is_target) { in there? Because there's fill_inode() called from readdir_prepopulate() and it's a lot more problematic than those two... Yep, patch below. It would be great to have the dir i_mutex rules summarized somewhere, even if it is just a copy of the below. It took a fair bit of trial and error to infer what was going on when writing this code. :) Directory -i_mutex rules are in part documented - what VFS guarantees to hold side is in Documentation/filesystems/directory-locking. It's the other side (what locks are expected to be held by callers of dcache.c functions) that is badly missing... Ping me when you've pushed that branch and I'll take a look... To gitol...@ra.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git 01a88fa..4056362 master - master with tentative ceph patch in the very end. Should be on git.kernel.org shortly... We should drop teh mds_client.c hunk from your patch, and then do something like the below. I'll put it in the ceph tree so we can do some basic testing. Unfortunately, the abort case is a bit annoying to trigger.. we'll need to write a new test for it. Thanks- sage From 80d70ef02d5277af8e1355db74fffe71f7217e76 Mon Sep 17 00:00:00 2001 From: Sage Weil s...@inktank.com Date: Tue, 29 Jan 2013 13:01:15 -0800 Subject: [PATCH] ceph: prepopulate inodes only when request is aborted If r_aborted is true, we do not hold the dir i_mutex, and cannot touch the dcache. However, we still need to update the inodes with the state returned by the MDS. Reported-by: Al Viro v...@zeniv.linux.org.uk Signed-off-by: Sage Weil s...@inktank.com --- fs/ceph/inode.c | 36 1 file changed, 36 insertions(+) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 77b1b18..b51653e 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -1195,6 +1195,39 @@ done: /* * Prepopulate our cache with readdir results, leases, etc. */ +static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req, + struct ceph_mds_session *session) +{ + struct ceph_mds_reply_info_parsed *rinfo = req-r_reply_info; + int i, err = 0; + + for (i = 0; i rinfo-dir_nr; i++) { + struct ceph_vino vino; + struct inode *in; + int rc; + + vino.ino = le64_to_cpu(rinfo-dir_in[i].in-ino); + vino.snap = le64_to_cpu(rinfo-dir_in[i].in-snapid); + + in = ceph_get_inode(req-r_dentry-d_sb, vino); + if (IS_ERR(in)) { + err = PTR_ERR(in); + dout(new_inode badness got %d\n, err); + continue; + } + rc = fill_inode(in, rinfo-dir_in[i], NULL, session, + req-r_request_started, -1, + req-r_caps_reservation); + if (rc 0) { + pr_err(fill_inode badness on %p got %d\n, in, rc); + err = rc; + continue; + } + } + + return err; +} + int ceph_readdir_prepopulate(struct ceph_mds_request *req, struct ceph_mds_session *session) { @@ -1209,6 +1242,9 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req, u64 frag = le32_to_cpu(rhead-args.readdir.frag); struct ceph_dentry_info *di; + if (req-r_aborted) + return readdir_prepopulate_inodes_only(req, session); + if (le32_to_cpu(rinfo-head-op) == CEPH_MDS_OP_LSSNAP) { snapdir = ceph_get_snapdir(parent-d_inode); parent = d_find_alias(snapdir); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph] locking fun with d_materialise_unique()
Hi Al, On Tue, 29 Jan 2013, Al Viro wrote: There's a fun potential problem with CEPH_MDS_OP_LOOKUPSNAP handling in ceph_fill_trace(). Consider the following scenario: Process calls stat(2). Lookup locks parent, allocates dentry and calls -lookup(). Request is created and sent over the wire. Then we sit and wait for completion. Just as the reply has arrived, process gets SIGKILL. OK, we get to /* * ensure we aren't running concurrently with * ceph_fill_trace or ceph_readdir_prepopulate, which * rely on locks (dir mutex) held by our caller. */ mutex_lock(req-r_fill_mutex); req-r_err = err; req-r_aborted = true; mutex_unlock(req-r_fill_mutex); and we got there before handle_reply() grabbed -r_fill_mutex. Then we return to ceph_lookup(), drop the reference to request and bugger off. Parent is unlocked by caller. In the meanwhile, there's another thread sitting in handle_reply(). It got -r_fill_mutex and called ceph_fill_trace(). Had that been something like rename request, ceph_fill_trace() would've checked req-r_aborted and that would've been it. However, we hit this: } else if (req-r_op == CEPH_MDS_OP_LOOKUPSNAP || req-r_op == CEPH_MDS_OP_MKSNAP) { struct dentry *dn = req-r_dentry; and proceed to dout( linking snapped dir %p to dn %p\n, in, dn); dn = splice_dentry(dn, in, NULL, true); which does realdn = d_materialise_unique(dn, in); and we are in trouble - d_materialise_unique() assumes that -i_mutex on parent is held, which isn't guaranteed anymore. Not that d_delete() done a couple of lines earlier was any better... Yep, that is indeed a problem. I think we just need to do the r_aborted and/or r_locked_dir check in the else if condition... I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't get to its splice_dentry() and d_delete() calls in similar situations - I hadn't checked that one yet. If it isn't guaranteed, we have a problem there as well. ...and the condition guarding readdir_prepopulate(). :) I might very well be missing something - that code is seriously convoluted, and race wouldn't be easy to hit, so I don't have anything resembling a candidate reproducer ;-/ IOW, this is just from RTFS and I'd really appreciate comments from folks familiar with ceph. I think you're reading it correctly. The main thing to keep in mind here is that we *do* need to call fill_inode() for the inode metadata on these requests to keep the mds and client state in sync. The dentry state is safe to ignore. It would be great to have the dir i_mutex rules summarized somewhere, even if it is just a copy of the below. It took a fair bit of trial and error to infer what was going on when writing this code. :) Ping me when you've pushed that branch and I'll take a look... Thanks! sage VFS side of requirements is fairly simple: * d_splice_alias(d, _), d_add_ci(d, _), d_add(d, _), d_materialise_unique(d, _), d_delete(d), d_move(_, d) should be called only with -i_mutex held on the parent of d. * d_move(d, _), d_add_unique(d, _), d_instantiate_unique(d, _), d_instantiate(d, _) should be called only with d being parentless (i.e. d-d_parent == d, aka. IS_ROOT(d)) or with -i_mutex held on the parent of d. * with the exception of prepopulate dentry tree at -get_sb() time kind of situations, d_alloc(d, _) and d_alloc_name(d, _) should be called only with d-d_inode-i_mutex held (and it won't be too hard to get rid of those exceptions, actually). * lookup_one_len(_, d, _) should only be called with -i_mutex held on d-d_inode * d_move(d1, d2) in case when d1 and d2 have different parents should only be called with -s_vfs_rename_mutex held on d1-d_sb (== d2-d_sb). We are guaranteed that -i_mutex is held on (inode of) parent of d in -lookup(_, d, _) -atomic_open(_, d, _, _, _, _) -mkdir(_, d, _) -symlink(_, d, _) -create(_, d, _, _) -mknod(_, d, _, _) -link(_, _, d) -unlink(_, d) -rmdir(_, d) -rename(_, d, _, _) -rename(_, _, _, d) Note that this is *not* guaranteed for another argument of -link() - the inode we are linking has -i_mutex held, but nothing of that kind is promised for its parent directory. We also are guaranteed that -i_mutex is held on the inode of opened directory passed to -readdir() and on victims of -unlink(), -rmdir() and overwriting -rename(). FWIW, I went through that stuff this weekend and we are fairly close to having those requirements satisfied - I'll push a branch with the accumulated fixes in a few and after that we should be down to very few remaining violations and dubious places (ceph issues above being one