Re: Vnode API cleanup pass 2a
On Jan 20, 2014, at 5:02 PM, Taylor R Campbell campbell+netbsd-tech-k...@mumble.net wrote: Date: Mon, 20 Jan 2014 16:01:07 +0100 From: J. Hannken-Illjes hann...@eis.cs.tu-bs.de Back to vnode creation operations returning unlocked vnodes I put a new diff to http://www.netbsd.org/~hannken/vnode-pass2a-4.diff Any more objections or OK to commit? On further consideration, this looks to me like the right approach now. I gave it a quick skim, but not a thorough review. It looks pretty good, modulo a few little nits below. I assume all the atf tests and your other stress tests still pass? Sure. Now even on a 20-core amd64 with linux/kvm as a host. @@ -517,8 +517,10 @@ zfs_replay_create(zfsvfs_t *zfsvfs, lr_c error = VOP_MKDIR(ZTOV(dzp), vp, cn, xva.xva_vattr /*,vflg*/); break; case TX_MKXATTR: error = zfs_make_xattrdir(dzp, xva.xva_vattr, vp, kcred); + if (error == 0 vp != NULL) + VOP_UNLOCK(vp); This looks suspicious -- does zfs_make_xattrdir return a locked vnode? I don't immediately see evidence that it does. (Also, if it turns out this branch is appropriate, shouldn't ((error == 0) == (vp != NULL)) be guaranteed?) Yes, looks like zfs_make_xattrdir returns an unlocked vnode (which means the current implementation is wrong). Removed this part. @@ -1563,10 +1553,11 @@ coda_symlink(void *v) saved_cn_flags = cnp-cn_flags; cnp-cn_flags = ~(MODMASK | OPMASK); cnp-cn_flags |= LOOKUP; error = VOP_LOOKUP(dvp, ap-a_vpp, cnp); + if (error == 0) + VOP_UNLOCK(*ap-a_vpp); Can you leave an XXX comment here marking it as a provisional kludge until VOP_LOOKUP becomes sane (i.e., returns the vnode unlocked)? Done. @@ -601,8 +601,10 @@ smbfs_create(void *v) error = smbfs_smb_lookup(dnp, name, nmlen, fattr, scred); if (error) goto out; error = smbfs_nget(VTOVFS(dvp), dvp, name, nmlen, fattr, ap-a_vpp); + if (error == 0) + VOP_UNLOCK(*ap-a_vpp); if (error) goto out; Can you do the VOP_UNLOCK unconditionally after the `if (error)' branch? Generally I find it much easier to read `branch on error' than `branch on success' and I try to adhere to that in new code. There are a few other cases of this in your patch, although some match the surrounding style of doing `if (error == 0)' or would require new labels so I wouldn't change those. Done. --- sys/rump/librump/rumpvfs/rumpfs.c 17 Jan 2014 10:55:03 - 1.122 +++ sys/rump/librump/rumpvfs/rumpfs.c 20 Jan 2014 09:10:53 - @@ -564,9 +564,10 @@ makeprivate(enum vtype vt, mode_t mode, return rn; } static int -makevnode(struct mount *mp, struct rumpfs_node *rn, struct vnode **vpp) +makevnode(struct mount *mp, struct rumpfs_node *rn, struct vnode **vpp, +bool lock_result) There are only a couple cases of makevnode(..., true) in your patch, in rump_vop_lookup and rumpfs_mountfs. Instead of adding an argument to makevnode, why not just call vn_lock after makevnode in those cases? Done. rump_vop_lookup changed as proposed -- this lock will go with the next pass doing the same for VOP_LOOKUP. rumpfs_mountfs doesn't need a lock -- an unmount can not happen here: RCS file: /cvsroot/src/sys/rump/librump/rumpvfs/rumpfs.c,v @@ -813,4 +812,11 @@ rump_vop_lookup(void *v) mutex_exit(reclock); rv = makevnode(dvp-v_mount, rn, vpp); + if (rv == 0) { + rv = vn_lock(*vpp, LK_EXCLUSIVE); + if (rv != 0) { + vrele(*vpp); + *vpp = NULL; + } + } } if (dotdot) @@ -1762,5 +1768,4 @@ rumpfs_mountfs(struct mount *mp) rfsmp-rfsmp_rvp-v_vflag |= VV_ROOT; - VOP_UNLOCK(rfsmp-rfsmp_rvp); mp-mnt_data = rfsmp; @@ -183,16 +184,17 @@ ext2fs_mknod(void *v) * Remove inode so that it will be reloaded by VFS_VGET and * checked to see if it is an alias of an existing entry in * the inode cache. */ - VOP_UNLOCK(*vpp); (*vpp)-v_type = VNON; + VOP_UNLOCK(*vpp); Can you do this one in a separate tiny commit? It looks like an unrelated bug fix. Done. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Vnode API cleanup pass 2a
Back to vnode creation operations returning unlocked vnodes I put a new diff to http://www.netbsd.org/~hannken/vnode-pass2a-4.diff Any more objections or OK to commit? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Vnode API cleanup pass 2a
Date: Mon, 20 Jan 2014 16:01:07 +0100 From: J. Hannken-Illjes hann...@eis.cs.tu-bs.de Back to vnode creation operations returning unlocked vnodes I put a new diff to http://www.netbsd.org/~hannken/vnode-pass2a-4.diff Any more objections or OK to commit? On further consideration, this looks to me like the right approach now. I gave it a quick skim, but not a thorough review. It looks pretty good, modulo a few little nits below. I assume all the atf tests and your other stress tests still pass? @@ -517,8 +517,10 @@ zfs_replay_create(zfsvfs_t *zfsvfs, lr_c error = VOP_MKDIR(ZTOV(dzp), vp, cn, xva.xva_vattr /*,vflg*/); break; case TX_MKXATTR: error = zfs_make_xattrdir(dzp, xva.xva_vattr, vp, kcred); + if (error == 0 vp != NULL) + VOP_UNLOCK(vp); This looks suspicious -- does zfs_make_xattrdir return a locked vnode? I don't immediately see evidence that it does. (Also, if it turns out this branch is appropriate, shouldn't ((error == 0) == (vp != NULL)) be guaranteed?) @@ -1563,10 +1553,11 @@ coda_symlink(void *v) saved_cn_flags = cnp-cn_flags; cnp-cn_flags = ~(MODMASK | OPMASK); cnp-cn_flags |= LOOKUP; error = VOP_LOOKUP(dvp, ap-a_vpp, cnp); + if (error == 0) + VOP_UNLOCK(*ap-a_vpp); Can you leave an XXX comment here marking it as a provisional kludge until VOP_LOOKUP becomes sane (i.e., returns the vnode unlocked)? @@ -601,8 +601,10 @@ smbfs_create(void *v) error = smbfs_smb_lookup(dnp, name, nmlen, fattr, scred); if (error) goto out; error = smbfs_nget(VTOVFS(dvp), dvp, name, nmlen, fattr, ap-a_vpp); + if (error == 0) + VOP_UNLOCK(*ap-a_vpp); if (error) goto out; Can you do the VOP_UNLOCK unconditionally after the `if (error)' branch? Generally I find it much easier to read `branch on error' than `branch on success' and I try to adhere to that in new code. There are a few other cases of this in your patch, although some match the surrounding style of doing `if (error == 0)' or would require new labels so I wouldn't change those. --- sys/rump/librump/rumpvfs/rumpfs.c17 Jan 2014 10:55:03 - 1.122 +++ sys/rump/librump/rumpvfs/rumpfs.c20 Jan 2014 09:10:53 - @@ -564,9 +564,10 @@ makeprivate(enum vtype vt, mode_t mode, return rn; } static int -makevnode(struct mount *mp, struct rumpfs_node *rn, struct vnode **vpp) +makevnode(struct mount *mp, struct rumpfs_node *rn, struct vnode **vpp, +bool lock_result) There are only a couple cases of makevnode(..., true) in your patch, in rump_vop_lookup and rumpfs_mountfs. Instead of adding an argument to makevnode, why not just call vn_lock after makevnode in those cases? @@ -183,16 +184,17 @@ ext2fs_mknod(void *v) * Remove inode so that it will be reloaded by VFS_VGET and * checked to see if it is an alias of an existing entry in * the inode cache. */ - VOP_UNLOCK(*vpp); (*vpp)-v_type = VNON; + VOP_UNLOCK(*vpp); Can you do this one in a separate tiny commit? It looks like an unrelated bug fix.
Re: Vnode API cleanup pass 2a
Date: Thu, 16 Jan 2014 07:07:56 + From: David Holland dholland-t...@netbsd.org On Wed, Jan 15, 2014 at 04:31:07PM +0100, J. Hannken-Illjes wrote: I put a diff to http://www.netbsd.org/~hannken/vnode-pass2a-3.diff that changes the vnode creation operations to keep the dvp locked. Any objections or OK to commit? I don't understand where all of the zfs changes arise (without wading into the zfs code) but I'm not going to worry too much about that. Whoops, I missed those on my first reading -- in particular, the first two files, zfs_ioctl.c and zfs_replay.c (the zfs_vnops.c changes look fine and I'm glad we can now get rid of the comments on insanity I left there). It's not clear to me why the vop protocol change would induce these zfs changes. I'm particularly nervous because you've made the vops *stop* unlocking and releasing vnodes, so callers should at most have to *add* unlocks and vreles, but you've *removed* some vreles and left things locked. It could be that something causes changes to zfs_zget, but this seems unlikely. Did you run this code? (I suspect not -- it looks like the code had broken locking to begin with.)
Re: Vnode API cleanup pass 2a
On Jan 16, 2014, at 2:45 PM, Taylor R Campbell campbell+netbsd-tech-k...@mumble.net wrote: Date: Thu, 16 Jan 2014 07:07:56 + From: David Holland dholland-t...@netbsd.org On Wed, Jan 15, 2014 at 04:31:07PM +0100, J. Hannken-Illjes wrote: I put a diff to http://www.netbsd.org/~hannken/vnode-pass2a-3.diff that changes the vnode creation operations to keep the dvp locked. Any objections or OK to commit? I don't understand where all of the zfs changes arise (without wading into the zfs code) but I'm not going to worry too much about that. Whoops, I missed those on my first reading -- in particular, the first two files, zfs_ioctl.c and zfs_replay.c (the zfs_vnops.c changes look fine and I'm glad we can now get rid of the comments on insanity I left there). It's not clear to me why the vop protocol change would induce these zfs changes. I'm particularly nervous because you've made the vops *stop* unlocking and releasing vnodes, so callers should at most have to *add* unlocks and vreles, but you've *removed* some vreles and left things locked. It could be that something causes changes to zfs_zget, but this seems unlikely. Did you run this code? (I suspect not -- it looks like the code had broken locking to begin with.) Taylor, you're right. - the change to zfs_ioctl.c was wrong and unneeded (this part is not used from NetBSD). - the change to zfs_replay.c was wrong. Looks like zfs_zget() returns a referenced and unlocked vnode so the attached diff should do it right as soon as VOP_CREATE and VOP_MKDIR get out of #ifdef TODO. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) zfs.diff Description: Binary data
Re: Vnode API cleanup pass 2a
Date: Thu, 16 Jan 2014 15:13:47 +0100 From: J. Hannken-Illjes hann...@eis.cs.tu-bs.de Taylor, you're right. - the change to zfs_ioctl.c was wrong and unneeded (this part is not used from NetBSD). - the change to zfs_replay.c was wrong. Looks like zfs_zget() returns a referenced and unlocked vnode so the attached diff should do it right as soon as VOP_CREATE and VOP_MKDIR get out of #ifdef TODO. OK, great. LGTM.
Re: Vnode API cleanup pass 2a
I'm not sure if I remember it correctly but we do not use zfs_replay.c it used with zfs snapshots and last time I did work on it it wasn't working. On Thu, Jan 16, 2014 at 3:13 PM, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: On Jan 16, 2014, at 2:45 PM, Taylor R Campbell campbell+netbsd-tech-k...@mumble.net wrote: Date: Thu, 16 Jan 2014 07:07:56 + From: David Holland dholland-t...@netbsd.org On Wed, Jan 15, 2014 at 04:31:07PM +0100, J. Hannken-Illjes wrote: I put a diff to http://www.netbsd.org/~hannken/vnode-pass2a-3.diff that changes the vnode creation operations to keep the dvp locked. Any objections or OK to commit? I don't understand where all of the zfs changes arise (without wading into the zfs code) but I'm not going to worry too much about that. Whoops, I missed those on my first reading -- in particular, the first two files, zfs_ioctl.c and zfs_replay.c (the zfs_vnops.c changes look fine and I'm glad we can now get rid of the comments on insanity I left there). It's not clear to me why the vop protocol change would induce these zfs changes. I'm particularly nervous because you've made the vops *stop* unlocking and releasing vnodes, so callers should at most have to *add* unlocks and vreles, but you've *removed* some vreles and left things locked. It could be that something causes changes to zfs_zget, but this seems unlikely. Did you run this code? (I suspect not -- it looks like the code had broken locking to begin with.) Taylor, you're right. - the change to zfs_ioctl.c was wrong and unneeded (this part is not used from NetBSD). - the change to zfs_replay.c was wrong. Looks like zfs_zget() returns a referenced and unlocked vnode so the attached diff should do it right as soon as VOP_CREATE and VOP_MKDIR get out of #ifdef TODO. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) -- Regards. Adam
Re: Vnode API cleanup pass 2a
On Jan 14, 2014, at 5:12 PM, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: On Jan 14, 2014, at 5:54 AM, David Holland dholland-t...@netbsd.org wrote: On Mon, Jan 13, 2014 at 04:06:02PM +0100, J. Hannken-Illjes wrote: snip Even though it looks like we don't need creation ops to return with locked directory at the moment it may be better to do this change now so atomic operations like the one seen in NFS server are possible. Yeah, that's what I was thinking Thus we would end up with: [snip] Right. It would probably be better to make this a separate commit; if that's a nuisance for you, I can do it... Give me a day (or maybe two) ... I put a diff to http://www.netbsd.org/~hannken/vnode-pass2a-3.diff that changes the vnode creation operations to keep the dvp locked. Any objections or OK to commit? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Vnode API cleanup pass 2a
Date: Mon, 13 Jan 2014 16:06:02 +0100 From: J. Hannken-Illjes hann...@eis.cs.tu-bs.de On Jan 13, 2014, at 8:39 AM, David Holland dholland-t...@netbsd.org wrote: On Tue, Jan 07, 2014 at 11:30:40AM +0100, J. Hannken-Illjes wrote: (also, while this is minor I think I'd rather have vop_mkdir_args_v2 and/or vop_mkdir_desc_v2 rather than vop_mkdir2_args and vop_mkdir2_desc. The version doesn't need to be part of the operation name.) Looks very good -- changed vnode_if.sh to create xxx_v2_args. Diff attached. Would you mind committing the versioning logic right away? It has clear value and it would be nice for it to appear in a separate changeset. Done. As is, existing code calling, e.g., VOP_MKDIR will still compile. Is there a reason the versioning doesn't rename it to VOP_MKDIR_V2?
Re: Vnode API cleanup pass 2a
Date: Wed, 15 Jan 2014 16:08:04 + From: Taylor R Campbell campbell+netbsd-tech-k...@mumble.net As is, existing code calling, e.g., VOP_MKDIR will still compile. Is there a reason the versioning doesn't rename it to VOP_MKDIR_V2? For that matter, why new machinery for this versioning stuff at all? Why not just rename the vop from mkdir to mkdir_v2? That would take care of both struct vop_mkdir_v2_args and VOP_MKDIR_V2. Am I missing something?
Re: Vnode API cleanup pass 2a
On Wed, 15 Jan 2014, Taylor R Campbell wrote: For that matter, why new machinery for this versioning stuff at all? Why not just rename the vop from mkdir to mkdir_v2? That would take care of both struct vop_mkdir_v2_args and VOP_MKDIR_V2. Am I missing something? That would the calling code ugly. --apb (Alan Barrett)
Re: Vnode API cleanup pass 2a
On Wed, Jan 15, 2014 at 05:31:56PM +, Taylor R Campbell wrote: As is, existing code calling, e.g., VOP_MKDIR will still compile. Is there a reason the versioning doesn't rename it to VOP_MKDIR_V2? For that matter, why new machinery for this versioning stuff at all? Why not just rename the vop from mkdir to mkdir_v2? That would take care of both struct vop_mkdir_v2_args and VOP_MKDIR_V2. Am I missing something? Well, one thing is that there aren't expected to be out-of-tree callers of VOP_MKDIR, only out-of-tree callees. It is convenient to be able to break callers too, but I'm ok with doing it some other way temporarily while hacking. -- David A. Holland dholl...@netbsd.org
Re: Vnode API cleanup pass 2a
On Wed, Jan 15, 2014 at 04:31:07PM +0100, J. Hannken-Illjes wrote: I put a diff to http://www.netbsd.org/~hannken/vnode-pass2a-3.diff that changes the vnode creation operations to keep the dvp locked. Any objections or OK to commit? I don't understand where all of the zfs changes arise (without wading into the zfs code) but I'm not going to worry too much about that. Otherwise it looks ok. I'm kind of distressed by how far afield some of the vput calls are from the vop entry points in some of the fses, but I suppose that's the usual level of code quality. :-/ I suppose there might be a problem in one of those cases where leaving the directory locked causes something else wrong to happen. But if so I think I'd rather tackle it head on. (Function calls should be symmetric with respect to locking, all the more so in what's ostensibly a supported API.) Also I'm not sure that if there were something missing I'd necessarily spot it from reading a diff, but that can't be helped. Now somebody just has to fix the rest of the vops that do this... maybe this weekend I can finally deal with the fix for PR 47040 and then I can stop being wedged on vfs work. -- David A. Holland dholl...@netbsd.org
Re: Vnode API cleanup pass 2a
On Jan 14, 2014, at 5:54 AM, David Holland dholland-t...@netbsd.org wrote: On Mon, Jan 13, 2014 at 04:06:02PM +0100, J. Hannken-Illjes wrote: snip Even though it looks like we don't need creation ops to return with locked directory at the moment it may be better to do this change now so atomic operations like the one seen in NFS server are possible. Yeah, that's what I was thinking Thus we would end up with: [snip] Right. It would probably be better to make this a separate commit; if that's a nuisance for you, I can do it... Give me a day (or maybe two) ... -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Vnode API cleanup pass 2a
On Jan 13, 2014, at 8:39 AM, David Holland dholland-t...@netbsd.org wrote: On Tue, Jan 07, 2014 at 11:30:40AM +0100, J. Hannken-Illjes wrote: (also, while this is minor I think I'd rather have vop_mkdir_args_v2 and/or vop_mkdir_desc_v2 rather than vop_mkdir2_args and vop_mkdir2_desc. The version doesn't need to be part of the operation name.) Looks very good -- changed vnode_if.sh to create xxx_v2_args. Diff attached. Would you mind committing the versioning logic right away? It has clear value and it would be nice for it to appear in a separate changeset. Done. snip Unlocking and relocking should not itself be a problem since AFAIK there isn't any reason for those locks to be held in the first place; however, I think I'd better go review all the call sites just in case. Plus I think it would be advisable to make this change first, so that in case we do find a site where it matters we can lock the returned object before anyone else can see it: # -#% mkdir dvp L U U +#% mkdir dvp L L L Where should it matter? VOP_MKDIR is only used from: snip sys/nfs/nfs_serv.c:nfsrv_mkdir() This extracts the nfs file handle for the new dir and calls VOP_GETATTR. If we don't leave the directory locked, these actions are no longer atomic with the directory creation. I don't know whether this is safe or not. My guess is not. In the absence of clear information it seems prudent to not introduce this race. RFC 1813 is a bit vague here: post_op_attr This structure is used for returning attributes in those operations that are not directly involved with manipulating attributes. One of the principles of this revision of the NFS protocol is to return the real value from the indicated operation and not an error from an incidental operation. The post_op_attr structure was designed to allow the server to recover from errors encountered while getting attributes. This appears to make returning attributes optional. However, server implementors are strongly encouraged to make best effort to return attributes whenever possible, even when returning an error. post_op_fh3 One of the principles of this revision of the NFS protocol is to return the real value from the indicated operation and not an error from an incidental operation. The post_op_fh3 structure was designed to allow the server to recover from errors encountered while constructing a file handle. This is the structure used to return a file handle from the CREATE, MKDIR, SYMLINK, MKNOD, and READDIRPLUS requests. In each case, the client can get the file handle by issuing a LOOKUP request after a successful return from one of the listed operations. Returning the file handle is an optimization so that the client is not forced to immediately issue a LOOKUP request to get the file handle. The reference implementation from Solaris is not atomic: VOP_MKDIR(dvp, vp) VOP_GETATTR(vp) makefh(vp) VN_RELE(vp) VN_RELE(dvp) sys/fs/union/union_subr.c:union_mkshadow() ...but this doesn't, the caller is trying to create the directory and also bind it into the onionfs layering atomically. Of course, being onionfs, it's already only approximate and adding another race to it may not make things any worse; but it's probably better not to. Looking closer it makes things better here. Currently unionfs will lock in the wrong order (uppervp - upperdvp) while it now does it right. If the shadow directory we just created would disappear because another thread removed it outside of unionfs we had much more problems than a race here. Even though it looks like we don't need creation ops to return with locked directory at the moment it may be better to do this change now so atomic operations like the one seen in NFS server are possible. Thus we would end up with: -#% create dvp L U U +#% create dvp L L L -#% create vpp - L - +#% create vpp - U - - IN LOCKED=YES WILLPUT struct vnode *dvp; + IN LOCKED=YES struct vnode *dvp; -#% mknod dvp L U U +#% mknod dvp L L L -#% mknod vpp - L - +#% mknod vpp - U - - IN LOCKED=YES WILLPUT struct vnode *dvp; + IN LOCKED=YES struct vnode *dvp; -#% mkdir dvp L U U +#% mkdir dvp L L L -#% mkdir vpp - L - +#% mkdir vpp - U - - IN LOCKED=YES WILLPUT struct vnode *dvp; + IN LOCKED=YES struct vnode *dvp; -#% symlinkdvp L U U +#% symlinkdvp L L L -#% symlinkvpp - L - +#% symlinkvpp - U - - IN LOCKED=YES WILLPUT struct vnode *dvp; + IN LOCKED=YES struct vnode *dvp; Please comment soon, I will prepare a new diff then. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Vnode API cleanup pass 2a
On Mon, Jan 13, 2014 at 04:06:02PM +0100, J. Hannken-Illjes wrote: [various nfs stuff] Yeah, I'm not sure what to make of that either. The reference implementation from Solaris is not atomic: VOP_MKDIR(dvp, vp) VOP_GETATTR(vp) makefh(vp) VN_RELE(vp) VN_RELE(dvp) They can't do this kind of thing atomically with their blinkered vfs locking :-) One should get the same file handle even if it isn't atomic and that's probably ok; on the other hand it would be weird if you created a directory mode 755 and the stat info you got back said 700 because a chmod raced with the getattr. On the other hand, it's NFS. sys/fs/union/union_subr.c:union_mkshadow() ...but this doesn't, the caller is trying to create the directory and also bind it into the onionfs layering atomically. Of course, being onionfs, it's already only approximate and adding another race to it may not make things any worse; but it's probably better not to. Looking closer it makes things better here. Currently unionfs will lock in the wrong order (uppervp - upperdvp) while it now does it right. Oh lovely, I missed that. If the shadow directory we just created would disappear because another thread removed it outside of unionfs we had much more problems than a race here. Yeah, I think the more likely problem is that it might try to create it twice and have one of them fail unexpectedly. But maybe it's already capable of dealing with that. Even though it looks like we don't need creation ops to return with locked directory at the moment it may be better to do this change now so atomic operations like the one seen in NFS server are possible. Yeah, that's what I was thinking Thus we would end up with: [snip] Right. It would probably be better to make this a separate commit; if that's a nuisance for you, I can do it... -- David A. Holland dholl...@netbsd.org
Re: Vnode API cleanup pass 2a
On Tue, Jan 07, 2014 at 11:32:28AM +0100, J. Hannken-Illjes wrote: This seems to largely add code to file systems to mess with locking further, rather than reducing that code. What do you mean with mess here? Can you explain how this change helps to clean up those dirty hacks? This is the first of hopefully two steps. Once lookup returns its result unlocked the hacks can go as all operations adding or looking up layer nodes work on unlocked vnodes. Are you planning to make lookup return the node unlocked too or something first? Sure, changing the lookup operation to do the same is be the next step. also, maybe it would be better to tackle lookup first? Why should the order matter? Less mess of the kind Taylor was objecting to. Maybe. (Or maybe not, too.) -- David A. Holland dholl...@netbsd.org
Re: Vnode API cleanup pass 2a
On Tue, Jan 07, 2014 at 11:30:40AM +0100, J. Hannken-Illjes wrote: (also, while this is minor I think I'd rather have vop_mkdir_args_v2 and/or vop_mkdir_desc_v2 rather than vop_mkdir2_args and vop_mkdir2_desc. The version doesn't need to be part of the operation name.) Looks very good -- changed vnode_if.sh to create xxx_v2_args. Diff attached. Would you mind committing the versioning logic right away? It has clear value and it would be nice for it to appear in a separate changeset. Also, since what you're doing is basically unlocking before return in every implementation, and then relocking after return at every call site, it's possible to do one op at a time; it seems like that would be a good idea in case unexpected problems crop up. This is not what I'm doing. It still seems like a good idea to do one op at a time, or at least do one op first. This is missing any arguments for the second time ...because it's vfs and everything in vfs is delicate? Small incremental changes are safer; if something weird goes wrong, it's easier to deal with the smaller the change was. (If nothing else, it's easier to rever only the part that broke; it's also easier to bisect later.) and you ignored my arguments after This is not what I'm doing.. That's because they aren't really relevant to working incrementally. Unlocking and relocking should not itself be a problem since AFAIK there isn't any reason for those locks to be held in the first place; however, I think I'd better go review all the call sites just in case. Plus I think it would be advisable to make this change first, so that in case we do find a site where it matters we can lock the returned object before anyone else can see it: # -#% mkdir dvp L U U +#% mkdir dvp L L L Where should it matter? VOP_MKDIR is only used from: In the case of VOP_SYMLINK it definitely matters, as the new symlink isn't filled until after VOP_SYMLINK returns. Since the locking protocol for the containing directory is to unlock it *before* returning, it then becomes possible to see the empty symlink. And that isn't so good. mkdir is less clear, but: external/cddl/osnet/dist/uts/common/fs/zfs/zfs_replay.c:zfs_replay_create() This doesn't appear to do anything with the new dir (right now it just panics...) sys/kern/vfs_syscalls.c:do_sys_mkdirat() This does nothing and is ok. sys/nfs/nfs_serv.c:nfsrv_mkdir() This extracts the nfs file handle for the new dir and calls VOP_GETATTR. If we don't leave the directory locked, these actions are no longer atomic with the directory creation. I don't know whether this is safe or not. My guess is not. In the absence of clear information it seems prudent to not introduce this race. sys/fs/union/union_vnops.c:union_mkdir() This looks ok. sys/fs/union/union_subr.c:union_mkshadow() ...but this doesn't, the caller is trying to create the directory and also bind it into the onionfs layering atomically. Of course, being onionfs, it's already only approximate and adding another race to it may not make things any worse; but it's probably better not to. sys/rump/librump/rumpvfs/rumpvnode_if.c:RUMP_VOP_MKDIR() this shouldn't matter. I have been putting off answering this because I ought to examine all these, and I haven't managed to do so yet but figured I ought to answer the rest. It looks to me like it would definitely be better to first fix the VOP_MKDIR interface to not release the directory lock within the call. Certainly for VOP_SYMLINK. Therefore, it's probably best to fix all the creation ops. -- David A. Holland dholl...@netbsd.org
Re: Vnode API cleanup pass 2a
On Mon, Jan 13, 2014 at 07:39:48AM +, David Holland wrote: In the case of VOP_SYMLINK it definitely matters, as the new symlink isn't filled until after VOP_SYMLINK returns. Since the locking protocol for the containing directory is to unlock it *before* returning, it then becomes possible to see the empty symlink. And that isn't so good. Er. This isn't actually true - I think I was thinking of some other older vfs interface. It looks to me like it would definitely be better to first fix the VOP_MKDIR interface to not release the directory lock within the call. Certainly for VOP_SYMLINK. Therefore, it's probably best to fix all the creation ops. Therefore, this is less critical; but I think the mkdir cases warrant doing it. -- David A. Holland dholl...@netbsd.org
Re: Vnode API cleanup pass 2a
On Jan 7, 2014, at 5:29 AM, David Holland dholland-t...@netbsd.org wrote: snip Plus I think it would be advisable to make this change first, so that in case we do find a site where it matters we can lock the returned object before anyone else can see it: # -#% mkdir dvp L U U +#% mkdir dvp L L L Where should it matter? VOP_MKDIR is only used from: external/cddl/osnet/dist/uts/common/fs/zfs/zfs_replay.c:zfs_replay_create() sys/kern/vfs_syscalls.c:do_sys_mkdirat() sys/nfs/nfs_serv.c:nfsrv_mkdir() sys/fs/union/union_vnops.c:union_mkdir() sys/fs/union/union_subr.c:union_mkshadow() sys/rump/librump/rumpvfs/rumpvnode_if.c:RUMP_VOP_MKDIR() and I don't see any problems here. I have been putting off answering this because I ought to examine all these, and I haven't managed to do so yet but figured I ought to answer the rest. Please do -- plan is to commit next monday, 2014-01-13 -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Vnode API cleanup pass 2a
On Jan 7, 2014, at 5:29 AM, David Holland dholland-t...@netbsd.org wrote: On Tue, Dec 31, 2013 at 11:34:27AM +0100, J. Hannken-Illjes wrote: The layered file systems hashlists currently have to work on locked vnodes as the vnode operations returning vnodes return them locked. This leads to some very dirty hacks -- see layer_node_find() from sys/miscfs/genfs/layer_subr.c for example. The best solution is to change these vnode operations to return unlocked (and referenced) vnodes. The attached diff implements this change for operations create, mknod, mkdir and symlink. It passes the atf test suite and some file system stress tests I use here. Comments or objections anyone? My first thought is that I really don't think it's a good idea to change important semantic properties (particularly locking) of calls from what's effectively a public API without changing the signatures too so old code won't compile. Ok. Will add an API version argument to vnode_if.src so vop_mkdir { +VERSION 1 IN LOCKED=YES WILLPUT struct vnode *dvp; will rename vop_mkdir_desc to vop_mkdir1_desc and then change all file systems to use the new desc. While on the whole I suppose this is probably a good thing, it feels inelegant. Also, the ops table is typically a long way from the code that actually needs to be changed. Wouldn't it be better to attach the version number to the name of the args structure? Then it will fail to compile in the VOP function that needs attention. That is, int ufs_mkdir(void *v) { struct vop_mkdir_args /* { struct vnode*a_dvp; struct vnode**a_vpp; struct componentname*a_cnp; struct vattr*a_vap; } */ *ap = v; : : } would need to be changed to e.g. int ufs_mkdir(void *v) { struct vop_mkdir_args_v2 /* { struct vnode*a_dvp; struct vnode**a_vpp; struct componentname*a_cnp; struct vattr*a_vap; } */ *ap = v; : : } (also, while this is minor I think I'd rather have vop_mkdir_args_v2 and/or vop_mkdir_desc_v2 rather than vop_mkdir2_args and vop_mkdir2_desc. The version doesn't need to be part of the operation name.) Looks very good -- changed vnode_if.sh to create xxx_v2_args. Diff attached. Also, since what you're doing is basically unlocking before return in every implementation, and then relocking after return at every call site, it's possible to do one op at a time; it seems like that would be a good idea in case unexpected problems crop up. This is not what I'm doing. It still seems like a good idea to do one op at a time, or at least do one op first. This is missing any arguments for the second time and you ignored my arguments after This is not what I'm doing.. Unlocking and relocking should not itself be a problem since AFAIK there isn't any reason for those locks to be held in the first place; however, I think I'd better go review all the call sites just in case. Plus I think it would be advisable to make this change first, so that in case we do find a site where it matters we can lock the returned object before anyone else can see it: # -#% mkdir dvp L U U +#% mkdir dvp L L L Where should it matter? VOP_MKDIR is only used from: external/cddl/osnet/dist/uts/common/fs/zfs/zfs_replay.c:zfs_replay_create() sys/kern/vfs_syscalls.c:do_sys_mkdirat() sys/nfs/nfs_serv.c:nfsrv_mkdir() sys/fs/union/union_vnops.c:union_mkdir() sys/fs/union/union_subr.c:union_mkshadow() sys/rump/librump/rumpvfs/rumpvnode_if.c:RUMP_VOP_MKDIR() and I don't see any problems here. I have been putting off answering this because I ought to examine all these, and I haven't managed to do so yet but figured I ought to answer the rest. An updated diff at http://www.netbsd.org/~hannken/vnode-pass2a-2.diff Diffs to vnode_if.{sh,src} attached. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) vnode-pass2a-new.diff Description: Binary data
Re: Vnode API cleanup pass 2a
On Jan 7, 2014, at 5:30 AM, David Holland dholland-t...@netbsd.org wrote: On Mon, Dec 30, 2013 at 05:26:14PM +0100, J. Hannken-Illjes wrote: This seems to largely add code to file systems to mess with locking further, rather than reducing that code. What do you mean with mess here? Can you explain how this change helps to clean up those dirty hacks? This is the first of hopefully two steps. Once lookup returns its result unlocked the hacks can go as all operations adding or looking up layer nodes work on unlocked vnodes. Are you planning to make lookup return the node unlocked too or something first? Sure, changing the lookup operation to do the same is be the next step. also, maybe it would be better to tackle lookup first? Why should the order matter? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Vnode API cleanup pass 2a
On Mon, Dec 30, 2013 at 05:26:14PM +0100, J. Hannken-Illjes wrote: This seems to largely add code to file systems to mess with locking further, rather than reducing that code. What do you mean with mess here? Can you explain how this change helps to clean up those dirty hacks? This is the first of hopefully two steps. Once lookup returns its result unlocked the hacks can go as all operations adding or looking up layer nodes work on unlocked vnodes. Are you planning to make lookup return the node unlocked too or something first? Sure, changing the lookup operation to do the same is be the next step. also, maybe it would be better to tackle lookup first? -- David A. Holland dholl...@netbsd.org
Re: Vnode API cleanup pass 2a
On Tue, Dec 31, 2013 at 11:34:27AM +0100, J. Hannken-Illjes wrote: The layered file systems hashlists currently have to work on locked vnodes as the vnode operations returning vnodes return them locked. This leads to some very dirty hacks -- see layer_node_find() from sys/miscfs/genfs/layer_subr.c for example. The best solution is to change these vnode operations to return unlocked (and referenced) vnodes. The attached diff implements this change for operations create, mknod, mkdir and symlink. It passes the atf test suite and some file system stress tests I use here. Comments or objections anyone? My first thought is that I really don't think it's a good idea to change important semantic properties (particularly locking) of calls from what's effectively a public API without changing the signatures too so old code won't compile. Ok. Will add an API version argument to vnode_if.src so vop_mkdir { +VERSION 1 IN LOCKED=YES WILLPUT struct vnode *dvp; will rename vop_mkdir_desc to vop_mkdir1_desc and then change all file systems to use the new desc. While on the whole I suppose this is probably a good thing, it feels inelegant. Also, the ops table is typically a long way from the code that actually needs to be changed. Wouldn't it be better to attach the version number to the name of the args structure? Then it will fail to compile in the VOP function that needs attention. That is, int ufs_mkdir(void *v) { struct vop_mkdir_args /* { struct vnode*a_dvp; struct vnode**a_vpp; struct componentname*a_cnp; struct vattr*a_vap; } */ *ap = v; : : } would need to be changed to e.g. int ufs_mkdir(void *v) { struct vop_mkdir_args_v2 /* { struct vnode*a_dvp; struct vnode**a_vpp; struct componentname*a_cnp; struct vattr*a_vap; } */ *ap = v; : : } (also, while this is minor I think I'd rather have vop_mkdir_args_v2 and/or vop_mkdir_desc_v2 rather than vop_mkdir2_args and vop_mkdir2_desc. The version doesn't need to be part of the operation name.) Also, since what you're doing is basically unlocking before return in every implementation, and then relocking after return at every call site, it's possible to do one op at a time; it seems like that would be a good idea in case unexpected problems crop up. This is not what I'm doing. It still seems like a good idea to do one op at a time, or at least do one op first. Unlocking and relocking should not itself be a problem since AFAIK there isn't any reason for those locks to be held in the first place; however, I think I'd better go review all the call sites just in case. Plus I think it would be advisable to make this change first, so that in case we do find a site where it matters we can lock the returned object before anyone else can see it: # -#% mkdir dvp L U U +#% mkdir dvp L L L Where should it matter? VOP_MKDIR is only used from: external/cddl/osnet/dist/uts/common/fs/zfs/zfs_replay.c:zfs_replay_create() sys/kern/vfs_syscalls.c:do_sys_mkdirat() sys/nfs/nfs_serv.c:nfsrv_mkdir() sys/fs/union/union_vnops.c:union_mkdir() sys/fs/union/union_subr.c:union_mkshadow() sys/rump/librump/rumpvfs/rumpvnode_if.c:RUMP_VOP_MKDIR() and I don't see any problems here. I have been putting off answering this because I ought to examine all these, and I haven't managed to do so yet but figured I ought to answer the rest. -- David A. Holland dholl...@netbsd.org
Re: Vnode API cleanup pass 2a
On Dec 31, 2013, at 11:34 AM, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: On Dec 30, 2013, at 10:58 PM, David Holland dholland-t...@netbsd.org wrote: On Mon, Dec 30, 2013 at 11:35:48AM +0100, J. Hannken-Illjes wrote: The layered file systems hashlists currently have to work on locked vnodes as the vnode operations returning vnodes return them locked. This leads to some very dirty hacks -- see layer_node_find() from sys/miscfs/genfs/layer_subr.c for example. The best solution is to change these vnode operations to return unlocked (and referenced) vnodes. The attached diff implements this change for operations create, mknod, mkdir and symlink. It passes the atf test suite and some file system stress tests I use here. Comments or objections anyone? My first thought is that I really don't think it's a good idea to change important semantic properties (particularly locking) of calls from what's effectively a public API without changing the signatures too so old code won't compile. Ok. Will add an API version argument to vnode_if.src so vop_mkdir { + VERSION 1 IN LOCKED=YES WILLPUT struct vnode *dvp; will rename vop_mkdir_desc to vop_mkdir1_desc and then change all file systems to use the new desc. Also, since what you're doing is basically unlocking before return in every implementation, and then relocking after return at every call site, it's possible to do one op at a time; it seems like that would be a good idea in case unexpected problems crop up. This is not what I'm doing. Many file systems use one internal function to create a new inode and where possible I change this function to return the node unlocked. Relocking is done only for nfs server, vn_open() and temporarily for layered file systems. The operations from vfs_syscalls.c all change vput() to vrele(). Unlocking and relocking should not itself be a problem since AFAIK there isn't any reason for those locks to be held in the first place; however, I think I'd better go review all the call sites just in case. Plus I think it would be advisable to make this change first, so that in case we do find a site where it matters we can lock the returned object before anyone else can see it: # -#% mkdir dvp L U U +#% mkdir dvp L L L Where should it matter? VOP_MKDIR is only used from: external/cddl/osnet/dist/uts/common/fs/zfs/zfs_replay.c:zfs_replay_create() sys/kern/vfs_syscalls.c:do_sys_mkdirat() sys/nfs/nfs_serv.c:nfsrv_mkdir() sys/fs/union/union_vnops.c:union_mkdir() sys/fs/union/union_subr.c:union_mkshadow() sys/rump/librump/rumpvfs/rumpvnode_if.c:RUMP_VOP_MKDIR() and I don't see any problems here. Put an updated diff to http://www.netbsd.org/~hannken/vnode-pass2a-1.diff This diff adds API version, diffs to vnode_if.{sh,src} attached. Further comments or objections anyone? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) vnode-pass2a-new.diff Description: Binary data
Re: Vnode API cleanup pass 2a
On Dec 30, 2013, at 10:58 PM, David Holland dholland-t...@netbsd.org wrote: On Mon, Dec 30, 2013 at 11:35:48AM +0100, J. Hannken-Illjes wrote: The layered file systems hashlists currently have to work on locked vnodes as the vnode operations returning vnodes return them locked. This leads to some very dirty hacks -- see layer_node_find() from sys/miscfs/genfs/layer_subr.c for example. The best solution is to change these vnode operations to return unlocked (and referenced) vnodes. The attached diff implements this change for operations create, mknod, mkdir and symlink. It passes the atf test suite and some file system stress tests I use here. Comments or objections anyone? My first thought is that I really don't think it's a good idea to change important semantic properties (particularly locking) of calls from what's effectively a public API without changing the signatures too so old code won't compile. Ok. Will add an API version argument to vnode_if.src so vop_mkdir { + VERSION 1 IN LOCKED=YES WILLPUT struct vnode *dvp; will rename vop_mkdir_desc to vop_mkdir1_desc and then change all file systems to use the new desc. Also, since what you're doing is basically unlocking before return in every implementation, and then relocking after return at every call site, it's possible to do one op at a time; it seems like that would be a good idea in case unexpected problems crop up. This is not what I'm doing. Many file systems use one internal function to create a new inode and where possible I change this function to return the node unlocked. Relocking is done only for nfs server, vn_open() and temporarily for layered file systems. The operations from vfs_syscalls.c all change vput() to vrele(). Unlocking and relocking should not itself be a problem since AFAIK there isn't any reason for those locks to be held in the first place; however, I think I'd better go review all the call sites just in case. Plus I think it would be advisable to make this change first, so that in case we do find a site where it matters we can lock the returned object before anyone else can see it: # -#% mkdir dvp L U U +#% mkdir dvp L L L Where should it matter? VOP_MKDIR is only used from: external/cddl/osnet/dist/uts/common/fs/zfs/zfs_replay.c:zfs_replay_create() sys/kern/vfs_syscalls.c:do_sys_mkdirat() sys/nfs/nfs_serv.c:nfsrv_mkdir() sys/fs/union/union_vnops.c:union_mkdir() sys/fs/union/union_subr.c:union_mkshadow() sys/rump/librump/rumpvfs/rumpvnode_if.c:RUMP_VOP_MKDIR() and I don't see any problems here. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Vnode API cleanup pass 2a
The layered file systems hashlists currently have to work on locked vnodes as the vnode operations returning vnodes return them locked. This leads to some very dirty hacks -- see layer_node_find() from sys/miscfs/genfs/layer_subr.c for example. The best solution is to change these vnode operations to return unlocked (and referenced) vnodes. The attached diff implements this change for operations create, mknod, mkdir and symlink. It passes the atf test suite and some file system stress tests I use here. Comments or objections anyone? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) vnode-pass2a.diff Description: Binary data
Re: Vnode API cleanup pass 2a
Date: Mon, 30 Dec 2013 11:35:48 +0100 From: J. Hannken-Illjes hann...@eis.cs.tu-bs.de The layered file systems hashlists currently have to work on locked vnodes as the vnode operations returning vnodes return them locked. This leads to some very dirty hacks -- see layer_node_find() from sys/miscfs/genfs/layer_subr.c for example. The best solution is to change these vnode operations to return unlocked (and referenced) vnodes. The attached diff implements this change for operations create, mknod, mkdir and symlink. It passes the atf test suite and some file system stress tests I use here. This seems to largely add code to file systems to mess with locking further, rather than reducing that code. Can you explain how this change helps to clean up those dirty hacks? Are you planning to make lookup return the node unlocked too or something first?
Re: Vnode API cleanup pass 2a
On Dec 30, 2013, at 4:12 PM, Taylor R Campbell campbell+netbsd-tech-k...@mumble.net wrote: Date: Mon, 30 Dec 2013 11:35:48 +0100 From: J. Hannken-Illjes hann...@eis.cs.tu-bs.de The layered file systems hashlists currently have to work on locked vnodes as the vnode operations returning vnodes return them locked. This leads to some very dirty hacks -- see layer_node_find() from sys/miscfs/genfs/layer_subr.c for example. The best solution is to change these vnode operations to return unlocked (and referenced) vnodes. The attached diff implements this change for operations create, mknod, mkdir and symlink. It passes the atf test suite and some file system stress tests I use here. This seems to largely add code to file systems to mess with locking further, rather than reducing that code. What do you mean with mess here? Can you explain how this change helps to clean up those dirty hacks? This is the first of hopefully two steps. Once lookup returns its result unlocked the hacks can go as all operations adding or looking up layer nodes work on unlocked vnodes. Are you planning to make lookup return the node unlocked too or something first? Sure, changing the lookup operation to do the same is be the next step. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Vnode API cleanup pass 2a
On Mon, Dec 30, 2013 at 11:35:48AM +0100, J. Hannken-Illjes wrote: The layered file systems hashlists currently have to work on locked vnodes as the vnode operations returning vnodes return them locked. This leads to some very dirty hacks -- see layer_node_find() from sys/miscfs/genfs/layer_subr.c for example. The best solution is to change these vnode operations to return unlocked (and referenced) vnodes. The attached diff implements this change for operations create, mknod, mkdir and symlink. It passes the atf test suite and some file system stress tests I use here. Comments or objections anyone? My first thought is that I really don't think it's a good idea to change important semantic properties (particularly locking) of calls from what's effectively a public API without changing the signatures too so old code won't compile. Also, since what you're doing is basically unlocking before return in every implementation, and then relocking after return at every call site, it's possible to do one op at a time; it seems like that would be a good idea in case unexpected problems crop up. Unlocking and relocking should not itself be a problem since AFAIK there isn't any reason for those locks to be held in the first place; however, I think I'd better go review all the call sites just in case. Plus I think it would be advisable to make this change first, so that in case we do find a site where it matters we can lock the returned object before anyone else can see it: # -#% mkdir dvp L U U +#% mkdir dvp L L L #% mkdir vpp - L - #% mkdir vpp - L - # This is not, unfortunately, entirely trivial; I've been putting it off to combine with namei/componentname-related changes. -- David A. Holland dholl...@netbsd.org