Move the vnode lock into file systems
The vnode lock operations currently work on a rw lock located inside the vnode. I propose to move this lock into the file system node. This place is more logical as we lock a file system node and not a vnode. This becomes clear if we think of a file system where one file system node is attached to more than one vnode. Ptyfs allowing multiple mounts is such a candidate. A diff implemeting this for ufs, lfs and ext2fs is attached. Comments or objections anyone? -- Juergen Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) Index: sys/ufs/ufs/inode.h === RCS file: /cvsroot/src/sys/ufs/ufs/inode.h,v retrieving revision 1.56 diff -p -u -4 -r1.56 inode.h --- sys/ufs/ufs/inode.h 22 Feb 2009 20:28:07 - 1.56 +++ sys/ufs/ufs/inode.h 25 Jun 2010 20:43:24 - @@ -91,9 +91,9 @@ struct inode { #definei_fsinode_u.fs #definei_lfs inode_u.lfs #definei_e2fs inode_u.e2fs - void*i_unused1; /* Unused. */ + krwlock_t i_lock; /* Inode lock. */ struct dquot *i_dquot[MAXQUOTAS]; /* Dquot structures. */ u_quad_t i_modrev; /* Revision level for NFS lease. */ struct lockf *i_lockf;/* Head of byte-level lock list. */ Index: sys/ufs/ufs/ufs_extern.h === RCS file: /cvsroot/src/sys/ufs/ufs/ufs_extern.h,v retrieving revision 1.62 diff -p -u -4 -r1.62 ufs_extern.h --- sys/ufs/ufs/ufs_extern.h13 Sep 2009 05:17:37 - 1.62 +++ sys/ufs/ufs/ufs_extern.h25 Jun 2010 20:43:24 - @@ -67,11 +67,11 @@ int ufs_create(void *); intufs_getattr(void *); intufs_inactive(void *); #defineufs_fcntl genfs_fcntl #defineufs_ioctl genfs_enoioctl -#defineufs_islockedgenfs_islocked +intufs_islocked(void *); intufs_link(void *); -#defineufs_lockgenfs_lock +intufs_lock(void *); intufs_lookup(void *); intufs_mkdir(void *); intufs_mknod(void *); #defineufs_mmapgenfs_mmap @@ -88,9 +88,9 @@ int ufs_rmdir(void *); #defineufs_pollgenfs_poll intufs_setattr(void *); intufs_strategy(void *); intufs_symlink(void *); -#defineufs_unlock genfs_unlock +intufs_unlock(void *); intufs_whiteout(void *); intufsspec_close(void *); intufsspec_read(void *); intufsspec_write(void *); Index: sys/ufs/ufs/ufs_ihash.c === RCS file: /cvsroot/src/sys/ufs/ufs/ufs_ihash.c,v retrieving revision 1.28 diff -p -u -4 -r1.28 ufs_ihash.c --- sys/ufs/ufs/ufs_ihash.c 5 Nov 2009 08:18:02 - 1.28 +++ sys/ufs/ufs/ufs_ihash.c 25 Jun 2010 20:43:24 - @@ -170,9 +170,10 @@ ufs_ihashins(struct inode *ip) KASSERT(mutex_owned(ufs_hashlock)); /* lock the inode, then put it on the appropriate hash list */ - vlockmgr(ip-i_vnode-v_lock, LK_EXCLUSIVE); + rw_init(ip-i_lock); + rw_enter(ip-i_lock, RW_WRITER); mutex_enter(ufs_ihash_lock); ipp = ihashtbl[INOHASH(ip-i_dev, ip-i_number)]; LIST_INSERT_HEAD(ipp, ip, i_hash); @@ -187,5 +188,6 @@ ufs_ihashrem(struct inode *ip) { mutex_enter(ufs_ihash_lock); LIST_REMOVE(ip, i_hash); mutex_exit(ufs_ihash_lock); + rw_destroy(ip-i_lock); } Index: sys/ufs/ufs/ufs_vnops.c === RCS file: /cvsroot/src/sys/ufs/ufs/ufs_vnops.c,v retrieving revision 1.183 diff -p -u -4 -r1.183 ufs_vnops.c --- sys/ufs/ufs/ufs_vnops.c 24 Jun 2010 13:03:20 - 1.183 +++ sys/ufs/ufs/ufs_vnops.c 25 Jun 2010 20:43:25 - @@ -2346,4 +2346,66 @@ ufs_gop_markupdate(struct vnode *vp, int ip-i_flag |= mask; } } + +/* + * Lock the node. + */ +int +ufs_lock(void *v) +{ + struct vop_lock_args /* { + struct vnode *a_vp; + int a_flags; + } */ *ap = v; + struct inode *ip = VTOI(ap-a_vp); + int flags = ap-a_flags; + krw_t op; + + KASSERT((flags ~(LK_EXCLUSIVE | LK_SHARED | LK_NOWAIT)) == 0); + + op = ((flags LK_EXCLUSIVE) != 0 ? RW_WRITER : RW_READER); + + if ((flags LK_NOWAIT) != 0) + return (rw_tryenter(ip-i_lock, op) ? 0 : EBUSY); + + rw_enter(ip-i_lock, op); + + return 0; +} + +/* + * Unlock the node. + */ +int +ufs_unlock(void *v) +{ + struct vop_unlock_args /* { + struct vnode *a_vp; + } */ *ap = v; + struct inode *ip = VTOI(ap-a_vp); + + rw_exit(ip-i_lock); + + return 0; +} + +/* + * Return whether or not the node is locked. + */ +int +ufs_islocked(void *v) +{ + struct vop_islocked_args /* { + struct vnode *a_vp; + } */ *ap = v; + struct inode *ip = VTOI(ap-a_vp); + + if (rw_write_held(ip-i_lock))
Re: why not remove AF_LOCAL sockets on last close?
Can anyone tell me why, exactly, we shouldn't remove bound AF_LOCAL sockets from the filesystem on last close? Well, I wouldn't expect it to be easy; it means that every AF_LOCAL socket needs to keep track of its directory,name pair. This leaves open the question of what to do if it has multiple links when whatever point you consider last close arrives. For most purposes it would probably be good enough to do something like if the originally-bound name in its original directory still refers to the same socket, unlink it; ignore other issues. If you want to do that, wouldn't it be easier to just go the Linux route and move them into a separate (virtual) namespace completely? Easier? Maybe. But you'd lose the (currently almost free) leveraging of the other attributes of overloading the filesystem namespace, such as permissions checking. Linux does that? I ran this test program on a Linux system and got a socket in the filesystem and the same results. I haven't tried it myself, but I've been told that Linux does that *sometimes* - specifically, if the pathname begins with a NUL but the address length indicates that there's more pathname following it. I don't know whether there's anything like a hierarchical filesystem namespace there or whether they're just opaque octet-string blobs. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: why not remove AF_LOCAL sockets on last close?
But I agree that if leaving the sockets around permits no interesting feature whatsoever (i.e. it doesn't even serve for SO_REUSEADDR), I've been trying to think of any such feature since this discussion started. So far i've failed. it very well could be a design or implementation bug, I suspect it is actually not so much a design or implementation bug as it is a historical accident: I suspect the initial implementation made them a new inode type because it was a cheap and easy way to get a namespace for free, and didn't bother with the impedance mismatch between filesystem semantics and (other-AF) socket semantics because it was just a quick experimental hack. Then, as often happens, the quick experimental hack persisted long beyond its initial experimental phase, with nobody fixing the semantics because everybody was used to them and software was written to deal. As I mentioned in another message a few minutes ago, I think this is not easy to do fully right - indeed, it's not totally clear what right is in all cases - but I do think it is easy to come close enough to be useful. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: why not remove AF_LOCAL sockets on last close?
A reason might be that every other system behaves the same way and being different will just lead to non-portable code. Non-portable *how*? What exactly would happen? I don't know, and if you've got an argument that code written for either behavior will be ok both places I don't have a problem with it. The only thing I can think of is that code that does an explicit unlink and checks for error on the unlink may complain, which is pretty mild. Well, code that expects the automatic unlink is liable to break unexpectedly (and probably cryptically) on traditional systems. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
kqueue clarification
OK this is from kqueue man page: kevent() returns the number of events placed in the eventlist, up to the value given by nevents What I find confusing is that kevent() returns int data type, however the length of eventlist given by nevents can be up to size_t. Suppose you have more than INT_MAX events in kqueue, what are the semantics of kevent()? 1. kevent() returns up to INT_MAX events, and advances internal pointer to the next segment of up to INT_MAX events. or 2. kevent() overflows int and returns a negative number. This also begs another question: is there some sort of internal limit on the total number of events supported by kqueue, or it can be as big as your RAM allows?
Re: kqueue clarification
On Sat, 26 Jun 2010 14:04:18 +0100 Sad Clouds cryintotheblue...@googlemail.com wrote: What I find confusing is that kevent() returns int data type, however the length of eventlist given by nevents can be up to size_t. Suppose you have more than INT_MAX events in kqueue, what are the semantics of kevent()? OK, I guess unless Unix file descriptor type changes (which is signed integer) kqueue will be limited by that.
Re: Move the vnode lock into file systems
On Sat, Jun 26, 2010 at 10:39:27AM +0200, Juergen Hannken-Illjes wrote: The vnode lock operations currently work on a rw lock located inside the vnode. I propose to move this lock into the file system node. This place is more logical as we lock a file system node and not a vnode. This becomes clear if we think of a file system where one file system node is attached to more than one vnode. Ptyfs allowing multiple mounts is such a candidate. I'm not convinced that sharing locks at the VFS level is a good idea for such cases. While ptyfs specifically is pretty limited and unlikely to get into trouble, something more complex easily could. I don't think we ought to encourage this until we have a clear plan for rebind mounts and the resulting namespace-handling issues. (Since in a sense that's the general case of multiply-mounted ptyfs.) Since I'm pretty sure that a reasonable architecture for that will lead to sharing vnodes between the multiple mountpoints, and manifesting some kind of virtual name objects akin to Linux's dentries to keep track of which mountpoint is which, I don't see that it's necessary or desirable to specialize the locking... Do you have another use case in mind? (In the absence of some clear benefits I don't think it's a particularly good idea to paste a dozen or two copies of genfs_lock everywhere. But folding vcrackmgr() into genfs_lock and genfs_unlock seems like a fine idea.) -- David A. Holland dholl...@netbsd.org