Move the vnode lock into file systems

2010-06-26 Thread Juergen Hannken-Illjes
The vnode lock operations currently work on a rw lock located inside the
vnode.  I propose to move this lock into the file system node.

This place is more logical as we lock a file system node and not a vnode.
This becomes clear if we think of a file system where one file system node
is attached to more than one vnode.  Ptyfs allowing multiple mounts is such
a candidate.

A diff implemeting this for ufs, lfs and ext2fs is attached.

Comments or objections anyone?

-- 
Juergen Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Index: sys/ufs/ufs/inode.h
===
RCS file: /cvsroot/src/sys/ufs/ufs/inode.h,v
retrieving revision 1.56
diff -p -u -4 -r1.56 inode.h
--- sys/ufs/ufs/inode.h 22 Feb 2009 20:28:07 -  1.56
+++ sys/ufs/ufs/inode.h 25 Jun 2010 20:43:24 -
@@ -91,9 +91,9 @@ struct inode {
 #definei_fsinode_u.fs
 #definei_lfs   inode_u.lfs
 #definei_e2fs  inode_u.e2fs
 
-   void*i_unused1; /* Unused. */
+   krwlock_t   i_lock; /* Inode lock. */
struct   dquot *i_dquot[MAXQUOTAS]; /* Dquot structures. */
u_quad_t i_modrev;  /* Revision level for NFS lease. */
struct   lockf *i_lockf;/* Head of byte-level lock list. */
 
Index: sys/ufs/ufs/ufs_extern.h
===
RCS file: /cvsroot/src/sys/ufs/ufs/ufs_extern.h,v
retrieving revision 1.62
diff -p -u -4 -r1.62 ufs_extern.h
--- sys/ufs/ufs/ufs_extern.h13 Sep 2009 05:17:37 -  1.62
+++ sys/ufs/ufs/ufs_extern.h25 Jun 2010 20:43:24 -
@@ -67,11 +67,11 @@ int ufs_create(void *);
 intufs_getattr(void *);
 intufs_inactive(void *);
 #defineufs_fcntl   genfs_fcntl
 #defineufs_ioctl   genfs_enoioctl
-#defineufs_islockedgenfs_islocked
+intufs_islocked(void *);
 intufs_link(void *);
-#defineufs_lockgenfs_lock
+intufs_lock(void *);
 intufs_lookup(void *);
 intufs_mkdir(void *);
 intufs_mknod(void *);
 #defineufs_mmapgenfs_mmap
@@ -88,9 +88,9 @@ int   ufs_rmdir(void *);
 #defineufs_pollgenfs_poll
 intufs_setattr(void *);
 intufs_strategy(void *);
 intufs_symlink(void *);
-#defineufs_unlock  genfs_unlock
+intufs_unlock(void *);
 intufs_whiteout(void *);
 intufsspec_close(void *);
 intufsspec_read(void *);
 intufsspec_write(void *);
Index: sys/ufs/ufs/ufs_ihash.c
===
RCS file: /cvsroot/src/sys/ufs/ufs/ufs_ihash.c,v
retrieving revision 1.28
diff -p -u -4 -r1.28 ufs_ihash.c
--- sys/ufs/ufs/ufs_ihash.c 5 Nov 2009 08:18:02 -   1.28
+++ sys/ufs/ufs/ufs_ihash.c 25 Jun 2010 20:43:24 -
@@ -170,9 +170,10 @@ ufs_ihashins(struct inode *ip)
 
KASSERT(mutex_owned(ufs_hashlock));
 
/* lock the inode, then put it on the appropriate hash list */
-   vlockmgr(ip-i_vnode-v_lock, LK_EXCLUSIVE);
+   rw_init(ip-i_lock);
+   rw_enter(ip-i_lock, RW_WRITER);
 
mutex_enter(ufs_ihash_lock);
ipp = ihashtbl[INOHASH(ip-i_dev, ip-i_number)];
LIST_INSERT_HEAD(ipp, ip, i_hash);
@@ -187,5 +188,6 @@ ufs_ihashrem(struct inode *ip)
 {
mutex_enter(ufs_ihash_lock);
LIST_REMOVE(ip, i_hash);
mutex_exit(ufs_ihash_lock);
+   rw_destroy(ip-i_lock);
 }
Index: sys/ufs/ufs/ufs_vnops.c
===
RCS file: /cvsroot/src/sys/ufs/ufs/ufs_vnops.c,v
retrieving revision 1.183
diff -p -u -4 -r1.183 ufs_vnops.c
--- sys/ufs/ufs/ufs_vnops.c 24 Jun 2010 13:03:20 -  1.183
+++ sys/ufs/ufs/ufs_vnops.c 25 Jun 2010 20:43:25 -
@@ -2346,4 +2346,66 @@ ufs_gop_markupdate(struct vnode *vp, int
 
ip-i_flag |= mask;
}
 }
+
+/*
+ * Lock the node.
+ */
+int
+ufs_lock(void *v)
+{
+   struct vop_lock_args /* {
+   struct vnode *a_vp;
+   int a_flags;
+   } */ *ap = v;
+   struct inode *ip = VTOI(ap-a_vp);
+   int flags = ap-a_flags;
+   krw_t op;
+
+   KASSERT((flags  ~(LK_EXCLUSIVE | LK_SHARED | LK_NOWAIT)) == 0);
+
+   op = ((flags  LK_EXCLUSIVE) != 0 ? RW_WRITER : RW_READER);
+
+   if ((flags  LK_NOWAIT) != 0)
+   return (rw_tryenter(ip-i_lock, op) ? 0 : EBUSY);
+
+   rw_enter(ip-i_lock, op);
+
+   return 0;
+}
+
+/*
+ * Unlock the node.
+ */
+int
+ufs_unlock(void *v)
+{
+   struct vop_unlock_args /* {
+   struct vnode *a_vp;
+   } */ *ap = v;
+   struct inode *ip = VTOI(ap-a_vp);
+
+   rw_exit(ip-i_lock);
+
+   return 0;
+}
+
+/*
+ * Return whether or not the node is locked.
+ */
+int
+ufs_islocked(void *v)
+{
+   struct vop_islocked_args /* {
+   struct vnode *a_vp;
+   } */ *ap = v;
+   struct inode *ip = VTOI(ap-a_vp);
+
+   if (rw_write_held(ip-i_lock))

Re: why not remove AF_LOCAL sockets on last close?

2010-06-26 Thread der Mouse
 Can anyone tell me why, exactly, we shouldn't remove bound AF_LOCAL
 sockets from the filesystem on last close?

Well, I wouldn't expect it to be easy; it means that every AF_LOCAL
socket needs to keep track of its directory,name pair.  This leaves
open the question of what to do if it has multiple links when whatever
point you consider last close arrives.

For most purposes it would probably be good enough to do something like
if the originally-bound name in its original directory still refers to
the same socket, unlink it; ignore other issues.

 If you want to do that, wouldn't it be easier to just go the Linux
 route and move them into a separate (virtual) namespace completely?

Easier?  Maybe.  But you'd lose the (currently almost free) leveraging
of the other attributes of overloading the filesystem namespace, such
as permissions checking.

 Linux does that?  I ran this test program on a Linux system and got a
 socket in the filesystem and the same results.

I haven't tried it myself, but I've been told that Linux does that
*sometimes* - specifically, if the pathname begins with a NUL but the
address length indicates that there's more pathname following it.

I don't know whether there's anything like a hierarchical filesystem
namespace there or whether they're just opaque octet-string blobs.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: why not remove AF_LOCAL sockets on last close?

2010-06-26 Thread der Mouse
 But I agree that if leaving the sockets around permits no interesting
 feature whatsoever (i.e. it doesn't even serve for SO_REUSEADDR),

I've been trying to think of any such feature since this discussion
started.  So far i've failed.

 it very well could be a design or implementation bug,

I suspect it is actually not so much a design or implementation bug as
it is a historical accident: I suspect the initial implementation made
them a new inode type because it was a cheap and easy way to get a
namespace for free, and didn't bother with the impedance mismatch
between filesystem semantics and (other-AF) socket semantics because it
was just a quick experimental hack.

Then, as often happens, the quick experimental hack persisted long
beyond its initial experimental phase, with nobody fixing the semantics
because everybody was used to them and software was written to deal.

As I mentioned in another message a few minutes ago, I think this is
not easy to do fully right - indeed, it's not totally clear what
right is in all cases - but I do think it is easy to come close
enough to be useful.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: why not remove AF_LOCAL sockets on last close?

2010-06-26 Thread der Mouse
 A reason might be that every other system behaves the same way and
 being different will just lead to non-portable code.
 Non-portable *how*?  What exactly would happen?
 I don't know, and if you've got an argument that code written for
 either behavior will be ok both places I don't have a problem with
 it.  The only thing I can think of is that code that does an explicit
 unlink and checks for error on the unlink may complain, which is
 pretty mild.

Well, code that expects the automatic unlink is liable to break
unexpectedly (and probably cryptically) on traditional systems.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


kqueue clarification

2010-06-26 Thread Sad Clouds
OK this is from kqueue man page:

kevent() returns the number of events placed in the eventlist, up to
the value given by nevents

What I find confusing is that kevent() returns int data type, however
the length of eventlist given by nevents can be up to size_t.

Suppose you have more than INT_MAX events in kqueue, what are the
semantics of kevent()?

1. kevent() returns up to INT_MAX events, and advances internal pointer
to the next segment of up to INT_MAX events.

or

2. kevent() overflows int and returns a negative number.

This also begs another question: is there some sort of internal limit
on the total number of events supported by kqueue, or it can be as big
as your RAM allows?


Re: kqueue clarification

2010-06-26 Thread Sad Clouds
On Sat, 26 Jun 2010 14:04:18 +0100
Sad Clouds cryintotheblue...@googlemail.com wrote:

 What I find confusing is that kevent() returns int data type, however
 the length of eventlist given by nevents can be up to size_t.
 
 Suppose you have more than INT_MAX events in kqueue, what are the
 semantics of kevent()?

OK, I guess unless Unix file descriptor type changes (which is signed
integer) kqueue will be limited by that.


Re: Move the vnode lock into file systems

2010-06-26 Thread David Holland
On Sat, Jun 26, 2010 at 10:39:27AM +0200, Juergen Hannken-Illjes wrote:
  The vnode lock operations currently work on a rw lock located inside the
  vnode.  I propose to move this lock into the file system node.
  
  This place is more logical as we lock a file system node and not a vnode.
  This becomes clear if we think of a file system where one file system node
  is attached to more than one vnode.  Ptyfs allowing multiple mounts is such
  a candidate.

I'm not convinced that sharing locks at the VFS level is a good idea
for such cases. While ptyfs specifically is pretty limited and
unlikely to get into trouble, something more complex easily could. I
don't think we ought to encourage this until we have a clear plan for
rebind mounts and the resulting namespace-handling issues. (Since
in a sense that's the general case of multiply-mounted ptyfs.)

Since I'm pretty sure that a reasonable architecture for that will
lead to sharing vnodes between the multiple mountpoints, and
manifesting some kind of virtual name objects akin to Linux's dentries
to keep track of which mountpoint is which, I don't see that it's
necessary or desirable to specialize the locking...

Do you have another use case in mind?

(In the absence of some clear benefits I don't think it's a
particularly good idea to paste a dozen or two copies of genfs_lock
everywhere. But folding vcrackmgr() into genfs_lock and genfs_unlock
seems like a fine idea.)

-- 
David A. Holland
dholl...@netbsd.org