The vnode recycle code is currently not able to recycle vnodes in an
    LRU fashion because we cannot move vnodes within the mount's vnodelist
    without causing a number of filesystems, including UFS, to lose track
    when scanning the mount's vnodelist and start doing loop-restarts,
    resulting in O(dirty_vnode_count^2) operation.

    For example, if the #if 0'd section in vlruvp() in vfs_subr.c were to
    be enabled, the filesystem syncing daemon would suddenly start eating
    insane amounts of cpu if a significant number of dirty vnodes exists
    (like rm -rf /usr/ports or cp -r or cvs checkout, etc...).

    This patch is intended to begin solving this problem.  The main purpose
    of the patch is to generalize the vnode scan that virtually all 
    filesystems do when they are sync'd (see *_sync() in [*/]*/*vfsops.c*).
    The patch defines a new API that does all the hard work of keeping
    track of the scan position and doing the vnode scan itself and makes
    callbacks to flush individual vnodes as appropriate.

    The patch implements this API and modifies UFS to use the new API to
    prove it out.  As part of this API, a new vnode type called VMARKER
    has been added.  This type allows us to initialize a dummy vnode and
    use it to mark our position in the mountlist.  All other code that
    does not use the new API to scan the vnodes under the mount must ignore
    this dummy vnode in their scans.  A sysctl, debug.vlruvp_enable, is
    provided to enable the previously #if 0'd section of vlruvp() (i.e.
    turn on LRU ordering for the vnode recycle code), for testing purposes.
    This defaults to off (0) since only UFS has been fixed.

    I've done some moderate testing on -current.  I intend to commit this
    some time tomorrow unless there are hicups or someone has a bright idea.
    It will eventually get into -stable as well.

    The main thing I am looking for review-wise is that the existing 
    ffs_sync() code is equivalent to the new ffs_sync()-using-new-API code.
    There should be no operational differences, it should simply be cut-up
    differently.  i.e. this commit stage is not supposed to make any major
    operational changes to the codebase.

                                                -Matt

Index: isofs/cd9660/cd9660_vnops.c
===================================================================
RCS file: /home/ncvs/src/sys/isofs/cd9660/cd9660_vnops.c,v
retrieving revision 1.73
diff -u -r1.73 cd9660_vnops.c
--- isofs/cd9660/cd9660_vnops.c 12 Sep 2001 08:37:43 -0000      1.73
+++ isofs/cd9660/cd9660_vnops.c 6 Mar 2002 08:14:05 -0000
@@ -110,6 +110,7 @@
                case VFIFO:
                case VNON:
                case VBAD:
+               case VMARKER:
                        return (0);
                }
        }
Index: kern/vfs_subr.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/vfs_subr.c,v
retrieving revision 1.347
diff -u -r1.347 vfs_subr.c
--- kern/vfs_subr.c     5 Mar 2002 19:45:45 -0000       1.347
+++ kern/vfs_subr.c     6 Mar 2002 08:46:03 -0000
@@ -335,6 +335,92 @@
 }
 
 /*
+ * vfs_scan_init:      Initialize a vfs_scan_info structure for use.
+ *
+ *     The info structure is cleared and initialized for use.  The caller
+ *     may initialize additional fields after calling this function.
+ *
+ */
+void
+vfs_scan_init(struct vfs_scan_info *info)
+{
+       bzero(info, sizeof(struct vfs_scan_info));
+       info->vs_doloop = 1;
+}
+
+/*
+ * vfs_scan_vnodes:    Scan the vnodes under a mount point
+ *
+ *     The caller passes the mount point, pre-initialized information
+ *     structure, and two function callbacks.
+ *
+ *     The fast function is called as an optimization, with the mountlist 
+ *     mutex still held.   This function may not block or release the mutex.
+ *     A negative return value will cause the scan code to skip the vnode
+ *     (never call the slow function).  This function may be NULL.
+ *
+ *     The slow function is called if the fast function returns >= 0.  This
+ *     function will be called without the mountlist mutex held and is allowed
+ *     to block.  It may also be NULL.
+ *
+ *     If either function wishes a rescan to occur after the current scan
+ *     finishes, they should set vs_tryagain to 1.  If either function wishes
+ *     to abort the loop entirely, vs_doloop should be set <= 0 (and
+ *     vs_tryagain should also be set to 0 if you had previously modified
+ *     it).
+ */
+int
+vfs_scan_vnodes(struct mount *mp, struct vfs_scan_info *info, vfs_vnfast_t fastfunc, 
+vfs_vnslow_t slowfunc)
+{
+       struct vnode *vp;
+       struct vnode *nvp;
+
+       info->vs_mount = mp;
+       info->vs_marker.v_mount = mp;
+       info->vs_marker.v_type = VMARKER;
+
+       mtx_lock(&mntvnode_mtx);
+       info->vs_tryagain = 1;
+       while (info->vs_tryagain) {
+               info->vs_tryagain = 0;
+               for (vp = TAILQ_FIRST(&mp->mnt_nvnodelist); 
+                    vp != NULL && info->vs_doloop > 0;
+                    vp = nvp) {
+                       /*
+                        * sanity check, skip markers
+                        */
+                       KASSERT(vp->v_mount == mp, 
+                           ("vnode %p mount association changed", vp));
+                       nvp = TAILQ_NEXT(vp, v_nmntvnodes);
+                       if (vp->v_type == VMARKER)
+                               continue;
+
+                       /*
+                        * non-blocking skip function
+                        */
+                       if (fastfunc && fastfunc(info, vp) < 0)
+                               continue;
+
+                       /*
+                        * Use a marker to save/restore our scan position
+                        * when calling the blocking function.
+                        */
+                       TAILQ_INSERT_AFTER(&mp->mnt_nvnodelist,
+                           vp, &info->vs_marker, v_nmntvnodes);
+                       mtx_unlock(&mntvnode_mtx);
+                       if (slowfunc)
+                               slowfunc(info, vp);
+                       mtx_lock(&mntvnode_mtx);
+                       nvp = TAILQ_NEXT(&info->vs_marker, v_nmntvnodes);
+                       TAILQ_REMOVE(&mp->mnt_nvnodelist,
+                           &info->vs_marker, v_nmntvnodes);
+               }
+       }
+       mtx_unlock(&mntvnode_mtx);
+       return(info->vs_allerror);
+}
+
+/*
  * Lookup a filesystem type, and if found allocate and initialize
  * a mount structure for it.
  *
@@ -572,6 +658,17 @@
        done = 0;
        mtx_lock(&mntvnode_mtx);
        while (count && (vp = TAILQ_FIRST(&mp->mnt_nvnodelist)) != NULL) {
+               /*
+                * Don't move any markers we find!
+                */
+               if (vp->v_type == VMARKER) {
+                       do {
+                               vp = TAILQ_NEXT(vp, v_nmntvnodes);
+                       } while (vp && vp->v_type == VMARKER);
+                       if (vp == NULL)
+                               break;
+               }
+
                TAILQ_REMOVE(&mp->mnt_nvnodelist, vp, v_nmntvnodes);
                TAILQ_INSERT_TAIL(&mp->mnt_nvnodelist, vp, v_nmntvnodes);
 
@@ -1902,6 +1999,11 @@
                if (vp->v_mount != mp)
                        goto loop;
                nvp = TAILQ_NEXT(vp, v_nmntvnodes);
+               /*
+                * temporary until we replace this with vfs_scan_vnodes XXX
+                */
+               if (vp->v_type == VMARKER)
+                       continue;
 
                mtx_unlock(&mntvnode_mtx);
                mtx_lock(&vp->v_interlock);
Index: sys/vnode.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/vnode.h,v
retrieving revision 1.171
diff -u -r1.171 vnode.h
--- sys/vnode.h 18 Feb 2002 16:17:57 -0000      1.171
+++ sys/vnode.h 6 Mar 2002 08:24:16 -0000
@@ -57,9 +57,10 @@
  */
 
 /*
- * Vnode types.  VNON means no type.
+ * Vnode types.  VNON means no type.  VMARKER is a scan placemarker
  */
-enum vtype     { VNON, VREG, VDIR, VBLK, VCHR, VLNK, VSOCK, VFIFO, VBAD };
+enum vtype     { VNON, VREG, VDIR, VBLK, VCHR, VLNK, 
+                 VSOCK, VFIFO, VBAD, VMARKER };
 
 /*
  * Vnode tag types.
@@ -80,6 +81,7 @@
 TAILQ_HEAD(buflists, buf);
 
 typedef        int     vop_t __P((void *));
+
 struct namecache;
 
 struct vpollinfo {
@@ -386,6 +388,33 @@
        caddr_t *vdesc_transports;
 };
 
+/*
+ * vfs_vnode_scan() support.
+ */
+
+struct vfs_scan_info {
+       /*
+        * Integrated into scanning code
+        */
+       int     vs_allerror;
+       int     vs_tryagain;
+       int     vs_doloop;
+       struct vnode vs_marker;
+       struct mount *vs_mount;
+
+       /*
+        * totally user-defined
+        */
+       int     vs_lockreq;
+       int     vs_wait;
+       int     vs_waitfor;
+       struct thread *vs_td;
+       struct ucred *vs_cred;
+};
+
+typedef int    (*vfs_vnfast_t)(struct vfs_scan_info *info, struct vnode *vp);
+typedef void   (*vfs_vnslow_t)(struct vfs_scan_info *info, struct vnode *vp);
+
 #ifdef _KERNEL
 /*
  * A list of all the operation descs.
@@ -602,6 +631,8 @@
 void   vgone __P((struct vnode *vp));
 void   vgonel __P((struct vnode *vp, struct thread *td));
 void   vhold __P((struct vnode *));
+void   vfs_scan_init __P((struct vfs_scan_info *info));
+int    vfs_scan_vnodes __P((struct mount *mp, struct vfs_scan_info *info, 
+vfs_vnfast_t fastfunc, vfs_vnslow_t slowfunc));
 int    vinvalbuf __P((struct vnode *vp, int save, struct ucred *cred,
            struct thread *td, int slpflag, int slptimeo));
 int    vtruncbuf __P((struct vnode *vp, struct ucred *cred, struct thread *td,
Index: ufs/ffs/ffs_snapshot.c
===================================================================
RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_snapshot.c,v
retrieving revision 1.30
diff -u -r1.30 ffs_snapshot.c
--- ufs/ffs/ffs_snapshot.c      27 Feb 2002 19:18:10 -0000      1.30
+++ ufs/ffs/ffs_snapshot.c      6 Mar 2002 08:36:50 -0000
@@ -376,6 +376,11 @@
                if (xvp->v_mount != mp)
                        goto loop;
                nvp = TAILQ_NEXT(xvp, v_nmntvnodes);
+               /*
+                * temporary until we replace this with vfs_scan_vnodes XXX
+                */
+               if (xvp->v_type == VMARKER)
+                       continue;
                mtx_unlock(&mntvnode_mtx);
                mtx_lock(&xvp->v_interlock);
                if (xvp->v_usecount == 0 || xvp->v_type == VNON ||
Index: ufs/ffs/ffs_vfsops.c
===================================================================
RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_vfsops.c,v
retrieving revision 1.168
diff -u -r1.168 ffs_vfsops.c
--- ufs/ffs/ffs_vfsops.c        27 Feb 2002 18:32:22 -0000      1.168
+++ ufs/ffs/ffs_vfsops.c        6 Mar 2002 08:38:21 -0000
@@ -491,6 +491,11 @@
                        goto loop;
                }
                nvp = TAILQ_NEXT(vp, v_nmntvnodes);
+               /*
+                * temporary until we replace this with vfs_scan_vnodes XXX
+                */
+               if (vp->v_type == VMARKER)
+                       continue;
                mtx_unlock(&mntvnode_mtx);
                /*
                 * Step 4: invalidate all inactive vnodes.
@@ -988,7 +993,51 @@
  * initiate the writing of the super block if it has been modified.
  *
  * Note: we are always called with the filesystem marked `MPBUSY'.
+ *
+ * ffs_sync_ffast:     non-blocking check for fsync candidate.  The
+ *                     mountlist mutex will be held during this call.
+ *                     A negative return value indicates that the
+ *                     vnode may be skipped.
+ *
+ * ffs_sync_fslow:     potentially blocking execute fsync on vnode.
+ *                     The mountlist mutex wil NOT be held.
  */
+
+static int
+ffs_sync_ffast(struct vfs_scan_info *info, struct vnode *vp)
+{
+       struct inode *ip;
+
+       ip = VTOI(vp);
+       if (vp->v_type == VNON || ((ip->i_flag &
+           (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
+           TAILQ_EMPTY(&vp->v_dirtyblkhd))) {
+               return(-1);
+       }
+       return(0);
+}
+
+static void
+ffs_sync_fslow(struct vfs_scan_info *info, struct vnode *vp)
+{
+       if (vp->v_type != VCHR) {
+               int error;
+
+               if ((error = vget(vp, info->vs_lockreq, info->vs_td)) != 0) {
+                       if (error == ENOENT)
+                               info->vs_tryagain = 1;
+               } else {
+                       if ((error = VOP_FSYNC(vp, info->vs_cred, info->vs_waitfor, 
+info->vs_td)) != 0) {
+                               info->vs_allerror = error;
+                       }
+                       VOP_UNLOCK(vp, 0, info->vs_td);
+                       vrele(vp);
+               }
+       } else {
+               UFS_UPDATE(vp, info->vs_wait);
+       }
+}
+
 int
 ffs_sync(mp, waitfor, cred, td)
        struct mount *mp;
@@ -996,71 +1045,38 @@
        struct ucred *cred;
        struct thread *td;
 {
-       struct vnode *nvp, *vp, *devvp;
-       struct inode *ip;
+       struct vnode *devvp;
        struct ufsmount *ump = VFSTOUFS(mp);
        struct fs *fs;
-       int error, count, wait, lockreq, allerror = 0;
+       int error, count, allerror = 0;
+       struct vfs_scan_info info;
 
        fs = ump->um_fs;
        if (fs->fs_fmod != 0 && fs->fs_ronly != 0) {            /* XXX */
                printf("fs = %s\n", fs->fs_fsmnt);
                panic("ffs_sync: rofs mod");
        }
+
        /*
-        * Write back each (modified) inode.
+        * Setup to scan the vnodes in the mountlist
         */
-       wait = 0;
-       lockreq = LK_EXCLUSIVE | LK_NOWAIT;
+       vfs_scan_init(&info);
+       info.vs_wait = 0;
+       info.vs_waitfor = waitfor;
+       info.vs_lockreq = LK_EXCLUSIVE | LK_NOWAIT;
        if (waitfor == MNT_WAIT) {
-               wait = 1;
-               lockreq = LK_EXCLUSIVE;
+               info.vs_wait = 1;
+               info.vs_lockreq = LK_EXCLUSIVE;
        }
-       mtx_lock(&mntvnode_mtx);
+       info.vs_td = td;
+       info.vs_cred = cred;
+
+       /*
+        * Write back each (modified) inode.
+        */
 loop:
-       for (vp = TAILQ_FIRST(&mp->mnt_nvnodelist); vp != NULL; vp = nvp) {
-               /*
-                * If the vnode that we are about to sync is no longer
-                * associated with this mount point, start over.
-                */
-               if (vp->v_mount != mp)
-                       goto loop;
+       allerror = vfs_scan_vnodes(mp, &info, ffs_sync_ffast, ffs_sync_fslow);
 
-               /*
-                * Depend on the mntvnode_slock to keep things stable enough
-                * for a quick test.  Since there might be hundreds of
-                * thousands of vnodes, we cannot afford even a subroutine
-                * call unless there's a good chance that we have work to do.
-                */
-               nvp = TAILQ_NEXT(vp, v_nmntvnodes);
-               ip = VTOI(vp);
-               if (vp->v_type == VNON || ((ip->i_flag &
-                   (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
-                   TAILQ_EMPTY(&vp->v_dirtyblkhd))) {
-                       continue;
-               }
-               if (vp->v_type != VCHR) {
-                       mtx_unlock(&mntvnode_mtx);
-                       if ((error = vget(vp, lockreq, td)) != 0) {
-                               mtx_lock(&mntvnode_mtx);
-                               if (error == ENOENT)
-                                       goto loop;
-                       } else {
-                               if ((error = VOP_FSYNC(vp, cred, waitfor, td)) != 0)
-                                       allerror = error;
-                               VOP_UNLOCK(vp, 0, td);
-                               vrele(vp);
-                               mtx_lock(&mntvnode_mtx);
-                       }
-               } else {
-                       mtx_unlock(&mntvnode_mtx);
-                       UFS_UPDATE(vp, wait);
-                       mtx_lock(&mntvnode_mtx);
-               }
-               if (TAILQ_NEXT(vp, v_nmntvnodes) != nvp)
-                       goto loop;
-       }
-       mtx_unlock(&mntvnode_mtx);
        /*
         * Force stale file system control information to be flushed.
         */
@@ -1068,10 +1084,8 @@
                if ((error = softdep_flushworklist(ump->um_mountp, &count, td)))
                        allerror = error;
                /* Flushed work items may create new vnodes to clean */
-               if (count) {
-                       mtx_lock(&mntvnode_mtx);
+               if (count)
                        goto loop;
-               }
        }
 #ifdef QUOTA
        qsync(mp);
@@ -1085,12 +1099,11 @@
                if ((error = VOP_FSYNC(devvp, cred, waitfor, td)) != 0)
                        allerror = error;
                VOP_UNLOCK(devvp, 0, td);
-               if (waitfor == MNT_WAIT) {
-                       mtx_lock(&mntvnode_mtx);
+               if (waitfor == MNT_WAIT)
                        goto loop;
-               }
-       } else
+       } else {
                mtx_unlock(&devvp->v_interlock);
+       }
        /*
         * Write back modified superblock.
         */
Index: ufs/ufs/ufs_quota.c
===================================================================
RCS file: /home/ncvs/src/sys/ufs/ufs/ufs_quota.c,v
retrieving revision 1.51
diff -u -r1.51 ufs_quota.c
--- ufs/ufs/ufs_quota.c 27 Feb 2002 18:32:22 -0000      1.51
+++ ufs/ufs/ufs_quota.c 6 Mar 2002 08:39:17 -0000
@@ -443,6 +443,11 @@
                if (vp->v_mount != mp)
                        goto again;
                nextvp = TAILQ_NEXT(vp, v_nmntvnodes);
+               /*
+                * temporary until we replace this with vfs_scan_vnodes XXX
+                */
+               if (vp->v_type == VMARKER)
+                       continue;
                
                mtx_unlock(&mntvnode_mtx);
                mtx_lock(&vp->v_interlock);
@@ -499,6 +504,11 @@
                if (vp->v_mount != mp)
                        goto again;
                nextvp = TAILQ_NEXT(vp, v_nmntvnodes);
+               /*
+                * temporary until we replace this with vfs_scan_vnodes XXX
+                */
+               if (vp->v_type == VMARKER)
+                       continue;
 
                mtx_unlock(&mntvnode_mtx);
                mtx_lock(&vp->v_interlock);
@@ -698,6 +708,12 @@
                if (vp->v_mount != mp)
                        goto again;
                nextvp = TAILQ_NEXT(vp, v_nmntvnodes);
+               /*
+                * temporary until we replace this with vfs_scan_vnodes XXX
+                */
+               if (vp->v_type == VMARKER)
+                       continue;
+
                mtx_unlock(&mntvnode_mtx);
                mtx_lock(&vp->v_interlock);
                if (vp->v_type == VNON) {

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Reply via email to