Problem The domount() function that is called by the mount() system call erroneously calls lookupname() (a VFS filename lookup function) on ZFS filesystem names when mounting ZFS filesystems. This causes spurious short-lived vnode holds to be placed on vnode_t's unrelated to the filesystem being mounted, as described below.
domount() maintains a VFS "mount-in-progress table" which contains a mapping of dev_t's being mounted to filesystems (see vfs_addmip()). The idea is that filesystems like UFS can, when processing a mount request, look at the table to determine if a device is already in the process of being mounted (see vfs_devmounting()). In order to add a device to the "mount-in-progress table" domount() calls lookupname() on the mount "spec" (e.g. for a UFS filesystem, that would be /dev/dsk/blabla) to obtain the vnode associated with that pathname, and adds the vnode's v_rdev to the table (under the assumption that it's a device...). The lookupname() function does its work by starting at the root of the path, and calling VOP_LOOKUP() on the "next" component of the path. Once it has held the vnode associated with the top-most directory, it iteratively calls VOP_LOOKUP() on the next component, and so on, until it has resolved the entire path, and has a vnode hold on the leaf filesystem entry (it also follows symbolic links, etc.). The guts of this algorithm is implemented in lookuppnvp(). The entire call chain for this lookupname() call from domount() looks like: domount() lookupname() lookupnameatcred() lookuppnatcred() lookuppnvp() (loops over components of the path doing VOP_LOOKUP() calls on each one) This all seems reasonable so far, but as you might have guessed from the summary for this bug, none of this makes a lick of sense when mounting a ZFS filesystem. The mount spec that is passed into the mount system call for ZFS isn't a path at all, but a ZFS filesystem name, which doesn't exist in the VFS namespace. What ends up happening in that case is that domount() passes a filesystem name into lookupname(). That filesystem name kind of looks like a path, and specifically, it looks like a relative path name (a path without a leading '/'). As it turns out, the lookuppnatcred() call in the call chain above handles this by pre-pending the current working directory to the path in this case: if (pnp->pn_path[0] == '/') { vp = rootvp; } else { vp = (startvp == NULL) ? PTOU(p)->u_cdir : startvp; } This code results in a fully-qualified path getting ultimately passed in to lookuppnvp(), and is part of the generic lookup code. The end result is that if I mount a ZFS filesystem named "mypool/seb", and my PWD is "/home/seb", lookuppnvp() will attempt to get a hold on the path named "/home/seb/mypool/seb", which is non-sensical, and fails. This failure is ultimately gracefully handled by domount(), which simply silently doesn't enter a device into the "mount-in-progress table" in this case. So you might then think, "if domount() silently handles this gracefully, then this could just be left alone, as it's harmless." Unfortunately, it's not harmless, as the lookuppnvp() function iteratively places vnode holds on each component of the path that it's looking up. Consider the following scenario where we have the following filesystems and associated mountpoints: NAME MOUNTPOINT seb /seb seb/data /data These two mountpoints are unrelated to one another. I should be able to simultaneously mount these two filesystems. Let's say I do that, and my CWD is "/". Unfortunately, due to the lookupname() call described above in domount(), the mounting of the "seb" filesystem might fail with EBUSY if the mount for "seb/data" happens to be in the middle of attempting to resolve the bogus path "/seb/data", a very short-lived vnode hold on "/seb" will cause the mount for the "seb" filesystem to fail with "EBUSY" when it attempts to check if its mountpoint is "busy". As such, the lookupname() code in domount() should only be called if the mount spec passed in is a path, and not an opaque filesystem-dependent string as it is for ZFS. Also note that this problem exists not only for ZFS, but for any filesystem whose mount spec isn't a device path (e.g. NFS, swapfs, tmpfs, ctfs, autofs, procfs, etc.). Solution There are a few possible approaches to a fix. One way would be to check if a mount is for a ZFS filesystem in domount(), and if it is, skip the device path lookup. This would be fine for ZFS, but as mentioned in the description, this bug actually exists for a plethora of other filesystems that don't mount from a device path. As such, the fix is to add a VSW VFS flag that indicates that the filesystem is mounted from a device path. If the flag is set, then domount() will know not to bother with the device path lookup and mount-in-progress table management for this mount. How to Reproduce Create two filesystems as described above on a test pool: # zpool create seb /dev/dsk/c1t1d0 # zfs create -o mountpoint=/data seb/data # zfs list -o name,mountpoint -r seb NAME MOUNTPOINT seb /seb seb/data /data Now, create a simple unmount/mount looping script as follows: # cat ~/mount-stress.sh #!/usr/bin/bash while true; do zfs unmount $1 || exit 1 zfs mount $1 || exit 1 done Now, simply invoke this script on each filesystem in parallel as follows: # ./mount-stress.sh seb & [1] 734 # ./mount-stress.sh seb/data & [2] 1212 I've found that in less than a second, one of the "zfs mount seb" invocations will fail with EBUSY as described above: Notes The root-cause analysis for this issue was facilitated by the static DTrace probes added in commit e15e026e38850c2df8dfce29873a46a089d5e215. Upstream bugs: DLPX-49847 You can view, comment on, or merge this pull request online at: https://github.com/openzfs/openzfs/pull/538 -- Commit Summary -- * 9074 domount() interprets ZFS filesystem names as relative paths -- File Changes -- M usr/src/uts/common/fs/hsfs/hsfs_vfsops.c (3) M usr/src/uts/common/fs/pcfs/pc_vfsops.c (5) M usr/src/uts/common/fs/udfs/udf_vfsops.c (19) M usr/src/uts/common/fs/ufs/ufs_vfsops.c (8) M usr/src/uts/common/fs/vfs.c (8) M usr/src/uts/common/sys/vfs.h (3) -- Patch Links -- https://github.com/openzfs/openzfs/pull/538.patch https://github.com/openzfs/openzfs/pull/538.diff -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/openzfs/openzfs/pull/538 ------------------------------------------ openzfs-developer Archives: https://openzfs.topicbox.com/groups/developer/discussions/T7060187599fd888c-Me193c1949294839eed9e1c73 Powered by Topicbox: https://topicbox.com