Problem

The domount() function that is called by the mount() system call erroneously
calls lookupname() (a VFS filename lookup function) on ZFS filesystem names
when mounting ZFS filesystems. This causes spurious short-lived vnode holds to
be placed on vnode_t's unrelated to the filesystem being mounted, as described
below.

domount() maintains a VFS "mount-in-progress table" which contains a mapping
of dev_t's being mounted to filesystems (see vfs_addmip()). The idea is that
filesystems like UFS can, when processing a mount request, look at the table
to determine if a device is already in the process of being mounted (see
vfs_devmounting()). In order to add a device to the "mount-in-progress table"
domount() calls lookupname() on the mount "spec" (e.g. for a UFS filesystem,
that would be /dev/dsk/blabla) to obtain the vnode associated with that
pathname, and adds the vnode's v_rdev to the table (under the assumption that
it's a device...).

The lookupname() function does its work by starting at the root of the path,
and calling VOP_LOOKUP() on the "next" component of the path. Once it has held
the vnode associated with the top-most directory, it iteratively calls
VOP_LOOKUP() on the next component, and so on, until it has resolved the
entire path, and has a vnode hold on the leaf filesystem entry (it also
follows symbolic links, etc.). The guts of this algorithm is implemented in
lookuppnvp(). The entire call chain for this lookupname() call from domount()
looks like:

    domount()
      lookupname()
        lookupnameatcred()
          lookuppnatcred()
            lookuppnvp()
              (loops over components of the path doing VOP_LOOKUP() calls on 
each one)

This all seems reasonable so far, but as you might have guessed from the
summary for this bug, none of this makes a lick of sense when mounting a ZFS
filesystem. The mount spec that is passed into the mount system call for ZFS
isn't a path at all, but a ZFS filesystem name, which doesn't exist in the VFS
namespace.

What ends up happening in that case is that domount() passes a filesystem name
into lookupname(). That filesystem name kind of looks like a path, and
specifically, it looks like a relative path name (a path without a leading
'/'). As it turns out, the lookuppnatcred() call in the call chain above
handles this by pre-pending the current working directory to the path in this
case:

    if (pnp->pn_path[0] == '/') {
            vp = rootvp;
    } else {
            vp = (startvp == NULL) ? PTOU(p)->u_cdir : startvp;
    }

This code results in a fully-qualified path getting ultimately passed in to
lookuppnvp(), and is part of the generic lookup code. The end result is that
if I mount a ZFS filesystem named "mypool/seb", and my PWD is "/home/seb",
lookuppnvp() will attempt to get a hold on the path named
"/home/seb/mypool/seb", which is non-sensical, and fails. This failure is
ultimately gracefully handled by domount(), which simply silently doesn't
enter a device into the "mount-in-progress table" in this case.

So you might then think, "if domount() silently handles this gracefully, then
this could just be left alone, as it's harmless." Unfortunately, it's not
harmless, as the lookuppnvp() function iteratively places vnode holds on each
component of the path that it's looking up. Consider the following scenario
where we have the following filesystems and associated mountpoints:

    NAME       MOUNTPOINT
    seb        /seb
    seb/data   /data

These two mountpoints are unrelated to one another. I should be able to
simultaneously mount these two filesystems. Let's say I do that, and my CWD is
"/". Unfortunately, due to the lookupname() call described above in domount(),
the mounting of the "seb" filesystem might fail with EBUSY if the mount for
"seb/data" happens to be in the middle of attempting to resolve the bogus path
"/seb/data", a very short-lived vnode hold on "/seb" will cause the mount for
the "seb" filesystem to fail with "EBUSY" when it attempts to check if its
mountpoint is "busy".

As such, the lookupname() code in domount() should only be called if the mount
spec passed in is a path, and not an opaque filesystem-dependent string as it
is for ZFS.

Also note that this problem exists not only for ZFS, but for any filesystem
whose mount spec isn't a device path (e.g. NFS, swapfs, tmpfs, ctfs, autofs,
procfs, etc.).

Solution

There are a few possible approaches to a fix. One way would be to check if a
mount is for a ZFS filesystem in domount(), and if it is, skip the device path
lookup. This would be fine for ZFS, but as mentioned in the description, this
bug actually exists for a plethora of other filesystems that don't mount from
a device path.

As such, the fix is to add a VSW VFS flag that indicates that the filesystem
is mounted from a device path. If the flag is set, then domount() will know
not to bother with the device path lookup and mount-in-progress table
management for this mount.

How to Reproduce

Create two filesystems as described above on a test pool:

    # zpool create seb /dev/dsk/c1t1d0
    # zfs create -o mountpoint=/data seb/data
    # zfs list -o name,mountpoint -r seb
    NAME      MOUNTPOINT
    seb       /seb
    seb/data  /data

Now, create a simple unmount/mount looping script as follows:

    # cat ~/mount-stress.sh
    #!/usr/bin/bash

    while true; do
        zfs unmount $1 || exit 1
        zfs mount $1 || exit 1
    done

Now, simply invoke this script on each filesystem in parallel as follows:

    # ./mount-stress.sh seb &
    [1] 734
    # ./mount-stress.sh seb/data &
    [2] 1212

I've found that in less than a second, one of the "zfs mount seb" invocations
will fail with EBUSY as described above:

Notes

The root-cause analysis for this issue was facilitated by the static DTrace
probes added in commit e15e026e38850c2df8dfce29873a46a089d5e215.

Upstream bugs: DLPX-49847
You can view, comment on, or merge this pull request online at:

  https://github.com/openzfs/openzfs/pull/538

-- Commit Summary --

  * 9074 domount() interprets ZFS filesystem names as relative paths

-- File Changes --

    M usr/src/uts/common/fs/hsfs/hsfs_vfsops.c (3)
    M usr/src/uts/common/fs/pcfs/pc_vfsops.c (5)
    M usr/src/uts/common/fs/udfs/udf_vfsops.c (19)
    M usr/src/uts/common/fs/ufs/ufs_vfsops.c (8)
    M usr/src/uts/common/fs/vfs.c (8)
    M usr/src/uts/common/sys/vfs.h (3)

-- Patch Links --

https://github.com/openzfs/openzfs/pull/538.patch
https://github.com/openzfs/openzfs/pull/538.diff

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/538

------------------------------------------
openzfs-developer
Archives: 
https://openzfs.topicbox.com/groups/developer/discussions/T7060187599fd888c-Me193c1949294839eed9e1c73
Powered by Topicbox: https://topicbox.com

Reply via email to