On Fri, Feb 06, 2026 at 09:09:16PM +0000, Al Viro wrote:
> On Fri, Feb 06, 2026 at 08:58:04PM +0000, Al Viro wrote:
> > On Fri, Feb 06, 2026 at 08:29:33PM +0000, Al Viro wrote:
> > 
> > > Look: the case where we might get passed current->fs down there is real.
> > > It can happen in one and only one situation - CLONE_NEWNS in unshare(2)
> > > arguments *and* current->fs->users being 1.
> > > 
> > > It wouldn't suffice, since there's chroot_fs_refs() that doesn't give
> > > a rat's arse for task->fs being ours - it goes and replaces every
> > > ->fs->pwd or ->fs->root that happens to point to old_root.
> > > 
> > > It's still not a real race, though - both chroot_fs_refs() and that area
> > > in copy_mnt_ns() are serialized on namespace_sem.
> > > 
> > > And yes, it's obscenely byzantine.  It gets even worse when you consider
> > > the fact that pivot_root(2) does not break only because the refcount
> > > drops in chroot_fs_refs() are guaranteed not to reach 0 - the caller is
> > > holding its own references to old_root.{mnt,dentry} and *thar* does not
> > > get dropped until we drop namespace_sem.
> > > 
> > > IOW, that shit is actually safe, but man, has its correctness grown 
> > > fucking
> > > convoluted...
> > > 
> > > Grabbing fs->seq in copy_mnt_ns() wouldn't make the things better, though 
> > > -
> > > it seriously relies upon the same exclusion with chroot_fs_refs() for
> > > correctness; unless you are willing to hold it over the entire walk 
> > > through
> > > the mount tree, the proof of correctness doesn't get any simpler.
> > 
> > Speaking of the race that _is_ there: pidfd setns() vs. pivot_root().
> > pivot_root() (well, chroot_fs_refs()) goes over all threads and flips their
> > ->fs->{root,pwd} for the ones that used to be at old_root.  The trouble is,
> > in case where we have setns() with more than just CLONE_NEWNS in flags, we
> > end up creating a temporary fs_struct, passing that to mntns_install() and
> > then copying its pwd and root back to the caller's if everything goes well.
> > 
> > That temporary is _not_ going to be found by chroot_fs_refs(), though, so
> > it misses the update by pivot_root().

I don't yet understand how it is substantially different from someone
holding a file descriptor to the old rootfs open and then re-fchdir()
and chroot("/proc/self/fd/<nr>)ing themselves. The main difference
surely is that during setns() it's unintentional. But the circumstances
under which this happens appear very arcane to me so I'm really
questioning whether we need to worry about setns() here. IOW, is this a
mere correctness thing or an actual bug.

Again, the scenarios where a pivot_root() happens with multiple tasks
already populating the mount namespace is very very low. So I'm fine
with just not caring about this particular thing especially if the fix
is involved. But I might be missing the actual bug here.

> BTW, in the same case of setns(2) (e.g. CLONE_NEWNS | CLONE_NEWUTS in flags)
> we end up defeating the check for fs->users == 1 in mntns_install() - for
> the temporary fs_struct it will always be true.  Unless I'm missing something
> elsewhere...  Christian?  Looks like that went in with 303cc571d107 ("nsproxy:
> attach to namespaces via pidfds")...

Sorry, just seeing that whole thread here now. No, there's no additional
checks. 

Reply via email to