On Thu, Apr 23, 2015 at 05:45:44PM +1000, NeilBrown wrote:

> follow_link calls  link_path_walk -> walk_component -> lookup_fast which sets
> nd->seq.  Is that not enough?  I guess not when nd_jump_link is called.  Is
> that what I missed?

No.  Potential nd_jump_link() callers are just refusing to go there in lazy
mode, end of story.  That's not the problem; see below.

> One thing that is clear to me is that I don't really understand all the
> handling of 'seq' numbers, making me unable to comment usefully.
> I'll try to go through the current code and you current patch with that issue
> in mind and make sure I do understand it.  Then I'll try to comment.

OK, here's the basic theory behind the lazy pathwalk:

* during the entire exercise we never drop rcu_read_lock, therefore any
objects that have all references to them removed before RCU-delayed
freeing (inodes, dentries, superblocks and vfsmounts are among such)
that we might find in process won't be freed until after we are done.

* invariant we are keeping:
        at some point between the beginning of walk and now the pathname
traversed so far would lead to nd->path, with nd->seq and nd->inode equal to
->d_seq and ->d_inode of the dentry in question.

* path_init() arranges for that to be true in the beginning of the walk.

* symlinks aside, walk_component() preserves that.
        + normal name components go through lookup_fast(), where we have
          __d_lookup_rcu() find a child of current nd->path with matching
          name and (atomically) picks ->d_seq of that child, which had the
          name matching our component.  Atomicity is provided by ->d_lock
          on child.  Then we proceed to pick ->d_inode (and verify that
          ->d_seq has not changed, thus making sure that ->d_inode value
          at the moment when the name matched had been the same and child
          is still hashed.  Then we verify that parent's ->d_seq has not
          changed, guaranteeing that parent hadn't been moved or unhashed
          from the moment we'd found it until after we'd found its child.
          Assuming nothing's mounted on top of that thing, and there's no
          problem with ->d_revalidate()), that's it - we have new nd->seq,
          nd->path and nd->inode satisfying our invariant for longer
          piece of pathname.
        + "." needs nothing - we just stay where we are
        + ".." is handled by follow_dotdot_rcu(), which in the normal case
          (no mountpoint crossing) picks ->d_parent of where we are,
          fetches its ->d_inode and ->d_seq and verifies that our directory
          still hadn't changed _its_ ->d_seq.  The last part guarantees that
          it hadn't been moved since the time we'd found it, and thus its
          ->d_parent had remained unchanged at least until that verification.
          Therefore, it remained pinned all along, and it ->d_inode had
          remained stable, including the moment when we fetched ->d_seq.
          Which means that the value we had picked *and* its ->d_inode and
          ->d_seq would satisfy the invariant for the longer piece of
          pathname.
        + mountpoint crossing towards leaves is handled in __follow_mount_rcu();
          it is simple (->mnt_root never changes and is always pinned,
          stabilizing its ->d_inode), but we might need to worry about
          automount points *and* need to make sure that we stop right
          there if mount_lock has been bumped.  See commit b37199e6 for
          the details on the last part - basically, making sure that false
          negatives from __lookup_mnt() won't end up with hard error when
          we walk into whatever had been under the mountpoint we'd missed.
        + mountpoint crossing towards root (in follow_dotdot_rcu()) is
          similar, but there we obviously don't care about automounts.
          Looking at it now, it might make sense to recheck mount_lock there
          as well, though - potential danger is to hit the moment when
          mnt_parent and mnt_mountpoint are out of sync, leaving us with
          mismatched vfsmount,dentry pair.  Generally, that will be caught
          when we try to leave lazy mode (legitimize_mnt() will fail) or
          to cross towards leaves, but the next crossing towards root
          might be bogus as well, and we could end up with unwarranted hard
          error.  Should be very hard to hit, but it's easy enough to check
          *and* backport, so it looks like a good idea for -stable.  Linus,
          do you have any objections against adding
                if (read_seqretry(&mount_lock, nd->m_seq))
                        goto failed;
          right after
                if (!follow_up_rcu(&nd->path))
                        break;
          in follow_dotdot_rcu()?  It's a very narrow race with mount --move
          and most of the time it ends up being completely harmless, but
          it's possible to construct a case when we'll get a bogus hard error
          instead of falling back to non-lazy walk...  OTOH, anyone doing
          that will get something inherently timing-dependent as the result,
          lazy mode or not.  I'm in favour of adding that check, but IMO it's
          not a critical problem.

* if we find that we can't continue in lazy mode because some verification
fails, we just fail with -ECHILD.  However, in cases when our current position
might be fine, but the next step can't be done in lazy mode, we attempt to
fall back to non-lazy without full restart that would be caused by -ECHILD.
That's what unlazy_walk() is.  Of course, if we reach the end of walk we
need to leave the lazy mode as well (complete_walk()).  Either can fail,
and such a failure means restart from scratch in non-lazy mode.  We need
to grab references on vfsmount and dentry (or two dentries, when we have
parent and child to deal with).  The interesting part is vfsmount - we need
to make sure we won't interfere with umount(2), having walked in that sucker
*after* umount(2) has checked that it's not busy.  See legitimize_mnt() for
details; basically, we have umount(2) mark the "I've verified it's not busy
and it's going to be killed no matter what" with MNT_SYNC_UMOUNT and rely
on RCU delays on that path - if we find one of those, we undo the
increment of its refcount we'd just done, without having dropped
rcu_read_lock().  Full-blown mntput() is done only on mismatches that
had _not_ been marked that way.

BTW, something similar to the above probably needs to be turned into coherent
text, either in parallel to Documentation/filesystems/path-lookup.txt, or
as update to it.

The reason why walk_component() in your call chain won't suffice is simple -
it will fail this
                /*
                 * This sequence count validates that the parent had no
                 * changes while we did the lookup of the dentry above.
                 *
                 * The memory barrier in read_seqcount_begin of child is
                 *  enough, we can use __read_seqcount_retry here.
                 */
                if (__read_seqcount_retry(&parent->d_seq, nd->seq))
                        return -ECHILD;
in lookup_fast(), just before updating nd->seq to new value.  Basically,
it has no way to tell if parent has been buggered to hell and back.

It's not _that_ hard to prevent - we just need to stop discarding the parent's
seq number in "need to follow a symlink" case of walk_component().  Will take
some massage, but not too much...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to