Re: [RFC][PATCH v2] mount: In propagate_umount handle overlapping mount propagation trees

2016-11-01 Thread Andrei Vagin
On Tue, Oct 25, 2016 at 04:45:44PM -0500, Eric W. Biederman wrote:
> Andrei Vagin  writes:
> >
> > From 8e0f45c0272aa1f789d1657a0acc98c58919dcc3 Mon Sep 17 00:00:00 2001
> > From: Andrei Vagin 
> > Date: Tue, 25 Oct 2016 13:57:31 -0700
> > Subject: [PATCH] mount: skip all mounts from a shared group if one is marked
> >
> > If we meet a marked mount, it means that all mounts from
> > its group have been already revised.
> >
> > Signed-off-by: Andrei Vagin 
> > ---
> >  fs/pnode.c | 18 +++---
> >  1 file changed, 15 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/pnode.c b/fs/pnode.c
> > index 8fd1a3f..ebb7134 100644
> > --- a/fs/pnode.c
> > +++ b/fs/pnode.c
> > @@ -426,10 +426,16 @@ static struct mount *propagation_visit_child(struct 
> > mount *last_child,
> > if (child && !IS_MNT_MARKED(child))
> > return child;
> >  
> > -   if (!child)
> > +   if (!child) {
> > m = propagation_next(m, origin);
> > -   else
> > +   } else {
> > +   if (IS_MNT_MARKED(child)) {
> > +   if (m->mnt_group_id == origin->mnt_group_id)
> > +   return NULL;
> > +   m = m->mnt_master;
> > +   }
> > m = propagation_next_sib(m, origin);
> > +   }
> > }
> > return NULL;
> >  }
> > @@ -456,8 +462,14 @@ static struct mount *propagation_revisit_child(struct 
> > mount *last_child,
> >  
> > if (!child)
> > m = propagation_next(m, origin);
> > -   else
> > +   else {
> > +   if (!IS_MNT_MARKED(child)) {
> > +   if (m->mnt_group_id == origin->mnt_group_id)
> > +   return NULL;
> > +   m = m->mnt_master;
> > +   }
> > m = propagation_next_sib(m, origin);
> > +   }
> > }
> > return NULL;
> >  }
> 
> That is certainly interesting.  The problem is that the reason we were
> going slow is that there were in fact mounts that had not been traversed
> in the share group.
> 
> And in fact the entire idea of visiting a vfsmount mountpoint pair
> exactly once is wrong in the face of shadow mounts.  For a vfsmount
> mountpoint pair that has shadow mounts the number of shadow mounts needs
> to be descreased by one each time the propgation tree is traversed
> during unmount. Which means that as far as I can see we have to kill
> shadow mounts to correctly optimize this code.  Once shadow mounts are
> gone I don't know of a case where need your optimization.

I am not sure that now shadow mounts are worked as you described
here. start_umount_propagation() doesn't remove a mount from mnt_hash,
so in a second time we will look up the same mount again.

Look at this script:

[root@fc24 mounts]# cat ./opus02.sh
set -e
mkdir -p /mnt
mount -t tmpfs zdtm /mnt
mkdir -p /mnt/A/a
mkdir -p /mnt/B/a
mount --bind --make-shared /mnt/A /mnt/A
mount --bind /mnt/A /mnt/B
mount --bind /mnt/A/a /mnt/A/a
mount --bind /mnt/A/a /mnt/A/a

umount -l /mnt/A
cat /proc/self/mountinfo | grep zdtm

[root@fc24 mounts]# unshare --propagation private -m ./opus02.sh
159 121 0:46 / /mnt rw,relatime - tmpfs zdtm rw
162 159 0:46 /A /mnt/B rw,relatime shared:67 - tmpfs zdtm rw
167 162 0:46 /A/a /mnt/B/a rw,relatime shared:67 - tmpfs zdtm rw

We mount nothing into /mnt/B, but when we umount everything from A, we
still have something in B.

Thanks,
Andrei
> 
> I am busily verifying my patch to kill shadow mounts but the following
> patch is the minimal version.  As far as I can see propagate_one
> is the only place we create shadow mounts, and holding the
> namespace_lock over attach_recursive_mnt, propagate_mnt, and
> propgate_one is sufficient for that __lookup_mnt to be competely safe.
> 
> diff --git a/fs/pnode.c b/fs/pnode.c
> index 234a9ac49958..b14119b370d4 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -217,6 +217,9 @@ static int propagate_one(struct mount *m)
> /* skip if mountpoint isn't covered by it */
> if (!is_subdir(mp->m_dentry, m->mnt.mnt_root))
> return 0;
> +   /* skip if mountpoint already has a mount on it */
> +   if (__lookup_mnt(&m->mnt, mp->m_dentry))
> +   return 0;
> if (peers(m, last_dest)) {
> type = CL_MAKE_SHARED;
> } else {
> 
> If you run with that patch you will see that there are go faster stripes.
> 
> Eric
> 


Re: [RFC][PATCH v2] mount: In propagate_umount handle overlapping mount propagation trees

2016-10-25 Thread Eric W. Biederman
Andrei Vagin  writes:

> On Tue, Oct 25, 2016 at 04:45:44PM -0500, Eric W. Biederman wrote:
>> That is certainly interesting.  The problem is that the reason we were
>> going slow is that there were in fact mounts that had not been traversed
>> in the share group.
>
> You are right.
>
>> 
>> And in fact the entire idea of visiting a vfsmount mountpoint pair
>> exactly once is wrong in the face of shadow mounts.  For a vfsmount
>> mountpoint pair that has shadow mounts the number of shadow mounts needs
>> to be descreased by one each time the propgation tree is traversed
>> during unmount. Which means that as far as I can see we have to kill
>> shadow mounts to correctly optimize this code.  Once shadow mounts are
>> gone I don't know of a case where need your optimization.
>
> Without shadow mounts, it will be hard to save predictable behaviour
> for cases like this:
>
> $ unshare --propagation private -m sh test.sh
> + mount -t tmpfs --make-shared  A
> + mkdir A/a
> + mount -t tmpfs  A/a
> + mount --bind A B
> + mount -t tmpfs  B/a
> + grep 
> + cat /proc/self/mountinfo
> 155 123 0:44 / /root/tmp/A rw,relatime shared:70 - tmpfs  rw
> 156 155 0:45 / /root/tmp/A/a rw,relatime shared:71 - tmpfs  rw
> 157 123 0:44 / /root/tmp/B rw,relatime shared:70 - tmpfs  rw
> 158 157 0:46 / /root/tmp/B/a rw,relatime shared:72 - tmpfs  rw
> 159 155 0:46 / /root/tmp/A/a rw,relatime shared:72 - tmpfs  rw
> + umount B/a
> + grep 
> + cat /proc/self/mountinfo
> 155 123 0:44 / /root/tmp/A rw,relatime shared:70 - tmpfs  rw
> 156 155 0:45 / /root/tmp/A/a rw,relatime shared:71 - tmpfs  rw
> 157 123 0:44 / /root/tmp/B rw,relatime shared:70 - tmpfs  rw
>
> X + a - a = X
>
> Maybe we need to add another ID for propagated mounts and when we
> do umount, we will detach only mounts with the same propagation id.
>
> I support the idea to kill shadow mounts. I guess it will help us to
> simplify algorithm of dumping and restoring a mount tree in CRIU.
>
> Currently it is a big pain for us.

Killing shadow mounts is not exactly a done deal as there are some user
visible effects.  The practical question becomes do we break anything
anyone cares about in userspace.  Answering those practical questions
sucks.

I definitely think we should try to kill shadow mounts because they are
such a big pain to deal with, and only provide very limited value.

So far the only thing I have seem shadow mounts being good for is
preserving unmount behavior in cases where what someone has
constructed an artificially evil mount tree. I haven't figured out how
to see how any of those mount trees are actually useful in real life.

Eric





Re: [RFC][PATCH v2] mount: In propagate_umount handle overlapping mount propagation trees

2016-10-25 Thread Andrei Vagin
On Sat, Oct 22, 2016 at 02:42:03PM -0500, Eric W. Biederman wrote:
> 
> Andrei,
> 
> This fixes the issue you have reported and through a refactoring
> makes the code simpler and easier to verify.  That said I find your
> last test case very interesting.   While looking at it in detail
> I have realized I don't fully understand why we have both lookup_mnt and
> lookup_mnt_last, so I can't say that this change is fully correct.
> 
> Outside of propogate_umount I am don't have concerns but I am not 100%
> convinced that my change to lookup_mnt_last does the right thing
> in the case of propagate_umount.
> 
> I do see why your last test case scales badly.  Long chains of shared
> mounts that we can't skip.  At the same time I don't really understand
> that case.  Part of it has to do with multiple child mounts of the same
> mount on the same mountpoint.
> 
> So I am working through my concerns.  In the mean time I figured it
> would be useful to post this version.  As this version is clearly better
> than the version of this change that have come before it.

Hi Eric,

I have tested this version and it works fine.

As for the the last test case, could you look at the attached patch?
The idea is that we can skip all mounts from a shared group, if one
of them already marked.

> 
> Eric
> 
> From: "Eric W. Biederman" 
> Date: Thu, 13 Oct 2016 13:27:19 -0500
> 
> Adrei Vagin pointed out that time to executue propagate_umount can go
> non-linear (and take a ludicrious amount of time) when the mount
> propogation trees of the mounts to be unmunted by a lazy unmount
> overlap.
> 
> While investigating the horrible performance I realized that in
> the case overlapping mount trees since the addition of locked
> mount support the code has been failing to unmount all of the
> mounts it should have been unmounting.
> 
> Make the walk of the mount propagation trees nearly linear by using
> MNT_MARK to mark pieces of the mount propagation trees that have
> already been visited, allowing subsequent walks to skip over
> subtrees.
> 
> Make the processing of mounts order independent by adding a list of
> mount entries that need to be unmounted, and simply adding a mount to
> that list when it becomes apparent the mount can safely be unmounted.
> For mounts that are locked on other mounts but otherwise could be
> unmounted move them from their parnets mnt_mounts to mnt_umounts so
> that if and when their parent becomes unmounted these mounts can be
> added to the list of mounts to unmount.
> 
> Add a final pass to clear MNT_MARK and to restore mnt_mounts
> from mnt_umounts for anything that did not get unmounted.
> 
> Add the functions propagation_visit_next and propagation_revisit_next
> to coordinate walking of the mount tree and setting and clearing the
> mount mark.
> 
> The skipping of already unmounted mounts has been moved from
> __lookup_mnt_last to mark_umount_candidates, so that the new
> propagation functions can notice when the propagation tree passes
> through the initial set of unmounted mounts.  Except in umount_tree as
> part of the unmounting process the only place where unmounted mounts
> should be found are in unmounted subtrees.  All of the other callers
> of __lookup_mnt_last are from mounted subtrees so the not checking for
> unmounted mounts should not affect them.
> 
> A script to generate overlapping mount propagation trees:
> $ cat run.sh
> mount -t tmpfs test-mount /mnt
> mount --make-shared /mnt
> for i in `seq $1`; do
> mkdir /mnt/test.$i
> mount --bind /mnt /mnt/test.$i
> done
> cat /proc/mounts | grep test-mount | wc -l
> time umount -l /mnt
> $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done
> 
> Here are the performance numbers with and without the patch:
> 
> mhash  |  8192   |  8192  |  8192   | 131072 | 131072  | 104857 | 
> 104857
> mounts | before  | after  | after (sys) | after  | after (sys) |  after | 
> after (sys)
> -
>   1024 |  0.071s | 0.020s | 0.000s  | 0.022s | 0.004s  | 0.020s | 
> 0.004s
>   2048 |  0.184s | 0.022s | 0.004s  | 0.023s | 0.004s  | 0.022s | 
> 0.008s
>   4096 |  0.604s | 0.025s | 0.020s  | 0.029s | 0.008s  | 0.026s | 
> 0.004s
>   8912 |  4.471s | 0.053s | 0.020s  | 0.051s | 0.024s  | 0.047s | 
> 0.016s
>  16384 | 34.826s | 0.088s | 0.060s  | 0.081s | 0.048s  | 0.082s | 
> 0.052s
>  32768 | | 0.216s | 0.172s  | 0.160s | 0.124s  | 0.160s | 
> 0.096s
>  65536 | | 0.819s | 0.726s  | 0.330s | 0.260s  | 0.338s | 
> 0.256s
> 131072 | | 4.502s | 4.168s  | 0.707s | 0.580s  | 0.709s | 
> 0.592s
> 
> Andrei Vagin reports fixing the performance problem is part of the
> work to fix CVE-2016-6213.
> 
> A script for a pathlogical set of mounts:
> 
> $ cat pathological.sh
> 
> mount -t tmpfs base /mnt
> mount --make-shared /mnt
> mkdir -p /mnt/b
> 
> mount -t tmpfs 

Re: [RFC][PATCH v2] mount: In propagate_umount handle overlapping mount propagation trees

2016-10-25 Thread Andrei Vagin
On Tue, Oct 25, 2016 at 04:45:44PM -0500, Eric W. Biederman wrote:
> Andrei Vagin  writes:
> 
> > On Sat, Oct 22, 2016 at 02:42:03PM -0500, Eric W. Biederman wrote:
> >> 
> >> Andrei,
> >> 
> >> This fixes the issue you have reported and through a refactoring
> >> makes the code simpler and easier to verify.  That said I find your
> >> last test case very interesting.   While looking at it in detail
> >> I have realized I don't fully understand why we have both lookup_mnt and
> >> lookup_mnt_last, so I can't say that this change is fully correct.
> >> 
> >> Outside of propogate_umount I am don't have concerns but I am not 100%
> >> convinced that my change to lookup_mnt_last does the right thing
> >> in the case of propagate_umount.
> >> 
> >> I do see why your last test case scales badly.  Long chains of shared
> >> mounts that we can't skip.  At the same time I don't really understand
> >> that case.  Part of it has to do with multiple child mounts of the same
> >> mount on the same mountpoint.
> >> 
> >> So I am working through my concerns.  In the mean time I figured it
> >> would be useful to post this version.  As this version is clearly better
> >> than the version of this change that have come before it.
> >
> > Hi Eric,
> >
> > I have tested this version and it works fine.
> >
> > As for the the last test case, could you look at the attached patch?
> > The idea is that we can skip all mounts from a shared group, if one
> > of them already marked.
> >
> >> 
> >> Eric
> >> 
> >> From: "Eric W. Biederman" 
> >> Date: Thu, 13 Oct 2016 13:27:19 -0500
> >> 
> >> Adrei Vagin pointed out that time to executue propagate_umount can go
> >> non-linear (and take a ludicrious amount of time) when the mount
> >> propogation trees of the mounts to be unmunted by a lazy unmount
> >> overlap.
> >> 
> >> While investigating the horrible performance I realized that in
> >> the case overlapping mount trees since the addition of locked
> >> mount support the code has been failing to unmount all of the
> >> mounts it should have been unmounting.
> >> 
> >> Make the walk of the mount propagation trees nearly linear by using
> >> MNT_MARK to mark pieces of the mount propagation trees that have
> >> already been visited, allowing subsequent walks to skip over
> >> subtrees.
> >> 
> >> Make the processing of mounts order independent by adding a list of
> >> mount entries that need to be unmounted, and simply adding a mount to
> >> that list when it becomes apparent the mount can safely be unmounted.
> >> For mounts that are locked on other mounts but otherwise could be
> >> unmounted move them from their parnets mnt_mounts to mnt_umounts so
> >> that if and when their parent becomes unmounted these mounts can be
> >> added to the list of mounts to unmount.
> >> 
> >> Add a final pass to clear MNT_MARK and to restore mnt_mounts
> >> from mnt_umounts for anything that did not get unmounted.
> >> 
> >> Add the functions propagation_visit_next and propagation_revisit_next
> >> to coordinate walking of the mount tree and setting and clearing the
> >> mount mark.
> >> 
> >> The skipping of already unmounted mounts has been moved from
> >> __lookup_mnt_last to mark_umount_candidates, so that the new
> >> propagation functions can notice when the propagation tree passes
> >> through the initial set of unmounted mounts.  Except in umount_tree as
> >> part of the unmounting process the only place where unmounted mounts
> >> should be found are in unmounted subtrees.  All of the other callers
> >> of __lookup_mnt_last are from mounted subtrees so the not checking for
> >> unmounted mounts should not affect them.
> >> 
> >> A script to generate overlapping mount propagation trees:
> >> $ cat run.sh
> >> mount -t tmpfs test-mount /mnt
> >> mount --make-shared /mnt
> >> for i in `seq $1`; do
> >> mkdir /mnt/test.$i
> >> mount --bind /mnt /mnt/test.$i
> >> done
> >> cat /proc/mounts | grep test-mount | wc -l
> >> time umount -l /mnt
> >> $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done
> >> 
> >> Here are the performance numbers with and without the patch:
> >> 
> >> mhash  |  8192   |  8192  |  8192   | 131072 | 131072  | 104857 | 
> >> 104857
> >> mounts | before  | after  | after (sys) | after  | after (sys) |  after | 
> >> after (sys)
> >> -
> >>   1024 |  0.071s | 0.020s | 0.000s  | 0.022s | 0.004s  | 0.020s | 
> >> 0.004s
> >>   2048 |  0.184s | 0.022s | 0.004s  | 0.023s | 0.004s  | 0.022s | 
> >> 0.008s
> >>   4096 |  0.604s | 0.025s | 0.020s  | 0.029s | 0.008s  | 0.026s | 
> >> 0.004s
> >>   8912 |  4.471s | 0.053s | 0.020s  | 0.051s | 0.024s  | 0.047s | 
> >> 0.016s
> >>  16384 | 34.826s | 0.088s | 0.060s  | 0.081s | 0.048s  | 0.082s | 
> >> 0.052s
> >>  32768 | | 0.216s | 0.172s  | 0.160s | 0.124s  | 0.160s | 
> >> 0.096s
> >>  65536 |

Re: [RFC][PATCH v2] mount: In propagate_umount handle overlapping mount propagation trees

2016-10-25 Thread Eric W. Biederman
Andrei Vagin  writes:

> On Sat, Oct 22, 2016 at 02:42:03PM -0500, Eric W. Biederman wrote:
>> 
>> Andrei,
>> 
>> This fixes the issue you have reported and through a refactoring
>> makes the code simpler and easier to verify.  That said I find your
>> last test case very interesting.   While looking at it in detail
>> I have realized I don't fully understand why we have both lookup_mnt and
>> lookup_mnt_last, so I can't say that this change is fully correct.
>> 
>> Outside of propogate_umount I am don't have concerns but I am not 100%
>> convinced that my change to lookup_mnt_last does the right thing
>> in the case of propagate_umount.
>> 
>> I do see why your last test case scales badly.  Long chains of shared
>> mounts that we can't skip.  At the same time I don't really understand
>> that case.  Part of it has to do with multiple child mounts of the same
>> mount on the same mountpoint.
>> 
>> So I am working through my concerns.  In the mean time I figured it
>> would be useful to post this version.  As this version is clearly better
>> than the version of this change that have come before it.
>
> Hi Eric,
>
> I have tested this version and it works fine.
>
> As for the the last test case, could you look at the attached patch?
> The idea is that we can skip all mounts from a shared group, if one
> of them already marked.
>
>> 
>> Eric
>> 
>> From: "Eric W. Biederman" 
>> Date: Thu, 13 Oct 2016 13:27:19 -0500
>> 
>> Adrei Vagin pointed out that time to executue propagate_umount can go
>> non-linear (and take a ludicrious amount of time) when the mount
>> propogation trees of the mounts to be unmunted by a lazy unmount
>> overlap.
>> 
>> While investigating the horrible performance I realized that in
>> the case overlapping mount trees since the addition of locked
>> mount support the code has been failing to unmount all of the
>> mounts it should have been unmounting.
>> 
>> Make the walk of the mount propagation trees nearly linear by using
>> MNT_MARK to mark pieces of the mount propagation trees that have
>> already been visited, allowing subsequent walks to skip over
>> subtrees.
>> 
>> Make the processing of mounts order independent by adding a list of
>> mount entries that need to be unmounted, and simply adding a mount to
>> that list when it becomes apparent the mount can safely be unmounted.
>> For mounts that are locked on other mounts but otherwise could be
>> unmounted move them from their parnets mnt_mounts to mnt_umounts so
>> that if and when their parent becomes unmounted these mounts can be
>> added to the list of mounts to unmount.
>> 
>> Add a final pass to clear MNT_MARK and to restore mnt_mounts
>> from mnt_umounts for anything that did not get unmounted.
>> 
>> Add the functions propagation_visit_next and propagation_revisit_next
>> to coordinate walking of the mount tree and setting and clearing the
>> mount mark.
>> 
>> The skipping of already unmounted mounts has been moved from
>> __lookup_mnt_last to mark_umount_candidates, so that the new
>> propagation functions can notice when the propagation tree passes
>> through the initial set of unmounted mounts.  Except in umount_tree as
>> part of the unmounting process the only place where unmounted mounts
>> should be found are in unmounted subtrees.  All of the other callers
>> of __lookup_mnt_last are from mounted subtrees so the not checking for
>> unmounted mounts should not affect them.
>> 
>> A script to generate overlapping mount propagation trees:
>> $ cat run.sh
>> mount -t tmpfs test-mount /mnt
>> mount --make-shared /mnt
>> for i in `seq $1`; do
>> mkdir /mnt/test.$i
>> mount --bind /mnt /mnt/test.$i
>> done
>> cat /proc/mounts | grep test-mount | wc -l
>> time umount -l /mnt
>> $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done
>> 
>> Here are the performance numbers with and without the patch:
>> 
>> mhash  |  8192   |  8192  |  8192   | 131072 | 131072  | 104857 | 
>> 104857
>> mounts | before  | after  | after (sys) | after  | after (sys) |  after | 
>> after (sys)
>> -
>>   1024 |  0.071s | 0.020s | 0.000s  | 0.022s | 0.004s  | 0.020s | 
>> 0.004s
>>   2048 |  0.184s | 0.022s | 0.004s  | 0.023s | 0.004s  | 0.022s | 
>> 0.008s
>>   4096 |  0.604s | 0.025s | 0.020s  | 0.029s | 0.008s  | 0.026s | 
>> 0.004s
>>   8912 |  4.471s | 0.053s | 0.020s  | 0.051s | 0.024s  | 0.047s | 
>> 0.016s
>>  16384 | 34.826s | 0.088s | 0.060s  | 0.081s | 0.048s  | 0.082s | 
>> 0.052s
>>  32768 | | 0.216s | 0.172s  | 0.160s | 0.124s  | 0.160s | 
>> 0.096s
>>  65536 | | 0.819s | 0.726s  | 0.330s | 0.260s  | 0.338s | 
>> 0.256s
>> 131072 | | 4.502s | 4.168s  | 0.707s | 0.580s  | 0.709s | 
>> 0.592s
>> 
>> Andrei Vagin reports fixing the performance problem is part of the
>> work to fix CVE-2016-6213.
>> 
>> A script for a path

[RFC][PATCH v2] mount: In propagate_umount handle overlapping mount propagation trees

2016-10-22 Thread Eric W. Biederman

Andrei,

This fixes the issue you have reported and through a refactoring
makes the code simpler and easier to verify.  That said I find your
last test case very interesting.   While looking at it in detail
I have realized I don't fully understand why we have both lookup_mnt and
lookup_mnt_last, so I can't say that this change is fully correct.

Outside of propogate_umount I am don't have concerns but I am not 100%
convinced that my change to lookup_mnt_last does the right thing
in the case of propagate_umount.

I do see why your last test case scales badly.  Long chains of shared
mounts that we can't skip.  At the same time I don't really understand
that case.  Part of it has to do with multiple child mounts of the same
mount on the same mountpoint.

So I am working through my concerns.  In the mean time I figured it
would be useful to post this version.  As this version is clearly better
than the version of this change that have come before it.

Eric

From: "Eric W. Biederman" 
Date: Thu, 13 Oct 2016 13:27:19 -0500

Adrei Vagin pointed out that time to executue propagate_umount can go
non-linear (and take a ludicrious amount of time) when the mount
propogation trees of the mounts to be unmunted by a lazy unmount
overlap.

While investigating the horrible performance I realized that in
the case overlapping mount trees since the addition of locked
mount support the code has been failing to unmount all of the
mounts it should have been unmounting.

Make the walk of the mount propagation trees nearly linear by using
MNT_MARK to mark pieces of the mount propagation trees that have
already been visited, allowing subsequent walks to skip over
subtrees.

Make the processing of mounts order independent by adding a list of
mount entries that need to be unmounted, and simply adding a mount to
that list when it becomes apparent the mount can safely be unmounted.
For mounts that are locked on other mounts but otherwise could be
unmounted move them from their parnets mnt_mounts to mnt_umounts so
that if and when their parent becomes unmounted these mounts can be
added to the list of mounts to unmount.

Add a final pass to clear MNT_MARK and to restore mnt_mounts
from mnt_umounts for anything that did not get unmounted.

Add the functions propagation_visit_next and propagation_revisit_next
to coordinate walking of the mount tree and setting and clearing the
mount mark.

The skipping of already unmounted mounts has been moved from
__lookup_mnt_last to mark_umount_candidates, so that the new
propagation functions can notice when the propagation tree passes
through the initial set of unmounted mounts.  Except in umount_tree as
part of the unmounting process the only place where unmounted mounts
should be found are in unmounted subtrees.  All of the other callers
of __lookup_mnt_last are from mounted subtrees so the not checking for
unmounted mounts should not affect them.

A script to generate overlapping mount propagation trees:
$ cat run.sh
mount -t tmpfs test-mount /mnt
mount --make-shared /mnt
for i in `seq $1`; do
mkdir /mnt/test.$i
mount --bind /mnt /mnt/test.$i
done
cat /proc/mounts | grep test-mount | wc -l
time umount -l /mnt
$ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done

Here are the performance numbers with and without the patch:

mhash  |  8192   |  8192  |  8192   | 131072 | 131072  | 104857 | 104857
mounts | before  | after  | after (sys) | after  | after (sys) |  after | after 
(sys)
-
  1024 |  0.071s | 0.020s | 0.000s  | 0.022s | 0.004s  | 0.020s | 0.004s
  2048 |  0.184s | 0.022s | 0.004s  | 0.023s | 0.004s  | 0.022s | 0.008s
  4096 |  0.604s | 0.025s | 0.020s  | 0.029s | 0.008s  | 0.026s | 0.004s
  8912 |  4.471s | 0.053s | 0.020s  | 0.051s | 0.024s  | 0.047s | 0.016s
 16384 | 34.826s | 0.088s | 0.060s  | 0.081s | 0.048s  | 0.082s | 0.052s
 32768 | | 0.216s | 0.172s  | 0.160s | 0.124s  | 0.160s | 0.096s
 65536 | | 0.819s | 0.726s  | 0.330s | 0.260s  | 0.338s | 0.256s
131072 | | 4.502s | 4.168s  | 0.707s | 0.580s  | 0.709s | 0.592s

Andrei Vagin reports fixing the performance problem is part of the
work to fix CVE-2016-6213.

A script for a pathlogical set of mounts:

$ cat pathological.sh

mount -t tmpfs base /mnt
mount --make-shared /mnt
mkdir -p /mnt/b

mount -t tmpfs test1 /mnt/b
mount --make-shared /mnt/b
mkdir -p /mnt/b/10

mount -t tmpfs test2 /mnt/b/10
mount --make-shared /mnt/b/10
mkdir -p /mnt/b/10/20

mount --rbind /mnt/b /mnt/b/10/20

unshare -Urm sleep 2
umount -l /mnt/b
wait %%

$ unshare -Urm pathlogical.sh

Cc: sta...@vger.kernel.org
Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount")
Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts")
Reported-by: Andrei Vagin 
Signed-off-by: "Eric W. Biederman" 
---
 fs/mount.h |   1 +
 fs/namespace.c