Re: [PATCH RFC] mnt: umount mounts one by one in umount_tree()

2017-06-14 Thread Eric W. Biederman
Ian Kent  writes:

> On Fri, 2017-05-12 at 00:08 -0700, Andrei Vagin wrote:
>> With this patch, we don't try to umount all mounts of a tree together.
>> Instead of this we umount them one by one. In this case, we see a significant
>> improvement in performance for the worsе case.
>
> Indeed, umount has been very slow for a while now.
> Even a moderately large number of mounts (~1) become painfully slow.
>
> Re you still perusing this?
> Anything I can do to help?
>
> Eric, what are your thoughts on this latest attempt?

I have something slightly more recent.  Please checkout my for-next
branch of my userns tree:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
for-next

There is one open area of semantics that I looking at with Ram Pai in
the hopes we can drive consensus before we take any patches for
better checkpoint-restart support.

Eric



Re: [PATCH RFC] mnt: umount mounts one by one in umount_tree()

2017-06-13 Thread Ian Kent
On Fri, 2017-05-12 at 00:08 -0700, Andrei Vagin wrote:
> With this patch, we don't try to umount all mounts of a tree together.
> Instead of this we umount them one by one. In this case, we see a significant
> improvement in performance for the worsе case.

Indeed, umount has been very slow for a while now.
Even a moderately large number of mounts (~1) become painfully slow.

Re you still perusing this?
Anything I can do to help?

Eric, what are your thoughts on this latest attempt?

> 
> The reason of this optimization is that umount() can hold namespace_sem
> for a long time, this semaphore is global, so it affects all users.
> Recently Eric W. Biederman added a per mount namespace limit on the
> number of mounts. The default number of mounts allowed per mount
> namespace at 100,000. Currently this value is allowed to construct a tree
> which requires hours to be umounted.
> 
> In a worse case the current complexity of umount_tree() is O(n^3).
> * Enumirate all mounts in a target tree (propagate_umount)
> * Enumirate mounts to find where these changes have to
>   be propagated (mark_umount_candidates)
> * Enumirate mounts to find a requered mount by parent and dentry
>   (__lookup_mnt). __lookup_mnt() searches a mount in m_hash, but
>   the number of mounts is much bigger than a size of the hash.
> 
> The worse case is when all mounts from the tree live in the same shared
> group. In this case we have to enumirate all mounts on each step.
> 
> There is CVE-2016-6213 about this case.
> 
> Here are results for the kernel with this patch
> $ for i in `seq 10 15`; do  unshare -m sh ./run.sh $i; done
> umount -l /mnt/1 ->   0:00.00
> umount -l /mnt/1 ->   0:00.01
> umount -l /mnt/1 ->   0:00.01
> umount -l /mnt/1 ->   0:00.03
> umount -l /mnt/1 ->   0:00.07
> umount -l /mnt/1 ->   0:00.14
> 
> Here are results for the kernel without this patch
> $ for i in `seq 10 15`; do  unshare -m sh ./run.sh $i; done
> umount -l /mnt/1 ->   0:00.04
> umount -l /mnt/1 ->   0:00.17
> umount -l /mnt/1 ->   0:00.75
> umount -l /mnt/1 ->   0:05.96
> umount -l /mnt/1 ->   0:34.40
> umount -l /mnt/1 ->   3:46.27
> 
> And here is a test script:
> $ cat run.sh
> set -e -m
> 
> mount -t tmpfs zdtm /mnt
> mkdir -p /mnt/1 /mnt/2
> mount -t tmpfs zdtm /mnt/1
> mount --make-shared /mnt/1
> mkdir /mnt/1/1
> 
> for i in `seq $1`; do
>   ./mount --bind /mnt/1/1 /mnt/1/1
> done
> 
> echo -n "umount -l /mnt/1 -> "
> /usr/bin/time -f '%E' ./umount -l /mnt/1
> 
> And we need these simple mount and umount tools, because the standard
> ones read /proc/self/mountinfo, but this is extremely slow when we have
> thousands of mounts.
> $ cat mount.c
>  #include 
>  #include 
> 
>  int main(int argc, char **argv)
>  {
>   return mount(argv[2], argv[3], NULL, MS_BIND, NULL);
>  }
> 
> $ cat umount.c
>  #include 
> 
>  int main(int argc, char **argv)
>  {
>   return umount2(argv[2], MNT_DETACH);
>  }
> 
> Here is a previous attempt to optimize this code:
> https://lkml.org/lkml/2016/10/10/495
> 
> Signed-off-by: Andrei Vagin 
> ---
>  fs/namespace.c | 81 +++
> ---
>  1 file changed, 43 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 3bf0cd2..4e6f258 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1474,56 +1474,61 @@ static bool disconnect_mount(struct mount *mnt, enum
> umount_tree_flags how)
>   */
>  static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
>  {
> - LIST_HEAD(tmp_list);
>   struct mount *p;
> + int done = 0;
>  
>   if (how & UMOUNT_PROPAGATE)
>   propagate_mount_unlock(mnt);
>  
>   /* Gather the mounts to umount */
> - for (p = mnt; p; p = next_mnt(p, mnt)) {
> + while (!done) {
> + LIST_HEAD(tmp_list);
> +
> + p = mnt;
> + while (!list_empty(&p->mnt_mounts))
> + p = list_entry(p->mnt_mounts.next, struct mount,
> mnt_child);
> +
>   p->mnt.mnt_flags |= MNT_UMOUNT;
>   list_move(&p->mnt_list, &tmp_list);
> - }
> -
> - /* Hide the mounts from mnt_mounts */
> - list_for_each_entry(p, &tmp_list, mnt_list) {
>   list_del_init(&p->mnt_child);
> - }
>  
> - /* Add propogated mounts to the tmp_list */
> - if (how & UMOUNT_PROPAGATE)
> - propagate_umount(&tmp_list);
> -
> - while (!list_empty(&tmp_list)) {
> - struct mnt_namespace *ns;
> - bool disconnect;
> - p = list_first_entry(&tmp_list, struct mount, mnt_list);
> - list_del_init(&p->mnt_expire);
> - list_del_init(&p->mnt_list);
> - ns = p->mnt_ns;
> - if (ns) {
> - ns->mounts--;
> - __touch_mnt_namespace(ns);
> - }
> - p->mnt_ns = NULL;
> - if (how & UMOUNT_SYNC)
> - p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
> -
>