On Fri, May 20, 2016 at 05:19:27PM +0300, Konstantin Khorenko wrote:
> This patch is a step for CRIU to work under non-root user.

It allows us to restore memory mappings without using memfd in userns

> The idea is good, but there is a long-long way ahead and it even can finally 
> possible only after next "Virtuozzo 8" kernel appearance.
> 
> => let's not include preparation patches until feature of running CRIU under 
> non-root work on mainstream kernel as least.
> 
> Surely this is valid until such patches are not required due to some other 
> reasons as well.
> 
> --
> Best regards,
> 
> Konstantin Khorenko,
> Virtuozzo Linux Kernel Team
> 
> On 05/16/2016 10:42 PM, Andrey Vagin wrote:
> > Acked-by: Andrey Vagin <ava...@virtuozzo.com>
> > 
> > On Mon, May 16, 2016 at 11:28:51AM +0300, Cyrill Gorcunov wrote:
> > > This is a backport of commit
> > > 
> > > ML: bdb4d100afe9818aebd1d98ced575c5ef143456c
> > > 
> > > From: Calvin Owens <calvinow...@fb.com>
> > > 
> > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
> > > only exposed if CONFIG_CHECKPOINT_RESTORE is set.
> > > 
> > > Each mapped file region gets a symlink in /proc/<pid>/map_files/
> > > corresponding to the virtual address range at which it is mapped.  The
> > > symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
> > > to the backing file even if that backing file has been unlinked.
> > > 
> > > Currently, files which are mapped, unlinked, and closed are impossible to
> > > stat() from userspace.  Exposing /proc/<pid>/map_files/ closes this
> > > functionality "hole".
> > > 
> > > Not being able to stat() such files makes noticing and explicitly
> > > accounting for the space they use on the filesystem impossible.  You can
> > > work around this by summing up the space used by every file in the
> > > filesystem and subtracting that total from what statfs() tells you, but
> > > that obviously isn't great, and it becomes unworkable once your filesystem
> > > becomes large enough.
> > > 
> > > This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
> > > adjusts the permissions enforced on it as follows:
> > > 
> > > * proc_map_files_lookup()
> > > * proc_map_files_readdir()
> > > * map_files_d_revalidate()
> > > 
> > >   Remove the CAP_SYS_ADMIN restriction, leaving only the current
> > >   restriction requiring PTRACE_MODE_READ. The information made
> > >   available to userspace by these three functions is already
> > >   available in /proc/PID/maps with MODE_READ, so I don't see any
> > >   reason to limit them any further (see below for more detail).
> > > 
> > > * proc_map_files_follow_link()
> > > 
> > >   This stub has been added, and requires that the user have
> > >   CAP_SYS_ADMIN in order to follow the links in map_files/,
> > >   since there was concern on LKML both about the potential for
> > >   bypassing permissions on ancestor directories in the path to
> > >   files pointed to, and about what happens with more exotic
> > >   memory mappings created by some drivers (ie dma-buf).
> > > 
> > > In older versions of this patch, I changed every permission check in
> > > the four functions above to enforce MODE_ATTACH instead of MODE_READ.
> > > This was an oversight on my part, and after revisiting the discussion
> > > it seems that nobody was concerned about anything outside of what is
> > > made possible by ->follow_link(). So in this version, I've left the
> > > checks for PTRACE_MODE_READ as-is.
> > > 
> > > [a...@linux-foundation.org: catch up with concurrent 
> > > proc_pid_follow_link() changes]
> > > Signed-off-by: Calvin Owens <calvinow...@fb.com>
> > > Reviewed-by: Kees Cook <keesc...@chromium.org>
> > > Cc: Andy Lutomirski <l...@amacapital.net>
> > > Cc: Cyrill Gorcunov <gorcu...@openvz.org>
> > > Cc: Joe Perches <j...@perches.com>
> > > Cc: Kirill A. Shutemov <kir...@shutemov.name>
> > > Signed-off-by: Andrew Morton <a...@linux-foundation.org>
> > > Signed-off-by: Linus Torvalds <torva...@linux-foundation.org>
> > > Signed-off-by: Cyrill Gorcunov <gorcu...@openvz.org>
> > > ---
> > > 
> > > Kostya, please wait for Ack from Andrew. The patch on its own is not
> > > bound to some of the bug we're working on now but usefull in general
> > > and probably will help us with renaming of memfd restored memory
> > > in criu (we use memfd to be able to restore anonymous shared memory
> > > in userns case but memfd mangles the backend name, we didn't find
> > > any problem with it yet, but been talking to Andrew and he agreed
> > > that we might need to do something with this problem, and this patch
> > > is first step).
> > > 
> > >   fs/proc/base.c |   44 +++++++++++++++++++++++---------------------
> > >   1 file changed, 23 insertions(+), 21 deletions(-)
> > > 
> > > Index: linux-pcs7.git/fs/proc/base.c
> > > ===================================================================
> > > --- linux-pcs7.git.orig/fs/proc/base.c
> > > +++ linux-pcs7.git/fs/proc/base.c
> > > @@ -1925,8 +1925,6 @@ end_instantiate:
> > >           return filldir(dirent, name, len, filp->f_pos, ino, type);
> > >   }
> > > 
> > > -#ifdef CONFIG_CHECKPOINT_RESTORE
> > > -
> > >   /*
> > >    * dname_to_vma_addr - maps a dentry name into two unsigned longs
> > >    * which represent vma start and end addresses.
> > > @@ -1953,11 +1951,6 @@ static int map_files_d_revalidate(struct
> > >           if (flags & LOOKUP_RCU)
> > >                   return -ECHILD;
> > > 
> > > - if (!capable(CAP_SYS_ADMIN)) {
> > > -         status = -EPERM;
> > > -         goto out_notask;
> > > - }
> > > -
> > >           inode = dentry->d_inode;
> > >           task = get_proc_task(inode);
> > >           if (!task)
> > > @@ -2048,6 +2041,28 @@ struct map_files_info {
> > >           unsigned char   name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
> > >   };
> > > 
> > > +/*
> > > + * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about 
> > > how the
> > > + * symlinks may be used to bypass permissions on ancestor directories in 
> > > the
> > > + * path to the file in question.
> > > + */
> > > +static void *proc_map_files_follow_link(struct dentry *dentry, struct 
> > > nameidata *nd)
> > > +{
> > > + if (!capable(CAP_SYS_ADMIN))
> > > +         return ERR_PTR(-EPERM);
> > > +
> > > + return proc_pid_follow_link(dentry, nd);
> > > +}
> > > +
> > > +/*
> > > + * Identical to proc_pid_link_inode_operations except for follow_link()
> > > + */
> > > +static const struct inode_operations 
> > > proc_map_files_link_inode_operations = {
> > > + .readlink       = proc_pid_readlink,
> > > + .follow_link    = proc_map_files_follow_link,
> > > + .setattr        = proc_setattr,
> > > +};
> > > +
> > >   static struct dentry *
> > >   proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
> > >                              struct task_struct *task, const void *ptr)
> > > @@ -2063,7 +2078,7 @@ proc_map_files_instantiate(struct inode
> > >           ei = PROC_I(inode);
> > >           ei->op.proc_get_link = proc_map_files_get_link;
> > > 
> > > - inode->i_op = &proc_pid_link_inode_operations;
> > > + inode->i_op = &proc_map_files_link_inode_operations;
> > >           inode->i_size = 64;
> > >           inode->i_mode = S_IFLNK;
> > > 
> > > @@ -2087,10 +2102,6 @@ static struct dentry *proc_map_files_loo
> > >           struct dentry *result;
> > >           struct mm_struct *mm;
> > > 
> > > - result = ERR_PTR(-EPERM);
> > > - if (!capable(CAP_SYS_ADMIN))
> > > -         goto out;
> > > -
> > >           result = ERR_PTR(-ENOENT);
> > >           task = get_proc_task(dir);
> > >           if (!task)
> > > @@ -2143,10 +2154,6 @@ proc_map_files_readdir(struct file *filp
> > >           ino_t ino;
> > >           int ret;
> > > 
> > > - ret = -EPERM;
> > > - if (!capable(CAP_SYS_ADMIN))
> > > -         goto out;
> > > -
> > >           ret = -ENOENT;
> > >           task = get_proc_task(inode);
> > >           if (!task)
> > > @@ -2353,9 +2360,6 @@ static const struct file_operations proc
> > >           .release        = seq_release_private,
> > >   };
> > > 
> > > -
> > > -#endif /* CONFIG_CHECKPOINT_RESTORE */
> > > -
> > >   #ifdef CONFIG_VE
> > >   static long proc_aio_ioctl(struct file *file, unsigned int cmd, 
> > > unsigned long arg)
> > >   {
> > > @@ -2911,9 +2915,7 @@ static const struct inode_operations pro
> > >   static const struct pid_entry tgid_base_stuff[] = {
> > >           DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, 
> > > proc_task_operations),
> > >           DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, 
> > > proc_fd_operations),
> > > -#ifdef CONFIG_CHECKPOINT_RESTORE
> > >           DIR("map_files",  S_IRUSR|S_IXUSR, 
> > > proc_map_files_inode_operations, proc_map_files_operations),
> > > -#endif
> > >           DIR("fdinfo",     S_IRUSR|S_IXUSR, 
> > > proc_fdinfo_inode_operations, proc_fdinfo_operations),
> > >           DIR("ns",         S_IRUSR|S_IXUGO, 
> > > proc_ns_dir_inode_operations, proc_ns_dir_operations),
> > >   #ifdef CONFIG_NET
> > .
> > 
_______________________________________________
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to