On 2019-11-05, Aleksa Sarai <cyp...@cyphar.com> wrote:
> This patchset is being developed here:
>   <https://github.com/cyphar/linux/tree/openat2/master>
> 
> Patch changelog:
>  v15:
>   * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus 
> Torvalds]
>   * Split out patches for each individual LOOKUP flag.
>   * Reword commit messages to give more background information about the
>     series, as well as mention the semantics of each flag in more detail.
>  v14: <https://lore.kernel.org/lkml/20191010054140.8483-1-cyp...@cyphar.com/>
>       <https://lore.kernel.org/lkml/20191026185700.10708-1-cyp...@cyphar.com>
>  v13: <https://lore.kernel.org/lkml/20190930183316.10190-1-cyp...@cyphar.com/>
>  v12: <https://lore.kernel.org/lkml/20190904201933.10736-1-cyp...@cyphar.com/>
>  v11: <https://lore.kernel.org/lkml/20190820033406.29796-1-cyp...@cyphar.com/>
>       <https://lore.kernel.org/lkml/20190728010207.9781-1-cyp...@cyphar.com/>
>  v10: <https://lore.kernel.org/lkml/20190719164225.27083-1-cyp...@cyphar.com/>
>  v09: <https://lore.kernel.org/lkml/20190706145737.5299-1-cyp...@cyphar.com/>
>  v08: <https://lore.kernel.org/lkml/20190520133305.11925-1-cyp...@cyphar.com/>
>  v07: <https://lore.kernel.org/lkml/20190507164317.13562-1-cyp...@cyphar.com/>
>  v06: <https://lore.kernel.org/lkml/20190506165439.9155-1-cyp...@cyphar.com/>
>  v05: <https://lore.kernel.org/lkml/20190320143717.2523-1-cyp...@cyphar.com/>
>  v04: <https://lore.kernel.org/lkml/20181112142654.341-1-cyp...@cyphar.com/>
>  v03: <https://lore.kernel.org/lkml/20181009070230.12884-1-cyp...@cyphar.com/>
>  v02: <https://lore.kernel.org/lkml/20181009065300.11053-1-cyp...@cyphar.com/>
>  v01: <https://lore.kernel.org/lkml/20180929103453.12025-1-cyp...@cyphar.com/>
> 
> For a very long time, extending openat(2) with new features has been
> incredibly frustrating. This stems from the fact that openat(2) is
> possibly the most famous counter-example to the mantra "don't silently
> accept garbage from userspace" -- it doesn't check whether unknown flags
> are present[1].
> 
> This means that (generally) the addition of new flags to openat(2) has
> been fraught with backwards-compatibility issues (O_TMPFILE has to be
> defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
> kernels gave errors, since it's insecure to silently ignore the
> flag[2]). All new security-related flags therefore have a tough road to
> being added to openat(2).
> 
> Furthermore, the need for some sort of control over VFS's path resolution (to
> avoid malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a 
> revival
> of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David
> Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum
> project[5]) with a few additions and changes made based on the previous
> discussion within [6] as well as others I felt were useful.
> 
> In line with the conclusions of the original discussion of AT_NO_JUMPS, the
> flag has been split up into separate flags. However, instead of being an
> openat(2) flag it is provided through a new syscall openat2(2) which provides
> several other improvements to the openat(2) interface (see the patch
> description for more details). The following new LOOKUP_* flags are added:
> 
>   * LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
>     or through absolute links). Absolute pathnames alone in openat(2) do not
>     trigger this. Magic-link traversal which implies a vfsmount jump is also
>     blocked (though magic-link jumps on the same vfsmount are permitted).
> 
>   * LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
>     links. This is done by blocking the usage of nd_jump_link() during
>     resolution in a filesystem. The term "magic-links" is used to match
>     with the only reference to these links in Documentation/, but I'm
>     happy to change the name.
> 
>     It should be noted that this is different to the scope of
>     ~LOOKUP_FOLLOW in that it applies to all path components. However,
>     you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
>     will *not* fail (assuming that no parent component was a
>     magic-link), and you will have an fd for the magic-link.
> 
>     In order to correctly detect magic-links, the introduction of a new
>     LOOKUP_MAGICLINK_JUMPED state flag was required.
> 
>   * LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
>     tree, using techniques such as ".." or absolute links. Absolute
>     paths in openat(2) are also disallowed. Conceptually this flag is to
>     ensure you "stay below" a certain point in the filesystem tree --
>     but this requires some additional to protect against various races
>     that would allow escape using "..".
> 
>     Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
>     can trivially beam you around the filesystem (breaking the
>     protection). In future, there might be similar safety checks done as
>     in LOOKUP_IN_ROOT, but that requires more discussion.
> 
> In addition, two new flags are added that expand on the above ideas:
> 
>   * LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
>     resolution is allowed at all, including magic-links. Just as with
>     LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
>     fd for the symlink as long as no parent path had a symlink
>     component.
> 
>   * LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
>     blocking attempts to move past the root, forces all such movements
>     to be scoped to the starting point. This provides chroot(2)-like
>     protection but without the cost of a chroot(2) for each filesystem
>     operation, as well as being safe against race attacks that chroot(2)
>     is not.
> 
>     If a race is detected (as with LOOKUP_BENEATH) then an error is
>     generated, and similar to LOOKUP_BENEATH it is not permitted to cross
>     magic-links with LOOKUP_IN_ROOT.
> 
>     The primary need for this is from container runtimes, which
>     currently need to do symlink scoping in userspace[7] when opening
>     paths in a potentially malicious container. There is a long list of
>     CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT
>     (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and
>     CVE-2019-5736, just to name a few).
> 
> In order to make all of the above more usable, I'm working on
> libpathrs[8] which is a C-friendly library for safe path resolution. It
> features a userspace-emulated backend if the kernel doesn't support
> openat2(2). Hopefully we can get userspace to switch to using it, and
> thus get openat2(2) support for free once it's ready.
> 
> Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and
> possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit 
> DoSes
> though stale NFS handles).
> 
> [1]: https://lwn.net/Articles/588444/
> [2]: 
> https://lore.kernel.org/lkml/ca+55afyyxjl1lyxzebsf2ypriraj5ut1xkndsunrbqgvjzu...@mail.gmail.com
> [3]: https://lore.kernel.org/lkml/20170429220414.gt29...@zeniv.linux.org.uk
> [4]: 
> https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysd...@google.com
> [5]: 
> https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysd...@google.com
> [6]: https://lwn.net/Articles/723057/
> [7]: https://github.com/cyphar/filepath-securejoin
> [8]: https://github.com/openSUSE/libpathrs
> 
> The current draft of the openat2(2) man-page is included below.
> 
> --8<---------------------------------------------------------------------------
> OPENAT2(2)                          Linux Programmer's Manual                 
>          OPENAT2(2)
> 
> NAME
>        openat2 - open and possibly create a file (extended)
> 
> SYNOPSIS
>        #include <sys/types.h>
>        #include <sys/stat.h>
>        #include <fcntl.h>
> 
>        int openat2(int dirfd, const char *pathname, struct open_how *how, 
> size_t size);
> 
>        Note: There is no glibc wrapper for this system call; see NOTES.
> 
> DESCRIPTION
>        The  openat2()  system  call  opens the file specified by pathname.  
> If the specified file
>        does not exist, it may optionally (if O_CREAT is specified in  
> how.flags)  be  created  by
>        openat2().
> 
>        As  with openat(2), if pathname is relative, then it is interpreted 
> relative to the direc-
>        tory referred to by the file descriptor dirfd (or the current  working 
>  directory  of  the
>        calling  process,  if dirfd is the special value AT_FDCWD.)  If 
> pathname is absolute, then
>        dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in 
> which case  pathname  is
>        resolved relative to dirfd.)
> 
>        The  openat2()  system  call  is  an extension of openat(2) and 
> provides a superset of its
>        functionality.  Rather than taking a single flag argument, an 
> extensible  structure  (how)
>        is  passed  instead  to  allow  for  future extensions.  size must be 
> set to sizeof(struct
>        open_how), to facilitate future extensions (see the "Extensibility" 
> section of  the  NOTES
>        for more detail on how extensions are handled.)
> 
>    The open_how structure
>        The following structure indicates how pathname should be opened, and 
> acts as a superset of
>        the flag and mode arguments to openat(2).
> 
>            struct open_how {
>                __aligned_u64 flags;         /* O_* flags. */
>                __u16         mode;          /* Mode for O_{CREAT,TMPFILE}. */
>                __u16         __padding[3];  /* Must be zeroed. */
>                __aligned_u64 resolve;       /* RESOLVE_* flags. */
>            };
> 
>        Any future extensions to openat2() will be implemented as new fields 
> appended to the above
>        structure (or through reuse of pre-existing padding space), with the 
> zero value of the new
>        fields acting as though the extension were not present.
> 
>        The meaning of each field is as follows:
> 
>               flags
>                      The file creation and status flags to use for this 
> operation.   All  of  the
>                      O_* flags defined for openat(2) are valid openat2() flag 
> values.
> 
>                      Unlike openat(2), it is an error to provide openat2() 
> unknown or conflicting
>                      flags in flags.
> 
>               mode
>                      File mode for the new file, with identical semantics to 
> the mode argument to
>                      openat(2).   However,  unlike openat(2), it is an error 
> to provide openat2()
>                      with a mode which contains bits other than 0777.
> 
>                      It is an error to provide openat2() a non-zero mode if 
> flags does  not  con-
>                      tain O_CREAT or O_TMPFILE.
> 
>               resolve
>                      Change  how  the  components  of pathname will be 
> resolved (see path_resolu-
>                      tion(7) for background information.)  The primary use 
> case for  these  flags
>                      is  to  allow trusted programs to restrict how untrusted 
> paths (or paths in-
>                      side untrusted directories) are resolved.  The full list 
> of resolve flags is
>                      given below.
> 
>                      RESOLVE_NO_XDEV
>                             Disallow  traversal of mount points during path 
> resolution (including
>                             all bind mounts).
> 
>                             Users of this flag are encouraged to make its use 
>  configurable  (un-
>                             less  it is used for a specific security 
> purpose), as bind mounts are
>                             very widely used by end-users.  Setting this flag 
> indiscrimnately for
>                             all  uses  of  openat2() may result in spurious 
> errors on previously-
>                             functional systems.
> 
>                      RESOLVE_NO_SYMLINKS
>                             Disallow resolution of symbolic links during path 
>  resolution.   This
>                             option implies RESOLVE_NO_MAGICLINKS.
> 
>                             If the trailing component is a symbolic link, and 
> flags contains both
>                             O_PATH and O_NOFOLLOW, then an O_PATH file 
> descriptor referencing the
>                             symbolic link will be returned.
> 
>                             Users  of  this flag are encouraged to make its 
> use configurable (un-
>                             less it is used for a specific security purpose), 
> as  symbolic  links
>                             are very widely used by end-users.  Setting this 
> flag indiscrimnately
>                             for all uses of openat2() may result in  spurious 
>  errors  on  previ-
>                             ously-functional systems.
> 
>                      RESOLVE_NO_MAGICLINKS
>                             Disallow all magic link resolution during path 
> resolution.
> 
>                             If  the  trailing  component is a magic link, and 
> flags contains both
>                             O_PATH and O_NOFOLLOW, then an O_PATH file 
> descriptor referencing the
>                             magic link will be returned.
> 
>                             Magic-links  are  symbolic  link-like  objects  
> that are most notably
>                             found   in   proc(5)   (examples    include    
> /proc/[pid]/exe    and
>                             /proc/[pid]/fd/*.)   Due to the potential danger 
> of unknowingly open-
>                             ing these magic links, it may be  preferable  for 
>  users  to  disable
>                             their resolution entirely (see symboliclink(7) 
> for more details.)
> 
>                      RESOLVE_BENEATH
>                             Do  not permit the path resolution to succeed if 
> any component of the
>                             resolution is not a descendant of the directory 
> indicated  by  dirfd.
>                             This results in absolute symbolic links (and 
> absolute values of path-
>                             name) to be rejected.
> 
>                             Currently, this flag also disables magic link  
> resolution.   However,
>                             this  may change in the future.  The caller 
> should explicitly specify
>                             RESOLVE_NO_MAGICLINKS to ensure that magic links 
> are not resolved.
> 
>                      RESOLVE_IN_ROOT
>                             Treat dirfd as the root directory while resolving 
> pathname (as though
>                             the user called chroot(2) with dirfd as the 
> argument.)  Absolute sym-
>                             bolic links and ".." path components will be  
> scoped  to  dirfd.   If
>                             pathname is an absolute path, it is also treated 
> relative to dirfd.
> 
>                             However,  unlike  chroot(2) (which changes the 
> filesystem root perma-
>                             nently for a process), RESOLVE_IN_ROOT  allows  a 
>  program  to  effi-
>                             ciently  restrict  path  resolution  for only 
> certain operations.  It
>                             also has several hardening features (such 
> detecting  escape  attempts
>                             during ..  resolution) which chroot(2) does not.
> 
>                             Currently,  this  flag also disables magic link 
> resolution.  However,
>                             this may change in the future.  The caller should 
> explicitly  specify
>                             RESOLVE_NO_MAGICLINKS to ensure that magic links 
> are not resolved.
> 
>                      It is an error to provide openat2() unknown flags in 
> resolve.
> 
> RETURN VALUE
>        On success, a new file descriptor is returned.  On error, -1 is 
> returned, and errno is set
>        appropriately.
> 
> ERRORS
>        The set of errors returned by openat2() includes all of the errors 
> returned by  openat(2),
>        as well as the following additional errors:
> 
>        EINVAL An unknown flag or invalid value was specified in how.
> 
>        EINVAL mode is non-zero, but flags does not contain O_CREAT or 
> O_TMPFILE.
> 
>        EINVAL size was smaller than any known version of struct open_how.
> 
>        E2BIG  An  extension  was specified in how, which the current kernel 
> does not support (see
>               the "Extensibility" section of the NOTES for more detail on how 
> extensions are han-
>               dled.)
> 
>        EAGAIN resolve  contains  either  RESOLVE_IN_ROOT or RESOLVE_BENEATH, 
> and the kernel could
>               not ensure that a ".." component didn't escape (due to a race 
> condition  or  poten-
>               tial attack.)  Callers may choose to retry the openat2() call.
> 
>        EXDEV  resolve  contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, 
> and an escape from the
>               root during path resolution was detected.
> 
>        EXDEV  resolve contains RESOLVE_NO_XDEV, and a path component 
> attempted to cross  a  mount
>               point.
> 
>        ELOOP  resolve contains RESOLVE_NO_SYMLINKS, and one of the path 
> components was a symbolic
>               link (or magic link).
> 
>        ELOOP  resolve contains RESOLVE_NO_MAGICLINKS, and one of the path 
> components was a  magic
>               link.
> 
> VERSIONS
>        openat2() was added to Linux in kernel 5.FOO.
> 
> CONFORMING TO
>        This system call is Linux-specific.
> 
>        The semantics of RESOLVE_BENEATH were modelled after FreeBSD's 
> O_BENEATH.
> 
> NOTES
>        Glibc does not provide a wrapper for this system call; call it using 
> systemcall(2).
> 
>    Extensibility
>        In order to allow for struct open_how to be extended in future kernel 
> revisions, openat2()
>        requires userspace to specify the size of struct open_how structure 
> they are passing.   By
>        providing  this  information,  it  is possible for openat2() to 
> provide both forwards- and
>        backwards-compatibility — with size acting as an implicit version 
> number (because new  ex-
>        tension  fields will always be appended, the size will always 
> increase.)  This extensibil-
>        ity  design  is  very  similar  to   other   system   calls   such   
> as   perf_setattr(2),
>        perf_event_open(2), and clone(3).
> 
>        If  we let usize be the size of the structure according to userspace 
> and ksize be the size
>        of the structure which the kernel supports, then there are only three 
> cases to consider:
> 
>               *  If ksize equals usize, then there is no version mismatch and 
>  how  can  be  used
>                  verbatim.
> 
>               *  If  ksize  is  larger than usize, then there are some 
> extensions the kernel sup-
>                  ports which the userspace program is unaware of.  Because  
> all  extensions  must
>                  have their zero values be a no-op, the kernel treats all of 
> the extension fields
>                  not set by userspace to have zero values.  This  provides  
> backwards-compatibil-
>                  ity.
> 
>               *  If  ksize  is  smaller  than  usize,  then  there  are some 
> extensions which the
>                  userspace program is aware of but the kernel does not 
> support.  Because all  ex-
>                  tensions  must  have  their zero values be a no-op, the 
> kernel can safely ignore
>                  the unsupported extension fields if they are all-zero.  If 
> any  unsupported  ex-
>                  tension  fields  are  non-zero,  then  -1 is returned and 
> errno is set to E2BIG.
>                  This provides forwards-compatibility.
> 
>        Therefore, most userspace programs will not need to have any special  
> handling  of  exten-
>        sions.   However,  if  a userspace program wishes to determine what 
> extensions the running
>        kernel supports, they may conduct a binary search on size (to find the 
> largest value which
>        doesn't produce an error of E2BIG.)
> 
> SEE ALSO
>        openat(2), path_resolution(7), symlink(7)
> 
> Linux                                       2019-11-05                        
>          OPENAT2(2)
> --8<---------------------------------------------------------------------------
> 
> Aleksa Sarai (9):
>   namei: LOOKUP_NO_SYMLINKS: block symlink resolution
>   namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
>   namei: LOOKUP_NO_XDEV: block mountpoint crossing
>   namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
>   namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
>   namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
>   open: introduce openat2(2) syscall
>   selftests: add openat2(2) selftests
>   Documentation: path-lookup: mention LOOKUP_MAGICLINK_JUMPED
> 
>  CREDITS                                       |   4 +-
>  Documentation/filesystems/path-lookup.rst     |  18 +-
>  arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
>  arch/arm/tools/syscall.tbl                    |   1 +
>  arch/arm64/include/asm/unistd.h               |   2 +-
>  arch/arm64/include/asm/unistd32.h             |   2 +
>  arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
>  arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
>  arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
>  arch/s390/kernel/syscalls/syscall.tbl         |   1 +
>  arch/sh/kernel/syscalls/syscall.tbl           |   1 +
>  arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
>  arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
>  fs/namei.c                                    | 176 +++++-
>  fs/open.c                                     | 149 +++--
>  include/linux/fcntl.h                         |  12 +-
>  include/linux/namei.h                         |  11 +
>  include/linux/syscalls.h                      |   3 +
>  include/uapi/asm-generic/unistd.h             |   5 +-
>  include/uapi/linux/fcntl.h                    |  41 ++
>  tools/testing/selftests/Makefile              |   1 +
>  tools/testing/selftests/openat2/.gitignore    |   1 +
>  tools/testing/selftests/openat2/Makefile      |   8 +
>  tools/testing/selftests/openat2/helpers.c     | 109 ++++
>  tools/testing/selftests/openat2/helpers.h     | 107 ++++
>  .../testing/selftests/openat2/openat2_test.c  | 316 +++++++++++
>  .../selftests/openat2/rename_attack_test.c    | 160 ++++++
>  .../testing/selftests/openat2/resolve_test.c  | 523 ++++++++++++++++++
>  35 files changed, 1591 insertions(+), 73 deletions(-)
>  create mode 100644 tools/testing/selftests/openat2/.gitignore
>  create mode 100644 tools/testing/selftests/openat2/Makefile
>  create mode 100644 tools/testing/selftests/openat2/helpers.c
>  create mode 100644 tools/testing/selftests/openat2/helpers.h
>  create mode 100644 tools/testing/selftests/openat2/openat2_test.c
>  create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
>  create mode 100644 tools/testing/selftests/openat2/resolve_test.c
> 
> 
> base-commit: a99d8080aaf358d5d23581244e5da23b35e340b9

Ping -- this patch hasn't been touched for a week. Thanks.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

Attachment: signature.asc
Description: PGP signature

Reply via email to