On 11/08/2020 21:58, Eric W. Biederman wrote:
> Mickaël Salaün <[email protected]> writes:
> 
>> Allow for the enforcement of the O_MAYEXEC openat2(2) flag.  Thanks to
>> the noexec option from the underlying VFS mount, or to the file execute
>> permission, userspace can enforce these execution policies.  This may
>> allow script interpreters to check execution permission before reading
>> commands from a file, or dynamic linkers to allow shared object
>> loading.
> 
> Ick!!!!!
> 
> This feels like being so open minded your brains fall out.

Really? :)

> 
> I can see having a sysctl that allows the new open flag to be ignored
> so that the existing lack of enforcement when the flag is passed
> continues.

And that could break distros (i.e. multiple interpreters, with or
without O_MAYEXEC, different versions, and different kernels).

> 
> But having the sysctl be fine grained seems like way too much rope.

This follow the same rational: file permissions and mount options can
change but they are controled by the sysadmin, how also configure sysctl
values.

> 
> I don't think the code needs to do more than enforce or not enforce this
> logic.

I think this is the more viable behavior for an eclectic set of distros
(and different use cases).

> 
> You can test the sysctl once when you process O_MAYEXEC.  But code such
> as may_open should not have the conditional behavior.  It should get an
> appropriate set of flags that are always enforced.  With the madness of
> what to do left at the edge of userspace.

The problem is that this userspace is not in charge of the system-wide
policy, the sysadmin is. As pointed out by the commit message, please
take a look at the rational:
https://lore.kernel.org/lkml/[email protected]/

> 
> Anything else appears to be madness, overengineering, and a failure to
> separate concerns.

Oh! What a conclusion! :)

I'd say it is a pragmatic approach.

> 
> Eric
> 
> 
>> Add a new sysctl fs.open_mayexec_enforce to enable system administrators
>> to enforce two complementary security policies according to the
>> installed system: enforce the noexec mount option, and enforce
>> executable file permission.  Indeed, because of compatibility with
>> installed systems, only system administrators are able to check that
>> this new enforcement is in line with the system mount points and file
>> permissions.  A following patch adds documentation.
>>
>> Being able to restrict execution also enables to protect the kernel by
>> restricting arbitrary syscalls that an attacker could perform with a
>> crafted binary or certain script languages.  It also improves multilevel
>> isolation by reducing the ability of an attacker to use side channels
>> with specific code.  These restrictions can natively be enforced for ELF
>> binaries (with the noexec mount option) but require this kernel
>> extension to properly handle scripts (e.g., Python, Perl).  To get a
>> consistent execution policy, additional memory restrictions should also
>> be enforced (e.g. thanks to SELinux).
>>
>> Because the O_MAYEXEC flag is a meant to enforce a system-wide security
>> policy (but not application-centric policies), it does not make sense
>> for userland to check the sysctl value.  Indeed, this new flag only
>> enables to extend the system ability to enforce a policy thanks to (some
>> trusted) userland collaboration.  Moreover, additional security policies
>> could be managed by LSMs.  This is a best-effort approach from the
>> application developer point of view:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Signed-off-by: Mickaël Salaün <[email protected]>
>> Reviewed-by: Thibaut Sautereau <[email protected]>
>> Cc: Aleksa Sarai <[email protected]>
>> Cc: Al Viro <[email protected]>
>> Cc: Jonathan Corbet <[email protected]>
>> Cc: Kees Cook <[email protected]>
>> Cc: Randy Dunlap <[email protected]>
>> ---
>>
>> Changes since v6:
>> * Allow opening pipes, block devices and character devices with
>>   O_MAYEXEC when there is no enforced policy, but forbid any non-regular
>>   file opened with O_MAYEXEC otherwise (i.e. for any enforced policy).
>> * Add a paragraph about the non-regular files policy.
>> * Move path_noexec() calls out of the fast-path (suggested by Kees
>>   Cook).
>>
>> Changes since v5:
>> * Remove the static enforcement configuration through Kconfig because it
>>   makes the code more simple like this, and because the current sysctl
>>   configuration can only be set with CAP_SYS_ADMIN, the same way mount
>>   options (i.e. noexec) can be set.  If an harden distro wants to
>>   enforce a configuration, it should restrict capabilities or sysctl
>>   configuration.  Furthermore, an LSM can easily leverage O_MAYEXEC to
>>   fit its need.
>> * Move checks from inode_permission() to may_open() and make the error
>>   codes more consistent according to file types (in line with a previous
>>   commit): opening a directory with O_MAYEXEC returns EISDIR and other
>>   non-regular file types may return EACCES.
>> * In may_open(), when OMAYEXEC_ENFORCE_FILE is set, replace explicit
>>   call to generic_permission() with an artificial MAY_EXEC to avoid
>>   double calls.  This makes sense especially when an LSM policy forbids
>>   execution of a file.
>> * Replace the custom proc_omayexec() with
>>   proc_dointvec_minmax_sysadmin(), and then replace the CAP_MAC_ADMIN
>>   check with a CAP_SYS_ADMIN one (suggested by Kees Cook and Stephen
>>   Smalley).
>> * Use BIT() (suggested by Kees Cook).
>> * Rename variables (suggested by Kees Cook).
>> * Reword the kconfig help.
>> * Import the documentation patch (suggested by Kees Cook):
>>   https://lore.kernel.org/lkml/[email protected]/
>> * Update documentation and add LWN.net article.
>>
>> Changes since v4:
>> * Add kernel configuration options to enforce O_MAYEXEC at build time,
>>   and disable the sysctl in such case (requested by James Morris).
>> * Reword commit message.
>>
>> Changes since v3:
>> * Update comment with O_MAYEXEC.
>>
>> Changes since v2:
>> * Cosmetic changes.
>>
>> Changes since v1:
>> * Move code from Yama to the FS subsystem (suggested by Kees Cook).
>> * Make omayexec_inode_permission() static (suggested by Jann Horn).
>> * Use mode 0600 for the sysctl.
>> * Only match regular files (not directories nor other types), which
>>   follows the same semantic as commit 73601ea5b7b1 ("fs/open.c: allow
>>   opening only regular files during execve()").
>> ---
>>  Documentation/admin-guide/sysctl/fs.rst | 49 +++++++++++++++++++++++++
>>  fs/namei.c                              | 24 ++++++++++++
>>  include/linux/fs.h                      |  1 +
>>  kernel/sysctl.c                         | 12 +++++-
>>  4 files changed, 84 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/sysctl/fs.rst 
>> b/Documentation/admin-guide/sysctl/fs.rst
>> index 2a45119e3331..ce6e2081d3a9 100644
>> --- a/Documentation/admin-guide/sysctl/fs.rst
>> +++ b/Documentation/admin-guide/sysctl/fs.rst
>> @@ -37,6 +37,7 @@ Currently, these files are in /proc/sys/fs:
>>  - inode-nr
>>  - inode-state
>>  - nr_open
>> +- open_mayexec_enforce
>>  - overflowuid
>>  - overflowgid
>>  - pipe-user-pages-hard
>> @@ -165,6 +166,54 @@ system needs to prune the inode list instead of 
>> allocating
>>  more.
>>  
>>  
>> +open_mayexec_enforce
>> +--------------------
>> +
>> +While being ignored by :manpage:`open(2)` and :manpage:`openat(2)`, the
>> +``O_MAYEXEC`` flag can be passed to :manpage:`openat2(2)` to only open 
>> regular
>> +files that are expected to be executable.  If the file is not identified as
>> +executable, then the syscall returns -EACCES.  This may allow a script
>> +interpreter to check executable permission before reading commands from a 
>> file,
>> +or a dynamic linker to only load executable shared objects.  One interesting
>> +use case is to enforce a "write xor execute" policy through interpreters.
>> +
>> +The ability to restrict code execution must be thought as a system-wide 
>> policy,
>> +which first starts by restricting mount points with the ``noexec`` option.
>> +This option is also automatically applied to special filesystems such as 
>> /proc .
>> +This prevents files on such mount points to be directly executed by the 
>> kernel
>> +or mapped as executable memory (e.g. libraries).  With script interpreters
>> +using the ``O_MAYEXEC`` flag, the executable permission can then be checked
>> +before reading commands from files. This makes it possible to enforce the
>> +``noexec`` at the interpreter level, and thus propagates this security 
>> policy
>> +to scripts.  To be fully effective, these interpreters also need to handle 
>> the
>> +other ways to execute code: command line parameters (e.g., option ``-e`` for
>> +Perl), module loading (e.g., option ``-m`` for Python), stdin, file 
>> sourcing,
>> +environment variables, configuration files, etc.  According to the threat
>> +model, it may be acceptable to allow some script interpreters (e.g. Bash) to
>> +interpret commands from stdin, may it be a TTY or a pipe, because it may 
>> not be
>> +enough to (directly) perform syscalls.
>> +
>> +There are two complementary security policies: enforce the ``noexec`` mount
>> +option, and enforce executable file permission.  These policies are handled 
>> by
>> +the ``fs.open_mayexec_enforce`` sysctl (writable only with 
>> ``CAP_SYS_ADMIN``)
>> +as a bitmask:
>> +
>> +1 - Mount restriction: checks that the mount options for the underlying VFS
>> +    mount do not prevent execution.
>> +
>> +2 - File permission restriction: checks that the to-be-opened file is 
>> marked as
>> +    executable for the current process (e.g., POSIX permissions).
>> +
>> +Note that as long as a policy is enforced, opening any non-regular file with
>> +``O_MAYEXEC`` is denied (e.g. TTYs, pipe), even when such a file is marked 
>> as
>> +executable or is on an executable mount point.
>> +
>> +Code samples can be found in tools/testing/selftests/openat2/omayexec_test.c
>> +and interpreter patches (for the original O_MAYEXEC version) may be found at
>> +https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC
>>  .
>> +See also an overview article: https://lwn.net/Articles/820000/ .
>> +
>> +
>>  overflowgid & overflowuid
>>  -------------------------
>>  
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 3f074ec77390..8ec13c7fd403 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -39,6 +39,7 @@
>>  #include <linux/bitops.h>
>>  #include <linux/init_task.h>
>>  #include <linux/uaccess.h>
>> +#include <linux/sysctl.h>
>>  
>>  #include "internal.h"
>>  #include "mount.h"
>> @@ -425,6 +426,11 @@ static int sb_permission(struct super_block *sb, struct 
>> inode *inode, int mask)
>>      return 0;
>>  }
>>  
>> +#define OPEN_MAYEXEC_ENFORCE_MOUNT  BIT(0)
>> +#define OPEN_MAYEXEC_ENFORCE_FILE   BIT(1)
>> +
>> +int sysctl_open_mayexec_enforce __read_mostly;
>> +
>>  /**
>>   * inode_permission - Check for access rights to a given inode
>>   * @inode: Inode to check permission on
>> @@ -2861,11 +2867,29 @@ static int may_open(const struct path *path, int 
>> acc_mode, int flag)
>>      case S_IFSOCK:
>>              if (acc_mode & MAY_EXEC)
>>                      return -EACCES;
>> +            /*
>> +             * Opening devices (e.g. TTYs) or pipes with O_MAYEXEC may be
>> +             * legitimate when there is no enforced policy.
>> +             */
>> +            if ((acc_mode & MAY_OPENEXEC) && sysctl_open_mayexec_enforce)
>> +                    return -EACCES;
>>              flag &= ~O_TRUNC;
>>              break;
>>      case S_IFREG:
>>              if ((acc_mode & MAY_EXEC) && path_noexec(path))
>>                      return -EACCES;
>> +            if (acc_mode & MAY_OPENEXEC) {
>> +                    if ((sysctl_open_mayexec_enforce & 
>> OPEN_MAYEXEC_ENFORCE_MOUNT)
>> +                                    && path_noexec(path))
>> +                            return -EACCES;
>> +                    if (sysctl_open_mayexec_enforce & 
>> OPEN_MAYEXEC_ENFORCE_FILE)
>> +                            /*
>> +                             * Because acc_mode may change here, the next 
>> and only
>> +                             * use of acc_mode should then be by the 
>> following call
>> +                             * to inode_permission().
>> +                             */
>> +                            acc_mode |= MAY_EXEC;
>> +            }
>>              break;
>>      }
>>  
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 56f835c9a87a..071f37707ccc 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -83,6 +83,7 @@ extern int sysctl_protected_symlinks;
>>  extern int sysctl_protected_hardlinks;
>>  extern int sysctl_protected_fifos;
>>  extern int sysctl_protected_regular;
>> +extern int sysctl_open_mayexec_enforce;
>>  
>>  typedef __kernel_rwf_t rwf_t;
>>  
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index db1ce7af2563..5008a2566e79 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -113,6 +113,7 @@ static int sixty = 60;
>>  
>>  static int __maybe_unused neg_one = -1;
>>  static int __maybe_unused two = 2;
>> +static int __maybe_unused three = 3;
>>  static int __maybe_unused four = 4;
>>  static unsigned long zero_ul;
>>  static unsigned long one_ul = 1;
>> @@ -888,7 +889,6 @@ static int proc_taint(struct ctl_table *table, int write,
>>      return err;
>>  }
>>  
>> -#ifdef CONFIG_PRINTK
>>  static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
>>                              void *buffer, size_t *lenp, loff_t *ppos)
>>  {
>> @@ -897,7 +897,6 @@ static int proc_dointvec_minmax_sysadmin(struct 
>> ctl_table *table, int write,
>>  
>>      return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>>  }
>> -#endif
>>  
>>  /**
>>   * struct do_proc_dointvec_minmax_conv_param - proc_dointvec_minmax() range 
>> checking structure
>> @@ -3264,6 +3263,15 @@ static struct ctl_table fs_table[] = {
>>              .extra1         = SYSCTL_ZERO,
>>              .extra2         = &two,
>>      },
>> +    {
>> +            .procname       = "open_mayexec_enforce",
>> +            .data           = &sysctl_open_mayexec_enforce,
>> +            .maxlen         = sizeof(int),
>> +            .mode           = 0600,
>> +            .proc_handler   = proc_dointvec_minmax_sysadmin,
>> +            .extra1         = SYSCTL_ZERO,
>> +            .extra2         = &three,
>> +    },
>>  #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
>>      {
>>              .procname       = "binfmt_misc",

Reply via email to