On Tue, May 26, 2026 at 10:42 AM Jamie Hill-Daniel <[email protected]> wrote:
>
> It is currently impossible to enable `SECCOMP_MODE_STRICT` if
> `SECCOMP_MODE_FILTER` is enabled, and vice-versa. This makes using
> seccomp difficult in environments such as Docker, which installs a
> seccomp filter by default.

Some quick thoughts on your resend -- the original reasons for
this approach:
(a) filter policy != strict policy
(b) filter can implement strict, if layering is desired
(c) minimize checks in the syscall entry/return path

I'd expected folks to simply create the ~80 byte strict filter and install
it if they needed STRICT policy. More discussion below.

> Introduce a new internal `SECCOMP_MODE_COMBINED`
> that runs `strict` checks, followed by any installed filters.
>
> Link: https://github.com/moby/moby/issues/42082
> Signed-off-by: Jamie Hill-Daniel <[email protected]>
> ---
>  kernel/seccomp.c | 46 ++++++++++++++++++++++++++--------------------
>  1 file changed, 26 insertions(+), 20 deletions(-)
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 25f62867a16d..8201a050d358 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -33,6 +33,8 @@
>
>  /* Not exposed in headers: strictly internal use only. */
>  #define SECCOMP_MODE_DEAD      (SECCOMP_MODE_FILTER + 1)
> +/* Run SECCOMP_MODE_STRICT checks, followed by SECCOMP_MODE_FILTER  */
> +#define SECCOMP_MODE_COMBINED (SECCOMP_MODE_DEAD + 1)

I'm not sure if DEAD is meant to be fixed at the end of all modes or
not -- given that it's internal.

>
>  #ifdef CONFIG_SECCOMP_FILTER
>  #include <linux/file.h>
> @@ -432,14 +434,21 @@ static u32 seccomp_run_filters(const struct 
> seccomp_data *sd,
>  }
>  #endif /* CONFIG_SECCOMP_FILTER */
>
> -static inline bool seccomp_may_assign_mode(unsigned long seccomp_mode)
> +/**
> + * seccomp_needs_combined: internal function for checking if requested mode
> + * needs to be upgraded to `SECCOMP_MODE_COMBINED`.
> + *
> + */
> +static inline bool seccomp_needs_combined(unsigned long seccomp_mode)

It is easier to reason about if you keep the access checks separate from the
behavior-change decision. Setting seccomp modes isn't on a particularly
performance-critical path, so we can be more verbose and explicit.

>  {
>         assert_spin_locked(&current->sighand->siglock);
>
> -       if (current->seccomp.mode && current->seccomp.mode != seccomp_mode)
> -               return false;
> +       if ((current->seccomp.mode == SECCOMP_MODE_STRICT ||
> +            current->seccomp.mode == SECCOMP_MODE_FILTER) &&
> +           current->seccomp.mode != seccomp_mode)
> +               return true;

AFAICT, the only valid "combined" case is current->seccomp.mode==FILTER
and seccomp_mode==STRICT.   There are other allowed transitions, but
not allowed for transitioning from FILTER->STRICT via COMBINED.

If this wasn't subsuming the update access check, then the single allowed
transition could be checked in set_mode_strict() (or a helper).

>
> -       return true;
> +       return false;
>  }
>
>  void __weak arch_seccomp_spec_mitigate(struct task_struct *task) { }
> @@ -1407,6 +1416,9 @@ int __secure_computing(void)
>                 WARN_ON_ONCE(1);
>                 do_exit(SIGKILL);
>                 return -1;
> +       case SECCOMP_MODE_COMBINED:
> +               __secure_computing_strict(this_syscall);
> +               return __seccomp_filter(this_syscall, false);
>         default:
>                 BUG();
>         }
> @@ -1421,30 +1433,23 @@ long prctl_get_seccomp(void)
>  /**
>   * seccomp_set_mode_strict: internal function for setting strict seccomp
>   *
> - * Once current->seccomp.mode is non-zero, it may not be changed.
> + * Once current->seccomp.mode is non-zero, it may only be changed to 
> `COMBINED` or `DEAD`.
>   *
> - * Returns 0 on success or -EINVAL on failure.
>   */
> -static long seccomp_set_mode_strict(void)
> +static void seccomp_set_mode_strict(void)
>  {
> -       const unsigned long seccomp_mode = SECCOMP_MODE_STRICT;
> -       long ret = -EINVAL;
> +       unsigned long seccomp_mode = SECCOMP_MODE_STRICT;
>
>         spin_lock_irq(&current->sighand->siglock);
>
> -       if (!seccomp_may_assign_mode(seccomp_mode))
> -               goto out;
> +       if (seccomp_needs_combined(seccomp_mode))
> +               seccomp_mode = SECCOMP_MODE_COMBINED;
>
>  #ifdef TIF_NOTSC
>         disable_TSC();
>  #endif
>         seccomp_assign_mode(current, seccomp_mode, 0);
> -       ret = 0;
> -
> -out:
>         spin_unlock_irq(&current->sighand->siglock);
> -
> -       return ret;

Is this reachable when mode is DEAD? In its current form, it would fail
out on may_assign_mode(), but now you have to consider if a race
could silently clobber the DEAD value (let's say).

>  }
>
>  #ifdef CONFIG_SECCOMP_FILTER
> @@ -1956,7 +1961,7 @@ static bool has_duplicate_listener(struct 
> seccomp_filter *new_child)
>  static long seccomp_set_mode_filter(unsigned int flags,
>                                     const char __user *filter)
>  {
> -       const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
> +       long seccomp_mode = SECCOMP_MODE_FILTER;
>         struct seccomp_filter *prepared = NULL;
>         long ret = -EINVAL;
>         int listener = -1;
> @@ -2016,8 +2021,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>
>         spin_lock_irq(&current->sighand->siglock);
>
> -       if (!seccomp_may_assign_mode(seccomp_mode))
> -               goto out;
> +       if (seccomp_needs_combined(seccomp_mode))
> +               seccomp_mode = SECCOMP_MODE_COMBINED;

So I have two concerns here --

1. By layering, STRICT becomes subject to FILTER RET behaviors.
2. If we did want to layer them, it would be ideal to separate the 'upgrade'
    decision from the access checks and make the layering path explicit.

If you are running a legacy binary that uses STRICT in Docker, then I
understand the goal, but there are userspace options.

I think it's hard to want to open the door on changing STRICT without
a good reason to work through all the implications that come with it:
cross-checking every SECCOMP_MODE_FILTER reference, sorting
out thread sync interactions, ...

That said, this change could be streamlined and look for ways
to minimize any potential implications.

Am I missing something or overstating it?

Thanks!
will

>
>         if (has_duplicate_listener(prepared)) {
>                 ret = -EBUSY;
> @@ -2105,7 +2110,8 @@ static long do_seccomp(unsigned int op, unsigned int 
> flags,
>         case SECCOMP_SET_MODE_STRICT:
>                 if (flags != 0 || uargs != NULL)
>                         return -EINVAL;
> -               return seccomp_set_mode_strict();
> +               seccomp_set_mode_strict();
> +               return 0;
>         case SECCOMP_SET_MODE_FILTER:
>                 return seccomp_set_mode_filter(flags, uargs);
>         case SECCOMP_GET_ACTION_AVAIL:
>
> --
> 2.54.0
>

Reply via email to