On Fri, Jun 12, 2026 at 3:37 PM clubby789 <[email protected]> wrote: > > On Fri, Jun 12, 2026 at 8:25 PM Will Drewry <[email protected]> wrote: > > > > So I have two concerns here -- > > > > 1. By layering, STRICT becomes subject to FILTER RET behaviors. > > 2. If we did want to layer them, it would be ideal to separate the 'upgrade' > > decision from the access checks and make the layering path explicit. > > > > If you are running a legacy binary that uses STRICT in Docker, then I > > understand the goal, but there are userspace options. > > > > I think it's hard to want to open the door on changing STRICT without > > a good reason to work through all the implications that come with it: > > cross-checking every SECCOMP_MODE_FILTER reference, sorting > > out thread sync interactions, ... > > > > That said, this change could be streamlined and look for ways > > to minimize any potential implications. > > > > Am I missing something or overstating it? > > > > Thanks! > > Thanks for the review - I'm currently working on a new version which > addresses some of the implementation > issues. On the semantics side: > The current version runs strict checks before filters. I think it > makes more sense to run filters, run strict checks, then return the > filter result (assuming strict checks were survived) > Since the strict checks logically work as another filter layer, > documentation says > > Synchronization will fail if another thread in the same process is in > > SECCOMP_MODE_STRICT > So refusing to sync threads which are using both modes seems the most > reasonable.
Can we take a moment to contemplate how filters compose? The current rule is: /* * All BPF programs must return a 32-bit value. * The bottom 16-bits are for optional return data. * The upper 16-bits are ordered from least permissive values to most, * as a signed value (so 0x8000000 is negative). * * The ordering ensures that a min_t() over composed return values always * selects the least permissive choice. */ #define SECCOMP_RET_KILL_PROCESS 0x80000000U /* kill the process */ #define SECCOMP_RET_KILL_THREAD 0x00000000U /* kill the thread */ #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */ #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */ #define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */ #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */ #define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */ #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */ This has always bothered me. In the absence of USER_NOTIF and TRACE, fine, I guess -- we're choosing the least permissive, and this doesn't seem too crazy. But TRACE and USER_NOTIF make this very strange, especially as people keep wanting, quite reasonably, to make USER_NOTIF fancier. Shouldn't the actual behavior be that each filter, starting from the innermost, gets to reject or transform a system call, and the outer filters should act on the result *after transformation* of the syscall? For example, if I run a container and set some syscall foobar() to SECCOMP_RET_ERROR in the container's policy, and the container them runs a tool that sets foobar() to TRACE, then I think it would make a lot more sense for the tracer get notified if the task calls foobar(). Or if the container sets foobar() to USER_NOTIF and tries to emulate it, then it should emulate (and not get ERRORed because no actual foobar() syscall has been attempted now that the inner filter was evaluated) and then run the result of USER_NOTIF emulation (assuming it tries to do a syscall) through the outer syscall? And, however, this gets done, presumably STRICT would be a different variant of KILL_PROCESS that does SIGKILL. --Andy

