Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow

2020-10-08 Thread Kees Cook
On Thu, Oct 08, 2020 at 11:47:17PM -0500, YiFei Zhu wrote:
> On Wed, Sep 30, 2020 at 10:20 AM YiFei Zhu  wrote:
> > @@ -544,7 +577,8 @@ static struct seccomp_filter 
> > *seccomp_prepare_filter(struct sock_fprog *fprog)
> >  {
> > struct seccomp_filter *sfilter;
> > int ret;
> > -   const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
> > +   const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
> > +  IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY);
> >
> > if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
> > return ERR_PTR(-EINVAL);
> 
> I'm trying to use __is_defined(SECCOMP_ARCH_NATIVE) here, and got this 
> message:
> 
> kernel/seccomp.c: In function ‘seccomp_prepare_filter’:
> ././include/linux/kconfig.h:44:44: error: pasting "__ARG_PLACEHOLDER_"
> and "(" does not give a valid preprocessing token
>44 | #define ___is_defined(val)  is_defined(__ARG_PLACEHOLDER_##val)
>   |^~
> ././include/linux/kconfig.h:43:27: note: in expansion of macro ‘___is_defined’
>43 | #define __is_defined(x)   ___is_defined(x)
>   |   ^
> kernel/seccomp.c:629:11: note: in expansion of macro ‘__is_defined’
>   629 |   __is_defined(SECCOMP_ARCH_NATIVE);
>   |   ^~~~
> 
> Looking at the implementation of __is_defined, it is:
> 
> #define __ARG_PLACEHOLDER_1 0,
> #define __take_second_arg(__ignored, val, ...) val
> #define __is_defined(x) ___is_defined(x)
> #define ___is_defined(val) is_defined(__ARG_PLACEHOLDER_##val)
> #define is_defined(arg1_or_junk) __take_second_arg(arg1_or_junk 1, 0)
> 
> Hence, when FOO is defined to be 1, then the expansion would be
> __is_defined(FOO) -> ___is_defined(1) ->
> is_defined(__ARG_PLACEHOLDER_1) -> __take_second_arg(0, 1, 0) ->
> 1,
> and when FOO is not defined, the expansion would be __is_defined(FOO)
> -> ___is_defined(FOO) -> is_defined(__ARG_PLACEHOLDER_FOO) ->
> __take_second_arg(__ARG_PLACEHOLDER_FOO 1, 0) -> 0
> 
> However, here SECCOMP_ARCH_NATIVE is an expression from an OR of some
> bits, and __is_defined(SECCOMP_ARCH_NATIVE) would not expand to
> __ARG_PLACEHOLDER_1 during any stage in the preprocessing.
> 
> Is there any better way to do this? I'm thinking of just doing #if
> defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE)
> like in Kee's patch.

Yeah, I think that's simplest.

-- 
Kees Cook


Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow

2020-10-08 Thread YiFei Zhu
On Wed, Sep 30, 2020 at 10:20 AM YiFei Zhu  wrote:
> @@ -544,7 +577,8 @@ static struct seccomp_filter 
> *seccomp_prepare_filter(struct sock_fprog *fprog)
>  {
> struct seccomp_filter *sfilter;
> int ret;
> -   const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
> +   const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
> +  IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY);
>
> if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
> return ERR_PTR(-EINVAL);

I'm trying to use __is_defined(SECCOMP_ARCH_NATIVE) here, and got this message:

kernel/seccomp.c: In function ‘seccomp_prepare_filter’:
././include/linux/kconfig.h:44:44: error: pasting "__ARG_PLACEHOLDER_"
and "(" does not give a valid preprocessing token
   44 | #define ___is_defined(val)  is_defined(__ARG_PLACEHOLDER_##val)
  |^~
././include/linux/kconfig.h:43:27: note: in expansion of macro ‘___is_defined’
   43 | #define __is_defined(x)   ___is_defined(x)
  |   ^
kernel/seccomp.c:629:11: note: in expansion of macro ‘__is_defined’
  629 |   __is_defined(SECCOMP_ARCH_NATIVE);
  |   ^~~~

Looking at the implementation of __is_defined, it is:

#define __ARG_PLACEHOLDER_1 0,
#define __take_second_arg(__ignored, val, ...) val
#define __is_defined(x) ___is_defined(x)
#define ___is_defined(val) is_defined(__ARG_PLACEHOLDER_##val)
#define is_defined(arg1_or_junk) __take_second_arg(arg1_or_junk 1, 0)

Hence, when FOO is defined to be 1, then the expansion would be
__is_defined(FOO) -> ___is_defined(1) ->
is_defined(__ARG_PLACEHOLDER_1) -> __take_second_arg(0, 1, 0) ->
1,
and when FOO is not defined, the expansion would be __is_defined(FOO)
-> ___is_defined(FOO) -> is_defined(__ARG_PLACEHOLDER_FOO) ->
__take_second_arg(__ARG_PLACEHOLDER_FOO 1, 0) -> 0

However, here SECCOMP_ARCH_NATIVE is an expression from an OR of some
bits, and __is_defined(SECCOMP_ARCH_NATIVE) would not expand to
__ARG_PLACEHOLDER_1 during any stage in the preprocessing.

Is there any better way to do this? I'm thinking of just doing #if
defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE)
like in Kee's patch.

YiFei Zhu


Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow

2020-10-02 Thread YiFei Zhu
On Thu, Oct 1, 2020 at 4:05 PM Kees Cook  wrote:
> Right, but we depend on that test always doing the correct thing (and
> continuing to do so into the future). I'm looking at this from the
> perspective of future changes, maintenance, etc. I want the actions to
> match the design principles as closely as possible so that future
> evolutions of the code have lower risk to bugs causing security
> failures. Right now, the code is simple. I want to design this so that
> when it is complex, it will still fail toward safety in the face of
> bugs.
>
> I'd prefer this way because for the loop, the tests, and the results only
> make the bitmap more restrictive. The worst thing a bug in here can do is
> leave the bitmap unchanged (which is certainly bad), but it can't _undo_
> an earlier restriction.
>
> The proposed loop's leading test_bit() becomes only an optimization,
> rather than being required for policy enforcement.
>
> In other words, I prefer:
>
> inherit all prior prior bitmap restrictions
> for all syscalls
> if this filter not restricted
> continue
> set bitmap restricted
>
> within this loop (where the bulk of future logic may get added),
> the worse-case future bug-induced failure mode for the syscall
> bitmap is "skip *this* filter".
>
>
> Instead of:
>
> set bitmap all restricted
> for all syscalls
> if previous bitmap not restricted and
>filter not restricted
> set bitmap unrestricted
>
> within this loop the worst-case future bug-induced failure mode
> for the syscall bitmap is "skip *all* filters".
>
>
>
>
> Or, to reword again, this:
>
> retain restrictions from previous caching decisions
> for all syscalls
> [evaluate this filter, maybe continue]
> set restricted
>
> instead of:
>
> set new cache to all restricted
> for all syscalls
> [evaluate prior cache and this filter, maybe continue]
> set unrestricted
>
> I expect the future code changes for caching to be in the "evaluate"
> step, so I'd like the code designed to make things MORE restrictive not
> less from the start, and remove any prior cache state tests from the
> loop.
>
> At the end of the day I believe changing the design like this now lays
> the groundwork to the caching mechanism being more robust against having
> future bugs introduce security flaws.
>

I see. Makes sense. Thanks. Will do that in the v4

YiFei Zhu


Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow

2020-10-01 Thread Jann Horn
On Thu, Oct 1, 2020 at 1:28 PM YiFei Zhu  wrote:
> On Wed, Sep 30, 2020 at 5:24 PM Jann Horn  wrote:
> > If you did the architecture enablement for X86 later in the series,
> > you could move this part over into that patch, that'd be cleaner.
>
> As in, patch 1: bitmap check logic. patch 2: emulator. patch 3: enable for 
> x86?

Yeah.


Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow

2020-10-01 Thread Kees Cook
On Thu, Oct 01, 2020 at 06:52:50AM -0500, YiFei Zhu wrote:
> On Wed, Sep 30, 2020 at 5:40 PM Kees Cook  wrote:
> > The guiding principle with seccomp's designs is to always make things
> > _more_ restrictive, never less. While we can never escape the
> > consequences of having seccomp_is_const_allow() report the wrong
> > answer, we can at least follow the basic principles, hopefully
> > minimizing the impact.
> >
> > When the bitmap starts with "always allowed" and we only flip it towards
> > "run full filters", we're only ever making things more restrictive. If
> > we instead go from "run full filters" towards "always allowed", we run
> > the risk of making things less restrictive. For example: a process that
> > maliciously adds a filter that the emulator mistakenly evaluates to
> > "always allow" doesn't suddenly cause all the prior filters to stop running.
> > (i.e. this isolates the flaw outcome, and doesn't depend on the early
> > "do not emulate if we already know we have to run filters" case before
> > the emulation call: there is no code path that allows the cache to
> > weaken: it can only maintain it being wrong).
> >
> > Without any seccomp filter installed, all syscalls are "always allowed"
> > (from the perspective of the seccomp boundary), so the default of the
> > cache needs to be "always allowed".
> 
> I cannot follow this. If a 'process that maliciously adds a filter
> that the emulator mistakenly evaluates to "always allow" doesn't
> suddenly cause all the prior filters to stop running', hence, you
> want, by default, the cache to be as transparent as possible. You
> would lift the restriction if and only if you are absolutely sure it
> does not cause an impact.

Yes, right now, the v3 code pattern is entirely safe.

> 
> In this patch, if there are prior filters, it goes through this logic:
> 
> if (bitmap_prev && !test_bit(nr, bitmap_prev))
> continue;
> 
> Hence, if the malicious filter were to happen, and prior filters were
> supposed to run, then seccomp_is_const_allow is simply not invoked --
> what it returns cannot be used maliciously by an adversary.

Right, but we depend on that test always doing the correct thing (and
continuing to do so into the future). I'm looking at this from the
perspective of future changes, maintenance, etc. I want the actions to
match the design principles as closely as possible so that future
evolutions of the code have lower risk to bugs causing security
failures. Right now, the code is simple. I want to design this so that
when it is complex, it will still fail toward safety in the face of
bugs.

> > if (bitmap_prev) {
> > /* The new filter must be as restrictive as the last. */
> > bitmap_copy(bitmap, bitmap_prev, bitmap_size);
> > } else {
> > /* Before any filters, all syscalls are always allowed. */
> > bitmap_fill(bitmap, bitmap_size);
> > }
> >
> > for (nr = 0; nr < bitmap_size; nr++) {
> > /* No bitmap change: not a cacheable action. */
> > if (!test_bit(nr, bitmap_prev) ||
> > continue;
> >
> > /* No bitmap change: continue to always allow. */
> > if (seccomp_is_const_allow(fprog, ))
> > continue;
> >
> > /* Not a cacheable action: always run filters. */
> > clear_bit(nr, bitmap);
> 
> I'm not strongly against this logic. I just feel unconvinced that this
> is any different with a slightly increased complexity.

I'd prefer this way because for the loop, the tests, and the results only
make the bitmap more restrictive. The worst thing a bug in here can do is
leave the bitmap unchanged (which is certainly bad), but it can't _undo_
an earlier restriction.

The proposed loop's leading test_bit() becomes only an optimization,
rather than being required for policy enforcement.

In other words, I prefer:

inherit all prior prior bitmap restrictions
for all syscalls
if this filter not restricted
continue
set bitmap restricted

within this loop (where the bulk of future logic may get added),
the worse-case future bug-induced failure mode for the syscall
bitmap is "skip *this* filter".


Instead of:

set bitmap all restricted
for all syscalls
if previous bitmap not restricted and
   filter not restricted
set bitmap unrestricted

within this loop the worst-case future bug-induced failure mode
for the syscall bitmap is "skip *all* filters".




Or, to reword again, this:

retain restrictions from previous caching decisions
for all syscalls
[evaluate this filter, maybe continue]
set restricted

instead of:

set new cache to all restricted
for all 

Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow

2020-10-01 Thread YiFei Zhu
On Wed, Sep 30, 2020 at 5:40 PM Kees Cook  wrote:
> I don't want this config: there is only 1 caching mechanism happening
> in this series and I do not want to have it buildable as "off": it
> should be available for all supported architectures. When further caching
> methods happen, the config can be introduced then (though I'll likely
> argue it should then be a boot param to allow distro kernels to make it
> selectable).

Alright, we can think about configuration (or boot param) when more
methods happen then.

> The guiding principle with seccomp's designs is to always make things
> _more_ restrictive, never less. While we can never escape the
> consequences of having seccomp_is_const_allow() report the wrong
> answer, we can at least follow the basic principles, hopefully
> minimizing the impact.
>
> When the bitmap starts with "always allowed" and we only flip it towards
> "run full filters", we're only ever making things more restrictive. If
> we instead go from "run full filters" towards "always allowed", we run
> the risk of making things less restrictive. For example: a process that
> maliciously adds a filter that the emulator mistakenly evaluates to
> "always allow" doesn't suddenly cause all the prior filters to stop running.
> (i.e. this isolates the flaw outcome, and doesn't depend on the early
> "do not emulate if we already know we have to run filters" case before
> the emulation call: there is no code path that allows the cache to
> weaken: it can only maintain it being wrong).
>
> Without any seccomp filter installed, all syscalls are "always allowed"
> (from the perspective of the seccomp boundary), so the default of the
> cache needs to be "always allowed".

I cannot follow this. If a 'process that maliciously adds a filter
that the emulator mistakenly evaluates to "always allow" doesn't
suddenly cause all the prior filters to stop running', hence, you
want, by default, the cache to be as transparent as possible. You
would lift the restriction if and only if you are absolutely sure it
does not cause an impact.

In this patch, if there are prior filters, it goes through this logic:

if (bitmap_prev && !test_bit(nr, bitmap_prev))
continue;

Hence, if the malicious filter were to happen, and prior filters were
supposed to run, then seccomp_is_const_allow is simply not invoked --
what it returns cannot be used maliciously by an adversary.

> if (bitmap_prev) {
> /* The new filter must be as restrictive as the last. */
> bitmap_copy(bitmap, bitmap_prev, bitmap_size);
> } else {
> /* Before any filters, all syscalls are always allowed. */
> bitmap_fill(bitmap, bitmap_size);
> }
>
> for (nr = 0; nr < bitmap_size; nr++) {
> /* No bitmap change: not a cacheable action. */
> if (!test_bit(nr, bitmap_prev) ||
> continue;
>
> /* No bitmap change: continue to always allow. */
> if (seccomp_is_const_allow(fprog, ))
> continue;
>
> /* Not a cacheable action: always run filters. */
> clear_bit(nr, bitmap);

I'm not strongly against this logic. I just feel unconvinced that this
is any different with a slightly increased complexity.

YiFei Zhu


Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow

2020-10-01 Thread YiFei Zhu
On Wed, Sep 30, 2020 at 5:24 PM Jann Horn  wrote:
> If you did the architecture enablement for X86 later in the series,
> you could move this part over into that patch, that'd be cleaner.

As in, patch 1: bitmap check logic. patch 2: emulator. patch 3: enable for x86?

> > + * Tis struct is ordered to minimize padding holes.
>
> I think this comment can probably go away, there isn't really much
> trickery around padding holes in the struct as it is now.

Oh right, I was trying the locks and adding bits to indicate if
certain arches are primed, then I undid that.

> > +   set_bit(nr, bitmap);
>
> set_bit() is atomic, but since we only do this at filter setup, before
> the filter becomes globally visible, we don't need atomicity here. So
> this should probably use __set_bit() instead.

Right

YiFei Zhu


Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow

2020-09-30 Thread Kees Cook
On Thu, Oct 01, 2020 at 12:24:32AM +0200, Jann Horn wrote:
> On Wed, Sep 30, 2020 at 5:20 PM YiFei Zhu  wrote:
> > SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
> > access any syscall arguments or instruction pointer. To facilitate
> > this we need a static analyser to know whether a filter will
> > return allow regardless of syscall arguments for a given
> > architecture number / syscall number pair. This is implemented
> > here with a pseudo-emulator, and stored in a per-filter bitmap.
> >
> > Each common BPF instruction are emulated. Any weirdness or loading
> > from a syscall argument will cause the emulator to bail.
> >
> > The emulation is also halted if it reaches a return. In that case,
> > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
> >
> > Emulator structure and comments are from Kees [1] and Jann [2].
> >
> > Emulation is done at attach time. If a filter depends on more
> > filters, and if the dependee does not guarantee to allow the
> > syscall, then we skip the emulation of this syscall.
> >
> > [1] 
> > https://lore.kernel.org/lkml/20200923232923.3142503-5-keesc...@chromium.org/
> > [2] 
> > https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76p...@mail.gmail.com/
> [...]
> > +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter,
> > +void *bitmap, const void 
> > *bitmap_prev,
> > +size_t bitmap_size, int arch)
> > +{
> > +   struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
> > +   struct seccomp_data sd;
> > +   int nr;
> > +
> > +   for (nr = 0; nr < bitmap_size; nr++) {
> > +   if (bitmap_prev && !test_bit(nr, bitmap_prev))
> > +   continue;
> > +
> > +   sd.nr = nr;
> > +   sd.arch = arch;
> > +
> > +   if (seccomp_emu_is_const_allow(fprog, ))
> > +   set_bit(nr, bitmap);
> 
> set_bit() is atomic, but since we only do this at filter setup, before
> the filter becomes globally visible, we don't need atomicity here. So
> this should probably use __set_bit() instead.

Oh yes, excellent point! That will speed this up a bit. When you do
this, please include a comment here describing why its safe to do it
non-atomic. :)

-- 
Kees Cook


Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow

2020-09-30 Thread Kees Cook
On Wed, Sep 30, 2020 at 10:19:13AM -0500, YiFei Zhu wrote:
> From: YiFei Zhu 
> 
> SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
> access any syscall arguments or instruction pointer. To facilitate
> this we need a static analyser to know whether a filter will
> return allow regardless of syscall arguments for a given
> architecture number / syscall number pair. This is implemented
> here with a pseudo-emulator, and stored in a per-filter bitmap.
> 
> Each common BPF instruction are emulated. Any weirdness or loading
> from a syscall argument will cause the emulator to bail.
> 
> The emulation is also halted if it reaches a return. In that case,
> if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
> 
> Emulator structure and comments are from Kees [1] and Jann [2].
> 
> Emulation is done at attach time. If a filter depends on more
> filters, and if the dependee does not guarantee to allow the
> syscall, then we skip the emulation of this syscall.
> 
> [1] 
> https://lore.kernel.org/lkml/20200923232923.3142503-5-keesc...@chromium.org/
> [2] 
> https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76p...@mail.gmail.com/
> 
> Signed-off-by: YiFei Zhu 

See comments on patch 3 for reorganizing this a bit for the next
version.

For the infrastructure patch, I'd like to see much of the cover letter
in the commit log (otherwise those details are harder for people to
find). That will describe the _why_ for preparing this change, etc.

For the emulator patch, I'd like to see the discussion about how the
subset of BFP instructions was selected, what libraries  Jann and I
examined, etc.

(For all of these commit logs, I try to pretend that whoever is reading
it has not followed any lkml thread of discussion, etc.)

> ---
>  arch/Kconfig |  34 ++
>  arch/x86/Kconfig |   1 +
>  kernel/seccomp.c | 167 ++-
>  3 files changed, 201 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 21a3675a7a3a..ca867b2a5d71 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -471,6 +471,14 @@ config HAVE_ARCH_SECCOMP_FILTER
>   results in the system call being skipped immediately.
> - seccomp syscall wired up
>  
> +config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
> + bool
> + help
> +   An arch should select this symbol if it provides all of these things:
> +   - all the requirements for HAVE_ARCH_SECCOMP_FILTER
> +   - SECCOMP_ARCH_DEFAULT
> +   - SECCOMP_ARCH_DEFAULT_NR
> +

There's no need for this config and the per-arch Kconfig clutter:
SECCOMP_ARCH_NATIVE will be a sufficient gate.

>  config SECCOMP
>   prompt "Enable seccomp to safely execute untrusted bytecode"
>   def_bool y
> @@ -498,6 +506,32 @@ config SECCOMP_FILTER
>  
> See Documentation/userspace-api/seccomp_filter.rst for details.
>  
> +choice
> + prompt "Seccomp filter cache"
> + default SECCOMP_CACHE_NONE
> + depends on SECCOMP_FILTER
> + depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
> + help
> +   Seccomp filters can potentially incur large overhead for each
> +   system call. This can alleviate some of the overhead.
> +
> +   If in doubt, select 'syscall numbers only'.
> +
> +config SECCOMP_CACHE_NONE
> + bool "None"
> + help
> +   No caching is done. Seccomp filters will be called each time
> +   a system call occurs in a seccomp-guarded task.
> +
> +config SECCOMP_CACHE_NR_ONLY
> + bool "Syscall number only"
> + depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
> + help
> +   For each syscall number, if the seccomp filter has a fixed
> +   result, store that result in a bitmap to speed up system calls.
> +
> +endchoice

I don't want this config: there is only 1 caching mechanism happening
in this series and I do not want to have it buildable as "off": it
should be available for all supported architectures. When further caching
methods happen, the config can be introduced then (though I'll likely
argue it should then be a boot param to allow distro kernels to make it
selectable).

> +
>  config HAVE_ARCH_STACKLEAK
>   bool
>   help
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 1ab22869a765..ff5289228ea5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -150,6 +150,7 @@ config X86
>   select HAVE_ARCH_COMPAT_MMAP_BASES  if MMU && COMPAT
>   select HAVE_ARCH_PREL32_RELOCATIONS
>   select HAVE_ARCH_SECCOMP_FILTER
> + select HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
>   select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>   select HAVE_ARCH_STACKLEAK
>   select HAVE_ARCH_TRACEHOOK
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index ae6b40cc39f4..f09c9e74ae05 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -143,6 +143,37 @@ struct notification {
>   struct list_head notifications;
>  };
>  
> +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
> +/**
> + * struct 

Re: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow

2020-09-30 Thread Jann Horn
On Wed, Sep 30, 2020 at 5:20 PM YiFei Zhu  wrote:
> SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
> access any syscall arguments or instruction pointer. To facilitate
> this we need a static analyser to know whether a filter will
> return allow regardless of syscall arguments for a given
> architecture number / syscall number pair. This is implemented
> here with a pseudo-emulator, and stored in a per-filter bitmap.
>
> Each common BPF instruction are emulated. Any weirdness or loading
> from a syscall argument will cause the emulator to bail.
>
> The emulation is also halted if it reaches a return. In that case,
> if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
>
> Emulator structure and comments are from Kees [1] and Jann [2].
>
> Emulation is done at attach time. If a filter depends on more
> filters, and if the dependee does not guarantee to allow the
> syscall, then we skip the emulation of this syscall.
>
> [1] 
> https://lore.kernel.org/lkml/20200923232923.3142503-5-keesc...@chromium.org/
> [2] 
> https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76p...@mail.gmail.com/
[...]
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 1ab22869a765..ff5289228ea5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -150,6 +150,7 @@ config X86
> select HAVE_ARCH_COMPAT_MMAP_BASES  if MMU && COMPAT
> select HAVE_ARCH_PREL32_RELOCATIONS
> select HAVE_ARCH_SECCOMP_FILTER
> +   select HAVE_ARCH_SECCOMP_CACHE_NR_ONLY
> select HAVE_ARCH_THREAD_STRUCT_WHITELIST
> select HAVE_ARCH_STACKLEAK
> select HAVE_ARCH_TRACEHOOK

If you did the architecture enablement for X86 later in the series,
you could move this part over into that patch, that'd be cleaner.

> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index ae6b40cc39f4..f09c9e74ae05 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -143,6 +143,37 @@ struct notification {
> struct list_head notifications;
>  };
>
> +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
> +/**
> + * struct seccomp_cache_filter_data - container for cache's per-filter data
> + *
> + * Tis struct is ordered to minimize padding holes.

I think this comment can probably go away, there isn't really much
trickery around padding holes in the struct as it is now.

> + * @syscall_allow_default: A bitmap where each bit represents whether the
> + *filter willalways allow the syscall, for the

nit: s/willalways/will always/

[...]
> +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter,
> +void *bitmap, const void 
> *bitmap_prev,
> +size_t bitmap_size, int arch)
> +{
> +   struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
> +   struct seccomp_data sd;
> +   int nr;
> +
> +   for (nr = 0; nr < bitmap_size; nr++) {
> +   if (bitmap_prev && !test_bit(nr, bitmap_prev))
> +   continue;
> +
> +   sd.nr = nr;
> +   sd.arch = arch;
> +
> +   if (seccomp_emu_is_const_allow(fprog, ))
> +   set_bit(nr, bitmap);

set_bit() is atomic, but since we only do this at filter setup, before
the filter becomes globally visible, we don't need atomicity here. So
this should probably use __set_bit() instead.