Re: [PATCH v8 3/6] stack: Optionally randomize kernel stack offset each syscall

2021-04-01 Thread Kees Cook
On Thu, Apr 01, 2021 at 11:15:43AM +, David Laight wrote:
> From: Will Deacon
> > Sent: 01 April 2021 09:31
> ...
> > > +/*
> > > + * These macros must be used during syscall entry when interrupts and
> > > + * preempt are disabled, and after user registers have been stored to
> > > + * the stack.
> > > + */
> > > +#define add_random_kstack_offset() do {  
> > > \
> > > + if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
> > > + _kstack_offset)) {\
> > > + u32 offset = __this_cpu_read(kstack_offset);\
> > > + u8 *ptr = __builtin_alloca(KSTACK_OFFSET_MAX(offset));  \
> > > + asm volatile("" : "=m"(*ptr) :: "memory");  \
> > 
> > Using the "m" constraint here is dangerous if you don't actually evaluate it
> > inside the asm. For example, if the compiler decides to generate an
> > addressing mode relative to the stack but with writeback (autodecrement), 
> > then
> > the stack pointer will be off by 8 bytes. Can you use "o" instead?

I see other examples of empty asm, but it's true, none are using "=m" read
constraints. But, yes, using "o" appears to work happily.

> Is it allowed to use such a mode?
> It would have to know that the "m" was substituted exactly once.
> I think there are quite a few examples with 'strange' uses of memory
> asm arguments.
> 
> However, in this case, isn't it enough to ensure the address is 'saved'?
> So:
>   asm volatile("" : "=r"(ptr) );
> should be enough.

It isn't, it seems.

Here's a comparison:

https://godbolt.org/z/xYGn9GfGY

So, I'll resend with "o", and with raw_cpu_*().

Thanks!

-- 
Kees Cook


RE: [PATCH v8 3/6] stack: Optionally randomize kernel stack offset each syscall

2021-04-01 Thread David Laight
From: Will Deacon
> Sent: 01 April 2021 09:31
...
> > +/*
> > + * These macros must be used during syscall entry when interrupts and
> > + * preempt are disabled, and after user registers have been stored to
> > + * the stack.
> > + */
> > +#define add_random_kstack_offset() do {
> > \
> > +   if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
> > +   _kstack_offset)) {\
> > +   u32 offset = __this_cpu_read(kstack_offset);\
> > +   u8 *ptr = __builtin_alloca(KSTACK_OFFSET_MAX(offset));  \
> > +   asm volatile("" : "=m"(*ptr) :: "memory");  \
> 
> Using the "m" constraint here is dangerous if you don't actually evaluate it
> inside the asm. For example, if the compiler decides to generate an
> addressing mode relative to the stack but with writeback (autodecrement), then
> the stack pointer will be off by 8 bytes. Can you use "o" instead?

Is it allowed to use such a mode?
It would have to know that the "m" was substituted exactly once.
I think there are quite a few examples with 'strange' uses of memory
asm arguments.

However, in this case, isn't it enough to ensure the address is 'saved'?
So:
asm volatile("" : "=r"(ptr) );
should be enough.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH v8 3/6] stack: Optionally randomize kernel stack offset each syscall

2021-04-01 Thread Will Deacon
On Tue, Mar 30, 2021 at 01:57:47PM -0700, Kees Cook wrote:
> diff --git a/include/linux/randomize_kstack.h 
> b/include/linux/randomize_kstack.h
> new file mode 100644
> index ..351520803006
> --- /dev/null
> +++ b/include/linux/randomize_kstack.h
> @@ -0,0 +1,55 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef _LINUX_RANDOMIZE_KSTACK_H
> +#define _LINUX_RANDOMIZE_KSTACK_H
> +
> +#include 
> +#include 
> +#include 
> +
> +DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
> +  randomize_kstack_offset);
> +DECLARE_PER_CPU(u32, kstack_offset);
> +
> +/*
> + * Do not use this anywhere else in the kernel. This is used here because
> + * it provides an arch-agnostic way to grow the stack with correct
> + * alignment. Also, since this use is being explicitly masked to a max of
> + * 10 bits, stack-clash style attacks are unlikely. For more details see
> + * "VLAs" in Documentation/process/deprecated.rst
> + * The asm statement is designed to convince the compiler to keep the
> + * allocation around even after "ptr" goes out of scope.
> + */
> +void *__builtin_alloca(size_t size);
> +/*
> + * Use, at most, 10 bits of entropy. We explicitly cap this to keep the
> + * "VLA" from being unbounded (see above). 10 bits leaves enough room for
> + * per-arch offset masks to reduce entropy (by removing higher bits, since
> + * high entropy may overly constrain usable stack space), and for
> + * compiler/arch-specific stack alignment to remove the lower bits.
> + */
> +#define KSTACK_OFFSET_MAX(x) ((x) & 0x3FF)
> +
> +/*
> + * These macros must be used during syscall entry when interrupts and
> + * preempt are disabled, and after user registers have been stored to
> + * the stack.
> + */
> +#define add_random_kstack_offset() do {  
> \
> + if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
> + _kstack_offset)) {\
> + u32 offset = __this_cpu_read(kstack_offset);\
> + u8 *ptr = __builtin_alloca(KSTACK_OFFSET_MAX(offset));  \
> + asm volatile("" : "=m"(*ptr) :: "memory");  \

Using the "m" constraint here is dangerous if you don't actually evaluate it
inside the asm. For example, if the compiler decides to generate an
addressing mode relative to the stack but with writeback (autodecrement), then
the stack pointer will be off by 8 bytes. Can you use "o" instead?

Will


Re: [PATCH v8 3/6] stack: Optionally randomize kernel stack offset each syscall

2021-04-01 Thread Kees Cook
On Thu, Apr 01, 2021 at 12:38:31AM +0200, Thomas Gleixner wrote:
> On Wed, Mar 31 2021 at 14:54, Kees Cook wrote:
> > On Wed, Mar 31, 2021 at 09:53:26AM +0200, Thomas Gleixner wrote:
> >> On Tue, Mar 30 2021 at 13:57, Kees Cook wrote:
> >> > +/*
> >> > + * Do not use this anywhere else in the kernel. This is used here 
> >> > because
> >> > + * it provides an arch-agnostic way to grow the stack with correct
> >> > + * alignment. Also, since this use is being explicitly masked to a max 
> >> > of
> >> > + * 10 bits, stack-clash style attacks are unlikely. For more details see
> >> > + * "VLAs" in Documentation/process/deprecated.rst
> >> > + * The asm statement is designed to convince the compiler to keep the
> >> > + * allocation around even after "ptr" goes out of scope.
> >> 
> >> Nit. That explanation of "ptr" might be better placed right at the
> >> add_random...() macro.
> >
> > Ah, yes! Fixed in v9.
> 
> Hmm, looking at V9 the "ptr" thing got lost 

I put the comment inline in the macro directly above the asm().

> > Do you want to take this via -tip (and leave off the arm64 patch until
> > it is acked), or would you rather it go via arm64? (I've sent v9 now...)
> 
> Either way is fine.

Since the arm64 folks have been a bit busy, can you just put this in
-tip and leave off the arm64 patch for now?

Thanks!

-- 
Kees Cook


Re: [PATCH v8 3/6] stack: Optionally randomize kernel stack offset each syscall

2021-03-31 Thread Thomas Gleixner
On Wed, Mar 31 2021 at 14:54, Kees Cook wrote:
> On Wed, Mar 31, 2021 at 09:53:26AM +0200, Thomas Gleixner wrote:
>> On Tue, Mar 30 2021 at 13:57, Kees Cook wrote:
>> > +/*
>> > + * Do not use this anywhere else in the kernel. This is used here because
>> > + * it provides an arch-agnostic way to grow the stack with correct
>> > + * alignment. Also, since this use is being explicitly masked to a max of
>> > + * 10 bits, stack-clash style attacks are unlikely. For more details see
>> > + * "VLAs" in Documentation/process/deprecated.rst
>> > + * The asm statement is designed to convince the compiler to keep the
>> > + * allocation around even after "ptr" goes out of scope.
>> 
>> Nit. That explanation of "ptr" might be better placed right at the
>> add_random...() macro.
>
> Ah, yes! Fixed in v9.

Hmm, looking at V9 the "ptr" thing got lost 

> +/*
> + * Do not use this anywhere else in the kernel. This is used here because
> + * it provides an arch-agnostic way to grow the stack with correct
> + * alignment. Also, since this use is being explicitly masked to a max of
> + * 10 bits, stack-clash style attacks are unlikely. For more details see
> + * "VLAs" in Documentation/process/deprecated.rst
> + */
> +void *__builtin_alloca(size_t size);
> +/*
> + * Use, at most, 10 bits of entropy. We explicitly cap this to keep the
> + * "VLA" from being unbounded (see above). 10 bits leaves enough room for
> + * per-arch offset masks to reduce entropy (by removing higher bits, since
> + * high entropy may overly constrain usable stack space), and for
> + * compiler/arch-specific stack alignment to remove the lower bits.
> + */
> +#define KSTACK_OFFSET_MAX(x) ((x) & 0x3FF)
> +
> +/*
> + * These macros must be used during syscall entry when interrupts and
> + * preempt are disabled, and after user registers have been stored to
> + * the stack.
> + */
> +#define add_random_kstack_offset() do {  
> \

> Do you want to take this via -tip (and leave off the arm64 patch until
> it is acked), or would you rather it go via arm64? (I've sent v9 now...)

Either way is fine.

Thanks,

tglx


Re: [PATCH v8 3/6] stack: Optionally randomize kernel stack offset each syscall

2021-03-31 Thread Kees Cook
On Wed, Mar 31, 2021 at 09:53:26AM +0200, Thomas Gleixner wrote:
> On Tue, Mar 30 2021 at 13:57, Kees Cook wrote:
> > +/*
> > + * Do not use this anywhere else in the kernel. This is used here because
> > + * it provides an arch-agnostic way to grow the stack with correct
> > + * alignment. Also, since this use is being explicitly masked to a max of
> > + * 10 bits, stack-clash style attacks are unlikely. For more details see
> > + * "VLAs" in Documentation/process/deprecated.rst
> > + * The asm statement is designed to convince the compiler to keep the
> > + * allocation around even after "ptr" goes out of scope.
> 
> Nit. That explanation of "ptr" might be better placed right at the
> add_random...() macro.

Ah, yes! Fixed in v9.

> Other than that.
> 
> Reviewed-by: Thomas Gleixner 

Thank you for the reviews!

Do you want to take this via -tip (and leave off the arm64 patch until
it is acked), or would you rather it go via arm64? (I've sent v9 now...)

-- 
Kees Cook


Re: [PATCH v8 3/6] stack: Optionally randomize kernel stack offset each syscall

2021-03-31 Thread Thomas Gleixner
On Tue, Mar 30 2021 at 13:57, Kees Cook wrote:
> +/*
> + * Do not use this anywhere else in the kernel. This is used here because
> + * it provides an arch-agnostic way to grow the stack with correct
> + * alignment. Also, since this use is being explicitly masked to a max of
> + * 10 bits, stack-clash style attacks are unlikely. For more details see
> + * "VLAs" in Documentation/process/deprecated.rst
> + * The asm statement is designed to convince the compiler to keep the
> + * allocation around even after "ptr" goes out of scope.

Nit. That explanation of "ptr" might be better placed right at the
add_random...() macro.

> + */
> +void *__builtin_alloca(size_t size);
> +/*
> + * Use, at most, 10 bits of entropy. We explicitly cap this to keep the
> + * "VLA" from being unbounded (see above). 10 bits leaves enough room for
> + * per-arch offset masks to reduce entropy (by removing higher bits, since
> + * high entropy may overly constrain usable stack space), and for
> + * compiler/arch-specific stack alignment to remove the lower bits.
> + */
> +#define KSTACK_OFFSET_MAX(x) ((x) & 0x3FF)
> +
> +/*
> + * These macros must be used during syscall entry when interrupts and
> + * preempt are disabled, and after user registers have been stored to
> + * the stack.
> + */
> +#define add_random_kstack_offset() do {  
> \
> + if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
> + _kstack_offset)) {\
> + u32 offset = __this_cpu_read(kstack_offset);\
> + u8 *ptr = __builtin_alloca(KSTACK_OFFSET_MAX(offset));  \
> + asm volatile("" : "=m"(*ptr) :: "memory");  \
> + }   \
> +} while (0)

Other than that.

Reviewed-by: Thomas Gleixner 


[PATCH v8 3/6] stack: Optionally randomize kernel stack offset each syscall

2021-03-30 Thread Kees Cook
This provides the ability for architectures to enable kernel stack base
address offset randomization. This feature is controlled by the boot
param "randomize_kstack_offset=on/off", with its default value set by
CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.

This feature is based on the original idea from the last public release
of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
All the credit for the original idea goes to the PaX team. Note that
the design and implementation of this upstream randomize_kstack_offset
feature differs greatly from the RANDKSTACK feature (see below).

Reasoning for the feature:

This feature aims to make harder the various stack-based attacks that
rely on deterministic stack structure. We have had many such attacks in
past (just to name few):

https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf
https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf
https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html

As Linux kernel stack protections have been constantly improving
(vmap-based stack allocation with guard pages, removal of thread_info,
STACKLEAK), attackers have had to find new ways for their exploits
to work. They have done so, continuing to rely on the kernel's stack
determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT
were not relevant. For example, the following recent attacks would have
been hampered if the stack offset was non-deterministic between syscalls:

https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
(page 70: targeting the pt_regs copy with linear stack overflow)

https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
(leaked stack address from one syscall as a target during next syscall)

The main idea is that since the stack offset is randomized on each system
call, it is harder for an attack to reliably land in any particular place
on the thread stack, even with address exposures, as the stack base will
change on the next syscall. Also, since randomization is performed after
placing pt_regs, the ptrace-based approach[1] to discover the randomized
offset during a long-running syscall should not be possible.

Design description:

During most of the kernel's execution, it runs on the "thread stack",
which is pretty deterministic in its structure: it is fixed in size,
and on every entry from userspace to kernel on a syscall the thread
stack starts construction from an address fetched from the per-cpu
cpu_current_top_of_stack variable. The first element to be pushed to the
thread stack is the pt_regs struct that stores all required CPU registers
and syscall parameters. Finally the specific syscall function is called,
with the stack being used as the kernel executes the resulting request.

The goal of randomize_kstack_offset feature is to add a random offset
after the pt_regs has been pushed to the stack and before the rest of the
thread stack is used during the syscall processing, and to change it every
time a process issues a syscall. The source of randomness is currently
architecture-defined (but x86 is using the low byte of rdtsc()). Future
improvements for different entropy sources is possible, but out of scope
for this patch. Further more, to add more unpredictability, new offsets
are chosen at the end of syscalls (the timing of which should be less
easy to measure from userspace than at syscall entry time), and stored
in a per-CPU variable, so that the life of the value does not stay
explicitly tied to a single task.

As suggested by Andy Lutomirski, the offset is added using alloca()
and an empty asm() statement with an output constraint, since it avoids
changes to assembly syscall entry code, to the unwinder, and provides
correct stack alignment as defined by the compiler.

In order to make this available by default with zero performance impact
for those that don't want it, it is boot-time selectable with static
branches. This way, if the overhead is not wanted, it can just be
left turned off with no performance impact.

The generated assembly for x86_64 with GCC looks like this:

...
81003977: 65 8b 05 02 ea 00 7f  mov %gs:0x7f00ea02(%rip),%eax
# 12380 
8100397e: 25 ff 03 00 00and $0x3ff,%eax
81003983: 48 83 c0 0f   add $0xf,%rax
81003987: 25 f8 07 00 00and $0x7f8,%eax
8100398c: 48 29 c4  sub %rax,%rsp
8100398f: 48 8d 44 24 0flea 0xf(%rsp),%rax
81003994: 48 83 e0 f0   and $0xfff0,%rax
...

As a result of the above stack alignment, this patch introduces about
5 bits of randomness after pt_regs is spilled to the thread stack on
x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for
stack alignment). The amount of entropy could be adjusted based on how
much of the stack space we wish to trade for security.

My measure of syscall performance overhead (on x86_64):