RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

2019-03-29 Thread Reshetova, Elena
> On Thu, Mar 28, 2019 at 9:29 AM Andy Lutomirski  wrote:
> > Doesn’t this just leak some of the canary to user code through side 
> > channels?
> 
> Erf, yes, good point. Let's just use prandom and be done with it.

And here I have some numbers on this. Actually prandom turned out to be pretty
fast, even when called every syscall. See the numbers below:

1) lmbench: ./lat_syscall -N 100 null
base:  Simple syscall: 0.1774 
microseconds
random_offset (prandom_u32() every syscall): Simple syscall: 0.1822 
microseconds
random_offset (prandom_u32() every 4th syscall): Simple syscall: 0.1844 
microseconds

2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
base:  1000 loops in 
1.62224s = 162.22 nsec / loop
random_offset (prandom_u32() every syscall): 1000 loops in 1.64660s 
= 166.26 nsec / loop
random_offset (prandom_u32() every 4th syscall): 1000 loops in 3.51315s 
= 169.30 nsec / loop

The second case is when prandom is called only once in 4 syscalls and unused 
random
bits are preserved in a per-cpu buffer. As you can see it is actually slower 
(modulo my maybe not
so optimized code in prandom, see below) vs. calling it every time, so I would 
vote for actually calling it every time and saving
on the hassle and also avoid additional code in prandom.

And below is what I was calling instead of prandom_u32() to preserve random bits
(net_rand_state_buffer is a new per-cpu buffer I added to save random bits):
And I didn't include the check for bytes >= sizeof(u32) since this was 
just poc to test the base speed, but for generic case it would be needed.

+void prandom_bytes_preserve(void *buf, size_t bytes)
+{
+u32 *buffer = _cpu_var(net_rand_state_buffer);
+u8 *ptr = buf;
+
+if (!(*buffer)) {
+struct rnd_state *state = _cpu_var(net_rand_state);
+if (bytes > 0) {
+*buffer = prandom_u32_state(state);
+do {
+*ptr++ = (u8) *buffer;
+bytes--;
+*buffer >>= BITS_PER_BYTE;
+} while (bytes > 0);
+}
+put_cpu_var(net_rand_state);
+put_cpu_var(net_rand_state_buffer);
+} else {
+if (bytes > 0) {
+do {
+*ptr++ = (u8) *buffer;
+bytes--;
+*buffer >>= BITS_PER_BYTE;
+} while (bytes > 0);
+}
+put_cpu_var(net_rand_state_buffer);
+}
+}

I will send the first version of patch (calling prandom_u32() every time)
shortly if anyone wants to double check performance implications. 

Best Regards,
Elena.


Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

2019-03-28 Thread Kees Cook
On Thu, Mar 28, 2019 at 9:29 AM Andy Lutomirski  wrote:
> Doesn’t this just leak some of the canary to user code through side channels?

Erf, yes, good point. Let's just use prandom and be done with it.

-- 
Kees Cook


Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

2019-03-28 Thread Andy Lutomirski



> On Mar 28, 2019, at 8:45 AM, Kees Cook  wrote:
> 
>> On Tue, Mar 26, 2019 at 9:31 PM Andy Lutomirski  wrote:
>> 
>> On Tue, Mar 26, 2019 at 3:35 AM Reshetova, Elena
>>  wrote:
>>> 
> On Mon, Mar 18, 2019 at 1:16 PM Andy Lutomirski  wrote:
> On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
>  wrote:
>> Performance:
>> 
>> 1) lmbench: ./lat_syscall -N 100 null
>>base: Simple syscall: 0.1774 microseconds
>>random_offset (rdtsc): Simple syscall: 0.1803 microseconds
>>random_offset (rdrand): Simple syscall: 0.3702 microseconds
>> 
>> 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
>>base: 1000 loops in 1.62224s = 162.22 nsec / 
>> loop
>>random_offset (rdtsc): 1000 loops in 1.64660s = 164.66 nsec / 
>> loop
>>random_offset (rdrand): 1000 loops in 3.51315s = 351.32 nsec / 
>> loop
>> 
> 
> Egads!  RDTSC is nice and fast but probably fairly easy to defeat.
> RDRAND is awful.  I had hoped for better.
 
 RDRAND can also fail.
 
> So perhaps we need a little percpu buffer that collects 64 bits of
> randomness at a time, shifts out the needed bits, and refills the
> buffer when we run out.
 
 I'd like to avoid saving the _exact_ details of where the next offset
 will be, but if nothing else works, this should be okay. We can use 8
 bits at a time and call prandom_u32() every 4th call. Something like
 prandom_bytes(), but where it doesn't throw away the unused bytes.
>>> 
>>> Actually I think this would make the end result even worse security-wise
>>> than simply using rdtsc() on every syscall. Saving the randomness in percpu
>>> buffer, which is probably easily accessible and can be probed if needed,
>>> would supply attacker with much more knowledge about the next 3-4
>>> random offsets that what he would get if we use "weak" rdtsc. Given
>>> that for a successful exploit, an attacker would need to have stack aligned
>>> once only, having a knowledge of 3-4 next offsets sounds like a present to 
>>> an
>>> exploit writer...  Additionally it creates complexity around the code that I
>>> have issues justifying with "security" argument because of above...
> 
> That certainly solidifies my concern against saving randomness. :)
> 
>>> I have the patch now with alloca() and rdtsc() working, I can post it
>>> (albeit it is very simple), but I am really hesitating on adding the percpu
>>> buffer randomness storage to it...
>>> 
>> 
>> Hmm.  I guess it depends on what types of attack you care about.  I
>> bet that, if you do a bunch of iterations of mfence;rdtsc;syscall,
>> you'll discover that the offset between the user rdtsc and the
>> syscall's rdtsc has several values that occur with high probability.
> 
> How about rdtsc xor with the middle word of the stack canary? (to
> avoid the 0-byte) Something like:
> 
>rdtsc
>xorl [%gs:...canary], %rax
>andq  $__MAX_STACK_RANDOM_OFFSET, %rax
> 
> I need to look at the right way to reference the canary during that
> code. Andy might know off the top of his head. :)
> 

Doesn’t this just leak some of the canary to user code through side channels?

Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

2019-03-28 Thread Kees Cook
On Tue, Mar 26, 2019 at 9:31 PM Andy Lutomirski  wrote:
>
> On Tue, Mar 26, 2019 at 3:35 AM Reshetova, Elena
>  wrote:
> >
> > > On Mon, Mar 18, 2019 at 1:16 PM Andy Lutomirski  wrote:
> > > > On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
> > > >  wrote:
> > > > > Performance:
> > > > >
> > > > > 1) lmbench: ./lat_syscall -N 100 null
> > > > > base: Simple syscall: 0.1774 microseconds
> > > > > random_offset (rdtsc): Simple syscall: 0.1803 microseconds
> > > > > random_offset (rdrand): Simple syscall: 0.3702 microseconds
> > > > >
> > > > > 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> > > > > base: 1000 loops in 1.62224s = 162.22 
> > > > > nsec / loop
> > > > > random_offset (rdtsc): 1000 loops in 1.64660s = 164.66 
> > > > > nsec / loop
> > > > > random_offset (rdrand): 1000 loops in 3.51315s = 351.32 nsec 
> > > > > / loop
> > > > >
> > > >
> > > > Egads!  RDTSC is nice and fast but probably fairly easy to defeat.
> > > > RDRAND is awful.  I had hoped for better.
> > >
> > > RDRAND can also fail.
> > >
> > > > So perhaps we need a little percpu buffer that collects 64 bits of
> > > > randomness at a time, shifts out the needed bits, and refills the
> > > > buffer when we run out.
> > >
> > > I'd like to avoid saving the _exact_ details of where the next offset
> > > will be, but if nothing else works, this should be okay. We can use 8
> > > bits at a time and call prandom_u32() every 4th call. Something like
> > > prandom_bytes(), but where it doesn't throw away the unused bytes.
> >
> > Actually I think this would make the end result even worse security-wise
> > than simply using rdtsc() on every syscall. Saving the randomness in percpu
> > buffer, which is probably easily accessible and can be probed if needed,
> > would supply attacker with much more knowledge about the next 3-4
> > random offsets that what he would get if we use "weak" rdtsc. Given
> > that for a successful exploit, an attacker would need to have stack aligned
> > once only, having a knowledge of 3-4 next offsets sounds like a present to 
> > an
> > exploit writer...  Additionally it creates complexity around the code that I
> > have issues justifying with "security" argument because of above...

That certainly solidifies my concern against saving randomness. :)

> > I have the patch now with alloca() and rdtsc() working, I can post it
> > (albeit it is very simple), but I am really hesitating on adding the percpu
> > buffer randomness storage to it...
> >
>
> Hmm.  I guess it depends on what types of attack you care about.  I
> bet that, if you do a bunch of iterations of mfence;rdtsc;syscall,
> you'll discover that the offset between the user rdtsc and the
> syscall's rdtsc has several values that occur with high probability.

How about rdtsc xor with the middle word of the stack canary? (to
avoid the 0-byte) Something like:

rdtsc
xorl [%gs:...canary], %rax
andq  $__MAX_STACK_RANDOM_OFFSET, %rax

I need to look at the right way to reference the canary during that
code. Andy might know off the top of his head. :)

-Kees

-- 
Kees Cook


Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

2019-03-26 Thread Andy Lutomirski
On Tue, Mar 26, 2019 at 3:35 AM Reshetova, Elena
 wrote:
>
> > On Mon, Mar 18, 2019 at 1:16 PM Andy Lutomirski  wrote:
> > > On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
> > >  wrote:
> > > > Performance:
> > > >
> > > > 1) lmbench: ./lat_syscall -N 100 null
> > > > base: Simple syscall: 0.1774 microseconds
> > > > random_offset (rdtsc): Simple syscall: 0.1803 microseconds
> > > > random_offset (rdrand): Simple syscall: 0.3702 microseconds
> > > >
> > > > 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> > > > base: 1000 loops in 1.62224s = 162.22 nsec 
> > > > / loop
> > > > random_offset (rdtsc): 1000 loops in 1.64660s = 164.66 nsec 
> > > > / loop
> > > > random_offset (rdrand): 1000 loops in 3.51315s = 351.32 nsec / 
> > > > loop
> > > >
> > >
> > > Egads!  RDTSC is nice and fast but probably fairly easy to defeat.
> > > RDRAND is awful.  I had hoped for better.
> >
> > RDRAND can also fail.
> >
> > > So perhaps we need a little percpu buffer that collects 64 bits of
> > > randomness at a time, shifts out the needed bits, and refills the
> > > buffer when we run out.
> >
> > I'd like to avoid saving the _exact_ details of where the next offset
> > will be, but if nothing else works, this should be okay. We can use 8
> > bits at a time and call prandom_u32() every 4th call. Something like
> > prandom_bytes(), but where it doesn't throw away the unused bytes.
>
> Actually I think this would make the end result even worse security-wise
> than simply using rdtsc() on every syscall. Saving the randomness in percpu
> buffer, which is probably easily accessible and can be probed if needed,
> would supply attacker with much more knowledge about the next 3-4
> random offsets that what he would get if we use "weak" rdtsc. Given
> that for a successful exploit, an attacker would need to have stack aligned
> once only, having a knowledge of 3-4 next offsets sounds like a present to an
> exploit writer...  Additionally it creates complexity around the code that I
> have issues justifying with "security" argument because of above...
>
> I have the patch now with alloca() and rdtsc() working, I can post it
> (albeit it is very simple), but I am really hesitating on adding the percpu
> buffer randomness storage to it...
>

Hmm.  I guess it depends on what types of attack you care about.  I
bet that, if you do a bunch of iterations of mfence;rdtsc;syscall,
you'll discover that the offset between the user rdtsc and the
syscall's rdtsc has several values that occur with high probability.

--Andy


RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

2019-03-26 Thread Reshetova, Elena
> On Mon, Mar 18, 2019 at 1:16 PM Andy Lutomirski  wrote:
> > On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
> >  wrote:
> > > Performance:
> > >
> > > 1) lmbench: ./lat_syscall -N 100 null
> > > base: Simple syscall: 0.1774 microseconds
> > > random_offset (rdtsc): Simple syscall: 0.1803 microseconds
> > > random_offset (rdrand): Simple syscall: 0.3702 microseconds
> > >
> > > 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> > > base: 1000 loops in 1.62224s = 162.22 nsec / 
> > > loop
> > > random_offset (rdtsc): 1000 loops in 1.64660s = 164.66 nsec / 
> > > loop
> > > random_offset (rdrand): 1000 loops in 3.51315s = 351.32 nsec / 
> > > loop
> > >
> >
> > Egads!  RDTSC is nice and fast but probably fairly easy to defeat.
> > RDRAND is awful.  I had hoped for better.
> 
> RDRAND can also fail.
> 
> > So perhaps we need a little percpu buffer that collects 64 bits of
> > randomness at a time, shifts out the needed bits, and refills the
> > buffer when we run out.
> 
> I'd like to avoid saving the _exact_ details of where the next offset
> will be, but if nothing else works, this should be okay. We can use 8
> bits at a time and call prandom_u32() every 4th call. Something like
> prandom_bytes(), but where it doesn't throw away the unused bytes.

Actually I think this would make the end result even worse security-wise
than simply using rdtsc() on every syscall. Saving the randomness in percpu
buffer, which is probably easily accessible and can be probed if needed,
would supply attacker with much more knowledge about the next 3-4
random offsets that what he would get if we use "weak" rdtsc. Given 
that for a successful exploit, an attacker would need to have stack aligned
once only, having a knowledge of 3-4 next offsets sounds like a present to an
exploit writer...  Additionally it creates complexity around the code that I
have issues justifying with "security" argument because of above... 

I have the patch now with alloca() and rdtsc() working, I can post it 
(albeit it is very simple), but I am really hesitating on adding the percpu
buffer randomness storage to it...

Best Regards,
Elena.


Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

2019-03-20 Thread Andy Lutomirski


> On Mar 20, 2019, at 4:12 AM, David Laight  wrote:
> 
> From: Andy Lutomirski
>> Sent: 18 March 2019 20:16
> ...
>>> As a result this patch introduces 8 bits of randomness
>>> (bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment)
>>> after pt_regs location on the thread stack.
>>> The amount of randomness can be adjusted based on how much of the
>>> stack space we wish/can trade for security.
>> 
>> Why do you need four zero bits at the bottom?  x86_64 Linux only
>> maintains 8 byte stack alignment.
> 
> ISTR that the gcc developers arbitrarily changed the alignment
> a few years ago.
> If the stack is only 8 byte aligned and you allocate a variable that
> requires 16 byte alignment you need gcc to generate the extra stack
> frame to align the stack.
> I don't remember seeing the relevant gcc options on the linux
> gcc command lines.
> 


On older gcc, you *can’t* set the relevant command line options because gcc was 
daft.  So we just crossed out fingers and hope led for the best.  On newer gcc, 
we set the options.  Fortunately, 32-byte stack variable alignment works 
regardless.

AFAIK x86_64 Linux has never aligned the stack to 16 bytes.

RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

2019-03-20 Thread Reshetova, Elena
> On Mon, Mar 18, 2019 at 01:15:44PM -0700, Andy Lutomirski wrote:
> > On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
> >  wrote:
> > >
> > > If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
> > > the kernel stack offset is randomized upon each
> > > entry to a system call after fixed location of pt_regs
> > > struct.
> > >
> > > This feature is based on the original idea from
> > > the PaX's RANDKSTACK feature:
> > > https://pax.grsecurity.net/docs/randkstack.txt
> > > All the credits for the original idea goes to the PaX team.
> > > However, the design and implementation of
> > > RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
> > > feature (see below).
> > >
> > > Reasoning for the feature:
> > >
> > > This feature aims to make considerably harder various
> > > stack-based attacks that rely on deterministic stack
> > > structure.
> > > We have had many of such attacks in past [1],[2],[3]
> > > (just to name few), and as Linux kernel stack protections
> > > have been constantly improving (vmap-based stack
> > > allocation with guard pages, removal of thread_info,
> > > STACKLEAK), attackers have to find new ways for their
> > > exploits to work.
> > >
> > > It is important to note that we currently cannot show
> > > a concrete attack that would be stopped by this new
> > > feature (given that other existing stack protections
> > > are enabled), so this is an attempt to be on a proactive
> > > side vs. catching up with existing successful exploits.
> > >
> > > The main idea is that since the stack offset is
> > > randomized upon each system call, it is very hard for
> > > attacker to reliably land in any particular place on
> > > the thread stack when attack is performed.
> > > Also, since randomization is performed *after* pt_regs,
> > > the ptrace-based approach to discover randomization
> > > offset during a long-running syscall should not be
> > > possible.
> > >
> > > [1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
> > > [2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
> > > [3] googleprojectzero.blogspot.com/2016/06/exploiting-
> > > recursion-in-linux-kernel_20.html
> 
> Now that thread_info is off the stack, and vmap stack guard pages exist,
> it's not clear to me what the benefit is.

Yes, as it says above, this is an attempt to be proactive vs. reactive. 
We cannot show concrete attack now that would succeed with vmap
stack enabled, thread_info removed and other protections enabled. 
However, the fact that kernel thread stack is still very deterministic
remains, and this feature of it has been utilized many times in attacks. 
We don't know where creative attackers would go next and what they
can use to mount next kernel stack-based attack, but I think this is just
a question of time. I don't believe we can claim that currently Linux kernel
thread stack is immune from attacks.

So, if we can add a protection that is not invasive, both on code and 
performance,
and which might make the attacker's life considerably harder, why not making 
it? 

> 
> > > The main issue with this approach is that it slightly breaks the
> > > processing of last frame in the unwinder, so I have made a simple
> > > fix to the frame pointer unwinder (I guess others should be fixed
> > > similarly) and stack dump functionality to "jump" over the random hole
> > > at the end. My way of solving this is probably far from ideal,
> > > so I would really appreciate feedback on how to improve it.
> >
> > That's probably a question for Josh :)
> >
> > Another way to do the dirty work would be to do:
> >
> > char *ptr = alloca(offset);
> > asm volatile ("" :: "m" (*ptr));
> >
> > in do_syscall_64() and adjust compiler flags as needed to avoid warnings.  
> > Hmm.
> 
> I like the alloca() idea a lot.  If you do the stack adjustment in C,
> then everything should just work, with no custom hacks in entry code or
> the unwinders.

Ok, so maybe this is what I am going to try next then. 

Best Regards,
Elena.


RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

2019-03-20 Thread Reshetova, Elena
Smth is really weird with my intel mail: it only now delivered
me all messages in one go and I was thinking that I don't get any feedback...

> > If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
> > the kernel stack offset is randomized upon each
> > entry to a system call after fixed location of pt_regs
> > struct.
> >
> > This feature is based on the original idea from
> > the PaX's RANDKSTACK feature:
> > https://pax.grsecurity.net/docs/randkstack.txt
> > All the credits for the original idea goes to the PaX team.
> > However, the design and implementation of
> > RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
> > feature (see below).
> >
> > Reasoning for the feature:
> >
> > This feature aims to make considerably harder various
> > stack-based attacks that rely on deterministic stack
> > structure.
> > We have had many of such attacks in past [1],[2],[3]
> > (just to name few), and as Linux kernel stack protections
> > have been constantly improving (vmap-based stack
> > allocation with guard pages, removal of thread_info,
> > STACKLEAK), attackers have to find new ways for their
> > exploits to work.
> >
> > It is important to note that we currently cannot show
> > a concrete attack that would be stopped by this new
> > feature (given that other existing stack protections
> > are enabled), so this is an attempt to be on a proactive
> > side vs. catching up with existing successful exploits.
> >
> > The main idea is that since the stack offset is
> > randomized upon each system call, it is very hard for
> > attacker to reliably land in any particular place on
> > the thread stack when attack is performed.
> > Also, since randomization is performed *after* pt_regs,
> > the ptrace-based approach to discover randomization
> > offset during a long-running syscall should not be
> > possible.
> >
> > [1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
> > [2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
> > [3] googleprojectzero.blogspot.com/2016/06/exploiting-
> > recursion-in-linux-kernel_20.html
> >
> > Design description:
> >
> > During most of the kernel's execution, it runs on the "thread
> > stack", which is allocated at fork.c/dup_task_struct() and stored in
> > a per-task variable (tsk->stack). Since stack is growing downward,
> > the stack top can be always calculated using task_top_of_stack(tsk)
> > function, which essentially returns an address of tsk->stack + stack
> > size. When VMAP_STACK is enabled, the thread stack is allocated from
> > vmalloc space.
> >
> > Thread stack is pretty deterministic on its structure - fixed in size,
> > and upon every entry from a userspace to kernel on a
> > syscall the thread stack is started to be constructed from an
> > address fetched from a per-cpu cpu_current_top_of_stack variable.
> > The first element to be pushed to the thread stack is the pt_regs struct
> > that stores all required CPU registers and sys call parameters.
> >
> > The goal of RANDOMIZE_KSTACK_OFFSET feature is to add a random offset
> > after the pt_regs has been pushed to the stack and the rest of thread
> > stack (used during the syscall processing) every time a process issues
> > a syscall. The source of randomness can be taken either from rdtsc or
> > rdrand with performance implications listed below. The value of random
> > offset is stored in a callee-saved register (r15 currently) and the
> > maximum size of random offset is defined by __MAX_STACK_RANDOM_OFFSET
> > value, which currently equals to 0xFF0.
> >
> > As a result this patch introduces 8 bits of randomness
> > (bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment)
> > after pt_regs location on the thread stack.
> > The amount of randomness can be adjusted based on how much of the
> > stack space we wish/can trade for security.
> 
> Why do you need four zero bits at the bottom?  x86_64 Linux only
> maintains 8 byte stack alignment.

I have to check this: it did look to me that this is needed to avoid
alignment issues, but maybe it is my mistake.  

> >
> > The main issue with this approach is that it slightly breaks the
> > processing of last frame in the unwinder, so I have made a simple
> > fix to the frame pointer unwinder (I guess others should be fixed
> > similarly) and stack dump functionality to "jump" over the random hole
> > at the end. My way of solving this is probably far from ideal,
> > so I would really appreciate feedback on how to improve it.
> 
> That's probably a question for Josh :)
> 
> Another way to do the dirty work would be to do:
> 
> char *ptr = alloca(offset);
> asm volatile ("" :: "m" (*ptr));
> 
> in do_syscall_64() and adjust compiler flags as needed to avoid warnings.  
> Hmm.

I was hoping to go away with assembly-only and minimal
changes, but if this approach seems better for you and Josh,
then I guess I can do it this way. 

> 
> >
> > Performance:
> >
> > 1) lmbench: ./lat_syscall -N 100 null
> > base: 

RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

2019-03-20 Thread David Laight
From: Andy Lutomirski
> Sent: 18 March 2019 20:16
...
> > As a result this patch introduces 8 bits of randomness
> > (bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment)
> > after pt_regs location on the thread stack.
> > The amount of randomness can be adjusted based on how much of the
> > stack space we wish/can trade for security.
> 
> Why do you need four zero bits at the bottom?  x86_64 Linux only
> maintains 8 byte stack alignment.

ISTR that the gcc developers arbitrarily changed the alignment
a few years ago.
If the stack is only 8 byte aligned and you allocate a variable that
requires 16 byte alignment you need gcc to generate the extra stack
frame to align the stack.
I don't remember seeing the relevant gcc options on the linux
gcc command lines.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

2019-03-18 Thread Josh Poimboeuf
On Mon, Mar 18, 2019 at 01:15:44PM -0700, Andy Lutomirski wrote:
> On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
>  wrote:
> >
> > If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
> > the kernel stack offset is randomized upon each
> > entry to a system call after fixed location of pt_regs
> > struct.
> >
> > This feature is based on the original idea from
> > the PaX's RANDKSTACK feature:
> > https://pax.grsecurity.net/docs/randkstack.txt
> > All the credits for the original idea goes to the PaX team.
> > However, the design and implementation of
> > RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
> > feature (see below).
> >
> > Reasoning for the feature:
> >
> > This feature aims to make considerably harder various
> > stack-based attacks that rely on deterministic stack
> > structure.
> > We have had many of such attacks in past [1],[2],[3]
> > (just to name few), and as Linux kernel stack protections
> > have been constantly improving (vmap-based stack
> > allocation with guard pages, removal of thread_info,
> > STACKLEAK), attackers have to find new ways for their
> > exploits to work.
> >
> > It is important to note that we currently cannot show
> > a concrete attack that would be stopped by this new
> > feature (given that other existing stack protections
> > are enabled), so this is an attempt to be on a proactive
> > side vs. catching up with existing successful exploits.
> >
> > The main idea is that since the stack offset is
> > randomized upon each system call, it is very hard for
> > attacker to reliably land in any particular place on
> > the thread stack when attack is performed.
> > Also, since randomization is performed *after* pt_regs,
> > the ptrace-based approach to discover randomization
> > offset during a long-running syscall should not be
> > possible.
> >
> > [1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
> > [2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
> > [3] googleprojectzero.blogspot.com/2016/06/exploiting-
> > recursion-in-linux-kernel_20.html

Now that thread_info is off the stack, and vmap stack guard pages exist,
it's not clear to me what the benefit is.

> > The main issue with this approach is that it slightly breaks the
> > processing of last frame in the unwinder, so I have made a simple
> > fix to the frame pointer unwinder (I guess others should be fixed
> > similarly) and stack dump functionality to "jump" over the random hole
> > at the end. My way of solving this is probably far from ideal,
> > so I would really appreciate feedback on how to improve it.
> 
> That's probably a question for Josh :)
> 
> Another way to do the dirty work would be to do:
> 
> char *ptr = alloca(offset);
> asm volatile ("" :: "m" (*ptr));
> 
> in do_syscall_64() and adjust compiler flags as needed to avoid warnings.  
> Hmm.

I like the alloca() idea a lot.  If you do the stack adjustment in C,
then everything should just work, with no custom hacks in entry code or
the unwinders.

> >  /*
> >   * This does 'call enter_from_user_mode' unless we can avoid it based on
> >   * kernel config or using the static jump infrastructure.
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 1f0efdb7b629..0816ec680c21 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -167,13 +167,19 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
> >
> > PUSH_AND_CLEAR_REGS rax=$-ENOSYS
> >
> > +   RANDOMIZE_KSTACK/* stores randomized offset in r15 
> > */
> > +
> > TRACE_IRQS_OFF
> >
> > /* IRQs are off. */
> > movq%rax, %rdi
> > movq%rsp, %rsi
> > +   sub %r15, %rsp  /* substitute random offset from rsp */
> > calldo_syscall_64   /* returns with IRQs disabled */
> >
> > +   /* need to restore the gap */
> > +   add %r15, %rsp   /* add random offset back to rsp */
> 
> Off the top of my head, the nicer way to approach this would be to
> change this such that mov %rbp, %rsp; popq %rbp or something like that
> will do the trick.  Then the unwinder could just see it as a regular
> frame.  Maybe Josh will have a better idea.

Yes, we could probably do something like that.  Though I think I'd much
rather do the alloca() thing.  

-- 
Josh


Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

2019-03-18 Thread Kees Cook
On Mon, Mar 18, 2019 at 1:16 PM Andy Lutomirski  wrote:
> On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
>  wrote:
> > Performance:
> >
> > 1) lmbench: ./lat_syscall -N 100 null
> > base: Simple syscall: 0.1774 microseconds
> > random_offset (rdtsc): Simple syscall: 0.1803 microseconds
> > random_offset (rdrand): Simple syscall: 0.3702 microseconds
> >
> > 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> > base: 1000 loops in 1.62224s = 162.22 nsec / 
> > loop
> > random_offset (rdtsc): 1000 loops in 1.64660s = 164.66 nsec / 
> > loop
> > random_offset (rdrand): 1000 loops in 3.51315s = 351.32 nsec / loop
> >
>
> Egads!  RDTSC is nice and fast but probably fairly easy to defeat.
> RDRAND is awful.  I had hoped for better.

RDRAND can also fail.

> So perhaps we need a little percpu buffer that collects 64 bits of
> randomness at a time, shifts out the needed bits, and refills the
> buffer when we run out.

I'd like to avoid saving the _exact_ details of where the next offset
will be, but if nothing else works, this should be okay. We can use 8
bits at a time and call prandom_u32() every 4th call. Something like
prandom_bytes(), but where it doesn't throw away the unused bytes.

-- 
Kees Cook


Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

2019-03-18 Thread Andy Lutomirski
On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
 wrote:
>
> If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
> the kernel stack offset is randomized upon each
> entry to a system call after fixed location of pt_regs
> struct.
>
> This feature is based on the original idea from
> the PaX's RANDKSTACK feature:
> https://pax.grsecurity.net/docs/randkstack.txt
> All the credits for the original idea goes to the PaX team.
> However, the design and implementation of
> RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
> feature (see below).
>
> Reasoning for the feature:
>
> This feature aims to make considerably harder various
> stack-based attacks that rely on deterministic stack
> structure.
> We have had many of such attacks in past [1],[2],[3]
> (just to name few), and as Linux kernel stack protections
> have been constantly improving (vmap-based stack
> allocation with guard pages, removal of thread_info,
> STACKLEAK), attackers have to find new ways for their
> exploits to work.
>
> It is important to note that we currently cannot show
> a concrete attack that would be stopped by this new
> feature (given that other existing stack protections
> are enabled), so this is an attempt to be on a proactive
> side vs. catching up with existing successful exploits.
>
> The main idea is that since the stack offset is
> randomized upon each system call, it is very hard for
> attacker to reliably land in any particular place on
> the thread stack when attack is performed.
> Also, since randomization is performed *after* pt_regs,
> the ptrace-based approach to discover randomization
> offset during a long-running syscall should not be
> possible.
>
> [1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
> [2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
> [3] googleprojectzero.blogspot.com/2016/06/exploiting-
> recursion-in-linux-kernel_20.html
>
> Design description:
>
> During most of the kernel's execution, it runs on the "thread
> stack", which is allocated at fork.c/dup_task_struct() and stored in
> a per-task variable (tsk->stack). Since stack is growing downward,
> the stack top can be always calculated using task_top_of_stack(tsk)
> function, which essentially returns an address of tsk->stack + stack
> size. When VMAP_STACK is enabled, the thread stack is allocated from
> vmalloc space.
>
> Thread stack is pretty deterministic on its structure - fixed in size,
> and upon every entry from a userspace to kernel on a
> syscall the thread stack is started to be constructed from an
> address fetched from a per-cpu cpu_current_top_of_stack variable.
> The first element to be pushed to the thread stack is the pt_regs struct
> that stores all required CPU registers and sys call parameters.
>
> The goal of RANDOMIZE_KSTACK_OFFSET feature is to add a random offset
> after the pt_regs has been pushed to the stack and the rest of thread
> stack (used during the syscall processing) every time a process issues
> a syscall. The source of randomness can be taken either from rdtsc or
> rdrand with performance implications listed below. The value of random
> offset is stored in a callee-saved register (r15 currently) and the
> maximum size of random offset is defined by __MAX_STACK_RANDOM_OFFSET
> value, which currently equals to 0xFF0.
>
> As a result this patch introduces 8 bits of randomness
> (bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment)
> after pt_regs location on the thread stack.
> The amount of randomness can be adjusted based on how much of the
> stack space we wish/can trade for security.

Why do you need four zero bits at the bottom?  x86_64 Linux only
maintains 8 byte stack alignment.

>
> The main issue with this approach is that it slightly breaks the
> processing of last frame in the unwinder, so I have made a simple
> fix to the frame pointer unwinder (I guess others should be fixed
> similarly) and stack dump functionality to "jump" over the random hole
> at the end. My way of solving this is probably far from ideal,
> so I would really appreciate feedback on how to improve it.

That's probably a question for Josh :)

Another way to do the dirty work would be to do:

char *ptr = alloca(offset);
asm volatile ("" :: "m" (*ptr));

in do_syscall_64() and adjust compiler flags as needed to avoid warnings.  Hmm.

>
> Performance:
>
> 1) lmbench: ./lat_syscall -N 100 null
> base: Simple syscall: 0.1774 microseconds
> random_offset (rdtsc): Simple syscall: 0.1803 microseconds
> random_offset (rdrand): Simple syscall: 0.3702 microseconds
>
> 2)  Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> base: 1000 loops in 1.62224s = 162.22 nsec / loop
> random_offset (rdtsc): 1000 loops in 1.64660s = 164.66 nsec / loop
> random_offset (rdrand): 1000 loops in 3.51315s = 351.32 nsec / loop
>

Egads!  RDTSC is nice and fast but probably fairly easy to defeat.
RDRAND is