Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-26 Thread Andy Lutomirski
On Feb 26, 2015 1:55 AM, "Denys Vlasenko"  wrote:
>
> On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski  wrote:
> > On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko  wrote:
> >> On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
> >> This part?
> >>
> >> .macro FORK_LIKE func
> >>  ENTRY(stub_\func)
> >> CFI_STARTPROC
> >> -   popq%r11/* save return address */
> >> -   PARTIAL_FRAME 0
> >> -   SAVE_REST
> >> -   pushq   %r11/* put it back on stack */
> >> +   DEFAULT_FRAME 0, 8  /* offset 8: return address */
> >> +   SAVE_EXTRA_REGS 8
> >> FIXUP_TOP_OF_STACK %r11, 8
> >> -   DEFAULT_FRAME 0 8   /* offset 8: return address */
> >> call sys_\func
> >> RESTORE_TOP_OF_STACK %r11, 8
> >> -   ret $REST_SKIP  /* pop extended registers */
> >> +   ret
> >> CFI_ENDPROC
> >>  END(stub_\func)
> >> .endm
> >>
> >> FORK_LIKE  clone
> >> FORK_LIKE  fork
> >> FORK_LIKE  vfork
> >>
> >> But the old code (SAVE_REST thing) was also saving registers here.
> >> It had to jump through hoops (pop return address, SAVE_REST,
> >> push return address) to do that.
> >> After the patch, "SAVE_EXTRA_REGS 8" does the same, just without
> >> pop/push pair.
> >>
> >> I just don't see what's wrong with it. Can you elaborate?
> >
> > SAVE_REST pushed the regs onto the stack, whereas SAVE_EXTRA_REGS just
> > writes them in place.  It's possible for this to be called when the
> > regs have already been saved.
>
> If that would be the case - that is, if SAVE_REST was saving extra copy
> of registers on stack, then FIXUP_TOP_OF_STACK %r11, 8 would be working
> on wrong locations. The "8" there says "we have full pt_regs on stack,
> plus extra 8 bytes (the return address)". Your conjecture would mean
> that in fact there would be more bytes on stack, and FIXUP_TOP_OF_STACK
> would corrupt iret stack. Evidently, since old code was not crashing,
> this wasn't happening. SAVE_REST was really creating the "tail" of pt_regs

Ugh, you're right.

The FIXUP_TOP_OF_STACK indeed looks duplicated, bit t that's less
harmful and was already the case.

--Andy
.
>
> In addition to my previous tests, I ran my home machine with
> patched kernel. Unfortunately, it works for me :(
>
> Will try on yet another machine.
>
> --
> vda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-26 Thread Sabrina Dubroca
2015-02-26, 14:54:33 +0100, Denys Vlasenko wrote:
> On Thu, Feb 26, 2015 at 1:11 PM, Denys Vlasenko
>  wrote:
> > On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko
> >  wrote:
> >> On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski  
> >> wrote:
> >> In addition to my previous tests, I ran my home machine with
> >> patched kernel. Unfortunately, it works for me :(
> >>
> >> Will try on yet another machine.
> >
> > And voila, it does happen on another machine :)
> >
> > I'm debugging it right now. Looks like 64-bit syscalls just stop working
> > at some point in new processes. That is, existing process is alive and well,
> > but children get SEGV after fork (most likely on any syscall64 they do,
> > not after fork per se. They eventually manage to kill themselves -
> > not trivial when exit syscall isn't working either - by tripping on HLT 
> > insn).
> >
> > 32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want.
> > I have static 32-bit busybox binary and everything works there.
> >
> > Also, any 64-bit process which was under strace continues to work correctly,
> > including forks and execs.
> >
> > This points towards some bug on fast path sysret64 code. Looking for it.
> 
> audit=0 makes crashes disappear.

Ah, yes.

> I found the problem. If syscall_trace_enter_phase1 returns 0,
> I restore %rax from pt_regs->ax, but should restore it from
> pt_regs->orig_ax:
> 
> call syscall_trace_enter_phase1
> test %rax, %rax
> jnz tracesys_phase2 /* if needed, run the slow path */
> -   RESTORE_C_REGS  /* else restore clobbered regs */
> +   RESTORE_C_REGS_EXCEPT_RAX   /* else restore clobbered regs */
> +   movq ORIG_RAX-ARGOFFSET(%rsp),%rax
> jmp system_call_fastpath/*  and return to the fast path */

with s/-ARGOFFSET// on top of next-20150224, that works.

Thanks, Denys.

-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-26 Thread Denys Vlasenko
On Thu, Feb 26, 2015 at 1:11 PM, Denys Vlasenko
 wrote:
> On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko
>  wrote:
>> On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski  
>> wrote:
>> In addition to my previous tests, I ran my home machine with
>> patched kernel. Unfortunately, it works for me :(
>>
>> Will try on yet another machine.
>
> And voila, it does happen on another machine :)
>
> I'm debugging it right now. Looks like 64-bit syscalls just stop working
> at some point in new processes. That is, existing process is alive and well,
> but children get SEGV after fork (most likely on any syscall64 they do,
> not after fork per se. They eventually manage to kill themselves -
> not trivial when exit syscall isn't working either - by tripping on HLT insn).
>
> 32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want.
> I have static 32-bit busybox binary and everything works there.
>
> Also, any 64-bit process which was under strace continues to work correctly,
> including forks and execs.
>
> This points towards some bug on fast path sysret64 code. Looking for it.

audit=0 makes crashes disappear.

I found the problem. If syscall_trace_enter_phase1 returns 0,
I restore %rax from pt_regs->ax, but should restore it from
pt_regs->orig_ax:

call syscall_trace_enter_phase1
test %rax, %rax
jnz tracesys_phase2 /* if needed, run the slow path */
-   RESTORE_C_REGS  /* else restore clobbered regs */
+   RESTORE_C_REGS_EXCEPT_RAX   /* else restore clobbered regs */
+   movq ORIG_RAX-ARGOFFSET(%rsp),%rax
jmp system_call_fastpath/*  and return to the fast path */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-26 Thread Denys Vlasenko
On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko
 wrote:
> On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski  wrote:
> In addition to my previous tests, I ran my home machine with
> patched kernel. Unfortunately, it works for me :(
>
> Will try on yet another machine.

And voila, it does happen on another machine :)

I'm debugging it right now. Looks like 64-bit syscalls just stop working
at some point in new processes. That is, existing process is alive and well,
but children get SEGV after fork (most likely on any syscall64 they do,
not after fork per se. They eventually manage to kill themselves -
not trivial when exit syscall isn't working either - by tripping on HLT insn).

32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want.
I have static 32-bit busybox binary and everything works there.

Also, any 64-bit process which was under strace continues to work correctly,
including forks and execs.

This points towards some bug on fast path sysret64 code. Looking for it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-26 Thread Denys Vlasenko
On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski  wrote:
> On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko  wrote:
>> On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
>> This part?
>>
>> .macro FORK_LIKE func
>>  ENTRY(stub_\func)
>> CFI_STARTPROC
>> -   popq%r11/* save return address */
>> -   PARTIAL_FRAME 0
>> -   SAVE_REST
>> -   pushq   %r11/* put it back on stack */
>> +   DEFAULT_FRAME 0, 8  /* offset 8: return address */
>> +   SAVE_EXTRA_REGS 8
>> FIXUP_TOP_OF_STACK %r11, 8
>> -   DEFAULT_FRAME 0 8   /* offset 8: return address */
>> call sys_\func
>> RESTORE_TOP_OF_STACK %r11, 8
>> -   ret $REST_SKIP  /* pop extended registers */
>> +   ret
>> CFI_ENDPROC
>>  END(stub_\func)
>> .endm
>>
>> FORK_LIKE  clone
>> FORK_LIKE  fork
>> FORK_LIKE  vfork
>>
>> But the old code (SAVE_REST thing) was also saving registers here.
>> It had to jump through hoops (pop return address, SAVE_REST,
>> push return address) to do that.
>> After the patch, "SAVE_EXTRA_REGS 8" does the same, just without
>> pop/push pair.
>>
>> I just don't see what's wrong with it. Can you elaborate?
>
> SAVE_REST pushed the regs onto the stack, whereas SAVE_EXTRA_REGS just
> writes them in place.  It's possible for this to be called when the
> regs have already been saved.

If that would be the case - that is, if SAVE_REST was saving extra copy
of registers on stack, then FIXUP_TOP_OF_STACK %r11, 8 would be working
on wrong locations. The "8" there says "we have full pt_regs on stack,
plus extra 8 bytes (the return address)". Your conjecture would mean
that in fact there would be more bytes on stack, and FIXUP_TOP_OF_STACK
would corrupt iret stack. Evidently, since old code was not crashing,
this wasn't happening. SAVE_REST was really creating the "tail" of pt_regs.

In addition to my previous tests, I ran my home machine with
patched kernel. Unfortunately, it works for me :(

Will try on yet another machine.

-- 
vda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-26 Thread Denys Vlasenko
On Thu, Feb 26, 2015 at 1:11 PM, Denys Vlasenko
vda.li...@googlemail.com wrote:
 On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko
 vda.li...@googlemail.com wrote:
 On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski l...@amacapital.net 
 wrote:
 In addition to my previous tests, I ran my home machine with
 patched kernel. Unfortunately, it works for me :(

 Will try on yet another machine.

 And voila, it does happen on another machine :)

 I'm debugging it right now. Looks like 64-bit syscalls just stop working
 at some point in new processes. That is, existing process is alive and well,
 but children get SEGV after fork (most likely on any syscall64 they do,
 not after fork per se. They eventually manage to kill themselves -
 not trivial when exit syscall isn't working either - by tripping on HLT insn).

 32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want.
 I have static 32-bit busybox binary and everything works there.

 Also, any 64-bit process which was under strace continues to work correctly,
 including forks and execs.

 This points towards some bug on fast path sysret64 code. Looking for it.

audit=0 makes crashes disappear.

I found the problem. If syscall_trace_enter_phase1 returns 0,
I restore %rax from pt_regs-ax, but should restore it from
pt_regs-orig_ax:

call syscall_trace_enter_phase1
test %rax, %rax
jnz tracesys_phase2 /* if needed, run the slow path */
-   RESTORE_C_REGS  /* else restore clobbered regs */
+   RESTORE_C_REGS_EXCEPT_RAX   /* else restore clobbered regs */
+   movq ORIG_RAX-ARGOFFSET(%rsp),%rax
jmp system_call_fastpath/*  and return to the fast path */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-26 Thread Andy Lutomirski
On Feb 26, 2015 1:55 AM, Denys Vlasenko vda.li...@googlemail.com wrote:

 On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski l...@amacapital.net wrote:
  On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko dvlas...@redhat.com wrote:
  On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
  This part?
 
  .macro FORK_LIKE func
   ENTRY(stub_\func)
  CFI_STARTPROC
  -   popq%r11/* save return address */
  -   PARTIAL_FRAME 0
  -   SAVE_REST
  -   pushq   %r11/* put it back on stack */
  +   DEFAULT_FRAME 0, 8  /* offset 8: return address */
  +   SAVE_EXTRA_REGS 8
  FIXUP_TOP_OF_STACK %r11, 8
  -   DEFAULT_FRAME 0 8   /* offset 8: return address */
  call sys_\func
  RESTORE_TOP_OF_STACK %r11, 8
  -   ret $REST_SKIP  /* pop extended registers */
  +   ret
  CFI_ENDPROC
   END(stub_\func)
  .endm
 
  FORK_LIKE  clone
  FORK_LIKE  fork
  FORK_LIKE  vfork
 
  But the old code (SAVE_REST thing) was also saving registers here.
  It had to jump through hoops (pop return address, SAVE_REST,
  push return address) to do that.
  After the patch, SAVE_EXTRA_REGS 8 does the same, just without
  pop/push pair.
 
  I just don't see what's wrong with it. Can you elaborate?
 
  SAVE_REST pushed the regs onto the stack, whereas SAVE_EXTRA_REGS just
  writes them in place.  It's possible for this to be called when the
  regs have already been saved.

 If that would be the case - that is, if SAVE_REST was saving extra copy
 of registers on stack, then FIXUP_TOP_OF_STACK %r11, 8 would be working
 on wrong locations. The 8 there says we have full pt_regs on stack,
 plus extra 8 bytes (the return address). Your conjecture would mean
 that in fact there would be more bytes on stack, and FIXUP_TOP_OF_STACK
 would corrupt iret stack. Evidently, since old code was not crashing,
 this wasn't happening. SAVE_REST was really creating the tail of pt_regs

Ugh, you're right.

The FIXUP_TOP_OF_STACK indeed looks duplicated, bit t that's less
harmful and was already the case.

--Andy
.

 In addition to my previous tests, I ran my home machine with
 patched kernel. Unfortunately, it works for me :(

 Will try on yet another machine.

 --
 vda
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-26 Thread Sabrina Dubroca
2015-02-26, 14:54:33 +0100, Denys Vlasenko wrote:
 On Thu, Feb 26, 2015 at 1:11 PM, Denys Vlasenko
 vda.li...@googlemail.com wrote:
  On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko
  vda.li...@googlemail.com wrote:
  On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski l...@amacapital.net 
  wrote:
  In addition to my previous tests, I ran my home machine with
  patched kernel. Unfortunately, it works for me :(
 
  Will try on yet another machine.
 
  And voila, it does happen on another machine :)
 
  I'm debugging it right now. Looks like 64-bit syscalls just stop working
  at some point in new processes. That is, existing process is alive and well,
  but children get SEGV after fork (most likely on any syscall64 they do,
  not after fork per se. They eventually manage to kill themselves -
  not trivial when exit syscall isn't working either - by tripping on HLT 
  insn).
 
  32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want.
  I have static 32-bit busybox binary and everything works there.
 
  Also, any 64-bit process which was under strace continues to work correctly,
  including forks and execs.
 
  This points towards some bug on fast path sysret64 code. Looking for it.
 
 audit=0 makes crashes disappear.

Ah, yes.

 I found the problem. If syscall_trace_enter_phase1 returns 0,
 I restore %rax from pt_regs-ax, but should restore it from
 pt_regs-orig_ax:
 
 call syscall_trace_enter_phase1
 test %rax, %rax
 jnz tracesys_phase2 /* if needed, run the slow path */
 -   RESTORE_C_REGS  /* else restore clobbered regs */
 +   RESTORE_C_REGS_EXCEPT_RAX   /* else restore clobbered regs */
 +   movq ORIG_RAX-ARGOFFSET(%rsp),%rax
 jmp system_call_fastpath/*  and return to the fast path */

with s/-ARGOFFSET// on top of next-20150224, that works.

Thanks, Denys.

-- 
Sabrina
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-26 Thread Denys Vlasenko
On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko dvlas...@redhat.com wrote:
 On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
 This part?

 .macro FORK_LIKE func
  ENTRY(stub_\func)
 CFI_STARTPROC
 -   popq%r11/* save return address */
 -   PARTIAL_FRAME 0
 -   SAVE_REST
 -   pushq   %r11/* put it back on stack */
 +   DEFAULT_FRAME 0, 8  /* offset 8: return address */
 +   SAVE_EXTRA_REGS 8
 FIXUP_TOP_OF_STACK %r11, 8
 -   DEFAULT_FRAME 0 8   /* offset 8: return address */
 call sys_\func
 RESTORE_TOP_OF_STACK %r11, 8
 -   ret $REST_SKIP  /* pop extended registers */
 +   ret
 CFI_ENDPROC
  END(stub_\func)
 .endm

 FORK_LIKE  clone
 FORK_LIKE  fork
 FORK_LIKE  vfork

 But the old code (SAVE_REST thing) was also saving registers here.
 It had to jump through hoops (pop return address, SAVE_REST,
 push return address) to do that.
 After the patch, SAVE_EXTRA_REGS 8 does the same, just without
 pop/push pair.

 I just don't see what's wrong with it. Can you elaborate?

 SAVE_REST pushed the regs onto the stack, whereas SAVE_EXTRA_REGS just
 writes them in place.  It's possible for this to be called when the
 regs have already been saved.

If that would be the case - that is, if SAVE_REST was saving extra copy
of registers on stack, then FIXUP_TOP_OF_STACK %r11, 8 would be working
on wrong locations. The 8 there says we have full pt_regs on stack,
plus extra 8 bytes (the return address). Your conjecture would mean
that in fact there would be more bytes on stack, and FIXUP_TOP_OF_STACK
would corrupt iret stack. Evidently, since old code was not crashing,
this wasn't happening. SAVE_REST was really creating the tail of pt_regs.

In addition to my previous tests, I ran my home machine with
patched kernel. Unfortunately, it works for me :(

Will try on yet another machine.

-- 
vda
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-26 Thread Denys Vlasenko
On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko
vda.li...@googlemail.com wrote:
 On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski l...@amacapital.net wrote:
 In addition to my previous tests, I ran my home machine with
 patched kernel. Unfortunately, it works for me :(

 Will try on yet another machine.

And voila, it does happen on another machine :)

I'm debugging it right now. Looks like 64-bit syscalls just stop working
at some point in new processes. That is, existing process is alive and well,
but children get SEGV after fork (most likely on any syscall64 they do,
not after fork per se. They eventually manage to kill themselves -
not trivial when exit syscall isn't working either - by tripping on HLT insn).

32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want.
I have static 32-bit busybox binary and everything works there.

Also, any 64-bit process which was under strace continues to work correctly,
including forks and execs.

This points towards some bug on fast path sysret64 code. Looking for it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Stephen Rothwell
Hi all,

On Wed, 25 Feb 2015 21:18:52 -0800 Andrew Morton  
wrote:
>
> On Thu, 26 Feb 2015 02:12:57 +0100 Denys Vlasenko  
> wrote:
> 
> > On Thu, Feb 26, 2015 at 12:34 AM, Sabrina Dubroca  
> > wrote:
> > > 2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote:
> > >> I can run some userspace programs, but I have no idea what would be
> > >> helpful.
> > >> I can also try booting a real machine with archlinux/systemd tomorrow.
> > >
> > > I got a good boot out of kernels that normally fail.  I booted
> > > systemd's emergency shell and enabled a few services, in the same
> > > order they normally start.  journald started cleanly, but after that,
> > > every single command produced a "traps:" output and an "audit:" line.
> > >
> > > I disabled systemd-journald (chmod -x, because `systemctl disable`
> > > didn't really disable it), and now it boots, no "traps:" in the log.
> > > If I run it, everything fails again (zsh has traps for simply pressing
> > > enter on an empty cmd).
> > 
> > That's some progress!
> > 
> > It's strange how one process manages to affect everything else.
> > 
> > "If I run it, everything fails again". How do you run it? Directly,
> > or via systemd services mechanism?
> > If you just run it directly, can you try running it under
> > "strace -f -tt -oLOG"? Does it have the same effect? What's in the LOG?
> 
> I'm hitting this bug as well, bisected to this commit.  On an old
> x64_64 box, no vms, paravirt, etc.  Running FC6 userspace (heh).
> 
> Quite late in initscripts, binaries start getting segmentation faults
> and init gives up.  Seems to only affect /usr/bin/rhgb-client.  There's
> one instance where /bin/rm is said to segfault, but I suspect that's
> init lying to me.

I note that that commit has been removed from today's version of the
luto-misc tree and thus linux-next.

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpu6vlymy8yz.pgp
Description: OpenPGP digital signature


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Andrew Morton
On Thu, 26 Feb 2015 02:12:57 +0100 Denys Vlasenko  
wrote:

> On Thu, Feb 26, 2015 at 12:34 AM, Sabrina Dubroca  
> wrote:
> > 2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote:
> >> I can run some userspace programs, but I have no idea what would be
> >> helpful.
> >> I can also try booting a real machine with archlinux/systemd tomorrow.
> >
> > I got a good boot out of kernels that normally fail.  I booted
> > systemd's emergency shell and enabled a few services, in the same
> > order they normally start.  journald started cleanly, but after that,
> > every single command produced a "traps:" output and an "audit:" line.
> >
> > I disabled systemd-journald (chmod -x, because `systemctl disable`
> > didn't really disable it), and now it boots, no "traps:" in the log.
> > If I run it, everything fails again (zsh has traps for simply pressing
> > enter on an empty cmd).
> 
> That's some progress!
> 
> It's strange how one process manages to affect everything else.
> 
> "If I run it, everything fails again". How do you run it? Directly,
> or via systemd services mechanism?
> If you just run it directly, can you try running it under
> "strace -f -tt -oLOG"? Does it have the same effect? What's in the LOG?

I'm hitting this bug as well, bisected to this commit.  On an old
x64_64 box, no vms, paravirt, etc.  Running FC6 userspace (heh).

Quite late in initscripts, binaries start getting segmentation faults
and init gives up.  Seems to only affect /usr/bin/rhgb-client.  There's
one instance where /bin/rm is said to segfault, but I suspect that's
init lying to me.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Denys Vlasenko
On Thu, Feb 26, 2015 at 12:34 AM, Sabrina Dubroca  wrote:
> 2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote:
>> I can run some userspace programs, but I have no idea what would be
>> helpful.
>> I can also try booting a real machine with archlinux/systemd tomorrow.
>
> I got a good boot out of kernels that normally fail.  I booted
> systemd's emergency shell and enabled a few services, in the same
> order they normally start.  journald started cleanly, but after that,
> every single command produced a "traps:" output and an "audit:" line.
>
> I disabled systemd-journald (chmod -x, because `systemctl disable`
> didn't really disable it), and now it boots, no "traps:" in the log.
> If I run it, everything fails again (zsh has traps for simply pressing
> enter on an empty cmd).

That's some progress!

It's strange how one process manages to affect everything else.

"If I run it, everything fails again". How do you run it? Directly,
or via systemd services mechanism?
If you just run it directly, can you try running it under
"strace -f -tt -oLOG"? Does it have the same effect? What's in the LOG?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Sabrina Dubroca
2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote:
> I can run some userspace programs, but I have no idea what would be
> helpful.
> I can also try booting a real machine with archlinux/systemd tomorrow.

I got a good boot out of kernels that normally fail.  I booted
systemd's emergency shell and enabled a few services, in the same
order they normally start.  journald started cleanly, but after that,
every single command produced a "traps:" output and an "audit:" line.

I disabled systemd-journald (chmod -x, because `systemctl disable`
didn't really disable it), and now it boots, no "traps:" in the log.
If I run it, everything fails again (zsh has traps for simply pressing
enter on an empty cmd).

-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Sabrina Dubroca
2015-02-25, 13:59:06 -0800, Andy Lutomirski wrote:
> On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko  wrote:
> > On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
> >> On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin  wrote:
> >>> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko :
>  On 02/25/2015 01:37 PM, Andrey Wagin wrote:
> > 2015-02-13 0:54 GMT+03:00 Denys Vlasenko :
> >> 64-bit code was using six stack slots less by not saving/restoring
> >> registers which are callee-preserved according to C ABI,
> >> and not allocating space for them.
> >> Only when syscall needed a complete "struct pt_regs",
> >> the complete area was allocated and filled in.
> >> As an additional twist, on interrupt entry a "slightly less truncated 
> >> pt_regs"
> >> trick is used, to make nested interrupt stacks easier to unwind.
> >>
> >> This proved to be a source of significant obfuscation and subtle bugs.
> >> For example, stub_fork had to pop the return address,
> >> extend the struct, save registers, and push return address back. Ugly.
> >> ia32_ptregs_common pops return address and "returns" via jmp insn,
> >> throwing a wrench into CPU return stack cache.
> >>
> >> This patch changes code to always allocate a complete "struct pt_regs".
> >> The saving of registers is still done lazily.
> >>
> >> "Partial pt_regs" trick on interrupt stack is retained.
> >>
> >> Macros which manipulate "struct pt_regs" on stack are reworked:
> >> ALLOC_PT_GPREGS_ON_STACK allocates the structure.
> >> SAVE_C_REGS saves to it those registers which are clobbered by C code.
> >> SAVE_EXTRA_REGS saves to it all other registers.
> >> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse 
> >> it.
> >>
> >> ia32_ptregs_common, stub_fork and friends lost their ugly dance with
> >> return pointer.
> >>
> >> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
> >> instead of magic numbers.
> >>
> >> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
> >> instead of having it open-coded yet again.
> >>
> >> Patch was run-tested: 64-bit executables, 32-bit executables,
> >> strace works.
> >> Timing tests did not show measurable difference in 32-bit
> >> and 64-bit syscalls.
> >
> > Hello Denys,
> >
> > My test vm doesn't boot with this patch. Could you help to investigate
> > this issue?
> 
>  I think I found it. This part of my patch is possibly wrong:
> 
>  @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
>   #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
>  TRACE_IRQS_ON; \
>  sti; \
>  -   SAVE_REST; \
>  +   SAVE_EXTRA_REGS; \
>  LOCKDEP_SYS_EXIT; \
>  -   RESTORE_REST; \
>  +   RESTORE_EXTRA_REGS; \
>  cli; \
>  TRACE_IRQS_OFF;
> 
>  The "SAVE_REST" here is intended to really *push* extra regs on stack,
>  but the patch changed it so that they are written to existing stack
>  slots above.
> 
>  From code inspection it should work in almost all cases, but some
>  locations where it is used are really obscure.
> 
>  If there are places where *pushing* regs is really necessary,
>  this can corrupt rbp,rbx,r12-15 registers.
> 
>  Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the 
>  bug
>  was here.
>  Please find updated patch attached. Can you try it?
> >>>
> >>> It doesn't work
> >
> > Thanks for testing it anyway.
> >
> >
> >>> [3.016262] traps: systemd-cgroups[390] general protection
> >>> ip:7f456f7b6028 sp:7fffdc059718 error:0 in
> >>> ld-2.18.so[7f456f79e000+2]
> >
> > This is what I know about these crashes. The SEGV itself is caused by
> > HLT instruction executed by dynamic loader, ld-2.NN.so.
> > The instruction is in _exit function, and is only reachable if
> > exit_group and exit syscalls fail to terminate the process.
> > So it seems that syscall execution is getting badly broken somehow
> > at some point.
> >
> > This happens to both reporters.
> >
> > My theory that it is related to lockdep seems to be wrong, because
> > Sabrina's kernel is not lockdep-enabled, yet it sees the same failure.
> >
> > Both kernels are paravirtualized, both are booted under KVM,
> > Andrey runs it with four virtual CPUs, Sabrina runs with two.
> >
> > My next theory is that I missed something related to paravirt.
> > I am looking at that code, so far I don't see anything suspicious.
> >
> > Unfortunately, it doesn't happen to me: I have Sabrina's bzImage,
> > I run it under "qemu-system-x86_64 -enable-kvm -smp 2",
> > I see in dmesg that kernel does detect that it is being run under KVM,
> > but it works for me. No mysterious segfaults.
> >
> > Andrey, can you send me your bzImage? Maybe it will trigger
> > the problem for me.
> >

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Andy Lutomirski
On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko  wrote:
> On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
>> On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin  wrote:
>>> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko :
 On 02/25/2015 01:37 PM, Andrey Wagin wrote:
> 2015-02-13 0:54 GMT+03:00 Denys Vlasenko :
>> 64-bit code was using six stack slots less by not saving/restoring
>> registers which are callee-preserved according to C ABI,
>> and not allocating space for them.
>> Only when syscall needed a complete "struct pt_regs",
>> the complete area was allocated and filled in.
>> As an additional twist, on interrupt entry a "slightly less truncated 
>> pt_regs"
>> trick is used, to make nested interrupt stacks easier to unwind.
>>
>> This proved to be a source of significant obfuscation and subtle bugs.
>> For example, stub_fork had to pop the return address,
>> extend the struct, save registers, and push return address back. Ugly.
>> ia32_ptregs_common pops return address and "returns" via jmp insn,
>> throwing a wrench into CPU return stack cache.
>>
>> This patch changes code to always allocate a complete "struct pt_regs".
>> The saving of registers is still done lazily.
>>
>> "Partial pt_regs" trick on interrupt stack is retained.
>>
>> Macros which manipulate "struct pt_regs" on stack are reworked:
>> ALLOC_PT_GPREGS_ON_STACK allocates the structure.
>> SAVE_C_REGS saves to it those registers which are clobbered by C code.
>> SAVE_EXTRA_REGS saves to it all other registers.
>> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse 
>> it.
>>
>> ia32_ptregs_common, stub_fork and friends lost their ugly dance with
>> return pointer.
>>
>> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
>> instead of magic numbers.
>>
>> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
>> instead of having it open-coded yet again.
>>
>> Patch was run-tested: 64-bit executables, 32-bit executables,
>> strace works.
>> Timing tests did not show measurable difference in 32-bit
>> and 64-bit syscalls.
>
> Hello Denys,
>
> My test vm doesn't boot with this patch. Could you help to investigate
> this issue?

 I think I found it. This part of my patch is possibly wrong:

 @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
  #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
 TRACE_IRQS_ON; \
 sti; \
 -   SAVE_REST; \
 +   SAVE_EXTRA_REGS; \
 LOCKDEP_SYS_EXIT; \
 -   RESTORE_REST; \
 +   RESTORE_EXTRA_REGS; \
 cli; \
 TRACE_IRQS_OFF;

 The "SAVE_REST" here is intended to really *push* extra regs on stack,
 but the patch changed it so that they are written to existing stack
 slots above.

 From code inspection it should work in almost all cases, but some
 locations where it is used are really obscure.

 If there are places where *pushing* regs is really necessary,
 this can corrupt rbp,rbx,r12-15 registers.

 Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
 was here.
 Please find updated patch attached. Can you try it?
>>>
>>> It doesn't work
>
> Thanks for testing it anyway.
>
>
>>> [3.016262] traps: systemd-cgroups[390] general protection
>>> ip:7f456f7b6028 sp:7fffdc059718 error:0 in
>>> ld-2.18.so[7f456f79e000+2]
>
> This is what I know about these crashes. The SEGV itself is caused by
> HLT instruction executed by dynamic loader, ld-2.NN.so.
> The instruction is in _exit function, and is only reachable if
> exit_group and exit syscalls fail to terminate the process.
> So it seems that syscall execution is getting badly broken somehow
> at some point.
>
> This happens to both reporters.
>
> My theory that it is related to lockdep seems to be wrong, because
> Sabrina's kernel is not lockdep-enabled, yet it sees the same failure.
>
> Both kernels are paravirtualized, both are booted under KVM,
> Andrey runs it with four virtual CPUs, Sabrina runs with two.
>
> My next theory is that I missed something related to paravirt.
> I am looking at that code, so far I don't see anything suspicious.
>
> Unfortunately, it doesn't happen to me: I have Sabrina's bzImage,
> I run it under "qemu-system-x86_64 -enable-kvm -smp 2",
> I see in dmesg that kernel does detect that it is being run under KVM,
> but it works for me. No mysterious segfaults.
>
> Andrey, can you send me your bzImage? Maybe it will trigger
> the problem for me.
>
>
>> The change to stub_\func looks wrong to me.  It saves and restores
>> regs, but those regs might already have been saved if we're on the
>> slow path.  (Yes, all that code is quite buggy even without all these
>> patches.)  So is execve.
>>
>> This means that, for example, execve 

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Denys Vlasenko
On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
> On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin  wrote:
>> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko :
>>> On 02/25/2015 01:37 PM, Andrey Wagin wrote:
 2015-02-13 0:54 GMT+03:00 Denys Vlasenko :
> 64-bit code was using six stack slots less by not saving/restoring
> registers which are callee-preserved according to C ABI,
> and not allocating space for them.
> Only when syscall needed a complete "struct pt_regs",
> the complete area was allocated and filled in.
> As an additional twist, on interrupt entry a "slightly less truncated 
> pt_regs"
> trick is used, to make nested interrupt stacks easier to unwind.
>
> This proved to be a source of significant obfuscation and subtle bugs.
> For example, stub_fork had to pop the return address,
> extend the struct, save registers, and push return address back. Ugly.
> ia32_ptregs_common pops return address and "returns" via jmp insn,
> throwing a wrench into CPU return stack cache.
>
> This patch changes code to always allocate a complete "struct pt_regs".
> The saving of registers is still done lazily.
>
> "Partial pt_regs" trick on interrupt stack is retained.
>
> Macros which manipulate "struct pt_regs" on stack are reworked:
> ALLOC_PT_GPREGS_ON_STACK allocates the structure.
> SAVE_C_REGS saves to it those registers which are clobbered by C code.
> SAVE_EXTRA_REGS saves to it all other registers.
> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.
>
> ia32_ptregs_common, stub_fork and friends lost their ugly dance with
> return pointer.
>
> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
> instead of magic numbers.
>
> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
> instead of having it open-coded yet again.
>
> Patch was run-tested: 64-bit executables, 32-bit executables,
> strace works.
> Timing tests did not show measurable difference in 32-bit
> and 64-bit syscalls.

 Hello Denys,

 My test vm doesn't boot with this patch. Could you help to investigate
 this issue?
>>>
>>> I think I found it. This part of my patch is possibly wrong:
>>>
>>> @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
>>>  #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
>>> TRACE_IRQS_ON; \
>>> sti; \
>>> -   SAVE_REST; \
>>> +   SAVE_EXTRA_REGS; \
>>> LOCKDEP_SYS_EXIT; \
>>> -   RESTORE_REST; \
>>> +   RESTORE_EXTRA_REGS; \
>>> cli; \
>>> TRACE_IRQS_OFF;
>>>
>>> The "SAVE_REST" here is intended to really *push* extra regs on stack,
>>> but the patch changed it so that they are written to existing stack
>>> slots above.
>>>
>>> From code inspection it should work in almost all cases, but some
>>> locations where it is used are really obscure.
>>>
>>> If there are places where *pushing* regs is really necessary,
>>> this can corrupt rbp,rbx,r12-15 registers.
>>>
>>> Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
>>> was here.
>>> Please find updated patch attached. Can you try it?
>>
>> It doesn't work

Thanks for testing it anyway.


>> [3.016262] traps: systemd-cgroups[390] general protection
>> ip:7f456f7b6028 sp:7fffdc059718 error:0 in
>> ld-2.18.so[7f456f79e000+2]

This is what I know about these crashes. The SEGV itself is caused by
HLT instruction executed by dynamic loader, ld-2.NN.so.
The instruction is in _exit function, and is only reachable if
exit_group and exit syscalls fail to terminate the process.
So it seems that syscall execution is getting badly broken somehow
at some point.

This happens to both reporters.

My theory that it is related to lockdep seems to be wrong, because
Sabrina's kernel is not lockdep-enabled, yet it sees the same failure.

Both kernels are paravirtualized, both are booted under KVM,
Andrey runs it with four virtual CPUs, Sabrina runs with two.

My next theory is that I missed something related to paravirt.
I am looking at that code, so far I don't see anything suspicious.

Unfortunately, it doesn't happen to me: I have Sabrina's bzImage,
I run it under "qemu-system-x86_64 -enable-kvm -smp 2",
I see in dmesg that kernel does detect that it is being run under KVM,
but it works for me. No mysterious segfaults.

Andrey, can you send me your bzImage? Maybe it will trigger
the problem for me.


> The change to stub_\func looks wrong to me.  It saves and restores
> regs, but those regs might already have been saved if we're on the
> slow path.  (Yes, all that code is quite buggy even without all these
> patches.)  So is execve.
> 
> This means that, for example, execve called in the slow path will
> save/restore regs twice.  If the values in the regs after the first
> save and before the second save are different, then we corrupt user
> state.

This part?

.macro 

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Andy Lutomirski
On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin  wrote:
> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko :
>> On 02/25/2015 01:37 PM, Andrey Wagin wrote:
>>> 2015-02-13 0:54 GMT+03:00 Denys Vlasenko :
 64-bit code was using six stack slots less by not saving/restoring
 registers which are callee-preserved according to C ABI,
 and not allocating space for them.
 Only when syscall needed a complete "struct pt_regs",
 the complete area was allocated and filled in.
 As an additional twist, on interrupt entry a "slightly less truncated 
 pt_regs"
 trick is used, to make nested interrupt stacks easier to unwind.

 This proved to be a source of significant obfuscation and subtle bugs.
 For example, stub_fork had to pop the return address,
 extend the struct, save registers, and push return address back. Ugly.
 ia32_ptregs_common pops return address and "returns" via jmp insn,
 throwing a wrench into CPU return stack cache.

 This patch changes code to always allocate a complete "struct pt_regs".
 The saving of registers is still done lazily.

 "Partial pt_regs" trick on interrupt stack is retained.

 Macros which manipulate "struct pt_regs" on stack are reworked:
 ALLOC_PT_GPREGS_ON_STACK allocates the structure.
 SAVE_C_REGS saves to it those registers which are clobbered by C code.
 SAVE_EXTRA_REGS saves to it all other registers.
 Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.

 ia32_ptregs_common, stub_fork and friends lost their ugly dance with
 return pointer.

 LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
 instead of magic numbers.

 error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
 instead of having it open-coded yet again.

 Patch was run-tested: 64-bit executables, 32-bit executables,
 strace works.
 Timing tests did not show measurable difference in 32-bit
 and 64-bit syscalls.
>>>
>>> Hello Denys,
>>>
>>> My test vm doesn't boot with this patch. Could you help to investigate
>>> this issue?
>>
>> I think I found it. This part of my patch is possibly wrong:
>>
>> @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
>>  #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
>> TRACE_IRQS_ON; \
>> sti; \
>> -   SAVE_REST; \
>> +   SAVE_EXTRA_REGS; \
>> LOCKDEP_SYS_EXIT; \
>> -   RESTORE_REST; \
>> +   RESTORE_EXTRA_REGS; \
>> cli; \
>> TRACE_IRQS_OFF;
>>
>> The "SAVE_REST" here is intended to really *push* extra regs on stack,
>> but the patch changed it so that they are written to existing stack
>> slots above.
>>
>> From code inspection it should work in almost all cases, but some
>> locations where it is used are really obscure.
>>
>> If there are places where *pushing* regs is really necessary,
>> this can corrupt rbp,rbx,r12-15 registers.
>>
>> Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
>> was here.
>> Please find updated patch attached. Can you try it?
>
> It doesn't work
>
> [2.282198] microcode: CPU0 sig=0x623, pf=0x0, revision=0x1
> [2.288321] microcode: CPU1 sig=0x623, pf=0x0, revision=0x1
> [2.289139] microcode: CPU2 sig=0x623, pf=0x0, revision=0x1
> [2.290618] microcode: CPU3 sig=0x623, pf=0x0, revision=0x1
> [2.292417] microcode: Microcode Update Driver: v2.00
> , Peter Oruba
> [2.392592] piix4_smbus :00:01.3: SMBus Host Controller at
> 0xb100, revision 0
> [2.510882] systemd-fsck[349]: /dev/sda1: clean, 343/128016 files,
> 166911/512000 blocks
> [2.521463] Adding 4128764k swap on /dev/sda2.  Priority:-1
> extents:1 across:4128764k FS
> [2.573597] EXT4-fs (sda1): mounted filesystem with ordered data
> mode. Opts: (null)
> [2.597771] systemd-journald[283]: Received request to flush
> runtime journal from PID 1
> [2.802288] audit: type=1305 audit(1424892361.629:3): audit_pid=369
> old=0 auid=4294967295 ses=4294967295 res=1
> [2.819660] traps: systemd-cgroups[380] general protection
> ip:7fb227939028 sp:7fff6bac59c8 error:0 in
> ld-2.18.so[7fb227921000+2]
> [3.016262] traps: systemd-cgroups[390] general protection
> ip:7f456f7b6028 sp:7fffdc059718 error:0 in
> ld-2.18.so[7f456f79e000+2]

The change to stub_\func looks wrong to me.  It saves and restores
regs, but those regs might already have been saved if we're on the
slow path.  (Yes, all that code is quite buggy even without all these
patches.)  So is execve.

This means that, for example, execve called in the slow path will
save/restore regs twice.  If the values in the regs after the first
save and before the second save are different, then we corrupt user
state.

I think that the changes to all the stub-like things should be reverted.

--Andy

>
>>
>> --
>> vda
>>



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe 

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Andrey Wagin
2015-02-25 21:42 GMT+03:00 Denys Vlasenko :
> On 02/25/2015 01:37 PM, Andrey Wagin wrote:
>> 2015-02-13 0:54 GMT+03:00 Denys Vlasenko :
>>> 64-bit code was using six stack slots less by not saving/restoring
>>> registers which are callee-preserved according to C ABI,
>>> and not allocating space for them.
>>> Only when syscall needed a complete "struct pt_regs",
>>> the complete area was allocated and filled in.
>>> As an additional twist, on interrupt entry a "slightly less truncated 
>>> pt_regs"
>>> trick is used, to make nested interrupt stacks easier to unwind.
>>>
>>> This proved to be a source of significant obfuscation and subtle bugs.
>>> For example, stub_fork had to pop the return address,
>>> extend the struct, save registers, and push return address back. Ugly.
>>> ia32_ptregs_common pops return address and "returns" via jmp insn,
>>> throwing a wrench into CPU return stack cache.
>>>
>>> This patch changes code to always allocate a complete "struct pt_regs".
>>> The saving of registers is still done lazily.
>>>
>>> "Partial pt_regs" trick on interrupt stack is retained.
>>>
>>> Macros which manipulate "struct pt_regs" on stack are reworked:
>>> ALLOC_PT_GPREGS_ON_STACK allocates the structure.
>>> SAVE_C_REGS saves to it those registers which are clobbered by C code.
>>> SAVE_EXTRA_REGS saves to it all other registers.
>>> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.
>>>
>>> ia32_ptregs_common, stub_fork and friends lost their ugly dance with
>>> return pointer.
>>>
>>> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
>>> instead of magic numbers.
>>>
>>> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
>>> instead of having it open-coded yet again.
>>>
>>> Patch was run-tested: 64-bit executables, 32-bit executables,
>>> strace works.
>>> Timing tests did not show measurable difference in 32-bit
>>> and 64-bit syscalls.
>>
>> Hello Denys,
>>
>> My test vm doesn't boot with this patch. Could you help to investigate
>> this issue?
>
> I think I found it. This part of my patch is possibly wrong:
>
> @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
>  #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
> TRACE_IRQS_ON; \
> sti; \
> -   SAVE_REST; \
> +   SAVE_EXTRA_REGS; \
> LOCKDEP_SYS_EXIT; \
> -   RESTORE_REST; \
> +   RESTORE_EXTRA_REGS; \
> cli; \
> TRACE_IRQS_OFF;
>
> The "SAVE_REST" here is intended to really *push* extra regs on stack,
> but the patch changed it so that they are written to existing stack
> slots above.
>
> From code inspection it should work in almost all cases, but some
> locations where it is used are really obscure.
>
> If there are places where *pushing* regs is really necessary,
> this can corrupt rbp,rbx,r12-15 registers.
>
> Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
> was here.
> Please find updated patch attached. Can you try it?

It doesn't work

[2.282198] microcode: CPU0 sig=0x623, pf=0x0, revision=0x1
[2.288321] microcode: CPU1 sig=0x623, pf=0x0, revision=0x1
[2.289139] microcode: CPU2 sig=0x623, pf=0x0, revision=0x1
[2.290618] microcode: CPU3 sig=0x623, pf=0x0, revision=0x1
[2.292417] microcode: Microcode Update Driver: v2.00
, Peter Oruba
[2.392592] piix4_smbus :00:01.3: SMBus Host Controller at
0xb100, revision 0
[2.510882] systemd-fsck[349]: /dev/sda1: clean, 343/128016 files,
166911/512000 blocks
[2.521463] Adding 4128764k swap on /dev/sda2.  Priority:-1
extents:1 across:4128764k FS
[2.573597] EXT4-fs (sda1): mounted filesystem with ordered data
mode. Opts: (null)
[2.597771] systemd-journald[283]: Received request to flush
runtime journal from PID 1
[2.802288] audit: type=1305 audit(1424892361.629:3): audit_pid=369
old=0 auid=4294967295 ses=4294967295 res=1
[2.819660] traps: systemd-cgroups[380] general protection
ip:7fb227939028 sp:7fff6bac59c8 error:0 in
ld-2.18.so[7fb227921000+2]
[3.016262] traps: systemd-cgroups[390] general protection
ip:7f456f7b6028 sp:7fffdc059718 error:0 in
ld-2.18.so[7f456f79e000+2]

>
> --
> vda
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Denys Vlasenko
On 02/25/2015 01:37 PM, Andrey Wagin wrote:
> 2015-02-13 0:54 GMT+03:00 Denys Vlasenko :
>> 64-bit code was using six stack slots less by not saving/restoring
>> registers which are callee-preserved according to C ABI,
>> and not allocating space for them.
>> Only when syscall needed a complete "struct pt_regs",
>> the complete area was allocated and filled in.
>> As an additional twist, on interrupt entry a "slightly less truncated 
>> pt_regs"
>> trick is used, to make nested interrupt stacks easier to unwind.
>>
>> This proved to be a source of significant obfuscation and subtle bugs.
>> For example, stub_fork had to pop the return address,
>> extend the struct, save registers, and push return address back. Ugly.
>> ia32_ptregs_common pops return address and "returns" via jmp insn,
>> throwing a wrench into CPU return stack cache.
>>
>> This patch changes code to always allocate a complete "struct pt_regs".
>> The saving of registers is still done lazily.
>>
>> "Partial pt_regs" trick on interrupt stack is retained.
>>
>> Macros which manipulate "struct pt_regs" on stack are reworked:
>> ALLOC_PT_GPREGS_ON_STACK allocates the structure.
>> SAVE_C_REGS saves to it those registers which are clobbered by C code.
>> SAVE_EXTRA_REGS saves to it all other registers.
>> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.
>>
>> ia32_ptregs_common, stub_fork and friends lost their ugly dance with
>> return pointer.
>>
>> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
>> instead of magic numbers.
>>
>> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
>> instead of having it open-coded yet again.
>>
>> Patch was run-tested: 64-bit executables, 32-bit executables,
>> strace works.
>> Timing tests did not show measurable difference in 32-bit
>> and 64-bit syscalls.
> 
> Hello Denys,
> 
> My test vm doesn't boot with this patch. Could you help to investigate
> this issue?

I think I found it. This part of my patch is possibly wrong:

@@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
 #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
TRACE_IRQS_ON; \
sti; \
-   SAVE_REST; \
+   SAVE_EXTRA_REGS; \
LOCKDEP_SYS_EXIT; \
-   RESTORE_REST; \
+   RESTORE_EXTRA_REGS; \
cli; \
TRACE_IRQS_OFF;

The "SAVE_REST" here is intended to really *push* extra regs on stack,
but the patch changed it so that they are written to existing stack
slots above.

>From code inspection it should work in almost all cases, but some
locations where it is used are really obscure.

If there are places where *pushing* regs is really necessary,
this can corrupt rbp,rbx,r12-15 registers.

Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
was here.
Please find updated patch attached. Can you try it?

-- 
vda

>From 80c82370dcb8cd85c3b981ce62f63f74eff0df2c Mon Sep 17 00:00:00 2001
From: Denys Vlasenko 
Date: Wed, 25 Feb 2015 19:32:14 +0100
Subject: [PATCH v2] x86: entry_64.S: always allocate complete "struct pt_regs"

64-bit code was using six stack slots less by not saving/restoring
registers which are callee-preserved according to C ABI,
and not allocating space for them.
Only when syscall needed a complete "struct pt_regs",
the complete area was allocated and filled in.
As an additional twist, on interrupt entry a "slightly less truncated pt_regs"
trick is used, to make nested interrupt stacks easier to unwind.

This proved to be a source of significant obfuscation and subtle bugs.
For example, stub_fork had to pop the return address,
extend the struct, save registers, and push return address back. Ugly.
ia32_ptregs_common pops return address and "returns" via jmp insn,
throwing a wrench into CPU return stack cache.

This patch changes code to always allocate a complete "struct pt_regs".
The saving of registers is still done lazily.

"Partial pt_regs" trick on interrupt stack is retained.

Macros which manipulate "struct pt_regs" on stack are reworked:
ALLOC_PT_GPREGS_ON_STACK allocates the structure.
SAVE_C_REGS saves to it those registers which are clobbered by C code.
SAVE_EXTRA_REGS saves to it all other registers.
Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.

ia32_ptregs_common, stub_fork and friends lost their ugly dance with
return pointer.

LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
instead of magic numbers.

error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
instead of having it open-coded yet again.

Patch was run-tested: 64-bit executables, 32-bit executables,
strace works.
Timing tests did not show measurable difference in 32-bit
and 64-bit syscalls.

Signed-off-by: Denys Vlasenko 
CC: Linus Torvalds 
CC: Oleg Nesterov 
CC: Borislav Petkov 
CC: "H. Peter Anvin" 
CC: Andy Lutomirski 
CC: Frederic Weisbecker 
CC: X86 ML 
CC: Alexei Starovoitov 
CC: Will Drewry 
CC: Kees Cook 
CC: linux-kernel@vger.kernel.org
---

Changes since v1: fix a thinko 

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Denys Vlasenko
On 02/25/2015 01:37 PM, Andrey Wagin wrote:
> Hello Denys,
> 
> My test vm doesn't boot with this patch. Could you help to investigate
> this issue?
> 
> I have attached a kernel config and console log.
> 
> [2.508252] traps: systemd-cgroups[380] general protection ip:7f68ad096028 
> sp:7fffba298af8 error:0 in ld-2.18.so[7f68ad07e000+2][  OK
> [2.600179] traps: systemd-cgroups[384] general protection ip:7f11b9a9c028 
> sp:7fff4420f978 error:0 in ld-2.18.so[7f11b9a84000+2]
> [2.743790] traps: systemd-cgroups[392] general protection ip:7f7f40a44028 
> sp:7fffe1c1b8b8 error:0 in ld-2.18.so[7f7f40a2c000+2]
> [2.754576] traps: systemd-cgroups[393] general protection ip:7fd1314bd028 
> sp:776ecc88 error:0 in ld-2.18.so[7fd1314a5000+2]
> [2.765343] traps: systemd-cgroups[396] general protection ip:7ff4537b7028 
> sp:7fff05902378 error:0 in ld-2.18.so[7ff45379f000+2]
> [2.798782] traps: systemd-cgroups[399] general protection ip:7f4d5bc9c028 
> sp:7fff35cb3a48 error:0 in ld-2.18.so[7f4d5bc84000+2]

These SEGVs are always at the same location within ld-2.18.so:
(ip - load_address) is always 0x18028.

For Sabrine's case, SEGVing address in ld-2.21.so also the same
all the time: 0x184f8.

Please send me your ld-2.18.so / ld-2.21.so files.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Sabrina Dubroca
Hello,

I'm seeing the same symptoms on next-2015022{4,5}, also with systemd in a VM:

traps: fsck[99] general protection ip:7fccb2401270 sp:7fffea3b8938 error:0 in 
libc-2.21.so[7fccb2349000+199000]
traps: systemd-cgroups[100] general protection ip:7fdd8ff784f8 sp:7ffcf6e27ad8 
error:0 in ld-2.21.so[7fdd8ff6+22000]
traps: systemd-cgroups[94] general protection ip:7f9f23bd24f8 sp:74fc5578 
error:0 in ld-2.21.so[7f9f23bba000+22000]
traps: systemd-cgroups[102] general protection ip:7f211e6574f8 sp:7ffdb8e0d538 
error:0 in ld-2.21.so[7f211e63f000+22000]
traps: systemd-cgroups[103] general protection ip:7f80627c34f8 sp:7ffc7fa4cff8 
error:0 in ld-2.21.so[7f80627ab000+22000]


2015-02-25, 14:55:34 +0100, Denys Vlasenko wrote:
> On 02/25/2015 01:37 PM, Andrey Wagin wrote:
> > 2015-02-13 0:54 GMT+03:00 Denys Vlasenko :
> > My test vm doesn't boot with this patch. Could you help to investigate
> > this issue?
> 
> Hi Andrey, thanks for testing!
> 
> > I have attached a kernel config and console log.
> 
> Looking at the logs, it seems that regular syscalls do work:
> systemd managed to function for some time, even spawned
> a few children.
> 
> It might be that the bug is somewhere in signal delivery code.
> This would explain why oops got delayed.

It doesn't oops here, it just tries to load other bits of systemd and hangs.
I've noticed that "ip:" - "the address after ld-2.21.so[" is always
the same value, I don't know if that's expected or relevant.

(full log below)

> I am trying to reproduce it. My gcc seems to be a bit old -
> it can't digest CONFIG_CC_STACKPROTECTOR_STRONG=y in your .config.
> 
> I switched to using "only" CONFIG_CC_STACKPROTECTOR_REGULAR=y:
> 
> CONFIG_CC_STACKPROTECTOR=y
> # CONFIG_CC_STACKPROTECTOR_NONE is not set
> CONFIG_CC_STACKPROTECTOR_REGULAR=y
> # CONFIG_CC_STACKPROTECTOR_STRONG is not set
> 
> and resulting kernel works for me.

I don't have any STACKPROTECTOR in my config:

# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_CC_STACKPROTECTOR_NONE=y
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
# CONFIG_CC_STACKPROTECTOR_STRONG is not set

(full config after the log)


I can start systemd's emergency shell (systemd.unit=emergency.target),
if running test programs helps.


Thanks,
Sabrina

[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 4.0.0-rc1-next-20150225 (zappy@kria) (gcc version 
4.9.2 20150204 (prerelease) (GCC) ) #636 SMP PREEMPT Wed Feb 25 13:45:23 CET 
2015
[0.00] Command line: root=/dev/sda1 
netconsole=@10.0.1.23/,@10.0.1.10/ console=ttyS0
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x1ffd] usable
[0.00] BIOS-e820: [mem 0x1ffe-0x1fff] reserved
[0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.8 present.
[0.00] Hypervisor detected: KVM
[0.00] AGP: No AGP bridge found
[0.00] e820: last_pfn = 0x1ffe0 max_arch_pfn = 0x4
[0.00] PAT configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC  
[0.00] found SMP MP-table at [mem 0x000f1010-0x000f101f] mapped at 
[880f1010]
[0.00] Scanning 1 areas for low memory corruption
[0.00] init_memory_mapping: [mem 0x-0x000f]
[0.00] init_memory_mapping: [mem 0x1fc0-0x1fdf]
[0.00] init_memory_mapping: [mem 0x0010-0x1fbf]
[0.00] init_memory_mapping: [mem 0x1fe0-0x1ffd]
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000F0DD0 14 (v00 BOCHS )
[0.00] ACPI: RSDT 0x1FFE18BC 34 (v01 BOCHS  BXPCRSDT 
0001 BXPC 0001)
[0.00] ACPI: FACP 0x1FFE0E48 74 (v01 BOCHS  BXPCFACP 
0001 BXPC 0001)
[0.00] ACPI: DSDT 0x1FFE0040 000E08 (v01 BOCHS  BXPCDSDT 
0001 BXPC 0001)
[0.00] ACPI: FACS 0x1FFE 40
[0.00] ACPI: SSDT 0x1FFE0EBC 000948 (v01 BOCHS  BXPCSSDT 
0001 BXPC 0001)
[0.00] ACPI: APIC 0x1FFE1804 80 (v01 BOCHS  BXPCAPIC 
0001 BXPC 0001)
[0.00] ACPI: HPET 0x1FFE1884 38 (v01 BOCHS  BXPCHPET 
0001 BXPC 0001)
[0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00
[0.00] kvm-clock: cpu 0, msr 0:1ffdf001, primary cpu clock
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x1000-0x00ff]
[0.00]   DMA32 

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-25 Thread Denys Vlasenko
On 02/25/2015 01:37 PM, Andrey Wagin wrote:
> 2015-02-13 0:54 GMT+03:00 Denys Vlasenko :
> My test vm doesn't boot with this patch. Could you help to investigate
> this issue?

Hi Andrey, thanks for testing!

> I have attached a kernel config and console log.

Looking at the logs, it seems that regular syscalls do work:
systemd managed to function for some time, even spawned
a few children.

It might be that the bug is somewhere in signal delivery code.
This would explain why oops got delayed.


I am trying to reproduce it. My gcc seems to be a bit old -
it can't digest CONFIG_CC_STACKPROTECTOR_STRONG=y in your .config.

I switched to using "only" CONFIG_CC_STACKPROTECTOR_REGULAR=y:

CONFIG_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_NONE is not set
CONFIG_CC_STACKPROTECTOR_REGULAR=y
# CONFIG_CC_STACKPROTECTOR_STRONG is not set

and resulting kernel works for me.

Can you send me your bzImage (off-list, of course)?

I'll send you my qemu test setup, can you check whether crash
is happening for you when you run your kernel in it?

-- 
vda

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Denys Vlasenko
On 02/25/2015 01:37 PM, Andrey Wagin wrote:
 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
 My test vm doesn't boot with this patch. Could you help to investigate
 this issue?

Hi Andrey, thanks for testing!

 I have attached a kernel config and console log.

Looking at the logs, it seems that regular syscalls do work:
systemd managed to function for some time, even spawned
a few children.

It might be that the bug is somewhere in signal delivery code.
This would explain why oops got delayed.


I am trying to reproduce it. My gcc seems to be a bit old -
it can't digest CONFIG_CC_STACKPROTECTOR_STRONG=y in your .config.

I switched to using only CONFIG_CC_STACKPROTECTOR_REGULAR=y:

CONFIG_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_NONE is not set
CONFIG_CC_STACKPROTECTOR_REGULAR=y
# CONFIG_CC_STACKPROTECTOR_STRONG is not set

and resulting kernel works for me.

Can you send me your bzImage (off-list, of course)?

I'll send you my qemu test setup, can you check whether crash
is happening for you when you run your kernel in it?

-- 
vda

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Sabrina Dubroca
Hello,

I'm seeing the same symptoms on next-2015022{4,5}, also with systemd in a VM:

traps: fsck[99] general protection ip:7fccb2401270 sp:7fffea3b8938 error:0 in 
libc-2.21.so[7fccb2349000+199000]
traps: systemd-cgroups[100] general protection ip:7fdd8ff784f8 sp:7ffcf6e27ad8 
error:0 in ld-2.21.so[7fdd8ff6+22000]
traps: systemd-cgroups[94] general protection ip:7f9f23bd24f8 sp:74fc5578 
error:0 in ld-2.21.so[7f9f23bba000+22000]
traps: systemd-cgroups[102] general protection ip:7f211e6574f8 sp:7ffdb8e0d538 
error:0 in ld-2.21.so[7f211e63f000+22000]
traps: systemd-cgroups[103] general protection ip:7f80627c34f8 sp:7ffc7fa4cff8 
error:0 in ld-2.21.so[7f80627ab000+22000]


2015-02-25, 14:55:34 +0100, Denys Vlasenko wrote:
 On 02/25/2015 01:37 PM, Andrey Wagin wrote:
  2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
  My test vm doesn't boot with this patch. Could you help to investigate
  this issue?
 
 Hi Andrey, thanks for testing!
 
  I have attached a kernel config and console log.
 
 Looking at the logs, it seems that regular syscalls do work:
 systemd managed to function for some time, even spawned
 a few children.
 
 It might be that the bug is somewhere in signal delivery code.
 This would explain why oops got delayed.

It doesn't oops here, it just tries to load other bits of systemd and hangs.
I've noticed that ip: - the address after ld-2.21.so[ is always
the same value, I don't know if that's expected or relevant.

(full log below)

 I am trying to reproduce it. My gcc seems to be a bit old -
 it can't digest CONFIG_CC_STACKPROTECTOR_STRONG=y in your .config.
 
 I switched to using only CONFIG_CC_STACKPROTECTOR_REGULAR=y:
 
 CONFIG_CC_STACKPROTECTOR=y
 # CONFIG_CC_STACKPROTECTOR_NONE is not set
 CONFIG_CC_STACKPROTECTOR_REGULAR=y
 # CONFIG_CC_STACKPROTECTOR_STRONG is not set
 
 and resulting kernel works for me.

I don't have any STACKPROTECTOR in my config:

# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_CC_STACKPROTECTOR_NONE=y
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
# CONFIG_CC_STACKPROTECTOR_STRONG is not set

(full config after the log)


I can start systemd's emergency shell (systemd.unit=emergency.target),
if running test programs helps.


Thanks,
Sabrina

[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 4.0.0-rc1-next-20150225 (zappy@kria) (gcc version 
4.9.2 20150204 (prerelease) (GCC) ) #636 SMP PREEMPT Wed Feb 25 13:45:23 CET 
2015
[0.00] Command line: root=/dev/sda1 
netconsole=@10.0.1.23/,@10.0.1.10/ console=ttyS0
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x1ffd] usable
[0.00] BIOS-e820: [mem 0x1ffe-0x1fff] reserved
[0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.8 present.
[0.00] Hypervisor detected: KVM
[0.00] AGP: No AGP bridge found
[0.00] e820: last_pfn = 0x1ffe0 max_arch_pfn = 0x4
[0.00] PAT configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC  
[0.00] found SMP MP-table at [mem 0x000f1010-0x000f101f] mapped at 
[880f1010]
[0.00] Scanning 1 areas for low memory corruption
[0.00] init_memory_mapping: [mem 0x-0x000f]
[0.00] init_memory_mapping: [mem 0x1fc0-0x1fdf]
[0.00] init_memory_mapping: [mem 0x0010-0x1fbf]
[0.00] init_memory_mapping: [mem 0x1fe0-0x1ffd]
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x000F0DD0 14 (v00 BOCHS )
[0.00] ACPI: RSDT 0x1FFE18BC 34 (v01 BOCHS  BXPCRSDT 
0001 BXPC 0001)
[0.00] ACPI: FACP 0x1FFE0E48 74 (v01 BOCHS  BXPCFACP 
0001 BXPC 0001)
[0.00] ACPI: DSDT 0x1FFE0040 000E08 (v01 BOCHS  BXPCDSDT 
0001 BXPC 0001)
[0.00] ACPI: FACS 0x1FFE 40
[0.00] ACPI: SSDT 0x1FFE0EBC 000948 (v01 BOCHS  BXPCSSDT 
0001 BXPC 0001)
[0.00] ACPI: APIC 0x1FFE1804 80 (v01 BOCHS  BXPCAPIC 
0001 BXPC 0001)
[0.00] ACPI: HPET 0x1FFE1884 38 (v01 BOCHS  BXPCHPET 
0001 BXPC 0001)
[0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00
[0.00] kvm-clock: cpu 0, msr 0:1ffdf001, primary cpu clock
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x1000-0x00ff]
[0.00]   DMA32[mem 

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Denys Vlasenko
On 02/25/2015 01:37 PM, Andrey Wagin wrote:
 Hello Denys,
 
 My test vm doesn't boot with this patch. Could you help to investigate
 this issue?
 
 I have attached a kernel config and console log.
 
 [2.508252] traps: systemd-cgroups[380] general protection ip:7f68ad096028 
 sp:7fffba298af8 error:0 in ld-2.18.so[7f68ad07e000+2][  OK
 [2.600179] traps: systemd-cgroups[384] general protection ip:7f11b9a9c028 
 sp:7fff4420f978 error:0 in ld-2.18.so[7f11b9a84000+2]
 [2.743790] traps: systemd-cgroups[392] general protection ip:7f7f40a44028 
 sp:7fffe1c1b8b8 error:0 in ld-2.18.so[7f7f40a2c000+2]
 [2.754576] traps: systemd-cgroups[393] general protection ip:7fd1314bd028 
 sp:776ecc88 error:0 in ld-2.18.so[7fd1314a5000+2]
 [2.765343] traps: systemd-cgroups[396] general protection ip:7ff4537b7028 
 sp:7fff05902378 error:0 in ld-2.18.so[7ff45379f000+2]
 [2.798782] traps: systemd-cgroups[399] general protection ip:7f4d5bc9c028 
 sp:7fff35cb3a48 error:0 in ld-2.18.so[7f4d5bc84000+2]

These SEGVs are always at the same location within ld-2.18.so:
(ip - load_address) is always 0x18028.

For Sabrine's case, SEGVing address in ld-2.21.so also the same
all the time: 0x184f8.

Please send me your ld-2.18.so / ld-2.21.so files.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Andrey Wagin
2015-02-25 21:42 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
 On 02/25/2015 01:37 PM, Andrey Wagin wrote:
 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
 64-bit code was using six stack slots less by not saving/restoring
 registers which are callee-preserved according to C ABI,
 and not allocating space for them.
 Only when syscall needed a complete struct pt_regs,
 the complete area was allocated and filled in.
 As an additional twist, on interrupt entry a slightly less truncated 
 pt_regs
 trick is used, to make nested interrupt stacks easier to unwind.

 This proved to be a source of significant obfuscation and subtle bugs.
 For example, stub_fork had to pop the return address,
 extend the struct, save registers, and push return address back. Ugly.
 ia32_ptregs_common pops return address and returns via jmp insn,
 throwing a wrench into CPU return stack cache.

 This patch changes code to always allocate a complete struct pt_regs.
 The saving of registers is still done lazily.

 Partial pt_regs trick on interrupt stack is retained.

 Macros which manipulate struct pt_regs on stack are reworked:
 ALLOC_PT_GPREGS_ON_STACK allocates the structure.
 SAVE_C_REGS saves to it those registers which are clobbered by C code.
 SAVE_EXTRA_REGS saves to it all other registers.
 Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.

 ia32_ptregs_common, stub_fork and friends lost their ugly dance with
 return pointer.

 LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
 instead of magic numbers.

 error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
 instead of having it open-coded yet again.

 Patch was run-tested: 64-bit executables, 32-bit executables,
 strace works.
 Timing tests did not show measurable difference in 32-bit
 and 64-bit syscalls.

 Hello Denys,

 My test vm doesn't boot with this patch. Could you help to investigate
 this issue?

 I think I found it. This part of my patch is possibly wrong:

 @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
  #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
 TRACE_IRQS_ON; \
 sti; \
 -   SAVE_REST; \
 +   SAVE_EXTRA_REGS; \
 LOCKDEP_SYS_EXIT; \
 -   RESTORE_REST; \
 +   RESTORE_EXTRA_REGS; \
 cli; \
 TRACE_IRQS_OFF;

 The SAVE_REST here is intended to really *push* extra regs on stack,
 but the patch changed it so that they are written to existing stack
 slots above.

 From code inspection it should work in almost all cases, but some
 locations where it is used are really obscure.

 If there are places where *pushing* regs is really necessary,
 this can corrupt rbp,rbx,r12-15 registers.

 Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
 was here.
 Please find updated patch attached. Can you try it?

It doesn't work

[2.282198] microcode: CPU0 sig=0x623, pf=0x0, revision=0x1
[2.288321] microcode: CPU1 sig=0x623, pf=0x0, revision=0x1
[2.289139] microcode: CPU2 sig=0x623, pf=0x0, revision=0x1
[2.290618] microcode: CPU3 sig=0x623, pf=0x0, revision=0x1
[2.292417] microcode: Microcode Update Driver: v2.00
tig...@aivazian.fsnet.co.uk, Peter Oruba
[2.392592] piix4_smbus :00:01.3: SMBus Host Controller at
0xb100, revision 0
[2.510882] systemd-fsck[349]: /dev/sda1: clean, 343/128016 files,
166911/512000 blocks
[2.521463] Adding 4128764k swap on /dev/sda2.  Priority:-1
extents:1 across:4128764k FS
[2.573597] EXT4-fs (sda1): mounted filesystem with ordered data
mode. Opts: (null)
[2.597771] systemd-journald[283]: Received request to flush
runtime journal from PID 1
[2.802288] audit: type=1305 audit(1424892361.629:3): audit_pid=369
old=0 auid=4294967295 ses=4294967295 res=1
[2.819660] traps: systemd-cgroups[380] general protection
ip:7fb227939028 sp:7fff6bac59c8 error:0 in
ld-2.18.so[7fb227921000+2]
[3.016262] traps: systemd-cgroups[390] general protection
ip:7f456f7b6028 sp:7fffdc059718 error:0 in
ld-2.18.so[7f456f79e000+2]


 --
 vda

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Andy Lutomirski
On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin ava...@gmail.com wrote:
 2015-02-25 21:42 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
 On 02/25/2015 01:37 PM, Andrey Wagin wrote:
 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
 64-bit code was using six stack slots less by not saving/restoring
 registers which are callee-preserved according to C ABI,
 and not allocating space for them.
 Only when syscall needed a complete struct pt_regs,
 the complete area was allocated and filled in.
 As an additional twist, on interrupt entry a slightly less truncated 
 pt_regs
 trick is used, to make nested interrupt stacks easier to unwind.

 This proved to be a source of significant obfuscation and subtle bugs.
 For example, stub_fork had to pop the return address,
 extend the struct, save registers, and push return address back. Ugly.
 ia32_ptregs_common pops return address and returns via jmp insn,
 throwing a wrench into CPU return stack cache.

 This patch changes code to always allocate a complete struct pt_regs.
 The saving of registers is still done lazily.

 Partial pt_regs trick on interrupt stack is retained.

 Macros which manipulate struct pt_regs on stack are reworked:
 ALLOC_PT_GPREGS_ON_STACK allocates the structure.
 SAVE_C_REGS saves to it those registers which are clobbered by C code.
 SAVE_EXTRA_REGS saves to it all other registers.
 Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.

 ia32_ptregs_common, stub_fork and friends lost their ugly dance with
 return pointer.

 LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
 instead of magic numbers.

 error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
 instead of having it open-coded yet again.

 Patch was run-tested: 64-bit executables, 32-bit executables,
 strace works.
 Timing tests did not show measurable difference in 32-bit
 and 64-bit syscalls.

 Hello Denys,

 My test vm doesn't boot with this patch. Could you help to investigate
 this issue?

 I think I found it. This part of my patch is possibly wrong:

 @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
  #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
 TRACE_IRQS_ON; \
 sti; \
 -   SAVE_REST; \
 +   SAVE_EXTRA_REGS; \
 LOCKDEP_SYS_EXIT; \
 -   RESTORE_REST; \
 +   RESTORE_EXTRA_REGS; \
 cli; \
 TRACE_IRQS_OFF;

 The SAVE_REST here is intended to really *push* extra regs on stack,
 but the patch changed it so that they are written to existing stack
 slots above.

 From code inspection it should work in almost all cases, but some
 locations where it is used are really obscure.

 If there are places where *pushing* regs is really necessary,
 this can corrupt rbp,rbx,r12-15 registers.

 Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
 was here.
 Please find updated patch attached. Can you try it?

 It doesn't work

 [2.282198] microcode: CPU0 sig=0x623, pf=0x0, revision=0x1
 [2.288321] microcode: CPU1 sig=0x623, pf=0x0, revision=0x1
 [2.289139] microcode: CPU2 sig=0x623, pf=0x0, revision=0x1
 [2.290618] microcode: CPU3 sig=0x623, pf=0x0, revision=0x1
 [2.292417] microcode: Microcode Update Driver: v2.00
 tig...@aivazian.fsnet.co.uk, Peter Oruba
 [2.392592] piix4_smbus :00:01.3: SMBus Host Controller at
 0xb100, revision 0
 [2.510882] systemd-fsck[349]: /dev/sda1: clean, 343/128016 files,
 166911/512000 blocks
 [2.521463] Adding 4128764k swap on /dev/sda2.  Priority:-1
 extents:1 across:4128764k FS
 [2.573597] EXT4-fs (sda1): mounted filesystem with ordered data
 mode. Opts: (null)
 [2.597771] systemd-journald[283]: Received request to flush
 runtime journal from PID 1
 [2.802288] audit: type=1305 audit(1424892361.629:3): audit_pid=369
 old=0 auid=4294967295 ses=4294967295 res=1
 [2.819660] traps: systemd-cgroups[380] general protection
 ip:7fb227939028 sp:7fff6bac59c8 error:0 in
 ld-2.18.so[7fb227921000+2]
 [3.016262] traps: systemd-cgroups[390] general protection
 ip:7f456f7b6028 sp:7fffdc059718 error:0 in
 ld-2.18.so[7f456f79e000+2]

The change to stub_\func looks wrong to me.  It saves and restores
regs, but those regs might already have been saved if we're on the
slow path.  (Yes, all that code is quite buggy even without all these
patches.)  So is execve.

This means that, for example, execve called in the slow path will
save/restore regs twice.  If the values in the regs after the first
save and before the second save are different, then we corrupt user
state.

I think that the changes to all the stub-like things should be reverted.

--Andy



 --
 vda




-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Denys Vlasenko
On 02/25/2015 01:37 PM, Andrey Wagin wrote:
 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
 64-bit code was using six stack slots less by not saving/restoring
 registers which are callee-preserved according to C ABI,
 and not allocating space for them.
 Only when syscall needed a complete struct pt_regs,
 the complete area was allocated and filled in.
 As an additional twist, on interrupt entry a slightly less truncated 
 pt_regs
 trick is used, to make nested interrupt stacks easier to unwind.

 This proved to be a source of significant obfuscation and subtle bugs.
 For example, stub_fork had to pop the return address,
 extend the struct, save registers, and push return address back. Ugly.
 ia32_ptregs_common pops return address and returns via jmp insn,
 throwing a wrench into CPU return stack cache.

 This patch changes code to always allocate a complete struct pt_regs.
 The saving of registers is still done lazily.

 Partial pt_regs trick on interrupt stack is retained.

 Macros which manipulate struct pt_regs on stack are reworked:
 ALLOC_PT_GPREGS_ON_STACK allocates the structure.
 SAVE_C_REGS saves to it those registers which are clobbered by C code.
 SAVE_EXTRA_REGS saves to it all other registers.
 Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.

 ia32_ptregs_common, stub_fork and friends lost their ugly dance with
 return pointer.

 LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
 instead of magic numbers.

 error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
 instead of having it open-coded yet again.

 Patch was run-tested: 64-bit executables, 32-bit executables,
 strace works.
 Timing tests did not show measurable difference in 32-bit
 and 64-bit syscalls.
 
 Hello Denys,
 
 My test vm doesn't boot with this patch. Could you help to investigate
 this issue?

I think I found it. This part of my patch is possibly wrong:

@@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
 #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
TRACE_IRQS_ON; \
sti; \
-   SAVE_REST; \
+   SAVE_EXTRA_REGS; \
LOCKDEP_SYS_EXIT; \
-   RESTORE_REST; \
+   RESTORE_EXTRA_REGS; \
cli; \
TRACE_IRQS_OFF;

The SAVE_REST here is intended to really *push* extra regs on stack,
but the patch changed it so that they are written to existing stack
slots above.

From code inspection it should work in almost all cases, but some
locations where it is used are really obscure.

If there are places where *pushing* regs is really necessary,
this can corrupt rbp,rbx,r12-15 registers.

Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
was here.
Please find updated patch attached. Can you try it?

-- 
vda

From 80c82370dcb8cd85c3b981ce62f63f74eff0df2c Mon Sep 17 00:00:00 2001
From: Denys Vlasenko dvlas...@redhat.com
Date: Wed, 25 Feb 2015 19:32:14 +0100
Subject: [PATCH v2] x86: entry_64.S: always allocate complete struct pt_regs

64-bit code was using six stack slots less by not saving/restoring
registers which are callee-preserved according to C ABI,
and not allocating space for them.
Only when syscall needed a complete struct pt_regs,
the complete area was allocated and filled in.
As an additional twist, on interrupt entry a slightly less truncated pt_regs
trick is used, to make nested interrupt stacks easier to unwind.

This proved to be a source of significant obfuscation and subtle bugs.
For example, stub_fork had to pop the return address,
extend the struct, save registers, and push return address back. Ugly.
ia32_ptregs_common pops return address and returns via jmp insn,
throwing a wrench into CPU return stack cache.

This patch changes code to always allocate a complete struct pt_regs.
The saving of registers is still done lazily.

Partial pt_regs trick on interrupt stack is retained.

Macros which manipulate struct pt_regs on stack are reworked:
ALLOC_PT_GPREGS_ON_STACK allocates the structure.
SAVE_C_REGS saves to it those registers which are clobbered by C code.
SAVE_EXTRA_REGS saves to it all other registers.
Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.

ia32_ptregs_common, stub_fork and friends lost their ugly dance with
return pointer.

LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
instead of magic numbers.

error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
instead of having it open-coded yet again.

Patch was run-tested: 64-bit executables, 32-bit executables,
strace works.
Timing tests did not show measurable difference in 32-bit
and 64-bit syscalls.

Signed-off-by: Denys Vlasenko dvlas...@redhat.com
CC: Linus Torvalds torva...@linux-foundation.org
CC: Oleg Nesterov o...@redhat.com
CC: Borislav Petkov b...@alien8.de
CC: H. Peter Anvin h...@zytor.com
CC: Andy Lutomirski l...@amacapital.net
CC: Frederic Weisbecker fweis...@gmail.com
CC: X86 ML x...@kernel.org
CC: Alexei Starovoitov a...@plumgrid.com
CC: Will Drewry 

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Denys Vlasenko
On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
 On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin ava...@gmail.com wrote:
 2015-02-25 21:42 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
 On 02/25/2015 01:37 PM, Andrey Wagin wrote:
 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
 64-bit code was using six stack slots less by not saving/restoring
 registers which are callee-preserved according to C ABI,
 and not allocating space for them.
 Only when syscall needed a complete struct pt_regs,
 the complete area was allocated and filled in.
 As an additional twist, on interrupt entry a slightly less truncated 
 pt_regs
 trick is used, to make nested interrupt stacks easier to unwind.

 This proved to be a source of significant obfuscation and subtle bugs.
 For example, stub_fork had to pop the return address,
 extend the struct, save registers, and push return address back. Ugly.
 ia32_ptregs_common pops return address and returns via jmp insn,
 throwing a wrench into CPU return stack cache.

 This patch changes code to always allocate a complete struct pt_regs.
 The saving of registers is still done lazily.

 Partial pt_regs trick on interrupt stack is retained.

 Macros which manipulate struct pt_regs on stack are reworked:
 ALLOC_PT_GPREGS_ON_STACK allocates the structure.
 SAVE_C_REGS saves to it those registers which are clobbered by C code.
 SAVE_EXTRA_REGS saves to it all other registers.
 Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.

 ia32_ptregs_common, stub_fork and friends lost their ugly dance with
 return pointer.

 LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
 instead of magic numbers.

 error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
 instead of having it open-coded yet again.

 Patch was run-tested: 64-bit executables, 32-bit executables,
 strace works.
 Timing tests did not show measurable difference in 32-bit
 and 64-bit syscalls.

 Hello Denys,

 My test vm doesn't boot with this patch. Could you help to investigate
 this issue?

 I think I found it. This part of my patch is possibly wrong:

 @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
  #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
 TRACE_IRQS_ON; \
 sti; \
 -   SAVE_REST; \
 +   SAVE_EXTRA_REGS; \
 LOCKDEP_SYS_EXIT; \
 -   RESTORE_REST; \
 +   RESTORE_EXTRA_REGS; \
 cli; \
 TRACE_IRQS_OFF;

 The SAVE_REST here is intended to really *push* extra regs on stack,
 but the patch changed it so that they are written to existing stack
 slots above.

 From code inspection it should work in almost all cases, but some
 locations where it is used are really obscure.

 If there are places where *pushing* regs is really necessary,
 this can corrupt rbp,rbx,r12-15 registers.

 Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
 was here.
 Please find updated patch attached. Can you try it?

 It doesn't work

Thanks for testing it anyway.


 [3.016262] traps: systemd-cgroups[390] general protection
 ip:7f456f7b6028 sp:7fffdc059718 error:0 in
 ld-2.18.so[7f456f79e000+2]

This is what I know about these crashes. The SEGV itself is caused by
HLT instruction executed by dynamic loader, ld-2.NN.so.
The instruction is in _exit function, and is only reachable if
exit_group and exit syscalls fail to terminate the process.
So it seems that syscall execution is getting badly broken somehow
at some point.

This happens to both reporters.

My theory that it is related to lockdep seems to be wrong, because
Sabrina's kernel is not lockdep-enabled, yet it sees the same failure.

Both kernels are paravirtualized, both are booted under KVM,
Andrey runs it with four virtual CPUs, Sabrina runs with two.

My next theory is that I missed something related to paravirt.
I am looking at that code, so far I don't see anything suspicious.

Unfortunately, it doesn't happen to me: I have Sabrina's bzImage,
I run it under qemu-system-x86_64 -enable-kvm -smp 2,
I see in dmesg that kernel does detect that it is being run under KVM,
but it works for me. No mysterious segfaults.

Andrey, can you send me your bzImage? Maybe it will trigger
the problem for me.


 The change to stub_\func looks wrong to me.  It saves and restores
 regs, but those regs might already have been saved if we're on the
 slow path.  (Yes, all that code is quite buggy even without all these
 patches.)  So is execve.
 
 This means that, for example, execve called in the slow path will
 save/restore regs twice.  If the values in the regs after the first
 save and before the second save are different, then we corrupt user
 state.

This part?

.macro FORK_LIKE func
 ENTRY(stub_\func)
CFI_STARTPROC
-   popq%r11/* save return address */
-   PARTIAL_FRAME 0
-   SAVE_REST
-   pushq   %r11/* put it back on stack */
+   DEFAULT_FRAME 0, 8  /* offset 8: return 

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Andy Lutomirski
On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko dvlas...@redhat.com wrote:
 On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
 On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin ava...@gmail.com wrote:
 2015-02-25 21:42 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
 On 02/25/2015 01:37 PM, Andrey Wagin wrote:
 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
 64-bit code was using six stack slots less by not saving/restoring
 registers which are callee-preserved according to C ABI,
 and not allocating space for them.
 Only when syscall needed a complete struct pt_regs,
 the complete area was allocated and filled in.
 As an additional twist, on interrupt entry a slightly less truncated 
 pt_regs
 trick is used, to make nested interrupt stacks easier to unwind.

 This proved to be a source of significant obfuscation and subtle bugs.
 For example, stub_fork had to pop the return address,
 extend the struct, save registers, and push return address back. Ugly.
 ia32_ptregs_common pops return address and returns via jmp insn,
 throwing a wrench into CPU return stack cache.

 This patch changes code to always allocate a complete struct pt_regs.
 The saving of registers is still done lazily.

 Partial pt_regs trick on interrupt stack is retained.

 Macros which manipulate struct pt_regs on stack are reworked:
 ALLOC_PT_GPREGS_ON_STACK allocates the structure.
 SAVE_C_REGS saves to it those registers which are clobbered by C code.
 SAVE_EXTRA_REGS saves to it all other registers.
 Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse 
 it.

 ia32_ptregs_common, stub_fork and friends lost their ugly dance with
 return pointer.

 LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
 instead of magic numbers.

 error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
 instead of having it open-coded yet again.

 Patch was run-tested: 64-bit executables, 32-bit executables,
 strace works.
 Timing tests did not show measurable difference in 32-bit
 and 64-bit syscalls.

 Hello Denys,

 My test vm doesn't boot with this patch. Could you help to investigate
 this issue?

 I think I found it. This part of my patch is possibly wrong:

 @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
  #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
 TRACE_IRQS_ON; \
 sti; \
 -   SAVE_REST; \
 +   SAVE_EXTRA_REGS; \
 LOCKDEP_SYS_EXIT; \
 -   RESTORE_REST; \
 +   RESTORE_EXTRA_REGS; \
 cli; \
 TRACE_IRQS_OFF;

 The SAVE_REST here is intended to really *push* extra regs on stack,
 but the patch changed it so that they are written to existing stack
 slots above.

 From code inspection it should work in almost all cases, but some
 locations where it is used are really obscure.

 If there are places where *pushing* regs is really necessary,
 this can corrupt rbp,rbx,r12-15 registers.

 Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
 was here.
 Please find updated patch attached. Can you try it?

 It doesn't work

 Thanks for testing it anyway.


 [3.016262] traps: systemd-cgroups[390] general protection
 ip:7f456f7b6028 sp:7fffdc059718 error:0 in
 ld-2.18.so[7f456f79e000+2]

 This is what I know about these crashes. The SEGV itself is caused by
 HLT instruction executed by dynamic loader, ld-2.NN.so.
 The instruction is in _exit function, and is only reachable if
 exit_group and exit syscalls fail to terminate the process.
 So it seems that syscall execution is getting badly broken somehow
 at some point.

 This happens to both reporters.

 My theory that it is related to lockdep seems to be wrong, because
 Sabrina's kernel is not lockdep-enabled, yet it sees the same failure.

 Both kernels are paravirtualized, both are booted under KVM,
 Andrey runs it with four virtual CPUs, Sabrina runs with two.

 My next theory is that I missed something related to paravirt.
 I am looking at that code, so far I don't see anything suspicious.

 Unfortunately, it doesn't happen to me: I have Sabrina's bzImage,
 I run it under qemu-system-x86_64 -enable-kvm -smp 2,
 I see in dmesg that kernel does detect that it is being run under KVM,
 but it works for me. No mysterious segfaults.

 Andrey, can you send me your bzImage? Maybe it will trigger
 the problem for me.


 The change to stub_\func looks wrong to me.  It saves and restores
 regs, but those regs might already have been saved if we're on the
 slow path.  (Yes, all that code is quite buggy even without all these
 patches.)  So is execve.

 This means that, for example, execve called in the slow path will
 save/restore regs twice.  If the values in the regs after the first
 save and before the second save are different, then we corrupt user
 state.

 This part?

 .macro FORK_LIKE func
  ENTRY(stub_\func)
 CFI_STARTPROC
 -   popq%r11/* save return address */
 -   PARTIAL_FRAME 0
 -   SAVE_REST
 -   pushq   %r11  

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Sabrina Dubroca
2015-02-25, 13:59:06 -0800, Andy Lutomirski wrote:
 On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko dvlas...@redhat.com wrote:
  On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
  On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin ava...@gmail.com wrote:
  2015-02-25 21:42 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
  On 02/25/2015 01:37 PM, Andrey Wagin wrote:
  2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com:
  64-bit code was using six stack slots less by not saving/restoring
  registers which are callee-preserved according to C ABI,
  and not allocating space for them.
  Only when syscall needed a complete struct pt_regs,
  the complete area was allocated and filled in.
  As an additional twist, on interrupt entry a slightly less truncated 
  pt_regs
  trick is used, to make nested interrupt stacks easier to unwind.
 
  This proved to be a source of significant obfuscation and subtle bugs.
  For example, stub_fork had to pop the return address,
  extend the struct, save registers, and push return address back. Ugly.
  ia32_ptregs_common pops return address and returns via jmp insn,
  throwing a wrench into CPU return stack cache.
 
  This patch changes code to always allocate a complete struct pt_regs.
  The saving of registers is still done lazily.
 
  Partial pt_regs trick on interrupt stack is retained.
 
  Macros which manipulate struct pt_regs on stack are reworked:
  ALLOC_PT_GPREGS_ON_STACK allocates the structure.
  SAVE_C_REGS saves to it those registers which are clobbered by C code.
  SAVE_EXTRA_REGS saves to it all other registers.
  Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse 
  it.
 
  ia32_ptregs_common, stub_fork and friends lost their ugly dance with
  return pointer.
 
  LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
  instead of magic numbers.
 
  error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
  instead of having it open-coded yet again.
 
  Patch was run-tested: 64-bit executables, 32-bit executables,
  strace works.
  Timing tests did not show measurable difference in 32-bit
  and 64-bit syscalls.
 
  Hello Denys,
 
  My test vm doesn't boot with this patch. Could you help to investigate
  this issue?
 
  I think I found it. This part of my patch is possibly wrong:
 
  @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
   #define ARCH_LOCKDEP_SYS_EXIT_IRQ  \
  TRACE_IRQS_ON; \
  sti; \
  -   SAVE_REST; \
  +   SAVE_EXTRA_REGS; \
  LOCKDEP_SYS_EXIT; \
  -   RESTORE_REST; \
  +   RESTORE_EXTRA_REGS; \
  cli; \
  TRACE_IRQS_OFF;
 
  The SAVE_REST here is intended to really *push* extra regs on stack,
  but the patch changed it so that they are written to existing stack
  slots above.
 
  From code inspection it should work in almost all cases, but some
  locations where it is used are really obscure.
 
  If there are places where *pushing* regs is really necessary,
  this can corrupt rbp,rbx,r12-15 registers.
 
  Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the 
  bug
  was here.
  Please find updated patch attached. Can you try it?
 
  It doesn't work
 
  Thanks for testing it anyway.
 
 
  [3.016262] traps: systemd-cgroups[390] general protection
  ip:7f456f7b6028 sp:7fffdc059718 error:0 in
  ld-2.18.so[7f456f79e000+2]
 
  This is what I know about these crashes. The SEGV itself is caused by
  HLT instruction executed by dynamic loader, ld-2.NN.so.
  The instruction is in _exit function, and is only reachable if
  exit_group and exit syscalls fail to terminate the process.
  So it seems that syscall execution is getting badly broken somehow
  at some point.
 
  This happens to both reporters.
 
  My theory that it is related to lockdep seems to be wrong, because
  Sabrina's kernel is not lockdep-enabled, yet it sees the same failure.
 
  Both kernels are paravirtualized, both are booted under KVM,
  Andrey runs it with four virtual CPUs, Sabrina runs with two.
 
  My next theory is that I missed something related to paravirt.
  I am looking at that code, so far I don't see anything suspicious.
 
  Unfortunately, it doesn't happen to me: I have Sabrina's bzImage,
  I run it under qemu-system-x86_64 -enable-kvm -smp 2,
  I see in dmesg that kernel does detect that it is being run under KVM,
  but it works for me. No mysterious segfaults.
 
  Andrey, can you send me your bzImage? Maybe it will trigger
  the problem for me.
 
 
  The change to stub_\func looks wrong to me.  It saves and restores
  regs, but those regs might already have been saved if we're on the
  slow path.  (Yes, all that code is quite buggy even without all these
  patches.)  So is execve.
 
  This means that, for example, execve called in the slow path will
  save/restore regs twice.  If the values in the regs after the first
  save and before the second save are different, then we corrupt user
  state.
 
  This part?
 
  .macro 

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Sabrina Dubroca
2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote:
 I can run some userspace programs, but I have no idea what would be
 helpful.
 I can also try booting a real machine with archlinux/systemd tomorrow.

I got a good boot out of kernels that normally fail.  I booted
systemd's emergency shell and enabled a few services, in the same
order they normally start.  journald started cleanly, but after that,
every single command produced a traps: output and an audit: line.

I disabled systemd-journald (chmod -x, because `systemctl disable`
didn't really disable it), and now it boots, no traps: in the log.
If I run it, everything fails again (zsh has traps for simply pressing
enter on an empty cmd).

-- 
Sabrina
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Denys Vlasenko
On Thu, Feb 26, 2015 at 12:34 AM, Sabrina Dubroca s...@queasysnail.net wrote:
 2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote:
 I can run some userspace programs, but I have no idea what would be
 helpful.
 I can also try booting a real machine with archlinux/systemd tomorrow.

 I got a good boot out of kernels that normally fail.  I booted
 systemd's emergency shell and enabled a few services, in the same
 order they normally start.  journald started cleanly, but after that,
 every single command produced a traps: output and an audit: line.

 I disabled systemd-journald (chmod -x, because `systemctl disable`
 didn't really disable it), and now it boots, no traps: in the log.
 If I run it, everything fails again (zsh has traps for simply pressing
 enter on an empty cmd).

That's some progress!

It's strange how one process manages to affect everything else.

If I run it, everything fails again. How do you run it? Directly,
or via systemd services mechanism?
If you just run it directly, can you try running it under
strace -f -tt -oLOG? Does it have the same effect? What's in the LOG?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Stephen Rothwell
Hi all,

On Wed, 25 Feb 2015 21:18:52 -0800 Andrew Morton a...@linux-foundation.org 
wrote:

 On Thu, 26 Feb 2015 02:12:57 +0100 Denys Vlasenko vda.li...@googlemail.com 
 wrote:
 
  On Thu, Feb 26, 2015 at 12:34 AM, Sabrina Dubroca s...@queasysnail.net 
  wrote:
   2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote:
   I can run some userspace programs, but I have no idea what would be
   helpful.
   I can also try booting a real machine with archlinux/systemd tomorrow.
  
   I got a good boot out of kernels that normally fail.  I booted
   systemd's emergency shell and enabled a few services, in the same
   order they normally start.  journald started cleanly, but after that,
   every single command produced a traps: output and an audit: line.
  
   I disabled systemd-journald (chmod -x, because `systemctl disable`
   didn't really disable it), and now it boots, no traps: in the log.
   If I run it, everything fails again (zsh has traps for simply pressing
   enter on an empty cmd).
  
  That's some progress!
  
  It's strange how one process manages to affect everything else.
  
  If I run it, everything fails again. How do you run it? Directly,
  or via systemd services mechanism?
  If you just run it directly, can you try running it under
  strace -f -tt -oLOG? Does it have the same effect? What's in the LOG?
 
 I'm hitting this bug as well, bisected to this commit.  On an old
 x64_64 box, no vms, paravirt, etc.  Running FC6 userspace (heh).
 
 Quite late in initscripts, binaries start getting segmentation faults
 and init gives up.  Seems to only affect /usr/bin/rhgb-client.  There's
 one instance where /bin/rm is said to segfault, but I suspect that's
 init lying to me.

I note that that commit has been removed from today's version of the
luto-misc tree and thus linux-next.

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgpu6vlymy8yz.pgp
Description: OpenPGP digital signature


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-25 Thread Andrew Morton
On Thu, 26 Feb 2015 02:12:57 +0100 Denys Vlasenko vda.li...@googlemail.com 
wrote:

 On Thu, Feb 26, 2015 at 12:34 AM, Sabrina Dubroca s...@queasysnail.net 
 wrote:
  2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote:
  I can run some userspace programs, but I have no idea what would be
  helpful.
  I can also try booting a real machine with archlinux/systemd tomorrow.
 
  I got a good boot out of kernels that normally fail.  I booted
  systemd's emergency shell and enabled a few services, in the same
  order they normally start.  journald started cleanly, but after that,
  every single command produced a traps: output and an audit: line.
 
  I disabled systemd-journald (chmod -x, because `systemctl disable`
  didn't really disable it), and now it boots, no traps: in the log.
  If I run it, everything fails again (zsh has traps for simply pressing
  enter on an empty cmd).
 
 That's some progress!
 
 It's strange how one process manages to affect everything else.
 
 If I run it, everything fails again. How do you run it? Directly,
 or via systemd services mechanism?
 If you just run it directly, can you try running it under
 strace -f -tt -oLOG? Does it have the same effect? What's in the LOG?

I'm hitting this bug as well, bisected to this commit.  On an old
x64_64 box, no vms, paravirt, etc.  Running FC6 userspace (heh).

Quite late in initscripts, binaries start getting segmentation faults
and init gives up.  Seems to only affect /usr/bin/rhgb-client.  There's
one instance where /bin/rm is said to segfault, but I suspect that's
init lying to me.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"

2015-02-18 Thread Andy Lutomirski
On Thu, Feb 12, 2015 at 1:54 PM, Denys Vlasenko  wrote:
> 64-bit code was using six stack slots less by not saving/restoring
> registers which are callee-preserved according to C ABI,
> and not allocating space for them.
> Only when syscall needed a complete "struct pt_regs",
> the complete area was allocated and filled in.
> As an additional twist, on interrupt entry a "slightly less truncated pt_regs"
> trick is used, to make nested interrupt stacks easier to unwind.
>
> This proved to be a source of significant obfuscation and subtle bugs.
> For example, stub_fork had to pop the return address,
> extend the struct, save registers, and push return address back. Ugly.
> ia32_ptregs_common pops return address and "returns" via jmp insn,
> throwing a wrench into CPU return stack cache.
>
> This patch changes code to always allocate a complete "struct pt_regs".
> The saving of registers is still done lazily.
>
> "Partial pt_regs" trick on interrupt stack is retained.
>
> Macros which manipulate "struct pt_regs" on stack are reworked:
> ALLOC_PT_GPREGS_ON_STACK allocates the structure.
> SAVE_C_REGS saves to it those registers which are clobbered by C code.
> SAVE_EXTRA_REGS saves to it all other registers.
> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.
>
> ia32_ptregs_common, stub_fork and friends lost their ugly dance with
> return pointer.
>
> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
> instead of magic numbers.
>
> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
> instead of having it open-coded yet again.
>
> Patch was run-tested: 64-bit executables, 32-bit executables,
> strace works.
> Timing tests did not show measurable difference in 32-bit
> and 64-bit syscalls.

This patch scares me, because it changes a lot of hairy code.  That
being said, I don't see anything wrong with it, and the end result is
much nicer than the status quo.  So I applied it, and I'll let the
kbuild bot have fun with it.  I confirmed that I can boot a 64-bit and
a 32-bit system with it, at least in my configuration.

Further reviews are encouraged :)

--Andy

>
> Signed-off-by: Denys Vlasenko 
> CC: Linus Torvalds 
> CC: Oleg Nesterov 
> CC: Borislav Petkov 
> CC: "H. Peter Anvin" 
> CC: Andy Lutomirski 
> CC: Frederic Weisbecker 
> CC: X86 ML 
> CC: Alexei Starovoitov 
> CC: Will Drewry 
> CC: Kees Cook 
> CC: linux-kernel@vger.kernel.org
> ---
>  arch/x86/ia32/ia32entry.S  |  47 +++
>  arch/x86/include/asm/calling.h | 222 
> -
>  arch/x86/include/asm/irqflags.h|   4 +-
>  arch/x86/include/uapi/asm/ptrace-abi.h |   1 -
>  arch/x86/kernel/entry_64.S | 195 +++--
>  5 files changed, 209 insertions(+), 260 deletions(-)
>
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 156ebca..f4bed49 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -62,12 +62,12 @@
>  */
> .macro LOAD_ARGS32 offset, _r9=0
> .if \_r9
> -   movl \offset+16(%rsp),%r9d
> +   movl \offset+R9(%rsp),%r9d
> .endif
> -   movl \offset+40(%rsp),%ecx
> -   movl \offset+48(%rsp),%edx
> -   movl \offset+56(%rsp),%esi
> -   movl \offset+64(%rsp),%edi
> +   movl \offset+RCX(%rsp),%ecx
> +   movl \offset+RDX(%rsp),%edx
> +   movl \offset+RSI(%rsp),%esi
> +   movl \offset+RDI(%rsp),%edi
> movl %eax,%eax  /* zero extension */
> .endm
>
> @@ -144,7 +144,8 @@ ENTRY(ia32_sysenter_target)
> CFI_REL_OFFSET rip,0
> pushq_cfi %rax
> cld
> -   SAVE_ARGS 0,1,0
> +   ALLOC_PT_GPREGS_ON_STACK
> +   SAVE_C_REGS_EXCEPT_R891011
> /* no need to do an access_ok check here because rbp has been
>32bit zero extended */
> ASM_STAC
> @@ -182,7 +183,8 @@ sysexit_from_sys_call:
> andl$~0x200,EFLAGS-ARGOFFSET(%rsp)
> movlRIP-ARGOFFSET(%rsp),%edx/* User %eip */
> CFI_REGISTER rip,rdx
> -   RESTORE_ARGS 0,24,0,0,0,0
> +   RESTORE_RSI_RDI
> +   REMOVE_PT_GPREGS_FROM_STACK 3*8
> xorq%r8,%r8
> xorq%r9,%r9
> xorq%r10,%r10
> @@ -256,13 +258,13 @@ sysenter_tracesys:
> testl   $(_TIF_WORK_SYSCALL_ENTRY & 
> ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> jz  sysenter_auditsys
>  #endif
> -   SAVE_REST
> +   SAVE_EXTRA_REGS
> CLEAR_RREGS
> movq$-ENOSYS,RAX(%rsp)/* ptrace can change this for a bad syscall 
> */
> movq%rsp,%rdi/* _regs -> arg1 */
> callsyscall_trace_enter
> LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace 
> changed it */
> -   RESTORE_REST
> +   RESTORE_EXTRA_REGS
> cmpq$(IA32_NR_syscalls-1),%rax
> ja  int_ret_from_sys_call /* sysenter_tracesys has set RAX(%rsp) 
> */

Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs

2015-02-18 Thread Andy Lutomirski
On Thu, Feb 12, 2015 at 1:54 PM, Denys Vlasenko dvlas...@redhat.com wrote:
 64-bit code was using six stack slots less by not saving/restoring
 registers which are callee-preserved according to C ABI,
 and not allocating space for them.
 Only when syscall needed a complete struct pt_regs,
 the complete area was allocated and filled in.
 As an additional twist, on interrupt entry a slightly less truncated pt_regs
 trick is used, to make nested interrupt stacks easier to unwind.

 This proved to be a source of significant obfuscation and subtle bugs.
 For example, stub_fork had to pop the return address,
 extend the struct, save registers, and push return address back. Ugly.
 ia32_ptregs_common pops return address and returns via jmp insn,
 throwing a wrench into CPU return stack cache.

 This patch changes code to always allocate a complete struct pt_regs.
 The saving of registers is still done lazily.

 Partial pt_regs trick on interrupt stack is retained.

 Macros which manipulate struct pt_regs on stack are reworked:
 ALLOC_PT_GPREGS_ON_STACK allocates the structure.
 SAVE_C_REGS saves to it those registers which are clobbered by C code.
 SAVE_EXTRA_REGS saves to it all other registers.
 Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.

 ia32_ptregs_common, stub_fork and friends lost their ugly dance with
 return pointer.

 LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
 instead of magic numbers.

 error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
 instead of having it open-coded yet again.

 Patch was run-tested: 64-bit executables, 32-bit executables,
 strace works.
 Timing tests did not show measurable difference in 32-bit
 and 64-bit syscalls.

This patch scares me, because it changes a lot of hairy code.  That
being said, I don't see anything wrong with it, and the end result is
much nicer than the status quo.  So I applied it, and I'll let the
kbuild bot have fun with it.  I confirmed that I can boot a 64-bit and
a 32-bit system with it, at least in my configuration.

Further reviews are encouraged :)

--Andy


 Signed-off-by: Denys Vlasenko dvlas...@redhat.com
 CC: Linus Torvalds torva...@linux-foundation.org
 CC: Oleg Nesterov o...@redhat.com
 CC: Borislav Petkov b...@alien8.de
 CC: H. Peter Anvin h...@zytor.com
 CC: Andy Lutomirski l...@amacapital.net
 CC: Frederic Weisbecker fweis...@gmail.com
 CC: X86 ML x...@kernel.org
 CC: Alexei Starovoitov a...@plumgrid.com
 CC: Will Drewry w...@chromium.org
 CC: Kees Cook keesc...@chromium.org
 CC: linux-kernel@vger.kernel.org
 ---
  arch/x86/ia32/ia32entry.S  |  47 +++
  arch/x86/include/asm/calling.h | 222 
 -
  arch/x86/include/asm/irqflags.h|   4 +-
  arch/x86/include/uapi/asm/ptrace-abi.h |   1 -
  arch/x86/kernel/entry_64.S | 195 +++--
  5 files changed, 209 insertions(+), 260 deletions(-)

 diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
 index 156ebca..f4bed49 100644
 --- a/arch/x86/ia32/ia32entry.S
 +++ b/arch/x86/ia32/ia32entry.S
 @@ -62,12 +62,12 @@
  */
 .macro LOAD_ARGS32 offset, _r9=0
 .if \_r9
 -   movl \offset+16(%rsp),%r9d
 +   movl \offset+R9(%rsp),%r9d
 .endif
 -   movl \offset+40(%rsp),%ecx
 -   movl \offset+48(%rsp),%edx
 -   movl \offset+56(%rsp),%esi
 -   movl \offset+64(%rsp),%edi
 +   movl \offset+RCX(%rsp),%ecx
 +   movl \offset+RDX(%rsp),%edx
 +   movl \offset+RSI(%rsp),%esi
 +   movl \offset+RDI(%rsp),%edi
 movl %eax,%eax  /* zero extension */
 .endm

 @@ -144,7 +144,8 @@ ENTRY(ia32_sysenter_target)
 CFI_REL_OFFSET rip,0
 pushq_cfi %rax
 cld
 -   SAVE_ARGS 0,1,0
 +   ALLOC_PT_GPREGS_ON_STACK
 +   SAVE_C_REGS_EXCEPT_R891011
 /* no need to do an access_ok check here because rbp has been
32bit zero extended */
 ASM_STAC
 @@ -182,7 +183,8 @@ sysexit_from_sys_call:
 andl$~0x200,EFLAGS-ARGOFFSET(%rsp)
 movlRIP-ARGOFFSET(%rsp),%edx/* User %eip */
 CFI_REGISTER rip,rdx
 -   RESTORE_ARGS 0,24,0,0,0,0
 +   RESTORE_RSI_RDI
 +   REMOVE_PT_GPREGS_FROM_STACK 3*8
 xorq%r8,%r8
 xorq%r9,%r9
 xorq%r10,%r10
 @@ -256,13 +258,13 @@ sysenter_tracesys:
 testl   $(_TIF_WORK_SYSCALL_ENTRY  
 ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
 jz  sysenter_auditsys
  #endif
 -   SAVE_REST
 +   SAVE_EXTRA_REGS
 CLEAR_RREGS
 movq$-ENOSYS,RAX(%rsp)/* ptrace can change this for a bad syscall 
 */
 movq%rsp,%rdi/* pt_regs - arg1 */
 callsyscall_trace_enter
 LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace 
 changed it */
 -   RESTORE_REST
 +   RESTORE_EXTRA_REGS
 cmpq$(IA32_NR_syscalls-1),%rax