Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
On Feb 26, 2015 1:55 AM, "Denys Vlasenko" wrote: > > On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski wrote: > > On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko wrote: > >> On 02/25/2015 09:10 PM, Andy Lutomirski wrote: > >> This part? > >> > >> .macro FORK_LIKE func > >> ENTRY(stub_\func) > >> CFI_STARTPROC > >> - popq%r11/* save return address */ > >> - PARTIAL_FRAME 0 > >> - SAVE_REST > >> - pushq %r11/* put it back on stack */ > >> + DEFAULT_FRAME 0, 8 /* offset 8: return address */ > >> + SAVE_EXTRA_REGS 8 > >> FIXUP_TOP_OF_STACK %r11, 8 > >> - DEFAULT_FRAME 0 8 /* offset 8: return address */ > >> call sys_\func > >> RESTORE_TOP_OF_STACK %r11, 8 > >> - ret $REST_SKIP /* pop extended registers */ > >> + ret > >> CFI_ENDPROC > >> END(stub_\func) > >> .endm > >> > >> FORK_LIKE clone > >> FORK_LIKE fork > >> FORK_LIKE vfork > >> > >> But the old code (SAVE_REST thing) was also saving registers here. > >> It had to jump through hoops (pop return address, SAVE_REST, > >> push return address) to do that. > >> After the patch, "SAVE_EXTRA_REGS 8" does the same, just without > >> pop/push pair. > >> > >> I just don't see what's wrong with it. Can you elaborate? > > > > SAVE_REST pushed the regs onto the stack, whereas SAVE_EXTRA_REGS just > > writes them in place. It's possible for this to be called when the > > regs have already been saved. > > If that would be the case - that is, if SAVE_REST was saving extra copy > of registers on stack, then FIXUP_TOP_OF_STACK %r11, 8 would be working > on wrong locations. The "8" there says "we have full pt_regs on stack, > plus extra 8 bytes (the return address)". Your conjecture would mean > that in fact there would be more bytes on stack, and FIXUP_TOP_OF_STACK > would corrupt iret stack. Evidently, since old code was not crashing, > this wasn't happening. SAVE_REST was really creating the "tail" of pt_regs Ugh, you're right. The FIXUP_TOP_OF_STACK indeed looks duplicated, bit t that's less harmful and was already the case. --Andy . > > In addition to my previous tests, I ran my home machine with > patched kernel. Unfortunately, it works for me :( > > Will try on yet another machine. > > -- > vda -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
2015-02-26, 14:54:33 +0100, Denys Vlasenko wrote: > On Thu, Feb 26, 2015 at 1:11 PM, Denys Vlasenko > wrote: > > On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko > > wrote: > >> On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski > >> wrote: > >> In addition to my previous tests, I ran my home machine with > >> patched kernel. Unfortunately, it works for me :( > >> > >> Will try on yet another machine. > > > > And voila, it does happen on another machine :) > > > > I'm debugging it right now. Looks like 64-bit syscalls just stop working > > at some point in new processes. That is, existing process is alive and well, > > but children get SEGV after fork (most likely on any syscall64 they do, > > not after fork per se. They eventually manage to kill themselves - > > not trivial when exit syscall isn't working either - by tripping on HLT > > insn). > > > > 32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want. > > I have static 32-bit busybox binary and everything works there. > > > > Also, any 64-bit process which was under strace continues to work correctly, > > including forks and execs. > > > > This points towards some bug on fast path sysret64 code. Looking for it. > > audit=0 makes crashes disappear. Ah, yes. > I found the problem. If syscall_trace_enter_phase1 returns 0, > I restore %rax from pt_regs->ax, but should restore it from > pt_regs->orig_ax: > > call syscall_trace_enter_phase1 > test %rax, %rax > jnz tracesys_phase2 /* if needed, run the slow path */ > - RESTORE_C_REGS /* else restore clobbered regs */ > + RESTORE_C_REGS_EXCEPT_RAX /* else restore clobbered regs */ > + movq ORIG_RAX-ARGOFFSET(%rsp),%rax > jmp system_call_fastpath/* and return to the fast path */ with s/-ARGOFFSET// on top of next-20150224, that works. Thanks, Denys. -- Sabrina -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
On Thu, Feb 26, 2015 at 1:11 PM, Denys Vlasenko wrote: > On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko > wrote: >> On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski >> wrote: >> In addition to my previous tests, I ran my home machine with >> patched kernel. Unfortunately, it works for me :( >> >> Will try on yet another machine. > > And voila, it does happen on another machine :) > > I'm debugging it right now. Looks like 64-bit syscalls just stop working > at some point in new processes. That is, existing process is alive and well, > but children get SEGV after fork (most likely on any syscall64 they do, > not after fork per se. They eventually manage to kill themselves - > not trivial when exit syscall isn't working either - by tripping on HLT insn). > > 32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want. > I have static 32-bit busybox binary and everything works there. > > Also, any 64-bit process which was under strace continues to work correctly, > including forks and execs. > > This points towards some bug on fast path sysret64 code. Looking for it. audit=0 makes crashes disappear. I found the problem. If syscall_trace_enter_phase1 returns 0, I restore %rax from pt_regs->ax, but should restore it from pt_regs->orig_ax: call syscall_trace_enter_phase1 test %rax, %rax jnz tracesys_phase2 /* if needed, run the slow path */ - RESTORE_C_REGS /* else restore clobbered regs */ + RESTORE_C_REGS_EXCEPT_RAX /* else restore clobbered regs */ + movq ORIG_RAX-ARGOFFSET(%rsp),%rax jmp system_call_fastpath/* and return to the fast path */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko wrote: > On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski wrote: > In addition to my previous tests, I ran my home machine with > patched kernel. Unfortunately, it works for me :( > > Will try on yet another machine. And voila, it does happen on another machine :) I'm debugging it right now. Looks like 64-bit syscalls just stop working at some point in new processes. That is, existing process is alive and well, but children get SEGV after fork (most likely on any syscall64 they do, not after fork per se. They eventually manage to kill themselves - not trivial when exit syscall isn't working either - by tripping on HLT insn). 32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want. I have static 32-bit busybox binary and everything works there. Also, any 64-bit process which was under strace continues to work correctly, including forks and execs. This points towards some bug on fast path sysret64 code. Looking for it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski wrote: > On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko wrote: >> On 02/25/2015 09:10 PM, Andy Lutomirski wrote: >> This part? >> >> .macro FORK_LIKE func >> ENTRY(stub_\func) >> CFI_STARTPROC >> - popq%r11/* save return address */ >> - PARTIAL_FRAME 0 >> - SAVE_REST >> - pushq %r11/* put it back on stack */ >> + DEFAULT_FRAME 0, 8 /* offset 8: return address */ >> + SAVE_EXTRA_REGS 8 >> FIXUP_TOP_OF_STACK %r11, 8 >> - DEFAULT_FRAME 0 8 /* offset 8: return address */ >> call sys_\func >> RESTORE_TOP_OF_STACK %r11, 8 >> - ret $REST_SKIP /* pop extended registers */ >> + ret >> CFI_ENDPROC >> END(stub_\func) >> .endm >> >> FORK_LIKE clone >> FORK_LIKE fork >> FORK_LIKE vfork >> >> But the old code (SAVE_REST thing) was also saving registers here. >> It had to jump through hoops (pop return address, SAVE_REST, >> push return address) to do that. >> After the patch, "SAVE_EXTRA_REGS 8" does the same, just without >> pop/push pair. >> >> I just don't see what's wrong with it. Can you elaborate? > > SAVE_REST pushed the regs onto the stack, whereas SAVE_EXTRA_REGS just > writes them in place. It's possible for this to be called when the > regs have already been saved. If that would be the case - that is, if SAVE_REST was saving extra copy of registers on stack, then FIXUP_TOP_OF_STACK %r11, 8 would be working on wrong locations. The "8" there says "we have full pt_regs on stack, plus extra 8 bytes (the return address)". Your conjecture would mean that in fact there would be more bytes on stack, and FIXUP_TOP_OF_STACK would corrupt iret stack. Evidently, since old code was not crashing, this wasn't happening. SAVE_REST was really creating the "tail" of pt_regs. In addition to my previous tests, I ran my home machine with patched kernel. Unfortunately, it works for me :( Will try on yet another machine. -- vda -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
On Thu, Feb 26, 2015 at 1:11 PM, Denys Vlasenko vda.li...@googlemail.com wrote: On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko vda.li...@googlemail.com wrote: On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski l...@amacapital.net wrote: In addition to my previous tests, I ran my home machine with patched kernel. Unfortunately, it works for me :( Will try on yet another machine. And voila, it does happen on another machine :) I'm debugging it right now. Looks like 64-bit syscalls just stop working at some point in new processes. That is, existing process is alive and well, but children get SEGV after fork (most likely on any syscall64 they do, not after fork per se. They eventually manage to kill themselves - not trivial when exit syscall isn't working either - by tripping on HLT insn). 32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want. I have static 32-bit busybox binary and everything works there. Also, any 64-bit process which was under strace continues to work correctly, including forks and execs. This points towards some bug on fast path sysret64 code. Looking for it. audit=0 makes crashes disappear. I found the problem. If syscall_trace_enter_phase1 returns 0, I restore %rax from pt_regs-ax, but should restore it from pt_regs-orig_ax: call syscall_trace_enter_phase1 test %rax, %rax jnz tracesys_phase2 /* if needed, run the slow path */ - RESTORE_C_REGS /* else restore clobbered regs */ + RESTORE_C_REGS_EXCEPT_RAX /* else restore clobbered regs */ + movq ORIG_RAX-ARGOFFSET(%rsp),%rax jmp system_call_fastpath/* and return to the fast path */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
On Feb 26, 2015 1:55 AM, Denys Vlasenko vda.li...@googlemail.com wrote: On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski l...@amacapital.net wrote: On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko dvlas...@redhat.com wrote: On 02/25/2015 09:10 PM, Andy Lutomirski wrote: This part? .macro FORK_LIKE func ENTRY(stub_\func) CFI_STARTPROC - popq%r11/* save return address */ - PARTIAL_FRAME 0 - SAVE_REST - pushq %r11/* put it back on stack */ + DEFAULT_FRAME 0, 8 /* offset 8: return address */ + SAVE_EXTRA_REGS 8 FIXUP_TOP_OF_STACK %r11, 8 - DEFAULT_FRAME 0 8 /* offset 8: return address */ call sys_\func RESTORE_TOP_OF_STACK %r11, 8 - ret $REST_SKIP /* pop extended registers */ + ret CFI_ENDPROC END(stub_\func) .endm FORK_LIKE clone FORK_LIKE fork FORK_LIKE vfork But the old code (SAVE_REST thing) was also saving registers here. It had to jump through hoops (pop return address, SAVE_REST, push return address) to do that. After the patch, SAVE_EXTRA_REGS 8 does the same, just without pop/push pair. I just don't see what's wrong with it. Can you elaborate? SAVE_REST pushed the regs onto the stack, whereas SAVE_EXTRA_REGS just writes them in place. It's possible for this to be called when the regs have already been saved. If that would be the case - that is, if SAVE_REST was saving extra copy of registers on stack, then FIXUP_TOP_OF_STACK %r11, 8 would be working on wrong locations. The 8 there says we have full pt_regs on stack, plus extra 8 bytes (the return address). Your conjecture would mean that in fact there would be more bytes on stack, and FIXUP_TOP_OF_STACK would corrupt iret stack. Evidently, since old code was not crashing, this wasn't happening. SAVE_REST was really creating the tail of pt_regs Ugh, you're right. The FIXUP_TOP_OF_STACK indeed looks duplicated, bit t that's less harmful and was already the case. --Andy . In addition to my previous tests, I ran my home machine with patched kernel. Unfortunately, it works for me :( Will try on yet another machine. -- vda -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
2015-02-26, 14:54:33 +0100, Denys Vlasenko wrote: On Thu, Feb 26, 2015 at 1:11 PM, Denys Vlasenko vda.li...@googlemail.com wrote: On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko vda.li...@googlemail.com wrote: On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski l...@amacapital.net wrote: In addition to my previous tests, I ran my home machine with patched kernel. Unfortunately, it works for me :( Will try on yet another machine. And voila, it does happen on another machine :) I'm debugging it right now. Looks like 64-bit syscalls just stop working at some point in new processes. That is, existing process is alive and well, but children get SEGV after fork (most likely on any syscall64 they do, not after fork per se. They eventually manage to kill themselves - not trivial when exit syscall isn't working either - by tripping on HLT insn). 32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want. I have static 32-bit busybox binary and everything works there. Also, any 64-bit process which was under strace continues to work correctly, including forks and execs. This points towards some bug on fast path sysret64 code. Looking for it. audit=0 makes crashes disappear. Ah, yes. I found the problem. If syscall_trace_enter_phase1 returns 0, I restore %rax from pt_regs-ax, but should restore it from pt_regs-orig_ax: call syscall_trace_enter_phase1 test %rax, %rax jnz tracesys_phase2 /* if needed, run the slow path */ - RESTORE_C_REGS /* else restore clobbered regs */ + RESTORE_C_REGS_EXCEPT_RAX /* else restore clobbered regs */ + movq ORIG_RAX-ARGOFFSET(%rsp),%rax jmp system_call_fastpath/* and return to the fast path */ with s/-ARGOFFSET// on top of next-20150224, that works. Thanks, Denys. -- Sabrina -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski l...@amacapital.net wrote: On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko dvlas...@redhat.com wrote: On 02/25/2015 09:10 PM, Andy Lutomirski wrote: This part? .macro FORK_LIKE func ENTRY(stub_\func) CFI_STARTPROC - popq%r11/* save return address */ - PARTIAL_FRAME 0 - SAVE_REST - pushq %r11/* put it back on stack */ + DEFAULT_FRAME 0, 8 /* offset 8: return address */ + SAVE_EXTRA_REGS 8 FIXUP_TOP_OF_STACK %r11, 8 - DEFAULT_FRAME 0 8 /* offset 8: return address */ call sys_\func RESTORE_TOP_OF_STACK %r11, 8 - ret $REST_SKIP /* pop extended registers */ + ret CFI_ENDPROC END(stub_\func) .endm FORK_LIKE clone FORK_LIKE fork FORK_LIKE vfork But the old code (SAVE_REST thing) was also saving registers here. It had to jump through hoops (pop return address, SAVE_REST, push return address) to do that. After the patch, SAVE_EXTRA_REGS 8 does the same, just without pop/push pair. I just don't see what's wrong with it. Can you elaborate? SAVE_REST pushed the regs onto the stack, whereas SAVE_EXTRA_REGS just writes them in place. It's possible for this to be called when the regs have already been saved. If that would be the case - that is, if SAVE_REST was saving extra copy of registers on stack, then FIXUP_TOP_OF_STACK %r11, 8 would be working on wrong locations. The 8 there says we have full pt_regs on stack, plus extra 8 bytes (the return address). Your conjecture would mean that in fact there would be more bytes on stack, and FIXUP_TOP_OF_STACK would corrupt iret stack. Evidently, since old code was not crashing, this wasn't happening. SAVE_REST was really creating the tail of pt_regs. In addition to my previous tests, I ran my home machine with patched kernel. Unfortunately, it works for me :( Will try on yet another machine. -- vda -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
On Thu, Feb 26, 2015 at 10:55 AM, Denys Vlasenko vda.li...@googlemail.com wrote: On Wed, Feb 25, 2015 at 10:59 PM, Andy Lutomirski l...@amacapital.net wrote: In addition to my previous tests, I ran my home machine with patched kernel. Unfortunately, it works for me :( Will try on yet another machine. And voila, it does happen on another machine :) I'm debugging it right now. Looks like 64-bit syscalls just stop working at some point in new processes. That is, existing process is alive and well, but children get SEGV after fork (most likely on any syscall64 they do, not after fork per se. They eventually manage to kill themselves - not trivial when exit syscall isn't working either - by tripping on HLT insn). 32-bit syscalls (int 80) continue to work. Fork, exec, whatever you want. I have static 32-bit busybox binary and everything works there. Also, any 64-bit process which was under strace continues to work correctly, including forks and execs. This points towards some bug on fast path sysret64 code. Looking for it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
Hi all, On Wed, 25 Feb 2015 21:18:52 -0800 Andrew Morton wrote: > > On Thu, 26 Feb 2015 02:12:57 +0100 Denys Vlasenko > wrote: > > > On Thu, Feb 26, 2015 at 12:34 AM, Sabrina Dubroca > > wrote: > > > 2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote: > > >> I can run some userspace programs, but I have no idea what would be > > >> helpful. > > >> I can also try booting a real machine with archlinux/systemd tomorrow. > > > > > > I got a good boot out of kernels that normally fail. I booted > > > systemd's emergency shell and enabled a few services, in the same > > > order they normally start. journald started cleanly, but after that, > > > every single command produced a "traps:" output and an "audit:" line. > > > > > > I disabled systemd-journald (chmod -x, because `systemctl disable` > > > didn't really disable it), and now it boots, no "traps:" in the log. > > > If I run it, everything fails again (zsh has traps for simply pressing > > > enter on an empty cmd). > > > > That's some progress! > > > > It's strange how one process manages to affect everything else. > > > > "If I run it, everything fails again". How do you run it? Directly, > > or via systemd services mechanism? > > If you just run it directly, can you try running it under > > "strace -f -tt -oLOG"? Does it have the same effect? What's in the LOG? > > I'm hitting this bug as well, bisected to this commit. On an old > x64_64 box, no vms, paravirt, etc. Running FC6 userspace (heh). > > Quite late in initscripts, binaries start getting segmentation faults > and init gives up. Seems to only affect /usr/bin/rhgb-client. There's > one instance where /bin/rm is said to segfault, but I suspect that's > init lying to me. I note that that commit has been removed from today's version of the luto-misc tree and thus linux-next. -- Cheers, Stephen Rothwells...@canb.auug.org.au pgpu6vlymy8yz.pgp Description: OpenPGP digital signature
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
On Thu, 26 Feb 2015 02:12:57 +0100 Denys Vlasenko wrote: > On Thu, Feb 26, 2015 at 12:34 AM, Sabrina Dubroca > wrote: > > 2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote: > >> I can run some userspace programs, but I have no idea what would be > >> helpful. > >> I can also try booting a real machine with archlinux/systemd tomorrow. > > > > I got a good boot out of kernels that normally fail. I booted > > systemd's emergency shell and enabled a few services, in the same > > order they normally start. journald started cleanly, but after that, > > every single command produced a "traps:" output and an "audit:" line. > > > > I disabled systemd-journald (chmod -x, because `systemctl disable` > > didn't really disable it), and now it boots, no "traps:" in the log. > > If I run it, everything fails again (zsh has traps for simply pressing > > enter on an empty cmd). > > That's some progress! > > It's strange how one process manages to affect everything else. > > "If I run it, everything fails again". How do you run it? Directly, > or via systemd services mechanism? > If you just run it directly, can you try running it under > "strace -f -tt -oLOG"? Does it have the same effect? What's in the LOG? I'm hitting this bug as well, bisected to this commit. On an old x64_64 box, no vms, paravirt, etc. Running FC6 userspace (heh). Quite late in initscripts, binaries start getting segmentation faults and init gives up. Seems to only affect /usr/bin/rhgb-client. There's one instance where /bin/rm is said to segfault, but I suspect that's init lying to me. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
On Thu, Feb 26, 2015 at 12:34 AM, Sabrina Dubroca wrote: > 2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote: >> I can run some userspace programs, but I have no idea what would be >> helpful. >> I can also try booting a real machine with archlinux/systemd tomorrow. > > I got a good boot out of kernels that normally fail. I booted > systemd's emergency shell and enabled a few services, in the same > order they normally start. journald started cleanly, but after that, > every single command produced a "traps:" output and an "audit:" line. > > I disabled systemd-journald (chmod -x, because `systemctl disable` > didn't really disable it), and now it boots, no "traps:" in the log. > If I run it, everything fails again (zsh has traps for simply pressing > enter on an empty cmd). That's some progress! It's strange how one process manages to affect everything else. "If I run it, everything fails again". How do you run it? Directly, or via systemd services mechanism? If you just run it directly, can you try running it under "strace -f -tt -oLOG"? Does it have the same effect? What's in the LOG? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote: > I can run some userspace programs, but I have no idea what would be > helpful. > I can also try booting a real machine with archlinux/systemd tomorrow. I got a good boot out of kernels that normally fail. I booted systemd's emergency shell and enabled a few services, in the same order they normally start. journald started cleanly, but after that, every single command produced a "traps:" output and an "audit:" line. I disabled systemd-journald (chmod -x, because `systemctl disable` didn't really disable it), and now it boots, no "traps:" in the log. If I run it, everything fails again (zsh has traps for simply pressing enter on an empty cmd). -- Sabrina -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
2015-02-25, 13:59:06 -0800, Andy Lutomirski wrote: > On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko wrote: > > On 02/25/2015 09:10 PM, Andy Lutomirski wrote: > >> On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin wrote: > >>> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko : > On 02/25/2015 01:37 PM, Andrey Wagin wrote: > > 2015-02-13 0:54 GMT+03:00 Denys Vlasenko : > >> 64-bit code was using six stack slots less by not saving/restoring > >> registers which are callee-preserved according to C ABI, > >> and not allocating space for them. > >> Only when syscall needed a complete "struct pt_regs", > >> the complete area was allocated and filled in. > >> As an additional twist, on interrupt entry a "slightly less truncated > >> pt_regs" > >> trick is used, to make nested interrupt stacks easier to unwind. > >> > >> This proved to be a source of significant obfuscation and subtle bugs. > >> For example, stub_fork had to pop the return address, > >> extend the struct, save registers, and push return address back. Ugly. > >> ia32_ptregs_common pops return address and "returns" via jmp insn, > >> throwing a wrench into CPU return stack cache. > >> > >> This patch changes code to always allocate a complete "struct pt_regs". > >> The saving of registers is still done lazily. > >> > >> "Partial pt_regs" trick on interrupt stack is retained. > >> > >> Macros which manipulate "struct pt_regs" on stack are reworked: > >> ALLOC_PT_GPREGS_ON_STACK allocates the structure. > >> SAVE_C_REGS saves to it those registers which are clobbered by C code. > >> SAVE_EXTRA_REGS saves to it all other registers. > >> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse > >> it. > >> > >> ia32_ptregs_common, stub_fork and friends lost their ugly dance with > >> return pointer. > >> > >> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets > >> instead of magic numbers. > >> > >> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS > >> instead of having it open-coded yet again. > >> > >> Patch was run-tested: 64-bit executables, 32-bit executables, > >> strace works. > >> Timing tests did not show measurable difference in 32-bit > >> and 64-bit syscalls. > > > > Hello Denys, > > > > My test vm doesn't boot with this patch. Could you help to investigate > > this issue? > > I think I found it. This part of my patch is possibly wrong: > > @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) > #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ > TRACE_IRQS_ON; \ > sti; \ > - SAVE_REST; \ > + SAVE_EXTRA_REGS; \ > LOCKDEP_SYS_EXIT; \ > - RESTORE_REST; \ > + RESTORE_EXTRA_REGS; \ > cli; \ > TRACE_IRQS_OFF; > > The "SAVE_REST" here is intended to really *push* extra regs on stack, > but the patch changed it so that they are written to existing stack > slots above. > > From code inspection it should work in almost all cases, but some > locations where it is used are really obscure. > > If there are places where *pushing* regs is really necessary, > this can corrupt rbp,rbx,r12-15 registers. > > Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the > bug > was here. > Please find updated patch attached. Can you try it? > >>> > >>> It doesn't work > > > > Thanks for testing it anyway. > > > > > >>> [3.016262] traps: systemd-cgroups[390] general protection > >>> ip:7f456f7b6028 sp:7fffdc059718 error:0 in > >>> ld-2.18.so[7f456f79e000+2] > > > > This is what I know about these crashes. The SEGV itself is caused by > > HLT instruction executed by dynamic loader, ld-2.NN.so. > > The instruction is in _exit function, and is only reachable if > > exit_group and exit syscalls fail to terminate the process. > > So it seems that syscall execution is getting badly broken somehow > > at some point. > > > > This happens to both reporters. > > > > My theory that it is related to lockdep seems to be wrong, because > > Sabrina's kernel is not lockdep-enabled, yet it sees the same failure. > > > > Both kernels are paravirtualized, both are booted under KVM, > > Andrey runs it with four virtual CPUs, Sabrina runs with two. > > > > My next theory is that I missed something related to paravirt. > > I am looking at that code, so far I don't see anything suspicious. > > > > Unfortunately, it doesn't happen to me: I have Sabrina's bzImage, > > I run it under "qemu-system-x86_64 -enable-kvm -smp 2", > > I see in dmesg that kernel does detect that it is being run under KVM, > > but it works for me. No mysterious segfaults. > > > > Andrey, can you send me your bzImage? Maybe it will trigger > > the problem for me. > >
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko wrote: > On 02/25/2015 09:10 PM, Andy Lutomirski wrote: >> On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin wrote: >>> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko : On 02/25/2015 01:37 PM, Andrey Wagin wrote: > 2015-02-13 0:54 GMT+03:00 Denys Vlasenko : >> 64-bit code was using six stack slots less by not saving/restoring >> registers which are callee-preserved according to C ABI, >> and not allocating space for them. >> Only when syscall needed a complete "struct pt_regs", >> the complete area was allocated and filled in. >> As an additional twist, on interrupt entry a "slightly less truncated >> pt_regs" >> trick is used, to make nested interrupt stacks easier to unwind. >> >> This proved to be a source of significant obfuscation and subtle bugs. >> For example, stub_fork had to pop the return address, >> extend the struct, save registers, and push return address back. Ugly. >> ia32_ptregs_common pops return address and "returns" via jmp insn, >> throwing a wrench into CPU return stack cache. >> >> This patch changes code to always allocate a complete "struct pt_regs". >> The saving of registers is still done lazily. >> >> "Partial pt_regs" trick on interrupt stack is retained. >> >> Macros which manipulate "struct pt_regs" on stack are reworked: >> ALLOC_PT_GPREGS_ON_STACK allocates the structure. >> SAVE_C_REGS saves to it those registers which are clobbered by C code. >> SAVE_EXTRA_REGS saves to it all other registers. >> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse >> it. >> >> ia32_ptregs_common, stub_fork and friends lost their ugly dance with >> return pointer. >> >> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets >> instead of magic numbers. >> >> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS >> instead of having it open-coded yet again. >> >> Patch was run-tested: 64-bit executables, 32-bit executables, >> strace works. >> Timing tests did not show measurable difference in 32-bit >> and 64-bit syscalls. > > Hello Denys, > > My test vm doesn't boot with this patch. Could you help to investigate > this issue? I think I found it. This part of my patch is possibly wrong: @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ TRACE_IRQS_ON; \ sti; \ - SAVE_REST; \ + SAVE_EXTRA_REGS; \ LOCKDEP_SYS_EXIT; \ - RESTORE_REST; \ + RESTORE_EXTRA_REGS; \ cli; \ TRACE_IRQS_OFF; The "SAVE_REST" here is intended to really *push* extra regs on stack, but the patch changed it so that they are written to existing stack slots above. From code inspection it should work in almost all cases, but some locations where it is used are really obscure. If there are places where *pushing* regs is really necessary, this can corrupt rbp,rbx,r12-15 registers. Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug was here. Please find updated patch attached. Can you try it? >>> >>> It doesn't work > > Thanks for testing it anyway. > > >>> [3.016262] traps: systemd-cgroups[390] general protection >>> ip:7f456f7b6028 sp:7fffdc059718 error:0 in >>> ld-2.18.so[7f456f79e000+2] > > This is what I know about these crashes. The SEGV itself is caused by > HLT instruction executed by dynamic loader, ld-2.NN.so. > The instruction is in _exit function, and is only reachable if > exit_group and exit syscalls fail to terminate the process. > So it seems that syscall execution is getting badly broken somehow > at some point. > > This happens to both reporters. > > My theory that it is related to lockdep seems to be wrong, because > Sabrina's kernel is not lockdep-enabled, yet it sees the same failure. > > Both kernels are paravirtualized, both are booted under KVM, > Andrey runs it with four virtual CPUs, Sabrina runs with two. > > My next theory is that I missed something related to paravirt. > I am looking at that code, so far I don't see anything suspicious. > > Unfortunately, it doesn't happen to me: I have Sabrina's bzImage, > I run it under "qemu-system-x86_64 -enable-kvm -smp 2", > I see in dmesg that kernel does detect that it is being run under KVM, > but it works for me. No mysterious segfaults. > > Andrey, can you send me your bzImage? Maybe it will trigger > the problem for me. > > >> The change to stub_\func looks wrong to me. It saves and restores >> regs, but those regs might already have been saved if we're on the >> slow path. (Yes, all that code is quite buggy even without all these >> patches.) So is execve. >> >> This means that, for example, execve
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
On 02/25/2015 09:10 PM, Andy Lutomirski wrote: > On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin wrote: >> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko : >>> On 02/25/2015 01:37 PM, Andrey Wagin wrote: 2015-02-13 0:54 GMT+03:00 Denys Vlasenko : > 64-bit code was using six stack slots less by not saving/restoring > registers which are callee-preserved according to C ABI, > and not allocating space for them. > Only when syscall needed a complete "struct pt_regs", > the complete area was allocated and filled in. > As an additional twist, on interrupt entry a "slightly less truncated > pt_regs" > trick is used, to make nested interrupt stacks easier to unwind. > > This proved to be a source of significant obfuscation and subtle bugs. > For example, stub_fork had to pop the return address, > extend the struct, save registers, and push return address back. Ugly. > ia32_ptregs_common pops return address and "returns" via jmp insn, > throwing a wrench into CPU return stack cache. > > This patch changes code to always allocate a complete "struct pt_regs". > The saving of registers is still done lazily. > > "Partial pt_regs" trick on interrupt stack is retained. > > Macros which manipulate "struct pt_regs" on stack are reworked: > ALLOC_PT_GPREGS_ON_STACK allocates the structure. > SAVE_C_REGS saves to it those registers which are clobbered by C code. > SAVE_EXTRA_REGS saves to it all other registers. > Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. > > ia32_ptregs_common, stub_fork and friends lost their ugly dance with > return pointer. > > LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets > instead of magic numbers. > > error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS > instead of having it open-coded yet again. > > Patch was run-tested: 64-bit executables, 32-bit executables, > strace works. > Timing tests did not show measurable difference in 32-bit > and 64-bit syscalls. Hello Denys, My test vm doesn't boot with this patch. Could you help to investigate this issue? >>> >>> I think I found it. This part of my patch is possibly wrong: >>> >>> @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) >>> #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ >>> TRACE_IRQS_ON; \ >>> sti; \ >>> - SAVE_REST; \ >>> + SAVE_EXTRA_REGS; \ >>> LOCKDEP_SYS_EXIT; \ >>> - RESTORE_REST; \ >>> + RESTORE_EXTRA_REGS; \ >>> cli; \ >>> TRACE_IRQS_OFF; >>> >>> The "SAVE_REST" here is intended to really *push* extra regs on stack, >>> but the patch changed it so that they are written to existing stack >>> slots above. >>> >>> From code inspection it should work in almost all cases, but some >>> locations where it is used are really obscure. >>> >>> If there are places where *pushing* regs is really necessary, >>> this can corrupt rbp,rbx,r12-15 registers. >>> >>> Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug >>> was here. >>> Please find updated patch attached. Can you try it? >> >> It doesn't work Thanks for testing it anyway. >> [3.016262] traps: systemd-cgroups[390] general protection >> ip:7f456f7b6028 sp:7fffdc059718 error:0 in >> ld-2.18.so[7f456f79e000+2] This is what I know about these crashes. The SEGV itself is caused by HLT instruction executed by dynamic loader, ld-2.NN.so. The instruction is in _exit function, and is only reachable if exit_group and exit syscalls fail to terminate the process. So it seems that syscall execution is getting badly broken somehow at some point. This happens to both reporters. My theory that it is related to lockdep seems to be wrong, because Sabrina's kernel is not lockdep-enabled, yet it sees the same failure. Both kernels are paravirtualized, both are booted under KVM, Andrey runs it with four virtual CPUs, Sabrina runs with two. My next theory is that I missed something related to paravirt. I am looking at that code, so far I don't see anything suspicious. Unfortunately, it doesn't happen to me: I have Sabrina's bzImage, I run it under "qemu-system-x86_64 -enable-kvm -smp 2", I see in dmesg that kernel does detect that it is being run under KVM, but it works for me. No mysterious segfaults. Andrey, can you send me your bzImage? Maybe it will trigger the problem for me. > The change to stub_\func looks wrong to me. It saves and restores > regs, but those regs might already have been saved if we're on the > slow path. (Yes, all that code is quite buggy even without all these > patches.) So is execve. > > This means that, for example, execve called in the slow path will > save/restore regs twice. If the values in the regs after the first > save and before the second save are different, then we corrupt user > state. This part? .macro
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin wrote: > 2015-02-25 21:42 GMT+03:00 Denys Vlasenko : >> On 02/25/2015 01:37 PM, Andrey Wagin wrote: >>> 2015-02-13 0:54 GMT+03:00 Denys Vlasenko : 64-bit code was using six stack slots less by not saving/restoring registers which are callee-preserved according to C ABI, and not allocating space for them. Only when syscall needed a complete "struct pt_regs", the complete area was allocated and filled in. As an additional twist, on interrupt entry a "slightly less truncated pt_regs" trick is used, to make nested interrupt stacks easier to unwind. This proved to be a source of significant obfuscation and subtle bugs. For example, stub_fork had to pop the return address, extend the struct, save registers, and push return address back. Ugly. ia32_ptregs_common pops return address and "returns" via jmp insn, throwing a wrench into CPU return stack cache. This patch changes code to always allocate a complete "struct pt_regs". The saving of registers is still done lazily. "Partial pt_regs" trick on interrupt stack is retained. Macros which manipulate "struct pt_regs" on stack are reworked: ALLOC_PT_GPREGS_ON_STACK allocates the structure. SAVE_C_REGS saves to it those registers which are clobbered by C code. SAVE_EXTRA_REGS saves to it all other registers. Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. ia32_ptregs_common, stub_fork and friends lost their ugly dance with return pointer. LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets instead of magic numbers. error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS instead of having it open-coded yet again. Patch was run-tested: 64-bit executables, 32-bit executables, strace works. Timing tests did not show measurable difference in 32-bit and 64-bit syscalls. >>> >>> Hello Denys, >>> >>> My test vm doesn't boot with this patch. Could you help to investigate >>> this issue? >> >> I think I found it. This part of my patch is possibly wrong: >> >> @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) >> #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ >> TRACE_IRQS_ON; \ >> sti; \ >> - SAVE_REST; \ >> + SAVE_EXTRA_REGS; \ >> LOCKDEP_SYS_EXIT; \ >> - RESTORE_REST; \ >> + RESTORE_EXTRA_REGS; \ >> cli; \ >> TRACE_IRQS_OFF; >> >> The "SAVE_REST" here is intended to really *push* extra regs on stack, >> but the patch changed it so that they are written to existing stack >> slots above. >> >> From code inspection it should work in almost all cases, but some >> locations where it is used are really obscure. >> >> If there are places where *pushing* regs is really necessary, >> this can corrupt rbp,rbx,r12-15 registers. >> >> Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug >> was here. >> Please find updated patch attached. Can you try it? > > It doesn't work > > [2.282198] microcode: CPU0 sig=0x623, pf=0x0, revision=0x1 > [2.288321] microcode: CPU1 sig=0x623, pf=0x0, revision=0x1 > [2.289139] microcode: CPU2 sig=0x623, pf=0x0, revision=0x1 > [2.290618] microcode: CPU3 sig=0x623, pf=0x0, revision=0x1 > [2.292417] microcode: Microcode Update Driver: v2.00 > , Peter Oruba > [2.392592] piix4_smbus :00:01.3: SMBus Host Controller at > 0xb100, revision 0 > [2.510882] systemd-fsck[349]: /dev/sda1: clean, 343/128016 files, > 166911/512000 blocks > [2.521463] Adding 4128764k swap on /dev/sda2. Priority:-1 > extents:1 across:4128764k FS > [2.573597] EXT4-fs (sda1): mounted filesystem with ordered data > mode. Opts: (null) > [2.597771] systemd-journald[283]: Received request to flush > runtime journal from PID 1 > [2.802288] audit: type=1305 audit(1424892361.629:3): audit_pid=369 > old=0 auid=4294967295 ses=4294967295 res=1 > [2.819660] traps: systemd-cgroups[380] general protection > ip:7fb227939028 sp:7fff6bac59c8 error:0 in > ld-2.18.so[7fb227921000+2] > [3.016262] traps: systemd-cgroups[390] general protection > ip:7f456f7b6028 sp:7fffdc059718 error:0 in > ld-2.18.so[7f456f79e000+2] The change to stub_\func looks wrong to me. It saves and restores regs, but those regs might already have been saved if we're on the slow path. (Yes, all that code is quite buggy even without all these patches.) So is execve. This means that, for example, execve called in the slow path will save/restore regs twice. If the values in the regs after the first save and before the second save are different, then we corrupt user state. I think that the changes to all the stub-like things should be reverted. --Andy > >> >> -- >> vda >> -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
2015-02-25 21:42 GMT+03:00 Denys Vlasenko : > On 02/25/2015 01:37 PM, Andrey Wagin wrote: >> 2015-02-13 0:54 GMT+03:00 Denys Vlasenko : >>> 64-bit code was using six stack slots less by not saving/restoring >>> registers which are callee-preserved according to C ABI, >>> and not allocating space for them. >>> Only when syscall needed a complete "struct pt_regs", >>> the complete area was allocated and filled in. >>> As an additional twist, on interrupt entry a "slightly less truncated >>> pt_regs" >>> trick is used, to make nested interrupt stacks easier to unwind. >>> >>> This proved to be a source of significant obfuscation and subtle bugs. >>> For example, stub_fork had to pop the return address, >>> extend the struct, save registers, and push return address back. Ugly. >>> ia32_ptregs_common pops return address and "returns" via jmp insn, >>> throwing a wrench into CPU return stack cache. >>> >>> This patch changes code to always allocate a complete "struct pt_regs". >>> The saving of registers is still done lazily. >>> >>> "Partial pt_regs" trick on interrupt stack is retained. >>> >>> Macros which manipulate "struct pt_regs" on stack are reworked: >>> ALLOC_PT_GPREGS_ON_STACK allocates the structure. >>> SAVE_C_REGS saves to it those registers which are clobbered by C code. >>> SAVE_EXTRA_REGS saves to it all other registers. >>> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. >>> >>> ia32_ptregs_common, stub_fork and friends lost their ugly dance with >>> return pointer. >>> >>> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets >>> instead of magic numbers. >>> >>> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS >>> instead of having it open-coded yet again. >>> >>> Patch was run-tested: 64-bit executables, 32-bit executables, >>> strace works. >>> Timing tests did not show measurable difference in 32-bit >>> and 64-bit syscalls. >> >> Hello Denys, >> >> My test vm doesn't boot with this patch. Could you help to investigate >> this issue? > > I think I found it. This part of my patch is possibly wrong: > > @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) > #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ > TRACE_IRQS_ON; \ > sti; \ > - SAVE_REST; \ > + SAVE_EXTRA_REGS; \ > LOCKDEP_SYS_EXIT; \ > - RESTORE_REST; \ > + RESTORE_EXTRA_REGS; \ > cli; \ > TRACE_IRQS_OFF; > > The "SAVE_REST" here is intended to really *push* extra regs on stack, > but the patch changed it so that they are written to existing stack > slots above. > > From code inspection it should work in almost all cases, but some > locations where it is used are really obscure. > > If there are places where *pushing* regs is really necessary, > this can corrupt rbp,rbx,r12-15 registers. > > Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug > was here. > Please find updated patch attached. Can you try it? It doesn't work [2.282198] microcode: CPU0 sig=0x623, pf=0x0, revision=0x1 [2.288321] microcode: CPU1 sig=0x623, pf=0x0, revision=0x1 [2.289139] microcode: CPU2 sig=0x623, pf=0x0, revision=0x1 [2.290618] microcode: CPU3 sig=0x623, pf=0x0, revision=0x1 [2.292417] microcode: Microcode Update Driver: v2.00 , Peter Oruba [2.392592] piix4_smbus :00:01.3: SMBus Host Controller at 0xb100, revision 0 [2.510882] systemd-fsck[349]: /dev/sda1: clean, 343/128016 files, 166911/512000 blocks [2.521463] Adding 4128764k swap on /dev/sda2. Priority:-1 extents:1 across:4128764k FS [2.573597] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null) [2.597771] systemd-journald[283]: Received request to flush runtime journal from PID 1 [2.802288] audit: type=1305 audit(1424892361.629:3): audit_pid=369 old=0 auid=4294967295 ses=4294967295 res=1 [2.819660] traps: systemd-cgroups[380] general protection ip:7fb227939028 sp:7fff6bac59c8 error:0 in ld-2.18.so[7fb227921000+2] [3.016262] traps: systemd-cgroups[390] general protection ip:7f456f7b6028 sp:7fffdc059718 error:0 in ld-2.18.so[7f456f79e000+2] > > -- > vda > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
On 02/25/2015 01:37 PM, Andrey Wagin wrote: > 2015-02-13 0:54 GMT+03:00 Denys Vlasenko : >> 64-bit code was using six stack slots less by not saving/restoring >> registers which are callee-preserved according to C ABI, >> and not allocating space for them. >> Only when syscall needed a complete "struct pt_regs", >> the complete area was allocated and filled in. >> As an additional twist, on interrupt entry a "slightly less truncated >> pt_regs" >> trick is used, to make nested interrupt stacks easier to unwind. >> >> This proved to be a source of significant obfuscation and subtle bugs. >> For example, stub_fork had to pop the return address, >> extend the struct, save registers, and push return address back. Ugly. >> ia32_ptregs_common pops return address and "returns" via jmp insn, >> throwing a wrench into CPU return stack cache. >> >> This patch changes code to always allocate a complete "struct pt_regs". >> The saving of registers is still done lazily. >> >> "Partial pt_regs" trick on interrupt stack is retained. >> >> Macros which manipulate "struct pt_regs" on stack are reworked: >> ALLOC_PT_GPREGS_ON_STACK allocates the structure. >> SAVE_C_REGS saves to it those registers which are clobbered by C code. >> SAVE_EXTRA_REGS saves to it all other registers. >> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. >> >> ia32_ptregs_common, stub_fork and friends lost their ugly dance with >> return pointer. >> >> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets >> instead of magic numbers. >> >> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS >> instead of having it open-coded yet again. >> >> Patch was run-tested: 64-bit executables, 32-bit executables, >> strace works. >> Timing tests did not show measurable difference in 32-bit >> and 64-bit syscalls. > > Hello Denys, > > My test vm doesn't boot with this patch. Could you help to investigate > this issue? I think I found it. This part of my patch is possibly wrong: @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ TRACE_IRQS_ON; \ sti; \ - SAVE_REST; \ + SAVE_EXTRA_REGS; \ LOCKDEP_SYS_EXIT; \ - RESTORE_REST; \ + RESTORE_EXTRA_REGS; \ cli; \ TRACE_IRQS_OFF; The "SAVE_REST" here is intended to really *push* extra regs on stack, but the patch changed it so that they are written to existing stack slots above. >From code inspection it should work in almost all cases, but some locations where it is used are really obscure. If there are places where *pushing* regs is really necessary, this can corrupt rbp,rbx,r12-15 registers. Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug was here. Please find updated patch attached. Can you try it? -- vda >From 80c82370dcb8cd85c3b981ce62f63f74eff0df2c Mon Sep 17 00:00:00 2001 From: Denys Vlasenko Date: Wed, 25 Feb 2015 19:32:14 +0100 Subject: [PATCH v2] x86: entry_64.S: always allocate complete "struct pt_regs" 64-bit code was using six stack slots less by not saving/restoring registers which are callee-preserved according to C ABI, and not allocating space for them. Only when syscall needed a complete "struct pt_regs", the complete area was allocated and filled in. As an additional twist, on interrupt entry a "slightly less truncated pt_regs" trick is used, to make nested interrupt stacks easier to unwind. This proved to be a source of significant obfuscation and subtle bugs. For example, stub_fork had to pop the return address, extend the struct, save registers, and push return address back. Ugly. ia32_ptregs_common pops return address and "returns" via jmp insn, throwing a wrench into CPU return stack cache. This patch changes code to always allocate a complete "struct pt_regs". The saving of registers is still done lazily. "Partial pt_regs" trick on interrupt stack is retained. Macros which manipulate "struct pt_regs" on stack are reworked: ALLOC_PT_GPREGS_ON_STACK allocates the structure. SAVE_C_REGS saves to it those registers which are clobbered by C code. SAVE_EXTRA_REGS saves to it all other registers. Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. ia32_ptregs_common, stub_fork and friends lost their ugly dance with return pointer. LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets instead of magic numbers. error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS instead of having it open-coded yet again. Patch was run-tested: 64-bit executables, 32-bit executables, strace works. Timing tests did not show measurable difference in 32-bit and 64-bit syscalls. Signed-off-by: Denys Vlasenko CC: Linus Torvalds CC: Oleg Nesterov CC: Borislav Petkov CC: "H. Peter Anvin" CC: Andy Lutomirski CC: Frederic Weisbecker CC: X86 ML CC: Alexei Starovoitov CC: Will Drewry CC: Kees Cook CC: linux-kernel@vger.kernel.org --- Changes since v1: fix a thinko
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
On 02/25/2015 01:37 PM, Andrey Wagin wrote: > Hello Denys, > > My test vm doesn't boot with this patch. Could you help to investigate > this issue? > > I have attached a kernel config and console log. > > [2.508252] traps: systemd-cgroups[380] general protection ip:7f68ad096028 > sp:7fffba298af8 error:0 in ld-2.18.so[7f68ad07e000+2][ OK > [2.600179] traps: systemd-cgroups[384] general protection ip:7f11b9a9c028 > sp:7fff4420f978 error:0 in ld-2.18.so[7f11b9a84000+2] > [2.743790] traps: systemd-cgroups[392] general protection ip:7f7f40a44028 > sp:7fffe1c1b8b8 error:0 in ld-2.18.so[7f7f40a2c000+2] > [2.754576] traps: systemd-cgroups[393] general protection ip:7fd1314bd028 > sp:776ecc88 error:0 in ld-2.18.so[7fd1314a5000+2] > [2.765343] traps: systemd-cgroups[396] general protection ip:7ff4537b7028 > sp:7fff05902378 error:0 in ld-2.18.so[7ff45379f000+2] > [2.798782] traps: systemd-cgroups[399] general protection ip:7f4d5bc9c028 > sp:7fff35cb3a48 error:0 in ld-2.18.so[7f4d5bc84000+2] These SEGVs are always at the same location within ld-2.18.so: (ip - load_address) is always 0x18028. For Sabrine's case, SEGVing address in ld-2.21.so also the same all the time: 0x184f8. Please send me your ld-2.18.so / ld-2.21.so files. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
Hello, I'm seeing the same symptoms on next-2015022{4,5}, also with systemd in a VM: traps: fsck[99] general protection ip:7fccb2401270 sp:7fffea3b8938 error:0 in libc-2.21.so[7fccb2349000+199000] traps: systemd-cgroups[100] general protection ip:7fdd8ff784f8 sp:7ffcf6e27ad8 error:0 in ld-2.21.so[7fdd8ff6+22000] traps: systemd-cgroups[94] general protection ip:7f9f23bd24f8 sp:74fc5578 error:0 in ld-2.21.so[7f9f23bba000+22000] traps: systemd-cgroups[102] general protection ip:7f211e6574f8 sp:7ffdb8e0d538 error:0 in ld-2.21.so[7f211e63f000+22000] traps: systemd-cgroups[103] general protection ip:7f80627c34f8 sp:7ffc7fa4cff8 error:0 in ld-2.21.so[7f80627ab000+22000] 2015-02-25, 14:55:34 +0100, Denys Vlasenko wrote: > On 02/25/2015 01:37 PM, Andrey Wagin wrote: > > 2015-02-13 0:54 GMT+03:00 Denys Vlasenko : > > My test vm doesn't boot with this patch. Could you help to investigate > > this issue? > > Hi Andrey, thanks for testing! > > > I have attached a kernel config and console log. > > Looking at the logs, it seems that regular syscalls do work: > systemd managed to function for some time, even spawned > a few children. > > It might be that the bug is somewhere in signal delivery code. > This would explain why oops got delayed. It doesn't oops here, it just tries to load other bits of systemd and hangs. I've noticed that "ip:" - "the address after ld-2.21.so[" is always the same value, I don't know if that's expected or relevant. (full log below) > I am trying to reproduce it. My gcc seems to be a bit old - > it can't digest CONFIG_CC_STACKPROTECTOR_STRONG=y in your .config. > > I switched to using "only" CONFIG_CC_STACKPROTECTOR_REGULAR=y: > > CONFIG_CC_STACKPROTECTOR=y > # CONFIG_CC_STACKPROTECTOR_NONE is not set > CONFIG_CC_STACKPROTECTOR_REGULAR=y > # CONFIG_CC_STACKPROTECTOR_STRONG is not set > > and resulting kernel works for me. I don't have any STACKPROTECTOR in my config: # CONFIG_CC_STACKPROTECTOR is not set CONFIG_CC_STACKPROTECTOR_NONE=y # CONFIG_CC_STACKPROTECTOR_REGULAR is not set # CONFIG_CC_STACKPROTECTOR_STRONG is not set (full config after the log) I can start systemd's emergency shell (systemd.unit=emergency.target), if running test programs helps. Thanks, Sabrina [0.00] Initializing cgroup subsys cpuset [0.00] Initializing cgroup subsys cpu [0.00] Initializing cgroup subsys cpuacct [0.00] Linux version 4.0.0-rc1-next-20150225 (zappy@kria) (gcc version 4.9.2 20150204 (prerelease) (GCC) ) #636 SMP PREEMPT Wed Feb 25 13:45:23 CET 2015 [0.00] Command line: root=/dev/sda1 netconsole=@10.0.1.23/,@10.0.1.10/ console=ttyS0 [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009fbff] usable [0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved [0.00] BIOS-e820: [mem 0x000f-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0x1ffd] usable [0.00] BIOS-e820: [mem 0x1ffe-0x1fff] reserved [0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved [0.00] BIOS-e820: [mem 0xfffc-0x] reserved [0.00] NX (Execute Disable) protection: active [0.00] SMBIOS 2.8 present. [0.00] Hypervisor detected: KVM [0.00] AGP: No AGP bridge found [0.00] e820: last_pfn = 0x1ffe0 max_arch_pfn = 0x4 [0.00] PAT configuration [0-7]: WB WC UC- UC WB WC UC- UC [0.00] found SMP MP-table at [mem 0x000f1010-0x000f101f] mapped at [880f1010] [0.00] Scanning 1 areas for low memory corruption [0.00] init_memory_mapping: [mem 0x-0x000f] [0.00] init_memory_mapping: [mem 0x1fc0-0x1fdf] [0.00] init_memory_mapping: [mem 0x0010-0x1fbf] [0.00] init_memory_mapping: [mem 0x1fe0-0x1ffd] [0.00] ACPI: Early table checksum verification disabled [0.00] ACPI: RSDP 0x000F0DD0 14 (v00 BOCHS ) [0.00] ACPI: RSDT 0x1FFE18BC 34 (v01 BOCHS BXPCRSDT 0001 BXPC 0001) [0.00] ACPI: FACP 0x1FFE0E48 74 (v01 BOCHS BXPCFACP 0001 BXPC 0001) [0.00] ACPI: DSDT 0x1FFE0040 000E08 (v01 BOCHS BXPCDSDT 0001 BXPC 0001) [0.00] ACPI: FACS 0x1FFE 40 [0.00] ACPI: SSDT 0x1FFE0EBC 000948 (v01 BOCHS BXPCSSDT 0001 BXPC 0001) [0.00] ACPI: APIC 0x1FFE1804 80 (v01 BOCHS BXPCAPIC 0001 BXPC 0001) [0.00] ACPI: HPET 0x1FFE1884 38 (v01 BOCHS BXPCHPET 0001 BXPC 0001) [0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00 [0.00] kvm-clock: cpu 0, msr 0:1ffdf001, primary cpu clock [0.00] Zone ranges: [0.00] DMA [mem 0x1000-0x00ff] [0.00] DMA32
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
On 02/25/2015 01:37 PM, Andrey Wagin wrote: > 2015-02-13 0:54 GMT+03:00 Denys Vlasenko : > My test vm doesn't boot with this patch. Could you help to investigate > this issue? Hi Andrey, thanks for testing! > I have attached a kernel config and console log. Looking at the logs, it seems that regular syscalls do work: systemd managed to function for some time, even spawned a few children. It might be that the bug is somewhere in signal delivery code. This would explain why oops got delayed. I am trying to reproduce it. My gcc seems to be a bit old - it can't digest CONFIG_CC_STACKPROTECTOR_STRONG=y in your .config. I switched to using "only" CONFIG_CC_STACKPROTECTOR_REGULAR=y: CONFIG_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_NONE is not set CONFIG_CC_STACKPROTECTOR_REGULAR=y # CONFIG_CC_STACKPROTECTOR_STRONG is not set and resulting kernel works for me. Can you send me your bzImage (off-list, of course)? I'll send you my qemu test setup, can you check whether crash is happening for you when you run your kernel in it? -- vda -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
On 02/25/2015 01:37 PM, Andrey Wagin wrote: 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com: My test vm doesn't boot with this patch. Could you help to investigate this issue? Hi Andrey, thanks for testing! I have attached a kernel config and console log. Looking at the logs, it seems that regular syscalls do work: systemd managed to function for some time, even spawned a few children. It might be that the bug is somewhere in signal delivery code. This would explain why oops got delayed. I am trying to reproduce it. My gcc seems to be a bit old - it can't digest CONFIG_CC_STACKPROTECTOR_STRONG=y in your .config. I switched to using only CONFIG_CC_STACKPROTECTOR_REGULAR=y: CONFIG_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_NONE is not set CONFIG_CC_STACKPROTECTOR_REGULAR=y # CONFIG_CC_STACKPROTECTOR_STRONG is not set and resulting kernel works for me. Can you send me your bzImage (off-list, of course)? I'll send you my qemu test setup, can you check whether crash is happening for you when you run your kernel in it? -- vda -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
Hello, I'm seeing the same symptoms on next-2015022{4,5}, also with systemd in a VM: traps: fsck[99] general protection ip:7fccb2401270 sp:7fffea3b8938 error:0 in libc-2.21.so[7fccb2349000+199000] traps: systemd-cgroups[100] general protection ip:7fdd8ff784f8 sp:7ffcf6e27ad8 error:0 in ld-2.21.so[7fdd8ff6+22000] traps: systemd-cgroups[94] general protection ip:7f9f23bd24f8 sp:74fc5578 error:0 in ld-2.21.so[7f9f23bba000+22000] traps: systemd-cgroups[102] general protection ip:7f211e6574f8 sp:7ffdb8e0d538 error:0 in ld-2.21.so[7f211e63f000+22000] traps: systemd-cgroups[103] general protection ip:7f80627c34f8 sp:7ffc7fa4cff8 error:0 in ld-2.21.so[7f80627ab000+22000] 2015-02-25, 14:55:34 +0100, Denys Vlasenko wrote: On 02/25/2015 01:37 PM, Andrey Wagin wrote: 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com: My test vm doesn't boot with this patch. Could you help to investigate this issue? Hi Andrey, thanks for testing! I have attached a kernel config and console log. Looking at the logs, it seems that regular syscalls do work: systemd managed to function for some time, even spawned a few children. It might be that the bug is somewhere in signal delivery code. This would explain why oops got delayed. It doesn't oops here, it just tries to load other bits of systemd and hangs. I've noticed that ip: - the address after ld-2.21.so[ is always the same value, I don't know if that's expected or relevant. (full log below) I am trying to reproduce it. My gcc seems to be a bit old - it can't digest CONFIG_CC_STACKPROTECTOR_STRONG=y in your .config. I switched to using only CONFIG_CC_STACKPROTECTOR_REGULAR=y: CONFIG_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_NONE is not set CONFIG_CC_STACKPROTECTOR_REGULAR=y # CONFIG_CC_STACKPROTECTOR_STRONG is not set and resulting kernel works for me. I don't have any STACKPROTECTOR in my config: # CONFIG_CC_STACKPROTECTOR is not set CONFIG_CC_STACKPROTECTOR_NONE=y # CONFIG_CC_STACKPROTECTOR_REGULAR is not set # CONFIG_CC_STACKPROTECTOR_STRONG is not set (full config after the log) I can start systemd's emergency shell (systemd.unit=emergency.target), if running test programs helps. Thanks, Sabrina [0.00] Initializing cgroup subsys cpuset [0.00] Initializing cgroup subsys cpu [0.00] Initializing cgroup subsys cpuacct [0.00] Linux version 4.0.0-rc1-next-20150225 (zappy@kria) (gcc version 4.9.2 20150204 (prerelease) (GCC) ) #636 SMP PREEMPT Wed Feb 25 13:45:23 CET 2015 [0.00] Command line: root=/dev/sda1 netconsole=@10.0.1.23/,@10.0.1.10/ console=ttyS0 [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009fbff] usable [0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved [0.00] BIOS-e820: [mem 0x000f-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0x1ffd] usable [0.00] BIOS-e820: [mem 0x1ffe-0x1fff] reserved [0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved [0.00] BIOS-e820: [mem 0xfffc-0x] reserved [0.00] NX (Execute Disable) protection: active [0.00] SMBIOS 2.8 present. [0.00] Hypervisor detected: KVM [0.00] AGP: No AGP bridge found [0.00] e820: last_pfn = 0x1ffe0 max_arch_pfn = 0x4 [0.00] PAT configuration [0-7]: WB WC UC- UC WB WC UC- UC [0.00] found SMP MP-table at [mem 0x000f1010-0x000f101f] mapped at [880f1010] [0.00] Scanning 1 areas for low memory corruption [0.00] init_memory_mapping: [mem 0x-0x000f] [0.00] init_memory_mapping: [mem 0x1fc0-0x1fdf] [0.00] init_memory_mapping: [mem 0x0010-0x1fbf] [0.00] init_memory_mapping: [mem 0x1fe0-0x1ffd] [0.00] ACPI: Early table checksum verification disabled [0.00] ACPI: RSDP 0x000F0DD0 14 (v00 BOCHS ) [0.00] ACPI: RSDT 0x1FFE18BC 34 (v01 BOCHS BXPCRSDT 0001 BXPC 0001) [0.00] ACPI: FACP 0x1FFE0E48 74 (v01 BOCHS BXPCFACP 0001 BXPC 0001) [0.00] ACPI: DSDT 0x1FFE0040 000E08 (v01 BOCHS BXPCDSDT 0001 BXPC 0001) [0.00] ACPI: FACS 0x1FFE 40 [0.00] ACPI: SSDT 0x1FFE0EBC 000948 (v01 BOCHS BXPCSSDT 0001 BXPC 0001) [0.00] ACPI: APIC 0x1FFE1804 80 (v01 BOCHS BXPCAPIC 0001 BXPC 0001) [0.00] ACPI: HPET 0x1FFE1884 38 (v01 BOCHS BXPCHPET 0001 BXPC 0001) [0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00 [0.00] kvm-clock: cpu 0, msr 0:1ffdf001, primary cpu clock [0.00] Zone ranges: [0.00] DMA [mem 0x1000-0x00ff] [0.00] DMA32[mem
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
On 02/25/2015 01:37 PM, Andrey Wagin wrote: Hello Denys, My test vm doesn't boot with this patch. Could you help to investigate this issue? I have attached a kernel config and console log. [2.508252] traps: systemd-cgroups[380] general protection ip:7f68ad096028 sp:7fffba298af8 error:0 in ld-2.18.so[7f68ad07e000+2][ OK [2.600179] traps: systemd-cgroups[384] general protection ip:7f11b9a9c028 sp:7fff4420f978 error:0 in ld-2.18.so[7f11b9a84000+2] [2.743790] traps: systemd-cgroups[392] general protection ip:7f7f40a44028 sp:7fffe1c1b8b8 error:0 in ld-2.18.so[7f7f40a2c000+2] [2.754576] traps: systemd-cgroups[393] general protection ip:7fd1314bd028 sp:776ecc88 error:0 in ld-2.18.so[7fd1314a5000+2] [2.765343] traps: systemd-cgroups[396] general protection ip:7ff4537b7028 sp:7fff05902378 error:0 in ld-2.18.so[7ff45379f000+2] [2.798782] traps: systemd-cgroups[399] general protection ip:7f4d5bc9c028 sp:7fff35cb3a48 error:0 in ld-2.18.so[7f4d5bc84000+2] These SEGVs are always at the same location within ld-2.18.so: (ip - load_address) is always 0x18028. For Sabrine's case, SEGVing address in ld-2.21.so also the same all the time: 0x184f8. Please send me your ld-2.18.so / ld-2.21.so files. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
2015-02-25 21:42 GMT+03:00 Denys Vlasenko dvlas...@redhat.com: On 02/25/2015 01:37 PM, Andrey Wagin wrote: 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com: 64-bit code was using six stack slots less by not saving/restoring registers which are callee-preserved according to C ABI, and not allocating space for them. Only when syscall needed a complete struct pt_regs, the complete area was allocated and filled in. As an additional twist, on interrupt entry a slightly less truncated pt_regs trick is used, to make nested interrupt stacks easier to unwind. This proved to be a source of significant obfuscation and subtle bugs. For example, stub_fork had to pop the return address, extend the struct, save registers, and push return address back. Ugly. ia32_ptregs_common pops return address and returns via jmp insn, throwing a wrench into CPU return stack cache. This patch changes code to always allocate a complete struct pt_regs. The saving of registers is still done lazily. Partial pt_regs trick on interrupt stack is retained. Macros which manipulate struct pt_regs on stack are reworked: ALLOC_PT_GPREGS_ON_STACK allocates the structure. SAVE_C_REGS saves to it those registers which are clobbered by C code. SAVE_EXTRA_REGS saves to it all other registers. Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. ia32_ptregs_common, stub_fork and friends lost their ugly dance with return pointer. LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets instead of magic numbers. error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS instead of having it open-coded yet again. Patch was run-tested: 64-bit executables, 32-bit executables, strace works. Timing tests did not show measurable difference in 32-bit and 64-bit syscalls. Hello Denys, My test vm doesn't boot with this patch. Could you help to investigate this issue? I think I found it. This part of my patch is possibly wrong: @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ TRACE_IRQS_ON; \ sti; \ - SAVE_REST; \ + SAVE_EXTRA_REGS; \ LOCKDEP_SYS_EXIT; \ - RESTORE_REST; \ + RESTORE_EXTRA_REGS; \ cli; \ TRACE_IRQS_OFF; The SAVE_REST here is intended to really *push* extra regs on stack, but the patch changed it so that they are written to existing stack slots above. From code inspection it should work in almost all cases, but some locations where it is used are really obscure. If there are places where *pushing* regs is really necessary, this can corrupt rbp,rbx,r12-15 registers. Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug was here. Please find updated patch attached. Can you try it? It doesn't work [2.282198] microcode: CPU0 sig=0x623, pf=0x0, revision=0x1 [2.288321] microcode: CPU1 sig=0x623, pf=0x0, revision=0x1 [2.289139] microcode: CPU2 sig=0x623, pf=0x0, revision=0x1 [2.290618] microcode: CPU3 sig=0x623, pf=0x0, revision=0x1 [2.292417] microcode: Microcode Update Driver: v2.00 tig...@aivazian.fsnet.co.uk, Peter Oruba [2.392592] piix4_smbus :00:01.3: SMBus Host Controller at 0xb100, revision 0 [2.510882] systemd-fsck[349]: /dev/sda1: clean, 343/128016 files, 166911/512000 blocks [2.521463] Adding 4128764k swap on /dev/sda2. Priority:-1 extents:1 across:4128764k FS [2.573597] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null) [2.597771] systemd-journald[283]: Received request to flush runtime journal from PID 1 [2.802288] audit: type=1305 audit(1424892361.629:3): audit_pid=369 old=0 auid=4294967295 ses=4294967295 res=1 [2.819660] traps: systemd-cgroups[380] general protection ip:7fb227939028 sp:7fff6bac59c8 error:0 in ld-2.18.so[7fb227921000+2] [3.016262] traps: systemd-cgroups[390] general protection ip:7f456f7b6028 sp:7fffdc059718 error:0 in ld-2.18.so[7f456f79e000+2] -- vda -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin ava...@gmail.com wrote: 2015-02-25 21:42 GMT+03:00 Denys Vlasenko dvlas...@redhat.com: On 02/25/2015 01:37 PM, Andrey Wagin wrote: 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com: 64-bit code was using six stack slots less by not saving/restoring registers which are callee-preserved according to C ABI, and not allocating space for them. Only when syscall needed a complete struct pt_regs, the complete area was allocated and filled in. As an additional twist, on interrupt entry a slightly less truncated pt_regs trick is used, to make nested interrupt stacks easier to unwind. This proved to be a source of significant obfuscation and subtle bugs. For example, stub_fork had to pop the return address, extend the struct, save registers, and push return address back. Ugly. ia32_ptregs_common pops return address and returns via jmp insn, throwing a wrench into CPU return stack cache. This patch changes code to always allocate a complete struct pt_regs. The saving of registers is still done lazily. Partial pt_regs trick on interrupt stack is retained. Macros which manipulate struct pt_regs on stack are reworked: ALLOC_PT_GPREGS_ON_STACK allocates the structure. SAVE_C_REGS saves to it those registers which are clobbered by C code. SAVE_EXTRA_REGS saves to it all other registers. Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. ia32_ptregs_common, stub_fork and friends lost their ugly dance with return pointer. LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets instead of magic numbers. error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS instead of having it open-coded yet again. Patch was run-tested: 64-bit executables, 32-bit executables, strace works. Timing tests did not show measurable difference in 32-bit and 64-bit syscalls. Hello Denys, My test vm doesn't boot with this patch. Could you help to investigate this issue? I think I found it. This part of my patch is possibly wrong: @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ TRACE_IRQS_ON; \ sti; \ - SAVE_REST; \ + SAVE_EXTRA_REGS; \ LOCKDEP_SYS_EXIT; \ - RESTORE_REST; \ + RESTORE_EXTRA_REGS; \ cli; \ TRACE_IRQS_OFF; The SAVE_REST here is intended to really *push* extra regs on stack, but the patch changed it so that they are written to existing stack slots above. From code inspection it should work in almost all cases, but some locations where it is used are really obscure. If there are places where *pushing* regs is really necessary, this can corrupt rbp,rbx,r12-15 registers. Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug was here. Please find updated patch attached. Can you try it? It doesn't work [2.282198] microcode: CPU0 sig=0x623, pf=0x0, revision=0x1 [2.288321] microcode: CPU1 sig=0x623, pf=0x0, revision=0x1 [2.289139] microcode: CPU2 sig=0x623, pf=0x0, revision=0x1 [2.290618] microcode: CPU3 sig=0x623, pf=0x0, revision=0x1 [2.292417] microcode: Microcode Update Driver: v2.00 tig...@aivazian.fsnet.co.uk, Peter Oruba [2.392592] piix4_smbus :00:01.3: SMBus Host Controller at 0xb100, revision 0 [2.510882] systemd-fsck[349]: /dev/sda1: clean, 343/128016 files, 166911/512000 blocks [2.521463] Adding 4128764k swap on /dev/sda2. Priority:-1 extents:1 across:4128764k FS [2.573597] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null) [2.597771] systemd-journald[283]: Received request to flush runtime journal from PID 1 [2.802288] audit: type=1305 audit(1424892361.629:3): audit_pid=369 old=0 auid=4294967295 ses=4294967295 res=1 [2.819660] traps: systemd-cgroups[380] general protection ip:7fb227939028 sp:7fff6bac59c8 error:0 in ld-2.18.so[7fb227921000+2] [3.016262] traps: systemd-cgroups[390] general protection ip:7f456f7b6028 sp:7fffdc059718 error:0 in ld-2.18.so[7f456f79e000+2] The change to stub_\func looks wrong to me. It saves and restores regs, but those regs might already have been saved if we're on the slow path. (Yes, all that code is quite buggy even without all these patches.) So is execve. This means that, for example, execve called in the slow path will save/restore regs twice. If the values in the regs after the first save and before the second save are different, then we corrupt user state. I think that the changes to all the stub-like things should be reverted. --Andy -- vda -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
On 02/25/2015 01:37 PM, Andrey Wagin wrote: 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com: 64-bit code was using six stack slots less by not saving/restoring registers which are callee-preserved according to C ABI, and not allocating space for them. Only when syscall needed a complete struct pt_regs, the complete area was allocated and filled in. As an additional twist, on interrupt entry a slightly less truncated pt_regs trick is used, to make nested interrupt stacks easier to unwind. This proved to be a source of significant obfuscation and subtle bugs. For example, stub_fork had to pop the return address, extend the struct, save registers, and push return address back. Ugly. ia32_ptregs_common pops return address and returns via jmp insn, throwing a wrench into CPU return stack cache. This patch changes code to always allocate a complete struct pt_regs. The saving of registers is still done lazily. Partial pt_regs trick on interrupt stack is retained. Macros which manipulate struct pt_regs on stack are reworked: ALLOC_PT_GPREGS_ON_STACK allocates the structure. SAVE_C_REGS saves to it those registers which are clobbered by C code. SAVE_EXTRA_REGS saves to it all other registers. Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. ia32_ptregs_common, stub_fork and friends lost their ugly dance with return pointer. LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets instead of magic numbers. error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS instead of having it open-coded yet again. Patch was run-tested: 64-bit executables, 32-bit executables, strace works. Timing tests did not show measurable difference in 32-bit and 64-bit syscalls. Hello Denys, My test vm doesn't boot with this patch. Could you help to investigate this issue? I think I found it. This part of my patch is possibly wrong: @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ TRACE_IRQS_ON; \ sti; \ - SAVE_REST; \ + SAVE_EXTRA_REGS; \ LOCKDEP_SYS_EXIT; \ - RESTORE_REST; \ + RESTORE_EXTRA_REGS; \ cli; \ TRACE_IRQS_OFF; The SAVE_REST here is intended to really *push* extra regs on stack, but the patch changed it so that they are written to existing stack slots above. From code inspection it should work in almost all cases, but some locations where it is used are really obscure. If there are places where *pushing* regs is really necessary, this can corrupt rbp,rbx,r12-15 registers. Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug was here. Please find updated patch attached. Can you try it? -- vda From 80c82370dcb8cd85c3b981ce62f63f74eff0df2c Mon Sep 17 00:00:00 2001 From: Denys Vlasenko dvlas...@redhat.com Date: Wed, 25 Feb 2015 19:32:14 +0100 Subject: [PATCH v2] x86: entry_64.S: always allocate complete struct pt_regs 64-bit code was using six stack slots less by not saving/restoring registers which are callee-preserved according to C ABI, and not allocating space for them. Only when syscall needed a complete struct pt_regs, the complete area was allocated and filled in. As an additional twist, on interrupt entry a slightly less truncated pt_regs trick is used, to make nested interrupt stacks easier to unwind. This proved to be a source of significant obfuscation and subtle bugs. For example, stub_fork had to pop the return address, extend the struct, save registers, and push return address back. Ugly. ia32_ptregs_common pops return address and returns via jmp insn, throwing a wrench into CPU return stack cache. This patch changes code to always allocate a complete struct pt_regs. The saving of registers is still done lazily. Partial pt_regs trick on interrupt stack is retained. Macros which manipulate struct pt_regs on stack are reworked: ALLOC_PT_GPREGS_ON_STACK allocates the structure. SAVE_C_REGS saves to it those registers which are clobbered by C code. SAVE_EXTRA_REGS saves to it all other registers. Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. ia32_ptregs_common, stub_fork and friends lost their ugly dance with return pointer. LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets instead of magic numbers. error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS instead of having it open-coded yet again. Patch was run-tested: 64-bit executables, 32-bit executables, strace works. Timing tests did not show measurable difference in 32-bit and 64-bit syscalls. Signed-off-by: Denys Vlasenko dvlas...@redhat.com CC: Linus Torvalds torva...@linux-foundation.org CC: Oleg Nesterov o...@redhat.com CC: Borislav Petkov b...@alien8.de CC: H. Peter Anvin h...@zytor.com CC: Andy Lutomirski l...@amacapital.net CC: Frederic Weisbecker fweis...@gmail.com CC: X86 ML x...@kernel.org CC: Alexei Starovoitov a...@plumgrid.com CC: Will Drewry
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
On 02/25/2015 09:10 PM, Andy Lutomirski wrote: On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin ava...@gmail.com wrote: 2015-02-25 21:42 GMT+03:00 Denys Vlasenko dvlas...@redhat.com: On 02/25/2015 01:37 PM, Andrey Wagin wrote: 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com: 64-bit code was using six stack slots less by not saving/restoring registers which are callee-preserved according to C ABI, and not allocating space for them. Only when syscall needed a complete struct pt_regs, the complete area was allocated and filled in. As an additional twist, on interrupt entry a slightly less truncated pt_regs trick is used, to make nested interrupt stacks easier to unwind. This proved to be a source of significant obfuscation and subtle bugs. For example, stub_fork had to pop the return address, extend the struct, save registers, and push return address back. Ugly. ia32_ptregs_common pops return address and returns via jmp insn, throwing a wrench into CPU return stack cache. This patch changes code to always allocate a complete struct pt_regs. The saving of registers is still done lazily. Partial pt_regs trick on interrupt stack is retained. Macros which manipulate struct pt_regs on stack are reworked: ALLOC_PT_GPREGS_ON_STACK allocates the structure. SAVE_C_REGS saves to it those registers which are clobbered by C code. SAVE_EXTRA_REGS saves to it all other registers. Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. ia32_ptregs_common, stub_fork and friends lost their ugly dance with return pointer. LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets instead of magic numbers. error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS instead of having it open-coded yet again. Patch was run-tested: 64-bit executables, 32-bit executables, strace works. Timing tests did not show measurable difference in 32-bit and 64-bit syscalls. Hello Denys, My test vm doesn't boot with this patch. Could you help to investigate this issue? I think I found it. This part of my patch is possibly wrong: @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ TRACE_IRQS_ON; \ sti; \ - SAVE_REST; \ + SAVE_EXTRA_REGS; \ LOCKDEP_SYS_EXIT; \ - RESTORE_REST; \ + RESTORE_EXTRA_REGS; \ cli; \ TRACE_IRQS_OFF; The SAVE_REST here is intended to really *push* extra regs on stack, but the patch changed it so that they are written to existing stack slots above. From code inspection it should work in almost all cases, but some locations where it is used are really obscure. If there are places where *pushing* regs is really necessary, this can corrupt rbp,rbx,r12-15 registers. Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug was here. Please find updated patch attached. Can you try it? It doesn't work Thanks for testing it anyway. [3.016262] traps: systemd-cgroups[390] general protection ip:7f456f7b6028 sp:7fffdc059718 error:0 in ld-2.18.so[7f456f79e000+2] This is what I know about these crashes. The SEGV itself is caused by HLT instruction executed by dynamic loader, ld-2.NN.so. The instruction is in _exit function, and is only reachable if exit_group and exit syscalls fail to terminate the process. So it seems that syscall execution is getting badly broken somehow at some point. This happens to both reporters. My theory that it is related to lockdep seems to be wrong, because Sabrina's kernel is not lockdep-enabled, yet it sees the same failure. Both kernels are paravirtualized, both are booted under KVM, Andrey runs it with four virtual CPUs, Sabrina runs with two. My next theory is that I missed something related to paravirt. I am looking at that code, so far I don't see anything suspicious. Unfortunately, it doesn't happen to me: I have Sabrina's bzImage, I run it under qemu-system-x86_64 -enable-kvm -smp 2, I see in dmesg that kernel does detect that it is being run under KVM, but it works for me. No mysterious segfaults. Andrey, can you send me your bzImage? Maybe it will trigger the problem for me. The change to stub_\func looks wrong to me. It saves and restores regs, but those regs might already have been saved if we're on the slow path. (Yes, all that code is quite buggy even without all these patches.) So is execve. This means that, for example, execve called in the slow path will save/restore regs twice. If the values in the regs after the first save and before the second save are different, then we corrupt user state. This part? .macro FORK_LIKE func ENTRY(stub_\func) CFI_STARTPROC - popq%r11/* save return address */ - PARTIAL_FRAME 0 - SAVE_REST - pushq %r11/* put it back on stack */ + DEFAULT_FRAME 0, 8 /* offset 8: return
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko dvlas...@redhat.com wrote: On 02/25/2015 09:10 PM, Andy Lutomirski wrote: On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin ava...@gmail.com wrote: 2015-02-25 21:42 GMT+03:00 Denys Vlasenko dvlas...@redhat.com: On 02/25/2015 01:37 PM, Andrey Wagin wrote: 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com: 64-bit code was using six stack slots less by not saving/restoring registers which are callee-preserved according to C ABI, and not allocating space for them. Only when syscall needed a complete struct pt_regs, the complete area was allocated and filled in. As an additional twist, on interrupt entry a slightly less truncated pt_regs trick is used, to make nested interrupt stacks easier to unwind. This proved to be a source of significant obfuscation and subtle bugs. For example, stub_fork had to pop the return address, extend the struct, save registers, and push return address back. Ugly. ia32_ptregs_common pops return address and returns via jmp insn, throwing a wrench into CPU return stack cache. This patch changes code to always allocate a complete struct pt_regs. The saving of registers is still done lazily. Partial pt_regs trick on interrupt stack is retained. Macros which manipulate struct pt_regs on stack are reworked: ALLOC_PT_GPREGS_ON_STACK allocates the structure. SAVE_C_REGS saves to it those registers which are clobbered by C code. SAVE_EXTRA_REGS saves to it all other registers. Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. ia32_ptregs_common, stub_fork and friends lost their ugly dance with return pointer. LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets instead of magic numbers. error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS instead of having it open-coded yet again. Patch was run-tested: 64-bit executables, 32-bit executables, strace works. Timing tests did not show measurable difference in 32-bit and 64-bit syscalls. Hello Denys, My test vm doesn't boot with this patch. Could you help to investigate this issue? I think I found it. This part of my patch is possibly wrong: @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ TRACE_IRQS_ON; \ sti; \ - SAVE_REST; \ + SAVE_EXTRA_REGS; \ LOCKDEP_SYS_EXIT; \ - RESTORE_REST; \ + RESTORE_EXTRA_REGS; \ cli; \ TRACE_IRQS_OFF; The SAVE_REST here is intended to really *push* extra regs on stack, but the patch changed it so that they are written to existing stack slots above. From code inspection it should work in almost all cases, but some locations where it is used are really obscure. If there are places where *pushing* regs is really necessary, this can corrupt rbp,rbx,r12-15 registers. Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug was here. Please find updated patch attached. Can you try it? It doesn't work Thanks for testing it anyway. [3.016262] traps: systemd-cgroups[390] general protection ip:7f456f7b6028 sp:7fffdc059718 error:0 in ld-2.18.so[7f456f79e000+2] This is what I know about these crashes. The SEGV itself is caused by HLT instruction executed by dynamic loader, ld-2.NN.so. The instruction is in _exit function, and is only reachable if exit_group and exit syscalls fail to terminate the process. So it seems that syscall execution is getting badly broken somehow at some point. This happens to both reporters. My theory that it is related to lockdep seems to be wrong, because Sabrina's kernel is not lockdep-enabled, yet it sees the same failure. Both kernels are paravirtualized, both are booted under KVM, Andrey runs it with four virtual CPUs, Sabrina runs with two. My next theory is that I missed something related to paravirt. I am looking at that code, so far I don't see anything suspicious. Unfortunately, it doesn't happen to me: I have Sabrina's bzImage, I run it under qemu-system-x86_64 -enable-kvm -smp 2, I see in dmesg that kernel does detect that it is being run under KVM, but it works for me. No mysterious segfaults. Andrey, can you send me your bzImage? Maybe it will trigger the problem for me. The change to stub_\func looks wrong to me. It saves and restores regs, but those regs might already have been saved if we're on the slow path. (Yes, all that code is quite buggy even without all these patches.) So is execve. This means that, for example, execve called in the slow path will save/restore regs twice. If the values in the regs after the first save and before the second save are different, then we corrupt user state. This part? .macro FORK_LIKE func ENTRY(stub_\func) CFI_STARTPROC - popq%r11/* save return address */ - PARTIAL_FRAME 0 - SAVE_REST - pushq %r11
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
2015-02-25, 13:59:06 -0800, Andy Lutomirski wrote: On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko dvlas...@redhat.com wrote: On 02/25/2015 09:10 PM, Andy Lutomirski wrote: On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin ava...@gmail.com wrote: 2015-02-25 21:42 GMT+03:00 Denys Vlasenko dvlas...@redhat.com: On 02/25/2015 01:37 PM, Andrey Wagin wrote: 2015-02-13 0:54 GMT+03:00 Denys Vlasenko dvlas...@redhat.com: 64-bit code was using six stack slots less by not saving/restoring registers which are callee-preserved according to C ABI, and not allocating space for them. Only when syscall needed a complete struct pt_regs, the complete area was allocated and filled in. As an additional twist, on interrupt entry a slightly less truncated pt_regs trick is used, to make nested interrupt stacks easier to unwind. This proved to be a source of significant obfuscation and subtle bugs. For example, stub_fork had to pop the return address, extend the struct, save registers, and push return address back. Ugly. ia32_ptregs_common pops return address and returns via jmp insn, throwing a wrench into CPU return stack cache. This patch changes code to always allocate a complete struct pt_regs. The saving of registers is still done lazily. Partial pt_regs trick on interrupt stack is retained. Macros which manipulate struct pt_regs on stack are reworked: ALLOC_PT_GPREGS_ON_STACK allocates the structure. SAVE_C_REGS saves to it those registers which are clobbered by C code. SAVE_EXTRA_REGS saves to it all other registers. Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. ia32_ptregs_common, stub_fork and friends lost their ugly dance with return pointer. LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets instead of magic numbers. error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS instead of having it open-coded yet again. Patch was run-tested: 64-bit executables, 32-bit executables, strace works. Timing tests did not show measurable difference in 32-bit and 64-bit syscalls. Hello Denys, My test vm doesn't boot with this patch. Could you help to investigate this issue? I think I found it. This part of my patch is possibly wrong: @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ TRACE_IRQS_ON; \ sti; \ - SAVE_REST; \ + SAVE_EXTRA_REGS; \ LOCKDEP_SYS_EXIT; \ - RESTORE_REST; \ + RESTORE_EXTRA_REGS; \ cli; \ TRACE_IRQS_OFF; The SAVE_REST here is intended to really *push* extra regs on stack, but the patch changed it so that they are written to existing stack slots above. From code inspection it should work in almost all cases, but some locations where it is used are really obscure. If there are places where *pushing* regs is really necessary, this can corrupt rbp,rbx,r12-15 registers. Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug was here. Please find updated patch attached. Can you try it? It doesn't work Thanks for testing it anyway. [3.016262] traps: systemd-cgroups[390] general protection ip:7f456f7b6028 sp:7fffdc059718 error:0 in ld-2.18.so[7f456f79e000+2] This is what I know about these crashes. The SEGV itself is caused by HLT instruction executed by dynamic loader, ld-2.NN.so. The instruction is in _exit function, and is only reachable if exit_group and exit syscalls fail to terminate the process. So it seems that syscall execution is getting badly broken somehow at some point. This happens to both reporters. My theory that it is related to lockdep seems to be wrong, because Sabrina's kernel is not lockdep-enabled, yet it sees the same failure. Both kernels are paravirtualized, both are booted under KVM, Andrey runs it with four virtual CPUs, Sabrina runs with two. My next theory is that I missed something related to paravirt. I am looking at that code, so far I don't see anything suspicious. Unfortunately, it doesn't happen to me: I have Sabrina's bzImage, I run it under qemu-system-x86_64 -enable-kvm -smp 2, I see in dmesg that kernel does detect that it is being run under KVM, but it works for me. No mysterious segfaults. Andrey, can you send me your bzImage? Maybe it will trigger the problem for me. The change to stub_\func looks wrong to me. It saves and restores regs, but those regs might already have been saved if we're on the slow path. (Yes, all that code is quite buggy even without all these patches.) So is execve. This means that, for example, execve called in the slow path will save/restore regs twice. If the values in the regs after the first save and before the second save are different, then we corrupt user state. This part? .macro
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote: I can run some userspace programs, but I have no idea what would be helpful. I can also try booting a real machine with archlinux/systemd tomorrow. I got a good boot out of kernels that normally fail. I booted systemd's emergency shell and enabled a few services, in the same order they normally start. journald started cleanly, but after that, every single command produced a traps: output and an audit: line. I disabled systemd-journald (chmod -x, because `systemctl disable` didn't really disable it), and now it boots, no traps: in the log. If I run it, everything fails again (zsh has traps for simply pressing enter on an empty cmd). -- Sabrina -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
On Thu, Feb 26, 2015 at 12:34 AM, Sabrina Dubroca s...@queasysnail.net wrote: 2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote: I can run some userspace programs, but I have no idea what would be helpful. I can also try booting a real machine with archlinux/systemd tomorrow. I got a good boot out of kernels that normally fail. I booted systemd's emergency shell and enabled a few services, in the same order they normally start. journald started cleanly, but after that, every single command produced a traps: output and an audit: line. I disabled systemd-journald (chmod -x, because `systemctl disable` didn't really disable it), and now it boots, no traps: in the log. If I run it, everything fails again (zsh has traps for simply pressing enter on an empty cmd). That's some progress! It's strange how one process manages to affect everything else. If I run it, everything fails again. How do you run it? Directly, or via systemd services mechanism? If you just run it directly, can you try running it under strace -f -tt -oLOG? Does it have the same effect? What's in the LOG? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
Hi all, On Wed, 25 Feb 2015 21:18:52 -0800 Andrew Morton a...@linux-foundation.org wrote: On Thu, 26 Feb 2015 02:12:57 +0100 Denys Vlasenko vda.li...@googlemail.com wrote: On Thu, Feb 26, 2015 at 12:34 AM, Sabrina Dubroca s...@queasysnail.net wrote: 2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote: I can run some userspace programs, but I have no idea what would be helpful. I can also try booting a real machine with archlinux/systemd tomorrow. I got a good boot out of kernels that normally fail. I booted systemd's emergency shell and enabled a few services, in the same order they normally start. journald started cleanly, but after that, every single command produced a traps: output and an audit: line. I disabled systemd-journald (chmod -x, because `systemctl disable` didn't really disable it), and now it boots, no traps: in the log. If I run it, everything fails again (zsh has traps for simply pressing enter on an empty cmd). That's some progress! It's strange how one process manages to affect everything else. If I run it, everything fails again. How do you run it? Directly, or via systemd services mechanism? If you just run it directly, can you try running it under strace -f -tt -oLOG? Does it have the same effect? What's in the LOG? I'm hitting this bug as well, bisected to this commit. On an old x64_64 box, no vms, paravirt, etc. Running FC6 userspace (heh). Quite late in initscripts, binaries start getting segmentation faults and init gives up. Seems to only affect /usr/bin/rhgb-client. There's one instance where /bin/rm is said to segfault, but I suspect that's init lying to me. I note that that commit has been removed from today's version of the luto-misc tree and thus linux-next. -- Cheers, Stephen Rothwells...@canb.auug.org.au pgpu6vlymy8yz.pgp Description: OpenPGP digital signature
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
On Thu, 26 Feb 2015 02:12:57 +0100 Denys Vlasenko vda.li...@googlemail.com wrote: On Thu, Feb 26, 2015 at 12:34 AM, Sabrina Dubroca s...@queasysnail.net wrote: 2015-02-25, 23:40:55 +0100, Sabrina Dubroca wrote: I can run some userspace programs, but I have no idea what would be helpful. I can also try booting a real machine with archlinux/systemd tomorrow. I got a good boot out of kernels that normally fail. I booted systemd's emergency shell and enabled a few services, in the same order they normally start. journald started cleanly, but after that, every single command produced a traps: output and an audit: line. I disabled systemd-journald (chmod -x, because `systemctl disable` didn't really disable it), and now it boots, no traps: in the log. If I run it, everything fails again (zsh has traps for simply pressing enter on an empty cmd). That's some progress! It's strange how one process manages to affect everything else. If I run it, everything fails again. How do you run it? Directly, or via systemd services mechanism? If you just run it directly, can you try running it under strace -f -tt -oLOG? Does it have the same effect? What's in the LOG? I'm hitting this bug as well, bisected to this commit. On an old x64_64 box, no vms, paravirt, etc. Running FC6 userspace (heh). Quite late in initscripts, binaries start getting segmentation faults and init gives up. Seems to only affect /usr/bin/rhgb-client. There's one instance where /bin/rm is said to segfault, but I suspect that's init lying to me. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
On Thu, Feb 12, 2015 at 1:54 PM, Denys Vlasenko wrote: > 64-bit code was using six stack slots less by not saving/restoring > registers which are callee-preserved according to C ABI, > and not allocating space for them. > Only when syscall needed a complete "struct pt_regs", > the complete area was allocated and filled in. > As an additional twist, on interrupt entry a "slightly less truncated pt_regs" > trick is used, to make nested interrupt stacks easier to unwind. > > This proved to be a source of significant obfuscation and subtle bugs. > For example, stub_fork had to pop the return address, > extend the struct, save registers, and push return address back. Ugly. > ia32_ptregs_common pops return address and "returns" via jmp insn, > throwing a wrench into CPU return stack cache. > > This patch changes code to always allocate a complete "struct pt_regs". > The saving of registers is still done lazily. > > "Partial pt_regs" trick on interrupt stack is retained. > > Macros which manipulate "struct pt_regs" on stack are reworked: > ALLOC_PT_GPREGS_ON_STACK allocates the structure. > SAVE_C_REGS saves to it those registers which are clobbered by C code. > SAVE_EXTRA_REGS saves to it all other registers. > Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. > > ia32_ptregs_common, stub_fork and friends lost their ugly dance with > return pointer. > > LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets > instead of magic numbers. > > error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS > instead of having it open-coded yet again. > > Patch was run-tested: 64-bit executables, 32-bit executables, > strace works. > Timing tests did not show measurable difference in 32-bit > and 64-bit syscalls. This patch scares me, because it changes a lot of hairy code. That being said, I don't see anything wrong with it, and the end result is much nicer than the status quo. So I applied it, and I'll let the kbuild bot have fun with it. I confirmed that I can boot a 64-bit and a 32-bit system with it, at least in my configuration. Further reviews are encouraged :) --Andy > > Signed-off-by: Denys Vlasenko > CC: Linus Torvalds > CC: Oleg Nesterov > CC: Borislav Petkov > CC: "H. Peter Anvin" > CC: Andy Lutomirski > CC: Frederic Weisbecker > CC: X86 ML > CC: Alexei Starovoitov > CC: Will Drewry > CC: Kees Cook > CC: linux-kernel@vger.kernel.org > --- > arch/x86/ia32/ia32entry.S | 47 +++ > arch/x86/include/asm/calling.h | 222 > - > arch/x86/include/asm/irqflags.h| 4 +- > arch/x86/include/uapi/asm/ptrace-abi.h | 1 - > arch/x86/kernel/entry_64.S | 195 +++-- > 5 files changed, 209 insertions(+), 260 deletions(-) > > diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S > index 156ebca..f4bed49 100644 > --- a/arch/x86/ia32/ia32entry.S > +++ b/arch/x86/ia32/ia32entry.S > @@ -62,12 +62,12 @@ > */ > .macro LOAD_ARGS32 offset, _r9=0 > .if \_r9 > - movl \offset+16(%rsp),%r9d > + movl \offset+R9(%rsp),%r9d > .endif > - movl \offset+40(%rsp),%ecx > - movl \offset+48(%rsp),%edx > - movl \offset+56(%rsp),%esi > - movl \offset+64(%rsp),%edi > + movl \offset+RCX(%rsp),%ecx > + movl \offset+RDX(%rsp),%edx > + movl \offset+RSI(%rsp),%esi > + movl \offset+RDI(%rsp),%edi > movl %eax,%eax /* zero extension */ > .endm > > @@ -144,7 +144,8 @@ ENTRY(ia32_sysenter_target) > CFI_REL_OFFSET rip,0 > pushq_cfi %rax > cld > - SAVE_ARGS 0,1,0 > + ALLOC_PT_GPREGS_ON_STACK > + SAVE_C_REGS_EXCEPT_R891011 > /* no need to do an access_ok check here because rbp has been >32bit zero extended */ > ASM_STAC > @@ -182,7 +183,8 @@ sysexit_from_sys_call: > andl$~0x200,EFLAGS-ARGOFFSET(%rsp) > movlRIP-ARGOFFSET(%rsp),%edx/* User %eip */ > CFI_REGISTER rip,rdx > - RESTORE_ARGS 0,24,0,0,0,0 > + RESTORE_RSI_RDI > + REMOVE_PT_GPREGS_FROM_STACK 3*8 > xorq%r8,%r8 > xorq%r9,%r9 > xorq%r10,%r10 > @@ -256,13 +258,13 @@ sysenter_tracesys: > testl $(_TIF_WORK_SYSCALL_ENTRY & > ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) > jz sysenter_auditsys > #endif > - SAVE_REST > + SAVE_EXTRA_REGS > CLEAR_RREGS > movq$-ENOSYS,RAX(%rsp)/* ptrace can change this for a bad syscall > */ > movq%rsp,%rdi/* _regs -> arg1 */ > callsyscall_trace_enter > LOAD_ARGS32 ARGOFFSET /* reload args from stack in case ptrace > changed it */ > - RESTORE_REST > + RESTORE_EXTRA_REGS > cmpq$(IA32_NR_syscalls-1),%rax > ja int_ret_from_sys_call /* sysenter_tracesys has set RAX(%rsp) > */
Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete struct pt_regs
On Thu, Feb 12, 2015 at 1:54 PM, Denys Vlasenko dvlas...@redhat.com wrote: 64-bit code was using six stack slots less by not saving/restoring registers which are callee-preserved according to C ABI, and not allocating space for them. Only when syscall needed a complete struct pt_regs, the complete area was allocated and filled in. As an additional twist, on interrupt entry a slightly less truncated pt_regs trick is used, to make nested interrupt stacks easier to unwind. This proved to be a source of significant obfuscation and subtle bugs. For example, stub_fork had to pop the return address, extend the struct, save registers, and push return address back. Ugly. ia32_ptregs_common pops return address and returns via jmp insn, throwing a wrench into CPU return stack cache. This patch changes code to always allocate a complete struct pt_regs. The saving of registers is still done lazily. Partial pt_regs trick on interrupt stack is retained. Macros which manipulate struct pt_regs on stack are reworked: ALLOC_PT_GPREGS_ON_STACK allocates the structure. SAVE_C_REGS saves to it those registers which are clobbered by C code. SAVE_EXTRA_REGS saves to it all other registers. Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. ia32_ptregs_common, stub_fork and friends lost their ugly dance with return pointer. LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets instead of magic numbers. error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS instead of having it open-coded yet again. Patch was run-tested: 64-bit executables, 32-bit executables, strace works. Timing tests did not show measurable difference in 32-bit and 64-bit syscalls. This patch scares me, because it changes a lot of hairy code. That being said, I don't see anything wrong with it, and the end result is much nicer than the status quo. So I applied it, and I'll let the kbuild bot have fun with it. I confirmed that I can boot a 64-bit and a 32-bit system with it, at least in my configuration. Further reviews are encouraged :) --Andy Signed-off-by: Denys Vlasenko dvlas...@redhat.com CC: Linus Torvalds torva...@linux-foundation.org CC: Oleg Nesterov o...@redhat.com CC: Borislav Petkov b...@alien8.de CC: H. Peter Anvin h...@zytor.com CC: Andy Lutomirski l...@amacapital.net CC: Frederic Weisbecker fweis...@gmail.com CC: X86 ML x...@kernel.org CC: Alexei Starovoitov a...@plumgrid.com CC: Will Drewry w...@chromium.org CC: Kees Cook keesc...@chromium.org CC: linux-kernel@vger.kernel.org --- arch/x86/ia32/ia32entry.S | 47 +++ arch/x86/include/asm/calling.h | 222 - arch/x86/include/asm/irqflags.h| 4 +- arch/x86/include/uapi/asm/ptrace-abi.h | 1 - arch/x86/kernel/entry_64.S | 195 +++-- 5 files changed, 209 insertions(+), 260 deletions(-) diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index 156ebca..f4bed49 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -62,12 +62,12 @@ */ .macro LOAD_ARGS32 offset, _r9=0 .if \_r9 - movl \offset+16(%rsp),%r9d + movl \offset+R9(%rsp),%r9d .endif - movl \offset+40(%rsp),%ecx - movl \offset+48(%rsp),%edx - movl \offset+56(%rsp),%esi - movl \offset+64(%rsp),%edi + movl \offset+RCX(%rsp),%ecx + movl \offset+RDX(%rsp),%edx + movl \offset+RSI(%rsp),%esi + movl \offset+RDI(%rsp),%edi movl %eax,%eax /* zero extension */ .endm @@ -144,7 +144,8 @@ ENTRY(ia32_sysenter_target) CFI_REL_OFFSET rip,0 pushq_cfi %rax cld - SAVE_ARGS 0,1,0 + ALLOC_PT_GPREGS_ON_STACK + SAVE_C_REGS_EXCEPT_R891011 /* no need to do an access_ok check here because rbp has been 32bit zero extended */ ASM_STAC @@ -182,7 +183,8 @@ sysexit_from_sys_call: andl$~0x200,EFLAGS-ARGOFFSET(%rsp) movlRIP-ARGOFFSET(%rsp),%edx/* User %eip */ CFI_REGISTER rip,rdx - RESTORE_ARGS 0,24,0,0,0,0 + RESTORE_RSI_RDI + REMOVE_PT_GPREGS_FROM_STACK 3*8 xorq%r8,%r8 xorq%r9,%r9 xorq%r10,%r10 @@ -256,13 +258,13 @@ sysenter_tracesys: testl $(_TIF_WORK_SYSCALL_ENTRY ~_TIF_SYSCALL_AUDIT),TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) jz sysenter_auditsys #endif - SAVE_REST + SAVE_EXTRA_REGS CLEAR_RREGS movq$-ENOSYS,RAX(%rsp)/* ptrace can change this for a bad syscall */ movq%rsp,%rdi/* pt_regs - arg1 */ callsyscall_trace_enter LOAD_ARGS32 ARGOFFSET /* reload args from stack in case ptrace changed it */ - RESTORE_REST + RESTORE_EXTRA_REGS cmpq$(IA32_NR_syscalls-1),%rax