Re: 4.14.9 doesn't boot (regression)
On Sun, Dec 31, 2017 at 01:03:25AM +0300, Alexander Tsoy wrote: > > Turns out my previous code to print iret frames was a bit ... > > misguided, to put it nicely. Not sure what I was smoking. > > > > Hopefully the below patch should fix it (in place of the previous > > patch). Would you mind testing again? > > > > With that patch I get: > > [2.160017] NMI backtrace for cpu 0 > [2.160017] CPU: 0 PID: 1 Comm: init Not tainted 4.15.0-rc5 #1 > [2.160017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.10.2-1.fc27 04/01/2014 > [2.160017] RIP: 0010:double_fault+0x0/0x30 > [2.160017] RSP: :fe807fd0 EFLAGS: 00010086 > [2.160017] RAX: ffc0 RBX: 0001 RCX: > c101 > [2.160017] RDX: 8edc RSI: RDI: > fe807f58 > [2.160017] RBP: R08: R09: > > [2.160017] R10: R11: R12: > a3c01426 > [2.160017] R13: R14: R15: > > [2.160017] FS: () GS:8edcffc0() > knlGS: > [2.160017] CS: 0010 DS: ES: CR0: 80050033 > [2.160017] CR2: fe806f08 CR3: 7c153000 CR4: > 06b0 > [2.160017] Call Trace: > [2.160017] <#DF> > [2.160017] RIP: 0010:do_double_fault+0xb/0x140 > [2.160017] RSP: :fe806f18 EFLAGS: 00010086 > [2.160017] Yes, that's more like it. I'll clean up the patches and submit them soon. These nasty bugs are always a good testcase for the stack dump code. Thanks for testing! -- Josh
Re: 4.14.9 doesn't boot (regression)
В Sat, 30 Dec 2017 11:57:46 -0600 Josh Poimboeuf пишет: > On Sat, Dec 30, 2017 at 11:09:46AM -0600, Josh Poimboeuf wrote: > > On Sat, Dec 30, 2017 at 11:45:13AM +0300, Alexander Tsoy wrote: > > > В Пт, 29/12/2017 в 21:49 -0600, Josh Poimboeuf пишет: > > > > On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski > > > > wrote: > > > > > (Also, Josh, the oops code should have printed the contents > > > > > of the struct pt_regs at the top of the DF stack. Any idea > > > > > why it didn't?) > > > > > > > > Looking at one of the dumps: > > > > > > > > [ 392.774879] NMI backtrace for cpu 0 > > > > [ 392.774881] CPU: 0 PID: 1 Comm: init Not tainted > > > > 4.14.9-gentoo #1 > > > > [ 392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1 > > > > 01/01/2011 [ 392.774882] task: 8802368b8000 task.stack: > > > > c900c000 [ 392.774885] RIP: 0010:double_fault+0x0/0x30 > > > > [ 392.774886] RSP: :ff527fd0 EFLAGS: 0086 > > > > [ 392.774887] RAX: 3fc0 RBX: 0001 > > > > RCX: c101 > > > > [ 392.774887] RDX: 8802 RSI: > > > > RDI: ff527f58 > > > > [ 392.774887] RBP: R08: > > > > R09: > > > > [ 392.774888] R10: R11: > > > > R12: 816ae726 > > > > [ 392.774888] R13: R14: > > > > R15: > > > > [ 392.774889] FS: () > > > > GS:88023fc0() knlGS: > > > > [ 392.774889] CS: 0010 DS: ES: CR0: > > > > 80050033 [ 392.774890] CR2: ff526f08 CR3: > > > > 000235b48002 CR4: 001606f0 > > > > [ 392.774892] Call Trace: > > > > [ 392.774894] <#DF> > > > > [ 392.774897] do_double_fault+0xb/0x140 > > > > [ 392.774898] > > > > > > > > It should have at least printed the #DF iret frame registers, > > > > which I recently added support for in "x86/unwinder: Handle > > > > stack overflows more > > > > gracefully", which is in both 4.14.9 and 4.15-rc5. > > > > > > > > I think the missing iret regs are due to a bug in > > > > show_trace_log_lvl(), > > > > where if the unwind starts with two regs frames in a row, the > > > > second regs don't get printed. > > > > > > > > Alexander, would you mind reproducing again with the below > > > > patch? It should still fail, but this time it should hopefully > > > > show another RIP/RSP/EFLAGS instead of the > > > > "do_double_fault+0xb/0x140" line. > > > > > > Yes, it works: > > > > > > [ 23.058064] NMI backtrace for cpu 2 > > > [ 23.058068] CPU: 2 PID: 1 Comm: init Not tainted 4.15.0-rc5+ #1 > > > [ 23.058069] Hardware name: QEMU Standard PC (i440FX + PIIX, > > > 1996), BIOS 1.10.2-1.fc27 04/01/2014 > > > [ 23.058074] RIP: 0010:double_fault+0x0/0x30 > > > [ 23.058075] RSP: :fe85ffd0 EFLAGS: 0086 > > > [ 23.058077] RAX: 3fd0 RBX: 0001 RCX: > > > c101 > > > [ 23.058077] RDX: 9681 RSI: RDI: > > > fe85ff58 > > > [ 23.058078] RBP: R08: R09: > > > > > > [ 23.058079] R10: R11: R12: > > > 92001426 > > > [ 23.058080] R13: R14: R15: > > > > > > [ 23.058083] FS: () > > > GS:96813fd0() knlGS: > > > [ 23.058084] CS: 0010 DS: ES: CR0: 80050033 > > > [ 23.058085] CR2: fe85ef08 CR3: 000137a09000 CR4: > > > 000406a0 > > > [ 23.058089] Call Trace: > > > [ 23.058101] <#DF> > > > [ 23.058104] RIP: 0010:do_double_fault+0xb/0x140 > > > [ 23.058105] RSP: :fe85ef18 EFLAGS: 00010086 > > > ORIG_RAX: > > > [ 23.058106] RAX: 3fd0 RBX: 0001 RCX: > > > c101 > > > [ 23.058107] RDX: 9681 RSI: RDI: > > > fe85ff58 > > > [ 23.058107] RBP: R08: R09: > > > > > > [ 23.058108] R10: R11: R12: > > > 92001426 > > > [ 23.058108] R13: R14: R15: > > > > > > [ 23.058111] > > > [ 23.058111] Code: 05 00 00 48 89 e7 31 f6 e8 2e 8c 61 ff e9 69 > > > 06 00 00 e8 94 05 00 00 48 89 e7 31 f6 e8 1a 8c 61 ff e9 55 06 00 > > > 00 0f 1f 44 00 00 <0f> 1f 00 48 83 c4 88 e8 e4 04 00 00 48 89 e7 > > > 48 8b 74 24 78 48 > > > > That's better indeed, though still not quite right. It should have > > only shown a subset of those registers. One more bug to fix > > there... > > Turns out my previous code to print iret frames was a bit ... > misguided, to put it nicely. Not sure what I was smoking. > > Hopefully the below patch should fix it (in place of the previous >
Re: 4.14.9 doesn't boot (regression)
On Sat, Dec 30, 2017 at 11:09:46AM -0600, Josh Poimboeuf wrote: > On Sat, Dec 30, 2017 at 11:45:13AM +0300, Alexander Tsoy wrote: > > В Пт, 29/12/2017 в 21:49 -0600, Josh Poimboeuf пишет: > > > On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski wrote: > > > > (Also, Josh, the oops code should have printed the contents of the > > > > struct pt_regs at the top of the DF stack. Any idea why it > > > > didn't?) > > > > > > Looking at one of the dumps: > > > > > > [ 392.774879] NMI backtrace for cpu 0 > > > [ 392.774881] CPU: 0 PID: 1 Comm: init Not tainted 4.14.9-gentoo > > > #1 > > > [ 392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > > > [ 392.774882] task: 8802368b8000 task.stack: c900c000 > > > [ 392.774885] RIP: 0010:double_fault+0x0/0x30 > > > [ 392.774886] RSP: :ff527fd0 EFLAGS: 0086 > > > [ 392.774887] RAX: 3fc0 RBX: 0001 RCX: > > > c101 > > > [ 392.774887] RDX: 8802 RSI: RDI: > > > ff527f58 > > > [ 392.774887] RBP: R08: R09: > > > > > > [ 392.774888] R10: R11: R12: > > > 816ae726 > > > [ 392.774888] R13: R14: R15: > > > > > > [ 392.774889] FS: () > > > GS:88023fc0() knlGS: > > > [ 392.774889] CS: 0010 DS: ES: CR0: 80050033 > > > [ 392.774890] CR2: ff526f08 CR3: 000235b48002 CR4: > > > 001606f0 > > > [ 392.774892] Call Trace: > > > [ 392.774894] <#DF> > > > [ 392.774897] do_double_fault+0xb/0x140 > > > [ 392.774898] > > > > > > It should have at least printed the #DF iret frame registers, which I > > > recently added support for in "x86/unwinder: Handle stack overflows > > > more > > > gracefully", which is in both 4.14.9 and 4.15-rc5. > > > > > > I think the missing iret regs are due to a bug in > > > show_trace_log_lvl(), > > > where if the unwind starts with two regs frames in a row, the second > > > regs don't get printed. > > > > > > Alexander, would you mind reproducing again with the below patch? It > > > should still fail, but this time it should hopefully show another > > > RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line. > > > > > > > Yes, it works: > > > > [ 23.058064] NMI backtrace for cpu 2 > > [ 23.058068] CPU: 2 PID: 1 Comm: init Not tainted 4.15.0-rc5+ #1 > > [ 23.058069] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > BIOS 1.10.2-1.fc27 04/01/2014 > > [ 23.058074] RIP: 0010:double_fault+0x0/0x30 > > [ 23.058075] RSP: :fe85ffd0 EFLAGS: 0086 > > [ 23.058077] RAX: 3fd0 RBX: 0001 RCX: > > c101 > > [ 23.058077] RDX: 9681 RSI: RDI: > > fe85ff58 > > [ 23.058078] RBP: R08: R09: > > > > [ 23.058079] R10: R11: R12: > > 92001426 > > [ 23.058080] R13: R14: R15: > > > > [ 23.058083] FS: () GS:96813fd0() > > knlGS: > > [ 23.058084] CS: 0010 DS: ES: CR0: 80050033 > > [ 23.058085] CR2: fe85ef08 CR3: 000137a09000 CR4: > > 000406a0 > > [ 23.058089] Call Trace: > > [ 23.058101] <#DF> > > [ 23.058104] RIP: 0010:do_double_fault+0xb/0x140 > > [ 23.058105] RSP: :fe85ef18 EFLAGS: 00010086 ORIG_RAX: > > > > [ 23.058106] RAX: 3fd0 RBX: 0001 RCX: > > c101 > > [ 23.058107] RDX: 9681 RSI: RDI: > > fe85ff58 > > [ 23.058107] RBP: R08: R09: > > > > [ 23.058108] R10: R11: R12: > > 92001426 > > [ 23.058108] R13: R14: R15: > > > > [ 23.058111] > > [ 23.058111] Code: 05 00 00 48 89 e7 31 f6 e8 2e 8c 61 ff e9 69 06 00 > > 00 e8 94 05 00 00 48 89 e7 31 f6 e8 1a 8c 61 ff e9 55 06 00 00 0f 1f 44 > > 00 00 <0f> 1f 00 48 83 c4 88 e8 e4 04 00 00 48 89 e7 48 8b 74 24 78 48 > > That's better indeed, though still not quite right. It should have only > shown a subset of those registers. One more bug to fix there... Turns out my previous code to print iret frames was a bit ... misguided, to put it nicely. Not sure what I was smoking. Hopefully the below patch should fix it (in place of the previous patch). Would you mind testing again? diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h index c1688c2d0a12..1f86e1b0a5cd 100644 --- a/arch/x86/include/asm/unwind.h +++ b/arch/x86/include/asm/unwind.h @@ -56,18 +56,27 @@ void unwind_start(struct unwind_state *st
Re: 4.14.9 doesn't boot (regression)
On Sat, Dec 30, 2017 at 11:45:13AM +0300, Alexander Tsoy wrote: > В Пт, 29/12/2017 в 21:49 -0600, Josh Poimboeuf пишет: > > On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski wrote: > > > (Also, Josh, the oops code should have printed the contents of the > > > struct pt_regs at the top of the DF stack. Any idea why it > > > didn't?) > > > > Looking at one of the dumps: > > > > [ 392.774879] NMI backtrace for cpu 0 > > [ 392.774881] CPU: 0 PID: 1 Comm: init Not tainted 4.14.9-gentoo > > #1 > > [ 392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > > [ 392.774882] task: 8802368b8000 task.stack: c900c000 > > [ 392.774885] RIP: 0010:double_fault+0x0/0x30 > > [ 392.774886] RSP: :ff527fd0 EFLAGS: 0086 > > [ 392.774887] RAX: 3fc0 RBX: 0001 RCX: > > c101 > > [ 392.774887] RDX: 8802 RSI: RDI: > > ff527f58 > > [ 392.774887] RBP: R08: R09: > > > > [ 392.774888] R10: R11: R12: > > 816ae726 > > [ 392.774888] R13: R14: R15: > > > > [ 392.774889] FS: () > > GS:88023fc0() knlGS: > > [ 392.774889] CS: 0010 DS: ES: CR0: 80050033 > > [ 392.774890] CR2: ff526f08 CR3: 000235b48002 CR4: > > 001606f0 > > [ 392.774892] Call Trace: > > [ 392.774894] <#DF> > > [ 392.774897] do_double_fault+0xb/0x140 > > [ 392.774898] > > > > It should have at least printed the #DF iret frame registers, which I > > recently added support for in "x86/unwinder: Handle stack overflows > > more > > gracefully", which is in both 4.14.9 and 4.15-rc5. > > > > I think the missing iret regs are due to a bug in > > show_trace_log_lvl(), > > where if the unwind starts with two regs frames in a row, the second > > regs don't get printed. > > > > Alexander, would you mind reproducing again with the below patch? It > > should still fail, but this time it should hopefully show another > > RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line. > > > > Yes, it works: > > [ 23.058064] NMI backtrace for cpu 2 > [ 23.058068] CPU: 2 PID: 1 Comm: init Not tainted 4.15.0-rc5+ #1 > [ 23.058069] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.10.2-1.fc27 04/01/2014 > [ 23.058074] RIP: 0010:double_fault+0x0/0x30 > [ 23.058075] RSP: :fe85ffd0 EFLAGS: 0086 > [ 23.058077] RAX: 3fd0 RBX: 0001 RCX: > c101 > [ 23.058077] RDX: 9681 RSI: RDI: > fe85ff58 > [ 23.058078] RBP: R08: R09: > > [ 23.058079] R10: R11: R12: > 92001426 > [ 23.058080] R13: R14: R15: > > [ 23.058083] FS: () GS:96813fd0() > knlGS: > [ 23.058084] CS: 0010 DS: ES: CR0: 80050033 > [ 23.058085] CR2: fe85ef08 CR3: 000137a09000 CR4: > 000406a0 > [ 23.058089] Call Trace: > [ 23.058101] <#DF> > [ 23.058104] RIP: 0010:do_double_fault+0xb/0x140 > [ 23.058105] RSP: :fe85ef18 EFLAGS: 00010086 ORIG_RAX: > > [ 23.058106] RAX: 3fd0 RBX: 0001 RCX: > c101 > [ 23.058107] RDX: 9681 RSI: RDI: > fe85ff58 > [ 23.058107] RBP: R08: R09: > > [ 23.058108] R10: R11: R12: > 92001426 > [ 23.058108] R13: R14: R15: > > [ 23.058111] > [ 23.058111] Code: 05 00 00 48 89 e7 31 f6 e8 2e 8c 61 ff e9 69 06 00 > 00 e8 94 05 00 00 48 89 e7 31 f6 e8 1a 8c 61 ff e9 55 06 00 00 0f 1f 44 > 00 00 <0f> 1f 00 48 83 c4 88 e8 e4 04 00 00 48 89 e7 48 8b 74 24 78 48 That's better indeed, though still not quite right. It should have only shown a subset of those registers. One more bug to fix there... -- Josh
Re: 4.14.9 doesn't boot (regression)
On Sat, 30 Dec 2017, Toralf Förster wrote: > This made the issue go away : > > diff --git a/Makefile b/Makefile > index ac8c441866b7..11a12947c550 100644 > --- a/Makefile > +++ b/Makefile > @@ -414,7 +414,7 @@ LINUXINCLUDE:= \ > > KBUILD_AFLAGS := -D__ASSEMBLY__ > KBUILD_CFLAGS := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \ > - -fno-strict-aliasing -fno-common -fshort-wchar \ > + -fno-strict-aliasing -fno-common -fshort-wchar > -fstack-check=no \ >-Werror-implicit-function-declaration \ >-Wno-format-security \ >-std=gnu89 > > But this doesn't solve the root cause, right ? So if the root cause is > "Gentoo hardened GCC is broken" please just let me know this - FWIW I'm > in #gentoo-dev on freenode. -fstack-check for kernel is never going to work properly. That option is purely for userspace, and assumes all the logic around 'stack guard gap' and the auto-growing semantics being in place; which is there for user stack VMA, but definitely not for kernel stack. It's probably the "hardened" flavor of your distro trying to push '-fstack-check' to everything it compiles; so I actually think the Makefile patch, sanitizing CFLAGS by force-disabling -fstack-check is exactly what we should be doing. Thanks, -- Jiri Kosina SUSE Labs
Re: 4.14.9 doesn't boot (regression)
On 12/30/2017 02:13 AM, Alexander Tsoy wrote: > You are right, It's due to fstack-check enabled in gentoo's gcc spec. > "-fstack-check=no" in KBUILD_CFLAGS fixed this problem for me. =/ This made the issue go away : diff --git a/Makefile b/Makefile index ac8c441866b7..11a12947c550 100644 --- a/Makefile +++ b/Makefile @@ -414,7 +414,7 @@ LINUXINCLUDE:= \ KBUILD_AFLAGS := -D__ASSEMBLY__ KBUILD_CFLAGS := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \ - -fno-strict-aliasing -fno-common -fshort-wchar \ + -fno-strict-aliasing -fno-common -fshort-wchar -fstack-check=no \ -Werror-implicit-function-declaration \ -Wno-format-security \ -std=gnu89 But this doesn't solve the root cause, right ? So if the root cause is "Gentoo hardened GCC is broken" please just let me know this - FWIW I'm in #gentoo-dev on freenode. -- Toralf PGP C4EACDDE 0076E94E
Re: 4.14.9 doesn't boot (regression)
On Fri, 29 Dec 2017, Linus Torvalds wrote: > Ok, so what does seem to be consistent for everybody is that > double-fault in the NMI backtrace. > > So the fact that the NMI always hits on a double-fault does make me > suspect that it's a infinite stream of double-faults, and that is > presumably also what causes the RCU timeout. As I've been fighting with recursive double-faults lately (backporting PTI to ancient kernels), I can tell you that this is not the symptom you'd be seeing in such case; recursive double fault pretty quickly overflows the interrupt stack and triple-faults. -- Jiri Kosina SUSE Labs
Re: 4.14.9 doesn't boot (regression)
On 12/30/2017 04:49 AM, Josh Poimboeuf wrote: > Alexander, would you mind reproducing again with the below patch? It > should still fail, but this time it should hopefully show another > RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line. I applied that too on top of v4.15-rc5-114-g2758b3e3e630 (no other patches or changes to cflags or so), make c clean, then build and booted the kernel, still stucks, the result is in [1] [1] https://zwiebeltoralf.de/pub/IMG_20171230_102325.jpg -- Toralf PGP C4EACDDE 0076E94E
Re: 4.14.9 doesn't boot (regression)
On 12/30/2017 10:14 AM, Alexander Tsoy wrote: > Yes, and only in hardened profile, so most users don't have -fstack- > check by default. :) Indeed, I do run hardened Gentoo only. -- Toralf PGP C4EACDDE 0076E94E
Re: 4.14.9 doesn't boot (regression)
В Пт, 29/12/2017 в 17:34 -0800, Linus Torvalds пишет: > On Fri, Dec 29, 2017 at 5:00 PM, Linus Torvalds > wrote: > > > > Good. I was not feeling so happy about this bug report, but now I > > can > > firmly just blame the gentoo compiler for having some shit-for- > > brains > > "feature". > > Looks like I can generate similar bad code with the F26 version of > gcc, it's just not enabled by default. > > So all gentoo did was change the default options. Yes, and only in hardened profile, so most users don't have -fstack- check by default. :) > > I suspect we should just add a > > KBUILD_CFLAGS += $(call cc-option,-fno-stack-check,) > > somewhere to the main Makefile, just to make sure. > > Maybe like the appended? > > Toralf, Alexander, does this make things JustWork(tm) for you? I can confirm that with your patch my gcc produces working kernel.
Re: 4.14.9 doesn't boot (regression)
В Пт, 29/12/2017 в 21:49 -0600, Josh Poimboeuf пишет: > On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski wrote: > > (Also, Josh, the oops code should have printed the contents of the > > struct pt_regs at the top of the DF stack. Any idea why it > > didn't?) > > Looking at one of the dumps: > > [ 392.774879] NMI backtrace for cpu 0 > [ 392.774881] CPU: 0 PID: 1 Comm: init Not tainted 4.14.9-gentoo > #1 > [ 392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 > [ 392.774882] task: 8802368b8000 task.stack: c900c000 > [ 392.774885] RIP: 0010:double_fault+0x0/0x30 > [ 392.774886] RSP: :ff527fd0 EFLAGS: 0086 > [ 392.774887] RAX: 3fc0 RBX: 0001 RCX: > c101 > [ 392.774887] RDX: 8802 RSI: RDI: > ff527f58 > [ 392.774887] RBP: R08: R09: > > [ 392.774888] R10: R11: R12: > 816ae726 > [ 392.774888] R13: R14: R15: > > [ 392.774889] FS: () > GS:88023fc0() knlGS: > [ 392.774889] CS: 0010 DS: ES: CR0: 80050033 > [ 392.774890] CR2: ff526f08 CR3: 000235b48002 CR4: > 001606f0 > [ 392.774892] Call Trace: > [ 392.774894] <#DF> > [ 392.774897] do_double_fault+0xb/0x140 > [ 392.774898] > > It should have at least printed the #DF iret frame registers, which I > recently added support for in "x86/unwinder: Handle stack overflows > more > gracefully", which is in both 4.14.9 and 4.15-rc5. > > I think the missing iret regs are due to a bug in > show_trace_log_lvl(), > where if the unwind starts with two regs frames in a row, the second > regs don't get printed. > > Alexander, would you mind reproducing again with the below patch? It > should still fail, but this time it should hopefully show another > RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line. > Yes, it works: [ 23.058064] NMI backtrace for cpu 2 [ 23.058068] CPU: 2 PID: 1 Comm: init Not tainted 4.15.0-rc5+ #1 [ 23.058069] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc27 04/01/2014 [ 23.058074] RIP: 0010:double_fault+0x0/0x30 [ 23.058075] RSP: :fe85ffd0 EFLAGS: 0086 [ 23.058077] RAX: 3fd0 RBX: 0001 RCX: c101 [ 23.058077] RDX: 9681 RSI: RDI: fe85ff58 [ 23.058078] RBP: R08: R09: [ 23.058079] R10: R11: R12: 92001426 [ 23.058080] R13: R14: R15: [ 23.058083] FS: () GS:96813fd0() knlGS: [ 23.058084] CS: 0010 DS: ES: CR0: 80050033 [ 23.058085] CR2: fe85ef08 CR3: 000137a09000 CR4: 000406a0 [ 23.058089] Call Trace: [ 23.058101] <#DF> [ 23.058104] RIP: 0010:do_double_fault+0xb/0x140 [ 23.058105] RSP: :fe85ef18 EFLAGS: 00010086 ORIG_RAX: [ 23.058106] RAX: 3fd0 RBX: 0001 RCX: c101 [ 23.058107] RDX: 9681 RSI: RDI: fe85ff58 [ 23.058107] RBP: R08: R09: [ 23.058108] R10: R11: R12: 92001426 [ 23.058108] R13: R14: R15: [ 23.058111] [ 23.058111] Code: 05 00 00 48 89 e7 31 f6 e8 2e 8c 61 ff e9 69 06 00 00 e8 94 05 00 00 48 89 e7 31 f6 e8 1a 8c 61 ff e9 55 06 00 00 0f 1f 44 00 00 <0f> 1f 00 48 83 c4 88 e8 e4 04 00 00 48 89 e7 48 8b 74 24 78 48
Re: 4.14.9 doesn't boot (regression)
On 12/30/2017 01:10 AM, Andy Lutomirski wrote: > Toralf, can you send the complete output of: > > objdump -dr arch/x86/kernel/traps.o > > From the build tree of a nonworking kernel? I attached it. FWIW: tfoerste@t44 ~/devel/linux $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/6.4.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /var/tmp/portage/sys-devel/gcc-6.4.0/work/gcc-6.4.0/configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/6.4.0 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/6.4.0/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/6.4.0 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/6.4.0/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/6.4.0/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/6.4.0/include/g++-v6 --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/6.4.0/python --enable-languages=c,c++ --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo Hardened 6.4.0 p1.1' --enable-esp --enable-libstdcxx-time --disable-libstdcxx-pch --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 --disable-altivec --disable-fixed-point --enable-targets=all --disable-libgcj --enable-libgomp --disable-libmudflap --disable-libssp --disable-libcilkrts --disable-libmpx --enable-vtable-verify --enable-libvtv --disable-libquadmath --enable-lto --without-isl --disable-libsanitizer --enable-default-pie --enable-default-ssp Thread model: posix gcc version 6.4.0 (Gentoo Hardened 6.4.0 p1.1) -- Toralf PGP C4EACDDE 0076E94E arch/x86/kernel/traps.o: file format elf64-x86-64 Disassembly of section .text: : 0: 41 57 push %r15 2: 41 56 push %r14 4: 41 55 push %r13 6: 41 54 push %r12 8: 55 push %rbp 9: 53 push %rbx a: 48 81 ec 28 10 00 00sub$0x1028,%rsp 11: 48 83 0c 24 00 orq$0x0,(%rsp) 16: 48 81 c4 20 10 00 00add$0x1020,%rsp 1d: 65 48 8b 2c 25 00 00mov%gs:0x0,%rbp 24: 00 00 22: R_X86_64_32Scurrent_task 26: f6 81 88 00 00 00 03testb $0x3,0x88(%rcx) 2d: 4c 63 efmovslq %edi,%r13 30: 41 89 f6mov%esi,%r14d 33: 48 89 14 24 mov%rdx,(%rsp) 37: 49 89 ccmov%rcx,%r12 3a: 4d 89 c7mov%r8,%r15 3d: 4c 89 cbmov%r9,%rbx 40: 75 3b jne7d 42: 44 89 eemov%r13d,%esi 45: 48 89 cfmov%rcx,%rdi 48: e8 00 00 00 00 callq 4d 49: R_X86_64_PC32 fixup_exception-0x4 4d: 85 c0 test %eax,%eax 4f: 74 0f je 60 51: 48 83 c4 08 add$0x8,%rsp 55: 5b pop%rbx 56: 5d pop%rbp 57: 41 5c pop%r12 59: 41 5d pop%r13 5b: 41 5e pop%r14 5d: 41 5f pop%r15 5f: c3 retq 60: 48 8b 3c 24 mov(%rsp),%rdi 64: 4c 89 bd c0 09 00 00mov%r15,0x9c0(%rbp) 6b: 4c 89 famov%r15,%rdx 6e: 4c 89 e6mov%r12,%rsi 71: 4c 89 ad b8 09 00 00mov%r13,0x9b8(%rbp) 78: e8 00 00 00 00 callq 7d 79: R_X86_64_PC32 die-0x4 7d: 8b 05 00 00 00 00 mov0x0(%rip),%eax# 83 7f: R_X86_64_PC32 show_unhandled_signals-0x4 83: 4c 89 bd c0 09 00 00mov%r15,0x9c0(%rbp) 8a: 4c 89 ad b8 09 00 00mov%r13,0x9b8(%rbp) 91: 85 c0 test %eax,%eax 93: 75 28 jnebd 95: 48 85 dbtest %rbx,%rbx 98: b8 01 00 00 00 mov$0x1,%eax 9d: 48 89 eamov%rbp,%rdx a0: 48 0f 44 d8 cmove %rax,%rbx a4: 48 83 c4 08 add$0x8,%rsp a8: 44 89 f7mov%r14d,%edi ab: 48 89 demov%rbx,%rsi ae: 5b pop%rbx af: 5d pop%rbp b0: 41 5c pop%r12 b2: 41 5d pop%r13 b4: 41 5e pop%r14 b6: 41 5f pop%r15 b8: e9 00 00 00 00 jmpq bd b9: R_X86_64_PC32 force_sig_info-0x4 bd: 44 89 f6
Re: 4.14.9 doesn't boot (regression)
On Fri, Dec 29, 2017 at 05:10:35PM -0700, Andy Lutomirski wrote: > (Also, Josh, the oops code should have printed the contents of the > struct pt_regs at the top of the DF stack. Any idea why it didn't?) Looking at one of the dumps: [ 392.774879] NMI backtrace for cpu 0 [ 392.774881] CPU: 0 PID: 1 Comm: init Not tainted 4.14.9-gentoo #1 [ 392.774881] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 392.774882] task: 8802368b8000 task.stack: c900c000 [ 392.774885] RIP: 0010:double_fault+0x0/0x30 [ 392.774886] RSP: :ff527fd0 EFLAGS: 0086 [ 392.774887] RAX: 3fc0 RBX: 0001 RCX: c101 [ 392.774887] RDX: 8802 RSI: RDI: ff527f58 [ 392.774887] RBP: R08: R09: [ 392.774888] R10: R11: R12: 816ae726 [ 392.774888] R13: R14: R15: [ 392.774889] FS: () GS:88023fc0() knlGS: [ 392.774889] CS: 0010 DS: ES: CR0: 80050033 [ 392.774890] CR2: ff526f08 CR3: 000235b48002 CR4: 001606f0 [ 392.774892] Call Trace: [ 392.774894] <#DF> [ 392.774897] do_double_fault+0xb/0x140 [ 392.774898] It should have at least printed the #DF iret frame registers, which I recently added support for in "x86/unwinder: Handle stack overflows more gracefully", which is in both 4.14.9 and 4.15-rc5. I think the missing iret regs are due to a bug in show_trace_log_lvl(), where if the unwind starts with two regs frames in a row, the second regs don't get printed. Alexander, would you mind reproducing again with the below patch? It should still fail, but this time it should hopefully show another RIP/RSP/EFLAGS instead of the "do_double_fault+0xb/0x140" line. diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c index 36b17e0febe8..39a320d077aa 100644 --- a/arch/x86/kernel/dumpstack.c +++ b/arch/x86/kernel/dumpstack.c @@ -103,6 +103,7 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs, unwind_start(&state, task, regs, stack); stack = stack ? : get_stack_pointer(task, regs); + regs = unwind_get_entry_regs(&state); /* * Iterate through the stacks, starting with the current stack pointer. @@ -120,7 +121,7 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs, * - hardirq stack * - entry stack */ - for (regs = NULL; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) { + for ( ; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) { const char *stack_name; if (get_stack_info(stack, task, &stack_info, &visit_mask)) {
Re: 4.14.9 doesn't boot (regression)
On Fri, Dec 29, 2017 at 5:00 PM, Linus Torvalds wrote: > > Good. I was not feeling so happy about this bug report, but now I can > firmly just blame the gentoo compiler for having some shit-for-brains > "feature". Looks like I can generate similar bad code with the F26 version of gcc, it's just not enabled by default. So all gentoo did was change the default options. I suspect we should just add a KBUILD_CFLAGS += $(call cc-option,-fno-stack-check,) somewhere to the main Makefile, just to make sure. Maybe like the appended? Toralf, Alexander, does this make things JustWork(tm) for you? Linus Makefile | 3 +++ 1 file changed, 3 insertions(+) diff --git a/Makefile b/Makefile index ac8c441866b7..92b74bcd3c2a 100644 --- a/Makefile +++ b/Makefile @@ -789,6 +789,9 @@ KBUILD_CFLAGS += $(call cc-disable-warning, pointer-sign) # disable invalid "can't wrap" optimizations for signed / pointers KBUILD_CFLAGS += $(call cc-option,-fno-strict-overflow) +# Make sure -fstack-check isn't enabled (like gentoo apparently did) +KBUILD_CFLAGS += $(call cc-option,-fno-stack-check,) + # conserve stack if available KBUILD_CFLAGS += $(call cc-option,-fconserve-stack)
Re: 4.14.9 doesn't boot (regression)
В Пт, 29/12/2017 в 17:10 -0700, Andy Lutomirski пишет: > > Also, you wouldn't happen to be using Gentoo perchance? I already > have two reports of a Gentoo system miscompiling the vDSO due to > Gentoo enabling -fstack-check and GCC generating stack check code > that is highly suboptimal, actively incorrect, and doesn't even > manage to check the stack in a particularly helpful way. > > If this is indeed what's going on, I'm going to try to come up with a > patch to outright fail the build on these buggy systems. We could > probably fudge the build options to avoid the problem, but Gentoo > really just needs fix its toolchain. You are right, It's due to fstack-check enabled in gentoo's gcc spec. "-fstack-check=no" in KBUILD_CFLAGS fixed this problem for me. =/
Re: 4.14.9 doesn't boot (regression)
f On Fri, Dec 29, 2017 at 4:10 PM, Andy Lutomirski wrote: > > Double faults use IST, so a double fault that double faults will effectively > just start over rather than eventually running out of stack and triple > faulting. > > But check out the registers. We have RSP = ...28fd8 and CR2 = ...27f08. > IOW the double fault stack is ...28000 - ...28fff and we're somehow getting > a failed page fault a couple hundred bytes below the bottom of the IST stack. > IOW, I think we're just stuck in a neverending loop of stack overflows. Ahh, good catch. This feels like it might finally be explaining things. > (Also, Josh, the oops code should have printed the contents of the struct > pt_regs at the top of the DF stack. Any idea why it didn't?) > > Toralf, can you send the complete output of: > > objdump -dr arch/x86/kernel/traps.o > > From the build tree of a nonworking kernel? Alexander made one of his failing kernels available earlier: https://www.dropbox.com/s/yesupqgig3uxf73/linux-4.15-rc5%2B.tar.xz?dl=0 and yes, there's something seriously wrong there. Doing a disassembly on "do_double_fault()" shows: 8101bda0 : 8101bda0: 41 54 push %r12 8101bda2: 55 push %rbp 8101bda3: 53 push %rbx 8101bda4: 48 81 ec 20 10 00 00sub$0x1020,%rsp 8101bdab: 48 83 0c 24 00 orq$0x0,(%rsp) 8101bdb0: 48 81 c4 20 10 00 00add$0x1020,%rsp WTF? That's bogus crap, and not ok in the kernel. Doing a stack probe below the stack by subtracting 4128rom the stack pointer and then oring it, and then resetting the stack pointer again is just crazy. And it's definitely not ever going to work for the kernel that has a limited stack. So yes, It's a terminally broken compiler from hell. I assume gentoo has applied some completely broken security patch to their compiler, turning said compiler into complete garbage. Doing some trivial grepping on the disassembly in that vmlinux file, there's tons of those "let's probe more than a page below the stack" issues. The biggest offset I found was 0x1400. That one happened to be in do_sys_poll(). > Also, you wouldn't happen to be using Gentoo perchance? Yes, several people involved are using gentoo. Maybe everybody. > I already have two reports of a Gentoo system miscompiling the vDSO > due to Gentoo enabling -fstack-check and GCC generating stack check > code that is highly suboptimal, actively incorrect, and doesn't even > manage to check the stack in a particularly helpful way. Yes. Good. I think you root-caused it. Good. I was not feeling so happy about this bug report, but now I can firmly just blame the gentoo compiler for having some shit-for-brains "feature". Linus
Re: 4.14.9 doesn't boot (regression)
> On Dec 29, 2017, at 3:53 PM, Linus Torvalds > wrote: > >> On Fri, Dec 29, 2017 at 2:30 PM, Toralf Förster >> wrote: >> >> The bad news - the issue is not solved with the changed cflags. >> The good news - I could compile eventually a working config for my desktop >> (works fine with 4.14.10 with generic CPU) having a higher screen resolution >> during boot. >> >> So I made a "make distclean", followed by a "sudo zcat /proc/config.gz > >> .config", changed the .config to use MCORE2 instead of GENERIC and defined >> the string "-local" to ensure that the modules directory is really unique. >> Then I run "time make -j4 && sudo make modules_install && sudo cp >> arch/x86_64/boot/bzImage /boot/vmlinuz-0 && sudo grub-mkconfig -o >> /boot/grub/grub.cfg", booted and made 3 fotos which were uploaded to [1], >> look for IMG_* > > Ok, so what does seem to be consistent for everybody is that > double-fault in the NMI backtrace. > > So the fact that the NMI always hits on a double-fault does make me > suspect that it's a infinite stream of double-faults, and that is > presumably also what causes the RCU timeout. > > And as I pointed out elsewhere (damn two threads), I think that it > would help to simply catch the *first* double-fault. > > And I *think* that the only thing that can make a double-fault > silently be re-tried is the CONFIG_X86_ESPFIX64 case, so if you can > build a failing kernel with the CONFIG_X86_ESPFIX64 case disabled in > arch/x86/kernel/traps.c do_double_fault(), that would be interesting. Double faults use IST, so a double fault that double faults will effectively just start over rather than eventually running out of stack and triple faulting. But check out the registers. We have RSP = ...28fd8 and CR2 = ...27f08. IOW the double fault stack is ...28000 - ...28fff and we're somehow getting a failed page fault a couple hundred bytes below the bottom of the IST stack. IOW, I think we're just stuck in a neverending loop of stack overflows. (Also, Josh, the oops code should have printed the contents of the struct pt_regs at the top of the DF stack. Any idea why it didn't?) Toralf, can you send the complete output of: objdump -dr arch/x86/kernel/traps.o From the build tree of a nonworking kernel? Also, you wouldn't happen to be using Gentoo perchance? I already have two reports of a Gentoo system miscompiling the vDSO due to Gentoo enabling -fstack-check and GCC generating stack check code that is highly suboptimal, actively incorrect, and doesn't even manage to check the stack in a particularly helpful way. If this is indeed what's going on, I'm going to try to come up with a patch to outright fail the build on these buggy systems. We could probably fudge the build options to avoid the problem, but Gentoo really just needs fix its toolchain.
Re: 4.14.9 doesn't boot (regression)
On 12/29/2017 11:53 PM, Linus Torvalds wrote: > So just change the > > #ifdef CONFIG_X86_ESPFIX64 > > into a > > #if 0 > > and see if instead of the RCU stall after 20 seconds, you get an > immediate double fault error report instead? well, 3 IMG_20171230_0008* should show the results https://zwiebeltoralf.de/pub/ -- Toralf PGP C4EACDDE 0076E94E
Re: 4.14.9 doesn't boot (regression)
On Fri, Dec 29, 2017 at 2:30 PM, Toralf Förster wrote: > > The bad news - the issue is not solved with the changed cflags. > The good news - I could compile eventually a working config for my desktop > (works fine with 4.14.10 with generic CPU) having a higher screen resolution > during boot. > > So I made a "make distclean", followed by a "sudo zcat /proc/config.gz > > .config", changed the .config to use MCORE2 instead of GENERIC and defined > the string "-local" to ensure that the modules directory is really unique. > Then I run "time make -j4 && sudo make modules_install && sudo cp > arch/x86_64/boot/bzImage /boot/vmlinuz-0 && sudo grub-mkconfig -o > /boot/grub/grub.cfg", booted and made 3 fotos which were uploaded to [1], > look for IMG_* Ok, so what does seem to be consistent for everybody is that double-fault in the NMI backtrace. So the fact that the NMI always hits on a double-fault does make me suspect that it's a infinite stream of double-faults, and that is presumably also what causes the RCU timeout. And as I pointed out elsewhere (damn two threads), I think that it would help to simply catch the *first* double-fault. And I *think* that the only thing that can make a double-fault silently be re-tried is the CONFIG_X86_ESPFIX64 case, so if you can build a failing kernel with the CONFIG_X86_ESPFIX64 case disabled in arch/x86/kernel/traps.c do_double_fault(), that would be interesting. So just change the #ifdef CONFIG_X86_ESPFIX64 into a #if 0 and see if instead of the RCU stall after 20 seconds, you get an immediate double fault error report instead? I'm still entirely confused about why that MCORE2 would make _any_ difference what-so-ever, so this is all fishing for random clues in the dark. Linus
Re: 4.14.9 doesn't boot (regression)
On 12/29/2017 10:17 PM, Linus Torvalds wrote: > On Fri, Dec 29, 2017 at 1:02 PM, Toralf Förster > wrote: >> On 12/29/2017 09:12 PM, Linus Torvalds wrote: >>> instead, and see if that makes a difference, that would narrow down >>> the possible root cause of this problem. >> >> not at this ThinkPad T440s (didn't test at the server with an i7-3930). >> >> Boot stops just at: >> >> tsc: Refined TSC clocksource calibration: 2494.225 MHz >> clocksource: tsc: mask: 0x max_cycles: >> 0x23f3ea95b09, max_idle_ns: 440795287034 ns > > Uhhuh. So for Alexander Troy, just getting rid of the -march=core2 > fixed the boot. > > But not for you. > > Strange. It really looked like the exact same thing. > >> This is a "Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz" with gcc-6.4 > > Yeah, other reporters of this have used gcc-6.4.0 too. > > But there's been some muddying of the waters there too - changing > compilers have fixed it for some cases, but there's at least one > report that a kernel build with gcc-7.2.0 still had the issue (and > another that said it didn't). > > But the MCORE2 was consistent for several people - including you. > Until this point. > > Strange. > > The only other thing (apart from the compiler flag) that MCORE2 > results in is to enable > > CONFIG_X86_INTEL_USERCOPY > CONFIG_X86_USE_PPRO_CHECKSUM > CONFIG_X86_P6_NOP > > and the two first of those shouldn't even matter on x86-64, and I > don't see that last one making any difference either. > > So because it looks so impossible that the "-march=core2" didn't make > a difference for you, I'll ask you to please double-check that you > actually booted into the right kernel. > > Sorry for doubting you, but your report just broke the _one_ > consistent thing we've seen about this bug. > > Linus > I double-checked it. The bad news - the issue is not solved with the changed cflags. The good news - I could compile eventually a working config for my desktop (works fine with 4.14.10 with generic CPU) having a higher screen resolution during boot. So I made a "make distclean", followed by a "sudo zcat /proc/config.gz > .config", changed the .config to use MCORE2 instead of GENERIC and defined the string "-local" to ensure that the modules directory is really unique. Then I run "time make -j4 && sudo make modules_install && sudo cp arch/x86_64/boot/bzImage /boot/vmlinuz-0 && sudo grub-mkconfig -o /boot/grub/grub.cfg", booted and made 3 fotos which were uploaded to [1], look for IMG_* [1] https://zwiebeltoralf.de/pub/ -- Toralf PGP C4EACDDE 0076E94E
Re: 4.14.9 doesn't boot (regression)
В Пт, 29/12/2017 в 13:39 -0800, Linus Torvalds пишет: > On Fri, Dec 29, 2017 at 1:17 PM, Linus Torvalds > wrote: > > > > Yeah, other reporters of this have used gcc-6.4.0 too. > > > > But there's been some muddying of the waters there too - changing > > compilers have fixed it for some cases, but there's at least one > > report that a kernel build with gcc-7.2.0 still had the issue (and > > another that said it didn't). > > Side note: I'm not convinced that we will reliably catch a compiler > version change in our dependency analysis, so it's probably best to > "make clean" between switching compilers to make sure that you don't > have old object files with the old compiler. I did "make clean" after changing compiler flags.
Re: 4.14.9 doesn't boot (regression)
On Fri, Dec 29, 2017 at 1:17 PM, Linus Torvalds wrote: > > Yeah, other reporters of this have used gcc-6.4.0 too. > > But there's been some muddying of the waters there too - changing > compilers have fixed it for some cases, but there's at least one > report that a kernel build with gcc-7.2.0 still had the issue (and > another that said it didn't). Side note: I'm not convinced that we will reliably catch a compiler version change in our dependency analysis, so it's probably best to "make clean" between switching compilers to make sure that you don't have old object files with the old compiler. > But the MCORE2 was consistent for several people - including you. > Until this point. .. and our build infrastructure definitely _should_ catch compiler switch changes automatically and force a re-build. Linus
Re: 4.14.9 doesn't boot (regression)
On Fri, Dec 29, 2017 at 1:02 PM, Toralf Förster wrote: > On 12/29/2017 09:12 PM, Linus Torvalds wrote: >> instead, and see if that makes a difference, that would narrow down >> the possible root cause of this problem. > > not at this ThinkPad T440s (didn't test at the server with an i7-3930). > > Boot stops just at: > > tsc: Refined TSC clocksource calibration: 2494.225 MHz > clocksource: tsc: mask: 0x max_cycles: 0x23f3ea95b09, > max_idle_ns: 440795287034 ns Uhhuh. So for Alexander Troy, just getting rid of the -march=core2 fixed the boot. But not for you. Strange. It really looked like the exact same thing. > This is a "Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz" with gcc-6.4 Yeah, other reporters of this have used gcc-6.4.0 too. But there's been some muddying of the waters there too - changing compilers have fixed it for some cases, but there's at least one report that a kernel build with gcc-7.2.0 still had the issue (and another that said it didn't). But the MCORE2 was consistent for several people - including you. Until this point. Strange. The only other thing (apart from the compiler flag) that MCORE2 results in is to enable CONFIG_X86_INTEL_USERCOPY CONFIG_X86_USE_PPRO_CHECKSUM CONFIG_X86_P6_NOP and the two first of those shouldn't even matter on x86-64, and I don't see that last one making any difference either. So because it looks so impossible that the "-march=core2" didn't make a difference for you, I'll ask you to please double-check that you actually booted into the right kernel. Sorry for doubting you, but your report just broke the _one_ consistent thing we've seen about this bug. Linus
Re: 4.14.9 doesn't boot (regression)
On 12/29/2017 09:12 PM, Linus Torvalds wrote: > instead, and see if that makes a difference, that would narrow down > the possible root cause of this problem. not at this ThinkPad T440s (didn't test at the server with an i7-3930). Boot stops just at: tsc: Refined TSC clocksource calibration: 2494.225 MHz clocksource: tsc: mask: 0x max_cycles: 0x23f3ea95b09, max_idle_ns: 440795287034 ns I changed the Makefile accordingly to your suggestion to: ~/devel/linux $ git diff diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 3e73bc255e4e..fb695558821b 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -128,7 +128,7 @@ else cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona) cflags-$(CONFIG_MCORE2) += \ -$(call cc-option,-march=core2,$(call cc-option,-mtune=generic)) +$(call cc-option,-mtune=generic) cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom) \ $(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic)) cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic) ~/devel/linux $ git describe v4.15-rc5-114-g2758b3e3e630 This is a "Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz" with gcc-6.4 .config attached -- Toralf PGP C4EACDDE 0076E94E # # Automatically generated file; DO NOT EDIT. # Linux/x86 4.15.0-rc5 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_MMU=y CONFIG_ARCH_MMAP_RND_BITS_MIN=28 CONFIG_ARCH_MMAP_RND_BITS_MAX=32 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_HAVE_INTEL_TXT=y CONFIG_X86_64_SMP=y CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y CONFIG_THREAD_INFO_IN_TASK=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="-0" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_USELIB=y CONFIG_AUDIT=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_WATCH=y CONFIG_AUDIT_TREE=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_GENERIC_IRQ_MIGRATION=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y CONFIG_GENERIC_IRQ_RESERVATION_MODE=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y # CONFIG_GENERIC_IRQ_DEBUGFS is not set CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # # CPU/Task time and stats accounting # CONFIG_TICK_CPU_ACCOUNTING=y # CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set # CONFIG_IRQ_TIME_ACCOUNTING is not set CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y # CONFIG_CPU_ISOLATION is not set # # RCU Subsystem # CONFIG_TREE_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y CONFIG_TREE_SRCU=y # CONFIG_TASKS_RCU is not set CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEG
Re: 4.14.9 doesn't boot (regression)
* Linus Torvalds wrote: > On Fri, Dec 29, 2017 at 3:14 AM, Toralf Förster > wrote: > > > > For the server the attached .config works fine but switching from > > CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any > > messages. Similar picture at the desktop. > > Ok, so there's another thread ("4.14.9 with CONFIG_MCORE2 fails to > boot") about this same thing, but one thing to try is to see if it's > just the > > cflags-$(CONFIG_MCORE2) += \ > $(call cc-option,-march=core2,$(call > cc-option,-mtune=generic)) > > in arch/x86/Makefile that causes this. > > The MCORE2 option does potentially have a few other effects (see > arch/x86/Kconfig.cpu), but the first one to check might be just that > compiler command line effect. > > So if you can edit arch/x86/Makefile, and just make that say > > cflags-$(CONFIG_MCORE2) += $(call cc-option,-mtune=generic) > > instead, and see if that makes a difference, that would narrow down > the possible root cause of this problem. Or, if it's more convenient, you can try Linus's suggestion by applying the patch below. Thanks, Ingo ===> arch/x86/Makefile | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 3e73bc255e4e..1835752fffc9 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -127,8 +127,8 @@ else cflags-$(CONFIG_MK8) += $(call cc-option,-march=k8) cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona) -cflags-$(CONFIG_MCORE2) += \ -$(call cc-option,-march=core2,$(call cc-option,-mtune=generic)) + cflags-$(CONFIG_MCORE2) += $(call cc-option,-mtune=generic) + cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom) \ $(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic)) cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic)
Re: 4.14.9 doesn't boot (regression)
On Fri, Dec 29, 2017 at 3:14 AM, Toralf Förster wrote: > > For the server the attached .config works fine but switching from > CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any > messages. Similar picture at the desktop. Ok, so there's another thread ("4.14.9 with CONFIG_MCORE2 fails to boot") about this same thing, but one thing to try is to see if it's just the cflags-$(CONFIG_MCORE2) += \ $(call cc-option,-march=core2,$(call cc-option,-mtune=generic)) in arch/x86/Makefile that causes this. The MCORE2 option does potentially have a few other effects (see arch/x86/Kconfig.cpu), but the first one to check might be just that compiler command line effect. So if you can edit arch/x86/Makefile, and just make that say cflags-$(CONFIG_MCORE2) += $(call cc-option,-mtune=generic) instead, and see if that makes a difference, that would narrow down the possible root cause of this problem. Linus
Re: 4.14.9 doesn't boot (regression)
On 12/29/2017 04:48 PM, Alexander Tsoy wrote: > В Пт, 29/12/2017 в 12:14 +0100, Toralf Förster пишет: >> I can confirm now, that that kernel breaks both a desktop (an >> ThinkPad T440s i5) and a headless server (i3930) setup. For the >> server the attached .config works fine but switching from >> CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any >> messages. Similar picture at the desktop. > > You most likely have the same problem as me: > https://lkml.org/lkml/2017/12/29/279 > Indeed, I got a similar message at my ThinkPad too when I tried to bisect it: >[ 21.776011] INFO: rcu_preempt detected stalls on CPUs/tasks: >[ 21.w77008] 0-...!: (0 ticks this GP) idle=c56/140/0 >softirq=73/73 fqs=0 >[ 21.777008] (detected by 1, t=21002 jiffies, g=-255, c=-256, q=4) -- Toralf PGP C4EACDDE 0076E94E
Re: 4.14.9 doesn't boot (regression)
В Пт, 29/12/2017 в 12:14 +0100, Toralf Förster пишет: > I can confirm now, that that kernel breaks both a desktop (an > ThinkPad T440s i5) and a headless server (i3930) setup. For the > server the attached .config works fine but switching from > CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any > messages. Similar picture at the desktop. You most likely have the same problem as me: https://lkml.org/lkml/2017/12/29/279
Re: 4.14.9 doesn't boot (regression)
On Fri, Dec 29, 2017 at 3:38 PM, Toralf Förster wrote: > On 12/29/2017 02:33 PM, Sebastian Gottschall wrote: >> bootlog? >> > nothing in any logs, hang happens very early in the boot process Does it have serial? Does it use EFI? You may try earlyprintk for EFI case or legacy UART. There was support for PCI UARTs, though it wasn't really what I ever used. -- With Best Regards, Andy Shevchenko
Re: 4.14.9 doesn't boot (regression)
On 12/29/2017 02:33 PM, Sebastian Gottschall wrote: > bootlog? > nothing in any logs, hang happens very early in the boot process -- Toralf PGP C4EACDDE 0076E94E
Re: 4.14.9 doesn't boot (regression)
bootlog? Am 29.12.2017 um 12:14 schrieb Toralf Förster: I can confirm now, that that kernel breaks both a desktop (an ThinkPad T440s i5) and a headless server (i3930) setup. For the server the attached .config works fine but switching from CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any messages. Similar picture at the desktop. Both are stable Gentoo Linux hardened systems. This issue seems to exist in mainline too, probably visible with d120cd749 (stable) and 9aaefe7b59 (upstream). -- Mit freundlichen Grüssen / Regards Sebastian Gottschall / CTO NewMedia-NET GmbH - DD-WRT Firmensitz: Stubenwaldallee 21a, 64625 Bensheim Registergericht: Amtsgericht Darmstadt, HRB 25473 Geschäftsführer: Peter Steinhäuser, Christian Scheele http://www.dd-wrt.com email: s.gottsch...@dd-wrt.com Tel.: +496251-582650 / Fax: +496251-5826565
4.14.9 doesn't boot (regression)
I can confirm now, that that kernel breaks both a desktop (an ThinkPad T440s i5) and a headless server (i3930) setup. For the server the attached .config works fine but switching from CONFIG_GENERIC_CPU to CONFIG_MCORE2 legt them hang at boot w/op any messages. Similar picture at the desktop. Both are stable Gentoo Linux hardened systems. This issue seems to exist in mainline too, probably visible with d120cd749 (stable) and 9aaefe7b59 (upstream). -- Toralf PGP C4EACDDE 0076E94E # # Automatically generated file; DO NOT EDIT. # Linux/x86 4.14.9 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_MMU=y CONFIG_ARCH_MMAP_RND_BITS_MIN=28 CONFIG_ARCH_MMAP_RND_BITS_MAX=32 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_HAVE_INTEL_TXT=y CONFIG_X86_64_SMP=y CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y CONFIG_THREAD_INFO_IN_TASK=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_FHANDLE=y CONFIG_USELIB=y CONFIG_AUDIT=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_WATCH=y CONFIG_AUDIT_TREE=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_GENERIC_IRQ_MIGRATION=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y # CONFIG_GENERIC_IRQ_DEBUGFS is not set CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # # CPU/Task time and stats accounting # CONFIG_TICK_CPU_ACCOUNTING=y # CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set # CONFIG_IRQ_TIME_ACCOUNTING is not set CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y # # RCU Subsystem # CONFIG_TREE_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y CONFIG_TREE_SRCU=y # CONFIG_TASKS_RCU is not set CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_BUILD_BIN2C=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=18 CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y CONFIG_ARCH_SUPPORTS_INT128=y # CONFIG_NUMA_BALANCING is not set CONFIG_CGROUPS=y CONFIG_PAGE_COUNTER=y CONFIG_MEMCG=y CONFIG_MEMCG_SWAP=y CONFIG_MEMCG_SWAP_ENABLED=y # CONFIG_BLK_CGROUP is not set CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y # CONFIG_CFS_BANDWIDTH is not set # CONFIG_RT_GROUP_SCHED is not set # CONFIG_CGROUP_PIDS is not set # CONFIG_CGROUP_RDMA is not set CONFIG_CGROUP_FREEZER=y # CONFIG_CPUSETS is not set # CONFIG_CGROUP_DEVICE is not set CONFIG_CGROUP_CPUACCT=y # CONFIG_CGROUP_PERF is not set # CONFIG_CGROUP_DEBUG is not set # CONFIG_SOCK_CGROUP_DATA is not set # CONFIG_CHECKPOINT_RESTORE is not set CONFIG