At Mon, 23 Mar 2015 17:07:15 +0100, Denys Vlasenko wrote: > > On 03/23/2015 02:22 PM, Takashi Iwai wrote: > > At Mon, 23 Mar 2015 10:35:41 +0100, > > Takashi Iwai wrote: > >> > >> At Mon, 23 Mar 2015 10:02:52 +0100, > >> Takashi Iwai wrote: > >>> > >>> At Fri, 20 Mar 2015 19:16:53 +0100, > >>> Denys Vlasenko wrote: > >>>> Takashi, are you willing to reproduce the panic one more time, > >>>> with this patch? I would like to see whether oops messages > >>>> are more informative with it. > >>> > >>> It can't be applied to 4.0-rc5, unfortunately. > >>> > >>> arch/x86/kernel/entry_64.S: Assembler messages: > >>> arch/x86/kernel/entry_64.S:1725: Error: no such instruction: > >>> `alloc_pt_gpregs_on_stack' > >>> arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* > >>> sections) for `+' > >>> scripts/Makefile.build:294: recipe for target > >>> 'arch/x86/kernel/entry_64.o' failed > >> > >> I pulled tip tree on top of 4.0-rc5, built with your patch and now > >> succeeded to get a better message: > >> > >> kvm: zapping shadow pages for mmio generation wraparound > >> kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff > >> Exception on user stack 00007ffd22c23ef0: RSP: 0018:00007ffd22c23f28 > >> EFLAGS: 00010006 > >> RIP: 0010:[<ffffffff8162681d>] [<ffffffff8162681d>] > >> netlink_attachskb+0x1d/0x1d0 > >> PANIC: double fault, error_code: 0x0 > >> CPU: 1 PID: 10819 Comm: cc1 Tainted: G W 4.0.0-rc5-debug1+ #2 > >> Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 > >> task: ffff8800d1b34b10 ti: ffff8800d1b30000 task.ti: ffff8800d1b30000 > >> RIP: 0010:[<ffffffff8162681d>] [<ffffffff8162681d>] > >> netlink_attachskb+0x1d/0x1d0 > >> RSP: 0018:00007ffd22c23f28 EFLAGS: 00010006 > >> RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00000000c0000101 > >> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ffd22c23ef0 > >> RBP: 0000000000000ea7 R08: 0000000000001ea7 R09: ffffffffffffffff > >> R10: 000000000309dbf8 R11: 0000000000000246 R12: 0000000000000001 > >> R13: 0000000000000000 R14: 0000000003026e40 R15: 000000000309cd50 > >> FS: 00007f89c83c2800(0000) GS:ffff88021d240000(0000) > >> knlGS:0000000000000000 > >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >> CR2: 000000000000016d CR3: 00000000d90a0000 CR4: 00000000001427e0 > >> Stack: > >> 0000000000000ea7 0000000000000000 0000000003099c10 0000000000000ea7 > >> 0000000000000ea7 0000000000000001 0000000003099c10 0000000000000ea7 > >> 0000000000c84696 0000000003099c88 00007f0122c23fb8 000000000302f610 > >> Call Trace: > >> <UNK> > >> Code: > >> 10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 > >> 56 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 <8b> 87 68 01 00 > >> 00 39 87 9c 01 00 00 7c 25 48 8b 87 88 04 00 00 > >> Kernel panic - not syncing: Machine halted. > >> CPU: 1 PID: 10819 Comm: cc1 Tainted: G W 4.0.0-rc5-debug1+ #2 > >> Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 > >> 0000000000000000 ffff8800d1b33e28 ffffffff816f80d2 0000000000000000 > >> ffffffff81a22f81 ffff8800d1b33ea8 ffffffff816f2358 00000000000058d7 > >> 0000000000000008 ffff8800d1b33eb8 ffff8800d1b33e58 ffff8800d1b33ea8 > >> Call Trace: > >> [<ffffffff816f80d2>] dump_stack+0x4c/0x6e > >> [<ffffffff816f2358>] panic+0xc0/0x1f3 > >> [<ffffffff81046e65>] df_debug+0x35/0x40 > >> [<ffffffff81003fe7>] do_double_fault+0x87/0x100 > >> [<ffffffff81004167>] do_userpsace_rsp_in_kernel+0x107/0x140 > >> [<ffffffff8162681d>] ? netlink_attachskb+0x1d/0x1d0 > >> [<ffffffff81703ca6>] userpsace_rsp_in_kernel+0x36/0x40 > >> [<ffffffff8162681d>] ? netlink_attachskb+0x1d/0x1d0 > >> > >> > >> So, it seems hitting in netlink_attachskb(). > >> I'd need to check whether this consistently hits there or just at > >> random. > > > > I managed to reproduce the bug two more times, and all three show the > > very same stack trace like the above. So, it's well reproducible. > > FYI: the disassembly of netlink_attachskb (from "Code:" line) is: > > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > 5: 55 push %rbp > 6: 48 89 e5 mov %rsp,%rbp > 9: 41 56 push %r14 > b: 41 55 push %r13 > d: 49 89 d5 mov %rdx,%r13 > 10: 41 54 push %r12 > 12: 49 89 f4 mov %rsi,%r12 > 15: 53 push %rbx > 16: 48 89 fb mov %rdi,%rbx > 19: 48 83 ec 30 sub $0x30,%rsp > 1d: 8b 87 68 01 00 00 mov 0x168(%rdi),%eax > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > 23: 39 87 9c 01 00 00 cmp %eax,0x19c(%rdi) > 29: 7c 25 jl 50 <_start+0x50> > 2b: 48 8b 87 88 04 00 00 mov 0x488(%rdi),%rax > > The ^^^^^ instruction is the one which faults. Since you said it > consistently happens here, this should be a page fault, not an external > hardware interrupt. > > The code corresponds to the comparison in if(): > > int netlink_attachskb(struct sock *sk, struct sk_buff *skb, > long *timeo, struct sock *ssk) > { > struct netlink_sock *nlk; > > nlk = nlk_sk(sk); > > if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf || > > %rdi (which is 1st param, "struct sock *sk") is 00007ffd22c23ef0 > (userspace address), but it's just because my patch clobbers %rdi, :( > we don't know which value it had at that moment. > > > I'm really puzzled now. We have a few pieces of information: > > > > - git bisection pointed the commit 96b6352c1271: > > x86_64, entry: Remove the syscall exit audit and schedule optimizations > > and reverting this "fixes" the problem indeed. Even just moving two > > lines > > LOCKDEP_SYS_EXIT > > DISABLE_INTERRUPTS(CLBR_NONE) > > at the beginning of ret_from_sys_call already fixes. (Of course I > > can't prove the fix but it stabilizes for a day without crash while > > usually I hit the bug in 10 minutes in full test running.) > > The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from > interrupt-disabled region to interrupt-enabled: > > cmpq $__NR_syscall_max,%rax > ja ret_from_sys_call > movq %r10,%rcx > call *sys_call_table(,%rax,8) # XXX: rip relative > movq %rax,RAX-ARGOFFSET(%rsp) > ret_from_sys_call: > testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > jnz int_ret_from_sys_call_fixup /* Go the the slow path */ > LOCKDEP_SYS_EXIT > DISABLE_INTERRUPTS(CLBR_NONE) > TRACE_IRQS_OFF > ... > ... > int_ret_from_sys_call_fixup: > FIXUP_TOP_OF_STACK %r11, -ARGOFFSET > jmp int_ret_from_sys_call > ... > ... > GLOBAL(int_ret_from_sys_call) > DISABLE_INTERRUPTS(CLBR_NONE) > TRACE_IRQS_OFF > > You reverted that by moving this insn to be after first > DISABLE_INTERRUPTS(CLBR_NONE).
Oh yes. I forgot to mention that I tested also only moving DISABLE_INTERRUPTS(CLBR_NONE) at the beginning. But this didn't help by some reason. And, I tested also without all kernel debug options (both CONFIG_DEBUG_LOCK_ALLOC and CONFIG_TRACE_IRQFLAGS are off), but this kernel also showed the same crash. > I also don't see how moving that check (even if it is wrong in a more > benign way) can have such a drastic effect. > > > Shot-in-the-dark idea. At this code revision we did not yet > store user's %rsp in pt_regs->sp, we used a fixup to populate it: > > .macro FIXUP_TOP_OF_STACK tmp offset=0 > movq PER_CPU_VAR(old_rsp),\tmp > movq \tmp,RSP+\offset(%rsp) > > (There are pending patches to fix this mess). > > If an interrupt interrupting *kernel code* would go into a code path > which does FIXUP_TOP_OF_STACK, it'd overwrite the correct saved %rsp > with a user's one. The iret from interrupt would work, > but the resulting CPU state would be inconsistent. But I don't see > such a code path from interrupts to FIXUP_TOP_OF_STACK... > > > > - Another piece is that the bug happens only when a KVM is running. > > The kernel ran without problem over days with similar tasks > > (compiling kernel, etc) when no KVM was used. > > Conceivably virtualization support in CPUs can have nasty erratas. > However, you and other reporter have different CPUs - yours > is Ivy Bridge, his CPU is a Penryn. > > I don't see the path how KVM helps to trigger this. > > > - And now I get the trace as above, pointing netlink_attachskb(). > > > > I have a difficulty to imagine how all these pieces fit into a single > > picture. Is something already screwed up before that? > > Well, a tiny bit more info will be seen if you'd change %rdi > to, say, %r15 in these two lines in my patch: > > /* Save bogus RSP value */ > movq %rsp,%rdi > ... > push %rdi /* pt_regs->sp */ > > Then original %rdi will be visible in the crash message. OK, here we go. kvm: zapping shadow pages for mmio generation wraparound kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff Exception on user stack 00007fff1d7e5ec0: RSP: 0018:00007fff1d7e5ef8 EFLAGS: 00010002 RIP: 0010:[<ffffffff8162681d>] [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0 PANIC: double fault, error_code: 0x0 CPU: 5 PID: 14285 Comm: fixdep Tainted: G W 4.0.0-rc5-debug1+ #3 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 task: ffff88020ba1c690 ti: ffff880206ba4000 task.ti: ffff880206ba4000 RIP: 0010:[<ffffffff8162681d>] [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0 RSP: 0018:00007fff1d7e5ef8 EFLAGS: 00010002 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000c0000101 RDX: 0000000000000000 RSI: 0000000000001ebb RDI: 0000000000000000 RBP: 0000000000000022 R08: 0000000000000004 R09: 0000000000000000 R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000001ebb R13: 00007fb642fcc6e4 R14: 00007fb642fcdc18 R15: 00007fff1d7e5ec0 FS: 00007fb642fa9700(0000) GS:ffff88021d340000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000168 CR3: 00000000ce1b4000 CR4: 00000000001427e0 Stack: 0000000000000005 0000000000401582 00007fb642fcb180 0000000000000053 0000000000400d8a 0000000000000000 0000000000000000 000000005d152a17 0000000000400f8c 0000000000000000 0000000100000004 00007fb642fcb000 Call Trace: <UNK> Code: 10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 <8b> 87 68 01 00 00 39 87 9c 01 00 00 7c 25 48 8b 87 88 04 00 00 Kernel panic - not syncing: Machine halted. CPU: 5 PID: 14285 Comm: fixdep Tainted: G W 4.0.0-rc5-debug1+ #3 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 0000000000000000 ffff880206ba7e28 ffffffff816f80d2 0000000000000000 ffffffff81a22f81 ffff880206ba7ea8 ffffffff816f2358 00000000000050da 0000000000000008 ffff880206ba7eb8 ffff880206ba7e58 ffff880206ba7ea8 Call Trace: [<ffffffff816f80d2>] dump_stack+0x4c/0x6e [<ffffffff816f2358>] panic+0xc0/0x1f3 [<ffffffff81046e65>] df_debug+0x35/0x40 [<ffffffff81003fe7>] do_double_fault+0x87/0x100 [<ffffffff81004167>] do_userpsace_rsp_in_kernel+0x107/0x140 [<ffffffff8162681d>] ? netlink_attachskb+0x1d/0x1d0 [<ffffffff81703ca7>] userpsace_rsp_in_kernel+0x37/0x40 [<ffffffff8162681d>] ? netlink_attachskb+0x1d/0x1d0 I have to leave my office now. If you need any further tests, let me know; I'll do it tomorrow. In anyway I'll need to double-check whether I tested properly. thanks, Takashi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/