On 03/23/2015 02:22 PM, Takashi Iwai wrote: > At Mon, 23 Mar 2015 10:35:41 +0100, > Takashi Iwai wrote: >> >> At Mon, 23 Mar 2015 10:02:52 +0100, >> Takashi Iwai wrote: >>> >>> At Fri, 20 Mar 2015 19:16:53 +0100, >>> Denys Vlasenko wrote: >>>> Takashi, are you willing to reproduce the panic one more time, >>>> with this patch? I would like to see whether oops messages >>>> are more informative with it. >>> >>> It can't be applied to 4.0-rc5, unfortunately. >>> >>> arch/x86/kernel/entry_64.S: Assembler messages: >>> arch/x86/kernel/entry_64.S:1725: Error: no such instruction: >>> `alloc_pt_gpregs_on_stack' >>> arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* >>> sections) for `+' >>> scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' >>> failed >> >> I pulled tip tree on top of 4.0-rc5, built with your patch and now >> succeeded to get a better message: >> >> kvm: zapping shadow pages for mmio generation wraparound >> kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff >> Exception on user stack 00007ffd22c23ef0: RSP: 0018:00007ffd22c23f28 >> EFLAGS: 00010006 >> RIP: 0010:[<ffffffff8162681d>] [<ffffffff8162681d>] >> netlink_attachskb+0x1d/0x1d0 >> PANIC: double fault, error_code: 0x0 >> CPU: 1 PID: 10819 Comm: cc1 Tainted: G W 4.0.0-rc5-debug1+ #2 >> Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 >> task: ffff8800d1b34b10 ti: ffff8800d1b30000 task.ti: ffff8800d1b30000 >> RIP: 0010:[<ffffffff8162681d>] [<ffffffff8162681d>] >> netlink_attachskb+0x1d/0x1d0 >> RSP: 0018:00007ffd22c23f28 EFLAGS: 00010006 >> RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00000000c0000101 >> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ffd22c23ef0 >> RBP: 0000000000000ea7 R08: 0000000000001ea7 R09: ffffffffffffffff >> R10: 000000000309dbf8 R11: 0000000000000246 R12: 0000000000000001 >> R13: 0000000000000000 R14: 0000000003026e40 R15: 000000000309cd50 >> FS: 00007f89c83c2800(0000) GS:ffff88021d240000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 000000000000016d CR3: 00000000d90a0000 CR4: 00000000001427e0 >> Stack: >> 0000000000000ea7 0000000000000000 0000000003099c10 0000000000000ea7 >> 0000000000000ea7 0000000000000001 0000000003099c10 0000000000000ea7 >> 0000000000c84696 0000000003099c88 00007f0122c23fb8 000000000302f610 >> Call Trace: >> <UNK> >> Code: >> 10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 56 >> 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 <8b> 87 68 01 00 00 39 >> 87 9c 01 00 00 7c 25 48 8b 87 88 04 00 00 >> Kernel panic - not syncing: Machine halted. >> CPU: 1 PID: 10819 Comm: cc1 Tainted: G W 4.0.0-rc5-debug1+ #2 >> Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 >> 0000000000000000 ffff8800d1b33e28 ffffffff816f80d2 0000000000000000 >> ffffffff81a22f81 ffff8800d1b33ea8 ffffffff816f2358 00000000000058d7 >> 0000000000000008 ffff8800d1b33eb8 ffff8800d1b33e58 ffff8800d1b33ea8 >> Call Trace: >> [<ffffffff816f80d2>] dump_stack+0x4c/0x6e >> [<ffffffff816f2358>] panic+0xc0/0x1f3 >> [<ffffffff81046e65>] df_debug+0x35/0x40 >> [<ffffffff81003fe7>] do_double_fault+0x87/0x100 >> [<ffffffff81004167>] do_userpsace_rsp_in_kernel+0x107/0x140 >> [<ffffffff8162681d>] ? netlink_attachskb+0x1d/0x1d0 >> [<ffffffff81703ca6>] userpsace_rsp_in_kernel+0x36/0x40 >> [<ffffffff8162681d>] ? netlink_attachskb+0x1d/0x1d0 >> >> >> So, it seems hitting in netlink_attachskb(). >> I'd need to check whether this consistently hits there or just at >> random. > > I managed to reproduce the bug two more times, and all three show the > very same stack trace like the above. So, it's well reproducible.
FYI: the disassembly of netlink_attachskb (from "Code:" line) is: 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 5: 55 push %rbp 6: 48 89 e5 mov %rsp,%rbp 9: 41 56 push %r14 b: 41 55 push %r13 d: 49 89 d5 mov %rdx,%r13 10: 41 54 push %r12 12: 49 89 f4 mov %rsi,%r12 15: 53 push %rbx 16: 48 89 fb mov %rdi,%rbx 19: 48 83 ec 30 sub $0x30,%rsp 1d: 8b 87 68 01 00 00 mov 0x168(%rdi),%eax ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 23: 39 87 9c 01 00 00 cmp %eax,0x19c(%rdi) 29: 7c 25 jl 50 <_start+0x50> 2b: 48 8b 87 88 04 00 00 mov 0x488(%rdi),%rax The ^^^^^ instruction is the one which faults. Since you said it consistently happens here, this should be a page fault, not an external hardware interrupt. The code corresponds to the comparison in if(): int netlink_attachskb(struct sock *sk, struct sk_buff *skb, long *timeo, struct sock *ssk) { struct netlink_sock *nlk; nlk = nlk_sk(sk); if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf || %rdi (which is 1st param, "struct sock *sk") is 00007ffd22c23ef0 (userspace address), but it's just because my patch clobbers %rdi, :( we don't know which value it had at that moment. > I'm really puzzled now. We have a few pieces of information: > > - git bisection pointed the commit 96b6352c1271: > x86_64, entry: Remove the syscall exit audit and schedule optimizations > and reverting this "fixes" the problem indeed. Even just moving two > lines > LOCKDEP_SYS_EXIT > DISABLE_INTERRUPTS(CLBR_NONE) > at the beginning of ret_from_sys_call already fixes. (Of course I > can't prove the fix but it stabilizes for a day without crash while > usually I hit the bug in 10 minutes in full test running.) The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from interrupt-disabled region to interrupt-enabled: cmpq $__NR_syscall_max,%rax ja ret_from_sys_call movq %r10,%rcx call *sys_call_table(,%rax,8) # XXX: rip relative movq %rax,RAX-ARGOFFSET(%rsp) ret_from_sys_call: testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ jnz int_ret_from_sys_call_fixup /* Go the the slow path */ LOCKDEP_SYS_EXIT DISABLE_INTERRUPTS(CLBR_NONE) TRACE_IRQS_OFF ... ... int_ret_from_sys_call_fixup: FIXUP_TOP_OF_STACK %r11, -ARGOFFSET jmp int_ret_from_sys_call ... ... GLOBAL(int_ret_from_sys_call) DISABLE_INTERRUPTS(CLBR_NONE) TRACE_IRQS_OFF You reverted that by moving this insn to be after first DISABLE_INTERRUPTS(CLBR_NONE). I also don't see how moving that check (even if it is wrong in a more benign way) can have such a drastic effect. Shot-in-the-dark idea. At this code revision we did not yet store user's %rsp in pt_regs->sp, we used a fixup to populate it: .macro FIXUP_TOP_OF_STACK tmp offset=0 movq PER_CPU_VAR(old_rsp),\tmp movq \tmp,RSP+\offset(%rsp) (There are pending patches to fix this mess). If an interrupt interrupting *kernel code* would go into a code path which does FIXUP_TOP_OF_STACK, it'd overwrite the correct saved %rsp with a user's one. The iret from interrupt would work, but the resulting CPU state would be inconsistent. But I don't see such a code path from interrupts to FIXUP_TOP_OF_STACK... > - Another piece is that the bug happens only when a KVM is running. > The kernel ran without problem over days with similar tasks > (compiling kernel, etc) when no KVM was used. Conceivably virtualization support in CPUs can have nasty erratas. However, you and other reporter have different CPUs - yours is Ivy Bridge, his CPU is a Penryn. I don't see the path how KVM helps to trigger this. > - And now I get the trace as above, pointing netlink_attachskb(). > > I have a difficulty to imagine how all these pieces fit into a single > picture. Is something already screwed up before that? Well, a tiny bit more info will be seen if you'd change %rdi to, say, %r15 in these two lines in my patch: /* Save bogus RSP value */ movq %rsp,%rdi ... push %rdi /* pt_regs->sp */ Then original %rdi will be visible in the crash message. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/