Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

Takashi Iwai Mon, 23 Mar 2015 06:23:21 -0700

At Mon, 23 Mar 2015 10:35:41 +0100,
Takashi Iwai wrote:
> 
> At Mon, 23 Mar 2015 10:02:52 +0100,
> Takashi Iwai wrote:
> > 
> > At Fri, 20 Mar 2015 19:16:53 +0100,
> > Denys Vlasenko wrote:
> > > Takashi, are you willing to reproduce the panic one more time,
> > > with this patch? I would like to see whether oops messages
> > > are more informative with it.
> > 
> > It can't be applied to 4.0-rc5, unfortunately.
> > 
> > arch/x86/kernel/entry_64.S: Assembler messages:
> > arch/x86/kernel/entry_64.S:1725: Error: no such instruction: 
> > `alloc_pt_gpregs_on_stack'
> > arch/x86/kernel/entry_64.S:1716: Error: invalid operands (*UND* and *UND* 
> > sections) for `+'
> > scripts/Makefile.build:294: recipe for target 'arch/x86/kernel/entry_64.o' 
> > failed
> 
> I pulled tip tree on top of 4.0-rc5, built with your patch and now
> succeeded to get a better message:
> 
>  kvm: zapping shadow pages for mmio generation wraparound
>  kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
>  Exception on user stack 00007ffd22c23ef0: RSP: 0018:00007ffd22c23f28  
> EFLAGS: 00010006
>  RIP: 0010:[<ffffffff8162681d>]  [<ffffffff8162681d>] 
> netlink_attachskb+0x1d/0x1d0
>  PANIC: double fault, error_code: 0x0
>  CPU: 1 PID: 10819 Comm: cc1 Tainted: G        W       4.0.0-rc5-debug1+ #2
>  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
>  task: ffff8800d1b34b10 ti: ffff8800d1b30000 task.ti: ffff8800d1b30000
>  RIP: 0010:[<ffffffff8162681d>]  [<ffffffff8162681d>] 
> netlink_attachskb+0x1d/0x1d0
>  RSP: 0018:00007ffd22c23f28  EFLAGS: 00010006
>  RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00000000c0000101
>  RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ffd22c23ef0
>  RBP: 0000000000000ea7 R08: 0000000000001ea7 R09: ffffffffffffffff
>  R10: 000000000309dbf8 R11: 0000000000000246 R12: 0000000000000001
>  R13: 0000000000000000 R14: 0000000003026e40 R15: 000000000309cd50
>  FS:  00007f89c83c2800(0000) GS:ffff88021d240000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 000000000000016d CR3: 00000000d90a0000 CR4: 00000000001427e0
>  Stack:
>   0000000000000ea7 0000000000000000 0000000003099c10 0000000000000ea7
>   0000000000000ea7 0000000000000001 0000000003099c10 0000000000000ea7
>   0000000000c84696 0000000003099c88 00007f0122c23fb8 000000000302f610
>  Call Trace:
>   <UNK> 
>  Code: 
>  10 75 ee f0 ff 42 6c 48 89 d0 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 56 
> 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 30 <8b> 87 68 01 00 00 39 
> 87 9c 01 00 00 7c 25 48 8b 87 88 04 00 00 
>  Kernel panic - not syncing: Machine halted.
>  CPU: 1 PID: 10819 Comm: cc1 Tainted: G        W       4.0.0-rc5-debug1+ #2
>  Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
>   0000000000000000 ffff8800d1b33e28 ffffffff816f80d2 0000000000000000
>   ffffffff81a22f81 ffff8800d1b33ea8 ffffffff816f2358 00000000000058d7
>   0000000000000008 ffff8800d1b33eb8 ffff8800d1b33e58 ffff8800d1b33ea8
>  Call Trace:
>   [<ffffffff816f80d2>] dump_stack+0x4c/0x6e
>   [<ffffffff816f2358>] panic+0xc0/0x1f3
>   [<ffffffff81046e65>] df_debug+0x35/0x40
>   [<ffffffff81003fe7>] do_double_fault+0x87/0x100
>   [<ffffffff81004167>] do_userpsace_rsp_in_kernel+0x107/0x140
>   [<ffffffff8162681d>] ? netlink_attachskb+0x1d/0x1d0
>   [<ffffffff81703ca6>] userpsace_rsp_in_kernel+0x36/0x40
>   [<ffffffff8162681d>] ? netlink_attachskb+0x1d/0x1d0
> 
> 
> So, it seems hitting in netlink_attachskb().
> I'd need to check whether this consistently hits there or just at
> random.


I managed to reproduce the bug two more times, and all three show the
very same stack trace like the above.  So, it's well reproducible.

I'm really puzzled now.  We have a few pieces of information:

- git bisection pointed the commit 96b6352c1271:
    x86_64, entry: Remove the syscall exit audit and schedule optimizations
  and reverting this "fixes" the problem indeed.  Even just moving two
  lines
    LOCKDEP_SYS_EXIT
    DISABLE_INTERRUPTS(CLBR_NONE) 
  at the beginning of ret_from_sys_call already fixes.  (Of course I
  can't prove the fix but it stabilizes for a day without crash while
  usually I hit the bug in 10 minutes in full test running.)

- Another piece is that the bug happens only when a KVM is running.
  The kernel ran without problem over days with similar tasks
  (compiling kernel, etc) when no KVM was used.

- And now I get the trace as above, pointing netlink_attachskb().

I have a difficulty to imagine how all these pieces fit into a single
picture.  Is something already screwed up before that?


Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

Reply via email to