On Thu, Mar 19, 2015 at 8:51 AM, Takashi Iwai <ti...@suse.de> wrote: > At Thu, 19 Mar 2015 08:41:57 -0700, > Andy Lutomirski wrote: >> >> On Thu, Mar 19, 2015 at 8:22 AM, Takashi Iwai <ti...@suse.de> wrote: >> > At Thu, 19 Mar 2015 15:55:26 +0100, >> > Takashi Iwai wrote: >> >> >> >> At Thu, 19 Mar 2015 14:47:12 +0100, >> >> Takashi Iwai wrote: >> >> > >> >> > At Thu, 19 Mar 2015 13:48:56 +0100, >> >> > Denys Vlasenko wrote: >> >> > > >> >> > > Having no more ideas at the moment, here is a tarball of 13 patches >> >> > > of commits touching entry_64.S up to 4.0.0-rc1. >> >> > > >> >> > > x0001.patch is the latest, x0015.patch is the oldest. >> >> > > >> >> > > Patches 0003 and 0008 are not there since 0003 is empty merge patch >> >> > > and 0008 does some PCI fixup. >> >> > > >> >> > > If this breakage is recent, it ought to be one of these. >> >> > > Most of them do some non-trivial surgery. >> >> > > >> >> > > Even though I did not spot anything suspicious in them, >> >> > > entry.S is notorious for subtle breakage. >> >> > > >> >> > > Try reverting them in sequence starting from x0001.patch >> >> > > and see reverting which one makes crash disappear. >> >> > >> >> > OK, I'm going to check these git series. >> >> >> >> Reverting the commit >> >> 96b6352c12711d5c0bb7157f49c92580248e8146 >> >> x86_64, entry: Remove the syscall exit audit and schedule >> >> optimizations >> >> >> >> seems enough. After reverting this one, the machine runs stable with >> >> the kvm stress test. >> >> >> >> (I'll keep test running for a while; at the previous bisection, I hit >> >> the bug right after posting the mail ;) >> > >> > It survived long enough, so this looks like the spot. >> > >> > Also, I checked the patch below instead of reverting the commit, and >> > this seems working, too. >> > >> > >> > Takashi >> > >> > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S >> > index 1d74d161687c..5340ac7f88a9 100644 >> > --- a/arch/x86/kernel/entry_64.S >> > +++ b/arch/x86/kernel/entry_64.S >> > @@ -364,12 +364,12 @@ system_call_fastpath: >> > * Has incomplete stack frame and undefined top of stack. >> > */ >> > ret_from_sys_call: >> > - testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) >> > - jnz int_ret_from_sys_call_fixup /* Go the the slow path */ >> > - >> > LOCKDEP_SYS_EXIT >> > DISABLE_INTERRUPTS(CLBR_NONE) >> > TRACE_IRQS_OFF >> > + testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) >> > + jnz int_ret_from_sys_call_fixup /* Go the the slow path */ >> > + >> > CFI_REMEMBER_STATE >> > /* >> > * sysretq will re-enable interrupts: >> >> The crash you're seeing could certainly be caused by an IRQ at the >> wrong time. However: >> >> int_ret_from_sys_call_fixup: >> FIXUP_TOP_OF_STACK %r11, -ARGOFFSET >> jmp int_ret_from_sys_call >> >> and >> >> GLOBAL(int_ret_from_sys_call) >> DISABLE_INTERRUPTS(CLBR_NONE) >> TRACE_IRQS_OFF >> >> so with or without your little patch, we're turning off IRQs very >> quickly. retint_swapgs also turnes off interrupts before doing >> anything. So I don't see how your patch would have any effect. > > What about LOCKDEP_SYS_EXIT? >
There's a LOCKDEP_SYS_EXIT_IRQ a few lines down in int_ret_from_sys_call, and the syscall slow path falls through directly to int_ret_from_sys_call. I'm going to try to write a diagnostic patch now. I have four separate contractors coming starting half an hour ago*, so it might take a while. * Yeah, right. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/