On Thu, Mar 19, 2015 at 8:22 AM, Takashi Iwai <ti...@suse.de> wrote: > At Thu, 19 Mar 2015 15:55:26 +0100, > Takashi Iwai wrote: >> >> At Thu, 19 Mar 2015 14:47:12 +0100, >> Takashi Iwai wrote: >> > >> > At Thu, 19 Mar 2015 13:48:56 +0100, >> > Denys Vlasenko wrote: >> > > >> > > Having no more ideas at the moment, here is a tarball of 13 patches >> > > of commits touching entry_64.S up to 4.0.0-rc1. >> > > >> > > x0001.patch is the latest, x0015.patch is the oldest. >> > > >> > > Patches 0003 and 0008 are not there since 0003 is empty merge patch >> > > and 0008 does some PCI fixup. >> > > >> > > If this breakage is recent, it ought to be one of these. >> > > Most of them do some non-trivial surgery. >> > > >> > > Even though I did not spot anything suspicious in them, >> > > entry.S is notorious for subtle breakage. >> > > >> > > Try reverting them in sequence starting from x0001.patch >> > > and see reverting which one makes crash disappear. >> > >> > OK, I'm going to check these git series. >> >> Reverting the commit >> 96b6352c12711d5c0bb7157f49c92580248e8146 >> x86_64, entry: Remove the syscall exit audit and schedule optimizations >> >> seems enough. After reverting this one, the machine runs stable with >> the kvm stress test. >> >> (I'll keep test running for a while; at the previous bisection, I hit >> the bug right after posting the mail ;) > > It survived long enough, so this looks like the spot. > > Also, I checked the patch below instead of reverting the commit, and > this seems working, too. > > > Takashi > > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S > index 1d74d161687c..5340ac7f88a9 100644 > --- a/arch/x86/kernel/entry_64.S > +++ b/arch/x86/kernel/entry_64.S > @@ -364,12 +364,12 @@ system_call_fastpath: > * Has incomplete stack frame and undefined top of stack. > */ > ret_from_sys_call: > - testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) > - jnz int_ret_from_sys_call_fixup /* Go the the slow path */ > - > LOCKDEP_SYS_EXIT > DISABLE_INTERRUPTS(CLBR_NONE) > TRACE_IRQS_OFF > + testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) > + jnz int_ret_from_sys_call_fixup /* Go the the slow path */ > + > CFI_REMEMBER_STATE > /* > * sysretq will re-enable interrupts:
The crash you're seeing could certainly be caused by an IRQ at the wrong time. However: int_ret_from_sys_call_fixup: FIXUP_TOP_OF_STACK %r11, -ARGOFFSET jmp int_ret_from_sys_call and GLOBAL(int_ret_from_sys_call) DISABLE_INTERRUPTS(CLBR_NONE) TRACE_IRQS_OFF so with or without your little patch, we're turning off IRQs very quickly. retint_swapgs also turnes off interrupts before doing anything. So I don't see how your patch would have any effect. I'm starting to wonder if the problem has something to do with running fire_user_return_notifiers with IRQs on. We appear to do that, and it seems rather questionable to me that it's safe, given the sneaky things that KVM does in there. If we end up in user mode with a bad MSR_SYSCALL_MASK, we could see your crash, although I don't see how that would happen either. I'll try to write a diagnostic patch later this morning. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/