At Mon, 23 Mar 2015 11:38:30 -0700, Andy Lutomirski wrote: > > On Mon, Mar 23, 2015 at 9:07 AM, Denys Vlasenko <dvlas...@redhat.com> wrote: > > On 03/23/2015 02:22 PM, Takashi Iwai wrote: > >> At Mon, 23 Mar 2015 10:35:41 +0100, > >> Takashi Iwai wrote: > >>> > >>> At Mon, 23 Mar 2015 10:02:52 +0100, > >>> Takashi Iwai wrote: > >>>> > >>>> At Fri, 20 Mar 2015 19:16:53 +0100, > >>>> Denys Vlasenko wrote: > > >> I'm really puzzled now. We have a few pieces of information: > >> > >> - git bisection pointed the commit 96b6352c1271: > >> x86_64, entry: Remove the syscall exit audit and schedule optimizations > >> and reverting this "fixes" the problem indeed. Even just moving two > >> lines > >> LOCKDEP_SYS_EXIT > >> DISABLE_INTERRUPTS(CLBR_NONE) > >> at the beginning of ret_from_sys_call already fixes. (Of course I > >> can't prove the fix but it stabilizes for a day without crash while > >> usually I hit the bug in 10 minutes in full test running.) > > > > The commit 96b6352c1271 moved TIF_ALLWORK_MASK check from > > interrupt-disabled region to interrupt-enabled: > > > > cmpq $__NR_syscall_max,%rax > > ja ret_from_sys_call > > movq %r10,%rcx > > call *sys_call_table(,%rax,8) # XXX: rip relative > > movq %rax,RAX-ARGOFFSET(%rsp) > > ret_from_sys_call: > > testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > jnz int_ret_from_sys_call_fixup /* Go the the slow path */ > > LOCKDEP_SYS_EXIT > > DISABLE_INTERRUPTS(CLBR_NONE) > > TRACE_IRQS_OFF > > ... > > ... > > int_ret_from_sys_call_fixup: > > FIXUP_TOP_OF_STACK %r11, -ARGOFFSET > > jmp int_ret_from_sys_call > > ... > > ... > > GLOBAL(int_ret_from_sys_call) > > DISABLE_INTERRUPTS(CLBR_NONE) > > TRACE_IRQS_OFF > > > > You reverted that by moving this insn to be after first > > DISABLE_INTERRUPTS(CLBR_NONE). > > > > I also don't see how moving that check (even if it is wrong in a more > > benign way) can have such a drastic effect. > > I bet I see it. I have the advantage of having stared at KVM code and > cursed at it more recently than you, I suspect. KVM does awful, awful > things to CPU state, and, as an optimization, it allows kernel code to > run with CPU state that would be totally invalid in user mode. This > happens through a bunch of hooks, including this bit in __switch_to: > > /* > * Now maybe reload the debug registers and handle I/O bitmaps > */ > if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT || > task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV)) > __switch_to_xtra(prev_p, next_p, tss); > > IOW, we *change* tif during context switches. > > > The race looks like this: > > testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP) > jnz int_ret_from_sys_call_fixup /* Go the the slow path */ > > --- preempted here, switch to KVM guest --- > > KVM guest enters and screws up, say, MSR_SYSCALL_MASK. This wouldn't > happen to be a *32-bit* KVM guest, perhaps? > > Now KVM schedules, calling __switch_to. __switch_to sets > _TIF_USER_RETURN_NOTIFY. We IRET back to the syscall exit code, turn > off interrupts, and do sysret. We are now screwed.
Thanks for enlightening! That looks like a feasible scenario. (I tested only a 64bit KVM guest, BTW.) > I don't know why this manifests in this particular failure, but any > number of terrible things could happen now. > > FWIW, this will affect things other than KVM. For example, SIGKILL > sent while a process is sleeping in that two-instruction window won't > work. > > Takashi, can you re-send your patch so we can review it for real in > light of this race? The patch below worked. I'll double-check tomorrow whether this really cures reliably. thanks, Takashi diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 1d74d161687c..5340ac7f88a9 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -364,12 +364,12 @@ system_call_fastpath: * Has incomplete stack frame and undefined top of stack. */ ret_from_sys_call: - testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) - jnz int_ret_from_sys_call_fixup /* Go the the slow path */ - LOCKDEP_SYS_EXIT DISABLE_INTERRUPTS(CLBR_NONE) TRACE_IRQS_OFF + testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET) + jnz int_ret_from_sys_call_fixup /* Go the the slow path */ + CFI_REMEMBER_STATE /* * sysretq will re-enable interrupts: -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/