On Thu, Sep 18, 2025 at 07:24:14PM +0200, Peter Zijlstra wrote:
> So we have:
>
> do_syscall_64()
> ... do stuff ...
> syscall_exit_to_user_mode(regs)
> syscall_exit_to_user_mode_work(regs)
> syscall_exit_work()
> exit_to_user_mode_prepare()
> exit_to_user_mode_loop()
> retume_user_mode_work()
> task_work_run()
> exit_to_user_mode()
> unwind_reset_info();
> user_enter_irqoff();
> arch_exit_to_user_mode();
> lockdep_hardirqs_on();
> SYSRET/IRET
>
>
> and
>
> DEFINE_IDTENTRY*()
> irqentry_enter();
> ... stuff ...
> irqentry_exit()
> irqentry_exit_to_user_mode()
> exit_to_user_mode_prepare()
> exit_to_user_mode_loop();
> retume_user_mode_work()
> task_work_run()
> exit_to_user_mode()
> unwind_reset_info();
> ...
> IRET
>
> Now, task_work_run() is in the exit_to_user_mode_loop() which is notably
> *before* exit_to_user_mode() which does the unwind_reset_info().
>
> What happens if we get an NMI requesting an unwind after
> unwind_reset_info() while still very much being in the kernel on the way
> out?
AFAICT it will try and do a task_work_add(TWA_RESUME) from NMI context,
and this will fail horribly.
If you do something like:
twa_mode = in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
task_work_add(foo, twa_mode);
it might actually work.