On Mon, May 11, 2026 at 08:00:21PM -0700, Dylan Hatch wrote:
> Hi Mark,
> 
> Thanks for all the feedback and help on this. I'm planning on getting
> your comments addressed in the coming days, but I have some initial
> clarifying questions.
> 
> On Fri, May 1, 2026 at 9:46 AM Mark Rutland <[email protected]> wrote:
> >
> > Hi Dylan,
> >
> > Thanks for putting this together. I think this is looking pretty good.
> > However, there are some things that aren't quite right and need some
> > work, which I've commented on below.
> >
> > More generally, there are a few things that aren't addressed by this
> > series that we will also need to address. Importantly:
> >
> > (1) For correctness, we'll need to address a latent issue with unwinding
> >     across an fgraph return trampoline, where the return address is
> >     transiently unrecoverable.
> >
> >     Before this series, that doesn't matter for livepatching because the
> >     livepatching code isn't called synchronously within the fgraph
> >     handler, and unwinds which cross an exception boundary are marked as
> >     unreliable.
> >
> >     After this series, that does matter as we can unwind across an
> >     exception boundary, and might happen to interrupt that transient
> >     window.
> >
> >     I think we can solve that with some restructuring of that code,
> >     restoring the original address *before* removing that from the
> >     fgraph return stack, and ensuring that the unwinder can find it.
> 
> If my understanding is correct, the issue arrises in return_to_handler
> as the return address is recovered:
> 
> mov x0, sp
> bl ftrace_return_to_handler // addr = ftrace_return_to_hander(fregs);
> mov x30, x0 // restore the original return address
> 
> Because ftrace_return_to_handler pops the return address from the
> return stack before it can be restored into the LR, it cannot be
> recovered.

Yes.

To be clear, please don't worry about solving that for the next version
of this series; let's get the SFrame bits into shape first. I just
wanted to highlight that there's some more general work that we'll need
to do.

I think we can *detect* this case (and mark the unwind as unreliable)
with some tiny changes to the arm64 code. I'm happy to put that
together.

> Based on this, I believe you are suggesting to restructure this code
> path such that the return address is removed from the return stack
> only after it has been restored to LR. But since kernel/trace/fgraph.c
> is core kernel code, will this end up requiring either (1) a similar
> restructuring of other arches supporting ftrace, or (2) an
> arm64-specific implementation of this recovery logic?

Yes, I am say that to *recover* the address we'd need to make changes to
core code.

In the mean time we can *detect* this case with some minimal changes to
arm64 code, and abort. As above, I'm happy to go put that together.

> It looks to me like there is essentially the same recovery pattern on
> other arches; is there a reason this transient unrecoverability isn't
> an issue for reliable unwind on other platforms?

Yep; on all architectures there's a transient period where the address
cannot be recovered. It's not a correctness issue so long as the
architecture detects this case and marks the unwind as unreliable.

IIUC x86 will mark the unwind as unreliable in this case.

I don't know whether other architectures detect this reliably. That's a
question for loonarch, parisc, powerpc, and s390 folk.

> >     I'm not immediately sure whether kretprobes has a similar issue.
> >
> > (2) To make unwinding generally possible, we'll need to annotate some
> >     assembly functions as unwindable. We'll need to do that for string
> >     routines under lib/, and probably some crypto code, but we don't
> >     need to do that for most code in head.S, entry.S, etc.
> >
> >     The vast majority of relevant assembly functions are leaf functions
> >     (where the return address is never moved out of the LR), so we can
> >     probably get away with a simple annotation for those that avoids the
> >     need for open-coded CFI directives everywhere.
> 
> Are you suggesting something like a SYM_LEAF_FUNC_(START|END), that
> wraps CFI directives for leaf functions?

Yep; that's exactly the sort of thing I was thinking of.

That or have a seaprate annotation we can add, e.g.

SYM_FUNC_START(foo)
SYM_FUN_END(foo)
SYM_FUNC_IS_LEAF_AND_DOES_NOT_TOUCH_LR(foo)

> > I've pushed some reliable stacktrace tests to:
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git 
> > stacktrace/tests
> >
> > That finds the fgraph issue (regardless of this series). When merged
> > with this series triggers a warning in kunwind_next_frame_record_meta(),
> > where unwind_next_frame_sframe() calls that erroneously as a fallback.
> 
> Thanks for the pointer on these tests, they're super useful! I've been
> able to reproduce the fgraph failure you mentioned.

Great!

Mark.

Reply via email to