On Mon, May 11, 2026 at 08:00:21PM -0700, Dylan Hatch wrote: > Hi Mark, > > Thanks for all the feedback and help on this. I'm planning on getting > your comments addressed in the coming days, but I have some initial > clarifying questions. > > On Fri, May 1, 2026 at 9:46 AM Mark Rutland <[email protected]> wrote: > > > > Hi Dylan, > > > > Thanks for putting this together. I think this is looking pretty good. > > However, there are some things that aren't quite right and need some > > work, which I've commented on below. > > > > More generally, there are a few things that aren't addressed by this > > series that we will also need to address. Importantly: > > > > (1) For correctness, we'll need to address a latent issue with unwinding > > across an fgraph return trampoline, where the return address is > > transiently unrecoverable. > > > > Before this series, that doesn't matter for livepatching because the > > livepatching code isn't called synchronously within the fgraph > > handler, and unwinds which cross an exception boundary are marked as > > unreliable. > > > > After this series, that does matter as we can unwind across an > > exception boundary, and might happen to interrupt that transient > > window. > > > > I think we can solve that with some restructuring of that code, > > restoring the original address *before* removing that from the > > fgraph return stack, and ensuring that the unwinder can find it. > > If my understanding is correct, the issue arrises in return_to_handler > as the return address is recovered: > > mov x0, sp > bl ftrace_return_to_handler // addr = ftrace_return_to_hander(fregs); > mov x30, x0 // restore the original return address > > Because ftrace_return_to_handler pops the return address from the > return stack before it can be restored into the LR, it cannot be > recovered.
Yes. To be clear, please don't worry about solving that for the next version of this series; let's get the SFrame bits into shape first. I just wanted to highlight that there's some more general work that we'll need to do. I think we can *detect* this case (and mark the unwind as unreliable) with some tiny changes to the arm64 code. I'm happy to put that together. > Based on this, I believe you are suggesting to restructure this code > path such that the return address is removed from the return stack > only after it has been restored to LR. But since kernel/trace/fgraph.c > is core kernel code, will this end up requiring either (1) a similar > restructuring of other arches supporting ftrace, or (2) an > arm64-specific implementation of this recovery logic? Yes, I am say that to *recover* the address we'd need to make changes to core code. In the mean time we can *detect* this case with some minimal changes to arm64 code, and abort. As above, I'm happy to go put that together. > It looks to me like there is essentially the same recovery pattern on > other arches; is there a reason this transient unrecoverability isn't > an issue for reliable unwind on other platforms? Yep; on all architectures there's a transient period where the address cannot be recovered. It's not a correctness issue so long as the architecture detects this case and marks the unwind as unreliable. IIUC x86 will mark the unwind as unreliable in this case. I don't know whether other architectures detect this reliably. That's a question for loonarch, parisc, powerpc, and s390 folk. > > I'm not immediately sure whether kretprobes has a similar issue. > > > > (2) To make unwinding generally possible, we'll need to annotate some > > assembly functions as unwindable. We'll need to do that for string > > routines under lib/, and probably some crypto code, but we don't > > need to do that for most code in head.S, entry.S, etc. > > > > The vast majority of relevant assembly functions are leaf functions > > (where the return address is never moved out of the LR), so we can > > probably get away with a simple annotation for those that avoids the > > need for open-coded CFI directives everywhere. > > Are you suggesting something like a SYM_LEAF_FUNC_(START|END), that > wraps CFI directives for leaf functions? Yep; that's exactly the sort of thing I was thinking of. That or have a seaprate annotation we can add, e.g. SYM_FUNC_START(foo) SYM_FUN_END(foo) SYM_FUNC_IS_LEAF_AND_DOES_NOT_TOUCH_LR(foo) > > I've pushed some reliable stacktrace tests to: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git > > stacktrace/tests > > > > That finds the fgraph issue (regardless of this series). When merged > > with this series triggers a warning in kunwind_next_frame_record_meta(), > > where unwind_next_frame_sframe() calls that erroneously as a fallback. > > Thanks for the pointer on these tests, they're super useful! I've been > able to reproduce the fgraph failure you mentioned. Great! Mark.

