> Why does the unwinding need to happen in the kernel?  The kernel can
> already asynchronously invoke userspace code in the form of signal
> handlers.  Is the problem that it is necessary to collect profiling
> information in the middle of a system call, where another syscall
> would see inconsistent (and potentially exploitable) kernel state?

One of the primary values of system-wide profiling is being able to see how a 
particular library call might have caused undesirable code paths due to a 
syscall, and where/what was reached (given high enough sampling rate).

Does that need to happen in kernel space? I don't know, perse, other than perf 
needs to be able to do that work as it is what gives us the array of 
instruction pointers back. There was some chatter a number of years ago in perf 
about how to handle ORC from user-space, and if I'm summing this up correctly, 
it was basically..

 - When sampling in PERF_CONTEXT_KERNEL, stop unwinding at the syscall boundary
 - Append stacktrace samples to perf buffer ring
 - Upon rescheduling, backtrace a single time into user-space, and expect the 
consumer to know that N previous samples with matching task-id all have the 
user-space backtrace.

That's a pretty significant behavior change, and all tools would need surgery 
to support it. I have no idea if that is paletable to either side of the 
debate, but it was the one possible direction I saw.

It does have a number of pros, in that you can save a lot of unwinding time on 
syscall-heavy workloads by doing user-space unwinding once, and from scheduler 
task queues (so you can take faults), and can avoid the NMI context being the 
cost-center for accounting. But the cons are significant in that the behavior 
change is expansive, effects all tooling, and will require ORC data across the 
platform.

> Ouch.  That is a serious problem for a number of reasons, not least
> of which is security.  Having the kernel parse even more complex
> untrusted input in C is a horrible idea.

It might seem that way by the description I gave, but we're just talking an 
array of intptr_t or similar. There is no dereferencing or state machines like 
you have with DWARF. Runtime resolution is also essentially bsearch() on 
interval arrays. I really don't think it's the sort of thing that requires Rust.

As for eBPF, we'd still probably be in NMI context with this route, and would 
fail if we had to page in ORC tables. So that means we'd either have to take a 
per-task memory overhead to maintain the mutated form (probably unreasonable) 
or find a way for that to be done from the task's space when returning from the 
syscall.

> Christian, would this be sufficient for your needs?

I don't think so without significant work. The best case I see here is for perf 
to support user-space unwinding within the task, be it ORC or DWARF, and that 
unwinder not have to necessarily be in-tree with the kernel because we know 
they won't accept a DWARF unwinder again.

It does pose some questions on what would happen with carefully crafted DWARF 
data.
_______________________________________________
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Reply via email to