Re: [RFC PATCH 0/4] perf: Correlating user process data to samples
On Sat, Apr 13, 2024 at 08:48:57AM -0400, Steven Rostedt wrote: > On Sat, 13 Apr 2024 12:53:38 +0200 > Peter Zijlstra wrote: > > > On Fri, Apr 12, 2024 at 09:37:24AM -0700, Beau Belgrave wrote: > > > > > > Anyway, since we typically run stuff from NMI context, accessing user > > > > data is 'interesting'. As such I would really like to make this work > > > > depend on the call-graph rework that pushes all the user access bits > > > > into return-to-user. > > > > > > Cool, I assume that's the SFRAME work? Are there pointers to work I > > > could look at and think about what a rebase looks like? Or do you have > > > someone in mind I should work with for this? > > > > I've been offline for a little while and still need to catch up with > > things myself. > > > > Josh was working on that when I dropped off IIRC, I'm not entirely sure > > where things are at currently (and there is no way I can ever hope to > > process the backlog). > > > > Anybody know where we are with that? > > It's still very much on my RADAR, but with layoffs and such, my > priorities have unfortunately changed. I'm hoping to start helping out > in the near future though (in a month or two). > > Josh was working on it, but I think he got pulled off onto other > priorities too :-p Yeah, this is still a priority for me and I hope to get back to it over the next few weeks (crosses fingers). -- Josh
Re: [RFC PATCH 0/4] perf: Correlating user process data to samples
On Sat, 13 Apr 2024 12:53:38 +0200 Peter Zijlstra wrote: > On Fri, Apr 12, 2024 at 09:37:24AM -0700, Beau Belgrave wrote: > > > > Anyway, since we typically run stuff from NMI context, accessing user > > > data is 'interesting'. As such I would really like to make this work > > > depend on the call-graph rework that pushes all the user access bits > > > into return-to-user. > > > > Cool, I assume that's the SFRAME work? Are there pointers to work I > > could look at and think about what a rebase looks like? Or do you have > > someone in mind I should work with for this? > > I've been offline for a little while and still need to catch up with > things myself. > > Josh was working on that when I dropped off IIRC, I'm not entirely sure > where things are at currently (and there is no way I can ever hope to > process the backlog). > > Anybody know where we are with that? It's still very much on my RADAR, but with layoffs and such, my priorities have unfortunately changed. I'm hoping to start helping out in the near future though (in a month or two). Josh was working on it, but I think he got pulled off onto other priorities too :-p -- Steve
Re: [RFC PATCH 0/4] perf: Correlating user process data to samples
On Fri, Apr 12, 2024 at 09:37:24AM -0700, Beau Belgrave wrote: > > Anyway, since we typically run stuff from NMI context, accessing user > > data is 'interesting'. As such I would really like to make this work > > depend on the call-graph rework that pushes all the user access bits > > into return-to-user. > > Cool, I assume that's the SFRAME work? Are there pointers to work I > could look at and think about what a rebase looks like? Or do you have > someone in mind I should work with for this? I've been offline for a little while and still need to catch up with things myself. Josh was working on that when I dropped off IIRC, I'm not entirely sure where things are at currently (and there is no way I can ever hope to process the backlog). Anybody know where we are with that?
Re: [RFC PATCH 0/4] perf: Correlating user process data to samples
On 2024-04-12 12:28, Beau Belgrave wrote: On Thu, Apr 11, 2024 at 09:52:22PM -0700, Ian Rogers wrote: On Thu, Apr 11, 2024 at 5:17 PM Beau Belgrave wrote: In the Open Telemetry profiling SIG [1], we are trying to find a way to grab a tracing association quickly on a per-sample basis. The team at Elastic has a bespoke way to do this [2], however, I'd like to see a more general way to achieve this. The folks I've been talking with seem open to the idea of just having a TLS value for this we could capture Presumably TLS == Thread Local Storage. Yes, the initial idea is to use thread local storage (TLS). It seems to be the fastest option to save a per-thread value that changes at a fast rate. upon each sample. We could then just state, Open Telemetry SDKs should have a TLS value for span correlation. However, we need a way to sample the TLS or other value(s) when a sampling event is generated. This is supported today on Windows via EventActivityIdControl() [3]. Since Open Telemetry works on both Windows and Linux, ideally we can do something as efficient for Linux based workloads. This series is to explore how it would be best possible to collect supporting data from a user process when a profile sample is collected. Having a value stored in TLS makes a lot of sense for this however there are other ways to explore. Whatever is chosen, kernel samples taken in process context should be able to get this supporting data. In these patches on X64 the fsbase and gsbase are used for this. An option to explore suggested by Mathieu Desnoyers is to utilize rseq for processes to register a value location that can be included when profiling if desired. This would allow a tighter contract between user processes and a profiler. It would allow better labeling/categorizing the correlation values. It is hard to understand this idea. Are you saying stash a cookie in TLS for samples to capture to indicate an activity? Restartable sequences are about preemption on a CPU not of a thread, so at least my intuition is that they feel different. You could stash information like this today by changing the thread name which generates comm events. I've wondered about having similar information in some form of reserved for profiling stack slot, for example, to stash a pointer to the name of a function being interpreted. Snapshotting all of a stack is bad performance wise and for security. A stack slot would be able to deal with nesting. You are getting the idea. A slot or tag for a thread would be great! I'm not a fan of overriding the thread comm name (as that already has a use). TLS would be fine, if we could also pass an offset + size + type. Maybe a stack slot that just points to parts of TLS? That way you could have a set of slots that don't require much memory and selectively copy them out of TLS (or where ever those slots point to in user memory). When I was talking to Mathieu about this, it seems that rseq already had a place to potentially put these slots. I'm unsure though how the per thread aspects would work. Mathieu, can you post your ideas here about that? Sure. I'll try to summarize my thoughts here. By all means, let me know if I'm missing important pieces of the puzzle. First of all, here is my understanding of what information we want to share between userspace and kernel. A 128-bit activity ID identifies "uniquely" (as far as a 128-bit random UUID allows) a portion of the dependency chain involved in doing some work (e.g. answer a HTTP request) across one or many participating hosts. Activity IDs have a parent/child relationship: a parent activity ID can create children activity IDs. For instance, if one host has the service "dispatch", another host has a "web server", and a third host has a SQL database, we should be able to follow the chain of activities needed to answer a web query by following those activity IDs, linking them together through parent/child relationships. This usually requires the communication protocols to convey those activity IDs across hosts. The reason why this information must be provided from userspace is because it's userspace that knows where to find those activity IDs within its application-layer communication protocols. With tracing, taking a full trace of the activity ID spans begin/end from all hosts allow reconstructing the activity IDs parent/child relationships, so we typically only need to extract information about activity ID span begin/end with parent/child info to a tracer. Using activity IDs from a kernel profiler is trickier, because we do not have access to the complete span begin/end trace to reconstruct the activity ID parent/child relationship. This is where I suspect we'd want to introduce a notion of "activity ID stack", so a profiler could reconstruct the currently active stack of activity IDs for the current thread by walking that stack. This profiling could be triggered either from an interrupt (sampling use-case), which would then walk
Re: [RFC PATCH 0/4] perf: Correlating user process data to samples
On Fri, Apr 12, 2024 at 09:12:45AM +0200, Peter Zijlstra wrote: > > On Fri, Apr 12, 2024 at 12:17:28AM +, Beau Belgrave wrote: > > > An idea flow would look like this: > > User Task Profile > > do_work(); sample() -> IP + No activity > > ... > > set_activity(123); > > ... > > do_work(); sample() -> IP + activity (123) > > ... > > set_activity(124); > > ... > > do_work(); sample() -> IP + activity (124) > > This, start with this, because until I saw this, I was utterly confused > as to what the heck you were on about. > Will do. > I started by thinking we already have TID in samples so you can already > associate back to user processes and got increasingly confused the > further I went. > > What you seem to want to do however is have some task-state included so > you can see what the thread is doing. > Yeah, there is typically an external context (not on the machine) that wants to be tied to each sample. The context could be a simple integer, UUID, or something else entirely. For OTel, this is a 16-byte array [1]. > Anyway, since we typically run stuff from NMI context, accessing user > data is 'interesting'. As such I would really like to make this work > depend on the call-graph rework that pushes all the user access bits > into return-to-user. Cool, I assume that's the SFRAME work? Are there pointers to work I could look at and think about what a rebase looks like? Or do you have someone in mind I should work with for this? Thanks, -Beau 1. https://www.w3.org/TR/trace-context/#version-format
Re: [RFC PATCH 0/4] perf: Correlating user process data to samples
On Thu, Apr 11, 2024 at 09:52:22PM -0700, Ian Rogers wrote: > On Thu, Apr 11, 2024 at 5:17 PM Beau Belgrave > wrote: > > > > In the Open Telemetry profiling SIG [1], we are trying to find a way to > > grab a tracing association quickly on a per-sample basis. The team at > > Elastic has a bespoke way to do this [2], however, I'd like to see a > > more general way to achieve this. The folks I've been talking with seem > > open to the idea of just having a TLS value for this we could capture > > Presumably TLS == Thread Local Storage. > Yes, the initial idea is to use thread local storage (TLS). It seems to be the fastest option to save a per-thread value that changes at a fast rate. > > upon each sample. We could then just state, Open Telemetry SDKs should > > have a TLS value for span correlation. However, we need a way to sample > > the TLS or other value(s) when a sampling event is generated. This is > > supported today on Windows via EventActivityIdControl() [3]. Since > > Open Telemetry works on both Windows and Linux, ideally we can do > > something as efficient for Linux based workloads. > > > > This series is to explore how it would be best possible to collect > > supporting data from a user process when a profile sample is collected. > > Having a value stored in TLS makes a lot of sense for this however > > there are other ways to explore. Whatever is chosen, kernel samples > > taken in process context should be able to get this supporting data. > > In these patches on X64 the fsbase and gsbase are used for this. > > > > An option to explore suggested by Mathieu Desnoyers is to utilize rseq > > for processes to register a value location that can be included when > > profiling if desired. This would allow a tighter contract between user > > processes and a profiler. It would allow better labeling/categorizing > > the correlation values. > > It is hard to understand this idea. Are you saying stash a cookie in > TLS for samples to capture to indicate an activity? Restartable > sequences are about preemption on a CPU not of a thread, so at least > my intuition is that they feel different. You could stash information > like this today by changing the thread name which generates comm > events. I've wondered about having similar information in some form of > reserved for profiling stack slot, for example, to stash a pointer to > the name of a function being interpreted. Snapshotting all of a stack > is bad performance wise and for security. A stack slot would be able > to deal with nesting. > You are getting the idea. A slot or tag for a thread would be great! I'm not a fan of overriding the thread comm name (as that already has a use). TLS would be fine, if we could also pass an offset + size + type. Maybe a stack slot that just points to parts of TLS? That way you could have a set of slots that don't require much memory and selectively copy them out of TLS (or where ever those slots point to in user memory). When I was talking to Mathieu about this, it seems that rseq already had a place to potentially put these slots. I'm unsure though how the per thread aspects would work. Mathieu, can you post your ideas here about that? > > An idea flow would look like this: > > User Task Profile > > do_work(); sample() -> IP + No activity > > ... > > set_activity(123); > > ... > > do_work(); sample() -> IP + activity (123) > > ... > > set_activity(124); > > ... > > do_work(); sample() -> IP + activity (124) > > > > Ideally, the set_activity() method would not be a syscall. It needs to > > be very cheap as this should not bottleneck work. Ideally this is just > > a memcpy of 16-20 bytes as it is on Windows via EventActivityIdControl() > > using EVENT_ACTIVITY_CTRL_SET_ID. > > > > For those not aware, Open Telemetry allows collecting data from multiple > > machines and show where time was spent. The tracing context is already > > available for logs, but not for profiling samples. The idea is to show > > where slowdowns occur and have profile samples to explain why they > > slowed down. This must be possible without having to track context > > switches to do this correlation. This is because the profiling rates > > are typically 20hz - 1Khz, while the context switching rates are much > > higher. We do not want to have to consume high context switch rates > > just to know a correlation for a 20hz signal. Often these 20hz signals > > are always enabled in some environments. > > > > Regardless if TLS, rseq, or other source is used I believe we will need > > a way for perf_events to include it within a sample. The changes in this > > series show how it could be done with TLS. There is some factoring work > > under perf to make it easier to add more dump types using the existing > > ABI. This is mostly to make the patches clearer, certainly the refactor > > parts could get dropped and we could have duplicated/specialized paths. > > fs and gs may be used
Re: [RFC PATCH 0/4] perf: Correlating user process data to samples
On Fri, Apr 12, 2024 at 12:17:28AM +, Beau Belgrave wrote: > An idea flow would look like this: > User Task Profile > do_work();sample() -> IP + No activity > ... > set_activity(123); > ... > do_work();sample() -> IP + activity (123) > ... > set_activity(124); > ... > do_work();sample() -> IP + activity (124) This, start with this, because until I saw this, I was utterly confused as to what the heck you were on about. I started by thinking we already have TID in samples so you can already associate back to user processes and got increasingly confused the further I went. What you seem to want to do however is have some task-state included so you can see what the thread is doing. Anyway, since we typically run stuff from NMI context, accessing user data is 'interesting'. As such I would really like to make this work depend on the call-graph rework that pushes all the user access bits into return-to-user.
Re: [RFC PATCH 0/4] perf: Correlating user process data to samples
On Thu, Apr 11, 2024 at 5:17 PM Beau Belgrave wrote: > > In the Open Telemetry profiling SIG [1], we are trying to find a way to > grab a tracing association quickly on a per-sample basis. The team at > Elastic has a bespoke way to do this [2], however, I'd like to see a > more general way to achieve this. The folks I've been talking with seem > open to the idea of just having a TLS value for this we could capture Presumably TLS == Thread Local Storage. > upon each sample. We could then just state, Open Telemetry SDKs should > have a TLS value for span correlation. However, we need a way to sample > the TLS or other value(s) when a sampling event is generated. This is > supported today on Windows via EventActivityIdControl() [3]. Since > Open Telemetry works on both Windows and Linux, ideally we can do > something as efficient for Linux based workloads. > > This series is to explore how it would be best possible to collect > supporting data from a user process when a profile sample is collected. > Having a value stored in TLS makes a lot of sense for this however > there are other ways to explore. Whatever is chosen, kernel samples > taken in process context should be able to get this supporting data. > In these patches on X64 the fsbase and gsbase are used for this. > > An option to explore suggested by Mathieu Desnoyers is to utilize rseq > for processes to register a value location that can be included when > profiling if desired. This would allow a tighter contract between user > processes and a profiler. It would allow better labeling/categorizing > the correlation values. It is hard to understand this idea. Are you saying stash a cookie in TLS for samples to capture to indicate an activity? Restartable sequences are about preemption on a CPU not of a thread, so at least my intuition is that they feel different. You could stash information like this today by changing the thread name which generates comm events. I've wondered about having similar information in some form of reserved for profiling stack slot, for example, to stash a pointer to the name of a function being interpreted. Snapshotting all of a stack is bad performance wise and for security. A stack slot would be able to deal with nesting. > An idea flow would look like this: > User Task Profile > do_work(); sample() -> IP + No activity > ... > set_activity(123); > ... > do_work(); sample() -> IP + activity (123) > ... > set_activity(124); > ... > do_work(); sample() -> IP + activity (124) > > Ideally, the set_activity() method would not be a syscall. It needs to > be very cheap as this should not bottleneck work. Ideally this is just > a memcpy of 16-20 bytes as it is on Windows via EventActivityIdControl() > using EVENT_ACTIVITY_CTRL_SET_ID. > > For those not aware, Open Telemetry allows collecting data from multiple > machines and show where time was spent. The tracing context is already > available for logs, but not for profiling samples. The idea is to show > where slowdowns occur and have profile samples to explain why they > slowed down. This must be possible without having to track context > switches to do this correlation. This is because the profiling rates > are typically 20hz - 1Khz, while the context switching rates are much > higher. We do not want to have to consume high context switch rates > just to know a correlation for a 20hz signal. Often these 20hz signals > are always enabled in some environments. > > Regardless if TLS, rseq, or other source is used I believe we will need > a way for perf_events to include it within a sample. The changes in this > series show how it could be done with TLS. There is some factoring work > under perf to make it easier to add more dump types using the existing > ABI. This is mostly to make the patches clearer, certainly the refactor > parts could get dropped and we could have duplicated/specialized paths. fs and gs may be used for more than just the C runtime's TLS. For example, they may be used by emulators or managed runtimes. I'm not clear why this specific case couldn't be handled through BPF. Thanks, Ian > 1. https://opentelemetry.io/blog/2024/profiling/ > 2. > https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation > 3. > https://learn.microsoft.com/en-us/windows/win32/api/evntprov/nf-evntprov-eventactivityidcontrol > > Beau Belgrave (4): > perf/core: Introduce perf_prepare_dump_data() > perf: Introduce PERF_SAMPLE_TLS_USER sample type > perf/core: Factor perf_output_sample_udump() > perf/x86/core: Add tls dump support > > arch/Kconfig | 7 ++ > arch/x86/Kconfig | 1 + > arch/x86/events/core.c| 14 +++ > arch/x86/include/asm/perf_event.h | 5 + > include/linux/perf_event.h| 7 ++ > include/uapi/linux/perf_event.h | 5 +- > kernel/events/core.c | 166 +++--- >
[RFC PATCH 0/4] perf: Correlating user process data to samples
In the Open Telemetry profiling SIG [1], we are trying to find a way to grab a tracing association quickly on a per-sample basis. The team at Elastic has a bespoke way to do this [2], however, I'd like to see a more general way to achieve this. The folks I've been talking with seem open to the idea of just having a TLS value for this we could capture upon each sample. We could then just state, Open Telemetry SDKs should have a TLS value for span correlation. However, we need a way to sample the TLS or other value(s) when a sampling event is generated. This is supported today on Windows via EventActivityIdControl() [3]. Since Open Telemetry works on both Windows and Linux, ideally we can do something as efficient for Linux based workloads. This series is to explore how it would be best possible to collect supporting data from a user process when a profile sample is collected. Having a value stored in TLS makes a lot of sense for this however there are other ways to explore. Whatever is chosen, kernel samples taken in process context should be able to get this supporting data. In these patches on X64 the fsbase and gsbase are used for this. An option to explore suggested by Mathieu Desnoyers is to utilize rseq for processes to register a value location that can be included when profiling if desired. This would allow a tighter contract between user processes and a profiler. It would allow better labeling/categorizing the correlation values. An idea flow would look like this: User Task Profile do_work(); sample() -> IP + No activity ... set_activity(123); ... do_work(); sample() -> IP + activity (123) ... set_activity(124); ... do_work(); sample() -> IP + activity (124) Ideally, the set_activity() method would not be a syscall. It needs to be very cheap as this should not bottleneck work. Ideally this is just a memcpy of 16-20 bytes as it is on Windows via EventActivityIdControl() using EVENT_ACTIVITY_CTRL_SET_ID. For those not aware, Open Telemetry allows collecting data from multiple machines and show where time was spent. The tracing context is already available for logs, but not for profiling samples. The idea is to show where slowdowns occur and have profile samples to explain why they slowed down. This must be possible without having to track context switches to do this correlation. This is because the profiling rates are typically 20hz - 1Khz, while the context switching rates are much higher. We do not want to have to consume high context switch rates just to know a correlation for a 20hz signal. Often these 20hz signals are always enabled in some environments. Regardless if TLS, rseq, or other source is used I believe we will need a way for perf_events to include it within a sample. The changes in this series show how it could be done with TLS. There is some factoring work under perf to make it easier to add more dump types using the existing ABI. This is mostly to make the patches clearer, certainly the refactor parts could get dropped and we could have duplicated/specialized paths. 1. https://opentelemetry.io/blog/2024/profiling/ 2. https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation 3. https://learn.microsoft.com/en-us/windows/win32/api/evntprov/nf-evntprov-eventactivityidcontrol Beau Belgrave (4): perf/core: Introduce perf_prepare_dump_data() perf: Introduce PERF_SAMPLE_TLS_USER sample type perf/core: Factor perf_output_sample_udump() perf/x86/core: Add tls dump support arch/Kconfig | 7 ++ arch/x86/Kconfig | 1 + arch/x86/events/core.c| 14 +++ arch/x86/include/asm/perf_event.h | 5 + include/linux/perf_event.h| 7 ++ include/uapi/linux/perf_event.h | 5 +- kernel/events/core.c | 166 +++--- kernel/events/internal.h | 16 +++ 8 files changed, 180 insertions(+), 41 deletions(-) base-commit: fec50db7033ea478773b159e0e2efb135270e3b7 -- 2.34.1