Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On 09/04/19 19:48, Peter Zijlstra wrote: > On Wed, Sep 04, 2019 at 03:37:11PM +0100, Qais Yousef wrote: > > > I managed to hook into sched_switch to get the nr_running of cfs tasks via > > eBPF. > > > > ``` > > int on_switch(struct sched_switch_args *args) { > > struct task_struct *prev = (struct task_struct *)bpf_get_current_task(); > > struct cgroup *prev_cgroup = > > prev->cgroups->subsys[cpuset_cgrp_id]->cgroup; > > const char *prev_cgroup_name = prev_cgroup->kn->name; > > > > if (prev_cgroup->kn->parent) { > > bpf_trace_printk("sched_switch_ext: nr_running=%d > > prev_cgroup=%s\\n", > > prev->se.cfs_rq->nr_running, > > prev_cgroup_name); > > } else { > > bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=/\\n", > > prev->se.cfs_rq->nr_running); > > } > > return 0; > > }; > > ``` > > > > You can do something similar by attaching to the sched_switch tracepoint > > from > > a module and a create a new event to get the nr_running. > > > > Now this is not as accurate as your proposed new tracepoint in terms where > > you > > sample nr_running, but should be good enough? > > The above is after deactivate() and gives an up-to-date count for > decrements. Attach something to trace_sched_wakeup() to get the > increment update. I just remembered that sched_switch and sched_wakeup aren't EXPORT_TRACEPOINT*() so can't be attached to via out of tree module. But still accessible via eBPF. There has been several attempts to export these tracepoints but they were NACKed because there was no in-kernel module that needed them. https://lore.kernel.org/lkml/20150422130052.4996e...@gandalf.local.home/ -- Qais Yousef
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Thu, Sep 05, 2019 at 10:13:10AM +0200, Ingo Molnar wrote: > > * Alexei Starovoitov wrote: > > > On Wed, Sep 4, 2019 at 10:47 AM Peter Zijlstra wrote: > > > > > > On Wed, Sep 04, 2019 at 08:51:21AM -0700, Alexei Starovoitov wrote: > > > > Anything in tracing can be deleted. > > > > Tracing is about debugging and introspection. > > > > When underlying kernel code changes the introspection points change as > > > > well. > > > > > > Right; except when it breaks widely used tools; like say powertop. Been > > > there, done that. > > > > powertop was a lesson learned, but it's not a relevant example anymore. > > There are more widely used tools today. Like bcc tools. > > And bpftrace is quickly gaining momentum and large user base. > > bcc tools did break already several times and people fixed them. > > Are these tools using libtraceevents? bcc tools and bpftrace are using libbcc. Which in turn is using libbpf. libtraceevents is not used. Interesting example is https://github.com/iovisor/bcc/blob/master/tools/tcplife.py It's using "inet_sock_set_state" tracepoint when available on newer kernels and kprobe in tcp_set_state() function on older kernels. That tracepoint changed significantly over time. It had different name 'tcp_set_state' and slightly different semantics. Hence the tool was fixed when that change in tracepoint happened: https://github.com/iovisor/bcc/commit/fd93dc0409b626b749b90f115d3d550a870ed125 Note that tcp:tcp_set_state tracepoint existed for full kernel release. Yet people didn't make fuzz about the fact it disappeared in 4.16. Though tcplife.py tool is simple there are more complex tools based on this idea that are deployed in netflix and fb that went through the same tcp_set_state->inet_sock_set_state fixes.
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
* Alexei Starovoitov wrote: > On Wed, Sep 4, 2019 at 10:47 AM Peter Zijlstra wrote: > > > > On Wed, Sep 04, 2019 at 08:51:21AM -0700, Alexei Starovoitov wrote: > > > Anything in tracing can be deleted. > > > Tracing is about debugging and introspection. > > > When underlying kernel code changes the introspection points change as > > > well. > > > > Right; except when it breaks widely used tools; like say powertop. Been > > there, done that. > > powertop was a lesson learned, but it's not a relevant example anymore. > There are more widely used tools today. Like bcc tools. > And bpftrace is quickly gaining momentum and large user base. > bcc tools did break already several times and people fixed them. Are these tools using libtraceevents? Thanks, Ingo
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 4, 2019 at 10:47 AM Peter Zijlstra wrote: > > On Wed, Sep 04, 2019 at 08:51:21AM -0700, Alexei Starovoitov wrote: > > Anything in tracing can be deleted. > > Tracing is about debugging and introspection. > > When underlying kernel code changes the introspection points change as well. > > Right; except when it breaks widely used tools; like say powertop. Been > there, done that. powertop was a lesson learned, but it's not a relevant example anymore. There are more widely used tools today. Like bcc tools. And bpftrace is quickly gaining momentum and large user base. bcc tools did break already several times and people fixed them.
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 04, 2019 at 03:37:11PM +0100, Qais Yousef wrote: > I managed to hook into sched_switch to get the nr_running of cfs tasks via > eBPF. > > ``` > int on_switch(struct sched_switch_args *args) { > struct task_struct *prev = (struct task_struct *)bpf_get_current_task(); > struct cgroup *prev_cgroup = > prev->cgroups->subsys[cpuset_cgrp_id]->cgroup; > const char *prev_cgroup_name = prev_cgroup->kn->name; > > if (prev_cgroup->kn->parent) { > bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=%s\\n", > prev->se.cfs_rq->nr_running, > prev_cgroup_name); > } else { > bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=/\\n", > prev->se.cfs_rq->nr_running); > } > return 0; > }; > ``` > > You can do something similar by attaching to the sched_switch tracepoint from > a module and a create a new event to get the nr_running. > > Now this is not as accurate as your proposed new tracepoint in terms where you > sample nr_running, but should be good enough? The above is after deactivate() and gives an up-to-date count for decrements. Attach something to trace_sched_wakeup() to get the increment update.
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 04, 2019 at 08:51:21AM -0700, Alexei Starovoitov wrote: > Anything in tracing can be deleted. > Tracing is about debugging and introspection. > When underlying kernel code changes the introspection points change as well. Right; except when it breaks widely used tools; like say powertop. Been there, done that.
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 4, 2019 at 8:40 AM Joel Fernandes wrote: > > On Wed, Sep 04, 2019 at 08:25:27AM -0700, Alexei Starovoitov wrote: > > On Wed, Sep 4, 2019 at 6:10 AM Joel Fernandes > > wrote: > > > > > > I wonder if this distinction of "tracepoint" being non-ABI can be > > > documented > > > somewhere. I would be happy to do that if there is a place for the same. I > > > really want some general "policy" in the kernel on where we draw a line in > > > the sand with respect to tracepoints and ABI :). > > > > It's been discussed millions times. tracepoints are not abi. > > Example: android folks started abusing tracepoints inside bpf core > > and we _deleted_ them. > > This is news to me, which ones? those that your android teammates abused! > > Same thing can be done with _any_ tracepoint. > > Do not abuse them and stop the fud about abi. > > I don't know what FUD you are referring to. At least it is not coming from > me. This thread is dealing with the issue about ABI specifically, I jumped in > just now. As I was saying earlier, I don't have a strong opinion about this. > I just want to know what is the agreed upon approach so that we can stick to > it. > > It sounds like the agreement here is tracepoints can be added and used > without ABI guarantees, however the same is not true with trace events. > Where's the FUD in that? Anything in tracing can be deleted. Tracing is about debugging and introspection. When underlying kernel code changes the introspection points change as well.
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 04, 2019 at 03:57:59PM +0100, Qais Yousef wrote: > On 09/04/19 10:41, Joel Fernandes wrote: > > On Wed, Sep 04, 2019 at 03:20:17PM +0100, Qais Yousef wrote: > > > On 09/04/19 09:06, Joel Fernandes wrote: > > > > > > > > > > It is actually true. > > > > > > > > > > But you need to make the distinction between a tracepoint > > > > > and a trace event first. > > > > > > > > I know this distinction well. > > > > > > > > > What Valentin is talking about here is the *bare* > > > > > tracepoint without any event associated with them like the one I > > > > > added to the > > > > > scheduler recently. These ones are not accessible via eBPF, unless > > > > > something > > > > > has changed since I last tried. > > > > > > > > Can this tracepoint be registered on with tracepoint_probe_register()? > > > > Quickly looking at these new tracepoint, they can be otherwise how > > > > would they > > > > even work right? If so, then eBPF can very well access it. Look at > > > > __bpf_probe_register() and bpf_raw_tracepoint_open() which implement the > > > > BPF_RAW_TRACEPOINT_OPEN. > > > > > > Humm okay. I tried to use raw tracepoint with bcc but failed to attach. > > > But > > > maybe I missed something on the way it should be used. AFAICT it was > > > missing > > > the bits that I implemented in [1]. Maybe the method you mention is lower > > > level > > > than bcc. > > > > Oh, Ok. Not sure about BCC. I know that facebook folks are using *existing* > > tracepoints (not trace events) to probe context switches and such (probably > > not through BCC but some other BPF tracing code). Peter had rejected trace > > events they were trying to add IIRC, so they added BPF_RAW_TRACEPOINT_OPEN > > then IIRC. > > Looking at the history BPF_RAW_TRACEPOINT_OPEN was added with the support for > RAW_TRACEPOINT c4f6699dfcb8 (bpf: introduce BPF_RAW_TRACEPOINT). > > Anyway, if you ever get a chance please try it and let me know. I might have > done something wrong and you're more of a eBPF guru than I am :-) eBPF guru and me? no way ;-) I have tried out BPF_RAW_TRACEPOINT_OPEN before and it works as expected. Are there not any in-kernel samples? Perhaps Alexei can post some if there are not. thanks, - Joel
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 04, 2019 at 08:37:22AM -0700, Alexei Starovoitov wrote: > On Wed, Sep 4, 2019 at 8:33 AM Joel Fernandes wrote: > > > > On Wed, Sep 04, 2019 at 08:26:52AM -0700, Alexei Starovoitov wrote: > > > On Wed, Sep 4, 2019 at 6:14 AM Joel Fernandes > > > wrote: > > > > > > > > True. However, for kprobes-based BPF program - it does check for kernel > > > > version to ensure that the BPF program is built against the right kernel > > > > version (in order to ensure the program is built against the right set > > > > of > > > > kernel headers). If it is not, then BPF refuses to load the program. > > > > > > This is not true anymore. Users found few ways to workaround that check > > > in practice. It became useless and it was deleted some time ago. > > > > Wow, Ok! Interesting! > > the other part of your email says about kernel header requirement. > This is not true any more as well :) > BTF relocations are already supported by the kernel, llvm, libbpf, > bpftool, pahole. > We'll be posting sample code soon. Ok, this landscape seems to be changing quite a bit. I was going by what I already know... Looking forward to catching up with the latest. Sorry, thanks, - Joel
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 4, 2019 at 8:33 AM Joel Fernandes wrote: > > On Wed, Sep 04, 2019 at 08:26:52AM -0700, Alexei Starovoitov wrote: > > On Wed, Sep 4, 2019 at 6:14 AM Joel Fernandes > > wrote: > > > > > > True. However, for kprobes-based BPF program - it does check for kernel > > > version to ensure that the BPF program is built against the right kernel > > > version (in order to ensure the program is built against the right set of > > > kernel headers). If it is not, then BPF refuses to load the program. > > > > This is not true anymore. Users found few ways to workaround that check > > in practice. It became useless and it was deleted some time ago. > > Wow, Ok! Interesting! the other part of your email says about kernel header requirement. This is not true any more as well :) BTF relocations are already supported by the kernel, llvm, libbpf, bpftool, pahole. We'll be posting sample code soon.
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 04, 2019 at 08:25:27AM -0700, Alexei Starovoitov wrote: > On Wed, Sep 4, 2019 at 6:10 AM Joel Fernandes wrote: > > > > I wonder if this distinction of "tracepoint" being non-ABI can be documented > > somewhere. I would be happy to do that if there is a place for the same. I > > really want some general "policy" in the kernel on where we draw a line in > > the sand with respect to tracepoints and ABI :). > > It's been discussed millions times. tracepoints are not abi. > Example: android folks started abusing tracepoints inside bpf core > and we _deleted_ them. This is news to me, which ones? > Same thing can be done with _any_ tracepoint. > Do not abuse them and stop the fud about abi. I don't know what FUD you are referring to. At least it is not coming from me. This thread is dealing with the issue about ABI specifically, I jumped in just now. As I was saying earlier, I don't have a strong opinion about this. I just want to know what is the agreed upon approach so that we can stick to it. It sounds like the agreement here is tracepoints can be added and used without ABI guarantees, however the same is not true with trace events. Where's the FUD in that? thanks, - Joel
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 04, 2019 at 08:26:52AM -0700, Alexei Starovoitov wrote: > On Wed, Sep 4, 2019 at 6:14 AM Joel Fernandes wrote: > > > > True. However, for kprobes-based BPF program - it does check for kernel > > version to ensure that the BPF program is built against the right kernel > > version (in order to ensure the program is built against the right set of > > kernel headers). If it is not, then BPF refuses to load the program. > > This is not true anymore. Users found few ways to workaround that check > in practice. It became useless and it was deleted some time ago. Wow, Ok! Interesting! thanks, - Joel
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 4, 2019 at 6:14 AM Joel Fernandes wrote: > > True. However, for kprobes-based BPF program - it does check for kernel > version to ensure that the BPF program is built against the right kernel > version (in order to ensure the program is built against the right set of > kernel headers). If it is not, then BPF refuses to load the program. This is not true anymore. Users found few ways to workaround that check in practice. It became useless and it was deleted some time ago.
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 4, 2019 at 6:10 AM Joel Fernandes wrote: > > I wonder if this distinction of "tracepoint" being non-ABI can be documented > somewhere. I would be happy to do that if there is a place for the same. I > really want some general "policy" in the kernel on where we draw a line in > the sand with respect to tracepoints and ABI :). It's been discussed millions times. tracepoints are not abi. Example: android folks started abusing tracepoints inside bpf core and we _deleted_ them. Same thing can be done with _any_ tracepoint. Do not abuse them and stop the fud about abi.
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On 09/04/19 10:41, Joel Fernandes wrote: > On Wed, Sep 04, 2019 at 03:20:17PM +0100, Qais Yousef wrote: > > On 09/04/19 09:06, Joel Fernandes wrote: > > > > > > > > It is actually true. > > > > > > > > But you need to make the distinction between a tracepoint > > > > and a trace event first. > > > > > > I know this distinction well. > > > > > > > What Valentin is talking about here is the *bare* > > > > tracepoint without any event associated with them like the one I added > > > > to the > > > > scheduler recently. These ones are not accessible via eBPF, unless > > > > something > > > > has changed since I last tried. > > > > > > Can this tracepoint be registered on with tracepoint_probe_register()? > > > Quickly looking at these new tracepoint, they can be otherwise how would > > > they > > > even work right? If so, then eBPF can very well access it. Look at > > > __bpf_probe_register() and bpf_raw_tracepoint_open() which implement the > > > BPF_RAW_TRACEPOINT_OPEN. > > > > Humm okay. I tried to use raw tracepoint with bcc but failed to attach. But > > maybe I missed something on the way it should be used. AFAICT it was missing > > the bits that I implemented in [1]. Maybe the method you mention is lower > > level > > than bcc. > > Oh, Ok. Not sure about BCC. I know that facebook folks are using *existing* > tracepoints (not trace events) to probe context switches and such (probably > not through BCC but some other BPF tracing code). Peter had rejected trace > events they were trying to add IIRC, so they added BPF_RAW_TRACEPOINT_OPEN > then IIRC. Looking at the history BPF_RAW_TRACEPOINT_OPEN was added with the support for RAW_TRACEPOINT c4f6699dfcb8 (bpf: introduce BPF_RAW_TRACEPOINT). Anyway, if you ever get a chance please try it and let me know. I might have done something wrong and you're more of a eBPF guru than I am :-) > > > > > The current infrastructure needs to be expanded to allow eBPF to attach > > > > these > > > > bare tracepoints. Something similar to what I have in [1] is needed - > > > > but > > > > instead of creating a new macro it needs to expand the current macro. > > > > [2] might > > > > give full context of when I was trying to come up with alternatives to > > > > using > > > > trace events. > > > > > > > > [1] > > > > https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b > > > > [2] > > > > https://lore.kernel.org/lkml/20190415144945.tumeop4djyj45...@e107158-lin.cambridge.arm.com/ > > > > > > > > > As I was mentioning, tracepoints, not "trace events" can already be opened > > > directly with BPF. I don't see how these new tracepoints are different. > > > > > > I wonder if this distinction of "tracepoint" being non-ABI can be > > > documented > > > somewhere. I would be happy to do that if there is a place for the same. I > > > really want some general "policy" in the kernel on where we draw a line in > > > the sand with respect to tracepoints and ABI :). > > > > > > For instance, perhaps VFS can also start having non-ABI tracepoints for > > > the > > > benefit of people tracing the VFS. > > > > Good question. I did consider that but failed to come up with a place. AFAIU > > the history moved from tracepoints to trace events and now moving back to > > tracepoints. Something Steve is not very enthusiastic about. > > Yeah this is a bit of a mess. I think for every recent LPC this has come up. > But the DECLARE_TRACE approach you did is interesting in that it > reduces/removes the API surface for trace-events at least. Yes. And you have the flexibility to add more info to the tracepoint without worrying about breaking current users. Another nice feat we discovered is that you can create several trace evenets from the same tracepoint each exposing different set of info. You can achieve the same with the current trace events if you use tracepoint_probe_register() of course, but not if you use the macros. The tendency for in-kernel trace events is to have 1:1 mapping even if the events can be extracted from a single tracepoint. -- Qais Yousef
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 04, 2019 at 03:20:17PM +0100, Qais Yousef wrote: > On 09/04/19 09:06, Joel Fernandes wrote: > > > > > > It is actually true. > > > > > > But you need to make the distinction between a tracepoint > > > and a trace event first. > > > > I know this distinction well. > > > > > What Valentin is talking about here is the *bare* > > > tracepoint without any event associated with them like the one I added to > > > the > > > scheduler recently. These ones are not accessible via eBPF, unless > > > something > > > has changed since I last tried. > > > > Can this tracepoint be registered on with tracepoint_probe_register()? > > Quickly looking at these new tracepoint, they can be otherwise how would > > they > > even work right? If so, then eBPF can very well access it. Look at > > __bpf_probe_register() and bpf_raw_tracepoint_open() which implement the > > BPF_RAW_TRACEPOINT_OPEN. > > Humm okay. I tried to use raw tracepoint with bcc but failed to attach. But > maybe I missed something on the way it should be used. AFAICT it was missing > the bits that I implemented in [1]. Maybe the method you mention is lower > level > than bcc. Oh, Ok. Not sure about BCC. I know that facebook folks are using *existing* tracepoints (not trace events) to probe context switches and such (probably not through BCC but some other BPF tracing code). Peter had rejected trace events they were trying to add IIRC, so they added BPF_RAW_TRACEPOINT_OPEN then IIRC. > > > The current infrastructure needs to be expanded to allow eBPF to attach > > > these > > > bare tracepoints. Something similar to what I have in [1] is needed - but > > > instead of creating a new macro it needs to expand the current macro. [2] > > > might > > > give full context of when I was trying to come up with alternatives to > > > using > > > trace events. > > > > > > [1] > > > https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b > > > [2] > > > https://lore.kernel.org/lkml/20190415144945.tumeop4djyj45...@e107158-lin.cambridge.arm.com/ > > > > > > As I was mentioning, tracepoints, not "trace events" can already be opened > > directly with BPF. I don't see how these new tracepoints are different. > > > > I wonder if this distinction of "tracepoint" being non-ABI can be documented > > somewhere. I would be happy to do that if there is a place for the same. I > > really want some general "policy" in the kernel on where we draw a line in > > the sand with respect to tracepoints and ABI :). > > > > For instance, perhaps VFS can also start having non-ABI tracepoints for the > > benefit of people tracing the VFS. > > Good question. I did consider that but failed to come up with a place. AFAIU > the history moved from tracepoints to trace events and now moving back to > tracepoints. Something Steve is not very enthusiastic about. Yeah this is a bit of a mess. I think for every recent LPC this has come up. But the DECLARE_TRACE approach you did is interesting in that it reduces/removes the API surface for trace-events at least. thanks, - Joel
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On 09/03/19 17:43, Radim Krčmář wrote: > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several > custom data gathering points to better understand what was going on in > the scheduler. > Red Hat adapted one of them for the tracepoint framework and created a > tool to plot a heatmap of nr_running, where the sched_update_nr_running > tracepoint is being used for fine grained monitoring of scheduling > imbalance. > The tool is available from https://github.com/jirvoz/plot-nr-running. > > The best place for the tracepoints is inside the add/sub_nr_running, > which requires some shenanigans to make it work as they are defined > inside sched.h. I managed to hook into sched_switch to get the nr_running of cfs tasks via eBPF. ``` int on_switch(struct sched_switch_args *args) { struct task_struct *prev = (struct task_struct *)bpf_get_current_task(); struct cgroup *prev_cgroup = prev->cgroups->subsys[cpuset_cgrp_id]->cgroup; const char *prev_cgroup_name = prev_cgroup->kn->name; if (prev_cgroup->kn->parent) { bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=%s\\n", prev->se.cfs_rq->nr_running, prev_cgroup_name); } else { bpf_trace_printk("sched_switch_ext: nr_running=%d prev_cgroup=/\\n", prev->se.cfs_rq->nr_running); } return 0; }; ``` You can do something similar by attaching to the sched_switch tracepoint from a module and a create a new event to get the nr_running. Now this is not as accurate as your proposed new tracepoint in terms where you sample nr_running, but should be good enough? Cheers -- Qais Yousef > The tracepoints have to be included from sched.h, which means that > CREATE_TRACE_POINTS has to be defined for the whole header and this > might cause problems if tree-wide headers expose tracepoints in sched.h > dependencies, but I'd argue it's the other side's misuse of tracepoints. > > Moving the import sched.h line lower would require fixes in s390 and ppc > headers, because they don't include dependecies properly and expect > sched.h to do it, so it is simpler to keep sched.h there and > preventively undefine CREATE_TRACE_POINTS right after. > > Exports of the pelt tracepoints remain because they don't need to be > protected by CREATE_TRACE_POINTS and moving them closer would be > unsightly. > > Tested-by: Jirka Hladký > Tested-by: Jiří Vozár > Signed-off-by: Radim Krčmář > --- > include/trace/events/sched.h | 22 ++ > kernel/sched/core.c | 7 ++- > kernel/sched/fair.c | 2 -- > kernel/sched/sched.h | 7 +++ > 4 files changed, 31 insertions(+), 7 deletions(-) > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > index 420e80e56e55..1527fc695609 100644 > --- a/include/trace/events/sched.h > +++ b/include/trace/events/sched.h > @@ -625,6 +625,28 @@ DECLARE_TRACE(sched_overutilized_tp, > TP_PROTO(struct root_domain *rd, bool overutilized), > TP_ARGS(rd, overutilized)); > > +TRACE_EVENT(sched_update_nr_running, > + > + TP_PROTO(int cpu, int change, unsigned int nr_running), > + > + TP_ARGS(cpu, change, nr_running), > + > + TP_STRUCT__entry( > + __field(int, cpu) > + __field(int, change) > + __field(unsigned int, nr_running) > + ), > + > + TP_fast_assign( > + __entry->cpu= cpu; > + __entry->change = change; > + __entry->nr_running = nr_running; > + ), > + > + TP_printk("cpu=%u nr_running=%u (%d)", > + __entry->cpu, __entry->nr_running, __entry->change) > +); > + > #endif /* _TRACE_SCHED_H */ > > /* This part must be outside protection */ > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 71981ce84231..31ac37b9f6f7 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6,7 +6,9 @@ > * > * Copyright (C) 1991-2002 Linus Torvalds > */ > +#define CREATE_TRACE_POINTS > #include "sched.h" > +#undef CREATE_TRACE_POINTS > > #include > > @@ -20,9 +22,6 @@ > > #include "pelt.h" > > -#define CREATE_TRACE_POINTS > -#include > - > /* > * Export tracepoints that act as a bare tracehook (ie: have no trace event > * associated with them) to allow external modules to probe them. > @@ -7618,5 +7617,3 @@ const u32 sched_prio_to_wmult[40] = { > /* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717, > /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153, > }; > - > -#undef CREATE_TRACE_POINTS > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 84959d3285d1..421236d39902 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -22,8 +22,6 @@ > */ > #include "sched.h" > > -#include > - > /* > * Targeted preemption latency for CPU-bound tasks: > * > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On 09/04/19 09:06, Joel Fernandes wrote: > > > > It is actually true. > > > > But you need to make the distinction between a tracepoint > > and a trace event first. > > I know this distinction well. > > > What Valentin is talking about here is the *bare* > > tracepoint without any event associated with them like the one I added to > > the > > scheduler recently. These ones are not accessible via eBPF, unless something > > has changed since I last tried. > > Can this tracepoint be registered on with tracepoint_probe_register()? > Quickly looking at these new tracepoint, they can be otherwise how would they > even work right? If so, then eBPF can very well access it. Look at > __bpf_probe_register() and bpf_raw_tracepoint_open() which implement the > BPF_RAW_TRACEPOINT_OPEN. Humm okay. I tried to use raw tracepoint with bcc but failed to attach. But maybe I missed something on the way it should be used. AFAICT it was missing the bits that I implemented in [1]. Maybe the method you mention is lower level than bcc. > > > The current infrastructure needs to be expanded to allow eBPF to attach > > these > > bare tracepoints. Something similar to what I have in [1] is needed - but > > instead of creating a new macro it needs to expand the current macro. [2] > > might > > give full context of when I was trying to come up with alternatives to using > > trace events. > > > > [1] > > https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b > > [2] > > https://lore.kernel.org/lkml/20190415144945.tumeop4djyj45...@e107158-lin.cambridge.arm.com/ > > > As I was mentioning, tracepoints, not "trace events" can already be opened > directly with BPF. I don't see how these new tracepoints are different. > > I wonder if this distinction of "tracepoint" being non-ABI can be documented > somewhere. I would be happy to do that if there is a place for the same. I > really want some general "policy" in the kernel on where we draw a line in > the sand with respect to tracepoints and ABI :). > > For instance, perhaps VFS can also start having non-ABI tracepoints for the > benefit of people tracing the VFS. Good question. I did consider that but failed to come up with a place. AFAIU the history moved from tracepoints to trace events and now moving back to tracepoints. Something Steve is not very enthusiastic about. LPC is coming, sounds like a good venue to discuss this :-) Cheers -- Qais Yousef
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On 04/09/2019 14:13, Peter Zijlstra wrote: > On Tue, Sep 03, 2019 at 05:43:40PM +0200, Radim Krčmář wrote: > >> Red Hat adapted one of them for the tracepoint framework and created a >> tool to plot a heatmap of nr_running, where the sched_update_nr_running >> tracepoint is being used for fine grained monitoring of scheduling >> imbalance. > > You should be able to reconstruct this from wakeup and switch > tracepoints. > > I never tried to do it myself but know some folks around me gave it a go (Morten, Qais and maybe even Chris). I can't remember the exact details but there was something about not getting the right input - duplicated wakeups, or no migration information (I think sched_migrate doesn't cover all of the cases) so the information you're building diverges as time progresses.
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Tue, Sep 03, 2019 at 05:43:40PM +0200, Radim Krčmář wrote: > Red Hat adapted one of them for the tracepoint framework and created a > tool to plot a heatmap of nr_running, where the sched_update_nr_running > tracepoint is being used for fine grained monitoring of scheduling > imbalance. You should be able to reconstruct this from wakeup and switch tracepoints.
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 04, 2019 at 10:14:48AM +0200, Peter Zijlstra wrote: > On Wed, Sep 04, 2019 at 12:23:10AM -0400, Joel Fernandes wrote: > > On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote: > > > On 03/09/2019 16:43, Radim Krčmář wrote: > > > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several > > > > custom data gathering points to better understand what was going on in > > > > the scheduler. > > > > Red Hat adapted one of them for the tracepoint framework and created a > > > > tool to plot a heatmap of nr_running, where the sched_update_nr_running > > > > tracepoint is being used for fine grained monitoring of scheduling > > > > imbalance. > > > > The tool is available from https://github.com/jirvoz/plot-nr-running. > > > > > > > > The best place for the tracepoints is inside the add/sub_nr_running, > > > > which requires some shenanigans to make it work as they are defined > > > > inside sched.h. > > > > The tracepoints have to be included from sched.h, which means that > > > > CREATE_TRACE_POINTS has to be defined for the whole header and this > > > > might cause problems if tree-wide headers expose tracepoints in sched.h > > > > dependencies, but I'd argue it's the other side's misuse of tracepoints. > > > > > > > > Moving the import sched.h line lower would require fixes in s390 and ppc > > > > headers, because they don't include dependecies properly and expect > > > > sched.h to do it, so it is simpler to keep sched.h there and > > > > preventively undefine CREATE_TRACE_POINTS right after. > > > > > > > > Exports of the pelt tracepoints remain because they don't need to be > > > > protected by CREATE_TRACE_POINTS and moving them closer would be > > > > unsightly. > > > > > > > > > > Pure trace events are frowned upon in scheduler world, try going with > > > trace points. > > Quite; I hate tracepoints for the API constraints they impose. Been > bitten by that, not want to ever have to deal with that again. Your NACKs on trace patches over the years have spoken out loud about this point ;-) > > > Qais did something very similar recently: > > > > > > https://lore.kernel.org/lkml/20190604111459.2862-1-qais.you...@arm.com/ > > > > > > You'll have to implement the associated trace events in a module, which > > > lets you define your own event format and doesn't form an ABI :). > > > > Is that really true? eBPF programs loaded from userspace can access > > tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI: > > https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103 > > > > I don't have a strong opinion about considering tracepoints as ABI / API or > > not, but just want to get the facts straight :) > > eBPF can access all sorts of kernel internals; if we were to deem eBPF > and API we'd be fscked. True. However, for kprobes-based BPF program - it does check for kernel version to ensure that the BPF program is built against the right kernel version (in order to ensure the program is built against the right set of kernel headers). If it is not, then BPF refuses to load the program. But, to your point bpf_probe_read() can away access kernel memory and assume structure layouts; so I guess a badly written bpf program can still do all sorts of ABI-unstable things. Good to know that the raw tracepoint being Ok since it is non-ABI; is where we draw the line. thanks, - Joel
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 04, 2019 at 11:43:33AM +0100, Qais Yousef wrote: > On 09/04/19 00:23, Joel Fernandes wrote: > > On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote: > > > On 03/09/2019 16:43, Radim Krčmář wrote: > > > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several > > > > custom data gathering points to better understand what was going on in > > > > the scheduler. > > > > Red Hat adapted one of them for the tracepoint framework and created a > > > > tool to plot a heatmap of nr_running, where the sched_update_nr_running > > > > tracepoint is being used for fine grained monitoring of scheduling > > > > imbalance. > > > > The tool is available from https://github.com/jirvoz/plot-nr-running. > > > > > > > > The best place for the tracepoints is inside the add/sub_nr_running, > > > > which requires some shenanigans to make it work as they are defined > > > > inside sched.h. > > > > The tracepoints have to be included from sched.h, which means that > > > > CREATE_TRACE_POINTS has to be defined for the whole header and this > > > > might cause problems if tree-wide headers expose tracepoints in sched.h > > > > dependencies, but I'd argue it's the other side's misuse of tracepoints. > > > > > > > > Moving the import sched.h line lower would require fixes in s390 and ppc > > > > headers, because they don't include dependecies properly and expect > > > > sched.h to do it, so it is simpler to keep sched.h there and > > > > preventively undefine CREATE_TRACE_POINTS right after. > > > > > > > > Exports of the pelt tracepoints remain because they don't need to be > > > > protected by CREATE_TRACE_POINTS and moving them closer would be > > > > unsightly. > > > > > > > > > > Pure trace events are frowned upon in scheduler world, try going with > > > trace points. Qais did something very similar recently: > > > > > > https://lore.kernel.org/lkml/20190604111459.2862-1-qais.you...@arm.com/ > > > > > > You'll have to implement the associated trace events in a module, which > > > lets you define your own event format and doesn't form an ABI :). > > > > Is that really true? eBPF programs loaded from userspace can access > > tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI: > > https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103 > > > > I don't have a strong opinion about considering tracepoints as ABI / API or > > not, but just want to get the facts straight :) > > It is actually true. > > But you need to make the distinction between a tracepoint > and a trace event first. I know this distinction well. > What Valentin is talking about here is the *bare* > tracepoint without any event associated with them like the one I added to the > scheduler recently. These ones are not accessible via eBPF, unless something > has changed since I last tried. Can this tracepoint be registered on with tracepoint_probe_register()? Quickly looking at these new tracepoint, they can be otherwise how would they even work right? If so, then eBPF can very well access it. Look at __bpf_probe_register() and bpf_raw_tracepoint_open() which implement the BPF_RAW_TRACEPOINT_OPEN. > The current infrastructure needs to be expanded to allow eBPF to attach these > bare tracepoints. Something similar to what I have in [1] is needed - but > instead of creating a new macro it needs to expand the current macro. [2] > might > give full context of when I was trying to come up with alternatives to using > trace events. > > [1] > https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b > [2] > https://lore.kernel.org/lkml/20190415144945.tumeop4djyj45...@e107158-lin.cambridge.arm.com/ As I was mentioning, tracepoints, not "trace events" can already be opened directly with BPF. I don't see how these new tracepoints are different. I wonder if this distinction of "tracepoint" being non-ABI can be documented somewhere. I would be happy to do that if there is a place for the same. I really want some general "policy" in the kernel on where we draw a line in the sand with respect to tracepoints and ABI :). For instance, perhaps VFS can also start having non-ABI tracepoints for the benefit of people tracing the VFS. thanks, - Joel
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On 09/04/19 10:14, Peter Zijlstra wrote: > On Wed, Sep 04, 2019 at 12:23:10AM -0400, Joel Fernandes wrote: > > On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote: > > > On 03/09/2019 16:43, Radim Krčmář wrote: > > > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several > > > > custom data gathering points to better understand what was going on in > > > > the scheduler. > > > > Red Hat adapted one of them for the tracepoint framework and created a > > > > tool to plot a heatmap of nr_running, where the sched_update_nr_running > > > > tracepoint is being used for fine grained monitoring of scheduling > > > > imbalance. > > > > The tool is available from https://github.com/jirvoz/plot-nr-running. > > > > > > > > The best place for the tracepoints is inside the add/sub_nr_running, > > > > which requires some shenanigans to make it work as they are defined > > > > inside sched.h. > > > > The tracepoints have to be included from sched.h, which means that > > > > CREATE_TRACE_POINTS has to be defined for the whole header and this > > > > might cause problems if tree-wide headers expose tracepoints in sched.h > > > > dependencies, but I'd argue it's the other side's misuse of tracepoints. > > > > > > > > Moving the import sched.h line lower would require fixes in s390 and ppc > > > > headers, because they don't include dependecies properly and expect > > > > sched.h to do it, so it is simpler to keep sched.h there and > > > > preventively undefine CREATE_TRACE_POINTS right after. > > > > > > > > Exports of the pelt tracepoints remain because they don't need to be > > > > protected by CREATE_TRACE_POINTS and moving them closer would be > > > > unsightly. > > > > > > > > > > Pure trace events are frowned upon in scheduler world, try going with > > > trace points. > > Quite; I hate tracepoints for the API constraints they impose. Been > bitten by that, not want to ever have to deal with that again. s/tracepoints/trace events/ ? They used to be one and the same but I think using them interchangeably might cause some confusion now since we have tracepoints without trace events associated with them. Not trying to be picky, but the missing distinction confused the hell out of me when I first started looking at this :-) -- Qais Yousef
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On 09/04/19 00:23, Joel Fernandes wrote: > On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote: > > On 03/09/2019 16:43, Radim Krčmář wrote: > > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several > > > custom data gathering points to better understand what was going on in > > > the scheduler. > > > Red Hat adapted one of them for the tracepoint framework and created a > > > tool to plot a heatmap of nr_running, where the sched_update_nr_running > > > tracepoint is being used for fine grained monitoring of scheduling > > > imbalance. > > > The tool is available from https://github.com/jirvoz/plot-nr-running. > > > > > > The best place for the tracepoints is inside the add/sub_nr_running, > > > which requires some shenanigans to make it work as they are defined > > > inside sched.h. > > > The tracepoints have to be included from sched.h, which means that > > > CREATE_TRACE_POINTS has to be defined for the whole header and this > > > might cause problems if tree-wide headers expose tracepoints in sched.h > > > dependencies, but I'd argue it's the other side's misuse of tracepoints. > > > > > > Moving the import sched.h line lower would require fixes in s390 and ppc > > > headers, because they don't include dependecies properly and expect > > > sched.h to do it, so it is simpler to keep sched.h there and > > > preventively undefine CREATE_TRACE_POINTS right after. > > > > > > Exports of the pelt tracepoints remain because they don't need to be > > > protected by CREATE_TRACE_POINTS and moving them closer would be > > > unsightly. > > > > > > > Pure trace events are frowned upon in scheduler world, try going with > > trace points. Qais did something very similar recently: > > > > https://lore.kernel.org/lkml/20190604111459.2862-1-qais.you...@arm.com/ > > > > You'll have to implement the associated trace events in a module, which > > lets you define your own event format and doesn't form an ABI :). > > Is that really true? eBPF programs loaded from userspace can access > tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI: > https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103 > > I don't have a strong opinion about considering tracepoints as ABI / API or > not, but just want to get the facts straight :) It is actually true. But you need to make the distinction between a tracepoint and a trace event first. What Valentin is talking about here is the *bare* tracepoint without any event associated with them like the one I added to the scheduler recently. These ones are not accessible via eBPF, unless something has changed since I last tried. The current infrastructure needs to be expanded to allow eBPF to attach these bare tracepoints. Something similar to what I have in [1] is needed - but instead of creating a new macro it needs to expand the current macro. [2] might give full context of when I was trying to come up with alternatives to using trace events. [1] https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b [2] https://lore.kernel.org/lkml/20190415144945.tumeop4djyj45...@e107158-lin.cambridge.arm.com/ HTH -- Qais Yousef
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Wed, Sep 04, 2019 at 12:23:10AM -0400, Joel Fernandes wrote: > On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote: > > On 03/09/2019 16:43, Radim Krčmář wrote: > > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several > > > custom data gathering points to better understand what was going on in > > > the scheduler. > > > Red Hat adapted one of them for the tracepoint framework and created a > > > tool to plot a heatmap of nr_running, where the sched_update_nr_running > > > tracepoint is being used for fine grained monitoring of scheduling > > > imbalance. > > > The tool is available from https://github.com/jirvoz/plot-nr-running. > > > > > > The best place for the tracepoints is inside the add/sub_nr_running, > > > which requires some shenanigans to make it work as they are defined > > > inside sched.h. > > > The tracepoints have to be included from sched.h, which means that > > > CREATE_TRACE_POINTS has to be defined for the whole header and this > > > might cause problems if tree-wide headers expose tracepoints in sched.h > > > dependencies, but I'd argue it's the other side's misuse of tracepoints. > > > > > > Moving the import sched.h line lower would require fixes in s390 and ppc > > > headers, because they don't include dependecies properly and expect > > > sched.h to do it, so it is simpler to keep sched.h there and > > > preventively undefine CREATE_TRACE_POINTS right after. > > > > > > Exports of the pelt tracepoints remain because they don't need to be > > > protected by CREATE_TRACE_POINTS and moving them closer would be > > > unsightly. > > > > > > > Pure trace events are frowned upon in scheduler world, try going with > > trace points. Quite; I hate tracepoints for the API constraints they impose. Been bitten by that, not want to ever have to deal with that again. > > Qais did something very similar recently: > > > > https://lore.kernel.org/lkml/20190604111459.2862-1-qais.you...@arm.com/ > > > > You'll have to implement the associated trace events in a module, which > > lets you define your own event format and doesn't form an ABI :). > > Is that really true? eBPF programs loaded from userspace can access > tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI: > https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103 > > I don't have a strong opinion about considering tracepoints as ABI / API or > not, but just want to get the facts straight :) eBPF can access all sorts of kernel internals; if we were to deem eBPF and API we'd be fscked.
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On 2019/9/4 0:05, Valentin Schneider wrote: On 03/09/2019 16:43, Radim Krčmář wrote: The paper "The Linux Scheduler: a Decade of Wasted Cores" used several custom data gathering points to better understand what was going on in the scheduler. Red Hat adapted one of them for the tracepoint framework and created a tool to plot a heatmap of nr_running, where the sched_update_nr_running tracepoint is being used for fine grained monitoring of scheduling imbalance. The tool is available from https://github.com/jirvoz/plot-nr-running. The best place for the tracepoints is inside the add/sub_nr_running, which requires some shenanigans to make it work as they are defined inside sched.h. The tracepoints have to be included from sched.h, which means that CREATE_TRACE_POINTS has to be defined for the whole header and this might cause problems if tree-wide headers expose tracepoints in sched.h dependencies, but I'd argue it's the other side's misuse of tracepoints. Moving the import sched.h line lower would require fixes in s390 and ppc headers, because they don't include dependecies properly and expect sched.h to do it, so it is simpler to keep sched.h there and preventively undefine CREATE_TRACE_POINTS right after. Exports of the pelt tracepoints remain because they don't need to be protected by CREATE_TRACE_POINTS and moving them closer would be unsightly. Pure trace events are frowned upon in scheduler world, try going with trace points. Qais did something very similar recently: https://lore.kernel.org/lkml/20190604111459.2862-1-qais.you...@arm.com/ You'll have to implement the associated trace events in a module, which lets you define your own event format and doesn't form an ABI :). . Hi Radim, Why not try Chrome Tracing 1--Record trace data, scheduling, irq, etc. You can Produce a JSON file with the format expected by Chrome OR just save the captured data directly as "xxx.html", it can still work. cat /sys/kernel/debug/tracing/trace > trace.html 2--Go to chrome://tracing in Chrome, 3--Click "Load" and open your file, or alternatively drag the file into Chrome, 4--Profit! Systrace (Android System Trace) captures and displays execution times of your app's processes and other Android system processes, fortunately, there are some ported versions for Linux Desktop/Server. https://github.com/gatieme/systrace references-- https://www.chromium.org/developers/how-tos/trace-event-profiling-tool https://www.chromium.org/developers/how-tos/trace-event-profiling-tool/trace-event-reading Thanks, - Cheng Jian.
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On Tue, Sep 03, 2019 at 05:05:47PM +0100, Valentin Schneider wrote: > On 03/09/2019 16:43, Radim Krčmář wrote: > > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several > > custom data gathering points to better understand what was going on in > > the scheduler. > > Red Hat adapted one of them for the tracepoint framework and created a > > tool to plot a heatmap of nr_running, where the sched_update_nr_running > > tracepoint is being used for fine grained monitoring of scheduling > > imbalance. > > The tool is available from https://github.com/jirvoz/plot-nr-running. > > > > The best place for the tracepoints is inside the add/sub_nr_running, > > which requires some shenanigans to make it work as they are defined > > inside sched.h. > > The tracepoints have to be included from sched.h, which means that > > CREATE_TRACE_POINTS has to be defined for the whole header and this > > might cause problems if tree-wide headers expose tracepoints in sched.h > > dependencies, but I'd argue it's the other side's misuse of tracepoints. > > > > Moving the import sched.h line lower would require fixes in s390 and ppc > > headers, because they don't include dependecies properly and expect > > sched.h to do it, so it is simpler to keep sched.h there and > > preventively undefine CREATE_TRACE_POINTS right after. > > > > Exports of the pelt tracepoints remain because they don't need to be > > protected by CREATE_TRACE_POINTS and moving them closer would be > > unsightly. > > > > Pure trace events are frowned upon in scheduler world, try going with > trace points. Qais did something very similar recently: > > https://lore.kernel.org/lkml/20190604111459.2862-1-qais.you...@arm.com/ > > You'll have to implement the associated trace events in a module, which > lets you define your own event format and doesn't form an ABI :). Is that really true? eBPF programs loaded from userspace can access tracepoints through BPF_RAW_TRACEPOINT_OPEN, which is UAPI: https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L103 I don't have a strong opinion about considering tracepoints as ABI / API or not, but just want to get the facts straight :) thanks, - Joel
Re: [PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
On 03/09/2019 16:43, Radim Krčmář wrote: > The paper "The Linux Scheduler: a Decade of Wasted Cores" used several > custom data gathering points to better understand what was going on in > the scheduler. > Red Hat adapted one of them for the tracepoint framework and created a > tool to plot a heatmap of nr_running, where the sched_update_nr_running > tracepoint is being used for fine grained monitoring of scheduling > imbalance. > The tool is available from https://github.com/jirvoz/plot-nr-running. > > The best place for the tracepoints is inside the add/sub_nr_running, > which requires some shenanigans to make it work as they are defined > inside sched.h. > The tracepoints have to be included from sched.h, which means that > CREATE_TRACE_POINTS has to be defined for the whole header and this > might cause problems if tree-wide headers expose tracepoints in sched.h > dependencies, but I'd argue it's the other side's misuse of tracepoints. > > Moving the import sched.h line lower would require fixes in s390 and ppc > headers, because they don't include dependecies properly and expect > sched.h to do it, so it is simpler to keep sched.h there and > preventively undefine CREATE_TRACE_POINTS right after. > > Exports of the pelt tracepoints remain because they don't need to be > protected by CREATE_TRACE_POINTS and moving them closer would be > unsightly. > Pure trace events are frowned upon in scheduler world, try going with trace points. Qais did something very similar recently: https://lore.kernel.org/lkml/20190604111459.2862-1-qais.you...@arm.com/ You'll have to implement the associated trace events in a module, which lets you define your own event format and doesn't form an ABI :).
[PATCH 2/2] sched/debug: add sched_update_nr_running tracepoint
The paper "The Linux Scheduler: a Decade of Wasted Cores" used several custom data gathering points to better understand what was going on in the scheduler. Red Hat adapted one of them for the tracepoint framework and created a tool to plot a heatmap of nr_running, where the sched_update_nr_running tracepoint is being used for fine grained monitoring of scheduling imbalance. The tool is available from https://github.com/jirvoz/plot-nr-running. The best place for the tracepoints is inside the add/sub_nr_running, which requires some shenanigans to make it work as they are defined inside sched.h. The tracepoints have to be included from sched.h, which means that CREATE_TRACE_POINTS has to be defined for the whole header and this might cause problems if tree-wide headers expose tracepoints in sched.h dependencies, but I'd argue it's the other side's misuse of tracepoints. Moving the import sched.h line lower would require fixes in s390 and ppc headers, because they don't include dependecies properly and expect sched.h to do it, so it is simpler to keep sched.h there and preventively undefine CREATE_TRACE_POINTS right after. Exports of the pelt tracepoints remain because they don't need to be protected by CREATE_TRACE_POINTS and moving them closer would be unsightly. Tested-by: Jirka Hladký Tested-by: Jiří Vozár Signed-off-by: Radim Krčmář --- include/trace/events/sched.h | 22 ++ kernel/sched/core.c | 7 ++- kernel/sched/fair.c | 2 -- kernel/sched/sched.h | 7 +++ 4 files changed, 31 insertions(+), 7 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 420e80e56e55..1527fc695609 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -625,6 +625,28 @@ DECLARE_TRACE(sched_overutilized_tp, TP_PROTO(struct root_domain *rd, bool overutilized), TP_ARGS(rd, overutilized)); +TRACE_EVENT(sched_update_nr_running, + + TP_PROTO(int cpu, int change, unsigned int nr_running), + + TP_ARGS(cpu, change, nr_running), + + TP_STRUCT__entry( + __field(int, cpu) + __field(int, change) + __field(unsigned int, nr_running) + ), + + TP_fast_assign( + __entry->cpu= cpu; + __entry->change = change; + __entry->nr_running = nr_running; + ), + + TP_printk("cpu=%u nr_running=%u (%d)", + __entry->cpu, __entry->nr_running, __entry->change) +); + #endif /* _TRACE_SCHED_H */ /* This part must be outside protection */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 71981ce84231..31ac37b9f6f7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6,7 +6,9 @@ * * Copyright (C) 1991-2002 Linus Torvalds */ +#define CREATE_TRACE_POINTS #include "sched.h" +#undef CREATE_TRACE_POINTS #include @@ -20,9 +22,6 @@ #include "pelt.h" -#define CREATE_TRACE_POINTS -#include - /* * Export tracepoints that act as a bare tracehook (ie: have no trace event * associated with them) to allow external modules to probe them. @@ -7618,5 +7617,3 @@ const u32 sched_prio_to_wmult[40] = { /* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717, /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153, }; - -#undef CREATE_TRACE_POINTS diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 84959d3285d1..421236d39902 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -22,8 +22,6 @@ */ #include "sched.h" -#include - /* * Targeted preemption latency for CPU-bound tasks: * diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c4915f46035a..b89d7786109a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -75,6 +75,8 @@ #include "cpupri.h" #include "cpudeadline.h" +#include + #ifdef CONFIG_SCHED_DEBUG # define SCHED_WARN_ON(x) WARN_ONCE(x, #x) #else @@ -1887,6 +1889,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count) rq->nr_running = prev_nr + count; + trace_sched_update_nr_running(cpu_of(rq), count, rq->nr_running); + #ifdef CONFIG_SMP if (prev_nr < 2 && rq->nr_running >= 2) { if (!READ_ONCE(rq->rd->overload)) @@ -1900,6 +1904,9 @@ static inline void add_nr_running(struct rq *rq, unsigned count) static inline void sub_nr_running(struct rq *rq, unsigned count) { rq->nr_running -= count; + + trace_sched_update_nr_running(cpu_of(rq), -count, rq->nr_running); + /* Check if we still need preemption */ sched_update_tick_dependency(rq); } -- 2.23.0