We have been consistently triggering the warning WARN_ON_ONCE(cpuctx->cgrp) in perf_cgroup_switch() for a rather long time, although we still have no clue on how to reproduce it.
Looking into the code, it seems the only possibility here is that the process calling perf_event_open() with a cgroup target exits before the process in the target cgroup exits but after it gains CPU to run. This is because we use the atomic counter perf_cgroup_events as an indication of whether cgroup perf event has enabled or not, which is inaccurate, illustrated as below: CPU 0 CPU 1 // open perf events with a cgroup // target for all CPU's perf_event_open(): account_event_cpu() // perf_cgroup_events == 1 // Schedule in a process in the target cgroup perf_cgroup_switch() perf_event_release_kernel(): unaccount_event_cpu() // perf_cgroup_events == 0 // schedule out // but perf_cgroup_sched_out() is skipped // cpuctx->cgrp left as non-NULL // schedule in another process perf_cgroup_switch() // WARN triggerred The proposed fix is kinda ugly, as it adds a flag in each process to indicate whether this process has to go through perf_cgroup_sched_out() when perf_cgroup_events is false negative. The other possible fix is to force a reschedule on each target CPU before decreasing the counter perf_cgroup_events, but this is expensive. Suggestions? Thoughts? Cc: Ingo Molnar <mi...@redhat.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Arnaldo Carvalho de Melo <a...@kernel.org> Cc: Alexander Shishkin <alexander.shish...@linux.intel.com> Cc: Jiri Olsa <jo...@redhat.com> Cc: Namhyung Kim <namhy...@kernel.org> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com> --- include/linux/sched.h | 3 +++ kernel/events/core.c | 5 ++++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index a2cd15855bad..835bdf15f92c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -733,6 +733,9 @@ struct task_struct { /* to be used once the psi infrastructure lands upstream. */ unsigned use_memdelay:1; #endif +#ifdef CONFIG_PERF_EVENTS + unsigned perf_cgroup_sched_in:1; +#endif unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/kernel/events/core.c b/kernel/events/core.c index abbd4b3b96c2..9b86b043018e 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -817,6 +817,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode) * to event_filter_match() in event_sched_out() */ cpuctx->cgrp = NULL; + task->perf_cgroup_sched_in = 0; } if (mode & PERF_CGROUP_SWIN) { @@ -831,6 +832,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode) cpuctx->cgrp = perf_cgroup_from_task(task, &cpuctx->ctx); cpu_ctx_sched_in(cpuctx, EVENT_ALL, task); + task->perf_cgroup_sched_in = 1; } perf_pmu_enable(cpuctx->ctx.pmu); perf_ctx_unlock(cpuctx, cpuctx->task_ctx); @@ -3233,7 +3235,8 @@ void __perf_event_task_sched_out(struct task_struct *task, * to check if we have to switch out PMU state. * cgroup event are system-wide mode only */ - if (atomic_read(this_cpu_ptr(&perf_cgroup_events))) + if (atomic_read(this_cpu_ptr(&perf_cgroup_events)) || + task->perf_cgroup_sched_in) perf_cgroup_sched_out(task, next); } -- 2.21.0