On Thu, May 01, 2025 at 04:57:30PM -0400, Steven Rostedt wrote:
> On Thu, 1 May 2025 13:14:11 -0700
> Namhyung Kim <[email protected]> wrote:
> 
> > Hi Steve,
> > 
> > On Wed, Apr 30, 2025 at 09:32:07PM -0400, Steven Rostedt wrote:
> 
> > > To solve this, when a per CPU event is created that has defer_callchain
> > > attribute set, it will do a lookup from a global list
> > > (unwind_deferred_list), for a perf_unwind_deferred descriptor that has the
> > > id that matches the PID of the current task's group_leader.  
> > 
> > Nice, it'd work well with the perf tools at least.
> 
> Cool!
> 
> 
> 
> > > +static void perf_event_deferred_cpu(struct unwind_work *work,
> > > +                             struct unwind_stacktrace *trace, u64 cookie)
> > > +{
> > > + struct perf_unwind_deferred *defer =
> > > +         container_of(work, struct perf_unwind_deferred, unwind_work);
> > > + struct perf_unwind_cpu *cpu_unwind;
> > > + struct perf_event *event;
> > > + int cpu;
> > > +
> > > + guard(rcu)();
> > > + guard(preempt)();
> > > +
> > > + cpu = smp_processor_id();
> > > + cpu_unwind = &defer->cpu_events[cpu];
> > > +
> > > + WRITE_ONCE(cpu_unwind->processing, 1);
> > > + /*
> > > +  * Make sure the above is seen for the rcuwait in
> > > +  * perf_remove_unwind_deferred() before iterating the loop.
> > > +  */
> > > + smp_mb();
> > > +
> > > + list_for_each_entry_rcu(event, &cpu_unwind->list, unwind_list) {
> > > +         perf_event_callchain_deferred(event, trace);
> > > +         /* Only the first CPU event gets the trace */
> > > +         break;  
> > 
> > I guess this is to emit a callchain record when more than one events
> > requested the deferred callchains for the same task like:
> > 
> >   $ perf record -a -e cycles,instructions
> > 
> > right?
> 
> Yeah. If perf assigns more than one per CPU event, we only need one of
> those events to record the deferred trace, not both of them.
> 
> But I keep a link list so that if the program closes the first one and
> keeps the second active, this will still work, as the first one would be
> removed from the list, and the second one would pick up the tracing after
> that.

Makes sense.

> 
> > 
> > 
> > > + }
> > > +
> > > + WRITE_ONCE(cpu_unwind->processing, 0);
> > > + rcuwait_wake_up(&cpu_unwind->pending_unwind_wait);
> > > +}
> > > +
> > >  static void perf_free_addr_filters(struct perf_event *event);
> > >  
> > >  /* vs perf_event_alloc() error */
> > > @@ -8198,6 +8355,15 @@ static int deferred_request_nmi(struct perf_event 
> > > *event)
> > >   return 0;
> > >  }
> > >  
> > > +static int deferred_unwind_request(struct perf_unwind_deferred *defer)
> > > +{
> > > + u64 cookie;
> > > + int ret;
> > > +
> > > + ret = unwind_deferred_request(&defer->unwind_work, &cookie);
> > > + return ret < 0 ? ret : 0;
> > > +}
> > > +
> > >  /*
> > >   * Returns:
> > >  *     > 0 : if already queued.
> > > @@ -8210,11 +8376,14 @@ static int deferred_request(struct perf_event 
> > > *event)
> > >   int pending;
> > >   int ret;
> > >  
> > > - /* Only defer for task events */
> > > - if (!event->ctx->task)
> > > + if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
> > >           return -EINVAL;
> > >  
> > > - if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
> > > + if (event->unwind_deferred)
> > > +         return deferred_unwind_request(event->unwind_deferred);
> > > +
> > > + /* Per CPU events should have had unwind_deferred set! */
> > > + if (WARN_ON_ONCE(!event->ctx->task))
> > >           return -EINVAL;
> > >  
> > >   if (in_nmi())
> > > @@ -13100,13 +13269,20 @@ perf_event_alloc(struct perf_event_attr *attr, 
> > > int cpu,
> > >           }
> > >   }
> > >  
> > > + /* Setup unwind deferring for per CPU events */
> > > + if (event->attr.defer_callchain && !task) {  
> > 
> > As I said it should handle per-task and per-CPU events.  How about this?
> 
> Hmm, I just added some printk()s in this code, and it seems that perf
> record always did per CPU.

Right, that's the default behavior.

> 
> But if an event is per CPU and per task, will it still only trace that
> task? It will never trace another task right?

Yes, the event can be inherited to a child but then child will create a
new event so each task will have its own events.

> 
> Because the way this is currently implemented is that the event that
> requested the callback is the one that records it, even if it runs on
> another CPU:
> 
> In defer_request_nmi():
> 
>       struct callback_head *work = &event->pending_unwind_work;
>       int ret;
> 
>       if (event->pending_unwind_callback)
>               return 1;
> 
>       ret = task_work_add(current, work, TWA_NMI_CURRENT);
>       if (ret)
>               return ret;
> 
>       event->pending_unwind_callback = 1;
> 
> The task_work_add() adds the work from the event's pending_unwind_work.
> 
> Now the callback will be:
> 
> static void perf_event_deferred_task(struct callback_head *work)
> {
>       struct perf_event *event = container_of(work, struct perf_event, 
> pending_unwind_work);
> 
> // the above is the event that requested this. This may run on another CPU.
> 
>       struct unwind_stacktrace trace;
> 
>       if (!event->pending_unwind_callback)
>               return;
> 
>       if (unwind_deferred_trace(&trace) >= 0) {
> 
>               /*
>                * All accesses to the event must belong to the same implicit 
> RCU
>                * read-side critical section as the ->pending_unwind_callback 
> reset.
>                * See comment in perf_pending_unwind_sync().
>                */
>               guard(rcu)();
>               perf_event_callchain_deferred(event, &trace);
> 
> // The above records the stack trace to that event.
> // Again, this may happen on another CPU.
> 
>       }
> 
>       event->pending_unwind_callback = 0;
>       local_dec(&event->ctx->nr_no_switch_fast);
>       rcuwait_wake_up(&event->pending_unwind_wait);
> }
> 
> Is the recording to an event from one CPU to another CPU an issue, if that
> event also is only tracing a task?

IIUC it should be fine as long as you use the unwind descriptor logic
like in the per-CPU case.  The data should be written to the current
CPU's ring buffer for per-task and per-CPU events.

> 
> > 
> >     if (event->attr.defer_callchain) {
> >             if (event->cpu >= 0) {
> >                     err = perf_add_unwind_deferred(event);
> >                     if (err)
> >                             return ERR_PTR(err);
> >             } else {
> >                     init_task_work(&event->pending_unwind_work,
> >                                     perf_event_callchain_deferred,
> >                                     perf_event_deferred_task);
> >             }
> >     }
> > 
> > > +         err = perf_add_unwind_deferred(event);
> > > +         if (err)
> > > +                 return ERR_PTR(err);
> > > + }
> > > +
> > >   err = security_perf_event_alloc(event);
> > >   if (err)
> > >           return ERR_PTR(err);
> > >  
> > >   if (event->attr.defer_callchain)
> > >           init_task_work(&event->pending_unwind_work,
> > > -                        perf_event_callchain_deferred);
> > > +                        perf_event_deferred_task);  
> > 
> > And you can remove here.
> 
> There's nothing wrong with always initializing it. It will just never be
> called.

Ok.

> 
> What situation do we have where cpu is negative? What's the perf command?
> Is there one?

Yep, there's --per-thread option for just per-task events.

Thanks,
Namhyung

Reply via email to