Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast

Mathieu Desnoyers Fri, 09 Jan 2026 10:58:59 -0800

On 2026-01-09 12:21, Steven Rostedt wrote:

On Fri, 9 Jan 2026 09:40:17 -0500
Mathieu Desnoyers <[email protected]> wrote:

On 2026-01-08 22:05, Steven Rostedt wrote:

From: "Paul E. McKenney" <[email protected]>

[...]

I disagree with many elements of the proposed approach.

On one end we have BPF wanting to hook on arbitrary tracepoints without
adding significant latency to PREEMPT RT kernels.

One the other hand, we have high-speed tracers which execute very short
critical sections to serialize trace data into ring buffers.

All of those users register to the tracepoint API.

We also have to consider that migrate disable is *not* cheap at all
compared to preempt disable.


To be fair, every spin_lock() converted into a mutex in PREEMPT_RT now
calls migrate_disable() instead of preempt_disable(). I'm just saying the
overhead of migrate_disable() in PREEMPT_RT is not limited to tracepoints.


That's a fair point. That being said, the kernel code which relies on
spin locks for fast-paths tends to use raw spinlocks (e.g. scheduler),
which AFAIU does not need migrate disable because those still disable
preemption even on preempt RT.


So I'm wondering why all tracepoint users need to pay the migrate
disable runtime overhead on preempt RT kernels for the sake of BPF ?


We could argue that it is to keep the same paradigm as non RT. Where
the code expects to stay on the same CPU. This is why we needed to add it
to spin_lock() code. Only a few places in the kernel expect spin_lock() to
pin the current task on the CPU, but because of those few cases, we needed
to make all callers of spin_lock() call migrate disable :-/


For tracepoints we have a handful of tracer users, so I don't consider
the scope of the audit to be at the same scale as spinlocks. Changing
each tracer user to either disable preemption or migration if they need
it on preempt-RT is fairly straightforward. And if tracers don't need
to stay on a single CPU, that's even better.

I'm also inclined to consider moving !RT kernels to srcu-fast rather
than RCU to make RT and !RT more alike. The performance of
srcu-fast read-side is excellent, at least on my EPYC test machine.

This would open the door to relax the tracer callback requirements on
preempt-off/migrate-off if their callback implementation can deal with
it.


Using SRCU-fast to protect tracepoint callback iteration makes sense
for preempt-rt, but I'd recommend moving the migrate disable guard
within the bpf callback code rather than slowing down other tracers
which execute within a short amount of time. Other tracers can then
choose to disable preemption rather than migration if that's a better
fit for their needs.


This is a discussion with the BPF folks.


FWIW, the approach I'm proposing would be similar to what I've done for
faultable syscall tracepoints.

[...]

+
+       trace_ctx = tracing_gen_ctx_dec();
+       /* The migration counter starts at bit 4 */
+       return trace_ctx - (1 << 4);


We should turn this hardcoded "4" value into an enum label or a
define. That define should be exposed by tracepoint.h. We should
not hardcode expectations about the implementation of distinct APIs
across the tracing subsystem.


This is exposed to user space already, so that 4 will never change. And
this is specifically for "trace events" which are what are attached to
tracepoints. No other tracepoint caller needs to know about this "4". This
value goes into the common_preempt_count of the event. libtraceevent
already parses this.

I have no problem making this an enum (or define) and using it here and
where it is set in trace.c:tracing_gen_ctx_irq_test(). But it belongs in
trace_event.h not tracepoint.h.

I can call it TRACE_MIGRATION_SHIFT

#define TRACE_MIGRATION_SHIFT   4


I'm OK with that.

[...]

/*

diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 4f22136fd465..6fb58387e9f1 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -429,6 +429,22 @@ do_trace_event_raw_event_##call(void *__data, proto)       
                \
        trace_event_buffer_commit(&fbuffer);                                \
   }

+/*

+ * When PREEMPT_RT is enabled, the tracepoint does not disable preemption
+ * but instead disables migration. The callbacks for the trace events
+ * need to have a consistent state so that it can reflect the proper
+ * preempt_disabled counter.


Having those defines within trace_events.h is poking holes within any
hope of abstraction we can have from the tracepoint.h API. This adds
strong coupling between tracepoint and trace_event.h.


OK, I see you are worried about the coupling between the behavior of
tracepoint.h and trace_event.h.


Rather than hardcoding preempt counter expectations across tracepoint
and trace-events, we should expose a #define in tracepoint.h which
will make the preempt counter nesting level available to other
parts of the kernel such as trace_events.h. This way we keep everything
in one place and we don't add cross-references about subtle preempt
counter nesting level details.


OK, so how do we pass this information from tracepoint.h to the users? I
hate to add another field to task_struct for this.


This could be as simple as adding something akin to

/*
 * Level of preempt disable nesting between tracepoint caller code and
 * tracer callback.
 */
#ifdef CONFIG_PREEMPT_RT
# define TRACEPOINT_PREEMPT_DISABLE_NESTING     0
#else
# define TRACEPOINT_PREEMPT_DISABLE_NESTING     1
#endif

to tracepoint.h and then use that from trace_event.h.


[...]

+ *
+ * Returns a pointer to the data on the ring buffer or NULL if the
+ *   event was not reserved (event was filtered, too big, or the buffer
+ *   simply was disabled for write).


odd spaces here.


You mean the indentation?  I could add it more and also a colon:

  * Returns: A pointer to the data on the ring buffer or NULL if the
  *          event was not reserved (event was filtered, too big, or the
  *          buffer simply was disabled for write).

Would that work better?


Yes.

[...]


So I don't understand this comment, and also I don't understand why we
need to chain the callbacks rather than just call the appropriate
call_rcu based on the tracepoint "is_faultable" state.

What am I missing ?


Ah, you are right. I think this was the result of trying different ways of
synchronization. The non-faultable version should be wrapped to either call
normal RCU synchronization or SRCU synchronization.


Yes. Or we switch even !RT kernels non-faultable tracepoints to
srcu-fast, considering its low overhead. This could come as a future
step though.

I can send a new version, or do we want to wait if the BPF folks have a
better idea about the "migrate disable" issue?


I'd recommend going ahead and move the migrate disable into the BPF
callback similarly to what did for the faultable syscall tracepoint, and
let the bpf folks comment on the resulting patch. It's not that much
code, and showing the change will ground the discussion into something
more tangible.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v5] tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast

Reply via email to