On 2026-01-08 22:05, Steven Rostedt wrote:
From: "Paul E. McKenney" <[email protected]>
[...]

I disagree with many elements of the proposed approach.

On one end we have BPF wanting to hook on arbitrary tracepoints without
adding significant latency to PREEMPT RT kernels.

One the other hand, we have high-speed tracers which execute very short
critical sections to serialize trace data into ring buffers.

All of those users register to the tracepoint API.

We also have to consider that migrate disable is *not* cheap at all
compared to preempt disable.

So I'm wondering why all tracepoint users need to pay the migrate
disable runtime overhead on preempt RT kernels for the sake of BPF ?

Using SRCU-fast to protect tracepoint callback iteration makes sense
for preempt-rt, but I'd recommend moving the migrate disable guard
within the bpf callback code rather than slowing down other tracers
which execute within a short amount of time. Other tracers can then
choose to disable preemption rather than migration if that's a better
fit for their needs.

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 3690221ba3d8..a2704c35eda8 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -222,6 +222,26 @@ static inline unsigned int tracing_gen_ctx_dec(void)
        return trace_ctx;
  }
+/*
+ * When PREEMPT_RT is enabled, trace events are called with disabled
+ * migration. The trace events need to know if the tracepoint disabled
+ * migration or not so that what is recorded to the ring buffer shows
+ * the state of when the trace event triggered, and not the state caused
+ * by the trace event.
+ */
+#ifdef CONFIG_PREEMPT_RT
+static inline unsigned int tracing_gen_ctx_dec_cond(void)
+{
+       unsigned int trace_ctx;
+
+       trace_ctx = tracing_gen_ctx_dec();
+       /* The migration counter starts at bit 4 */
+       return trace_ctx - (1 << 4);

We should turn this hardcoded "4" value into an enum label or a
define. That define should be exposed by tracepoint.h. We should
not hardcode expectations about the implementation of distinct APIs
across the tracing subsystem.

[...]

--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -100,6 +100,25 @@ void for_each_tracepoint_in_module(struct module *mod,
  }
  #endif /* CONFIG_MODULES */
+/*
+ * BPF programs can attach to the tracepoint callbacks. But if the
+ * callbacks are called with preemption disabled, the BPF programs
+ * can cause quite a bit of latency. When PREEMPT_RT is enabled,
+ * instead of disabling preemption, use srcu_fast_notrace() for
+ * synchronization. As BPF programs that are attached to tracepoints
+ * expect to stay on the same CPU, also disable migration.
+ */
+#ifdef CONFIG_PREEMPT_RT
+extern struct srcu_struct tracepoint_srcu;
+# define tracepoint_sync() synchronize_srcu(&tracepoint_srcu);
+# define tracepoint_guard()                            \
+       guard(srcu_fast_notrace)(&tracepoint_srcu); \
+       guard(migrate)()
+#else
+# define tracepoint_sync() synchronize_rcu();
+# define tracepoint_guard() guard(preempt_notrace)()
+#endif

Doing migrate disable on PREEMPT RT for BPF vs preempt disable in other
tracers should come in a separate preparation patch. It belongs to the
tracers, not to tracepoints.

[...]
                                \
diff --git a/include/trace/perf.h b/include/trace/perf.h
index a1754b73a8f5..348ad1d9b556 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -71,6 +71,7 @@ perf_trace_##call(void *__data, proto)                        
                \
        u64 __count __attribute__((unused));                            \
        struct task_struct *__task __attribute__((unused));             \
                                                                        \
+       guard(preempt_notrace)();                                       \
        do_perf_trace_##call(__data, args);                             \
  }
@@ -85,9 +86,8 @@ perf_trace_##call(void *__data, proto) \
        struct task_struct *__task __attribute__((unused));             \
                                                                        \
        might_fault();                                                  \
-       preempt_disable_notrace();                                      \
+       guard(preempt_notrace)();                                       \
        do_perf_trace_##call(__data, args);                             \
-       preempt_enable_notrace();       

Move this to a perf-specific preparation patch.                         \
  }
/*
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 4f22136fd465..6fb58387e9f1 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -429,6 +429,22 @@ do_trace_event_raw_event_##call(void *__data, proto)       
                \
        trace_event_buffer_commit(&fbuffer);                                \
  }
+/*
+ * When PREEMPT_RT is enabled, the tracepoint does not disable preemption
+ * but instead disables migration. The callbacks for the trace events
+ * need to have a consistent state so that it can reflect the proper
+ * preempt_disabled counter.

Having those defines within trace_events.h is poking holes within any
hope of abstraction we can have from the tracepoint.h API. This adds
strong coupling between tracepoint and trace_event.h.

Rather than hardcoding preempt counter expectations across tracepoint
and trace-events, we should expose a #define in tracepoint.h which
will make the preempt counter nesting level available to other
parts of the kernel such as trace_events.h. This way we keep everything
in one place and we don't add cross-references about subtle preempt
counter nesting level details.

+ */
+#ifdef CONFIG_PREEMPT_RT
+/* disable preemption for RT so that the counters still match */
+# define trace_event_guard() guard(preempt_notrace)()
+/* Have syscalls up the migrate disable counter to emulate non-syscalls */
+# define trace_syscall_event_guard() guard(migrate)()
+#else
+# define trace_event_guard()
+# define trace_syscall_event_guard()
+#endif
This should be moved to separate tracer-specific prep patches.

[...]

+ * The @trace_file is the desrciptor with information about the status

descriptor

[...]

+ *
+ * Returns a pointer to the data on the ring buffer or NULL if the
+ *   event was not reserved (event was filtered, too big, or the buffer
+ *   simply was disabled for write).

odd spaces here.

[...]

+#ifdef CONFIG_PREEMPT_RT
+static void srcu_free_old_probes(struct rcu_head *head)
+{
+       kfree(container_of(head, struct tp_probes, rcu));
+}
+
+static void rcu_free_old_probes(struct rcu_head *head)
+{
+       call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
+}
+#else
  static void rcu_free_old_probes(struct rcu_head *head)
  {
        kfree(container_of(head, struct tp_probes, rcu));
  }
+#endif
static inline void release_probes(struct tracepoint *tp, struct tracepoint_func *old)
  {
@@ -112,6 +149,13 @@ static inline void release_probes(struct tracepoint *tp, 
struct tracepoint_func
                struct tp_probes *tp_probes = container_of(old,
                        struct tp_probes, probes[0]);
+ /*
+                * Tracepoint probes are protected by either RCU or
+                * Tasks Trace RCU and also by SRCU.  By calling the SRCU

I'm confused.

My understanding is that in !RT we have:

- RCU (!tracepoint_is_faultable(tp))
- RCU tasks trace (tracepoint_is_faultable(tp))

And for RT:

- SRCU-fast (!tracepoint_is_faultable(tp))
- RCU tasks trace (tracepoint_is_faultable(tp))

So I don't understand this comment, and also I don't understand why we
need to chain the callbacks rather than just call the appropriate
call_rcu based on the tracepoint "is_faultable" state.

What am I missing ?

+                * callback in the [Tasks Trace] RCU callback we cover
+                * both cases. So let us chain the SRCU and [Tasks Trace]
+                * RCU callbacks to wait for both grace periods.
+                */
                if (tracepoint_is_faultable(tp))
                        call_rcu_tasks_trace(&tp_probes->rcu, 
rcu_free_old_probes);
                else

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Reply via email to