Hi Joel,
On Thu, May 7, 2026 at 7:59 PM Joel Fernandes <[email protected]> wrote:
>
>
>
> On 5/7/2026 1:37 PM, Gustavo Luiz Duarte wrote:
> > There is currently no easy way to monitor how many RCU callbacks are
> > pending system-wide. The existing trace points provide per-event data
> > but require active tracing, which makes them awkward for fleet-wide
> > monitoring. Knowing the depth and stage of pending callbacks helps
> > admins reason about RCU health, gives an indirect signal of memory
> > held back by RCU, and is useful when tuning RCU parameters.
> >
> > This series adds a debugfs file at:
> >
> > /sys/kernel/debug/rcu/pending_cbs
> >
> > that reports per-CPU pending callback counts with a "total" row.
> >
> > Patch 1 introduces the file with per-CPU columns for each segcblist
> > segment (done, wait, next_ready, next) plus a "lazy" column.
> >
> > Patch 2 extends the file with a "kfree_rcu" column reporting objects
> > queued in the batched kfree_rcu()/kvfree_rcu() path
> > (CONFIG_KVFREE_RCU_BATCHED), which has its own per-CPU queues outside
> > the main segmented callback list.
> >
> > Signed-off-by: Gustavo Luiz Duarte <[email protected]>
>
> You actually don't need debugfs for this. You can just use bpftrace and
> instrument trace_rcu_ (with other RCU tracing Kconfig options enabled?). I had
> something like that working sometime ago.
My initial attempt to do this using tracepoints was probing
trace_rcu_segcb_stats, but this would add significant overhead to
every callback enqueue/dequeue event which is too expensive for a
production environment. I played a bit more with bpftrace and managed
to get this working with an interval probe plus some __per_cpu_offset
pointer arithmetic (see below). It is not the most maintainable code
and has some race issues, but probably acceptable for us if you
believe having this information easily available doesn't add value for
other use cases.
If anyone is interested, here is what I came up with:
interval:s:5 {
printf("===== %s =====\n", strftime("%H:%M:%S", nsecs));
$rdp_base = kaddr("rcu_data");
$krc_base = kaddr("krc");
$offsets = (uint64 *)kaddr("__per_cpu_offset");
for ($cpu : 0..ncpus) {
$rdp = (struct rcu_data *)($rdp_base + $offsets[$cpu]);
$krcp = (struct kfree_rcu_cpu *)($krc_base + $offsets[$cpu]);
$kfree = $krcp->head_count.counter
+ $krcp->bulk_count[0].counter
+ $krcp->bulk_count[1].counter;
printf("cpu: %d done: %ld wait: %ld nr: %ld next: %ld lazy:
%ld kfree: %d\n",
$cpu,
$rdp->cblist.seglen[0],
$rdp->cblist.seglen[1],
$rdp->cblist.seglen[2],
$rdp->cblist.seglen[3],
$rdp->lazy_len,
$kfree);
}
}
>
> Generally RCU doesn't add userspace interfaces randomly like that. I remember
> Paul ripped similar things out some time ago.
Debugfs is intentionally not a stable ABI, so the bar for adding
things useful for debugging and tuning seems lower than /proc or /sys
-- which is why I went with debugfs here.