On Tue, 2026-05-12 at 02:24 +0800, [email protected] wrote:
> From: Wen Yang <[email protected]>
> 
> diff --git a/Documentation/trace/rv/monitor_tlob.rst
> b/Documentation/trace/rv/monitor_tlob.rst
> new file mode 100644
> index 000000000000..91b592630b3f
> --- /dev/null
> +++ b/Documentation/trace/rv/monitor_tlob.rst
> +Usage
> +-----
> +
> +tracefs interface (uprobe-based external monitoring)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The ``monitor`` tracefs file instruments an unmodified binary via uprobes.
> +The format follows the ftrace ``uprobe_events`` convention (``PATH:OFFSET``
> +for the probe location, ``key=value`` for configuration parameters)::
> +
> +  p PATH:OFFSET_START OFFSET_STOP threshold=US
> +
> +The uprobe at ``OFFSET_START`` fires ``tlob_start_task()``; the uprobe at
> +``OFFSET_STOP`` fires ``tlob_stop_task()``.  Both offsets are ELF file
> +offsets of entry points in ``PATH``.  ``PATH`` may contain ``:``; the last
> +``:`` in the ``PATH:OFFSET_START`` token is the separator.
> +
> +To remove a binding, use ``-PATH:OFFSET_START``::
> +
> +  echo 1 > /sys/kernel/tracing/rv/monitors/tlob/enable
> +
> +  echo "p /usr/bin/myapp:0x12a0 0x12f0 threshold=5000" \
> +      > /sys/kernel/tracing/rv/monitors/tlob/monitor
> +
> +  # Remove a binding
> +  echo "-/usr/bin/myapp:0x12a0" >
> /sys/kernel/tracing/rv/monitors/tlob/monitor
> +
> +  # List registered bindings
> +  cat /sys/kernel/tracing/rv/monitors/tlob/monitor
> +
> +  # Read violations from the trace buffer
> +  cat /sys/kernel/tracing/trace
> +
> +ioctl self-instrumentation (/dev/rv)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I'm not particularly fond of ioctls, they aren't that flexible and in
this way I don't really see an added value.

In short, you're adding this so a program could instrument itself using
ioctls instead of using uprobes, cannot the same thing be achieved using
uprobes alone, e.g. by registering a function address or the current
instruction pointer?

If you really cannot do it with uprobes alone, wouldn't a sysfs/tracefs file
achieve a similar purpose without much of the boilerplate code?

> +
> +``/dev/rv`` is a shared RV character device.  Before using any monitor-
> specific
> +ioctl, the fd must be bound to a monitor via ``RV_IOCTL_BIND_MONITOR``.  Each
> +open fd has independent per-fd monitoring state::
> +
> +  int fd = open("/dev/rv", O_RDWR);
> +
> +  /* Bind this fd to the tlob monitor. */
> +  struct rv_bind_args bind = { .monitor_name = "tlob" };
> +  ioctl(fd, RV_IOCTL_BIND_MONITOR, &bind);
> +
> +  struct tlob_start_args args = {
> +      .threshold_us = 50000,   /* 50 ms in microseconds */
> +  };
> +  ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> +
> +  /* ... code path under observation ... */
> +
> +  int ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +  /* ret == 0:          within budget  */
> +  /* ret == -EOVERFLOW: budget exceeded */
> +
> +  close(fd);
> +
> +``TRACE_STOP`` returns ``-EOVERFLOW`` whenever the budget was exceeded.
> +The HA timer calls ``da_monitor_reset()`` (storage remains); the
> +synchronous ``ha_cancel_timer_sync()`` in ``tlob_stop_task()`` ensures the
> +callback has completed before checking ``da_monitoring()``.
> +
> +Violation events
> +~~~~~~~~~~~~~~~~

Since you are not documenting the detail_env_tlob tracepoint, is it
something really required?
It's deviating from the original RV purpose (run a model and spot
violations) by adding further accounting, I'm fine with that if there is a
documented need.
In such case I would at the very least document its usage (thought I'd
really like to be rid of it and let the curious user implement the
accounting themselves).
> +
> +Budget violations are always reported via the ``error_env_tlob`` RV
> +tracepoint (HA clock-invariant violation), regardless of which interface
> +triggered them::
> +
> +  cat /sys/kernel/tracing/trace
> +
> +To capture violations in a file::
> +
> +  trace-cmd record -e error_env_tlob &
> +  # ... run workload ...
> +  trace-cmd report
> +

This is standard tracepoints usage, there's nothing about tlob we should
document here. If you feel the existing RV documentation should expand
this subject, feel free to contribute there.

> +tracefs files
> +-------------
> +
> +The following files are created under
> +``/sys/kernel/tracing/rv/monitors/tlob/``:
> +
> +``enable`` (rw)
> +  Write ``1`` to enable the monitor; write ``0`` to disable it.
> +
> +``desc`` (ro)
> +  Human-readable description of the monitor.

Same here, standard RV.

> +
> +``monitor`` (rw)
> +  Write ``p PATH:OFFSET_START OFFSET_STOP threshold=US``
> +  to bind two entry uprobes.  Write ``-PATH:OFFSET_START`` to remove a
> +  binding.  Read to list registered bindings in the same format.

And this is duplicating what mentioned above about uprobes, isn't it?

> +
> +Kernel API
> +----------
> +
> +.. kernel-doc:: kernel/trace/rv/monitors/tlob/tlob.c
> +   :functions: tlob_start_task tlob_stop_task
> +
> +``tlob_start_task(task, threshold_us)``
> +  Begin monitoring *task* with a total latency budget of *threshold_us*
> +  microseconds.  Allocates per-task state, sets initial DA state to
> +  ``running``, resets ``clk_elapsed``, and arms the HA budget timer.
> +  Returns 0, -ENODEV (monitor disabled), -ERANGE (zero threshold),
> +  -EALREADY (already monitoring), -ENOSPC (at capacity), or -ENOMEM.
> +
> +``tlob_stop_task(task)``
> +  Stop monitoring *task*.  Synchronously cancels the HA timer via
> +  ``ha_cancel_timer_sync()``, checks ``da_monitoring()`` to determine
> outcome.
> +  Returns 0 (clean stop, within budget), -EOVERFLOW (budget was exceeded),
> +  -ESRCH (not monitored), or -EAGAIN (concurrent stop racing).
> +

Is kernel code going to use this API? RV monitors are meant to be
enabled by userspace. What's the use-case here?

> +Design notes
> +------------
> +
> +State transitions are driven by two tracepoints:
> +
> +- ``sched_switch``: ``prev_state == 0`` (``TASK_RUNNING``, preempted,
> +  stays on runqueue) → running→waiting; ``prev_state != 0`` (voluntarily
> +  blocked, leaves runqueue) → running→sleeping; ``next`` pointer →
> +  waiting→running.
> +- ``sched_wakeup``: task moves back onto the runqueue → sleeping→waiting.
> +
> +No ``waiting → sleeping`` edge exists because a task can only block
> +itself while executing on CPU.  ``try_to_wake_up()`` is also a no-op
> +when ``__state == TASK_RUNNING``, so ``sched_wakeup`` never fires while
> +the task is in ``waiting`` state.

That's probably a bit too detailed for this page.
If you really want this information somewhere couldn't it stay in the
code?

Thanks,
Gabriele


Reply via email to