tlob: Add tlob deterministic automaton monitor

Gabriele Monaco Mon, 13 Apr 2026 01:19:43 -0700

On Mon, 2026-04-13 at 03:27 +0800, [email protected] wrote:
> From: Wen Yang <[email protected]>
> 
> Add the tlob (task latency over budget) RV monitor. tlob tracks the
> monotonic elapsed time (CLOCK_MONOTONIC) of a marked per-task code
> path, including time off-CPU, and fires a per-task hrtimer when the
> elapsed time exceeds a configurable budget.
> 
> Three-state DA (unmonitored/on_cpu/off_cpu) driven by trace_start,
> switch_in/out, and budget_expired events. Per-task state lives in a
> fixed-size hash table (TLOB_MAX_MONITORED slots) with RCU-deferred
> free.
> 
> Two userspace interfaces:
>  - tracefs: uprobe pair registration via the monitor file using the
>    format "pid:threshold_us:offset_start:offset_stop:binary_path"
>  - /dev/rv ioctls (CONFIG_RV_CHARDEV): TLOB_IOCTL_TRACE_START /
>    TRACE_STOP; TRACE_STOP returns -EOVERFLOW on violation
> 
> Each /dev/rv fd has a per-fd mmap ring buffer (physically contiguous
> pages). A control page (struct tlob_mmap_page) at offset 0 exposes
> head/tail/dropped for lockless userspace reads; struct tlob_event
> records follow at data_offset. Drop-new policy on overflow.
> 
> UAPI: include/uapi/linux/rv.h (tlob_start_args, tlob_event,
>       tlob_mmap_page, ioctl numbers), monitor_tlob.rst,
>       ioctl-number.rst (RV_IOC_MAGIC=0xB9).
>


I'm not fully grasping all the requirements for the monitors yet, but I see you
are reimplementing a lot of functionality in the monitor itself rather than
within RV, let's see if we can consolidate some of them:

 * you're using timer expirations, can we do it with timed automata? [1]
 * RV automata usually don't have an /unmonitored/ state, your trace_start event
would  be the start condition (da_event_start) and the monitor will get non-
running at each violation (it calls da_monitor_reset() automatically), all
setup/cleanup logic should be handled implicitly within RV. I believe that would
also save you that ugly trace_event_tlob() redefinition.
 * you're maintaining a local hash table for each task_struct, that could use
the per-object monitors [2] where your "object" is in fact your struct,
allocated when you start the monitor with all appropriate fields and indexed by
pid
 * you are handling violations manually, considering timed automata trigger a
full fledged violation on timeouts, can you use the RV-way (error tracepoints or
reactors only)? Do you need the additional reporting within the
tracepoint/ioctl? Cannot the userspace consumer desume all those from other
events and let RV do just the monitoring?
 * I like the uprobe thing, we could probably move all that to a common helper
once we figure out how to make it generic.

Note: [1] and [2] didn't reach upstream yet, but should reach linux-next soon.

Thanks,
Gabriele

[1] -
https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=f5587d1b6ec938afb2f74fe399a68020d66923e4
[2] -
https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=rv/for-next&id=da282bf7fadb095ee0a40c32ff0126429c769b45

> Signed-off-by: Wen Yang <[email protected]>
> ---
>  Documentation/trace/rv/index.rst              |   1 +
>  Documentation/trace/rv/monitor_tlob.rst       | 381 +++++++
>  .../userspace-api/ioctl/ioctl-number.rst      |   1 +
>  include/uapi/linux/rv.h                       | 181 ++++
>  kernel/trace/rv/Kconfig                       |  17 +
>  kernel/trace/rv/Makefile                      |   2 +
>  kernel/trace/rv/monitors/tlob/Kconfig         |  51 +
>  kernel/trace/rv/monitors/tlob/tlob.c          | 986 ++++++++++++++++++
>  kernel/trace/rv/monitors/tlob/tlob.h          | 145 +++
>  kernel/trace/rv/monitors/tlob/tlob_trace.h    |  42 +
>  kernel/trace/rv/rv.c                          |   4 +
>  kernel/trace/rv/rv_dev.c                      | 602 +++++++++++
>  kernel/trace/rv/rv_trace.h                    |  50 +
>  13 files changed, 2463 insertions(+)
>  create mode 100644 Documentation/trace/rv/monitor_tlob.rst
>  create mode 100644 include/uapi/linux/rv.h
>  create mode 100644 kernel/trace/rv/monitors/tlob/Kconfig
>  create mode 100644 kernel/trace/rv/monitors/tlob/tlob.c
>  create mode 100644 kernel/trace/rv/monitors/tlob/tlob.h
>  create mode 100644 kernel/trace/rv/monitors/tlob/tlob_trace.h
>  create mode 100644 kernel/trace/rv/rv_dev.c
> 
> diff --git a/Documentation/trace/rv/index.rst
> b/Documentation/trace/rv/index.rst
> index a2812ac5c..4f2bfaf38 100644
> --- a/Documentation/trace/rv/index.rst
> +++ b/Documentation/trace/rv/index.rst
> @@ -15,3 +15,4 @@ Runtime Verification
>     monitor_wwnr.rst
>     monitor_sched.rst
>     monitor_rtapp.rst
> +   monitor_tlob.rst
> diff --git a/Documentation/trace/rv/monitor_tlob.rst
> b/Documentation/trace/rv/monitor_tlob.rst
> new file mode 100644
> index 000000000..d498e9894
> --- /dev/null
> +++ b/Documentation/trace/rv/monitor_tlob.rst
> @@ -0,0 +1,381 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Monitor tlob
> +============
> +
> +- Name: tlob - task latency over budget
> +- Type: per-task deterministic automaton
> +- Author: Wen Yang <[email protected]>
> +
> +Description
> +-----------
> +
> +The tlob monitor tracks per-task elapsed time (CLOCK_MONOTONIC, including
> +both on-CPU and off-CPU time) and reports a violation when the monitored
> +task exceeds a configurable latency budget threshold.
> +
> +The monitor implements a three-state deterministic automaton::
> +
> +                              |
> +                              | (initial)
> +                              v
> +                    +--------------+
> +          +-------> | unmonitored  |
> +          |         +--------------+
> +          |                |
> +          |          trace_start
> +          |                v
> +          |         +--------------+
> +          |         |   on_cpu     |
> +          |         +--------------+
> +          |           |         |
> +          |  switch_out|         | trace_stop / budget_expired
> +          |            v         v
> +          |  +--------------+  (unmonitored)
> +          |  |   off_cpu    |
> +          |  +--------------+
> +          |     |         |
> +          |     | switch_in| trace_stop / budget_expired
> +          |     v         v
> +          |  (on_cpu)  (unmonitored)
> +          |
> +          +-- trace_stop (from on_cpu or off_cpu)
> +
> +  Key transitions:
> +    unmonitored   --(trace_start)-->   on_cpu
> +    on_cpu        --(switch_out)-->    off_cpu
> +    off_cpu       --(switch_in)-->     on_cpu
> +    on_cpu        --(trace_stop)-->    unmonitored
> +    off_cpu       --(trace_stop)-->    unmonitored
> +    on_cpu        --(budget_expired)-> unmonitored   [violation]
> +    off_cpu       --(budget_expired)-> unmonitored   [violation]
> +
> +  sched_wakeup self-loops in on_cpu and unmonitored; switch_out and
> +  sched_wakeup self-loop in off_cpu.  budget_expired is fired by the one-shot
> hrtimer; it always
> +  transitions to unmonitored regardless of whether the task is on-CPU
> +  or off-CPU when the timer fires.
> +
> +State Descriptions
> +------------------
> +
> +- **unmonitored**: Task is not being traced.  Scheduling events
> +  (``switch_in``, ``switch_out``, ``sched_wakeup``) are silently
> +  ignored (self-loop).  The monitor waits for a ``trace_start`` event
> +  to begin a new observation window.
> +
> +- **on_cpu**: Task is running on the CPU with the deadline timer armed.
> +  A one-shot hrtimer was set for ``threshold_us`` microseconds at
> +  ``trace_start`` time.  A ``switch_out`` event transitions to
> +  ``off_cpu``; the hrtimer keeps running (off-CPU time counts toward
> +  the budget).  A ``trace_stop`` cancels the timer and returns to
> +  ``unmonitored`` (normal completion).  If the hrtimer fires
> +  (``budget_expired``) the violation is recorded and the automaton
> +  transitions to ``unmonitored``.
> +
> +- **off_cpu**: Task was preempted or blocked.  The one-shot hrtimer
> +  continues to run.  A ``switch_in`` event returns to ``on_cpu``.
> +  A ``trace_stop`` cancels the timer and returns to ``unmonitored``.
> +  If the hrtimer fires (``budget_expired``) while the task is off-CPU,
> +  the violation is recorded and the automaton transitions to
> +  ``unmonitored``.
> +
> +Rationale
> +---------
> +
> +The per-task latency budget threshold allows operators to express timing
> +requirements in microseconds and receive an immediate ftrace event when a
> +task exceeds its budget.  This is useful for real-time tasks
> +(``SCHED_FIFO`` / ``SCHED_DEADLINE``) where total elapsed time must
> +remain within a known bound.
> +
> +Each task has an independent threshold, so up to ``TLOB_MAX_MONITORED``
> +(64) tasks with different timing requirements can be monitored
> +simultaneously.
> +
> +On threshold violation the automaton records a ``tlob_budget_exceeded``
> +ftrace event carrying the final on-CPU / off-CPU time breakdown, but does
> +not kill or throttle the task.  Monitoring can be restarted by issuing a
> +new ``trace_start`` event (or a new ``TLOB_IOCTL_TRACE_START`` ioctl).
> +
> +A per-task one-shot hrtimer is armed at ``trace_start`` for exactly
> +``threshold_us`` microseconds.  It fires at most once per monitoring
> +window, performs an O(1) hash lookup, records the violation, and injects
> +the ``budget_expired`` event into the DA.  When ``CONFIG_RV_MON_TLOB``
> +is not set there is zero runtime cost.
> +
> +Usage
> +-----
> +
> +tracefs interface (uprobe-based external monitoring)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The ``monitor`` tracefs file allows any privileged user to instrument an
> +unmodified binary via uprobes, without changing its source code.  Write a
> +four-field record to attach two plain entry uprobes: one at
> +``offset_start`` fires ``tlob_start_task()`` and one at ``offset_stop``
> +fires ``tlob_stop_task()``, so the latency budget covers exactly the code
> +region between the two offsets::
> +
> +  threshold_us:offset_start:offset_stop:binary_path
> +
> +``binary_path`` comes last so it may freely contain ``:`` (e.g. paths
> +inside a container namespace).
> +
> +The uprobes fire for every task that executes the probed instruction in
> +the binary, consistent with the native uprobe semantics.  All tasks that
> +execute the code region get independent per-task monitoring slots.
> +
> +Using two plain entry uprobes (rather than a uretprobe for the stop) means
> +that a mistyped offset can never corrupt the call stack; the worst outcome
> +of a bad ``offset_stop`` is a missed stop that causes the hrtimer to fire
> +and report a budget violation.
> +
> +Example  --  monitor a code region in ``/usr/bin/myapp`` with a 5 ms
> +budget, where the region starts at offset 0x12a0 and ends at 0x12f0::
> +
> +  echo 1 > /sys/kernel/tracing/rv/monitors/tlob/enable
> +
> +  # Bind uprobes: start probe starts the clock, stop probe stops it
> +  echo "5000:0x12a0:0x12f0:/usr/bin/myapp" \
> +      > /sys/kernel/tracing/rv/monitors/tlob/monitor
> +
> +  # Remove the uprobe binding for this code region
> +  echo "-0x12a0:/usr/bin/myapp" >
> /sys/kernel/tracing/rv/monitors/tlob/monitor
> +
> +  # List registered uprobe bindings (mirrors the write format)
> +  cat /sys/kernel/tracing/rv/monitors/tlob/monitor
> +  # -> 5000:0x12a0:0x12f0:/usr/bin/myapp
> +
> +  # Read violations from the trace buffer
> +  cat /sys/kernel/tracing/trace
> +
> +Up to ``TLOB_MAX_MONITORED`` tasks may be monitored simultaneously.
> +
> +The offsets can be obtained with ``nm`` or ``readelf``::
> +
> +  nm -n /usr/bin/myapp | grep my_function
> +  # -> 0000000000012a0 T my_function
> +
> +  readelf -s /usr/bin/myapp | grep my_function
> +  # -> 42: 0000000000012a0  336 FUNC GLOBAL DEFAULT  13 my_function
> +
> +  # offset_start = 0x12a0 (function entry)
> +  # offset_stop  = 0x12a0 + 0x50 = 0x12f0 (or any instruction before return)
> +
> +Notes:
> +
> +- The uprobes fire for every task that executes the probed instruction,
> +  so concurrent calls from different threads each get independent
> +  monitoring slots.
> +- ``offset_stop`` need not be a function return; it can be any instruction
> +  within the region.  If the stop probe is never reached (e.g. early exit
> +  path bypasses it), the hrtimer fires and a budget violation is reported.
> +- Each ``(binary_path, offset_start)`` pair may only be registered once.
> +  A second write with the same ``offset_start`` for the same binary is
> +  rejected with ``-EEXIST``.  Two entry uprobes at the same address would
> +  both fire for every task, causing ``tlob_start_task()`` to be called
> +  twice; the second call would silently fail with ``-EEXIST`` and the
> +  second binding's threshold would never take effect.  Different code
> +  regions that share the same ``offset_stop`` (common exit point) are
> +  explicitly allowed.
> +- The uprobe binding is removed when ``-offset_start:binary_path`` is
> +  written to ``monitor``, or when the monitor is disabled.
> +- The ``tag`` field in every ``tlob_budget_exceeded`` event is
> +  automatically set to ``offset_start`` for the tracefs path, so
> +  violation events for different code regions are immediately
> +  distinguishable even when ``threshold_us`` values are identical.
> +
> +ftrace ring buffer (budget violation events)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When a monitored task exceeds its latency budget the hrtimer fires,
> +records the violation, and emits a single ``tlob_budget_exceeded`` event
> +into the ftrace ring buffer.  **Nothing is written to the ftrace ring
> +buffer while the task is within budget.**
> +
> +The event carries the on-CPU / off-CPU time breakdown so that root-cause
> +analysis (CPU-bound vs. scheduling / I/O overrun) is immediate::
> +
> +  cat /sys/kernel/tracing/trace
> +
> +Example output::
> +
> +  myapp-1234 [003] .... 12345.678: tlob_budget_exceeded: \
> +    myapp[1234]: budget exceeded threshold=5000 \
> +    on_cpu=820 off_cpu=4500 switches=3 state=off_cpu tag=0x00000000000012a0
> +
> +Field descriptions:
> +
> +``threshold``
> +  Configured latency budget in microseconds.
> +
> +``on_cpu``
> +  Cumulative on-CPU time since ``trace_start``, in microseconds.
> +
> +``off_cpu``
> +  Cumulative off-CPU (scheduling + I/O wait) time since ``trace_start``,
> +  in microseconds.
> +
> +``switches``
> +  Number of times the task was scheduled out during this window.
> +
> +``state``
> +  DA state when the hrtimer fired: ``on_cpu`` means the task was executing
> +  when the budget expired (CPU-bound overrun); ``off_cpu`` means the task
> +  was preempted or blocked (scheduling / I/O overrun).
> +
> +``tag``
> +  Opaque 64-bit cookie supplied by the caller via ``tlob_start_args.tag``
> +  (ioctl path) or automatically set to ``offset_start`` (tracefs uprobe
> +  path).  Use it to distinguish violations from different code regions
> +  monitored by the same thread.  Zero when not set.
> +
> +To capture violations in a file::
> +
> +  trace-cmd record -e tlob_budget_exceeded &
> +  # ... run workload ...
> +  trace-cmd report
> +
> +/dev/rv ioctl interface (self-instrumentation)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Tasks can self-instrument their own code paths via the ``/dev/rv`` misc
> +device (requires ``CONFIG_RV_CHARDEV``).  The kernel key is
> +``task_struct``; multiple threads sharing a single fd each get their own
> +independent monitoring slot.
> +
> +**Synchronous mode**  --  the calling thread checks its own result::
> +
> +  int fd = open("/dev/rv", O_RDWR);
> +
> +  struct tlob_start_args args = {
> +      .threshold_us = 50000,   /* 50 ms */
> +      .tag          = 0,       /* optional; 0 = don't care */
> +      .notify_fd    = -1,      /* no fd notification */
> +  };
> +  ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> +
> +  /* ... code path under observation ... */
> +
> +  int ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +  /* ret == 0:          within budget  */
> +  /* ret == -EOVERFLOW: budget exceeded */
> +
> +  close(fd);
> +
> +**Asynchronous mode**  --  a dedicated monitor thread receives violation
> +records via ``read()`` on a shared fd, decoupling the observation from
> +the critical path::
> +
> +  /* Monitor thread: open a dedicated fd. */
> +  int monitor_fd = open("/dev/rv", O_RDWR);
> +
> +  /* Worker thread: set notify_fd = monitor_fd in TRACE_START args. */
> +  int work_fd = open("/dev/rv", O_RDWR);
> +  struct tlob_start_args args = {
> +      .threshold_us = 10000,   /* 10 ms */
> +      .tag          = REGION_A,
> +      .notify_fd    = monitor_fd,
> +  };
> +  ioctl(work_fd, TLOB_IOCTL_TRACE_START, &args);
> +  /* ... critical section ... */
> +  ioctl(work_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +
> +  /* Monitor thread: blocking read() returns one or more tlob_event records.
> */
> +  struct tlob_event ntfs[8];
> +  ssize_t n = read(monitor_fd, ntfs, sizeof(ntfs));
> +  for (int i = 0; i < n / sizeof(struct tlob_event); i++) {
> +      struct tlob_event *ntf = &ntfs[i];
> +      printf("tid=%u tag=0x%llx exceeded budget=%llu us "
> +             "(on_cpu=%llu off_cpu=%llu switches=%u state=%s)\n",
> +             ntf->tid, ntf->tag, ntf->threshold_us,
> +             ntf->on_cpu_us, ntf->off_cpu_us, ntf->switches,
> +             ntf->state ? "on_cpu" : "off_cpu");
> +  }
> +
> +**mmap ring buffer**  --  zero-copy consumption of violation events::
> +
> +  int fd = open("/dev/rv", O_RDWR);
> +  struct tlob_start_args args = {
> +      .threshold_us = 1000,   /* 1 ms */
> +      .notify_fd    = fd,     /* push violations to own ring buffer */
> +  };
> +  ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> +
> +  /* Map the ring: one control page + capacity data records. */
> +  size_t pagesize = sysconf(_SC_PAGESIZE);
> +  size_t cap = 64;   /* read from page->capacity after mmap */
> +  size_t len = pagesize + cap * sizeof(struct tlob_event);
> +  void *map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> +
> +  struct tlob_mmap_page *page = map;
> +  struct tlob_event *data =
> +      (struct tlob_event *)((char *)map + page->data_offset);
> +
> +  /* Consumer loop: poll for events, read without copying. */
> +  while (1) {
> +      poll(&(struct pollfd){fd, POLLIN, 0}, 1, -1);
> +
> +      uint32_t head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
> +      uint32_t tail = page->data_tail;
> +      while (tail != head) {
> +          handle(&data[tail & (page->capacity - 1)]);
> +          tail++;
> +      }
> +      __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
> +  }
> +
> +Note: ``read()`` and ``mmap()`` share the same ring and ``data_tail``
> +cursor.  Do not use both simultaneously on the same fd.
> +
> +``tlob_event`` fields:
> +
> +``tid``
> +  Thread ID (``task_pid_vnr``) of the violating task.
> +
> +``threshold_us``
> +  Budget that was exceeded, in microseconds.
> +
> +``on_cpu_us``
> +  Cumulative on-CPU time at violation time, in microseconds.
> +
> +``off_cpu_us``
> +  Cumulative off-CPU time at violation time, in microseconds.
> +
> +``switches``
> +  Number of context switches since ``TRACE_START``.
> +
> +``state``
> +  1 = timer fired while task was on-CPU; 0 = timer fired while off-CPU.
> +
> +``tag``
> +  Cookie from ``tlob_start_args.tag``; for the tracefs uprobe path this
> +  equals ``offset_start``.  Zero when not set.
> +
> +tracefs files
> +-------------
> +
> +The following files are created under
> +``/sys/kernel/tracing/rv/monitors/tlob/``:
> +
> +``enable`` (rw)
> +  Write ``1`` to enable the monitor; write ``0`` to disable it and
> +  stop all currently monitored tasks.
> +
> +``desc`` (ro)
> +  Human-readable description of the monitor.
> +
> +``monitor`` (rw)
> +  Write ``threshold_us:offset_start:offset_stop:binary_path`` to bind two
> +  plain entry uprobes in *binary_path*.  The uprobe at *offset_start* fires
> +  ``tlob_start_task()``; the uprobe at *offset_stop* fires
> +  ``tlob_stop_task()``.  Returns ``-EEXIST`` if a binding with the same
> +  *offset_start* already exists for *binary_path*.  Write
> +  ``-offset_start:binary_path`` to remove the binding.  Read to list
> +  registered bindings, one
> +  ``threshold_us:0xoffset_start:0xoffset_stop:binary_path`` entry per line.
> +
> +Specification
> +-------------
> +
> +Graphviz DOT file in tools/verification/models/tlob.dot
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst
> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 331223761..8d3af68db 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -385,6 +385,7 @@ Code  Seq#    Include
> File                                             Comments
>  0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h                               
> Marvell CN10K DPI driver
>  0xB8  all    uapi/linux/mshv.h                                        
> Microsoft Hyper-V /dev/mshv driver
>                                                                        
> <mailto:[email protected]>
> +0xB9  00-3F  linux/rv.h                                               
> Runtime Verification (RV) monitors
>  0xBA  00-0F  uapi/linux/liveupdate.h                                   Pasha
> Tatashin
>                                                                        
> <mailto:[email protected]>
>  0xC0  00-0F  linux/usb/iowarrior.h
> diff --git a/include/uapi/linux/rv.h b/include/uapi/linux/rv.h
> new file mode 100644
> index 000000000..d1b96d8cd
> --- /dev/null
> +++ b/include/uapi/linux/rv.h
> @@ -0,0 +1,181 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * UAPI definitions for Runtime Verification (RV) monitors.
> + *
> + * All RV monitors that expose an ioctl self-instrumentation interface
> + * share the magic byte RV_IOC_MAGIC (0xB9), registered in
> + * Documentation/userspace-api/ioctl/ioctl-number.rst.
> + *
> + * A single /dev/rv misc device serves as the entry point.  ioctl numbers
> + * encode both the monitor identity and the operation:
> + *
> + *   0x01 - 0x1F  tlob (task latency over budget)
> + *   0x20 - 0x3F  reserved for future RV monitors
> + *
> + * Usage examples and design rationale are in:
> + *   Documentation/trace/rv/monitor_tlob.rst
> + */
> +
> +#ifndef _UAPI_LINUX_RV_H
> +#define _UAPI_LINUX_RV_H
> +
> +#include <linux/ioctl.h>
> +#include <linux/types.h>
> +
> +/* Magic byte shared by all RV monitor ioctls. */
> +#define RV_IOC_MAGIC 0xB9
> +
> +/* -----------------------------------------------------------------------
> + * tlob: task latency over budget monitor  (nr 0x01 - 0x1F)
> + * -----------------------------------------------------------------------
> + */
> +
> +/**
> + * struct tlob_start_args - arguments for TLOB_IOCTL_TRACE_START
> + * @threshold_us: Latency budget for this critical section, in microseconds.
> + *               Must be greater than zero.
> + * @tag:         Opaque 64-bit cookie supplied by the caller.  Echoed back
> + *               verbatim in the tlob_budget_exceeded ftrace event and in any
> + *               tlob_event record delivered via @notify_fd.  Use it to
> identify
> + *               which code region triggered a violation when the same thread
> + *               monitors multiple regions sequentially.  Set to 0 if not
> + *               needed.
> + * @notify_fd:   File descriptor that will receive a tlob_event record on
> + *               violation.  Must refer to an open /dev/rv fd.  May equal
> + *               the calling fd (self-notification, useful for retrieving the
> + *               on_cpu_us / off_cpu_us breakdown after TRACE_STOP returns
> + *               -EOVERFLOW).  Set to -1 to disable fd notification; in that
> + *               case violations are only signalled via the TRACE_STOP return
> + *               value and the tlob_budget_exceeded ftrace event.
> + * @flags:       Must be 0.  Reserved for future extensions.
> + */
> +struct tlob_start_args {
> +     __u64 threshold_us;
> +     __u64 tag;
> +     __s32 notify_fd;
> +     __u32 flags;
> +};
> +
> +/**
> + * struct tlob_event - one budget-exceeded event
> + *
> + * Consumed by read() on the notify_fd registered at TLOB_IOCTL_TRACE_START.
> + * Each record describes a single budget exceedance for one task.
> + *
> + * @tid:          Thread ID (task_pid_vnr) of the violating task.
> + * @threshold_us: Budget that was exceeded, in microseconds.
> + * @on_cpu_us:    Cumulative on-CPU time at violation time, in microseconds.
> + * @off_cpu_us:   Cumulative off-CPU (scheduling + I/O wait) time at
> + *               violation time, in microseconds.
> + * @switches:     Number of context switches since TRACE_START.
> + * @state:        DA state at violation: 1 = on_cpu, 0 = off_cpu.
> + * @tag:          Cookie from tlob_start_args.tag; for the tracefs uprobe
> path
> + *               this is the offset_start value.  Zero when not set.
> + */
> +struct tlob_event {
> +     __u32 tid;
> +     __u32 pad;
> +     __u64 threshold_us;
> +     __u64 on_cpu_us;
> +     __u64 off_cpu_us;
> +     __u32 switches;
> +     __u32 state;   /* 1 = on_cpu, 0 = off_cpu */
> +     __u64 tag;
> +};
> +
> +/**
> + * struct tlob_mmap_page - control page for the mmap'd violation ring buffer
> + *
> + * Mapped at offset 0 of the mmap region returned by mmap(2) on a /dev/rv fd.
> + * The data array of struct tlob_event records begins at offset @data_offset
> + * (always one page from the mmap base; use this field rather than hard-
> coding
> + * PAGE_SIZE so the code remains correct across architectures).
> + *
> + * Ring layout:
> + *
> + *   mmap base + 0             : struct tlob_mmap_page  (one page)
> + *   mmap base + data_offset   : struct tlob_event[capacity]
> + *
> + * The mmap length determines the ring capacity.  Compute it as:
> + *
> + *   raw    = sysconf(_SC_PAGESIZE) + capacity * sizeof(struct tlob_event)
> + *   length = (raw + sysconf(_SC_PAGESIZE) - 1) & ~(sysconf(_SC_PAGESIZE) -
> 1)
> + *
> + * i.e. round the raw byte count up to the next page boundary before
> + * passing it to mmap(2).  The kernel requires a page-aligned length.
> + * capacity must be a power of 2.  Read @capacity after a successful
> + * mmap(2) for the actual value.
> + *
> + * Producer/consumer ordering contract:
> + *
> + *   Kernel (producer):
> + *     data[data_head & (capacity - 1)] = event;
> + *     // pairs with load-acquire in userspace:
> + *     smp_store_release(&page->data_head, data_head + 1);
> + *
> + *   Userspace (consumer):
> + *     // pairs with store-release in kernel:
> + *     head = __atomic_load_n(&page->data_head, __ATOMIC_ACQUIRE);
> + *     for (tail = page->data_tail; tail != head; tail++)
> + *         handle(&data[tail & (capacity - 1)]);
> + *     __atomic_store_n(&page->data_tail, tail, __ATOMIC_RELEASE);
> + *
> + * @data_head and @data_tail are monotonically increasing __u32 counters
> + * in units of records.  Unsigned 32-bit wrap-around is handled correctly
> + * by modular arithmetic; the ring is full when
> + * (data_head - data_tail) == capacity.
> + *
> + * When the ring is full the kernel drops the incoming record and increments
> + * @dropped.  The consumer should check @dropped periodically to detect loss.
> + *
> + * read() and mmap() share the same ring buffer.  Do not use both
> + * simultaneously on the same fd.
> + *
> + * @data_head:   Next write slot index.  Updated by the kernel with
> + *               store-release ordering.  Read by userspace with load-
> acquire.
> + * @data_tail:   Next read slot index.  Updated by userspace.  Read by the
> + *               kernel to detect overflow.
> + * @capacity:    Actual ring capacity in records (power of 2).  Written once
> + *               by the kernel at mmap time; read-only for userspace
> thereafter.
> + * @version:     Ring buffer ABI version; currently 1.
> + * @data_offset: Byte offset from the mmap base to the data array.
> + *               Always equal to sysconf(_SC_PAGESIZE) on the running kernel.
> + * @record_size: sizeof(struct tlob_event) as seen by the kernel.  Verify
> + *               this matches userspace's sizeof before indexing the array.
> + * @dropped:     Number of events dropped because the ring was full.
> + *               Monotonically increasing; read with __ATOMIC_RELAXED.
> + */
> +struct tlob_mmap_page {
> +     __u32  data_head;
> +     __u32  data_tail;
> +     __u32  capacity;
> +     __u32  version;
> +     __u32  data_offset;
> +     __u32  record_size;
> +     __u64  dropped;
> +};
> +
> +/*
> + * TLOB_IOCTL_TRACE_START - begin monitoring the calling task.
> + *
> + * Arms a per-task hrtimer for threshold_us microseconds.  If args.notify_fd
> + * is >= 0, a tlob_event record is pushed into that fd's ring buffer on
> + * violation in addition to the tlob_budget_exceeded ftrace event.
> + * args.notify_fd == -1 disables fd notification.
> + *
> + * Violation records are consumed by read() on the notify_fd (blocking or
> + * non-blocking depending on O_NONBLOCK).  On violation,
> TLOB_IOCTL_TRACE_STOP
> + * also returns -EOVERFLOW regardless of whether notify_fd is set.
> + *
> + * args.flags must be 0.
> + */
> +#define TLOB_IOCTL_TRACE_START               _IOW(RV_IOC_MAGIC, 0x01, struct
> tlob_start_args)
> +
> +/*
> + * TLOB_IOCTL_TRACE_STOP - end monitoring the calling task.
> + *
> + * Returns 0 if within budget, -EOVERFLOW if the budget was exceeded.
> + */
> +#define TLOB_IOCTL_TRACE_STOP                _IO(RV_IOC_MAGIC,  0x02)
> +
> +#endif /* _UAPI_LINUX_RV_H */
> diff --git a/kernel/trace/rv/Kconfig b/kernel/trace/rv/Kconfig
> index 5b4be87ba..227573cda 100644
> --- a/kernel/trace/rv/Kconfig
> +++ b/kernel/trace/rv/Kconfig
> @@ -65,6 +65,7 @@ source "kernel/trace/rv/monitors/pagefault/Kconfig"
>  source "kernel/trace/rv/monitors/sleep/Kconfig"
>  # Add new rtapp monitors here
>  
> +source "kernel/trace/rv/monitors/tlob/Kconfig"
>  # Add new monitors here
>  
>  config RV_REACTORS
> @@ -93,3 +94,19 @@ config RV_REACT_PANIC
>       help
>         Enables the panic reactor. The panic reactor emits a printk()
>         message if an exception is found and panic()s the system.
> +
> +config RV_CHARDEV
> +     bool "RV ioctl interface via /dev/rv"
> +     depends on RV
> +     default n
> +     help
> +       Register a /dev/rv misc device that exposes an ioctl interface
> +       for RV monitor self-instrumentation.  All RV monitors share the
> +       single device node; ioctl numbers encode the monitor identity.
> +
> +       When enabled, user-space programs can open /dev/rv and use
> +       monitor-specific ioctl commands to bracket code regions they
> +       want the kernel RV subsystem to observe.
> +
> +       Say Y here if you want to use the tlob self-instrumentation
> +       ioctl interface; otherwise say N.
> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
> index 750e4ad6f..cc3781a3b 100644
> --- a/kernel/trace/rv/Makefile
> +++ b/kernel/trace/rv/Makefile
> @@ -3,6 +3,7 @@
>  ccflags-y += -I $(src)               # needed for trace events
>  
>  obj-$(CONFIG_RV) += rv.o
> +obj-$(CONFIG_RV_CHARDEV) += rv_dev.o
>  obj-$(CONFIG_RV_MON_WIP) += monitors/wip/wip.o
>  obj-$(CONFIG_RV_MON_WWNR) += monitors/wwnr/wwnr.o
>  obj-$(CONFIG_RV_MON_SCHED) += monitors/sched/sched.o
> @@ -17,6 +18,7 @@ obj-$(CONFIG_RV_MON_STS) += monitors/sts/sts.o
>  obj-$(CONFIG_RV_MON_NRP) += monitors/nrp/nrp.o
>  obj-$(CONFIG_RV_MON_SSSW) += monitors/sssw/sssw.o
>  obj-$(CONFIG_RV_MON_OPID) += monitors/opid/opid.o
> +obj-$(CONFIG_RV_MON_TLOB) += monitors/tlob/tlob.o
>  # Add new monitors here
>  obj-$(CONFIG_RV_REACTORS) += rv_reactors.o
>  obj-$(CONFIG_RV_REACT_PRINTK) += reactor_printk.o
> diff --git a/kernel/trace/rv/monitors/tlob/Kconfig
> b/kernel/trace/rv/monitors/tlob/Kconfig
> new file mode 100644
> index 000000000..010237480
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/Kconfig
> @@ -0,0 +1,51 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +config RV_MON_TLOB
> +     depends on RV
> +     depends on UPROBES
> +     select DA_MON_EVENTS_ID
> +     bool "tlob monitor"
> +     help
> +       Enable the tlob (task latency over budget) monitor. This monitor
> +       tracks the elapsed time (CLOCK_MONOTONIC) of a marked code path
> within a
> +       task (including both on-CPU and off-CPU time) and reports a
> +       violation when the elapsed time exceeds a configurable budget
> +       threshold.
> +
> +       The monitor implements a three-state deterministic automaton.
> +       States: unmonitored, on_cpu, off_cpu.
> +       Key transitions:
> +         unmonitored    --(trace_start)-->    on_cpu
> +         on_cpu   --(switch_out)-->     off_cpu
> +         off_cpu  --(switch_in)-->      on_cpu
> +         on_cpu   --(trace_stop)-->    unmonitored
> +         off_cpu  --(trace_stop)-->    unmonitored
> +         on_cpu   --(budget_expired)--> unmonitored
> +         off_cpu  --(budget_expired)--> unmonitored
> +
> +       External configuration is done via the tracefs "monitor" file:
> +         echo pid:threshold_us:binary:offset_start:offset_stop >
> .../rv/monitors/tlob/monitor
> +         echo -pid             > .../rv/monitors/tlob/monitor  (remove
> task)
> +         cat                     .../rv/monitors/tlob/monitor  (list
> tasks)
> +
> +       The uprobe binding places two plain entry uprobes at offset_start
> and
> +       offset_stop in the binary; these trigger tlob_start_task() and
> +       tlob_stop_task() respectively.  Using two entry uprobes (rather
> than a
> +       uretprobe) means that a mistyped offset can never corrupt the call
> +       stack; the worst outcome is a missed stop, which causes the hrtimer
> to
> +       fire and report a budget violation.
> +
> +       Violation events are delivered via a lock-free mmap ring buffer on
> +       /dev/rv (enabled by CONFIG_RV_CHARDEV).  The consumer mmap()s the
> +       device, reads records from the data array using the head/tail
> indices
> +       in the control page, and advances data_tail when done.
> +
> +       For self-instrumentation, use TLOB_IOCTL_TRACE_START /
> +       TLOB_IOCTL_TRACE_STOP via the /dev/rv misc device (enabled by
> +       CONFIG_RV_CHARDEV).
> +
> +       Up to TLOB_MAX_MONITORED tasks may be monitored simultaneously.
> +
> +       For further information, see:
> +         Documentation/trace/rv/monitor_tlob.rst
> +
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
> b/kernel/trace/rv/monitors/tlob/tlob.c
> new file mode 100644
> index 000000000..a6e474025
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
> @@ -0,0 +1,986 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob: task latency over budget monitor
> + *
> + * Track the elapsed wall-clock time of a marked code path and detect when
> + * a monitored task exceeds its per-task latency budget.  CLOCK_MONOTONIC
> + * is used so both on-CPU and off-CPU time count toward the budget.
> + *
> + * Per-task state is maintained in a spinlock-protected hash table.  A
> + * one-shot hrtimer fires at the deadline; if the task has not called
> + * trace_stop by then, a violation is recorded.
> + *
> + * Up to TLOB_MAX_MONITORED tasks may be tracked simultaneously.
> + *
> + * Copyright (C) 2026 Wen Yang <[email protected]>
> + */
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/ftrace.h>
> +#include <linux/hash.h>
> +#include <linux/hrtimer.h>
> +#include <linux/kernel.h>
> +#include <linux/ktime.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/namei.h>
> +#include <linux/poll.h>
> +#include <linux/rv.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/atomic.h>
> +#include <linux/rcupdate.h>
> +#include <linux/spinlock.h>
> +#include <linux/tracefs.h>
> +#include <linux/uaccess.h>
> +#include <linux/uprobes.h>
> +#include <kunit/visibility.h>
> +#include <rv/instrumentation.h>
> +
> +/* rv_interface_lock is defined in kernel/trace/rv/rv.c */
> +extern struct mutex rv_interface_lock;
> +
> +#define MODULE_NAME "tlob"
> +
> +#include <rv_trace.h>
> +#include <trace/events/sched.h>
> +
> +#define RV_MON_TYPE RV_MON_PER_TASK
> +#include "tlob.h"
> +#include <rv/da_monitor.h>
> +
> +/* Hash table size; must be a power of two. */
> +#define TLOB_HTABLE_BITS             6
> +#define TLOB_HTABLE_SIZE             (1 << TLOB_HTABLE_BITS)
> +
> +/* Maximum binary path length for uprobe binding. */
> +#define TLOB_MAX_PATH                        256
> +
> +/* Per-task latency monitoring state. */
> +struct tlob_task_state {
> +     struct hlist_node       hlist;
> +     struct task_struct      *task;
> +     u64                     threshold_us;
> +     u64                     tag;
> +     struct hrtimer          deadline_timer;
> +     int                     canceled;       /* protected by entry_lock */
> +     struct file             *notify_file;   /* NULL or held reference */
> +
> +     /*
> +      * entry_lock serialises the mutable accounting fields below.
> +      * Lock order: tlob_table_lock -> entry_lock (never reverse).
> +      */
> +     raw_spinlock_t          entry_lock;
> +     u64                     on_cpu_us;
> +     u64                     off_cpu_us;
> +     ktime_t                 last_ts;
> +     u32                     switches;
> +     u8                      da_state;
> +
> +     struct rcu_head         rcu;    /* for call_rcu() teardown */
> +};
> +
> +/* Per-uprobe-binding state: a start + stop probe pair for one binary region.
> */
> +struct tlob_uprobe_binding {
> +     struct list_head        list;
> +     u64                     threshold_us;
> +     struct path             path;
> +     char                    binpath[TLOB_MAX_PATH]; /* canonical
> path for read/remove */
> +     loff_t                  offset_start;
> +     loff_t                  offset_stop;
> +     struct uprobe_consumer  entry_uc;
> +     struct uprobe_consumer  stop_uc;
> +     struct uprobe           *entry_uprobe;
> +     struct uprobe           *stop_uprobe;
> +};
> +
> +/* Object pool for tlob_task_state. */
> +static struct kmem_cache *tlob_state_cache;
> +
> +/* Hash table and lock protecting table structure (insert/delete/canceled).
> */
> +static struct hlist_head tlob_htable[TLOB_HTABLE_SIZE];
> +static DEFINE_RAW_SPINLOCK(tlob_table_lock);
> +static atomic_t tlob_num_monitored = ATOMIC_INIT(0);
> +
> +/* Uprobe binding list; protected by tlob_uprobe_mutex. */
> +static LIST_HEAD(tlob_uprobe_list);
> +static DEFINE_MUTEX(tlob_uprobe_mutex);
> +
> +/* Forward declaration */
> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer);
> +
> +/* Hash table helpers */
> +
> +static unsigned int tlob_hash_task(const struct task_struct *task)
> +{
> +     return hash_ptr((void *)task, TLOB_HTABLE_BITS);
> +}
> +
> +/*
> + * tlob_find_rcu - look up per-task state.
> + * Must be called under rcu_read_lock() or with tlob_table_lock held.
> + */
> +static struct tlob_task_state *tlob_find_rcu(struct task_struct *task)
> +{
> +     struct tlob_task_state *ws;
> +     unsigned int h = tlob_hash_task(task);
> +
> +     hlist_for_each_entry_rcu(ws, &tlob_htable[h], hlist,
> +                              lockdep_is_held(&tlob_table_lock))
> +             if (ws->task == task)
> +                     return ws;
> +     return NULL;
> +}
> +
> +/* Allocate and initialise a new per-task state entry. */
> +static struct tlob_task_state *tlob_alloc(struct task_struct *task,
> +                                       u64 threshold_us, u64 tag)
> +{
> +     struct tlob_task_state *ws;
> +
> +     ws = kmem_cache_zalloc(tlob_state_cache, GFP_ATOMIC);
> +     if (!ws)
> +             return NULL;
> +
> +     ws->task = task;
> +     get_task_struct(task);
> +     ws->threshold_us = threshold_us;
> +     ws->tag = tag;
> +     ws->last_ts = ktime_get();
> +     ws->da_state = on_cpu_tlob;
> +     raw_spin_lock_init(&ws->entry_lock);
> +     hrtimer_setup(&ws->deadline_timer, tlob_deadline_timer_fn,
> +                   CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +     return ws;
> +}
> +
> +/* RCU callback: free the slab once no readers remain. */
> +static void tlob_free_rcu_slab(struct rcu_head *head)
> +{
> +     struct tlob_task_state *ws =
> +             container_of(head, struct tlob_task_state, rcu);
> +     kmem_cache_free(tlob_state_cache, ws);
> +}
> +
> +/* Arm the one-shot deadline timer for threshold_us microseconds. */
> +static void tlob_arm_deadline(struct tlob_task_state *ws)
> +{
> +     hrtimer_start(&ws->deadline_timer,
> +                   ns_to_ktime(ws->threshold_us * NSEC_PER_USEC),
> +                   HRTIMER_MODE_REL);
> +}
> +
> +/*
> + * Push a violation record into a monitor fd's ring buffer (softirq context).
> + * Drop-new policy: discard incoming record when full.  smp_store_release on
> + * data_head pairs with smp_load_acquire in the consumer.
> + */
> +static void tlob_event_push(struct rv_file_priv *priv,
> +                         const struct tlob_event *info)
> +{
> +     struct tlob_ring *ring = &priv->ring;
> +     unsigned long flags;
> +     u32 head, tail;
> +
> +     spin_lock_irqsave(&ring->lock, flags);
> +
> +     head = ring->page->data_head;
> +     tail = READ_ONCE(ring->page->data_tail);
> +
> +     if (head - tail > ring->mask) {
> +             /* Ring full: drop incoming record. */
> +             ring->page->dropped++;
> +             spin_unlock_irqrestore(&ring->lock, flags);
> +             return;
> +     }
> +
> +     ring->data[head & ring->mask] = *info;
> +     /* pairs with smp_load_acquire() in the consumer */
> +     smp_store_release(&ring->page->data_head, head + 1);
> +
> +     spin_unlock_irqrestore(&ring->lock, flags);
> +
> +     wake_up_interruptible_poll(&priv->waitq, EPOLLIN | EPOLLRDNORM);
> +}
> +
> +#if IS_ENABLED(CONFIG_KUNIT)
> +void tlob_event_push_kunit(struct rv_file_priv *priv,
> +                       const struct tlob_event *info)
> +{
> +     tlob_event_push(priv, info);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_event_push_kunit);
> +#endif /* CONFIG_KUNIT */
> +
> +/*
> + * Budget exceeded: remove the entry, record the violation, and inject
> + * budget_expired into the DA.
> + *
> + * Lock order: tlob_table_lock -> entry_lock.  tlob_stop_task() sets
> + * ws->canceled under both locks; if we see it here the stop path owns
> cleanup.
> + * fput/put_task_struct are done before call_rcu(); the RCU callback only
> + * reclaims the slab.
> + */
> +static enum hrtimer_restart tlob_deadline_timer_fn(struct hrtimer *timer)
> +{
> +     struct tlob_task_state *ws =
> +             container_of(timer, struct tlob_task_state, deadline_timer);
> +     struct tlob_event info = {};
> +     struct file *notify_file;
> +     struct task_struct *task;
> +     unsigned long flags;
> +     /* snapshots taken under entry_lock */
> +     u64 on_cpu_us, off_cpu_us, threshold_us, tag;
> +     u32 switches;
> +     bool on_cpu;
> +     bool push_event = false;
> +
> +     raw_spin_lock_irqsave(&tlob_table_lock, flags);
> +     /* stop path sets canceled under both locks; if set it owns cleanup
> */
> +     if (ws->canceled) {
> +             raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +             return HRTIMER_NORESTART;
> +     }
> +
> +     /* Finalize accounting and snapshot all fields under entry_lock. */
> +     raw_spin_lock(&ws->entry_lock);
> +
> +     {
> +             ktime_t now = ktime_get();
> +             u64 delta_us = ktime_to_us(ktime_sub(now, ws->last_ts));
> +
> +             if (ws->da_state == on_cpu_tlob)
> +                     ws->on_cpu_us += delta_us;
> +             else
> +                     ws->off_cpu_us += delta_us;
> +     }
> +
> +     ws->canceled  = 1;
> +     on_cpu_us     = ws->on_cpu_us;
> +     off_cpu_us    = ws->off_cpu_us;
> +     threshold_us  = ws->threshold_us;
> +     tag           = ws->tag;
> +     switches      = ws->switches;
> +     on_cpu        = (ws->da_state == on_cpu_tlob);
> +     notify_file   = ws->notify_file;
> +     if (notify_file) {
> +             info.tid          = task_pid_vnr(ws->task);
> +             info.threshold_us = threshold_us;
> +             info.on_cpu_us    = on_cpu_us;
> +             info.off_cpu_us   = off_cpu_us;
> +             info.switches     = switches;
> +             info.state        = on_cpu ? 1 : 0;
> +             info.tag          = tag;
> +             push_event        = true;
> +     }
> +
> +     raw_spin_unlock(&ws->entry_lock);
> +
> +     hlist_del_rcu(&ws->hlist);
> +     atomic_dec(&tlob_num_monitored);
> +     /*
> +      * Hold a reference so task remains valid across da_handle_event()
> +      * after we drop tlob_table_lock.
> +      */
> +     task = ws->task;
> +     get_task_struct(task);
> +     raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> +     /*
> +      * Both locks are now released; ws is exclusively owned (removed from
> +      * the hash table with canceled=1).  Emit the tracepoint and push the
> +      * violation record.
> +      */
> +     trace_tlob_budget_exceeded(ws->task, threshold_us, on_cpu_us,
> +                                off_cpu_us, switches, on_cpu, tag);
> +
> +     if (push_event) {
> +             struct rv_file_priv *priv = notify_file->private_data;
> +
> +             if (priv)
> +                     tlob_event_push(priv, &info);
> +     }
> +
> +     da_handle_event(task, budget_expired_tlob);
> +
> +     if (notify_file)
> +             fput(notify_file);              /* ref from fget() at
> TRACE_START */
> +     put_task_struct(ws->task);              /* ref from tlob_alloc() */
> +     put_task_struct(task);                  /* extra ref from
> get_task_struct() above */
> +     call_rcu(&ws->rcu, tlob_free_rcu_slab);
> +     return HRTIMER_NORESTART;
> +}
> +
> +/* Tracepoint handlers */
> +
> +/*
> + * handle_sched_switch - advance the DA and accumulate on/off-CPU time.
> + *
> + * RCU read-side for lock-free lookup; entry_lock for per-task accounting.
> + * da_handle_event() is called after rcu_read_unlock() to avoid holding the
> + * read-side critical section across the RV framework.
> + */
> +static void handle_sched_switch(void *data, bool preempt,
> +                             struct task_struct *prev,
> +                             struct task_struct *next,
> +                             unsigned int prev_state)
> +{
> +     struct tlob_task_state *ws;
> +     unsigned long flags;
> +     bool do_prev = false, do_next = false;
> +     ktime_t now;
> +
> +     rcu_read_lock();
> +
> +     ws = tlob_find_rcu(prev);
> +     if (ws) {
> +             raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +             if (!ws->canceled) {
> +                     now = ktime_get();
> +                     ws->on_cpu_us += ktime_to_us(ktime_sub(now, ws-
> >last_ts));
> +                     ws->last_ts = now;
> +                     ws->switches++;
> +                     ws->da_state = off_cpu_tlob;
> +                     do_prev = true;
> +             }
> +             raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +     }
> +
> +     ws = tlob_find_rcu(next);
> +     if (ws) {
> +             raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +             if (!ws->canceled) {
> +                     now = ktime_get();
> +                     ws->off_cpu_us += ktime_to_us(ktime_sub(now, ws-
> >last_ts));
> +                     ws->last_ts = now;
> +                     ws->da_state = on_cpu_tlob;
> +                     do_next = true;
> +             }
> +             raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +     }
> +
> +     rcu_read_unlock();
> +
> +     if (do_prev)
> +             da_handle_event(prev, switch_out_tlob);
> +     if (do_next)
> +             da_handle_event(next, switch_in_tlob);
> +}
> +
> +static void handle_sched_wakeup(void *data, struct task_struct *p)
> +{
> +     struct tlob_task_state *ws;
> +     unsigned long flags;
> +     bool found = false;
> +
> +     rcu_read_lock();
> +     ws = tlob_find_rcu(p);
> +     if (ws) {
> +             raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +             found = !ws->canceled;
> +             raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +     }
> +     rcu_read_unlock();
> +
> +     if (found)
> +             da_handle_event(p, sched_wakeup_tlob);
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Core start/stop helpers (also called from rv_dev.c)
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * __tlob_insert - insert @ws into the hash table and arm its deadline timer.
> + *
> + * Re-checks for duplicates and capacity under tlob_table_lock; the caller
> + * may have done a lock-free pre-check before allocating @ws.  On failure @ws
> + * is freed directly (never in table, so no call_rcu needed).
> + */
> +static int __tlob_insert(struct task_struct *task, struct tlob_task_state
> *ws)
> +{
> +     unsigned int h;
> +     unsigned long flags;
> +
> +     raw_spin_lock_irqsave(&tlob_table_lock, flags);
> +     if (tlob_find_rcu(task)) {
> +             raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +             if (ws->notify_file)
> +                     fput(ws->notify_file);
> +             put_task_struct(ws->task);
> +             kmem_cache_free(tlob_state_cache, ws);
> +             return -EEXIST;
> +     }
> +     if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
> +             raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +             if (ws->notify_file)
> +                     fput(ws->notify_file);
> +             put_task_struct(ws->task);
> +             kmem_cache_free(tlob_state_cache, ws);
> +             return -ENOSPC;
> +     }
> +     h = tlob_hash_task(task);
> +     hlist_add_head_rcu(&ws->hlist, &tlob_htable[h]);
> +     atomic_inc(&tlob_num_monitored);
> +     raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> +     da_handle_start_run_event(task, trace_start_tlob);
> +     tlob_arm_deadline(ws);
> +     return 0;
> +}
> +
> +/**
> + * tlob_start_task - begin monitoring @task with latency budget
> @threshold_us.
> + *
> + * @notify_file: /dev/rv fd whose ring buffer receives a tlob_event on
> + *               violation; caller transfers the fget() reference to tlob.c.
> + *               Pass NULL for synchronous mode (violations only via
> + *               TRACE_STOP return value and the tlob_budget_exceeded event).
> + *
> + * Returns 0, -ENODEV, -EEXIST, -ENOSPC, or -ENOMEM.  On failure the caller
> + * retains responsibility for any @notify_file reference.
> + */
> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
> +                 struct file *notify_file, u64 tag)
> +{
> +     struct tlob_task_state *ws;
> +     unsigned long flags;
> +
> +     if (!tlob_state_cache)
> +             return -ENODEV;
> +
> +     if (threshold_us > (u64)KTIME_MAX / NSEC_PER_USEC)
> +             return -ERANGE;
> +
> +     /* Quick pre-check before allocation. */
> +     raw_spin_lock_irqsave(&tlob_table_lock, flags);
> +     if (tlob_find_rcu(task)) {
> +             raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +             return -EEXIST;
> +     }
> +     if (atomic_read(&tlob_num_monitored) >= TLOB_MAX_MONITORED) {
> +             raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +             return -ENOSPC;
> +     }
> +     raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> +     ws = tlob_alloc(task, threshold_us, tag);
> +     if (!ws)
> +             return -ENOMEM;
> +
> +     ws->notify_file = notify_file;
> +     return __tlob_insert(task, ws);
> +}
> +EXPORT_SYMBOL_GPL(tlob_start_task);
> +
> +/**
> + * tlob_stop_task - stop monitoring @task before the deadline fires.
> + *
> + * Sets canceled under entry_lock (inside tlob_table_lock) before calling
> + * hrtimer_cancel(), racing safely with the timer callback.
> + *
> + * Returns 0 if within budget, -ESRCH if the entry is gone (deadline already
> + * fired, or TRACE_START was never called).
> + */
> +int tlob_stop_task(struct task_struct *task)
> +{
> +     struct tlob_task_state *ws;
> +     struct file *notify_file;
> +     unsigned long flags;
> +
> +     raw_spin_lock_irqsave(&tlob_table_lock, flags);
> +     ws = tlob_find_rcu(task);
> +     if (!ws) {
> +             raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +             return -ESRCH;
> +     }
> +
> +     /* Prevent handle_sched_switch from updating accounting after
> removal. */
> +     raw_spin_lock(&ws->entry_lock);
> +     ws->canceled = 1;
> +     raw_spin_unlock(&ws->entry_lock);
> +
> +     hlist_del_rcu(&ws->hlist);
> +     atomic_dec(&tlob_num_monitored);
> +     raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> +     hrtimer_cancel(&ws->deadline_timer);
> +
> +     da_handle_event(task, trace_stop_tlob);
> +
> +     notify_file = ws->notify_file;
> +     if (notify_file)
> +             fput(notify_file);
> +     put_task_struct(ws->task);
> +     call_rcu(&ws->rcu, tlob_free_rcu_slab);
> +
> +     return 0;
> +}
> +EXPORT_SYMBOL_GPL(tlob_stop_task);
> +
> +/* Stop monitoring all tracked tasks; called on monitor disable. */
> +static void tlob_stop_all(void)
> +{
> +     struct tlob_task_state *batch[TLOB_MAX_MONITORED];
> +     struct tlob_task_state *ws;
> +     struct hlist_node *tmp;
> +     unsigned long flags;
> +     int n = 0, i;
> +
> +     raw_spin_lock_irqsave(&tlob_table_lock, flags);
> +     for (i = 0; i < TLOB_HTABLE_SIZE; i++) {
> +             hlist_for_each_entry_safe(ws, tmp, &tlob_htable[i], hlist) {
> +                     raw_spin_lock(&ws->entry_lock);
> +                     ws->canceled = 1;
> +                     raw_spin_unlock(&ws->entry_lock);
> +                     hlist_del_rcu(&ws->hlist);
> +                     atomic_dec(&tlob_num_monitored);
> +                     if (n < TLOB_MAX_MONITORED)
> +                             batch[n++] = ws;
> +             }
> +     }
> +     raw_spin_unlock_irqrestore(&tlob_table_lock, flags);
> +
> +     for (i = 0; i < n; i++) {
> +             ws = batch[i];
> +             hrtimer_cancel(&ws->deadline_timer);
> +             da_handle_event(ws->task, trace_stop_tlob);
> +             if (ws->notify_file)
> +                     fput(ws->notify_file);
> +             put_task_struct(ws->task);
> +             call_rcu(&ws->rcu, tlob_free_rcu_slab);
> +     }
> +}
> +
> +/* uprobe binding helpers */
> +
> +static int tlob_uprobe_entry_handler(struct uprobe_consumer *uc,
> +                                  struct pt_regs *regs, __u64 *data)
> +{
> +     struct tlob_uprobe_binding *b =
> +             container_of(uc, struct tlob_uprobe_binding, entry_uc);
> +
> +     tlob_start_task(current, b->threshold_us, NULL, (u64)b-
> >offset_start);
> +     return 0;
> +}
> +
> +static int tlob_uprobe_stop_handler(struct uprobe_consumer *uc,
> +                                 struct pt_regs *regs, __u64 *data)
> +{
> +     tlob_stop_task(current);
> +     return 0;
> +}
> +
> +/*
> + * Register start + stop entry uprobes for a binding.
> + * Both are plain entry uprobes (no uretprobe), so a wrong offset never
> + * corrupts the call stack; the worst outcome is a missed stop (hrtimer
> + * fires and reports a budget violation).
> + * Called with tlob_uprobe_mutex held.
> + */
> +static int tlob_add_uprobe(u64 threshold_us, const char *binpath,
> +                        loff_t offset_start, loff_t offset_stop)
> +{
> +     struct tlob_uprobe_binding *b, *tmp_b;
> +     char pathbuf[TLOB_MAX_PATH];
> +     struct inode *inode;
> +     char *canon;
> +     int ret;
> +
> +     b = kzalloc(sizeof(*b), GFP_KERNEL);
> +     if (!b)
> +             return -ENOMEM;
> +
> +     if (binpath[0] != '/') {
> +             kfree(b);
> +             return -EINVAL;
> +     }
> +
> +     b->threshold_us = threshold_us;
> +     b->offset_start = offset_start;
> +     b->offset_stop  = offset_stop;
> +
> +     ret = kern_path(binpath, LOOKUP_FOLLOW, &b->path);
> +     if (ret)
> +             goto err_free;
> +
> +     if (!d_is_reg(b->path.dentry)) {
> +             ret = -EINVAL;
> +             goto err_path;
> +     }
> +
> +     /* Reject duplicate start offset for the same binary. */
> +     list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
> +             if (tmp_b->offset_start == offset_start &&
> +                 tmp_b->path.dentry == b->path.dentry) {
> +                     ret = -EEXIST;
> +                     goto err_path;
> +             }
> +     }
> +
> +     /* Store canonical path for read-back and removal matching. */
> +     canon = d_path(&b->path, pathbuf, sizeof(pathbuf));
> +     if (IS_ERR(canon)) {
> +             ret = PTR_ERR(canon);
> +             goto err_path;
> +     }
> +     strscpy(b->binpath, canon, sizeof(b->binpath));
> +
> +     b->entry_uc.handler = tlob_uprobe_entry_handler;
> +     b->stop_uc.handler  = tlob_uprobe_stop_handler;
> +
> +     inode = d_real_inode(b->path.dentry);
> +
> +     b->entry_uprobe = uprobe_register(inode, offset_start, 0, &b-
> >entry_uc);
> +     if (IS_ERR(b->entry_uprobe)) {
> +             ret = PTR_ERR(b->entry_uprobe);
> +             b->entry_uprobe = NULL;
> +             goto err_path;
> +     }
> +
> +     b->stop_uprobe = uprobe_register(inode, offset_stop, 0, &b->stop_uc);
> +     if (IS_ERR(b->stop_uprobe)) {
> +             ret = PTR_ERR(b->stop_uprobe);
> +             b->stop_uprobe = NULL;
> +             goto err_entry;
> +     }
> +
> +     list_add_tail(&b->list, &tlob_uprobe_list);
> +     return 0;
> +
> +err_entry:
> +     uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
> +     uprobe_unregister_sync();
> +err_path:
> +     path_put(&b->path);
> +err_free:
> +     kfree(b);
> +     return ret;
> +}
> +
> +/*
> + * Remove the uprobe binding for (offset_start, binpath).
> + * binpath is resolved to a dentry for comparison so symlinks are handled
> + * correctly.  Called with tlob_uprobe_mutex held.
> + */
> +static void tlob_remove_uprobe_by_key(loff_t offset_start, const char
> *binpath)
> +{
> +     struct tlob_uprobe_binding *b, *tmp;
> +     struct path remove_path;
> +
> +     if (kern_path(binpath, LOOKUP_FOLLOW, &remove_path))
> +             return;
> +
> +     list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
> +             if (b->offset_start != offset_start)
> +                     continue;
> +             if (b->path.dentry != remove_path.dentry)
> +                     continue;
> +             uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
> +             uprobe_unregister_nosync(b->stop_uprobe,  &b->stop_uc);
> +             list_del(&b->list);
> +             uprobe_unregister_sync();
> +             path_put(&b->path);
> +             kfree(b);
> +             break;
> +     }
> +
> +     path_put(&remove_path);
> +}
> +
> +/* Unregister all uprobe bindings; called from disable_tlob(). */
> +static void tlob_remove_all_uprobes(void)
> +{
> +     struct tlob_uprobe_binding *b, *tmp;
> +
> +     mutex_lock(&tlob_uprobe_mutex);
> +     list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
> +             uprobe_unregister_nosync(b->entry_uprobe, &b->entry_uc);
> +             uprobe_unregister_nosync(b->stop_uprobe,  &b->stop_uc);
> +             list_del(&b->list);
> +             path_put(&b->path);
> +             kfree(b);
> +     }
> +     mutex_unlock(&tlob_uprobe_mutex);
> +     uprobe_unregister_sync();
> +}
> +
> +/*
> + * tracefs "monitor" file
> + *
> + * Read:  one "threshold_us:0xoffset_start:0xoffset_stop:binary_path\n"
> + *        line per registered uprobe binding.
> + * Write: "threshold_us:offset_start:offset_stop:binary_path" - add uprobe
> binding
> + *        "-offset_start:binary_path"                         - remove uprobe
> binding
> + */
> +
> +static ssize_t tlob_monitor_read(struct file *file,
> +                              char __user *ubuf,
> +                              size_t count, loff_t *ppos)
> +{
> +     /* pid(10) + threshold(20) + 2 offsets(2*18) + path(256) + delimiters
> */
> +     const int line_sz = TLOB_MAX_PATH + 72;
> +     struct tlob_uprobe_binding *b;
> +     char *buf, *p;
> +     int n = 0, buf_sz, pos = 0;
> +     ssize_t ret;
> +
> +     mutex_lock(&tlob_uprobe_mutex);
> +     list_for_each_entry(b, &tlob_uprobe_list, list)
> +             n++;
> +     mutex_unlock(&tlob_uprobe_mutex);
> +
> +     buf_sz = (n ? n : 1) * line_sz + 1;
> +     buf = kmalloc(buf_sz, GFP_KERNEL);
> +     if (!buf)
> +             return -ENOMEM;
> +
> +     mutex_lock(&tlob_uprobe_mutex);
> +     list_for_each_entry(b, &tlob_uprobe_list, list) {
> +             p = b->binpath;
> +             pos += scnprintf(buf + pos, buf_sz - pos,
> +                              "%llu:0x%llx:0x%llx:%s\n",
> +                              b->threshold_us,
> +                              (unsigned long long)b->offset_start,
> +                              (unsigned long long)b->offset_stop,
> +                              p);
> +     }
> +     mutex_unlock(&tlob_uprobe_mutex);
> +
> +     ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
> +     kfree(buf);
> +     return ret;
> +}
> +
> +/*
> + * Parse "threshold_us:offset_start:offset_stop:binary_path".
> + * binary_path comes last so it may freely contain ':'.
> + * Returns 0 on success.
> + */
> +VISIBLE_IF_KUNIT int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
> +                                         char **path_out,
> +                                         loff_t *start_out, loff_t
> *stop_out)
> +{
> +     unsigned long long thr;
> +     long long start, stop;
> +     int n = 0;
> +
> +     /*
> +      * %llu : decimal-only (microseconds)
> +      * %lli : auto-base, accepts 0x-prefixed hex for offsets
> +      * %n   : records the byte offset of the first path character
> +      */
> +     if (sscanf(buf, "%llu:%lli:%lli:%n", &thr, &start, &stop, &n) != 3)
> +             return -EINVAL;
> +     if (thr == 0 || n == 0 || buf[n] == '\0')
> +             return -EINVAL;
> +     if (start < 0 || stop < 0)
> +             return -EINVAL;
> +
> +     *thr_out   = thr;
> +     *start_out = start;
> +     *stop_out  = stop;
> +     *path_out  = buf + n;
> +     return 0;
> +}
> +
> +static ssize_t tlob_monitor_write(struct file *file,
> +                               const char __user *ubuf,
> +                               size_t count, loff_t *ppos)
> +{
> +     char buf[TLOB_MAX_PATH + 64];
> +     loff_t offset_start, offset_stop;
> +     u64 threshold_us;
> +     char *binpath;
> +     int ret;
> +
> +     if (count >= sizeof(buf))
> +             return -EINVAL;
> +     if (copy_from_user(buf, ubuf, count))
> +             return -EFAULT;
> +     buf[count] = '\0';
> +
> +     if (count > 0 && buf[count - 1] == '\n')
> +             buf[count - 1] = '\0';
> +
> +     /* Remove request: "-offset_start:binary_path" */
> +     if (buf[0] == '-') {
> +             long long off;
> +             int n = 0;
> +
> +             if (sscanf(buf + 1, "%lli:%n", &off, &n) != 1 || n == 0)
> +                     return -EINVAL;
> +             binpath = buf + 1 + n;
> +             if (binpath[0] != '/')
> +                     return -EINVAL;
> +
> +             mutex_lock(&tlob_uprobe_mutex);
> +             tlob_remove_uprobe_by_key((loff_t)off, binpath);
> +             mutex_unlock(&tlob_uprobe_mutex);
> +
> +             return (ssize_t)count;
> +     }
> +
> +     /*
> +      * Uprobe binding:
> "threshold_us:offset_start:offset_stop:binary_path"
> +      * binpath points into buf at the start of the path field.
> +      */
> +     ret = tlob_parse_uprobe_line(buf, &threshold_us,
> +                                  &binpath, &offset_start, &offset_stop);
> +     if (ret)
> +             return ret;
> +
> +     mutex_lock(&tlob_uprobe_mutex);
> +     ret = tlob_add_uprobe(threshold_us, binpath, offset_start,
> offset_stop);
> +     mutex_unlock(&tlob_uprobe_mutex);
> +     return ret ? ret : (ssize_t)count;
> +}
> +
> +static const struct file_operations tlob_monitor_fops = {
> +     .open   = simple_open,
> +     .read   = tlob_monitor_read,
> +     .write  = tlob_monitor_write,
> +     .llseek = noop_llseek,
> +};
> +
> +/*
> + * __tlob_init_monitor / __tlob_destroy_monitor - called with
> rv_interface_lock
> + * held (required by da_monitor_init/destroy via
> rv_get/put_task_monitor_slot).
> + */
> +static int __tlob_init_monitor(void)
> +{
> +     int i, retval;
> +
> +     tlob_state_cache = kmem_cache_create("tlob_task_state",
> +                                          sizeof(struct tlob_task_state),
> +                                          0, 0, NULL);
> +     if (!tlob_state_cache)
> +             return -ENOMEM;
> +
> +     for (i = 0; i < TLOB_HTABLE_SIZE; i++)
> +             INIT_HLIST_HEAD(&tlob_htable[i]);
> +     atomic_set(&tlob_num_monitored, 0);
> +
> +     retval = da_monitor_init();
> +     if (retval) {
> +             kmem_cache_destroy(tlob_state_cache);
> +             tlob_state_cache = NULL;
> +             return retval;
> +     }
> +
> +     rv_this.enabled = 1;
> +     return 0;
> +}
> +
> +static void __tlob_destroy_monitor(void)
> +{
> +     rv_this.enabled = 0;
> +     tlob_stop_all();
> +     tlob_remove_all_uprobes();
> +     /*
> +      * Drain pending call_rcu() callbacks from tlob_stop_all() before
> +      * destroying the kmem_cache.
> +      */
> +     synchronize_rcu();
> +     da_monitor_destroy();
> +     kmem_cache_destroy(tlob_state_cache);
> +     tlob_state_cache = NULL;
> +}
> +
> +/*
> + * tlob_init_monitor / tlob_destroy_monitor - KUnit wrappers that acquire
> + * rv_interface_lock, satisfying the lockdep_assert_held() inside
> + * rv_get/put_task_monitor_slot().
> + */
> +VISIBLE_IF_KUNIT int tlob_init_monitor(void)
> +{
> +     int ret;
> +
> +     mutex_lock(&rv_interface_lock);
> +     ret = __tlob_init_monitor();
> +     mutex_unlock(&rv_interface_lock);
> +     return ret;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_init_monitor);
> +
> +VISIBLE_IF_KUNIT void tlob_destroy_monitor(void)
> +{
> +     mutex_lock(&rv_interface_lock);
> +     __tlob_destroy_monitor();
> +     mutex_unlock(&rv_interface_lock);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_destroy_monitor);
> +
> +VISIBLE_IF_KUNIT int tlob_enable_hooks(void)
> +{
> +     rv_attach_trace_probe("tlob", sched_switch, handle_sched_switch);
> +     rv_attach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
> +     return 0;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_enable_hooks);
> +
> +VISIBLE_IF_KUNIT void tlob_disable_hooks(void)
> +{
> +     rv_detach_trace_probe("tlob", sched_switch, handle_sched_switch);
> +     rv_detach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_disable_hooks);
> +
> +/*
> + * enable_tlob / disable_tlob - called by rv_enable/disable_monitor() which
> + * already holds rv_interface_lock; call the __ variants directly.
> + */
> +static int enable_tlob(void)
> +{
> +     int retval;
> +
> +     retval = __tlob_init_monitor();
> +     if (retval)
> +             return retval;
> +
> +     return tlob_enable_hooks();
> +}
> +
> +static void disable_tlob(void)
> +{
> +     tlob_disable_hooks();
> +     __tlob_destroy_monitor();
> +}
> +
> +static struct rv_monitor rv_this = {
> +     .name           = "tlob",
> +     .description    = "Per-task latency-over-budget monitor.",
> +     .enable         = enable_tlob,
> +     .disable        = disable_tlob,
> +     .reset          = da_monitor_reset_all,
> +     .enabled        = 0,
> +};
> +
> +static int __init register_tlob(void)
> +{
> +     int ret;
> +
> +     ret = rv_register_monitor(&rv_this, NULL);
> +     if (ret)
> +             return ret;
> +
> +     if (rv_this.root_d) {
> +             tracefs_create_file("monitor", 0644, rv_this.root_d, NULL,
> +                                 &tlob_monitor_fops);
> +     }
> +
> +     return 0;
> +}
> +
> +static void __exit unregister_tlob(void)
> +{
> +     rv_unregister_monitor(&rv_this);
> +}
> +
> +module_init(register_tlob);
> +module_exit(unregister_tlob);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Wen Yang <[email protected]>");
> +MODULE_DESCRIPTION("tlob: task latency over budget per-task monitor.");
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.h
> b/kernel/trace/rv/monitors/tlob/tlob.h
> new file mode 100644
> index 000000000..3438a6175
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob.h
> @@ -0,0 +1,145 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _RV_TLOB_H
> +#define _RV_TLOB_H
> +
> +/*
> + * C representation of the tlob automaton, generated from tlob.dot via rvgen
> + * and extended with tlob_start_task()/tlob_stop_task() declarations.
> + * For the format description see
> Documentation/trace/rv/deterministic_automata.rst
> + */
> +
> +#include <linux/rv.h>
> +#include <uapi/linux/rv.h>
> +
> +#define MONITOR_NAME tlob
> +
> +enum states_tlob {
> +     unmonitored_tlob,
> +     on_cpu_tlob,
> +     off_cpu_tlob,
> +     state_max_tlob,
> +};
> +
> +#define INVALID_STATE state_max_tlob
> +
> +enum events_tlob {
> +     trace_start_tlob,
> +     switch_in_tlob,
> +     switch_out_tlob,
> +     sched_wakeup_tlob,
> +     trace_stop_tlob,
> +     budget_expired_tlob,
> +     event_max_tlob,
> +};
> +
> +struct automaton_tlob {
> +     char *state_names[state_max_tlob];
> +     char *event_names[event_max_tlob];
> +     unsigned char function[state_max_tlob][event_max_tlob];
> +     unsigned char initial_state;
> +     bool final_states[state_max_tlob];
> +};
> +
> +static const struct automaton_tlob automaton_tlob = {
> +     .state_names = {
> +             "unmonitored",
> +             "on_cpu",
> +             "off_cpu",
> +     },
> +     .event_names = {
> +             "trace_start",
> +             "switch_in",
> +             "switch_out",
> +             "sched_wakeup",
> +             "trace_stop",
> +             "budget_expired",
> +     },
> +     .function = {
> +             /* unmonitored */
> +             {
> +                     on_cpu_tlob,            /* trace_start    */
> +                     unmonitored_tlob,       /* switch_in      */
> +                     unmonitored_tlob,       /* switch_out     */
> +                     unmonitored_tlob,       /* sched_wakeup   */
> +                     INVALID_STATE,          /* trace_stop     */
> +                     INVALID_STATE,          /* budget_expired */
> +             },
> +             /* on_cpu */
> +             {
> +                     INVALID_STATE,          /* trace_start    */
> +                     INVALID_STATE,          /* switch_in      */
> +                     off_cpu_tlob,           /* switch_out     */
> +                     on_cpu_tlob,            /* sched_wakeup   */
> +                     unmonitored_tlob,       /* trace_stop     */
> +                     unmonitored_tlob,       /* budget_expired */
> +             },
> +             /* off_cpu */
> +             {
> +                     INVALID_STATE,          /* trace_start    */
> +                     on_cpu_tlob,            /* switch_in      */
> +                     off_cpu_tlob,           /* switch_out     */
> +                     off_cpu_tlob,           /* sched_wakeup   */
> +                     unmonitored_tlob,       /* trace_stop     */
> +                     unmonitored_tlob,       /* budget_expired */
> +             },
> +     },
> +     /*
> +      * final_states: unmonitored is the sole accepting state.
> +      * Violations are recorded via ntf_push and tlob_budget_exceeded.
> +      */
> +     .initial_state = unmonitored_tlob,
> +     .final_states = { 1, 0, 0 },
> +};
> +
> +/* Exported for use by the RV ioctl layer (rv_dev.c) */
> +int tlob_start_task(struct task_struct *task, u64 threshold_us,
> +                 struct file *notify_file, u64 tag);
> +int tlob_stop_task(struct task_struct *task);
> +
> +/* Maximum number of concurrently monitored tasks (also used by KUnit). */
> +#define TLOB_MAX_MONITORED   64U
> +
> +/*
> + * Ring buffer constants (also published in UAPI for mmap size calculation).
> + */
> +#define TLOB_RING_DEFAULT_CAP        64U     /* records allocated at open()  
> */
> +#define TLOB_RING_MIN_CAP     8U     /* minimum accepted by mmap()   */
> +#define TLOB_RING_MAX_CAP    4096U   /* maximum accepted by mmap()   */
> +
> +/**
> + * struct tlob_ring - per-fd mmap-capable violation ring buffer.
> + *
> + * Allocated as a contiguous page range at rv_open() time:
> + *   page 0:    struct tlob_mmap_page  (shared with userspace)
> + *   pages 1-N: struct tlob_event[capacity]
> + */
> +struct tlob_ring {
> +     struct tlob_mmap_page   *page;
> +     struct tlob_event       *data;
> +     u32                      mask;
> +     spinlock_t               lock;
> +     unsigned long            base;
> +     unsigned int             order;
> +};
> +
> +/**
> + * struct rv_file_priv - per-fd private data for /dev/rv.
> + */
> +struct rv_file_priv {
> +     struct tlob_ring        ring;
> +     wait_queue_head_t       waitq;
> +};
> +
> +#if IS_ENABLED(CONFIG_KUNIT)
> +int tlob_init_monitor(void);
> +void tlob_destroy_monitor(void);
> +int tlob_enable_hooks(void);
> +void tlob_disable_hooks(void);
> +void tlob_event_push_kunit(struct rv_file_priv *priv,
> +                       const struct tlob_event *info);
> +int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
> +                        char **path_out,
> +                        loff_t *start_out, loff_t *stop_out);
> +#endif /* CONFIG_KUNIT */
> +
> +#endif /* _RV_TLOB_H */
> diff --git a/kernel/trace/rv/monitors/tlob/tlob_trace.h
> b/kernel/trace/rv/monitors/tlob/tlob_trace.h
> new file mode 100644
> index 000000000..b08d67776
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob_trace.h
> @@ -0,0 +1,42 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Snippet to be included in rv_trace.h
> + */
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +/*
> + * tlob uses the generic event_da_monitor_id and error_da_monitor_id event
> + * classes so that both event classes are instantiated.  This avoids a
> + * -Werror=unused-variable warning that the compiler emits when a
> + * DECLARE_EVENT_CLASS has no corresponding DEFINE_EVENT instance.
> + *
> + * The event_tlob tracepoint is defined here but the call-site in
> + * da_handle_event() is overridden with a no-op macro below so that no
> + * trace record is emitted on every scheduler context switch.  Budget
> + * violations are reported via the dedicated tlob_budget_exceeded event.
> + *
> + * error_tlob IS kept active so that invalid DA transitions (programming
> + * errors) are still visible in the ftrace ring buffer for debugging.
> + */
> +DEFINE_EVENT(event_da_monitor_id, event_tlob,
> +          TP_PROTO(int id, char *state, char *event, char *next_state,
> +                   bool final_state),
> +          TP_ARGS(id, state, event, next_state, final_state));
> +
> +DEFINE_EVENT(error_da_monitor_id, error_tlob,
> +          TP_PROTO(int id, char *state, char *event),
> +          TP_ARGS(id, state, event));
> +
> +/*
> + * Override the trace_event_tlob() call-site with a no-op after the
> + * DEFINE_EVENT above has satisfied the event class instantiation
> + * requirement.  The tracepoint symbol itself exists (and can be enabled
> + * via tracefs) but the automatic call from da_handle_event() is silenced
> + * to avoid per-context-switch ftrace noise during normal operation.
> + */
> +#undef trace_event_tlob
> +#define trace_event_tlob(id, state, event, next_state, final_state)  \
> +     do { (void)(id); (void)(state); (void)(event);                  \
> +          (void)(next_state); (void)(final_state); } while (0)
> +#endif /* CONFIG_RV_MON_TLOB */
> diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
> index ee4e68102..e754e76d5 100644
> --- a/kernel/trace/rv/rv.c
> +++ b/kernel/trace/rv/rv.c
> @@ -148,6 +148,10 @@
>  #include <rv_trace.h>
>  #endif
>  
> +#ifdef CONFIG_RV_MON_TLOB
> +EXPORT_TRACEPOINT_SYMBOL_GPL(tlob_budget_exceeded);
> +#endif
> +
>  #include "rv.h"
>  
>  DEFINE_MUTEX(rv_interface_lock);
> diff --git a/kernel/trace/rv/rv_dev.c b/kernel/trace/rv/rv_dev.c
> new file mode 100644
> index 000000000..a052f3203
> --- /dev/null
> +++ b/kernel/trace/rv/rv_dev.c
> @@ -0,0 +1,602 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * rv_dev.c - /dev/rv misc device for RV monitor self-instrumentation
> + *
> + * A single misc device (MISC_DYNAMIC_MINOR) serves all RV monitors.
> + * ioctl numbers encode the monitor identity:
> + *
> + *   0x01 - 0x1F  tlob (task latency over budget)
> + *   0x20 - 0x3F  reserved
> + *
> + * Each monitor exports tlob_start_task() / tlob_stop_task() which are
> + * called here.  The calling task is identified by current.
> + *
> + * Magic: RV_IOC_MAGIC (0xB9), defined in include/uapi/linux/rv.h
> + *
> + * Per-fd private data (rv_file_priv)
> + * ------------------------------------
> + * Every open() of /dev/rv allocates an rv_file_priv (defined in tlob.h).
> + * When TLOB_IOCTL_TRACE_START is called with args.notify_fd >= 0, violations
> + * are pushed as tlob_event records into that fd's per-fd ring buffer
> (tlob_ring)
> + * and its poll/epoll waitqueue is woken.
> + *
> + * Consumers drain records with read() on the notify_fd; read() blocks until
> + * at least one record is available (unless O_NONBLOCK is set).
> + *
> + * Per-thread "started" tracking (tlob_task_handle)
> + * -------------------------------------------------
> + * tlob_stop_task() returns -ESRCH in two distinct situations:
> + *
> + *   (a) The deadline timer already fired and removed the tlob hash-table
> + *       entry before TRACE_STOP arrived -> budget was exceeded -> -EOVERFLOW
> + *
> + *   (b) TRACE_START was never called for this thread -> programming error
> + *       -> -ESRCH
> + *
> + * To distinguish them, rv_dev.c maintains a lightweight hash table
> + * (tlob_handles) that records a tlob_task_handle for every task_struct *
> + * for which a successful TLOB_IOCTL_TRACE_START has been
> + * issued but the corresponding TLOB_IOCTL_TRACE_STOP has not yet arrived.
> + *
> + * tlob_task_handle is a thin "session ticket"  --  it carries only the
> + * task pointer and the owning file descriptor.  The heavy per-task state
> + * (hrtimer, DA state, threshold) lives in tlob_task_state inside tlob.c.
> + *
> + * The table is keyed on task_struct * (same key as tlob.c), protected
> + * by tlob_handles_lock (spinlock, irq-safe).  No get_task_struct()
> + * refcount is needed here because tlob.c already holds a reference for
> + * each live entry.
> + *
> + * Multiple threads may share the same fd.  Each thread has its own
> + * tlob_task_handle in the table, so concurrent TRACE_START / TRACE_STOP
> + * calls from different threads do not interfere.
> + *
> + * The fd release path (rv_release) calls tlob_stop_task() for every
> + * handle in tlob_handles that belongs to the closing fd, ensuring cleanup
> + * even if the user forgets to call TRACE_STOP.
> + */
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/gfp.h>
> +#include <linux/hash.h>
> +#include <linux/mm.h>
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/uaccess.h>
> +#include <uapi/linux/rv.h>
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +#include "monitors/tlob/tlob.h"
> +#endif
> +
> +/* -----------------------------------------------------------------------
> + * tlob_task_handle - per-thread session ticket for the ioctl interface
> + *
> + * One handle is allocated by TLOB_IOCTL_TRACE_START and freed by
> + * TLOB_IOCTL_TRACE_STOP (or by rv_release if the fd is closed).
> + *
> + * @hlist:  Hash-table linkage in tlob_handles (keyed on task pointer).
> + * @task:   The monitored thread.  Plain pointer; no refcount held here
> + *          because tlob.c holds one for the lifetime of the monitoring
> + *          window, which encompasses the lifetime of this handle.
> + * @file:   The /dev/rv file descriptor that issued TRACE_START.
> + *          Used by rv_release() to sweep orphaned handles on close().
> + * -----------------------------------------------------------------------
> + */
> +#define TLOB_HANDLES_BITS    5
> +#define TLOB_HANDLES_SIZE    (1 << TLOB_HANDLES_BITS)
> +
> +struct tlob_task_handle {
> +     struct hlist_node       hlist;
> +     struct task_struct      *task;
> +     struct file             *file;
> +};
> +
> +static struct hlist_head tlob_handles[TLOB_HANDLES_SIZE];
> +static DEFINE_SPINLOCK(tlob_handles_lock);
> +
> +static unsigned int tlob_handle_hash(const struct task_struct *task)
> +{
> +     return hash_ptr((void *)task, TLOB_HANDLES_BITS);
> +}
> +
> +/* Must be called with tlob_handles_lock held. */
> +static struct tlob_task_handle *
> +tlob_handle_find_locked(struct task_struct *task)
> +{
> +     struct tlob_task_handle *h;
> +     unsigned int slot = tlob_handle_hash(task);
> +
> +     hlist_for_each_entry(h, &tlob_handles[slot], hlist) {
> +             if (h->task == task)
> +                     return h;
> +     }
> +     return NULL;
> +}
> +
> +/*
> + * tlob_handle_alloc - record that @task has an active monitoring session
> + *                     opened via @file.
> + *
> + * Returns 0 on success, -EEXIST if @task already has a handle (double
> + * TRACE_START without TRACE_STOP), -ENOMEM on allocation failure.
> + */
> +static int tlob_handle_alloc(struct task_struct *task, struct file *file)
> +{
> +     struct tlob_task_handle *h;
> +     unsigned long flags;
> +     unsigned int slot;
> +
> +     h = kmalloc(sizeof(*h), GFP_KERNEL);
> +     if (!h)
> +             return -ENOMEM;
> +     h->task = task;
> +     h->file = file;
> +
> +     spin_lock_irqsave(&tlob_handles_lock, flags);
> +     if (tlob_handle_find_locked(task)) {
> +             spin_unlock_irqrestore(&tlob_handles_lock, flags);
> +             kfree(h);
> +             return -EEXIST;
> +     }
> +     slot = tlob_handle_hash(task);
> +     hlist_add_head(&h->hlist, &tlob_handles[slot]);
> +     spin_unlock_irqrestore(&tlob_handles_lock, flags);
> +     return 0;
> +}
> +
> +/*
> + * tlob_handle_free - remove the handle for @task and free it.
> + *
> + * Returns 1 if a handle existed (TRACE_START was called), 0 if not found
> + * (TRACE_START was never called for this thread).
> + */
> +static int tlob_handle_free(struct task_struct *task)
> +{
> +     struct tlob_task_handle *h;
> +     unsigned long flags;
> +
> +     spin_lock_irqsave(&tlob_handles_lock, flags);
> +     h = tlob_handle_find_locked(task);
> +     if (h) {
> +             hlist_del_init(&h->hlist);
> +             spin_unlock_irqrestore(&tlob_handles_lock, flags);
> +             kfree(h);
> +             return 1;
> +     }
> +     spin_unlock_irqrestore(&tlob_handles_lock, flags);
> +     return 0;
> +}
> +
> +/*
> + * tlob_handle_sweep_file - release all handles owned by @file.
> + *
> + * Called from rv_release() when the fd is closed without TRACE_STOP.
> + * Calls tlob_stop_task() for each orphaned handle to drain the tlob
> + * monitoring entries and prevent resource leaks in tlob.c.
> + *
> + * Handles are collected under the lock (short critical section), then
> + * processed outside it (tlob_stop_task() may sleep/spin internally).
> + */
> +#ifdef CONFIG_RV_MON_TLOB
> +static void tlob_handle_sweep_file(struct file *file)
> +{
> +     struct tlob_task_handle *batch[TLOB_HANDLES_SIZE];
> +     struct tlob_task_handle *h;
> +     struct hlist_node *tmp;
> +     unsigned long flags;
> +     int i, n = 0;
> +
> +     spin_lock_irqsave(&tlob_handles_lock, flags);
> +     for (i = 0; i < TLOB_HANDLES_SIZE; i++) {
> +             hlist_for_each_entry_safe(h, tmp, &tlob_handles[i], hlist) {
> +                     if (h->file == file) {
> +                             hlist_del_init(&h->hlist);
> +                             batch[n++] = h;
> +                     }
> +             }
> +     }
> +     spin_unlock_irqrestore(&tlob_handles_lock, flags);
> +
> +     for (i = 0; i < n; i++) {
> +             /*
> +              * Ignore -ESRCH: the deadline timer may have already fired
> +              * and cleaned up the tlob entry.
> +              */
> +             tlob_stop_task(batch[i]->task);
> +             kfree(batch[i]);
> +     }
> +}
> +#else
> +static inline void tlob_handle_sweep_file(struct file *file) {}
> +#endif /* CONFIG_RV_MON_TLOB */
> +
> +/* -----------------------------------------------------------------------
> + * Ring buffer lifecycle
> + * -----------------------------------------------------------------------
> + */
> +
> +/*
> + * tlob_ring_alloc - allocate a ring of @cap records (must be a power of 2).
> + *
> + * Allocates a physically contiguous block of pages:
> + *   page 0     : struct tlob_mmap_page  (control page, shared with
> userspace)
> + *   pages 1..N : struct tlob_event[cap] (data pages)
> + *
> + * Each page is marked reserved so it can be mapped to userspace via mmap().
> + */
> +static int tlob_ring_alloc(struct tlob_ring *ring, u32 cap)
> +{
> +     unsigned int total = PAGE_SIZE + cap * sizeof(struct tlob_event);
> +     unsigned int order = get_order(total);
> +     unsigned long base;
> +     unsigned int i;
> +
> +     base = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> +     if (!base)
> +             return -ENOMEM;
> +
> +     for (i = 0; i < (1u << order); i++)
> +             SetPageReserved(virt_to_page((void *)(base + i *
> PAGE_SIZE)));
> +
> +     ring->base  = base;
> +     ring->order = order;
> +     ring->page  = (struct tlob_mmap_page *)base;
> +     ring->data  = (struct tlob_event *)(base + PAGE_SIZE);
> +     ring->mask  = cap - 1;
> +     spin_lock_init(&ring->lock);
> +
> +     ring->page->capacity    = cap;
> +     ring->page->version     = 1;
> +     ring->page->data_offset = PAGE_SIZE;
> +     ring->page->record_size = sizeof(struct tlob_event);
> +     return 0;
> +}
> +
> +static void tlob_ring_free(struct tlob_ring *ring)
> +{
> +     unsigned int i;
> +
> +     if (!ring->base)
> +             return;
> +
> +     for (i = 0; i < (1u << ring->order); i++)
> +             ClearPageReserved(virt_to_page((void *)(ring->base + i *
> PAGE_SIZE)));
> +
> +     free_pages(ring->base, ring->order);
> +     ring->base = 0;
> +     ring->page = NULL;
> +     ring->data = NULL;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * File operations
> + * -----------------------------------------------------------------------
> + */
> +
> +static int rv_open(struct inode *inode, struct file *file)
> +{
> +     struct rv_file_priv *priv;
> +     int ret;
> +
> +     priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +     if (!priv)
> +             return -ENOMEM;
> +
> +     ret = tlob_ring_alloc(&priv->ring, TLOB_RING_DEFAULT_CAP);
> +     if (ret) {
> +             kfree(priv);
> +             return ret;
> +     }
> +
> +     init_waitqueue_head(&priv->waitq);
> +     file->private_data = priv;
> +     return 0;
> +}
> +
> +static int rv_release(struct inode *inode, struct file *file)
> +{
> +     struct rv_file_priv *priv = file->private_data;
> +
> +     tlob_handle_sweep_file(file);
> +     tlob_ring_free(&priv->ring);
> +     kfree(priv);
> +     file->private_data = NULL;
> +     return 0;
> +}
> +
> +static __poll_t rv_poll(struct file *file, poll_table *wait)
> +{
> +     struct rv_file_priv *priv = file->private_data;
> +
> +     if (!priv)
> +             return EPOLLERR;
> +
> +     poll_wait(file, &priv->waitq, wait);
> +
> +     /*
> +      * Pairs with smp_store_release(&ring->page->data_head, ...) in
> +      * tlob_event_push().  No lock needed: head is written by the kernel
> +      * producer and read here; tail is written by the consumer and we
> only
> +      * need an approximate check for the poll fast path.
> +      */
> +     if (smp_load_acquire(&priv->ring.page->data_head) !=
> +         READ_ONCE(priv->ring.page->data_tail))
> +             return EPOLLIN | EPOLLRDNORM;
> +
> +     return 0;
> +}
> +
> +/*
> + * rv_read - consume tlob_event violation records from this fd's ring buffer.
> + *
> + * Each read() returns a whole number of struct tlob_event records.  @count
> must
> + * be at least sizeof(struct tlob_event); partial-record sizes are rejected
> with
> + * -EINVAL.
> + *
> + * Blocking behaviour follows O_NONBLOCK on the fd:
> + *   O_NONBLOCK clear: blocks until at least one record is available.
> + *   O_NONBLOCK set:   returns -EAGAIN immediately if the ring is empty.
> + *
> + * Returns the number of bytes copied (always a multiple of sizeof
> tlob_event),
> + * -EAGAIN if non-blocking and empty, or a negative error code.
> + *
> + * read() and mmap() share the same ring and data_tail cursor; do not use
> + * both simultaneously on the same fd.
> + */
> +static ssize_t rv_read(struct file *file, char __user *buf, size_t count,
> +                    loff_t *ppos)
> +{
> +     struct rv_file_priv *priv = file->private_data;
> +     struct tlob_ring *ring;
> +     size_t rec = sizeof(struct tlob_event);
> +     unsigned long irqflags;
> +     ssize_t done = 0;
> +     int ret;
> +
> +     if (!priv)
> +             return -ENODEV;
> +
> +     ring = &priv->ring;
> +
> +     if (count < rec)
> +             return -EINVAL;
> +
> +     /* Blocking path: sleep until the producer advances data_head. */
> +     if (!(file->f_flags & O_NONBLOCK)) {
> +             ret = wait_event_interruptible(priv->waitq,
> +                     /* pairs with smp_store_release() in the producer */
> +                     smp_load_acquire(&ring->page->data_head) !=
> +                     READ_ONCE(ring->page->data_tail));
> +             if (ret)
> +                     return ret;
> +     }
> +
> +     /*
> +      * Drain records into the caller's buffer.  ring->lock serialises
> +      * concurrent read() callers and the softirq producer.
> +      */
> +     while (done + rec <= count) {
> +             struct tlob_event record;
> +             u32 head, tail;
> +
> +             spin_lock_irqsave(&ring->lock, irqflags);
> +             /* pairs with smp_store_release() in the producer */
> +             head = smp_load_acquire(&ring->page->data_head);
> +             tail = ring->page->data_tail;
> +             if (head == tail) {
> +                     spin_unlock_irqrestore(&ring->lock, irqflags);
> +                     break;
> +             }
> +             record = ring->data[tail & ring->mask];
> +             WRITE_ONCE(ring->page->data_tail, tail + 1);
> +             spin_unlock_irqrestore(&ring->lock, irqflags);
> +
> +             if (copy_to_user(buf + done, &record, rec))
> +                     return done ? done : -EFAULT;
> +             done += rec;
> +     }
> +
> +     return done ? done : -EAGAIN;
> +}
> +
> +/*
> + * rv_mmap - map the per-fd violation ring buffer into userspace.
> + *
> + * The mmap region covers the full ring allocation:
> + *
> + *   offset 0          : struct tlob_mmap_page  (control page)
> + *   offset PAGE_SIZE  : struct tlob_event[capacity]  (data pages)
> + *
> + * The caller must map exactly PAGE_SIZE + capacity * sizeof(struct
> tlob_event)
> + * bytes starting at offset 0 (vm_pgoff must be 0).  The actual capacity is
> + * read from tlob_mmap_page.capacity after a successful mmap(2).
> + *
> + * Private mappings (MAP_PRIVATE) are rejected: the shared data_tail field
> + * written by userspace must be visible to the kernel producer.
> + */
> +static int rv_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +     struct rv_file_priv *priv = file->private_data;
> +     struct tlob_ring    *ring;
> +     unsigned long        size = vma->vm_end - vma->vm_start;
> +     unsigned long        ring_size;
> +
> +     if (!priv)
> +             return -ENODEV;
> +
> +     ring = &priv->ring;
> +
> +     if (vma->vm_pgoff != 0)
> +             return -EINVAL;
> +
> +     ring_size = PAGE_ALIGN(PAGE_SIZE + ((unsigned long)(ring->mask + 1) *
> +                                         sizeof(struct tlob_event)));
> +     if (size != ring_size)
> +             return -EINVAL;
> +
> +     if (!(vma->vm_flags & VM_SHARED))
> +             return -EINVAL;
> +
> +     return remap_pfn_range(vma, vma->vm_start,
> +                            page_to_pfn(virt_to_page((void *)ring->base)),
> +                            ring_size, vma->vm_page_prot);
> +}
> +
> +/* -----------------------------------------------------------------------
> + * ioctl dispatcher
> + * -----------------------------------------------------------------------
> + */
> +
> +static long rv_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +     unsigned int nr = _IOC_NR(cmd);
> +
> +     /*
> +      * Verify the magic byte so we don't accidentally handle ioctls
> +      * intended for a different device.
> +      */
> +     if (_IOC_TYPE(cmd) != RV_IOC_MAGIC)
> +             return -ENOTTY;
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +     /* tlob: ioctl numbers 0x01 - 0x1F */
> +     switch (cmd) {
> +     case TLOB_IOCTL_TRACE_START: {
> +             struct tlob_start_args args;
> +             struct file *notify_file = NULL;
> +             int ret, hret;
> +
> +             if (copy_from_user(&args,
> +                                (struct tlob_start_args __user *)arg,
> +                                sizeof(args)))
> +                     return -EFAULT;
> +             if (args.threshold_us == 0)
> +                     return -EINVAL;
> +             if (args.flags != 0)
> +                     return -EINVAL;
> +
> +             /*
> +              * If notify_fd >= 0, resolve it to a file pointer.
> +              * fget() bumps the reference count; tlob.c drops it
> +              * via fput() when the monitoring window ends.
> +              * Reject non-/dev/rv fds to prevent type confusion.
> +              */
> +             if (args.notify_fd >= 0) {
> +                     notify_file = fget(args.notify_fd);
> +                     if (!notify_file)
> +                             return -EBADF;
> +                     if (notify_file->f_op != file->f_op) {
> +                             fput(notify_file);
> +                             return -EINVAL;
> +                     }
> +             }
> +
> +             ret = tlob_start_task(current, args.threshold_us,
> +                                   notify_file, args.tag);
> +             if (ret != 0) {
> +                     /* tlob.c did not take ownership; drop ref. */
> +                     if (notify_file)
> +                             fput(notify_file);
> +                     return ret;
> +             }
> +
> +             /*
> +              * Record session handle.  Free any stale handle left by
> +              * a previous window whose deadline timer fired (timer
> +              * removes tlob_task_state but cannot touch tlob_handles).
> +              */
> +             tlob_handle_free(current);
> +             hret = tlob_handle_alloc(current, file);
> +             if (hret < 0) {
> +                     tlob_stop_task(current);
> +                     return hret;
> +             }
> +             return 0;
> +     }
> +     case TLOB_IOCTL_TRACE_STOP: {
> +             int had_handle;
> +             int ret;
> +
> +             /*
> +              * Atomically remove the session handle for current.
> +              *
> +              *   had_handle == 0: TRACE_START was never called for
> +              *                    this thread -> caller bug -> -ESRCH
> +              *
> +              *   had_handle == 1: TRACE_START was called.  If
> +              *                    tlob_stop_task() now returns
> +              *                    -ESRCH, the deadline timer already
> +              *                    fired -> budget exceeded -> -EOVERFLOW
> +              */
> +             had_handle = tlob_handle_free(current);
> +             if (!had_handle)
> +                     return -ESRCH;
> +
> +             ret = tlob_stop_task(current);
> +             return (ret == -ESRCH) ? -EOVERFLOW : ret;
> +     }
> +     default:
> +             break;
> +     }
> +#endif /* CONFIG_RV_MON_TLOB */
> +
> +     return -ENOTTY;
> +}
> +
> +/* -----------------------------------------------------------------------
> + * Module init / exit
> + * -----------------------------------------------------------------------
> + */
> +
> +static const struct file_operations rv_fops = {
> +     .owner          = THIS_MODULE,
> +     .open           = rv_open,
> +     .release        = rv_release,
> +     .read           = rv_read,
> +     .poll           = rv_poll,
> +     .mmap           = rv_mmap,
> +     .unlocked_ioctl = rv_ioctl,
> +#ifdef CONFIG_COMPAT
> +     .compat_ioctl   = rv_ioctl,
> +#endif
> +     .llseek         = noop_llseek,
> +};
> +
> +/*
> + * 0666: /dev/rv is a self-instrumentation device.  All ioctls operate
> + * exclusively on the calling task (current); no task can monitor another
> + * via this interface.  Opening the device does not grant any privilege
> + * beyond observing one's own latency, so world-read/write is appropriate.
> + */
> +static struct miscdevice rv_miscdev = {
> +     .minor  = MISC_DYNAMIC_MINOR,
> +     .name   = "rv",
> +     .fops   = &rv_fops,
> +     .mode   = 0666,
> +};
> +
> +static int __init rv_ioctl_init(void)
> +{
> +     int i;
> +
> +     for (i = 0; i < TLOB_HANDLES_SIZE; i++)
> +             INIT_HLIST_HEAD(&tlob_handles[i]);
> +
> +     return misc_register(&rv_miscdev);
> +}
> +
> +static void __exit rv_ioctl_exit(void)
> +{
> +     misc_deregister(&rv_miscdev);
> +}
> +
> +module_init(rv_ioctl_init);
> +module_exit(rv_ioctl_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("RV ioctl interface via /dev/rv");
> diff --git a/kernel/trace/rv/rv_trace.h b/kernel/trace/rv/rv_trace.h
> index 4a6faddac..65d6c6485 100644
> --- a/kernel/trace/rv/rv_trace.h
> +++ b/kernel/trace/rv/rv_trace.h
> @@ -126,6 +126,7 @@ DECLARE_EVENT_CLASS(error_da_monitor_id,
>  #include <monitors/snroc/snroc_trace.h>
>  #include <monitors/nrp/nrp_trace.h>
>  #include <monitors/sssw/sssw_trace.h>
> +#include <monitors/tlob/tlob_trace.h>
>  // Add new monitors based on CONFIG_DA_MON_EVENTS_ID here
>  
>  #endif /* CONFIG_DA_MON_EVENTS_ID */
> @@ -202,6 +203,55 @@ TRACE_EVENT(rv_retries_error,
>               __get_str(event), __get_str(name))
>  );
>  #endif /* CONFIG_RV_MON_MAINTENANCE_EVENTS */
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +/*
> + * tlob_budget_exceeded - emitted when a monitored task exceeds its latency
> + * budget.  Carries the on-CPU / off-CPU time breakdown so that the cause
> + * of the overrun (CPU-bound vs. scheduling/I/O latency) is immediately
> + * visible in the ftrace ring buffer without post-processing.
> + */
> +TRACE_EVENT(tlob_budget_exceeded,
> +
> +     TP_PROTO(struct task_struct *task, u64 threshold_us,
> +              u64 on_cpu_us, u64 off_cpu_us, u32 switches,
> +              bool state_is_on_cpu, u64 tag),
> +
> +     TP_ARGS(task, threshold_us, on_cpu_us, off_cpu_us, switches,
> +             state_is_on_cpu, tag),
> +
> +     TP_STRUCT__entry(
> +             __string(comm,          task->comm)
> +             __field(pid_t,          pid)
> +             __field(u64,            threshold_us)
> +             __field(u64,            on_cpu_us)
> +             __field(u64,            off_cpu_us)
> +             __field(u32,            switches)
> +             __field(bool,           state_is_on_cpu)
> +             __field(u64,            tag)
> +     ),
> +
> +     TP_fast_assign(
> +             __assign_str(comm);
> +             __entry->pid            = task->pid;
> +             __entry->threshold_us   = threshold_us;
> +             __entry->on_cpu_us      = on_cpu_us;
> +             __entry->off_cpu_us     = off_cpu_us;
> +             __entry->switches       = switches;
> +             __entry->state_is_on_cpu = state_is_on_cpu;
> +             __entry->tag            = tag;
> +     ),
> +
> +     TP_printk("%s[%d]: budget exceeded threshold=%llu on_cpu=%llu
> off_cpu=%llu switches=%u state=%s tag=0x%016llx",
> +             __get_str(comm), __entry->pid,
> +             __entry->threshold_us,
> +             __entry->on_cpu_us, __entry->off_cpu_us,
> +             __entry->switches,
> +             __entry->state_is_on_cpu ? "on_cpu" : "off_cpu",
> +             __entry->tag)
> +);
> +#endif /* CONFIG_RV_MON_TLOB */
> +
>  #endif /* _TRACE_RV_H */
>  
>  /* This part must be outside protection */

Re: [RFC PATCH 2/4] rv/tlob: Add tlob deterministic automaton monitor

Reply via email to