Re: [PATCH] drm/amd/pm: Change gpu_metrics over to binary with per-reader snapshots (v4)
Reading GPU metrics has several constraints that weren't being honoured before: - reading from offset >0 should be from the sampled data when offset==0 was read so it's coherent - support >4KB gpu metrics - use binary attributes - ideally for a given file handle you should invalidate the data periodically (so you can keep the handle open and seek to 0 to capture new data though the invalidation should only happen on an offset==0 read) Not really arguing one way or another if this patch is correct/ideal just saying what/why we're doing this. Tom From: Koenig, Christian Sent: Friday, April 10, 2026 10:00 To: StDenis, Tom; [email protected] Cc: Prosyak, Vitaly Subject: Re: [PATCH] drm/amd/pm: Change gpu_metrics over to binary with per-reader snapshots (v4) On 4/6/26 18:25, Tom St Denis wrote: > The gpu_metrics sysfs file carries a binary blob but was implemented as > a text device_attribute, which imposes a hard PAGE_SIZE-1 cap and makes > it impossible for userspace to distinguish a successful short read from > a truncated one (stat reports 4 KiB regardless of actual payload size). > > Convert gpu_metrics to a bin_attribute. The declared file size is an > upper bound (128 KiB); the real payload length is the byte count > returned before EOF. > > To guarantee that multi-chunk reads (which kernfs splits at PAGE_SIZE > boundaries) return a coherent snapshot, each reader (identified by its > struct file pointer) gets its own cached metrics buffer via an xarray. > When offset is 0, PMFW is sampled and stored in that reader's entry; > subsequent offsets are served from the same snapshot. The entry is > freed on EOF. A mutex serialises xarray mutations and PMFW access. > > Stale entries from readers that close without reaching EOF (e.g., > killed processes) are lazily evicted after 30 seconds whenever any > new reader starts. > > This per-reader approach avoids the cross-contamination problem of > a single shared cache: concurrent readers each see their own coherent > PMFW snapshot rather than risking one reader's off=0 overwriting the > buffer while another reader is mid-way through a multi-chunk read. > > V3: Per-reader snapshot state, so each open fd gets its own cached > gpu_metrics buffer and all chunks for that reader come from the > same sample (Vitaly) > > V4: Remove the residual PAGE_SIZE cap that v3 still carried over from > v1. The whole point of the bin_attribute conversion is to > lift the old PAGE_SIZE-1 limit, but v3 still had: > - WARN_ON_ONCE(len > PAGE_SIZE) with silent truncation to 4K > - kzalloc(PAGE_SIZE) for the snapshot buffer (fixed allocation) > - a separate memcpy from metrics into that fixed buffer > In v4 the snapshot buffer is allocated with kmemdup() sized to the > actual payload returned by amdgpu_dpm_get_gpu_metrics(), so payloads > larger than PAGE_SIZE work correctly. The only remaining upper bound > is AMDGPU_GPU_METRICS_BIN_ATTR_MAX_SZ (128 KiB), which is the > declared bin_attribute file size and serves as a sanity check — > exceeding it now returns -EOVERFLOW instead of silently truncating. > (Tom) > > Signed-off-by: Tom St Denis > Acked-by: Vitaly Prosyak > Change-Id: Ib4fa233d9a25a396f1cc7d5fcf74f5f6578329f5 > --- > drivers/gpu/drm/amd/pm/amdgpu_pm.c | 208 +--- > drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 4 + > drivers/gpu/drm/amd/pm/inc/amdgpu_pm.h | 1 - > 3 files changed, 188 insertions(+), 25 deletions(-) > > diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c > b/drivers/gpu/drm/amd/pm/amdgpu_pm.c > index a4d8e667eafb..303a2d643b10 100644 > --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c > +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c > @@ -34,8 +34,19 @@ > #include > #include > #include > +#include > +#include > +#include > #include > > +/* > + * Sysfs reports this as the file size (stat/ls); kernfs also uses it to cap > + * read offsets. Actual payload length is the return value of > + * amdgpu_dpm_get_gpu_metrics() and must not exceed this. > + */ > +#define AMDGPU_GPU_METRICS_BIN_ATTR_MAX_SZ SZ_128K > + > + > #define MAX_NUM_OF_FEATURES_PER_SUBSET 8 > #define MAX_NUM_OF_SUBSETS 8 > > @@ -1734,43 +1745,170 @@ static ssize_t amdgpu_get_pm_metrics(struct device > *dev, > * DOC: gpu_metrics > * > * The amdgpu driver provides a sysfs API for retrieving current gpu > - * metrics data. The file gpu_metrics is used for this. Reading the > - * file will dump all the current gpu metrics data. > + * metrics data. The binary sysfs file gpu_metrics is used for this. > + * Reading the file
Re: [PATCH] drm/amd/pm: Change gpu_metrics over to binary with per-reader snapshots (v4)
On 4/6/26 18:25, Tom St Denis wrote:
> The gpu_metrics sysfs file carries a binary blob but was implemented as
> a text device_attribute, which imposes a hard PAGE_SIZE-1 cap and makes
> it impossible for userspace to distinguish a successful short read from
> a truncated one (stat reports 4 KiB regardless of actual payload size).
>
> Convert gpu_metrics to a bin_attribute. The declared file size is an
> upper bound (128 KiB); the real payload length is the byte count
> returned before EOF.
>
> To guarantee that multi-chunk reads (which kernfs splits at PAGE_SIZE
> boundaries) return a coherent snapshot, each reader (identified by its
> struct file pointer) gets its own cached metrics buffer via an xarray.
> When offset is 0, PMFW is sampled and stored in that reader's entry;
> subsequent offsets are served from the same snapshot. The entry is
> freed on EOF. A mutex serialises xarray mutations and PMFW access.
>
> Stale entries from readers that close without reaching EOF (e.g.,
> killed processes) are lazily evicted after 30 seconds whenever any
> new reader starts.
>
> This per-reader approach avoids the cross-contamination problem of
> a single shared cache: concurrent readers each see their own coherent
> PMFW snapshot rather than risking one reader's off=0 overwriting the
> buffer while another reader is mid-way through a multi-chunk read.
>
> V3: Per-reader snapshot state, so each open fd gets its own cached
> gpu_metrics buffer and all chunks for that reader come from the
> same sample (Vitaly)
>
> V4: Remove the residual PAGE_SIZE cap that v3 still carried over from
> v1. The whole point of the bin_attribute conversion is to
> lift the old PAGE_SIZE-1 limit, but v3 still had:
> - WARN_ON_ONCE(len > PAGE_SIZE) with silent truncation to 4K
> - kzalloc(PAGE_SIZE) for the snapshot buffer (fixed allocation)
> - a separate memcpy from metrics into that fixed buffer
> In v4 the snapshot buffer is allocated with kmemdup() sized to the
> actual payload returned by amdgpu_dpm_get_gpu_metrics(), so payloads
> larger than PAGE_SIZE work correctly. The only remaining upper bound
> is AMDGPU_GPU_METRICS_BIN_ATTR_MAX_SZ (128 KiB), which is the
> declared bin_attribute file size and serves as a sanity check —
> exceeding it now returns -EOVERFLOW instead of silently truncating.
> (Tom)
>
> Signed-off-by: Tom St Denis
> Acked-by: Vitaly Prosyak
> Change-Id: Ib4fa233d9a25a396f1cc7d5fcf74f5f6578329f5
> ---
> drivers/gpu/drm/amd/pm/amdgpu_pm.c | 208 +---
> drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 4 +
> drivers/gpu/drm/amd/pm/inc/amdgpu_pm.h | 1 -
> 3 files changed, 188 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> index a4d8e667eafb..303a2d643b10 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> @@ -34,8 +34,19 @@
> #include
> #include
> #include
> +#include
> +#include
> +#include
> #include
>
> +/*
> + * Sysfs reports this as the file size (stat/ls); kernfs also uses it to cap
> + * read offsets. Actual payload length is the return value of
> + * amdgpu_dpm_get_gpu_metrics() and must not exceed this.
> + */
> +#define AMDGPU_GPU_METRICS_BIN_ATTR_MAX_SZ SZ_128K
> +
> +
> #define MAX_NUM_OF_FEATURES_PER_SUBSET 8
> #define MAX_NUM_OF_SUBSETS 8
>
> @@ -1734,43 +1745,170 @@ static ssize_t amdgpu_get_pm_metrics(struct device
> *dev,
> * DOC: gpu_metrics
> *
> * The amdgpu driver provides a sysfs API for retrieving current gpu
> - * metrics data. The file gpu_metrics is used for this. Reading the
> - * file will dump all the current gpu metrics data.
> + * metrics data. The binary sysfs file gpu_metrics is used for this.
> + * Reading the file returns the raw metrics blob. The sysfs file size
> + * is an upper bound for inode metadata; the real length is the amount
> + * returned before EOF.
> + *
> + * Metrics are sampled atomically per reader: the first read at offset 0
> + * captures a snapshot into a per-fd cache; subsequent reads at higher
> + * offsets (for payloads that span multiple pages) are served from that
> + * same snapshot. Concurrent readers each get their own cache.
> *
> * These data include temperature, frequency, engines utilization,
> * power consume, throttler status, fan speed and cpu core statistics(
> * available for APU only). That's it will give a snapshot of all sensors
> * at the same time.
> */
> -static ssize_t amdgpu_get_gpu_metrics(struct device *dev,
> - struct device_attribute *attr,
> - char *buf)
> +
> +/* Per-reader snapshot entry, keyed by struct file pointer in the xarray */
> +struct gpu_metrics_snap_entry {
> + void*data;
> + size_t size;
> + ktime_t timestamp;
> +};
> +
> +/* Evict stale entri
[PATCH] drm/amd/pm: Change gpu_metrics over to binary with per-reader snapshots (v4)
The gpu_metrics sysfs file carries a binary blob but was implemented as
a text device_attribute, which imposes a hard PAGE_SIZE-1 cap and makes
it impossible for userspace to distinguish a successful short read from
a truncated one (stat reports 4 KiB regardless of actual payload size).
Convert gpu_metrics to a bin_attribute. The declared file size is an
upper bound (128 KiB); the real payload length is the byte count
returned before EOF.
To guarantee that multi-chunk reads (which kernfs splits at PAGE_SIZE
boundaries) return a coherent snapshot, each reader (identified by its
struct file pointer) gets its own cached metrics buffer via an xarray.
When offset is 0, PMFW is sampled and stored in that reader's entry;
subsequent offsets are served from the same snapshot. The entry is
freed on EOF. A mutex serialises xarray mutations and PMFW access.
Stale entries from readers that close without reaching EOF (e.g.,
killed processes) are lazily evicted after 30 seconds whenever any
new reader starts.
This per-reader approach avoids the cross-contamination problem of
a single shared cache: concurrent readers each see their own coherent
PMFW snapshot rather than risking one reader's off=0 overwriting the
buffer while another reader is mid-way through a multi-chunk read.
V3: Per-reader snapshot state, so each open fd gets its own cached
gpu_metrics buffer and all chunks for that reader come from the
same sample (Vitaly)
V4: Remove the residual PAGE_SIZE cap that v3 still carried over from
v1. The whole point of the bin_attribute conversion is to
lift the old PAGE_SIZE-1 limit, but v3 still had:
- WARN_ON_ONCE(len > PAGE_SIZE) with silent truncation to 4K
- kzalloc(PAGE_SIZE) for the snapshot buffer (fixed allocation)
- a separate memcpy from metrics into that fixed buffer
In v4 the snapshot buffer is allocated with kmemdup() sized to the
actual payload returned by amdgpu_dpm_get_gpu_metrics(), so payloads
larger than PAGE_SIZE work correctly. The only remaining upper bound
is AMDGPU_GPU_METRICS_BIN_ATTR_MAX_SZ (128 KiB), which is the
declared bin_attribute file size and serves as a sanity check —
exceeding it now returns -EOVERFLOW instead of silently truncating.
(Tom)
Signed-off-by: Tom St Denis
Acked-by: Vitaly Prosyak
Change-Id: Ib4fa233d9a25a396f1cc7d5fcf74f5f6578329f5
---
drivers/gpu/drm/amd/pm/amdgpu_pm.c | 208 +---
drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 4 +
drivers/gpu/drm/amd/pm/inc/amdgpu_pm.h | 1 -
3 files changed, 188 insertions(+), 25 deletions(-)
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index a4d8e667eafb..303a2d643b10 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -34,8 +34,19 @@
#include
#include
#include
+#include
+#include
+#include
#include
+/*
+ * Sysfs reports this as the file size (stat/ls); kernfs also uses it to cap
+ * read offsets. Actual payload length is the return value of
+ * amdgpu_dpm_get_gpu_metrics() and must not exceed this.
+ */
+#define AMDGPU_GPU_METRICS_BIN_ATTR_MAX_SZ SZ_128K
+
+
#define MAX_NUM_OF_FEATURES_PER_SUBSET 8
#define MAX_NUM_OF_SUBSETS 8
@@ -1734,43 +1745,170 @@ static ssize_t amdgpu_get_pm_metrics(struct device
*dev,
* DOC: gpu_metrics
*
* The amdgpu driver provides a sysfs API for retrieving current gpu
- * metrics data. The file gpu_metrics is used for this. Reading the
- * file will dump all the current gpu metrics data.
+ * metrics data. The binary sysfs file gpu_metrics is used for this.
+ * Reading the file returns the raw metrics blob. The sysfs file size
+ * is an upper bound for inode metadata; the real length is the amount
+ * returned before EOF.
+ *
+ * Metrics are sampled atomically per reader: the first read at offset 0
+ * captures a snapshot into a per-fd cache; subsequent reads at higher
+ * offsets (for payloads that span multiple pages) are served from that
+ * same snapshot. Concurrent readers each get their own cache.
*
* These data include temperature, frequency, engines utilization,
* power consume, throttler status, fan speed and cpu core statistics(
* available for APU only). That's it will give a snapshot of all sensors
* at the same time.
*/
-static ssize_t amdgpu_get_gpu_metrics(struct device *dev,
- struct device_attribute *attr,
- char *buf)
+
+/* Per-reader snapshot entry, keyed by struct file pointer in the xarray */
+struct gpu_metrics_snap_entry {
+ void*data;
+ size_t size;
+ ktime_t timestamp;
+};
+
+/* Evict stale entries from readers that closed without reaching EOF */
+#define GPU_METRICS_SNAP_STALE_NS (30ULL * NSEC_PER_SEC)
+
+static void gpu_metrics_evict_stale_locked(struct xarray *xa,
+ unsigned long skip_key)
+{
+
