ras: Introduce the DRM RAS infrastructure over generic netlink

Riana Tauro Thu, 15 Jan 2026 21:57:09 -0800



On 1/16/2026 5:09 AM, Zack McKevitt wrote:

On 1/13/2026 1:20 AM, Riana Tauro wrote:
diff --git a/Documentation/netlink/specs/drm_ras.yaml b/Documentation/netlink/specs/drm_ras.yaml
new file mode 100644
index 000000000000..be0e379c5bc9
--- /dev/null
+++ b/Documentation/netlink/specs/drm_ras.yaml
@@ -0,0 +1,130 @@
+# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) ORBSD-3-Clause)
+---
+name: drm-ras
+protocol: genetlink
+uapi-header: drm/drm_ras.h
+
+doc: >-
+ DRM RAS (Reliability, Availability, Serviceability) overGeneric Netlink.+ Provides a standardized mechanism for DRM drivers to register"nodes"+ representing hardware/software components capable of reportingerror counters.+ Userspace tools can query the list of nodes or individual errorcounters
+  via the Generic Netlink interface.
+
+definitions:
+  -
+    type: enum
+    name: node-type
+    value-start: 1
+    entries: [error-counter]
+    doc: >-
+         Type of the node. Currently, only error-counter nodes are
+ supported, which expose reliability counters for ahardware/software
+         component.
+
+attribute-sets:
+  -
+    name: node-attrs
+    attributes:
+      -
+        name: node-id
+        type: u32
+        doc: >-
+             Unique identifier for the node.
+ Assigned dynamically by the DRM RAS core uponregistration.
+      -
+        name: device-name
+        type: string
+        doc: >-
+             Device name chosen by the driver at registration.
+             Can be a PCI BDF, UUID, or module name if unique.
+      -
+        name: node-name
+        type: string
+        doc: >-
+             Node name chosen by the driver at registration.
+ Can be an IP block name, or any name that identifiesthe
+             RAS node inside the device.
+      -
+        name: node-type
+        type: u32
+        doc: Type of this node, identifying its function.
+        enum: node-type
+  -
+    name: error-counter-attrs
+    attributes:
+      -
+        name: node-id
+        type: u32
+        doc:  Node ID targeted by this error counter operation.
+      -
+        name: error-id
+        type: u32
+ doc: Unique identifier for a specific error counterwithin an node.
+      -
+        name: error-name
+        type: string
+        doc: Name of the error.
+      -
+        name: error-value
+        type: u32
+        doc: Current value of the requested error counter.
+
+operations:
+  list:
+    -
+      name: list-nodes
+      doc: >-
+ Retrieve the full list of currently registered DRM RASnodes.+ Each node includes its dynamically assigned ID, name,and type.+ **Important:** User space must call this operationfirst to obtain
+           the node IDs. These IDs are required for all subsequent
+           operations on nodes, such as querying error counters.
I am curious about security implications of this design.
hmm... very good point you are raising here.

This current design relies entirely in the CAP_NET_ADMIN.
No driver would have the flexibility to choose anything differently.
Please notice that the flag admin-perm is hardcoded in this yaml file.
If the complete
list of RAS nodes is visible for any process on the system (and onewants to
avoid requiring CAP_NET_ADMIN), there should be some way to enforce
permission checks when performing these operations if desired.
Right now, there's no way that the driver would choose not avoidrequiring
CAP_NET_ADMIN...

Only way would be the admin to give the cap_net_admin to the tool with:

$ sudo setcap cap_net_admin+ep /bin/drm_ras_tool

but not ideal and not granular anyway...
For example, this might be implemented in the driver's definition of
callback functions like query_error_counter; some drivers may wantto ensurethat the process can in fact open the file descriptor correspondingto thequeried device before serving a netlink request. Is it enough for adriverto simply return -EPERM in this case? Any driver that doesnt wish toprotect
its RAS nodes need not implement checks in their callbacks.
Fair enough. If we want to give the option to the drivers, then we need:
1. to first remove all the admin-perm flags below and leave thedriver to
pick up their policy on when to return something or -EPERM.
2. Document this security responsibility and list a few possibilities.
3. In our Xe case here I believe the easiest option is to usesomething like:
struct scm_creds *creds = NETLINK_CREDS(cb->skb);
if (!gid_eq(creds->gid, GLOBAL_ROOT_GID))
     return -EPERM
The driver currently does not have access to the callback or theskbuffer. Sending these details as param to driver won't be right as
drm_ras needs to handle all the netlink buffers.
How about using pre_doit & start calls? If driver has a pre callback ,it's the responsibility of the driver to check permissions/any-preconditions, else the CAP_NET_ADMIN permission will be checked.
@Zack / @Rodrigo thoughts?
@Zack Will this work for your usecase?

yaml
+    dump:
+        pre: drm-ras-nl-pre-list-nodes


drm_ras.c :

+       if (node->pre_list_nodes)
+                return node->pre_list_nodes(node);
+
+        return check_permissions(cb->skb);  //Checks creds

Thanks
Riana
I agree that a pre_doit is probably the best solution for this.
Not entirely sure what a driver specific implementation would look likeyet, but I think that as long as the driver callback has a way to accessthe 'current' task_struct pointer corresponding to the user process thenthis approach seems like the best option from the netlink side.
Since this is mostly a concern for our specific use case, perhaps we canincorporate this functionality in our change down the road when weexpand the interface for telemetry?



Yeah using pre_doit we can allow driver to make decisions based on

the private data or driver specific permissions. However we will need tocheck the CAP_NET_ADMIN when no driver callback is implemented in thenetlink layer as a default .


Thank you, you can incorporate the changes when you add telemetry nodes.

For now, I will retain the admin-perm in flags.

I will address the rest of the review comments and send out a newrevision shortly.


Thank you
Riana

Let me know what you think.

Zack
or something like that?!
perhaps drivers could implement some form of cookie or pre-authorization with
ioctls or sysfs, and then store in the priv?

Thoughts?
Other options?
I dont see any such permissions checks in your driver implementationwhichis understandable given that it may not be necessary for your usecases.However, this would be a concern for our driver if we were to adoptthis
interface.
yeap, this case was entirely with admin-perm, so not needed at all...
But I see your point and this is really not giving any flexibility to
other drivers.
Thanks,

Zack

Re: [PATCH v3 1/4] drm/ras: Introduce the DRM RAS infrastructure over generic netlink

Reply via email to