[Public] >From what I understand, all KFD events will also eventually be moved to drm >node based uevents. Also, these events are pre/post reset ones, not when the >device is in reset state.
Thanks, Lijo ________________________________ From: Wang, Yang(Kevin) <[email protected]> Sent: Friday, September 26, 2025 3:13:39 PM To: Lazar, Lijo <[email protected]>; [email protected] <[email protected]> Cc: Zhang, Hawking <[email protected]>; Deucher, Alexander <[email protected]> Subject: RE: [PATCH] drm/amdgpu: notify amdgpu gpu reset state via uevent [Public] >> I guess the primary reason to have drm_ event and amdgpu having that is >> because all the 'users' interested in GPU events come through drm interface. in fact, that such devices like drm render/kfd/i2c controller/hwmon/device nodes are attached to pci devices may not respond if device in reset state. so, this is useful event for user mode application. and please conduct some research before making any comments to avoid wasting review resources. KERNEL[11438.593689] remove /devices/virtual/kfd/kfd (kfd) KERNEL[11438.593757] remove /class/kfd (class) KERNEL[11438.614767] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/drm/card1/card1-DP-3/i2c-13/i2c-dev/i2c-13 (i2c-dev) KERNEL[11438.615100] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/drm/card1/card1-DP-3/i2c-13 (i2c) KERNEL[11438.615624] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/drm/card1/card1-DP-3 (drm) KERNEL[11438.615951] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/drm/card1/card1-HDMI-A-1 (drm) KERNEL[11438.617227] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/drm/card1/card1-Writeback-1 (drm) KERNEL[11438.618336] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/drm/card1 (drm) KERNEL[11438.618429] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/drm/renderD128 (drm) KERNEL[11438.622178] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/hwmon/hwmon0 (hwmon) KERNEL[11438.784296] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/i2c-10/i2c-dev/i2c-10 (i2c-dev) KERNEL[11438.784346] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/i2c-10 (i2c) KERNEL[11438.784386] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/i2c-9/i2c-dev/i2c-9 (i2c-dev) KERNEL[11438.784417] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/i2c-9 (i2c) KERNEL[11438.784508] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/i2c-8/i2c-dev/i2c-8 (i2c-dev) KERNEL[11438.784540] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/i2c-8 (i2c) KERNEL[11438.784634] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/i2c-7/i2c-dev/i2c-7 (i2c-dev) KERNEL[11438.784664] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/i2c-7 (i2c) KERNEL[11438.784803] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/i2c-6/i2c-dev/i2c-6 (i2c-dev) KERNEL[11438.784934] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/i2c-6 (i2c) KERNEL[11438.785151] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/i2c-5/i2c-dev/i2c-5 (i2c-dev) KERNEL[11438.785335] remove /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/i2c-5 (i2c) Best Regards, Kevin From: Lazar, Lijo <[email protected]> Sent: Friday, September 26, 2025 17:22 To: Wang, Yang(Kevin) <[email protected]>; [email protected] Cc: Zhang, Hawking <[email protected]>; Deucher, Alexander <[email protected]> Subject: Re: [PATCH] drm/amdgpu: notify amdgpu gpu reset state via uevent [Public] The intention is to notify users of the device about the event. I guess the primary reason to have drm_ event and amdgpu having that is because all the 'users' interested in GPU events come through drm interface. Thanks, Lijo ________________________________ From: Wang, Yang(Kevin) <[email protected]<mailto:[email protected]>> Sent: Friday, September 26, 2025 1:04:56 PM To: Lazar, Lijo <[email protected]<mailto:[email protected]>>; [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Cc: Zhang, Hawking <[email protected]<mailto:[email protected]>>; Deucher, Alexander <[email protected]<mailto:[email protected]>> Subject: RE: [PATCH] drm/amdgpu: notify amdgpu gpu reset state via uevent [Public] KERNEL[173.150476] change /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/drm/card1 (drm) ACTION=change DEVPATH=/devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0/drm/card1 SUBSYSTEM=drm WEDGED=none DEVNAME=/dev/dri/card1 DEVTYPE=drm_minor SEQNUM=6237 MAJOR=226 MINOR=1 a "drm_dev_wedget_event()" uevent example above. You shouldn't discuss these together; they are two separate events occurring on different type devices (pci device and drm device). software-defined devices and physical devices don't have a strict one-to-one mapping, and the device initiating the reset and the device that need to reset are different on an XGMI system. so, all independent PCI devices in same XGMI link need to independently report events. Best Regards, Kevin -----Original Message----- From: Lazar, Lijo <[email protected]<mailto:[email protected]>> Sent: Friday, September 26, 2025 14:55 To: Wang, Yang(Kevin) <[email protected]<mailto:[email protected]>>; [email protected]<mailto:[email protected]> Cc: Zhang, Hawking <[email protected]<mailto:[email protected]>>; Deucher, Alexander <[email protected]<mailto:[email protected]>> Subject: RE: [PATCH] drm/amdgpu: notify amdgpu gpu reset state via uevent [Public] Presently, there is this one also - drm_dev_wedged_event. Perhaps it's better to modify this to include additional info like pre and post reset along with cause of reset? Thanks, Lijo -----Original Message----- From: amd-gfx <[email protected]<mailto:[email protected]>> On Behalf Of Yang Wang Sent: Friday, September 26, 2025 12:04 PM To: [email protected]<mailto:[email protected]> Cc: Zhang, Hawking <[email protected]<mailto:[email protected]>>; Deucher, Alexander <[email protected]<mailto:[email protected]>> Subject: [PATCH] drm/amdgpu: notify amdgpu gpu reset state via uevent Use the uevent mechanism to expose the GPU reset state, so that the system tool can more accurately monitor the device reset status. example: $ sudo cat /sys/kernel/debug/dri/<minor>/amdgpu_gpu_recover KERNEL[172.053149] change /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0 (pci) ACTION=change DEVPATH=/devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0 SUBSYSTEM=pci RESET=1 DRIVER=amdgpu PCI_CLASS=30000 PCI_ID=1002:73BF PCI_SUBSYS_ID=1002:0E3A PCI_SLOT_NAME=0000:05:00.0 MODALIAS=pci:v00001002d000073BFsv00001002sd00000E3Abc03sc00i00 SEQNUM=6235 KERNEL[173.137681] change /devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0 (pci) ACTION=change DEVPATH=/devices/pci0000:00/0000:00:03.1/0000:03:00.0/0000:04:00.0/0000:05:00.0 SUBSYSTEM=pci RESET=0 DRIVER=amdgpu PCI_CLASS=30000 PCI_ID=1002:73BF PCI_SUBSYS_ID=1002:0E3A PCI_SLOT_NAME=0000:05:00.0 MODALIAS=pci:v00001002d000073BFsv00001002sd00000E3Abc03sc00i00 SEQNUM=6236 Signed-off-by: Yang Wang <[email protected]<mailto:[email protected]>> --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 ++++++++++++++++++++++ 2 files changed, 42 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 2a0df4cabb99..73c946d9cbe1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1805,4 +1805,7 @@ void amdgpu_device_set_uid(struct amdgpu_uid *uid_info, uint64_t uid); uint64_t amdgpu_device_get_uid(struct amdgpu_uid *uid_info, enum amdgpu_uid_type type, uint8_t inst); + +int amdgpu_device_uevent_reset(struct amdgpu_device *adev, bool done); + #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index a77000c2e0bb..300cc22dad91 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -6318,6 +6318,7 @@ static int amdgpu_device_asic_reset(struct amdgpu_device *adev, retry: /* Rest of adevs pre asic reset from XGMI hive. */ list_for_each_entry(tmp_adev, device_list, reset_list) { + amdgpu_device_uevent_reset(tmp_adev, false); r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); /*TODO Should we stop ?*/ if (r) { @@ -6362,6 +6363,8 @@ static int amdgpu_device_asic_reset(struct amdgpu_device *adev, * in before drm_sched_start. */ amdgpu_device_stop_pending_resets(tmp_adev); + + amdgpu_device_uevent_reset(tmp_adev, true); } return r; @@ -7669,3 +7672,39 @@ u64 amdgpu_device_get_uid(struct amdgpu_uid *uid_info, return uid_info->uid[type][inst]; } + +__printf(3, 4) +static int amdgpu_device_uevent_emit(struct amdgpu_device *adev, enum kobject_action action, + char *fmt, ...) { + struct kobject *kobj = &adev->dev->kobj; + char *uevent_env[2], *tmp; + va_list ap; + int ret; + + va_start(ap, fmt); + tmp = kvasprintf(GFP_KERNEL, fmt, ap); + va_end(ap); + + if (!tmp) { + ret = -ENOMEM; + goto out; + } + + uevent_env[0] = tmp; + uevent_env[1] = NULL; + + ret = kobject_uevent_env(kobj, action, uevent_env); + + kvfree(tmp); + +out: + return ret; +} + +int amdgpu_device_uevent_reset(struct amdgpu_device *adev, bool done) { + int val = done ? 0 : 1; + + return amdgpu_device_uevent_emit(adev, KOBJ_CHANGE, "RESET=%d", +val); } -- 2.34.1
