** Tags added: jira-sutton-5174 oem-priority sutton
** Tags added: jira-sutton-5194
** Also affects: linux-oem-6.17 (Ubuntu Resolute)
Importance: Undecided
Status: New
** Also affects: linux-oem-6.17 (Ubuntu Stonking)
Importance: Undecided
Status: New
** Changed in: linux-oem-6.17 (Ubuntu Resolute)
Status: New => Fix Released
** Changed in: linux-oem-6.17 (Ubuntu Stonking)
Status: New => Fix Released
** Changed in: linux-oem-6.17 (Ubuntu Noble)
Status: New => In Progress
** Changed in: hwe-next
Status: New => In Progress
** Changed in: hwe-next
Assignee: (unassigned) => AaronMa (mapengyu)
** Changed in: hwe-next
Importance: Undecided => Medium
** Changed in: linux-oem-6.17 (Ubuntu Noble)
Importance: Undecided => Medium
** Changed in: linux-oem-6.17 (Ubuntu Resolute)
Importance: Undecided => Medium
** Changed in: linux-oem-6.17 (Ubuntu Stonking)
Importance: Undecided => Medium
** Description changed:
[Impact]
System hangs on amdgpu during exec. systemd blocked for 368s in D state in
amdgpu_ctx_mgr_entity_flush (mutex_lock) via filp_close -> amdgpu_flush during
do_close_on_exec -> load_elf_binary. Reproduces on 6.17.0-1020-oem.
- May 29 10:47:30 London-AMD-SIT5 kernel: INFO: task systemd:16426 blocked for
more than 368 seconds.
- May 29 10:47:30 London-AMD-SIT5 kernel: Tainted: G O
6.17.0-1020-oem #20-Ubuntu
- May 29 10:47:30 London-AMD-SIT5 kernel: task:systemd state:D pid:16426
- May 29 10:47:30 London-AMD-SIT5 kernel: mutex_lock
- May 29 10:47:30 London-AMD-SIT5 kernel:
amdgpu_ctx_mgr_entity_flush+0x46/0x1f0 [amdgpu]
- May 29 10:47:30 London-AMD-SIT5 kernel: amdgpu_flush+0x26/0x50 [amdgpu]
- May 29 10:47:30 London-AMD-SIT5 kernel: filp_flush+0x6f/0xb0
- May 29 10:47:30 London-AMD-SIT5 kernel: filp_close+0x14/0x30
- May 29 10:47:30 London-AMD-SIT5 kernel: do_close_on_exec+0xe7/0x140
- May 29 10:47:30 London-AMD-SIT5 kernel: begin_new_exec+0x1ab/0x420
- May 29 10:47:30 London-AMD-SIT5 kernel: load_elf_binary+0x32d/0xf40
+ kernel: INFO: task systemd:16426 blocked for more than 368 seconds.
+ kernel: Tainted: G O 6.17.0-1020-oem #20-Ubuntu
+ kernel: task:systemd state:D pid:16426
+ kernel: mutex_lock
+ kernel: amdgpu_ctx_mgr_entity_flush+0x46/0x1f0 [amdgpu]
+ kernel: amdgpu_flush+0x26/0x50 [amdgpu]
+ kernel: filp_flush+0x6f/0xb0
+ kernel: filp_close+0x14/0x30
+ kernel: do_close_on_exec+0xe7/0x140
+ kernel: begin_new_exec+0x1ab/0x420
+ kernel: load_elf_binary+0x32d/0xf40
[Fix]
Cherry-pick upstream commits:
- b18fc0ab837381 "drm/amdgpu: fix sync handling in amdgpu_dma_buf_move_notify"
- 930595df251c "drm/amdgpu: remove check for BO reservation add assert
instead"
Pass NULL ticket to amdgpu_vm_handle_moved in move_notify so the clear=true
path is used, avoiding a PT update while another process's job is still
running — the contention that blocks amdgpu_ctx_mgr_entity_flush.
[Test Plan]
Boot oem-6.17 on dual-GPU AMD system without P2P PCI; glxgears on GPU0,
Xorg on GPU1 sharing a dmabuf. Without fix: systemd hung >368s in
amdgpu_ctx_mgr_entity_flush. With fix: no hang.
[Where problems could occur]
- Patch 1 changes only the ticket arg in amdgpu_dma_buf_move_notify; normal VM
- update path untouched. Risk limited to dmabuf move-notify sync on multi-GPU
- shared-BO setups.
+ update path untouched. Risk limited to dmabuf move-notify sync on multi-GPU
+ shared-BO setups.
- Patch 2 replaces a runtime warning + -EINVAL with a lockdep assert (no-op on
- production kernels with PROVE_LOCKING off). Low risk.
+ production kernels with PROVE_LOCKING off). Low risk.
** Description changed:
[Impact]
System hangs on amdgpu during exec. systemd blocked for 368s in D state in
amdgpu_ctx_mgr_entity_flush (mutex_lock) via filp_close -> amdgpu_flush during
do_close_on_exec -> load_elf_binary. Reproduces on 6.17.0-1020-oem.
- kernel: INFO: task systemd:16426 blocked for more than 368 seconds.
- kernel: Tainted: G O 6.17.0-1020-oem #20-Ubuntu
- kernel: task:systemd state:D pid:16426
- kernel: mutex_lock
- kernel: amdgpu_ctx_mgr_entity_flush+0x46/0x1f0 [amdgpu]
- kernel: amdgpu_flush+0x26/0x50 [amdgpu]
- kernel: filp_flush+0x6f/0xb0
- kernel: filp_close+0x14/0x30
- kernel: do_close_on_exec+0xe7/0x140
- kernel: begin_new_exec+0x1ab/0x420
- kernel: load_elf_binary+0x32d/0xf40
+ kernel: INFO: task systemd:16426 blocked for more than 368 seconds.
+ kernel: Tainted: G O 6.17.0-1020-oem #20-Ubuntu
+ kernel: task:systemd state:D pid:16426
+ kernel: mutex_lock
+ kernel: amdgpu_ctx_mgr_entity_flush+0x46/0x1f0 [amdgpu]
+ kernel: amdgpu_flush+0x26/0x50 [amdgpu]
+ kernel: filp_flush+0x6f/0xb0
+ kernel: filp_close+0x14/0x30
+ kernel: do_close_on_exec+0xe7/0x140
+ kernel: begin_new_exec+0x1ab/0x420
+ kernel: load_elf_binary+0x32d/0xf40
[Fix]
Cherry-pick upstream commits:
- b18fc0ab837381 "drm/amdgpu: fix sync handling in amdgpu_dma_buf_move_notify"
- 930595df251c "drm/amdgpu: remove check for BO reservation add assert
instead"
Pass NULL ticket to amdgpu_vm_handle_moved in move_notify so the clear=true
path is used, avoiding a PT update while another process's job is still
running — the contention that blocks amdgpu_ctx_mgr_entity_flush.
[Test Plan]
Boot oem-6.17 on dual-GPU AMD system without P2P PCI; glxgears on GPU0,
Xorg on GPU1 sharing a dmabuf. Without fix: systemd hung >368s in
amdgpu_ctx_mgr_entity_flush. With fix: no hang.
[Where problems could occur]
- - Patch 1 changes only the ticket arg in amdgpu_dma_buf_move_notify; normal VM
- update path untouched. Risk limited to dmabuf move-notify sync on multi-GPU
- shared-BO setups.
- - Patch 2 replaces a runtime warning + -EINVAL with a lockdep assert (no-op on
- production kernels with PROVE_LOCKING off). Low risk.
+ changes only the ticket arg in amdgpu_dma_buf_move_notify; normal VM
+ update path untouched. Risk limited to dmabuf move-notify sync on multi-GPU
+ shared-BO setups.
+ It may break amdgpu driver.
** Summary changed:
- amdgpu: hang in amdgpu_ctx_mgr_entity_flush during exec
+ Fix amdgpu hang in amdgpu_ctx_mgr_entity_flush during exec
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2158236
Title:
Fix amdgpu hang in amdgpu_ctx_mgr_entity_flush during exec
To manage notifications about this bug go to:
https://bugs.launchpad.net/hwe-next/+bug/2158236/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs