From: Antonino Maniscalco <[email protected]>

[ Upstream commit 10fb1b2fcaee5545a5e54db1ed4d7b15c2db50c8 ]

If two fault IRQs arrive in short succession recovery work will be
queued up twice.

When recovery runs a second time it may end up killing an unrelated
context.

Prevent this by masking off interrupts when triggering recovery.

Signed-off-by: Antonino Maniscalco <[email protected]>
Reviewed-by: Akhil P Oommen <[email protected]>
Patchwork: https://patchwork.freedesktop.org/patch/670023/
Signed-off-by: Rob Clark <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---

LLM Generated explanations, may be completely bogus:

YES

- What it fixes: Two fault IRQs arriving back-to-back can queue
  `recover_work` twice; the second recovery may kill an unrelated
  context. The change masks interrupts before queuing recovery so
  subsequent fault IRQs don’t re-queue recovery.
- Core change: In `drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1821-1824`,
  `a6xx_fault_detect_irq()` adds:
  - `gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, 0);` to mask all RBBM
    interrupts before `kthread_queue_work(gpu->worker,
    &gpu->recover_work);`
  - This follows the hangcheck timer being disabled, ensuring no further
    spurious recovery triggers while the first recovery proceeds.
- Call path impact: `a6xx_irq()` invokes `a6xx_fault_detect_irq()` on
  hang detect (`drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1891-1900`). With
  the new mask, subsequent IRQs won’t retrigger recovery for the same
  incident.
- Interrupts are safely restored: During recovery, the GPU is
  reinitialized and interrupts are re-enabled in `a6xx_hw_init()` via
  `REG_A6XX_RBBM_INT_0_MASK` (sets `A6XX_INT_MASK`/`A7XX_INT_MASK`) at
  `drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1410-1413`. `a6xx_recover()`
  calls `msm_gpu_hw_init(gpu)` to perform this re-init
  (`drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1614`). Additionally, the top-
  level wrapper brackets `hw_init` with `disable_irq/enable_irq`
  (`drivers/gpu/drm/msm/msm_gpu.c:168-174`), so the flow cleanly unmasks
  after reset.
- Consistency with existing patterns: A similar mask-on-fault pattern
  already exists for a7xx SW fuse violations
  (`drivers/gpu/drm/msm/adreno/a6xx_gpu.c:1831-1834`), indicating this
  is the established approach to prevent repeated fault handling.
- Stable suitability:
  - User-visible bugfix: Prevents erroneous second recovery that can
    kill unrelated contexts.
  - Small and contained: One register write in an error path; no ABI or
    feature changes.
  - Low regression risk: Interrupts are restored during the normal
    recovery/reinit path; only affects a6xx hang/fault handling.
  - No architectural churn, limited to DRM/MSM Adreno a6xx driver.

Conclusion: This is a minimal, targeted fix for a real correctness issue
with low risk and clear recovery restore points, making it a good
candidate for backporting to all supported stable kernels that include
the a6xx driver.

 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 45dd5fd1c2bfc..f8992a68df7fb 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1727,6 +1727,9 @@ static void a6xx_fault_detect_irq(struct msm_gpu *gpu)
        /* Turn off the hangcheck timer to keep it from bothering us */
        timer_delete(&gpu->hangcheck_timer);
 
+       /* Turn off interrupts to avoid triggering recovery again */
+       gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, 0);
+
        kthread_queue_work(gpu->worker, &gpu->recover_work);
 }
 
-- 
2.51.0

Reply via email to