amdgpu: introduce a kind of halt state for amdgpu device

Andrey Grodzovsky Thu, 09 Dec 2021 09:01:45 -0800


On 2021-12-09 4:00 a.m., Christian König wrote:



Am 09.12.21 um 09:49 schrieb Lang Yu:

It is useful to maintain error context when debugging
SW/FW issues. We introduce amdgpu_device_halt() for this
purpose. It will bring hardware to a kind of halt state,
so that no one can touch it any more.

Compare to a simple hang, the system will keep stable
at least for SSH access. Then it should be trivial to
inspect the hardware state and see what's going on.

Suggested-by: Christian Koenig <christian.koe...@amd.com>
Suggested-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
Signed-off-by: Lang Yu <lang...@amd.com>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 ++++++++++++++++++++++
  2 files changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.hb/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index c5cfe2926ca1..3f5f8f62aa5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

@@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(structamdgpu_device *adev,

  void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
          struct amdgpu_ring *ring);
  +void amdgpu_device_halt(struct amdgpu_device *adev);
+
  /* atpx handler */
  #if defined(CONFIG_VGA_SWITCHEROO)
  void amdgpu_register_atpx_handler(void);

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.cb/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index a1c14466f23d..62216627cc83 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

@@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(structamdgpu_device *adev,

        amdgpu_asic_invalidate_hdp(adev, ring);
  }
+
+/**
+ * amdgpu_device_halt() - bring hardware to some kind of halt state
+ *
+ * @adev: amdgpu_device pointer
+ *

+ * Bring hardware to some kind of halt state so that no one cantouch it+ * any more. It will help to maintain error context when erroroccurred.+ * Compare to a simple hang, the system will keep stable at leastfor SSH

+ * access. Then it should be trivial to inspect the hardware state and
+ * see what's going on. Implemented as following:
+ *

+ * 1. drm_dev_unplug() makes device inaccessible to userspace(IOCTLs, etc),+ * clears all CPU mappings to device, disallows remappingsthrough page faults

+ * 2. amdgpu_irq_disable_all() disables all interrupts
+ * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
+ * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
+ * 5. pci_disable_device() and pci_wait_for_pending_transaction()
+ *    flush any in flight DMA operations
+ * 6. set adev->no_hw_access to true
+ */
+void amdgpu_device_halt(struct amdgpu_device *adev)
+{
+    struct pci_dev *pdev = adev->pdev;
+    struct drm_device *ddev = &adev->ddev;
+
+    drm_dev_unplug(ddev);
+
+    amdgpu_irq_disable_all(adev);
+
+    amdgpu_fence_driver_hw_fini(adev);
+
+    amdgpu_device_unmap_mmio(adev);

Note that this one will cause page fault on any subsequent MMIO access(trough registers or by direct VRAM access)

+
+    pci_disable_device(pdev);
+    pci_wait_for_pending_transaction(pdev);
+
+    adev->no_hw_access = true;
I think we need to reorder this, e.g. set adev->no_hw_access muchearlier for example. Andrey what do you think?

Earlier can be ok but at least after the last HW configuration weactaully want to do like disabling IRQs.


Andrey


Apart from that sounds like the right idea to me.

Regards,
Christian.

+}

Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device

Reply via email to