Re: [PATCH 4/6] drm/amdgpu: address remove from fault filter

2021-04-22 Thread philip yang

  


On 2021-04-21 3:22 a.m., Christian
  König wrote:

Am
  20.04.21 um 22:21 schrieb Philip Yang:
  
  Add interface to remove address from fault
filter ring by resetting

fault ring entry of the fault address timestamp to 0, then
future vm

fault on the address will be processed to recover.


Check fault address from fault ring, add address into fault ring
and

remove address from fault ring are serialized in same interrupt
deferred

work, don't have race condition.

  
  
  That might not work on Vega20.
  
  
  We call amdgpu_gmc_filter_faults() from the the IH while the fault
  handling id done from the delegated IH processing.
  

Added spinlock for VG20.

  
  More comments below.
  
  
  Signed-off-by: Philip Yang


---

  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24


  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  2 ++

  2 files changed, 26 insertions(+)


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c

index c39ed9eb0987..338e45fa66cb 100644

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c

@@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct
amdgpu_device *adev, uint64_t addr,

  return false;

  }

  +/**

+ * amdgpu_gmc_filter_faults_remove - remove address from VM
faults filter

+ *

+ * @adev: amdgpu device structure

+ * @addr: address of the VM fault

+ * @pasid: PASID of the process causing the fault

+ *

+ * Remove the address from fault filter, then future vm fault
on this address

+ * will pass to retry fault handler to recover.

+ */

+void amdgpu_gmc_filter_faults_remove(struct amdgpu_device
*adev, uint64_t addr,

+ uint16_t pasid)

+{

+    struct amdgpu_gmc *gmc = >gmc;

+

+    uint64_t key = addr << 4 | pasid;

  
  
  We should probably have a function for this now.
  

add function fault_key in v2.

  
  +    struct amdgpu_gmc_fault *fault;

+    uint32_t hash;

+

+    hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER);

+    fault =
>fault_ring[gmc->fault_hash[hash].idx];

+    fault->timestamp = 0;

  
  
  There is no guarantee that the ring entry you found for the fault
  is the one for this address.
  
  
  After all that is just an 8 bit hash for a 64bit values :)
  
  
  You need to double check the key and walk the chain by looking at
  the next entry to eventually find the right one.
  
  

I am not completely understand how fault->next and
  gmc->last_fault works, as it keep increasing. Please help
  review patch v2.
Thanks,
Philip

Christian.
  
  
  +}

+

  int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)

  {

  int r;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h

index 9d11c02a3938..498a7a0d5a9e 100644

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h

@@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct
amdgpu_device *adev,

   struct amdgpu_gmc *mc);

  bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev,
uint64_t addr,

    uint16_t pasid, uint64_t timestamp);

+void amdgpu_gmc_filter_faults_remove(struct amdgpu_device
*adev, uint64_t addr,

+ uint16_t pasid);

  int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev);

  void amdgpu_gmc_ras_fini(struct amdgpu_device *adev);

  int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device
*adev);

  
  

  

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 4/6] drm/amdgpu: address remove from fault filter

2021-04-22 Thread philip yang

  


On 2021-04-21 11:29 a.m., Felix
  Kuehling wrote:

On
  2021-04-21 3:55 a.m., Christian König wrote:
  
  Am 21.04.21 um 03:20 schrieb Felix
Kuehling:

Am 2021-04-20 um 4:21 p.m. schrieb
  Philip Yang:
  
  Add interface to remove address from
fault filter ring by resetting

fault ring entry of the fault address timestamp to 0, then
future vm

fault on the address will be processed to recover.


Check fault address from fault ring, add address into fault
ring and

remove address from fault ring are serialized in same
interrupt deferred

work, don't have race condition.


Signed-off-by: Philip Yang 

---

  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24


  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  2 ++

  2 files changed, 26 insertions(+)


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c

index c39ed9eb0987..338e45fa66cb 100644

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c

@@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct
amdgpu_device *adev, uint64_t addr,

  return false;

  }

  +/**

+ * amdgpu_gmc_filter_faults_remove - remove address from VM
faults filter

+ *

+ * @adev: amdgpu device structure

+ * @addr: address of the VM fault

+ * @pasid: PASID of the process causing the fault

+ *

+ * Remove the address from fault filter, then future vm
fault on this address

+ * will pass to retry fault handler to recover.

+ */

+void amdgpu_gmc_filter_faults_remove(struct amdgpu_device
*adev, uint64_t addr,

+ uint16_t pasid)

+{

+    struct amdgpu_gmc *gmc = >gmc;

+

+    uint64_t key = addr << 4 | pasid;

+    struct amdgpu_gmc_fault *fault;

+    uint32_t hash;

+

+    hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER);

+    fault =
>fault_ring[gmc->fault_hash[hash].idx];

  
  You need to loop over the fault ring to find a fault with the
  matching
  
  key since there may be hash collisions.
  
  
  You also need to make sure you don't break the single link
  list of keys
  
  with the same hash when you remove an entry. I think the
  easier way to
  
  remove an entry without breaking this ring+closed hashing
  structure is
  
  to reset the fault->key rather than the
  fault->timestamp.
  
  
  Finally, you need to add locking to the fault ring structure.
  Currently
  
  it's not protected by any locks because only one thread (the
  interrupt
  
  handler) accesses it. Now you have another thread that can
  remove
  
  entries, so you need to protect it with a lock. If you are
  handling
  
  retry faults, you know that the interrupt handler is really a
  worker
  
  thread, so you can use a mutex or a spin-lock, but it doesn't
  need to be
  
  interrupt-safe.
  


I don't think you need a lock at all.


Just using cmpxchg() to update the key should do it.


Something like this:


    hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER);

    fault =
>fault_ring[gmc->fault_hash[hash].idx];

    while (fault->timestamp >= stamp) {

    uint64_t tmp;


                cmpxchg(>key, key, 0);

  
  
  Good idea. Then we should probably use READ_ONCE and WRITE_ONCE to
  access fault->key in other places.
  

I will use spinlock to protect fault ring 

Re: [PATCH 4/6] drm/amdgpu: address remove from fault filter

2021-04-21 Thread Felix Kuehling

On 2021-04-21 3:55 a.m., Christian König wrote:

Am 21.04.21 um 03:20 schrieb Felix Kuehling:

Am 2021-04-20 um 4:21 p.m. schrieb Philip Yang:

Add interface to remove address from fault filter ring by resetting
fault ring entry of the fault address timestamp to 0, then future vm
fault on the address will be processed to recover.

Check fault address from fault ring, add address into fault ring and
remove address from fault ring are serialized in same interrupt 
deferred

work, don't have race condition.

Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24 
  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  2 ++
  2 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c

index c39ed9eb0987..338e45fa66cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct 
amdgpu_device *adev, uint64_t addr,

  return false;
  }
  +/**
+ * amdgpu_gmc_filter_faults_remove - remove address from VM faults 
filter

+ *
+ * @adev: amdgpu device structure
+ * @addr: address of the VM fault
+ * @pasid: PASID of the process causing the fault
+ *
+ * Remove the address from fault filter, then future vm fault on 
this address

+ * will pass to retry fault handler to recover.
+ */
+void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, 
uint64_t addr,

+ uint16_t pasid)
+{
+    struct amdgpu_gmc *gmc = >gmc;
+
+    uint64_t key = addr << 4 | pasid;
+    struct amdgpu_gmc_fault *fault;
+    uint32_t hash;
+
+    hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER);
+    fault = >fault_ring[gmc->fault_hash[hash].idx];

You need to loop over the fault ring to find a fault with the matching
key since there may be hash collisions.

You also need to make sure you don't break the single link list of keys
with the same hash when you remove an entry. I think the easier way to
remove an entry without breaking this ring+closed hashing structure is
to reset the fault->key rather than the fault->timestamp.

Finally, you need to add locking to the fault ring structure. Currently
it's not protected by any locks because only one thread (the interrupt
handler) accesses it. Now you have another thread that can remove
entries, so you need to protect it with a lock. If you are handling
retry faults, you know that the interrupt handler is really a worker
thread, so you can use a mutex or a spin-lock, but it doesn't need to be
interrupt-safe.


I don't think you need a lock at all.

Just using cmpxchg() to update the key should do it.

Something like this:

    hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER);
    fault = >fault_ring[gmc->fault_hash[hash].idx];
    while (fault->timestamp >= stamp) {
    uint64_t tmp;

                cmpxchg(>key, key, 0);


Good idea. Then we should probably use READ_ONCE and WRITE_ONCE to 
access fault->key in other places.


Regards,
  Felix




    tmp = fault->timestamp;
    fault = >fault_ring[fault->next];

    /* Check if the entry was reused */
    if (fault->timestamp >= tmp)
    break;
    }

Regards,
Christian.



Regards,
   Felix



+    fault->timestamp = 0;
+}
+
  int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
  {
  int r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h

index 9d11c02a3938..498a7a0d5a9e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct 
amdgpu_device *adev,

   struct amdgpu_gmc *mc);
  bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t 
addr,

    uint16_t pasid, uint64_t timestamp);
+void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, 
uint64_t addr,

+ uint16_t pasid);
  int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev);
  void amdgpu_gmc_ras_fini(struct amdgpu_device *adev);
  int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device *adev);

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=04%7C01%7Cfelix.kuehling%40amd.com%7C7d19870014ff40b6d80b08d9049ae8a0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637545885642357593%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=b0LnO9SQPgYskIRH0vCjKebvh%2FYzFfidRne15q2WIXw%3Dreserved=0 




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 4/6] drm/amdgpu: address remove from fault filter

2021-04-21 Thread Christian König

Am 21.04.21 um 03:20 schrieb Felix Kuehling:

Am 2021-04-20 um 4:21 p.m. schrieb Philip Yang:

Add interface to remove address from fault filter ring by resetting
fault ring entry of the fault address timestamp to 0, then future vm
fault on the address will be processed to recover.

Check fault address from fault ring, add address into fault ring and
remove address from fault ring are serialized in same interrupt deferred
work, don't have race condition.

Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24 
  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  2 ++
  2 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index c39ed9eb0987..338e45fa66cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, 
uint64_t addr,
return false;
  }
  
+/**

+ * amdgpu_gmc_filter_faults_remove - remove address from VM faults filter
+ *
+ * @adev: amdgpu device structure
+ * @addr: address of the VM fault
+ * @pasid: PASID of the process causing the fault
+ *
+ * Remove the address from fault filter, then future vm fault on this address
+ * will pass to retry fault handler to recover.
+ */
+void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr,
+uint16_t pasid)
+{
+   struct amdgpu_gmc *gmc = >gmc;
+
+   uint64_t key = addr << 4 | pasid;
+   struct amdgpu_gmc_fault *fault;
+   uint32_t hash;
+
+   hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER);
+   fault = >fault_ring[gmc->fault_hash[hash].idx];

You need to loop over the fault ring to find a fault with the matching
key since there may be hash collisions.

You also need to make sure you don't break the single link list of keys
with the same hash when you remove an entry. I think the easier way to
remove an entry without breaking this ring+closed hashing structure is
to reset the fault->key rather than the fault->timestamp.

Finally, you need to add locking to the fault ring structure. Currently
it's not protected by any locks because only one thread (the interrupt
handler) accesses it. Now you have another thread that can remove
entries, so you need to protect it with a lock. If you are handling
retry faults, you know that the interrupt handler is really a worker
thread, so you can use a mutex or a spin-lock, but it doesn't need to be
interrupt-safe.


I don't think you need a lock at all.

Just using cmpxchg() to update the key should do it.

Something like this:

    hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER);
    fault = >fault_ring[gmc->fault_hash[hash].idx];
    while (fault->timestamp >= stamp) {
    uint64_t tmp;

                cmpxchg(>key, key, 0);

    tmp = fault->timestamp;
    fault = >fault_ring[fault->next];

    /* Check if the entry was reused */
    if (fault->timestamp >= tmp)
    break;
    }

Regards,
Christian.



Regards,
   Felix



+   fault->timestamp = 0;
+}
+
  int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
  {
int r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index 9d11c02a3938..498a7a0d5a9e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct amdgpu_device *adev,
 struct amdgpu_gmc *mc);
  bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t addr,
  uint16_t pasid, uint64_t timestamp);
+void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr,
+uint16_t pasid);
  int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev);
  void amdgpu_gmc_ras_fini(struct amdgpu_device *adev);
  int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device *adev);

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 4/6] drm/amdgpu: address remove from fault filter

2021-04-21 Thread Christian König

Am 20.04.21 um 22:21 schrieb Philip Yang:

Add interface to remove address from fault filter ring by resetting
fault ring entry of the fault address timestamp to 0, then future vm
fault on the address will be processed to recover.

Check fault address from fault ring, add address into fault ring and
remove address from fault ring are serialized in same interrupt deferred
work, don't have race condition.


That might not work on Vega20.

We call amdgpu_gmc_filter_faults() from the the IH while the fault 
handling id done from the delegated IH processing.


More comments below.


Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24 
  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  2 ++
  2 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index c39ed9eb0987..338e45fa66cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, 
uint64_t addr,
return false;
  }
  
+/**

+ * amdgpu_gmc_filter_faults_remove - remove address from VM faults filter
+ *
+ * @adev: amdgpu device structure
+ * @addr: address of the VM fault
+ * @pasid: PASID of the process causing the fault
+ *
+ * Remove the address from fault filter, then future vm fault on this address
+ * will pass to retry fault handler to recover.
+ */
+void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr,
+uint16_t pasid)
+{
+   struct amdgpu_gmc *gmc = >gmc;
+
+   uint64_t key = addr << 4 | pasid;


We should probably have a function for this now.


+   struct amdgpu_gmc_fault *fault;
+   uint32_t hash;
+
+   hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER);
+   fault = >fault_ring[gmc->fault_hash[hash].idx];
+   fault->timestamp = 0;


There is no guarantee that the ring entry you found for the fault is the 
one for this address.


After all that is just an 8 bit hash for a 64bit values :)

You need to double check the key and walk the chain by looking at the 
next entry to eventually find the right one.


Christian.


+}
+
  int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
  {
int r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index 9d11c02a3938..498a7a0d5a9e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct amdgpu_device *adev,
 struct amdgpu_gmc *mc);
  bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t addr,
  uint16_t pasid, uint64_t timestamp);
+void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr,
+uint16_t pasid);
  int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev);
  void amdgpu_gmc_ras_fini(struct amdgpu_device *adev);
  int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device *adev);


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 4/6] drm/amdgpu: address remove from fault filter

2021-04-20 Thread Felix Kuehling
Am 2021-04-20 um 4:21 p.m. schrieb Philip Yang:
> Add interface to remove address from fault filter ring by resetting
> fault ring entry of the fault address timestamp to 0, then future vm
> fault on the address will be processed to recover.
>
> Check fault address from fault ring, add address into fault ring and
> remove address from fault ring are serialized in same interrupt deferred
> work, don't have race condition.
>
> Signed-off-by: Philip Yang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  2 ++
>  2 files changed, 26 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index c39ed9eb0987..338e45fa66cb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device 
> *adev, uint64_t addr,
>   return false;
>  }
>  
> +/**
> + * amdgpu_gmc_filter_faults_remove - remove address from VM faults filter
> + *
> + * @adev: amdgpu device structure
> + * @addr: address of the VM fault
> + * @pasid: PASID of the process causing the fault
> + *
> + * Remove the address from fault filter, then future vm fault on this address
> + * will pass to retry fault handler to recover.
> + */
> +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t 
> addr,
> +  uint16_t pasid)
> +{
> + struct amdgpu_gmc *gmc = >gmc;
> +
> + uint64_t key = addr << 4 | pasid;
> + struct amdgpu_gmc_fault *fault;
> + uint32_t hash;
> +
> + hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER);
> + fault = >fault_ring[gmc->fault_hash[hash].idx];

You need to loop over the fault ring to find a fault with the matching
key since there may be hash collisions.

You also need to make sure you don't break the single link list of keys
with the same hash when you remove an entry. I think the easier way to
remove an entry without breaking this ring+closed hashing structure is
to reset the fault->key rather than the fault->timestamp.

Finally, you need to add locking to the fault ring structure. Currently
it's not protected by any locks because only one thread (the interrupt
handler) accesses it. Now you have another thread that can remove
entries, so you need to protect it with a lock. If you are handling
retry faults, you know that the interrupt handler is really a worker
thread, so you can use a mutex or a spin-lock, but it doesn't need to be
interrupt-safe.

Regards,
  Felix


> + fault->timestamp = 0;
> +}
> +
>  int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
>  {
>   int r;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> index 9d11c02a3938..498a7a0d5a9e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> @@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct amdgpu_device *adev,
>struct amdgpu_gmc *mc);
>  bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev, uint64_t addr,
> uint16_t pasid, uint64_t timestamp);
> +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t 
> addr,
> +  uint16_t pasid);
>  int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev);
>  void amdgpu_gmc_ras_fini(struct amdgpu_device *adev);
>  int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device *adev);
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx