RE: [PATCH] drm/amdgpu: mark invalid records with U64_MAX

Xie, Patrick Mon, 19 Jan 2026 23:32:49 -0800

[AMD Official Use Only - AMD Internal Distribution Only]

-----Original Message-----
From: amd-gfx <[email protected]> On Behalf Of 
[email protected]
Sent: Tuesday, January 20, 2026 1:12 PM
To: [email protected]
Subject: amd-gfx Digest, Vol 116, Issue 317


Send amd-gfx mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.freedesktop.org/mailman/listinfo/amd-gfx
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific than "Re: 
Contents of amd-gfx digest..."


Today's Topics:

   1. Re: [PATCH 10/10] drm/amdgpu: rework ring reset backup and
      reemit (Timur Krist?f)
   2. [PATCH] drm/amdgpu: mark invalid records with U64_MAX
      (Gangliang Xie)
   3. RE: [PATCH] drm/amdgpu: mark invalid records with U64_MAX
      (Zhou1, Tao)


----------------------------------------------------------------------

Message: 1
Date: Tue, 20 Jan 2026 03:41:02 +0100
From: Timur Krist?f <[email protected]>
To: [email protected]
Cc: Alex Deucher <[email protected]>, Alex Deucher
        <[email protected]>
Subject: Re: [PATCH 10/10] drm/amdgpu: rework ring reset backup and
        reemit
Message-ID: <27258781.1r3eYUQgxm@timur-max>
Content-Type: text/plain; charset="utf-8"

On 2026. janu?r 16., p?ntek 17:20:27 k?z?p-eur?pai t?li id? Alex Deucher
wrote:
> Store the start and end wptrs in the IB fence. On queue reset, only
> save the ring contents of the non-guilty contexts.
> Since the VM fence is a sub-fence of the the IB fence, the IB fence
> stores the start and end wptrs for both fences as the IB state
> encapsulates the VM fence state.
>
> For reemit, only reemit the state for non-guilty contexts; for guilty
> contexts, just emit the fences.  Split the reemit per fence when when
> we reemit, update the start and end wptrs with the new values from
> reemit.  This allows us to reemit jobs repeatedly as the wptrs get
> properly updated each time.  Submitting the fence directly also
> simplifies the logic as we no longer need to store additional wptrs
> for each fence within the IB ring sequence.  If the context is guilty,
> we emit the fence(s) and if not, we reemit the whole IB ring sequence
> for that IB.
>

Hi Alex,

I have a few ideas that may help simplify this:

1. We could rework the pipeline sync so that it doesn't depend on the fence of 
the previous submission, similarly to how you already did for gfx9.
- Then it becomes unnecessary to re-emit the guilty fence or track where the 
fence was emitted.
- This also fixes queue reset for the case when the maximum scheduled 
submissions are greater than two.
- For those engines that do not have something like VS/PS/CS_PARTIAL_FLUSH and 
ACQUIRE_MEM, you can emit a fence + wait_reg inside the pipeline sync.

2. When backing up the ring contents, you could replace submissions that belong 
to the guilty process with NOPs, and remember the start position of the first 
submission inside the ring buffer. When restoring, you can pad the ring buffer 
with NOPs until you get back to the same position.
- This way, all the pointers to things inside the ring will remain valid and no 
extra logic is needed when re-emitting the same content repeatedly.

Hope this helps,
Timur

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 98 +++++++++--------------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c    |  9 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c  | 37 +--------
> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  | 20 ++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c    |  1 +
>  5 files changed, 54 insertions(+), 111 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index
> d48f61076c06a..541cdbe1e28bf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -89,16 +89,6 @@ static u32 amdgpu_fence_read(struct amdgpu_ring *ring)
>       return seq;
>  }
>
> -static void amdgpu_fence_save_fence_wptr_start(struct amdgpu_fence
> *af) -{
> -     af->fence_wptr_start = af->ring->wptr;
> -}
> -
> -static void amdgpu_fence_save_fence_wptr_end(struct amdgpu_fence *af)
> -{
> -     af->fence_wptr_end = af->ring->wptr;
> -}
> -
>  /**
>   * amdgpu_fence_emit - emit a fence on the requested ring
>   *
> @@ -126,11 +116,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring,
> struct amdgpu_fence *af, &ring->fence_drv.lock,
>                      adev->fence_context + ring->idx, seq);
>
> -     amdgpu_fence_save_fence_wptr_start(af);
> +     af->flags = flags | AMDGPU_FENCE_FLAG_INT;
>       amdgpu_ring_emit_fence(ring, ring->fence_drv.gpu_addr,
> -                            seq, flags | AMDGPU_FENCE_FLAG_INT);
> -     amdgpu_fence_save_fence_wptr_end(af);
> -     amdgpu_fence_save_wptr(af);
> +                            seq, af->flags);
> +
>       pm_runtime_get_noresume(adev_to_drm(adev)->dev);
>       ptr = &ring->fence_drv.fences[seq & ring-
>fence_drv.num_fences_mask];
>       if (unlikely(rcu_dereference_protected(*ptr, 1))) {  @@ -241,7
>+230,6 @@ bool amdgpu_fence_process(struct amdgpu_ring *ring)
>
>       do {
>               struct dma_fence *fence, **ptr;
> -             struct amdgpu_fence *am_fence;
>
>               ++last_seq;
>               last_seq &= drv->num_fences_mask;
> @@ -254,12 +242,6 @@ bool amdgpu_fence_process(struct amdgpu_ring *ring)
>               if (!fence)
>                       continue;
>
> -             /* Save the wptr in the fence driver so we know what
the last processed
> -              * wptr was.  This is required for re-emitting the ring
state for
> -              * queues that are reset but are not guilty and thus
have no guilty
> fence. -               */
> -             am_fence = container_of(fence, struct amdgpu_fence,
base);
> -             drv->signalled_wptr = am_fence->wptr;
>               dma_fence_signal(fence);
>               dma_fence_put(fence);
>               pm_runtime_mark_last_busy(adev_to_drm(adev)->dev);
> @@ -727,25 +709,27 @@ void amdgpu_fence_driver_force_completion(struct
> amdgpu_ring *ring, */
>
>  /**
> - * amdgpu_fence_driver_update_timedout_fence_state - Update fence
> state and set errors + * amdgpu_ring_set_fence_errors_and_reemit - Set
> dma_fence errors and reemit *
> - * @af: fence of the ring to update
> + * @guilty_fence: fence of the ring to update
>   *
>   */
> -void amdgpu_fence_driver_update_timedout_fence_state(struct
> amdgpu_fence
> *af) +void amdgpu_ring_set_fence_errors_and_reemit(struct amdgpu_ring
> *ring, +                                           struct
amdgpu_fence *guilty_fence)
>  {
>       struct dma_fence *unprocessed;
>       struct dma_fence __rcu **ptr;
>       struct amdgpu_fence *fence;
> -     struct amdgpu_ring *ring = af->ring;
>       unsigned long flags;
>       u32 seq, last_seq;
> -     bool reemitted = false;
> +     unsigned int i;
>
>       last_seq = amdgpu_fence_read(ring) & ring-
>fence_drv.num_fences_mask;
>       seq = ring->fence_drv.sync_seq & ring->fence_drv.num_fences_mask;
>
>       /* mark all fences from the guilty context with an error */
> +     amdgpu_ring_alloc(ring, ring->ring_backup_entries_to_copy +
> +                       20 * ring->guilty_fence_count);
>       spin_lock_irqsave(&ring->fence_drv.lock, flags);
>       do {
>               last_seq++;
> @@ -758,39 +742,41 @@ void
> amdgpu_fence_driver_update_timedout_fence_state(struct amdgpu_fence
> *af) if (unprocessed && !dma_fence_is_signaled_locked(unprocessed)) {
> fence = container_of(unprocessed, struct amdgpu_fence, base);
>
> -                     if (fence->reemitted > 1)
> -                             reemitted = true;
> -                     else if (fence == af)
> +                     if (fence == guilty_fence)
>                               dma_fence_set_error(&fence->base,
-ETIME);
> -                     else if (fence->context == af->context)
> +                     else if (fence->context == guilty_fence-
>context)
>                               dma_fence_set_error(&fence->base,
-ECANCELED);
> +
> +                     if (fence->context == guilty_fence->context)
{
> +                             /* just emit the fence for the
guilty context */
> +                             amdgpu_ring_emit_fence(ring, ring-
>fence_drv.gpu_addr,
> +
fence->base.seqno, fence->flags);
> +                     } else {
> +                             /* reemit the packet stream and
update wptrs */
> +                             fence->wptr_start = ring->wptr;
> +                             for (i = 0; i < fence-
>backup_size; i++)
> +                                     amdgpu_ring_write(ring,
ring->ring_backup[fence->backup_idx + i]);
> +                             fence->wptr_end = ring->wptr;
> +                     }
>               }
>               rcu_read_unlock();
>       } while (last_seq != seq);
>       spin_unlock_irqrestore(&ring->fence_drv.lock, flags);
> -
> -     if (reemitted) {
> -             /* if we've already reemitted once then just cancel
everything */
> -             amdgpu_fence_driver_force_completion(af->ring, &af-
>base);
> -             af->ring->ring_backup_entries_to_copy = 0;
> -     }
> -}
> -
> -void amdgpu_fence_save_wptr(struct amdgpu_fence *af)  -{
> -     af->wptr = af->ring->wptr;
> +     amdgpu_ring_commit(ring);
>  }
>
>  static void amdgpu_ring_backup_unprocessed_command(struct amdgpu_ring
> *ring, -                                                 u64
start_wptr, u64 end_wptr)
> +                                                struct
amdgpu_fence *af)
>  {
> -     unsigned int first_idx = start_wptr & ring->buf_mask;
> -     unsigned int last_idx = end_wptr & ring->buf_mask;
> +     unsigned int first_idx = af->wptr_start & ring->buf_mask;
> +     unsigned int last_idx = af->wptr_end & ring->buf_mask;
>       unsigned int i;
>
>       /* Backup the contents of the ring buffer. */
> +     af->backup_idx = ring->ring_backup_entries_to_copy;
>       for (i = first_idx; i != last_idx; ++i, i &= ring->buf_mask)
>               ring->ring_backup[ring->ring_backup_entries_to_copy++]
= ring->ring[i];
> +     af->backup_size = ring->ring_backup_entries_to_copy - af-
>backup_idx;
>  }
>
>  void amdgpu_ring_backup_unprocessed_commands(struct amdgpu_ring
> *ring, @@ -799,13 +785,12 @@ void
> amdgpu_ring_backup_unprocessed_commands(struct
> amdgpu_ring *ring, struct dma_fence *unprocessed;
>       struct dma_fence __rcu **ptr;
>       struct amdgpu_fence *fence;
> -     u64 wptr;
>       u32 seq, last_seq;
>
>       last_seq = amdgpu_fence_read(ring) & ring-
>fence_drv.num_fences_mask;
>       seq = ring->fence_drv.sync_seq & ring->fence_drv.num_fences_mask;
> -     wptr = ring->fence_drv.signalled_wptr;
>       ring->ring_backup_entries_to_copy = 0;
> +     ring->guilty_fence_count = 0;
>
>       do {
>               last_seq++;
> @@ -818,21 +803,12 @@ void
> amdgpu_ring_backup_unprocessed_commands(struct
> amdgpu_ring *ring, if (unprocessed &&
> !dma_fence_is_signaled(unprocessed))
> {
>                       fence = container_of(unprocessed, struct
amdgpu_fence, base);
>
> -                     /* save everything if the ring is not
guilty, otherwise
> -                      * just save the content from other
contexts.
> -                      */
> -                     if (!fence->reemitted &&
> -                         (!guilty_fence || (fence->context !=
guilty_fence->context))) {
> -
amdgpu_ring_backup_unprocessed_command(ring, wptr,
> -
       fence->wptr);
> -                     } else if (!fence->reemitted) {
> -                             /* always save the fence */
> -
amdgpu_ring_backup_unprocessed_command(ring,
> -
       fence->fence_wptr_start,
> -
       fence->fence_wptr_end);
> -                     }
> -                     wptr = fence->wptr;
> -                     fence->reemitted++;
> +                     /* keep track of the guilty fences for reemit
*/
> +                     if (fence->context == guilty_fence->context)
> +                             ring->guilty_fence_count++;
> +                     /* Non-vm fence has all the state.  Backup
the non-guilty contexts */
> +                     if (!fence->is_vm_fence && (fence->context !=
guilty_fence->context))
> +
amdgpu_ring_backup_unprocessed_command(ring, fence);
>               }
>               rcu_read_unlock();
>       } while (last_seq != seq);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c index
> fb58637ed1507..865a803026c3b
> 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
> @@ -185,6 +185,7 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring,
> struct amdgpu_job *job, dev_err(adev->dev, "scheduling IB failed (%d).\n", r);
>               return r;
>       }
> +     af->wptr_start = ring->wptr;
>
>       need_ctx_switch = ring->current_ctx != fence_ctx;
>       if (ring->funcs->emit_pipeline_sync && job && @@ -303,13 +304,7 @@
> int amdgpu_ib_schedule(struct amdgpu_ring *ring, struct amdgpu_job
> *job, ring->hw_prio == AMDGPU_GFX_PIPE_PRIO_HIGH)
>               ring->funcs->emit_wave_limit(ring, false);
>
> -     /* Save the wptr associated with this fence.
> -      * This must be last for resets to work properly
> -      * as we need to save the wptr associated with this
> -      * fence so we know what rings contents to backup
> -      * after we reset the queue.
> -      */
> -     amdgpu_fence_save_wptr(af);
> +     af->wptr_end = ring->wptr;
>
>       amdgpu_ring_ib_end(ring);
>       amdgpu_ring_commit(ring);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index
> b82357c657237..df37335127979 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -104,29 +104,6 @@ int amdgpu_ring_alloc(struct amdgpu_ring *ring,
> unsigned int ndw) return 0;  }
>
> -/**
> - * amdgpu_ring_alloc_reemit - allocate space on the ring buffer for
> reemit
> - *
> - * @ring: amdgpu_ring structure holding ring information
> - * @ndw: number of dwords to allocate in the ring buffer
> - *
> - * Allocate @ndw dwords in the ring buffer (all asics).
> - * doesn't check the max_dw limit as we may be reemitting
> - * several submissions.
> - */
> -static void amdgpu_ring_alloc_reemit(struct amdgpu_ring *ring,
> unsigned int
> ndw) -{
> -     /* Align requested size with padding so unlock_commit can
> -      * pad safely */
> -     ndw = (ndw + ring->funcs->align_mask) & ~ring->funcs->align_mask;
> -
> -     ring->count_dw = ndw;
> -     ring->wptr_old = ring->wptr;
> -
> -     if (ring->funcs->begin_use)
> -             ring->funcs->begin_use(ring);
> -}
> -
>  /**
>   * amdgpu_ring_insert_nop - insert NOP packets
>   *
> @@ -877,7 +854,6 @@ void amdgpu_ring_reset_helper_begin(struct
> amdgpu_ring *ring, int amdgpu_ring_reset_helper_end(struct amdgpu_ring *ring,
>                                struct amdgpu_fence
*guilty_fence)
>  {
> -     unsigned int i;
>       int r;
>
>       /* verify that the ring is functional */ @@ -885,16 +861,9 @@ int
> amdgpu_ring_reset_helper_end(struct amdgpu_ring *ring, if (r)
>               return r;
>
> -     /* set an error on all fences from the context */
> -     if (guilty_fence)
> -
amdgpu_fence_driver_update_timedout_fence_state(guilty_fence);
> -     /* Re-emit the non-guilty commands */
> -     if (ring->ring_backup_entries_to_copy) {
> -             amdgpu_ring_alloc_reemit(ring, ring-
>ring_backup_entries_to_copy);
> -             for (i = 0; i < ring->ring_backup_entries_to_copy; i++)
> -                     amdgpu_ring_write(ring, ring-
>ring_backup[i]);
> -             amdgpu_ring_commit(ring);
> -     }
> +     /* set an error on all fences from the context and reemit */
> +     amdgpu_ring_set_fence_errors_and_reemit(ring, guilty_fence);
> +
>       /* Start the scheduler again */
>       drm_sched_wqueue_start(&ring->sched);
>       return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h index
> ce095427611fb..6dca1ccd2fc2e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -121,7 +121,6 @@ struct amdgpu_fence_driver {
>       /* sync_seq is protected by ring emission lock */
>       uint32_t                        sync_seq;
>       atomic_t                        last_seq;
> -     u64                             signalled_wptr;
>       bool                            initialized;
>       struct amdgpu_irq_src           *irq_src;
>       unsigned                        irq_type;
> @@ -146,15 +145,17 @@ struct amdgpu_fence {
>       struct amdgpu_ring              *ring;
>       ktime_t                         start_timestamp;
>
> +     bool                            is_vm_fence;
> +     unsigned int                    flags;
>       /* wptr for the total submission for resets */
> -     u64                             wptr;
> +     u64                             wptr_start;
> +     u64                             wptr_end;
>       /* fence context for resets */
>       u64                             context;
> -     /* has this fence been reemitted */
> -     unsigned int                    reemitted;
> -     /* wptr for the fence for the submission */
> -     u64                             fence_wptr_start;
> -     u64                             fence_wptr_end;
> +     /* idx into the ring backup */
> +     unsigned int                    backup_idx;
> +     unsigned int                    backup_size;
> +
>  };
>
>  extern const struct drm_sched_backend_ops amdgpu_sched_ops; @@ -162,8
> +163,8 @@ extern const struct drm_sched_backend_ops amdgpu_sched_ops;
> void amdgpu_fence_driver_set_error(struct amdgpu_ring *ring, int
> error); void amdgpu_fence_driver_force_completion(struct
> amdgpu_ring *ring, struct dma_fence *timedout_fence); -void
> amdgpu_fence_driver_update_timedout_fence_state(struct amdgpu_fence
> *af); -void amdgpu_fence_save_wptr(struct amdgpu_fence *af);
> +void amdgpu_ring_set_fence_errors_and_reemit(struct amdgpu_ring *ring,
> +                                          struct
amdgpu_fence *guilty_fence);
>
>  int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring);  int
> amdgpu_fence_driver_start_ring(struct amdgpu_ring *ring, @@ -314,6
> +315,7 @@ struct amdgpu_ring {
>       /* backups for resets */
>       uint32_t                *ring_backup;
>       unsigned int            ring_backup_entries_to_copy;
> +     unsigned int            guilty_fence_count;
>       unsigned                rptr_offs;
>       u64                     rptr_gpu_addr;
>       u32                     *rptr_cpu_addr;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index
> 6a2ea200d90c8..bd2c01b1ef18f
> 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -848,6 +848,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring,
> struct amdgpu_job *job, r = amdgpu_fence_emit(ring, job->hw_vm_fence, 0);
>               if (r)
>                       return r;
> +             job->hw_vm_fence->is_vm_fence = true;
>               fence = &job->hw_vm_fence->base;
>               /* get a ref for the job */
>               dma_fence_get(fence);






------------------------------

Message: 2
Date: Tue, 20 Jan 2026 12:03:39 +0800
From: Gangliang Xie <[email protected]>
To: <[email protected]>
Cc: <[email protected]>, Gangliang Xie <[email protected]>
Subject: [PATCH] drm/amdgpu: mark invalid records with U64_MAX
Message-ID: <[email protected]>
Content-Type: text/plain

set retired_page of invalid ras records to U64_MAX, and skip them when reading 
ras records

Signed-off-by: Gangliang Xie <[email protected]>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 0c84c7ab36d1..13d901c0ed6e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2784,6 +2784,9 @@ static int amdgpu_ras_badpages_read(struct amdgpu_device 
*adev,
                        if (!data->bps[i].ts)
                                continue;

+                       if (data->bps[i].retired_page == U64_MAX)
+                               continue;
+
                        bps[r].bp = data->bps[i].retired_page;
                        r++;
                        if (r >= count)
@@ -3090,6 +3093,7 @@ static int __amdgpu_ras_restore_bad_pages(struct 
amdgpu_device *adev,

                if (amdgpu_ras_check_bad_page_unlock(con,
                        bps[j].retired_page << AMDGPU_GPU_PAGE_SHIFT)) {
+                       data->bps[data->count].retired_page = U64_MAX;
                        data->count++;
                        data->space_left--;
                        continue;
--
2.34.1



------------------------------

Message: 3
Date: Tue, 20 Jan 2026 05:12:17 +0000
From: "Zhou1, Tao" <[email protected]>
To: "Xie, Patrick" <[email protected]>,
        "[email protected]" <[email protected]>
Subject: RE: [PATCH] drm/amdgpu: mark invalid records with U64_MAX
Message-ID:
        
<ph7pr12mb8796f233c5a112dd0b94a051b0...@ph7pr12mb8796.namprd12.prod.outlook.com>

Content-Type: text/plain; charset="us-ascii"

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Xie, Patrick <[email protected]>
> Sent: Tuesday, January 20, 2026 12:04 PM
> To: [email protected]
> Cc: Zhou1, Tao <[email protected]>; Xie, Patrick
> <[email protected]>
> Subject: [PATCH] drm/amdgpu: mark invalid records with U64_MAX
>
> set retired_page of invalid ras records to U64_MAX, and skip them when
> reading ras records
>
> Signed-off-by: Gangliang Xie <[email protected]>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 0c84c7ab36d1..13d901c0ed6e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2784,6 +2784,9 @@ static int amdgpu_ras_badpages_read(struct
> amdgpu_device *adev,
>                       if (!data->bps[i].ts)
>                               continue;
>
> +                     if (data->bps[i].retired_page == U64_MAX)
> +                             continue;

[Tao] the change is fine with me, it's better to add comment to explain U64_MAX 
is used as invalid flag.
With that fixed, the patch is:  Reviewed-by: Tao Zhou <[email protected]>

[Patrick] thanks, will add this comment
> +
>                       bps[r].bp = data->bps[i].retired_page;
>                       r++;
>                       if (r >= count)
> @@ -3090,6 +3093,7 @@ static int __amdgpu_ras_restore_bad_pages(struct
> amdgpu_device *adev,
>
>               if (amdgpu_ras_check_bad_page_unlock(con,
>                       bps[j].retired_page << AMDGPU_GPU_PAGE_SHIFT)) {
> +                     data->bps[data->count].retired_page = U64_MAX;
>                       data->count++;
>                       data->space_left--;
>                       continue;
> --
> 2.34.1



------------------------------

Subject: Digest Footer

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


------------------------------

End of amd-gfx Digest, Vol 116, Issue 317
*****************************************

RE: [PATCH] drm/amdgpu: mark invalid records with U64_MAX

Reply via email to