On 1/9/26 15:41, Srinivasan Shanmugam wrote:
> When GPU memory mappings are updated, the driver returns a fence so
> userspace knows when the update is finished.
> 
> The previous refactor could pick the wrong fence or rely on checks that
> are not safe for GPU mappings that stay valid even when memory is
> missing. In some cases this could return an invalid fence or cause fence
> reference counting problems.
> 
> Fix this by (v6, per Christian):
> - Starting from the VM’s existing last update fence, so a valid and
>   meaningful fence is always returned even when no new work is required.
> - Selecting the VM-level fence only for always-valid / PRT mappings using
>   the required combined bo_va + bo guard.
> - Using the per-BO page table update fence for normal MAP and REPLACE
>   operations.
> - For UNMAP and CLEAR, returning the fence provided by
>   amdgpu_vm_clear_freed(), which may remain unchanged when nothing needs
>   clearing.
> - Keeping fence reference counting balanced.
> 
> This makes VM timeline fences correct and prevents crashes caused by
> incorrect fence handling.
> 
> Fixes: 463b33e780ae ("drm/amdgpu: Refactor amdgpu_gem_va_ioctl for Handling 
> Last Fence Update and Timeline Management v4")
> Suggested-by: Christian König <[email protected]>
> Signed-off-by: Srinivasan Shanmugam <[email protected]>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 75 +++++++++++++------------
>  1 file changed, 39 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> index e5fb8f5346b6..c1dc0cae0cfb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> @@ -732,15 +732,23 @@ amdgpu_gem_va_update_vm(struct amdgpu_device *adev,
>                       struct amdgpu_bo_va *bo_va,
>                       uint32_t operation)
>  {
> -     struct dma_fence *clear_fence = dma_fence_get_stub();
> -     struct dma_fence *last_update = NULL;
> -     int r;
> +     struct dma_fence *fence;
> +     int r = 0;
> +
> +     /* Always start from the VM's existing last update fence. */
> +     fence = dma_fence_get(vm->last_update);
>  
>       if (!amdgpu_vm_ready(vm))
> -             return clear_fence;
> +             return fence;
>  
> -     /* First clear freed BOs and get a fence for that work, if any. */
> -     r = amdgpu_vm_clear_freed(adev, vm, &clear_fence);
> +     /*
> +      * First clean up any freed mappings in the VM.
> +      *
> +      * amdgpu_vm_clear_freed() may replace @fence with a new fence if it
> +      * schedules GPU work. If nothing needs clearing, @fence can remain as
> +      * the original vm->last_update.
> +      */
> +     r = amdgpu_vm_clear_freed(adev, vm, &fence);
>       if (r)
>               goto error;
>  
> @@ -758,35 +766,40 @@ amdgpu_gem_va_update_vm(struct amdgpu_device *adev,
>               goto error;
>  
>       /*
> -      * Decide which fence represents the "last update" for this VM/BO:
> +      * Decide which fence best represents the last update:
> +      *
> +      * MAP/REPLACE:
> +      *   - For always-valid / PRT mappings, export vm->last_update.
> +      *   - Otherwise, export bo_va->last_pt_update.

Not correct, PRT mappings also use bo_va->last_pt_update (but I had to double 
check that as well :)

Mhm, that might be the reason why we had problems with PRT mapping in the past.

Please just change like this:
- For always-valid use vm->last_update.
- Otherwise, export bo_va->last_pt_update.

>        *
> -      * - For MAP/REPLACE we want the PT update fence, which is tracked as
> -      *   either vm->last_update (for always-valid BOs) or 
> bo_va->last_pt_update
> -      *   (for per-BO updates).
> +      * UNMAP/CLEAR:
> +      *   Keep the fence returned by amdgpu_vm_clear_freed(). If no work was
> +      *   needed, it can remain as vm->last_pt_update.
>        *
> -      * - For UNMAP/CLEAR we rely on the fence returned by
> -      *   amdgpu_vm_clear_freed(), which already covers the page table work
> -      *   for the removed mappings.
> +      * The VM and BO update fences are always initialized to a valid value.
> +      * vm->last_update and bo_va->last_pt_update always start as valid 
> fences.
> +      * and are never expected to be NULL.
>        */
>       switch (operation) {
>       case AMDGPU_VA_OP_MAP:
>       case AMDGPU_VA_OP_REPLACE:
> -             if (bo_va && bo_va->base.bo) {
> -                     if (amdgpu_vm_is_bo_always_valid(vm, bo_va->base.bo)) {
> -                             if (vm->last_update)
> -                                     last_update = 
> dma_fence_get(vm->last_update);
> -                     } else {
> -                             if (bo_va->last_pt_update)
> -                                     last_update = 
> dma_fence_get(bo_va->last_pt_update);
> -                     }
> -             }
> +             /*
> +              * For MAP/REPLACE, return the page table update fence for the
> +              * mapping we just modified. bo_va is expected to be valid here.
> +              */
> +             dma_fence_put(fence);
> +
> +             /* Combined guard required for always-valid / PRT correctness */
> +             if (bo_va && bo_va->base.bo &&
> +                 amdgpu_vm_is_bo_always_valid(vm, bo_va->base.bo))

I've just double checked and amdgpu_vm_is_bo_always_valid() can take NULL as BO 
and does the right thing for PRT mappings.

So I think we can just change this to "if (amdgpu_vm_is_bo_always_valid(vm, 
bo_va->base.bo))".

Apart from that looks good to me.

Regards,
Christian.

> +                     fence = dma_fence_get(vm->last_update);
> +             else
> +                     fence = dma_fence_get(bo_va->last_pt_update);
>               break;
>       case AMDGPU_VA_OP_UNMAP:
>       case AMDGPU_VA_OP_CLEAR:
> -             if (clear_fence)
> -                     last_update = dma_fence_get(clear_fence);
> -             break;
>       default:
> +             /* keep @fence as returned by amdgpu_vm_clear_freed() */
>               break;
>       }
>  
> @@ -794,17 +807,7 @@ amdgpu_gem_va_update_vm(struct amdgpu_device *adev,
>       if (r && r != -ERESTARTSYS)
>               DRM_ERROR("Couldn't update BO_VA (%d)\n", r);
>  
> -     /*
> -      * If we managed to pick a more specific last-update fence, prefer it
> -      * over the generic clear_fence and drop the extra reference to the
> -      * latter.
> -      */
> -     if (last_update) {
> -             dma_fence_put(clear_fence);
> -             return last_update;
> -     }
> -
> -     return clear_fence;
> +     return fence;
>  }
>  
>  int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,

Reply via email to