[PATCH] drm/amdkfd: make sure VM is ready for updating operations

2024-04-06 Thread Lang Yu
When VM is in evicting state, amdgpu_vm_update_range would return -EBUSY.
Then restore_process_worker runs into a dead loop.

Fixes: 2fdba514ad5a ("drm/amdgpu: Auto-validate DMABuf imports in compute VMs")

Signed-off-by: Lang Yu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 0ae9fd844623..8c71fe07807a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -2900,6 +2900,12 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, 
struct dma_fence __rcu *
 
amdgpu_sync_create(_obj);
 
+   ret = process_validate_vms(process_info, NULL);
+   if (ret) {
+   pr_debug("Validating VMs failed, ret: %d\n", ret);
+   goto validate_map_fail;
+   }
+
/* Validate BOs and map them to GPUVM (update VM page tables). */
list_for_each_entry(mem, _info->kfd_bo_list,
validate_list) {
-- 
2.25.1



RE: [PATCH] drm/amdgpu: Fix incorrect return value

2024-04-06 Thread Chai, Thomas
[AMD Official Use Only - General]

-
Best Regards,
Thomas

-Original Message-
From: Zhou1, Tao 
Sent: Wednesday, April 3, 2024 6:36 PM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Li, Candice ; 
Wang, Yang(Kevin) ; Yang, Stanley 
Subject: RE: [PATCH] drm/amdgpu: Fix incorrect return value

[AMD Official Use Only - General]

> -Original Message-
> From: Chai, Thomas 
> Sent: Wednesday, April 3, 2024 3:07 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chai, Thomas ; Zhang, Hawking
> ; Zhou1, Tao ; Li, Candice
> ; Wang, Yang(Kevin) ;
> Yang, Stanley ; Chai, Thomas
> 
> Subject: [PATCH] drm/amdgpu: Fix incorrect return value
>
> [Why]
>   After calling amdgpu_vram_mgr_reserve_range multiple times with the
> same address, calling amdgpu_vram_mgr_query_page_status will always
> return - EBUSY.

>[Tao] could you explain why we call amdgpu_vram_mgr_reserve_range multiple 
>times with the same  address? IIRC, we skip duplicate address before reserve 
>memory.

[Thomas]
   When poison creation interrupt is received, since some poisoning addresses 
may have been allocated by some processes, reserving these memories will fail.
These memory will be tried to reserve again after killing the poisoned process 
in the subsequent poisoning consumption interrupt handler.
so amdgpu_vram_mgr_reserve_range needs to be called multiple times with the 
same address.

>   From the second call to amdgpu_vram_mgr_reserve_range, the same
> address will be added to the reservations_pending list again and is
> never moved to the reserved_pages list because the address had been reserved.
>
> [How]
>   First add the address status check before calling
> amdgpu_vram_mgr_do_reserve, if the address is already reserved, do
> nothing; If the address is already in the reservations_pending list,
> directly reserve memory; only add new nodes for the addresses that are
> not in the reserved_pages list and reservations_pending list.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 28
> +---
>  1 file changed, 19 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> index 1e36c428d254..0bf3f4092900 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct
> ttm_resource_manager *man)
>
>   dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n",
>   rsv->start, rsv->size);
> -
>   vis_usage = amdgpu_vram_mgr_vis_size(adev, block);
>   atomic64_add(vis_usage, >vis_usage);
>   spin_lock(>bdev->lru_lock); @@ -340,19 +339,30 @@
> int amdgpu_vram_mgr_reserve_range(struct
> amdgpu_vram_mgr *mgr,
> uint64_t start, uint64_t size)  {
>   struct amdgpu_vram_reservation *rsv;
> + int ret = 0;
>
> - rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> - if (!rsv)
> - return -ENOMEM;
> + ret = amdgpu_vram_mgr_query_page_status(mgr, start);
> + if (!ret)
> + return 0;
> +
> + if (ret == -ENOENT) {
> + rsv = kzalloc(sizeof(*rsv), GFP_KERNEL);
> + if (!rsv)
> + return -ENOMEM;
>
> - INIT_LIST_HEAD(>allocated);
> - INIT_LIST_HEAD(>blocks);
> + INIT_LIST_HEAD(>allocated);
> + INIT_LIST_HEAD(>blocks);
>
> - rsv->start = start;
> - rsv->size = size;
> + rsv->start = start;
> + rsv->size = size;
> +
> + mutex_lock(>lock);
> + list_add_tail(>blocks, >reservations_pending);
> + mutex_unlock(>lock);
> +
> + }
>
>   mutex_lock(>lock);
> - list_add_tail(>blocks, >reservations_pending);
>   amdgpu_vram_mgr_do_reserve(>manager);
>   mutex_unlock(>lock);
>
> --
> 2.34.1




[PATCH] Documentation/gpu: correct path of reference

2024-04-06 Thread Simon Horman
The path to GPU documentation is Documentation/gpu
rather than Documentation/GPU

This appears to have been introduced by commit ba162ae749a5
("Documentation/gpu: Introduce a simple contribution list for display code")

Flagged by make htmldocs.

Signed-off-by: Simon Horman 
---
 Documentation/gpu/amdgpu/display/display-contributing.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/gpu/amdgpu/display/display-contributing.rst 
b/Documentation/gpu/amdgpu/display/display-contributing.rst
index fdb2bea01d53..36f3077eee00 100644
--- a/Documentation/gpu/amdgpu/display/display-contributing.rst
+++ b/Documentation/gpu/amdgpu/display/display-contributing.rst
@@ -135,7 +135,7 @@ Enable underlay
 ---
 
 AMD display has this feature called underlay (which you can read more about at
-'Documentation/GPU/amdgpu/display/mpo-overview.rst') which is intended to
+'Documentation/gpu/amdgpu/display/mpo-overview.rst') which is intended to
 save power when playing a video. The basic idea is to put a video in the
 underlay plane at the bottom and the desktop in the plane above it with a hole
 in the video area. This feature is enabled in ChromeOS, and from our data