amdgpu: New VM state for evicted user BOs

Christian König Thu, 09 Nov 2023 00:13:04 -0800

Am 08.11.23 um 22:23 schrieb Felix Kuehling:

On 2023-11-08 07:28, Christian König wrote:
Not necessary objections to this patch here, but rather how this newstate is used later on.
The fundamental problem is that re-validating things inamdgpu_vm_handle_moved() won't work in all cases.
The general approach for both classic CS IOCTL as well as user queuesis the following:
1. Grab all the necessary locks
The VM lock + either everything inside the VM (user queues) orjust the BOs in the submission (CS IOCTL).
2. Validate everything locked
It can be that additional BO locks are grabbed for eviction andeven locked BOs are shuffled around during that.
3. Update the PDEs and PTEs
This can be done only after validating everything because thingscan still shuffle around during validation.
4. Do your CS or make the userqueu / process runable etc...

5. Attach fences to the locked BOs.

6. Unlock everything again.
I think that is part of the problem why handling all updates inamdgpu_vm_handle_moved() for the CS doesn't work and sooner or lateryou might run into the same issue during process restore as well.
If I'm not completely mistaken this should be solvable by moving allvalidation to amdgpu_vm_validate_pt_bos() instead (but let us renamethat function).
I'm trying to understand what the problem is. As far as I can tell,amdgpu_vm_handle_moved just runs a bit later in the CS andprocess-restore code than amdgpu_vm_validate_pt_bos. But it runs withall the same reservations locked. My current implementation inamdgpu_vm_handle_moved has some failure cases when reservations cannotbe acquired. I think your drm_exec patch series would make it possibleto handle this more robustly. That said, in case of KFD processrestore, we can retry in case of failures already.
Anyway, moving my re-validation code to a renamed version ofamdgpu_vm_validate_pt_bos shouldn't be a big problem. I just don'tunderstand what problem that solves. Maybe it just makes the codecleaner to handle both evicted states in one place? Or maybe you'repointing to a way to merge the two states into one?

Well yeah merging both state into one might be nice to have, but thatisn't the fundamental problem.

What basically happens during validation of a BO is that this BO is moveto a desired place (as described by the placement parameter). Duringthis it can happen that we shuffle out other BOs.

Now the issue is that when you validate in amdgpu_vm_handle_moved() itcan be that by validating BO we shuffle out other BOs which then end upof the evicted list again.

This most likely won't happen with KFD, but for GFX CS it's certainlypossible. So the idea is that we first validate everything and thenupdate all the page tables and not inter mix the two operations.


Regards,
Christian.


Regards,
  Felix


Regards,
Christian.

Am 07.11.23 um 23:11 schrieb Felix Kuehling:

Hi Christian,

I know you have objected to this patch before. I still think this isthe best solution for what I need. I can talk you through myreasoning by email or offline. If I can't convince you, I will needyour guidance for a better solution.


Thanks,
  Felix


On 2023-11-07 11:58, Felix Kuehling wrote:

Create a new VM state to track user BOs that are in the system domain.

In the next patch this will be used do conditionally re-validatethem in

amdgpu_vm_handle_moved.

Signed-off-by: Felix Kuehling <felix.kuehl...@amd.com>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 +++++++++++++++++
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |  5 ++++-
  2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.cb/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

index 1442d97ddd0f..0d685577243c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

@@ -232,6 +232,22 @@ static void amdgpu_vm_bo_invalidated(structamdgpu_vm_bo_base *vm_bo)

      spin_unlock(&vm_bo->vm->status_lock);
  }
  +/**
+ * amdgpu_vm_bo_evicted_user - vm_bo is evicted
+ *
+ * @vm_bo: vm_bo which is evicted
+ *

+ * State for BOs used by user mode queues which are not at thelocation they

+ * should be.
+ */

+static void amdgpu_vm_bo_evicted_user(struct amdgpu_vm_bo_base*vm_bo)

+{
+    vm_bo->moved = true;
+    spin_lock(&vm_bo->vm->status_lock);
+    list_move(&vm_bo->vm_status, &vm_bo->vm->evicted_user);
+    spin_unlock(&vm_bo->vm->status_lock);
+}
+
  /**
   * amdgpu_vm_bo_relocated - vm_bo is reloacted
   *

@@ -2110,6 +2126,7 @@ int amdgpu_vm_init(struct amdgpu_device*adev, struct amdgpu_vm *vm, int32_t xcp

      for (i = 0; i < AMDGPU_MAX_VMHUBS; i++)
          vm->reserved_vmid[i] = NULL;
      INIT_LIST_HEAD(&vm->evicted);
+    INIT_LIST_HEAD(&vm->evicted_user);
      INIT_LIST_HEAD(&vm->relocated);
      INIT_LIST_HEAD(&vm->moved);
      INIT_LIST_HEAD(&vm->idle);

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.hb/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h

index 12cac06fa289..939d0c2219c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -286,9 +286,12 @@ struct amdgpu_vm {
      /* Lock to protect vm_bo add/del/move on all lists of vm */
      spinlock_t        status_lock;
  -    /* BOs who needs a validation */
+    /* Per VM and PT BOs who needs a validation */
      struct list_head    evicted;
  +    /* BOs for user mode queues that need a validation */
+    struct list_head    evicted_user;
+
      /* PT BOs which relocated and their parent need an update */
      struct list_head    relocated;

Re: [PATCH 2/6] drm/amdgpu: New VM state for evicted user BOs

Reply via email to