Re: [PATCH v2 2/3] drm/amdgpu: Add flag to disable implicit sync for GEM operations.

2023-11-06 Thread Christian König

Am 06.11.23 um 16:47 schrieb Tatsuyuki Ishi:
On Nov 6, 2023, at 22:44, Christian König  
wrote:


Am 02.11.23 um 15:04 schrieb Tatsuyuki Ishi:

In Vulkan, it is the application's responsibility to perform adequate
synchronization before a sparse unmap, replace or BO destroy operation.
Until now, the kernel applied the same rule as implicitly-synchronized
APIs like OpenGL, which with per-VM BOs made page table updates 
stall the

queue completely. The newly added AMDGPU_VM_EXPLICIT_SYNC flag allows
drivers to opt-out of this behavior, while still ensuring adequate 
implicit

sync happens for kernel-initiated updates (e.g. BO moves).

We record whether to use implicit sync or not for each freed mapping. To
avoid increasing the mapping struct's size, this is union-ized with the
interval tree field which is unused after the unmap.

The reason this is done with a GEM ioctl flag, instead of being a VM /
context global setting, is that the current libdrm implementation shares
the DRM handle even between different kind of drivers (radeonsi vs 
radv).


It would be nice if we could make this more future prove by not using 
a flag, but rather a drm_syncobj.


There is asynchronous VM_BIND and synchronous VM_BIND. Using syncobjs 
address asynchronous binds, but what this patch set solves is to add 
an explicitly synced synchronous bind.


All VM updates are asynchronous in the sense that they are queues up and 
don't execute immediately.


If you don't add input/output fences and don't sync implicitly with 
command submission any more you actually have no idea in userspace when 
they execute.


That doesn't sound like a good idea to me.



Even within Vulkan, there are use cases for synchronous binds. This is 
when a non-sparse BO is destroyed (or created but that’s not 
synchronized). Such operations should still be explicit sync, unlike 
OpenGL where it syncs to previous submissions. So adding asynchronous 
bind doesn’t supersede this need.


I’ve also thought whether we can just make the unmap asynchronous, 
since the spec requires that destroyed stuff are not accessed in any 
way, but I think it will complicate behavior when the destruction of 
BO immediately follows.


We should implement asynchronous bind someday to make 
vkQueueBindSparse work (even) better, but that will likely involve a 
larger scope including the scheduler. Getting synchronous but 
explicitly synced binds should be simpler and a good incremental step.


That's the whole point, I don't think that the flag is simpler/cleaner 
than a fence.


We still need to take the implicit sync which can come from kernel 
operations into account, but at the same time disable the implicit sync 
which comes from user space submissions.


As far as I can see the easiest way to do this and which both doesn't 
break anything currently and still leaves room to extend the interface 
is to add an input dependency fence.


If you then use a signaled syncpoint as input you get exactly the 
semantic you desire while we are still able to add an output fence later on.


Regards,
Christian.



Tatsuyuki.

You can extend the drm_amdgpu_gem_va structure by adding a 
drm_syncobj handle and timeline point at the end.


If the syncobj/timeline point results in a fence we give that as 
input dependency the operation has to wait for.


And output fence can come later on as well, but that one is much more 
harder to handle.


Regards,
Christian.



Signed-off-by: Tatsuyuki Ishi 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   | 14 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  7 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h |  6 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c    | 47 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h    | 23 +
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c  | 18 +++
 include/uapi/drm/amdgpu_drm.h |  2 +
 9 files changed, 71 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c

index 7d6daf8d2bfa..10e129bff977 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1196,7 +1196,7 @@ static void unmap_bo_from_gpuvm(struct kgd_mem 
*mem,

struct amdgpu_device *adev = entry->adev;
struct amdgpu_vm *vm = bo_va->base.vm;
 -amdgpu_vm_bo_unmap(adev, bo_va, entry->va);
+amdgpu_vm_bo_unmap(adev, bo_va, entry->va, true);
amdgpu_vm_clear_freed(adev, vm, _va->last_pt_update);
 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c

index 720011019741..612279e65bff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
@@ -122,7 +122,7 @@ int amdgpu_unmap_static_csa(struct amdgpu_device 
*adev, struct amdgpu_vm *vm,

}
}
 -r = amdgpu_vm_bo_unmap(adev, 

Re: [PATCH v2 2/3] drm/amdgpu: Add flag to disable implicit sync for GEM operations.

2023-11-06 Thread Tatsuyuki Ishi
> On Nov 6, 2023, at 22:44, Christian König  wrote:
> 
>  Am 02.11.23 um 15:04 schrieb Tatsuyuki Ishi:
>> In Vulkan, it is the application's responsibility to perform adequate
>> synchronization before a sparse unmap, replace or BO destroy operation.
>> Until now, the kernel applied the same rule as implicitly-synchronized
>> APIs like OpenGL, which with per-VM BOs made page table updates stall the
>> queue completely. The newly added AMDGPU_VM_EXPLICIT_SYNC flag allows
>> drivers to opt-out of this behavior, while still ensuring adequate implicit
>> sync happens for kernel-initiated updates (e.g. BO moves).
>> 
>> We record whether to use implicit sync or not for each freed mapping. To
>> avoid increasing the mapping struct's size, this is union-ized with the
>> interval tree field which is unused after the unmap.
>> 
>> The reason this is done with a GEM ioctl flag, instead of being a VM /
>> context global setting, is that the current libdrm implementation shares
>> the DRM handle even between different kind of drivers (radeonsi vs radv).
> 
> It would be nice if we could make this more future prove by not using a flag, 
> but rather a drm_syncobj.

There is asynchronous VM_BIND and synchronous VM_BIND. Using syncobjs address 
asynchronous binds, but what this patch set solves is to add an explicitly 
synced synchronous bind.

Even within Vulkan, there are use cases for synchronous binds. This is when a 
non-sparse BO is destroyed (or created but that’s not synchronized). Such 
operations should still be explicit sync, unlike OpenGL where it syncs to 
previous submissions. So adding asynchronous bind doesn’t supersede this need.

I’ve also thought whether we can just make the unmap asynchronous, since the 
spec requires that destroyed stuff are not accessed in any way, but I think it 
will complicate behavior when the destruction of BO immediately follows.

We should implement asynchronous bind someday to make vkQueueBindSparse work 
(even) better, but that will likely involve a larger scope including the 
scheduler. Getting synchronous but explicitly synced binds should be simpler 
and a good incremental step.

Tatsuyuki.

> You can extend the drm_amdgpu_gem_va structure by adding a drm_syncobj handle 
> and timeline point at the end.
> 
> If the syncobj/timeline point results in a fence we give that as input 
> dependency the operation has to wait for.
> 
> And output fence can come later on as well, but that one is much more harder 
> to handle.
> 
> Regards,
> Christian.
> 
>> 
>> Signed-off-by: Tatsuyuki Ishi 
>> ---
>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  2 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c   |  2 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   | 14 --
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h|  7 ++-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h |  6 ++-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 47 +++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h| 23 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c  | 18 +++
>>  include/uapi/drm/amdgpu_drm.h |  2 +
>>  9 files changed, 71 insertions(+), 50 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 7d6daf8d2bfa..10e129bff977 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -1196,7 +1196,7 @@ static void unmap_bo_from_gpuvm(struct kgd_mem *mem,
>>  struct amdgpu_device *adev = entry->adev;
>>  struct amdgpu_vm *vm = bo_va->base.vm;
>>  -   amdgpu_vm_bo_unmap(adev, bo_va, entry->va);
>> +amdgpu_vm_bo_unmap(adev, bo_va, entry->va, true);
>>  amdgpu_vm_clear_freed(adev, vm, _va->last_pt_update);
>>  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>> index 720011019741..612279e65bff 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
>> @@ -122,7 +122,7 @@ int amdgpu_unmap_static_csa(struct amdgpu_device *adev, 
>> struct amdgpu_vm *vm,
>>  }
>>  }
>>  -   r = amdgpu_vm_bo_unmap(adev, bo_va, csa_addr);
>> +r = amdgpu_vm_bo_unmap(adev, bo_va, csa_addr, true);
>>  if (r) {
>>  DRM_ERROR("failed to do bo_unmap on static CSA, err=%d\n", r);
>>  goto error;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> index a1b15d0d6c48..cca68b89754e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> @@ -667,9 +667,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void 
>> *data,
>>  const uint32_t valid_flags = AMDGPU_VM_DELAY_UPDATE |
>>  AMDGPU_VM_PAGE_READABLE | AMDGPU_VM_PAGE_WRITEABLE |
>>  AMDGPU_VM_PAGE_EXECUTABLE | AMDGPU_VM_MTYPE_MASK |
>> -AMDGPU_VM_PAGE_NOALLOC;

Re: [PATCH v2 2/3] drm/amdgpu: Add flag to disable implicit sync for GEM operations.

2023-11-06 Thread Christian König

 Am 02.11.23 um 15:04 schrieb Tatsuyuki Ishi:

In Vulkan, it is the application's responsibility to perform adequate
synchronization before a sparse unmap, replace or BO destroy operation.
Until now, the kernel applied the same rule as implicitly-synchronized
APIs like OpenGL, which with per-VM BOs made page table updates stall the
queue completely. The newly added AMDGPU_VM_EXPLICIT_SYNC flag allows
drivers to opt-out of this behavior, while still ensuring adequate implicit
sync happens for kernel-initiated updates (e.g. BO moves).

We record whether to use implicit sync or not for each freed mapping. To
avoid increasing the mapping struct's size, this is union-ized with the
interval tree field which is unused after the unmap.

The reason this is done with a GEM ioctl flag, instead of being a VM /
context global setting, is that the current libdrm implementation shares
the DRM handle even between different kind of drivers (radeonsi vs radv).


It would be nice if we could make this more future prove by not using a 
flag, but rather a drm_syncobj.


You can extend the drm_amdgpu_gem_va structure by adding a drm_syncobj 
handle and timeline point at the end.


If the syncobj/timeline point results in a fence we give that as input 
dependency the operation has to wait for.


And output fence can come later on as well, but that one is much more 
harder to handle.


Regards,
Christian.



Signed-off-by: Tatsuyuki Ishi 
---
  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c   |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   | 14 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h|  7 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h |  6 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 47 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h| 23 +
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c  | 18 +++
  include/uapi/drm/amdgpu_drm.h |  2 +
  9 files changed, 71 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 7d6daf8d2bfa..10e129bff977 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1196,7 +1196,7 @@ static void unmap_bo_from_gpuvm(struct kgd_mem *mem,
struct amdgpu_device *adev = entry->adev;
struct amdgpu_vm *vm = bo_va->base.vm;
  
-	amdgpu_vm_bo_unmap(adev, bo_va, entry->va);

+   amdgpu_vm_bo_unmap(adev, bo_va, entry->va, true);
  
  	amdgpu_vm_clear_freed(adev, vm, _va->last_pt_update);
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c

index 720011019741..612279e65bff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
@@ -122,7 +122,7 @@ int amdgpu_unmap_static_csa(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
}
}
  
-	r = amdgpu_vm_bo_unmap(adev, bo_va, csa_addr);

+   r = amdgpu_vm_bo_unmap(adev, bo_va, csa_addr, true);
if (r) {
DRM_ERROR("failed to do bo_unmap on static CSA, err=%d\n", r);
goto error;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index a1b15d0d6c48..cca68b89754e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -667,9 +667,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
const uint32_t valid_flags = AMDGPU_VM_DELAY_UPDATE |
AMDGPU_VM_PAGE_READABLE | AMDGPU_VM_PAGE_WRITEABLE |
AMDGPU_VM_PAGE_EXECUTABLE | AMDGPU_VM_MTYPE_MASK |
-   AMDGPU_VM_PAGE_NOALLOC;
+   AMDGPU_VM_PAGE_NOALLOC | AMDGPU_VM_EXPLICIT_SYNC;
const uint32_t prt_flags = AMDGPU_VM_DELAY_UPDATE |
-   AMDGPU_VM_PAGE_PRT;
+   AMDGPU_VM_PAGE_PRT | AMDGPU_VM_EXPLICIT_SYNC;
  
  	struct drm_amdgpu_gem_va *args = data;

struct drm_gem_object *gobj;
@@ -680,6 +680,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
struct drm_exec exec;
uint64_t va_flags;
uint64_t vm_size;
+   bool sync_unmap;
int r = 0;
  
  	if (args->va_address < AMDGPU_VA_RESERVED_SIZE) {

@@ -715,6 +716,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
return -EINVAL;
}
  
+	sync_unmap = !(args->flags & AMDGPU_VM_EXPLICIT_SYNC);

+
switch (args->operation) {
case AMDGPU_VA_OP_MAP:
case AMDGPU_VA_OP_UNMAP:
@@ -774,19 +777,20 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void 
*data,
 va_flags);
break;
case AMDGPU_VA_OP_UNMAP:
-   r = amdgpu_vm_bo_unmap(adev, bo_va, args->va_address);
+   r = amdgpu_vm_bo_unmap(adev, bo_va, args->va_address,
+