Re: [PATCH 1/2][RFC] amdgpu: fix a race in kfd_mem_export_dmabuf()

2024-06-06 Thread Felix Kuehling
On 2024-06-05 05:14, Christian König wrote: Am 04.06.24 um 20:08 schrieb Felix Kuehling: On 2024-06-03 22:13, Al Viro wrote: Using drm_gem_prime_handle_to_fd() to set dmabuf up and insert it into descriptor table, only to have it looked up by file descriptor and remove it from descriptor

Re: [PATCH 3/3] drm/amdgpu: nuke the VM PD/PT shadow handling

2024-06-06 Thread Felix Kuehling
case left is SVM and that is most likely not recoverable in any way when VRAM is lost. I agree. The series is Acked-by: Felix Kuehling Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 - drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 87

Re: [PATCH v4 9/9] drm/amdgpu: add lock in kfd_process_dequeue_from_device

2024-06-06 Thread Felix Kuehling
-off-by: Yunxiang Li Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd

Re: [PATCH 10/12] drm/amdkfd: remove dead code in kq_initialize

2024-06-04 Thread Felix Kuehling
values to this function. I don't think C compilers are that strict. You could pass a random integer to the function. That said, this function only has two callers, and both of them use a proper enum value. Signed-off-by: Jesse Zhang Acked-by: Felix Kuehling --- drivers/gpu/drm/amd

Re: [PATCH 11/12] drm/amdkfd: remove logically dead code

2024-06-04 Thread Felix Kuehling
On 2024-06-03 04:49, Jesse Zhang wrote: idr_for_each_entry can ensure that mem is not empty during the loop. So don't need check mem again. Signed-off-by: Jesse Zhang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 5 - 1 file changed, 5 deletions

Re: [PATCH] Revert "drm/amdgpu: init iommu after amdkfd device init"

2024-06-04 Thread Felix Kuehling
28 12:20:12 2023 -0400     drm/amdkfd: drop IOMMUv2 support     Now that we use the dGPU path for all APUs, drop the     IOMMUv2 support.     v2: drop the now unused queue manager functions for gfx7/8 APUs     Reviewed-by: Felix Kuehling     Acked-by: Christian König     Tested-by: Mike

Re: [PATCH 2/2][RFC] amdkfd CRIU fixes

2024-06-04 Thread Felix Kuehling
references into the corresponding slots of descriptor table, or drop all those file references and free the unused descriptors. Signed-off-by: Al Viro Thank you for the patches and the explanation. One minor nit-pick inline. With that fixed, this patch is Reviewed-by: Felix Kuehling I can

Re: [PATCH 1/2][RFC] amdgpu: fix a race in kfd_mem_export_dmabuf()

2024-06-04 Thread Felix Kuehling
the descriptor table alone. Signed-off-by: Al Viro This patch looks good to me on the amdgpu side. For the DRM side I'm adding dri-devel. Acked-by: Felix Kuehling --- diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index

Re: [PATCH 8/8] drm/amdkfd: remove dead code in kfd_create_vcrat_image_gpu

2024-05-31 Thread Felix Kuehling
384). if (!pcrat_image || avail_size < VCRAT_SIZE_FOR_GPU) return -EINVAL; Ok, I missed that. Makes sense. Maybe mention it in the commit description that kfd_create_vcrat_image_gpu itself checks the avail_size at the start. The patch is Reviewed-by: Felix Ku

Re: [PATCH 7/8] drm/amdkfd: Comment out the unused variable use_static in pm_map_queues_v9

2024-05-31 Thread Felix Kuehling
On 2024-05-30 22:51, Jesse Zhang wrote: To fix the warning about unused value, remove the use_static and use the parameter is_static directly. Signed-off-by: Jesse Zhang Suggested-by: Felix Kuehling Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_packet_manager_v9.c

Re: [PATCH 2/8] drm/amdkfd: fix the kdf debugger issue

2024-05-31 Thread Felix Kuehling
e alu ops for gfx12") Signed-off-by: Jesse Zhang Suggested-by: Felix Kuehling Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_debug.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c b/drivers/gpu/drm/

Re: [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks

2024-05-31 Thread Felix Kuehling
On 2024-05-31 2:52, Christian König wrote: > Am 31.05.24 um 00:02 schrieb Felix Kuehling: >> On 2024-05-28 13:23, Yunxiang Li wrote: >>> These functions are missing the lock for reset domain. >>> >>> Signed-off-by: Yunxiang Li >>> --- &

Re: [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks

2024-05-30 Thread Felix Kuehling
On 2024-05-28 13:23, Yunxiang Li wrote: These functions are missing the lock for reset domain. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 4 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++--

Re: [PATCH 6/8] drm/amdkfd: remove dead code in the function svm_range_get_pte_flags

2024-05-30 Thread Felix Kuehling
On 2024-05-29 23:49, Jesse Zhang wrote: The varible uncached set false, the condition uncached cannot be true. So remove the dead code, mapping flags will set the flag AMDGPU_VM_MTYPE_UC in else. Signed-off-by: Jesse Zhang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd

Re: [PATCH 7/8] drm/amdkfd: Comment out the unused variable use_static in pm_map_queues_v9

2024-05-30 Thread Felix Kuehling
On 2024-05-30 10:12, Christian König wrote: Am 30.05.24 um 05:50 schrieb Jesse Zhang: To fix the warning about unused value, comment out the variable use_static. Commenting out variables with // will just get you another warning from checkpatch. Christian. Signed-off-by: Jesse Zhang

Re: [PATCH 2/8] drm/amdkfd: fix the kdf debugger issue

2024-05-30 Thread Felix Kuehling
On 2024-05-29 23:47, Jesse Zhang wrote: the expression caps | HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED is always 1/true regardless of the values of its operand. Signed-off-by: Jesse Zhang Please add a Fixes tag. I think this is the commit that introduced the problem:

Re: [PATCH 3/8] drm/amdkfd: fix overflow for the function criu_restore_bos

2024-05-30 Thread Felix Kuehling
On 2024-05-29 23:47, Jesse Zhang wrote: When copying the information from the user fails, it will goto exit. But the variable i remains at 0, and do i-- will overflow. i-- may underflow, but the loop will still exit. Why is the underflow a problem? Signed-off-by: Jesse Zhang ---

Re: [PATCH 5/8] drm/amdkfd: fix the return for the function kfd_dbg_trap_set_flags

2024-05-30 Thread Felix Kuehling
On 2024-05-29 23:48, Jesse Zhang wrote: If the rewind flag is set, it should return the final result of setting mes debug mode or refresh the run list. No. We're rewinding because an error occurred. We want to return that error, not the success probably returned by refreshing the runlist.

Re: [PATCH 8/8] drm/amdkfd: remove dead code in kfd_create_vcrat_image_gpu

2024-05-30 Thread Felix Kuehling
On 2024-05-29 23:50, Jesse Zhang wrote: Since the value of avail_size is at least VCRAT_SIZE_FOR_GPU(16384), minus struct crat_header(40UL) and struct crat_subtype_compute(40UL) it cannot be less than 0. Signed-off-by: Jesse Zhang --- drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 6 -- 1

Re: [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting

2024-05-29 Thread Felix Kuehling
pts at HW access that detect an error or time out, which may get the HW into a worse state or delay the actual reset. At a minimum, I'd recommend testing this with /sys/kernel/debug/hang_hws on a pre-MES GPU, while some ROCm workload is running. Reviewed-by: Felix Kuehling > --- > driv

Re: [PATCH] drm/amdgpu: Make CPX mode auto default in NPS4

2024-05-27 Thread Felix Kuehling
-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c index d62cfa4e2d2b..2c9a0aa41e2d 100644 --- a/drivers/gpu/drm/amd/amdgpu

Re: [PATCH] drm/amdkfd: simplify APU VRAM handling

2024-05-27 Thread Felix Kuehling
code. v2: clean up a few more places (Lang) Signed-off-by: Alex Deucher This is a lot cleaner, thanks. I was looking for something like this when I reviewed the original patch but missed it. I found it now in amdgpu_discovery_set_ip_blocks (I think). Acked-by: Felix Kuehling ---

Re: [PATCH] drm/amdgpu: Update the impelmentation of AMDGPU_PTE_MTYPE_GFX12

2024-05-21 Thread Felix Kuehling
On 2024-05-20 5:14, Shane Xiao wrote: > This patch changes the implementation of AMDGPU_PTE_MTYPE_GFX12, > clear the bits before setting the new one. > This fixed the potential issue that GFX12 setting memory to NC. > > v2: Clear mtype field before setting the new one (Alex) > > Signed-off-by:

Re: [PATCH] drm/kfd: Correct pined buffer handling at kfd restore and validate process

2024-05-13 Thread Felix Kuehling
415,6 +415,10 @@ static int amdgpu_amdkfd_bo_validate(struct amdgpu_bo > *bo, uint32_t domain, >"Called with userptr BO")) > return -EINVAL; > > + /* bo has been pined, not need validate it */ pined -> pinned With those typos fixed,

Re: [PATCH v2] drm/amdkfd: Check correct memory types for is_system variable

2024-05-10 Thread Felix Kuehling
Fixes tag. It should be a single line and no single quotes. Other than that, the patch is Reviewed-by: Felix Kuehling Signed-off-by: Sreekant Somasekharan --- drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/

Re: [PATCH] drm/amdkfd: Ensure gpu_id is unique

2024-05-10 Thread Felix Kuehling
Changed commit header to reflect the above v3: Use crc16 as suggested-by: Lijo Lazar Ensure that gpu_id != 0 Signed-off-by: Harish Kasiviswanathan Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 40 +++ 1 file changed, 34

Re: [PATCH 11/11] drm/tegra: Use fbdev client helpers

2024-05-07 Thread Felix Kuehling
On 2024-05-07 07:58, Thomas Zimmermann wrote: Implement struct drm_client_funcs with the respective helpers and remove the custom code from the emulation. The generic helpers are equivalent in functionality. Signed-off-by: Thomas Zimmermann --- drivers/gpu/drm/radeon/radeon_fbdev.c | 66

Re: [PATCH] drm/amdkfd: Ensure gpu_id is unique

2024-05-06 Thread Felix Kuehling
On 2024-05-06 17:10, Harish Kasiviswanathan wrote: On 2024-05-06 16:30, Felix Kuehling wrote: On 2024-05-03 18:06, Harish Kasiviswanathan wrote: gpu_id needs to be unique for user space to identify GPUs via KFD interface. In the current implementation there is a very small probability

Re: [PATCH] drm/amdkfd: Ensure gpu_id is unique

2024-05-06 Thread Felix Kuehling
On 2024-05-03 18:06, Harish Kasiviswanathan wrote: gpu_id needs to be unique for user space to identify GPUs via KFD interface. In the current implementation there is a very small probability of having non unique gpu_ids. v2: Add check to confirm if gpu_id is unique. If not unique, find one

Re: [PATCH] drm/amdkfd: Refactor kfd CRIU into its own file

2024-05-06 Thread Felix Kuehling
/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -45,6 +45,7 @@ Can you remove #include and "amdgpu_dma_buf.h" here? Or is it still needed by something else left in kfd_chardev.c? Other than that, this patch is Reviewed-by: Felix Kuehling #include "k

Re: [PATCH] drm/amdkfd: Remove arbitrary timeout for hmm_range_fault

2024-05-06 Thread Felix Kuehling
urn EAGAIN to application if hmm_range_fault return EBUSY, then userspace libdrm and Thunk will call ioctl again. Change EAGAIN to debug message as this is not error. Signed-off-by: Philip Yang Assuming this passes your stress testing without CPU stall warnings, this patch is Reviewed-by: Fe

Re: Proposal to add CRIU support to DRM render nodes

2024-05-03 Thread Felix Kuehling
On 2024-04-16 10:04, Tvrtko Ursulin wrote: > > On 01/04/2024 18:58, Felix Kuehling wrote: >> >> On 2024-04-01 12:56, Tvrtko Ursulin wrote: >>> >>> On 01/04/2024 17:37, Felix Kuehling wrote: >>>> On 2024-04-01 11:09, Tvrtko Ursulin wrote: >&

Re: [PATCH v3 2/3] drm/amdgpu: Reduce mem_type to domain double indirection

2024-05-02 Thread Felix Kuehling
-by: Tvrtko Ursulin Reviewed-by: Christian König # v1 Reviewed-by: Felix Kuehling # v2 I'm waiting for Christian to review patches 1 and 3. Then I can apply the whole series. Regards,   Felix --- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 3 +-- drivers/gpu/drm/amd/amdgpu

Re: [PATCH 1/2] drm/amdkfd: Use dev_error intead of pr_error

2024-05-01 Thread Felix Kuehling
On 2024-05-01 21:08, Harish Kasiviswanathan wrote: > No functional change. This will help in moving gpu_id creation to next > step while still being able to identify the correct GPU > > Signed-off-by: Harish Kasiviswanathan Reviewed-by: Felix Kuehling > --- > drivers/

Re: [PATCH 2/2] drm/amdkfd: Improve chances of unique gpu_id

2024-05-01 Thread Felix Kuehling
On 2024-05-01 21:08, Harish Kasiviswanathan wrote: > gpu_id needs to be unique for user space to identify GPUs via KFD > interface. Do a single pass search to detect collision. If > detected, increment gpu_id by one. > > Probability of collisons are very rare. Hence, no more complexity is >

Re: [PATCH v2] drm/amd/amdkfd: Fix a resource leak in svm_range_validate_and_map()

2024-05-01 Thread Felix Kuehling
-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 386875e6eb96..481cb958e165 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

Re: [PATCH] drm/amd/amdkfd: Fix a resource leak in svm_range_validate_and_map()

2024-05-01 Thread Felix Kuehling
On 2024-05-01 14:34, Felix Kuehling wrote: On 2024-04-30 19:29, Ramesh Errabolu wrote: Analysis of code by Coverity, a static code analyser, has identified a resource leak in the symbol hmm_range. This leak occurs when one of the prior steps before it is released encounters an error

Re: [PATCH] drm/amd/amdkfd: Fix a resource leak in svm_range_validate_and_map()

2024-05-01 Thread Felix Kuehling
On 2024-04-30 19:29, Ramesh Errabolu wrote: Analysis of code by Coverity, a static code analyser, has identified a resource leak in the symbol hmm_range. This leak occurs when one of the prior steps before it is released encounters an error. Signed-off-by: Ramesh Errabolu ---

Re: [PATCH v2] drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs

2024-04-30 Thread Felix Kuehling
hen device and host can effectively share system memory. v2: Report local_mem_size_private as 0. (Felix) Signed-off-by: Lang Yu Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 5 + .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 20 ++-

Re: [PATCH 2/3] drm/amdgpu: Reduce mem_type to domain double indirection

2024-04-29 Thread Felix Kuehling
Reviewed-by: Christian König # v1 Reviewed-by: Felix Kuehling I also ran kfdtest on a multi-GPU system just to make sure this didn't break our multi-GPU support. BTW, I had to fix up some things when I tried to apply your patch to the current amd-staging-drm-next branch. That branch

Re: [PATCH] drm/amdkfd: update buffer_{store,load}_* modifiers for gfx940

2024-04-29 Thread Felix Kuehling
offset VMEM_MODIFIERS offset:256*3   s_waitcnt vmcnt(0)   end base-commit: cf743996352e327f483dc7d66606c90276f57380 Reviewed-by: Jay Cornwall Acked-by: Felix Kuehling Do you need me to submit the patch to amd-staging-drm-next? Thanks,   Felix

Re: [PATCH 2/2] drm/amdkfd: Allow memory oversubscription on small APUs

2024-04-29 Thread Felix Kuehling
On 2024-04-29 06:38, Yu, Lang wrote: [Public] -Original Message- From: Kuehling, Felix Sent: Saturday, April 27, 2024 6:45 AM To: Yu, Lang ; amd-gfx@lists.freedesktop.org Cc: Yang, Philip ; Koenig, Christian ; Zhang, Yifan ; Liu, Aaron Subject: Re: [PATCH 2/2] drm/amdkfd: Allow

Re: [PATCH 3/3] drm/amdgpu: Fix pinned GART area accounting and fdinfo reporting

2024-04-29 Thread Felix Kuehling
On 2024-04-29 5:43, Tvrtko Ursulin wrote: On 26/04/2024 23:24, Felix Kuehling wrote: On 2024-04-26 12:43, Tvrtko Ursulin wrote: From: Tvrtko Ursulin When commit b453e42a6e8b ("drm/amdgpu: Add new placement for preemptible SG BOs") added a new TTM region it missed to notice the

Re: [PATCH 3/3] drm/amdgpu: Fix pinned GART area accounting and fdinfo reporting

2024-04-29 Thread Felix Kuehling
On 2024-04-29 9:45, Tvrtko Ursulin wrote: On 29/04/2024 12:11, Christian König wrote: Am 29.04.24 um 11:43 schrieb Tvrtko Ursulin: On 26/04/2024 23:24, Felix Kuehling wrote: On 2024-04-26 12:43, Tvrtko Ursulin wrote: From: Tvrtko Ursulin When commit b453e42a6e8b ("drm/amdgpu: Ad

Re: [PATCH 1/2] drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs

2024-04-26 Thread Felix Kuehling
On 2024-04-26 04:37, Lang Yu wrote: Small APUs(i.e., consumer, embedded products) usually have a small carveout device memory which can't satisfy most compute workloads memory allocation requirements. We can't even run a Basic MNIST Example with a default 512MB carveout.

Re: [PATCH 2/2] drm/amdkfd: Allow memory oversubscription on small APUs

2024-04-26 Thread Felix Kuehling
On 2024-04-26 04:37, Lang Yu wrote: The default ttm_tt_pages_limit is 1/2 of system memory. It is prone to out of memory with such a configuration. Indiscriminately allowing the violation of all memory limits is not a good solution. It will lead to poor performance once you actually reach

Re: [PATCH 3/3] drm/amdgpu: Fix pinned GART area accounting and fdinfo reporting

2024-04-26 Thread Felix Kuehling
quot;drm/amdgpu: Add new placement for preemptible SG BOs") Cc: Felix Kuehling Cc: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/am

Re: [PATCH] drm/amdkfd: Flush the process wq before creating a kfd_process

2024-04-26 Thread Felix Kuehling
-off-by: Lancelot SIX Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index 58c1fe542193..451bb058cc62 100644 --- a/drivers

Re: [PATCH] drm/amdkfd: Enforce queue BO's adev

2024-04-24 Thread Felix Kuehling
this doesn't break existing user mode. It only makes it fail in a more obvious way. If that's the case, the patch is Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b

Re: [PATCH v6 0/5] Best effort contiguous VRAM allocation

2024-04-24 Thread Felix Kuehling
The series is Reviewed-by: Felix Kuehling On 2024-04-24 11:27, Philip Yang wrote: This patch series implement new KFD memory alloc flag for best effort contiguous VRAM allocation, to support peer direct access RDMA device with limited scatter-gather dma capability. v2: rebase on patch (&quo

Re: [PATCH v5 1/6] drm/amdgpu: Support contiguous VRAM allocation

2024-04-23 Thread Felix Kuehling
On 2024-04-23 11:28, Philip Yang wrote: RDMA device with limited scatter-gather ability requires contiguous VRAM buffer allocation for RDMA peer direct support. Add a new KFD alloc memory flag and store as bo alloc flag AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS. When pin this bo to export for RDMA

Re: [PATCH v5 3/6] drm/amdgpu: Evict BOs from same process for contiguous allocation

2024-04-23 Thread Felix Kuehling
contiguous VRAM, allow TTM evict KFD BOs from the same process, this will evict the user queues first, and restore the queues later after contiguous VRAM allocation. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 3 ++- 1 file changed

Re: [PATCH v5 4/6] drm/amdkfd: Evict BO itself for contiguous allocation

2024-04-23 Thread Felix Kuehling
On 2024-04-23 11:28, Philip Yang wrote: If the BO pages pinned for RDMA is not contiguous on VRAM, evict it to system memory first to free the VRAM space, then allocate contiguous VRAM space, and then move it from system memory back to VRAM. Signed-off-by: Philip Yang ---

Re: [PATCH v5 5/6] drm/amdkfd: Increase KFD bo restore wait time

2024-04-23 Thread Felix Kuehling
On 2024-04-23 11:28, Philip Yang wrote: TTM allocate contiguous VRAM may takes more than 1 second to evict BOs for larger size RDMA buffer. Because KFD restore bo worker reserves all KFD BOs, then TTM cannot hold the remainning KFD BOs lock to evict them, this causes TTM failed to alloc

Re: [PATCH] drm/amdkfd: handle duplicate BOs in reserve_bo_and_cond_vms

2024-04-23 Thread Felix Kuehling
] ? __pfx_kthread+0x10/0x10 [ 57.794184] ret_from_fork_asm+0x1b/0x30 Signed-off-by: Lang Yu Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c

Re: [PATCH] drm/amdgpu: Fix VRAM memory accounting

2024-04-23 Thread Felix Kuehling
On 2024-04-23 14:56, Mukul Joshi wrote: Subtract the VRAM pinned memory when checking for available memory in amdgpu_amdkfd_reserve_mem_limit function since that memory is not available for use. Signed-off-by: Mukul Joshi Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu

Re: [PATCH] drm/amdgpu: Fix two reset triggered in a row

2024-04-23 Thread Felix Kuehling
On 2024-04-23 01:50, Christian König wrote: Am 22.04.24 um 21:45 schrieb Yunxiang Li: Reset request from KFD is missing a check for if a reset is already in progress, this causes a second reset to be triggered right after the previous one finishes. Add the check to align with the other reset

Re: [PATCH] drm/amdgpu: Fix two reset triggered in a row

2024-04-22 Thread Felix Kuehling
reset sources. Acked-by: Alex Deucher Reviewed-by: Felix Kuehling Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu

Re: [PATCH] drm/amdkfd: Add VRAM accounting for SVM migration

2024-04-19 Thread Felix Kuehling
. Signed-off-by: Mukul Joshi Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 16 +++- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +- 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu

[PATCH] drm/amdkfd: Fix rescheduling of restore worker

2024-04-19 Thread Felix Kuehling
Handle the case that the restore worker was already scheduled by another eviction while the restore was in progress. Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs") Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 6 +++--- 1 file

Re: [PATCH v2] drm/amdkfd: make sure VM is ready for updating operations

2024-04-18 Thread Felix Kuehling
separate loops in amdgpu_amdkfd_restore_process_bos. (Felix) 1.Validate BOs 2.Validate VM (and DMABuf attachments) 3.Update page tables for the BOs validated above Fixes: 2fdba514ad5a ("drm/amdgpu: Auto-validate DMABuf imports in compute VMs") Signed-off-by: Lang Yu Reviewed

[PATCH] drm/amdgpu: Update BO eviction priorities

2024-04-18 Thread Felix Kuehling
Make SVM BOs more likely to get evicted than other BOs. These BOs opportunistically use available VRAM, but can fall back relatively seamlessly to system memory. It also avoids SVM migrations evicting other, more important BOs as they will evict other SVM allocations first. Signed-off-by: Felix

Re: [PATCH] drm/amdgpu/mes11: print MES opcodes rather than numbers

2024-04-18 Thread Felix Kuehling
+ "MES_SCH_API_UPDATE_ROOT_PAGE_TABLE", + "MES_SCH_API_AMD_LOG", Maybe drop the prefixes. They don't add any information value and only bloat the log messages and module binary size. Other than that, the patch is Acked-by: Felix Kuehling +}; + +static const

[PATCH] drm/amdkfd: Fix eviction fence handling

2024-04-17 Thread Felix Kuehling
Handle case that dma_fence_get_rcu_safe returns NULL. If restore work is already scheduled, only update its timer. The same work item cannot be queued twice, so undo the extra queue eviction. Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs") Signed-off-by: Feli

Re: [PATCH] rock-dgb_defconfig: Update for Linux 6.7 with UBSAN

2024-04-16 Thread Felix Kuehling
On 2024-04-16 13:02, Chen, Xiaogang wrote: On 4/15/2024 2:49 PM, Felix Kuehling wrote: Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. make rock-dbg_defconfig make savedefconfig cp defconfig arch/x86

Re: [PATCH] drm/amdkfd: fix NULL pointer dereference

2024-04-15 Thread Felix Kuehling
alling dma_fence_signal and dma_fence_put with zero fences to rely on checking parameters in DMA API. Cc: Alex Deucher Cc: Christian Koenig Cc: Xiaogang Chen Cc: Felix Kuehling Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 10 ++ 1 file changed, 6 insertions

[PATCH] rock-dgb_defconfig: Update for Linux 6.7 with UBSAN

2024-04-15 Thread Felix Kuehling
make rock-dbg_defconfig make savedefconfig cp defconfig arch/x86/config/rock-dbg_defconfig This also enables UBSAN, which can help catch some types of bugs at compile time. Signed-off-by: Felix Kuehling --- arch/x86/configs/rock-dbg_defconfig | 46 + 1 file changed

[PATCH] drm/amdkfd: Fix memory leak in create_process failure

2024-04-10 Thread Felix Kuehling
Fix memory leak due to a leaked mmget reference on an error handling code path that is triggered when attempting to create KFD processes while a GPU reset is in progress. Fixes: 0ab2d7532b05 ("drm/amdkfd: prepare per-process debug enable and disable") CC: Xiaogang Chen Signed-off

Re: [PATCH] drm/amdkfd: make sure VM is ready for updating operations

2024-04-09 Thread Felix Kuehling
On 2024-04-08 3:55, Christian König wrote: Am 07.04.24 um 06:52 schrieb Lang Yu: When VM is in evicting state, amdgpu_vm_update_range would return -EBUSY. Then restore_process_worker runs into a dead loop. Fixes: 2fdba514ad5a ("drm/amdgpu: Auto-validate DMABuf imports in compute VMs")

Re: [PATCH 1/2] amd/amdkfd: sync all devices to wait all processes being evicted

2024-04-03 Thread Felix Kuehling
. if the process has not been evicted before doing recover, it will be restored, then caused page fault. Signed-off-by: Zhigang Luo This patch is Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 17 ++--- 1 file changed, 6 insertions(+), 11 deletions(-) diff

Re: [PATCH 1/2] amd/amdkfd: sync all devices to wait all processes being evicted

2024-04-02 Thread Felix Kuehling
On 2024-04-01 17:53, Zhigang Luo wrote: If there are more than one device doing reset in parallel, the first device will call kfd_suspend_all_processes() to evict all processes on all devices, this call takes time to finish. other device will start reset and recover without waiting. if the

Re: Proposal to add CRIU support to DRM render nodes

2024-04-01 Thread Felix Kuehling
On 2024-04-01 12:56, Tvrtko Ursulin wrote: On 01/04/2024 17:37, Felix Kuehling wrote: On 2024-04-01 11:09, Tvrtko Ursulin wrote: On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU

Re: Proposal to add CRIU support to DRM render nodes

2024-04-01 Thread Felix Kuehling
On 2024-04-01 11:09, Tvrtko Ursulin wrote: On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present

Re: Proposal to add CRIU support to DRM render nodes

2024-03-28 Thread Felix Kuehling
a few weeks. Regards,   Felix Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive

Re: [PATCH] drm/amdgpu: use vm_update_mode=0 as default in sriov for gfx10.3 onwards

2024-03-28 Thread Felix Kuehling
, the patch is Reviewed-by: Felix Kuehling + /* VF MMIO access (except mailbox range) from CPU +* will be blocked during sriov runtime +*/ + adev->virt.caps |= AMDGPU_VF_MMIO_ACCESS_PROTECT; + amdgpu_gmc_noretry_set(adev);

Re: [PATCH 1/2] drm/amdgpu: always allocate cleared VRAM for KFD allocations

2024-03-26 Thread Felix Kuehling
On 2024-03-26 11:52, Alex Deucher wrote: This adds allocation latency, but aligns better with user expectations. The latency should improve with the drm buddy clearing patches that Arun has been working on. If we submit this before the clear-page-tracking patches are in, this will cause

Re: [PATCH] drm/amdgpu : Increase the mes log buffer size as per new MES FW version

2024-03-26 Thread Felix Kuehling
On 2024-03-25 19:33, Liu, Shaoyun wrote: [AMD Official Use Only - General] It can cause page fault when the log size exceed the page size . I'd consider that a breaking change in the firmware that should be avoided. Is there a way the updated driver can tell the FW the log size that

Re: [PATCH] drm/amd/amdgpu: Enable IH Retry CAM by register read

2024-03-26 Thread Felix Kuehling
On 2024-03-26 12:04, Alam, Dewan wrote: [AMD Official Use Only - General] Looping in +@Zhang, Zhaochen CAM control register can only be written by PF. VF can only read the register. In SRIOV VF, the write won't work. In SRIOV case, CAM's enablement is controlled by the host. Hence, we think

Re: [PATCH 2/3] amd/amdgpu: wait no process running in kfd before resuming device

2024-03-26 Thread Felix Kuehling
On 2024-03-26 10:53, Philip Yang wrote: On 2024-03-25 14:45, Felix Kuehling wrote: On 2024-03-22 15:57, Zhigang Luo wrote: it will cause page fault after device recovered if there is a process running. Signed-off-by: Zhigang Luo Change-Id: Ib1eddb56b69ecd41fe703abd169944154f48b0cd

Re: [PATCH] drm/amdgpu : Increase the mes log buffer size as per new MES FW version

2024-03-25 Thread Felix Kuehling
On 2024-03-22 12:49, shaoyunl wrote: From MES version 0x54, the log entry increased and require the log buffer size to be increased. The 16k is maximum size agreed What happens when you run the new firmware on an old kernel that only allocates 4KB? Regards,   Felix Signed-off-by:

Re: [PATCH 2/3] amd/amdgpu: wait no process running in kfd before resuming device

2024-03-25 Thread Felix Kuehling
On 2024-03-22 15:57, Zhigang Luo wrote: it will cause page fault after device recovered if there is a process running. Signed-off-by: Zhigang Luo Change-Id: Ib1eddb56b69ecd41fe703abd169944154f48b0cd --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ 1 file changed, 2 insertions(+) diff

Re: [PATCH] drm/amdkfd: Cleanup workqueue during module unload

2024-03-21 Thread Felix Kuehling
On 2024-03-20 18:52, Mukul Joshi wrote: Destroy the high priority workqueue that handles interrupts during KFD node cleanup. Signed-off-by: Mukul Joshi Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_interrupt.c | 2 ++ 1 file changed, 2 insertions(+) diff --git

Re: [PATCH] drm/amdkfd: range check cp bad op exception interrupts

2024-03-21 Thread Felix Kuehling
Tested-by: Jesse Zhang Reviewed-by: Felix Kuehling --- .../gpu/drm/amd/amdkfd/kfd_int_process_v10.c| 3 ++- .../gpu/drm/amd/amdkfd/kfd_int_process_v11.c| 3 ++- drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 3 ++- include/uapi/linux/kfd_ioctl.h | 17

Re: [PATCH] drm/amdkfd: Check cgroup when returning DMABuf info

2024-03-20 Thread Felix Kuehling
On 2024-03-18 16:12, Felix Kuehling wrote: On 2024-03-15 14:17, Mukul Joshi wrote: Check cgroup permissions when returning DMA-buf info and based on cgroup check return the id of the GPU that has access to the BO. Signed-off-by: Mukul Joshi ---   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4

Re: [PATCH] drm/amdkfd: Check cgroup when returning DMABuf info

2024-03-20 Thread Felix Kuehling
On 2024-03-20 15:09, Joshi, Mukul wrote: [AMD Official Use Only - General] -Original Message- From: Kuehling, Felix Sent: Monday, March 18, 2024 4:13 PM To: Joshi, Mukul ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amdkfd: Check cgroup when returning DMABuf info On

Re: [PATCH] drm/amdkfd: Check cgroup when returning DMABuf info

2024-03-18 Thread Felix Kuehling
On 2024-03-15 14:17, Mukul Joshi wrote: Check cgroup permissions when returning DMA-buf info and based on cgroup check return the id of the GPU that has access to the BO. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4 ++-- 1 file changed, 2 insertions(+), 2

Re: [PATCH 05/10] drivers: use new capable_any functionality

2024-03-15 Thread Felix Kuehling
On 2024-03-15 7:37, Christian Göttsche wrote: Use the new added capable_any function in appropriate cases, where a task is required to have any of two capabilities. Reorder CAP_SYS_ADMIN last. Signed-off-by: Christian Göttsche Acked-by: Alexander Gordeev (s390 portion) Acked-by: Felix

Re: Proposal to add CRIU support to DRM render nodes

2024-03-14 Thread Felix Kuehling
On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying

Re: [PATCH 2/2] drm/amdkfd: Check preemption status on all XCDs

2024-03-14 Thread Felix Kuehling
int32_t inst) +{ + if (doorbell_id) { + struct device *dev = node->adev->dev; + + if (KFD_GC_VERSION(node) == IP_VERSION(9, 4, 3)) Could this be made more generic? E.g.: if (node->adev->xcp_mgr && node->adev->xcp_mgr->num_x

Re: [PATCH AUTOSEL 5.15 3/5] drm/amdgpu: Enable gpu reset for S3 abort cases on Raven series

2024-03-13 Thread Felix Kuehling
On 2024-03-11 11:14, Sasha Levin wrote: From: Prike Liang [ Upstream commit c671ec01311b4744b377f98b0b4c6d033fe569b3 ] Currently, GPU resets can now be performed successfully on the Raven series. While GPU reset is required for the S3 suspend abort case. So now can enable gpu reset for S3

Re: [PATCH] drm/amdgpu: Do a basic health check before reset

2024-03-13 Thread Felix Kuehling
On 2024-03-13 5:41, Lijo Lazar wrote: Check if the device is present in the bus before trying to recover. It could be that device itself is lost from the bus in some hang situations. Signed-off-by: Lijo Lazar --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++ 1

Re: [PATCH] drm/amd/amdgpu: Enable IH Retry CAM by register read

2024-03-13 Thread Felix Kuehling
On 2024-03-13 13:43, Dewan Alam wrote: IH Retry CAM should be enabled by register reads instead of always being set to true. This explanation sounds odd. Your code is still writing the register first. What's the reason for reading back the register? I assume it's not needed for enabling the

Re: [PATCH v3] drm/amdgpu: Init zone device and drm client after mode-1 reset on reload

2024-03-12 Thread Felix Kuehling
causes VM clear to SDMA before SDAM init. Adding the condition to in drm client creation, on top of v1, to guard against drm client creation call multiple times. Signed-off-by: Ahmad Rehman Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 4 ++-- drivers/gpu/drm

Re: [PATCH] drm/amdgpu: Handle duplicate BOs during process restore

2024-03-11 Thread Felix Kuehling
On 2024-03-11 12:33, Christian König wrote: Am 11.03.24 um 16:33 schrieb Felix Kuehling: On 2024-03-11 11:25, Joshi, Mukul wrote: [AMD Official Use Only - General] -Original Message- From: Christian König Sent: Monday, March 11, 2024 2:50 AM To: Joshi, Mukul ; amd-gfx

Re: [PATCH] drm/amdgpu: Handle duplicate BOs during process restore

2024-03-11 Thread Felix Kuehling
On 2024-03-11 11:25, Joshi, Mukul wrote: [AMD Official Use Only - General] -Original Message- From: Christian König Sent: Monday, March 11, 2024 2:50 AM To: Joshi, Mukul ; amd-gfx@lists.freedesktop.org Cc: Kuehling, Felix Subject: Re: [PATCH] drm/amdgpu: Handle duplicate BOs during

Re: [PATCH] drm/amdgpu: Handle duplicate BOs during process restore

2024-03-08 Thread Felix Kuehling
e log. validation can fail intermittently and rescheduling the worker is there to handle it. With that fixed, the patch is Reviewed-by: Felix Kuehling goto validate_map_fail; + } /* Update mappings not managed by KFD */ list_for_each_entry(peer_vm, _info->vm_list_head,

Re: [PATCH v5 1/2] drm/amdgpu: implement TLB flush fence

2024-03-07 Thread Felix Kuehling
On 2024-03-07 1:39, Sharma, Shashank wrote: On 07/03/2024 00:54, Felix Kuehling wrote: On 2024-03-06 09:41, Shashank Sharma wrote: From: Christian König The problem is that when (for example) 4k pages are replaced with a single 2M page we need to wait for change to be flushed out

Re: [PATCH v5 1/2] drm/amdgpu: implement TLB flush fence

2024-03-06 Thread Felix Kuehling
(f->dependency) in tlb_fence_work (Christian) - move the misplaced fence_create call to the end (Philip) V5: - free the f->dependency properly (Christian) Cc: Christian Koenig Cc: Felix Kuehling Cc: Rajneesh Bhardwaj Cc: Alex Deucher Reviewed-by: Shashank Sharma Signed-off-by:

Re: [PATCH 2/3] drm/amdgpu: sdma support for sriov cpx mode

2024-03-05 Thread Felix Kuehling
On 2024-03-05 14:49, Dhume, Samir wrote: [AMD Official Use Only - General] -Original Message- From: Kuehling, Felix Sent: Monday, March 4, 2024 6:47 PM To: Dhume, Samir ; amd-gfx@lists.freedesktop.org Cc: Lazar, Lijo ; Wan, Gavin ; Liu, Leo ; Deucher, Alexander Subject: Re: [PATCH

Re: [PATCH] drm/amdkfd: make kfd_class constant

2024-03-05 Thread Felix Kuehling
nly memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman Suggested-by: Greg Kroah-Hartman Signed-off-by: Ricardo B. Marliere The patch looks good to me. Do you want me to apply this to Alex's amd-staging-drm-next? Reviewed-by: Felix Kuehling --- d

  1   2   3   4   5   6   7   8   9   10   >