Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Christian König
Am 03.05.23 um 21:14 schrieb André Almeida: Em 03/05/2023 14:43, Timur Kristóf escreveu: Hi Felix, On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: That's the worst-case scenario where you're debugging HW or FW issues. Those should be pretty rare post-bringup. But are there hangs cause

RE: [PATCH v4] drm/amdgpu: drop gfx_v11_0_cp_ecc_error_irq_funcs

2023-05-03 Thread Zhang, Horatio
[AMD Official Use Only - General] Hi Hawking, Thank you for your review. I will change the judgment criteria to if (adev->gfx.cp_error_irq.funcs), and submit this patch to amd-staging-drm-next. Regards, Horatio -Original Message- From: Zhang, Hawking Sent: Friday, April 28, 2023 1:

[pull] amdgpu drm-fixes-6.4

2023-05-03 Thread Alex Deucher
Hi Dave, Daniel, Fixes for 6.4. The following changes since commit d893f39320e1248d1c97fde0d6e51e5ea008a76b: drm/amd/display: Lowering min Z8 residency time (2023-04-26 22:53:58 -0400) are available in the Git repository at: https://gitlab.freedesktop.org/agd5f/linux.git tags/amd-drm-fixe

[PATCH] drm:amd:amdgpu: Fix missing buffer object unlock in failure path

2023-05-03 Thread Sukrut Bellary
smatch warning - 1) drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c:3615 gfx_v9_0_kiq_resume() warn: inconsistent returns 'ring->mqd_obj->tbo.base.resv'. 2) drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:6901 gfx_v10_0_kiq_resume() warn: inconsistent returns 'ring->mqd_obj->tbo.base.resv'. Signed-off-by: Sukrut Be

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
On Wed, May 3, 2023, 14:53 André Almeida wrote: > Em 03/05/2023 14:08, Marek Olšák escreveu: > > GPU hangs are pretty common post-bringup. They are not common per user, > > but if we gather all hangs from all users, we can have lots and lots of > > them. > > > > GPU hangs are indeed not very debu

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread André Almeida
Em 03/05/2023 14:43, Timur Kristóf escreveu: Hi Felix, On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: That's the worst-case scenario where you're debugging HW or FW issues. Those should be pretty rare post-bringup. But are there hangs caused by user mode driver or application bugs tha

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread André Almeida
Em 03/05/2023 14:08, Marek Olšák escreveu: GPU hangs are pretty common post-bringup. They are not common per user, but if we gather all hangs from all users, we can have lots and lots of them. GPU hangs are indeed not very debuggable. There are however some things we can do: - Identify the h

Re: drm/amdgpu: fix an amdgpu_irq_put() issue in gmc_v9_0_hw_fini()

2023-05-03 Thread Limonciello, Mario
On 5/2/2023 11:51 AM, Hamza Mahfooz wrote: As made mention of, in commit 9128e6babf10 ("drm/amdgpu: fix amdgpu_irq_put call trace in gmc_v10_0_hw_fini") and commit c094b8923bdd ("drm/amdgpu: fix amdgpu_irq_put call trace in gmc_v11_0_hw_fini"). It is meaningless to call amdgpu_irq_put() for gmc

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Timur Kristóf
Hi Felix, On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote: > That's the worst-case scenario where you're debugging HW or FW > issues. > Those should be pretty rare post-bringup. But are there hangs caused > by > user mode driver or application bugs that are easier to debug and > probabl

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
WRITE_DATA with ENGINE=PFP will execute the packet on the frontend engine, while ENGINE=ME will execute the packet on the backend engine. Marek On Wed, May 3, 2023 at 1:08 PM Marek Olšák wrote: > GPU hangs are pretty common post-bringup. They are not common per user, > but if we gather all hang

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Marek Olšák
GPU hangs are pretty common post-bringup. They are not common per user, but if we gather all hangs from all users, we can have lots and lots of them. GPU hangs are indeed not very debuggable. There are however some things we can do: - Identify the hanging IB by its VA (the kernel should know it) -

RE: [PATCH v2] drm/amdkfd: Expose proc sysfs folder contents after permission check

2023-05-03 Thread Kasiviswanathan, Harish
[AMD Official Use Only - General] One minor comment inline. -Original Message- From: amd-gfx On Behalf Of Sreekant Somasekharan Sent: Friday, April 28, 2023 3:12 PM To: amd-gfx@lists.freedesktop.org Cc: Somasekharan, Sreekant Subject: [PATCH v2] drm/amdkfd: Expose proc sysfs folder con

Re: [PATCH] drm/amdgpu: unlock on error in gfx_v9_4_3_kiq_resume()

2023-05-03 Thread Alex Deucher
Applied. Thanks! Alex On Wed, May 3, 2023 at 11:29 AM Dan Carpenter wrote: > > Smatch complains that we need to drop this lock before returning. > > drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c:1838 gfx_v9_4_3_kiq_resume() > warn: inconsistent returns 'ring->mqd_obj->tbo.base.resv'. > > Fixe

Re: [PATCH] drm/amdgpu: unlock the correct lock in amdgpu_gfx_enable_kcq()

2023-05-03 Thread Alex Deucher
Applied. Thanks! On Wed, May 3, 2023 at 11:29 AM Dan Carpenter wrote: > > We changed which lock we are supposed to take but this error path > was accidentally over looked so it still drops the old lock. > > Fixes: def799c6596d ("drm/amdgpu: add multi-xcc support to amdgpu_gfx > interfaces (v4)"

Re: [Intel-gfx] [RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup

2023-05-03 Thread Maarten Lankhorst
On 2023-05-03 17:31, Tvrtko Ursulin wrote: On 03/05/2023 09:34, Maarten Lankhorst wrote: Based roughly on the rdma and misc cgroup controllers, with a lot of the accounting code borrowed from rdma. The interface is simple: - populate drmcgroup_device->regions[..] name and size for each activ

Re: [Intel-gfx] [RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup

2023-05-03 Thread Tvrtko Ursulin
On 03/05/2023 09:34, Maarten Lankhorst wrote: Based roughly on the rdma and misc cgroup controllers, with a lot of the accounting code borrowed from rdma. The interface is simple: - populate drmcgroup_device->regions[..] name and size for each active region. - Call drm(m)cg_register_device(

[PATCH] drm/amdgpu: unlock on error in gfx_v9_4_3_kiq_resume()

2023-05-03 Thread Dan Carpenter
Smatch complains that we need to drop this lock before returning. drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c:1838 gfx_v9_4_3_kiq_resume() warn: inconsistent returns 'ring->mqd_obj->tbo.base.resv'. Fixes: 86301129698b ("drm/amdgpu: split gc v9_4_3 functionality from gc v9_0") Signed-off-by: D

[PATCH] drm/amdgpu: unlock the correct lock in amdgpu_gfx_enable_kcq()

2023-05-03 Thread Dan Carpenter
We changed which lock we are supposed to take but this error path was accidentally over looked so it still drops the old lock. Fixes: def799c6596d ("drm/amdgpu: add multi-xcc support to amdgpu_gfx interfaces (v4)") Signed-off-by: Dan Carpenter --- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 2 +-

Re: [PATCH 2/2] drm/amdgpu: drop unused function

2023-05-03 Thread Christian König
Am 03.05.23 um 17:24 schrieb Alex Deucher: On Wed, May 3, 2023 at 11:20 AM Christian König wrote: Reviewed-by: Christian König for this one. Can't say much about the first one. That was just the hack because some bit in the IP version was re-used on SRIOV, wasn't it? Yes, the high 2 bits of

Re: [PATCH 2/2] drm/amdgpu: drop unused function

2023-05-03 Thread Alex Deucher
On Wed, May 3, 2023 at 11:20 AM Christian König wrote: > > Reviewed-by: Christian König for this one. > > Can't say much about the first one. That was just the hack because some > bit in the IP version was re-used on SRIOV, wasn't it? Yes, the high 2 bits of the revision number were reused for a

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Christian König
Am 03.05.23 um 17:08 schrieb Felix Kuehling: Am 2023-05-03 um 03:59 schrieb Christian König: Am 02.05.23 um 20:41 schrieb Alex Deucher: On Tue, May 2, 2023 at 11:22 AM Timur Kristóf wrote: [SNIP] In my opinion, the correct solution to those problems would be if the kernel could give userspac

Re: [PATCH 2/2] drm/amdgpu: drop unused function

2023-05-03 Thread Christian König
Reviewed-by: Christian König for this one. Can't say much about the first one. That was just the hack because some bit in the IP version was re-used on SRIOV, wasn't it? Christian. Am 03.05.23 um 17:02 schrieb Alex Deucher: Ping? On Thu, Apr 27, 2023 at 2:34 PM Alex Deucher wrote: amdgpu

Re: [PATCH 2/2] drm/amdgpu: drop unused function

2023-05-03 Thread Luben Tuikov
I suppose we have this information elsewhere. Series is: Reviewed-by: Luben Tuikov Regards, Luben On 2023-05-03 11:02, Alex Deucher wrote: > Ping? > > On Thu, Apr 27, 2023 at 2:34 PM Alex Deucher > wrote: >> >> amdgpu_discovery_get_ip_version() has not been used since >> commit c40bdfb2ffa4

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Felix Kuehling
Am 2023-05-03 um 03:59 schrieb Christian König: Am 02.05.23 um 20:41 schrieb Alex Deucher: On Tue, May 2, 2023 at 11:22 AM Timur Kristóf wrote: [SNIP] In my opinion, the correct solution to those problems would be if the kernel could give userspace the necessary information about a GPU hang b

Re: [PATCH 2/2] drm/amdgpu: drop unused function

2023-05-03 Thread Alex Deucher
Ping? On Thu, Apr 27, 2023 at 2:34 PM Alex Deucher wrote: > > amdgpu_discovery_get_ip_version() has not been used since > commit c40bdfb2ffa4 ("drm/amdgpu: fix incorrect VCN revision in SRIOV") > so drop it. > > Signed-off-by: Alex Deucher > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c |

Re: [PATCH 1/2] drm/amdgpu: drop invalid IP revision

2023-05-03 Thread Alex Deucher
Ping? On Thu, Apr 27, 2023 at 2:34 PM Alex Deucher wrote: > > This was already fixed and dropped in: > commit baf3f8f37406 ("drm/amdgpu: handle SRIOV VCN revision parsing") > commit c40bdfb2ffa4 ("drm/amdgpu: fix incorrect VCN revision in SRIOV") > But seems to have been accidently been left arou

Re: [PATCH] drm/amdgpu: put MQDs in VRAM

2023-05-03 Thread Luben Tuikov
Reviewed-by: Luben Tuikov Regards, Luben On 2023-05-01 10:55, Alex Deucher wrote: > Ping? > > Alex > > On Fri, Apr 28, 2023 at 11:57 AM Alex Deucher > wrote: >> >> Reduces preemption latency. >> Only enable this for gfx10 and 11 for now >> to avoid changing behavior on gfx 8 and 9. >> >> v2:

Re: [Intel-xe] [RFC PATCH 3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC.

2023-05-03 Thread Thomas Hellström
Hi, Maarten On 5/3/23 10:34, Maarten Lankhorst wrote: This allows the drm cgroup controller to return no space is available.. XXX: This is a hopeless simplification that changes behavior, and returns -ENOSPC even if we could evict ourselves from the current cgroup. Ideally, the eviction code b

[PATCH] drm/amdgpu: drop redundant sched job cleanup when cs is aborted

2023-05-03 Thread Guchun Chen
Once command submission failed due to userptr invalidation in amdgpu_cs_submit, legacy code will perform cleanup of scheduler job. However, it's not needed at all, as former commit has integrated job cleanup stuff into amdgpu_job_free. Otherwise, because of double free, a NULL pointer dereference w

Re: [Intel-xe] [RFC PATCH 3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC.

2023-05-03 Thread Maarten Lankhorst
On 2023-05-03 11:11, Thomas Hellström wrote: Hi, Maarten On 5/3/23 10:34, Maarten Lankhorst wrote: This allows the drm cgroup controller to return no space is available.. XXX: This is a hopeless simplification that changes behavior, and returns -ENOSPC even if we could evict ourselves from th

Re: [PATCH v2] drm/amd/amdgpu: Fix errors & warnings in amdgpu _bios, _cs, _dma_buf, _fence.c

2023-05-03 Thread Christian König
Am 03.05.23 um 11:00 schrieb Srinivasan Shanmugam: The following checkpatch errors & warning is removed. ERROR: else should follow close brace '}' ERROR: trailing statements should be on next line WARNING: Prefer 'unsigned int' to bare use of 'unsigned' WARNING: Possible repeated word: 'Fences'

[PATCH] drm/amd/amdgpu: Remove unnecessary OOM messages

2023-05-03 Thread Srinivasan Shanmugam
The following checkpatch warning is removed. WARNING: Possible unnecessary 'out of memory' message Cc: Christian König Cc: Alex Deucher Signed-off-by: Srinivasan Shanmugam --- drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drive

[PATCH v2] drm/amd/amdgpu: Fix errors & warnings in amdgpu _bios, _cs, _dma_buf, _fence.c

2023-05-03 Thread Srinivasan Shanmugam
The following checkpatch errors & warning is removed. ERROR: else should follow close brace '}' ERROR: trailing statements should be on next line WARNING: Prefer 'unsigned int' to bare use of 'unsigned' WARNING: Possible repeated word: 'Fences' WARNING: Missing a blank line after declarations WARN

Re: [PATCH] drm/amd/amdgpu: Fix errors & warnings in amdgpu _bios, _cs, _dma_buf, _fence.c

2023-05-03 Thread Christian König
Am 03.05.23 um 10:46 schrieb Srinivasan Shanmugam: The following checkpatch errors & warning is removed. ERROR: else should follow close brace '}' ERROR: trailing statements should be on next line WARNING: Prefer 'unsigned int' to bare use of 'unsigned' WARNING: Possible repeated word: 'Fences'

[PATCH] drm/amd/amdgpu: Fix errors & warnings in amdgpu _bios, _cs, _dma_buf, _fence.c

2023-05-03 Thread Srinivasan Shanmugam
The following checkpatch errors & warning is removed. ERROR: else should follow close brace '}' ERROR: trailing statements should be on next line WARNING: Prefer 'unsigned int' to bare use of 'unsigned' WARNING: Possible repeated word: 'Fences' WARNING: Missing a blank line after declarations WARN

[RFC PATCH 2/4] drm/cgroup: Add memory accounting to DRM cgroup

2023-05-03 Thread Maarten Lankhorst
Based roughly on the rdma and misc cgroup controllers, with a lot of the accounting code borrowed from rdma. The interface is simple: - populate drmcgroup_device->regions[..] name and size for each active region. - Call drm(m)cg_register_device() - Use drmcg_try_charge to check if you can alloca

[RFC PATCH 4/4] drm/xe: Add support for the drm cgroup

2023-05-03 Thread Maarten Lankhorst
Add some code to implement basic support for the vram0, vram1 and stolen memory regions. I fear the try_charge code should probably be done inside TTM. This code should interact with the shrinker, but for a simple RFC it's good enough. Signed-off-by: Maarten Lankhorst --- drivers/gpu/drm/xe/xe_

[RFC PATCH 3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC.

2023-05-03 Thread Maarten Lankhorst
This allows the drm cgroup controller to return no space is available.. XXX: This is a hopeless simplification that changes behavior, and returns -ENOSPC even if we could evict ourselves from the current cgroup. Ideally, the eviction code becomes cgroup aware, and will force eviction from the cur

[RFC PATCH 1/4] cgroup: Add the DRM cgroup controller

2023-05-03 Thread Maarten Lankhorst
From: Tvrtko Ursulin Skeleton controller without any functionality. Signed-off-by: Tvrtko Ursulin Signed-off-by: Maarten Lankhorst --- include/linux/cgroup_drm.h| 9 ++ include/linux/cgroup_subsys.h | 4 +++ init/Kconfig | 7 kernel/cgroup/Makefile| 1

[RFC PATCH 0/4] Add support for DRM cgroup memory accounting.

2023-05-03 Thread Maarten Lankhorst
RFC as I'm looking for comments. For long running compute, it can be beneficial to partition the GPU memory between cgroups, so each cgroup can use its maximum amount of memory without interfering with other scheduled jobs. Done properly, this can alleviate the need for eviction, which might resul

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

2023-05-03 Thread Christian König
Am 02.05.23 um 20:41 schrieb Alex Deucher: On Tue, May 2, 2023 at 11:22 AM Timur Kristóf wrote: [SNIP] In my opinion, the correct solution to those problems would be if the kernel could give userspace the necessary information about a GPU hang before a GPU reset. The fundamental problem he