RE: [PATCH] mm: Remove double faults once write a device pfn

2024-01-25 Thread Zhou, Xianrong
[AMD Official Use Only - General]

>  The vmf_insert_pfn_prot could cause unnecessary double faults
>  on a device pfn. Because currently the vmf_insert_pfn_prot does
>  not make the pfn writable so the pte entry is normally
>  read-only or dirty catching.
> >>> What? How do you got to this conclusion?
> >> Sorry. I did not mention that this problem only exists on arm64
> platform.
> > Ok, that makes at least a little bit more sense.
> >
> >> Because on arm64 platform the PTE_RDONLY is automatically
> >> attached to the userspace pte entries even through VM_WRITE +
> VM_SHARE.
> >> The  PTE_RDONLY needs to be cleared in vmf_insert_pfn_prot.
> >> However vmf_insert_pfn_prot do not make the pte writable passing
> >> false @mkwrite to insert_pfn.
> > Question is why is arm64 doing this? As far as I can see they must
> > have some hardware reason for that.
> >
> > The mkwrite parameter to insert_pfn() was added by commit
> > b2770da642540 to make insert_pfn() look more like insert_pfn_pmd()
> > so that the DAX code can insert PTEs which are writable and dirty
> > at the same
> >>> time.
>  This is one scenario to do so. In fact on arm64 there are many
>  scenarios could be to do so. So we can let vmf_insert_pfn_prot
>  supporting @mkwrite for drivers at core layer and let drivers to
>  decide whether or not to make writable and dirty at one time. The
>  patch did this. Otherwise double faults on arm64 when call
> >>> vmf_insert_pfn_prot.
> >>>
> >>> Well, that doesn't answer my question why arm64 is double faulting
> >>> in the first place,.
> >>>
> >>
> >> Eh.
> >>
> >> On arm64 When userspace mmap() with PROT_WRITE and MAP_SHARED
> the
> >> vma->vm_page_prot has the PTE_RDONLY and PTE_WRITE within
> >> PAGE_SHARED_EXEC. (seeing arm64 protection_map)
>
> Well that's your observation, but not the explanation why arm64 is doing this.
>
> See this would have quite some negative impact on performance, not only for
> gfx drivers but in general.
>
> So either the observation is incorrect or there is a *really* good reason why
> arm64 is taking this performance penalty.
>
> >> When write the userspace virtual address the first fault happen and
> >> call into driver's .fault->ttm_bo_vm_fault_reserved->vmf_insert_pfn_prot-
> >insert_pfn.
> >> The insert_pfn will establish the pte entry. However the
> >> vmf_insert_pfn_prot pass false @mkwrite to insert_pfn by default and
> >> so insert_pfn could not make the pfn writable and it do not call
> >> maybe_mkwrite(pte_mkdirty(entry), vma) to clear the PTE_RDONLY bit. So
> the pte entry is actually write protection for mmu.
> >> So when the first fault return and re-execute the store instruction
> >> the second fault happen again. And in second fault it just only do
> >> pte_mkdirty(entry) which clear the PTE_RDONLY.
> > It depends if the ARM64 CPU in question supports hardware dirty bit
> > management (DBM). If that is the case and PTE_DBM (ie. PTE_WRITE) is
> > set HW will automatically clear PTE_RDONLY bit to mark the entry dirty
> > instead of raising a write fault. So you shouldn't see a double fault
> > if PTE_DBM/WRITE is set.

Thanks. This is reasonable. But I still really had the double faults in my 
project
platform.

> >
> > On ARM64 you can kind of think of PTE_RDONLY as the HW dirty bit and
> > PTE_DBM as the read/write permission bit with SW being responsible for
> > updating PTE_RDONLY via the fault handler if DBM is not supported by HW.
> >
> > At least that's my understanding from having hacked on this in the
> > past. You can see all this weirdness happening in the definitions of
> > pte_dirty() and pte_write() for ARM64.
>
> +1
>
> Thanks a lot for that, this was exactly the information I was looking for.
>
> In this light it makes this patch here look unnecessary and questionable at
> best.
>
> Xianrong if you have an arm64 platform which really double faults (confirmed
> through a debugger for example) then you have to ask why this platform
> shows this behavior and not try to work around it.
>
> Behaviors like those usually have a very very good reason and without a
> confirmed explanation I'm not allowing any patch in which would disable stuff
> like that.

Thanks. You are very right. I found CONFIG_ARM64_HW_AFDBM is not enabled
in my project. So actually PTE_DBM is not hardware bit. It is software bit.
Now i understand why arm64 to do this attaching PTE_RDONLY automatically.
It is compatible for PTE_DBM hardware or not. This answers your question.

So I met double faults in my project when CONFIG_ARM64_HW_AFDBM is false.
However these double faults can be eliminated when i replace the 
vmf_insert_mixed
to vmf_insert_mixed_mkwrite in drivers under CONFIG_ARM64_HW_AFDBM = false.
The vmf_insert_pfn_prot is similar with vmf_insert_mixed. It should supply 
@mkwrite
parameter rather than false mkwrite by default.

So i think if you forgot to e

RE: [PATCH v2] drm/amdkfd: reserve the BO before validating it

2024-01-25 Thread Yu, Lang
[AMD Official Use Only - General]

>-Original Message-
>From: Kuehling, Felix 
>Sent: Thursday, January 25, 2024 5:41 AM
>To: Yu, Lang ; amd-gfx@lists.freedesktop.org
>Cc: Francis, David 
>Subject: Re: [PATCH v2] drm/amdkfd: reserve the BO before validating it
>
>On 2024-01-22 4:08, Lang Yu wrote:
>> Fixes: 410f08516e0f ("drm/amdkfd: Move dma unmapping after TLB flush")
>>
>> v2:
>> Avoid unmapping attachment twice when ERESTARTSYS.
>>
>> [   41.708711] WARNING: CPU: 0 PID: 1463 at
>drivers/gpu/drm/ttm/ttm_bo.c:846 ttm_bo_validate+0x146/0x1b0 [ttm]
>> [   41.708989] Call Trace:
>> [   41.708992]  
>> [   41.708996]  ? show_regs+0x6c/0x80
>> [   41.709000]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
>> [   41.709008]  ? __warn+0x93/0x190
>> [   41.709014]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
>> [   41.709024]  ? report_bug+0x1f9/0x210
>> [   41.709035]  ? handle_bug+0x46/0x80
>> [   41.709041]  ? exc_invalid_op+0x1d/0x80
>> [   41.709048]  ? asm_exc_invalid_op+0x1f/0x30
>> [   41.709057]  ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80
>[amdgpu]
>> [   41.709185]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
>> [   41.709197]  ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80
>[amdgpu]
>> [   41.709337]  ? srso_alias_return_thunk+0x5/0x7f
>> [   41.709346]  kfd_mem_dmaunmap_attachment+0x9e/0x1e0 [amdgpu]
>> [   41.709467]  amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x56/0x80
>[amdgpu]
>> [   41.709586]  kfd_ioctl_unmap_memory_from_gpu+0x1b7/0x300 [amdgpu]
>> [   41.709710]  kfd_ioctl+0x1ec/0x650 [amdgpu]
>> [   41.709822]  ? __pfx_kfd_ioctl_unmap_memory_from_gpu+0x10/0x10
>[amdgpu]
>> [   41.709945]  ? srso_alias_return_thunk+0x5/0x7f
>> [   41.709949]  ? tomoyo_file_ioctl+0x20/0x30
>> [   41.709959]  __x64_sys_ioctl+0x9c/0xd0
>> [   41.709967]  do_syscall_64+0x3f/0x90
>> [   41.709973]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>>
>> Signed-off-by: Lang Yu 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  2 +-
>>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 28
>+--
>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  4 ++-
>>   3 files changed, 29 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> index 584a0cea5572..41854417e487 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> @@ -311,7 +311,7 @@ int
>amdgpu_amdkfd_gpuvm_map_memory_to_gpu(struct amdgpu_device *adev,
>>struct kgd_mem *mem, void
>*drm_priv);
>>   int amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu(
>>  struct amdgpu_device *adev, struct kgd_mem *mem, void
>*drm_priv);
>> -void amdgpu_amdkfd_gpuvm_dmaunmap_mem(struct kgd_mem *mem, void
>> *drm_priv);
>> +int amdgpu_amdkfd_gpuvm_dmaunmap_mem(struct kgd_mem *mem, void
>> +*drm_priv);
>>   int amdgpu_amdkfd_gpuvm_sync_memory(
>>  struct amdgpu_device *adev, struct kgd_mem *mem, bool intr);
>>   int amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel(struct kgd_mem *mem,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 6f3a4cb2a9ef..7a050d46fa4d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -2088,21 +2088,43 @@ int
>amdgpu_amdkfd_gpuvm_map_memory_to_gpu(
>>  return ret;
>>   }
>>
>> -void amdgpu_amdkfd_gpuvm_dmaunmap_mem(struct kgd_mem *mem, void
>> *drm_priv)
>> +int amdgpu_amdkfd_gpuvm_dmaunmap_mem(struct kgd_mem *mem, void
>> +*drm_priv)
>>   {
>>  struct kfd_mem_attachment *entry;
>>  struct amdgpu_vm *vm;
>> +bool reserved = false;
>> +int ret = 0;
>>
>>  vm = drm_priv_to_vm(drm_priv);
>>
>>  mutex_lock(&mem->lock);
>>
>>  list_for_each_entry(entry, &mem->attachments, list) {
>> -if (entry->bo_va->base.vm == vm)
>> -kfd_mem_dmaunmap_attachment(mem, entry);
>> +if (entry->bo_va->base.vm != vm)
>> +continue;
>> +if (entry->type == KFD_MEM_ATT_SHARED ||
>> +entry->type == KFD_MEM_ATT_DMABUF)
>> +continue;
>> +if (!entry->bo_va->base.bo->tbo.ttm->sg)
>> +continue;
>
>You're going to great lengths to avoid the reservation when it's not needed by
>kfd_mem_dmaunmap_attachment. However, this feels a bit fragile. Any changes
>in the kfd_mem_dmaunmap_* functions could break this.
>
>> +
>> +if (!reserved) {
>> +ret = amdgpu_bo_reserve(mem->bo, true);
>> +if (ret)
>> +goto out;
>> +reserved = true;
>> +}
>> +
>> +kfd_mem_dmaunmap_attachment(mem, entry);
>>  }
>>
>> +if (reserved)
>> +amdgpu_bo_unreserve(mem->bo);
>> +
>> +out:
>>  mutex_unlock(&mem->lock);
>> +
>> +return ret;
>>   }
>>
>>   int amdgpu_amdkfd_gpuv

RE: [PATCH 1/2] drm/amdgpu: skip to program GFXDEC registers for PM abort case

2024-01-25 Thread Liang, Prike
[AMD Official Use Only - General]

> From: Alex Deucher 
> Sent: Thursday, January 25, 2024 11:24 PM
> To: Liang, Prike 
> Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> ; Sharma, Deepak
> 
> Subject: Re: [PATCH 1/2] drm/amdgpu: skip to program GFXDEC registers for
> PM abort case
>
> On Thu, Jan 25, 2024 at 10:22 AM Alex Deucher 
> wrote:
> >
> > On Wed, Jan 24, 2024 at 9:39 PM Liang, Prike 
> wrote:
> > >
> > > [AMD Official Use Only - General]
> > >
> > > Hi, Alex
> > > > -Original Message-
> > > > From: Alex Deucher 
> > > > Sent: Wednesday, January 24, 2024 11:59 PM
> > > > To: Liang, Prike 
> > > > Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> > > > ; Sharma, Deepak
> > > > 
> > > > Subject: Re: [PATCH 1/2] drm/amdgpu: skip to program GFXDEC
> > > > registers for PM abort case
> > > >
> > > > On Wed, Jan 24, 2024 at 2:12 AM Prike Liang 
> wrote:
> > > > >
> > > > > In the PM abort cases, the gfx power rail doesn't turn off so
> > > > > some GFXDEC registers/CSB can't reset to default vaule. In order
> > > > > to avoid unexpected problem now need skip to program GFXDEC
> > > > > registers and bypass issue CSB packet for PM abort case.
> > > > >
> > > > > Signed-off-by: Prike Liang 
> > > > > ---
> > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
> > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
> > > > >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   | 4 
> > > > >  3 files changed, 6 insertions(+)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > > index c5f3859fd682..26d983eb831b 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > > @@ -1079,6 +1079,7 @@ struct amdgpu_device {
> > > > > boolin_s3;
> > > > > boolin_s4;
> > > > > boolin_s0ix;
> > > > > +   boolpm_complete;
> > > > >
> > > > > enum pp_mp1_state   mp1_state;
> > > > > struct amdgpu_doorbell_index doorbell_index; diff --git
> > > > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > > index 475bd59c9ac2..a01f9b0c2f30 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > > @@ -2486,6 +2486,7 @@ static int
> > > > > amdgpu_pmops_suspend_noirq(struct
> > > > device *dev)
> > > > > struct drm_device *drm_dev = dev_get_drvdata(dev);
> > > > > struct amdgpu_device *adev = drm_to_adev(drm_dev);
> > > > >
> > > > > +   adev->pm_complete = true;
> > > >
> > > > This needs to be cleared somewhere on resume.
> > > [Liang, Prike]  This flag is designed to indicate the amdgpu device
> suspension process status and will update the patch and clear it at the
> amdgpu suspension beginning point.
> > > >
> > > > > if (amdgpu_acpi_should_gpu_reset(adev))
> > > > > return amdgpu_asic_reset(adev);
> > > > >
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > > > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > > > index 57808be6e3ec..3bf51f18e13c 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > > > @@ -3034,6 +3034,10 @@ static int gfx_v9_0_cp_gfx_start(struct
> > > > > amdgpu_device *adev)
> > > > >
> > > > > gfx_v9_0_cp_gfx_enable(adev, true);
> > > > >
> > > > > +   if (adev->in_suspend && !adev->pm_complete) {
> > > > > +   DRM_INFO(" will skip the csb ring write\n");
> > > > > +   return 0;
> > > > > +   }
> > > >
> > > > We probably want a similar fix for other gfx generations as well.
> > > >
> > > > Alex
> > > >
> > > [Liang, Prike] IIRC, there's no issue on the Mendocino side even without
> the fix. How about keep the other gfx generations unchanged firstly and after
> sort out the failed case will add the quirk for each specific gfx 
> respectively?
> >
> > Mendocino only supports S0i3 so we don't touch gfx on suspend/resume.
> > This would only happen on platforms that support S3.
>
> E.g., try an aborted suspend on Raphael or PHX2.
>
> Alex
>
[Liang, Prike]  Thanks for the reminder, but the Mendocino also was verified on 
the system with S3 enabled from BIOS. I will double confirm if there need the 
quirk on the RPL or PHX2.

> >
> > Alex
> >
> > >
> > > > > r = amdgpu_ring_alloc(ring, gfx_v9_0_get_csb_size(adev) + 4 + 
> > > > > 3);
> > > > > if (r) {
> > > > > DRM_ERROR("amdgpu: cp failed to lock ring
> > > > > (%d).\n", r);
> > > > > --
> > > > > 2.34.1
> > > > >


RE: [PATCH] drm/amdgpu: update documentation on new chips

2024-01-25 Thread Xu, Feifei
[AMD Official Use Only - General]

Acked-by: Feifei Xu 

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Friday, January 19, 2024 3:51 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH] drm/amdgpu: update documentation on new chips

These have been released now, so add them to the documentation.

Signed-off-by: Alex Deucher 
---
 Documentation/gpu/amdgpu/dgpu-asic-info-table.csv | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/gpu/amdgpu/dgpu-asic-info-table.csv 
b/Documentation/gpu/amdgpu/dgpu-asic-info-table.csv
index 882d2518f8ed..3825f00ca9fe 100644
--- a/Documentation/gpu/amdgpu/dgpu-asic-info-table.csv
+++ b/Documentation/gpu/amdgpu/dgpu-asic-info-table.csv
@@ -16,6 +16,7 @@ Radeon (RX|TM) (PRO|WX) Vega /MI25 /V320 /V340L /8200 /9100 
/SSG MxGPU, VEGA10,  AMD Radeon (Pro) VII /MI50 /MI60, VEGA20, DCE 12, 9.4.0, 
VCE 4.1.0 / UVD 7.2.0, 4.2.0  MI100, ARCTURUS, *, 9.4.1, VCN 2.5.0, 4.2.2  
MI200, ALDEBARAN, *, 9.4.2, VCN 2.6.0, 4.4.0
+MI300, AQUA_VANGARAM, *, 9.4.3, VCN 4.0.3, 4.4.2
 AMD Radeon (RX|Pro) 5600(M|XT) /5700 (M|XT|XTB) /W5700, NAVI10, DCN 2.0.0, 
10.1.10, VCN 2.0.0, 5.0.0  AMD Radeon (Pro) 5300 /5500XTB/5500(XT|M) /W5500M 
/W5500, NAVI14, DCN 2.0.0, 10.1.1, VCN 2.0.2, 5.0.2  AMD Radeon RX 6800(XT) 
/6900(XT) /W6800, SIENNA_CICHLID, DCN 3.0.0, 10.3.0, VCN 3.0.0, 5.2.0 @@ -23,4 
+24,5 @@ AMD Radeon RX 6700 XT / 6800M / 6700M, NAVY_FLOUNDER, DCN 3.0.0, 
10.3.2, VCN 3.0  AMD Radeon RX 6600(XT) /6600M /W6600 /W6600M, 
DIMGREY_CAVEFISH, DCN 3.0.2, 10.3.4, VCN 3.0.16, 5.2.4  AMD Radeon RX 6500M 
/6300M /W6500M /W6300M, BEIGE_GOBY, DCN 3.0.3, 10.3.5, VCN 3.0.33, 5.2.5  AMD 
Radeon RX 7900 XT /XTX, , DCN 3.2.0, 11.0.0, VCN 4.0.0, 6.0.0
+AMD Radeon RX 7800 XT, , DCN 3.2.0, 11.0.3, VCN 4.0.0, 6.0.3
 AMD Radeon RX 7600M (XT) /7700S /7600S, , DCN 3.2.1, 11.0.2, VCN 4.0.4, 6.0.2
--
2.42.0



Re: [PATCH 2/2] drm/amdgpu: reset gpu for pm abort case

2024-01-25 Thread Alex Deucher
On Wed, Jan 24, 2024 at 11:11 PM Prike Liang  wrote:
>
> In the pm abort case the gfx power rail not turn off from FCH side and
> this will lead to the gfx reinitialized failed base on the unknown gfx
> HW status, so let's reset the gpu to a known good power state.
>
> Signed-off-by: Prike Liang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +
>  drivers/gpu/drm/amd/amdgpu/soc15.c | 8 +++-
>  2 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 56d9dfa61290..4c40ffaaa5c2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4627,6 +4627,11 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
> fbcon)
> return r;
> }
>
> +   if(amdgpu_asic_need_reset_on_init(adev)) {

space between if and (.

Also, I think we need to check that we are not in S0ix as well
otherwise I think we'll always reset in S0ix.  We could probably do
away with the GPU reset in the suspend_noirq callback with this
change, but maybe make that a separate follow up patch.

Alex

> +   DRM_INFO("PM abort case and let's reset asic \n");
> +   amdgpu_asic_reset(adev);
> +   }
> +
> if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
> return 0;
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
> b/drivers/gpu/drm/amd/amdgpu/soc15.c
> index 15033efec2ba..9329a00b6abc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> @@ -804,9 +804,16 @@ static bool soc15_need_reset_on_init(struct 
> amdgpu_device *adev)
> if (adev->asic_type == CHIP_RENOIR)
> return true;
>
> +   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
> +
> /* Just return false for soc15 GPUs.  Reset does not seem to
>  * be necessary.
>  */
> +   if (adev->in_suspend && !adev->in_s0ix &&
> +   !adev->pm_complete &&
> +   sol_reg)
> +   return true;
> +
> if (!amdgpu_passthrough(adev))
> return false;
>
> @@ -816,7 +823,6 @@ static bool soc15_need_reset_on_init(struct amdgpu_device 
> *adev)
> /* Check sOS sign of life register to confirm sys driver and sOS
>  * are already been loaded.
>  */
> -   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
> if (sol_reg)
> return true;
>
> --
> 2.34.1
>


[PATCH] drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole (v2)

2024-01-25 Thread Felix Kuehling
The TBA and TMA, along with an unused IB allocation, reside at low
addresses in the VM address space. A stray VM fault which hits these
pages must be serviced by making their page table entries invalid.
The scheduler depends upon these pages being resident and fails,
preventing a debugger from inspecting the failure state.

By relocating these pages above 47 bits in the VM address space they
can only be reached when bits [63:48] are set to 1. This makes it much
less likely for a misbehaving program to generate accesses to them.
The current placement at VA (PAGE_SIZE*2) is readily hit by a NULL
access with a small offset.

v2:
- Move it to the reserved space to avoid concflicts with Mesa
- Add macros to make reserved space management easier

Cc: Arunpravin Paneer Selvam 
Cc: Christian Koenig 
Signed-off-by: Jay Cornwall 
Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c  |  4 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c|  7 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h   | 12 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c | 30 +++-
 4 files changed, 30 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
index 823d31f4a2a3..53d0a458d78e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
@@ -28,9 +28,9 @@
 
 uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)
 {
-   uint64_t addr = adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT;
+   uint64_t addr = AMDGPU_VA_RESERVED_CSA_START(
+   adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT);
 
-   addr -= AMDGPU_VA_RESERVED_CSA_SIZE;
addr = amdgpu_gmc_sign_extend(addr);
 
return addr;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c
index 3d0d56087d41..9e769ef50f2e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c
@@ -45,11 +45,8 @@
  */
 static inline u64 amdgpu_seq64_get_va_base(struct amdgpu_device *adev)
 {
-   u64 addr = adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT;
-
-   addr -= AMDGPU_VA_RESERVED_TOP;
-
-   return addr;
+   return AMDGPU_VA_RESERVED_SEQ64_START(
+   adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 98a57192..f23b6153d310 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -135,11 +135,19 @@ struct amdgpu_mem_stats;
 #define AMDGPU_IS_MMHUB0(x) ((x) >= AMDGPU_MMHUB0_START && (x) < 
AMDGPU_MMHUB1_START)
 #define AMDGPU_IS_MMHUB1(x) ((x) >= AMDGPU_MMHUB1_START && (x) < 
AMDGPU_MAX_VMHUBS)
 
-/* Reserve 2MB at top/bottom of address space for kernel use */
+/* Reserve space at top/bottom of address space for kernel use */
 #define AMDGPU_VA_RESERVED_CSA_SIZE(2ULL << 20)
+#define AMDGPU_VA_RESERVED_CSA_START(top)  ((top) \
+- AMDGPU_VA_RESERVED_CSA_SIZE)
 #define AMDGPU_VA_RESERVED_SEQ64_SIZE  (2ULL << 20)
+#define AMDGPU_VA_RESERVED_SEQ64_START(top)
(AMDGPU_VA_RESERVED_CSA_START(top) \
+- 
AMDGPU_VA_RESERVED_SEQ64_SIZE)
+#define AMDGPU_VA_RESERVED_TRAP_SIZE   (2ULL << 12)
+#define AMDGPU_VA_RESERVED_TRAP_START(top) 
(AMDGPU_VA_RESERVED_SEQ64_START(top) \
+- AMDGPU_VA_RESERVED_TRAP_SIZE)
 #define AMDGPU_VA_RESERVED_BOTTOM  (2ULL << 20)
-#define AMDGPU_VA_RESERVED_TOP (AMDGPU_VA_RESERVED_SEQ64_SIZE 
+ \
+#define AMDGPU_VA_RESERVED_TOP (AMDGPU_VA_RESERVED_TRAP_SIZE + 
\
+AMDGPU_VA_RESERVED_SEQ64_SIZE 
+ \
 AMDGPU_VA_RESERVED_CSA_SIZE)
 
 /* See vm_update_mode */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
index 6604a3f99c5e..f899cce25b2a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include "amdgpu_vm.h"
 
 /*
  * The primary memory I/O features being added for revisions of gfxip
@@ -326,10 +327,16 @@ static void kfd_init_apertures_vi(struct 
kfd_process_device *pdd, uint8_t id)
 * with small reserved space for kernel.
 * Set them to CANONICAL addresses.
 */
-   pdd->gpuvm_base = SVM_USER_BASE;
+   pdd->gpuvm_base = max(SVM_USER_BASE, AMDGPU_VA_RESERVED_BOTTOM);
pdd->gpuvm_limit =
pdd->dev->kfd->shared_resources.gpuvm_size - 1;
 
+   /* dGPUs: the reserved space for kernel
+* before SVM
+*/
+   pdd->qpd.cwsr_base = SVM_CWSR_BASE;
+   pdd->qpd.ib_base = S

[pull] amdgpu drm-fixes-6.8

2024-01-25 Thread Alex Deucher
Hi Dave, Sima,

Fixes for 6.8.

The following changes since commit b16702be210bb49256f8a32df2c310383134dd57:

  Merge tag 'exynos-drm-fixes-for-v6.8-rc2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos into drm-fixes 
(2024-01-25 14:22:15 +1000)

are available in the Git repository at:

  https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-fixes-6.8-2024-01-25

for you to fetch changes up to c82eb25c5f005b33aebb1415a8472fc2eeea0deb:

  drm/amd/display: "Enable IPS by default" (2024-01-25 16:00:24 -0500)


amd-drm-fixes-6.8-2024-01-25:

amdgpu:
- AC/DC power supply tracking fix
- Don't show invalid vram vendor data
- SMU 13.0.x fixes
- GART fix for umr on systems without VRAM
- GFX 10/11 UNORD_DISPATCH fixes
- IPS display fixes (required for S0ix on some platforms)
- Misc fixes


Alex Deucher (2):
  drm/amdgpu/gfx10: set UNORD_DISPATCH in compute MQDs
  drm/amdgpu/gfx11: set UNORD_DISPATCH in compute MQDs

Alvin Lee (1):
  drm/amd/display: Add Replay IPS register for DMUB command table

ChunTao Tso (1):
  drm/amd/display: Replay + IPS + ABM in Full Screen VPB

Hawking Zhang (1):
  drm/amdgpu: Fix null pointer dereference

Kenneth Feng (1):
  drm/amd/pm: update the power cap setting

Lijo Lazar (3):
  drm/amdgpu: Avoid fetching vram vendor information
  drm/amdgpu: Show vram vendor only if available
  drm/amd/pm: Fetch current power limit from FW

Ma Jun (1):
  drm/amdgpu/pm: Fix the power source flag error

Nicholas Kazlauskas (1):
  drm/amd/display: Allow IPS2 during Replay

Roman Li (4):
  drm/amd/display: Add IPS checks before dcn register access
  drm/amd/display: Disable ips before dc interrupt setting
  drm/amd: Add a DC debug mask for IPS
  drm/amd/display: "Enable IPS by default"

Srinivasan Shanmugam (1):
  drm/amd/display: Fix uninitialized variable usage in core_link_ 
'read_dpcd() & write_dpcd()' functions

Tom St Denis (1):
  drm/amd/amdgpu: Assign GART pages to AMD device mapping

Yang Wang (1):
  drm/amd/pm: udpate smu v13.0.6 message permission

 drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   |  8 
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c   | 17 ++-
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  |  3 +-
 drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c   |  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c   |  1 +
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c  | 21 -
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c  |  5 +-
 drivers/gpu/drm/amd/display/dc/dc.h|  1 +
 drivers/gpu/drm/amd/display/dc/dc_types.h  |  5 ++
 .../drm/amd/display/dc/hwss/dcn35/dcn35_hwseq.c|  9 +++-
 .../drm/amd/display/dc/link/protocols/link_dpcd.c  |  4 +-
 drivers/gpu/drm/amd/display/dmub/inc/dmub_cmd.h| 47 +++
 .../drm/amd/display/modules/power/power_helpers.c  |  5 ++
 .../drm/amd/display/modules/power/power_helpers.h  |  1 +
 drivers/gpu/drm/amd/include/amd_shared.h   |  1 +
 drivers/gpu/drm/amd/include/amdgpu_reg_state.h |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  | 14 ++
 drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c |  2 +
 drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c |  2 +
 .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c   | 54 +-
 .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c   |  4 +-
 .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c   | 54 +-
 24 files changed, 229 insertions(+), 36 deletions(-)


Re: [PATCH] drm/amdgpu: update documentation on new chips

2024-01-25 Thread Alex Deucher
Ping?

On Thu, Jan 18, 2024 at 3:47 PM Alex Deucher  wrote:
>
> These have been released now, so add them to the documentation.
>
> Signed-off-by: Alex Deucher 
> ---
>  Documentation/gpu/amdgpu/dgpu-asic-info-table.csv | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/Documentation/gpu/amdgpu/dgpu-asic-info-table.csv 
> b/Documentation/gpu/amdgpu/dgpu-asic-info-table.csv
> index 882d2518f8ed..3825f00ca9fe 100644
> --- a/Documentation/gpu/amdgpu/dgpu-asic-info-table.csv
> +++ b/Documentation/gpu/amdgpu/dgpu-asic-info-table.csv
> @@ -16,6 +16,7 @@ Radeon (RX|TM) (PRO|WX) Vega /MI25 /V320 /V340L /8200 /9100 
> /SSG MxGPU, VEGA10,
>  AMD Radeon (Pro) VII /MI50 /MI60, VEGA20, DCE 12, 9.4.0, VCE 4.1.0 / UVD 
> 7.2.0, 4.2.0
>  MI100, ARCTURUS, *, 9.4.1, VCN 2.5.0, 4.2.2
>  MI200, ALDEBARAN, *, 9.4.2, VCN 2.6.0, 4.4.0
> +MI300, AQUA_VANGARAM, *, 9.4.3, VCN 4.0.3, 4.4.2
>  AMD Radeon (RX|Pro) 5600(M|XT) /5700 (M|XT|XTB) /W5700, NAVI10, DCN 2.0.0, 
> 10.1.10, VCN 2.0.0, 5.0.0
>  AMD Radeon (Pro) 5300 /5500XTB/5500(XT|M) /W5500M /W5500, NAVI14, DCN 2.0.0, 
> 10.1.1, VCN 2.0.2, 5.0.2
>  AMD Radeon RX 6800(XT) /6900(XT) /W6800, SIENNA_CICHLID, DCN 3.0.0, 10.3.0, 
> VCN 3.0.0, 5.2.0
> @@ -23,4 +24,5 @@ AMD Radeon RX 6700 XT / 6800M / 6700M, NAVY_FLOUNDER, DCN 
> 3.0.0, 10.3.2, VCN 3.0
>  AMD Radeon RX 6600(XT) /6600M /W6600 /W6600M, DIMGREY_CAVEFISH, DCN 3.0.2, 
> 10.3.4, VCN 3.0.16, 5.2.4
>  AMD Radeon RX 6500M /6300M /W6500M /W6300M, BEIGE_GOBY, DCN 3.0.3, 10.3.5, 
> VCN 3.0.33, 5.2.5
>  AMD Radeon RX 7900 XT /XTX, , DCN 3.2.0, 11.0.0, VCN 4.0.0, 6.0.0
> +AMD Radeon RX 7800 XT, , DCN 3.2.0, 11.0.3, VCN 4.0.0, 6.0.3
>  AMD Radeon RX 7600M (XT) /7700S /7600S, , DCN 3.2.1, 11.0.2, VCN 4.0.4, 6.0.2
> --
> 2.42.0
>


Re: [PATCH] drm/amd/include: Add missing registers/mask for DCN316 and 350

2024-01-25 Thread Pillai, Aurabindo
[AMD Official Use Only - General]

Reviewed-by: Aurabindo Pillai 

--

Regards,
Jay

From: Siqueira, Rodrigo 
Sent: Thursday, January 25, 2024 1:37 PM
To: amd-gfx@lists.freedesktop.org 
Cc: Siqueira, Rodrigo ; Lei, Jun ; 
Pillai, Aurabindo ; Mahfooz, Hamza 
; Wentland, Harry ; Deucher, 
Alexander 
Subject: [PATCH] drm/amd/include: Add missing registers/mask for DCN316 and 350

Cc: Jun Lei 
Cc: Aurabindo Pillai 
Cc: Hamza Mahfooz 
Cc: Harry Wentland 
Cc: Alex Deucher 
Signed-off-by: Rodrigo Siqueira 
---
 .../include/asic_reg/dcn/dcn_3_1_6_offset.h   |  4 ++
 .../include/asic_reg/dcn/dcn_3_1_6_sh_mask.h  | 10 +++
 .../include/asic_reg/dcn/dcn_3_5_0_offset.h   | 24 +++
 .../include/asic_reg/dcn/dcn_3_5_0_sh_mask.h  | 65 +++
 4 files changed, 103 insertions(+)

diff --git a/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_offset.h 
b/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_offset.h
index 222fa8d13269..a05bf8e4f58d 100644
--- a/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_offset.h
+++ b/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_offset.h
@@ -626,6 +626,8 @@
 #define regDTBCLK_DTO2_MODULO_BASE_IDX 
 2
 #define regDTBCLK_DTO3_MODULO  
 0x0022
 #define regDTBCLK_DTO3_MODULO_BASE_IDX 
 2
+#define regHDMICHARCLK0_CLOCK_CNTL 
 0x004a
+#define regHDMICHARCLK0_CLOCK_CNTL_BASE_IDX
 2
 #define regPHYASYMCLK_CLOCK_CNTL   
 0x0052
 #define regPHYASYMCLK_CLOCK_CNTL_BASE_IDX  
 2
 #define regPHYBSYMCLK_CLOCK_CNTL   
 0x0053
@@ -638,6 +640,8 @@
 #define regPHYESYMCLK_CLOCK_CNTL_BASE_IDX  
 2
 #define regPHYFSYMCLK_CLOCK_CNTL   
 0x0057
 #define regPHYFSYMCLK_CLOCK_CNTL_BASE_IDX  
 2
+#define regHDMISTREAMCLK_CNTL  
 0x0059
+#define regHDMISTREAMCLK_CNTL_BASE_IDX 
 2
 #define regDCCG_GATE_DISABLE_CNTL3 
 0x005a
 #define regDCCG_GATE_DISABLE_CNTL3_BASE_IDX
 2
 #define regHDMISTREAMCLK0_DTO_PARAM
 0x005b
diff --git a/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_sh_mask.h 
b/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_sh_mask.h
index 8ddb03a1dc39..df84941bbe5b 100644
--- a/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_sh_mask.h
+++ b/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_sh_mask.h
@@ -1933,6 +1933,11 @@
 //DTBCLK_DTO3_MODULO
 #define DTBCLK_DTO3_MODULO__DTBCLK_DTO3_MODULO__SHIFT  
   0x0
 #define DTBCLK_DTO3_MODULO__DTBCLK_DTO3_MODULO_MASK
   0xL
+//HDMICHARCLK0_CLOCK_CNTL
+#define HDMICHARCLK0_CLOCK_CNTL__HDMICHARCLK0_EN__SHIFT
   0x0
+#define HDMICHARCLK0_CLOCK_CNTL__HDMICHARCLK0_SRC_SEL__SHIFT   
   0x4
+#define HDMICHARCLK0_CLOCK_CNTL__HDMICHARCLK0_EN_MASK  
   0x0001L
+#define HDMICHARCLK0_CLOCK_CNTL__HDMICHARCLK0_SRC_SEL_MASK 
   0x0070L
 //PHYASYMCLK_CLOCK_CNTL
 #define PHYASYMCLK_CLOCK_CNTL__PHYASYMCLK_FORCE_EN__SHIFT  
   0x0
 #define PHYASYMCLK_CLOCK_CNTL__PHYASYMCLK_FORCE_SRC_SEL__SHIFT 
   0x4
@@ -1967,6 +1972,11 @@
 #define PHYFSYMCLK_CLOCK_CNTL__PHYFSYMCLK_FORCE_SRC_SEL__SHIFT 
   0x4
 #define PHYFSYMCLK_CLOCK_CNTL__PHYFSYMCLK_FORCE_EN_MASK
   0x0001L
 #define PHYFSYMCLK_CLOCK_CNTL__PHYFSYMCLK_FORCE_SRC_SEL_MASK   
   0x0030L
+//HDMISTREAMCLK_CNTL
+#define HDMISTREAMCLK_CNTL__HDMISTREAMCLK0_SRC_SEL__SHIFT  
   0x0
+#define HDMISTREAMCLK_CNTL__HDMISTREAMCLK0_DTO_FORCE_DIS__SHIFT
   0x10
+#define HDMISTREAMCLK_CNTL__HDMISTREAMCLK0_SRC_SEL_MASK   

Re: [PATCH v2] drm/amdgpu: move the drm client creation behind drm device registration

2024-01-25 Thread Felix Kuehling



On 2024-01-25 08:15, Lazar, Lijo wrote:

On 1/25/2024 1:37 PM, Le Ma wrote:

This patch is to eliminate interrupt warning below:

   "[drm] Fence fallback timer expired on ring sdma0.0".

An early vm pt clearing job is sent to SDMA ahead of interrupt enabled,
introduced by patch below:

   - drm/amdkfd: Export DMABufs from KFD using GEM handles


The drm client creation logic in this patch won't work well for dynamic
partition switch cases. Better add a ''Fixes' tag also.


I agree. It would also be problematic with GPU resets. amdgpu_pci_probe 
seems to be the right place for this to ensure it only gets called once. 
The fix looks reasonable to me.


Reviewed-by: Felix Kuehling 




This looks fine to me, needs to be checked by Felix anyway.

Thanks,
Lijo


And re-locating the drm client creation following after drm_dev_register
looks like a more proper flow.

v2: wrap the drm client creation

Signed-off-by: Le Ma 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 32 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  4 +++
  3 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 77e263660288..41db030ddc4e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -141,11 +141,31 @@ static void amdgpu_amdkfd_reset_work(struct work_struct 
*work)
  static const struct drm_client_funcs kfd_client_funcs = {
.unregister = drm_client_release,
  };
+
+int amdgpu_amdkfd_drm_client_create(struct amdgpu_device *adev)
+{
+   int ret;
+
+   if (!adev->kfd.init_complete)
+   return 0;
+
+   ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd",
+ &kfd_client_funcs);
+   if (ret) {
+   dev_err(adev->dev, "Failed to init DRM client: %d\n",
+   ret);
+   return ret;
+   }
+
+   drm_client_register(&adev->kfd.client);
+
+   return 0;
+}
+
  void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
  {
int i;
int last_valid_bit;
-   int ret;
  
  	amdgpu_amdkfd_gpuvm_init_mem_limits();
  
@@ -164,12 +184,6 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)

.enable_mes = adev->enable_mes,
};
  
-		ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", &kfd_client_funcs);

-   if (ret) {
-   dev_err(adev->dev, "Failed to init DRM client: %d\n", 
ret);
-   return;
-   }
-
/* this is going to have a few of the MSBs set that we need to
 * clear
 */
@@ -208,10 +222,6 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
  
  		adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,

&gpu_resources);
-   if (adev->kfd.init_complete)
-   drm_client_register(&adev->kfd.client);
-   else
-   drm_client_release(&adev->kfd.client);
  
  		amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size;
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h

index 584a0cea5572..da175c384adf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -182,6 +182,8 @@ int amdgpu_queue_mask_bit_to_set_resource_bit(struct 
amdgpu_device *adev,
  struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context,
struct mm_struct *mm,
struct svm_range_bo *svm_bo);
+
+int amdgpu_amdkfd_drm_client_create(struct amdgpu_device *adev);
  #if defined(CONFIG_DEBUG_FS)
  int kfd_debugfs_kfd_mem_limits(struct seq_file *m, void *data);
  #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 475bd59c9ac2..91d5d9435067 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2255,6 +2255,10 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
if (ret)
goto err_pci;
  
+	ret = amdgpu_amdkfd_drm_client_create(adev);

+   if (ret)
+   goto err_pci;
+
/*
 * 1. don't init fbdev on hw without DCE
 * 2. don't init fbdev if there are no connectors


Re: [PATCH v2 0/8] Expand and improve AMDGPU documentation

2024-01-25 Thread Rodrigo Siqueira Jordao




On 1/23/24 09:19, Hamza Mahfooz wrote:

On 1/22/24 16:24, Rodrigo Siqueira wrote:

This patchset improves how the AMDGPU display documentation is
organized, expands the kernel-doc to extract information from the
source, and adds more context about DC workflow. Finally, at the end of
this series, we also introduce a contribution section for those
interested in contributing to the display code.

Changes since V1:
- Remove unprecise information about the DC process.
- Expand the contribution list.
- Rebase.

Thanks
Siqueira


Series is:
Acked-by: Hamza Mahfooz 



Rodrigo Siqueira (8):
   Documentation/gpu: Add basic page for HUBP
   Documentation/gpu: Add simple doc page for DCHUBBUB
   Documentation/gpu: Add kernel doc entry for DPP
   Documentation/gpu: Add kernel doc entry for MPC
   Documentation/gpu: Add entry for OPP in the kernel doc
   Documentation/gpu: Add entry for the DIO component
   Documentation/gpu: Add an explanation about the DC weekly patches
   Documentation/gpu: Introduce a simple contribution list for display
 code

  .../gpu/amdgpu/display/dcn-blocks.rst |  78 ++
  .../amdgpu/display/display-contributing.rst   | 168 
  .../gpu/amdgpu/display/display-manager.rst    |   3 -
  Documentation/gpu/amdgpu/display/index.rst    |  78 +-
  drivers/gpu/drm/amd/display/TODO  | 110 
  .../gpu/drm/amd/display/dc/inc/hw/dchubbub.h  |   6 +
  drivers/gpu/drm/amd/display/dc/inc/hw/dpp.h   |  26 ++
  drivers/gpu/drm/amd/display/dc/inc/hw/hubp.h  |  13 +-
  drivers/gpu/drm/amd/display/dc/inc/hw/mpc.h   | 250 --
  drivers/gpu/drm/amd/display/dc/inc/hw/opp.h   |  16 ++
  .../amd/display/dc/link/hwss/link_hwss_dio.h  |  10 +
  11 files changed, 560 insertions(+), 198 deletions(-)
  create mode 100644 Documentation/gpu/amdgpu/display/dcn-blocks.rst
  create mode 100644 
Documentation/gpu/amdgpu/display/display-contributing.rst

  delete mode 100644 drivers/gpu/drm/amd/display/TODO



Thanks!

Change pushed to amd-staging-drm-next.



[PATCH] drm/amd/include: Add missing registers/mask for DCN316 and 350

2024-01-25 Thread Rodrigo Siqueira
Cc: Jun Lei 
Cc: Aurabindo Pillai 
Cc: Hamza Mahfooz 
Cc: Harry Wentland 
Cc: Alex Deucher 
Signed-off-by: Rodrigo Siqueira 
---
 .../include/asic_reg/dcn/dcn_3_1_6_offset.h   |  4 ++
 .../include/asic_reg/dcn/dcn_3_1_6_sh_mask.h  | 10 +++
 .../include/asic_reg/dcn/dcn_3_5_0_offset.h   | 24 +++
 .../include/asic_reg/dcn/dcn_3_5_0_sh_mask.h  | 65 +++
 4 files changed, 103 insertions(+)

diff --git a/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_offset.h 
b/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_offset.h
index 222fa8d13269..a05bf8e4f58d 100644
--- a/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_offset.h
+++ b/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_offset.h
@@ -626,6 +626,8 @@
 #define regDTBCLK_DTO2_MODULO_BASE_IDX 
 2
 #define regDTBCLK_DTO3_MODULO  
 0x0022
 #define regDTBCLK_DTO3_MODULO_BASE_IDX 
 2
+#define regHDMICHARCLK0_CLOCK_CNTL 
 0x004a
+#define regHDMICHARCLK0_CLOCK_CNTL_BASE_IDX
 2
 #define regPHYASYMCLK_CLOCK_CNTL   
 0x0052
 #define regPHYASYMCLK_CLOCK_CNTL_BASE_IDX  
 2
 #define regPHYBSYMCLK_CLOCK_CNTL   
 0x0053
@@ -638,6 +640,8 @@
 #define regPHYESYMCLK_CLOCK_CNTL_BASE_IDX  
 2
 #define regPHYFSYMCLK_CLOCK_CNTL   
 0x0057
 #define regPHYFSYMCLK_CLOCK_CNTL_BASE_IDX  
 2
+#define regHDMISTREAMCLK_CNTL  
 0x0059
+#define regHDMISTREAMCLK_CNTL_BASE_IDX 
 2
 #define regDCCG_GATE_DISABLE_CNTL3 
 0x005a
 #define regDCCG_GATE_DISABLE_CNTL3_BASE_IDX
 2
 #define regHDMISTREAMCLK0_DTO_PARAM
 0x005b
diff --git a/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_sh_mask.h 
b/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_sh_mask.h
index 8ddb03a1dc39..df84941bbe5b 100644
--- a/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_sh_mask.h
+++ b/drivers/gpu/drm/amd/include/asic_reg/dcn/dcn_3_1_6_sh_mask.h
@@ -1933,6 +1933,11 @@
 //DTBCLK_DTO3_MODULO
 #define DTBCLK_DTO3_MODULO__DTBCLK_DTO3_MODULO__SHIFT  
   0x0
 #define DTBCLK_DTO3_MODULO__DTBCLK_DTO3_MODULO_MASK
   0xL
+//HDMICHARCLK0_CLOCK_CNTL
+#define HDMICHARCLK0_CLOCK_CNTL__HDMICHARCLK0_EN__SHIFT
   0x0
+#define HDMICHARCLK0_CLOCK_CNTL__HDMICHARCLK0_SRC_SEL__SHIFT   
   0x4
+#define HDMICHARCLK0_CLOCK_CNTL__HDMICHARCLK0_EN_MASK  
   0x0001L
+#define HDMICHARCLK0_CLOCK_CNTL__HDMICHARCLK0_SRC_SEL_MASK 
   0x0070L
 //PHYASYMCLK_CLOCK_CNTL
 #define PHYASYMCLK_CLOCK_CNTL__PHYASYMCLK_FORCE_EN__SHIFT  
   0x0
 #define PHYASYMCLK_CLOCK_CNTL__PHYASYMCLK_FORCE_SRC_SEL__SHIFT 
   0x4
@@ -1967,6 +1972,11 @@
 #define PHYFSYMCLK_CLOCK_CNTL__PHYFSYMCLK_FORCE_SRC_SEL__SHIFT 
   0x4
 #define PHYFSYMCLK_CLOCK_CNTL__PHYFSYMCLK_FORCE_EN_MASK
   0x0001L
 #define PHYFSYMCLK_CLOCK_CNTL__PHYFSYMCLK_FORCE_SRC_SEL_MASK   
   0x0030L
+//HDMISTREAMCLK_CNTL
+#define HDMISTREAMCLK_CNTL__HDMISTREAMCLK0_SRC_SEL__SHIFT  
   0x0
+#define HDMISTREAMCLK_CNTL__HDMISTREAMCLK0_DTO_FORCE_DIS__SHIFT
   0x10
+#define HDMISTREAMCLK_CNTL__HDMISTREAMCLK0_SRC_SEL_MASK
   0x0003L
+#define HDMISTREAMCLK_CNTL__HDMISTREAMCLK0_DTO_FORCE_DIS_MASK  
   0x0001L
 //DCCG_GATE_DISABLE_CNTL3
 #define DCCG_GATE_DISABLE_CNTL3__HDMISTREAMCLK0_GATE_DISABLE__SHIFT
   0x0
 #define DCCG_GATE_DISABLE_CNTL3__HDMISTREAMCLK1_GATE_DISABLE__SHIFT

RE: [PATCH] drm/amd/display: Fix potential NULL pointer dereferences in 'dcn10_set_output_transfer_func()'

2024-01-25 Thread Koo, Anthony
[AMD Official Use Only - General]

Reviewed-by: Anthony Koo 

Thanks,
Anthony

-Original Message-
From: SHANMUGAM, SRINIVASAN 
Sent: Thursday, January 25, 2024 10:55 AM
To: Siqueira, Rodrigo ; Pillai, Aurabindo 

Cc: amd-gfx@lists.freedesktop.org; SHANMUGAM, SRINIVASAN 
; Wyatt Wood ; Koo, Anthony 

Subject: [PATCH] drm/amd/display: Fix potential NULL pointer dereferences in 
'dcn10_set_output_transfer_func()'

The 'stream' pointer is used in dcn10_set_output_transfer_func() before the 
check if 'stream' is NULL.

Fixes the below:
drivers/gpu/drm/amd/amdgpu/../display/dc/hwss/dcn10/dcn10_hwseq.c:1892 
dcn10_set_output_transfer_func() warn: variable dereferenced before check 
'stream' (see line 1875)

Fixes: ddef02de0d71 ("drm/amd/display: add null checks before logging")
Cc: Wyatt Wood 
Cc: Anthony Koo 
Cc: Rodrigo Siqueira 
Cc: Aurabindo Pillai 
Signed-off-by: Srinivasan Shanmugam 
---
 drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c 
b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
index d923d8d915f9..22cce2b58f95 100644
--- a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
+++ b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
@@ -1890,6 +1890,9 @@ bool dcn10_set_output_transfer_func(struct dc *dc, struct 
pipe_ctx *pipe_ctx,  {
struct dpp *dpp = pipe_ctx->plane_res.dpp;

+   if (!stream)
+   return false;
+
if (dpp == NULL)
return false;

@@ -1912,8 +1915,8 @@ bool dcn10_set_output_transfer_func(struct dc *dc, struct 
pipe_ctx *pipe_ctx,
} else
dpp->funcs->dpp_program_regamma_pwl(dpp, NULL, 
OPP_REGAMMA_BYPASS);

-   if (stream != NULL && stream->ctx != NULL &&
-   stream->out_transfer_func != NULL) {
+   if (stream->ctx &&
+   stream->out_transfer_func) {
log_tf(stream->ctx,
stream->out_transfer_func,
dpp->regamma_params.hw_points_num);
--
2.34.1



[PATCH] drm/amdgpu: Fix the warning info in mode1 reset

2024-01-25 Thread Alex Deucher
From: Ma Jun 

Fix the warning info below during mode1 reset.
[  +0.04] Call Trace:
[  +0.04]  
[  +0.06]  ? show_regs+0x6e/0x80
[  +0.11]  ? __flush_work.isra.0+0x2e8/0x390
[  +0.05]  ? __warn+0x91/0x150
[  +0.09]  ? __flush_work.isra.0+0x2e8/0x390
[  +0.06]  ? report_bug+0x19d/0x1b0
[  +0.13]  ? handle_bug+0x46/0x80
[  +0.12]  ? exc_invalid_op+0x1d/0x80
[  +0.11]  ? asm_exc_invalid_op+0x1f/0x30
[  +0.14]  ? __flush_work.isra.0+0x2e8/0x390
[  +0.07]  ? __flush_work.isra.0+0x208/0x390
[  +0.07]  ? _prb_read_valid+0x216/0x290
[  +0.08]  __cancel_work_timer+0x11d/0x1a0
[  +0.07]  ? try_to_grab_pending+0xe8/0x190
[  +0.12]  cancel_work_sync+0x14/0x20
[  +0.08]  amddrm_sched_stop+0x3c/0x1d0 [amd_sched]
[  +0.32]  amdgpu_device_gpu_recover+0x29a/0xe90 [amdgpu]

This warning info was printed after applying the patch
"drm/sched: Convert drm scheduler to use a work queue rather than kthread".
The root cause is that amdgpu driver tries to use the uninitialized
work_struct in the struct drm_gpu_scheduler

v2:
 - Rename the function to amdgpu_ring_sched_ready and move it to
amdgpu_ring.c (Alex)
v3:
- Fix a few more checks based on Vitaly's patch (Alex)

Fixes: 11b3b9f461c5 ("drm/sched: Check scheduler ready before calling timeout 
handling")
Reviewed-by: Alex Deucher 
Signed-off-by: Vitaly Prosyak 
Signed-off-by: Ma Jun 
Signed-off-by: Alex Deucher 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c|  8 
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   | 14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +-
 5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
index 899e31e3a5e8..3a3f3ce09f00 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
@@ -290,7 +290,7 @@ static int suspend_resume_compute_scheduler(struct 
amdgpu_device *adev, bool sus
for (i = 0; i < adev->gfx.num_compute_rings; i++) {
struct amdgpu_ring *ring = &adev->gfx.compute_ring[i];
 
-   if (!(ring && drm_sched_wqueue_ready(&ring->sched)))
+   if (!amdgpu_ring_sched_ready(ring))
continue;
 
/* stop secheduler and drain ring. */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index e485dd3357c6..1afbb2e932c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1678,7 +1678,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file 
*m, void *unused)
for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
struct amdgpu_ring *ring = adev->rings[i];
 
-   if (!ring || !drm_sched_wqueue_ready(&ring->sched))
+   if (!amdgpu_ring_sched_ready(ring))
continue;
drm_sched_wqueue_stop(&ring->sched);
}
@@ -1694,7 +1694,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file 
*m, void *unused)
for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
struct amdgpu_ring *ring = adev->rings[i];
 
-   if (!ring || !drm_sched_wqueue_ready(&ring->sched))
+   if (!amdgpu_ring_sched_ready(ring))
continue;
drm_sched_wqueue_start(&ring->sched);
}
@@ -1916,8 +1916,8 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
 
ring = adev->rings[val];
 
-   if (!ring || !ring->funcs->preempt_ib ||
-   !drm_sched_wqueue_ready(&ring->sched))
+   if (!amdgpu_ring_sched_ready(ring) ||
+   !ring->funcs->preempt_ib)
return -EINVAL;
 
/* the last preemption failed */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1a04ccba9542..7ff17df7a5ce 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5042,7 +5042,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device 
*adev)
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = adev->rings[i];
 
-   if (!ring || !drm_sched_wqueue_ready(&ring->sched))
+   if (!amdgpu_ring_sched_ready(ring))
continue;
 
spin_lock(&ring->sched.job_list_lock);
@@ -5181,7 +5181,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
*adev,
for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
struct amdgpu_ring *ring = adev->rings[i];
 
-   if (!ring || !drm_sched_wqueue_ready(&ring->sched))
+   if (!amdgpu_ring_sched_ready(ring))
continu

[PATCH] drm/amd/display: Fix potential NULL pointer dereferences in 'dcn10_set_output_transfer_func()'

2024-01-25 Thread Srinivasan Shanmugam
The 'stream' pointer is used in dcn10_set_output_transfer_func() before
the check if 'stream' is NULL.

Fixes the below:
drivers/gpu/drm/amd/amdgpu/../display/dc/hwss/dcn10/dcn10_hwseq.c:1892 
dcn10_set_output_transfer_func() warn: variable dereferenced before check 
'stream' (see line 1875)

Fixes: ddef02de0d71 ("drm/amd/display: add null checks before logging")
Cc: Wyatt Wood 
Cc: Anthony Koo 
Cc: Rodrigo Siqueira 
Cc: Aurabindo Pillai 
Signed-off-by: Srinivasan Shanmugam 
---
 drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c 
b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
index d923d8d915f9..22cce2b58f95 100644
--- a/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
+++ b/drivers/gpu/drm/amd/display/dc/hwss/dcn10/dcn10_hwseq.c
@@ -1890,6 +1890,9 @@ bool dcn10_set_output_transfer_func(struct dc *dc, struct 
pipe_ctx *pipe_ctx,
 {
struct dpp *dpp = pipe_ctx->plane_res.dpp;
 
+   if (!stream)
+   return false;
+
if (dpp == NULL)
return false;
 
@@ -1912,8 +1915,8 @@ bool dcn10_set_output_transfer_func(struct dc *dc, struct 
pipe_ctx *pipe_ctx,
} else
dpp->funcs->dpp_program_regamma_pwl(dpp, NULL, 
OPP_REGAMMA_BYPASS);
 
-   if (stream != NULL && stream->ctx != NULL &&
-   stream->out_transfer_func != NULL) {
+   if (stream->ctx &&
+   stream->out_transfer_func) {
log_tf(stream->ctx,
stream->out_transfer_func,
dpp->regamma_params.hw_points_num);
-- 
2.34.1



Re: [PATCH 1/2] drm/amdgpu: skip to program GFXDEC registers for PM abort case

2024-01-25 Thread Alex Deucher
On Thu, Jan 25, 2024 at 10:22 AM Alex Deucher  wrote:
>
> On Wed, Jan 24, 2024 at 9:39 PM Liang, Prike  wrote:
> >
> > [AMD Official Use Only - General]
> >
> > Hi, Alex
> > > -Original Message-
> > > From: Alex Deucher 
> > > Sent: Wednesday, January 24, 2024 11:59 PM
> > > To: Liang, Prike 
> > > Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> > > ; Sharma, Deepak
> > > 
> > > Subject: Re: [PATCH 1/2] drm/amdgpu: skip to program GFXDEC registers for
> > > PM abort case
> > >
> > > On Wed, Jan 24, 2024 at 2:12 AM Prike Liang  wrote:
> > > >
> > > > In the PM abort cases, the gfx power rail doesn't turn off so some
> > > > GFXDEC registers/CSB can't reset to default vaule. In order to avoid
> > > > unexpected problem now need skip to program GFXDEC registers and
> > > > bypass issue CSB packet for PM abort case.
> > > >
> > > > Signed-off-by: Prike Liang 
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
> > > >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   | 4 
> > > >  3 files changed, 6 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > index c5f3859fd682..26d983eb831b 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > @@ -1079,6 +1079,7 @@ struct amdgpu_device {
> > > > boolin_s3;
> > > > boolin_s4;
> > > > boolin_s0ix;
> > > > +   boolpm_complete;
> > > >
> > > > enum pp_mp1_state   mp1_state;
> > > > struct amdgpu_doorbell_index doorbell_index; diff --git
> > > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > index 475bd59c9ac2..a01f9b0c2f30 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > @@ -2486,6 +2486,7 @@ static int amdgpu_pmops_suspend_noirq(struct
> > > device *dev)
> > > > struct drm_device *drm_dev = dev_get_drvdata(dev);
> > > > struct amdgpu_device *adev = drm_to_adev(drm_dev);
> > > >
> > > > +   adev->pm_complete = true;
> > >
> > > This needs to be cleared somewhere on resume.
> > [Liang, Prike]  This flag is designed to indicate the amdgpu device 
> > suspension process status and will update the patch and clear it at the 
> > amdgpu suspension beginning point.
> > >
> > > > if (amdgpu_acpi_should_gpu_reset(adev))
> > > > return amdgpu_asic_reset(adev);
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > > index 57808be6e3ec..3bf51f18e13c 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > > @@ -3034,6 +3034,10 @@ static int gfx_v9_0_cp_gfx_start(struct
> > > > amdgpu_device *adev)
> > > >
> > > > gfx_v9_0_cp_gfx_enable(adev, true);
> > > >
> > > > +   if (adev->in_suspend && !adev->pm_complete) {
> > > > +   DRM_INFO(" will skip the csb ring write\n");
> > > > +   return 0;
> > > > +   }
> > >
> > > We probably want a similar fix for other gfx generations as well.
> > >
> > > Alex
> > >
> > [Liang, Prike] IIRC, there's no issue on the Mendocino side even without 
> > the fix. How about keep the other gfx generations unchanged firstly and 
> > after sort out the failed case will add the quirk for each specific gfx 
> > respectively?
>
> Mendocino only supports S0i3 so we don't touch gfx on suspend/resume.
> This would only happen on platforms that support S3.

E.g., try an aborted suspend on Raphael or PHX2.

Alex

>
> Alex
>
> >
> > > > r = amdgpu_ring_alloc(ring, gfx_v9_0_get_csb_size(adev) + 4 + 
> > > > 3);
> > > > if (r) {
> > > > DRM_ERROR("amdgpu: cp failed to lock ring (%d).\n",
> > > > r);
> > > > --
> > > > 2.34.1
> > > >


Re: [PATCH 1/2] drm/amdgpu: skip to program GFXDEC registers for PM abort case

2024-01-25 Thread Alex Deucher
On Wed, Jan 24, 2024 at 10:46 PM Prike Liang  wrote:
>
> In the PM abort cases, the gfx power rail doesn't turn off so
> some GFXDEC registers/CSB can't reset to default vaule. In order
> to avoid unexpected problem now need skip to program GFXDEC registers
> and bypass issue CSB packet for PM abort case.
>
> Signed-off-by: Prike Liang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 ++
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   | 6 ++
>  3 files changed, 10 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index c5f3859fd682..9e9e3385b5d3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1079,6 +1079,8 @@ struct amdgpu_device {
> boolin_s3;
> boolin_s4;
> boolin_s0ix;
> +   /* indicate amdgpu suspension status */
> +   boolpm_complete;

Maybe rename this suspend_complete to better specify that it only
applies on the suspend side.

Alex

>
> enum pp_mp1_state   mp1_state;
> struct amdgpu_doorbell_index doorbell_index;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 475bd59c9ac2..9cb8f7fe55cf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2472,6 +2472,7 @@ static int amdgpu_pmops_suspend(struct device *dev)
> struct drm_device *drm_dev = dev_get_drvdata(dev);
> struct amdgpu_device *adev = drm_to_adev(drm_dev);
>
> +   adev->pm_complete = false;
> if (amdgpu_acpi_is_s0ix_active(adev))
> adev->in_s0ix = true;
> else if (amdgpu_acpi_is_s3_active(adev))
> @@ -2486,6 +2487,7 @@ static int amdgpu_pmops_suspend_noirq(struct device 
> *dev)
> struct drm_device *drm_dev = dev_get_drvdata(dev);
> struct amdgpu_device *adev = drm_to_adev(drm_dev);
>
> +   adev->pm_complete = true;
> if (amdgpu_acpi_should_gpu_reset(adev))
> return amdgpu_asic_reset(adev);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 57808be6e3ec..0abdc85eda77 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -3034,6 +3034,12 @@ static int gfx_v9_0_cp_gfx_start(struct amdgpu_device 
> *adev)
>
> gfx_v9_0_cp_gfx_enable(adev, true);
>
> +   /* TODO: double confirm whether need to reinitialize gfxedc and 
> submit csb packet
> +* on the other gfx generations for the pm suspend abort case. */
> +   if (adev->in_suspend && !adev->pm_complete) {
> +   DRM_INFO(" will skip the csb ring write\n");
> +   return 0;
> +   }
> r = amdgpu_ring_alloc(ring, gfx_v9_0_get_csb_size(adev) + 4 + 3);
> if (r) {
> DRM_ERROR("amdgpu: cp failed to lock ring (%d).\n", r);
> --
> 2.34.1
>


Re: [PATCH 1/2] drm/amdgpu: skip to program GFXDEC registers for PM abort case

2024-01-25 Thread Alex Deucher
On Wed, Jan 24, 2024 at 9:39 PM Liang, Prike  wrote:
>
> [AMD Official Use Only - General]
>
> Hi, Alex
> > -Original Message-
> > From: Alex Deucher 
> > Sent: Wednesday, January 24, 2024 11:59 PM
> > To: Liang, Prike 
> > Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> > ; Sharma, Deepak
> > 
> > Subject: Re: [PATCH 1/2] drm/amdgpu: skip to program GFXDEC registers for
> > PM abort case
> >
> > On Wed, Jan 24, 2024 at 2:12 AM Prike Liang  wrote:
> > >
> > > In the PM abort cases, the gfx power rail doesn't turn off so some
> > > GFXDEC registers/CSB can't reset to default vaule. In order to avoid
> > > unexpected problem now need skip to program GFXDEC registers and
> > > bypass issue CSB packet for PM abort case.
> > >
> > > Signed-off-by: Prike Liang 
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 +
> > >  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   | 4 
> > >  3 files changed, 6 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > index c5f3859fd682..26d983eb831b 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > @@ -1079,6 +1079,7 @@ struct amdgpu_device {
> > > boolin_s3;
> > > boolin_s4;
> > > boolin_s0ix;
> > > +   boolpm_complete;
> > >
> > > enum pp_mp1_state   mp1_state;
> > > struct amdgpu_doorbell_index doorbell_index; diff --git
> > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > index 475bd59c9ac2..a01f9b0c2f30 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > @@ -2486,6 +2486,7 @@ static int amdgpu_pmops_suspend_noirq(struct
> > device *dev)
> > > struct drm_device *drm_dev = dev_get_drvdata(dev);
> > > struct amdgpu_device *adev = drm_to_adev(drm_dev);
> > >
> > > +   adev->pm_complete = true;
> >
> > This needs to be cleared somewhere on resume.
> [Liang, Prike]  This flag is designed to indicate the amdgpu device 
> suspension process status and will update the patch and clear it at the 
> amdgpu suspension beginning point.
> >
> > > if (amdgpu_acpi_should_gpu_reset(adev))
> > > return amdgpu_asic_reset(adev);
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > index 57808be6e3ec..3bf51f18e13c 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > @@ -3034,6 +3034,10 @@ static int gfx_v9_0_cp_gfx_start(struct
> > > amdgpu_device *adev)
> > >
> > > gfx_v9_0_cp_gfx_enable(adev, true);
> > >
> > > +   if (adev->in_suspend && !adev->pm_complete) {
> > > +   DRM_INFO(" will skip the csb ring write\n");
> > > +   return 0;
> > > +   }
> >
> > We probably want a similar fix for other gfx generations as well.
> >
> > Alex
> >
> [Liang, Prike] IIRC, there's no issue on the Mendocino side even without the 
> fix. How about keep the other gfx generations unchanged firstly and after 
> sort out the failed case will add the quirk for each specific gfx 
> respectively?

Mendocino only supports S0i3 so we don't touch gfx on suspend/resume.
This would only happen on platforms that support S3.

Alex

>
> > > r = amdgpu_ring_alloc(ring, gfx_v9_0_get_csb_size(adev) + 4 + 3);
> > > if (r) {
> > > DRM_ERROR("amdgpu: cp failed to lock ring (%d).\n",
> > > r);
> > > --
> > > 2.34.1
> > >


[PATCH v4 3/3] drm/buddy: Add defragmentation support

2024-01-25 Thread Arunpravin Paneer Selvam
Add a function to support defragmentation.

Signed-off-by: Arunpravin Paneer Selvam 
Suggested-by: Matthew Auld 
---
 drivers/gpu/drm/drm_buddy.c | 48 -
 1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index d44172f23f05..3cffa9cc12d7 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -277,7 +277,8 @@ drm_get_buddy(struct drm_buddy_block *block)
 EXPORT_SYMBOL(drm_get_buddy);
 
 static void __drm_buddy_free(struct drm_buddy *mm,
-struct drm_buddy_block *block)
+struct drm_buddy_block *block,
+bool defrag)
 {
struct drm_buddy_block *parent;
 
@@ -289,12 +290,14 @@ static void __drm_buddy_free(struct drm_buddy *mm,
if (!drm_buddy_block_is_free(buddy))
break;
 
-   if (drm_buddy_block_is_clear(block) !=
-   drm_buddy_block_is_clear(buddy))
-   break;
+   if (!defrag) {
+   if (drm_buddy_block_is_clear(block) !=
+   drm_buddy_block_is_clear(buddy))
+   break;
 
-   if (drm_buddy_block_is_clear(block))
-   mark_cleared(parent);
+   if (drm_buddy_block_is_clear(block))
+   mark_cleared(parent);
+   }
 
list_del(&buddy->link);
 
@@ -307,6 +310,19 @@ static void __drm_buddy_free(struct drm_buddy *mm,
mark_free(mm, block);
 }
 
+static void drm_buddy_defrag(struct drm_buddy *mm)
+{
+   struct drm_buddy_block *block;
+   struct list_head *list;
+
+   list = &mm->free_list[0];
+   if (list_empty(list))
+   return;
+
+   list_for_each_entry_reverse(block, list, link)
+   __drm_buddy_free(mm, block, 1);
+}
+
 /**
  * drm_buddy_free_block - free a block
  *
@@ -321,7 +337,7 @@ void drm_buddy_free_block(struct drm_buddy *mm,
if (drm_buddy_block_is_clear(block))
mm->clear_avail += drm_buddy_block_size(mm, block);
 
-   __drm_buddy_free(mm, block);
+   __drm_buddy_free(mm, block, 0);
 }
 EXPORT_SYMBOL(drm_buddy_free_block);
 
@@ -447,7 +463,7 @@ __alloc_range_bias(struct drm_buddy *mm,
if (buddy &&
(drm_buddy_block_is_free(block) &&
 drm_buddy_block_is_free(buddy)))
-   __drm_buddy_free(mm, block);
+   __drm_buddy_free(mm, block, 0);
return ERR_PTR(err);
 }
 
@@ -577,7 +593,7 @@ alloc_from_freelist(struct drm_buddy *mm,
 
 err_undo:
if (tmp != order)
-   __drm_buddy_free(mm, block);
+   __drm_buddy_free(mm, block, 0);
return ERR_PTR(err);
 }
 
@@ -657,7 +673,7 @@ static int __alloc_range(struct drm_buddy *mm,
if (buddy &&
(drm_buddy_block_is_free(block) &&
 drm_buddy_block_is_free(buddy)))
-   __drm_buddy_free(mm, block);
+   __drm_buddy_free(mm, block, 0);
 
 err_free:
if (err == -ENOSPC && total_allocated_on_err) {
@@ -903,7 +919,7 @@ int drm_buddy_alloc_blocks(struct drm_buddy *mm,
 
if (order-- == min_order) {
if (flags & DRM_BUDDY_CONTIGUOUS_ALLOCATION &&
-   !(flags & DRM_BUDDY_RANGE_ALLOCATION))
+   !(flags & DRM_BUDDY_RANGE_ALLOCATION)) {
/*
 * Try contiguous block allocation 
through
 * try harder method
@@ -912,6 +928,16 @@ int drm_buddy_alloc_blocks(struct drm_buddy *mm,
 
original_size,
 
original_min_size,
 
blocks);
+
+   /*
+* Defragment the freelist and try 
contiguous block
+* allocation
+*/
+   drm_buddy_defrag(mm);
+   if (!IS_ERR(alloc_from_freelist(mm, 
min_order, flags)))
+   break;
+   }
+
err = -ENOSPC;
goto err_free;
}
-- 
2.25.1



[PATCH v4 2/3] drm/amdgpu: Enable clear page functionality

2024-01-25 Thread Arunpravin Paneer Selvam
Add clear page support in vram memory region.

v1:(Christian)
  - Dont handle clear page as TTM flag since when moving the BO back
in from GTT again we don't need that.
  - Make a specialized version of amdgpu_fill_buffer() which only
clears the VRAM areas which are not already cleared
  - Drop the TTM_PL_FLAG_WIPE_ON_RELEASE check in
amdgpu_object.c

v2:
  - Modify the function name amdgpu_ttm_* (Alex)
  - Drop the delayed parameter (Christian)
  - handle amdgpu_res_cleared(&cursor) just above the size
calculation (Christian)
  - Use AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE for clearing the buffers
in the free path to properly wait for fences etc.. (Christian)

v3:(Christian)
  - Remove buffer clear code in VRAM manager instead change the
AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE handling to set
the DRM_BUDDY_CLEARED flag.
  - Remove ! from amdgpu_res_cleared(&cursor) check.

Signed-off-by: Arunpravin Paneer Selvam 
Suggested-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c| 22 ---
 .../gpu/drm/amd/amdgpu/amdgpu_res_cursor.h| 25 
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   | 61 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h   |  5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c  |  6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.h  |  5 ++
 6 files changed, 111 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index b2e9a2f81d82..be32f9852d19 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -39,6 +39,7 @@
 #include "amdgpu.h"
 #include "amdgpu_trace.h"
 #include "amdgpu_amdkfd.h"
+#include "amdgpu_vram_mgr.h"
 
 /**
  * DOC: amdgpu_object
@@ -595,8 +596,7 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
if (!amdgpu_bo_support_uswc(bo->flags))
bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
 
-   if (adev->ras_enabled)
-   bo->flags |= AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE;
+   bo->flags |= AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE;
 
bo->tbo.bdev = &adev->mman.bdev;
if (bp->domain & (AMDGPU_GEM_DOMAIN_GWS | AMDGPU_GEM_DOMAIN_OA |
@@ -626,15 +626,17 @@ int amdgpu_bo_create(struct amdgpu_device *adev,
 
if (bp->flags & AMDGPU_GEM_CREATE_VRAM_CLEARED &&
bo->tbo.resource->mem_type == TTM_PL_VRAM) {
-   struct dma_fence *fence;
+   struct dma_fence *fence = NULL;
 
-   r = amdgpu_fill_buffer(bo, 0, bo->tbo.base.resv, &fence, true);
+   r = amdgpu_ttm_clear_buffer(bo, bo->tbo.base.resv, &fence);
if (unlikely(r))
goto fail_unreserve;
 
-   dma_resv_add_fence(bo->tbo.base.resv, fence,
-  DMA_RESV_USAGE_KERNEL);
-   dma_fence_put(fence);
+   if (fence) {
+   dma_resv_add_fence(bo->tbo.base.resv, fence,
+  DMA_RESV_USAGE_KERNEL);
+   dma_fence_put(fence);
+   }
}
if (!bp->resv)
amdgpu_bo_unreserve(bo);
@@ -1357,8 +1359,12 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object 
*bo)
if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv)))
return;
 
-   r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence, true);
+   r = amdgpu_fill_buffer(abo, 0, bo->base.resv, &fence, true);
if (!WARN_ON(r)) {
+   struct amdgpu_vram_mgr_resource *vres;
+
+   vres = to_amdgpu_vram_mgr_resource(bo->resource);
+   vres->flags |= DRM_BUDDY_CLEARED;
amdgpu_bo_fence(abo, fence, false);
dma_fence_put(fence);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
index 381101d2bf05..50fcd86e1033 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_res_cursor.h
@@ -164,4 +164,29 @@ static inline void amdgpu_res_next(struct 
amdgpu_res_cursor *cur, uint64_t size)
}
 }
 
+/**
+ * amdgpu_res_cleared - check if blocks are cleared
+ *
+ * @cur: the cursor to extract the block
+ *
+ * Check if the @cur block is cleared
+ */
+static inline bool amdgpu_res_cleared(struct amdgpu_res_cursor *cur)
+{
+   struct drm_buddy_block *block;
+
+   switch (cur->mem_type) {
+   case TTM_PL_VRAM:
+   block = cur->node;
+
+   if (!amdgpu_vram_mgr_is_cleared(block))
+   return false;
+   break;
+   default:
+   return false;
+   }
+
+   return true;
+}
+
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 46a24d2308aa..15cdda77573c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/

[PATCH v4 1/3] drm/buddy: Implement tracking clear page feature

2024-01-25 Thread Arunpravin Paneer Selvam
- Add tracking clear page feature.

- Driver should enable the DRM_BUDDY_CLEARED flag if it
  successfully clears the blocks in the free path. On the otherhand,
  DRM buddy marks each block as cleared.

- Track the available cleared pages size

- If driver requests cleared memory we prefer cleared memory
  but fallback to uncleared if we can't find the cleared blocks.
  when driver requests uncleared memory we try to use uncleared but
  fallback to cleared memory if necessary.

- When a block gets freed we clear it and mark the freed block as cleared,
  when there are buddies which are cleared as well we can merge them.
  Otherwise, we prefer to keep the blocks as separated.

v1: (Christian)
  - Depends on the flag check DRM_BUDDY_CLEARED, enable the block as
cleared. Else, reset the clear flag for each block in the list.

  - For merging the 2 cleared blocks compare as below,
drm_buddy_is_clear(block) != drm_buddy_is_clear(buddy)

Signed-off-by: Arunpravin Paneer Selvam 
Suggested-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c  |   6 +-
 drivers/gpu/drm/drm_buddy.c   | 169 +++---
 drivers/gpu/drm/i915/i915_ttm_buddy_manager.c |   6 +-
 drivers/gpu/drm/tests/drm_buddy_test.c|  10 +-
 include/drm/drm_buddy.h   |  18 +-
 5 files changed, 168 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 08916538a615..d0e199cc8f17 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -556,7 +556,7 @@ static int amdgpu_vram_mgr_new(struct ttm_resource_manager 
*man,
return 0;
 
 error_free_blocks:
-   drm_buddy_free_list(mm, &vres->blocks);
+   drm_buddy_free_list(mm, &vres->blocks, 0);
mutex_unlock(&mgr->lock);
 error_fini:
ttm_resource_fini(man, &vres->base);
@@ -589,7 +589,7 @@ static void amdgpu_vram_mgr_del(struct ttm_resource_manager 
*man,
 
amdgpu_vram_mgr_do_reserve(man);
 
-   drm_buddy_free_list(mm, &vres->blocks);
+   drm_buddy_free_list(mm, &vres->blocks, 0);
mutex_unlock(&mgr->lock);
 
atomic64_sub(vis_usage, &mgr->vis_usage);
@@ -897,7 +897,7 @@ void amdgpu_vram_mgr_fini(struct amdgpu_device *adev)
kfree(rsv);
 
list_for_each_entry_safe(rsv, temp, &mgr->reserved_pages, blocks) {
-   drm_buddy_free_list(&mgr->mm, &rsv->allocated);
+   drm_buddy_free_list(&mgr->mm, &rsv->allocated, 0);
kfree(rsv);
}
if (!adev->gmc.is_app_apu)
diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index f57e6d74fb0e..d44172f23f05 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -57,6 +57,16 @@ static void list_insert_sorted(struct drm_buddy *mm,
__list_add(&block->link, node->link.prev, &node->link);
 }
 
+static void clear_reset(struct drm_buddy_block *block)
+{
+   block->header &= ~DRM_BUDDY_HEADER_CLEAR;
+}
+
+static void mark_cleared(struct drm_buddy_block *block)
+{
+   block->header |= DRM_BUDDY_HEADER_CLEAR;
+}
+
 static void mark_allocated(struct drm_buddy_block *block)
 {
block->header &= ~DRM_BUDDY_HEADER_STATE;
@@ -223,6 +233,12 @@ static int split_block(struct drm_buddy *mm,
mark_free(mm, block->left);
mark_free(mm, block->right);
 
+   if (drm_buddy_block_is_clear(block)) {
+   mark_cleared(block->left);
+   mark_cleared(block->right);
+   clear_reset(block);
+   }
+
mark_split(block);
 
return 0;
@@ -273,6 +289,13 @@ static void __drm_buddy_free(struct drm_buddy *mm,
if (!drm_buddy_block_is_free(buddy))
break;
 
+   if (drm_buddy_block_is_clear(block) !=
+   drm_buddy_block_is_clear(buddy))
+   break;
+
+   if (drm_buddy_block_is_clear(block))
+   mark_cleared(parent);
+
list_del(&buddy->link);
 
drm_block_free(mm, block);
@@ -295,6 +318,9 @@ void drm_buddy_free_block(struct drm_buddy *mm,
 {
BUG_ON(!drm_buddy_block_is_allocated(block));
mm->avail += drm_buddy_block_size(mm, block);
+   if (drm_buddy_block_is_clear(block))
+   mm->clear_avail += drm_buddy_block_size(mm, block);
+
__drm_buddy_free(mm, block);
 }
 EXPORT_SYMBOL(drm_buddy_free_block);
@@ -305,10 +331,20 @@ EXPORT_SYMBOL(drm_buddy_free_block);
  * @mm: DRM buddy manager
  * @objects: input list head to free blocks
  */
-void drm_buddy_free_list(struct drm_buddy *mm, struct list_head *objects)
+void drm_buddy_free_list(struct drm_buddy *mm,
+struct list_head *objects,
+unsigned long flags)
 {
struct drm_buddy_block *block, *on;
 
+   if (flags & DRM_BUDDY_CLEARED) {
+   lis

Re: [PATCH] mm: Remove double faults once write a device pfn

2024-01-25 Thread Christian König

Am 24.01.24 um 12:04 schrieb Alistair Popple:

"Zhou, Xianrong"  writes:


[AMD Official Use Only - General]


The vmf_insert_pfn_prot could cause unnecessary double faults on a
device pfn. Because currently the vmf_insert_pfn_prot does not
make the pfn writable so the pte entry is normally read-only or
dirty catching.

What? How do you got to this conclusion?

Sorry. I did not mention that this problem only exists on arm64 platform.

Ok, that makes at least a little bit more sense.


Because on arm64 platform the PTE_RDONLY is automatically attached
to the userspace pte entries even through VM_WRITE + VM_SHARE.
The  PTE_RDONLY needs to be cleared in vmf_insert_pfn_prot. However
vmf_insert_pfn_prot do not make the pte writable passing false
@mkwrite to insert_pfn.

Question is why is arm64 doing this? As far as I can see they must
have some hardware reason for that.

The mkwrite parameter to insert_pfn() was added by commit
b2770da642540 to make insert_pfn() look more like insert_pfn_pmd() so
that the DAX code can insert PTEs which are writable and dirty at the same

time.

This is one scenario to do so. In fact on arm64 there are many
scenarios could be to do so. So we can let vmf_insert_pfn_prot
supporting @mkwrite for drivers at core layer and let drivers to
decide whether or not to make writable and dirty at one time. The
patch did this. Otherwise double faults on arm64 when call

vmf_insert_pfn_prot.

Well, that doesn't answer my question why arm64 is double faulting in the
first place,.



Eh.

On arm64 When userspace mmap() with PROT_WRITE and MAP_SHARED the
vma->vm_page_prot has the PTE_RDONLY and PTE_WRITE within
PAGE_SHARED_EXEC. (seeing arm64 protection_map)


Well that's your observation, but not the explanation why arm64 is doing 
this.


See this would have quite some negative impact on performance, not only 
for gfx drivers but in general.


So either the observation is incorrect or there is a *really* good 
reason why arm64 is taking this performance penalty.



When write the userspace virtual address the first fault happen and call
into driver's .fault->ttm_bo_vm_fault_reserved->vmf_insert_pfn_prot->insert_pfn.
The insert_pfn will establish the pte entry. However the vmf_insert_pfn_prot
pass false @mkwrite to insert_pfn by default and so insert_pfn could not make
the pfn writable and it do not call maybe_mkwrite(pte_mkdirty(entry), vma)
to clear the PTE_RDONLY bit. So the pte entry is actually write protection for 
mmu.
So when the first fault return and re-execute the store instruction the second
fault happen again. And in second fault it just only do pte_mkdirty(entry) which
clear the PTE_RDONLY.

It depends if the ARM64 CPU in question supports hardware dirty bit
management (DBM). If that is the case and PTE_DBM (ie. PTE_WRITE) is set
HW will automatically clear PTE_RDONLY bit to mark the entry dirty
instead of raising a write fault. So you shouldn't see a double fault if
PTE_DBM/WRITE is set.

On ARM64 you can kind of think of PTE_RDONLY as the HW dirty bit and
PTE_DBM as the read/write permission bit with SW being responsible for
updating PTE_RDONLY via the fault handler if DBM is not supported by HW.

At least that's my understanding from having hacked on this in the
past. You can see all this weirdness happening in the definitions of
pte_dirty() and pte_write() for ARM64.


+1

Thanks a lot for that, this was exactly the information I was looking for.

In this light it makes this patch here look unnecessary and questionable 
at best.


Xianrong if you have an arm64 platform which really double faults 
(confirmed through a debugger for example) then you have to ask why this 
platform shows this behavior and not try to work around it.


Behaviors like those usually have a very very good reason and without a 
confirmed explanation I'm not allowing any patch in which would disable 
stuff like that.


Regards,
Christian.




I think so and hope no wrong.


So as long as this isn't sorted out I'm going to reject this patch.

Regards,
Christian.


This is a completely different use case to what you try to use it
here for and that looks extremely fishy to me.

Regards,
Christian.


The first fault only sets up the pte entry which actually is dirty
catching. And the second immediate fault to the pfn due to first
dirty catching when the cpu re-execute the store instruction.

It could be that this is done to work around some hw behavior, but
not because of dirty catching.


Normally if the drivers call vmf_insert_pfn_prot and also supply
'pfn_mkwrite' callback within vm_operations_struct which requires
the pte to be dirty catching then the vmf_insert_pfn_prot and the
double fault are reasonable. It is not a problem.

Well, as far as I can see that behavior absolutely doesn't make sense.

When pfn_mkwrite is requested then the driver should use PAGE_COPY,
which is exactly what VMWGFX (the only driver using dirty tracking)
is

doing.

Everybody else uses PAGE_SHARED which should make the pte w

Re: [PATCH v2] drm/amdgpu: move the drm client creation behind drm device registration

2024-01-25 Thread Lazar, Lijo



On 1/25/2024 1:37 PM, Le Ma wrote:
> This patch is to eliminate interrupt warning below:
> 
>   "[drm] Fence fallback timer expired on ring sdma0.0".
> 
> An early vm pt clearing job is sent to SDMA ahead of interrupt enabled,
> introduced by patch below:
> 
>   - drm/amdkfd: Export DMABufs from KFD using GEM handles
> 
The drm client creation logic in this patch won't work well for dynamic
partition switch cases. Better add a ''Fixes' tag also.

This looks fine to me, needs to be checked by Felix anyway.

Thanks,
Lijo

> And re-locating the drm client creation following after drm_dev_register
> looks like a more proper flow.
> 
> v2: wrap the drm client creation
> 
> Signed-off-by: Le Ma 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 32 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  2 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  4 +++
>  3 files changed, 27 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 77e263660288..41db030ddc4e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -141,11 +141,31 @@ static void amdgpu_amdkfd_reset_work(struct work_struct 
> *work)
>  static const struct drm_client_funcs kfd_client_funcs = {
>   .unregister = drm_client_release,
>  };
> +
> +int amdgpu_amdkfd_drm_client_create(struct amdgpu_device *adev)
> +{
> + int ret;
> +
> + if (!adev->kfd.init_complete)
> + return 0;
> +
> + ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd",
> +   &kfd_client_funcs);
> + if (ret) {
> + dev_err(adev->dev, "Failed to init DRM client: %d\n",
> + ret);
> + return ret;
> + }
> +
> + drm_client_register(&adev->kfd.client);
> +
> + return 0;
> +}
> +
>  void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>  {
>   int i;
>   int last_valid_bit;
> - int ret;
>  
>   amdgpu_amdkfd_gpuvm_init_mem_limits();
>  
> @@ -164,12 +184,6 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device 
> *adev)
>   .enable_mes = adev->enable_mes,
>   };
>  
> - ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", 
> &kfd_client_funcs);
> - if (ret) {
> - dev_err(adev->dev, "Failed to init DRM client: %d\n", 
> ret);
> - return;
> - }
> -
>   /* this is going to have a few of the MSBs set that we need to
>* clear
>*/
> @@ -208,10 +222,6 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device 
> *adev)
>  
>   adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
>   &gpu_resources);
> - if (adev->kfd.init_complete)
> - drm_client_register(&adev->kfd.client);
> - else
> - drm_client_release(&adev->kfd.client);
>  
>   amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index 584a0cea5572..da175c384adf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -182,6 +182,8 @@ int amdgpu_queue_mask_bit_to_set_resource_bit(struct 
> amdgpu_device *adev,
>  struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context,
>   struct mm_struct *mm,
>   struct svm_range_bo *svm_bo);
> +
> +int amdgpu_amdkfd_drm_client_create(struct amdgpu_device *adev);
>  #if defined(CONFIG_DEBUG_FS)
>  int kfd_debugfs_kfd_mem_limits(struct seq_file *m, void *data);
>  #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 475bd59c9ac2..91d5d9435067 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2255,6 +2255,10 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>   if (ret)
>   goto err_pci;
>  
> + ret = amdgpu_amdkfd_drm_client_create(adev);
> + if (ret)
> + goto err_pci;
> +
>   /*
>* 1. don't init fbdev on hw without DCE
>* 2. don't init fbdev if there are no connectors


Re: [PATCH 2/2] drm/amdgpu: reset gpu for pm abort case

2024-01-25 Thread Lazar, Lijo



On 1/25/2024 8:52 AM, Prike Liang wrote:
> In the pm abort case the gfx power rail not turn off from FCH side and
> this will lead to the gfx reinitialized failed base on the unknown gfx
> HW status, so let's reset the gpu to a known good power state.
> 

>From the description, this an APU only problem (or this patch could only
resolve APU abort sequence). However, there is no check for APU in the
patch below.


> Signed-off-by: Prike Liang 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +
>  drivers/gpu/drm/amd/amdgpu/soc15.c | 8 +++-
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 56d9dfa61290..4c40ffaaa5c2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4627,6 +4627,11 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
> fbcon)
>   return r;
>   }
>  
> + if(amdgpu_asic_need_reset_on_init(adev)) {
> + DRM_INFO("PM abort case and let's reset asic \n");
> + amdgpu_asic_reset(adev);
> + }
> +

suspend_noirq is specific for suspend scenarios and not valid for
freeze/thaw. I guess this could trigger reset for successful restore on
APUs.

>   if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
>   return 0;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
> b/drivers/gpu/drm/amd/amdgpu/soc15.c
> index 15033efec2ba..9329a00b6abc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> @@ -804,9 +804,16 @@ static bool soc15_need_reset_on_init(struct 
> amdgpu_device *adev)
>   if (adev->asic_type == CHIP_RENOIR)
>   return true;
>  
> + sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
> +
>   /* Just return false for soc15 GPUs.  Reset does not seem to
>* be necessary.
>*/

The comment now doesn't make sense.

Thanks,
Lijo

> + if (adev->in_suspend && !adev->in_s0ix &&
> + !adev->pm_complete &&
> + sol_reg)
> + return true;
> +
>   if (!amdgpu_passthrough(adev))
>   return false;
>  
> @@ -816,7 +823,6 @@ static bool soc15_need_reset_on_init(struct amdgpu_device 
> *adev)
>   /* Check sOS sign of life register to confirm sys driver and sOS
>* are already been loaded.
>*/
> - sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
>   if (sol_reg)
>   return true;
>  


Re: [PATCH v2 2/2] drm/amdgpu: Implement check_async_props for planes

2024-01-25 Thread Ville Syrjälä
On Wed, Jan 24, 2024 at 11:14:40AM -0300, André Almeida wrote:
> Hi Ville,
> 
> Em 19/01/2024 15:25, Ville Syrjälä escreveu:
> > On Fri, Jan 19, 2024 at 03:12:35PM -0300, André Almeida wrote:
> >> AMD GPUs can do async flips with changes on more properties than just
> >> the FB ID, so implement a custom check_async_props for AMD planes.
> >>
> >> Allow amdgpu to do async flips with IN_FENCE_ID and FB_DAMAGE_CLIPS
> >> properties. For userspace to check if a driver support this two
> >> properties, the strategy for now is to use TEST_ONLY commits.
> >>
> >> Signed-off-by: André Almeida 
> >> ---
> >> v2: Drop overlay plane option for now
> >>
> >>   .../amd/display/amdgpu_dm/amdgpu_dm_plane.c   | 29 +++
> >>   1 file changed, 29 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c 
> >> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
> >> index 116121e647ca..7afe8c1b62d4 100644
> >> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
> >> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
> >> @@ -25,6 +25,7 @@
> >>*/
> >>   
> >>   #include 
> >> +#include 
> >>   #include 
> >>   #include 
> >>   #include 
> >> @@ -1430,6 +1431,33 @@ static void 
> >> amdgpu_dm_plane_drm_plane_destroy_state(struct drm_plane *plane,
> >>drm_atomic_helper_plane_destroy_state(plane, state);
> >>   }
> >>   
> >> +static int amdgpu_dm_plane_check_async_props(struct drm_property *prop,
> >> +struct drm_plane *plane,
> >> +struct drm_plane_state *plane_state,
> >> +struct drm_mode_object *obj,
> >> +u64 prop_value, u64 old_val)
> >> +{
> >> +  struct drm_mode_config *config = &plane->dev->mode_config;
> >> +  int ret;
> >> +
> >> +  if (prop != config->prop_fb_id &&
> >> +  prop != config->prop_in_fence_fd &&
> > 
> > IN_FENCE should just be allowed always.
> 
> Do you mean that the common path should allow IN_FENCE_FD for all drivers?

Yes.

-- 
Ville Syrjälä
Intel


Re: [PATCH] mm: Remove double faults once write a device pfn

2024-01-25 Thread Alistair Popple


"Zhou, Xianrong"  writes:

> [AMD Official Use Only - General]
>
>> > The vmf_insert_pfn_prot could cause unnecessary double faults on a
>> > device pfn. Because currently the vmf_insert_pfn_prot does not
>> > make the pfn writable so the pte entry is normally read-only or
>> > dirty catching.
>>  What? How do you got to this conclusion?
>> >>> Sorry. I did not mention that this problem only exists on arm64 platform.
>> >> Ok, that makes at least a little bit more sense.
>> >>
>> >>> Because on arm64 platform the PTE_RDONLY is automatically attached
>> >>> to the userspace pte entries even through VM_WRITE + VM_SHARE.
>> >>> The  PTE_RDONLY needs to be cleared in vmf_insert_pfn_prot. However
>> >>> vmf_insert_pfn_prot do not make the pte writable passing false
>> >>> @mkwrite to insert_pfn.
>> >> Question is why is arm64 doing this? As far as I can see they must
>> >> have some hardware reason for that.
>> >>
>> >> The mkwrite parameter to insert_pfn() was added by commit
>> >> b2770da642540 to make insert_pfn() look more like insert_pfn_pmd() so
>> >> that the DAX code can insert PTEs which are writable and dirty at the same
>> time.
>> >>
>> > This is one scenario to do so. In fact on arm64 there are many
>> > scenarios could be to do so. So we can let vmf_insert_pfn_prot
>> > supporting @mkwrite for drivers at core layer and let drivers to
>> > decide whether or not to make writable and dirty at one time. The
>> > patch did this. Otherwise double faults on arm64 when call
>> vmf_insert_pfn_prot.
>>
>> Well, that doesn't answer my question why arm64 is double faulting in the
>> first place,.
>>
>
>
> Eh.
>
> On arm64 When userspace mmap() with PROT_WRITE and MAP_SHARED the
> vma->vm_page_prot has the PTE_RDONLY and PTE_WRITE within
> PAGE_SHARED_EXEC. (seeing arm64 protection_map)
>
> When write the userspace virtual address the first fault happen and call
> into driver's 
> .fault->ttm_bo_vm_fault_reserved->vmf_insert_pfn_prot->insert_pfn.
> The insert_pfn will establish the pte entry. However the vmf_insert_pfn_prot
> pass false @mkwrite to insert_pfn by default and so insert_pfn could not make
> the pfn writable and it do not call maybe_mkwrite(pte_mkdirty(entry), vma)
> to clear the PTE_RDONLY bit. So the pte entry is actually write protection 
> for mmu.
> So when the first fault return and re-execute the store instruction the second
> fault happen again. And in second fault it just only do pte_mkdirty(entry) 
> which
> clear the PTE_RDONLY.

It depends if the ARM64 CPU in question supports hardware dirty bit
management (DBM). If that is the case and PTE_DBM (ie. PTE_WRITE) is set
HW will automatically clear PTE_RDONLY bit to mark the entry dirty
instead of raising a write fault. So you shouldn't see a double fault if
PTE_DBM/WRITE is set.

On ARM64 you can kind of think of PTE_RDONLY as the HW dirty bit and
PTE_DBM as the read/write permission bit with SW being responsible for
updating PTE_RDONLY via the fault handler if DBM is not supported by HW.

At least that's my understanding from having hacked on this in the
past. You can see all this weirdness happening in the definitions of
pte_dirty() and pte_write() for ARM64.

> I think so and hope no wrong.
>
>> So as long as this isn't sorted out I'm going to reject this patch.
>>
>> Regards,
>> Christian.
>>
>> >
>> >> This is a completely different use case to what you try to use it
>> >> here for and that looks extremely fishy to me.
>> >>
>> >> Regards,
>> >> Christian.
>> >>
>> > The first fault only sets up the pte entry which actually is dirty
>> > catching. And the second immediate fault to the pfn due to first
>> > dirty catching when the cpu re-execute the store instruction.
>>  It could be that this is done to work around some hw behavior, but
>>  not because of dirty catching.
>> 
>> > Normally if the drivers call vmf_insert_pfn_prot and also supply
>> > 'pfn_mkwrite' callback within vm_operations_struct which requires
>> > the pte to be dirty catching then the vmf_insert_pfn_prot and the
>> > double fault are reasonable. It is not a problem.
>>  Well, as far as I can see that behavior absolutely doesn't make sense.
>> 
>>  When pfn_mkwrite is requested then the driver should use PAGE_COPY,
>>  which is exactly what VMWGFX (the only driver using dirty tracking)
>>  is
>> >> doing.
>>  Everybody else uses PAGE_SHARED which should make the pte writeable
>>  immediately.
>> 
>>  Regards,
>>  Christian.
>> 
>> > However the most of drivers calling vmf_insert_pfn_prot do not
>> > supply the 'pfn_mkwrite' callback so that the second fault is
>> unnecessary.
>> >
>> > So just like vmf_insert_mixed and vmf_insert_mixed_mkwrite pair,
>> > we should also supply vmf_insert_pfn_mkwrite for drivers as well.
>> >
>> > Signed-off-by: Xianrong Zhou 
>> > ---
>> > arch/x86/entry/vdso/vma.c  |  3 +

Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008

2024-01-25 Thread Mirsad Todorovac

Hi, Ma Jun,

Normally, I would reply under the quoted text, but I will adjust to your 
convention.

I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland 
session
to block at typing password and ENTER in the graphical logon screen (tested 
several times).

After that, I was not able to even log from another box with ssh, or the 
session would
block (tested one time, second time too, thrid time it passed after I connected 
before
attempt to login on XWayland console).

You might find useful syslog and dmesg of the freeze on this link (they were 
+100K):

https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/

The exact applied patch was this:

marvin@defiant:~/linux/kernel/linux_torvalds$ git diff
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 73f6d7e72c73..6ef333df9adf 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct 
amdgpu_device *adev)
  
 if (!amdgpu_sriov_vf(adev)) {

 snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", 
ucode_prefix);
-   err = amdgpu_ucode_request(adev, &adev->gfx.rlc_fw, fw_name);
-   /* don't check this.  There are apparently firmwares in the 
wild with
-* incorrect size in the header
-*/
-   if (err == -ENODEV)
-   goto out;
+   err = request_firmware(&adev->gfx.rlc_fw, fw_name, adev->dev);
 if (err)
-   dev_dbg(adev->dev,
-   "gfx10: amdgpu_ucode_request() failed \"%s\"\n",
-   fw_name);
+   goto out;
+
+   /* don't validate this firmware.  There are apparently firmwares
+* in the wild with incorrect size in the header
+*/
 rlc_hdr = (const struct rlc_firmware_header_v2_0 
*)adev->gfx.rlc_fw->data;
 version_major = 
le16_to_cpu(rlc_hdr->header.header_version_major);
 version_minor = 
le16_to_cpu(rlc_hdr->header.header_version_minor);
marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms
Linux 6.7.0-xway-09721-g61da593f4458 x86_64
marvin@defiant:~/linux/kernel/linux_torvalds$

So, there seems to be a problem with the way the patch affects XWayland.

Checked multiple times the exact commit with and without the diff.

Hope this helps, because I am not familiar with the amdgpu driver.

Best regards,
Mirsad Todorovac

On 1/22/24 09:34, Ma, Jun wrote:

Perhaps similar to the problem I encountered earlier, you can
try the following patch

https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html

Regards,
Ma Jun

On 1/21/2024 3:54 AM, Mirsad Todorovac wrote:

Hi,

The last email did not pass to the most of the recipients due to banned .xz 
attachment.

As the .config is too big to send inline or uncompressed either, I will omit it 
in this
attempt. In the meantime, I had some success in decoding the stack trace, but 
sadly not
complete.

I don't think this Oops is deterministic, but I am working on a reproducer.

The platform is Ubuntu 22.04 LTS.

Complete list of hardware and .config is available here:

https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/

Best regards,
Mirsad

---
kernel: [5.576702] BUG: kernel NULL pointer dereference, address: 
0008
kernel: [5.576707] #PF: supervisor read access in kernel mode
kernel: [5.576710] #PF: error_code(0x) - not-present page
kernel: [5.576712] PGD 0 P4D 0
kernel: [5.576715] Oops:  [#1] PREEMPT SMP NOPTI
kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted 
6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2
kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG 
Lightning, BIOS 1.21 04/26/2023
kernel: [5.576726] RIP: 0010:gfx_v10_0_early_init 
(drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:4009 
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:7478) amdgpu
kernel: [ 5.576872] Code: 8d 55 a8 4c 89 ff e8 e4 83 ec ff 41 89 c2 83 f8 ed 0f 84 b3 
fd ff ff 85 c0 74 05 0f 1f 44 00 00 49 8b 87 08 87 01 00 4c 89 ff <48> 8b 40 08 
0f b7 50 0a 0f b7 70 08 e8 e4 42 fb ff 41 89 c2 85 c0
All code

 0: 8d 55 a8lea-0x58(%rbp),%edx
 3: 4c 89 ffmov%r15,%rdi
 6: e8 e4 83 ec ff  call   0xffec83ef
 b: 41 89 c2mov%eax,%r10d
 e: 83 f8 edcmp$0xffed,%eax
11: 0f 84 b3 fd ff ff   je 0xfdca
17: 85 c0   test   %eax,%eax
19: 74 05   je 0x20
1b: 0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
20: 49 8b 87 08 

Re: [PATCH v3] drm/amdgpu/pm: Use macro definitions in the smu IH process function

2024-01-25 Thread Lazar, Lijo



On 1/25/2024 2:20 PM, Ma Jun wrote:
> Replace the hard-coded numbers with macro definition
> 
> Signed-off-by: Ma Jun 

Reviewed-by: Lijo Lazar 

Thanks,
Lijo

> ---
> v3:
> - Add new SMU_IH_INTERRUPT_* macros for smu, keeping the original
> macro definitions in sync with pmfw (kevin)
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 10 +-
>  drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 14 +++---
>  drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0.c |  2 +-
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h | 10 ++
>  4 files changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> index fbeb31bf9e48..f6545093bfc1 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
> @@ -1432,24 +1432,24 @@ static int smu_v11_0_irq_process(struct amdgpu_device 
> *adev,
>   dev_emerg(adev->dev, "ERROR: System is going to shutdown due to 
> GPU HW CTF!\n");
>   orderly_poweroff(true);
>   } else if (client_id == SOC15_IH_CLIENTID_MP1) {
> - if (src_id == 0xfe) {
> + if (src_id == SMU_IH_INTERRUPT_ID_TO_DRIVER) {
>   /* ACK SMUToHost interrupt */
>   data = RREG32_SOC15(MP1, 0, mmMP1_SMN_IH_SW_INT_CTRL);
>   data = REG_SET_FIELD(data, MP1_SMN_IH_SW_INT_CTRL, 
> INT_ACK, 1);
>   WREG32_SOC15(MP1, 0, mmMP1_SMN_IH_SW_INT_CTRL, data);
>  
>   switch (ctxid) {
> - case 0x3:
> + case SMU_IH_INTERRUPT_CONTEXT_ID_AC:
>   dev_dbg(adev->dev, "Switched to AC mode!\n");
>   schedule_work(&smu->interrupt_work);
>   adev->pm.ac_power = true;
>   break;
> - case 0x4:
> + case SMU_IH_INTERRUPT_CONTEXT_ID_DC:
>   dev_dbg(adev->dev, "Switched to DC mode!\n");
>   schedule_work(&smu->interrupt_work);
>   adev->pm.ac_power = false;
>   break;
> - case 0x7:
> + case SMU_IH_INTERRUPT_CONTEXT_ID_THERMAL_THROTTLING:
>   /*
>* Increment the throttle interrupt counter
>*/
> @@ -1508,7 +1508,7 @@ int smu_v11_0_register_irq_handler(struct smu_context 
> *smu)
>   return ret;
>  
>   ret = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_MP1,
> - 0xfe,
> + SMU_IH_INTERRUPT_ID_TO_DRIVER,
>   irq_src);
>   if (ret)
>   return ret;
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
> index 1db74c0b5257..c13364830500 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
> @@ -1368,24 +1368,24 @@ static int smu_v13_0_irq_process(struct amdgpu_device 
> *adev,
>   dev_emerg(adev->dev, "ERROR: System is going to shutdown due to 
> GPU HW CTF!\n");
>   orderly_poweroff(true);
>   } else if (client_id == SOC15_IH_CLIENTID_MP1) {
> - if (src_id == 0xfe) {
> + if (src_id == SMU_IH_INTERRUPT_ID_TO_DRIVER) {
>   /* ACK SMUToHost interrupt */
>   data = RREG32_SOC15(MP1, 0, regMP1_SMN_IH_SW_INT_CTRL);
>   data = REG_SET_FIELD(data, MP1_SMN_IH_SW_INT_CTRL, 
> INT_ACK, 1);
>   WREG32_SOC15(MP1, 0, regMP1_SMN_IH_SW_INT_CTRL, data);
>  
>   switch (ctxid) {
> - case 0x3:
> + case SMU_IH_INTERRUPT_CONTEXT_ID_AC:
>   dev_dbg(adev->dev, "Switched to AC mode!\n");
>   smu_v13_0_ack_ac_dc_interrupt(smu);
>   adev->pm.ac_power = true;
>   break;
> - case 0x4:
> + case SMU_IH_INTERRUPT_CONTEXT_ID_DC:
>   dev_dbg(adev->dev, "Switched to DC mode!\n");
>   smu_v13_0_ack_ac_dc_interrupt(smu);
>   adev->pm.ac_power = false;
>   break;
> - case 0x7:
> + case SMU_IH_INTERRUPT_CONTEXT_ID_THERMAL_THROTTLING:
>   /*
>* Increment the throttle interrupt counter
>*/
> @@ -1398,7 +1398,7 @@ static int smu_v13_0_irq_process(struct amdgpu_device 
> *adev,
>   
> schedule_wor

RE: [PATCH v3] drm/amdgpu/pm: Use macro definitions in the smu IH process function

2024-01-25 Thread Wang, Yang(Kevin)
[AMD Official Use Only - General]

Reviewed-by: Yang Wang 

Best Regards,
Kevin

-Original Message-
From: Ma, Jun 
Sent: Thursday, January 25, 2024 4:50 PM
To: amd-gfx@lists.freedesktop.org
Cc: Lazar, Lijo ; Feng, Kenneth ; 
Deucher, Alexander ; Wang, Yang(Kevin) 
; Ma, Jun 
Subject: [PATCH v3] drm/amdgpu/pm: Use macro definitions in the smu IH process 
function

Replace the hard-coded numbers with macro definition

Signed-off-by: Ma Jun 
---
v3:
- Add new SMU_IH_INTERRUPT_* macros for smu, keeping the original macro 
definitions in sync with pmfw (kevin)
---
 drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 10 +-  
drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 14 +++---  
drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0.c |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h | 10 ++
 4 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
index fbeb31bf9e48..f6545093bfc1 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
@@ -1432,24 +1432,24 @@ static int smu_v11_0_irq_process(struct amdgpu_device 
*adev,
dev_emerg(adev->dev, "ERROR: System is going to shutdown due to 
GPU HW CTF!\n");
orderly_poweroff(true);
} else if (client_id == SOC15_IH_CLIENTID_MP1) {
-   if (src_id == 0xfe) {
+   if (src_id == SMU_IH_INTERRUPT_ID_TO_DRIVER) {
/* ACK SMUToHost interrupt */
data = RREG32_SOC15(MP1, 0, mmMP1_SMN_IH_SW_INT_CTRL);
data = REG_SET_FIELD(data, MP1_SMN_IH_SW_INT_CTRL, 
INT_ACK, 1);
WREG32_SOC15(MP1, 0, mmMP1_SMN_IH_SW_INT_CTRL, data);

switch (ctxid) {
-   case 0x3:
+   case SMU_IH_INTERRUPT_CONTEXT_ID_AC:
dev_dbg(adev->dev, "Switched to AC mode!\n");
schedule_work(&smu->interrupt_work);
adev->pm.ac_power = true;
break;
-   case 0x4:
+   case SMU_IH_INTERRUPT_CONTEXT_ID_DC:
dev_dbg(adev->dev, "Switched to DC mode!\n");
schedule_work(&smu->interrupt_work);
adev->pm.ac_power = false;
break;
-   case 0x7:
+   case SMU_IH_INTERRUPT_CONTEXT_ID_THERMAL_THROTTLING:
/*
 * Increment the throttle interrupt counter
 */
@@ -1508,7 +1508,7 @@ int smu_v11_0_register_irq_handler(struct smu_context 
*smu)
return ret;

ret = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_MP1,
-   0xfe,
+   SMU_IH_INTERRUPT_ID_TO_DRIVER,
irq_src);
if (ret)
return ret;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
index 1db74c0b5257..c13364830500 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
@@ -1368,24 +1368,24 @@ static int smu_v13_0_irq_process(struct amdgpu_device 
*adev,
dev_emerg(adev->dev, "ERROR: System is going to shutdown due to 
GPU HW CTF!\n");
orderly_poweroff(true);
} else if (client_id == SOC15_IH_CLIENTID_MP1) {
-   if (src_id == 0xfe) {
+   if (src_id == SMU_IH_INTERRUPT_ID_TO_DRIVER) {
/* ACK SMUToHost interrupt */
data = RREG32_SOC15(MP1, 0, regMP1_SMN_IH_SW_INT_CTRL);
data = REG_SET_FIELD(data, MP1_SMN_IH_SW_INT_CTRL, 
INT_ACK, 1);
WREG32_SOC15(MP1, 0, regMP1_SMN_IH_SW_INT_CTRL, data);

switch (ctxid) {
-   case 0x3:
+   case SMU_IH_INTERRUPT_CONTEXT_ID_AC:
dev_dbg(adev->dev, "Switched to AC mode!\n");
smu_v13_0_ack_ac_dc_interrupt(smu);
adev->pm.ac_power = true;
break;
-   case 0x4:
+   case SMU_IH_INTERRUPT_CONTEXT_ID_DC:
dev_dbg(adev->dev, "Switched to DC mode!\n");
smu_v13_0_ack_ac_dc_interrupt(smu);
adev->pm.ac_power = false;
break;
-   case 0x7:
+   case SMU_IH_INTERRUPT_CONTEXT_ID_THERMAL_THROTTLING:
/*
 * 

[PATCH v3] drm/amdgpu/pm: Use macro definitions in the smu IH process function

2024-01-25 Thread Ma Jun
Replace the hard-coded numbers with macro definition

Signed-off-by: Ma Jun 
---
v3:
- Add new SMU_IH_INTERRUPT_* macros for smu, keeping the original
macro definitions in sync with pmfw (kevin)
---
 drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c | 10 +-
 drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 14 +++---
 drivers/gpu/drm/amd/pm/swsmu/smu14/smu_v14_0.c |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h | 10 ++
 4 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
index fbeb31bf9e48..f6545093bfc1 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
@@ -1432,24 +1432,24 @@ static int smu_v11_0_irq_process(struct amdgpu_device 
*adev,
dev_emerg(adev->dev, "ERROR: System is going to shutdown due to 
GPU HW CTF!\n");
orderly_poweroff(true);
} else if (client_id == SOC15_IH_CLIENTID_MP1) {
-   if (src_id == 0xfe) {
+   if (src_id == SMU_IH_INTERRUPT_ID_TO_DRIVER) {
/* ACK SMUToHost interrupt */
data = RREG32_SOC15(MP1, 0, mmMP1_SMN_IH_SW_INT_CTRL);
data = REG_SET_FIELD(data, MP1_SMN_IH_SW_INT_CTRL, 
INT_ACK, 1);
WREG32_SOC15(MP1, 0, mmMP1_SMN_IH_SW_INT_CTRL, data);
 
switch (ctxid) {
-   case 0x3:
+   case SMU_IH_INTERRUPT_CONTEXT_ID_AC:
dev_dbg(adev->dev, "Switched to AC mode!\n");
schedule_work(&smu->interrupt_work);
adev->pm.ac_power = true;
break;
-   case 0x4:
+   case SMU_IH_INTERRUPT_CONTEXT_ID_DC:
dev_dbg(adev->dev, "Switched to DC mode!\n");
schedule_work(&smu->interrupt_work);
adev->pm.ac_power = false;
break;
-   case 0x7:
+   case SMU_IH_INTERRUPT_CONTEXT_ID_THERMAL_THROTTLING:
/*
 * Increment the throttle interrupt counter
 */
@@ -1508,7 +1508,7 @@ int smu_v11_0_register_irq_handler(struct smu_context 
*smu)
return ret;
 
ret = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_MP1,
-   0xfe,
+   SMU_IH_INTERRUPT_ID_TO_DRIVER,
irq_src);
if (ret)
return ret;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
index 1db74c0b5257..c13364830500 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
@@ -1368,24 +1368,24 @@ static int smu_v13_0_irq_process(struct amdgpu_device 
*adev,
dev_emerg(adev->dev, "ERROR: System is going to shutdown due to 
GPU HW CTF!\n");
orderly_poweroff(true);
} else if (client_id == SOC15_IH_CLIENTID_MP1) {
-   if (src_id == 0xfe) {
+   if (src_id == SMU_IH_INTERRUPT_ID_TO_DRIVER) {
/* ACK SMUToHost interrupt */
data = RREG32_SOC15(MP1, 0, regMP1_SMN_IH_SW_INT_CTRL);
data = REG_SET_FIELD(data, MP1_SMN_IH_SW_INT_CTRL, 
INT_ACK, 1);
WREG32_SOC15(MP1, 0, regMP1_SMN_IH_SW_INT_CTRL, data);
 
switch (ctxid) {
-   case 0x3:
+   case SMU_IH_INTERRUPT_CONTEXT_ID_AC:
dev_dbg(adev->dev, "Switched to AC mode!\n");
smu_v13_0_ack_ac_dc_interrupt(smu);
adev->pm.ac_power = true;
break;
-   case 0x4:
+   case SMU_IH_INTERRUPT_CONTEXT_ID_DC:
dev_dbg(adev->dev, "Switched to DC mode!\n");
smu_v13_0_ack_ac_dc_interrupt(smu);
adev->pm.ac_power = false;
break;
-   case 0x7:
+   case SMU_IH_INTERRUPT_CONTEXT_ID_THERMAL_THROTTLING:
/*
 * Increment the throttle interrupt counter
 */
@@ -1398,7 +1398,7 @@ static int smu_v13_0_irq_process(struct amdgpu_device 
*adev,

schedule_work(&smu->throttling_logging_work);
 
break;
-   case 0x8:
+   case SMU_IH_INTERRUPT_CONTE

[PATCH v2] drm/amdgpu: move the drm client creation behind drm device registration

2024-01-25 Thread Le Ma
This patch is to eliminate interrupt warning below:

  "[drm] Fence fallback timer expired on ring sdma0.0".

An early vm pt clearing job is sent to SDMA ahead of interrupt enabled,
introduced by patch below:

  - drm/amdkfd: Export DMABufs from KFD using GEM handles

And re-locating the drm client creation following after drm_dev_register
looks like a more proper flow.

v2: wrap the drm client creation

Signed-off-by: Le Ma 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 32 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  4 +++
 3 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 77e263660288..41db030ddc4e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -141,11 +141,31 @@ static void amdgpu_amdkfd_reset_work(struct work_struct 
*work)
 static const struct drm_client_funcs kfd_client_funcs = {
.unregister = drm_client_release,
 };
+
+int amdgpu_amdkfd_drm_client_create(struct amdgpu_device *adev)
+{
+   int ret;
+
+   if (!adev->kfd.init_complete)
+   return 0;
+
+   ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd",
+ &kfd_client_funcs);
+   if (ret) {
+   dev_err(adev->dev, "Failed to init DRM client: %d\n",
+   ret);
+   return ret;
+   }
+
+   drm_client_register(&adev->kfd.client);
+
+   return 0;
+}
+
 void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 {
int i;
int last_valid_bit;
-   int ret;
 
amdgpu_amdkfd_gpuvm_init_mem_limits();
 
@@ -164,12 +184,6 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
.enable_mes = adev->enable_mes,
};
 
-   ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", 
&kfd_client_funcs);
-   if (ret) {
-   dev_err(adev->dev, "Failed to init DRM client: %d\n", 
ret);
-   return;
-   }
-
/* this is going to have a few of the MSBs set that we need to
 * clear
 */
@@ -208,10 +222,6 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 
adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
&gpu_resources);
-   if (adev->kfd.init_complete)
-   drm_client_register(&adev->kfd.client);
-   else
-   drm_client_release(&adev->kfd.client);
 
amdgpu_amdkfd_total_mem_size += adev->gmc.real_vram_size;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 584a0cea5572..da175c384adf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -182,6 +182,8 @@ int amdgpu_queue_mask_bit_to_set_resource_bit(struct 
amdgpu_device *adev,
 struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context,
struct mm_struct *mm,
struct svm_range_bo *svm_bo);
+
+int amdgpu_amdkfd_drm_client_create(struct amdgpu_device *adev);
 #if defined(CONFIG_DEBUG_FS)
 int kfd_debugfs_kfd_mem_limits(struct seq_file *m, void *data);
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 475bd59c9ac2..91d5d9435067 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2255,6 +2255,10 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
if (ret)
goto err_pci;
 
+   ret = amdgpu_amdkfd_drm_client_create(adev);
+   if (ret)
+   goto err_pci;
+
/*
 * 1. don't init fbdev on hw without DCE
 * 2. don't init fbdev if there are no connectors
-- 
2.38.1