[PATCH] drm/amdgpu/mmsch: Correct the definition for mmsch init header

2023-06-05 Thread Emily Deng
For the header, it is version related, shouldn't use MAX_VCN_INSTANCES.

Signed-off-by: Emily Deng 
---
 drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h | 4 +++-
 drivers/gpu/drm/amd/amdgpu/mmsch_v4_0.h | 4 +++-
 drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c   | 2 +-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h 
b/drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h
index 3e4e858a6965..a773ef61b78c 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h
+++ b/drivers/gpu/drm/amd/amdgpu/mmsch_v3_0.h
@@ -30,6 +30,8 @@
 #define MMSCH_VERSION_MINOR0
 #define MMSCH_VERSION  (MMSCH_VERSION_MAJOR << 16 | MMSCH_VERSION_MINOR)
 
+#define MMSCH_V3_0_VCN_INSTANCES 0x2
+
 enum mmsch_v3_0_command_type {
MMSCH_COMMAND__DIRECT_REG_WRITE = 0,
MMSCH_COMMAND__DIRECT_REG_POLLING = 2,
@@ -47,7 +49,7 @@ struct mmsch_v3_0_table_info {
 struct mmsch_v3_0_init_header {
uint32_t version;
uint32_t total_size;
-   struct mmsch_v3_0_table_info inst[AMDGPU_MAX_VCN_INSTANCES];
+   struct mmsch_v3_0_table_info inst[MMSCH_V3_0_VCN_INSTANCES];
 };
 
 struct mmsch_v3_0_cmd_direct_reg_header {
diff --git a/drivers/gpu/drm/amd/amdgpu/mmsch_v4_0.h 
b/drivers/gpu/drm/amd/amdgpu/mmsch_v4_0.h
index 83653a50a1a2..796d4f8791e5 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmsch_v4_0.h
+++ b/drivers/gpu/drm/amd/amdgpu/mmsch_v4_0.h
@@ -43,6 +43,8 @@
 #define MMSCH_VF_MAILBOX_RESP__OK 0x1
 #define MMSCH_VF_MAILBOX_RESP__INCOMPLETE 0x2
 
+#define MMSCH_V4_0_VCN_INSTANCES 0x2
+
 enum mmsch_v4_0_command_type {
MMSCH_COMMAND__DIRECT_REG_WRITE = 0,
MMSCH_COMMAND__DIRECT_REG_POLLING = 2,
@@ -60,7 +62,7 @@ struct mmsch_v4_0_table_info {
 struct mmsch_v4_0_init_header {
uint32_t version;
uint32_t total_size;
-   struct mmsch_v4_0_table_info inst[AMDGPU_MAX_VCN_INSTANCES];
+   struct mmsch_v4_0_table_info inst[MMSCH_V4_0_VCN_INSTANCES];
struct mmsch_v4_0_table_info jpegdec;
 };
 
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
index 70fefbf26c48..c8f63b3c6f69 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -1313,7 +1313,7 @@ static int vcn_v3_0_start_sriov(struct amdgpu_device 
*adev)
 
header.version = MMSCH_VERSION;
header.total_size = sizeof(struct mmsch_v3_0_init_header) >> 2;
-   for (i = 0; i < AMDGPU_MAX_VCN_INSTANCES; i++) {
+   for (i = 0; i < MMSCH_V3_0_VCN_INSTANCES; i++) {
header.inst[i].init_status = 0;
header.inst[i].table_offset = 0;
header.inst[i].table_size = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
index 60c3fd20e8ce..8d371faaa2b3 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
@@ -1239,7 +1239,7 @@ static int vcn_v4_0_start_sriov(struct amdgpu_device 
*adev)
 
header.version = MMSCH_VERSION;
header.total_size = sizeof(struct mmsch_v4_0_init_header) >> 2;
-   for (i = 0; i < AMDGPU_MAX_VCN_INSTANCES; i++) {
+   for (i = 0; i < MMSCH_V4_0_VCN_INSTANCES; i++) {
header.inst[i].init_status = 0;
header.inst[i].table_offset = 0;
header.inst[i].table_size = 0;
-- 
2.36.1



[PATCH] drm/amdgpu: disable virtual display support on APP device

2023-06-05 Thread Yang Wang
virtual display is not support on APP device.

Signed-off-by: Yang Wang 
Signed-off-by: Gavin Wan 
Reviewed-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 2c1fbed24535..0f1ca0136f50 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -56,7 +56,8 @@ void amdgpu_virt_init_setting(struct amdgpu_device *adev)
 
/* enable virtual display */
if (adev->asic_type != CHIP_ALDEBARAN &&
-   adev->asic_type != CHIP_ARCTURUS) {
+   adev->asic_type != CHIP_ARCTURUS &&
+   ((adev->pdev->class >> 8) != AMD_ACCELERATOR_PROCESSING)) {
if (adev->mode_info.num_crtc == 0)
adev->mode_info.num_crtc = 1;
adev->enable_virtual_display = true;
-- 
2.34.1



Re: [Intel-gfx] [PATCH v2 1/2] vgaarb: various coding style and comments fix

2023-06-05 Thread Sui Jingfeng

Hi,

On 2023/6/6 06:16, Andi Shyti wrote:

Hi Sui,

On Mon, Jun 05, 2023 at 04:58:30AM +0800, Sui Jingfeng wrote:

From: Sui Jingfeng 

To keep consistent with vga_iostate_to_str() function, the third argument
of vga_str_to_iostate() function should be 'unsigned int *'.

I think the real reason is not to keep consistent with
vga_iostate_to_str() but because vga_str_to_iostate() is actually
only taking "unsigned int *" parameters.


Yes, right.

my expression is not completely correct, I will update it at next version.


I think, we have the same opinion.

Originally, I also want to express the opinion.

Because, it make no sense to  interpret the return value

(VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM) as int type.


IO state should be should be donate by a unsigned type.

vga_iostate_to_str() also receive unsigned type.

static const char *vga_iostate_to_str(unsigned int iostate)


Signed-off-by: Sui Jingfeng 
---
  drivers/pci/vgaarb.c   | 29 +++--
  include/linux/vgaarb.h |  8 +++-
  2 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/drivers/pci/vgaarb.c b/drivers/pci/vgaarb.c
index 5a696078b382..e40e6e5e5f03 100644
--- a/drivers/pci/vgaarb.c
+++ b/drivers/pci/vgaarb.c
@@ -61,7 +61,6 @@ static bool vga_arbiter_used;
  static DEFINE_SPINLOCK(vga_lock);
  static DECLARE_WAIT_QUEUE_HEAD(vga_wait_queue);
  
-

drop this change


OK,

This is a double blank line.

Originally, I intend to accumulate all tiny fix, commit together.

As they are trivial.

Now, Should I split this patch,

then this patch set will contain two trivial patch ?


  static const char *vga_iostate_to_str(unsigned int iostate)
  {
/* Ignore VGA_RSRC_IO and VGA_RSRC_MEM */
@@ -77,10 +76,12 @@ static const char *vga_iostate_to_str(unsigned int iostate)
return "none";
  }
  
-static int vga_str_to_iostate(char *buf, int str_size, int *io_state)

+static int vga_str_to_iostate(char *buf, int str_size, unsigned int *io_state)

this is OK, it's actually what you are describing in the commit
log, but...


  {
-   /* we could in theory hand out locks on IO and mem
-* separately to userspace but it can cause deadlocks */
+   /*
+* we could in theory hand out locks on IO and mem
+* separately to userspace but it can cause deadlocks
+*/

... all the rest needs to go on different patches as it doesn't
have anything to do with what you describe.


OK,

I will wait a few days for more reviews,

I process them together,   also avoid version grow too fast.

Thanks.


Andi


--
Jingfeng



RE: [PATCH 1/2] drm/amdgpu: make sure BOs are locked in amdgpu_vm_get_memory

2023-06-05 Thread Chen, Guchun
[Public]

Acked-by: Guchun Chen  for this series.

A simple question is we don't need to hold the lock if bo locations are not 
changed?

Regards,
Guchun

> -Original Message-
> From: Christian König 
> Sent: Monday, June 5, 2023 5:11 PM
> To: amd-gfx@lists.freedesktop.org; mikhail.v.gavri...@gmail.com; Chen,
> Guchun 
> Subject: [PATCH 1/2] drm/amdgpu: make sure BOs are locked in
> amdgpu_vm_get_memory
>
> We need to grab the lock of the BO or otherwise can run into a crash when
> we try to inspect the current location.
>
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 69 +++-
> --
>  1 file changed, 39 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 3c0310576b3b..2c8cafec48a4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -920,42 +920,51 @@ int amdgpu_vm_update_range(struct
> amdgpu_device *adev, struct amdgpu_vm *vm,
>   return r;
>  }
>
> +static void amdgpu_vm_bo_get_memory(struct amdgpu_bo_va *bo_va,
> + struct amdgpu_mem_stats *stats) {
> + struct amdgpu_vm *vm = bo_va->base.vm;
> + struct amdgpu_bo *bo = bo_va->base.bo;
> +
> + if (!bo)
> + return;
> +
> + /*
> +  * For now ignore BOs which are currently locked and potentially
> +  * changing their location.
> +  */
> + if (bo->tbo.base.resv != vm->root.bo->tbo.base.resv &&
> + !dma_resv_trylock(bo->tbo.base.resv))
> + return;
> +
> + amdgpu_bo_get_memory(bo, stats);
> + if (bo->tbo.base.resv != vm->root.bo->tbo.base.resv)
> + dma_resv_unlock(bo->tbo.base.resv);
> +}
> +
>  void amdgpu_vm_get_memory(struct amdgpu_vm *vm,
> struct amdgpu_mem_stats *stats)
>  {
>   struct amdgpu_bo_va *bo_va, *tmp;
>
>   spin_lock(&vm->status_lock);
> - list_for_each_entry_safe(bo_va, tmp, &vm->idle, base.vm_status) {
> - if (!bo_va->base.bo)
> - continue;
> - amdgpu_bo_get_memory(bo_va->base.bo, stats);
> - }
> - list_for_each_entry_safe(bo_va, tmp, &vm->evicted, base.vm_status)
> {
> - if (!bo_va->base.bo)
> - continue;
> - amdgpu_bo_get_memory(bo_va->base.bo, stats);
> - }
> - list_for_each_entry_safe(bo_va, tmp, &vm->relocated,
> base.vm_status) {
> - if (!bo_va->base.bo)
> - continue;
> - amdgpu_bo_get_memory(bo_va->base.bo, stats);
> - }
> - list_for_each_entry_safe(bo_va, tmp, &vm->moved, base.vm_status)
> {
> - if (!bo_va->base.bo)
> - continue;
> - amdgpu_bo_get_memory(bo_va->base.bo, stats);
> - }
> - list_for_each_entry_safe(bo_va, tmp, &vm->invalidated,
> base.vm_status) {
> - if (!bo_va->base.bo)
> - continue;
> - amdgpu_bo_get_memory(bo_va->base.bo, stats);
> - }
> - list_for_each_entry_safe(bo_va, tmp, &vm->done, base.vm_status) {
> - if (!bo_va->base.bo)
> - continue;
> - amdgpu_bo_get_memory(bo_va->base.bo, stats);
> - }
> + list_for_each_entry_safe(bo_va, tmp, &vm->idle, base.vm_status)
> + amdgpu_vm_bo_get_memory(bo_va, stats);
> +
> + list_for_each_entry_safe(bo_va, tmp, &vm->evicted, base.vm_status)
> + amdgpu_vm_bo_get_memory(bo_va, stats);
> +
> + list_for_each_entry_safe(bo_va, tmp, &vm->relocated,
> base.vm_status)
> + amdgpu_vm_bo_get_memory(bo_va, stats);
> +
> + list_for_each_entry_safe(bo_va, tmp, &vm->moved, base.vm_status)
> + amdgpu_vm_bo_get_memory(bo_va, stats);
> +
> + list_for_each_entry_safe(bo_va, tmp, &vm->invalidated,
> base.vm_status)
> + amdgpu_vm_bo_get_memory(bo_va, stats);
> +
> + list_for_each_entry_safe(bo_va, tmp, &vm->done, base.vm_status)
> + amdgpu_vm_bo_get_memory(bo_va, stats);
>   spin_unlock(&vm->status_lock);
>  }
>
> --
> 2.34.1



Re: [Intel-gfx] [PATCH v2 1/2] vgaarb: various coding style and comments fix

2023-06-05 Thread Andi Shyti
Hi Sui,

On Mon, Jun 05, 2023 at 04:58:30AM +0800, Sui Jingfeng wrote:
> From: Sui Jingfeng 
> 
> To keep consistent with vga_iostate_to_str() function, the third argument
> of vga_str_to_iostate() function should be 'unsigned int *'.

I think the real reason is not to keep consistent with
vga_iostate_to_str() but because vga_str_to_iostate() is actually
only taking "unsigned int *" parameters.

> Signed-off-by: Sui Jingfeng 
> ---
>  drivers/pci/vgaarb.c   | 29 +++--
>  include/linux/vgaarb.h |  8 +++-
>  2 files changed, 18 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/pci/vgaarb.c b/drivers/pci/vgaarb.c
> index 5a696078b382..e40e6e5e5f03 100644
> --- a/drivers/pci/vgaarb.c
> +++ b/drivers/pci/vgaarb.c
> @@ -61,7 +61,6 @@ static bool vga_arbiter_used;
>  static DEFINE_SPINLOCK(vga_lock);
>  static DECLARE_WAIT_QUEUE_HEAD(vga_wait_queue);
>  
> -

drop this change

>  static const char *vga_iostate_to_str(unsigned int iostate)
>  {
>   /* Ignore VGA_RSRC_IO and VGA_RSRC_MEM */
> @@ -77,10 +76,12 @@ static const char *vga_iostate_to_str(unsigned int 
> iostate)
>   return "none";
>  }
>  
> -static int vga_str_to_iostate(char *buf, int str_size, int *io_state)
> +static int vga_str_to_iostate(char *buf, int str_size, unsigned int 
> *io_state)

this is OK, it's actually what you are describing in the commit
log, but...

>  {
> - /* we could in theory hand out locks on IO and mem
> -  * separately to userspace but it can cause deadlocks */
> + /*
> +  * we could in theory hand out locks on IO and mem
> +  * separately to userspace but it can cause deadlocks
> +  */

... all the rest needs to go on different patches as it doesn't
have anything to do with what you describe.

Andi


RE: [PATCH] drm/amd: Check that a system is a NUMA system before looking for SRAT

2023-06-05 Thread Limonciello, Mario
[Public]

> On 2023-06-02 08:18, Mario Limonciello wrote:
> > It's pointless on laptops to look for the SRAT table as these are not
> > NUMA.  Check the number of possible nodes is > 1 to decide whether to
> > look for SRAT.
> >
> > Suggested-by: Felix Kuehling 
> > Signed-off-by: Mario Limonciello 
>
> I think we discussed this a while ago and I don't remember the exact
> issue that was meant to fix. Was just to get rid of an irritating
> warning in the kernel log? Anyway, the patch looks good to me.

Yeah I forgot all about sending out the fix until I noticed it again recently.

>
> Reviewed-by: Felix Kuehling 

Thanks!

>
>
> > ---
> >   drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > index 950af6820153..3dcd8f8bc98e 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > @@ -2041,7 +2041,8 @@ static int kfd_fill_gpu_direct_io_link_to_cpu(int
> *avail_size,
> > sub_type_hdr->proximity_domain_from = proximity_domain;
> >
> >   #ifdef CONFIG_ACPI_NUMA
> > -   if (kdev->adev->pdev->dev.numa_node == NUMA_NO_NODE)
> > +   if (kdev->adev->pdev->dev.numa_node == NUMA_NO_NODE &&
> > +   num_possible_nodes() > 1)
> > kfd_find_numa_node_in_srat(kdev);
> >   #endif
> >   #ifdef CONFIG_NUMA


Re: [PATCH] drm/amd: Check that a system is a NUMA system before looking for SRAT

2023-06-05 Thread Felix Kuehling

On 2023-06-02 08:18, Mario Limonciello wrote:

It's pointless on laptops to look for the SRAT table as these are not
NUMA.  Check the number of possible nodes is > 1 to decide whether to
look for SRAT.

Suggested-by: Felix Kuehling 
Signed-off-by: Mario Limonciello 


I think we discussed this a while ago and I don't remember the exact 
issue that was meant to fix. Was just to get rid of an irritating 
warning in the kernel log? Anyway, the patch looks good to me.


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
index 950af6820153..3dcd8f8bc98e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
@@ -2041,7 +2041,8 @@ static int kfd_fill_gpu_direct_io_link_to_cpu(int 
*avail_size,
sub_type_hdr->proximity_domain_from = proximity_domain;
  
  #ifdef CONFIG_ACPI_NUMA

-   if (kdev->adev->pdev->dev.numa_node == NUMA_NO_NODE)
+   if (kdev->adev->pdev->dev.numa_node == NUMA_NO_NODE &&
+   num_possible_nodes() > 1)
kfd_find_numa_node_in_srat(kdev);
  #endif
  #ifdef CONFIG_NUMA


Re: PROBLEM: AMD Ryzen 9 7950X iGPU - Blinking Issue

2023-06-05 Thread Felix Richter
I will apply this patch and see if fixes the issue for me. Will let you 
now when I am done.


Felix

On 05.06.23 16:11, Alex Deucher wrote:

On Sat, Jun 3, 2023 at 10:52 AM Felix Richter  wrote:

Hi Guys,

sorry for the silence from my side. I had a lot of things to take care
of after returning from vacation. Also I had to wait on the zfs modules
to be updated to support kernel 6.3 for further testing.

The bad news is that I am still experiencing issues. I have been able to
get a reproducible trigger for the buggy behavior. The moment I take a
screenshot or any other program like `wdisplays` accesses the screen
buffer the screen starts flickering. The only way to reset it is to
reboot the machine or log out of the desktop.

With this I did a bisection to figure out which commit is responsible
for this. I attached the logs to the mail. The short version is that I
identified commit 81d0bcf9900932633d270d5bc4a54ff599c6ebdb as the
culprit. Seems that there are side effects of having more flexible
buffer placement for the case of the internal GPU. To verify that this
actually is the cause of the issue I built the current archlinux kernel
with an extra patch to revert the commit:
https://github.com/ju6ge/linux/tree/v6.3.5-ju6ge. The result is that be
bug is fixed!

+ Hamza

This is a known issue.  You can workaround it by setting
amdgpu.sg_display=0.  It should be issue should be fixed in:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=08da182175db4c7f80850354849d95f2670e8cd9

Alex




Now if this is the desired long term fix I do not know …

Kind regards,
Felix Richter

On 02.05.23 16:12, Linux regression tracking (Thorsten Leemhuis) wrote:

On 02.05.23 15:48, Felix Richter wrote:

On 5/2/23 15:34, Linux regression tracking (Thorsten Leemhuis) wrote:

On 02.05.23 15:13, Alex Deucher wrote:

On Tue, May 2, 2023 at 7:45 AM Linux regression tracking (Thorsten
Leemhuis)  wrote:


On 30.04.23 13:44, Felix Richter wrote:

Hi,

I am running into an issue with the integrated GPU of the Ryzen 9
7950X. It seems to be a regression from kernel version 6.1 to 6.2.
The bug materializes in from of my monitor blinking, meaning it
turns full white shortly. This happens very often so that the
system becomes unpleasant to use.

I am running the Archlinux Kernel:
The Issue happens on the bleeding edge kernel: 6.2.13
Switching back to the LTS kernel resolves the issue: 6.1.26

I have two monitors attached to the system. One 42 inch 4k Display
and a 24 inch 1080p Display and am running sway as my desktop.

Let me know if there is more information I could provide to help
narrow down the issue.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced v6.1..v6.2
#regzbot title drm: amdgpu: system becomes unpleasant to use after
monitor starts blinking and turns full white
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify
when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags
pointing
to the report (the parent of this mail). See page linked in footer for
details.

This sounds exactly like the issue that was fixed in this patch which
is already on it's way to Linus:
https://gitlab.freedesktop.org/agd5f/linux/-/commit/08da182175db4c7f80850354849d95f2670e8cd9

FWIW, you in the flood of emails likely missed that this is the same
thread where you yesterday replied "If the module parameter didn't help
then perhaps you are seeing some other issue.  Can you bisect?". That's
why I decided to add this to the tracking. Or am I missing something
obvious here?

/me looks around again and can't see anything, but that doesn't have to
mean anything...

Felix, btw, this guide might help you with the bisection, even if it's
just for kernel compilation:

https://docs.kernel.org/next/admin-guide/quickly-build-trimmed-linux.html

And to indirectly reply to your mail from yesterday[1]. You might want
to ignore the arch linux kernel git repo and just do a bisection between
6.1 and the latest 6.2.y kernel using upstream repos; and if I were you
I'd also try 6.3 or even mainline before that, in case the issue was
fixed already.

[1]
https://lore.kernel.org/all/04749ee4-0728-92fe-bcb0-a7320279e...@felixrichter.tech/


Thanks for the pointers, I'll do a bisection on my desktop from 6.1 to
the newest commit.

FWIW, I wonder what you actually mean with "newest commit" here: a
bisection between 6.1 and mainline HEAD might be a waste of time, *if*
this is something that only happens in 6.2.y (say due to a broken or
incomplete backport)


That was the part I was mostly

Re: PROBLEM: AMD Ryzen 9 7950X iGPU - Blinking Issue

2023-06-05 Thread Felix Richter

Hi,

I can confirm that setting amdgpu.sg_display=0 does not fix the issue 
for me.


I have 64GB of Kinsten Memory running with XMP at 5200MHz. I attached 
the result of `dmidecode --type=memory` to this email.


Kind regards
Felix Richter

On 05.06.23 17:27, Hamza Mahfooz wrote:


On 6/3/23 10:52, Felix Richter wrote:

Hi Guys,

sorry for the silence from my side. I had a lot of things to take 
care of after returning from vacation. Also I had to wait on the zfs 
modules to be updated to support kernel 6.3 for further testing.


The bad news is that I am still experiencing issues. I have been able 
to get a reproducible trigger for the buggy behavior. The moment I 
take a screenshot or any other program like `wdisplays` accesses the 
screen buffer the screen starts flickering. The only way to reset it 
is to reboot the machine or log out of the desktop.


With this I did a bisection to figure out which commit is responsible 
for this. I attached the logs to the mail. The short version is that 
I identified commit 81d0bcf9900932633d270d5bc4a54ff599c6ebdb as the 
culprit. Seems that there are side effects of having more flexible 
buffer placement for the case of the internal GPU. To verify that 
this actually is the cause of the issue I built the current archlinux 
kernel with an extra patch to revert the commit: 
https://github.com/ju6ge/linux/tree/v6.3.5-ju6ge. The result is that 
be bug is fixed!


Now if this is the desired long term fix I do not know …


Can you provide a dmidecode of your RAM (i.e. # dmidecode --type=memory)?

The current trend seems to suggest that if you have 64 or more gigs of
RAM, you will probably still experience issues with S/G mode enabled
even with my fix applied.



Kind regards,
Felix Richter

On 02.05.23 16:12, Linux regression tracking (Thorsten Leemhuis) wrote:

On 02.05.23 15:48, Felix Richter wrote:

On 5/2/23 15:34, Linux regression tracking (Thorsten Leemhuis) wrote:

On 02.05.23 15:13, Alex Deucher wrote:

On Tue, May 2, 2023 at 7:45 AM Linux regression tracking (Thorsten
Leemhuis)  wrote:


On 30.04.23 13:44, Felix Richter wrote:

Hi,

I am running into an issue with the integrated GPU of the Ryzen 9
7950X. It seems to be a regression from kernel version 6.1 to 6.2.
The bug materializes in from of my monitor blinking, meaning it
turns full white shortly. This happens very often so that the
system becomes unpleasant to use.

I am running the Archlinux Kernel:
The Issue happens on the bleeding edge kernel: 6.2.13
Switching back to the LTS kernel resolves the issue: 6.1.26

I have two monitors attached to the system. One 42 inch 4k Display
and a 24 inch 1080p Display and am running sway as my desktop.

Let me know if there is more information I could provide to help
narrow down the issue.
Thanks for the report. To be sure the issue doesn't fall through 
the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel 
regression

tracking bot:

#regzbot ^introduced v6.1..v6.2
#regzbot title drm: amdgpu: system becomes unpleasant to use after
monitor starts blinking and turns full white
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify
when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- 
ideally
while also telling regzbot about it, as explained by the page 
listed in

the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags
pointing
to the report (the parent of this mail). See page linked in 
footer for

details.
This sounds exactly like the issue that was fixed in this patch 
which

is already on it's way to Linus:
https://gitlab.freedesktop.org/agd5f/linux/-/commit/08da182175db4c7f80850354849d95f2670e8cd9 


FWIW, you in the flood of emails likely missed that this is the same
thread where you yesterday replied "If the module parameter didn't 
help
then perhaps you are seeing some other issue.  Can you bisect?". 
That's

why I decided to add this to the tracking. Or am I missing something
obvious here?

/me looks around again and can't see anything, but that doesn't 
have to

mean anything...

Felix, btw, this guide might help you with the bisection, even if 
it's

just for kernel compilation:

https://docs.kernel.org/next/admin-guide/quickly-build-trimmed-linux.html 



And to indirectly reply to your mail from yesterday[1]. You might 
want
to ignore the arch linux kernel git repo and just do a bisection 
between
6.1 and the latest 6.2.y kernel using upstream repos; and if I 
were you

I'd also try 6.3 or even mainline before that, in case the issue was
fixed already.

[1]
https://lore.kernel.org/all/04749ee4-0728-92fe-bcb0-a7320279e...@felixrichter.tech/ 




Thanks for the pointers, I'll do a bisection on my desktop from 6.1 to
the newest commit.

FWIW, I wonder what you actually mean with "newest commit" here: a
bisection between 6.1 and mainlin

[PATCH 2/2] drm/amd/display: mark dml314's UseMinimumDCFCLK() as noinline_for_stack

2023-06-05 Thread Hamza Mahfooz
clang reports:
drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn314/display_mode_vba_314.c:3892:6:
 error: stack frame size (2632) exceeds limit (2048) in 
'dml314_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]
 3892 | void dml314_ModeSupportAndSystemConfigurationFull(struct 
display_mode_lib *mode_lib)
  |  ^
1 error generated.

So, since UseMinimumDCFCLK() consumes a lot of stack space, mark it as
noinline_for_stack to prevent it from blowing up
dml314_ModeSupportAndSystemConfigurationFull()'s stack size.

Signed-off-by: Hamza Mahfooz 
---
 .../gpu/drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c 
b/drivers/gpu/drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c
index 27b83162ae45..1532a7e0ed6c 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c
+++ b/drivers/gpu/drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c
@@ -7061,7 +7061,7 @@ static double CalculateUrgentLatency(
return ret;
 }
 
-static void UseMinimumDCFCLK(
+static noinline_for_stack void UseMinimumDCFCLK(
struct display_mode_lib *mode_lib,
int MaxPrefetchMode,
int ReorderingBytes)
-- 
2.40.1



[PATCH 1/2] drm/amd/display: mark dml31's UseMinimumDCFCLK() as noinline_for_stack

2023-06-05 Thread Hamza Mahfooz
clang reports:
drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_mode_vba_31.c:3797:6:
 error: stack frame size (2632) exceeds limit (2048) in 
'dml31_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]
 3797 | void dml31_ModeSupportAndSystemConfigurationFull(struct 
display_mode_lib *mode_lib)
  |  ^
1 error generated.

So, since UseMinimumDCFCLK() consumes a lot of stack space, mark it as
noinline_for_stack to prevent it from blowing up
dml31_ModeSupportAndSystemConfigurationFull()'s stack size.

Signed-off-by: Hamza Mahfooz 
---
 drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c 
b/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c
index 01603abd75bb..43016c462251 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c
+++ b/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c
@@ -7032,7 +7032,7 @@ static double CalculateUrgentLatency(
return ret;
 }
 
-static void UseMinimumDCFCLK(
+static noinline_for_stack void UseMinimumDCFCLK(
struct display_mode_lib *mode_lib,
int MaxPrefetchMode,
int ReorderingBytes)
-- 
2.40.1



Re: drm/amd: Drop messages in init for radeon, amdgpu

2023-06-05 Thread Limonciello, Mario



On 6/5/2023 9:28 AM, Alex Deucher wrote:

Since there is overlap in supported devices, both
modules load, but only one will bind to a particular
device depending on the user's configuration.  Drop
the message in the module init function as this can
be confusing to users.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2608
Signed-off-by: Alex Deucher 

Reviewed-by: Mario Limonciello 

---
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 -
  drivers/gpu/drm/radeon/radeon_drv.c | 1 -
  2 files changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 7eda4f039224..94509b76fa6c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -3065,7 +3065,6 @@ static int __init amdgpu_init(void)
if (r)
goto error_fence;
  
-	DRM_INFO("amdgpu kernel modesetting enabled.\n");

amdgpu_register_atpx_handler();
amdgpu_acpi_detect();
  
diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c

index e4374814f0ef..16b9eab90185 100644
--- a/drivers/gpu/drm/radeon/radeon_drv.c
+++ b/drivers/gpu/drm/radeon/radeon_drv.c
@@ -634,7 +634,6 @@ static int __init radeon_module_init(void)
if (radeon_modeset == 0)
return -EINVAL;
  
-	DRM_INFO("radeon kernel modesetting enabled.\n");

radeon_register_atpx_handler();
  
  	return pci_register_driver(&radeon_kms_pci_driver);


Re: [PATCH] drm/amdkfd: mark som eclear_address_watch() callback static

2023-06-05 Thread Alex Deucher
On Mon, Jun 5, 2023 at 6:58 AM Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> Some of the newly introduced clear_address_watch callback handlers have
> no prototype because they are only used in one file, which causes a W=1
> warning:
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c:164:10: error: no 
> previous prototype for 'kgd_gfx_aldebaran_clear_address_watch' 
> [-Werror=missing-prototypes]
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c:782:10: error: no previous 
> prototype for 'kgd_gfx_v11_clear_address_watch' [-Werror=missing-prototypes]
>
> Mark these ones static. If another user comes up in the future, that
> can be reverted along with adding the prototype.
>
> Fixes: cfd9715f741a1 ("drm/amdkfd: add debug set and clear address watch 
> points operation")
> Signed-off-by: Arnd Bergmann 

Thanks.  Srinivasan already sent out a fix for this.

Alex


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c   | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
> index efd6a72aab4eb..bdda8744398fe 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
> @@ -161,7 +161,7 @@ static uint32_t kgd_gfx_aldebaran_set_address_watch(
> return watch_address_cntl;
>  }
>
> -uint32_t kgd_gfx_aldebaran_clear_address_watch(struct amdgpu_device *adev,
> +static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct amdgpu_device 
> *adev,
> uint32_t watch_id)
>  {
> return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c
> index 52efa690a3c21..131859ce3e7e9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c
> @@ -779,7 +779,7 @@ static uint32_t kgd_gfx_v11_set_address_watch(struct 
> amdgpu_device *adev,
> return watch_address_cntl;
>  }
>
> -uint32_t kgd_gfx_v11_clear_address_watch(struct amdgpu_device *adev,
> +static uint32_t kgd_gfx_v11_clear_address_watch(struct amdgpu_device *adev,
> uint32_t watch_id)
>  {
> return 0;
> --
> 2.39.2
>


Re: [PATCH] drm/amdgpu: Report ras_num_recs in debugfs

2023-06-05 Thread Alex Deucher
On Sat, Jun 3, 2023 at 1:11 AM Luben Tuikov  wrote:
>
> Report the number of records stored in the RAS EEPROM table in debugfs.
>
> This can be used by user-space to calculate the capacity of the RAS EEPROM
> table since "bad_page_cnt_threshold" is also reported in the same place in
> debugfs.
>
> See commit reference 7fb6407145479d (drm/amdgpu: Add bad_page_cnt_threshold to
> debugfs, 2021-04-13).
>
> ras_num_recs can already be inferred by dumping the RAS EEPROM table, also in
> the same debugfs location, see commit reference c65b0805e77919 (drm/amdgpu:
> RAS EEPROM table is now in debugfs, 2021-04-08). This commit makes it an
> integer value easily shown in a single file.
>
> Cc: Alex Deucher 
> Cc: Hawking Zhang 
> Cc: Tao Zhou 
> Cc: Stanley Yang 
> Cc: John Clements 
> Signed-off-by: Luben Tuikov 

Acked-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index f2da69adcd9d48..68163890f9632d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1487,6 +1487,7 @@ static int amdgpu_ras_sysfs_remove_all(struct 
> amdgpu_device *adev)
>  static struct dentry *amdgpu_ras_debugfs_create_ctrl_node(struct 
> amdgpu_device *adev)
>  {
> struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +   struct amdgpu_ras_eeprom_control *eeprom = &con->eeprom_control;
> struct drm_minor  *minor = adev_to_drm(adev)->primary;
> struct dentry *dir;
>
> @@ -1497,6 +1498,7 @@ static struct dentry 
> *amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *
> &amdgpu_ras_debugfs_eeprom_ops);
> debugfs_create_u32("bad_page_cnt_threshold", 0444, dir,
>&con->bad_page_cnt_threshold);
> +   debugfs_create_u32("ras_num_recs", 0444, dir, &eeprom->ras_num_recs);
> debugfs_create_x32("ras_hw_enabled", 0444, dir, 
> &adev->ras_hw_enabled);
> debugfs_create_x32("ras_enabled", 0444, dir, &adev->ras_enabled);
> debugfs_create_file("ras_eeprom_size", S_IRUGO, dir, adev,
>
> base-commit: e82c20a8755677528a5e01e58b7763a42edf
> --
> 2.41.0
>


RE: WARNING: CPU: 5 PID: 1464 at drivers/gpu/drm/ttm/ttm_bo.c:326 ttm_bo_release+0x27e/0x2d0 [ttm]

2023-06-05 Thread Deucher, Alexander
[Public]

+ Christian

> -Original Message-
> From: Borislav Petkov 
> Sent: Saturday, June 3, 2023 1:48 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; dri-
> de...@lists.freedesktop.org; lkml 
> Subject: WARNING: CPU: 5 PID: 1464 at drivers/gpu/drm/ttm/ttm_bo.c:326
> ttm_bo_release+0x27e/0x2d0 [ttm]
>
> Hi,
>
> this below triggers with the latest Linus tree:
>
> 51f269a6ecc7 ("Merge tag 'probes-fixes-6.4-rc4' of
> git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace")
>
> ...
> [   16.173593] [drm] radeon kernel modesetting enabled.
> [   16.173743] radeon :29:00.0: vgaarb: deactivate vga console
> [   16.174300] MCE: In-kernel MCE decoding enabled.
> [   16.175695] EDAC DEBUG: umc_read_base_mask:   DCSB0[0]=0x0001
> reg: 0x5
> [   16.175698] EDAC DEBUG: umc_read_base_mask:
> DCSB_SEC0[0]=0x reg: 0x50010
> [   16.175700] EDAC DEBUG: umc_read_base_mask:   DCSB0[1]=0x
> reg: 0x50004
> [   16.175702] EDAC DEBUG: umc_read_base_mask:
> DCSB_SEC0[1]=0x reg: 0x50014
> [   16.175703] EDAC DEBUG: umc_read_base_mask:   DCSB0[2]=0x0201
> reg: 0x50008
> [   16.175705] EDAC DEBUG: umc_read_base_mask:
> DCSB_SEC0[2]=0x reg: 0x50018
> [   16.175706] EDAC DEBUG: umc_read_base_mask:   DCSB0[3]=0x
> reg: 0x5000c
> [   16.175707] EDAC DEBUG: umc_read_base_mask:
> DCSB_SEC0[3]=0x reg: 0x5001c
> [   16.175709] EDAC DEBUG: umc_read_base_mask:   DCSM0[0]=0x03fffdfe
> reg: 0x50020
> [   16.175710] EDAC DEBUG: umc_read_base_mask:
> DCSM_SEC0[0]=0x reg: 0x50028
> [   16.175712] EDAC DEBUG: umc_read_base_mask:   DCSM0[1]=0x03fffdfe
> reg: 0x50024
> [   16.175713] EDAC DEBUG: umc_read_base_mask:
> DCSM_SEC0[1]=0x reg: 0x5002c
> [   16.175715] EDAC DEBUG: umc_read_base_mask:   DCSB1[0]=0x0001
> reg: 0x15
> [   16.175716] EDAC DEBUG: umc_read_base_mask:
> DCSB_SEC1[0]=0x reg: 0x150010
> [   16.175718] EDAC DEBUG: umc_read_base_mask:   DCSB1[1]=0x
> reg: 0x150004
> [   16.175719] EDAC DEBUG: umc_read_base_mask:
> DCSB_SEC1[1]=0x reg: 0x150014
> [   16.175720] EDAC DEBUG: umc_read_base_mask:   DCSB1[2]=0x0201
> reg: 0x150008
> [   16.175722] EDAC DEBUG: umc_read_base_mask:
> DCSB_SEC1[2]=0x reg: 0x150018
> [   16.175723] EDAC DEBUG: umc_read_base_mask:   DCSB1[3]=0x
> reg: 0x15000c
> [   16.175725] EDAC DEBUG: umc_read_base_mask:
> DCSB_SEC1[3]=0x reg: 0x15001c
> [   16.175726] EDAC DEBUG: umc_read_base_mask:   DCSM1[0]=0x03fffdfe
> reg: 0x150020
> [   16.175728] EDAC DEBUG: umc_read_base_mask:
> DCSM_SEC1[0]=0x reg: 0x150028
> [   16.175729] EDAC DEBUG: umc_read_base_mask:   DCSM1[1]=0x03fffdfe
> reg: 0x150024
> [   16.175730] EDAC DEBUG: umc_read_base_mask:
> DCSM_SEC1[1]=0x reg: 0x15002c
> [   16.175741] EDAC DEBUG: umc_determine_memory_type:   UMC0 DIMM
> type: Unbuffered-DDR4
> [   16.175742] EDAC DEBUG: umc_determine_memory_type:   UMC1 DIMM
> type: Unbuffered-DDR4
> [   16.177514] Console: switching to colour dummy device 80x25
> [   16.177693] [drm] initializing kernel modesetting (CEDAR 0x1002:0x68E1
> 0x174B:0x3000 0x00).
> [   16.177733] ATOM BIOS: AMD
> [   16.177795] radeon :29:00.0: VRAM: 1024M 0x
> - 0x3FFF (1024M used)
> [   16.177798] radeon :29:00.0: GTT: 1024M 0x4000 -
> 0x7FFF
> [   16.177800] [drm] Detected VRAM RAM=1024M, BAR=256M
> [   16.177802] [drm] RAM width 64bits DDR
> [   16.177835] [drm] radeon: 1024M of VRAM memory ready
> [   16.177836] [drm] radeon: 1024M of GTT memory ready.
> [   16.177839] [drm] Loading CEDAR Microcode
> [   16.179106] [drm] Internal thermal controller without fan control
> [   16.199812] [drm] radeon: dpm initialized
> [   16.200179] [drm] GART: num cpu pages 262144, num gpu pages 262144
> [   16.200399] [drm] enabling PCIE gen 2 link speeds, disable with
> radeon.pcie_gen2=0
> [   16.218135] [drm] PCIE GART of 1024M enabled (table at
> 0x0014C000).
> [   16.218239] radeon :29:00.0: WB enabled
> [   16.218240] radeon :29:00.0: fence driver on ring 0 use gpu addr
> 0x4c00
> [   16.218242] radeon :29:00.0: fence driver on ring 3 use gpu addr
> 0x4c0c
> [   16.218606] radeon :29:00.0: fence driver on ring 5 use gpu addr
> 0x0005c418
> [   16.218657] radeon :29:00.0: radeon: MSI limited to 32-bit
> [   16.218689] radeon :29:00.0: radeon: using MSI.
> [   16.218707] [drm] radeon: irq initialized.
> [   16.234730] [drm] ring test on 0 succeeded in 0 usecs
> [   16.234738] [drm] ring test on 3 succeeded in 2 usecs
> [   16.317725] r8169 :25:00.0 eth0: Link is Down
> [   16.410486] [drm] ring test on 5 succeeded in 1 usecs
> [   16.410492] [drm] UVD initialized successfully.
> [   16.410555] [drm] ib test on ring 0 succeeded in 0 usecs
> [   16.410596] [drm] ib test on ring 3 succeeded in 0 usecs
> [   17.077422] [drm] ib test on ring 5 succeeded
> [   17.077581] [drm] Radeo

Re: [PATCH] drm/amdgpu: fix xclk freq on CHIP_STONEY

2023-06-05 Thread Alex Deucher
Applied.  Thanks!

On Fri, Jun 2, 2023 at 11:13 PM Chia-I Wu  wrote:
>
> On Fri, Jun 2, 2023 at 11:50 AM Alex Deucher  wrote:
> >
> > Nevermind, missing your Signed-off-by.  Please add and I'll apply.
> Sorry that I keep forgetting...  This patch is
>
>   Signed-off-by: Chia-I Wu 
>
> I can send v2 if necessary.
> >
> > Alex
> >


Re: [PATCH 2/4] drm/amdkfd: Signal page table fence after KFD flush tlb

2023-06-05 Thread Christian König

Am 05.06.23 um 17:40 schrieb Shashank Sharma:


On 05/06/2023 17:18, Christian König wrote:

Am 05.06.23 um 17:13 schrieb Shashank Sharma:


On 02/06/2023 16:54, Felix Kuehling wrote:

Am 2023-06-02 um 07:57 schrieb Christian König:

Am 01.06.23 um 21:31 schrieb Philip Yang:

To free page table BOs which are freed when updating page table, for
example PTE BOs when PDE0 used as PTE.

Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c

index af0a4b5257cc..0ff007a74d03 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2101,6 +2101,11 @@ void kfd_flush_tlb(struct 
kfd_process_device *pdd, enum TLB_FLUSH_TYPE type)

  amdgpu_amdkfd_flush_gpu_tlb_pasid(
  dev->adev, pdd->process->pasid, type, xcc);
  }
+
+    /* Signal page table fence to free page table BOs */
+    dma_fence_signal(vm->pt_fence);


That's not something you can do here.

Signaling a fence can never depend on anything except for hardware 
work. In other words dma_fence_signal() is supposed to be called 
only from interrupt context!


We are signaling eviction fences from normal user context, too. 
There is no practical way to put this into an interrupt handler 
when the TLB flush is being done synchronously on a user thread. We 
have to do this in such a context for user mode queues.



We are currently working on adding a provide a high level kernel API 
which can be called directly to perform a TLB flush. Internally this 
API will add a deferred work to use the SDMA engine to perform a GPU 
TLB flush work (to compensate for a HW bug in Navi Chips). If my 
understanding is right, by interrupt context Christian means to 
perform this flush and signal from that differed work, is that so 
@Christian ?


Well more or less. Ideally you put the TLB flush in a work item (or 
use the SDMA for the hw bug workaround on Navi 1x).


The point is that you shouldn't have it in the same execution thread 
as the VM page table updates, because any memory allocation or 
grabbing a lock could potentially depend on the TLB flush as soon as 
you have published the dma_fence (by adding it to the VM page table 
BOs for example).


Would it work for everyone if we add this generic API (say 
amdgpu_flush_tlb_async()) to add a TLB flush work and will also send 
this dma_fence_signal from within ? KFD can simply consume it from 
wherever they want, do you see a race condition if we do like this ?


Yes, that's pretty much the whole idea. amdgpu_flush_tlb() should just 
return a dma_fence object.


This dma_fence object should either be the SDMA workaround or signaled 
from a work item.


We can then fence the BOs or just wait for the dma_fence object to signal.

Regards,
Christian.




- Shashank


Christian.



- Shashank



Regards,
  Felix




What we can to is to put the TLB flushing into an irq worker or 
work item and let the signaling happen from there.


Amar and Shashank are already working on this, I strongly suggest 
to sync up with them.


Regards,
Christian.


+ dma_fence_put(vm->pt_fence);
+    vm->pt_fence = amdgpu_pt_fence_create();
  }
    struct kfd_process_device 
*kfd_process_device_data_by_id(struct kfd_process *p, uint32_t 
gpu_id)








Re: [PATCH 2/4] drm/amdkfd: Signal page table fence after KFD flush tlb

2023-06-05 Thread Shashank Sharma



On 05/06/2023 17:18, Christian König wrote:

Am 05.06.23 um 17:13 schrieb Shashank Sharma:


On 02/06/2023 16:54, Felix Kuehling wrote:

Am 2023-06-02 um 07:57 schrieb Christian König:

Am 01.06.23 um 21:31 schrieb Philip Yang:

To free page table BOs which are freed when updating page table, for
example PTE BOs when PDE0 used as PTE.

Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c

index af0a4b5257cc..0ff007a74d03 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2101,6 +2101,11 @@ void kfd_flush_tlb(struct 
kfd_process_device *pdd, enum TLB_FLUSH_TYPE type)

  amdgpu_amdkfd_flush_gpu_tlb_pasid(
  dev->adev, pdd->process->pasid, type, xcc);
  }
+
+    /* Signal page table fence to free page table BOs */
+    dma_fence_signal(vm->pt_fence);


That's not something you can do here.

Signaling a fence can never depend on anything except for hardware 
work. In other words dma_fence_signal() is supposed to be called 
only from interrupt context!


We are signaling eviction fences from normal user context, too. 
There is no practical way to put this into an interrupt handler when 
the TLB flush is being done synchronously on a user thread. We have 
to do this in such a context for user mode queues.



We are currently working on adding a provide a high level kernel API 
which can be called directly to perform a TLB flush. Internally this 
API will add a deferred work to use the SDMA engine to perform a GPU 
TLB flush work (to compensate for a HW bug in Navi Chips). If my 
understanding is right, by interrupt context Christian means to 
perform this flush and signal from that differed work, is that so 
@Christian ?


Well more or less. Ideally you put the TLB flush in a work item (or 
use the SDMA for the hw bug workaround on Navi 1x).


The point is that you shouldn't have it in the same execution thread 
as the VM page table updates, because any memory allocation or 
grabbing a lock could potentially depend on the TLB flush as soon as 
you have published the dma_fence (by adding it to the VM page table 
BOs for example).


Would it work for everyone if we add this generic API (say 
amdgpu_flush_tlb_async()) to add a TLB flush work and will also send 
this dma_fence_signal from within ? KFD can simply consume it from 
wherever they want, do you see a race condition if we do like this ?


- Shashank


Christian.



- Shashank



Regards,
  Felix




What we can to is to put the TLB flushing into an irq worker or 
work item and let the signaling happen from there.


Amar and Shashank are already working on this, I strongly suggest 
to sync up with them.


Regards,
Christian.


+    dma_fence_put(vm->pt_fence);
+    vm->pt_fence = amdgpu_pt_fence_create();
  }
    struct kfd_process_device 
*kfd_process_device_data_by_id(struct kfd_process *p, uint32_t 
gpu_id)






[PATCH] drm/amdgpu: Log if device is unsupported

2023-06-05 Thread Paul Menzel
Since there is overlap in supported devices, both modules load, but only
one will bind to a particular device depending on the model and user's
configuration.

amdgpu binds to all display class devices with VID 0x1002 and then
determines whether or not to bind to a device based on whether the
individual device is supported by the driver or not. Log that case, so
users looking at the logs know what is going on.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2608
Signed-off-by: Paul Menzel 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 86fbb4138285..410ff918c350 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2062,8 +2062,10 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
 
/* skip devices which are owned by radeon */
for (i = 0; i < ARRAY_SIZE(amdgpu_unsupported_pciidlist); i++) {
-   if (amdgpu_unsupported_pciidlist[i] == pdev->device)
+   if (amdgpu_unsupported_pciidlist[i] == pdev->device) {
+   DRM_INFO("This hardware is only supported by radeon.");
return -ENODEV;
+   }
}
 
if (amdgpu_aspm == -1 && !pcie_aspm_enabled(pdev))
-- 
2.40.1



RE: [PATCH] drm/amdgpu: Log if device is unsupported

2023-06-05 Thread Deucher, Alexander
[AMD Official Use Only - General]

> -Original Message-
> From: Paul Menzel 
> Sent: Monday, June 5, 2023 11:23 AM
> To: Deucher, Alexander ; Koenig, Christian
> ; Pan, Xinhui ; David
> Airlie ; Daniel Vetter 
> Cc: Paul Menzel ; amd-gfx@lists.freedesktop.org;
> dri-de...@lists.freedesktop.org; linux-ker...@vger.kernel.org
> Subject: [PATCH] drm/amdgpu: Log if device is unsupported
>
> Since there is overlap in supported devices, both modules load, but only one
> will bind to a particular device depending on the model and user's
> configuration.
>
> amdgpu binds to all display class devices with VID 0x1002 and then
> determines whether or not to bind to a device based on whether the
> individual device is supported by the driver or not. Log that case, so users
> looking at the logs know what is going on.
>
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2608
> Signed-off-by: Paul Menzel 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 86fbb4138285..410ff918c350 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2062,8 +2062,10 @@ static int amdgpu_pci_probe(struct pci_dev
> *pdev,
>
>   /* skip devices which are owned by radeon */
>   for (i = 0; i < ARRAY_SIZE(amdgpu_unsupported_pciidlist); i++) {
> - if (amdgpu_unsupported_pciidlist[i] == pdev->device)
> + if (amdgpu_unsupported_pciidlist[i] == pdev->device) {
> + DRM_INFO("This hardware is only supported by
> radeon.");
>   return -ENODEV;

I think this will confuse users even more.  As there will be a new "error" 
message reported.  I'd suggest either dropping the message in init per my 
proposed patch or just leaving things as is.

Alex

> + }
>   }
>
>   if (amdgpu_aspm == -1 && !pcie_aspm_enabled(pdev))
> --
> 2.40.1



Re: PROBLEM: AMD Ryzen 9 7950X iGPU - Blinking Issue

2023-06-05 Thread Hamza Mahfooz



On 6/3/23 10:52, Felix Richter wrote:

Hi Guys,

sorry for the silence from my side. I had a lot of things to take care 
of after returning from vacation. Also I had to wait on the zfs modules 
to be updated to support kernel 6.3 for further testing.


The bad news is that I am still experiencing issues. I have been able to 
get a reproducible trigger for the buggy behavior. The moment I take a 
screenshot or any other program like `wdisplays` accesses the screen 
buffer the screen starts flickering. The only way to reset it is to 
reboot the machine or log out of the desktop.


With this I did a bisection to figure out which commit is responsible 
for this. I attached the logs to the mail. The short version is that I 
identified commit 81d0bcf9900932633d270d5bc4a54ff599c6ebdb as the 
culprit. Seems that there are side effects of having more flexible 
buffer placement for the case of the internal GPU. To verify that this 
actually is the cause of the issue I built the current archlinux kernel 
with an extra patch to revert the commit: 
https://github.com/ju6ge/linux/tree/v6.3.5-ju6ge. The result is that be 
bug is fixed!


Now if this is the desired long term fix I do not know …


Can you provide a dmidecode of your RAM (i.e. # dmidecode --type=memory)?

The current trend seems to suggest that if you have 64 or more gigs of
RAM, you will probably still experience issues with S/G mode enabled
even with my fix applied.



Kind regards,
Felix Richter

On 02.05.23 16:12, Linux regression tracking (Thorsten Leemhuis) wrote:

On 02.05.23 15:48, Felix Richter wrote:

On 5/2/23 15:34, Linux regression tracking (Thorsten Leemhuis) wrote:

On 02.05.23 15:13, Alex Deucher wrote:

On Tue, May 2, 2023 at 7:45 AM Linux regression tracking (Thorsten
Leemhuis)  wrote:


On 30.04.23 13:44, Felix Richter wrote:

Hi,

I am running into an issue with the integrated GPU of the Ryzen 9
7950X. It seems to be a regression from kernel version 6.1 to 6.2.
The bug materializes in from of my monitor blinking, meaning it
turns full white shortly. This happens very often so that the
system becomes unpleasant to use.

I am running the Archlinux Kernel:
The Issue happens on the bleeding edge kernel: 6.2.13
Switching back to the LTS kernel resolves the issue: 6.1.26

I have two monitors attached to the system. One 42 inch 4k Display
and a 24 inch 1080p Display and am running sway as my desktop.

Let me know if there is more information I could provide to help
narrow down the issue.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel 
regression

tracking bot:

#regzbot ^introduced v6.1..v6.2
#regzbot title drm: amdgpu: system becomes unpleasant to use after
monitor starts blinking and turns full white
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify
when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page 
listed in

the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags
pointing
to the report (the parent of this mail). See page linked in footer 
for

details.

This sounds exactly like the issue that was fixed in this patch which
is already on it's way to Linus:
https://gitlab.freedesktop.org/agd5f/linux/-/commit/08da182175db4c7f80850354849d95f2670e8cd9

FWIW, you in the flood of emails likely missed that this is the same
thread where you yesterday replied "If the module parameter didn't help
then perhaps you are seeing some other issue.  Can you bisect?". That's
why I decided to add this to the tracking. Or am I missing something
obvious here?

/me looks around again and can't see anything, but that doesn't have to
mean anything...

Felix, btw, this guide might help you with the bisection, even if it's
just for kernel compilation:

https://docs.kernel.org/next/admin-guide/quickly-build-trimmed-linux.html

And to indirectly reply to your mail from yesterday[1]. You might want
to ignore the arch linux kernel git repo and just do a bisection 
between

6.1 and the latest 6.2.y kernel using upstream repos; and if I were you
I'd also try 6.3 or even mainline before that, in case the issue was
fixed already.

[1]
https://lore.kernel.org/all/04749ee4-0728-92fe-bcb0-a7320279e...@felixrichter.tech/


Thanks for the pointers, I'll do a bisection on my desktop from 6.1 to
the newest commit.

FWIW, I wonder what you actually mean with "newest commit" here: a
bisection between 6.1 and mainline HEAD might be a waste of time, *if*
this is something that only happens in 6.2.y (say due to a broken or
incomplete backport)


That was the part I was mostly unsure about … where
to start from.

I was planning to use PKGBUILD scripts from arch to achieve the same
configuration as I would when inst

Re: [PATCH v2] drm/radeon: fix race condition UAF in radeon_gem_set_domain_ioctl

2023-06-05 Thread Alex Deucher
Applied.  Thanks!

On Mon, Jun 5, 2023 at 4:13 AM Christian König  wrote:
>
> Am 03.06.23 um 09:43 schrieb Min Li:
> > Userspace can race to free the gobj(robj converted from), robj should not
> > be accessed again after drm_gem_object_put, otherwith it will result in
> > use-after-free.
> >
> > Signed-off-by: Min Li 
>
> Reviewed-by: Christian König 
>
> > ---
> > Changes in v2:
> > - Remove unused robj, avoid compile complain
> >
> >   drivers/gpu/drm/radeon/radeon_gem.c | 4 +---
> >   1 file changed, 1 insertion(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/radeon/radeon_gem.c 
> > b/drivers/gpu/drm/radeon/radeon_gem.c
> > index bdc5af23f005..d3f5ddbc1704 100644
> > --- a/drivers/gpu/drm/radeon/radeon_gem.c
> > +++ b/drivers/gpu/drm/radeon/radeon_gem.c
> > @@ -459,7 +459,6 @@ int radeon_gem_set_domain_ioctl(struct drm_device *dev, 
> > void *data,
> >   struct radeon_device *rdev = dev->dev_private;
> >   struct drm_radeon_gem_set_domain *args = data;
> >   struct drm_gem_object *gobj;
> > - struct radeon_bo *robj;
> >   int r;
> >
> >   /* for now if someone requests domain CPU -
> > @@ -472,13 +471,12 @@ int radeon_gem_set_domain_ioctl(struct drm_device 
> > *dev, void *data,
> >   up_read(&rdev->exclusive_lock);
> >   return -ENOENT;
> >   }
> > - robj = gem_to_radeon_bo(gobj);
> >
> >   r = radeon_gem_set_domain(gobj, args->read_domains, 
> > args->write_domain);
> >
> >   drm_gem_object_put(gobj);
> >   up_read(&rdev->exclusive_lock);
> > - r = radeon_gem_handle_lockup(robj->rdev, r);
> > + r = radeon_gem_handle_lockup(rdev, r);
> >   return r;
> >   }
> >
>


Re: [PATCH 2/4] drm/amdkfd: Signal page table fence after KFD flush tlb

2023-06-05 Thread Christian König

Am 05.06.23 um 17:13 schrieb Shashank Sharma:


On 02/06/2023 16:54, Felix Kuehling wrote:

Am 2023-06-02 um 07:57 schrieb Christian König:

Am 01.06.23 um 21:31 schrieb Philip Yang:

To free page table BOs which are freed when updating page table, for
example PTE BOs when PDE0 used as PTE.

Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c

index af0a4b5257cc..0ff007a74d03 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2101,6 +2101,11 @@ void kfd_flush_tlb(struct kfd_process_device 
*pdd, enum TLB_FLUSH_TYPE type)

  amdgpu_amdkfd_flush_gpu_tlb_pasid(
  dev->adev, pdd->process->pasid, type, xcc);
  }
+
+    /* Signal page table fence to free page table BOs */
+    dma_fence_signal(vm->pt_fence);


That's not something you can do here.

Signaling a fence can never depend on anything except for hardware 
work. In other words dma_fence_signal() is supposed to be called 
only from interrupt context!


We are signaling eviction fences from normal user context, too. There 
is no practical way to put this into an interrupt handler when the 
TLB flush is being done synchronously on a user thread. We have to do 
this in such a context for user mode queues.



We are currently working on adding a provide a high level kernel API 
which can be called directly to perform a TLB flush. Internally this 
API will add a deferred work to use the SDMA engine to perform a GPU 
TLB flush work (to compensate for a HW bug in Navi Chips). If my 
understanding is right, by interrupt context Christian means to 
perform this flush and signal from that differed work, is that so 
@Christian ?


Well more or less. Ideally you put the TLB flush in a work item (or use 
the SDMA for the hw bug workaround on Navi 1x).


The point is that you shouldn't have it in the same execution thread as 
the VM page table updates, because any memory allocation or grabbing a 
lock could potentially depend on the TLB flush as soon as you have 
published the dma_fence (by adding it to the VM page table BOs for example).


Christian.



- Shashank



Regards,
  Felix




What we can to is to put the TLB flushing into an irq worker or work 
item and let the signaling happen from there.


Amar and Shashank are already working on this, I strongly suggest to 
sync up with them.


Regards,
Christian.


+    dma_fence_put(vm->pt_fence);
+    vm->pt_fence = amdgpu_pt_fence_create();
  }
    struct kfd_process_device *kfd_process_device_data_by_id(struct 
kfd_process *p, uint32_t gpu_id)






Re: [PATCH 2/2] drm/amdgpu: make sure that BOs have a backing store

2023-06-05 Thread Alex Deucher
On Mon, Jun 5, 2023 at 5:11 AM Christian König
 wrote:
>
> It's perfectly possible that the BO is about to be destroyed and doesn't
> have a backing store associated with it.
>
> Signed-off-by: Christian König 

Series is:
Reviewed-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 2bd1a54ee866..249385985a4f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -1268,8 +1268,12 @@ void amdgpu_bo_move_notify(struct ttm_buffer_object 
> *bo,
>  void amdgpu_bo_get_memory(struct amdgpu_bo *bo,
>   struct amdgpu_mem_stats *stats)
>  {
> -   unsigned int domain;
> uint64_t size = amdgpu_bo_size(bo);
> +   unsigned int domain;
> +
> +   /* Abort if the BO doesn't currently have a backing store */
> +   if (!bo->tbo.resource)
> +   return;
>
> domain = amdgpu_mem_type_to_domain(bo->tbo.resource->mem_type);
> switch (domain) {
> --
> 2.34.1
>


Re: [PATCH 2/4] drm/amdkfd: Signal page table fence after KFD flush tlb

2023-06-05 Thread Shashank Sharma



On 02/06/2023 16:54, Felix Kuehling wrote:

Am 2023-06-02 um 07:57 schrieb Christian König:

Am 01.06.23 um 21:31 schrieb Philip Yang:

To free page table BOs which are freed when updating page table, for
example PTE BOs when PDE0 used as PTE.

Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c

index af0a4b5257cc..0ff007a74d03 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -2101,6 +2101,11 @@ void kfd_flush_tlb(struct kfd_process_device 
*pdd, enum TLB_FLUSH_TYPE type)

  amdgpu_amdkfd_flush_gpu_tlb_pasid(
  dev->adev, pdd->process->pasid, type, xcc);
  }
+
+    /* Signal page table fence to free page table BOs */
+    dma_fence_signal(vm->pt_fence);


That's not something you can do here.

Signaling a fence can never depend on anything except for hardware 
work. In other words dma_fence_signal() is supposed to be called only 
from interrupt context!


We are signaling eviction fences from normal user context, too. There 
is no practical way to put this into an interrupt handler when the TLB 
flush is being done synchronously on a user thread. We have to do this 
in such a context for user mode queues.



We are currently working on adding a provide a high level kernel API 
which can be called directly to perform a TLB flush. Internally this API 
will add a deferred work to use the SDMA engine to perform a GPU TLB 
flush work (to compensate for a HW bug in Navi Chips). If my 
understanding is right, by interrupt context Christian means to perform 
this flush and signal from that differed work, is that so @Christian ?


- Shashank



Regards,
  Felix




What we can to is to put the TLB flushing into an irq worker or work 
item and let the signaling happen from there.


Amar and Shashank are already working on this, I strongly suggest to 
sync up with them.


Regards,
Christian.


+    dma_fence_put(vm->pt_fence);
+    vm->pt_fence = amdgpu_pt_fence_create();
  }
    struct kfd_process_device *kfd_process_device_data_by_id(struct 
kfd_process *p, uint32_t gpu_id)




[PATCH] drm/amd: Drop messages in init for radeon, amdgpu

2023-06-05 Thread Alex Deucher
Since there is overlap in supported devices, both
modules load, but only one will bind to a particular
device depending on the user's configuration.  Drop
the message in the module init function as this can
be confusing to users.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2608
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 -
 drivers/gpu/drm/radeon/radeon_drv.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 7eda4f039224..94509b76fa6c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -3065,7 +3065,6 @@ static int __init amdgpu_init(void)
if (r)
goto error_fence;
 
-   DRM_INFO("amdgpu kernel modesetting enabled.\n");
amdgpu_register_atpx_handler();
amdgpu_acpi_detect();
 
diff --git a/drivers/gpu/drm/radeon/radeon_drv.c 
b/drivers/gpu/drm/radeon/radeon_drv.c
index e4374814f0ef..16b9eab90185 100644
--- a/drivers/gpu/drm/radeon/radeon_drv.c
+++ b/drivers/gpu/drm/radeon/radeon_drv.c
@@ -634,7 +634,6 @@ static int __init radeon_module_init(void)
if (radeon_modeset == 0)
return -EINVAL;
 
-   DRM_INFO("radeon kernel modesetting enabled.\n");
radeon_register_atpx_handler();
 
return pci_register_driver(&radeon_kms_pci_driver);
-- 
2.40.1



Re: PROBLEM: AMD Ryzen 9 7950X iGPU - Blinking Issue

2023-06-05 Thread Felix Richter

Hi Guys,

sorry for the silence from my side. I had a lot of things to take care 
of after returning from vacation. Also I had to wait on the zfs modules 
to be updated to support kernel 6.3 for further testing.


The bad news is that I am still experiencing issues. I have been able to 
get a reproducible trigger for the buggy behavior. The moment I take a 
screenshot or any other program like `wdisplays` accesses the screen 
buffer the screen starts flickering. The only way to reset it is to 
reboot the machine or log out of the desktop.


With this I did a bisection to figure out which commit is responsible 
for this. I attached the logs to the mail. The short version is that I 
identified commit 81d0bcf9900932633d270d5bc4a54ff599c6ebdb as the 
culprit. Seems that there are side effects of having more flexible 
buffer placement for the case of the internal GPU. To verify that this 
actually is the cause of the issue I built the current archlinux kernel 
with an extra patch to revert the commit: 
https://github.com/ju6ge/linux/tree/v6.3.5-ju6ge. The result is that be 
bug is fixed!


Now if this is the desired long term fix I do not know …

Kind regards,
Felix Richter

On 02.05.23 16:12, Linux regression tracking (Thorsten Leemhuis) wrote:

On 02.05.23 15:48, Felix Richter wrote:

On 5/2/23 15:34, Linux regression tracking (Thorsten Leemhuis) wrote:

On 02.05.23 15:13, Alex Deucher wrote:

On Tue, May 2, 2023 at 7:45 AM Linux regression tracking (Thorsten
Leemhuis)  wrote:


On 30.04.23 13:44, Felix Richter wrote:

Hi,

I am running into an issue with the integrated GPU of the Ryzen 9
7950X. It seems to be a regression from kernel version 6.1 to 6.2.
The bug materializes in from of my monitor blinking, meaning it
turns full white shortly. This happens very often so that the
system becomes unpleasant to use.

I am running the Archlinux Kernel:
The Issue happens on the bleeding edge kernel: 6.2.13
Switching back to the LTS kernel resolves the issue: 6.1.26

I have two monitors attached to the system. One 42 inch 4k Display
and a 24 inch 1080p Display and am running sway as my desktop.

Let me know if there is more information I could provide to help
narrow down the issue.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced v6.1..v6.2
#regzbot title drm: amdgpu: system becomes unpleasant to use after
monitor starts blinking and turns full white
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify
when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags
pointing
to the report (the parent of this mail). See page linked in footer for
details.

This sounds exactly like the issue that was fixed in this patch which
is already on it's way to Linus:
https://gitlab.freedesktop.org/agd5f/linux/-/commit/08da182175db4c7f80850354849d95f2670e8cd9

FWIW, you in the flood of emails likely missed that this is the same
thread where you yesterday replied "If the module parameter didn't help
then perhaps you are seeing some other issue.  Can you bisect?". That's
why I decided to add this to the tracking. Or am I missing something
obvious here?

/me looks around again and can't see anything, but that doesn't have to
mean anything...

Felix, btw, this guide might help you with the bisection, even if it's
just for kernel compilation:

https://docs.kernel.org/next/admin-guide/quickly-build-trimmed-linux.html

And to indirectly reply to your mail from yesterday[1]. You might want
to ignore the arch linux kernel git repo and just do a bisection between
6.1 and the latest 6.2.y kernel using upstream repos; and if I were you
I'd also try 6.3 or even mainline before that, in case the issue was
fixed already.

[1]
https://lore.kernel.org/all/04749ee4-0728-92fe-bcb0-a7320279e...@felixrichter.tech/


Thanks for the pointers, I'll do a bisection on my desktop from 6.1 to
the newest commit.

FWIW, I wonder what you actually mean with "newest commit" here: a
bisection between 6.1 and mainline HEAD might be a waste of time, *if*
this is something that only happens in 6.2.y (say due to a broken or
incomplete backport)


That was the part I was mostly unsure about … where
to start from.

I was planning to use PKGBUILD scripts from arch to achieve the same
configuration as I would when installing
the package and just rewrite the script to use a local copy of the
source code instead of the repository.
That way I can just use the bisect command, rebuild the package and test
again.

In my experience trying to deal with Linux distro's package managers
creates more trouble than it's 

[PATCH v2] drm/radeon: fix race condition UAF in radeon_gem_set_domain_ioctl

2023-06-05 Thread Min Li
Userspace can race to free the gobj(robj converted from), robj should not
be accessed again after drm_gem_object_put, otherwith it will result in
use-after-free.

Signed-off-by: Min Li 
---
Changes in v2:
- Remove unused robj, avoid compile complain

 drivers/gpu/drm/radeon/radeon_gem.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_gem.c 
b/drivers/gpu/drm/radeon/radeon_gem.c
index bdc5af23f005..d3f5ddbc1704 100644
--- a/drivers/gpu/drm/radeon/radeon_gem.c
+++ b/drivers/gpu/drm/radeon/radeon_gem.c
@@ -459,7 +459,6 @@ int radeon_gem_set_domain_ioctl(struct drm_device *dev, 
void *data,
struct radeon_device *rdev = dev->dev_private;
struct drm_radeon_gem_set_domain *args = data;
struct drm_gem_object *gobj;
-   struct radeon_bo *robj;
int r;
 
/* for now if someone requests domain CPU -
@@ -472,13 +471,12 @@ int radeon_gem_set_domain_ioctl(struct drm_device *dev, 
void *data,
up_read(&rdev->exclusive_lock);
return -ENOENT;
}
-   robj = gem_to_radeon_bo(gobj);
 
r = radeon_gem_set_domain(gobj, args->read_domains, args->write_domain);
 
drm_gem_object_put(gobj);
up_read(&rdev->exclusive_lock);
-   r = radeon_gem_handle_lockup(robj->rdev, r);
+   r = radeon_gem_handle_lockup(rdev, r);
return r;
 }
 
-- 
2.34.1



WARNING: CPU: 5 PID: 1464 at drivers/gpu/drm/ttm/ttm_bo.c:326 ttm_bo_release+0x27e/0x2d0 [ttm]

2023-06-05 Thread Borislav Petkov
Hi,

this below triggers with the latest Linus tree:

51f269a6ecc7 ("Merge tag 'probes-fixes-6.4-rc4' of 
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace")

...
[   16.173593] [drm] radeon kernel modesetting enabled.
[   16.173743] radeon :29:00.0: vgaarb: deactivate vga console
[   16.174300] MCE: In-kernel MCE decoding enabled.
[   16.175695] EDAC DEBUG: umc_read_base_mask:   DCSB0[0]=0x0001 reg: 
0x5
[   16.175698] EDAC DEBUG: umc_read_base_mask: DCSB_SEC0[0]=0x reg: 
0x50010
[   16.175700] EDAC DEBUG: umc_read_base_mask:   DCSB0[1]=0x reg: 
0x50004
[   16.175702] EDAC DEBUG: umc_read_base_mask: DCSB_SEC0[1]=0x reg: 
0x50014
[   16.175703] EDAC DEBUG: umc_read_base_mask:   DCSB0[2]=0x0201 reg: 
0x50008
[   16.175705] EDAC DEBUG: umc_read_base_mask: DCSB_SEC0[2]=0x reg: 
0x50018
[   16.175706] EDAC DEBUG: umc_read_base_mask:   DCSB0[3]=0x reg: 
0x5000c
[   16.175707] EDAC DEBUG: umc_read_base_mask: DCSB_SEC0[3]=0x reg: 
0x5001c
[   16.175709] EDAC DEBUG: umc_read_base_mask:   DCSM0[0]=0x03fffdfe reg: 
0x50020
[   16.175710] EDAC DEBUG: umc_read_base_mask: DCSM_SEC0[0]=0x reg: 
0x50028
[   16.175712] EDAC DEBUG: umc_read_base_mask:   DCSM0[1]=0x03fffdfe reg: 
0x50024
[   16.175713] EDAC DEBUG: umc_read_base_mask: DCSM_SEC0[1]=0x reg: 
0x5002c
[   16.175715] EDAC DEBUG: umc_read_base_mask:   DCSB1[0]=0x0001 reg: 
0x15
[   16.175716] EDAC DEBUG: umc_read_base_mask: DCSB_SEC1[0]=0x reg: 
0x150010
[   16.175718] EDAC DEBUG: umc_read_base_mask:   DCSB1[1]=0x reg: 
0x150004
[   16.175719] EDAC DEBUG: umc_read_base_mask: DCSB_SEC1[1]=0x reg: 
0x150014
[   16.175720] EDAC DEBUG: umc_read_base_mask:   DCSB1[2]=0x0201 reg: 
0x150008
[   16.175722] EDAC DEBUG: umc_read_base_mask: DCSB_SEC1[2]=0x reg: 
0x150018
[   16.175723] EDAC DEBUG: umc_read_base_mask:   DCSB1[3]=0x reg: 
0x15000c
[   16.175725] EDAC DEBUG: umc_read_base_mask: DCSB_SEC1[3]=0x reg: 
0x15001c
[   16.175726] EDAC DEBUG: umc_read_base_mask:   DCSM1[0]=0x03fffdfe reg: 
0x150020
[   16.175728] EDAC DEBUG: umc_read_base_mask: DCSM_SEC1[0]=0x reg: 
0x150028
[   16.175729] EDAC DEBUG: umc_read_base_mask:   DCSM1[1]=0x03fffdfe reg: 
0x150024
[   16.175730] EDAC DEBUG: umc_read_base_mask: DCSM_SEC1[1]=0x reg: 
0x15002c
[   16.175741] EDAC DEBUG: umc_determine_memory_type:   UMC0 DIMM type: 
Unbuffered-DDR4
[   16.175742] EDAC DEBUG: umc_determine_memory_type:   UMC1 DIMM type: 
Unbuffered-DDR4
[   16.177514] Console: switching to colour dummy device 80x25
[   16.177693] [drm] initializing kernel modesetting (CEDAR 0x1002:0x68E1 
0x174B:0x3000 0x00).
[   16.177733] ATOM BIOS: AMD
[   16.177795] radeon :29:00.0: VRAM: 1024M 0x - 
0x3FFF (1024M used)
[   16.177798] radeon :29:00.0: GTT: 1024M 0x4000 - 
0x7FFF
[   16.177800] [drm] Detected VRAM RAM=1024M, BAR=256M
[   16.177802] [drm] RAM width 64bits DDR
[   16.177835] [drm] radeon: 1024M of VRAM memory ready
[   16.177836] [drm] radeon: 1024M of GTT memory ready.
[   16.177839] [drm] Loading CEDAR Microcode
[   16.179106] [drm] Internal thermal controller without fan control
[   16.199812] [drm] radeon: dpm initialized
[   16.200179] [drm] GART: num cpu pages 262144, num gpu pages 262144
[   16.200399] [drm] enabling PCIE gen 2 link speeds, disable with 
radeon.pcie_gen2=0
[   16.218135] [drm] PCIE GART of 1024M enabled (table at 0x0014C000).
[   16.218239] radeon :29:00.0: WB enabled
[   16.218240] radeon :29:00.0: fence driver on ring 0 use gpu addr 
0x4c00
[   16.218242] radeon :29:00.0: fence driver on ring 3 use gpu addr 
0x4c0c
[   16.218606] radeon :29:00.0: fence driver on ring 5 use gpu addr 
0x0005c418
[   16.218657] radeon :29:00.0: radeon: MSI limited to 32-bit
[   16.218689] radeon :29:00.0: radeon: using MSI.
[   16.218707] [drm] radeon: irq initialized.
[   16.234730] [drm] ring test on 0 succeeded in 0 usecs
[   16.234738] [drm] ring test on 3 succeeded in 2 usecs
[   16.317725] r8169 :25:00.0 eth0: Link is Down
[   16.410486] [drm] ring test on 5 succeeded in 1 usecs
[   16.410492] [drm] UVD initialized successfully.
[   16.410555] [drm] ib test on ring 0 succeeded in 0 usecs
[   16.410596] [drm] ib test on ring 3 succeeded in 0 usecs
[   17.077422] [drm] ib test on ring 5 succeeded
[   17.077581] [drm] Radeon Display Connectors
[   17.077584] [drm] Connector 0:
[   17.077585] [drm]   HDMI-A-1
[   17.077586] [drm]   HPD4
[   17.077588] [drm]   DDC: 0x6440 0x6440 0x6444 0x6444 0x6448 0x6448 0x644c 
0x644c
[   17.077590] [drm]   Encoders:
[   17.077591] [drm] DFP1: INTERNAL_UNIPHY1
[   17.077593] [drm] Connector 1:
[   17.077594] [drm]   DVI-I-1
[   17.077595] [drm]   HPD1
[   17.077597] [drm]   DDC: 0x6460 0x6460 0x6464 0x6464 0x6468 0x6468 0x646c

Re: Build regressions/improvements in v6.4-rc5

2023-06-05 Thread Geert Uytterhoeven

On Mon, 5 Jun 2023, Geert Uytterhoeven wrote:

JFYI, when comparing v6.4-rc5[1] to v6.4-rc4[3], the summaries are:
 - build errors: +2/-4


arm64-gcc5/arm64-allmodconfig (seen before)


[1] 
http://kisskb.ellerman.id.au/kisskb/branch/linus/head/9561de3a55bed6bdd44a12820ba81ec416e705a7/
 (151 out of 152 configs)
[3] 
http://kisskb.ellerman.id.au/kisskb/branch/linus/head/7877cb91f1081754a1487c144d85dc0d2e2e7fc4/
 (151 out of 152 configs)


Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: PROBLEM: AMD Ryzen 9 7950X iGPU - Blinking Issue

2023-06-05 Thread Alex Deucher
On Sat, Jun 3, 2023 at 10:52 AM Felix Richter  wrote:
>
> Hi Guys,
>
> sorry for the silence from my side. I had a lot of things to take care
> of after returning from vacation. Also I had to wait on the zfs modules
> to be updated to support kernel 6.3 for further testing.
>
> The bad news is that I am still experiencing issues. I have been able to
> get a reproducible trigger for the buggy behavior. The moment I take a
> screenshot or any other program like `wdisplays` accesses the screen
> buffer the screen starts flickering. The only way to reset it is to
> reboot the machine or log out of the desktop.
>
> With this I did a bisection to figure out which commit is responsible
> for this. I attached the logs to the mail. The short version is that I
> identified commit 81d0bcf9900932633d270d5bc4a54ff599c6ebdb as the
> culprit. Seems that there are side effects of having more flexible
> buffer placement for the case of the internal GPU. To verify that this
> actually is the cause of the issue I built the current archlinux kernel
> with an extra patch to revert the commit:
> https://github.com/ju6ge/linux/tree/v6.3.5-ju6ge. The result is that be
> bug is fixed!

+ Hamza

This is a known issue.  You can workaround it by setting
amdgpu.sg_display=0.  It should be issue should be fixed in:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=08da182175db4c7f80850354849d95f2670e8cd9

Alex



>
> Now if this is the desired long term fix I do not know …
>
> Kind regards,
> Felix Richter
>
> On 02.05.23 16:12, Linux regression tracking (Thorsten Leemhuis) wrote:
> > On 02.05.23 15:48, Felix Richter wrote:
> >> On 5/2/23 15:34, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>> On 02.05.23 15:13, Alex Deucher wrote:
>  On Tue, May 2, 2023 at 7:45 AM Linux regression tracking (Thorsten
>  Leemhuis)  wrote:
> 
> > On 30.04.23 13:44, Felix Richter wrote:
> >> Hi,
> >>
> >> I am running into an issue with the integrated GPU of the Ryzen 9
> >> 7950X. It seems to be a regression from kernel version 6.1 to 6.2.
> >> The bug materializes in from of my monitor blinking, meaning it
> >> turns full white shortly. This happens very often so that the
> >> system becomes unpleasant to use.
> >>
> >> I am running the Archlinux Kernel:
> >> The Issue happens on the bleeding edge kernel: 6.2.13
> >> Switching back to the LTS kernel resolves the issue: 6.1.26
> >>
> >> I have two monitors attached to the system. One 42 inch 4k Display
> >> and a 24 inch 1080p Display and am running sway as my desktop.
> >>
> >> Let me know if there is more information I could provide to help
> >> narrow down the issue.
> > Thanks for the report. To be sure the issue doesn't fall through the
> > cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> > tracking bot:
> >
> > #regzbot ^introduced v6.1..v6.2
> > #regzbot title drm: amdgpu: system becomes unpleasant to use after
> > monitor starts blinking and turns full white
> > #regzbot ignore-activity
> >
> > This isn't a regression? This issue or a fix for it are already
> > discussed somewhere else? It was fixed already? You want to clarify
> > when
> > the regression started to happen? Or point out I got the title or
> > something else totally wrong? Then just reply and tell me -- ideally
> > while also telling regzbot about it, as explained by the page listed in
> > the footer of this mail.
> >
> > Developers: When fixing the issue, remember to add 'Link:' tags
> > pointing
> > to the report (the parent of this mail). See page linked in footer for
> > details.
>  This sounds exactly like the issue that was fixed in this patch which
>  is already on it's way to Linus:
>  https://gitlab.freedesktop.org/agd5f/linux/-/commit/08da182175db4c7f80850354849d95f2670e8cd9
> >>> FWIW, you in the flood of emails likely missed that this is the same
> >>> thread where you yesterday replied "If the module parameter didn't help
> >>> then perhaps you are seeing some other issue.  Can you bisect?". That's
> >>> why I decided to add this to the tracking. Or am I missing something
> >>> obvious here?
> >>>
> >>> /me looks around again and can't see anything, but that doesn't have to
> >>> mean anything...
> >>>
> >>> Felix, btw, this guide might help you with the bisection, even if it's
> >>> just for kernel compilation:
> >>>
> >>> https://docs.kernel.org/next/admin-guide/quickly-build-trimmed-linux.html
> >>>
> >>> And to indirectly reply to your mail from yesterday[1]. You might want
> >>> to ignore the arch linux kernel git repo and just do a bisection between
> >>> 6.1 and the latest 6.2.y kernel using upstream repos; and if I were you
> >>> I'd also try 6.3 or even mainline before that, in case the issue was
> >>> fixed already.
> >>>
> >>> [1]
> >>> https://lore.kernel.org/all/0474

RE: [PATCH 00/14] DC Patches June 2, 2023

2023-06-05 Thread Wheeler, Daniel
[Public]

Hi all,

This week this patchset was tested on the following systems:
* Lenovo ThinkBook T13s Gen4 with AMD Ryzen 5 6600U
* MSI Gaming X Trio RX 6800
* Gigabyte Gaming OC RX 7900 XTX

These systems were tested on the following display/connection types:
* eDP, (1080p 60hz [5650U]) (1920x1200 60hz [6600U]) (2560x1600 
120hz[6600U])
* VGA and DVI (1680x1050 60hz [DP to VGA/DVI, USB-C to VGA/DVI])
* DP/HDMI/USB-C (1440p 170hz, 4k 60hz, 4k 144hz, 4k 240hz [Includes 
USB-C to DP/HDMI adapters])
* Thunderbolt (LG Ultrafine 5k)
* MST (Startech MST14DP123DP [DP to 3x DP] and 2x 4k 60Hz displays)
* DSC (with Cable Matters 101075 [DP to 3x DP] with 3x 4k60 displays, 
and HP Hook G2 with 1 4k60 display)
* USB 4 (Kensington SD5700T and 1x 4k 60Hz display)
* PCON (Club3D CAC-1085 and 1x 4k 144Hz display [at 4k 120HZ, as that 
is the max the adapter supports])

The testing is a mix of automated and manual tests. Manual testing includes 
(but is not limited to):
* Changing display configurations and settings
* Benchmark testing
* Feature testing (Freesync, etc.)

Automated testing includes (but is not limited to):
* Script testing (scripts to automate some of the manual checks)
* IGT testing

The patchset consists of the amd-staging-drm-next branch (Head commit - 
3e54d382a51b71bd08702a10c0864a60f0108c66 -> drm/amd/amdgpu: Fix up locking etc 
in amdgpu_debugfs_gprwave_ioctl()) with new patches added on top of it. This 
branch is used for both Ubuntu and Chrome OS testing (ChromeOS on a bi-weekly 
basis).


Tested on Ubuntu 22.04.2

Tested-by: Daniel Wheeler 


Thank you,

Dan Wheeler
Sr. Technologist | AMD
SW Display
--
1 Commerce Valley Dr E, Thornhill, ON L3T 7X6
amd.com

-Original Message-
From: Wang, Chao-kai (Stylon) 
Sent: Wednesday, May 31, 2023 12:48 AM
To: amd-gfx@lists.freedesktop.org
Cc: Wentland, Harry ; Li, Sun peng (Leo) 
; Lakha, Bhawanpreet ; Siqueira, 
Rodrigo ; Pillai, Aurabindo 
; Zhuo, Qingqing (Lillian) ; 
Li, Roman ; Lin, Wayne ; Wang, Chao-kai 
(Stylon) ; Chiu, Solomon ; Kotarac, 
Pavle ; Gutierrez, Agustin ; 
Wheeler, Daniel 
Subject: [PATCH 00/14] DC Patches June 2, 2023

This DC patchset brings improvements in multiple areas. In summary, we have:

* Clock optimiation for DCN 3.1.4
* Performance improvements
* Improvements on power saving
* Fix screen flash in high resolution displays
* Enable Freesync video mode by default
* Bug fixed on hang or crashes in various cases
* Improved code robustness in corner cases

Cc: Daniel Wheeler 

Alvin Lee (2):
  drm/amd/display: Refactor fast update to use new HWSS build sequence
  drm/amd/display: Reduce sdp bw after urgent to 90%

Aurabindo Pillai (1):
  drm/amd/display: Enable Freesync Video Mode by default

Austin Zheng (1):
  drm/amd/display: Filter out AC mode frequencies on DC mode systems

Charlene Liu (1):
  drm/amd/display: add NULL pointer check

Daniel Miess (1):
  drm/amd/display: Enable dcn314 DPP RCO

Dmytro Laktyushkin (2):
  drm/amd/display: fix seamless odm transitions
  drm/amd/display: fix dcn315 single stream crb allocation

Leo Ma (1):
  Revert "drm/amd/display: cache trace buffer size"

Max Tseng (1):
  drm/amd/display: Add control flag to dc_stream_state to skip eDP BL
off/link off

Nicholas Kazlauskas (1):
  drm/amd/display: Skip DPP DTO update if root clock is gated

Saaem Rizvi (1):
  drm/amd/display: Wrong index type for pipe iterator

Samson Tam (1):
  drm/amd/display: add ODM case when looking for first split pipe

Sridevi (1):
  drm/amd/display: DSC Programming Deltas

 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  12 +-  
.../display/dc/clk_mgr/dcn32/dcn32_clk_mgr.c  |  13 +-
 drivers/gpu/drm/amd/display/dc/core/dc.c  | 307 --
 .../drm/amd/display/dc/core/dc_hw_sequencer.c | 255 +++  
.../gpu/drm/amd/display/dc/core/dc_resource.c |  20 ++
 .../gpu/drm/amd/display/dc/core/dc_stream.c   |   4 +-
 drivers/gpu/drm/amd/display/dc/dc.h   |   2 +
 drivers/gpu/drm/amd/display/dc/dc_stream.h|   1 +
 .../amd/display/dc/dce100/dce100_resource.c   |   5 +
 .../display/dc/dce110/dce110_hw_sequencer.c   |   3 +-
 .../amd/display/dc/dce110/dce110_resource.c   |   5 +
 .../amd/display/dc/dce112/dce112_resource.c   |   5 +
 .../amd/display/dc/dce120/dce120_resource.c   |   1 +
 .../drm/amd/display/dc/dce80/dce80_resource.c |   6 +
 .../drm/amd/display/dc/dcn10/dcn10_resource.c |   1 +
 .../gpu/drm/amd/display/dc/dcn20/dcn20_dsc.c  |  29 +-  
.../gpu/drm/amd/display/dc/dcn20/dcn20_dsc.h  |  28 ++
 .../drm/amd/display/dc/dcn20/dcn20_hwseq.c|  11 +
 .../drm/amd/display/dc/dcn20/dcn20_resource.c |   1 +
 .../amd/display/dc/dcn201/dcn201_resource.c   |   1 +
 .../drm/amd/display/dc/dcn21/dcn21_resource.c |   1 +
 .../gpu/drm/amd/display/dc/d

[PATCH Review V2 3/3] drm/amdgpu: convert vcn/jpeg logical mask to physical mask

2023-06-05 Thread Stanley . Yang
 Changed from V1:
Remove amdgpu_ras_logical_mask_to_physical_mask
due to GET_MASK provides same feature.
Support convert VCN/JPEG logical mask to physical
mask.

Signed-off-by: Stanley.Yang 
Reviewed-by: Tao Zhou 
Reviewed-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 2ad3b93bf530..1fa024a94314 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -1698,6 +1698,10 @@ int psp_ras_trigger_error(struct psp_context *psp,
case TA_RAS_BLOCK__SDMA:
dev_mask = GET_MASK(SDMA0, instance_mask);
break;
+   case TA_RAS_BLOCK__VCN:
+   case TA_RAS_BLOCK__JPEG:
+   dev_mask = GET_MASK(VCN, instance_mask);
+   break;
default:
dev_mask = instance_mask;
break;
-- 
2.17.1



[PATCH Review V2 1/3] drm/amdgpu: pass xcc mask to ras ta

2023-06-05 Thread Stanley . Yang
pass xcc mask to ras ta, ras ta will compare
the mask with the one from chiplet topology.

Changed from V1:
Remove IP version checking.
Set ras_cmd->ras_init_message.init_flags.xcc_mask
directly due to xcc_mask is common structres to
all the devices.

Signed-off-by: Stanley.Yang 
Reviewed-by: Tao Zhou 
Reviewed-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 ++
 drivers/gpu/drm/amd/amdgpu/ta_ras_if.h  | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 2175bfc89e7d..2ad3b93bf530 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -1662,6 +1662,8 @@ int psp_ras_initialize(struct psp_context *psp)
ras_cmd->ras_in_message.init_flags.poison_mode_en = 1;
if (!adev->gmc.xgmi.connected_to_cpu)
ras_cmd->ras_in_message.init_flags.dgpu_mode = 1;
+   ras_cmd->ras_in_message.init_flags.xcc_mask =
+   adev->gfx.xcc_mask;
 
ret = psp_ta_load(psp, &psp->ras_context.context);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h 
b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
index 30d0482ac466..be2984ac00a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
+++ b/drivers/gpu/drm/amd/amdgpu/ta_ras_if.h
@@ -129,6 +129,7 @@ struct ta_ras_trigger_error_input {
 struct ta_ras_init_flags {
uint8_t poison_mode_en;
uint8_t dgpu_mode;
+   uint16_t xcc_mask;
 };
 
 struct ta_ras_output_flags {
-- 
2.17.1



[PATCH Review V2 2/3] drm/amdgpu: support check vcn jpeg block mask

2023-06-05 Thread Stanley . Yang
Support VCN/JPEG instance mask checking, pass logical
mask directly except GFX/SDMA/VCN/JPEG blocks.

Changed from V1:
correct a typo

Signed-off-by: Stanley.Yang 
Reviewed-by: Tao Zhou 
Reviewed-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index eb3630e480f3..c56a5a6f9e83 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -360,8 +360,12 @@ static void amdgpu_ras_instance_mask_check(struct 
amdgpu_device *adev,
case AMDGPU_RAS_BLOCK__SDMA:
mask = GENMASK(adev->sdma.num_instances - 1, 0);
break;
+   case AMDGPU_RAS_BLOCK__VCN:
+   case AMDGPU_RAS_BLOCK__JPEG:
+   mask = GENMASK(adev->vcn.num_vcn_inst - 1, 0);
+   break;
default:
-   mask = 0;
+   mask = inst_mask;
break;
}
 
-- 
2.17.1



[PATCH] drm/amdkfd: mark som eclear_address_watch() callback static

2023-06-05 Thread Arnd Bergmann
From: Arnd Bergmann 

Some of the newly introduced clear_address_watch callback handlers have
no prototype because they are only used in one file, which causes a W=1
warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c:164:10: error: no previous 
prototype for 'kgd_gfx_aldebaran_clear_address_watch' 
[-Werror=missing-prototypes]
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c:782:10: error: no previous 
prototype for 'kgd_gfx_v11_clear_address_watch' [-Werror=missing-prototypes]

Mark these ones static. If another user comes up in the future, that
can be reverted along with adding the prototype.

Fixes: cfd9715f741a1 ("drm/amdkfd: add debug set and clear address watch points 
operation")
Signed-off-by: Arnd Bergmann 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
index efd6a72aab4eb..bdda8744398fe 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
@@ -161,7 +161,7 @@ static uint32_t kgd_gfx_aldebaran_set_address_watch(
return watch_address_cntl;
 }
 
-uint32_t kgd_gfx_aldebaran_clear_address_watch(struct amdgpu_device *adev,
+static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct amdgpu_device 
*adev,
uint32_t watch_id)
 {
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c
index 52efa690a3c21..131859ce3e7e9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c
@@ -779,7 +779,7 @@ static uint32_t kgd_gfx_v11_set_address_watch(struct 
amdgpu_device *adev,
return watch_address_cntl;
 }
 
-uint32_t kgd_gfx_v11_clear_address_watch(struct amdgpu_device *adev,
+static uint32_t kgd_gfx_v11_clear_address_watch(struct amdgpu_device *adev,
uint32_t watch_id)
 {
return 0;
-- 
2.39.2



[PATCH 1/2] drm/amdgpu: make sure BOs are locked in amdgpu_vm_get_memory

2023-06-05 Thread Christian König
We need to grab the lock of the BO or otherwise can run into a crash
when we try to inspect the current location.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 69 +++---
 1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 3c0310576b3b..2c8cafec48a4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -920,42 +920,51 @@ int amdgpu_vm_update_range(struct amdgpu_device *adev, 
struct amdgpu_vm *vm,
return r;
 }
 
+static void amdgpu_vm_bo_get_memory(struct amdgpu_bo_va *bo_va,
+   struct amdgpu_mem_stats *stats)
+{
+   struct amdgpu_vm *vm = bo_va->base.vm;
+   struct amdgpu_bo *bo = bo_va->base.bo;
+
+   if (!bo)
+   return;
+
+   /*
+* For now ignore BOs which are currently locked and potentially
+* changing their location.
+*/
+   if (bo->tbo.base.resv != vm->root.bo->tbo.base.resv &&
+   !dma_resv_trylock(bo->tbo.base.resv))
+   return;
+
+   amdgpu_bo_get_memory(bo, stats);
+   if (bo->tbo.base.resv != vm->root.bo->tbo.base.resv)
+   dma_resv_unlock(bo->tbo.base.resv);
+}
+
 void amdgpu_vm_get_memory(struct amdgpu_vm *vm,
  struct amdgpu_mem_stats *stats)
 {
struct amdgpu_bo_va *bo_va, *tmp;
 
spin_lock(&vm->status_lock);
-   list_for_each_entry_safe(bo_va, tmp, &vm->idle, base.vm_status) {
-   if (!bo_va->base.bo)
-   continue;
-   amdgpu_bo_get_memory(bo_va->base.bo, stats);
-   }
-   list_for_each_entry_safe(bo_va, tmp, &vm->evicted, base.vm_status) {
-   if (!bo_va->base.bo)
-   continue;
-   amdgpu_bo_get_memory(bo_va->base.bo, stats);
-   }
-   list_for_each_entry_safe(bo_va, tmp, &vm->relocated, base.vm_status) {
-   if (!bo_va->base.bo)
-   continue;
-   amdgpu_bo_get_memory(bo_va->base.bo, stats);
-   }
-   list_for_each_entry_safe(bo_va, tmp, &vm->moved, base.vm_status) {
-   if (!bo_va->base.bo)
-   continue;
-   amdgpu_bo_get_memory(bo_va->base.bo, stats);
-   }
-   list_for_each_entry_safe(bo_va, tmp, &vm->invalidated, base.vm_status) {
-   if (!bo_va->base.bo)
-   continue;
-   amdgpu_bo_get_memory(bo_va->base.bo, stats);
-   }
-   list_for_each_entry_safe(bo_va, tmp, &vm->done, base.vm_status) {
-   if (!bo_va->base.bo)
-   continue;
-   amdgpu_bo_get_memory(bo_va->base.bo, stats);
-   }
+   list_for_each_entry_safe(bo_va, tmp, &vm->idle, base.vm_status)
+   amdgpu_vm_bo_get_memory(bo_va, stats);
+
+   list_for_each_entry_safe(bo_va, tmp, &vm->evicted, base.vm_status)
+   amdgpu_vm_bo_get_memory(bo_va, stats);
+
+   list_for_each_entry_safe(bo_va, tmp, &vm->relocated, base.vm_status)
+   amdgpu_vm_bo_get_memory(bo_va, stats);
+
+   list_for_each_entry_safe(bo_va, tmp, &vm->moved, base.vm_status)
+   amdgpu_vm_bo_get_memory(bo_va, stats);
+
+   list_for_each_entry_safe(bo_va, tmp, &vm->invalidated, base.vm_status)
+   amdgpu_vm_bo_get_memory(bo_va, stats);
+
+   list_for_each_entry_safe(bo_va, tmp, &vm->done, base.vm_status)
+   amdgpu_vm_bo_get_memory(bo_va, stats);
spin_unlock(&vm->status_lock);
 }
 
-- 
2.34.1



[PATCH 2/2] drm/amdgpu: make sure that BOs have a backing store

2023-06-05 Thread Christian König
It's perfectly possible that the BO is about to be destroyed and doesn't
have a backing store associated with it.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 2bd1a54ee866..249385985a4f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1268,8 +1268,12 @@ void amdgpu_bo_move_notify(struct ttm_buffer_object *bo,
 void amdgpu_bo_get_memory(struct amdgpu_bo *bo,
  struct amdgpu_mem_stats *stats)
 {
-   unsigned int domain;
uint64_t size = amdgpu_bo_size(bo);
+   unsigned int domain;
+
+   /* Abort if the BO doesn't currently have a backing store */
+   if (!bo->tbo.resource)
+   return;
 
domain = amdgpu_mem_type_to_domain(bo->tbo.resource->mem_type);
switch (domain) {
-- 
2.34.1



Re: [PATCH 1/3] Revert "drm/amdgpu: change the reference clock for raven/raven2"

2023-06-05 Thread Michel Dänzer
On 6/2/23 20:43, Alex Deucher wrote:
> This reverts commit fbc24293ca16b3b9ef891fe32ccd04735a6f8dc1.
> 
> This results in inconsistent timing reported via asynchronous
> GPU queries.
> 
> Link: https://lists.freedesktop.org/archives/amd-gfx/2023-May/093731.html
> Cc: jesse.zh...@amd.com
> Cc: mic...@daenzer.net
> Signed-off-by: Alex Deucher 

The series is

Reviewed-by: Michel Dänzer 

Thanks!


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: [PATCH v2] drm/radeon: fix race condition UAF in radeon_gem_set_domain_ioctl

2023-06-05 Thread Christian König

Am 03.06.23 um 09:43 schrieb Min Li:

Userspace can race to free the gobj(robj converted from), robj should not
be accessed again after drm_gem_object_put, otherwith it will result in
use-after-free.

Signed-off-by: Min Li 


Reviewed-by: Christian König 


---
Changes in v2:
- Remove unused robj, avoid compile complain

  drivers/gpu/drm/radeon/radeon_gem.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_gem.c 
b/drivers/gpu/drm/radeon/radeon_gem.c
index bdc5af23f005..d3f5ddbc1704 100644
--- a/drivers/gpu/drm/radeon/radeon_gem.c
+++ b/drivers/gpu/drm/radeon/radeon_gem.c
@@ -459,7 +459,6 @@ int radeon_gem_set_domain_ioctl(struct drm_device *dev, 
void *data,
struct radeon_device *rdev = dev->dev_private;
struct drm_radeon_gem_set_domain *args = data;
struct drm_gem_object *gobj;
-   struct radeon_bo *robj;
int r;
  
  	/* for now if someone requests domain CPU -

@@ -472,13 +471,12 @@ int radeon_gem_set_domain_ioctl(struct drm_device *dev, 
void *data,
up_read(&rdev->exclusive_lock);
return -ENOENT;
}
-   robj = gem_to_radeon_bo(gobj);
  
  	r = radeon_gem_set_domain(gobj, args->read_domains, args->write_domain);
  
  	drm_gem_object_put(gobj);

up_read(&rdev->exclusive_lock);
-   r = radeon_gem_handle_lockup(robj->rdev, r);
+   r = radeon_gem_handle_lockup(rdev, r);
return r;
  }