Re: [PATCH v11] drm/amdgpu: add drm buddy support to amdgpu

2022-03-29 Thread Christian König

Am 29.03.22 um 21:18 schrieb Arunpravin Paneer Selvam:


On 29/03/22 9:30 pm, Arunpravin Paneer Selvam wrote:


On 29/03/22 4:54 pm, Christian König wrote:

Am 29.03.22 um 13:19 schrieb Arunpravin Paneer Selvam:

[SNIP]

+   pages_left = node->base.num_pages;

	i = 0;

-   spin_lock(&mgr->lock);
while (pages_left) {
-   uint32_t alignment = tbo->page_alignment;
+   if (tbo->page_alignment)
+   min_page_size = tbo->page_alignment << PAGE_SHIFT;
+   else
+   min_page_size = mgr->default_page_size;

The handling here looks extremely awkward to me.

min_page_size should be determined outside of the loop, based on 
default_page_size, alignment and contiguous flag.

I kept min_page_size determine logic inside the loop for cases 2GiB+
requirements, Since now we are round up the size to the required
alignment, I modified the min_page_size determine logic outside of the
loop in v12. Please review.

Ah! So do we only have the loop so that each allocation isn't bigger
than 2GiB? If yes couldn't we instead add a max_alloc_size or something
similar?

yes we have the loop to limit the allocation not bigger than 2GiB, I
think we cannot avoid the loop since we need to allocate the remaining
pages if any, to complete the 2GiB+ size request. In other words, first
iteration we limit the max size to 2GiB and in the second iteration we
allocate the left over pages if any.

Hi Christian,

Here my understanding might be incorrect, should we limit the max size =
2GiB and skip all the remaining pages for a 2GiB+ request?


No, the total size can be bigger than 2GiB. Only the contained pages 
should be a maximum of 2GiB.


See drm_buddy_alloc_blocks() already has the loop you need inside of it, 
all you need to do is to restrict the maximum allocation order.


In other words you got this line here in drm_buddy_alloc_blocks:

order = fls(pages) - 1;

Which then would become:

oder = min(fls(pages), mm->max_order) - 1;

You then just need to give mm->max_order as parameter to 
drm_buddy_init() instead of trying to figure it out yourself. This would 
then also make the following BUG_ON() superfluous.


And btw: IIRC fls() uses only 32 bit! You should either use fls64() or 
directly ilog2() which optimizes and calls fls() or fls64() or constant 
log2 based on the data type.


Regards,
Christian.



Thanks,
Arun

BTW: I strongly suggest that you rename min_page_size to min_alloc_size.
Otherwise somebody could think that those numbers are in pages and not
bytes.

modified in v12

Then why do you drop the lock and grab it again inside the loop? And what is 
"i" actually good for?

modified the lock/unlock placement in v12.

"i" is to track when there is 2GiB+ contiguous allocation request, first
we allocate 2GiB (due to SG table limit) continuously and the remaining
pages in the next iteration, hence this request can't be a continuous.
To set the placement flag we make use of "i" value. In our case "i"
value becomes 2 and we don't set the below flag.
node->base.placement |= TTM_PL_FLAG_CONTIGUOUS;

If we don't get such requests, I will remove "i".

I'm not sure if that works.

As far as I can see drm_buddy_alloc_blocks() can allocate multiple
blocks at the same time, but i is only incremented when we loop.

So what you should do instead is to check if node->blocks just contain
exactly one element after the allocation but before the trim.

ok

+
+   /* Limit maximum size to 2GB due to SG table limitations */
+   pages = min(pages_left, 2UL << (30 - PAGE_SHIFT));

		if (pages >= pages_per_node)

-   alignment = pages_per_node;
-
-   r = drm_mm_insert_node_in_range(mm, &node->mm_nodes[i], pages,
-   alignment, 0, place->fpfn,
-   lpfn, mode);
-   if (unlikely(r)) {
-   if (pages > pages_per_node) {
-   if (is_power_of_2(pages))
-   pages = pages / 2;
-   else
-   pages = rounddown_pow_of_two(pages);
-   continue;
-   }
-   goto error_free;
+   min_page_size = pages_per_node << PAGE_SHIFT;
+
+   if (!is_contiguous && !IS_ALIGNED(pages, min_page_size >> 
PAGE_SHIFT))
+   is_contiguous = 1;
+
+   if (is_contiguous) {
+   pages = roundup_pow_of_two(pages);
+   min_page_size = pages << PAGE_SHIFT;
+
+   if (pages > lpfn)
+   lpfn = pages;
}

-		vis_usage += amdgpu_vram_mgr_vis_size(adev, &node->mm_nodes[i]);

-   amdgpu_vram_mgr_virt_start(&node->base, &node->mm_nodes[i]);
-   pages

Re: [PATCH V2] drm/amdgpu: fix incorrect GCR_GENERAL_CNTL address

2022-03-29 Thread Paul Menzel

Dear Ruili,


Am 29.03.22 um 11:17 schrieb Ji, Ruili:


This is not related to any issue.


I didn’t mean an issue (where I’d use Resolves to differentiate the two 
cases), but the commit introducing the incorrect address.



Kind regards,

Paul


PS: Please use interleaved style when replying instead of top-posting.


Re: [PATCH v2] drm/amdgpu: fix that issue that the number of the crtc of the 3250c is not correct

2022-03-29 Thread Paul Menzel

[Cc: Remove undeliverable Chun Ming Zhou ]

Am 30.03.22 um 08:34 schrieb Paul Menzel:

Dear Tsung-Hua,


Thank you for your patch.

Am 30.03.22 um 04:46 schrieb Ryan Lin:

The commit message summary is quite long and confusing. Maybe:

Use 3 CRTC for 3250c to get internal display working


[Why]
External displays take priority over internal display when there are 
fewer

display controllers than displays.


This causes the internal display to not work on the Chromebook google/zork.


[How]
The root cause is because of that number of the crtc is not correct.


The root cause is the incorrect number of four configured CRTCs.


The number of the crtc on the 3250c is 3, but on the 3500c is 4.
On the source code, we can see that number of the crtc has been fixed 
at 4.

Needs to set the num_crtc to 3 for 3250c platform.


Please do not wrap lines after each sentence, and use a text width of 75 
characters.



v2:
    - remove unnecessary comments and Id

Signed-off-by: Ryan Lin 

---
  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c

index 40c91b448f7da..455a2c45e8cda 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2738,9 +2738,15 @@ static int dm_early_init(void *handle)
  break;
  #if defined(CONFIG_DRM_AMD_DC_DCN1_0)
  case CHIP_RAVEN:
-    adev->mode_info.num_crtc = 4;
-    adev->mode_info.num_hpd = 4;
-    adev->mode_info.num_dig = 4;
+    if (adev->rev_id >= 8) {


Is there some define for that number? Maybe add a comment, that it’s for 
3250c?



Kind regards,

Paul



+    adev->mode_info.num_crtc = 3;
+    adev->mode_info.num_hpd = 3;
+    adev->mode_info.num_dig = 3;
+    } else {
+    adev->mode_info.num_crtc = 4;
+    adev->mode_info.num_hpd = 4;
+    adev->mode_info.num_dig = 4;
+    }
  break;
  #endif
  #if defined(CONFIG_DRM_AMD_DC_DCN2_0)


Re: [PATCH v2] drm/amdgpu: fix that issue that the number of the crtc of the 3250c is not correct

2022-03-29 Thread Paul Menzel

Dear Tsung-Hua,


Thank you for your patch.

Am 30.03.22 um 04:46 schrieb Ryan Lin:

The commit message summary is quite long and confusing. Maybe:

Use 3 CRTC for 3250c to get internal display working


[Why]
External displays take priority over internal display when there are fewer
display controllers than displays.


This causes the internal display to not work on the Chromebook google/zork.


[How]
The root cause is because of that number of the crtc is not correct.


The root cause is the incorrect number of four configured CRTCs.


The number of the crtc on the 3250c is 3, but on the 3500c is 4.
On the source code, we can see that number of the crtc has been fixed at 4.
Needs to set the num_crtc to 3 for 3250c platform.


Please do not wrap lines after each sentence, and use a text width of 75 
characters.



v2:
- remove unnecessary comments and Id

Signed-off-by: Ryan Lin 

---
  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 40c91b448f7da..455a2c45e8cda 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2738,9 +2738,15 @@ static int dm_early_init(void *handle)
break;
  #if defined(CONFIG_DRM_AMD_DC_DCN1_0)
case CHIP_RAVEN:
-   adev->mode_info.num_crtc = 4;
-   adev->mode_info.num_hpd = 4;
-   adev->mode_info.num_dig = 4;
+   if (adev->rev_id >= 8) {


Is there some define for that number? Maybe add a comment, that it’s for 
3250c?



Kind regards,

Paul



+   adev->mode_info.num_crtc = 3;
+   adev->mode_info.num_hpd = 3;
+   adev->mode_info.num_dig = 3;
+   } else {
+   adev->mode_info.num_crtc = 4;
+   adev->mode_info.num_hpd = 4;
+   adev->mode_info.num_dig = 4;
+   }
break;
  #endif
  #if defined(CONFIG_DRM_AMD_DC_DCN2_0)


RE: [PATCH v2] drm/amdgpu: fix that issue that the number of the crtc of the 3250c is not correct

2022-03-29 Thread Lin, Tsung-hua (Ryan)
[AMD Official Use Only]

Hi Chandan,

This issue we found on the Zork project which uses the kernel 5.4 on. So I just 
implemented it on kernel 5.4.
For finding out which is 3250c, I referenced the function which is implemented 
from another function.
Below is the part where I found it.

drivers/gpu/drm/amd/amdgpu/amdgpu_device.c-
case CHIP_RAVEN:
if (adev->rev_id >= 8)
chip_name = "raven2";
else if (adev->pdev->device == 0x15d8)
chip_name = "picasso";
else
chip_name = "raven";
break;

BR,
Ryan Lin.

-Original Message-
From: VURDIGERENATARAJ, CHANDAN  
Sent: Wednesday, March 30, 2022 12:30 PM
To: Lin, Tsung-hua (Ryan) 
Cc: David (ChunMing) Zhou ; Drew Davenport 
; Li, Sun peng (Leo) ; Li, Leon 
; dri-de...@lists.freedesktop.org; Siqueira, Rodrigo 
; linux-ker...@vger.kernel.org; 
amd-gfx@lists.freedesktop.org; Koenig, Christian ; 
David Airlie ; Sean Paul ; Louis Li 
; Daniel Vetter ; 
Bas Nieuwenhuizen ; Deucher, Alexander 
; Stéphane Marchesin ; 
Kazlauskas, Nicholas ; Wentland, Harry 
; Lin, Tsung-hua (Ryan) ; Mark 
Yacoub 
Subject: RE: [PATCH v2] drm/amdgpu: fix that issue that the number of the crtc 
of the 3250c is not correct


Hi Ryan,

Is this change applicable on a specific kernel version?
On latest I see IP DISCOVERY based impl for CHIP_RAVEN.

>[Why]
>External displays take priority over internal display when there are fewer 
>display controllers than displays.
>
> [How]
>The root cause is because of that number of the crtc is not correct.
>The number of the crtc on the 3250c is 3, but on the 3500c is 4.
>On the source code, we can see that number of the crtc has been fixed at 4.
>Needs to set the num_crtc to 3 for 3250c platform.
>
>v2:
>   - remove unnecessary comments and Id
>
>Signed-off-by: Ryan Lin 
>
>---
> drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
>b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>index 40c91b448f7da..455a2c45e8cda 100644
>--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>@@ -2738,9 +2738,15 @@ static int dm_early_init(void *handle)
>   break;
> #if defined(CONFIG_DRM_AMD_DC_DCN1_0)
>   case CHIP_RAVEN:
>-  adev->mode_info.num_crtc = 4;
>-  adev->mode_info.num_hpd = 4;
>-  adev->mode_info.num_dig = 4;
>+  if (adev->rev_id >= 8) {

May I know what this ">=8" indicate? Also, should it be external_rev_id if its 
based on old version?

>+  adev->mode_info.num_crtc = 3;
>+  adev->mode_info.num_hpd = 3;
>+  adev->mode_info.num_dig = 3;
>+  } else {
>+  adev->mode_info.num_crtc = 4;
>+  adev->mode_info.num_hpd = 4;
>+  adev->mode_info.num_dig = 4;
>+  }
>   break;
> #endif
> #if defined(CONFIG_DRM_AMD_DC_DCN2_0)
>--
>2.25.1
>

BR,
Chandan V N


RE: [PATCH v2] drm/amdgpu: fix that issue that the number of the crtc of the 3250c is not correct

2022-03-29 Thread VURDIGERENATARAJ, CHANDAN


Hi Ryan,

Is this change applicable on a specific kernel version?
On latest I see IP DISCOVERY based impl for CHIP_RAVEN.

>[Why]
>External displays take priority over internal display when there are fewer 
>display controllers than displays.
>
> [How]
>The root cause is because of that number of the crtc is not correct.
>The number of the crtc on the 3250c is 3, but on the 3500c is 4.
>On the source code, we can see that number of the crtc has been fixed at 4.
>Needs to set the num_crtc to 3 for 3250c platform.
>
>v2:
>   - remove unnecessary comments and Id
>
>Signed-off-by: Ryan Lin 
>
>---
> drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
>b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>index 40c91b448f7da..455a2c45e8cda 100644
>--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>@@ -2738,9 +2738,15 @@ static int dm_early_init(void *handle)
>   break;
> #if defined(CONFIG_DRM_AMD_DC_DCN1_0)
>   case CHIP_RAVEN:
>-  adev->mode_info.num_crtc = 4;
>-  adev->mode_info.num_hpd = 4;
>-  adev->mode_info.num_dig = 4;
>+  if (adev->rev_id >= 8) {

May I know what this ">=8" indicate? Also, should it be external_rev_id if its 
based on old version?

>+  adev->mode_info.num_crtc = 3;
>+  adev->mode_info.num_hpd = 3;
>+  adev->mode_info.num_dig = 3;
>+  } else {
>+  adev->mode_info.num_crtc = 4;
>+  adev->mode_info.num_hpd = 4;
>+  adev->mode_info.num_dig = 4;
>+  }
>   break;
> #endif
> #if defined(CONFIG_DRM_AMD_DC_DCN2_0)
>--
>2.25.1
>

BR,
Chandan V N


Re: [PATCH v2] drm/amdgpu: fix that issue that the number of the crtc of the 3250c is not correct

2022-03-29 Thread Alex Deucher
On Tue, Mar 29, 2022 at 10:57 PM Ryan Lin  wrote:
>
> [Why]
> External displays take priority over internal display when there are fewer
> display controllers than displays.
>
> [How]
> The root cause is because of that number of the crtc is not correct.
> The number of the crtc on the 3250c is 3, but on the 3500c is 4.
> On the source code, we can see that number of the crtc has been fixed at 4.
> Needs to set the num_crtc to 3 for 3250c platform.
>
> v2:
>- remove unnecessary comments and Id
>
> Signed-off-by: Ryan Lin 

Reviewed-by: Alex Deucher 

>
> ---
>  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 40c91b448f7da..455a2c45e8cda 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -2738,9 +2738,15 @@ static int dm_early_init(void *handle)
> break;
>  #if defined(CONFIG_DRM_AMD_DC_DCN1_0)
> case CHIP_RAVEN:
> -   adev->mode_info.num_crtc = 4;
> -   adev->mode_info.num_hpd = 4;
> -   adev->mode_info.num_dig = 4;
> +   if (adev->rev_id >= 8) {
> +   adev->mode_info.num_crtc = 3;
> +   adev->mode_info.num_hpd = 3;
> +   adev->mode_info.num_dig = 3;
> +   } else {
> +   adev->mode_info.num_crtc = 4;
> +   adev->mode_info.num_hpd = 4;
> +   adev->mode_info.num_dig = 4;
> +   }
> break;
>  #endif
>  #if defined(CONFIG_DRM_AMD_DC_DCN2_0)
> --
> 2.25.1
>


[PATCH v2] drm/amdgpu: fix that issue that the number of the crtc of the 3250c is not correct

2022-03-29 Thread Ryan Lin
[Why]
External displays take priority over internal display when there are fewer
display controllers than displays.

[How]
The root cause is because of that number of the crtc is not correct.
The number of the crtc on the 3250c is 3, but on the 3500c is 4.
On the source code, we can see that number of the crtc has been fixed at 4.
Needs to set the num_crtc to 3 for 3250c platform.

v2:
   - remove unnecessary comments and Id

Signed-off-by: Ryan Lin 

---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 40c91b448f7da..455a2c45e8cda 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2738,9 +2738,15 @@ static int dm_early_init(void *handle)
break;
 #if defined(CONFIG_DRM_AMD_DC_DCN1_0)
case CHIP_RAVEN:
-   adev->mode_info.num_crtc = 4;
-   adev->mode_info.num_hpd = 4;
-   adev->mode_info.num_dig = 4;
+   if (adev->rev_id >= 8) {
+   adev->mode_info.num_crtc = 3;
+   adev->mode_info.num_hpd = 3;
+   adev->mode_info.num_dig = 3;
+   } else {
+   adev->mode_info.num_crtc = 4;
+   adev->mode_info.num_hpd = 4;
+   adev->mode_info.num_dig = 4;
+   }
break;
 #endif
 #if defined(CONFIG_DRM_AMD_DC_DCN2_0)
-- 
2.25.1



RE: [PATCH] drm/amd/display: Fix by adding FPU protection for dcn30_internal_validate_bw

2022-03-29 Thread VURDIGERENATARAJ, CHANDAN
Hi Paul,

>Am 29.03.22 um 10:29 schrieb CHANDAN VURDIGERE NATARAJ:
>
>Is it common to spell your name all uppercase? If not, please use Chandan 
>nVurdigere Nataraj.
>
>> [WHY]
>
>The [] already emphasize the word, so Why could be used.
>
>> Below general protection fault observed when WebGL Aquarium is run for 
>> longer duration. If drm debug logs are enabled and set to 0x1f then 
>> the
>
>In what browser and what version?
The issue was observed on ChromiumOS and Chromium Browser version 100.0.4877.0
>
>> issue is observed within 10 minutes of run.
>
>Where you able to reproduce it without drm debug logs?
Yes. It took 34 hours to reproduce without drm debug logs. Using drm debug logs 
was a faster way to reproduce the issue.
>
>> [  100.717056] general protection fault, probably for non-canonical address 
>> 0x2d33302d32323032:  [#1] PREEMPT SMP NOPTI
>> [  100.727921] CPU: 3 PID: 1906 Comm: DrmThread Tainted: GW 
>> 5.15.30 #12 d726c6a2d6ebe5cf9223931cbca6892f916fe18b
>> [  100.754419] RIP: 0010:CalculateSwathWidth+0x1f7/0x44f
>> [  100.767109] Code: 00 00 00 f2 42 0f 11 04 f0 48 8b 85 88 00 00 00 
>> f2 42 0f 10 04 f0 48 8b 85 98 00 00 00 f2 42 0f 11 04 f0 48 8b 45 10 
>> 0f 57 c0  42 0f 2a 04 b0 0f 57 c9 f3 43 0f 2a 0c b4 e8 8c e2 f3 ff 
>> 48 8b [  100.781269] RSP: 0018:a9230079eeb0 EFLAGS: 00010246 [  
>> 100.812528] RAX: 2d33302d32323032 RBX: 0500 RCX: 
>>  [  100.819656] RDX: 0001 RSI: 
>> 99deb712c49c RDI:  [  100.826781] RBP: 
>> a9230079ef50 R08: 99deb712460c R09: 99deb712462c [  
>> 100.833907] R10: 99deb7124940 R11: 99deb7124d70 R12: 
>> 99deb712ae44 [  100.841033] R13: 0001 R14: 
>>  R15: a9230079f0a0 [  100.848159] FS:  
>> 7af121212640() GS:99deba78() knlGS: [  
>> 100.856240] CS:  0010 DS:  ES:  CR0: 80050033 [  100.861980] 
>> CR2: 209000fe1000 CR3: >00011b18c000 CR4: 00350ee0 [  
>> 100.869106] Call Trace:
>> [  100.871555]  
>> [  100.873655]  ? asm_sysvec_reschedule_ipi+0x12/0x20
>> [  100.878449]  CalculateSwathAndDETConfiguration+0x1a3/0x6dd
>> [  100.883937]  
>> dml31_ModeSupportAndSystemConfigurationFull+0x2ce4/0x76da
>> [  100.890467]  ? kallsyms_lookup_buildid+0xc8/0x163
>> [  100.895173]  ? kallsyms_lookup_buildid+0xc8/0x163
>> [  100.899874]  ? __sprint_symbol+0x80/0x135 [  100.903883]  ? 
>> dm_update_plane_state+0x3f9/0x4d2 [  100.908500]  ? 
>> symbol_string+0xb7/0xde [  100.912250]  ? number+0x145/0x29b [  
>> 100.915566]  ? vsnprintf+0x341/0x5ff [  100.919141]  ? 
>> desc_read_finalized_seq+0x39/0x87 [  100.923755]  ? 
>> update_load_avg+0x1b9/0x607 [  100.927849]  ? 
>> compute_mst_dsc_configs_for_state+0x7d/0xd5b
>> [  100.933416]  ? fetch_pipe_params+0xa4d/0xd0c [  100.937686]  ? 
>> dc_fpu_end+0x3d/0xa8 [  100.941175]  dml_get_voltage_level+0x16b/0x180 
>> [  100.945619]  dcn30_internal_validate_bw+0x10e/0x89b
>> [  100.950495]  ? dcn31_validate_bandwidth+0x68/0x1fc
>> [  100.955285]  ? resource_build_scaling_params+0x98b/0xb8c
>> [  100.960595]  ? dcn31_validate_bandwidth+0x68/0x1fc
>> [  100.965384]  dcn31_validate_bandwidth+0x9a/0x1fc
>> [  100.970001]  dc_validate_global_state+0x238/0x295
>> [  100.974703]  amdgpu_dm_atomic_check+0x9c1/0xbce
>> [  100.979235]  ? _printk+0x59/0x73
>> [  100.982467]  drm_atomic_check_only+0x403/0x78b [  100.986912]  
>> drm_mode_atomic_ioctl+0x49b/0x546 [  100.991358]  ? 
>> drm_ioctl+0x1c1/0x3b3 [  100.994936]  ? 
>> drm_atomic_set_property+0x92a/0x92a
>> [  100.999725]  drm_ioctl_kernel+0xdc/0x149 [  101.003648]  
>> drm_ioctl+0x27f/0x3b3 [  101.007051]  ? 
>> drm_atomic_set_property+0x92a/0x92a
>> [  101.011842]  amdgpu_drm_ioctl+0x49/0x7d [  101.015679]  
>> __se_sys_ioctl+0x7c/0xb8 [  101.015685]  do_syscall_64+0x5f/0xb8 [  
>> 101.015690]  ? __irq_exit_rcu+0x34/0x96
>> 
>> [HOW]
>> It calles populate_dml_pipes which uses doubles to initialize.
>
>calls
>
>Excuse my ignorance. So using doubles causes a context switch?
If we don’t add FPU protection then context switch can happen. DC_FP_START 
would in-turn call preempt_disable.

>> Adding FPU protection avoids context switch and probable loss of vba 
>> context as there is potential contention while drm debug logs are enabled.
>> 
>> Signed-off-by: CHANDAN VURDIGERE NATARAJ 
>> 
>> 
>> diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c 
>> b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
>> index 826970f2bd0a..f27262417abe 100644
>> --- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
>> +++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
>> @@ -1750,7 +1750,9 @@ bool dcn31_validate_bandwidth(struct dc *dc,
>>   
>>  BW_VAL_TRACE_COUNT();
>>   
>> +DC_FP_START();
>>  out = dcn30_internal_validate_bw(dc, context, pipes, &pipe_cnt, 
>> &vlevel, fast_validate);
>> +DC_FP_END();
>>   
>>  // Disable fast_validate to se

[PATCH v2] drm/amd/display: don't ignore alpha property on pre-multiplied mode

2022-03-29 Thread Melissa Wen
"Pre-multiplied" is the default pixel blend mode for KMS/DRM, as
documented in supported_modes of drm_plane_create_blend_mode_property():
https://cgit.freedesktop.org/drm/drm-misc/tree/drivers/gpu/drm/drm_blend.c

In this mode, both 'pixel alpha' and 'plane alpha' participate in the
calculation, as described by the pixel blend mode formula in KMS/DRM
documentation:

out.rgb = plane_alpha * fg.rgb +
  (1 - (plane_alpha * fg.alpha)) * bg.rgb

Considering the blend config mechanisms we have in the driver so far,
the alpha mode that better fits this blend mode is the
_PER_PIXEL_ALPHA_COMBINED_GLOBAL_GAIN, where the value for global_gain
is the plane alpha (global_alpha).

With this change, alpha property stops to be ignored. It also addresses
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1734

v2:
 * keep the 8-bit value for global_alpha_value (Nicholas)
 * correct the logical ordering for combined global gain (Nicholas)
 * apply to dcn10 too (Nicholas)

Signed-off-by: Melissa Wen 
---
 .../drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c  | 14 +-
 drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.c | 14 +-
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c 
b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
index ad757b59e00e..b1034e6258c8 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
@@ -2528,14 +2528,18 @@ void dcn10_update_mpcc(struct dc *dc, struct pipe_ctx 
*pipe_ctx)
struct mpc *mpc = dc->res_pool->mpc;
struct mpc_tree *mpc_tree_params = 
&(pipe_ctx->stream_res.opp->mpc_tree_params);
 
-   if (per_pixel_alpha)
-   blnd_cfg.alpha_mode = MPCC_ALPHA_BLEND_MODE_PER_PIXEL_ALPHA;
-   else
-   blnd_cfg.alpha_mode = MPCC_ALPHA_BLEND_MODE_GLOBAL_ALPHA;
-
blnd_cfg.overlap_only = false;
blnd_cfg.global_gain = 0xff;
 
+   if (per_pixel_alpha && pipe_ctx->plane_state->global_alpha) {
+   blnd_cfg.alpha_mode = 
MPCC_ALPHA_BLEND_MODE_PER_PIXEL_ALPHA_COMBINED_GLOBAL_GAIN;
+   blnd_cfg.global_gain = 
pipe_ctx->plane_state->global_alpha_value;
+   } else if (per_pixel_alpha) {
+   blnd_cfg.alpha_mode = MPCC_ALPHA_BLEND_MODE_PER_PIXEL_ALPHA;
+   } else {
+   blnd_cfg.alpha_mode = MPCC_ALPHA_BLEND_MODE_GLOBAL_ALPHA;
+   }
+
if (pipe_ctx->plane_state->global_alpha)
blnd_cfg.global_alpha = 
pipe_ctx->plane_state->global_alpha_value;
else
diff --git a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.c 
b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.c
index 4290eaf11a04..b627c41713cc 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.c
@@ -2344,14 +2344,18 @@ void dcn20_update_mpcc(struct dc *dc, struct pipe_ctx 
*pipe_ctx)
struct mpc *mpc = dc->res_pool->mpc;
struct mpc_tree *mpc_tree_params = 
&(pipe_ctx->stream_res.opp->mpc_tree_params);
 
-   if (per_pixel_alpha)
-   blnd_cfg.alpha_mode = MPCC_ALPHA_BLEND_MODE_PER_PIXEL_ALPHA;
-   else
-   blnd_cfg.alpha_mode = MPCC_ALPHA_BLEND_MODE_GLOBAL_ALPHA;
-
blnd_cfg.overlap_only = false;
blnd_cfg.global_gain = 0xff;
 
+   if (per_pixel_alpha && pipe_ctx->plane_state->global_alpha) {
+   blnd_cfg.alpha_mode = 
MPCC_ALPHA_BLEND_MODE_PER_PIXEL_ALPHA_COMBINED_GLOBAL_GAIN;
+   blnd_cfg.global_gain = 
pipe_ctx->plane_state->global_alpha_value;
+   } else if (per_pixel_alpha) {
+   blnd_cfg.alpha_mode = MPCC_ALPHA_BLEND_MODE_PER_PIXEL_ALPHA;
+   } else {
+   blnd_cfg.alpha_mode = MPCC_ALPHA_BLEND_MODE_GLOBAL_ALPHA;
+   }
+
if (pipe_ctx->plane_state->global_alpha)
blnd_cfg.global_alpha = 
pipe_ctx->plane_state->global_alpha_value;
else
-- 
2.35.1



Re: [PATCH v11] drm/amdgpu: add drm buddy support to amdgpu

2022-03-29 Thread Arunpravin Paneer Selvam



On 29/03/22 9:30 pm, Arunpravin Paneer Selvam wrote:
> 
> 
> On 29/03/22 4:54 pm, Christian König wrote:
>> Am 29.03.22 um 13:19 schrieb Arunpravin Paneer Selvam:
>>> [SNIP]
> + pages_left = node->base.num_pages;
>
>   i = 0;
> - spin_lock(&mgr->lock);
>   while (pages_left) {
> - uint32_t alignment = tbo->page_alignment;
> + if (tbo->page_alignment)
> + min_page_size = tbo->page_alignment << PAGE_SHIFT;
> + else
> + min_page_size = mgr->default_page_size;
 The handling here looks extremely awkward to me.

 min_page_size should be determined outside of the loop, based on 
 default_page_size, alignment and contiguous flag.
>>> I kept min_page_size determine logic inside the loop for cases 2GiB+
>>> requirements, Since now we are round up the size to the required
>>> alignment, I modified the min_page_size determine logic outside of the
>>> loop in v12. Please review.
>>
>> Ah! So do we only have the loop so that each allocation isn't bigger 
>> than 2GiB? If yes couldn't we instead add a max_alloc_size or something 
>> similar?
> yes we have the loop to limit the allocation not bigger than 2GiB, I
> think we cannot avoid the loop since we need to allocate the remaining
> pages if any, to complete the 2GiB+ size request. In other words, first
> iteration we limit the max size to 2GiB and in the second iteration we
> allocate the left over pages if any.

Hi Christian,

Here my understanding might be incorrect, should we limit the max size =
2GiB and skip all the remaining pages for a 2GiB+ request?

Thanks,
Arun
>>
>> BTW: I strongly suggest that you rename min_page_size to min_alloc_size. 
>> Otherwise somebody could think that those numbers are in pages and not 
>> bytes.
> modified in v12
>>
 Then why do you drop the lock and grab it again inside the loop? And what 
 is "i" actually good for?
>>> modified the lock/unlock placement in v12.
>>>
>>> "i" is to track when there is 2GiB+ contiguous allocation request, first
>>> we allocate 2GiB (due to SG table limit) continuously and the remaining
>>> pages in the next iteration, hence this request can't be a continuous.
>>> To set the placement flag we make use of "i" value. In our case "i"
>>> value becomes 2 and we don't set the below flag.
>>> node->base.placement |= TTM_PL_FLAG_CONTIGUOUS;
>>>
>>> If we don't get such requests, I will remove "i".
>>
>> I'm not sure if that works.
>>
>> As far as I can see drm_buddy_alloc_blocks() can allocate multiple 
>> blocks at the same time, but i is only incremented when we loop.
>>
>> So what you should do instead is to check if node->blocks just contain 
>> exactly one element after the allocation but before the trim.
> ok
>>
> +
> + /* Limit maximum size to 2GB due to SG table limitations */
> + pages = min(pages_left, 2UL << (30 - PAGE_SHIFT));
>
>   if (pages >= pages_per_node)
> - alignment = pages_per_node;
> -
> - r = drm_mm_insert_node_in_range(mm, &node->mm_nodes[i], pages,
> - alignment, 0, place->fpfn,
> - lpfn, mode);
> - if (unlikely(r)) {
> - if (pages > pages_per_node) {
> - if (is_power_of_2(pages))
> - pages = pages / 2;
> - else
> - pages = rounddown_pow_of_two(pages);
> - continue;
> - }
> - goto error_free;
> + min_page_size = pages_per_node << PAGE_SHIFT;
> +
> + if (!is_contiguous && !IS_ALIGNED(pages, min_page_size >> 
> PAGE_SHIFT))
> + is_contiguous = 1;
> +
> + if (is_contiguous) {
> + pages = roundup_pow_of_two(pages);
> + min_page_size = pages << PAGE_SHIFT;
> +
> + if (pages > lpfn)
> + lpfn = pages;
>   }
>
> - vis_usage += amdgpu_vram_mgr_vis_size(adev, &node->mm_nodes[i]);
> - amdgpu_vram_mgr_virt_start(&node->base, &node->mm_nodes[i]);
> - pages_left -= pages;
> + BUG_ON(min_page_size < mm->chunk_size);
> +
> + mutex_lock(&mgr->lock);
> + r = drm_buddy_alloc_blocks(mm, (u64)place->fpfn << PAGE_SHIFT,
> +(u64)lpfn << PAGE_SHIFT,
> +(u64)pages << PAGE_SHIFT,
> +min_page_size,
> +&node->blocks,
> +node->flags);
> + mutex_unlock(&mgr->lock);
> + if (unlikely(r)

RE: [PATCH 4/4] drm/amdgpu: Add unique_id support for sienna cichlid

2022-03-29 Thread Russell, Kent
[AMD Official Use Only]

> -Original Message-
> From: Alex Deucher 
> Sent: Tuesday, March 29, 2022 11:28 AM
> To: Russell, Kent 
> Cc: amd-gfx list 
> Subject: Re: [PATCH 4/4] drm/amdgpu: Add unique_id support for sienna cichlid
>
> On Tue, Mar 29, 2022 at 11:10 AM Kent Russell  wrote:
> >
> > This is being added to SMU Metrics, so add the required tie-ins in the
> > kernel. Also create the corresponding unique_id sysfs file.
> >
> > v2: Add FW version check, remove SMU mutex
> > v3: Fix style warning
> > v4: Add MP1 IP_VERSION check to FW version check
> >
> > Signed-off-by: Kent Russell 
> > ---
> >  drivers/gpu/drm/amd/pm/amdgpu_pm.c|  1 +
> >  .../pmfw_if/smu11_driver_if_sienna_cichlid.h  | 13 ++--
> >  .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 33 +++
> >  3 files changed, 45 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> > index 4151db2678fb..4a9aabc16fbc 100644
> > --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> > +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> > @@ -1993,6 +1993,7 @@ static int default_attr_update(struct amdgpu_device 
> > *adev,
> struct amdgpu_device_
> > case IP_VERSION(9, 4, 0):
> > case IP_VERSION(9, 4, 1):
> > case IP_VERSION(9, 4, 2):
> > +   case IP_VERSION(10, 3, 0):
> > *states = ATTR_STATE_SUPPORTED;
> > break;
> > default:
> > diff --git
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
> > index 3e4a314ef925..5831145646e6 100644
> > --- 
> > a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
> > +++ 
> > b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
> > @@ -1419,8 +1419,11 @@ typedef struct {
> >uint8_t  PcieRate   ;
> >uint8_t  PcieWidth  ;
> >uint16_t AverageGfxclkFrequencyTarget;
> > -  uint16_t Padding16_2;
> >
> > +  uint32_t PublicSerialNumLower32;
> > +  uint32_t PublicSerialNumUpper32;
> > +
> > +  uint16_t Padding16_2;
> >  } SmuMetrics_t;
> >
> >  typedef struct {
> > @@ -1476,8 +1479,11 @@ typedef struct {
> >uint8_t  PcieRate   ;
> >uint8_t  PcieWidth  ;
> >uint16_t AverageGfxclkFrequencyTarget;
> > -  uint16_t Padding16_2;
> >
> > +  uint32_t PublicSerialNumLower32;
> > +  uint32_t PublicSerialNumUpper32;
> > +
> > +  uint16_t Padding16_2;
> >  } SmuMetrics_V2_t;
> >
> >  typedef struct {
> > @@ -1535,6 +1541,9 @@ typedef struct {
> >uint8_t  PcieWidth;
> >uint16_t AverageGfxclkFrequencyTarget;
> >
> > +  uint32_t PublicSerialNumLower32;
> > +  uint32_t PublicSerialNumUpper32;
> > +
> >  } SmuMetrics_V3_t;
> >
>
> Was this really added to all three versions of the metrics table?  If
> it's a new addition, presumably it's only in v3?  Other than that, the
> series is:
> Reviewed-by: Alex Deucher 
Apparently it was. I checked with the PMFW dev and he said it was there, so I'm 
trusting him on that one. Thanks for the reviews!

 Kent


>
> Alex
>
> >  typedef struct {
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> > index 38f04836c82f..b2f3d80e5945 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> > @@ -715,6 +715,16 @@ static int sienna_cichlid_get_smu_metrics_data(struct
> smu_context *smu,
> > *value = use_metrics_v3 ? metrics_v3->CurrFanSpeed :
> > use_metrics_v2 ? metrics_v2->CurrFanSpeed : 
> > metrics->CurrFanSpeed;
> > break;
> > +   case METRICS_UNIQUE_ID_UPPER32:
> > +   *value = use_metrics_v3 ? 
> > metrics_v3->PublicSerialNumUpper32 :
> > +   use_metrics_v2 ? metrics_v2->PublicSerialNumUpper32 
> > :
> > +   metrics->PublicSerialNumUpper32;
> > +   break;
> > +   case METRICS_UNIQUE_ID_LOWER32:
> > +   *value = use_metrics_v3 ? 
> > metrics_v3->PublicSerialNumLower32 :
> > +   use_metrics_v2 ? metrics_v2->PublicSerialNumLower32 
> > :
> > +   metrics->PublicSerialNumLower32;
> > +   break;
> > default:
> > *value = UINT_MAX;
> > break;
> > @@ -1773,6 +1783,28 @@ static int sienna_cichlid_read_sensor(struct 
> > smu_context
> *smu,
> > return ret;
> >  }
> >
> > +static void sienna_cichlid_get_unique_id(struct smu_context *smu)
> > +{
> > +   struct amdgpu_device *adev = smu->adev;
> > +   uint32_t upper32 = 0, lower32 = 0;
> > +
> > +   /* Only supported as of version 0.58.83.0 and only on Sienna 
> > Cichlid */
> > +   if (smu->smc_fw_version < 0x

Re: [PATCH] drm/amd/display: Fix by adding FPU protection for dcn30_internal_validate_bw

2022-03-29 Thread Paul Menzel

Dear Chandan,


Am 29.03.22 um 10:29 schrieb CHANDAN VURDIGERE NATARAJ:

Is it common to spell your name all uppercase? If not, please use 
Chandan nVurdigere Nataraj.



[WHY]


The [] already emphasize the word, so Why could be used.


Below general protection fault observed when WebGL Aquarium is run for
longer duration. If drm debug logs are enabled and set to 0x1f then the


In what browser and what version?


issue is observed within 10 minutes of run.


Where you able to reproduce it without drm debug logs?


[  100.717056] general protection fault, probably for non-canonical address 
0x2d33302d32323032:  [#1] PREEMPT SMP NOPTI
[  100.727921] CPU: 3 PID: 1906 Comm: DrmThread Tainted: GW 
5.15.30 #12 d726c6a2d6ebe5cf9223931cbca6892f916fe18b
[  100.754419] RIP: 0010:CalculateSwathWidth+0x1f7/0x44f
[  100.767109] Code: 00 00 00 f2 42 0f 11 04 f0 48 8b 85 88 00 00 00 f2 42 0f 10 04 
f0 48 8b 85 98 00 00 00 f2 42 0f 11 04 f0 48 8b 45 10 0f 57 c0  42 0f 2a 04 
b0 0f 57 c9 f3 43 0f 2a 0c b4 e8 8c e2 f3 ff 48 8b
[  100.781269] RSP: 0018:a9230079eeb0 EFLAGS: 00010246
[  100.812528] RAX: 2d33302d32323032 RBX: 0500 RCX: 
[  100.819656] RDX: 0001 RSI: 99deb712c49c RDI: 
[  100.826781] RBP: a9230079ef50 R08: 99deb712460c R09: 99deb712462c
[  100.833907] R10: 99deb7124940 R11: 99deb7124d70 R12: 99deb712ae44
[  100.841033] R13: 0001 R14:  R15: a9230079f0a0
[  100.848159] FS:  7af121212640() GS:99deba78() 
knlGS:
[  100.856240] CS:  0010 DS:  ES:  CR0: 80050033
[  100.861980] CR2: 209000fe1000 CR3: 00011b18c000 CR4: 00350ee0
[  100.869106] Call Trace:
[  100.871555]  
[  100.873655]  ? asm_sysvec_reschedule_ipi+0x12/0x20
[  100.878449]  CalculateSwathAndDETConfiguration+0x1a3/0x6dd
[  100.883937]  dml31_ModeSupportAndSystemConfigurationFull+0x2ce4/0x76da
[  100.890467]  ? kallsyms_lookup_buildid+0xc8/0x163
[  100.895173]  ? kallsyms_lookup_buildid+0xc8/0x163
[  100.899874]  ? __sprint_symbol+0x80/0x135
[  100.903883]  ? dm_update_plane_state+0x3f9/0x4d2
[  100.908500]  ? symbol_string+0xb7/0xde
[  100.912250]  ? number+0x145/0x29b
[  100.915566]  ? vsnprintf+0x341/0x5ff
[  100.919141]  ? desc_read_finalized_seq+0x39/0x87
[  100.923755]  ? update_load_avg+0x1b9/0x607
[  100.927849]  ? compute_mst_dsc_configs_for_state+0x7d/0xd5b
[  100.933416]  ? fetch_pipe_params+0xa4d/0xd0c
[  100.937686]  ? dc_fpu_end+0x3d/0xa8
[  100.941175]  dml_get_voltage_level+0x16b/0x180
[  100.945619]  dcn30_internal_validate_bw+0x10e/0x89b
[  100.950495]  ? dcn31_validate_bandwidth+0x68/0x1fc
[  100.955285]  ? resource_build_scaling_params+0x98b/0xb8c
[  100.960595]  ? dcn31_validate_bandwidth+0x68/0x1fc
[  100.965384]  dcn31_validate_bandwidth+0x9a/0x1fc
[  100.970001]  dc_validate_global_state+0x238/0x295
[  100.974703]  amdgpu_dm_atomic_check+0x9c1/0xbce
[  100.979235]  ? _printk+0x59/0x73
[  100.982467]  drm_atomic_check_only+0x403/0x78b
[  100.986912]  drm_mode_atomic_ioctl+0x49b/0x546
[  100.991358]  ? drm_ioctl+0x1c1/0x3b3
[  100.994936]  ? drm_atomic_set_property+0x92a/0x92a
[  100.999725]  drm_ioctl_kernel+0xdc/0x149
[  101.003648]  drm_ioctl+0x27f/0x3b3
[  101.007051]  ? drm_atomic_set_property+0x92a/0x92a
[  101.011842]  amdgpu_drm_ioctl+0x49/0x7d
[  101.015679]  __se_sys_ioctl+0x7c/0xb8
[  101.015685]  do_syscall_64+0x5f/0xb8
[  101.015690]  ? __irq_exit_rcu+0x34/0x96

[HOW]
It calles populate_dml_pipes which uses doubles to initialize.


calls

Excuse my ignorance. So using doubles causes a context switch?


Adding FPU protection avoids context switch and probable loss of vba context
as there is potential contention while drm debug logs are enabled.

Signed-off-by: CHANDAN VURDIGERE NATARAJ 

diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
index 826970f2bd0a..f27262417abe 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
@@ -1750,7 +1750,9 @@ bool dcn31_validate_bandwidth(struct dc *dc,
  
  	BW_VAL_TRACE_COUNT();
  
+	DC_FP_START();

out = dcn30_internal_validate_bw(dc, context, pipes, &pipe_cnt, 
&vlevel, fast_validate);
+   DC_FP_END();
  
  	// Disable fast_validate to set min dcfclk in alculate_wm_and_dlg

if (pipe_cnt == 0)



Kind regards,

Paul


Re: [RFC PATCH] drm/amd/display: dont ignore alpha property

2022-03-29 Thread Melissa Wen
On 03/28, Melissa Wen wrote:
> On 03/28, Kazlauskas, Nicholas wrote:
> > [AMD Official Use Only]
> > 
> > > -Original Message-
> > > From: Melissa Wen 
> > > Sent: Friday, March 25, 2022 4:45 PM
> > > To: amd-gfx@lists.freedesktop.org; Wentland, Harry
> > > ; Deucher, Alexander
> > > ; Siqueira, Rodrigo
> > > ; Kazlauskas, Nicholas
> > > ; Gutierrez, Agustin
> > > ; Liu, Zhan 
> > > Cc: dri-de...@lists.freedesktop.org; Simon Ser 
> > > Subject: [RFC PATCH] drm/amd/display: dont ignore alpha property
> > > Importance: High
> > >
> > > Hi all,
> > >
> > > I'm examining the IGT kms_plane_alpha_blend test, specifically the
> > > alpha-7efc. It fails on AMD and Intel gen8 hw, but passes on Intel
> > > gen11. At first, I thought it was a rounding issue. In fact, it may be
> > > the problem for different results between intel hw generations.
> > >
> > > However, I changed the test locally to compare CRCs for all alpha values
> > > in the range before the test fails. Interestingly, it fails for all
> > > values when running on AMD, even when comparing planes with zero alpha
> > > (fully transparent). Moreover, I see the same CRC values regardless of
> > > the value set in the alpha property.
> > >
> > > To ensure that the blending mode is as expected, I explicitly set the
> > > Pre-multiplied blending mode in the test. Then I tried using different
> > > framebuffer data and alpha values. I've tried obvious comparisons too,
> > > such as fully opaque and fully transparent.
> > >
> > > As far as I could verify and understand, the value set for the ALPHA
> > > property is totally ignored by AMD drivers. I'm not sure if this is a
> > > matter of how we interpret the meaning of the premultiplied blend mode
> > > or the driver's assumptions about the data in these blend modes.
> > > For example, I made a change in the test as here:
> > > https://paste.debian.net/1235620/
> > > That basically means same framebuffer, but different alpha values for
> > > each plane. And the result was succesful (but I expected it fails).
> > >
> > 
> > The intent was that we don't enable global plane alpha along with anything 
> > that requires per pixel alpha.
> > 
> > The HW does have bits to specify a mode that's intended to work like this, 
> > but I don't think we've ever fully supported it in software.
> > 
> > I wouldn't necessarily expect that the blending result is correct, but 
> > maybe the IGT test result says otherwise.
> 
> hmm... afaiu, I think the issue here is that no formula of pixel blend
> mode ignores the "global plane alpha". Looking at the description here:
> https://cgit.freedesktop.org/drm/drm-misc/tree/drivers/gpu/drm/drm_blend.c#n142
> I understand the global plane alpha is the plane_alpha, described as:
> "Plane alpha value set by the plane "alpha" property."
> And the pixel alpha is the fg.alpha, right? So, the "None" mode doesn't
> care of pixel alpha, but still considers (global) plane alpha, and
> "Pre-multiplied" mode is considering plane alpha and pixel alpha to
> calculate how background RGB affects the resulted composition...
> 
> > 
> > > Besides that, I see that other subtests in kms_plane_alpha_blend are
> > > skipped, use "None" pixel blend mode, or are not changing the
> > > IGT_PLANE_ALPHA property. So, this alpha-7efc seems to be the only one
> > > in the subset that is checking changes on alpha property under a
> > > Pre-multiplied blend mode, and it is failing.
> > >
> > > I see some inputs in this issue:
> > > https://gitlab.freedesktop.org/drm/amd/-/issues/1769.
> > > But them, I guessed there are different interpretations for handling
> > > plane alpha in the pre-multiplied blend mode. Tbh, I'm not clear, but
> > > there's always a chance of different interpretations, and I don't have
> > > a third driver with CRC capabilities for further comparisons.
> > >
> > > I made some experiments on blnd_cfg values, changing alpha_mode vs
> > > global_gain and global_alpha. I think the expected behaviour for the
> > > Pre-multiplied blend mode is achieved by applying this RFC patch (for
> > > Cezanne).
> > >
> > > Does it seems reasonable? Can anyone help me with more inputs to guide
> > > me the right direction or point out what I misunderstood about these
> > > concepts?
> > >
> > > Thanks,
> > >
> > > Signed-off-by: Melissa Wen 
> > > ---
> > >  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c  | 2 +-
> > >  drivers/gpu/drm/amd/display/dc/dcn20/dcn20_hwseq.c | 4 
> > >  2 files changed, 5 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > index 6633df7682ce..821ffafa441e 100644
> > > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > @@ -5438,7 +5438,7 @@ fill_blending_from_plane_state(const struct
> > > drm_plane_state *plane_state,
> > >
> > >   if (plane_state->alpha < 0x) {
> > > 

Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-29 Thread Marek Olšák
I don't know what iris does, but I would guess that the same problems as
with AMD GPUs apply, making GPUs resets very fragile.

Marek

On Tue., Mar. 29, 2022, 08:14 Christian König, 
wrote:

> My main question is what does the iris driver better than radeonsi when
> the client doesn't support the robustness extension?
>
> From Daniels description it sounds like they have at least a partial
> recovery mechanism in place.
>
> Apart from that I completely agree to what you said below.
>
> Christian.
>
> Am 26.03.22 um 01:53 schrieb Olsak, Marek:
>
> [AMD Official Use Only]
>
> amdgpu has 2 resets: soft reset and hard reset.
>
> The soft reset is able to recover from an infinite loop and even some GPU
> hangs due to bad shaders or bad states. The soft reset uses a signal that
> kills all currently-running shaders of a certain process (VM context),
> which unblocks the graphics pipeline, so draws and command buffers finish
> but are not correctly. This can then cause a hard hang if the shader was
> supposed to signal work completion through a shader store instruction and a
> non-shader consumer is waiting for it (skipping the store instruction by
> killing the shader won't signal the work, and thus the consumer will be
> stuck, requiring a hard reset).
>
> The hard reset can recover from other hangs, which is great, but it may
> use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory
> contents, but we should assume that any process that had running jobs on
> the GPU during a GPU reset has its memory resources in an inconsistent
> state, and thus following command buffers can cause another GPU hang. The
> shader store example above is enough to cause another hard hang due to
> incorrect content in memory resources, which can contain synchronization
> primitives that are used internally by the hardware.
>
> Asking the driver to replay a command buffer that caused a hang is a sure
> way to hang it again. Unrelated processes can be affected due to lost VRAM
> or the misfortune of using the GPU while the GPU hang occurred. The window
> system should recreate GPU resources and redraw everything without
> affecting applications. If apps use GL, they should do the same. Processes
> that can't recover by redrawing content can be terminated or left alone,
> but they shouldn't be allowed to submit work to the GPU anymore.
>
> dEQP only exercises the soft reset. I think WebGL is only able to trigger
> a soft reset at this point, but Vulkan can also trigger a hard reset.
>
> Marek
> --
> *From:* Koenig, Christian 
> 
> *Sent:* March 23, 2022 11:25
> *To:* Daniel Vetter  ; Daniel Stone
>  ; Olsak, Marek
>  ; Grodzovsky, Andrey
>  
> *Cc:* Rob Clark  ; Rob Clark
>  ; Sharma, Shashank
>  ; Christian König
>  ;
> Somalapuram, Amaranath 
> ; Abhinav Kumar 
> ; dri-devel 
> ; amd-gfx list
>  ; Deucher,
> Alexander  ;
> Shashank Sharma 
> 
> *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
>
> [Adding Marek and Andrey as well]
>
> Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > On Wed, 23 Mar 2022 at 15:07, Daniel Stone 
>  wrote:
> >> Hi,
> >>
> >> On Mon, 21 Mar 2022 at 16:02, Rob Clark 
>  wrote:
> >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> >>>   wrote:
>  Well you can, it just means that their contexts are lost as well.
> >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> >>> take down your compositor ;-)
> >> Yeah. Or anything WebGL.
> >>
> >> System-wide collateral damage is definitely a non-starter. If that
> >> means that the userspace driver has to do what iris does and ensure
> >> everything's recreated and resubmitted, that works too, just as long
> >> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> >> something better than 'shoot the entire user session'.
> > Not sure where that idea came from, I thought at least I made it clear
> > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > which should die without recovery attempt.
> >
> > The entire discussion here is who should be responsible for replay and
> > at least if you can decide the uapi, then punting that entirely to
> > userspace is a good approach.
>
> Yes, completely agree. We have the approach of re-submitting things in
> the kernel and that failed quite miserable.
>
> In other words currently a GPU reset has something like a 99% chance to
> get down your whole desktop.
>
> Daniel can you briefly explain what exactly iris does when a lost
> context is detected without gl robustness?
>
> It sounds like you guys got that working quite well.
>
> Thanks,
> Christian.
>
> >
> > Ofc it'd be nice if the collateral damage is limited, i.e. requests
> > not currently on the gpu, or on different engines and all that
> > shouldn't be nuked, if possible.
> >
> > Also ofc since msm uapi is that the kernel tries to recover there's
> > not much we can do there, contexts cannot be shot. But still trying to
> > replay them as much as possible 

Re: [PATCH v11] drm/amdgpu: add drm buddy support to amdgpu

2022-03-29 Thread Arunpravin Paneer Selvam



On 29/03/22 4:54 pm, Christian König wrote:
> Am 29.03.22 um 13:19 schrieb Arunpravin Paneer Selvam:
>> [SNIP]
 +  pages_left = node->base.num_pages;

i = 0;
 -  spin_lock(&mgr->lock);
while (pages_left) {
 -  uint32_t alignment = tbo->page_alignment;
 +  if (tbo->page_alignment)
 +  min_page_size = tbo->page_alignment << PAGE_SHIFT;
 +  else
 +  min_page_size = mgr->default_page_size;
>>> The handling here looks extremely awkward to me.
>>>
>>> min_page_size should be determined outside of the loop, based on 
>>> default_page_size, alignment and contiguous flag.
>> I kept min_page_size determine logic inside the loop for cases 2GiB+
>> requirements, Since now we are round up the size to the required
>> alignment, I modified the min_page_size determine logic outside of the
>> loop in v12. Please review.
> 
> Ah! So do we only have the loop so that each allocation isn't bigger 
> than 2GiB? If yes couldn't we instead add a max_alloc_size or something 
> similar?
yes we have the loop to limit the allocation not bigger than 2GiB, I
think we cannot avoid the loop since we need to allocate the remaining
pages if any, to complete the 2GiB+ size request. In other words, first
iteration we limit the max size to 2GiB and in the second iteration we
allocate the left over pages if any.
> 
> BTW: I strongly suggest that you rename min_page_size to min_alloc_size. 
> Otherwise somebody could think that those numbers are in pages and not 
> bytes.
modified in v12
> 
>>> Then why do you drop the lock and grab it again inside the loop? And what 
>>> is "i" actually good for?
>> modified the lock/unlock placement in v12.
>>
>> "i" is to track when there is 2GiB+ contiguous allocation request, first
>> we allocate 2GiB (due to SG table limit) continuously and the remaining
>> pages in the next iteration, hence this request can't be a continuous.
>> To set the placement flag we make use of "i" value. In our case "i"
>> value becomes 2 and we don't set the below flag.
>> node->base.placement |= TTM_PL_FLAG_CONTIGUOUS;
>>
>> If we don't get such requests, I will remove "i".
> 
> I'm not sure if that works.
> 
> As far as I can see drm_buddy_alloc_blocks() can allocate multiple 
> blocks at the same time, but i is only incremented when we loop.
> 
> So what you should do instead is to check if node->blocks just contain 
> exactly one element after the allocation but before the trim.
ok
> 
 +
 +  /* Limit maximum size to 2GB due to SG table limitations */
 +  pages = min(pages_left, 2UL << (30 - PAGE_SHIFT));

if (pages >= pages_per_node)
 -  alignment = pages_per_node;
 -
 -  r = drm_mm_insert_node_in_range(mm, &node->mm_nodes[i], pages,
 -  alignment, 0, place->fpfn,
 -  lpfn, mode);
 -  if (unlikely(r)) {
 -  if (pages > pages_per_node) {
 -  if (is_power_of_2(pages))
 -  pages = pages / 2;
 -  else
 -  pages = rounddown_pow_of_two(pages);
 -  continue;
 -  }
 -  goto error_free;
 +  min_page_size = pages_per_node << PAGE_SHIFT;
 +
 +  if (!is_contiguous && !IS_ALIGNED(pages, min_page_size >> 
 PAGE_SHIFT))
 +  is_contiguous = 1;
 +
 +  if (is_contiguous) {
 +  pages = roundup_pow_of_two(pages);
 +  min_page_size = pages << PAGE_SHIFT;
 +
 +  if (pages > lpfn)
 +  lpfn = pages;
}

 -  vis_usage += amdgpu_vram_mgr_vis_size(adev, &node->mm_nodes[i]);
 -  amdgpu_vram_mgr_virt_start(&node->base, &node->mm_nodes[i]);
 -  pages_left -= pages;
 +  BUG_ON(min_page_size < mm->chunk_size);
 +
 +  mutex_lock(&mgr->lock);
 +  r = drm_buddy_alloc_blocks(mm, (u64)place->fpfn << PAGE_SHIFT,
 + (u64)lpfn << PAGE_SHIFT,
 + (u64)pages << PAGE_SHIFT,
 + min_page_size,
 + &node->blocks,
 + node->flags);
 +  mutex_unlock(&mgr->lock);
 +  if (unlikely(r))
 +  goto error_free_blocks;
 +
++i;

if (pages > pages_left)
 -  pages = pages_left;
 +  pages_left = 0;
 +  else
 +  pages_le

Re: [PATCH 18/23] drm/amdgpu: remove dma_resv workaround

2022-03-29 Thread Daniel Vetter
On Mon, Mar 21, 2022 at 02:58:51PM +0100, Christian König wrote:
> We can now add multiple writers to the dma_resv object.
> 
> Also enable the check for not adding containers in dma_resv.c again.
> 
> Signed-off-by: Christian König 
> Cc: amd-gfx@lists.freedesktop.org

It's a bit much magic, but that's the entire point of your huge prep
series to be able to have all the fences on a dma-resv :-)

Reviewed-by: Daniel Vetter 

> ---
>  drivers/dma-buf/dma-resv.c  |  6 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h |  1 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 51 ++---
>  3 files changed, 8 insertions(+), 50 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
> index 26257ba1527e..10d70812373c 100644
> --- a/drivers/dma-buf/dma-resv.c
> +++ b/drivers/dma-buf/dma-resv.c
> @@ -308,10 +308,10 @@ void dma_resv_add_fence(struct dma_resv *obj, struct 
> dma_fence *fence,
>  
>   dma_resv_assert_held(obj);
>  
> - /* TODO: Drivers should not add containers here, instead add each fence
> -  * individually. Disabled for now until we cleaned up amdgpu/ttm.
> + /* Drivers should not add containers here, instead add each fence
> +  * individually.
>*/
> - /* WARN_ON(dma_fence_is_container(fence)); */
> + WARN_ON(dma_fence_is_container(fence));
>  
>   fobj = dma_resv_fences_list(obj);
>   count = fobj->num_fences;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h
> index 044b41f0bfd9..529d52a204cf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bo_list.h
> @@ -34,7 +34,6 @@ struct amdgpu_fpriv;
>  struct amdgpu_bo_list_entry {
>   struct ttm_validate_buffer  tv;
>   struct amdgpu_bo_va *bo_va;
> - struct dma_fence_chain  *chain;
>   uint32_tpriority;
>   struct page **user_pages;
>   booluser_invalidated;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 1c039db976a9..88009833f523 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -575,14 +575,6 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser 
> *p,
>   struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
>  
>   e->bo_va = amdgpu_vm_bo_find(vm, bo);
> -
> - if (bo->tbo.base.dma_buf && !amdgpu_bo_explicit_sync(bo)) {
> - e->chain = dma_fence_chain_alloc();
> - if (!e->chain) {
> - r = -ENOMEM;
> - goto error_validate;
> - }
> - }
>   }
>  
>   amdgpu_cs_get_threshold_for_moves(p->adev, &p->bytes_moved_threshold,
> @@ -633,13 +625,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser 
> *p,
>   }
>  
>  error_validate:
> - if (r) {
> - amdgpu_bo_list_for_each_entry(e, p->bo_list) {
> - dma_fence_chain_free(e->chain);
> - e->chain = NULL;
> - }
> + if (r)
>   ttm_eu_backoff_reservation(&p->ticket, &p->validated);
> - }
>  out:
>   return r;
>  }
> @@ -679,17 +666,9 @@ static void amdgpu_cs_parser_fini(struct 
> amdgpu_cs_parser *parser, int error,
>  {
>   unsigned i;
>  
> - if (error && backoff) {
> - struct amdgpu_bo_list_entry *e;
> -
> - amdgpu_bo_list_for_each_entry(e, parser->bo_list) {
> - dma_fence_chain_free(e->chain);
> - e->chain = NULL;
> - }
> -
> + if (error && backoff)
>   ttm_eu_backoff_reservation(&parser->ticket,
>  &parser->validated);
> - }
>  
>   for (i = 0; i < parser->num_post_deps; i++) {
>   drm_syncobj_put(parser->post_deps[i].syncobj);
> @@ -1264,29 +1243,9 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>  
>   amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);
>  
> - amdgpu_bo_list_for_each_entry(e, p->bo_list) {
> - struct dma_resv *resv = e->tv.bo->base.resv;
> - struct dma_fence_chain *chain = e->chain;
> - struct dma_resv_iter cursor;
> - struct dma_fence *fence;
> -
> - if (!chain)
> - continue;
> -
> - /*
> -  * Work around dma_resv shortcommings by wrapping up the
> -  * submission in a dma_fence_chain and add it as exclusive
> -  * fence.
> -  */
> - dma_resv_for_each_fence(&cursor, resv,
> - DMA_RESV_USAGE_WRITE,
> - fence) {
> - break;
> - }
> - dma_fence_cha

Re: [PATCH 4/4] drm/amdgpu: Add unique_id support for sienna cichlid

2022-03-29 Thread Alex Deucher
On Tue, Mar 29, 2022 at 11:10 AM Kent Russell  wrote:
>
> This is being added to SMU Metrics, so add the required tie-ins in the
> kernel. Also create the corresponding unique_id sysfs file.
>
> v2: Add FW version check, remove SMU mutex
> v3: Fix style warning
> v4: Add MP1 IP_VERSION check to FW version check
>
> Signed-off-by: Kent Russell 
> ---
>  drivers/gpu/drm/amd/pm/amdgpu_pm.c|  1 +
>  .../pmfw_if/smu11_driver_if_sienna_cichlid.h  | 13 ++--
>  .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 33 +++
>  3 files changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> index 4151db2678fb..4a9aabc16fbc 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> @@ -1993,6 +1993,7 @@ static int default_attr_update(struct amdgpu_device 
> *adev, struct amdgpu_device_
> case IP_VERSION(9, 4, 0):
> case IP_VERSION(9, 4, 1):
> case IP_VERSION(9, 4, 2):
> +   case IP_VERSION(10, 3, 0):
> *states = ATTR_STATE_SUPPORTED;
> break;
> default:
> diff --git 
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h 
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
> index 3e4a314ef925..5831145646e6 100644
> --- 
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
> +++ 
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
> @@ -1419,8 +1419,11 @@ typedef struct {
>uint8_t  PcieRate   ;
>uint8_t  PcieWidth  ;
>uint16_t AverageGfxclkFrequencyTarget;
> -  uint16_t Padding16_2;
>
> +  uint32_t PublicSerialNumLower32;
> +  uint32_t PublicSerialNumUpper32;
> +
> +  uint16_t Padding16_2;
>  } SmuMetrics_t;
>
>  typedef struct {
> @@ -1476,8 +1479,11 @@ typedef struct {
>uint8_t  PcieRate   ;
>uint8_t  PcieWidth  ;
>uint16_t AverageGfxclkFrequencyTarget;
> -  uint16_t Padding16_2;
>
> +  uint32_t PublicSerialNumLower32;
> +  uint32_t PublicSerialNumUpper32;
> +
> +  uint16_t Padding16_2;
>  } SmuMetrics_V2_t;
>
>  typedef struct {
> @@ -1535,6 +1541,9 @@ typedef struct {
>uint8_t  PcieWidth;
>uint16_t AverageGfxclkFrequencyTarget;
>
> +  uint32_t PublicSerialNumLower32;
> +  uint32_t PublicSerialNumUpper32;
> +
>  } SmuMetrics_V3_t;
>

Was this really added to all three versions of the metrics table?  If
it's a new addition, presumably it's only in v3?  Other than that, the
series is:
Reviewed-by: Alex Deucher 

Alex

>  typedef struct {
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> index 38f04836c82f..b2f3d80e5945 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> @@ -715,6 +715,16 @@ static int sienna_cichlid_get_smu_metrics_data(struct 
> smu_context *smu,
> *value = use_metrics_v3 ? metrics_v3->CurrFanSpeed :
> use_metrics_v2 ? metrics_v2->CurrFanSpeed : 
> metrics->CurrFanSpeed;
> break;
> +   case METRICS_UNIQUE_ID_UPPER32:
> +   *value = use_metrics_v3 ? metrics_v3->PublicSerialNumUpper32 :
> +   use_metrics_v2 ? metrics_v2->PublicSerialNumUpper32 :
> +   metrics->PublicSerialNumUpper32;
> +   break;
> +   case METRICS_UNIQUE_ID_LOWER32:
> +   *value = use_metrics_v3 ? metrics_v3->PublicSerialNumLower32 :
> +   use_metrics_v2 ? metrics_v2->PublicSerialNumLower32 :
> +   metrics->PublicSerialNumLower32;
> +   break;
> default:
> *value = UINT_MAX;
> break;
> @@ -1773,6 +1783,28 @@ static int sienna_cichlid_read_sensor(struct 
> smu_context *smu,
> return ret;
>  }
>
> +static void sienna_cichlid_get_unique_id(struct smu_context *smu)
> +{
> +   struct amdgpu_device *adev = smu->adev;
> +   uint32_t upper32 = 0, lower32 = 0;
> +
> +   /* Only supported as of version 0.58.83.0 and only on Sienna Cichlid 
> */
> +   if (smu->smc_fw_version < 0x3A5300 ||
> +   smu->adev->ip_versions[MP1_HWIP][0] != IP_VERSION(11, 0, 7))
> +   return;
> +
> +   if (sienna_cichlid_get_smu_metrics_data(smu, 
> METRICS_UNIQUE_ID_UPPER32, &upper32))
> +   goto out;
> +   if (sienna_cichlid_get_smu_metrics_data(smu, 
> METRICS_UNIQUE_ID_LOWER32, &lower32))
> +   goto out;
> +
> +out:
> +
> +   adev->unique_id = ((uint64_t)upper32 << 32) | lower32;
> +   if (adev->serial[0] == '\0')
> +   sprintf(adev->serial, "%016llx", adev->unique_id);
> +}
> +
>  static int sienna_cichlid_get_uclk_dpm_states(struct smu_con

Re: [PATCH] drm/amd/display: Fix by adding FPU protection for dcn30_internal_validate_bw

2022-03-29 Thread Deucher, Alexander
[Public]

Acked-by: Alex Deucher 

From: amd-gfx  on behalf of CHANDAN 
VURDIGERE NATARAJ 
Sent: Tuesday, March 29, 2022 4:29 AM
To: Wentland, Harry ; Li, Sun peng (Leo) 
; Deucher, Alexander 
Cc: VURDIGERENATARAJ, CHANDAN ; 
amd-gfx@lists.freedesktop.org 
Subject: [PATCH] drm/amd/display: Fix by adding FPU protection for 
dcn30_internal_validate_bw

[WHY]
Below general protection fault observed when WebGL Aquarium is run for
longer duration. If drm debug logs are enabled and set to 0x1f then the
issue is observed within 10 minutes of run.

[  100.717056] general protection fault, probably for non-canonical address 
0x2d33302d32323032:  [#1] PREEMPT SMP NOPTI
[  100.727921] CPU: 3 PID: 1906 Comm: DrmThread Tainted: GW 
5.15.30 #12 d726c6a2d6ebe5cf9223931cbca6892f916fe18b
[  100.754419] RIP: 0010:CalculateSwathWidth+0x1f7/0x44f
[  100.767109] Code: 00 00 00 f2 42 0f 11 04 f0 48 8b 85 88 00 00 00 f2 42 0f 
10 04 f0 48 8b 85 98 00 00 00 f2 42 0f 11 04 f0 48 8b 45 10 0f 57 c0  42 0f 
2a 04 b0 0f 57 c9 f3 43 0f 2a 0c b4 e8 8c e2 f3 ff 48 8b
[  100.781269] RSP: 0018:a9230079eeb0 EFLAGS: 00010246
[  100.812528] RAX: 2d33302d32323032 RBX: 0500 RCX: 
[  100.819656] RDX: 0001 RSI: 99deb712c49c RDI: 
[  100.826781] RBP: a9230079ef50 R08: 99deb712460c R09: 99deb712462c
[  100.833907] R10: 99deb7124940 R11: 99deb7124d70 R12: 99deb712ae44
[  100.841033] R13: 0001 R14:  R15: a9230079f0a0
[  100.848159] FS:  7af121212640() GS:99deba78() 
knlGS:
[  100.856240] CS:  0010 DS:  ES:  CR0: 80050033
[  100.861980] CR2: 209000fe1000 CR3: 00011b18c000 CR4: 00350ee0
[  100.869106] Call Trace:
[  100.871555]  
[  100.873655]  ? asm_sysvec_reschedule_ipi+0x12/0x20
[  100.878449]  CalculateSwathAndDETConfiguration+0x1a3/0x6dd
[  100.883937]  dml31_ModeSupportAndSystemConfigurationFull+0x2ce4/0x76da
[  100.890467]  ? kallsyms_lookup_buildid+0xc8/0x163
[  100.895173]  ? kallsyms_lookup_buildid+0xc8/0x163
[  100.899874]  ? __sprint_symbol+0x80/0x135
[  100.903883]  ? dm_update_plane_state+0x3f9/0x4d2
[  100.908500]  ? symbol_string+0xb7/0xde
[  100.912250]  ? number+0x145/0x29b
[  100.915566]  ? vsnprintf+0x341/0x5ff
[  100.919141]  ? desc_read_finalized_seq+0x39/0x87
[  100.923755]  ? update_load_avg+0x1b9/0x607
[  100.927849]  ? compute_mst_dsc_configs_for_state+0x7d/0xd5b
[  100.933416]  ? fetch_pipe_params+0xa4d/0xd0c
[  100.937686]  ? dc_fpu_end+0x3d/0xa8
[  100.941175]  dml_get_voltage_level+0x16b/0x180
[  100.945619]  dcn30_internal_validate_bw+0x10e/0x89b
[  100.950495]  ? dcn31_validate_bandwidth+0x68/0x1fc
[  100.955285]  ? resource_build_scaling_params+0x98b/0xb8c
[  100.960595]  ? dcn31_validate_bandwidth+0x68/0x1fc
[  100.965384]  dcn31_validate_bandwidth+0x9a/0x1fc
[  100.970001]  dc_validate_global_state+0x238/0x295
[  100.974703]  amdgpu_dm_atomic_check+0x9c1/0xbce
[  100.979235]  ? _printk+0x59/0x73
[  100.982467]  drm_atomic_check_only+0x403/0x78b
[  100.986912]  drm_mode_atomic_ioctl+0x49b/0x546
[  100.991358]  ? drm_ioctl+0x1c1/0x3b3
[  100.994936]  ? drm_atomic_set_property+0x92a/0x92a
[  100.999725]  drm_ioctl_kernel+0xdc/0x149
[  101.003648]  drm_ioctl+0x27f/0x3b3
[  101.007051]  ? drm_atomic_set_property+0x92a/0x92a
[  101.011842]  amdgpu_drm_ioctl+0x49/0x7d
[  101.015679]  __se_sys_ioctl+0x7c/0xb8
[  101.015685]  do_syscall_64+0x5f/0xb8
[  101.015690]  ? __irq_exit_rcu+0x34/0x96

[HOW]
It calles populate_dml_pipes which uses doubles to initialize.
Adding FPU protection avoids context switch and probable loss of vba context
as there is potential contention while drm debug logs are enabled.

Signed-off-by: CHANDAN VURDIGERE NATARAJ 

diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
index 826970f2bd0a..f27262417abe 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
@@ -1750,7 +1750,9 @@ bool dcn31_validate_bandwidth(struct dc *dc,

 BW_VAL_TRACE_COUNT();

+   DC_FP_START();
 out = dcn30_internal_validate_bw(dc, context, pipes, &pipe_cnt, 
&vlevel, fast_validate);
+   DC_FP_END();

 // Disable fast_validate to set min dcfclk in alculate_wm_and_dlg
 if (pipe_cnt == 0)
--
2.25.1



[PATCH 4/4] drm/amdgpu: Add unique_id support for sienna cichlid

2022-03-29 Thread Kent Russell
This is being added to SMU Metrics, so add the required tie-ins in the
kernel. Also create the corresponding unique_id sysfs file.

v2: Add FW version check, remove SMU mutex
v3: Fix style warning
v4: Add MP1 IP_VERSION check to FW version check

Signed-off-by: Kent Russell 
---
 drivers/gpu/drm/amd/pm/amdgpu_pm.c|  1 +
 .../pmfw_if/smu11_driver_if_sienna_cichlid.h  | 13 ++--
 .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 33 +++
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 4151db2678fb..4a9aabc16fbc 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -1993,6 +1993,7 @@ static int default_attr_update(struct amdgpu_device 
*adev, struct amdgpu_device_
case IP_VERSION(9, 4, 0):
case IP_VERSION(9, 4, 1):
case IP_VERSION(9, 4, 2):
+   case IP_VERSION(10, 3, 0):
*states = ATTR_STATE_SUPPORTED;
break;
default:
diff --git 
a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h 
b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
index 3e4a314ef925..5831145646e6 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
@@ -1419,8 +1419,11 @@ typedef struct {
   uint8_t  PcieRate   ;
   uint8_t  PcieWidth  ;
   uint16_t AverageGfxclkFrequencyTarget;
-  uint16_t Padding16_2;
 
+  uint32_t PublicSerialNumLower32;
+  uint32_t PublicSerialNumUpper32;
+
+  uint16_t Padding16_2;
 } SmuMetrics_t;
 
 typedef struct {
@@ -1476,8 +1479,11 @@ typedef struct {
   uint8_t  PcieRate   ;
   uint8_t  PcieWidth  ;
   uint16_t AverageGfxclkFrequencyTarget;
-  uint16_t Padding16_2;
 
+  uint32_t PublicSerialNumLower32;
+  uint32_t PublicSerialNumUpper32;
+
+  uint16_t Padding16_2;
 } SmuMetrics_V2_t;
 
 typedef struct {
@@ -1535,6 +1541,9 @@ typedef struct {
   uint8_t  PcieWidth;
   uint16_t AverageGfxclkFrequencyTarget;
 
+  uint32_t PublicSerialNumLower32;
+  uint32_t PublicSerialNumUpper32;
+
 } SmuMetrics_V3_t;
 
 typedef struct {
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index 38f04836c82f..b2f3d80e5945 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -715,6 +715,16 @@ static int sienna_cichlid_get_smu_metrics_data(struct 
smu_context *smu,
*value = use_metrics_v3 ? metrics_v3->CurrFanSpeed :
use_metrics_v2 ? metrics_v2->CurrFanSpeed : 
metrics->CurrFanSpeed;
break;
+   case METRICS_UNIQUE_ID_UPPER32:
+   *value = use_metrics_v3 ? metrics_v3->PublicSerialNumUpper32 :
+   use_metrics_v2 ? metrics_v2->PublicSerialNumUpper32 :
+   metrics->PublicSerialNumUpper32;
+   break;
+   case METRICS_UNIQUE_ID_LOWER32:
+   *value = use_metrics_v3 ? metrics_v3->PublicSerialNumLower32 :
+   use_metrics_v2 ? metrics_v2->PublicSerialNumLower32 :
+   metrics->PublicSerialNumLower32;
+   break;
default:
*value = UINT_MAX;
break;
@@ -1773,6 +1783,28 @@ static int sienna_cichlid_read_sensor(struct smu_context 
*smu,
return ret;
 }
 
+static void sienna_cichlid_get_unique_id(struct smu_context *smu)
+{
+   struct amdgpu_device *adev = smu->adev;
+   uint32_t upper32 = 0, lower32 = 0;
+
+   /* Only supported as of version 0.58.83.0 and only on Sienna Cichlid */
+   if (smu->smc_fw_version < 0x3A5300 ||
+   smu->adev->ip_versions[MP1_HWIP][0] != IP_VERSION(11, 0, 7))
+   return;
+
+   if (sienna_cichlid_get_smu_metrics_data(smu, METRICS_UNIQUE_ID_UPPER32, 
&upper32))
+   goto out;
+   if (sienna_cichlid_get_smu_metrics_data(smu, METRICS_UNIQUE_ID_LOWER32, 
&lower32))
+   goto out;
+
+out:
+
+   adev->unique_id = ((uint64_t)upper32 << 32) | lower32;
+   if (adev->serial[0] == '\0')
+   sprintf(adev->serial, "%016llx", adev->unique_id);
+}
+
 static int sienna_cichlid_get_uclk_dpm_states(struct smu_context *smu, 
uint32_t *clocks_in_khz, uint32_t *num_states)
 {
uint32_t num_discrete_levels = 0;
@@ -4182,6 +4214,7 @@ static const struct pptable_funcs 
sienna_cichlid_ppt_funcs = {
.get_ecc_info = sienna_cichlid_get_ecc_info,
.get_default_config_table_settings = 
sienna_cichlid_get_default_config_table_settings,
.set_config_table = sienna_cichlid_set_config_table,
+   .get_unique_id = sienna_cichlid_get_unique_id,
 };
 
 void sienna_cichlid_s

[PATCH 3/4] drm/amdgpu: Use metrics data function to get unique_id for Aldebaran

2022-03-29 Thread Kent Russell
This is abstracted well enough in the get_metrics_data function, so use
the function

Signed-off-by: Kent Russell 
---
 .../gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c   | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index cd81f848d45a..38af648cb857 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
@@ -650,6 +650,12 @@ static int aldebaran_get_smu_metrics_data(struct 
smu_context *smu,
case METRICS_THROTTLER_STATUS:
*value = metrics->ThrottlerStatus;
break;
+   case METRICS_UNIQUE_ID_UPPER32:
+   *value = metrics->PublicSerialNumUpper32;
+   break;
+   case METRICS_UNIQUE_ID_LOWER32:
+   *value = metrics->PublicSerialNumLower32;
+   break;
default:
*value = UINT_MAX;
break;
@@ -1614,16 +1620,12 @@ static void aldebaran_i2c_control_fini(struct 
smu_context *smu)
 static void aldebaran_get_unique_id(struct smu_context *smu)
 {
struct amdgpu_device *adev = smu->adev;
-   SmuMetrics_t *metrics = smu->smu_table.metrics_table;
uint32_t upper32 = 0, lower32 = 0;
-   int ret;
 
-   ret = smu_cmn_get_metrics_table(smu, NULL, false);
-   if (ret)
+   if (aldebaran_get_smu_metrics_data(smu, METRICS_UNIQUE_ID_UPPER32, 
&upper32))
+   goto out;
+   if (aldebaran_get_smu_metrics_data(smu, METRICS_UNIQUE_ID_LOWER32, 
&lower32))
goto out;
-
-   upper32 = metrics->PublicSerialNumUpper32;
-   lower32 = metrics->PublicSerialNumLower32;
 
 out:
adev->unique_id = ((uint64_t)upper32 << 32) | lower32;
-- 
2.25.1



[PATCH 1/4] drm/amdgpu: Use switch case for unique_id

2022-03-29 Thread Kent Russell
To ease readability, use switch to set unique_id as supported for the
supported IP_VERSIONs, and set it to unsupported by default for all
other ASICs.
This makes it easier to add IP_VERSIONs later on, and makes it obvious
that it is not supported by default, instead of the current logic that
assumes that it is supported unless it is not one of the specified
IP_VERSIONs.

v2: Rebase onto previous IP_VERSION change

Signed-off-by: Kent Russell 
Reviewed-by: Alex Deucher 
Reviewed-by: Kevin Wang 
---
 drivers/gpu/drm/amd/pm/amdgpu_pm.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index 9ce597ded31d..4151db2678fb 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -1988,11 +1988,16 @@ static int default_attr_update(struct amdgpu_device 
*adev, struct amdgpu_device_
if (adev->flags & AMD_IS_APU)
*states = ATTR_STATE_UNSUPPORTED;
} else if (DEVICE_ATTR_IS(unique_id)) {
-   if (gc_ver != IP_VERSION(9, 0, 1) &&
-   gc_ver != IP_VERSION(9, 4, 0) &&
-   gc_ver != IP_VERSION(9, 4, 1) &&
-   gc_ver != IP_VERSION(9, 4, 2))
+   switch (gc_ver) {
+   case IP_VERSION(9, 0, 1):
+   case IP_VERSION(9, 4, 0):
+   case IP_VERSION(9, 4, 1):
+   case IP_VERSION(9, 4, 2):
+   *states = ATTR_STATE_SUPPORTED;
+   break;
+   default:
*states = ATTR_STATE_UNSUPPORTED;
+   }
} else if (DEVICE_ATTR_IS(pp_features)) {
if (adev->flags & AMD_IS_APU || gc_ver < IP_VERSION(9, 0, 0))
*states = ATTR_STATE_UNSUPPORTED;
-- 
2.25.1



[PATCH 2/4] drm/amdgpu: Add UNIQUE_ID to MetricsMember_t

2022-03-29 Thread Kent Russell
This will allow us to use the generic *_get_metrics_data functions for
ASICs that support unique_id

Signed-off-by: Kent Russell 
---
 drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
index ef57b6089c69..46e34ed8a3c8 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
@@ -1333,6 +1333,8 @@ typedef enum {
METRICS_VOLTAGE_VDDGFX,
METRICS_SS_APU_SHARE,
METRICS_SS_DGPU_SHARE,
+   METRICS_UNIQUE_ID_UPPER32,
+   METRICS_UNIQUE_ID_LOWER32,
 } MetricsMember_t;
 
 enum smu_cmn2asic_mapping_type {
-- 
2.25.1



Re: [PATCH] UPSTREAM: drm/amdgpu: Disable ABM when AC mode

2022-03-29 Thread Harry Wentland



On 2022-03-29 10:33, Harry Wentland wrote:
> 
> 
> On 2022-03-24 19:10, Ryan Lin wrote:
>> Disable ABM feature when the system is running on AC mode to get
>> the more perfect contrast of the display.
>>
>> Signed-off-by: Ryan Lin 
>>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c  |  4 ++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  1 +
>>  drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c | 58 ---
>>  drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  1 +
>>  4 files changed, 42 insertions(+), 22 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
>> index c560c1ab62ecb..bc8bb9aad2e36 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
>> @@ -822,6 +822,10 @@ static int amdgpu_acpi_event(struct notifier_block *nb,
>>  struct amdgpu_device *adev = container_of(nb, struct amdgpu_device, 
>> acpi_nb);
>>  struct acpi_bus_event *entry = (struct acpi_bus_event *)data;
>>  
>> +if (strcmp(entry->device_class, "battery") == 0) {
>> +adev->pm.ac_power = power_supply_is_system_supplied() > 0;
>> +}
>> +
>>  if (strcmp(entry->device_class, ACPI_AC_CLASS) == 0) {
>>  if (power_supply_is_system_supplied() > 0)
>>  DRM_DEBUG_DRIVER("pm: AC\n");
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index abfcc1304ba0c..3a0afe7602727 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3454,6 +3454,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>  
>>  adev->gfx.gfx_off_req_count = 1;
>>  adev->pm.ac_power = power_supply_is_system_supplied() > 0;
>> +adev->pm.old_ac_power = true;
>>  
>>  atomic_set(&adev->throttling_logging_enabled, 1);
>>  /*
>> diff --git a/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c 
>> b/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c
>> index 54a1408c8015c..478a734b66926 100644
>> --- a/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c
>> +++ b/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c
>> @@ -23,6 +23,8 @@
>>   *
>>   */
>>  
>> +#include 
>> +#include "amdgpu.h"
>>  #include "dmub_abm.h"
>>  #include "dce_abm.h"
>>  #include "dc.h"
>> @@ -51,6 +53,7 @@
>>  #define DISABLE_ABM_IMMEDIATELY 255
>>  
>>  
>> +extern uint amdgpu_dm_abm_level;
>>  
>>  static void dmub_abm_enable_fractional_pwm(struct dc_context *dc)
>>  {
>> @@ -117,28 +120,6 @@ static void dmub_abm_init(struct abm *abm, uint32_t 
>> backlight)
>>  dmub_abm_enable_fractional_pwm(abm->ctx);
>>  }
>>  
>> -static unsigned int dmub_abm_get_current_backlight(struct abm *abm)
>> -{
>> -struct dce_abm *dce_abm = TO_DMUB_ABM(abm);
>> -unsigned int backlight = REG_READ(BL1_PWM_CURRENT_ABM_LEVEL);
>> -
>> -/* return backlight in hardware format which is unsigned 17 bits, with
>> - * 1 bit integer and 16 bit fractional
>> - */
>> -return backlight;
>> -}
>> -
>> -static unsigned int dmub_abm_get_target_backlight(struct abm *abm)
>> -{
>> -struct dce_abm *dce_abm = TO_DMUB_ABM(abm);
>> -unsigned int backlight = REG_READ(BL1_PWM_TARGET_ABM_LEVEL);
>> -
>> -/* return backlight in hardware format which is unsigned 17 bits, with
>> - * 1 bit integer and 16 bit fractional
>> - */
>> -return backlight;
>> -}
>> -
>>  static bool dmub_abm_set_level(struct abm *abm, uint32_t level)
>>  {
>>  union dmub_rb_cmd cmd;
>> @@ -148,6 +129,9 @@ static bool dmub_abm_set_level(struct abm *abm, uint32_t 
>> level)
>>  int edp_num;
>>  uint8_t panel_mask = 0;
>>  
>> +if (power_supply_is_system_supplied() > 0)
>> +level = 0;
>> +
>>  get_edp_links(dc->dc, edp_links, &edp_num);
>>  
>>  for (i = 0; i < edp_num; i++) {
>> @@ -170,6 +154,36 @@ static bool dmub_abm_set_level(struct abm *abm, 
>> uint32_t level)
>>  return true;
>>  }
>>  
>> +static unsigned int dmub_abm_get_current_backlight(struct abm *abm)
>> +{
>> +struct dce_abm *dce_abm = TO_DMUB_ABM(abm);
>> +unsigned int backlight = REG_READ(BL1_PWM_CURRENT_ABM_LEVEL);
>> +struct dc_context *dc = abm->ctx;
>> +struct amdgpu_device *adev = dc->driver_context;
>> +
>> +if (adev->pm.ac_power != adev->pm.old_ac_power) {
> 
> This patch still has the problem of accessing adev from within DC.
> That'll break things on other platforms. This information needs to
> come in through the DC interface if we want to enable/disable ABM in
> this function.
> 
> After a closer look I also don't think that amdgpu is the right place
> to control the logic to disable ABM in AC mode, i.e. to switch between
> ABM levels. Take a look at dm_connector_state.abm_level and the
> abm_level_property. It's exposed to userspace as "abm level".
> 
> The "abm level" defaults to "0" unless userspace sets the "abm level"
> to something else. The same component that sets th

Re: [PATCH] UPSTREAM: drm/amdgpu: Disable ABM when AC mode

2022-03-29 Thread Harry Wentland



On 2022-03-24 19:10, Ryan Lin wrote:
> Disable ABM feature when the system is running on AC mode to get
> the more perfect contrast of the display.
> 
> Signed-off-by: Ryan Lin 
> 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c  |  4 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  1 +
>  drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c | 58 ---
>  drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  1 +
>  4 files changed, 42 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> index c560c1ab62ecb..bc8bb9aad2e36 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> @@ -822,6 +822,10 @@ static int amdgpu_acpi_event(struct notifier_block *nb,
>   struct amdgpu_device *adev = container_of(nb, struct amdgpu_device, 
> acpi_nb);
>   struct acpi_bus_event *entry = (struct acpi_bus_event *)data;
>  
> + if (strcmp(entry->device_class, "battery") == 0) {
> + adev->pm.ac_power = power_supply_is_system_supplied() > 0;
> + }
> +
>   if (strcmp(entry->device_class, ACPI_AC_CLASS) == 0) {
>   if (power_supply_is_system_supplied() > 0)
>   DRM_DEBUG_DRIVER("pm: AC\n");
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index abfcc1304ba0c..3a0afe7602727 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3454,6 +3454,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>  
>   adev->gfx.gfx_off_req_count = 1;
>   adev->pm.ac_power = power_supply_is_system_supplied() > 0;
> + adev->pm.old_ac_power = true;
>  
>   atomic_set(&adev->throttling_logging_enabled, 1);
>   /*
> diff --git a/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c 
> b/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c
> index 54a1408c8015c..478a734b66926 100644
> --- a/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c
> +++ b/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c
> @@ -23,6 +23,8 @@
>   *
>   */
>  
> +#include 
> +#include "amdgpu.h"
>  #include "dmub_abm.h"
>  #include "dce_abm.h"
>  #include "dc.h"
> @@ -51,6 +53,7 @@
>  #define DISABLE_ABM_IMMEDIATELY 255
>  
>  
> +extern uint amdgpu_dm_abm_level;
>  
>  static void dmub_abm_enable_fractional_pwm(struct dc_context *dc)
>  {
> @@ -117,28 +120,6 @@ static void dmub_abm_init(struct abm *abm, uint32_t 
> backlight)
>   dmub_abm_enable_fractional_pwm(abm->ctx);
>  }
>  
> -static unsigned int dmub_abm_get_current_backlight(struct abm *abm)
> -{
> - struct dce_abm *dce_abm = TO_DMUB_ABM(abm);
> - unsigned int backlight = REG_READ(BL1_PWM_CURRENT_ABM_LEVEL);
> -
> - /* return backlight in hardware format which is unsigned 17 bits, with
> -  * 1 bit integer and 16 bit fractional
> -  */
> - return backlight;
> -}
> -
> -static unsigned int dmub_abm_get_target_backlight(struct abm *abm)
> -{
> - struct dce_abm *dce_abm = TO_DMUB_ABM(abm);
> - unsigned int backlight = REG_READ(BL1_PWM_TARGET_ABM_LEVEL);
> -
> - /* return backlight in hardware format which is unsigned 17 bits, with
> -  * 1 bit integer and 16 bit fractional
> -  */
> - return backlight;
> -}
> -
>  static bool dmub_abm_set_level(struct abm *abm, uint32_t level)
>  {
>   union dmub_rb_cmd cmd;
> @@ -148,6 +129,9 @@ static bool dmub_abm_set_level(struct abm *abm, uint32_t 
> level)
>   int edp_num;
>   uint8_t panel_mask = 0;
>  
> + if (power_supply_is_system_supplied() > 0)
> + level = 0;
> +
>   get_edp_links(dc->dc, edp_links, &edp_num);
>  
>   for (i = 0; i < edp_num; i++) {
> @@ -170,6 +154,36 @@ static bool dmub_abm_set_level(struct abm *abm, uint32_t 
> level)
>   return true;
>  }
>  
> +static unsigned int dmub_abm_get_current_backlight(struct abm *abm)
> +{
> + struct dce_abm *dce_abm = TO_DMUB_ABM(abm);
> + unsigned int backlight = REG_READ(BL1_PWM_CURRENT_ABM_LEVEL);
> + struct dc_context *dc = abm->ctx;
> + struct amdgpu_device *adev = dc->driver_context;
> +
> + if (adev->pm.ac_power != adev->pm.old_ac_power) {

This patch still has the problem of accessing adev from within DC.
That'll break things on other platforms. This information needs to
come in through the DC interface if we want to enable/disable ABM in
this function.

After a closer look I also don't think that amdgpu is the right place
to control the logic to disable ABM in AC mode, i.e. to switch between
ABM levels. Take a look at dm_connector_state.abm_level and the
abm_level_property. It's exposed to userspace as "abm level".

The "abm level" defaults to "0" unless userspace sets the "abm level"
to something else. The same component that sets the "abm level"
initially is the one that should set it to "0" when in AC mode.

Harry

> + dmub_abm_set_level(abm, amdgpu_dm_abm_level);
> +  

Re: drm/amdgpu: Disable ABM when AC mode

2022-03-29 Thread Alex Deucher
On Tue, Mar 29, 2022 at 4:56 AM Ryan Lin  wrote:
>
> Disable ABM feature when the system is running on AC mode to get
> the more perfect contrast of the display.
>
> v2: remove "UPSTREAM" from the subject.
>
> Signed-off-by: Ryan Lin 
>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c  |  4 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  1 +
>  drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c | 58 ---
>  drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  1 +
>  4 files changed, 42 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> index c560c1ab62ecb..bc8bb9aad2e36 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> @@ -822,6 +822,10 @@ static int amdgpu_acpi_event(struct notifier_block *nb,
> struct amdgpu_device *adev = container_of(nb, struct amdgpu_device, 
> acpi_nb);
> struct acpi_bus_event *entry = (struct acpi_bus_event *)data;
>
> +   if (strcmp(entry->device_class, "battery") == 0) {
> +   adev->pm.ac_power = power_supply_is_system_supplied() > 0;
> +   }
> +

Is this change necessary?  As I said before, we already update
adev->pm.ac_power a few lines later in amdgpu_pm_acpi_event_handler().
If there is something wrong with that code, please adjust as
necessary.

Alex

> if (strcmp(entry->device_class, ACPI_AC_CLASS) == 0) {
> if (power_supply_is_system_supplied() > 0)
> DRM_DEBUG_DRIVER("pm: AC\n");
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index abfcc1304ba0c..3a0afe7602727 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3454,6 +3454,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>
> adev->gfx.gfx_off_req_count = 1;
> adev->pm.ac_power = power_supply_is_system_supplied() > 0;
> +   adev->pm.old_ac_power = true;
>
> atomic_set(&adev->throttling_logging_enabled, 1);
> /*
> diff --git a/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c 
> b/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c
> index 54a1408c8015c..478a734b66926 100644
> --- a/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c
> +++ b/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c
> @@ -23,6 +23,8 @@
>   *
>   */
>
> +#include 
> +#include "amdgpu.h"
>  #include "dmub_abm.h"
>  #include "dce_abm.h"
>  #include "dc.h"
> @@ -51,6 +53,7 @@
>  #define DISABLE_ABM_IMMEDIATELY 255
>
>
> +extern uint amdgpu_dm_abm_level;
>
>  static void dmub_abm_enable_fractional_pwm(struct dc_context *dc)
>  {
> @@ -117,28 +120,6 @@ static void dmub_abm_init(struct abm *abm, uint32_t 
> backlight)
> dmub_abm_enable_fractional_pwm(abm->ctx);
>  }
>
> -static unsigned int dmub_abm_get_current_backlight(struct abm *abm)
> -{
> -   struct dce_abm *dce_abm = TO_DMUB_ABM(abm);
> -   unsigned int backlight = REG_READ(BL1_PWM_CURRENT_ABM_LEVEL);
> -
> -   /* return backlight in hardware format which is unsigned 17 bits, with
> -* 1 bit integer and 16 bit fractional
> -*/
> -   return backlight;
> -}
> -
> -static unsigned int dmub_abm_get_target_backlight(struct abm *abm)
> -{
> -   struct dce_abm *dce_abm = TO_DMUB_ABM(abm);
> -   unsigned int backlight = REG_READ(BL1_PWM_TARGET_ABM_LEVEL);
> -
> -   /* return backlight in hardware format which is unsigned 17 bits, with
> -* 1 bit integer and 16 bit fractional
> -*/
> -   return backlight;
> -}
> -
>  static bool dmub_abm_set_level(struct abm *abm, uint32_t level)
>  {
> union dmub_rb_cmd cmd;
> @@ -148,6 +129,9 @@ static bool dmub_abm_set_level(struct abm *abm, uint32_t 
> level)
> int edp_num;
> uint8_t panel_mask = 0;
>
> +   if (power_supply_is_system_supplied() > 0)
> +   level = 0;
> +
> get_edp_links(dc->dc, edp_links, &edp_num);
>
> for (i = 0; i < edp_num; i++) {
> @@ -170,6 +154,36 @@ static bool dmub_abm_set_level(struct abm *abm, uint32_t 
> level)
> return true;
>  }
>
> +static unsigned int dmub_abm_get_current_backlight(struct abm *abm)
> +{
> +   struct dce_abm *dce_abm = TO_DMUB_ABM(abm);
> +   unsigned int backlight = REG_READ(BL1_PWM_CURRENT_ABM_LEVEL);
> +   struct dc_context *dc = abm->ctx;
> +   struct amdgpu_device *adev = dc->driver_context;
> +
> +   if (adev->pm.ac_power != adev->pm.old_ac_power) {
> +   dmub_abm_set_level(abm, amdgpu_dm_abm_level);
> +   adev->pm.ac_power = power_supply_is_system_supplied() > 0;
> +   adev->pm.old_ac_power = adev->pm.ac_power;
> +   }
> +
> +   /* return backlight in hardware format which is unsigned 17 bits, with
> +* 1 bit integer and 16 bit fractional
> +*/
> +   return backlight;
> +}
> +
> +static unsigned int dmub_abm_get_targ

Re: [PATCH] drm/amdgpu: Support AMDGPU RAS debugfs poll interface

2022-03-29 Thread Christian König

Well that is a really bad idea.

Tools should only use sysfs functionality, not debugfs since that isn't 
stable.


And I totally agree to Hawking here. As I wrote on the other mail this 
patch is a no-go.


Regards,
Christian.

Am 29.03.22 um 15:18 schrieb Chai, Thomas:

[AMD Official Use Only]

Sorry for the confusing commit message。
This interface is only for amdgpu ras tool new function. There is no impact on 
currently existing tools and scripts。

-Original Message-
From: Zhang, Hawking 
Sent: Tuesday, March 29, 2022 4:33 PM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Zhou1, Tao ; Clements, John 
Subject: RE: [PATCH] drm/amdgpu: Support AMDGPU RAS debugfs poll interface

[AMD Official Use Only]

I'm not sure I understand the fix correctly - It seems to me it is trying to 
stop user/test cases that initiate error injection request back-to-back? But 
anyway, we shouldn't make the change or leverage debugfs for that purpose, and 
there is no guarantee test scripts/applications will follow the rule as well.

I guess we need to identify the root cause case by case and stop the invalid 
request in kernel driver.

Regards,
Hawking

-Original Message-
From: Chai, Thomas 
Sent: Tuesday, March 29, 2022 15:39
To: amd-gfx@lists.freedesktop.org
Cc: Chai, Thomas ; Zhang, Hawking ; Zhou1, Tao 
; Clements, John ; Chai, Thomas 

Subject: [PATCH] drm/amdgpu: Support AMDGPU RAS debugfs poll interface

Some AMDGPU RAS debugfs operations like UE injection can cause gpu reset. 
Before doing the next debugfs operation, the application should call poll to 
check if the gpu has finished recovering.

Signed-off-by: yipechai 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 -  
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  6 
  2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 4bbed76b79c8..337e3e247a45 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -452,6 +452,12 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file 
*f,
  
  		/* data.inject.address is offset instead of absolute gpu address */

ret = amdgpu_ras_error_inject(adev, &data.inject);
+
+   if (!ret && (data.head.type == 
AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE)) {
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+
+   con->ras_ue_injected = 1;
+   }
break;
default:
ret = -EINVAL;
@@ -464,6 +470,30 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file 
*f,
return size;
  }
  
+/**

+ * DOC: Support AMDGPU RAS debugfs poll interface
+ *
+ * Some AMDGPU RAS debugfs operations like UE injection
+ * can cause gpu reset. Before doing the next debugfs
+ * operation, the application should call poll to check
+ * if gpu is in recovering status.
+ */
+static __poll_t amdgpu_ras_debugfs_ctrl_poll(struct file *f, struct
+poll_table_struct *wait) {
+   struct amdgpu_device *adev = (struct amdgpu_device 
*)file_inode(f)->i_private;
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+   __poll_t mask = 0;
+
+   /* For UE injection, wait for gpu to finish recovery */
+   if (con->ras_ue_injected)
+   poll_wait(f, &con->gpu_ready_wait_wq, wait);
+
+   if (!atomic_read(&con->in_recovery))
+   mask = EPOLLIN | EPOLLRDNORM;
+
+   return mask;
+}
+
  /**
   * DOC: AMDGPU RAS debugfs EEPROM table reset interface
   *
@@ -503,6 +533,7 @@ static ssize_t amdgpu_ras_debugfs_eeprom_write(struct file 
*f,
  
  static const struct file_operations amdgpu_ras_debugfs_ctrl_ops = {

.owner = THIS_MODULE,
+   .poll = amdgpu_ras_debugfs_ctrl_poll,
.read = NULL,
.write = amdgpu_ras_debugfs_ctrl_write,
.llseek = default_llseek
@@ -1837,6 +1868,11 @@ static void amdgpu_ras_do_recovery(struct work_struct 
*work)
if (amdgpu_device_should_recover_gpu(ras->adev))
amdgpu_device_gpu_recover(ras->adev, NULL);
atomic_set(&ras->in_recovery, 0);
+
+   if (ras->ras_ue_injected) {
+   ras->ras_ue_injected = 0;
+   wake_up_all(&ras->gpu_ready_wait_wq);
+   }
  }
  
  /* alloc/realloc bps array */

@@ -2279,7 +2315,7 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
INIT_DELAYED_WORK(&con->ras_counte_delay_work, amdgpu_ras_counte_dw);
atomic_set(&con->ras_ce_count, 0);
atomic_set(&con->ras_ue_count, 0);
-
+   init_waitqueue_head(&con->gpu_ready_wait_wq);
con->objs = (struct ras_manager *)(con + 1);
  
  	amdgpu_ras_set_context(adev, con);

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 606df8869b89..aea6bbb71501 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -379,

RE: [PATCH] drm/amdgpu: Support AMDGPU RAS debugfs poll interface

2022-03-29 Thread Chai, Thomas
[AMD Official Use Only]

Sorry for the confusing commit message。 
This interface is only for amdgpu ras tool new function. There is no impact on 
currently existing tools and scripts。

-Original Message-
From: Zhang, Hawking  
Sent: Tuesday, March 29, 2022 4:33 PM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Zhou1, Tao ; Clements, John 
Subject: RE: [PATCH] drm/amdgpu: Support AMDGPU RAS debugfs poll interface

[AMD Official Use Only]

I'm not sure I understand the fix correctly - It seems to me it is trying to 
stop user/test cases that initiate error injection request back-to-back? But 
anyway, we shouldn't make the change or leverage debugfs for that purpose, and 
there is no guarantee test scripts/applications will follow the rule as well. 

I guess we need to identify the root cause case by case and stop the invalid 
request in kernel driver.

Regards,
Hawking

-Original Message-
From: Chai, Thomas 
Sent: Tuesday, March 29, 2022 15:39
To: amd-gfx@lists.freedesktop.org
Cc: Chai, Thomas ; Zhang, Hawking ; 
Zhou1, Tao ; Clements, John ; Chai, 
Thomas 
Subject: [PATCH] drm/amdgpu: Support AMDGPU RAS debugfs poll interface

Some AMDGPU RAS debugfs operations like UE injection can cause gpu reset. 
Before doing the next debugfs operation, the application should call poll to 
check if the gpu has finished recovering.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 -  
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  6 
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 4bbed76b79c8..337e3e247a45 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -452,6 +452,12 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file 
*f,
 
/* data.inject.address is offset instead of absolute gpu 
address */
ret = amdgpu_ras_error_inject(adev, &data.inject);
+
+   if (!ret && (data.head.type == 
AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE)) {
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+
+   con->ras_ue_injected = 1;
+   }
break;
default:
ret = -EINVAL;
@@ -464,6 +470,30 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file 
*f,
return size;
 }
 
+/**
+ * DOC: Support AMDGPU RAS debugfs poll interface
+ *
+ * Some AMDGPU RAS debugfs operations like UE injection
+ * can cause gpu reset. Before doing the next debugfs
+ * operation, the application should call poll to check
+ * if gpu is in recovering status.
+ */
+static __poll_t amdgpu_ras_debugfs_ctrl_poll(struct file *f, struct 
+poll_table_struct *wait) {
+   struct amdgpu_device *adev = (struct amdgpu_device 
*)file_inode(f)->i_private;
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+   __poll_t mask = 0;
+
+   /* For UE injection, wait for gpu to finish recovery */
+   if (con->ras_ue_injected)
+   poll_wait(f, &con->gpu_ready_wait_wq, wait);
+
+   if (!atomic_read(&con->in_recovery))
+   mask = EPOLLIN | EPOLLRDNORM;
+
+   return mask;
+}
+
 /**
  * DOC: AMDGPU RAS debugfs EEPROM table reset interface
  *
@@ -503,6 +533,7 @@ static ssize_t amdgpu_ras_debugfs_eeprom_write(struct file 
*f,
 
 static const struct file_operations amdgpu_ras_debugfs_ctrl_ops = {
.owner = THIS_MODULE,
+   .poll = amdgpu_ras_debugfs_ctrl_poll,
.read = NULL,
.write = amdgpu_ras_debugfs_ctrl_write,
.llseek = default_llseek
@@ -1837,6 +1868,11 @@ static void amdgpu_ras_do_recovery(struct work_struct 
*work)
if (amdgpu_device_should_recover_gpu(ras->adev))
amdgpu_device_gpu_recover(ras->adev, NULL);
atomic_set(&ras->in_recovery, 0);
+
+   if (ras->ras_ue_injected) {
+   ras->ras_ue_injected = 0;
+   wake_up_all(&ras->gpu_ready_wait_wq);
+   }
 }
 
 /* alloc/realloc bps array */
@@ -2279,7 +2315,7 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
INIT_DELAYED_WORK(&con->ras_counte_delay_work, amdgpu_ras_counte_dw);
atomic_set(&con->ras_ce_count, 0);
atomic_set(&con->ras_ue_count, 0);
-
+   init_waitqueue_head(&con->gpu_ready_wait_wq);
con->objs = (struct ras_manager *)(con + 1);
 
amdgpu_ras_set_context(adev, con);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 606df8869b89..aea6bbb71501 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -379,6 +379,12 @@ struct amdgpu_ras {
 
/* Indicates smu whether need update bad channel info */
bool update_channel_flag;
+
+   /* UE injection flag */
+   uint32_t  ras_ue_injected;
+
+   /* Waiting for gpu ready work queue */
+   wait_queue_

Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-29 Thread Christian König
My main question is what does the iris driver better than radeonsi when 
the client doesn't support the robustness extension?


From Daniels description it sounds like they have at least a partial 
recovery mechanism in place.


Apart from that I completely agree to what you said below.

Christian.

Am 26.03.22 um 01:53 schrieb Olsak, Marek:


[AMD Official Use Only]


amdgpu has 2 resets: soft reset and hard reset.

The soft reset is able to recover from an infinite loop and even some 
GPU hangs due to bad shaders or bad states. The soft reset uses a 
signal that kills all currently-running shaders of a certain process 
(VM context), which unblocks the graphics pipeline, so draws and 
command buffers finish but are not correctly. This can then cause a 
hard hang if the shader was supposed to signal work completion through 
a shader store instruction and a non-shader consumer is waiting for it 
(skipping the store instruction by killing the shader won't signal the 
work, and thus the consumer will be stuck, requiring a hard reset).


The hard reset can recover from other hangs, which is great, but it 
may use a PCI reset, which erases VRAM on dGPUs. APUs don't lose 
memory contents, but we should assume that any process that had 
running jobs on the GPU during a GPU reset has its memory resources in 
an inconsistent state, and thus following command buffers can cause 
another GPU hang. The shader store example above is enough to cause 
another hard hang due to incorrect content in memory resources, which 
can contain synchronization primitives that are used internally by the 
hardware.


Asking the driver to replay a command buffer that caused a hang is a 
sure way to hang it again. Unrelated processes can be affected due to 
lost VRAM or the misfortune of using the GPU while the GPU hang 
occurred. The window system should recreate GPU resources and redraw 
everything without affecting applications. If apps use GL, they should 
do the same. Processes that can't recover by redrawing content can be 
terminated or left alone, but they shouldn't be allowed to submit work 
to the GPU anymore.


dEQP only exercises the soft reset. I think WebGL is only able to 
trigger a soft reset at this point, but Vulkan can also trigger a hard 
reset.


Marek

*From:* Koenig, Christian 
*Sent:* March 23, 2022 11:25
*To:* Daniel Vetter ; Daniel Stone 
; Olsak, Marek ; 
Grodzovsky, Andrey 
*Cc:* Rob Clark ; Rob Clark 
; Sharma, Shashank ; 
Christian König ; Somalapuram, 
Amaranath ; Abhinav Kumar 
; dri-devel 
; amd-gfx list 
; Deucher, Alexander 
; Shashank Sharma 


*Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
[Adding Marek and Andrey as well]

Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> On Wed, 23 Mar 2022 at 15:07, Daniel Stone  wrote:
>> Hi,
>>
>> On Mon, 21 Mar 2022 at 16:02, Rob Clark  wrote:
>>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
>>>  wrote:
 Well you can, it just means that their contexts are lost as well.
>>> Which is rather inconvenient when deqp-egl reset tests, for example,
>>> take down your compositor ;-)
>> Yeah. Or anything WebGL.
>>
>> System-wide collateral damage is definitely a non-starter. If that
>> means that the userspace driver has to do what iris does and ensure
>> everything's recreated and resubmitted, that works too, just as long
>> as the response to 'my adblocker didn't detect a crypto miner ad'  is
>> something better than 'shoot the entire user session'.
> Not sure where that idea came from, I thought at least I made it clear
> that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> which should die without recovery attempt.
>
> The entire discussion here is who should be responsible for replay and
> at least if you can decide the uapi, then punting that entirely to
> userspace is a good approach.

Yes, completely agree. We have the approach of re-submitting things in
the kernel and that failed quite miserable.

In other words currently a GPU reset has something like a 99% chance to
get down your whole desktop.

Daniel can you briefly explain what exactly iris does when a lost
context is detected without gl robustness?

It sounds like you guys got that working quite well.

Thanks,
Christian.

>
> Ofc it'd be nice if the collateral damage is limited, i.e. requests
> not currently on the gpu, or on different engines and all that
> shouldn't be nuked, if possible.
>
> Also ofc since msm uapi is that the kernel tries to recover there's
> not much we can do there, contexts cannot be shot. But still trying to
> replay them as much as possible feels a bit like overkill.
> -Daniel
>
>> Cheers,
>> Daniel
>
>



RE: [PATCH 2/2] drm/amdgpu: Add unique_id support for sienna cichlid

2022-03-29 Thread Russell, Kent
[AMD Official Use Only]

Ah I didn’t see that we added a Metrics v3 in there between now and when I made 
my first version of this patch. I'll test it out and make sure that things are 
still looking alright. Thanks Lijo!

 Kent

> -Original Message-
> From: Lazar, Lijo 
> Sent: Tuesday, March 29, 2022 12:34 AM
> To: Russell, Kent ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander 
> Subject: Re: [PATCH 2/2] drm/amdgpu: Add unique_id support for sienna cichlid
>
>
>
> On 3/28/2022 10:37 PM, Kent Russell wrote:
> > This is being added to SMU Metrics, so add the required tie-ins in the
> > kernel. Also create the corresponding unique_id sysfs file.
> >
> > v2: Add FW version check, remove SMU mutex
> > v3: Fix style warning
> > v4: Add MP1 IP_VERSION check to FW version check
> >
> > Signed-off-by: Kent Russell 
> > Reviewed-by: Alex Deucher 
> > ---
> >   drivers/gpu/drm/amd/pm/amdgpu_pm.c|  1 +
> >   .../pmfw_if/smu11_driver_if_sienna_cichlid.h  | 10 -
> >   .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 37 +++
> >   3 files changed, 46 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> > index 4151db2678fb..4a9aabc16fbc 100644
> > --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> > +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> > @@ -1993,6 +1993,7 @@ static int default_attr_update(struct amdgpu_device 
> > *adev,
> struct amdgpu_device_
> > case IP_VERSION(9, 4, 0):
> > case IP_VERSION(9, 4, 1):
> > case IP_VERSION(9, 4, 2):
> > +   case IP_VERSION(10, 3, 0):
> > *states = ATTR_STATE_SUPPORTED;
> > break;
> > default:
> > diff --git
> a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
> b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
> > index 3e4a314ef925..037f38b0fa15 100644
> > --- 
> > a/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
> > +++ 
> > b/drivers/gpu/drm/amd/pm/swsmu/inc/pmfw_if/smu11_driver_if_sienna_cichlid.h
> > @@ -1419,8 +1419,11 @@ typedef struct {
> > uint8_t  PcieRate   ;
> > uint8_t  PcieWidth  ;
> > uint16_t AverageGfxclkFrequencyTarget;
> > -  uint16_t Padding16_2;
> >
> > +  uint32_t PublicSerialNumLower32;
> > +  uint32_t PublicSerialNumUpper32;
> > +
> > +  uint16_t Padding16_2;
> >   } SmuMetrics_t;
> >
> >   typedef struct {
> > @@ -1476,8 +1479,11 @@ typedef struct {
> > uint8_t  PcieRate   ;
> > uint8_t  PcieWidth  ;
> > uint16_t AverageGfxclkFrequencyTarget;
> > -  uint16_t Padding16_2;
> >
> > +  uint32_t PublicSerialNumLower32;
> > +  uint32_t PublicSerialNumUpper32;
> > +
> > +  uint16_t Padding16_2;
> >   } SmuMetrics_V2_t;
> >
>
> Hi Kent,
>
> Are you using the latest code?
>
>uint8_t  PcieWidth;
>uint16_t AverageGfxclkFrequencyTarget;
>
> } SmuMetrics_V3_t;
>
>
> V3 is what I see as the latest in our source and this struct is
> consistent with the latest PMFW.
>
> Thanks,
> Lijo
>
> >   typedef struct {
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> > index 38f04836c82f..140005bf6d9e 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> > @@ -481,6 +481,42 @@ static int sienna_cichlid_setup_pptable(struct 
> > smu_context
> *smu)
> > return sienna_cichlid_patch_pptable_quirk(smu);
> >   }
> >
> > +static void sienna_cichlid_get_unique_id(struct smu_context *smu)
> > +{
> > +   struct amdgpu_device *adev = smu->adev;
> > +   struct smu_table_context *smu_table = &smu->smu_table;
> > +   SmuMetrics_t *metrics =
> > +   &(((SmuMetricsExternal_t 
> > *)(smu_table->metrics_table))->SmuMetrics);
> > +   SmuMetrics_V2_t *metrics_v2 =
> > +   &(((SmuMetricsExternal_t 
> > *)(smu_table->metrics_table))->SmuMetrics_V2);
> > +   uint32_t upper32 = 0, lower32 = 0;
> > +   bool use_metrics_v2;
> > +   int ret;
> > +
> > +   /* Only supported as of version 0.58.83.0 and only on Sienna Cichlid */
> > +   if (smu->smc_fw_version < 0x3A5300 ||
> > +   smu->adev->ip_versions[MP1_HWIP][0] != IP_VERSION(11, 0, 7))
> > +   return;
> > +
> > +   ret = smu_cmn_get_metrics_table(smu, NULL, false);
> > +   if (ret)
> > +   goto out_unlock;
> > +
> > +   use_metrics_v2 = ((smu->adev->ip_versions[MP1_HWIP][0] == 
> > IP_VERSION(11, 0,
> 7)) &&
> > +   (smu->smc_fw_version >= 0x3A4300)) ? true : false;
> > +
> > +   upper32 = use_metrics_v2 ? metrics_v2->PublicSerialNumUpper32 :
> > +  metrics->PublicSerialNumUpper32;
> > +   lower32 = use_metrics_v2 ? metrics_v2->PublicSerialNumLower32 :
> > +  metrics->PublicSerialNumLower32;
> > +
> > +out_unlock:
> > +
> 

Re: [PATCH v2] drm: add a check to verify the size alignment

2022-03-29 Thread Christian König

Am 29.03.22 um 13:28 schrieb Arunpravin Paneer Selvam:

On 23/03/22 1:15 pm, Christian König wrote:

Am 23.03.22 um 08:34 schrieb Arunpravin Paneer Selvam:

Add a simple check to reject any size not aligned to the
min_page_size.

handle instances when size is not aligned with the min_page_size.
Unigine Heaven has allocation requests for example required pages
are 257 and alignment request is 256. To allocate the left over 1
page, continues the iteration to find the order value which is 0
and when it compares with min_order = 8, triggers the BUG_ON(order
< min_order). To avoid this problem, we added a simple check to
return -EINVAL if size is not aligned to the min_page_size.

v2: Added more details to the commit description

Signed-off-by: Arunpravin Paneer Selvam 
Suggested-by: Matthew Auld 
---
   drivers/gpu/drm/drm_buddy.c | 3 +++
   1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index 72f52f293249..b503c88786b0 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -661,6 +661,9 @@ int drm_buddy_alloc_blocks(struct drm_buddy *mm,
if (range_overflows(start, size, mm->size))
return -EINVAL;
   
+	if (WARN_ON(!IS_ALIGNED(size, min_page_size)))

+   return -EINVAL;
+

I'm not that happy with the handling here.

See a minimum page size larger than the requested size is perfectly
valid, it just means that the remaining pages needs to be trimmed.

In other words when the request is to allocate 1 page with an alignment
of 256 we just need to give the remaining 255 pages back to the allocator.

In one of the previous mail Matthew explained that i915 expects to
return -EINVAL error code if size is not aligned to min_page_size.

can we just modify in amdgpu code where we round_up the size to the
min_page_size value and keep this error handling in drm_buddy.c?


Yeah, I'm fine with that as well now.

I realized that this is probably the easiest option to check if an 
allocation is contiguous or not between the alloc and trim.


So having two functions for that sounds like a good idea to me.

Thanks,
Christian.


Regards,
Christian.


/* Actual range allocation */
if (start + size == end)
return __drm_buddy_alloc_range(mm, start, size, blocks);

base-commit: 056d47eaf6ea753fa2e21da31f9cbd8b721bbb7b




Re: [PATCH v2] drm: add a check to verify the size alignment

2022-03-29 Thread Matthew Auld
On Tue, 29 Mar 2022 at 12:17, Arunpravin Paneer Selvam
 wrote:
>
>
>
> On 23/03/22 1:15 pm, Christian König wrote:
> > Am 23.03.22 um 08:34 schrieb Arunpravin Paneer Selvam:
> >> Add a simple check to reject any size not aligned to the
> >> min_page_size.
> >>
> >> handle instances when size is not aligned with the min_page_size.
> >> Unigine Heaven has allocation requests for example required pages
> >> are 257 and alignment request is 256. To allocate the left over 1
> >> page, continues the iteration to find the order value which is 0
> >> and when it compares with min_order = 8, triggers the BUG_ON(order
> >> < min_order). To avoid this problem, we added a simple check to
> >> return -EINVAL if size is not aligned to the min_page_size.
> >>
> >> v2: Added more details to the commit description
> >>
> >> Signed-off-by: Arunpravin Paneer Selvam 
> >> Suggested-by: Matthew Auld 
> >> ---
> >>   drivers/gpu/drm/drm_buddy.c | 3 +++
> >>   1 file changed, 3 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
> >> index 72f52f293249..b503c88786b0 100644
> >> --- a/drivers/gpu/drm/drm_buddy.c
> >> +++ b/drivers/gpu/drm/drm_buddy.c
> >> @@ -661,6 +661,9 @@ int drm_buddy_alloc_blocks(struct drm_buddy *mm,
> >>  if (range_overflows(start, size, mm->size))
> >>  return -EINVAL;
> >>
> >> +if (WARN_ON(!IS_ALIGNED(size, min_page_size)))
> >> +return -EINVAL;
> >> +
> >
> > I'm not that happy with the handling here.
> >
> > See a minimum page size larger than the requested size is perfectly
> > valid, it just means that the remaining pages needs to be trimmed.
> >
> > In other words when the request is to allocate 1 page with an alignment
> > of 256 we just need to give the remaining 255 pages back to the allocator.
>
> In one of the previous mail Matthew explained that i915 expects to
> return -EINVAL error code if size is not aligned to min_page_size.

We could also move the WARN_ON() into i915 as a separate patch, and
just change the default buddy behaviour to transparently handle the
rounding + trim, if you prefer. I don't have a strong opinion.

>
> can we just modify in amdgpu code where we round_up the size to the
> min_page_size value and keep this error handling in drm_buddy.c?
> >
> > Regards,
> > Christian.
> >
> >>  /* Actual range allocation */
> >>  if (start + size == end)
> >>  return __drm_buddy_alloc_range(mm, start, size, blocks);
> >>
> >> base-commit: 056d47eaf6ea753fa2e21da31f9cbd8b721bbb7b
> >


Re: [PATCH v11] drm/amdgpu: add drm buddy support to amdgpu

2022-03-29 Thread Christian König

Am 29.03.22 um 13:19 schrieb Arunpravin Paneer Selvam:

[SNIP]

+   pages_left = node->base.num_pages;
   
   	i = 0;

-   spin_lock(&mgr->lock);
while (pages_left) {
-   uint32_t alignment = tbo->page_alignment;
+   if (tbo->page_alignment)
+   min_page_size = tbo->page_alignment << PAGE_SHIFT;
+   else
+   min_page_size = mgr->default_page_size;

The handling here looks extremely awkward to me.

min_page_size should be determined outside of the loop, based on 
default_page_size, alignment and contiguous flag.

I kept min_page_size determine logic inside the loop for cases 2GiB+
requirements, Since now we are round up the size to the required
alignment, I modified the min_page_size determine logic outside of the
loop in v12. Please review.


Ah! So do we only have the loop so that each allocation isn't bigger 
than 2GiB? If yes couldn't we instead add a max_alloc_size or something 
similar?


BTW: I strongly suggest that you rename min_page_size to min_alloc_size. 
Otherwise somebody could think that those numbers are in pages and not 
bytes.



Then why do you drop the lock and grab it again inside the loop? And what is 
"i" actually good for?

modified the lock/unlock placement in v12.

"i" is to track when there is 2GiB+ contiguous allocation request, first
we allocate 2GiB (due to SG table limit) continuously and the remaining
pages in the next iteration, hence this request can't be a continuous.
To set the placement flag we make use of "i" value. In our case "i"
value becomes 2 and we don't set the below flag.
node->base.placement |= TTM_PL_FLAG_CONTIGUOUS;

If we don't get such requests, I will remove "i".


I'm not sure if that works.

As far as I can see drm_buddy_alloc_blocks() can allocate multiple 
blocks at the same time, but i is only incremented when we loop.


So what you should do instead is to check if node->blocks just contain 
exactly one element after the allocation but before the trim.



+
+   /* Limit maximum size to 2GB due to SG table limitations */
+   pages = min(pages_left, 2UL << (30 - PAGE_SHIFT));
   
   		if (pages >= pages_per_node)

-   alignment = pages_per_node;
-
-   r = drm_mm_insert_node_in_range(mm, &node->mm_nodes[i], pages,
-   alignment, 0, place->fpfn,
-   lpfn, mode);
-   if (unlikely(r)) {
-   if (pages > pages_per_node) {
-   if (is_power_of_2(pages))
-   pages = pages / 2;
-   else
-   pages = rounddown_pow_of_two(pages);
-   continue;
-   }
-   goto error_free;
+   min_page_size = pages_per_node << PAGE_SHIFT;
+
+   if (!is_contiguous && !IS_ALIGNED(pages, min_page_size >> 
PAGE_SHIFT))
+   is_contiguous = 1;
+
+   if (is_contiguous) {
+   pages = roundup_pow_of_two(pages);
+   min_page_size = pages << PAGE_SHIFT;
+
+   if (pages > lpfn)
+   lpfn = pages;
}
   
-		vis_usage += amdgpu_vram_mgr_vis_size(adev, &node->mm_nodes[i]);

-   amdgpu_vram_mgr_virt_start(&node->base, &node->mm_nodes[i]);
-   pages_left -= pages;
+   BUG_ON(min_page_size < mm->chunk_size);
+
+   mutex_lock(&mgr->lock);
+   r = drm_buddy_alloc_blocks(mm, (u64)place->fpfn << PAGE_SHIFT,
+  (u64)lpfn << PAGE_SHIFT,
+  (u64)pages << PAGE_SHIFT,
+  min_page_size,
+  &node->blocks,
+  node->flags);
+   mutex_unlock(&mgr->lock);
+   if (unlikely(r))
+   goto error_free_blocks;
+
++i;
   
   		if (pages > pages_left)

-   pages = pages_left;
+   pages_left = 0;
+   else
+   pages_left -= pages;
}
-   spin_unlock(&mgr->lock);
   
-	if (i == 1)

+   /* Free unused pages for contiguous allocation */
+   if (is_contiguous) {

Well that looks really odd, why is trimming not part of
drm_buddy_alloc_blocks() ?

we didn't place trim function part of drm_buddy_alloc_blocks since we
thought this function can be a generic one and it can be used by any
other application as well. For example, now we are using it for trimming
the last block in case of size non-alignment with min_page_size.


Good argument. Another thing I just realized is that we probably want to 
double check if we 

Re: [PATCH v2] drm: add a check to verify the size alignment

2022-03-29 Thread Arunpravin Paneer Selvam



On 23/03/22 1:15 pm, Christian König wrote:
> Am 23.03.22 um 08:34 schrieb Arunpravin Paneer Selvam:
>> Add a simple check to reject any size not aligned to the
>> min_page_size.
>>
>> handle instances when size is not aligned with the min_page_size.
>> Unigine Heaven has allocation requests for example required pages
>> are 257 and alignment request is 256. To allocate the left over 1
>> page, continues the iteration to find the order value which is 0
>> and when it compares with min_order = 8, triggers the BUG_ON(order
>> < min_order). To avoid this problem, we added a simple check to
>> return -EINVAL if size is not aligned to the min_page_size.
>>
>> v2: Added more details to the commit description
>>
>> Signed-off-by: Arunpravin Paneer Selvam 
>> Suggested-by: Matthew Auld 
>> ---
>>   drivers/gpu/drm/drm_buddy.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
>> index 72f52f293249..b503c88786b0 100644
>> --- a/drivers/gpu/drm/drm_buddy.c
>> +++ b/drivers/gpu/drm/drm_buddy.c
>> @@ -661,6 +661,9 @@ int drm_buddy_alloc_blocks(struct drm_buddy *mm,
>>  if (range_overflows(start, size, mm->size))
>>  return -EINVAL;
>>   
>> +if (WARN_ON(!IS_ALIGNED(size, min_page_size)))
>> +return -EINVAL;
>> +
> 
> I'm not that happy with the handling here.
> 
> See a minimum page size larger than the requested size is perfectly 
> valid, it just means that the remaining pages needs to be trimmed.
> 
> In other words when the request is to allocate 1 page with an alignment 
> of 256 we just need to give the remaining 255 pages back to the allocator.

In one of the previous mail Matthew explained that i915 expects to
return -EINVAL error code if size is not aligned to min_page_size.

can we just modify in amdgpu code where we round_up the size to the
min_page_size value and keep this error handling in drm_buddy.c?
> 
> Regards,
> Christian.
> 
>>  /* Actual range allocation */
>>  if (start + size == end)
>>  return __drm_buddy_alloc_range(mm, start, size, blocks);
>>
>> base-commit: 056d47eaf6ea753fa2e21da31f9cbd8b721bbb7b
> 


Re: [PATCH v11] drm/amdgpu: add drm buddy support to amdgpu

2022-03-29 Thread Arunpravin Paneer Selvam


> -Original Message-
> From: amd-gfx  On Behalf Of Christian 
> König
> Sent: Wednesday, March 23, 2022 1:07 PM
> To: Paneer Selvam, Arunpravin ; 
> intel-...@lists.freedesktop.org; dri-de...@lists.freedesktop.org; 
> amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; matthew.a...@intel.com; 
> dan...@ffwll.ch; Koenig, Christian 
> Subject: Re: [PATCH v11] drm/amdgpu: add drm buddy support to amdgpu
> 
> Am 23.03.22 um 07:25 schrieb Arunpravin Paneer Selvam:
>> [SNIP]
>> @@ -415,48 +409,86 @@ static int amdgpu_vram_mgr_new(struct 
>> ttm_resource_manager *man,
>>  goto error_fini;
>>  }
>>   
>> -mode = DRM_MM_INSERT_BEST;
>> +INIT_LIST_HEAD(&node->blocks);
>> +
>>  if (place->flags & TTM_PL_FLAG_TOPDOWN)
>> -mode = DRM_MM_INSERT_HIGH;
>> +node->flags |= DRM_BUDDY_TOPDOWN_ALLOCATION;
>>   
>> -pages_left = node->base.num_pages;
>> +if (place->fpfn || lpfn != man->size >> PAGE_SHIFT)
>> +/* Allocate blocks in desired range */
>> +node->flags |= DRM_BUDDY_RANGE_ALLOCATION;
>>   
>> -/* Limit maximum size to 2GB due to SG table limitations */
>> -pages = min(pages_left, 2UL << (30 - PAGE_SHIFT));
>> +BUG_ON(!node->base.num_pages);
> 
> Please drop this BUG_ON(). This is not something which prevents further data 
> corruption, so the BUG_ON() is not justified.

ok
> 
>> +pages_left = node->base.num_pages;
>>   
>>  i = 0;
>> -spin_lock(&mgr->lock);
>>  while (pages_left) {
>> -uint32_t alignment = tbo->page_alignment;
>> +if (tbo->page_alignment)
>> +min_page_size = tbo->page_alignment << PAGE_SHIFT;
>> +else
>> +min_page_size = mgr->default_page_size;
> 
> The handling here looks extremely awkward to me.
> 
> min_page_size should be determined outside of the loop, based on 
> default_page_size, alignment and contiguous flag.
I kept min_page_size determine logic inside the loop for cases 2GiB+
requirements, Since now we are round up the size to the required
alignment, I modified the min_page_size determine logic outside of the
loop in v12. Please review.
> 
> Then why do you drop the lock and grab it again inside the loop? And what is 
> "i" actually good for?
modified the lock/unlock placement in v12.

"i" is to track when there is 2GiB+ contiguous allocation request, first
we allocate 2GiB (due to SG table limit) continuously and the remaining
pages in the next iteration, hence this request can't be a continuous.
To set the placement flag we make use of "i" value. In our case "i"
value becomes 2 and we don't set the below flag.
node->base.placement |= TTM_PL_FLAG_CONTIGUOUS;

If we don't get such requests, I will remove "i".

>



>> +
>> +/* Limit maximum size to 2GB due to SG table limitations */
>> +pages = min(pages_left, 2UL << (30 - PAGE_SHIFT));
>>   
>>  if (pages >= pages_per_node)
>> -alignment = pages_per_node;
>> -
>> -r = drm_mm_insert_node_in_range(mm, &node->mm_nodes[i], pages,
>> -alignment, 0, place->fpfn,
>> -lpfn, mode);
>> -if (unlikely(r)) {
>> -if (pages > pages_per_node) {
>> -if (is_power_of_2(pages))
>> -pages = pages / 2;
>> -else
>> -pages = rounddown_pow_of_two(pages);
>> -continue;
>> -}
>> -goto error_free;
>> +min_page_size = pages_per_node << PAGE_SHIFT;
>> +
>> +if (!is_contiguous && !IS_ALIGNED(pages, min_page_size >> 
>> PAGE_SHIFT))
>> +is_contiguous = 1;
>> +
>> +if (is_contiguous) {
>> +pages = roundup_pow_of_two(pages);
>> +min_page_size = pages << PAGE_SHIFT;
>> +
>> +if (pages > lpfn)
>> +lpfn = pages;
>>  }
>>   
>> -vis_usage += amdgpu_vram_mgr_vis_size(adev, &node->mm_nodes[i]);
>> -amdgpu_vram_mgr_virt_start(&node->base, &node->mm_nodes[i]);
>> -pages_left -= pages;
>> +BUG_ON(min_page_size < mm->chunk_size);
>> +
>> +mutex_lock(&mgr->lock);
>> +r = drm_buddy_alloc_blocks(mm, (u64)place->fpfn << PAGE_SHIFT,
>> +   (u64)lpfn << PAGE_SHIFT,
>> +   (u64)pages << PAGE_SHIFT,
>> +   min_page_size,
>> +   &node->blocks,
>> +   node->flags);
>> +mutex_unlock(&mgr->lock);
>> +if (unlikely(r))
>> +goto error_free_blocks;
>> +
>>  ++i

RE: [PATCH V2] drm/amdgpu: fix incorrect GCR_GENERAL_CNTL address

2022-03-29 Thread Ji, Ruili
[AMD Official Use Only]

Hi Paul,

This is not related to any issue.

Kind regards,
Ruili

-Original Message-
From: Paul Menzel 
Sent: 2022年3月29日 16:16
To: Ji, Ruili 
Cc: amd-gfx@lists.freedesktop.org; Zhang, Yifan ; Liu, 
Aaron ; Liang, Prike ; Huang, Ray 
; Deucher, Alexander ; Ji, Ruili 

Subject: Re: [PATCH V2] drm/amdgpu: fix incorrect GCR_GENERAL_CNTL address

[CAUTION: External Email]

Dear Ruili,


Thank you for your patch.

Am 28.03.22 um 06:58 schrieb Ji, Ruili:
> From: Ruili Ji 
>
> gfx10.3.3/gfx10.3.6/gfx10.3.7 shall use 0x1580 address for
> GCR_GENERAL_CNTL

Is any “user-visible“ problem fixed by this?

Please add a Fixes tag.


Kind regards,

Paul


> Signed-off-by: Ruili Ji 
> ---
>   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index 99df18ae7316..e4c9d92ac381 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -3300,7 +3300,7 @@ static const struct soc15_reg_golden 
> golden_settings_gc_10_3_3[] =
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0242),
> - SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 
> 0x0500),
> + SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh,
> + 0x1ff1, 0x0500),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x00e4),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 
> 0x32103210),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x,
> 0x32103210), @@ -3436,7 +3436,7 @@ static const struct soc15_reg_golden 
> golden_settings_gc_10_3_6[] =
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0042),
> - SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 
> 0x0500),
> + SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh,
> + 0x1ff1, 0x0500),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x0044),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 
> 0x32103210),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x,
> 0x32103210), @@ -3461,7 +3461,7 @@ static const struct soc15_reg_golden 
> golden_settings_gc_10_3_7[] = {
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0041),
> - SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 
> 0x0500),
> + SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh,
> + 0x1ff1, 0x0500),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x00e4),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 
> 0x32103210),
>   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x,
> 0x32103210),


drm/amdgpu: Disable ABM when AC mode

2022-03-29 Thread Ryan Lin
Disable ABM feature when the system is running on AC mode to get
the more perfect contrast of the display.

v2: remove "UPSTREAM" from the subject.

Signed-off-by: Ryan Lin 

---
 drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c  |  4 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  1 +
 drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c | 58 ---
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  1 +
 4 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
index c560c1ab62ecb..bc8bb9aad2e36 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
@@ -822,6 +822,10 @@ static int amdgpu_acpi_event(struct notifier_block *nb,
struct amdgpu_device *adev = container_of(nb, struct amdgpu_device, 
acpi_nb);
struct acpi_bus_event *entry = (struct acpi_bus_event *)data;
 
+   if (strcmp(entry->device_class, "battery") == 0) {
+   adev->pm.ac_power = power_supply_is_system_supplied() > 0;
+   }
+
if (strcmp(entry->device_class, ACPI_AC_CLASS) == 0) {
if (power_supply_is_system_supplied() > 0)
DRM_DEBUG_DRIVER("pm: AC\n");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index abfcc1304ba0c..3a0afe7602727 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3454,6 +3454,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 
adev->gfx.gfx_off_req_count = 1;
adev->pm.ac_power = power_supply_is_system_supplied() > 0;
+   adev->pm.old_ac_power = true;
 
atomic_set(&adev->throttling_logging_enabled, 1);
/*
diff --git a/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c 
b/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c
index 54a1408c8015c..478a734b66926 100644
--- a/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c
+++ b/drivers/gpu/drm/amd/display/dc/dce/dmub_abm.c
@@ -23,6 +23,8 @@
  *
  */
 
+#include 
+#include "amdgpu.h"
 #include "dmub_abm.h"
 #include "dce_abm.h"
 #include "dc.h"
@@ -51,6 +53,7 @@
 #define DISABLE_ABM_IMMEDIATELY 255
 
 
+extern uint amdgpu_dm_abm_level;
 
 static void dmub_abm_enable_fractional_pwm(struct dc_context *dc)
 {
@@ -117,28 +120,6 @@ static void dmub_abm_init(struct abm *abm, uint32_t 
backlight)
dmub_abm_enable_fractional_pwm(abm->ctx);
 }
 
-static unsigned int dmub_abm_get_current_backlight(struct abm *abm)
-{
-   struct dce_abm *dce_abm = TO_DMUB_ABM(abm);
-   unsigned int backlight = REG_READ(BL1_PWM_CURRENT_ABM_LEVEL);
-
-   /* return backlight in hardware format which is unsigned 17 bits, with
-* 1 bit integer and 16 bit fractional
-*/
-   return backlight;
-}
-
-static unsigned int dmub_abm_get_target_backlight(struct abm *abm)
-{
-   struct dce_abm *dce_abm = TO_DMUB_ABM(abm);
-   unsigned int backlight = REG_READ(BL1_PWM_TARGET_ABM_LEVEL);
-
-   /* return backlight in hardware format which is unsigned 17 bits, with
-* 1 bit integer and 16 bit fractional
-*/
-   return backlight;
-}
-
 static bool dmub_abm_set_level(struct abm *abm, uint32_t level)
 {
union dmub_rb_cmd cmd;
@@ -148,6 +129,9 @@ static bool dmub_abm_set_level(struct abm *abm, uint32_t 
level)
int edp_num;
uint8_t panel_mask = 0;
 
+   if (power_supply_is_system_supplied() > 0)
+   level = 0;
+
get_edp_links(dc->dc, edp_links, &edp_num);
 
for (i = 0; i < edp_num; i++) {
@@ -170,6 +154,36 @@ static bool dmub_abm_set_level(struct abm *abm, uint32_t 
level)
return true;
 }
 
+static unsigned int dmub_abm_get_current_backlight(struct abm *abm)
+{
+   struct dce_abm *dce_abm = TO_DMUB_ABM(abm);
+   unsigned int backlight = REG_READ(BL1_PWM_CURRENT_ABM_LEVEL);
+   struct dc_context *dc = abm->ctx;
+   struct amdgpu_device *adev = dc->driver_context;
+
+   if (adev->pm.ac_power != adev->pm.old_ac_power) {
+   dmub_abm_set_level(abm, amdgpu_dm_abm_level);
+   adev->pm.ac_power = power_supply_is_system_supplied() > 0;
+   adev->pm.old_ac_power = adev->pm.ac_power;
+   }
+
+   /* return backlight in hardware format which is unsigned 17 bits, with
+* 1 bit integer and 16 bit fractional
+*/
+   return backlight;
+}
+
+static unsigned int dmub_abm_get_target_backlight(struct abm *abm)
+{
+   struct dce_abm *dce_abm = TO_DMUB_ABM(abm);
+   unsigned int backlight = REG_READ(BL1_PWM_TARGET_ABM_LEVEL);
+
+   /* return backlight in hardware format which is unsigned 17 bits, with
+* 1 bit integer and 16 bit fractional
+*/
+   return backlight;
+}
+
 static bool dmub_abm_init_config(struct abm *abm,
const char *src,
unsigned int bytes,
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h 
b/drivers/gpu/drm/am

Re: [PATCH next, v2] kernel: Add 1 ms delay to init handler to fix s3 resume hang

2022-03-29 Thread Daniel Vetter
On Tue, Mar 29, 2022 at 08:20:24AM +0200, Christian König wrote:
> Am 29.03.22 um 05:05 schrieb Zhenneng Li:
> > This is a workaround for s3 resume hang for r7 340(amdgpu).
> > When we test s3 with r7 340 on arm64 platform, graphics card will hang up,
> > the error message are as follows:
> > Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.599374][ 7] [  T291] 
> > amdgpu :02:00.0: fb0: amdgpudrmfb frame buffer device
> > Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.612869][ 7] [  T291] 
> > [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block 
> >  failed -22
> > Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.623392][ 7] [  T291] 
> > amdgpu :02:00.0: amdgpu_device_ip_late_init failed
> > Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.630696][ 7] [  T291] 
> > amdgpu :02:00.0: Fatal error during GPU init
> > Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.637477][ 7] [  T291] 
> > [drm] amdgpu: finishing device.
> > 
> > On the following hardware:
> > lspci -nn -s 05:00.0
> > 05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. 
> > [AMD/ATI] Oland [Radeon HD 8570 / R7 240/340 / Radeon 520 OEM] [1002:6611] 
> > (rev 87)
> 
> Well that's rather funny and certainly a NAK. To recap you are adding a
> delay to a delayed work handler. In other words you could delay the work
> handler in the first place :)
> 
> But this is not the reason why that here is a NAK. The more obvious problem
> is that we seem to have a race between the DPM code kicking in to save power
> after driver load and the asynchronous testing if userspace command
> submission works.
> 
> Adding the delay here works around that for the IB submission, but there can
> be other things going on in parallel which can fail as well.

Yeah standard pattern for this is to refcount your dpm code (using power
domains or runtime pm ideally or hand-rolled if you have to). And then
grabbing a dpm reference before you launch that work, and dropping that
when the work has finished.

That gives you a nice clean way to handle all these problems around "right
now I'm really not ready to allow low power states" in a very clean
fashion. arm-soc drivers go totally overboard on this with runtime pm on
all the chip components, that's maybe a bit much but afaiui we could do it
on big pci drivers with power domains too :-)

Also with power domains you get autosuspend delay timers for free and
tunable in sysfs ...

Cheers, Daniel

> 
> Please rather open up a bug report instead.
> 
> Regards,
> Christian.
> 
> > 
> > Signed-off-by: Zhenneng Li 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
> >   1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 3987ecb24ef4..1eced991b5b2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2903,6 +2903,8 @@ static void 
> > amdgpu_device_delayed_init_work_handler(struct work_struct *work)
> > container_of(work, struct amdgpu_device, 
> > delayed_init_work.work);
> > int r;
> > +   mdelay(1);
> > +
> > r = amdgpu_ib_ring_tests(adev);
> > if (r)
> > DRM_ERROR("ib ring test failed (%d).\n", r);
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


RE: [PATCH] drm/amdgpu: Support AMDGPU RAS debugfs poll interface

2022-03-29 Thread Zhang, Hawking
[AMD Official Use Only]

I'm not sure I understand the fix correctly - It seems to me it is trying to 
stop user/test cases that initiate error injection request back-to-back? But 
anyway, we shouldn't make the change or leverage debugfs for that purpose, and 
there is no guarantee test scripts/applications will follow the rule as well. 

I guess we need to identify the root cause case by case and stop the invalid 
request in kernel driver.

Regards,
Hawking

-Original Message-
From: Chai, Thomas  
Sent: Tuesday, March 29, 2022 15:39
To: amd-gfx@lists.freedesktop.org
Cc: Chai, Thomas ; Zhang, Hawking ; 
Zhou1, Tao ; Clements, John ; Chai, 
Thomas 
Subject: [PATCH] drm/amdgpu: Support AMDGPU RAS debugfs poll interface

Some AMDGPU RAS debugfs operations like UE injection can cause gpu reset. 
Before doing the next debugfs operation, the application should call poll to 
check if the gpu has finished recovering.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 -  
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  6 
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 4bbed76b79c8..337e3e247a45 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -452,6 +452,12 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file 
*f,
 
/* data.inject.address is offset instead of absolute gpu 
address */
ret = amdgpu_ras_error_inject(adev, &data.inject);
+
+   if (!ret && (data.head.type == 
AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE)) {
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+
+   con->ras_ue_injected = 1;
+   }
break;
default:
ret = -EINVAL;
@@ -464,6 +470,30 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file 
*f,
return size;
 }
 
+/**
+ * DOC: Support AMDGPU RAS debugfs poll interface
+ *
+ * Some AMDGPU RAS debugfs operations like UE injection
+ * can cause gpu reset. Before doing the next debugfs
+ * operation, the application should call poll to check
+ * if gpu is in recovering status.
+ */
+static __poll_t amdgpu_ras_debugfs_ctrl_poll(struct file *f, struct 
+poll_table_struct *wait) {
+   struct amdgpu_device *adev = (struct amdgpu_device 
*)file_inode(f)->i_private;
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+   __poll_t mask = 0;
+
+   /* For UE injection, wait for gpu to finish recovery */
+   if (con->ras_ue_injected)
+   poll_wait(f, &con->gpu_ready_wait_wq, wait);
+
+   if (!atomic_read(&con->in_recovery))
+   mask = EPOLLIN | EPOLLRDNORM;
+
+   return mask;
+}
+
 /**
  * DOC: AMDGPU RAS debugfs EEPROM table reset interface
  *
@@ -503,6 +533,7 @@ static ssize_t amdgpu_ras_debugfs_eeprom_write(struct file 
*f,
 
 static const struct file_operations amdgpu_ras_debugfs_ctrl_ops = {
.owner = THIS_MODULE,
+   .poll = amdgpu_ras_debugfs_ctrl_poll,
.read = NULL,
.write = amdgpu_ras_debugfs_ctrl_write,
.llseek = default_llseek
@@ -1837,6 +1868,11 @@ static void amdgpu_ras_do_recovery(struct work_struct 
*work)
if (amdgpu_device_should_recover_gpu(ras->adev))
amdgpu_device_gpu_recover(ras->adev, NULL);
atomic_set(&ras->in_recovery, 0);
+
+   if (ras->ras_ue_injected) {
+   ras->ras_ue_injected = 0;
+   wake_up_all(&ras->gpu_ready_wait_wq);
+   }
 }
 
 /* alloc/realloc bps array */
@@ -2279,7 +2315,7 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
INIT_DELAYED_WORK(&con->ras_counte_delay_work, amdgpu_ras_counte_dw);
atomic_set(&con->ras_ce_count, 0);
atomic_set(&con->ras_ue_count, 0);
-
+   init_waitqueue_head(&con->gpu_ready_wait_wq);
con->objs = (struct ras_manager *)(con + 1);
 
amdgpu_ras_set_context(adev, con);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 606df8869b89..aea6bbb71501 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -379,6 +379,12 @@ struct amdgpu_ras {
 
/* Indicates smu whether need update bad channel info */
bool update_channel_flag;
+
+   /* UE injection flag */
+   uint32_t  ras_ue_injected;
+
+   /* Waiting for gpu ready work queue */
+   wait_queue_head_t gpu_ready_wait_wq;
 };
 
 struct ras_fs_data {
--
2.25.1


[PATCH] drm/amd/display: Fix by adding FPU protection for dcn30_internal_validate_bw

2022-03-29 Thread CHANDAN VURDIGERE NATARAJ
[WHY]
Below general protection fault observed when WebGL Aquarium is run for
longer duration. If drm debug logs are enabled and set to 0x1f then the
issue is observed within 10 minutes of run.

[  100.717056] general protection fault, probably for non-canonical address 
0x2d33302d32323032:  [#1] PREEMPT SMP NOPTI
[  100.727921] CPU: 3 PID: 1906 Comm: DrmThread Tainted: GW 
5.15.30 #12 d726c6a2d6ebe5cf9223931cbca6892f916fe18b
[  100.754419] RIP: 0010:CalculateSwathWidth+0x1f7/0x44f
[  100.767109] Code: 00 00 00 f2 42 0f 11 04 f0 48 8b 85 88 00 00 00 f2 42 0f 
10 04 f0 48 8b 85 98 00 00 00 f2 42 0f 11 04 f0 48 8b 45 10 0f 57 c0  42 0f 
2a 04 b0 0f 57 c9 f3 43 0f 2a 0c b4 e8 8c e2 f3 ff 48 8b
[  100.781269] RSP: 0018:a9230079eeb0 EFLAGS: 00010246
[  100.812528] RAX: 2d33302d32323032 RBX: 0500 RCX: 
[  100.819656] RDX: 0001 RSI: 99deb712c49c RDI: 
[  100.826781] RBP: a9230079ef50 R08: 99deb712460c R09: 99deb712462c
[  100.833907] R10: 99deb7124940 R11: 99deb7124d70 R12: 99deb712ae44
[  100.841033] R13: 0001 R14:  R15: a9230079f0a0
[  100.848159] FS:  7af121212640() GS:99deba78() 
knlGS:
[  100.856240] CS:  0010 DS:  ES:  CR0: 80050033
[  100.861980] CR2: 209000fe1000 CR3: 00011b18c000 CR4: 00350ee0
[  100.869106] Call Trace:
[  100.871555]  
[  100.873655]  ? asm_sysvec_reschedule_ipi+0x12/0x20
[  100.878449]  CalculateSwathAndDETConfiguration+0x1a3/0x6dd
[  100.883937]  dml31_ModeSupportAndSystemConfigurationFull+0x2ce4/0x76da
[  100.890467]  ? kallsyms_lookup_buildid+0xc8/0x163
[  100.895173]  ? kallsyms_lookup_buildid+0xc8/0x163
[  100.899874]  ? __sprint_symbol+0x80/0x135
[  100.903883]  ? dm_update_plane_state+0x3f9/0x4d2
[  100.908500]  ? symbol_string+0xb7/0xde
[  100.912250]  ? number+0x145/0x29b
[  100.915566]  ? vsnprintf+0x341/0x5ff
[  100.919141]  ? desc_read_finalized_seq+0x39/0x87
[  100.923755]  ? update_load_avg+0x1b9/0x607
[  100.927849]  ? compute_mst_dsc_configs_for_state+0x7d/0xd5b
[  100.933416]  ? fetch_pipe_params+0xa4d/0xd0c
[  100.937686]  ? dc_fpu_end+0x3d/0xa8
[  100.941175]  dml_get_voltage_level+0x16b/0x180
[  100.945619]  dcn30_internal_validate_bw+0x10e/0x89b
[  100.950495]  ? dcn31_validate_bandwidth+0x68/0x1fc
[  100.955285]  ? resource_build_scaling_params+0x98b/0xb8c
[  100.960595]  ? dcn31_validate_bandwidth+0x68/0x1fc
[  100.965384]  dcn31_validate_bandwidth+0x9a/0x1fc
[  100.970001]  dc_validate_global_state+0x238/0x295
[  100.974703]  amdgpu_dm_atomic_check+0x9c1/0xbce
[  100.979235]  ? _printk+0x59/0x73
[  100.982467]  drm_atomic_check_only+0x403/0x78b
[  100.986912]  drm_mode_atomic_ioctl+0x49b/0x546
[  100.991358]  ? drm_ioctl+0x1c1/0x3b3
[  100.994936]  ? drm_atomic_set_property+0x92a/0x92a
[  100.999725]  drm_ioctl_kernel+0xdc/0x149
[  101.003648]  drm_ioctl+0x27f/0x3b3
[  101.007051]  ? drm_atomic_set_property+0x92a/0x92a
[  101.011842]  amdgpu_drm_ioctl+0x49/0x7d
[  101.015679]  __se_sys_ioctl+0x7c/0xb8
[  101.015685]  do_syscall_64+0x5f/0xb8
[  101.015690]  ? __irq_exit_rcu+0x34/0x96

[HOW]
It calles populate_dml_pipes which uses doubles to initialize.
Adding FPU protection avoids context switch and probable loss of vba context
as there is potential contention while drm debug logs are enabled.

Signed-off-by: CHANDAN VURDIGERE NATARAJ 

diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
index 826970f2bd0a..f27262417abe 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c
@@ -1750,7 +1750,9 @@ bool dcn31_validate_bandwidth(struct dc *dc,
 
BW_VAL_TRACE_COUNT();
 
+   DC_FP_START();
out = dcn30_internal_validate_bw(dc, context, pipes, &pipe_cnt, 
&vlevel, fast_validate);
+   DC_FP_END();
 
// Disable fast_validate to set min dcfclk in alculate_wm_and_dlg
if (pipe_cnt == 0)
-- 
2.25.1



Re: [PATCH] drm/amdgpu: Support AMDGPU RAS debugfs poll interface

2022-03-29 Thread Paul Menzel

Dear Yi Peng,


Thank you for the patch. Two nits regarding the commit message.

Am 29.03.22 um 09:38 schrieb yipechai:

Some AMDGPU RAS debugfs operations like UE injection
can cause gpu reset. Before doing the next debugfs
operation, the application should call poll to check
if the gpu has finished recovering.


Please use a text width of 75 characters per line for the commit message 
body.



Signed-off-by: yipechai 


Could you please configure git (and your email program) to use your full 
name?


$ git config --global user.name "Yi Peng Chai" # no idea if correct

[…]


Kind regards,

Paul


Re: [PATCH V2] drm/amdgpu: fix incorrect GCR_GENERAL_CNTL address

2022-03-29 Thread Paul Menzel

Dear Ruili,


Thank you for your patch.

Am 28.03.22 um 06:58 schrieb Ji, Ruili:

From: Ruili Ji 

gfx10.3.3/gfx10.3.6/gfx10.3.7 shall use 0x1580 address for GCR_GENERAL_CNTL


Is any “user-visible“ problem fixed by this?

Please add a Fixes tag.


Kind regards,

Paul



Signed-off-by: Ruili Ji 
---
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 99df18ae7316..e4c9d92ac381 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3300,7 +3300,7 @@ static const struct soc15_reg_golden 
golden_settings_gc_10_3_3[] =
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0242),
-   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 
0x0500),
+   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh, 0x1ff1, 
0x0500),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x00e4),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 
0x32103210),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x, 
0x32103210),
@@ -3436,7 +3436,7 @@ static const struct soc15_reg_golden 
golden_settings_gc_10_3_6[] =
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0042),
-   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 
0x0500),
+   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh, 0x1ff1, 
0x0500),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x0044),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 
0x32103210),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x, 
0x32103210),
@@ -3461,7 +3461,7 @@ static const struct soc15_reg_golden 
golden_settings_gc_10_3_7[] = {
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0041),
-   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 
0x0500),
+   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh, 0x1ff1, 
0x0500),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x00e4),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 
0x32103210),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x, 
0x32103210),


Re: [PATCH] drm/amdgpu: Support AMDGPU RAS debugfs poll interface

2022-03-29 Thread Christian König

Am 29.03.22 um 09:38 schrieb yipechai:

Some AMDGPU RAS debugfs operations like UE injection
can cause gpu reset. Before doing the next debugfs
operation, the application should call poll to check
if the gpu has finished recovering.


Well NAK. debugfs files are designed to be used from the command-line 
and can be opened and read/written at any time.


If we need to prevent issuing the next operation before the previous one 
has finished we need to use locks for that.


Especially if not doing so results in a hard system lockup.

Regards,
Christian.



Signed-off-by: yipechai 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  6 
  2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 4bbed76b79c8..337e3e247a45 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -452,6 +452,12 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file 
*f,
  
  		/* data.inject.address is offset instead of absolute gpu address */

ret = amdgpu_ras_error_inject(adev, &data.inject);
+
+   if (!ret && (data.head.type == 
AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE)) {
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+
+   con->ras_ue_injected = 1;
+   }
break;
default:
ret = -EINVAL;
@@ -464,6 +470,30 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file 
*f,
return size;
  }
  
+/**

+ * DOC: Support AMDGPU RAS debugfs poll interface
+ *
+ * Some AMDGPU RAS debugfs operations like UE injection
+ * can cause gpu reset. Before doing the next debugfs
+ * operation, the application should call poll to check
+ * if gpu is in recovering status.
+ */
+static __poll_t amdgpu_ras_debugfs_ctrl_poll(struct file *f, struct 
poll_table_struct *wait)
+{
+   struct amdgpu_device *adev = (struct amdgpu_device 
*)file_inode(f)->i_private;
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+   __poll_t mask = 0;
+
+   /* For UE injection, wait for gpu to finish recovery */
+   if (con->ras_ue_injected)
+   poll_wait(f, &con->gpu_ready_wait_wq, wait);
+
+   if (!atomic_read(&con->in_recovery))
+   mask = EPOLLIN | EPOLLRDNORM;
+
+   return mask;
+}
+
  /**
   * DOC: AMDGPU RAS debugfs EEPROM table reset interface
   *
@@ -503,6 +533,7 @@ static ssize_t amdgpu_ras_debugfs_eeprom_write(struct file 
*f,
  
  static const struct file_operations amdgpu_ras_debugfs_ctrl_ops = {

.owner = THIS_MODULE,
+   .poll = amdgpu_ras_debugfs_ctrl_poll,
.read = NULL,
.write = amdgpu_ras_debugfs_ctrl_write,
.llseek = default_llseek
@@ -1837,6 +1868,11 @@ static void amdgpu_ras_do_recovery(struct work_struct 
*work)
if (amdgpu_device_should_recover_gpu(ras->adev))
amdgpu_device_gpu_recover(ras->adev, NULL);
atomic_set(&ras->in_recovery, 0);
+
+   if (ras->ras_ue_injected) {
+   ras->ras_ue_injected = 0;
+   wake_up_all(&ras->gpu_ready_wait_wq);
+   }
  }
  
  /* alloc/realloc bps array */

@@ -2279,7 +2315,7 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
INIT_DELAYED_WORK(&con->ras_counte_delay_work, amdgpu_ras_counte_dw);
atomic_set(&con->ras_ce_count, 0);
atomic_set(&con->ras_ue_count, 0);
-
+   init_waitqueue_head(&con->gpu_ready_wait_wq);
con->objs = (struct ras_manager *)(con + 1);
  
  	amdgpu_ras_set_context(adev, con);

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 606df8869b89..aea6bbb71501 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -379,6 +379,12 @@ struct amdgpu_ras {
  
  	/* Indicates smu whether need update bad channel info */

bool update_channel_flag;
+
+   /* UE injection flag */
+   uint32_t  ras_ue_injected;
+
+   /* Waiting for gpu ready work queue */
+   wait_queue_head_t gpu_ready_wait_wq;
  };
  
  struct ras_fs_data {




Re: [Intel-gfx] [PATCH v2 03/11] drm/edid: slightly restructure timing and non-timing descriptor structs

2022-03-29 Thread Jani Nikula


I think I'm just going to revert back to my original plan of leaving the
struct restructuring to another time in the future.

BR,
Jani.


On Mon, 28 Mar 2022, kernel test robot  wrote:
> Hi Jani,
>
> I love your patch! Perhaps something to improve:
>
> [auto build test WARNING on drm/drm-next]
> [also build test WARNING on drm-intel/for-linux-next drm-tip/drm-tip v5.17 
> next-20220328]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url:
> https://github.com/intel-lab-lkp/linux/commits/Jani-Nikula/drm-edid-constify-EDID-parsing-with-some-fixes/20220328-171858
> base:   git://anongit.freedesktop.org/drm/drm drm-next
> config: x86_64-randconfig-a003-20220328 
> (https://download.01.org/0day-ci/archive/20220328/202203281926.athxjpnk-...@intel.com/config)
> compiler: gcc-9 (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
> reproduce (this is a W=1 build):
> # 
> https://github.com/intel-lab-lkp/linux/commit/f538c9296c54ce8f878432153584a68939ffc111
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review 
> Jani-Nikula/drm-edid-constify-EDID-parsing-with-some-fixes/20220328-171858
> git checkout f538c9296c54ce8f878432153584a68939ffc111
> # save the config file to linux build tree
> mkdir build_dir
> make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/gpu/drm/tiny/
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
>
> All warnings (new ones prefixed by >>):
>
>drivers/gpu/drm/tiny/gm12u320.c:478:4: error: 'struct detailed_timing' has 
> no member named 'pixel_clock'
>  478 |   .pixel_clock = 3383,
>  |^~~
>>> drivers/gpu/drm/tiny/gm12u320.c:464:36: warning: missing braces around 
>>> initializer [-Wmissing-braces]
>  464 | static struct edid gm12u320_edid = {
>  |^
>..
>  478 |   .pixel_clock = 3383,
>  |  {{  }}
>  479 |   /* hactive = 848, hblank = 256 */
>  480 |   .data.pixel_data.hactive_lo = 0x50,
>  | }}
>  481 |   .data.pixel_data.hblank_lo = 0x00,
>  |}}
>  482 |   .data.pixel_data.hactive_hblank_hi = 0x31,
>  |}}
>  483 |   /* vactive = 480, vblank = 28 */
>  484 |   .data.pixel_data.vactive_lo = 0xe0,
>  | }}
>  485 |   .data.pixel_data.vblank_lo = 0x1c,
>  |}}
>  486 |   .data.pixel_data.vactive_vblank_hi = 0x10,
>  |}}
>  487 |   /* hsync offset 40 pw 128, vsync offset 1 pw 4 */
>  488 |   .data.pixel_data.hsync_offset_lo = 0x28,
>  |  }}
>  489 |   .data.pixel_data.hsync_pulse_width_lo = 0x80,
>  |   }}
>  490 |   .data.pixel_data.vsync_offset_pulse_width_lo = 0x14,
>  |  }}
>  491 |   .data.pixel_data.hsync_vsync_offset_pulse_width_hi = 0x00,
>  |}}
>..
>  494 |  }, {
>  |  }}
>drivers/gpu/drm/tiny/gm12u320.c:495:4: error: 'struct detailed_timing' has 
> no member named 'pixel_clock'
>  495 |   .pixel_clock = 0,
>  |^~~
>drivers/gpu/drm/tiny/gm12u320.c:496:9: error: 'union ' has no 
> member named 'other_data'
>  496 |   .data.other_data.type = 0xfd, /* Monitor ranges */
>  | ^~
>drivers/gpu/drm/tiny/gm12u320.c:496:27: warning: initialized field 
> overwritten [-Woverride-init]
>  496 |   .data.other_data.type = 0xfd, /* Monitor ranges */
>  |   ^~~~
>drivers/gpu/drm/tiny/gm12u320.c:496:27: note: (near initialization for 
> 'gm12u320_edid.detailed_timings[1].data.pixel_data.pixel_clock')
>drivers/gpu/drm/tiny/gm12u320.c:497:9: error: 'union ' has no 
> member named 'other_data'
>  497 |   .data.other_data.data.range.min_vfreq = 59,
>  | ^~
>drivers/gpu/drm/tiny/gm12u320.c:497:43: warning: initialized field 
> overwritten [-Woverride-init]
>  497 |   .data.other_data.data.range.min_vfreq = 59,
>  |   ^~
>drivers/gpu/drm/tiny/gm12u320.c:497:43: note: (near initialization for 
> 'gm12u320_edid.detailed_timings[1].data.pixel_data.pixel_clock')
>drivers/gpu/drm/tiny/gm12u320.c:498:9: error: 'union ' has no 
> member named 'other_data'
>  498 |   .data.other_data.data.range.max_vfreq = 61,
>  | ^~
>dr

[PATCH next, v2] kernel: Add 1 ms delay to init handler to fix s3 resume hang

2022-03-29 Thread Zhenneng Li
This is a workaround for s3 resume hang for r7 340(amdgpu).
When we test s3 with r7 340 on arm64 platform, graphics card will hang up,
the error message are as follows:
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.599374][ 7] [  T291] 
amdgpu :02:00.0: fb0: amdgpudrmfb frame buffer device
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.612869][ 7] [  T291] 
[drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block 
 failed -22
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.623392][ 7] [  T291] 
amdgpu :02:00.0: amdgpu_device_ip_late_init failed
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.630696][ 7] [  T291] 
amdgpu :02:00.0: Fatal error during GPU init
Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [1.637477][ 7] [  T291] 
[drm] amdgpu: finishing device.

On the following hardware:
lspci -nn -s 05:00.0
05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. 
[AMD/ATI] Oland [Radeon HD 8570 / R7 240/340 / Radeon 520 OEM] [1002:6611] (rev 
87)

Signed-off-by: Zhenneng Li 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3987ecb24ef4..1eced991b5b2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2903,6 +2903,8 @@ static void 
amdgpu_device_delayed_init_work_handler(struct work_struct *work)
container_of(work, struct amdgpu_device, 
delayed_init_work.work);
int r;
 
+   mdelay(1);
+
r = amdgpu_ib_ring_tests(adev);
if (r)
DRM_ERROR("ib ring test failed (%d).\n", r);
-- 
2.25.1



回复: Re: [PATCH] drm/amdgpu: resolve s3 hang for r7340

2022-03-29 Thread 李真能
    

    
        主 题:Re: [PATCH] drm/amdgpu: resolve s3 hang for r7340
            日 期:2022-03-28 15:38
            发件人:Paul Menzel
            收件人:Zhenneng Li
            
        
        [Cc: -Jack Zhang (invalid address)Am 28.03.22 um 09:36 schrieb Paul Menzel:> Dear Zhenneng,> > > Thank you for your patch.> > Am 28.03.22 um 06:05 schrieb Zhenneng Li:>> This is a workaround for s3 hang for r7340(amdgpu).> > Is it hanging when resuming from S3? Yes, this func is a delayed work after init graphics card.> Maybe also use the line below for > the commit message summary:> > drm/amdgpu: Add 1 ms delay to init handler to fix s3 resume hang> > Also, please add a space before the ( in “r7340(amdgpu)”.> >> When we test s3 with r7340 on arm64 platform, graphics card will hang up,>> the error message are as follows:>> Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [    1.599374][ 7] [  T291] amdgpu :02:00.0: fb0: amdgpudrmfb frame buffer device>> Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [    1.612869][ 7] [  T291] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP blockfailed -22>> Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [    1.623392][ 7] [  T291] amdgpu :02:00.0: amdgpu_device_ip_late_init failed>> Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [    1.630696][ 7] [  T291] amdgpu :02:00.0: Fatal error during GPU init>> Mar  4 01:14:11 greatwall-GW-XX-XXX kernel: [    1.637477][ 7] [  T291] [drm] amdgpu: finishing device.> > The prefix in the beginning is not really needed. Only the stuff after > `kernel: `.> > Maybe also add the output of `lspci -nn -s …` for that r7340 device.> >> Change-Id: I5048b3894c0ca9faf2f4847ddab61f9eb17b4823> > Without the Gerrit instance this belongs to, the Change-Id is of no use > in the public.> >> Signed-off-by: Zhenneng Li>> --->>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++>>   1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c>> index 3987ecb24ef4..1eced991b5b2 100644>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c>> @@ -2903,6 +2903,8 @@ static void >> amdgpu_device_delayed_init_work_handler(struct work_struct *work)>>   container_of(work, struct amdgpu_device, delayed_init_work.work);>>   int r;>> +    mdelay(1);>> +> > Wow, I wonder how long it took you to find that workaround.About 3 mouths, I try to add this delay work(amdgpu_device_delayed_init_work_handler) from 2000ms to 2500ms, or use mb() instead of mdelay(1), but it's useless, I don't know the reason,the occurrenct probability  of this bug is one ten-thousandth, do you know the possible reasons?> >>   r = amdgpu_ib_ring_tests(adev);>>   if (r)>>   DRM_ERROR("ib ring test failed (%d).\n", r);> > > Kind regards,> > Paul

[PATCH] drm/amdgpu: Support AMDGPU RAS debugfs poll interface

2022-03-29 Thread yipechai
Some AMDGPU RAS debugfs operations like UE injection
can cause gpu reset. Before doing the next debugfs
operation, the application should call poll to check
if the gpu has finished recovering.

Signed-off-by: yipechai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 38 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  6 
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 4bbed76b79c8..337e3e247a45 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -452,6 +452,12 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file 
*f,
 
/* data.inject.address is offset instead of absolute gpu 
address */
ret = amdgpu_ras_error_inject(adev, &data.inject);
+
+   if (!ret && (data.head.type == 
AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE)) {
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+
+   con->ras_ue_injected = 1;
+   }
break;
default:
ret = -EINVAL;
@@ -464,6 +470,30 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file 
*f,
return size;
 }
 
+/**
+ * DOC: Support AMDGPU RAS debugfs poll interface
+ *
+ * Some AMDGPU RAS debugfs operations like UE injection
+ * can cause gpu reset. Before doing the next debugfs
+ * operation, the application should call poll to check
+ * if gpu is in recovering status.
+ */
+static __poll_t amdgpu_ras_debugfs_ctrl_poll(struct file *f, struct 
poll_table_struct *wait)
+{
+   struct amdgpu_device *adev = (struct amdgpu_device 
*)file_inode(f)->i_private;
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+   __poll_t mask = 0;
+
+   /* For UE injection, wait for gpu to finish recovery */
+   if (con->ras_ue_injected)
+   poll_wait(f, &con->gpu_ready_wait_wq, wait);
+
+   if (!atomic_read(&con->in_recovery))
+   mask = EPOLLIN | EPOLLRDNORM;
+
+   return mask;
+}
+
 /**
  * DOC: AMDGPU RAS debugfs EEPROM table reset interface
  *
@@ -503,6 +533,7 @@ static ssize_t amdgpu_ras_debugfs_eeprom_write(struct file 
*f,
 
 static const struct file_operations amdgpu_ras_debugfs_ctrl_ops = {
.owner = THIS_MODULE,
+   .poll = amdgpu_ras_debugfs_ctrl_poll,
.read = NULL,
.write = amdgpu_ras_debugfs_ctrl_write,
.llseek = default_llseek
@@ -1837,6 +1868,11 @@ static void amdgpu_ras_do_recovery(struct work_struct 
*work)
if (amdgpu_device_should_recover_gpu(ras->adev))
amdgpu_device_gpu_recover(ras->adev, NULL);
atomic_set(&ras->in_recovery, 0);
+
+   if (ras->ras_ue_injected) {
+   ras->ras_ue_injected = 0;
+   wake_up_all(&ras->gpu_ready_wait_wq);
+   }
 }
 
 /* alloc/realloc bps array */
@@ -2279,7 +2315,7 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
INIT_DELAYED_WORK(&con->ras_counte_delay_work, amdgpu_ras_counte_dw);
atomic_set(&con->ras_ce_count, 0);
atomic_set(&con->ras_ue_count, 0);
-
+   init_waitqueue_head(&con->gpu_ready_wait_wq);
con->objs = (struct ras_manager *)(con + 1);
 
amdgpu_ras_set_context(adev, con);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 606df8869b89..aea6bbb71501 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -379,6 +379,12 @@ struct amdgpu_ras {
 
/* Indicates smu whether need update bad channel info */
bool update_channel_flag;
+
+   /* UE injection flag */
+   uint32_t  ras_ue_injected;
+
+   /* Waiting for gpu ready work queue */
+   wait_queue_head_t gpu_ready_wait_wq;
 };
 
 struct ras_fs_data {
-- 
2.25.1



RE: [PATCH V2] drm/amdgpu: fix incorrect GCR_GENERAL_CNTL address

2022-03-29 Thread Zhang, Yifan
[AMD Official Use Only]

Reviewed-by: Yifan Zhang 

-Original Message-
From: amd-gfx  On Behalf Of Ji, Ruili
Sent: Monday, March 28, 2022 12:59 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Yifan ; Liu, Aaron ; Liang, 
Prike ; Huang, Ray ; Deucher, Alexander 
; Ji, Ruili 
Subject: [PATCH V2] drm/amdgpu: fix incorrect GCR_GENERAL_CNTL address

From: Ruili Ji 

gfx10.3.3/gfx10.3.6/gfx10.3.7 shall use 0x1580 address for GCR_GENERAL_CNTL

Signed-off-by: Ruili Ji 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 99df18ae7316..e4c9d92ac381 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3300,7 +3300,7 @@ static const struct soc15_reg_golden 
golden_settings_gc_10_3_3[] =
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0242),
-   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 
0x0500),
+   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh, 0x1ff1, 
+0x0500),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x00e4),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 
0x32103210),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x, 
0x32103210), @@ -3436,7 +3436,7 @@ static const struct soc15_reg_golden 
golden_settings_gc_10_3_6[] =
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0042),
-   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 
0x0500),
+   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh, 0x1ff1, 
+0x0500),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x0044),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 
0x32103210),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x, 
0x32103210), @@ -3461,7 +3461,7 @@ static const struct soc15_reg_golden 
golden_settings_gc_10_3_7[] = {
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG3, 0x, 0x0280),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmDB_DEBUG4, 0x, 0x0080),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGB_ADDR_CONFIG, 0x0c1807ff, 0x0041),
-   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL, 0x1ff1, 
0x0500),
+   SOC15_REG_GOLDEN_VALUE(GC, 0, mmGCR_GENERAL_CNTL_Vangogh, 0x1ff1, 
+0x0500),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL1_PIPE_STEER, 0x00ff, 0x00e4),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_0, 0x, 
0x32103210),
SOC15_REG_GOLDEN_VALUE(GC, 0, mmGL2_PIPE_STEER_1, 0x, 
0x32103210),
--
2.25.1