Re: [PATCH Review 1/1] drm/amdgpu: fix smu not match warning

2021-11-15 Thread Lazar, Lijo




On 11/16/2021 1:13 PM, Stanley.Yang wrote:

update smu driver if version to avoid mismatch log

Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
Signed-off-by: Stanley.Yang 
---
  drivers/gpu/drm/amd/pm/inc/smu_v13_0.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h 
b/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
index e5d3b0d1a032..2e35885c7287 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
@@ -27,7 +27,7 @@
  
  #define SMU13_DRIVER_IF_VERSION_INV 0x

  #define SMU13_DRIVER_IF_VERSION_YELLOW_CARP 0x04
-#define SMU13_DRIVER_IF_VERSION_ALDE 0x07
+#define SMU13_DRIVER_IF_VERSION_ALDE 0x08



This is not an independent change, it should go along with a change in 
interface file. Please post the changes in smu13_driver_if_aldebaran.h 
along with this as one patch.


Thanks,
Lijo


  /* MP Apertures */
  #define MP0_Public0x0380



Re: [PATCH] drm/amdgpu: support new mode-1 reset interface

2021-11-15 Thread Lazar, Lijo




On 11/16/2021 12:53 PM, Tao Zhou wrote:

If gpu reset is triggered by ras fatal error, tell it to smu in mode-1
reset message.

Signed-off-by: Tao Zhou 
---
  .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c| 21 ---
  1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
index 35145db6eedf..6f3d064a8232 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
@@ -1426,16 +1426,31 @@ int smu_v13_0_set_azalia_d3_pme(struct smu_context *smu)
  
  int smu_v13_0_mode1_reset(struct smu_context *smu)

  {
-   u32 smu_version;
+   u32 smu_version, fatal_err, param;
int ret = 0;
+   struct amdgpu_device *adev = smu->adev;
+   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+   fatal_err = 0;
+   param = SMU_RESET_MODE_1;
+
/*
* PM FW support SMU_MSG_GfxDeviceDriverReset from 68.07
*/
smu_cmn_get_smc_version(smu, NULL, _version);
if (smu_version < 0x00440700)
ret = smu_cmn_send_smc_msg(smu, SMU_MSG_Mode1Reset, NULL);
-   else
-   ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_GfxDeviceDriverReset, SMU_RESET_MODE_1, NULL);
+   else {
+   /* fatal error triggered by ras, PMFW supports the flag
+  from 68.44.0 */
+   if ((smu_version >= 0x00442c00) && ras &&
+   atomic_read(>in_recovery))
+   fatal_err = 1;
+


From PMFW version, this looks specific to aldebaran. Since there is 
version check as well, the implementation needs to be moved to 
aldebaran_ppt.c


Thanks,
Lijo


+   param |= (fatal_err << 16);
+   ret = smu_cmn_send_smc_msg_with_param(smu,
+   SMU_MSG_GfxDeviceDriverReset, param, 
NULL);
+   }
  
  	if (!ret)

msleep(SMU13_MODE1_RESET_WAIT_TIME_IN_MS);



[PATCH Review 1/1] drm/amdgpu: fix smu not match warning

2021-11-15 Thread Stanley . Yang
update smu driver if version to avoid mismatch log

Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
Signed-off-by: Stanley.Yang 
---
 drivers/gpu/drm/amd/pm/inc/smu_v13_0.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h 
b/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
index e5d3b0d1a032..2e35885c7287 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
@@ -27,7 +27,7 @@
 
 #define SMU13_DRIVER_IF_VERSION_INV 0x
 #define SMU13_DRIVER_IF_VERSION_YELLOW_CARP 0x04
-#define SMU13_DRIVER_IF_VERSION_ALDE 0x07
+#define SMU13_DRIVER_IF_VERSION_ALDE 0x08
 
 /* MP Apertures */
 #define MP0_Public 0x0380
-- 
2.17.1



RE: [PATCH] drm/amdgpu: support new mode-1 reset interface

2021-11-15 Thread Zhang, Hawking
[AMD Official Use Only]

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: Zhou1, Tao  
Sent: Tuesday, November 16, 2021 15:24
To: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; 
Clements, John ; Yang, Stanley ; 
Quan, Evan 
Cc: Zhou1, Tao 
Subject: [PATCH] drm/amdgpu: support new mode-1 reset interface

If gpu reset is triggered by ras fatal error, tell it to smu in mode-1 reset 
message.

Signed-off-by: Tao Zhou 
---
 .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c| 21 ---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
index 35145db6eedf..6f3d064a8232 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
@@ -1426,16 +1426,31 @@ int smu_v13_0_set_azalia_d3_pme(struct smu_context *smu)
 
 int smu_v13_0_mode1_reset(struct smu_context *smu)  {
-   u32 smu_version;
+   u32 smu_version, fatal_err, param;
int ret = 0;
+   struct amdgpu_device *adev = smu->adev;
+   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+   fatal_err = 0;
+   param = SMU_RESET_MODE_1;
+
/*
* PM FW support SMU_MSG_GfxDeviceDriverReset from 68.07
*/
smu_cmn_get_smc_version(smu, NULL, _version);
if (smu_version < 0x00440700)
ret = smu_cmn_send_smc_msg(smu, SMU_MSG_Mode1Reset, NULL);
-   else
-   ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_GfxDeviceDriverReset, SMU_RESET_MODE_1, NULL);
+   else {
+   /* fatal error triggered by ras, PMFW supports the flag
+  from 68.44.0 */
+   if ((smu_version >= 0x00442c00) && ras &&
+   atomic_read(>in_recovery))
+   fatal_err = 1;
+
+   param |= (fatal_err << 16);
+   ret = smu_cmn_send_smc_msg_with_param(smu,
+   SMU_MSG_GfxDeviceDriverReset, param, 
NULL);
+   }
 
if (!ret)
msleep(SMU13_MODE1_RESET_WAIT_TIME_IN_MS);
--
2.17.1


[PATCH] drm/amdgpu: support new mode-1 reset interface

2021-11-15 Thread Tao Zhou
If gpu reset is triggered by ras fatal error, tell it to smu in mode-1
reset message.

Signed-off-by: Tao Zhou 
---
 .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c| 21 ---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
index 35145db6eedf..6f3d064a8232 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
@@ -1426,16 +1426,31 @@ int smu_v13_0_set_azalia_d3_pme(struct smu_context *smu)
 
 int smu_v13_0_mode1_reset(struct smu_context *smu)
 {
-   u32 smu_version;
+   u32 smu_version, fatal_err, param;
int ret = 0;
+   struct amdgpu_device *adev = smu->adev;
+   struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+
+   fatal_err = 0;
+   param = SMU_RESET_MODE_1;
+
/*
* PM FW support SMU_MSG_GfxDeviceDriverReset from 68.07
*/
smu_cmn_get_smc_version(smu, NULL, _version);
if (smu_version < 0x00440700)
ret = smu_cmn_send_smc_msg(smu, SMU_MSG_Mode1Reset, NULL);
-   else
-   ret = smu_cmn_send_smc_msg_with_param(smu, 
SMU_MSG_GfxDeviceDriverReset, SMU_RESET_MODE_1, NULL);
+   else {
+   /* fatal error triggered by ras, PMFW supports the flag
+  from 68.44.0 */
+   if ((smu_version >= 0x00442c00) && ras &&
+   atomic_read(>in_recovery))
+   fatal_err = 1;
+
+   param |= (fatal_err << 16);
+   ret = smu_cmn_send_smc_msg_with_param(smu,
+   SMU_MSG_GfxDeviceDriverReset, param, 
NULL);
+   }
 
if (!ret)
msleep(SMU13_MODE1_RESET_WAIT_TIME_IN_MS);
-- 
2.17.1



Re: [PATCH] drm/radeon:WARNING opportunity for max()

2021-11-15 Thread Christian König

Am 16.11.21 um 06:50 schrieb zhaoxiao:

Fix following coccicheck warning:
drivers/gpu/drm/radeon/r100.c:3450:26-27: WARNING opportunity for max()
drivers/gpu/drm/radeon/r100.c:2812:23-24: WARNING opportunity for max()

Signed-off-by: zhaoxiao 


In general feel free to add my Acked-by, but I'm not sure if we want 
such cleanups in a driver which is only in maintenance mode.


If you do this as part of a general and automated cleanup then it is 
probably ok.


Regards,
Christian.


---
  drivers/gpu/drm/radeon/r100.c | 10 ++
  1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c
index 2dd85ba1faa2..c65ee6f44af2 100644
--- a/drivers/gpu/drm/radeon/r100.c
+++ b/drivers/gpu/drm/radeon/r100.c
@@ -2809,10 +2809,7 @@ void r100_vram_init_sizes(struct radeon_device *rdev)
if (rdev->mc.aper_size > config_aper_size)
config_aper_size = rdev->mc.aper_size;
  
-		if (config_aper_size > rdev->mc.real_vram_size)

-   rdev->mc.mc_vram_size = config_aper_size;
-   else
-   rdev->mc.mc_vram_size = rdev->mc.real_vram_size;
+   rdev->mc.mc_vram_size = max(config_aper_size, 
rdev->mc.real_vram_size);
}
  }
  
@@ -3447,10 +3444,7 @@ void r100_bandwidth_update(struct radeon_device *rdev)

mc_latency_mclk.full += disp_latency_overhead.full + 
cur_latency_mclk.full;
mc_latency_sclk.full += disp_latency_overhead.full + 
cur_latency_sclk.full;
  
-	if (mc_latency_mclk.full > mc_latency_sclk.full)

-   disp_latency.full = mc_latency_mclk.full;
-   else
-   disp_latency.full = mc_latency_sclk.full;
+   disp_latency.full = max(mc_latency_mclk.full, mc_latency_sclk.full);
  
  	/* setup Max GRPH_STOP_REQ default value */

if (ASIC_IS_RV100(rdev))




Re: Questions about KMS flip

2021-11-15 Thread Christian König

Am 16.11.21 um 04:27 schrieb Lang Yu:

On Mon, Nov 15, 2021 at 01:04:15PM +0100, Michel DDDnzer wrote:

[SNIP]

Though a single call to dce_v*_0_crtc_do_set_base() will
only pin the BO, I found it will be unpinned in next call to
dce_v*_0_crtc_do_set_base().

Yeah, that's the normal case when the new BO is different from the old one.

To catch the case I described, try something like

diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c

index 18a7b3bd633b..5726bd87a355 100644

--- a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c

+++ b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c

@@ -1926,6 +1926,7 @@ static int dce_v11_0_crtc_do_set_base(struct drm_crtc 
*crtc,

 return r;



 if (!atomic) {

+   WARN_ON_ONCE(target_fb == fb);

 r = amdgpu_bo_pin(abo, AMDGPU_GEM_DOMAIN_VRAM);

 if (unlikely(r != 0)) {

 amdgpu_bo_unreserve(abo);


I did some tests, the warning can be triggered.

pin/unpin operations in *_crtc_do_set_base() and
amdgpu_display_crtc_page_flip_target() are mixed.


Ok sounds like we narrowed down the root cause pretty well.

Question is now how can we fix this? Just not pin the BO when target_fb 
== fb?


Thanks,
Christian.



Regards,
Lang





RE: [PATCH] drm/amdgpu: always reset the asic in suspend (v2)

2021-11-15 Thread Quan, Evan
[AMD Official Use Only]

Acked-by: Evan Quan 

> -Original Message-
> From: amd-gfx  On Behalf Of Alex
> Deucher
> Sent: Saturday, November 13, 2021 12:26 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander 
> Subject: [PATCH] drm/amdgpu: always reset the asic in suspend (v2)
> 
> If the platform suspend happens to fail and the power rail
> is not turned off, the GPU will be in an unknown state on
> resume, so reset the asic so that it will be in a known
> good state on resume even if the platform suspend failed.
> 
> v2: handle s0ix
> 
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 1db76429a673..b4591f6e82dd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2165,7 +2165,10 @@ static int amdgpu_pmops_suspend(struct device
> *dev)
>   adev->in_s3 = true;
>   r = amdgpu_device_suspend(drm_dev, true);
>   adev->in_s3 = false;
> -
> + if (r)
> + return r;
> + if (!adev->in_s0ix)
> + r = amdgpu_asic_reset(adev);
>   return r;
>  }
> 
> --
> 2.31.1


RE: [PATCH] drm/amd/pm: Remove artificial freq level on Navi1x

2021-11-15 Thread Quan, Evan
[AMD Official Use Only]

Reviewed-by: Evan Quan 

> -Original Message-
> From: Lazar, Lijo 
> Sent: Monday, November 15, 2021 3:42 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Zhang, Hawking
> ; Wang, Yang(Kevin)
> ; Quan, Evan 
> Subject: [PATCH] drm/amd/pm: Remove artificial freq level on Navi1x
> 
> Print Navi1x fine grained clocks in a consistent manner with other SOCs.
> Don't show aritificial DPM level when the current clock equals min or max.
> 
> Signed-off-by: Lijo Lazar 
> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
> index 71161f6b78fe..60a557068ea4 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
> @@ -1265,7 +1265,7 @@ static int navi10_print_clk_levels(struct
> smu_context *smu,
>   enum smu_clk_type clk_type, char *buf)  {
>   uint16_t *curve_settings;
> - int i, size = 0, ret = 0;
> + int i, levels, size = 0, ret = 0;
>   uint32_t cur_value = 0, value = 0, count = 0;
>   uint32_t freq_values[3] = {0};
>   uint32_t mark_index = 0;
> @@ -1319,14 +1319,17 @@ static int navi10_print_clk_levels(struct
> smu_context *smu,
>   freq_values[1] = cur_value;
>   mark_index = cur_value == freq_values[0] ? 0 :
>cur_value == freq_values[2] ? 2 : 1;
> - if (mark_index != 1)
> - freq_values[1] = (freq_values[0] +
> freq_values[2]) / 2;
> 
> - for (i = 0; i < 3; i++) {
> + levels = 3;
> + if (mark_index != 1) {
> + levels = 2;
> + freq_values[1] = freq_values[2];
> + }
> +
> + for (i = 0; i < levels; i++) {
>   size += sysfs_emit_at(buf, size,
> "%d: %uMhz %s\n", i, freq_values[i],
>   i == mark_index ? "*" : "");
>   }
> -
>   }
>   break;
>   case SMU_PCIE:
> --
> 2.17.1


Re: [PATCH v2 2/8] drm: improve drm_buddy_alloc function

2021-11-15 Thread Arunpravin
Hi Matthew,
I am preparing version3 of the buddy allocator,
Please find the updated comments.

SNIP

>>> -int drm_buddy_alloc_range(struct drm_buddy_mm *mm,
>>> - struct list_head *blocks,
>>> - u64 start, u64 size)
>>> +static struct drm_buddy_block *
>>> +alloc_range(struct drm_buddy_mm *mm,
>>> +   u64 start, u64 end,
>>> +   unsigned int order)
>>>   {
>>> struct drm_buddy_block *block;
>>> struct drm_buddy_block *buddy;
>>> -   LIST_HEAD(allocated);
>>> LIST_HEAD(dfs);
>>> -   u64 end;
>>> int err;
>>> int i;
>>>   
>>
>>> if (!block)
>>> break;
>>>   
>>> list_del(>tmp_link);
>>>   
>>> +   if (drm_buddy_block_order(block) < order)
>>> +   continue;
>>> +
>>> block_start = drm_buddy_block_offset(block);
>>> block_end = block_start + drm_buddy_block_size(mm, block) - 1;
>>>   
>>> if (!overlaps(start, end, block_start, block_end))
>>> continue;
>>>   
>>> -   if (drm_buddy_block_is_allocated(block)) {
>>> -   err = -ENOSPC;
>>> -   goto err_free;
>>> -   }
>>> +   if (drm_buddy_block_is_allocated(block))
>>> +   continue;
>>>   
>>> -   if (contains(start, end, block_start, block_end)) {
>>> -   if (!drm_buddy_block_is_free(block)) {
>>> -   err = -ENOSPC;
>>> -   goto err_free;
>>> -   }
>>> +   if (contains(start, end, block_start, block_end)
>>> +   && order == drm_buddy_block_order(block)) {
>>
>> Alignment looks off, also && should be on the line above.
> 
> [Arun] ok
>>
>>> +   /*
>>> +* Find the free block within the range.
>>> +*/
>>> +   if (drm_buddy_block_is_free(block))
>>> +   return block;
>>
>> Would it make sense to keep searching here, rather than restarting the 
>> search from scratch every time? Would it work if we pass in the total 
>> size and min order?
> [Arun] yes, I will rewrite this function

I tried to rewrite the function, AFAIK, in case of end bias allocation,
we have to restart the search on every new order computed value from the
requested total size since we have to find a free node in the required
order level traversing from left to right, here continuing the search
for the subsequent order value would skip the free nodes present in the
beginning of the tree.

In case of actual range allocation, as handled at
i915_buddy_alloc_range, we can continue the search from where the
previous allocation happened since we allocate all the blocks
progressively within the start and end address values.

alloc_range() handles both the cases, having a penalty of restarting the
search in case of actual range allocation. Please let me know if any
suggestions?

>>> +int drm_buddy_alloc(struct drm_buddy_mm *mm,
>>> +   u64 start, u64 end, u64 size,
>>> +   u64 min_page_size,
>>> +   struct list_head *blocks,
>>> +   unsigned long flags)
>>
>> Do we need to validate the flags somewhere?
> [Arun] I will move 'unsigned long flags' to enum type declaration
I tried to move 'unsigned long flags' to enum type declaration, it
creates an ambiguity in i915 driver as both DRM_BUDDY_ALLOC_TOPDOWN and
DRM_BUDDY_ALLOC_RANGE are mutually non-exclusive. So I think its better
to have 'unsigned long flags'.

AFAIK, we don't need to validate the flags since we check flags using
'flags & DRM_BUDDY_RANGE_ALLOCATION'

>>
>>> +   BUG_ON(order > mm->max_order);
>>> +   BUG_ON(order < min_order);
>>> +
>>> +   do {
>>> +   if (flags & DRM_BUDDY_RANGE_ALLOCATION)
>>> +   /* Allocate traversing within the range */
>>> +   block = alloc_range(mm, start, end, order);
>>
>> Ok, so blocks might be in a random order, which is a slight concern for 
>> actual range allocations(not the bias thing). Can we somehow make 
>> alloc_range just do the old behaviour when end - start == size? Not the 
>> end of the world though if not.
> [Arun] I will change the alloc_range() block allocations to bottom-up,
> so both actual range allocation and end bias allocation blocks will
> start from lowest address. And, since we are traversing the tree from
> left to right, blocks will be in order.
> 
> And, alloc_range() handles actual range allocation demands the same way
> as in the old i915_buddy_alloc_range() function except alloc_range()
> make use of order value to find the blocks within the actual range
> allocation.

Correction - I will change alloc_range() block allocations to bottom-up,
so actual range allocation blocks will start from lowest address (not
the bias thing), and since we are traversing the tree from left to 

Re: [PATCH] drm/amd/pm: Add sysfs interface for retrieving gpu metrics(V2)

2021-11-15 Thread Alex Deucher
On Mon, Nov 15, 2021 at 3:49 PM Kakarya, Surbhi  wrote:
>
> [AMD Official Use Only]
>
> Hi Alex,
>
> I am porting the patches 
> (http://gerrit-git.amd.com/c/brahma/ec/linux/+/396997 and 
> http://gerrit-git.amd.com/c/brahma/ec/linux/+/528745) to provide the 
> necessary SMU utils (basic and system_status) support in this branch.
>

If you just want to populate the new fields in the metrics table, that
is all you need to do.  No need for any of the other changes.  I think
the last hunk of your patch is all you need.

Alex


> Thanks
> Surbhi
>
> -Original Message-
> From: Alex Deucher 
> Sent: Friday, November 12, 2021 2:41 PM
> To: Kakarya, Surbhi 
> Cc: amd-gfx list ; Zhang, Bokun 
> ; Chang, HaiJun ; Liu, Monk 
> ; Deucher, Alexander 
> Subject: Re: [PATCH] drm/amd/pm: Add sysfs interface for retrieving gpu 
> metrics(V2)
>
> On Fri, Nov 12, 2021 at 12:46 PM Surbhi Kakarya  
> wrote:
> >
> > A new interface for UMD to retrieve gpu metrics data. This patch is
> > based on an existing patch If7f3523915505c0ece0a56dfd476d2b8473440d4.
> >
>
> It's not clear what you are trying to do here.
>
> > Signed-off-by: Surbhi Kakarya 
> > Change-Id: I701110d78a85c092f5dda167a52350cc6dda7557
> > ---
> >  drivers/gpu/drm/amd/pm/amdgpu_pm.c |  6 +-
> >  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h|  2 +-
> >  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  4 +---
> >  .../gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c| 10 ++
> >  4 files changed, 17 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> > b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> > index 01cca08a774f..d60426daddae 100644
> > --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> > +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> > @@ -1800,8 +1800,12 @@ static ssize_t amdgpu_get_gpu_metrics(struct device 
> > *dev,
> > return ret;
> > }
> >
> > -   if (adev->powerplay.pp_funcs->get_gpu_metrics)
> > +   down_read(>reset_sem);
> > +   if (is_support_sw_smu(adev))
> > +   size = smu_sys_get_gpu_metrics(>smu, _metrics);
> > +   else if (adev->powerplay.pp_funcs->get_gpu_metrics)
> > size = amdgpu_dpm_get_gpu_metrics(adev, _metrics);
> > +   up_read(>reset_sem);
> >
>
> Why are you changing this code?
> adev->powerplay.pp_funcs->get_gpu_metrics already points to
> smu_sys_get_gpu_metrics().  Also why do you need to add the semaphore locking?
>
> > if (size <= 0)
> > goto out;
> > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > index 3557f4e7fc30..5ffe7e3bf1aa 100644
> > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > @@ -1397,6 +1397,6 @@ int smu_set_light_sbr(struct smu_context *smu,
> > bool enable);
> >
> >  int smu_wait_for_event(struct amdgpu_device *adev, enum smu_event_type 
> > event,
> >uint64_t event_arg);
> > -
> > +ssize_t smu_sys_get_gpu_metrics(struct smu_context *smu, void
> > +**table);
> >  #endif
> >  #endif
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > index b06c59dcc1b4..ec81abe385e3 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > @@ -3005,9 +3005,8 @@ static int smu_get_dpm_clock_table(void *handle,
> > return ret;
> >  }
> >
> > -static ssize_t smu_sys_get_gpu_metrics(void *handle, void **table)
> > +ssize_t smu_sys_get_gpu_metrics(struct smu_context *smu, void
> > +**table)
> >  {
> > -   struct smu_context *smu = handle;
> > ssize_t size;
> >
> > if (!smu->pm_enabled || !smu->adev->pm.dpm_enabled) @@ -3135,7
> > +3134,6 @@ static const struct amd_pm_funcs swsmu_pm_funcs = {
> > .asic_reset_mode_2= smu_mode2_reset,
> > .set_df_cstate= smu_set_df_cstate,
> > .set_xgmi_pstate  = smu_set_xgmi_pstate,
> > -   .get_gpu_metrics  = smu_sys_get_gpu_metrics,
>
> Why are you removing this?
>
> > .set_watermarks_for_clock_ranges = 
> > smu_set_watermarks_for_clock_ranges,
> > .display_disable_memory_clock_switch = 
> > smu_display_disable_memory_clock_switch,
> > .get_max_sustainable_clocks_by_dc= 
> > smu_get_max_sustainable_clocks_by_dc,
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> > index 3b1bf270ebc6..97d18e764665 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> > @@ -3619,6 +3619,16 @@ static ssize_t sienna_cichlid_get_gpu_metrics(struct 
> > smu_context *smu,
> > gpu_metrics->energy_accumulator =
> > use_metrics_v2 ? metrics_v2->EnergyAccumulator :
> > 

Re: [PATCH 4/5] drm/amdkfd: restore pages race with process termination

2021-11-15 Thread philip yang

  


On 2021-11-09 11:16 p.m., Felix
  Kuehling wrote:

On
  2021-11-09 6:04 p.m., Philip Yang wrote:
  
  restore pages work can not find kfd
process or mm struct if process is

destroyed before drain retry fault work schedule to run, this is
not

failure, return 0 to avoid dump GPU vm fault kernel log.

  
  I wonder if this could also be solved by draining page fault
  interrupts in kfd_process_notifier_release before we remove the
  process from the hash table. Because that function runs while
  holding the mmap lock, we'd need to detect the draining condition
  for the process in the page fault handler before it tries to take
  the mmap lock. Maybe that's even a helpful optimization that
  speeds up interrupt draining by just ignoring all retry faults
  during that time.
  
  

Call svm_range_drain_retry_fault in kfd_process_notifier_release
before removing the process from hash table solve this race
condition.
That would
  also allow draining faults in svm_range_unmap_from_cpu instead of
  the delayed worker. And I believe that would also elegantly fix
  the vma removal race.
  

vma maybe removed from rbtree before unmap call back (refer to
  below __do_munmap), draining fault in svm_range_unmap_from_cpu can
  not fix vma removal race, also cause mmap write and read lock
  deadlock issue, will change to check if svms->drain_pagefaults
  is set, restore_pages returns true if no VMA is found.

int __do_munmap


    /* Detach vmas from rbtree */
      if (!detach_vmas_to_be_unmapped(mm, vma, prev, end))
          downgrade = false;
  
      if (downgrade)
          mmap_write_downgrade(mm);
  
      unmap_region(mm, vma, prev, start, end);
  
      /* Fix up all other VM information */
      remove_vma_list(mm, vma);
  
  Regards,
Philip


  
  Regards,
  
    Felix
  
  
  
  

Signed-off-by: Philip Yang 

---

  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 ++--

  1 file changed, 2 insertions(+), 2 deletions(-)


diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

index 8f77d5746b2c..2083a10b35c2 100644

--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

@@ -2596,7 +2596,7 @@ svm_range_restore_pages(struct
amdgpu_device *adev, unsigned int pasid,

  p = kfd_lookup_process_by_pasid(pasid);

  if (!p) {

  pr_debug("kfd process not founded pasid 0x%x\n",
pasid);

-    return -ESRCH;

+    return 0;

  }

  if (!p->xnack_enabled) {

  pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);

@@ -2610,7 +2610,7 @@ svm_range_restore_pages(struct
amdgpu_device *adev, unsigned int pasid,

  mm = get_task_mm(p->lead_thread);

  if (!mm) {

  pr_debug("svms 0x%p failed to get mm\n", svms);

-    r = -ESRCH;

+    r = 0;

  goto out;

  }

  

  



RE: [PATCH] drm/amd/pm: Add sysfs interface for retrieving gpu metrics(V2)

2021-11-15 Thread Kakarya, Surbhi
[AMD Official Use Only]

Hi Alex,

I am porting the patches (http://gerrit-git.amd.com/c/brahma/ec/linux/+/396997 
and http://gerrit-git.amd.com/c/brahma/ec/linux/+/528745) to provide the 
necessary SMU utils (basic and system_status) support in this branch.

Thanks
Surbhi

-Original Message-
From: Alex Deucher 
Sent: Friday, November 12, 2021 2:41 PM
To: Kakarya, Surbhi 
Cc: amd-gfx list ; Zhang, Bokun 
; Chang, HaiJun ; Liu, Monk 
; Deucher, Alexander 
Subject: Re: [PATCH] drm/amd/pm: Add sysfs interface for retrieving gpu 
metrics(V2)

On Fri, Nov 12, 2021 at 12:46 PM Surbhi Kakarya  wrote:
>
> A new interface for UMD to retrieve gpu metrics data. This patch is
> based on an existing patch If7f3523915505c0ece0a56dfd476d2b8473440d4.
>

It's not clear what you are trying to do here.

> Signed-off-by: Surbhi Kakarya 
> Change-Id: I701110d78a85c092f5dda167a52350cc6dda7557
> ---
>  drivers/gpu/drm/amd/pm/amdgpu_pm.c |  6 +-
>  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h|  2 +-
>  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  |  4 +---
>  .../gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c| 10 ++
>  4 files changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> index 01cca08a774f..d60426daddae 100644
> --- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> +++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
> @@ -1800,8 +1800,12 @@ static ssize_t amdgpu_get_gpu_metrics(struct device 
> *dev,
> return ret;
> }
>
> -   if (adev->powerplay.pp_funcs->get_gpu_metrics)
> +   down_read(>reset_sem);
> +   if (is_support_sw_smu(adev))
> +   size = smu_sys_get_gpu_metrics(>smu, _metrics);
> +   else if (adev->powerplay.pp_funcs->get_gpu_metrics)
> size = amdgpu_dpm_get_gpu_metrics(adev, _metrics);
> +   up_read(>reset_sem);
>

Why are you changing this code?
adev->powerplay.pp_funcs->get_gpu_metrics already points to
smu_sys_get_gpu_metrics().  Also why do you need to add the semaphore locking?

> if (size <= 0)
> goto out;
> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> index 3557f4e7fc30..5ffe7e3bf1aa 100644
> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> @@ -1397,6 +1397,6 @@ int smu_set_light_sbr(struct smu_context *smu,
> bool enable);
>
>  int smu_wait_for_event(struct amdgpu_device *adev, enum smu_event_type event,
>uint64_t event_arg);
> -
> +ssize_t smu_sys_get_gpu_metrics(struct smu_context *smu, void
> +**table);
>  #endif
>  #endif
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> index b06c59dcc1b4..ec81abe385e3 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> @@ -3005,9 +3005,8 @@ static int smu_get_dpm_clock_table(void *handle,
> return ret;
>  }
>
> -static ssize_t smu_sys_get_gpu_metrics(void *handle, void **table)
> +ssize_t smu_sys_get_gpu_metrics(struct smu_context *smu, void
> +**table)
>  {
> -   struct smu_context *smu = handle;
> ssize_t size;
>
> if (!smu->pm_enabled || !smu->adev->pm.dpm_enabled) @@ -3135,7
> +3134,6 @@ static const struct amd_pm_funcs swsmu_pm_funcs = {
> .asic_reset_mode_2= smu_mode2_reset,
> .set_df_cstate= smu_set_df_cstate,
> .set_xgmi_pstate  = smu_set_xgmi_pstate,
> -   .get_gpu_metrics  = smu_sys_get_gpu_metrics,

Why are you removing this?

> .set_watermarks_for_clock_ranges = 
> smu_set_watermarks_for_clock_ranges,
> .display_disable_memory_clock_switch = 
> smu_display_disable_memory_clock_switch,
> .get_max_sustainable_clocks_by_dc= 
> smu_get_max_sustainable_clocks_by_dc,
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> index 3b1bf270ebc6..97d18e764665 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
> @@ -3619,6 +3619,16 @@ static ssize_t sienna_cichlid_get_gpu_metrics(struct 
> smu_context *smu,
> gpu_metrics->energy_accumulator =
> use_metrics_v2 ? metrics_v2->EnergyAccumulator :
> metrics->EnergyAccumulator;
>
> +   if (metrics->CurrGfxVoltageOffset)
> +   gpu_metrics->voltage_gfx =
> +   (155000 - 625 * metrics->CurrGfxVoltageOffset) / 100;
> +   if (metrics->CurrMemVidOffset)
> +   gpu_metrics->voltage_mem =
> +   (155000 - 625 * metrics->CurrMemVidOffset) / 100;
> +   if (metrics->CurrSocVoltageOffset)
> +   gpu_metrics->voltage_soc =
> +   (155000 - 625 * 

Re: [PATCH] drm/amd/display: Fix LTTPR not Enabled on HP ZBook G8 laptop

2021-11-15 Thread Lakha, Bhawanpreet
[AMD Official Use Only]

Reviewed-by: Bhawanpreet Lakha 

From: Aurabindo Pillai 
Sent: November 15, 2021 2:59 PM
To: amd-gfx@lists.freedesktop.org 
Cc: Lakha, Bhawanpreet ; Siqueira, Rodrigo 
; Pillai, Aurabindo ; Wang, 
Angus ; Chalmers, Wesley ; Leung, 
Martin ; Zhuo, Qingqing 
Subject: [PATCH] drm/amd/display: Fix LTTPR not Enabled on HP ZBook G8 laptop

From: Angus Wang 

[WHY]
Previous LTTPR change has caused a regression that led to an
issue where LTTPR is disabled

[HOW]
Extended changes from previous fix to DCN30X

Reviewed-by: Wesley Chalmers 
Reviewed-by: Martin Leung 
Acked-by: Qingqing Zhuo 
Signed-off-by: Angus Wang 
---
 .../amd/display/dc/dcn301/dcn301_resource.c| 18 ++
 .../amd/display/dc/dcn302/dcn302_resource.c| 18 ++
 .../amd/display/dc/dcn303/dcn303_resource.c| 17 +
 3 files changed, 53 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn301/dcn301_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn301/dcn301_resource.c
index 2650d3bd50ec..9cc1610360bd 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn301/dcn301_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn301/dcn301_resource.c
@@ -1485,6 +1485,24 @@ static bool dcn301_resource_construct(
 dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
 dc->caps.color.mpc.ocsc = 1;

+   /* read VBIOS LTTPR caps */
+
+   if (ctx->dc_bios->funcs->get_lttpr_caps) {
+   enum bp_result bp_query_result;
+   uint8_t is_vbios_lttpr_enable = 0;
+
+   bp_query_result = 
ctx->dc_bios->funcs->get_lttpr_caps(ctx->dc_bios, _vbios_lttpr_enable);
+   dc->caps.vbios_lttpr_enable = (bp_query_result == BP_RESULT_OK) 
&& !!is_vbios_lttpr_enable;
+   }
+
+   if (ctx->dc_bios->funcs->get_lttpr_interop) {
+   enum bp_result bp_query_result;
+   uint8_t is_vbios_interop_enabled = 0;
+
+   bp_query_result = 
ctx->dc_bios->funcs->get_lttpr_interop(ctx->dc_bios, _vbios_interop_enabled);
+   dc->caps.vbios_lttpr_aware = (bp_query_result == BP_RESULT_OK) 
&& !!is_vbios_interop_enabled;
+   }
+
 if (dc->ctx->dce_environment == DCE_ENV_PRODUCTION_DRV)
 dc->debug = debug_defaults_drv;
 else if (dc->ctx->dce_environment == DCE_ENV_FPGA_MAXIMUS) {
diff --git a/drivers/gpu/drm/amd/display/dc/dcn302/dcn302_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn302/dcn302_resource.c
index fcf96cf08c76..058f5d71e037 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn302/dcn302_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn302/dcn302_resource.c
@@ -1557,6 +1557,24 @@ static bool dcn302_resource_construct(
 dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
 dc->caps.color.mpc.ocsc = 1;

+   /* read VBIOS LTTPR caps */
+   if (ctx->dc_bios->funcs->get_lttpr_caps) {
+   enum bp_result bp_query_result;
+   uint8_t is_vbios_lttpr_enable = 0;
+
+   bp_query_result = 
ctx->dc_bios->funcs->get_lttpr_caps(ctx->dc_bios, _vbios_lttpr_enable);
+   dc->caps.vbios_lttpr_enable = (bp_query_result == BP_RESULT_OK) 
&& !!is_vbios_lttpr_enable;
+   }
+
+   if (ctx->dc_bios->funcs->get_lttpr_interop) {
+   enum bp_result bp_query_result;
+   uint8_t is_vbios_interop_enabled = 0;
+
+   bp_query_result = 
ctx->dc_bios->funcs->get_lttpr_interop(ctx->dc_bios,
+   _vbios_interop_enabled);
+   dc->caps.vbios_lttpr_aware = (bp_query_result == BP_RESULT_OK) 
&& !!is_vbios_interop_enabled;
+   }
+
 if (dc->ctx->dce_environment == DCE_ENV_PRODUCTION_DRV)
 dc->debug = debug_defaults_drv;
 else
diff --git a/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c
index 4a9b64023675..7024aeb0884c 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c
@@ -1500,6 +1500,23 @@ static bool dcn303_resource_construct(
 dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
 dc->caps.color.mpc.ocsc = 1;

+   /* read VBIOS LTTPR caps */
+   if (ctx->dc_bios->funcs->get_lttpr_caps) {
+   enum bp_result bp_query_result;
+   uint8_t is_vbios_lttpr_enable = 0;
+
+   bp_query_result = 
ctx->dc_bios->funcs->get_lttpr_caps(ctx->dc_bios, _vbios_lttpr_enable);
+   dc->caps.vbios_lttpr_enable = (bp_query_result == BP_RESULT_OK) 
&& !!is_vbios_lttpr_enable;
+   }
+
+   if (ctx->dc_bios->funcs->get_lttpr_interop) {
+   enum bp_result bp_query_result;
+   uint8_t is_vbios_interop_enabled = 0;
+
+   bp_query_result = 
ctx->dc_bios->funcs->get_lttpr_interop(ctx->dc_bios, _vbios_interop_enabled);
+   dc->caps.vbios_lttpr_aware = (bp_query_result == BP_RESULT_OK) 
&& 

[PATCH] drm/amd/display: Fix LTTPR not Enabled on HP ZBook G8 laptop

2021-11-15 Thread Aurabindo Pillai
From: Angus Wang 

[WHY]
Previous LTTPR change has caused a regression that led to an
issue where LTTPR is disabled

[HOW]
Extended changes from previous fix to DCN30X

Reviewed-by: Wesley Chalmers 
Reviewed-by: Martin Leung 
Acked-by: Qingqing Zhuo 
Signed-off-by: Angus Wang 
---
 .../amd/display/dc/dcn301/dcn301_resource.c| 18 ++
 .../amd/display/dc/dcn302/dcn302_resource.c| 18 ++
 .../amd/display/dc/dcn303/dcn303_resource.c| 17 +
 3 files changed, 53 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn301/dcn301_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn301/dcn301_resource.c
index 2650d3bd50ec..9cc1610360bd 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn301/dcn301_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn301/dcn301_resource.c
@@ -1485,6 +1485,24 @@ static bool dcn301_resource_construct(
dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
dc->caps.color.mpc.ocsc = 1;
 
+   /* read VBIOS LTTPR caps */
+
+   if (ctx->dc_bios->funcs->get_lttpr_caps) {
+   enum bp_result bp_query_result;
+   uint8_t is_vbios_lttpr_enable = 0;
+
+   bp_query_result = 
ctx->dc_bios->funcs->get_lttpr_caps(ctx->dc_bios, _vbios_lttpr_enable);
+   dc->caps.vbios_lttpr_enable = (bp_query_result == BP_RESULT_OK) 
&& !!is_vbios_lttpr_enable;
+   }
+
+   if (ctx->dc_bios->funcs->get_lttpr_interop) {
+   enum bp_result bp_query_result;
+   uint8_t is_vbios_interop_enabled = 0;
+
+   bp_query_result = 
ctx->dc_bios->funcs->get_lttpr_interop(ctx->dc_bios, _vbios_interop_enabled);
+   dc->caps.vbios_lttpr_aware = (bp_query_result == BP_RESULT_OK) 
&& !!is_vbios_interop_enabled;
+   }
+
if (dc->ctx->dce_environment == DCE_ENV_PRODUCTION_DRV)
dc->debug = debug_defaults_drv;
else if (dc->ctx->dce_environment == DCE_ENV_FPGA_MAXIMUS) {
diff --git a/drivers/gpu/drm/amd/display/dc/dcn302/dcn302_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn302/dcn302_resource.c
index fcf96cf08c76..058f5d71e037 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn302/dcn302_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn302/dcn302_resource.c
@@ -1557,6 +1557,24 @@ static bool dcn302_resource_construct(
dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
dc->caps.color.mpc.ocsc = 1;
 
+   /* read VBIOS LTTPR caps */
+   if (ctx->dc_bios->funcs->get_lttpr_caps) {
+   enum bp_result bp_query_result;
+   uint8_t is_vbios_lttpr_enable = 0;
+
+   bp_query_result = 
ctx->dc_bios->funcs->get_lttpr_caps(ctx->dc_bios, _vbios_lttpr_enable);
+   dc->caps.vbios_lttpr_enable = (bp_query_result == BP_RESULT_OK) 
&& !!is_vbios_lttpr_enable;
+   }
+
+   if (ctx->dc_bios->funcs->get_lttpr_interop) {
+   enum bp_result bp_query_result;
+   uint8_t is_vbios_interop_enabled = 0;
+
+   bp_query_result = 
ctx->dc_bios->funcs->get_lttpr_interop(ctx->dc_bios,
+   _vbios_interop_enabled);
+   dc->caps.vbios_lttpr_aware = (bp_query_result == BP_RESULT_OK) 
&& !!is_vbios_interop_enabled;
+   }
+
if (dc->ctx->dce_environment == DCE_ENV_PRODUCTION_DRV)
dc->debug = debug_defaults_drv;
else
diff --git a/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c
index 4a9b64023675..7024aeb0884c 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn303/dcn303_resource.c
@@ -1500,6 +1500,23 @@ static bool dcn303_resource_construct(
dc->caps.color.mpc.ogam_rom_caps.hlg = 0;
dc->caps.color.mpc.ocsc = 1;
 
+   /* read VBIOS LTTPR caps */
+   if (ctx->dc_bios->funcs->get_lttpr_caps) {
+   enum bp_result bp_query_result;
+   uint8_t is_vbios_lttpr_enable = 0;
+
+   bp_query_result = 
ctx->dc_bios->funcs->get_lttpr_caps(ctx->dc_bios, _vbios_lttpr_enable);
+   dc->caps.vbios_lttpr_enable = (bp_query_result == BP_RESULT_OK) 
&& !!is_vbios_lttpr_enable;
+   }
+
+   if (ctx->dc_bios->funcs->get_lttpr_interop) {
+   enum bp_result bp_query_result;
+   uint8_t is_vbios_interop_enabled = 0;
+
+   bp_query_result = 
ctx->dc_bios->funcs->get_lttpr_interop(ctx->dc_bios, _vbios_interop_enabled);
+   dc->caps.vbios_lttpr_aware = (bp_query_result == BP_RESULT_OK) 
&& !!is_vbios_interop_enabled;
+   }
+
if (dc->ctx->dce_environment == DCE_ENV_PRODUCTION_DRV)
dc->debug = debug_defaults_drv;
else
-- 
2.30.2



[PATCH v1 1/2] ext4/xfs: add page refcount helper

2021-11-15 Thread Alex Sierra
From: Ralph Campbell 

There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

Signed-off-by: Ralph Campbell 
Signed-off-by: Alex Sierra 
Reviewed-by: Christoph Hellwig 
Acked-by: Theodore Ts'o 
Acked-by: Darrick J. Wong 
---
v3:
[AS]: rename dax_layout_is_idle_page func to dax_page_unused

v4:
[AS]: This ref count functionality was missing on fuse/dax.c.
---
 fs/dax.c|  4 ++--
 fs/ext4/inode.c |  5 +
 fs/fuse/dax.c   |  4 +---
 fs/xfs/xfs_file.c   |  4 +---
 include/linux/dax.h | 10 ++
 5 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4e3e5a283a91..84803c835650 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -369,7 +369,7 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+   WARN_ON_ONCE(trunc && !dax_page_unused(page));
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
page->mapping = NULL;
page->index = 0;
@@ -383,7 +383,7 @@ static struct page *dax_busy_page(void *entry)
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   if (page_ref_count(page) > 1)
+   if (!dax_page_unused(page))
return page;
}
return NULL;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0f06305167d5..068e8f78ec02 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3913,10 +3913,7 @@ int ext4_break_layouts(struct inode *inode)
if (!page)
return 0;
 
-   error = ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1,
-   TASK_INTERRUPTIBLE, 0, 0,
-   ext4_wait_dax_page(inode));
+   error = dax_wait_page(inode, page, ext4_wait_dax_page);
} while (error == 0);
 
return error;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 281d79f8b3d3..f6d2a36e56e2 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -676,9 +676,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, 
bool *retry,
return 0;
 
*retry = true;
-   return ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
-   0, 0, fuse_wait_dax_page(inode));
+   return dax_wait_page(inode, page, fuse_wait_dax_page);
 }
 
 /* dmap_end == 0 leads to unmapping of whole file */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7aa943edfc02..fb13b12fd032 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -860,9 +860,7 @@ xfs_break_dax_layouts(
return 0;
 
*retry = true;
-   return ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
-   0, 0, xfs_wait_dax_page(inode));
+   return dax_wait_page(inode, page, xfs_wait_dax_page);
 }
 
 int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 2619d94c308d..c9c27fbf0b98 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -216,6 +216,16 @@ static inline bool dax_mapping(struct address_space 
*mapping)
return mapping->host && IS_DAX(mapping->host);
 }
 
+static inline bool dax_page_unused(struct page *page)
+{
+   return page_ref_count(page) == 1;
+}
+
+#define dax_wait_page(_inode, _page, _wait_cb) \
+   ___wait_var_event(&(_page)->_refcount,  \
+   dax_page_unused(_page), \
+   TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
+
 #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
 void hmem_register_device(int target_nid, struct resource *r);
 #else
-- 
2.32.0



[PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2021-11-15 Thread Alex Sierra
From: Ralph Campbell 

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

Signed-off-by: Ralph Campbell 
Signed-off-by: Alex Sierra 
Reviewed-by: Christoph Hellwig 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c   |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  3 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
 fs/dax.c |  4 +-
 include/linux/dax.h  |  2 +-
 include/linux/memremap.h |  7 ++-
 include/linux/mm.h   | 11 
 lib/test_hmm.c   |  2 +-
 mm/internal.h|  7 +++
 mm/memcontrol.c  |  6 +-
 mm/memremap.c| 72 +++-
 mm/migrate.c |  5 --
 mm/page_alloc.c  |  3 +
 mm/swap.c| 45 ++-
 14 files changed, 48 insertions(+), 123 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index a7061ee3b157..4de7c77ea17b 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -712,7 +712,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   get_page(dpage);
+   init_page_count(dpage);
lock_page(dpage);
return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 3e405f078ade..c1b41c301a67 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -224,7 +224,8 @@ svm_migrate_get_vram_page(struct svm_range *prange, 
unsigned long pfn)
page = pfn_to_page(pfn);
svm_range_bo_ref(prange->svm_bo);
page->zone_device_data = prange->svm_bo;
-   get_page(page);
+   VM_BUG_ON_PAGE(page_ref_count(page), page);
+   init_page_count(page);
lock_page(page);
 }
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..8bc7120e1216 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   get_page(page);
+   init_page_count(page);
lock_page(page);
return page;
 }
diff --git a/fs/dax.c b/fs/dax.c
index 84803c835650..72bb6f1e155c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -572,14 +572,14 @@ static void *grab_mapping_entry(struct xa_state *xas,
 
 /**
  * dax_layout_busy_page_range - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
+ * @mapping: address space to scan for a page with ref count > 0
  * @start: Starting offset. Page containing 'start' is included.
  * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
  *   pages from 'start' till the end of file are included.
  *
  * DAX requires ZONE_DEVICE mapped pages. These pages are never
  * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
+ * page->count == 0. A filesystem uses this interface to determine if
  * any page in the mapping is busy, i.e. for DMA, or other
  * get_user_pages() usages.
  *
diff --git a/include/linux/dax.h b/include/linux/dax.h
index c9c27fbf0b98..1d3e25d51fc6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -218,7 +218,7 @@ static inline bool dax_mapping(struct address_space 
*mapping)
 
 static inline bool dax_page_unused(struct page *page)
 {
-   return page_ref_count(page) == 1;
+   return page_ref_count(page) == 0;
 }
 
 #define dax_wait_page(_inode, _page, _wait_cb) \
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index ff4d398edf35..8340e39241fc 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -74,9 +74,10 @@ enum memory_type {
 
 struct dev_pagemap_ops {
/*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0. The reference count
+* should be reset to one with init_page_count(page) before reusing
+* the page. This allows the device driver to implement its own
+* memory management.
 */
void (*page_free)(struct page *page);
 
diff --git 

[PATCH v1 0/2] Remove extra ZONE_DEVICE page refcount

2021-11-15 Thread Alex Sierra
This patch includes Ralph Campbell’s ZONE_DEVICE refcount cleanup and
additionally the changes necessary to avoid breaking the separately
submitted MEMORY_DEVICE_COHERENT page migration code.
Ralph’s original description:
ZONE_DEVICE struct pages have an extra reference count that complicates
the code for put_page() and several places in the kernel that need to
check the reference count to see that a page is not being used (gup,
compaction, migration, etc.). Clean up the code so the reference count
doesn't need to be treated specially for ZONE_DEVICE.
Following a suggestion by Christoph, we attempted to combine this
cleanup with the device patch migration patch series, however this
caused xftests 413 to fail with a warning, the root cause of which has
large kernel footprint than just device memory. Fixing this issue
properly will require cooperation between multiple development groups
working across multiple kernel subsystems, as is apparent from the
discussion under the earlier, combined patch submission.
We therefore propose to break this work out separately as its own patch,
so it can receive the cooperative development work it needs. The deep
problem arises from the get_user_pages API, which has proved troublesome
for many years. It is possible that a concerted effort to repair this
particular refcount issue properly will be a step in the direction of
rationalizing this popular and problematic API.
In the larger picture, this API rationalization work probably deserves
an agenda item at the upcoming Filesystem, MM & BPF Summit:
https://events.linuxfoundation.org/lsfmm/

The wide ranging discussion following previous iterations of the
migration patchset focused almost exclusively on the refcount cleanup
patch. The thread is here:
https://lore.kernel.org/linux-mm/20211014153928.16805-3-alex.sie...@amd.com/
and links a number of previous threads. It is apparent that there is a
lot of work in progress by a number of developer groups in parallel,
and that there are issues with the order in which this work should be
attempted and merged.
Jason provided his list of “balls in the air”:
 - Joao's compound page support for device_dax and more
 - Alex's DEVICE_COHERENT
 - The refcount normalization
 - Removing the pgmap test from GUP
 - Removing the need for the PUD/PMD/PTE special bit
 - Removing the need for the PUD/PMD/PTE devmap bit
 - Remove PUD/PMD vma_is_special
 - folios for fsdax
 - shootdown for fsdax
It is not clear that the refcount cleanup in this patch should be the
first item on the list to be merged, however it has proved to be a good
starting point for a cooperative effort to address the underlying
issues.
Ralph, if you would prefer to take back “ownership” of this patch, it’s
yours, otherwise we will be happy to keep it in play and get it merged
when some other pieces of the puzzle fall into place.

Ralph Campbell (2):
  ext4/xfs: add page refcount helper
  mm: remove extra ZONE_DEVICE struct page refcount

 arch/powerpc/kvm/book3s_hv_uvmem.c   |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  3 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |  2 +-
 fs/dax.c |  8 +--
 fs/ext4/inode.c  |  5 +-
 fs/fuse/dax.c|  4 +-
 fs/xfs/xfs_file.c|  4 +-
 include/linux/dax.h  | 10 
 include/linux/memremap.h |  7 ++-
 include/linux/mm.h   | 11 
 lib/test_hmm.c   |  2 +-
 mm/internal.h|  7 +++
 mm/memcontrol.c  |  6 +-
 mm/memremap.c| 72 +++-
 mm/migrate.c |  5 --
 mm/page_alloc.c  |  3 +
 mm/swap.c| 45 ++-
 17 files changed, 62 insertions(+), 134 deletions(-)

-- 
2.32.0



[PATCH v1 8/9] tools: update hmm-test to support device coherent type

2021-11-15 Thread Alex Sierra
Test cases such as migrate_fault and migrate_multiple,
were modified to explicit migrate from device to sys memory
without the need of page faults, when using device coherent
type.

Snapshot test case updated to read memory device type
first and based on that, get the proper returned results
migrate_ping_pong test case added to test explicit migration
from device to sys memory for both private and coherent zone
types.

Helpers to migrate from device to sys memory and vicerversa
were also added.

Signed-off-by: Alex Sierra 
---
 tools/testing/selftests/vm/hmm-tests.c | 156 ++---
 1 file changed, 138 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c 
b/tools/testing/selftests/vm/hmm-tests.c
index 864f126ffd78..6091e30636d5 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -44,6 +44,7 @@ struct hmm_buffer {
int fd;
uint64_tcpages;
uint64_tfaults;
+   int zone_device_type;
 };
 
 #define TWOMEG (1 << 21)
@@ -144,6 +145,7 @@ static int hmm_dmirror_cmd(int fd,
}
buffer->cpages = cmd.cpages;
buffer->faults = cmd.faults;
+   buffer->zone_device_type = cmd.zone_device_type;
 
return 0;
 }
@@ -211,6 +213,32 @@ static void hmm_nanosleep(unsigned int n)
nanosleep(, NULL);
 }
 
+static int hmm_migrate_sys_to_dev(int fd,
+  struct hmm_buffer *buffer,
+  unsigned long npages)
+{
+   return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages);
+}
+
+static int hmm_migrate_dev_to_sys(int fd,
+  struct hmm_buffer *buffer,
+  unsigned long npages)
+{
+   return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages);
+}
+
+static int hmm_is_private_device(int fd, bool *res)
+{
+   struct hmm_buffer buffer;
+   int ret;
+
+   buffer.ptr = 0;
+   ret = hmm_dmirror_cmd(fd, HMM_DMIRROR_GET_MEM_DEV_TYPE, , 1);
+   *res = (buffer.zone_device_type == HMM_DMIRROR_MEMORY_DEVICE_PRIVATE);
+
+   return ret;
+}
+
 /*
  * Simple NULL test of device open/close.
  */
@@ -875,7 +903,7 @@ TEST_F(hmm, migrate)
ptr[i] = i;
 
/* Migrate memory to device. */
-   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+   ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);
 
@@ -923,7 +951,7 @@ TEST_F(hmm, migrate_fault)
ptr[i] = i;
 
/* Migrate memory to device. */
-   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+   ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);
 
@@ -936,7 +964,7 @@ TEST_F(hmm, migrate_fault)
ASSERT_EQ(ptr[i], i);
 
/* Migrate memory to the device again. */
-   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+   ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);
 
@@ -976,7 +1004,7 @@ TEST_F(hmm, migrate_shared)
ASSERT_NE(buffer->ptr, MAP_FAILED);
 
/* Migrate memory to device. */
-   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+   ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, -ENOENT);
 
hmm_buffer_free(buffer);
@@ -1015,7 +1043,7 @@ TEST_F(hmm2, migrate_mixed)
p = buffer->ptr;
 
/* Migrating a protected area should be an error. */
-   ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages);
+   ret = hmm_migrate_sys_to_dev(self->fd1, buffer, npages);
ASSERT_EQ(ret, -EINVAL);
 
/* Punch a hole after the first page address. */
@@ -1023,7 +1051,7 @@ TEST_F(hmm2, migrate_mixed)
ASSERT_EQ(ret, 0);
 
/* We expect an error if the vma doesn't cover the range. */
-   ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3);
+   ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 3);
ASSERT_EQ(ret, -EINVAL);
 
/* Page 2 will be a read-only zero page. */
@@ -1055,13 +1083,13 @@ TEST_F(hmm2, migrate_mixed)
 
/* Now try to migrate pages 2-5 to device 1. */
buffer->ptr = p + 2 * self->page_size;
-   ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 4);
+   ret = hmm_migrate_sys_to_dev(self->fd1, buffer, 4);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, 4);
 
/* Page 5 won't be migrated to device 0 because it's on device 1. */
buffer->ptr = p + 5 * self->page_size;
-   ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+   ret = hmm_migrate_sys_to_dev(self->fd0, buffer, 1);

[PATCH v1 0/9] Add MEMORY_DEVICE_COHERENT for coherent device memory mapping

2021-11-15 Thread Alex Sierra
This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
owned by a device that can be mapped into CPU page tables like
MEMORY_DEVICE_GENERIC and can also be migrated like
MEMORY_DEVICE_PRIVATE.
Christoph, the suggestion to incorporate Ralph Campbell’s refcount
cleanup patch into our hardware page migration patchset originally came
from you, but it proved impractical to do things in that order because
the refcount cleanup introduced a bug with wide ranging structural
implications. Instead, we amended Ralph’s patch so that it could be
applied after merging the migration work. As we saw from the recent
discussion, merging the refcount work is going to take some time and
cooperation between multiple development groups, while the migration
work is ready now and is needed now. So we propose to merge this
patchset first and continue to work with Ralph and others to merge the
refcount cleanup separately, when it is ready.
This patch series is mostly self-contained except for a few places where
it needs to update other subsystems to handle the new memory type.
System stability and performance are not affected according to our
ongoing testing, including xfstests.
How it works: The system BIOS advertises the GPU device memory
(aka VRAM) as SPM (special purpose memory) in the UEFI system address
map.
The amdgpu driver registers the memory with devmap as
MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for
this hardware page migration capability is the Frontier supercomputer
project. This functionality is not AMD-specific. We expect other GPU
vendors to find this functionality useful, and possibly other hardware
types in the future.
Our test nodes in the lab are similar to the Frontier configuration,
with .5 TB of system memory plus 256 GB of device memory split across
4 GPUs, all in a single coherent address space. Page migration is
expected to improve application efficiency significantly. We will
report empirical results as they become available.
We extended hmm_test to cover migration of MEMORY_DEVICE_COHERENT. This
patch set builds on HMM and our SVM memory manager already merged in
5.15.

Alex Sierra (9):
  mm: add zone device coherent type memory support
  mm: add device coherent vma selection for memory migration
  drm/amdkfd: add SPM support for SVM
  drm/amdkfd: coherent type as sys mem on migration to ram
  lib: test_hmm add ioctl to get zone device type
  lib: test_hmm add module param for zone device type
  lib: add support for device coherent type in test_hmm
  tools: update hmm-test to support device coherent type
  tools: update test_hmm script to support SP config

 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  34 ++-
 include/linux/memremap.h |   8 +
 include/linux/migrate.h  |   1 +
 include/linux/mm.h   |   9 +
 lib/test_hmm.c   | 270 +--
 lib/test_hmm_uapi.h  |  21 +-
 mm/memcontrol.c  |   6 +-
 mm/memory-failure.c  |   6 +-
 mm/memremap.c|   5 +-
 mm/migrate.c |  30 ++-
 tools/testing/selftests/vm/hmm-tests.c   | 156 +++--
 tools/testing/selftests/vm/test_hmm.sh   |  20 +-
 12 files changed, 446 insertions(+), 120 deletions(-)

-- 
2.32.0



[PATCH v1 7/9] lib: add support for device coherent type in test_hmm

2021-11-15 Thread Alex Sierra
Device Coherent type uses device memory that is coherently accesible by
the CPU. This could be shown as SP (special purpose) memory range
at the BIOS-e820 memory enumeration. If no SP memory is supported in
system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.

Currently, test_hmm only supports two different SP ranges of at least
256MB size. This could be specified in the kernel parameter variable
efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x1 &
0x14000 physical address. Ex.
efi_fake_mem=1G@0x1:0x4,1G@0x14000:0x4

Signed-off-by: Alex Sierra 
---
 lib/test_hmm.c  | 191 +---
 lib/test_hmm_uapi.h |  15 ++--
 2 files changed, 153 insertions(+), 53 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 45334df28d7b..9e2cc0cd4412 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -471,6 +471,7 @@ static int dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
unsigned long pfn_first;
unsigned long pfn_last;
void *ptr;
+   int ret = -ENOMEM;
 
devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
if (!devmem)
@@ -553,7 +554,7 @@ static int dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
}
spin_unlock(>lock);
 
-   return true;
+   return 0;
 
 err_release:
mutex_unlock(>devmem_lock);
@@ -562,7 +563,7 @@ static int dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
 err_devmem:
kfree(devmem);
 
-   return false;
+   return ret;
 }
 
 static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
@@ -571,8 +572,10 @@ static struct page *dmirror_devmem_alloc_page(struct 
dmirror_device *mdevice)
struct page *rpage;
 
/*
-* This is a fake device so we alloc real system memory to store
-* our device memory.
+* For ZONE_DEVICE private type, this is a fake device so we alloc real
+* system memory to store our device memory.
+* For ZONE_DEVICE coherent type we use the actual dpage to store the 
data
+* and ignore rpage.
 */
rpage = alloc_page(GFP_HIGHUSER);
if (!rpage)
@@ -623,12 +626,17 @@ static void dmirror_migrate_alloc_and_copy(struct 
migrate_vma *args,
 * unallocated pte_none() or read-only zero page.
 */
spage = migrate_pfn_to_page(*src);
+   if (spage && is_zone_device_page(spage))
+   pr_err("page already in device spage pfn: 0x%lx\n",
+   page_to_pfn(spage));
+   WARN_ON(spage && is_zone_device_page(spage));
 
dpage = dmirror_devmem_alloc_page(mdevice);
if (!dpage)
continue;
 
-   rpage = dpage->zone_device_data;
+   rpage = is_device_private_page(dpage) ? dpage->zone_device_data 
:
+   dpage;
if (spage)
copy_highpage(rpage, spage);
else
@@ -640,8 +648,11 @@ static void dmirror_migrate_alloc_and_copy(struct 
migrate_vma *args,
 * the simulated device memory and that page holds the pointer
 * to the mirror.
 */
+   rpage = dpage->zone_device_data;
rpage->zone_device_data = dmirror;
 
+   pr_debug("migrating from sys to dev pfn src: 0x%lx pfn dst: 
0x%lx\n",
+page_to_pfn(spage), page_to_pfn(dpage));
*dst = migrate_pfn(page_to_pfn(dpage)) |
MIGRATE_PFN_LOCKED;
if ((*src & MIGRATE_PFN_WRITE) ||
@@ -721,10 +732,13 @@ static int dmirror_migrate_finalize_and_map(struct 
migrate_vma *args,
continue;
 
/*
-* Store the page that holds the data so the page table
-* doesn't have to deal with ZONE_DEVICE private pages.
+* For ZONE_DEVICE private pages we store the page that
+* holds the data so the page table doesn't have to deal it.
+* For ZONE_DEVICE coherent pages we store the actual page, 
since
+* the CPU has coherent access to the page.
 */
-   entry = dpage->zone_device_data;
+   entry = is_device_private_page(dpage) ? dpage->zone_device_data 
:
+   dpage;
if (*dst & MIGRATE_PFN_WRITE)
entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
entry = xa_store(>pt, pfn, entry, GFP_ATOMIC);
@@ -804,8 +818,110 @@ static int dmirror_exclusive(struct dmirror *dmirror,
return ret;
 }
 
-static int dmirror_migrate(struct dmirror *dmirror,
-  struct hmm_dmirror_cmd *cmd)
+static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma 

[PATCH v1 6/9] lib: test_hmm add module param for zone device type

2021-11-15 Thread Alex Sierra
In order to configure device coherent in test_hmm, two module parameters
should be passed, which correspond to the SP start address of each
device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
private device type is configured.

Signed-off-by: Alex Sierra 
---
 lib/test_hmm.c  | 66 +++--
 lib/test_hmm_uapi.h |  1 +
 2 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 2daaa7b3710b..45334df28d7b 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -34,6 +34,16 @@
 #define DEVMEM_CHUNK_SIZE  (256 * 1024 * 1024U)
 #define DEVMEM_CHUNKS_RESERVE  16
 
+static unsigned long spm_addr_dev0;
+module_param(spm_addr_dev0, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev0,
+   "Specify start address for SPM (special purpose memory) used 
for device 0. By setting this Coherent device type will be used. Make sure 
spm_addr_dev1 is set too");
+
+static unsigned long spm_addr_dev1;
+module_param(spm_addr_dev1, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev1,
+   "Specify start address for SPM (special purpose memory) used 
for device 1. By setting this Coherent device type will be used. Make sure 
spm_addr_dev0 is set too");
+
 static const struct dev_pagemap_ops dmirror_devmem_ops;
 static const struct mmu_interval_notifier_ops dmirror_min_ops;
 static dev_t dmirror_dev;
@@ -452,11 +462,11 @@ static int dmirror_write(struct dmirror *dmirror, struct 
hmm_dmirror_cmd *cmd)
return ret;
 }
 
-static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
   struct page **ppage)
 {
struct dmirror_chunk *devmem;
-   struct resource *res;
+   struct resource *res = NULL;
unsigned long pfn;
unsigned long pfn_first;
unsigned long pfn_last;
@@ -464,17 +474,29 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
 
devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
if (!devmem)
-   return false;
+   return -ENOMEM;
 
-   res = request_free_mem_region(_resource, DEVMEM_CHUNK_SIZE,
- "hmm_dmirror");
-   if (IS_ERR(res))
-   goto err_devmem;
+   if (!spm_addr_dev0 && !spm_addr_dev1) {
+   res = request_free_mem_region(_resource, 
DEVMEM_CHUNK_SIZE,
+ "hmm_dmirror");
+   if (IS_ERR_OR_NULL(res))
+   goto err_devmem;
+   devmem->pagemap.range.start = res->start;
+   devmem->pagemap.range.end = res->end;
+   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+   mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+   } else if (spm_addr_dev0 && spm_addr_dev1) {
+   devmem->pagemap.range.start = MINOR(mdevice->cdevice.dev) ?
+   spm_addr_dev0 :
+   spm_addr_dev1;
+   devmem->pagemap.range.end = devmem->pagemap.range.start +
+   DEVMEM_CHUNK_SIZE - 1;
+   devmem->pagemap.type = MEMORY_DEVICE_COHERENT;
+   mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_COHERENT;
+   } else {
+   pr_err("Both spm_addr_dev parameters should be set\n");
+   }
 
-   mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
-   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
-   devmem->pagemap.range.start = res->start;
-   devmem->pagemap.range.end = res->end;
devmem->pagemap.nr_range = 1;
devmem->pagemap.ops = _devmem_ops;
devmem->pagemap.owner = mdevice;
@@ -495,10 +517,14 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
mdevice->devmem_capacity = new_capacity;
mdevice->devmem_chunks = new_chunks;
}
-
ptr = memremap_pages(>pagemap, numa_node_id());
-   if (IS_ERR(ptr))
+   if (IS_ERR_OR_NULL(ptr)) {
+   if (ptr)
+   ret = PTR_ERR(ptr);
+   else
+   ret = -EFAULT;
goto err_release;
+   }
 
devmem->mdevice = mdevice;
pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
@@ -531,7 +557,8 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
 
 err_release:
mutex_unlock(>devmem_lock);
-   release_mem_region(devmem->pagemap.range.start, 
range_len(>pagemap.range));
+   if (res)
+   release_mem_region(devmem->pagemap.range.start, 
range_len(>pagemap.range));
 err_devmem:
kfree(devmem);
 
@@ -1219,10 +1246,8 @@ static int dmirror_device_init(struct dmirror_device 
*mdevice, int id)
if (ret)
return ret;
 
-  

[PATCH v1 2/9] mm: add device coherent vma selection for memory migration

2021-11-15 Thread Alex Sierra
This case is used to migrate pages from device memory, back to system
memory. Device coherent type memory is cache coherent from device and CPU
point of view.

Signed-off-by: Alex Sierra 
---
v2:
condition added when migrations from device coherent pages.
---
 include/linux/migrate.h | 1 +
 mm/migrate.c| 9 +++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index c8077e936691..e74bb0978f6f 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -138,6 +138,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
 enum migrate_vma_direction {
MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
+   MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
 };
 
 struct migrate_vma {
diff --git a/mm/migrate.c b/mm/migrate.c
index f74422a42192..166bfec7d85e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2340,8 +2340,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (is_writable_device_private_entry(entry))
mpfn |= MIGRATE_PFN_WRITE;
} else {
-   if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
-   goto next;
pfn = pte_pfn(pte);
if (is_zero_pfn(pfn)) {
mpfn = MIGRATE_PFN_MIGRATE;
@@ -2349,6 +2347,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
goto next;
}
page = vm_normal_page(migrate->vma, addr, pte);
+   if (!is_zone_device_page(page) &&
+   !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
+   goto next;
+   if (is_zone_device_page(page) &&
+   (!(migrate->flags & 
MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
+page->pgmap->owner != migrate->pgmap_owner))
+   goto next;
mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
}
-- 
2.32.0



[PATCH v1 9/9] tools: update test_hmm script to support SP config

2021-11-15 Thread Alex Sierra
Add two more parameters to set spm_addr_dev0 & spm_addr_dev1
addresses. These two parameters configure the start SP
addresses for each device in test_hmm driver.
Consequently, this configures zone device type as coherent.

Signed-off-by: Alex Sierra 
---
 tools/testing/selftests/vm/test_hmm.sh | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/test_hmm.sh 
b/tools/testing/selftests/vm/test_hmm.sh
index 0647b525a625..3eeabe94399f 100755
--- a/tools/testing/selftests/vm/test_hmm.sh
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -40,7 +40,18 @@ check_test_requirements()
 
 load_driver()
 {
-   modprobe $DRIVER > /dev/null 2>&1
+   if [ $# -eq 0 ]; then
+   modprobe $DRIVER > /dev/null 2>&1
+   else
+   if [ $# -eq 2 ]; then
+   modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2
+   > /dev/null 2>&1
+   else
+   echo "Missing module parameters. Make sure pass"\
+   "spm_addr_dev0 and spm_addr_dev1"
+   usage
+   fi
+   fi
if [ $? == 0 ]; then
major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
mknod /dev/hmm_dmirror0 c $major 0
@@ -58,7 +69,7 @@ run_smoke()
 {
echo "Running smoke test. Note, this test provides basic coverage."
 
-   load_driver
+   load_driver $1 $2
$(dirname "${BASH_SOURCE[0]}")/hmm-tests
unload_driver
 }
@@ -75,6 +86,9 @@ usage()
echo "# Smoke testing"
echo "./${TEST_NAME}.sh smoke"
echo
+   echo "# Smoke testing with SPM enabled"
+   echo "./${TEST_NAME}.sh smoke  "
+   echo
exit 0
 }
 
@@ -84,7 +98,7 @@ function run_test()
usage
else
if [ "$1" = "smoke" ]; then
-   run_smoke
+   run_smoke $2 $3
else
usage
fi
-- 
2.32.0



[PATCH v1 4/9] drm/amdkfd: coherent type as sys mem on migration to ram

2021-11-15 Thread Alex Sierra
Coherent device type memory on VRAM to RAM migration, has similar access
as System RAM from the CPU. This flag sets the source from the sender.
Which in Coherent type case, should be set as
MIGRATE_VMA_SELECT_DEVICE_COHERENT.

Signed-off-by: Alex Sierra 
Reviewed-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 9e36fe8aea0f..3e405f078ade 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -661,9 +661,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct 
svm_range *prange,
migrate.vma = vma;
migrate.start = start;
migrate.end = end;
-   migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
 
+   if (adev->gmc.xgmi.connected_to_cpu)
+   migrate.flags = MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+   else
+   migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t);
size *= npages;
buf = kvmalloc(size, GFP_KERNEL | __GFP_ZERO);
-- 
2.32.0



[PATCH v1 5/9] lib: test_hmm add ioctl to get zone device type

2021-11-15 Thread Alex Sierra
new ioctl cmd added to query zone device type. This will be
used once the test_hmm adds zone device coherent type.

Signed-off-by: Alex Sierra 
---
 lib/test_hmm.c  | 15 ++-
 lib/test_hmm_uapi.h |  7 +++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index c259842f6d44..2daaa7b3710b 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -84,6 +84,7 @@ struct dmirror_chunk {
 struct dmirror_device {
struct cdev cdevice;
struct hmm_devmem   *devmem;
+   unsigned intzone_device_type;
 
unsigned intdevmem_capacity;
unsigned intdevmem_count;
@@ -470,6 +471,7 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
if (IS_ERR(res))
goto err_devmem;
 
+   mdevice->zone_device_type = HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
devmem->pagemap.range.start = res->start;
devmem->pagemap.range.end = res->end;
@@ -1025,6 +1027,15 @@ static int dmirror_snapshot(struct dmirror *dmirror,
return ret;
 }
 
+static int dmirror_get_device_type(struct dmirror *dmirror,
+   struct hmm_dmirror_cmd *cmd)
+{
+   mutex_lock(>mutex);
+   cmd->zone_device_type = dmirror->mdevice->zone_device_type;
+   mutex_unlock(>mutex);
+
+   return 0;
+}
 static long dmirror_fops_unlocked_ioctl(struct file *filp,
unsigned int command,
unsigned long arg)
@@ -1074,7 +1085,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
case HMM_DMIRROR_SNAPSHOT:
ret = dmirror_snapshot(dmirror, );
break;
-
+   case HMM_DMIRROR_GET_MEM_DEV_TYPE:
+   ret = dmirror_get_device_type(dmirror, );
+   break;
default:
return -EINVAL;
}
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index f14dea5dcd06..c42e57a6a71e 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -26,6 +26,7 @@ struct hmm_dmirror_cmd {
__u64   npages;
__u64   cpages;
__u64   faults;
+   __u64   zone_device_type;
 };
 
 /* Expose the address space of the calling process through hmm device file */
@@ -35,6 +36,7 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_SNAPSHOT   _IOWR('H', 0x03, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_EXCLUSIVE  _IOWR('H', 0x04, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_CHECK_EXCLUSIVE_IOWR('H', 0x05, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_GET_MEM_DEV_TYPE   _IOWR('H', 0x06, struct hmm_dmirror_cmd)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -62,4 +64,9 @@ enum {
HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
 };
 
+enum {
+   /* 0 is reserved to catch uninitialized type fields */
+   HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+};
+
 #endif /* _LIB_TEST_HMM_UAPI_H */
-- 
2.32.0



[PATCH v1 3/9] drm/amdkfd: add SPM support for SVM

2021-11-15 Thread Alex Sierra
When CPU is connected throug XGMI, it has coherent
access to VRAM resource. In this case that resource
is taken from a table in the device gmc aperture base.
This resource is used along with the device type, which could
be DEVICE_PRIVATE or DEVICE_COHERENT to create the device
page map region.

Signed-off-by: Alex Sierra 
Reviewed-by: Felix Kuehling 
---
v7:
Remove lookup_resource call, so export symbol for this function
is not longer required. Patch dropped "kernel: resource:
lookup_resource as exported symbol"
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 29 +++-
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index aeade32ec298..9e36fe8aea0f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -935,7 +935,7 @@ int svm_migrate_init(struct amdgpu_device *adev)
 {
struct kfd_dev *kfddev = adev->kfd.dev;
struct dev_pagemap *pgmap;
-   struct resource *res;
+   struct resource *res = NULL;
unsigned long size;
void *r;
 
@@ -950,28 +950,34 @@ int svm_migrate_init(struct amdgpu_device *adev)
 * should remove reserved size
 */
size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
-   res = devm_request_free_mem_region(adev->dev, _resource, size);
-   if (IS_ERR(res))
-   return -ENOMEM;
+   if (adev->gmc.xgmi.connected_to_cpu) {
+   pgmap->range.start = adev->gmc.aper_base;
+   pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 
1;
+   pgmap->type = MEMORY_DEVICE_COHERENT;
+   } else {
+   res = devm_request_free_mem_region(adev->dev, _resource, 
size);
+   if (IS_ERR(res))
+   return -ENOMEM;
+   pgmap->range.start = res->start;
+   pgmap->range.end = res->end;
+   pgmap->type = MEMORY_DEVICE_PRIVATE;
+   }
 
-   pgmap->type = MEMORY_DEVICE_PRIVATE;
pgmap->nr_range = 1;
-   pgmap->range.start = res->start;
-   pgmap->range.end = res->end;
pgmap->ops = _migrate_pgmap_ops;
pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev);
-   pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
-
+   pgmap->flags = 0;
/* Device manager releases device-specific resources, memory region and
 * pgmap when driver disconnects from device.
 */
r = devm_memremap_pages(adev->dev, pgmap);
if (IS_ERR(r)) {
pr_err("failed to register HMM device memory\n");
-
/* Disable SVM support capability */
pgmap->type = 0;
-   devm_release_mem_region(adev->dev, res->start, 
resource_size(res));
+   if (pgmap->type == MEMORY_DEVICE_PRIVATE)
+   devm_release_mem_region(adev->dev, res->start,
+   res->end - res->start + 1);
return PTR_ERR(r);
}
 
@@ -984,3 +990,4 @@ int svm_migrate_init(struct amdgpu_device *adev)
 
return 0;
 }
+
-- 
2.32.0



[PATCH v1 1/9] mm: add zone device coherent type memory support

2021-11-15 Thread Alex Sierra
Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra 
---
 include/linux/memremap.h |  8 
 include/linux/mm.h   |  9 +
 mm/memcontrol.c  |  6 +++---
 mm/memory-failure.c  |  6 +-
 mm/memremap.c|  5 -
 mm/migrate.c | 21 +
 6 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index c0e9d35889e8..ff4d398edf35 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -39,6 +39,13 @@ struct vmem_altmap {
  * A more complete discussion of unaddressable memory may be found in
  * include/linux/hmm.h and Documentation/vm/hmm.rst.
  *
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allowed to pin such memory so that it can always be evicted.
+ *
  * MEMORY_DEVICE_FS_DAX:
  * Host memory that has similar access semantics as System RAM i.e. DMA
  * coherent and supports page pinning. In support of coordinating page
@@ -59,6 +66,7 @@ struct vmem_altmap {
 enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
+   MEMORY_DEVICE_COHERENT,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..d23b6ebd2177 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,6 +1162,7 @@ static inline bool page_is_devmap_managed(struct page 
*page)
return false;
switch (page->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
+   case MEMORY_DEVICE_COHERENT:
case MEMORY_DEVICE_FS_DAX:
return true;
default:
@@ -1191,6 +1192,14 @@ static inline bool is_device_private_page(const struct 
page *page)
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
+static inline bool is_device_page(const struct page *page)
+{
+   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+   is_zone_device_page(page) &&
+   (page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
+   page->pgmap->type == MEMORY_DEVICE_COHERENT);
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..e0275ebe1198 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5695,8 +5695,8 @@ static int mem_cgroup_move_account(struct page *page,
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  * target for charge migration. if @target is not NULL, the entry is stored
  * in target->ent.
- *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is 
MEMORY_DEVICE_PRIVATE
- * (so ZONE_DEVICE page and thus not on the lru).
+ *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is 
MEMORY_DEVICE_COHERENT
+ * or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
  * For now we such page is charge like a regular page would be as for all
  * intent and purposes it is just special memory taking the place of a
  * regular page.
@@ -5730,7 +5730,7 @@ static enum mc_target_type get_mctgt_type(struct 
vm_area_struct *vma,
 */
if (page_memcg(page) == mc.from) {
ret = MC_TARGET_PAGE;
-   if (is_device_private_page(page))
+   if (is_device_page(page))
ret = MC_TARGET_DEVICE;
if (target)
target->page = page;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3e6449f2102a..51b55fc5172c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1554,12 +1554,16 @@ static int memory_failure_dev_pagemap(unsigned long 
pfn, int flags,
goto unlock;
}
 
-   if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+   switch (pgmap->type) {
+   case MEMORY_DEVICE_PRIVATE:
+   case MEMORY_DEVICE_COHERENT:
/*
 * TODO: Handle HMM pages which may need coordination
 * with device-side memory.
 */
goto unlock;
+   default:
+   break;
}
 
/*
diff --git a/mm/memremap.c b/mm/memremap.c
index ed593bf87109..94d6a1e01d42 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ 

[PATCH 5.15 350/917] drm/amd/display: Pass display_pipe_params_st as const in DML

2021-11-15 Thread Greg Kroah-Hartman
From: Harry Wentland 

[ Upstream commit 22667e6ec6b2ce9ca706e9061660b059725d009c ]

[Why]
This neither needs to be on the stack nor passed by value
to each function call. In fact, when building with clang
it seems to break the Linux's default 1024 byte stack
frame limit.

[How]
We can simply pass this as a const pointer.

This patch fixes these Coverity IDs
Addresses-Coverity-ID: 1424031: ("Big parameter passed by value")
Addresses-Coverity-ID: 1423970: ("Big parameter passed by value")
Addresses-Coverity-ID: 1423941: ("Big parameter passed by value")
Addresses-Coverity-ID: 1451742: ("Big parameter passed by value")
Addresses-Coverity-ID: 1451887: ("Big parameter passed by value")
Addresses-Coverity-ID: 1454146: ("Big parameter passed by value")
Addresses-Coverity-ID: 1454152: ("Big parameter passed by value")
Addresses-Coverity-ID: 1454413: ("Big parameter passed by value")
Addresses-Coverity-ID: 1466144: ("Big parameter passed by value")
Addresses-Coverity-ID: 1487237: ("Big parameter passed by value")

Signed-off-by: Harry Wentland 
Fixes: 3fe617ccafd6 ("Enable '-Werror' by default for all kernel builds")
Cc: Nick Desaulniers 
Cc: Linus Torvalds 
Cc: amd-gfx@lists.freedesktop.org
Cc: Linux Kernel Mailing List 
Cc: Arnd Bergmann 
Cc: Leo Li 
Cc: Alex Deucher 
Cc: Christian König 
Cc: Xinhui Pan 
Cc: Nathan Chancellor 
Cc: Guenter Roeck 
Cc: l...@lists.linux.dev
Acked-by: Christian König 
Build-tested-by: Nathan Chancellor 
Reviewed-by: Leo Li 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 .../drm/amd/display/dc/dcn20/dcn20_resource.c |  2 +-
 .../dc/dml/dcn20/display_rq_dlg_calc_20.c |  6 +-
 .../dc/dml/dcn20/display_rq_dlg_calc_20.h |  4 +-
 .../dc/dml/dcn20/display_rq_dlg_calc_20v2.c   |  6 +-
 .../dc/dml/dcn20/display_rq_dlg_calc_20v2.h   |  4 +-
 .../dc/dml/dcn21/display_rq_dlg_calc_21.c | 62 
 .../dc/dml/dcn21/display_rq_dlg_calc_21.h |  4 +-
 .../dc/dml/dcn30/display_rq_dlg_calc_30.c | 72 +--
 .../dc/dml/dcn30/display_rq_dlg_calc_30.h |  4 +-
 .../dc/dml/dcn31/display_rq_dlg_calc_31.c | 68 +-
 .../dc/dml/dcn31/display_rq_dlg_calc_31.h |  4 +-
 .../drm/amd/display/dc/dml/display_mode_lib.h |  4 +-
 12 files changed, 120 insertions(+), 120 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
index f2f258e70f9da..34a126816133e 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
@@ -3152,7 +3152,7 @@ void dcn20_calculate_dlg_params(
 

context->bw_ctx.dml.funcs.rq_dlg_get_rq_reg(>bw_ctx.dml,
>res_ctx.pipe_ctx[i].rq_regs,
-   pipes[pipe_idx].pipe);
+   [pipe_idx].pipe);
pipe_idx++;
}
 }
diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn20/display_rq_dlg_calc_20.c 
b/drivers/gpu/drm/amd/display/dc/dml/dcn20/display_rq_dlg_calc_20.c
index 2091dd8c252da..8c168f348a27f 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/dcn20/display_rq_dlg_calc_20.c
+++ b/drivers/gpu/drm/amd/display/dc/dml/dcn20/display_rq_dlg_calc_20.c
@@ -768,12 +768,12 @@ static void dml20_rq_dlg_get_rq_params(struct 
display_mode_lib *mode_lib,
 
 void dml20_rq_dlg_get_rq_reg(struct display_mode_lib *mode_lib,
display_rq_regs_st *rq_regs,
-   const display_pipe_params_st pipe_param)
+   const display_pipe_params_st *pipe_param)
 {
display_rq_params_st rq_param = {0};
 
memset(rq_regs, 0, sizeof(*rq_regs));
-   dml20_rq_dlg_get_rq_params(mode_lib, _param, pipe_param.src);
+   dml20_rq_dlg_get_rq_params(mode_lib, _param, pipe_param->src);
extract_rq_regs(mode_lib, rq_regs, rq_param);
 
print__rq_regs_st(mode_lib, *rq_regs);
@@ -1549,7 +1549,7 @@ static void dml20_rq_dlg_get_dlg_params(struct 
display_mode_lib *mode_lib,
 void dml20_rq_dlg_get_dlg_reg(struct display_mode_lib *mode_lib,
display_dlg_regs_st *dlg_regs,
display_ttu_regs_st *ttu_regs,
-   display_e2e_pipe_params_st *e2e_pipe_param,
+   const display_e2e_pipe_params_st *e2e_pipe_param,
const unsigned int num_pipes,
const unsigned int pipe_idx,
const bool cstate_en,
diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn20/display_rq_dlg_calc_20.h 
b/drivers/gpu/drm/amd/display/dc/dml/dcn20/display_rq_dlg_calc_20.h
index d0b90947f5409..8b23867e97c18 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/dcn20/display_rq_dlg_calc_20.h
+++ b/drivers/gpu/drm/amd/display/dc/dml/dcn20/display_rq_dlg_calc_20.h
@@ -43,7 +43,7 @@ struct display_mode_lib;
 void dml20_rq_dlg_get_rq_reg(
struct display_mode_lib *mode_lib,
display_rq_regs_st *rq_regs,
-   const display_pipe_params_st pipe_param);
+ 

Re: [PATCH] drm/amd/amdkfd: Fix kernel panic when reset failed and been triggered again

2021-11-15 Thread Felix Kuehling
Am 2021-11-15 um 11:20 a.m. schrieb shaoyunl:
> In SRIOV configuration, the reset may failed to bring asic back to normal but 
> stop cpsch
> already been called, the start_cpsch will not be called since there is no 
> resume in this
> case.  When reset been triggered again, driver should avoid to do 
> uninitialization again.
>
> Signed-off-by: shaoyunl 

Reviewed-by: Felix Kuehling 


> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 42b2cc999434..62fe28244a80 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -1225,6 +1225,11 @@ static int stop_cpsch(struct device_queue_manager *dqm)
>   bool hanging;
>  
>   dqm_lock(dqm);
> + if (!dqm->sched_running) {
> + dqm_unlock(dqm);
> + return 0;
> + }
> +
>   if (!dqm->is_hws_hang)
>   unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
>   hanging = dqm->is_hws_hang || dqm->is_resetting;


[PATCH] drm/amd/amdkfd: Fix kernel panic when reset failed and been triggered again

2021-11-15 Thread shaoyunl
In SRIOV configuration, the reset may failed to bring asic back to normal but 
stop cpsch
already been called, the start_cpsch will not be called since there is no 
resume in this
case.  When reset been triggered again, driver should avoid to do 
uninitialization again.

Signed-off-by: shaoyunl 
---
 drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 42b2cc999434..62fe28244a80 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1225,6 +1225,11 @@ static int stop_cpsch(struct device_queue_manager *dqm)
bool hanging;
 
dqm_lock(dqm);
+   if (!dqm->sched_running) {
+   dqm_unlock(dqm);
+   return 0;
+   }
+
if (!dqm->is_hws_hang)
unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
hanging = dqm->is_hws_hang || dqm->is_resetting;
-- 
2.17.1



RE: [PATCH] drm/amd/amdkfd: Fix kernel panic when reset failed and been triggered again

2021-11-15 Thread Liu, Shaoyun
[AMD Official Use Only]

Om, sounds reasonable 

Thanks 
Shaoyun.liu 

-Original Message-
From: Kuehling, Felix  
Sent: Monday, November 15, 2021 11:07 AM
To: amd-gfx@lists.freedesktop.org; Liu, Shaoyun 
Subject: Re: [PATCH] drm/amd/amdkfd: Fix kernel panic when reset failed and 
been triggered again

Am 2021-11-14 um 12:55 p.m. schrieb shaoyunl:
> In SRIOV configuration, the reset may failed to bring asic back to 
> normal but stop cpsch already been called, the start_cpsch will not be 
> called since there is no resume in this case.  When reset been triggered 
> again, driver should avoid to do uninitialization again.
>
> Signed-off-by: shaoyunl 

If there is a possibility that stop_cpsch is called multiple times, I think the 
check for that should be at the start of the function.
Something like:

    if (!dqm->sched_running)
        return 0;

Regards,
  Felix


> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 42b2cc999434..bcc8980d77e0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -1228,12 +1228,14 @@ static int stop_cpsch(struct device_queue_manager 
> *dqm)
>   if (!dqm->is_hws_hang)
>   unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
>   hanging = dqm->is_hws_hang || dqm->is_resetting;
> - dqm->sched_running = false;
>  
> - pm_release_ib(>packet_mgr);
> + if (dqm->sched_running) {
> + dqm->sched_running = false;
> + pm_release_ib(>packet_mgr);
> + kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
> + pm_uninit(>packet_mgr, hanging);
> + }
>  
> - kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
> - pm_uninit(>packet_mgr, hanging);
>   dqm_unlock(dqm);
>  
>   return 0;


Re: [PATCH] drm/amd/amdkfd: Fix kernel panic when reset failed and been triggered again

2021-11-15 Thread Felix Kuehling
Am 2021-11-14 um 12:55 p.m. schrieb shaoyunl:
> In SRIOV configuration, the reset may failed to bring asic back to normal but 
> stop cpsch
> already been called, the start_cpsch will not be called since there is no 
> resume in this
> case.  When reset been triggered again, driver should avoid to do 
> uninitialization again.
>
> Signed-off-by: shaoyunl 

If there is a possibility that stop_cpsch is called multiple times, I
think the check for that should be at the start of the function.
Something like:

    if (!dqm->sched_running)
        return 0;

Regards,
  Felix


> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 42b2cc999434..bcc8980d77e0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -1228,12 +1228,14 @@ static int stop_cpsch(struct device_queue_manager 
> *dqm)
>   if (!dqm->is_hws_hang)
>   unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
>   hanging = dqm->is_hws_hang || dqm->is_resetting;
> - dqm->sched_running = false;
>  
> - pm_release_ib(>packet_mgr);
> + if (dqm->sched_running) {
> + dqm->sched_running = false;
> + pm_release_ib(>packet_mgr);
> + kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
> + pm_uninit(>packet_mgr, hanging);
> + }
>  
> - kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
> - pm_uninit(>packet_mgr, hanging);
>   dqm_unlock(dqm);
>  
>   return 0;


Re: [PATCH] drm/amd/amdgpu: fix potential memleak

2021-11-15 Thread Felix Kuehling
Am 2021-11-14 um 9:58 p.m. schrieb Bernard Zhao:
> In function amdgpu_get_xgmi_hive, when kobject_init_and_add failed
> There is a potential memleak if not call kobject_put.
>
> Signed-off-by: Bernard Zhao 

Reviewed-by: Felix Kuehling 


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index 0fad2bf854ae..567df2db23ac 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -386,6 +386,7 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct 
> amdgpu_device *adev)
>   "%s", "xgmi_hive_info");
>   if (ret) {
>   dev_err(adev->dev, "XGMI: failed initializing kobject for xgmi 
> hive\n");
> + kobject_put(>kobj);
>   kfree(hive);
>   hive = NULL;
>   goto pro_end;


Re: Backlight control broken on UM325 (OLED) on 5.15 (bisected)

2021-11-15 Thread Alex Deucher
On Mon, Nov 15, 2021 at 2:35 AM Samuel Čavoj  wrote:
>
> Hello,
>
> the backlight control no longer works on my ASUS UM325 (Ryzen 5700U)
> OLED laptop. I have bisected the breakage to commit 7fd13baeb7a3a48.
>
> commit 7fd13baeb7a3a48cae12c36c52f06bf4e9e7d728 (HEAD, refs/bisect/bad)
> Author: Alex Deucher 
> Date:   Thu Jul 8 16:31:10 2021 -0400
>
> drm/amdgpu/display: add support for multiple backlights
>
> On platforms that support multiple backlights, register
> each one separately.  This lets us manage them independently
> rather than registering a single backlight and applying the
> same settings to both.
>
> v2: fix typo:
> Reported-by: kernel test robot 
>
> Reviewed-by: Roman Li 
> Signed-off-by: Alex Deucher 
>
> I have encountered another user with the same issue on reddit[0].
>
> The node in /sys/class/backlight exists, writing to it just does
> nothing. I would be glad to help debugging the issue. Thank you very
> much.

That patch adds support for systems with multiple backlights.  Do you
have multiple backlight devices now?  If so, does the other one work?

Can you also try this patch?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
index 4811b0faafd9..67163c9d49e6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
@@ -854,8 +854,8 @@ int amdgpu_acpi_init(struct amdgpu_device *adev)
if (amdgpu_device_has_dc_support(adev)) {
 #if defined(CONFIG_DRM_AMD_DC)
struct amdgpu_display_manager *dm = >dm;
-   if (dm->backlight_dev[0])
-   atif->bd = dm->backlight_dev[0];
+   if (dm->backlight_dev[1])
+   atif->bd = dm->backlight_dev[1];
 #endif
} else {
struct drm_encoder *tmp;


Alex

>
> Regards,
> Samuel Čavoj
>
> [0]: 
> https://www.reddit.com/r/AMDLaptops/comments/qst0fm/after_updating_to_linux_515_my_brightness/


Re: [PATCH v1 1/3] string: Consolidate yesno() helpers under string.h hood

2021-11-15 Thread Andy Shevchenko
On Mon, Nov 15, 2021 at 01:43:02PM +0200, Jani Nikula wrote:
> On Mon, 15 Nov 2021, Andy Shevchenko  
> wrote:
> > On Tue, Oct 05, 2021 at 02:34:23PM -0700, Lucas De Marchi wrote:
> >> On Mon, Feb 15, 2021 at 04:21:35PM +0200, Andy Shevchenko wrote:
> >> > We have already few similar implementation and a lot of code that can 
> >> > benefit
> >> > of the yesno() helper.  Consolidate yesno() helpers under string.h hood.
> >> 
> >> I was taking a look on i915_utils.h to reduce it and move some of it
> >> elsewhere to be shared with others.  I was starting with these helpers
> >> and had [1] done, then Jani pointed me to this thread and also his
> >> previous tentative. I thought the natural place for this would be
> >> include/linux/string_helpers.h, but I will leave it up to you.
> >
> > Seems reasonable to use string_helpers (headers and/or C-file).
> >
> >> After reading the threads, I don't see real opposition to it.
> >> Is there a tree you plan to take this through?
> >
> > I rest my series in favour of Jani's approach, so I suppose there is no go
> > for _this_ series.
> 
> If you want to make it happen, please pick it up and drive it. I'm
> thoroughly had enough of this.

My point is still the same, so it's more to Lucas.

I'm not going to drive this activity due to lack of time.

> >> [1] 
> >> https://lore.kernel.org/lkml/20211005212634.3223113-1-lucas.demar...@intel.com/T/#u

-- 
With Best Regards,
Andy Shevchenko




答复: [PATCH] drm/amd/amdgpu: cleanup the code style a bit

2021-11-15 Thread 赵军奎

-邮件原件-
发件人: bern...@vivo.com  代表 Christian K?nig
发送时间: 2021年11月15日 19:49
收件人: 赵军奎 ; Alex Deucher ; 
Christian König ; Pan, Xinhui ; 
David Airlie ; Daniel Vetter ; Jingwen Chen 
; Candice Li ; John Clements 
; Monk liu ; Peng Ju Zhou 
; Jiawei Gu ; Bokun Zhang 
; Zhigang Luo ; Lee Jones 
; amd-gfx@lists.freedesktop.org; 
dri-de...@lists.freedesktop.org; linux-ker...@vger.kernel.org
主题: Re: [PATCH] drm/amd/amdgpu: cleanup the code style a bit

Am 15.11.21 um 08:07 schrieb Bernard Zhao:
> This change is to cleanup the code style a bit.

>To be honest I think the old style looked better. It took me a moment to 
>validate this now.

>What you could to instead is to have goto style error handling which would 
>make this a bit more cleaner I think.
Hi 
Looks like a good idea, thank you for your comments!
I will resubmit a version!
BR//Bernard

>Christian.

>
> Signed-off-by: Bernard Zhao 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 10 ++
>   1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index 04cf9b207e62..90070b41136a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -286,12 +286,14 @@ static int amdgpu_virt_init_ras_err_handler_data(struct 
> amdgpu_device *adev)
>   return -ENOMEM;
>   
>   bps = kmalloc_array(align_space, sizeof((*data)->bps), GFP_KERNEL);
> + if (!bps) {
> + kfree(*data);
> + return -ENOMEM;
> + }
>   bps_bo = kmalloc_array(align_space, sizeof((*data)->bps_bo), 
> GFP_KERNEL);
> -
> - if (!bps || !bps_bo) {
> - kfree(bps);
> - kfree(bps_bo);
> + if (!bps_bo) {
>   kfree(*data);
> + kfree(bps);
>   return -ENOMEM;
>   }
>   



Re: [PATCH v1 1/3] string: Consolidate yesno() helpers under string.h hood

2021-11-15 Thread Andy Shevchenko
On Tue, Oct 05, 2021 at 02:34:23PM -0700, Lucas De Marchi wrote:
> On Mon, Feb 15, 2021 at 04:21:35PM +0200, Andy Shevchenko wrote:
> > We have already few similar implementation and a lot of code that can 
> > benefit
> > of the yesno() helper.  Consolidate yesno() helpers under string.h hood.
> 
> I was taking a look on i915_utils.h to reduce it and move some of it
> elsewhere to be shared with others.  I was starting with these helpers
> and had [1] done, then Jani pointed me to this thread and also his
> previous tentative. I thought the natural place for this would be
> include/linux/string_helpers.h, but I will leave it up to you.

Seems reasonable to use string_helpers (headers and/or C-file).

> After reading the threads, I don't see real opposition to it.
> Is there a tree you plan to take this through?

I rest my series in favour of Jani's approach, so I suppose there is no go
for _this_ series.

> [1] 
> https://lore.kernel.org/lkml/20211005212634.3223113-1-lucas.demar...@intel.com/T/#u

-- 
With Best Regards,
Andy Shevchenko




[PATCH v2] drm/amd/amdgpu: cleanup the code style a bit

2021-11-15 Thread Bernard Zhao
This change is to cleanup the code style a bit.

Signed-off-by: Bernard Zhao 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 04cf9b207e62..3fc49823f527 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -283,17 +283,15 @@ static int amdgpu_virt_init_ras_err_handler_data(struct 
amdgpu_device *adev)
 
*data = kmalloc(sizeof(struct amdgpu_virt_ras_err_handler_data), 
GFP_KERNEL);
if (!*data)
-   return -ENOMEM;
+   goto data_failure;
 
bps = kmalloc_array(align_space, sizeof((*data)->bps), GFP_KERNEL);
-   bps_bo = kmalloc_array(align_space, sizeof((*data)->bps_bo), 
GFP_KERNEL);
+   if (!bps)
+   goto bps_failure;
 
-   if (!bps || !bps_bo) {
-   kfree(bps);
-   kfree(bps_bo);
-   kfree(*data);
-   return -ENOMEM;
-   }
+   bps_bo = kmalloc_array(align_space, sizeof((*data)->bps_bo), 
GFP_KERNEL);
+   if (!bps_bo)
+   goto bps_bo_failure;
 
(*data)->bps = bps;
(*data)->bps_bo = bps_bo;
@@ -303,6 +301,13 @@ static int amdgpu_virt_init_ras_err_handler_data(struct 
amdgpu_device *adev)
virt->ras_init_done = true;
 
return 0;
+
+bps_bo_failure:
+   kfree(bps);
+bps_failure:
+   kfree(*data);
+data_failure:
+   return -ENOMEM;
 }
 
 static void amdgpu_virt_ras_release_bp(struct amdgpu_device *adev)
-- 
2.33.1



Re: [PATCH] drm/amdgpu: always reset the asic in suspend

2021-11-15 Thread Deucher, Alexander
[AMD Official Use Only]

Well, that handles the case of the GPU needing to be reset on driver (e.g., 
virtualization), but doesn't handle the interrupted suspend case (e.g., when 
suspend is unwound before the power rail was turned off).  We already so 
something similar for hibernate to deal with the multiple freeze and thaw 
cycles.

Alex


From: Christian König 
Sent: Monday, November 15, 2021 8:41 AM
To: Alex Deucher ; Deucher, Alexander 

Cc: amd-gfx list 
Subject: Re: [PATCH] drm/amdgpu: always reset the asic in suspend

I was just about to write up my concern as well.

IIRC we used to have that and it didn't really worked that well and we
switched to resetting the GPU on driver load instead if initializing it
doesn't work of hand.

Christian.

Am 12.11.21 um 17:19 schrieb Alex Deucher:
> Actually, ignore this for now.  This will likely cause problems with S0ix.
>
> Alex
>
> On Fri, Nov 12, 2021 at 11:18 AM Alex Deucher  
> wrote:
>> If the platform suspend happens to fail and the power rail
>> is not turned off, the GPU will be in an unknown state on
>> resume, so reset the asic so that it will be in a known
>> good state on resume even if the platform suspend failed.
>>
>> Signed-off-by: Alex Deucher 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 1db76429a673..42af3d88e0ba 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -2165,8 +2165,9 @@ static int amdgpu_pmops_suspend(struct device *dev)
>>  adev->in_s3 = true;
>>  r = amdgpu_device_suspend(drm_dev, true);
>>  adev->in_s3 = false;
>> -
>> -   return r;
>> +   if (r)
>> +   return r;
>> +   return amdgpu_asic_reset(adev);
>>   }
>>
>>   static int amdgpu_pmops_resume(struct device *dev)
>> --
>> 2.31.1
>>



Re: [PATCH] drm/amdgpu: always reset the asic in suspend (v2)

2021-11-15 Thread Alex Deucher
On Mon, Nov 15, 2021 at 2:50 AM Lazar, Lijo  wrote:
>
>
>
> On 11/12/2021 9:55 PM, Alex Deucher wrote:
> > If the platform suspend happens to fail and the power rail
> > is not turned off, the GPU will be in an unknown state on
> > resume, so reset the asic so that it will be in a known
> > good state on resume even if the platform suspend failed.
> >
>
> Any more background info on the issue? Is there a need to trigger BACO
> or D3cold entry similar to how it's done for runtime suspend?

Basically something like the following, user requests S3, drivers
start to do their suspend thing, but then something interrupts it
(e.g., user plugs/unplugs a usb device or S3 gets interrupted for
something).  At that point, the power rail has not been turned off.
The kernel then starts calling the resume functions for each device
because the suspend was aborted.  However, since the power rail was
not turned off, the GPU is still initialized so the driver can't
properly re-init it without a reset.

Alex


>
> Thanks,
> Lijo
>
> > v2: handle s0ix
> >
> > Signed-off-by: Alex Deucher 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 -
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 1db76429a673..b4591f6e82dd 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -2165,7 +2165,10 @@ static int amdgpu_pmops_suspend(struct device *dev)
> >   adev->in_s3 = true;
> >   r = amdgpu_device_suspend(drm_dev, true);
> >   adev->in_s3 = false;
> > -
> > + if (r)
> > + return r;
> > + if (!adev->in_s0ix)
> > + r = amdgpu_asic_reset(adev);
> >   return r;
> >   }
> >
> >


Re: [PATCH] drm/amd/pm: Remove artificial freq level on Navi1x

2021-11-15 Thread Deucher, Alexander
[AMD Official Use Only]

Acked-by: Alex Deucher 

From: Lazar, Lijo 
Sent: Monday, November 15, 2021 2:42 AM
To: amd-gfx@lists.freedesktop.org 
Cc: Deucher, Alexander ; Zhang, Hawking 
; Wang, Yang(Kevin) ; Quan, Evan 

Subject: [PATCH] drm/amd/pm: Remove artificial freq level on Navi1x

Print Navi1x fine grained clocks in a consistent manner with other SOCs.
Don't show aritificial DPM level when the current clock equals min or max.

Signed-off-by: Lijo Lazar 
---
 drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
index 71161f6b78fe..60a557068ea4 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
@@ -1265,7 +1265,7 @@ static int navi10_print_clk_levels(struct smu_context 
*smu,
 enum smu_clk_type clk_type, char *buf)
 {
 uint16_t *curve_settings;
-   int i, size = 0, ret = 0;
+   int i, levels, size = 0, ret = 0;
 uint32_t cur_value = 0, value = 0, count = 0;
 uint32_t freq_values[3] = {0};
 uint32_t mark_index = 0;
@@ -1319,14 +1319,17 @@ static int navi10_print_clk_levels(struct smu_context 
*smu,
 freq_values[1] = cur_value;
 mark_index = cur_value == freq_values[0] ? 0 :
  cur_value == freq_values[2] ? 2 : 1;
-   if (mark_index != 1)
-   freq_values[1] = (freq_values[0] + 
freq_values[2]) / 2;

-   for (i = 0; i < 3; i++) {
+   levels = 3;
+   if (mark_index != 1) {
+   levels = 2;
+   freq_values[1] = freq_values[2];
+   }
+
+   for (i = 0; i < levels; i++) {
 size += sysfs_emit_at(buf, size, "%d: %uMhz 
%s\n", i, freq_values[i],
 i == mark_index ? "*" : "");
 }
-
 }
 break;
 case SMU_PCIE:
--
2.17.1



RE: [PATCH 00/14] DC Patches November 12, 2021

2021-11-15 Thread Wheeler, Daniel
[Public]

Hi all,
 
This week this patchset was tested on the following systems:
 
HP Envy 360, with Ryzen 5 4500U, with the following display types: eDP 1080p 
60hz, 4k 60hz  (via USB-C to DP/HDMI), 1440p 144hz (via USB-C to DP/HDMI), 
1680*1050 60hz (via USB-C to DP and then DP to DVI/VGA)
 
AMD Ryzen 9 5900H, with the following display types: eDP 1080p 60hz, 4k 60hz  
(via USB-C to DP/HDMI), 1440p 144hz (via USB-C to DP/HDMI), 1680*1050 60hz (via 
USB-C to DP and then DP to DVI/VGA)
 
Sapphire Pulse RX5700XT with the following display types:
4k 60hz  (via DP/HDMI), 1440p 144hz (via DP/HDMI), 1680*1050 60hz (via DP to 
DVI/VGA)
 
Reference AMD RX6800 with the following display types:
4k 60hz  (via DP/HDMI and USB-C to DP/HDMI), 1440p 144hz (via USB-C to DP/HDMI 
and USB-C to DP/HDMI), 1680*1050 60hz (via DP to DVI/VGA)
 
Included testing using a Startech DP 1.4 MST hub at 2x 4k 60hz, and 3x 1080p 
60hz on all systems. Also tested DSC via USB-C to DP DSC Hub with 3x 4k 60hz on 
Ryzen 9 5900h and Ryzen 5 4500u.
 
Tested on Ubuntu 20.04.3 with Kernel Version 5.13 and ChromeOS
 
Tested-by: Daniel Wheeler 
 
 
Thank you,
 
Dan Wheeler
Technologist  |  AMD
SW Display
--
1 Commerce Valley Dr E, Thornhill, ON L3T 7X6
Facebook |  Twitter |  amd.com  

-Original Message-
From: amd-gfx  On Behalf Of Wayne Lin
Sent: November 11, 2021 7:54 PM
To: amd-gfx@lists.freedesktop.org
Cc: Wang, Chao-kai (Stylon) ; Li, Sun peng (Leo) 
; Wentland, Harry ; Zhuo, Qingqing 
; Siqueira, Rodrigo ; Li, 
Roman ; Chiu, Solomon ; Pillai, 
Aurabindo ; Lin, Wayne ; Lipski, 
Mikita ; Lakha, Bhawanpreet ; 
Gutierrez, Agustin ; Kotarac, Pavle 

Subject: [PATCH 00/14] DC Patches November 12, 2021

This DC patchset brings improvements in multiple areas. In summary, we 
highlight:

- Fix issue that secondary display goes blank on Non DCN31.
- Adjust flushing data in DMCUB
- Revert patches which cause regression in hadnling MPO/Link encoder assignment
- Correct the setting within MSA of DP2.0
- Adjustment for DML isolation
- Fix FIFO erro in fast boot sequence
- Enable DSC over eDP
- Adjust the DSC power off sequence

---

Ahmad Othman (1):
  drm/amd/display: Secondary display goes blank on Non DCN31

Angus Wang (1):
  drm/amd/display: Revert changes for MPO underflow

Anthony Koo (2):
  drm/amd/display: [FW Promotion] Release 0.0.92
  drm/amd/display: [FW Promotion] Release 0.0.93

Aric Cyr (1):
  drm/amd/display: 3.2.162

Brandon Syu (1):
  drm/amd/display: Fix eDP will flash when boot to OS

Jun Lei (1):
  drm/amd/display: Code change for DML isolation

Mikita Lipski (1):
  drm/amd/display: Enable DSC over eDP

Nicholas Kazlauskas (1):
  drm/amd/display: Only flush delta from last command execution

Sung Joon Kim (1):
  drm/amd/display: Revert "retain/release stream pointer in link enc
table"

Wenjing Liu (1):
  drm/amd/display: set MSA vsp/hsp to 0 for positive polarity for DP
128b/132b

Xu, Jinze (1):
  drm/amd/display: Reset fifo after enable otg

Yi-Ling Chen (1):
  drm/amd/display: fixed the DSC power off sequence during Driver PnP

hvanzyll (1):
  drm/amd/display: Visual Confirm Bar Height Adjust

 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |   73 +-
 .../amd/display/amdgpu_dm/amdgpu_dm_helpers.c |2 +-
 drivers/gpu/drm/amd/display/dc/core/dc_link.c |5 +-
 .../gpu/drm/amd/display/dc/core/dc_link_dp.c  |  167 +-
 .../drm/amd/display/dc/core/dc_link_enc_cfg.c |2 -
 drivers/gpu/drm/amd/display/dc/dc.h   |7 +-
 drivers/gpu/drm/amd/display/dc/dce/dmub_psr.c |1 +
 .../display/dc/dce110/dce110_hw_sequencer.c   |7 +-
 .../drm/amd/display/dc/dcn10/dcn10_dpp_dscl.c |   14 +-
 .../amd/display/dc/dcn10/dcn10_hw_sequencer.c |   37 +
 .../display/dc/dcn10/dcn10_stream_encoder.c   |   15 +
 .../display/dc/dcn10/dcn10_stream_encoder.h   |3 +
 .../gpu/drm/amd/display/dc/dcn20/dcn20_dsc.c  |2 +
 .../gpu/drm/amd/display/dc/dcn20/dcn20_optc.c |   14 +
 .../gpu/drm/amd/display/dc/dcn20/dcn20_optc.h |3 +
 .../drm/amd/display/dc/dcn20/dcn20_resource.c |2 +-
 .../display/dc/dcn20/dcn20_stream_encoder.c   |2 +
 .../dc/dcn30/dcn30_dio_stream_encoder.c   |2 +
 .../gpu/drm/amd/display/dc/dcn30/dcn30_optc.c |1 +
 .../drm/amd/display/dc/dcn30/dcn30_resource.c |2 +-
 .../amd/display/dc/dcn302/dcn302_resource.c   |2 +-
 .../amd/display/dc/dcn303/dcn303_resource.c   |2 +-
 .../dc/dcn31/dcn31_hpo_dp_stream_encoder.c|4 +-
 .../drm/amd/display/dc/dcn31/dcn31_hwseq.c|5 -
 .../gpu/drm/amd/display/dc/dcn31/dcn31_optc.c |1 +
 .../drm/amd/display/dc/dcn31/dcn31_resource.c |1 +
 .../drm/amd/display/dc/dml/display_mode_lib.h |1 +
 .../gpu/drm/amd/display/dc/dml/dml_wrapper.c  | 1889 +  
.../display/dc/dml/dml_wrapper_translation.c  |  284 +++
 drivers/gpu/drm/amd/display/dc/dsc/dc_dsc.c   |8 +
 

Re: [PATCH 1/2] drm/amdgpu: IH process reset count when restart

2021-11-15 Thread Christian König

Am 13.11.21 um 01:05 schrieb Philip Yang:

Otherwise when IH process restart, count is zero, the loop will
not exit to wake_up_all after processing AMDGPU_IH_MAX_NUM_IVS
interrupts.

Signed-off-by: Philip Yang 


Reviewed-by: Christian König 

Maybe even CC: stable?

Regards,
Christian.


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
index f3d62e196901..0c7963dfacad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
@@ -223,7 +223,7 @@ int amdgpu_ih_wait_on_checkpoint_process(struct 
amdgpu_device *adev,
   */
  int amdgpu_ih_process(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih)
  {
-   unsigned int count = AMDGPU_IH_MAX_NUM_IVS;
+   unsigned int count;
u32 wptr;
  
  	if (!ih->enabled || adev->shutdown)

@@ -232,6 +232,7 @@ int amdgpu_ih_process(struct amdgpu_device *adev, struct 
amdgpu_ih_ring *ih)
wptr = amdgpu_ih_get_wptr(adev, ih);
  
  restart_ih:

+   count  = AMDGPU_IH_MAX_NUM_IVS;
DRM_DEBUG("%s: rptr %d, wptr %d\n", __func__, ih->rptr, wptr);
  
  	/* Order reading of wptr vs. reading of IH ring data */




Re: [PATCH] drm/amdgpu: always reset the asic in suspend

2021-11-15 Thread Christian König

I was just about to write up my concern as well.

IIRC we used to have that and it didn't really worked that well and we 
switched to resetting the GPU on driver load instead if initializing it 
doesn't work of hand.


Christian.

Am 12.11.21 um 17:19 schrieb Alex Deucher:

Actually, ignore this for now.  This will likely cause problems with S0ix.

Alex

On Fri, Nov 12, 2021 at 11:18 AM Alex Deucher  wrote:

If the platform suspend happens to fail and the power rail
is not turned off, the GPU will be in an unknown state on
resume, so reset the asic so that it will be in a known
good state on resume even if the platform suspend failed.

Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 1db76429a673..42af3d88e0ba 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2165,8 +2165,9 @@ static int amdgpu_pmops_suspend(struct device *dev)
 adev->in_s3 = true;
 r = amdgpu_device_suspend(drm_dev, true);
 adev->in_s3 = false;
-
-   return r;
+   if (r)
+   return r;
+   return amdgpu_asic_reset(adev);
  }

  static int amdgpu_pmops_resume(struct device *dev)
--
2.31.1





Re: Questions about KMS flip

2021-11-15 Thread Michel Dänzer
On 2021-11-15 12:31, Lang Yu wrote:
> On Mon, Nov 15, 2021 at 10:49:39AM +0100, Michel DDDnzer wrote:
>> On 2021-11-15 10:04, Lang Yu wrote:
>>> On Mon, Nov 15, 2021 at 09:38:47AM +0100, Michel DDDnzer wrote:
 On 2021-11-15 07:41, Lang Yu wrote:
> On Fri, Nov 12, 2021 at 05:10:27PM +0100, Michel DDDnzer wrote:
>> On 2021-11-12 16:03, Christian König wrote:
>>> Am 12.11.21 um 15:30 schrieb Michel Dänzer:
 On 2021-11-12 15:29, Michel Dänzer wrote:
> On 2021-11-12 13:47, Christian König wrote:
>> Anyway this unfortunately turned out to be work for Harray and 
>> Nicholas. In detail it's about this bug report here: 
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214621data=04%7C01%7CLang.Yu%40amd.com%7Cee54c4d055d040ef9f8b08d9a81d3eb9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637725665833112900%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=7nwIYd1um420XHVpOzeIvz37%2FLQqHF%2F6aRKfzgxUTnM%3Dreserved=0
>>
>> Lang was able to reproduce the issue and narrow it down to the pin 
>> in amdgpu_display_crtc_page_flip_target().
>>
>> In other words we somehow have an unbalanced pinning of the scanout 
>> buffer in DC.
> DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The 
> corresponding pin with DC would be in dm_plane_helper_prepare_fb, 
> paired with the unpin in
> dm_plane_helper_cleanup_fb.
>
>
> With non-DC, the pin in amdgpu_display_crtc_page_flip_target is 
> paired with the unpin in dm_plane_helper_cleanup_fb
 This should say amdgpu_display_unpin_work_func.
>>>
>>> Ah! So that is the classic (e.g. non atomic) path?
>>
>> Presumably.
>>
>>
> & dce_v*_crtc_disable. One thing I notice is that the pin is guarded 
> by if (!adev->enable_virtual_display), but the unpins seem 
> unconditional. So could this be about virtual display, and the 
> problem is actually trying to unpin a BO that was never pinned?
>>>
>>> Nope, my educated guess is rather that we free up the BO before 
>>> amdgpu_display_unpin_work_func is called.
>>>
>>> E.g. not pin unbalance, but rather use after free.
>>
>> amdgpu_display_crtc_page_flip_target calls amdgpu_bo_ref(work->old_abo), 
>> and amdgpu_display_unpin_work_func calls amdgpu_bo_unref(>old_abo) 
>> only after amdgpu_bo_unpin. So what you describe could only happen if 
>> there's an imbalance elsewhere such that amdgpu_bo_unref is called more 
>> often than amdgpu_bo_ref, or maybe if amdgpu_bo_reserve fails in 
>> amdgpu_display_unpin_work_func (in which case the "failed to reserve 
>> buffer after flip" error message should appear in dmesg).
>
>
> Actually, each call to amdgpu_display_crtc_page_flip_target() will
>
> 1, init a work(amdgpu_display_unpin_work_func) to unpin an old buffer
>(crtc->primary->fb), the work will be queued in 
> dce_vX_0_pageflip_irq().
>
> 2, pin a new buffer, assign it to crtc->primary->fb. But how to unpin it?
>Next call.
>
> The problem is the pinned buffer of last call to 
> amdgpu_display_crtc_page_flip_target() is not unpinned.

 It's unpinned in dce_v*_0_crtc_disable.
>>>
>>> I just found crtc->primary->fb is NULL when came in dce_v*_0_crtc_disable().
>>> So it's not unpinned...
>>
>> __drm_helper_disable_unused_functions sets crtc->primary->fb = NULL only 
>> after calling crtc_funcs->disable. Maybe this path can get hit for a CRTC 
>> which was already disabled, in which case crtc->primary->fb == NULL in 
>> dce_v*_0_crtc_disable is harmless.
>>
>> Have you checked for the issue I described below? Should be pretty easy to 
>> catch.
>>
>>
 I think I've found the problem though: dce_v*_0_crtc_do_set_base pin the 
 BO from target_fb unconditionally, but unpin the BO from the fb parameter 
 only if it's different from the former. So if they're the same, the BO's 
 pin count is incremented by 1.
> 
> Form my observations, amdgpu_bo_unpin() in dce_v*_0_crtc_disable() is
> never called.

It would be expected to happen when the screen turns off, e.g. due to DPMS.


> Though a single call to dce_v*_0_crtc_do_set_base() will 
> only pin the BO, I found it will be unpinned in next call to 
> dce_v*_0_crtc_do_set_base().

Yeah, that's the normal case when the new BO is different from the old one.

To catch the case I described, try something like

diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c

index 18a7b3bd633b..5726bd87a355 100644

--- a/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c

+++ b/drivers/gpu/drm/amd/amdgpu/dce_v11_0.c

@@ -1926,6 +1926,7 @@ static int dce_v11_0_crtc_do_set_base(struct drm_crtc 
*crtc,

 

Re: [PATCH] drm/amd/amdgpu: cleanup the code style a bit

2021-11-15 Thread Christian König

Am 15.11.21 um 08:07 schrieb Bernard Zhao:

This change is to cleanup the code style a bit.


To be honest I think the old style looked better. It took me a moment to 
validate this now.


What you could to instead is to have goto style error handling which 
would make this a bit more cleaner I think.


Christian.



Signed-off-by: Bernard Zhao 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 04cf9b207e62..90070b41136a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -286,12 +286,14 @@ static int amdgpu_virt_init_ras_err_handler_data(struct 
amdgpu_device *adev)
return -ENOMEM;
  
  	bps = kmalloc_array(align_space, sizeof((*data)->bps), GFP_KERNEL);

+   if (!bps) {
+   kfree(*data);
+   return -ENOMEM;
+   }
bps_bo = kmalloc_array(align_space, sizeof((*data)->bps_bo), 
GFP_KERNEL);
-
-   if (!bps || !bps_bo) {
-   kfree(bps);
-   kfree(bps_bo);
+   if (!bps_bo) {
kfree(*data);
+   kfree(bps);
return -ENOMEM;
}
  




Re: [PATCH v1 1/3] string: Consolidate yesno() helpers under string.h hood

2021-11-15 Thread Jani Nikula
On Mon, 15 Nov 2021, Andy Shevchenko  wrote:
> On Tue, Oct 05, 2021 at 02:34:23PM -0700, Lucas De Marchi wrote:
>> On Mon, Feb 15, 2021 at 04:21:35PM +0200, Andy Shevchenko wrote:
>> > We have already few similar implementation and a lot of code that can 
>> > benefit
>> > of the yesno() helper.  Consolidate yesno() helpers under string.h hood.
>> 
>> I was taking a look on i915_utils.h to reduce it and move some of it
>> elsewhere to be shared with others.  I was starting with these helpers
>> and had [1] done, then Jani pointed me to this thread and also his
>> previous tentative. I thought the natural place for this would be
>> include/linux/string_helpers.h, but I will leave it up to you.
>
> Seems reasonable to use string_helpers (headers and/or C-file).
>
>> After reading the threads, I don't see real opposition to it.
>> Is there a tree you plan to take this through?
>
> I rest my series in favour of Jani's approach, so I suppose there is no go
> for _this_ series.

If you want to make it happen, please pick it up and drive it. I'm
thoroughly had enough of this.

BR,
Jani.


>
>> [1] 
>> https://lore.kernel.org/lkml/20211005212634.3223113-1-lucas.demar...@intel.com/T/#u

-- 
Jani Nikula, Intel Open Source Graphics Center


[PATCH] drm/amd/amdgpu: remove useless break after return

2021-11-15 Thread Bernard Zhao
This change is to remove useless break after return.

Signed-off-by: Bernard Zhao 
---
 drivers/gpu/drm/amd/amdgpu/dce_v8_0.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
index b200b9e722d9..8318ee8339f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
@@ -2092,22 +2092,18 @@ static int dce_v8_0_pick_dig_encoder(struct drm_encoder 
*encoder)
return 1;
else
return 0;
-   break;
case ENCODER_OBJECT_ID_INTERNAL_UNIPHY1:
if (dig->linkb)
return 3;
else
return 2;
-   break;
case ENCODER_OBJECT_ID_INTERNAL_UNIPHY2:
if (dig->linkb)
return 5;
else
return 4;
-   break;
case ENCODER_OBJECT_ID_INTERNAL_UNIPHY3:
return 6;
-   break;
default:
DRM_ERROR("invalid encoder_id: 0x%x\n", 
amdgpu_encoder->encoder_id);
return 0;
-- 
2.33.1



[PATCH] drm/amd/amdgpu: cleanup the code style a bit

2021-11-15 Thread Bernard Zhao
This change is to cleanup the code style a bit.

Signed-off-by: Bernard Zhao 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 04cf9b207e62..90070b41136a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -286,12 +286,14 @@ static int amdgpu_virt_init_ras_err_handler_data(struct 
amdgpu_device *adev)
return -ENOMEM;
 
bps = kmalloc_array(align_space, sizeof((*data)->bps), GFP_KERNEL);
+   if (!bps) {
+   kfree(*data);
+   return -ENOMEM;
+   }
bps_bo = kmalloc_array(align_space, sizeof((*data)->bps_bo), 
GFP_KERNEL);
-
-   if (!bps || !bps_bo) {
-   kfree(bps);
-   kfree(bps_bo);
+   if (!bps_bo) {
kfree(*data);
+   kfree(bps);
return -ENOMEM;
}
 
-- 
2.33.1



[PATCH] drm/amd/amdgpu: fix potential memleak

2021-11-15 Thread Bernard Zhao
In function amdgpu_get_xgmi_hive, when kobject_init_and_add failed
There is a potential memleak if not call kobject_put.

Signed-off-by: Bernard Zhao 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 0fad2bf854ae..567df2db23ac 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -386,6 +386,7 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct 
amdgpu_device *adev)
"%s", "xgmi_hive_info");
if (ret) {
dev_err(adev->dev, "XGMI: failed initializing kobject for xgmi 
hive\n");
+   kobject_put(>kobj);
kfree(hive);
hive = NULL;
goto pro_end;
-- 
2.33.1



[PATCH -next] drm/amd/display: check top_pipe_to_program pointer

2021-11-15 Thread Yang Li
Clang static analysis reports this error

drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:2870:7: warning:
Dereference of null pointer [clang-analyzer-core.NullDereference]
if
(top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
^

top_pipe_to_program being NULL is caught as an error
But then it is used to report the error.

So add a check before using it.

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
---
 drivers/gpu/drm/amd/display/dc/core/dc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c 
b/drivers/gpu/drm/amd/display/dc/core/dc.c
index 39ad385..34382d0 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -2867,7 +2867,8 @@ static void commit_planes_for_stream(struct dc *dc,
 #endif
 
if ((update_type != UPDATE_TYPE_FAST) && 
stream->update_flags.bits.dsc_changed)
-   if 
(top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
+   if (top_pipe_to_program &&
+   
top_pipe_to_program->stream_res.tg->funcs->lock_doublebuffer_enable) {
if (should_use_dmub_lock(stream->link)) {
union dmub_hw_lock_flags hw_locks = { 0 };
struct dmub_hw_lock_inst_flags inst_flags = { 0 
};
-- 
1.8.3.1



Backlight control broken on UM325 (OLED) on 5.15 (bisected)

2021-11-15 Thread Samuel Čavoj
Hello,

the backlight control no longer works on my ASUS UM325 (Ryzen 5700U)
OLED laptop. I have bisected the breakage to commit 7fd13baeb7a3a48.

commit 7fd13baeb7a3a48cae12c36c52f06bf4e9e7d728 (HEAD, refs/bisect/bad)
Author: Alex Deucher 
Date:   Thu Jul 8 16:31:10 2021 -0400
 
drm/amdgpu/display: add support for multiple backlights

On platforms that support multiple backlights, register
each one separately.  This lets us manage them independently
rather than registering a single backlight and applying the
same settings to both.

v2: fix typo:
Reported-by: kernel test robot 

Reviewed-by: Roman Li 
Signed-off-by: Alex Deucher 

I have encountered another user with the same issue on reddit[0].

The node in /sys/class/backlight exists, writing to it just does
nothing. I would be glad to help debugging the issue. Thank you very
much.

Regards,
Samuel Čavoj

[0]: 
https://www.reddit.com/r/AMDLaptops/comments/qst0fm/after_updating_to_linux_515_my_brightness/


Re: [PATCH] drm/amd/amdgpu: remove useless break after return

2021-11-15 Thread Joe Perches
On Sun, 2021-11-14 at 23:14 -0800, Bernard Zhao wrote:
> This change is to remove useless break after return.
[]
> diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c 
> b/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
[]
> @@ -2092,22 +2092,18 @@ static int dce_v8_0_pick_dig_encoder(struct 
> drm_encoder *encoder)
>   return 1;
>   else
>   return 0;
> - break;
>   case ENCODER_OBJECT_ID_INTERNAL_UNIPHY1:
>   if (dig->linkb)
>   return 3;
>   else
>   return 2;
> - break;
>   case ENCODER_OBJECT_ID_INTERNAL_UNIPHY2:
>   if (dig->linkb)
>   return 5;
>   else
>   return 4;
> - break;
>   case ENCODER_OBJECT_ID_INTERNAL_UNIPHY3:
>   return 6;
> - break;
>   default:
>   DRM_ERROR("invalid encoder_id: 0x%x\n", 
> amdgpu_encoder->encoder_id);
>   return 0;

Perhaps rewrite to make the sequential numbering more obvious.
---
 drivers/gpu/drm/amd/amdgpu/dce_v8_0.c | 19 +++
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
index b200b9e722d97..7307524b706b4 100644
--- a/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/dce_v8_0.c
@@ -2088,26 +2088,13 @@ static int dce_v8_0_pick_dig_encoder(struct drm_encoder 
*encoder)
 
switch (amdgpu_encoder->encoder_id) {
case ENCODER_OBJECT_ID_INTERNAL_UNIPHY:
-   if (dig->linkb)
-   return 1;
-   else
-   return 0;
-   break;
+   return !dig->linkb ? 0 : 1;
case ENCODER_OBJECT_ID_INTERNAL_UNIPHY1:
-   if (dig->linkb)
-   return 3;
-   else
-   return 2;
-   break;
+   return !dig->linkb ? 2 : 3;
case ENCODER_OBJECT_ID_INTERNAL_UNIPHY2:
-   if (dig->linkb)
-   return 5;
-   else
-   return 4;
-   break;
+   return !dig->linkb ? 4 : 5;
case ENCODER_OBJECT_ID_INTERNAL_UNIPHY3:
return 6;
-   break;
default:
DRM_ERROR("invalid encoder_id: 0x%x\n", 
amdgpu_encoder->encoder_id);
return 0;




Re: Questions about KMS flip

2021-11-15 Thread Michel Dänzer
On 2021-11-15 10:04, Lang Yu wrote:
> On Mon, Nov 15, 2021 at 09:38:47AM +0100, Michel DDDnzer wrote:
>> On 2021-11-15 07:41, Lang Yu wrote:
>>> On Fri, Nov 12, 2021 at 05:10:27PM +0100, Michel DDDnzer wrote:
 On 2021-11-12 16:03, Christian König wrote:
> Am 12.11.21 um 15:30 schrieb Michel Dänzer:
>> On 2021-11-12 15:29, Michel Dänzer wrote:
>>> On 2021-11-12 13:47, Christian König wrote:
 Anyway this unfortunately turned out to be work for Harray and 
 Nicholas. In detail it's about this bug report here: 
 https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214621data=04%7C01%7CLang.Yu%40amd.com%7C766e41a1c3052b6b08d9a81358b0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637725623316543180%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=od%2BNksWOff%2FtsuAYZLX7lIJFQJMY2OScVqhLclPYWAQ%3Dreserved=0

 Lang was able to reproduce the issue and narrow it down to the pin in 
 amdgpu_display_crtc_page_flip_target().

 In other words we somehow have an unbalanced pinning of the scanout 
 buffer in DC.
>>> DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The 
>>> corresponding pin with DC would be in dm_plane_helper_prepare_fb, 
>>> paired with the unpin in
>>> dm_plane_helper_cleanup_fb.
>>>
>>>
>>> With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired 
>>> with the unpin in dm_plane_helper_cleanup_fb
>> This should say amdgpu_display_unpin_work_func.
>
> Ah! So that is the classic (e.g. non atomic) path?

 Presumably.


>>> & dce_v*_crtc_disable. One thing I notice is that the pin is guarded by 
>>> if (!adev->enable_virtual_display), but the unpins seem unconditional. 
>>> So could this be about virtual display, and the problem is actually 
>>> trying to unpin a BO that was never pinned?
>
> Nope, my educated guess is rather that we free up the BO before 
> amdgpu_display_unpin_work_func is called.
>
> E.g. not pin unbalance, but rather use after free.

 amdgpu_display_crtc_page_flip_target calls amdgpu_bo_ref(work->old_abo), 
 and amdgpu_display_unpin_work_func calls amdgpu_bo_unref(>old_abo) 
 only after amdgpu_bo_unpin. So what you describe could only happen if 
 there's an imbalance elsewhere such that amdgpu_bo_unref is called more 
 often than amdgpu_bo_ref, or maybe if amdgpu_bo_reserve fails in 
 amdgpu_display_unpin_work_func (in which case the "failed to reserve 
 buffer after flip" error message should appear in dmesg).
>>>
>>>
>>> Actually, each call to amdgpu_display_crtc_page_flip_target() will
>>>
>>> 1, init a work(amdgpu_display_unpin_work_func) to unpin an old buffer
>>>(crtc->primary->fb), the work will be queued in dce_vX_0_pageflip_irq().
>>>
>>> 2, pin a new buffer, assign it to crtc->primary->fb. But how to unpin it?
>>>Next call.
>>>
>>> The problem is the pinned buffer of last call to 
>>> amdgpu_display_crtc_page_flip_target() is not unpinned.
>>
>> It's unpinned in dce_v*_0_crtc_disable.
> 
> I just found crtc->primary->fb is NULL when came in dce_v*_0_crtc_disable().
> So it's not unpinned...

__drm_helper_disable_unused_functions sets crtc->primary->fb = NULL only after 
calling crtc_funcs->disable. Maybe this path can get hit for a CRTC which was 
already disabled, in which case crtc->primary->fb == NULL in 
dce_v*_0_crtc_disable is harmless.

Have you checked for the issue I described below? Should be pretty easy to 
catch.


>> I think I've found the problem though: dce_v*_0_crtc_do_set_base pin the BO 
>> from target_fb unconditionally, but unpin the BO from the fb parameter only 
>> if it's different from the former. So if they're the same, the BO's pin 
>> count is incremented by 1.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer


Re: Questions about KMS flip

2021-11-15 Thread Michel Dänzer
On 2021-11-15 07:41, Lang Yu wrote:
> On Fri, Nov 12, 2021 at 05:10:27PM +0100, Michel DDDnzer wrote:
>> On 2021-11-12 16:03, Christian König wrote:
>>> Am 12.11.21 um 15:30 schrieb Michel Dänzer:
 On 2021-11-12 15:29, Michel Dänzer wrote:
> On 2021-11-12 13:47, Christian König wrote:
>> Anyway this unfortunately turned out to be work for Harray and Nicholas. 
>> In detail it's about this bug report here: 
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214621data=04%7C01%7CLang.Yu%40amd.com%7Cca557eab16864ab544a108d9a5f6f288%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637723302338811983%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=9oG6k22BVp%2BkSvRX6uMlGXc6G%2BA5UL0Qy3W5iDTpvzs%3Dreserved=0
>>
>> Lang was able to reproduce the issue and narrow it down to the pin in 
>> amdgpu_display_crtc_page_flip_target().
>>
>> In other words we somehow have an unbalanced pinning of the scanout 
>> buffer in DC.
> DC doesn't use amdgpu_display_crtc_page_flip_target AFAICT. The 
> corresponding pin with DC would be in dm_plane_helper_prepare_fb, paired 
> with the unpin in
> dm_plane_helper_cleanup_fb.
>
>
> With non-DC, the pin in amdgpu_display_crtc_page_flip_target is paired 
> with the unpin in dm_plane_helper_cleanup_fb
 This should say amdgpu_display_unpin_work_func.
>>>
>>> Ah! So that is the classic (e.g. non atomic) path?
>>
>> Presumably.
>>
>>
> & dce_v*_crtc_disable. One thing I notice is that the pin is guarded by 
> if (!adev->enable_virtual_display), but the unpins seem unconditional. So 
> could this be about virtual display, and the problem is actually trying 
> to unpin a BO that was never pinned?
>>>
>>> Nope, my educated guess is rather that we free up the BO before 
>>> amdgpu_display_unpin_work_func is called.
>>>
>>> E.g. not pin unbalance, but rather use after free.
>>
>> amdgpu_display_crtc_page_flip_target calls amdgpu_bo_ref(work->old_abo), and 
>> amdgpu_display_unpin_work_func calls amdgpu_bo_unref(>old_abo) only 
>> after amdgpu_bo_unpin. So what you describe could only happen if there's an 
>> imbalance elsewhere such that amdgpu_bo_unref is called more often than 
>> amdgpu_bo_ref, or maybe if amdgpu_bo_reserve fails in 
>> amdgpu_display_unpin_work_func (in which case the "failed to reserve buffer 
>> after flip" error message should appear in dmesg).
> 
> 
> Actually, each call to amdgpu_display_crtc_page_flip_target() will
> 
> 1, init a work(amdgpu_display_unpin_work_func) to unpin an old buffer
>(crtc->primary->fb), the work will be queued in dce_vX_0_pageflip_irq().
> 
> 2, pin a new buffer, assign it to crtc->primary->fb. But how to unpin it?
>Next call.
> 
> The problem is the pinned buffer of last call to 
> amdgpu_display_crtc_page_flip_target() is not unpinned.

It's unpinned in dce_v*_0_crtc_disable.


I think I've found the problem though: dce_v*_0_crtc_do_set_base pin the BO 
from target_fb unconditionally, but unpin the BO from the fb parameter only if 
it's different from the former. So if they're the same, the BO's pin count is 
incremented by 1.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer