date:20210225

Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

2021-02-25 Thread Christian König


Hi Monk,

in general an interesting idea, but I see two major problems with that:

1. It would make the reset take much longer.

2. Things get often stuck because of timing issues, so a guilty job 
might pass perfectly when run a second time.


Apart from that the whole ring mirror list turned out to be a really bad 
idea. E.g. we still struggle with object life time because the concept 
doesn't fit into the object model of the GPU scheduler under Linux.


We should probably work on this separately and straighten up the job 
destruction once more and keep the recovery information in the fence 
instead.


Regards,
Christian.

Am 26.02.21 um 06:58 schrieb Liu, Monk:


[AMD Public Use]


Hi all

NAVI2X  project hit a really hard to solve issue now, and it is turned 
out to be a general headache of our TDR mechanism , check below scenario:


 1. There is a job1 running on compute1 ring at timestamp
 2. There is a job2 running on gfx ring at timestamp
 3. Job1 is the guilty one, and job1/job2 were scheduled to their
rings at almost the same timestamp
 4. After 2 seconds we receive two TDR reporting from both GFX ring
and compute ring
 5. *Current scheme is that in drm scheduler all the head jobs of
those two rings are considered “bad job” and taken away from the
mirror list *
 6. The result is both the real guilty job (job1) and the innocent job
(job2) were all deleted from mirror list, and their corresponding
contexts were also treated as guilty*(so the innocent process
remains running is not secured)*

**

But by our wish the ideal case is TDR mechanism can detect which ring 
is the guilty ring and the innocent ring can resubmits all its pending 
jobs:


 1. Job1 to be deleted from compute1 ring’s mirror list
 2. Job2 is kept and resubmitted later and its belonging
process/context are even not aware of this TDR at all

Here I have a proposal tend to achieve above goal and it rough 
procedure is :


 1. Once any ring reports a TDR, the head job is **not** treated as
“bad job”, and it is **not** deleted from the mirror list in drm
sched functions
 2. In vendor’s function (our amdgpu driver here):
  * reset GPU
  * repeat below actions on each RINGS * one by one *:

1.take the head job and submit it on this ring

2.see if it completes, if not then this job is the real “bad job”

3. take it away from mirror list if this head job is “bad job”

  * After above iteration on all RINGS, we already clears all the
bad job(s)
 2. Resubmit all jobs from each mirror list to their corresponding
rings (this is the existed logic)

The idea of this is to use “serial” way to re-run and re-check each 
head job of each RING, in order to take out the real black sheep and 
its guilty context.


P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , 
since those rings are intermutually affected to each other. For SDMA 
ring timeout it definitely proves the head job on SDMA ring is really 
guilty.


Thanks

--

Monk Liu | Cloud-GPU Core team

--



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 0/3] drm/ttm: constify static vm_operations_structs

2021-02-25 Thread Christian König


Am 23.02.21 um 18:31 schrieb Alex Deucher:

On Wed, Feb 10, 2021 at 8:14 AM Daniel Vetter  wrote:

On Wed, Feb 10, 2021 at 08:45:56AM +0100, Christian König wrote:

Reviewed-by: Christian König  for the series.

Smash it into -misc?

@Christian Koenig did these ever land?  I don't see them in drm-misc.


I've just pushed them to drm-misc-next. Sorry for the delay, totally 
forgot about them.


Christian.



Alex


-Daniel


Am 10.02.21 um 00:48 schrieb Rikard Falkeborn:

Constify a few static vm_operations_struct that are never modified. Their
only usage is to assign their address to the vm_ops field in the
vm_area_struct, which is a pointer to const vm_operations_struct. Make them
const to allow the compiler to put them in read-only memory.

With this series applied, all static struct vm_operations_struct in the
kernel tree are const.

Rikard Falkeborn (3):
drm/amdgpu/ttm: constify static vm_operations_struct
drm/radeon/ttm: constify static vm_operations_struct
drm/nouveau/ttm: constify static vm_operations_struct

   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +-
   drivers/gpu/drm/nouveau/nouveau_ttm.c   | 2 +-
   drivers/gpu/drm/radeon/radeon_ttm.c | 2 +-
   3 files changed, 3 insertions(+), 3 deletions(-)


--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=04%7C01%7Cchristian.koenig%40amd.com%7C9d730e56efe54d3215ee08d8d820d642%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637496982837619645%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=b4UU8bzeX%2Ba1VfObe8mta7fwtjVv%2F1wo4%2FPVuGZFW8Q%3D&reserved=0
___
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7Cchristian.koenig%40amd.com%7C9d730e56efe54d3215ee08d8d820d642%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637496982837629638%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=RKJh6p%2BTxaD0lH6M%2B0s3nah3tBatRFqoTvy3Mh7Lz5M%3D&reserved=0


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] radeon: ERROR: space prohibited before that ','

2021-02-25 Thread Christian König

Well coding style clean ups are usually welcome, but not necessarily one 
by one.


We can probably merge this if you clean up all checkpatch.pl warnings in 
the whole file.


Christian.

Am 26.02.21 um 07:05 schrieb wangjingyu:

drm_property_create_range(rdev->ddev, 0 , "coherent", 0, 1);

Signed-off-by: wangjingyu 
---
  drivers/gpu/drm/radeon/radeon_display.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/radeon/radeon_display.c 
b/drivers/gpu/drm/radeon/radeon_display.c
index 3a6fedad002d..439d1b3e87d8 100644
--- a/drivers/gpu/drm/radeon/radeon_display.c
+++ b/drivers/gpu/drm/radeon/radeon_display.c
@@ -1396,7 +1396,7 @@ static int radeon_modeset_create_props(struct 
radeon_device *rdev)
  
  	if (rdev->is_atom_bios) {

rdev->mode_info.coherent_mode_property =
-   drm_property_create_range(rdev->ddev, 0 , "coherent", 
0, 1);
+   drm_property_create_range(rdev->ddev, 0, "coherent", 0, 
1);
if (!rdev->mode_info.coherent_mode_property)
return -ENOMEM;
}


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amd/pm: optimize the link width/speed retrieving V2

2021-02-25 Thread Evan Quan

By using the information provided by PMFW when available.

V2: put those structures shared around SMU V11 ASICs in
smu_v11_0.h

Change-Id: I1afec4cd07ac9608861ee07c449e320e3f36a932
Signed-off-by: Evan Quan 
Acked-by: Alex Deucher 
---
 drivers/gpu/drm/amd/pm/inc/smu_v11_0.h| 10 --
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 12 ---
 .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c   | 20 +++
 .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c| 10 ++
 4 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h 
b/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h
index 50dd1529b994..d400f75e9202 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h
@@ -58,6 +58,12 @@
 #define CTF_OFFSET_HOTSPOT 5
 #define CTF_OFFSET_MEM 5
 
+#define LINK_WIDTH_MAX 6
+#define LINK_SPEED_MAX 3
+
+static __maybe_unused uint8_t link_width[] = {0, 1, 2, 4, 8, 12, 16};
+static __maybe_unused uint8_t link_speed[] = {25, 50, 80, 160};
+
 static const
 struct smu_temperature_range __maybe_unused smu11_thermal_policy[] =
 {
@@ -284,11 +290,11 @@ int smu_v11_0_get_dpm_level_range(struct smu_context *smu,
 
 int smu_v11_0_get_current_pcie_link_width_level(struct smu_context *smu);
 
-int smu_v11_0_get_current_pcie_link_width(struct smu_context *smu);
+uint8_t smu_v11_0_get_current_pcie_link_width(struct smu_context *smu);
 
 int smu_v11_0_get_current_pcie_link_speed_level(struct smu_context *smu);
 
-int smu_v11_0_get_current_pcie_link_speed(struct smu_context *smu);
+uint8_t smu_v11_0_get_current_pcie_link_speed(struct smu_context *smu);
 
 int smu_v11_0_gfx_ulv_control(struct smu_context *smu,
  bool enablement);
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
index ffd37b8a3882..f71723c345a8 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
@@ -2716,10 +2716,8 @@ static ssize_t navi10_get_gpu_metrics(struct smu_context 
*smu,
 
gpu_metrics->current_fan_speed = metrics.CurrFanSpeed;
 
-   gpu_metrics->pcie_link_width =
-   smu_v11_0_get_current_pcie_link_width(smu);
-   gpu_metrics->pcie_link_speed =
-   smu_v11_0_get_current_pcie_link_speed(smu);
+   gpu_metrics->pcie_link_width = metrics.PcieWidth;
+   gpu_metrics->pcie_link_speed = link_speed[metrics.PcieRate];
 
gpu_metrics->system_clock_counter = ktime_get_boottime_ns();
 
@@ -2856,10 +2854,8 @@ static ssize_t navi12_get_gpu_metrics(struct smu_context 
*smu,
 
gpu_metrics->current_fan_speed = metrics.CurrFanSpeed;
 
-   gpu_metrics->pcie_link_width =
-   smu_v11_0_get_current_pcie_link_width(smu);
-   gpu_metrics->pcie_link_speed =
-   smu_v11_0_get_current_pcie_link_speed(smu);
+   gpu_metrics->pcie_link_width = metrics.PcieWidth;
+   gpu_metrics->pcie_link_speed = link_speed[metrics.PcieRate];
 
gpu_metrics->system_clock_counter = ktime_get_boottime_ns();
 
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index e74299da1739..527e02b578af 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -3124,6 +3124,8 @@ static ssize_t sienna_cichlid_get_gpu_metrics(struct 
smu_context *smu,
SmuMetricsExternal_t metrics_external;
SmuMetrics_t *metrics =
&(metrics_external.SmuMetrics);
+   struct amdgpu_device *adev = smu->adev;
+   uint32_t smu_version;
int ret = 0;
 
ret = smu_cmn_get_metrics_table(smu,
@@ -3170,10 +3172,20 @@ static ssize_t sienna_cichlid_get_gpu_metrics(struct 
smu_context *smu,
 
gpu_metrics->current_fan_speed = metrics->CurrFanSpeed;
 
-   gpu_metrics->pcie_link_width =
-   smu_v11_0_get_current_pcie_link_width(smu);
-   gpu_metrics->pcie_link_speed =
-   smu_v11_0_get_current_pcie_link_speed(smu);
+   ret = smu_cmn_get_smc_version(smu, NULL, &smu_version);
+   if (ret)
+   return ret;
+
+   if (((adev->asic_type == CHIP_SIENNA_CICHLID) && smu_version > 
0x003A1E00) ||
+ ((adev->asic_type == CHIP_NAVY_FLOUNDER) && smu_version > 
0x00410400)) {
+   gpu_metrics->pcie_link_width = metrics->PcieWidth;
+   gpu_metrics->pcie_link_speed = link_speed[metrics->PcieRate];
+   } else {
+   gpu_metrics->pcie_link_width =
+   smu_v11_0_get_current_pcie_link_width(smu);
+   gpu_metrics->pcie_link_speed =
+   smu_v11_0_get_current_pcie_link_speed(smu);
+   }
 
gpu_metrics->system_clock_cou

[PATCH] drm/amd/pm: correct gpu metrics related data structures V2

2021-02-25 Thread Evan Quan

To make sure they are naturally aligned.

V2: minimum the possible influence to existing applications which
were developed based on those data structures. With this change,
only 32bit OS are affected while 64bit OS not.

Change-Id: I0a139e1e1f09fe27deffdce1cec6ea4594947625
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/include/kgd_pp_interface.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
index 41c89f7d6412..ca38a204beb0 100644
--- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
+++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
@@ -339,6 +339,8 @@ struct metrics_table_header {
uint16_tstructure_size;
uint8_t format_revision;
uint8_t content_revision;
+   /* make the data structure naturely aligned for 64bit OS */
+   uint16_tpadding[2];
 };
 
 struct gpu_metrics_v1_0 {
-- 
2.29.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH v2] drm/amdgpu: remove unnecessary reading for epprom header

2021-02-25 Thread Zhang, Hawking

[AMD Public Use]

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: Dennis Li  
Sent: Friday, February 26, 2021 14:42
To: amd-gfx@lists.freedesktop.org; Chen, Guchun ; Zhang, 
Hawking ; Koenig, Christian 
Cc: Li, Dennis 
Subject: [PATCH v2] drm/amdgpu: remove unnecessary reading for epprom header

If the number of badpage records exceed the threshold, driver has updated both 
epprom header and control->tbl_hdr.header before gpu reset, therefore GPU 
recovery thread no need to read epprom header directly.

v2: merge amdgpu_ras_check_err_threshold into 
amdgpu_ras_eeprom_check_err_threshold

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f0f7ed42ee7f..f2ff10403d93 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4397,7 +4397,7 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info 
*hive,
 * bad_page_threshold value to fix this once
 * probing driver again.
 */
-   if (!amdgpu_ras_check_err_threshold(tmp_adev)) {
+   if 
(!amdgpu_ras_eeprom_check_err_threshold(tmp_adev)) {
/* must succeed. */
amdgpu_ras_resume(tmp_adev);
} else {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 09546dec40ff..c669435ccc74 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2189,19 +2189,3 @@ bool amdgpu_ras_need_emergency_restart(struct 
amdgpu_device *adev)
 
return false;
 }
-
-bool amdgpu_ras_check_err_threshold(struct amdgpu_device *adev) -{
-   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
-   bool exc_err_limit = false;
-
-   if (con && (amdgpu_bad_page_threshold != 0))
-   amdgpu_ras_eeprom_check_err_threshold(&con->eeprom_control,
-   &exc_err_limit);
-
-   /*
-* We are only interested in variable exc_err_limit,
-* as it says if GPU is in bad state or not.
-*/
-   return exc_err_limit;
-}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index aed0716efa5a..42aab9adc263 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -491,8 +491,6 @@ void amdgpu_ras_suspend(struct amdgpu_device *adev);  
unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
bool is_ce);
 
-bool amdgpu_ras_check_err_threshold(struct amdgpu_device *adev);
-
 /* error handling functions */
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
struct eeprom_table_record *bps, int pages); diff --git 
a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 19d9aa76cfbf..7f527f8bbdb1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -434,47 +434,21 @@ static uint32_t __correct_eeprom_dest_address(uint32_t 
curr_address)
return curr_address;
 }
 
-int amdgpu_ras_eeprom_check_err_threshold(
-   struct amdgpu_ras_eeprom_control *control,
-   bool *exceed_err_limit)
+bool amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev)
 {
-   struct amdgpu_device *adev = to_amdgpu_device(control);
-   unsigned char buff[EEPROM_ADDRESS_SIZE +
-   EEPROM_TABLE_HEADER_SIZE] = { 0 };
-   struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
-   struct i2c_msg msg = {
-   .addr = control->i2c_address,
-   .flags = I2C_M_RD,
-   .len = EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE,
-   .buf = buff,
-   };
-   int ret;
-
-   *exceed_err_limit = false;
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 
if (!__is_ras_eeprom_supported(adev))
-   return 0;
-
-   /* read EEPROM table header */
-   mutex_lock(&control->tbl_mutex);
-   ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1);
-   if (ret < 1) {
-   dev_err(adev->dev, "Failed to read EEPROM table header.\n");
-   goto err;
-   }
-
-   __decode_table_header_from_buff(hdr, &buff[2]);
+   return false;
 
-   if (hdr->header == EEPROM_TABLE_HDR_BAD) {
+   if (con->eeprom_control.tbl_hdr.header == EEPROM_TABLE_HDR_BAD) {
dev_warn(adev->dev, "This GPU is in BAD status.");
dev_warn(adev->dev, "Please retire it or setting one bigger "
"threshold val

[PATCH v2] drm/amdgpu: remove unnecessary reading for epprom header

2021-02-25 Thread Dennis Li

If the number of badpage records exceed the threshold, driver has
updated both epprom header and control->tbl_hdr.header before gpu reset,
therefore GPU recovery thread no need to read epprom header directly.

v2: merge amdgpu_ras_check_err_threshold into 
amdgpu_ras_eeprom_check_err_threshold

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f0f7ed42ee7f..f2ff10403d93 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4397,7 +4397,7 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info 
*hive,
 * bad_page_threshold value to fix this once
 * probing driver again.
 */
-   if (!amdgpu_ras_check_err_threshold(tmp_adev)) {
+   if 
(!amdgpu_ras_eeprom_check_err_threshold(tmp_adev)) {
/* must succeed. */
amdgpu_ras_resume(tmp_adev);
} else {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 09546dec40ff..c669435ccc74 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2189,19 +2189,3 @@ bool amdgpu_ras_need_emergency_restart(struct 
amdgpu_device *adev)
 
return false;
 }
-
-bool amdgpu_ras_check_err_threshold(struct amdgpu_device *adev)
-{
-   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
-   bool exc_err_limit = false;
-
-   if (con && (amdgpu_bad_page_threshold != 0))
-   amdgpu_ras_eeprom_check_err_threshold(&con->eeprom_control,
-   &exc_err_limit);
-
-   /*
-* We are only interested in variable exc_err_limit,
-* as it says if GPU is in bad state or not.
-*/
-   return exc_err_limit;
-}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index aed0716efa5a..42aab9adc263 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -491,8 +491,6 @@ void amdgpu_ras_suspend(struct amdgpu_device *adev);
 unsigned long amdgpu_ras_query_error_count(struct amdgpu_device *adev,
bool is_ce);
 
-bool amdgpu_ras_check_err_threshold(struct amdgpu_device *adev);
-
 /* error handling functions */
 int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
struct eeprom_table_record *bps, int pages);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 19d9aa76cfbf..7f527f8bbdb1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -434,47 +434,21 @@ static uint32_t __correct_eeprom_dest_address(uint32_t 
curr_address)
return curr_address;
 }
 
-int amdgpu_ras_eeprom_check_err_threshold(
-   struct amdgpu_ras_eeprom_control *control,
-   bool *exceed_err_limit)
+bool amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev)
 {
-   struct amdgpu_device *adev = to_amdgpu_device(control);
-   unsigned char buff[EEPROM_ADDRESS_SIZE +
-   EEPROM_TABLE_HEADER_SIZE] = { 0 };
-   struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
-   struct i2c_msg msg = {
-   .addr = control->i2c_address,
-   .flags = I2C_M_RD,
-   .len = EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE,
-   .buf = buff,
-   };
-   int ret;
-
-   *exceed_err_limit = false;
+   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
 
if (!__is_ras_eeprom_supported(adev))
-   return 0;
-
-   /* read EEPROM table header */
-   mutex_lock(&control->tbl_mutex);
-   ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1);
-   if (ret < 1) {
-   dev_err(adev->dev, "Failed to read EEPROM table header.\n");
-   goto err;
-   }
-
-   __decode_table_header_from_buff(hdr, &buff[2]);
+   return false;
 
-   if (hdr->header == EEPROM_TABLE_HDR_BAD) {
+   if (con->eeprom_control.tbl_hdr.header == EEPROM_TABLE_HDR_BAD) {
dev_warn(adev->dev, "This GPU is in BAD status.");
dev_warn(adev->dev, "Please retire it or setting one bigger "
"threshold value when reloading driver.\n");
-   *exceed_err_limit = true;
+   return true;
}
 
-err:
-   mutex_unlock(&control->tbl_mutex);
-   return 0;
+   return false;
 }
 
 int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
diff --git a/drivers/gpu/drm/amd/amdgpu/amd

[PATCH] drm/amd/pm: bump Navi1x driver if version and related data structures V2

2021-02-25 Thread Evan Quan

New changes were involved for the SmuMetrics structure.

Change-Id: Ib45443db03977ccd18618bcfdfd3574ac13d50d1
Signed-off-by: Evan Quan 
---
 .../drm/amd/pm/inc/smu11_driver_if_navi10.h   |  98 ++-
 drivers/gpu/drm/amd/pm/inc/smu_v11_0.h|   6 +-
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 609 +-
 3 files changed, 673 insertions(+), 40 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/inc/smu11_driver_if_navi10.h 
b/drivers/gpu/drm/amd/pm/inc/smu11_driver_if_navi10.h
index 246d3951a78a..04752ade1016 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu11_driver_if_navi10.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu11_driver_if_navi10.h
@@ -843,11 +843,15 @@ typedef struct {
   uint16_t  FanMaximumRpm;
   uint16_t  FanMinimumPwm;
   uint16_t  FanTargetTemperature; // Degree Celcius 
+  uint16_t  FanMode;
+  uint16_t  FanMaxPwm;
+  uint16_t  FanMinPwm;
+  uint16_t  FanMaxTemp; // Degree Celcius
+  uint16_t  FanMinTemp; // Degree Celcius
   uint16_t  MaxOpTemp;// Degree Celcius
   uint16_t  FanZeroRpmEnable;
-  uint16_t  Padding;
 
-  uint32_t MmHubPadding[8]; // SMU internal use  
+  uint32_t MmHubPadding[6]; // SMU internal use
 
 } OverDriveTable_t; 
 
@@ -880,6 +884,45 @@ typedef struct {
   uint8_t  Padding8_2;
   uint16_t CurrFanSpeed;
 
+  // Padding - ignore
+  uint32_t MmHubPadding[8]; // SMU internal use
+} SmuMetrics_legacy_t;
+
+typedef struct {
+  uint16_t CurrClock[PPCLK_COUNT];
+  uint16_t AverageGfxclkFrequencyPostDs;
+  uint16_t AverageSocclkFrequency;
+  uint16_t AverageUclkFrequencyPostDs;
+  uint16_t AverageGfxActivity;
+  uint16_t AverageUclkActivity   ;
+  uint8_t  CurrSocVoltageOffset  ;
+  uint8_t  CurrGfxVoltageOffset  ;
+  uint8_t  CurrMemVidOffset  ;
+  uint8_t  Padding8  ;
+  uint16_t AverageSocketPower;
+  uint16_t TemperatureEdge   ;
+  uint16_t TemperatureHotspot;
+  uint16_t TemperatureMem;
+  uint16_t TemperatureVrGfx  ;
+  uint16_t TemperatureVrMem0 ;
+  uint16_t TemperatureVrMem1 ;  
+  uint16_t TemperatureVrSoc  ;  
+  uint16_t TemperatureLiquid0;
+  uint16_t TemperatureLiquid1;  
+  uint16_t TemperaturePlx;
+  uint16_t Padding16 ;
+  uint32_t ThrottlerStatus   ; 
+ 
+  uint8_t  LinkDpmLevel;
+  uint8_t  Padding8_2;
+  uint16_t CurrFanSpeed;
+
+  uint16_t AverageGfxclkFrequencyPreDs;
+  uint16_t AverageUclkFrequencyPreDs;
+  uint8_t  PcieRate;
+  uint8_t  PcieWidth;
+  uint8_t  Padding8_3[2];
+
   // Padding - ignore
   uint32_t MmHubPadding[8]; // SMU internal use
 } SmuMetrics_t;
@@ -919,10 +962,61 @@ typedef struct {
   uint16_t VcnActivityPercentage ;
   uint16_t padding16_2;
 
+  // Padding - ignore
+  uint32_t MmHubPadding[8]; // SMU internal use
+} SmuMetrics_NV12_legacy_t;
+
+typedef struct {
+  uint16_t CurrClock[PPCLK_COUNT];
+  uint16_t AverageGfxclkFrequencyPostDs;
+  uint16_t AverageSocclkFrequency;
+  uint16_t AverageUclkFrequencyPostDs;
+  uint16_t AverageGfxActivity;
+  uint16_t AverageUclkActivity   ;
+  uint8_t  CurrSocVoltageOffset  ;
+  uint8_t  CurrGfxVoltageOffset  ;
+  uint8_t  CurrMemVidOffset  ;
+  uint8_t  Padding8  ;
+  uint16_t AverageSocketPower;
+  uint16_t TemperatureEdge   ;
+  uint16_t TemperatureHotspot;
+  uint16_t TemperatureMem;
+  uint16_t TemperatureVrGfx  ;
+  uint16_t TemperatureVrMem0 ;
+  uint16_t TemperatureVrMem1 ;
+  uint16_t TemperatureVrSoc  ;
+  uint16_t TemperatureLiquid0;
+  uint16_t TemperatureLiquid1;
+  uint16_t TemperaturePlx;
+  uint16_t Padding16 ;
+  uint32_t ThrottlerStatus   ;
+
+  uint8_t  LinkDpmLevel;
+  uint8_t  Padding8_2;
+  uint16_t CurrFanSpeed;
+
+  uint16_t AverageVclkFrequency  ;
+  uint16_t AverageDclkFrequency  ;
+  uint16_t VcnActivityPercentage ;
+  uint16_t AverageGfxclkFrequencyPreDs;
+  uint16_t AverageUclkFrequencyPreDs;
+  uint8_t  PcieRate;
+  uint8_t  PcieWidth;
+
+  uint32_t Padding32_1;
+  uint64_t EnergyAccumulator;
+
   // Padding - ignore
   uint32_t MmHubPadding[8]; // SMU internal use
 } SmuMetrics_NV12_t;
 
+typedef union SmuMetrics {
+   SmuMetrics_legacy_t nv10_legacy_metrics;
+   SmuMetrics_tnv10_metrics;
+   SmuMetrics_NV12_legacy_tnv12_legacy_metrics;
+   SmuMetrics_NV12_t   nv12_metrics;
+} SmuMetrics_NV1X_t;
+
 typedef struct {
   uint16_t MinClock; // This is either DCEFCLK or SOCCLK (in MHz)
   uint16_t MaxClock; // This is either DCEFCLK or SOCCLK (in MHz)
diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h 
b/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h
index 281835f23f6d..50dd1529b994 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h
@@ -27,9 +27,9 @@
 
 #define SMU11_DRIVER_IF_VERSION_INV 0x
 #define SMU11_DRIVER_IF_VERSION_ARCT 0x17
-#define SMU11_DRIVER_IF_VERSION_NV10 0x36
-#define SMU11_DRIVE

[RFC] a new approach to detect which ring is the real black sheep upon TDR reported

2021-02-25 Thread Liu, Monk

[AMD Public Use]

Hi all

NAVI2X  project hit a really hard to solve issue now, and it is turned out to 
be a general headache of our TDR mechanism , check below scenario:


  1.  There is a job1 running on compute1 ring at timestamp
  2.  There is a job2 running on gfx ring at timestamp
  3.  Job1 is the guilty one, and job1/job2 were scheduled to their rings at 
almost the same timestamp
  4.  After 2 seconds we receive two TDR reporting from both GFX ring and 
compute ring
  5.  Current scheme is that in drm scheduler all the head jobs of those two 
rings are considered "bad job" and taken away from the mirror list
  6.  The result is both the real guilty job (job1) and the innocent job (job2) 
were all deleted from mirror list, and their corresponding contexts were also 
treated as guilty (so the innocent process remains running is not secured)


But by our wish the ideal case is TDR mechanism can detect which ring is the 
guilty ring and the innocent ring can resubmits all its pending jobs:

  1.  Job1 to be deleted from compute1 ring's mirror list
  2.  Job2 is kept and resubmitted later and its belonging process/context are 
even not aware of this TDR at all


Here I have a proposal tend to achieve above goal and it rough procedure is :

  1.  Once any ring reports a TDR, the head job is *not* treated as "bad job", 
and it is *not* deleted from the mirror list in drm sched functions
  2.  In vendor's function (our amdgpu driver here):
 *   reset GPU
 *   repeat below actions on each RINGS * one by one *:

1. take the head job and submit it on this ring

2. see if it completes, if not then this job is the real "bad job"

3.  take it away from mirror list if this head job is "bad job"

 *   After above iteration on all RINGS, we already clears all the bad 
job(s)
  1.  Resubmit all jobs from each mirror list to their corresponding rings 
(this is the existed logic)

The idea of this is to use "serial" way to re-run and re-check each head job of 
each RING, in order to take out the real black sheep and its guilty context.

P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , since those 
rings are intermutually affected to each other. For SDMA ring timeout it 
definitely proves the head job on SDMA ring is really guilty.

Thanks

--
Monk Liu | Cloud-GPU Core team
--

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] amdgpu/pm: read_sensor() report failure apporpriately

2021-02-25 Thread Quan, Evan

[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Evan Quan 

-Original Message-
From: amd-gfx  On Behalf Of Shirish S
Sent: Thursday, February 25, 2021 11:45 PM
To: Deucher, Alexander ; 
amd-gfx@lists.freedesktop.org
Cc: S, Shirish 
Subject: [PATCH] amdgpu/pm: read_sensor() report failure apporpriately

report -ENOTSUPP instead of -EINVAL, so that if userspace
fails to read sensor data can figure it out the failure correctly.

Signed-off-by: Shirish S 
---
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c  | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c   | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c   | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/si_dpm.c | 2 +-
 8 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
index 2c90f715ee0a..c932b632ddd4 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
@@ -1285,7 +1285,7 @@ static int smu10_read_sensor(struct pp_hwmgr *hwmgr, int 
idx,
*size = 4;
break;
default:
-   ret = -EINVAL;
+   ret = -EOPNOTSUPP;
break;
}
 
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
index c57dc9ae81f2..a58f92249c53 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
@@ -3945,7 +3945,7 @@ static int smu7_read_sensor(struct pp_hwmgr *hwmgr, int 
idx,
*((uint32_t *)value) = (uint32_t)convert_to_vddc(val_vid);
return 0;
default:
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
 }
 
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c
index ed9b89980184..2cef9c0c6d6f 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c
@@ -1805,7 +1805,7 @@ static int smu8_read_sensor(struct pp_hwmgr *hwmgr, int 
idx,
*((uint32_t *)value) = smu8_thermal_get_temperature(hwmgr);
return 0;
default:
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
 }
 
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c
index 29c99642d22d..5e875ad8d633 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c
@@ -3890,7 +3890,7 @@ static int vega10_read_sensor(struct pp_hwmgr *hwmgr, int 
idx,
*size = 8;
break;
default:
-   ret = -EINVAL;
+   ret = -EOPNOTSUPP;
break;
}
 
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c
index c0753029a8e2..a827f2bc7904 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c
@@ -1429,7 +1429,7 @@ static int vega12_read_sensor(struct pp_hwmgr *hwmgr, int 
idx,
*size = 8;
break;
default:
-   ret = -EINVAL;
+   ret = -EOPNOTSUPP;
break;
}
return ret;
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c
index 87811b005b85..e8eec2539c17 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c
@@ -2240,7 +2240,7 @@ static int vega20_read_sensor(struct pp_hwmgr *hwmgr, int 
idx,
*size = 8;
break;
default:
-   ret = -EINVAL;
+   ret = -EOPNOTSUPP;
break;
}
return ret;
diff --git a/drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c 
b/drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c
index 66daabebee35..bcae42cef374 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c
@@ -3305,7 +3305,7 @@ static int kv_dpm_read_sensor(void *handle, int idx,
*size = 4;
return 0;
default:
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
 }
 
diff --git a/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c 
b/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
index 62291358fb1c..26a5321e621b 100644
--- a/drivers/gpu/drm/amd/pm/po

RE: [PATCH] drm/amdgpu: enable one vf mode on navi21 vf

2021-02-25 Thread Quan, Evan

[AMD Public Use]

Acked-by: Evan Quan 

-Original Message-
From: Horace Chen  
Sent: Thursday, February 25, 2021 8:05 PM
To: amd-gfx@lists.freedesktop.org
Cc: Grodzovsky, Andrey ; Quan, Evan 
; Chen, Horace ; Tuikov, Luben 
; Koenig, Christian ; Deucher, 
Alexander ; Xiao, Jack ; Zhang, 
Hawking ; Liu, Monk ; Xu, Feifei 
; Wang, Kevin(Yang) ; Xiaojie Yuan 

Subject: [PATCH] drm/amdgpu: enable one vf mode on navi21 vf

navi21 vf needs one vf mode which allows vf to set and get
clock status from guest vm. So now expose the required interface
and allow some smu request on VF mode. Also since navi21 blocked
direct MMIO access, use KIQ to send SMU request under sriov vf.

OD use same command as getting pp table which is not allowed for
 navi21, so remove OD feature under sriov vf.

Signed-off-by: Horace Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   |  2 ++
 drivers/gpu/drm/amd/pm/amdgpu_pm.c   |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c| 10 ++
 .../gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c  | 10 +-
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c   | 12 ++--
 5 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f0f7ed42ee7f..dfbf2f2db0de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2043,6 +2043,8 @@ static int amdgpu_device_ip_early_init(struct 
amdgpu_device *adev)
adev->pm.pp_feature = amdgpu_pp_feature_mask;
if (amdgpu_sriov_vf(adev) || sched_policy == KFD_SCHED_POLICY_NO_HWS)
adev->pm.pp_feature &= ~PP_GFXOFF_MASK;
+   if (amdgpu_sriov_vf(adev) && adev->asic_type == CHIP_SIENNA_CICHLID)
+   adev->pm.pp_feature &= ~PP_OVERDRIVE_MASK;
 
for (i = 0; i < adev->num_ip_blocks; i++) {
if ((amdgpu_ip_block_mask & (1 << i)) == 0) {
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index b770dd634ab6..1866cbaf70c3 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -2167,7 +2167,7 @@ static ssize_t amdgpu_get_gpu_metrics(struct device *dev,
 
 static struct amdgpu_device_attr amdgpu_device_attrs[] = {
AMDGPU_DEVICE_ATTR_RW(power_dpm_state,  
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
-   AMDGPU_DEVICE_ATTR_RW(power_dpm_force_performance_level,
ATTR_FLAG_BASIC),
+   AMDGPU_DEVICE_ATTR_RW(power_dpm_force_performance_level,
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
AMDGPU_DEVICE_ATTR_RO(pp_num_states,
ATTR_FLAG_BASIC),
AMDGPU_DEVICE_ATTR_RO(pp_cur_state, 
ATTR_FLAG_BASIC),
AMDGPU_DEVICE_ATTR_RW(pp_force_state,   
ATTR_FLAG_BASIC),
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index d143ef1b460b..7033d52eb4d0 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -612,10 +612,12 @@ static int smu_late_init(void *handle)
return ret;
}
 
-   ret = smu_set_default_od_settings(smu);
-   if (ret) {
-   dev_err(adev->dev, "Failed to setup default OD settings!\n");
-   return ret;
+   if (smu->od_enabled) {
+   ret = smu_set_default_od_settings(smu);
+   if (ret) {
+   dev_err(adev->dev, "Failed to setup default OD 
settings!\n");
+   return ret;
+   }
}
 
ret = smu_populate_umd_state_clk(smu);
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index af73e1430af5..441effc23625 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -89,17 +89,17 @@ static struct cmn2asic_msg_mapping 
sienna_cichlid_message_map[SMU_MSG_MAX_COUNT]
MSG_MAP(GetEnabledSmuFeaturesHigh,  
PPSMC_MSG_GetRunningSmuFeaturesHigh,   1),
MSG_MAP(SetWorkloadMask,PPSMC_MSG_SetWorkloadMask,  
   1),
MSG_MAP(SetPptLimit,PPSMC_MSG_SetPptLimit,  
   0),
-   MSG_MAP(SetDriverDramAddrHigh,  
PPSMC_MSG_SetDriverDramAddrHigh,   0),
-   MSG_MAP(SetDriverDramAddrLow,   PPSMC_MSG_SetDriverDramAddrLow, 
   0),
+   MSG_MAP(SetDriverDramAddrHigh,  
PPSMC_MSG_SetDriverDramAddrHigh,   1),
+   MSG_MAP(SetDriverDramAddrLow,   PPSMC_MSG_SetDriverDramAddrLow, 
   1),
MSG_MAP(SetToolsDramAddrHigh,   PPSMC_MSG_SetToolsDramAddrHigh, 
   0),
MSG_MAP(SetToolsDramAddrLow,PPSMC_MSG_SetToolsDramAddrLow,  
   0),
-   MSG_MAP(TransferTableSmu2Dram,

RE: [PATCH] drm/amdgpu: remove unnecessary reading for epprom header

2021-02-25 Thread Li, Dennis

Hi, Hawking,
  Agree with your suggestion, and it could further simplify our codes. I 
will refactor them again. 

Best Regards
Dennis Li
-Original Message-
From: Zhang, Hawking  
Sent: Friday, February 26, 2021 12:30 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Chen, Guchun 
; Koenig, Christian 
Cc: Li, Dennis 
Subject: RE: [PATCH] drm/amdgpu: remove unnecessary reading for epprom header

[AMD Public Use]

What about merge this function with amdgpu_ras_check_err_threshold?

Regards,
Hawking

-Original Message-
From: Dennis Li  
Sent: Friday, February 26, 2021 09:26
To: amd-gfx@lists.freedesktop.org; Chen, Guchun ; Zhang, 
Hawking ; Koenig, Christian 
Cc: Li, Dennis 
Subject: [PATCH] drm/amdgpu: remove unnecessary reading for epprom header

If the number of badpage records exceed the threshold, driver has updated both 
epprom header and control->tbl_hdr.header before gpu reset, therefore GPU 
recovery thread no need to read epprom header directly.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 19d9aa76cfbf..4310ad63890c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -439,41 +439,19 @@ int amdgpu_ras_eeprom_check_err_threshold(
bool *exceed_err_limit)
 {
struct amdgpu_device *adev = to_amdgpu_device(control);
-   unsigned char buff[EEPROM_ADDRESS_SIZE +
-   EEPROM_TABLE_HEADER_SIZE] = { 0 };
-   struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
-   struct i2c_msg msg = {
-   .addr = control->i2c_address,
-   .flags = I2C_M_RD,
-   .len = EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE,
-   .buf = buff,
-   };
-   int ret;
 
*exceed_err_limit = false;
 
if (!__is_ras_eeprom_supported(adev))
return 0;
 
-   /* read EEPROM table header */
-   mutex_lock(&control->tbl_mutex);
-   ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1);
-   if (ret < 1) {
-   dev_err(adev->dev, "Failed to read EEPROM table header.\n");
-   goto err;
-   }
-
-   __decode_table_header_from_buff(hdr, &buff[2]);
-
-   if (hdr->header == EEPROM_TABLE_HDR_BAD) {
+   if (control->tbl_hdr.header == EEPROM_TABLE_HDR_BAD) {
dev_warn(adev->dev, "This GPU is in BAD status.");
dev_warn(adev->dev, "Please retire it or setting one bigger "
"threshold value when reloading driver.\n");
*exceed_err_limit = true;
}
 
-err:
-   mutex_unlock(&control->tbl_mutex);
return 0;
 }
 
--
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: remove unnecessary reading for epprom header

2021-02-25 Thread Zhang, Hawking

[AMD Public Use]

What about merge this function with amdgpu_ras_check_err_threshold?

Regards,
Hawking

-Original Message-
From: Dennis Li  
Sent: Friday, February 26, 2021 09:26
To: amd-gfx@lists.freedesktop.org; Chen, Guchun ; Zhang, 
Hawking ; Koenig, Christian 
Cc: Li, Dennis 
Subject: [PATCH] drm/amdgpu: remove unnecessary reading for epprom header

If the number of badpage records exceed the threshold, driver has updated both 
epprom header and control->tbl_hdr.header before gpu reset, therefore GPU 
recovery thread no need to read epprom header directly.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 19d9aa76cfbf..4310ad63890c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -439,41 +439,19 @@ int amdgpu_ras_eeprom_check_err_threshold(
bool *exceed_err_limit)
 {
struct amdgpu_device *adev = to_amdgpu_device(control);
-   unsigned char buff[EEPROM_ADDRESS_SIZE +
-   EEPROM_TABLE_HEADER_SIZE] = { 0 };
-   struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
-   struct i2c_msg msg = {
-   .addr = control->i2c_address,
-   .flags = I2C_M_RD,
-   .len = EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE,
-   .buf = buff,
-   };
-   int ret;
 
*exceed_err_limit = false;
 
if (!__is_ras_eeprom_supported(adev))
return 0;
 
-   /* read EEPROM table header */
-   mutex_lock(&control->tbl_mutex);
-   ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1);
-   if (ret < 1) {
-   dev_err(adev->dev, "Failed to read EEPROM table header.\n");
-   goto err;
-   }
-
-   __decode_table_header_from_buff(hdr, &buff[2]);
-
-   if (hdr->header == EEPROM_TABLE_HDR_BAD) {
+   if (control->tbl_hdr.header == EEPROM_TABLE_HDR_BAD) {
dev_warn(adev->dev, "This GPU is in BAD status.");
dev_warn(adev->dev, "Please retire it or setting one bigger "
"threshold value when reloading driver.\n");
*exceed_err_limit = true;
}
 
-err:
-   mutex_unlock(&control->tbl_mutex);
return 0;
 }
 
--
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: remove unused variable in amdgpu_dma_buf_unmap()

2021-02-25 Thread Zhang, Hawking

[AMD Public Use]

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: amd-gfx  On Behalf Of Kevin Wang
Sent: Friday, February 26, 2021 12:25
To: amd-gfx@lists.freedesktop.org
Cc: Wang, Kevin(Yang) 
Subject: [PATCH] drm/amdgpu: remove unused variable in amdgpu_dma_buf_unmap()

clean up unsued variable in amdgpu_dma_buf_unmap().

Fixes:
drm/amdgpu: Remove amdgpu_device arg from free_sgt api

Signed-off-by: Kevin Wang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index e83d73ec0e9d..5ec6556ebf7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -356,10 +356,6 @@ static void amdgpu_dma_buf_unmap(struct dma_buf_attachment 
*attach,
 struct sg_table *sgt,
 enum dma_data_direction dir)
 {
-   struct dma_buf *dma_buf = attach->dmabuf;
-   struct drm_gem_object *obj = dma_buf->priv;
-   struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
-
if (sgt->sgl->page_link) {
dma_unmap_sgtable(attach->dev, sgt, dir, 0);
sg_free_table(sgt);
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Chawking.zhang%40amd.com%7C9555021bae2f415e1bef08d8da0e8125%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637499103125405709%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=y21rsObIUGSeS2XiFfVMrUNuD5%2Fc%2BrfXjOd%2BamdLa60%3D&reserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: add RAP TA version print in amdgpu_firmware_info

2021-02-25 Thread Zhang, Hawking

[AMD Public Use]

Reviewed-by: Hawking Zhang 

-Original Message-
From: Wang, Kevin(Yang)  
Sent: Friday, February 26, 2021 12:24
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Li, Candice ; 
Wang, Kevin(Yang) 
Subject: [PATCH] drm/amdgpu: add RAP TA version print in amdgpu_firmware_info

add RAP TA version print in amdgpu_firmware_info.

Signed-off-by: Kevin Wang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index ce031a77cda5..a5ed9530f542 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -305,6 +305,10 @@ static int amdgpu_firmware_info(struct 
drm_amdgpu_info_firmware *fw_info,
fw_info->ver = adev->psp.ta_fw_version;
fw_info->feature = adev->psp.ta_dtm_ucode_version;
break;
+   case 4:
+   fw_info->ver = adev->psp.ta_fw_version;
+   fw_info->feature = adev->psp.ta_rap_ucode_version;
+   break;
default:
return -EINVAL;
}
@@ -1490,6 +1494,10 @@ static int amdgpu_debugfs_firmware_info_show(struct 
seq_file *m, void *unused)
seq_printf(m, "TA %s feature version: 0x%08x, firmware 
version: 0x%08x\n",
"DTM", fw_info.feature, fw_info.ver);
break;
+   case 4:
+   seq_printf(m, "TA %s feature version: 0x%08x, firmware 
version: 0x%08x\n",
+   "RAP", fw_info.feature, fw_info.ver);
+   break;
default:
return -EINVAL;
}
-- 
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: remove unused variable in amdgpu_dma_buf_unmap()

2021-02-25 Thread Kevin Wang

clean up unsued variable in amdgpu_dma_buf_unmap().

Fixes:
drm/amdgpu: Remove amdgpu_device arg from free_sgt api

Signed-off-by: Kevin Wang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index e83d73ec0e9d..5ec6556ebf7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -356,10 +356,6 @@ static void amdgpu_dma_buf_unmap(struct dma_buf_attachment 
*attach,
 struct sg_table *sgt,
 enum dma_data_direction dir)
 {
-   struct dma_buf *dma_buf = attach->dmabuf;
-   struct drm_gem_object *obj = dma_buf->priv;
-   struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
-
if (sgt->sgl->page_link) {
dma_unmap_sgtable(attach->dev, sgt, dir, 0);
sg_free_table(sgt);
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: add RAP TA version print in amdgpu_firmware_info

2021-02-25 Thread Kevin Wang

add RAP TA version print in amdgpu_firmware_info.

Signed-off-by: Kevin Wang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index ce031a77cda5..a5ed9530f542 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -305,6 +305,10 @@ static int amdgpu_firmware_info(struct 
drm_amdgpu_info_firmware *fw_info,
fw_info->ver = adev->psp.ta_fw_version;
fw_info->feature = adev->psp.ta_dtm_ucode_version;
break;
+   case 4:
+   fw_info->ver = adev->psp.ta_fw_version;
+   fw_info->feature = adev->psp.ta_rap_ucode_version;
+   break;
default:
return -EINVAL;
}
@@ -1490,6 +1494,10 @@ static int amdgpu_debugfs_firmware_info_show(struct 
seq_file *m, void *unused)
seq_printf(m, "TA %s feature version: 0x%08x, firmware 
version: 0x%08x\n",
"DTM", fw_info.feature, fw_info.ver);
break;
+   case 4:
+   seq_printf(m, "TA %s feature version: 0x%08x, firmware 
version: 0x%08x\n",
+   "RAP", fw_info.feature, fw_info.ver);
+   break;
default:
return -EINVAL;
}
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: remove unnecessary reading for epprom header

2021-02-25 Thread Dennis Li

If the number of badpage records exceed the threshold, driver has
updated both epprom header and control->tbl_hdr.header before gpu reset,
therefore GPU recovery thread no need to read epprom header directly.

Signed-off-by: Dennis Li 

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 19d9aa76cfbf..4310ad63890c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -439,41 +439,19 @@ int amdgpu_ras_eeprom_check_err_threshold(
bool *exceed_err_limit)
 {
struct amdgpu_device *adev = to_amdgpu_device(control);
-   unsigned char buff[EEPROM_ADDRESS_SIZE +
-   EEPROM_TABLE_HEADER_SIZE] = { 0 };
-   struct amdgpu_ras_eeprom_table_header *hdr = &control->tbl_hdr;
-   struct i2c_msg msg = {
-   .addr = control->i2c_address,
-   .flags = I2C_M_RD,
-   .len = EEPROM_ADDRESS_SIZE + EEPROM_TABLE_HEADER_SIZE,
-   .buf = buff,
-   };
-   int ret;
 
*exceed_err_limit = false;
 
if (!__is_ras_eeprom_supported(adev))
return 0;
 
-   /* read EEPROM table header */
-   mutex_lock(&control->tbl_mutex);
-   ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1);
-   if (ret < 1) {
-   dev_err(adev->dev, "Failed to read EEPROM table header.\n");
-   goto err;
-   }
-
-   __decode_table_header_from_buff(hdr, &buff[2]);
-
-   if (hdr->header == EEPROM_TABLE_HDR_BAD) {
+   if (control->tbl_hdr.header == EEPROM_TABLE_HDR_BAD) {
dev_warn(adev->dev, "This GPU is in BAD status.");
dev_warn(adev->dev, "Please retire it or setting one bigger "
"threshold value when reloading driver.\n");
*exceed_err_limit = true;
}
 
-err:
-   mutex_unlock(&control->tbl_mutex);
return 0;
 }
 
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: remove unused variables

2021-02-25 Thread Alex Deucher

Not used so remove them.

Fixes: d2d0c920a127 ("drm/amdgpu: Remove amdgpu_device arg from free_sgt api")
Cc: Ramesh Errabolu 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index e83d73ec0e9d..5ec6556ebf7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -356,10 +356,6 @@ static void amdgpu_dma_buf_unmap(struct dma_buf_attachment 
*attach,
 struct sg_table *sgt,
 enum dma_data_direction dir)
 {
-   struct dma_buf *dma_buf = attach->dmabuf;
-   struct drm_gem_object *obj = dma_buf->priv;
-   struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
-
if (sgt->sgl->page_link) {
dma_unmap_sgtable(attach->dev, sgt, dir, 0);
sg_free_table(sgt);
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amd/display: Fix an uninitialized index variable

2021-02-25 Thread Alex Deucher

On Thu, Feb 25, 2021 at 10:01 AM Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> clang points out that the new logic uses an always-uninitialized
> array index:
>
> drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:9810:38: warning: 
> variable 'i' is uninitialized when used here [-Wuninitialized]
> timing  = &edid->detailed_timings[i];
>   ^
> drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:9720:7: note: 
> initialize the variable 'i' to silence this warning
>
> My best guess is that the index should have been returned by the
> parse_hdmi_amd_vsdb() function that walks an array here, so do that.
>
> Fixes: f9b4f20c4777 ("drm/amd/display: Add Freesync HDMI support to DM")
> Signed-off-by: Arnd Bergmann 

This looks correct to me.  Stylon can you verify?

Thanks,

Alex


> ---
>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c| 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index b19b93c74bae..667c0d52dbfa 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -9736,7 +9736,7 @@ static bool parse_edid_cea(struct amdgpu_dm_connector 
> *aconnector,
> return false;
>  }
>
> -static bool parse_hdmi_amd_vsdb(struct amdgpu_dm_connector *aconnector,
> +static int parse_hdmi_amd_vsdb(struct amdgpu_dm_connector *aconnector,
> struct edid *edid, struct amdgpu_hdmi_vsdb_info *vsdb_info)
>  {
> uint8_t *edid_ext = NULL;
> @@ -9746,7 +9746,7 @@ static bool parse_hdmi_amd_vsdb(struct 
> amdgpu_dm_connector *aconnector,
> /*- drm_find_cea_extension() -*/
> /* No EDID or EDID extensions */
> if (edid == NULL || edid->extensions == 0)
> -   return false;
> +   return -ENODEV;
>
> /* Find CEA extension */
> for (i = 0; i < edid->extensions; i++) {
> @@ -9756,14 +9756,15 @@ static bool parse_hdmi_amd_vsdb(struct 
> amdgpu_dm_connector *aconnector,
> }
>
> if (i == edid->extensions)
> -   return false;
> +   return -ENODEV;
>
> /*- cea_db_offsets() -*/
> if (edid_ext[0] != CEA_EXT)
> -   return false;
> +   return -ENODEV;
>
> valid_vsdb_found = parse_edid_cea(aconnector, edid_ext, EDID_LENGTH, 
> vsdb_info);
> -   return valid_vsdb_found;
> +
> +   return valid_vsdb_found ? i : -ENODEV;
>  }
>
>  void amdgpu_dm_update_freesync_caps(struct drm_connector *connector,
> @@ -9781,7 +9782,6 @@ void amdgpu_dm_update_freesync_caps(struct 
> drm_connector *connector,
> struct amdgpu_device *adev = drm_to_adev(dev);
> bool freesync_capable = false;
> struct amdgpu_hdmi_vsdb_info vsdb_info = {0};
> -   bool hdmi_valid_vsdb_found = false;
>
> if (!connector->state) {
> DRM_ERROR("%s - Connector has no state", __func__);
> @@ -9857,8 +9857,8 @@ void amdgpu_dm_update_freesync_caps(struct 
> drm_connector *connector,
> }
> }
> } else if (edid && amdgpu_dm_connector->dc_sink->sink_signal == 
> SIGNAL_TYPE_HDMI_TYPE_A) {
> -   hdmi_valid_vsdb_found = 
> parse_hdmi_amd_vsdb(amdgpu_dm_connector, edid, &vsdb_info);
> -   if (hdmi_valid_vsdb_found && vsdb_info.freesync_supported) {
> +   i = parse_hdmi_amd_vsdb(amdgpu_dm_connector, edid, 
> &vsdb_info);
> +   if (i >= 0 && vsdb_info.freesync_supported) {
> timing  = &edid->detailed_timings[i];
> data= &timing->data.other_data;
>
> --
> 2.29.2
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amd/display: Fix an uninitialized index variable

2021-02-25 Thread Arnd Bergmann

On Thu, Feb 25, 2021 at 10:34 PM 'Nick Desaulniers' via Clang Built
Linux  wrote:
> return parse_edid_cea(aconnector, edid_ext, EDID_LENGTH, vsdb_info) ? i : 
> -ENODEV;
>
> would suffice, but the patch is still fine as is.

Right, I did not want to change more than necessary here, and the
original code already had the extra assignment instead of returning
the value.

> > @@ -9857,8 +9857,8 @@ void amdgpu_dm_update_freesync_caps(struct 
> > drm_connector *connector,
> > }
> > }
> > } else if (edid && amdgpu_dm_connector->dc_sink->sink_signal == 
> > SIGNAL_TYPE_HDMI_TYPE_A) {
> > -   hdmi_valid_vsdb_found = 
> > parse_hdmi_amd_vsdb(amdgpu_dm_connector, edid, &vsdb_info);
> > -   if (hdmi_valid_vsdb_found && vsdb_info.freesync_supported) {
> > +   i = parse_hdmi_amd_vsdb(amdgpu_dm_connector, edid, 
> > &vsdb_info);
> > +   if (i >= 0 && vsdb_info.freesync_supported) {
>
> reusing `i` here is safe, for now, but reuse of variables like this in
> separate branches like this might not get noticed if the function is
> amended in the future.
>
> > timing  = &edid->detailed_timings[i];
> > data= &timing->data.other_data;

The entire point of the patch is that 'i' is in fact used in the following line,
but was lacking an intialization.

   Arnd
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amd/display: fix 64-bit integer division

2021-02-25 Thread Stempen, Vladimir

[AMD Official Use Only - Internal Distribution Only]

Hi Arnd,
I have all the patches ready and I have tested them with the feature/platform 
I'm working on and Bindu helped to test the 32bit build.
I'm in process of submitting the latest change.
Thanks,
Vladimir.

On 2021-02-25, 4:36 PM, "Arnd Bergmann"  wrote:

On Thu, Feb 25, 2021 at 3:33 PM Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> The new display synchronization code caused a regression
> on all 32-bit architectures:
>
> ld.lld: error: undefined symbol: __aeabi_uldivmod
> >>> referenced by dce_clock_source.c
> >>>   
gpu/drm/amd/display/dc/dce/dce_clock_source.o:(get_pixel_clk_frequency_100hz) 
in archive drivers/built-in.a
>
> ld.lld: error: undefined symbol: __aeabi_ldivmod
> >>> referenced by dc_resource.c
> >>>   
gpu/drm/amd/display/dc/core/dc_resource.o:(resource_are_vblanks_synchronizable) 
in archive drivers/built-in.a
> >>> referenced by dc_resource.c
> >>>   
gpu/drm/amd/display/dc/core/dc_resource.o:(resource_are_vblanks_synchronizable) 
in archive drivers/built-in.a
> >>> referenced by dc_resource.c
> >>>   
gpu/drm/amd/display/dc/core/dc_resource.o:(resource_are_vblanks_synchronizable) 
in archive drivers/built-in.a
>
> This is not a fast path, so the use of an explicit div_u64/div_s64
> seems appropriate.

I found two more instances:

>>> referenced by dcn20_optc.c
>>>   
gpu/drm/amd/display/dc/dcn20/dcn20_optc.o:(optc2_align_vblanks) in archive 
drivers/built-in.a

>>> referenced by dcn10_hw_sequencer.c
>>>   
gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.o:(reduceSizeAndFraction) in 
archive drivers/built-in.a

I have patches for both, but will let the randconfig build box keep working
on it over night to see if there are any others. Let me know if you want a
combined patch or one per file once there are no more regressions.

Arnd

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH v3] drm/scheduler: Fix hang when sched_entity released

2021-02-25 Thread Andrey Grodzovsky

Problem: If scheduler is already stopped by the time sched_entity
is released and entity's job_queue not empty I encountred
a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
never becomes false.

Fix: In drm_sched_fini detach all sched_entities from the
scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
Also wakeup all those processes stuck in sched_entity flushing
as the scheduler main thread which wakes them up is stopped by now.

v2:
Reverse order of drm_sched_rq_remove_entity and marking
s_entity as stopped to prevent reinserion back to rq due
to race.

v3:
Drop drm_sched_rq_remove_entity, only modify entity->stopped
and check for it in drm_sched_entity_is_idle

Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/scheduler/sched_entity.c |  3 ++-
 drivers/gpu/drm/scheduler/sched_main.c   | 23 +++
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index 92d965b629c6..68b10813129a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -116,7 +116,8 @@ static bool drm_sched_entity_is_idle(struct 
drm_sched_entity *entity)
rmb(); /* for list_empty to work without lock */
 
if (list_empty(&entity->list) ||
-   spsc_queue_count(&entity->job_queue) == 0)
+   spsc_queue_count(&entity->job_queue) == 0 ||
+   entity->stopped)
return true;
 
return false;
diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 908b0b56032d..b50fab472734 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -897,9 +897,32 @@ EXPORT_SYMBOL(drm_sched_init);
  */
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
+   int i;
+   struct drm_sched_entity *s_entity;
if (sched->thread)
kthread_stop(sched->thread);
 
+   for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; 
i--) {
+   struct drm_sched_rq *rq = &sched->sched_rq[i];
+
+   if (!rq)
+   continue;
+
+   spin_lock(&rq->lock);
+   list_for_each_entry(s_entity, &rq->entities, list)
+   /*
+* Prevents reinsertion and marks job_queue as idle,
+* it will removed from rq in drm_sched_entity_fini
+* eventually
+*/
+   s_entity->stopped = true;
+   spin_unlock(&rq->lock);
+
+   }
+
+   /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
+   wake_up_all(&sched->job_scheduled);
+
/* Confirm no work left behind accessing device structures */
cancel_delayed_work_sync(&sched->work_tdr);
 
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amd/display: fix 64-bit integer division

2021-02-25 Thread Arnd Bergmann

On Thu, Feb 25, 2021 at 3:33 PM Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> The new display synchronization code caused a regression
> on all 32-bit architectures:
>
> ld.lld: error: undefined symbol: __aeabi_uldivmod
> >>> referenced by dce_clock_source.c
> >>>   
> >>> gpu/drm/amd/display/dc/dce/dce_clock_source.o:(get_pixel_clk_frequency_100hz)
> >>>  in archive drivers/built-in.a
>
> ld.lld: error: undefined symbol: __aeabi_ldivmod
> >>> referenced by dc_resource.c
> >>>   
> >>> gpu/drm/amd/display/dc/core/dc_resource.o:(resource_are_vblanks_synchronizable)
> >>>  in archive drivers/built-in.a
> >>> referenced by dc_resource.c
> >>>   
> >>> gpu/drm/amd/display/dc/core/dc_resource.o:(resource_are_vblanks_synchronizable)
> >>>  in archive drivers/built-in.a
> >>> referenced by dc_resource.c
> >>>   
> >>> gpu/drm/amd/display/dc/core/dc_resource.o:(resource_are_vblanks_synchronizable)
> >>>  in archive drivers/built-in.a
>
> This is not a fast path, so the use of an explicit div_u64/div_s64
> seems appropriate.

I found two more instances:

>>> referenced by dcn20_optc.c
>>>   
>>> gpu/drm/amd/display/dc/dcn20/dcn20_optc.o:(optc2_align_vblanks) in archive 
>>> drivers/built-in.a

>>> referenced by dcn10_hw_sequencer.c
>>>   
>>> gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.o:(reduceSizeAndFraction) 
>>> in archive drivers/built-in.a

I have patches for both, but will let the randconfig build box keep working
on it over night to see if there are any others. Let me know if you want a
combined patch or one per file once there are no more regressions.

Arnd
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amd/display: remove unnecessary conversion to bool

2021-02-25 Thread Alex Deucher

On Thu, Feb 25, 2021 at 4:19 AM Jiapeng Chong
 wrote:
>
> Fix the following coccicheck warnings:
>
> ./drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dpp_cm.c:243:67-72:
> WARNING: conversion to bool not needed here.
>
> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Chong 

Applied.  Thanks!

Alex

> ---
>  drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dpp_cm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dpp_cm.c 
> b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dpp_cm.c
> index 3398540..102f6a0 100644
> --- a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dpp_cm.c
> +++ b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dpp_cm.c
> @@ -240,7 +240,7 @@ bool dpp3_program_gamcor_lut(
> next_mode = LUT_RAM_A;
>
> dpp3_power_on_gamcor_lut(dpp_base, true);
> -   dpp3_configure_gamcor_lut(dpp_base, next_mode == LUT_RAM_A ? 
> true:false);
> +   dpp3_configure_gamcor_lut(dpp_base, next_mode == LUT_RAM_A);
>
> if (next_mode == LUT_RAM_B) {
> gam_regs.start_cntl_b = REG(CM_GAMCOR_RAMB_START_CNTL_B);
> --
> 1.8.3.1
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: Replace DEFINE_SIMPLE_ATTRIBUTE with DEFINE_DEBUGFS_ATTRIBUTE

2021-02-25 Thread Alex Deucher

On Thu, Feb 25, 2021 at 4:02 AM Yang Li  wrote:
>
> Fix the following coccicheck warning:
> ./drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c:1589:0-23: WARNING:
> fops_ib_preempt should be defined with DEFINE_DEBUGFS_ATTRIBUTE
> ./drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c:1592:0-23: WARNING:
> fops_sclk_set should be defined with DEFINE_DEBUGFS_ATTRIBUTE
>
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 

Applied.  Thanks!

Alex


> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 0a25fec..52ef488 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1586,10 +1586,10 @@ static int amdgpu_debugfs_sclk_set(void *data, u64 
> val)
> return 0;
>  }
>
> -DEFINE_SIMPLE_ATTRIBUTE(fops_ib_preempt, NULL,
> +DEFINE_DEBUGFS_ATTRIBUTE(fops_ib_preempt, NULL,
> amdgpu_debugfs_ib_preempt, "%llu\n");
>
> -DEFINE_SIMPLE_ATTRIBUTE(fops_sclk_set, NULL,
> +DEFINE_DEBUGFS_ATTRIBUTE(fops_sclk_set, NULL,
> amdgpu_debugfs_sclk_set, "%llu\n");
>
>  int amdgpu_debugfs_init(struct amdgpu_device *adev)
> --
> 1.8.3.1
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released

2021-02-25 Thread Andrey Grodzovsky



On 2021-02-25 1:42 p.m., Christian König wrote:



Am 25.02.21 um 17:03 schrieb Andrey Grodzovsky:


On 2021-02-25 2:53 a.m., Christian König wrote:

Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:

Ping


Sorry, I've been on vacation this week.



Andrey

On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:


On 2/20/21 3:38 AM, Christian König wrote:



Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:


On 2/18/21 10:15 AM, Christian König wrote:

Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:


On 2/18/21 3:07 AM, Christian König wrote:



Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
Problem: If scheduler is already stopped by the time 
sched_entity

is released and entity's job_queue not empty I encountred
a hang in drm_sched_entity_flush. This is because 
drm_sched_entity_is_idle

never becomes false.

Fix: In drm_sched_fini detach all sched_entities from the
scheduler's run queues. This will satisfy 
drm_sched_entity_is_idle.

Also wakeup all those processes stuck in sched_entity flushing
as the scheduler main thread which wakes them up is stopped 
by now.


v2:
Reverse order of drm_sched_rq_remove_entity and marking
s_entity as stopped to prevent reinserion back to rq due
to race.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/scheduler/sched_main.c | 31 
+++

  1 file changed, 31 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c

index 908b0b5..c6b7947 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
   */
  void drm_sched_fini(struct drm_gpu_scheduler *sched)
  {
+    int i;
+    struct drm_sched_entity *s_entity;


BTW: Please order that so that i is declared last.


  if (sched->thread)
  kthread_stop(sched->thread);
  +    /* Detach all sched_entites from this scheduler once 
it's stopped */
+    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
DRM_SCHED_PRIORITY_MIN; i--) {

+    struct drm_sched_rq *rq = &sched->sched_rq[i];
+
+    if (!rq)
+    continue;
+
+    /* Loop this way because rq->lock is taken in 
drm_sched_rq_remove_entity */

+    spin_lock(&rq->lock);
+    while ((s_entity = 
list_first_entry_or_null(&rq->entities,

+    struct drm_sched_entity,
+    list))) {
+    spin_unlock(&rq->lock);
+
+    /* Prevent reinsertion and remove */
+ spin_lock(&s_entity->rq_lock);
+    s_entity->stopped = true;
+    drm_sched_rq_remove_entity(rq, s_entity);
+ spin_unlock(&s_entity->rq_lock);


Well this spin_unlock/lock dance here doesn't look correct at 
all now.


Christian.



In what way ? It's in the same same order as in other call 
sites (see drm_sched_entity_push_job and drm_sched_entity_flush).
If i just locked rq->lock and did list_for_each_entry_safe 
while manually deleting entity->list instead of calling
drm_sched_rq_remove_entity this still would not be possible as 
the order of lock acquisition between s_entity->rq_lock
and rq->lock would be reverse compared to the call sites 
mentioned above.


Ah, now I understand. You need this because 
drm_sched_rq_remove_entity() will grab the rq lock again!


Problem is now what prevents the entity from being destroyed 
while you remove it?


Christian.


Right, well, since (unfortunately) sched_entity is part of 
amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
there is a problem here that we don't increment 
amdgpu_ctx.refcount when assigning  sched_entity
to new rq (e.g. before drm_sched_rq_add_entity) and not 
decrement before removing. We do it for
amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init 
and amdgpu_cs_parser_fini by
calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit 
tricky due to all the drm_sched_entity_select_rq

logic.

Another, kind of a band aid fix, would probably be just locking 
amdgpu_ctx_mgr.lock around drm_sched_fini
when finalizing the fence driver and around idr iteration in 
amdgpu_ctx_mgr_fini (which should be lock protected
anyway as I see from other idr usages in the code) ... This 
should prevent this use after free.


Puh, that's rather complicated as well. Ok let's look at it from 
the other side for a moment.


Why do we have to remove the entities from the rq in the first 
place?


Wouldn't it be sufficient to just set all of them to stopped?

Christian.



And adding it as another  condition in drm_sched_entity_is_idle ?


Yes, I think that should work.



In this case reverse locking order is created, In 
drm_sched_entity_push_job and drm_sched_entity_flush lock 
entity->rq_lock locked first and rq->lock locked second. In 
drm_sched_fini on the other hand, I need to lock rq->lock first to 
iterate rq->entities and then lock s_entity->rq_lock for each entity 
to modify s_entity->stopped. I guess we could change 
s_entity->stopped to atomic ?


Good point.

Re: [PATCH] drm/amdgpu: add missing df counter disable write

2021-02-25 Thread Alex Deucher

On Tue, Feb 23, 2021 at 4:34 PM Jonathan Kim  wrote:
>
> Request to stop DF performance counters is missing the actual write to the
> controller register.
>
> Reported-by: Chris Freehill 
> Signed-off-by: Jonathan Kim 

Acked-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/df_v3_6.c 
> b/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
> index 6b4b30a8dce5..44109a6b8f44 100644
> --- a/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
> +++ b/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
> @@ -568,6 +568,8 @@ static int df_v3_6_pmc_stop(struct amdgpu_device *adev, 
> uint64_t config,
> if (ret)
> return ret;
>
> +   df_v3_6_perfmon_wreg(adev, lo_base_addr, lo_val,
> +   hi_base_addr, hi_val);
>
> if (is_remove) {
> df_v3_6_reset_perfmon_cntr(adev, config, counter_idx);
> --
> 2.17.1
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 1/5] drm/amdgpu: add asic callback for querying video codec info (v3)

2021-02-25 Thread Christian König


Am 25.02.21 um 21:16 schrieb Alex Deucher:

This will be used by a new INFO ioctl query to fetch the decode
and encode capabilities from the kernel driver rather than
hardcoding them in mesa.  This gives us more fine grained control
of capabilities using information that is only availabl in the
kernel (e.g., platform limitations or bandwidth restrictions).

v2: reorder the codecs to better align with mesa
v3: add max_pixels_per_frame to handle the portrait case

Reviewed-by: Leo Liu  (v2)
Signed-off-by: Alex Deucher 


Reviewed-by: Christian König  for the series.


---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h | 26 ++
  1 file changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 22e5d9f284c3..09aec16c8feb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -583,6 +583,28 @@ enum amd_reset_method {
AMD_RESET_METHOD_PCI,
  };
  
+#define AMDGPU_VIDEO_CODEC_TYPE_MPEG2			0

+#define AMDGPU_VIDEO_CODEC_TYPE_MPEG4  1
+#define AMDGPU_VIDEO_CODEC_TYPE_VC12
+#define AMDGPU_VIDEO_CODEC_TYPE_MPEG4_AVC  3
+#define AMDGPU_VIDEO_CODEC_TYPE_HEVC   4
+#define AMDGPU_VIDEO_CODEC_TYPE_JPEG   5
+#define AMDGPU_VIDEO_CODEC_TYPE_VP96
+#define AMDGPU_VIDEO_CODEC_TYPE_AV17
+
+struct amdgpu_video_codec_info {
+   u32 codec_type;
+   u32 max_width;
+   u32 max_height;
+   u32 max_pixels_per_frame;
+   u32 max_level;
+};
+
+struct amdgpu_video_codecs {
+   const u32 codec_count;
+   const struct amdgpu_video_codec_info *codec_array;
+};
+
  /*
   * ASIC specific functions.
   */
@@ -627,6 +649,9 @@ struct amdgpu_asic_funcs {
void (*pre_asic_init)(struct amdgpu_device *adev);
/* enter/exit umd stable pstate */
int (*update_umd_stable_pstate)(struct amdgpu_device *adev, bool enter);
+   /* query video codecs */
+   int (*query_video_codecs)(struct amdgpu_device *adev, bool encode,
+ const struct amdgpu_video_codecs **codecs);
  };
  
  /*

@@ -1221,6 +1246,7 @@ int emu_soc_asic_init(struct amdgpu_device *adev);
  #define amdgpu_asic_pre_asic_init(adev) 
(adev)->asic_funcs->pre_asic_init((adev))
  #define amdgpu_asic_update_umd_stable_pstate(adev, enter) \
((adev)->asic_funcs->update_umd_stable_pstate ? 
(adev)->asic_funcs->update_umd_stable_pstate((adev), (enter)) : 0)
+#define amdgpu_asic_query_video_codecs(adev, e, c) 
(adev)->asic_funcs->query_video_codecs((adev), (e), (c))
  
  #define amdgpu_inc_vram_lost(adev) atomic_inc(&((adev)->vram_lost_counter));
  


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 5/5] drm/amdgpu/codec: drop the internal codec index

2021-02-25 Thread Alex Deucher

And just use the ioctl index.  They are the same.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h |  9 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 16 -
 drivers/gpu/drm/amd/amdgpu/cik.c| 12 ---
 drivers/gpu/drm/amd/amdgpu/nv.c | 36 ++-
 drivers/gpu/drm/amd/amdgpu/si.c | 12 ---
 drivers/gpu/drm/amd/amdgpu/soc15.c  | 46 +
 drivers/gpu/drm/amd/amdgpu/vi.c | 28 ---
 7 files changed, 80 insertions(+), 79 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 09aec16c8feb..9ec99a2df666 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -583,15 +583,6 @@ enum amd_reset_method {
AMD_RESET_METHOD_PCI,
 };
 
-#define AMDGPU_VIDEO_CODEC_TYPE_MPEG2  0
-#define AMDGPU_VIDEO_CODEC_TYPE_MPEG4  1
-#define AMDGPU_VIDEO_CODEC_TYPE_VC12
-#define AMDGPU_VIDEO_CODEC_TYPE_MPEG4_AVC  3
-#define AMDGPU_VIDEO_CODEC_TYPE_HEVC   4
-#define AMDGPU_VIDEO_CODEC_TYPE_JPEG   5
-#define AMDGPU_VIDEO_CODEC_TYPE_VP96
-#define AMDGPU_VIDEO_CODEC_TYPE_AV17
-
 struct amdgpu_video_codec_info {
u32 codec_type;
u32 max_width;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 9f35e8a6c421..a5ed84bc83f7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -1012,14 +1012,14 @@ int amdgpu_info_ioctl(struct drm_device *dev, void 
*data, struct drm_file *filp)
int idx = codecs->codec_array[i].codec_type;
 
switch (idx) {
-   case AMDGPU_VIDEO_CODEC_TYPE_MPEG2:
-   case AMDGPU_VIDEO_CODEC_TYPE_MPEG4:
-   case AMDGPU_VIDEO_CODEC_TYPE_MPEG4_AVC:
-   case AMDGPU_VIDEO_CODEC_TYPE_VC1:
-   case AMDGPU_VIDEO_CODEC_TYPE_HEVC:
-   case AMDGPU_VIDEO_CODEC_TYPE_JPEG:
-   case AMDGPU_VIDEO_CODEC_TYPE_VP9:
-   case AMDGPU_VIDEO_CODEC_TYPE_AV1:
+   case AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_MPEG2:
+   case AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_MPEG4:
+   case AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_VC1:
+   case AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_MPEG4_AVC:
+   case AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_HEVC:
+   case AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_JPEG:
+   case AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_VP9:
+   case AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_AV1:
caps->codec_info[idx].valid = 1;
caps->codec_info[idx].max_width =
codecs->codec_array[i].max_width;
diff --git a/drivers/gpu/drm/amd/amdgpu/cik.c b/drivers/gpu/drm/amd/amdgpu/cik.c
index 72abfad2fd67..c0fcc41ee574 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik.c
@@ -26,6 +26,8 @@
 #include 
 #include 
 
+#include 
+
 #include "amdgpu.h"
 #include "amdgpu_atombios.h"
 #include "amdgpu_ih.h"
@@ -73,7 +75,7 @@
 static const struct amdgpu_video_codec_info cik_video_codecs_encode_array[] =
 {
{
-   .codec_type = AMDGPU_VIDEO_CODEC_TYPE_MPEG4_AVC,
+   .codec_type = AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_MPEG4_AVC,
.max_width = 2048,
.max_height = 1152,
.max_pixels_per_frame = 2048 * 1152,
@@ -90,28 +92,28 @@ static const struct amdgpu_video_codecs 
cik_video_codecs_encode =
 static const struct amdgpu_video_codec_info cik_video_codecs_decode_array[] =
 {
{
-   .codec_type = AMDGPU_VIDEO_CODEC_TYPE_MPEG2,
+   .codec_type = AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_MPEG2,
.max_width = 2048,
.max_height = 1152,
.max_pixels_per_frame = 2048 * 1152,
.max_level = 3,
},
{
-   .codec_type = AMDGPU_VIDEO_CODEC_TYPE_MPEG4,
+   .codec_type = AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_MPEG4,
.max_width = 2048,
.max_height = 1152,
.max_pixels_per_frame = 2048 * 1152,
.max_level = 5,
},
{
-   .codec_type = AMDGPU_VIDEO_CODEC_TYPE_MPEG4_AVC,
+   .codec_type = AMDGPU_INFO_VIDEO_CAPS_CODEC_IDX_MPEG4_AVC,
.max_width = 2048,
.max_height = 1152,
.max_pixels_per_frame = 2048 * 1152,
.max_level = 41,
},
{
-   .codec_type = AMDGPU_VIDEO_CODEC_TYPE_VC1,
+   .codec_type = AMDGPU_INFO_VIDEO_CAPS_CODEC_I

[PATCH 2/5] drm/amdgpu: add video decode/encode cap tables and asic callbacks (v3)

2021-02-25 Thread Alex Deucher

For each asic family.  Will be used to populate tables
for the new INFO ioctl query.

v2: add max_pixels_per_frame to handle the portrait case
v3: fix copy paste typos

Reviewed-by: Leo Liu  (v1)
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/cik.c   |  75 ++
 drivers/gpu/drm/amd/amdgpu/nv.c| 179 ++
 drivers/gpu/drm/amd/amdgpu/si.c| 109 ++
 drivers/gpu/drm/amd/amdgpu/soc15.c | 230 +
 drivers/gpu/drm/amd/amdgpu/vi.c| 188 +++
 5 files changed, 781 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/cik.c b/drivers/gpu/drm/amd/amdgpu/cik.c
index 4d6832cc7fb0..72abfad2fd67 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik.c
@@ -70,6 +70,80 @@
 #include "amdgpu_amdkfd.h"
 #include "dce_virtual.h"
 
+static const struct amdgpu_video_codec_info cik_video_codecs_encode_array[] =
+{
+   {
+   .codec_type = AMDGPU_VIDEO_CODEC_TYPE_MPEG4_AVC,
+   .max_width = 2048,
+   .max_height = 1152,
+   .max_pixels_per_frame = 2048 * 1152,
+   .max_level = 0,
+   },
+};
+
+static const struct amdgpu_video_codecs cik_video_codecs_encode =
+{
+   .codec_count = ARRAY_SIZE(cik_video_codecs_encode_array),
+   .codec_array = cik_video_codecs_encode_array,
+};
+
+static const struct amdgpu_video_codec_info cik_video_codecs_decode_array[] =
+{
+   {
+   .codec_type = AMDGPU_VIDEO_CODEC_TYPE_MPEG2,
+   .max_width = 2048,
+   .max_height = 1152,
+   .max_pixels_per_frame = 2048 * 1152,
+   .max_level = 3,
+   },
+   {
+   .codec_type = AMDGPU_VIDEO_CODEC_TYPE_MPEG4,
+   .max_width = 2048,
+   .max_height = 1152,
+   .max_pixels_per_frame = 2048 * 1152,
+   .max_level = 5,
+   },
+   {
+   .codec_type = AMDGPU_VIDEO_CODEC_TYPE_MPEG4_AVC,
+   .max_width = 2048,
+   .max_height = 1152,
+   .max_pixels_per_frame = 2048 * 1152,
+   .max_level = 41,
+   },
+   {
+   .codec_type = AMDGPU_VIDEO_CODEC_TYPE_VC1,
+   .max_width = 2048,
+   .max_height = 1152,
+   .max_pixels_per_frame = 2048 * 1152,
+   .max_level = 4,
+   },
+};
+
+static const struct amdgpu_video_codecs cik_video_codecs_decode =
+{
+   .codec_count = ARRAY_SIZE(cik_video_codecs_decode_array),
+   .codec_array = cik_video_codecs_decode_array,
+};
+
+static int cik_query_video_codecs(struct amdgpu_device *adev, bool encode,
+ const struct amdgpu_video_codecs **codecs)
+{
+   switch (adev->asic_type) {
+   case CHIP_BONAIRE:
+   case CHIP_HAWAII:
+   case CHIP_KAVERI:
+   case CHIP_KABINI:
+   case CHIP_MULLINS:
+   if (encode)
+   *codecs = &cik_video_codecs_encode;
+   else
+   *codecs = &cik_video_codecs_decode;
+   return 0;
+   default:
+   return -EINVAL;
+   }
+}
+
 /*
  * Indirect registers accessor
  */
@@ -1933,6 +2007,7 @@ static const struct amdgpu_asic_funcs cik_asic_funcs =
.get_pcie_replay_count = &cik_get_pcie_replay_count,
.supports_baco = &cik_asic_supports_baco,
.pre_asic_init = &cik_pre_asic_init,
+   .query_video_codecs = &cik_query_video_codecs,
 };
 
 static int cik_common_early_init(void *handle)
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 160fa5f59805..c7ea779e7d6e 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -65,6 +65,184 @@
 
 static const struct amd_ip_funcs nv_common_ip_funcs;
 
+/* Navi */
+static const struct amdgpu_video_codec_info nv_video_codecs_encode_array[] =
+{
+   {
+   .codec_type = AMDGPU_VIDEO_CODEC_TYPE_MPEG4_AVC,
+   .max_width = 4096,
+   .max_height = 2304,
+   .max_pixels_per_frame = 4096 * 2304,
+   .max_level = 0,
+   },
+   {
+   .codec_type = AMDGPU_VIDEO_CODEC_TYPE_HEVC,
+   .max_width = 4096,
+   .max_height = 2304,
+   .max_pixels_per_frame = 4096 * 2304,
+   .max_level = 0,
+   },
+};
+
+static const struct amdgpu_video_codecs nv_video_codecs_encode =
+{
+   .codec_count = ARRAY_SIZE(nv_video_codecs_encode_array),
+   .codec_array = nv_video_codecs_encode_array,
+};
+
+/* Navi1x */
+static const struct amdgpu_video_codec_info nv_video_codecs_decode_array[] =
+{
+   {
+   .codec_type = AMDGPU_VIDEO_CODEC_TYPE_MPEG2,
+   .max_width = 4096,
+   .max_height = 4096,
+   .max_pixels_per_frame = 4096 * 4096,
+   .max_level = 3,
+   },
+   {
+   .codec_type =

[PATCH 4/5] drm/amdgpu: bump driver version for new video codec INFO ioctl query

2021-02-25 Thread Alex Deucher

So mesa can check when to query the kernel vs use hardcoded
codec bandwidth data.

Reviewed-by: Leo Liu 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 4575192d9b08..4acfb136a295 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -90,9 +90,10 @@
  * - 3.38.0 - Add AMDGPU_IB_FLAG_EMIT_MEM_SYNC
  * - 3.39.0 - DMABUF implicit sync does a full pipeline sync
  * - 3.40.0 - Add AMDGPU_IDS_FLAGS_TMZ
+ * - 3.41.0 - Add video codec query
  */
 #define KMS_DRIVER_MAJOR   3
-#define KMS_DRIVER_MINOR   40
+#define KMS_DRIVER_MINOR   41
 #define KMS_DRIVER_PATCHLEVEL  0
 
 int amdgpu_vram_limit;
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 3/5] drm/amdgpu: add INFO ioctl support for querying video caps (v4)

2021-02-25 Thread Alex Deucher

We currently hardcode these in mesa, but querying them from
the kernel makes more sense since there may be board specific
limitations that the kernel driver is better suited to
determining.

Userpace patches that use this interface:
https://gitlab.freedesktop.org/leoliu/drm/-/commits/info_video_caps
https://gitlab.freedesktop.org/leoliu/mesa/-/commits/info_video_caps

v2: reorder the codecs to better align with mesa
v3: add max_pixels_per_frame to handle the portrait case, squash in
memory leak fix
v4: drop extra break

Reviewed-by: Leo Liu  (v2)
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 57 +
 include/uapi/drm/amdgpu_drm.h   | 34 +++
 2 files changed, 91 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index cf81ade31703..9f35e8a6c421 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -982,6 +982,63 @@ int amdgpu_info_ioctl(struct drm_device *dev, void *data, 
struct drm_file *filp)
min_t(u64, size, sizeof(ras_mask))) ?
-EFAULT : 0;
}
+   case AMDGPU_INFO_VIDEO_CAPS: {
+   const struct amdgpu_video_codecs *codecs;
+   struct drm_amdgpu_info_video_caps *caps;
+   int r;
+
+   switch (info->video_cap.type) {
+   case AMDGPU_INFO_VIDEO_CAPS_DECODE:
+   r = amdgpu_asic_query_video_codecs(adev, false, 
&codecs);
+   if (r)
+   return -EINVAL;
+   break;
+   case AMDGPU_INFO_VIDEO_CAPS_ENCODE:
+   r = amdgpu_asic_query_video_codecs(adev, true, &codecs);
+   if (r)
+   return -EINVAL;
+   break;
+   default:
+   DRM_DEBUG_KMS("Invalid request %d\n",
+ info->video_cap.type);
+   return -EINVAL;
+   }
+
+   caps = kzalloc(sizeof(*caps), GFP_KERNEL);
+   if (!caps)
+   return -ENOMEM;
+
+   for (i = 0; i < codecs->codec_count; i++) {
+   int idx = codecs->codec_array[i].codec_type;
+
+   switch (idx) {
+   case AMDGPU_VIDEO_CODEC_TYPE_MPEG2:
+   case AMDGPU_VIDEO_CODEC_TYPE_MPEG4:
+   case AMDGPU_VIDEO_CODEC_TYPE_MPEG4_AVC:
+   case AMDGPU_VIDEO_CODEC_TYPE_VC1:
+   case AMDGPU_VIDEO_CODEC_TYPE_HEVC:
+   case AMDGPU_VIDEO_CODEC_TYPE_JPEG:
+   case AMDGPU_VIDEO_CODEC_TYPE_VP9:
+   case AMDGPU_VIDEO_CODEC_TYPE_AV1:
+   caps->codec_info[idx].valid = 1;
+   caps->codec_info[idx].max_width =
+   codecs->codec_array[i].max_width;
+   caps->codec_info[idx].max_height =
+   codecs->codec_array[i].max_height;
+   caps->codec_info[idx].max_pixels_per_frame =
+   
codecs->codec_array[i].max_pixels_per_frame;
+   caps->codec_info[idx].max_level =
+   codecs->codec_array[i].max_level;
+   break;
+   default:
+   break;
+   }
+   }
+   r = copy_to_user(out, caps,
+min((size_t)size, sizeof(*caps))) ? -EFAULT : 
0;
+   kfree(caps);
+   return r;
+   }
default:
DRM_DEBUG_KMS("Invalid request %d\n", info->query);
return -EINVAL;
diff --git a/include/uapi/drm/amdgpu_drm.h b/include/uapi/drm/amdgpu_drm.h
index 7fb9c09ee93f..728566542f8a 100644
--- a/include/uapi/drm/amdgpu_drm.h
+++ b/include/uapi/drm/amdgpu_drm.h
@@ -782,6 +782,12 @@ struct drm_amdgpu_cs_chunk_data {
 #define AMDGPU_INFO_VRAM_LOST_COUNTER  0x1F
 /* query ras mask of enabled features*/
 #define AMDGPU_INFO_RAS_ENABLED_FEATURES   0x20
+/* query video encode/decode caps */
+#define AMDGPU_INFO_VIDEO_CAPS 0x21
+   /* Subquery id: Decode */
+   #define AMDGPU_INFO_VIDEO_CAPS_DECODE   0
+   /* Subquery id: Encode */
+   #define AMDGPU_INFO_VIDEO_CAPS_ENCODE   1
 
 /* RAS MASK: UMC (VRAM) */
 #define AMDGPU_INFO_RAS_ENABLED_UMC(1 << 0)
@@ -878,6 +884,10 @@ struct drm_amdgpu_info {
struct {
__u32 type;
} sensor_info;
+
+   struct {
+   __u32 type;
+   } vi

[PATCH 1/5] drm/amdgpu: add asic callback for querying video codec info (v3)

2021-02-25 Thread Alex Deucher

This will be used by a new INFO ioctl query to fetch the decode
and encode capabilities from the kernel driver rather than
hardcoding them in mesa.  This gives us more fine grained control
of capabilities using information that is only availabl in the
kernel (e.g., platform limitations or bandwidth restrictions).

v2: reorder the codecs to better align with mesa
v3: add max_pixels_per_frame to handle the portrait case

Reviewed-by: Leo Liu  (v2)
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 22e5d9f284c3..09aec16c8feb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -583,6 +583,28 @@ enum amd_reset_method {
AMD_RESET_METHOD_PCI,
 };
 
+#define AMDGPU_VIDEO_CODEC_TYPE_MPEG2  0
+#define AMDGPU_VIDEO_CODEC_TYPE_MPEG4  1
+#define AMDGPU_VIDEO_CODEC_TYPE_VC12
+#define AMDGPU_VIDEO_CODEC_TYPE_MPEG4_AVC  3
+#define AMDGPU_VIDEO_CODEC_TYPE_HEVC   4
+#define AMDGPU_VIDEO_CODEC_TYPE_JPEG   5
+#define AMDGPU_VIDEO_CODEC_TYPE_VP96
+#define AMDGPU_VIDEO_CODEC_TYPE_AV17
+
+struct amdgpu_video_codec_info {
+   u32 codec_type;
+   u32 max_width;
+   u32 max_height;
+   u32 max_pixels_per_frame;
+   u32 max_level;
+};
+
+struct amdgpu_video_codecs {
+   const u32 codec_count;
+   const struct amdgpu_video_codec_info *codec_array;
+};
+
 /*
  * ASIC specific functions.
  */
@@ -627,6 +649,9 @@ struct amdgpu_asic_funcs {
void (*pre_asic_init)(struct amdgpu_device *adev);
/* enter/exit umd stable pstate */
int (*update_umd_stable_pstate)(struct amdgpu_device *adev, bool enter);
+   /* query video codecs */
+   int (*query_video_codecs)(struct amdgpu_device *adev, bool encode,
+ const struct amdgpu_video_codecs **codecs);
 };
 
 /*
@@ -1221,6 +1246,7 @@ int emu_soc_asic_init(struct amdgpu_device *adev);
 #define amdgpu_asic_pre_asic_init(adev) 
(adev)->asic_funcs->pre_asic_init((adev))
 #define amdgpu_asic_update_umd_stable_pstate(adev, enter) \
((adev)->asic_funcs->update_umd_stable_pstate ? 
(adev)->asic_funcs->update_umd_stable_pstate((adev), (enter)) : 0)
+#define amdgpu_asic_query_video_codecs(adev, e, c) 
(adev)->asic_funcs->query_video_codecs((adev), (e), (c))
 
 #define amdgpu_inc_vram_lost(adev) atomic_inc(&((adev)->vram_lost_counter));
 
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: add ih call to process until checkpoint

2021-02-25 Thread Christian König


Am 25.02.21 um 19:33 schrieb Felix Kuehling:

[SNIP]

This in turn can lead to starvation of the work handler and so a life
lock as well.




I won't touch rptr or wptr at all for this.

Not sure what's your idea here, using ih->lock. Is it to completely
drain all IRQs until the IH ring is completely empty?

Correct.


That can
live-lock, if the GPU produces IRQs faster than the kernel can process
them. Therefore I was looking at rptr and wptr to drain only IRQs that
were already in the queue when the drain call was made. That also
ensures that the wait time is bounded and should be short (unless the
ring size is huge).

Correct as well, but the problem here is that Jonathan's
implementation is not even remotely correct.

See when you look at the rptr and wptr you can't be sure that they
haven't wrapped around between two looks.

What you could do is look at both the rptr as well as the original
wptr, but that is tricky.

The code in Jon's patch was suggested by me. I check for wrapping by
comparing rptr with the rptr from the previous loop iteration. Comparing
rptr with wptr you can never be sure whether rptr has already passed
wptr, or whether rptr has to wrap first.


Exactly that's my concern as well.



I could see a compromise where we sleep and wake up the waiting threads when

  1. the IH ring is empty
  2. the IH rptr wraps

That should be good enough to ensure a quick response in the common case
(no interrupt storm), and a reasonable response in the interrupt storm case.

But then it's still not clear what's the correct condition for checking
that the interrupts the caller cares about have been drained. Looking at
just rptr and wptr is always ambiguous. Maybe we could use timestamps of
the last processed interrupt? Those 48-bit time stamps wrap much less
frequently. The idea is this:

   * At the start get the timestamp of the last written IH ring entry
 (checkpoint)
   * Wait until the last_processed IH timestamp passes the checkpoint:
 sign_extend(last_processed - checkpoint) >= 0


Yeah that could work. Alternatively we could just have a rptr wrap 
around counter.


This way you do something like this:
1. get the wrap around counter.
2. get wptr
3. get rptr
4. compare the wrap around counter with the old one, if it has changed 
start over with #1

5. Use wrap around counter and rptr/wptr to come up with 64bit values.
6. Compare wptr with rptr/wrap around counter until we are sure the IHs 
are processed.


Regards,
Christian.



Regards,
   Felix



Regards,
Christian.


Regards,
    Felix



Regards,
Christian.


Regards,
    Felix



Suggested-by: Felix Kuehling 
Signed-off-by: Jonathan Kim 
---
     drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 46

+-

drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h |  2 ++
     2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
index dc852af4f3b7..cae50af9559d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
@@ -22,7 +22,7 @@
  */

     #include 
-
+#include 
     #include "amdgpu.h"
     #include "amdgpu_ih.h"

@@ -160,6 +160,50 @@ void amdgpu_ih_ring_write(struct

amdgpu_ih_ring *ih, const uint32_t *iv,

     }
     }

+/**
+ * amdgpu_ih_wait_on_checkpoint_process - wait to process IVs
up to
+checkpoint
+ *
+ * @adev: amdgpu_device pointer
+ * @ih: ih ring to process
+ *
+ * Used to ensure ring has processed IVs up to the checkpoint
write

pointer.

+ */
+int amdgpu_ih_wait_on_checkpoint_process(struct amdgpu_device

*adev,

+struct amdgpu_ih_ring *ih)
+{
+u32 prev_rptr, cur_rptr, checkpoint_wptr;
+
+if (!ih->enabled || adev->shutdown)
+return -ENODEV;
+
+cur_rptr = READ_ONCE(ih->rptr);
+/* Order read of current rptr with checktpoint wptr. */
+mb();
+checkpoint_wptr = amdgpu_ih_get_wptr(adev, ih);
+
+/* allow rptr to wrap around  */
+if (cur_rptr > checkpoint_wptr) {
+spin_begin();
+do {
+spin_cpu_relax();
+prev_rptr = cur_rptr;
+cur_rptr = READ_ONCE(ih->rptr);
+} while (cur_rptr >= prev_rptr);
+spin_end();

That's a certain NAK since it busy waits for IH processing. We need
some
event to trigger here.

The function is meant to be just a waiter up to the checkpoint.
There's a need to guarantee that "stale" interrupts have been
processed on check before doing other stuff after call.
The description could be improved to clarify that.

Would busy waiting only on a locked ring help?  I assume an unlocked
ring means nothing to process so no need to wait and we can exit
early.  Or is it better to just to process the entries up to the
checkpoint (maybe adjust amdgpu_ih_process for this need like adding
a bool arg to skip restart or something)?

Thanks,

Jon


+}
+
+/* wait for rptr to catch up to or pass checkpoint. */
+spin_begin();
+do {
+spin_cpu_relax();
+prev_rptr = cur_rptr;
+cur_rptr = READ_ONCE(ih->rptr);
+} while (cur_rptr >= prev_rptr && cur_rptr < checkpoint_wptr);

Same of course here.

Chr

[PATCH 135/159] drm/amdgpu: workaround the TMR MC address issue (v2)

2021-02-25 Thread Alex Deucher

From: Oak Zeng 

With the 2-level gart page table,  vram is squeezed into gart aperture
and FB aperture is disabled. Therefore all VRAM virtual addresses are
 in the GART aperture. However currently PSP requires TMR addresses
in FB aperture. So we need some design change at PSP FW level to support
this 2-level gart table driver change. Right now this PSP FW support
doesn't exist. To workaround this issue temporarily, FB aperture is
added back and the gart aperture address is converted back to FB aperture
for this PSP TMR address.

Will revert it after we get a fix from PSP FW.

v2: squash in tmr fix for other asics (Kevin)

Signed-off-by: Oak Zeng 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h  |  9 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c  | 10 ++
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c | 21 +++--
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c  | 10 ++
 4 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index d5f3825cd479..cd4592ff70ae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -208,6 +208,15 @@ struct amdgpu_gmc {
 */
u64 fb_start;
u64 fb_end;
+   /* In the case of use GART table for vmid0 FB access, [fb_start, fb_end]
+* will be squeezed to GART aperture. But we have a PSP FW issue to fix
+* for now. To temporarily workaround the PSP FW issue, added below two
+* variables to remember the original fb_start/end to re-enable FB
+* aperture to workaround the PSP FW issue. Will delete it after we
+* get a proper PSP FW fix.
+*/
+   u64 fb_start_original;
+   u64 fb_end_original;
unsignedvram_width;
u64 real_vram_size;
int vram_mtrr;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index cf8cfe620d8c..a4f96d931573 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -407,6 +407,16 @@ static int psp_tmr_init(struct psp_context *psp)
  AMDGPU_GEM_DOMAIN_VRAM,
  &psp->tmr_bo, &psp->tmr_mc_addr, pptr);
 
+   /* workaround the tmr_mc_addr:
+* PSP requires an address in FB aperture. Right now driver produce
+* tmr_mc_addr in the GART aperture. Convert it back to FB aperture
+* for PSP. Will revert it after we get a fix from PSP FW.
+*/
+   if (psp->adev->asic_type == CHIP_ALDEBARAN) {
+   psp->tmr_mc_addr -= psp->adev->gmc.fb_start;
+   psp->tmr_mc_addr += psp->adev->gmc.fb_start_original;
+   }
+
return ret;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
index 62019885bda5..d189507dcef0 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
@@ -141,12 +141,21 @@ static void gfxhub_v1_0_init_system_aperture_regs(struct 
amdgpu_device *adev)
 * FB aperture and AGP aperture. Disable them.
 */
if (adev->gmc.pdb0_bo) {
-   WREG32_SOC15(GC, 0, mmMC_VM_FB_LOCATION_TOP, 0);
-   WREG32_SOC15(GC, 0, mmMC_VM_FB_LOCATION_BASE, 0x00FF);
-   WREG32_SOC15(GC, 0, mmMC_VM_AGP_TOP, 0);
-   WREG32_SOC15(GC, 0, mmMC_VM_AGP_BOT, 0xFF);
-   WREG32_SOC15(GC, 0, mmMC_VM_SYSTEM_APERTURE_LOW_ADDR, 
0x3FFF);
-   WREG32_SOC15(GC, 0, mmMC_VM_SYSTEM_APERTURE_HIGH_ADDR, 0);
+   if (adev->asic_type == CHIP_ALDEBARAN) {
+   WREG32_SOC15(GC, 0, mmMC_VM_FB_LOCATION_TOP, 
adev->gmc.fb_end_original >> 24);
+   WREG32_SOC15(GC, 0, mmMC_VM_FB_LOCATION_BASE, 
adev->gmc.fb_start_original >> 24);
+   WREG32_SOC15(GC, 0, mmMC_VM_AGP_TOP, 0);
+   WREG32_SOC15(GC, 0, mmMC_VM_AGP_BOT, 0xFF);
+   WREG32_SOC15(GC, 0, mmMC_VM_SYSTEM_APERTURE_LOW_ADDR, 
adev->gmc.fb_start_original >> 18);
+   WREG32_SOC15(GC, 0, mmMC_VM_SYSTEM_APERTURE_HIGH_ADDR, 
adev->gmc.fb_end_original >> 18);
+   } else {
+   WREG32_SOC15(GC, 0, mmMC_VM_FB_LOCATION_TOP, 0);
+   WREG32_SOC15(GC, 0, mmMC_VM_FB_LOCATION_BASE, 
0x00FF);
+   WREG32_SOC15(GC, 0, mmMC_VM_AGP_TOP, 0);
+   WREG32_SOC15(GC, 0, mmMC_VM_AGP_BOT, 0xFF);
+   WREG32_SOC15(GC, 0, mmMC_VM_SYSTEM_APERTURE_LOW_ADDR, 
0x3FFF);
+   WREG32_SOC15(GC, 0, mmMC_VM_SYSTEM_APERTURE_HIGH_ADDR, 
0);
+   }
}
 }

[PATCH 3/3] drm/amdgpu: harvest edc status when connected to host via xGMI

2021-02-25 Thread Alex Deucher

From: Dennis Li 

When connected to a host via xGMI, system fatal errors may trigger
warm reset, driver has no change to query edc status before reset.
Therefore in this case, driver should harvest previous error loging
registers during boot, instead of only resetting them.

v2:
1. IP's ras_manager object is created when its ras feature is enabled,
so change to query edc status after amdgpu_ras_late_init called

2. change to enable watchdog timer after finishing gfx edc init

Signed-off-by: Dennis Li 
Reivewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c |  9 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 50 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  5 ++-
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   | 12 ++
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c | 54 ++---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.h |  3 +-
 7 files changed, 107 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 8e0a6c62322e..689addb1520d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -601,6 +601,7 @@ int amdgpu_gfx_ras_late_init(struct amdgpu_device *adev)
struct ras_ih_if ih_info = {
.cb = amdgpu_gfx_process_ras_data_cb,
};
+   struct ras_query_if info = { 0 };
 
if (!adev->gfx.ras_if) {
adev->gfx.ras_if = kmalloc(sizeof(struct ras_common_if), 
GFP_KERNEL);
@@ -612,13 +613,19 @@ int amdgpu_gfx_ras_late_init(struct amdgpu_device *adev)
strcpy(adev->gfx.ras_if->name, "gfx");
}
fs_info.head = ih_info.head = *adev->gfx.ras_if;
-
r = amdgpu_ras_late_init(adev, adev->gfx.ras_if,
 &fs_info, &ih_info);
if (r)
goto free;
 
if (amdgpu_ras_is_supported(adev, adev->gfx.ras_if->block)) {
+   if (adev->gmc.xgmi.connected_to_cpu) {
+   info.head = *adev->gfx.ras_if;
+   amdgpu_ras_query_error_status(adev, &info);
+   } else {
+   amdgpu_ras_reset_error_status(adev, 
AMDGPU_RAS_BLOCK__GFX);
+   }
+
r = amdgpu_irq_get(adev, &adev->gfx.cp_ecc_error_irq, 0);
if (r)
goto late_fini;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index d92f0f14cbeb..38af93f501e1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -225,9 +225,9 @@ struct amdgpu_gfx_funcs {
void (*reset_ras_error_count) (struct amdgpu_device *adev);
void (*init_spm_golden)(struct amdgpu_device *adev);
void (*query_ras_error_status) (struct amdgpu_device *adev);
+   void (*reset_ras_error_status) (struct amdgpu_device *adev);
void (*update_perfmon_mgcg)(struct amdgpu_device *adev, bool enable);
void (*enable_watchdog_timer)(struct amdgpu_device *adev);
-   void (*query_sq_timeout_status)(struct amdgpu_device *adev);
 };
 
 struct sq_work {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 5805c78c356b..517e19fae34f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -109,7 +109,7 @@ static ssize_t amdgpu_ras_debugfs_read(struct file *f, char 
__user *buf,
ssize_t s;
char val[128];
 
-   if (amdgpu_ras_error_query(obj->adev, &info))
+   if (amdgpu_ras_query_error_status(obj->adev, &info))
return -EINVAL;
 
s = snprintf(val, sizeof(val), "%s: %lu\n%s: %lu\n",
@@ -434,7 +434,7 @@ static ssize_t amdgpu_ras_sysfs_read(struct device *dev,
return snprintf(buf, PAGE_SIZE,
"Query currently inaccessible\n");
 
-   if (amdgpu_ras_error_query(obj->adev, &info))
+   if (amdgpu_ras_query_error_status(obj->adev, &info))
return -EINVAL;
 
return snprintf(buf, PAGE_SIZE, "%s: %lu\n%s: %lu\n",
@@ -757,8 +757,8 @@ static int amdgpu_ras_enable_all_features(struct 
amdgpu_device *adev,
 /* feature ctl end */
 
 /* query/inject/cure begin */
-int amdgpu_ras_error_query(struct amdgpu_device *adev,
-   struct ras_query_if *info)
+int amdgpu_ras_query_error_status(struct amdgpu_device *adev,
+   struct ras_query_if *info)
 {
struct ras_manager *obj = amdgpu_ras_find_obj(adev, &info->head);
struct ras_err_data err_data = {0, 0, 0, NULL};
@@ -787,10 +787,16 @@ int amdgpu_ras_error_query(struct amdgpu_device *adev,
case AMDGPU_RAS_BLOCK__GFX:
if (adev->gfx.funcs->query_ras_error_count)
adev->gfx.funcs->query_ras_error_count(adev, &err_data);
+
+   if (adev->gfx.funcs->query_ra

[PATCH 2/3] drm/amdgpu: Make noretry the default on Aldebaran

2021-02-25 Thread Alex Deucher

From: Felix Kuehling 

This is needed for best machine learning performance. XNACK can still
be enabled per-process if needed.

Signed-off-by: Felix Kuehling 
Reviewed-by: Alex Deucher 
Reviewed-by: Philip Yang 
Tested-by: Alex Sierra 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 6d9c660da27a..8a64f5e49cb2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -508,6 +508,7 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
switch (adev->asic_type) {
case CHIP_VEGA10:
case CHIP_VEGA20:
+   case CHIP_ALDEBARAN:
/*
 * noretry = 0 will cause kfd page fault tests fail
 * for some ASICs, so set default to 1 for these ASICs.
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 1/3] drm/amdgpu: update default timeout of Aldebaran SQ watchdog

2021-02-25 Thread Alex Deucher

From: Harish Kasiviswanathan 

Signed-off-by: Harish Kasiviswanathan 
Reivewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 ++--
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c | 7 +++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 42654deb9c55..29a2fc56e51b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -174,7 +174,7 @@ uint amdgpu_ras_mask = 0x;
 int amdgpu_bad_page_threshold = 100;
 struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
.timeout_fatal_disable = false,
-   .period = 0x3f, /* about 8s */
+   .period = 0x23, /* default to max. timeout = 1 << 0x23 cycles */
 };
 
 /**
@@ -542,7 +542,7 @@ module_param_named(timeout_fatal_disable, 
amdgpu_watchdog_timer.timeout_fatal_di
  * DOC: timeout_period (uint)
  * Modify the watchdog timeout max_cycles as (1 << period)
  */
-MODULE_PARM_DESC(timeout_period, "watchdog timeout period (0x1F = default), 
timeout maxCycles = (1 << period)");
+MODULE_PARM_DESC(timeout_period, "watchdog timeout period (1 to 0x23(default), 
timeout maxCycles = (1 << period)");
 module_param_named(timeout_period, amdgpu_watchdog_timer.period, uint, 0644);
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
index 1faeae14ead9..44024ab93577 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
@@ -1138,6 +1138,13 @@ void gfx_v9_4_2_enable_watchdog_timer(struct 
amdgpu_device *adev)
data = REG_SET_FIELD(0, SQ_TIMEOUT_CONFIG, TIMEOUT_FATAL_DISABLE,
 amdgpu_watchdog_timer.timeout_fatal_disable ? 1 :
   0);
+
+   if (amdgpu_watchdog_timer.timeout_fatal_disable &&
+   (amdgpu_watchdog_timer.period < 1 ||
+amdgpu_watchdog_timer.period > 0x23)) {
+   dev_warn(adev->dev, "Watchdog period range is 1 to 0x23\n");
+   amdgpu_watchdog_timer.period = 0x23;
+   }
data = REG_SET_FIELD(data, SQ_TIMEOUT_CONFIG, PERIOD_SEL,
 amdgpu_watchdog_timer.period);
 
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 147/159] drm/amdgpu: restore aldebaran save ttmp and trap config on init (v2)

2021-02-25 Thread Felix Kuehling


Am 2021-02-25 um 1:32 p.m. schrieb Deucher, Alexander:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> I dropped the KFD debugger hunks and just added the gfx 9.4.2 changes
> since these were required for a bunch of later patches that build on
> that file that are not dependent on debugger. 

I see.


> I can rework the commit message if you'd like.

No, that's fine. I'd prefer to keep the original message to correlate it
with the remaining change on the DKMS branch.


Thanks,
  Felix


>
> Alex
>
> 
> *From:* Kuehling, Felix 
> *Sent:* Wednesday, February 24, 2021 10:22 PM
> *To:* Deucher, Alexander ;
> amd-gfx@lists.freedesktop.org 
> *Cc:* Kim, Jonathan 
> *Subject:* Re: [PATCH 147/159] drm/amdgpu: restore aldebaran save ttmp
> and trap config on init (v2)
>  
> This patch is for the debugger functionality that's not on
> amd-staging-drm-next yet. You can probably drop this patch for now.
>
> Regards,
>    Felix
>
> On 2021-02-24 5:18 p.m., Alex Deucher wrote:
> > From: Jonathan Kim 
> >
> > Initialization of TRAP_DATA0/1 is still required for the debugger to
> detect
> > new waves on Aldebaran.  Also, per-vmid global trap enablement may be
> > required outside of debugger scope so move to init phase.
> >
> > v2: just add the gfx 9.4.2 changes (Alex)
> >
> > Signed-off-by: Jonathan Kim 
> > Reviewed-by: Felix Kuehling 
> > Signed-off-by: Alex Deucher 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/Makefile |  1 +
> >   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   |  1 +
> >   drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c | 50 +
> >   drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.h | 30 +++
> >   4 files changed, 82 insertions(+)
> >   create mode 100644 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
> >   create mode 100644 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.h
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile
> b/drivers/gpu/drm/amd/amdgpu/Makefile
> > index c5ec926bc6d5..741b68874e53 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/Makefile
> > +++ b/drivers/gpu/drm/amd/amdgpu/Makefile
> > @@ -123,6 +123,7 @@ amdgpu-y += \
> >    gfx_v8_0.o \
> >    gfx_v9_0.o \
> >    gfx_v9_4.o \
> > + gfx_v9_4_2.o \
> >    gfx_v10_0.o
> >  
> >   # add async DMA block
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > index 5bac5659e707..78bb4e28c27c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > @@ -49,6 +49,7 @@
> >  
> >   #include "gfx_v9_4.h"
> >   #include "gfx_v9_0.h"
> > +#include "gfx_v9_4_2.h"
> >  
> >   #include "asic_reg/pwr/pwr_10_0_offset.h"
> >   #include "asic_reg/pwr/pwr_10_0_sh_mask.h"
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
> > new file mode 100644
> > index ..0c2ccbe327ab
> > --- /dev/null
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
> > @@ -0,0 +1,50 @@
> > +/*
> > + * Copyright 2020 Advanced Micro Devices, Inc.
> > + *
> > + * Permission is hereby granted, free of charge, to any person
> obtaining a
> > + * copy of this software and associated documentation files (the
> "Software"),
> > + * to deal in the Software without restriction, including without
> limitation
> > + * the rights to use, copy, modify, merge, publish, distribute,
> sublicense,
> > + * and/or sell copies of the Software, and to permit persons to
> whom the
> > + * Software is furnished to do so, subject to the following conditions:
> > + *
> > + * The above copyright notice and this permission notice shall be
> included in
> > + * all copies or substantial portions of the Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> EXPRESS OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO
> EVENT SHALL
> > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM,
> DAMAGES OR
> > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
> OTHERWISE,
> > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
> USE OR
> > + * OTHER DEALINGS IN THE SOFTWARE.
> > + *
> > + */
> > +#include "amdgpu.h"
> > +#include "soc15.h"
> > +
> > +#include "gc/gc_9_4_2_offset.h"
> > +#include "gc/gc_9_4_2_sh_mask.h"
> > +
> > +void gfx_v9_4_2_debug_trap_config_init(struct amdgpu_device *adev,
> > + uint32_t first_vmid,
> > + uint32_t last_vmid)
> > +{
> > + uint32_t data;
> > + int i;
> > +
> > + mutex_lock(&adev->srbm_mutex);
> > +
> > + for (i = first_vmid; i < last_vmid; i++) {
> > + data = 0;
> > + soc15_grbm_select(adev, 0, 0, 0, i);
> > + data = REG_SET_FIELD(data, SPI_GDBG_PER_VMID_CNTL,
> TRAP_EN, 1);
> > + data = REG_SET_FIELD(data, SPI_GDBG_PER_VMID_CNT

Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released

2021-02-25 Thread Christian König




Am 25.02.21 um 17:03 schrieb Andrey Grodzovsky:


On 2021-02-25 2:53 a.m., Christian König wrote:

Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:

Ping


Sorry, I've been on vacation this week.



Andrey

On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:


On 2/20/21 3:38 AM, Christian König wrote:



Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:


On 2/18/21 10:15 AM, Christian König wrote:

Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:


On 2/18/21 3:07 AM, Christian König wrote:



Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:
Problem: If scheduler is already stopped by the time 
sched_entity

is released and entity's job_queue not empty I encountred
a hang in drm_sched_entity_flush. This is because 
drm_sched_entity_is_idle

never becomes false.

Fix: In drm_sched_fini detach all sched_entities from the
scheduler's run queues. This will satisfy 
drm_sched_entity_is_idle.

Also wakeup all those processes stuck in sched_entity flushing
as the scheduler main thread which wakes them up is stopped 
by now.


v2:
Reverse order of drm_sched_rq_remove_entity and marking
s_entity as stopped to prevent reinserion back to rq due
to race.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/scheduler/sched_main.c | 31 
+++

  1 file changed, 31 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c

index 908b0b5..c6b7947 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
   */
  void drm_sched_fini(struct drm_gpu_scheduler *sched)
  {
+    int i;
+    struct drm_sched_entity *s_entity;


BTW: Please order that so that i is declared last.


  if (sched->thread)
  kthread_stop(sched->thread);
  +    /* Detach all sched_entites from this scheduler once 
it's stopped */
+    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
DRM_SCHED_PRIORITY_MIN; i--) {

+    struct drm_sched_rq *rq = &sched->sched_rq[i];
+
+    if (!rq)
+    continue;
+
+    /* Loop this way because rq->lock is taken in 
drm_sched_rq_remove_entity */

+    spin_lock(&rq->lock);
+    while ((s_entity = 
list_first_entry_or_null(&rq->entities,

+    struct drm_sched_entity,
+    list))) {
+    spin_unlock(&rq->lock);
+
+    /* Prevent reinsertion and remove */
+ spin_lock(&s_entity->rq_lock);
+    s_entity->stopped = true;
+    drm_sched_rq_remove_entity(rq, s_entity);
+ spin_unlock(&s_entity->rq_lock);


Well this spin_unlock/lock dance here doesn't look correct at 
all now.


Christian.



In what way ? It's in the same same order as in other call 
sites (see drm_sched_entity_push_job and drm_sched_entity_flush).
If i just locked rq->lock and did list_for_each_entry_safe 
while manually deleting entity->list instead of calling
drm_sched_rq_remove_entity this still would not be possible as 
the order of lock acquisition between s_entity->rq_lock
and rq->lock would be reverse compared to the call sites 
mentioned above.


Ah, now I understand. You need this because 
drm_sched_rq_remove_entity() will grab the rq lock again!


Problem is now what prevents the entity from being destroyed 
while you remove it?


Christian.


Right, well, since (unfortunately) sched_entity is part of 
amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
there is a problem here that we don't increment 
amdgpu_ctx.refcount when assigning  sched_entity
to new rq (e.g. before drm_sched_rq_add_entity) and not decrement 
before removing. We do it for
amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and 
amdgpu_cs_parser_fini by
calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit 
tricky due to all the drm_sched_entity_select_rq

logic.

Another, kind of a band aid fix, would probably be just locking 
amdgpu_ctx_mgr.lock around drm_sched_fini
when finalizing the fence driver and around idr iteration in 
amdgpu_ctx_mgr_fini (which should be lock protected
anyway as I see from other idr usages in the code) ... This 
should prevent this use after free.


Puh, that's rather complicated as well. Ok let's look at it from 
the other side for a moment.


Why do we have to remove the entities from the rq in the first place?

Wouldn't it be sufficient to just set all of them to stopped?

Christian.



And adding it as another  condition in drm_sched_entity_is_idle ?


Yes, I think that should work.



In this case reverse locking order is created, In 
drm_sched_entity_push_job and drm_sched_entity_flush lock 
entity->rq_lock locked first and rq->lock locked second. In 
drm_sched_fini on the other hand, I need to lock rq->lock first to 
iterate rq->entities and then lock s_entity->rq_lock for each entity 
to modify s_entity->stopped. I guess we could change s_entity->stopped 
to atomic ?


Good point. But I think the memory barrier inserted by 
wake_up

Re: [PATCH] drm/amdgpu: add ih call to process until checkpoint

2021-02-25 Thread Felix Kuehling

Am 2021-02-25 um 11:48 a.m. schrieb Christian König:
>
>
> Am 25.02.21 um 16:35 schrieb Felix Kuehling:
>> Am 2021-02-25 um 8:53 a.m. schrieb Christian König:
>>>
>>> Am 25.02.21 um 04:15 schrieb Felix Kuehling:
 On 2021-02-24 10:54 a.m., Kim, Jonathan wrote:
> [AMD Official Use Only - Internal Distribution Only]
>
>> -Original Message-
>> From: Koenig, Christian 
>> Sent: Wednesday, February 24, 2021 4:17 AM
>> To: Kim, Jonathan ; amd-
>> g...@lists.freedesktop.org
>> Cc: Yang, Philip ; Kuehling, Felix
>> 
>> Subject: Re: [PATCH] drm/amdgpu: add ih call to process until
>> checkpoint
>>
>> Am 23.02.21 um 22:10 schrieb Jonathan Kim:
>>> Add IH function to allow caller to process ring entries until the
>>> checkpoint write pointer.
>> This needs a better description of what this will be used for.
> Felix or Philip could elaborate better for HMM needs.
> Debugging tools requires this but it's in experimental mode at the
> moment so probably not the best place to describe here.
 On the HMM side we're planning to use this to drain pending page
 fault interrupts before we unmap memory. That should address phantom
 VM faults after memory is unmapped.
>>> Thought so. I suggest to use a wait_event() here which on the waiter
>>> side checks ih->lock and add a wake_up_all() at the end of
>>> amdgpu_ih_process.
>> Right. I thought about that and it should be easy to add. The reason to
>> suggest busy waiting first is, that interrupt processing is supposed to
>> be fast. The IRQ handler itself doesn't sleep. So I'd expect the wait
>> time to be short enough that sleeping and scheduling is not worth it.
>
> Well the page fault IRQs are processed in a work item, so we busy wait
> for another thread here and not interrupt context.

Good point.


>
> This in turn can lead to starvation of the work handler and so a life
> lock as well.
>
>>
>>
>>> I won't touch rptr or wptr at all for this.
>> Not sure what's your idea here, using ih->lock. Is it to completely
>> drain all IRQs until the IH ring is completely empty?
>
> Correct.
>
>> That can
>> live-lock, if the GPU produces IRQs faster than the kernel can process
>> them. Therefore I was looking at rptr and wptr to drain only IRQs that
>> were already in the queue when the drain call was made. That also
>> ensures that the wait time is bounded and should be short (unless the
>> ring size is huge).
>
> Correct as well, but the problem here is that Jonathan's
> implementation is not even remotely correct.
>
> See when you look at the rptr and wptr you can't be sure that they
> haven't wrapped around between two looks.
>
> What you could do is look at both the rptr as well as the original
> wptr, but that is tricky.

The code in Jon's patch was suggested by me. I check for wrapping by
comparing rptr with the rptr from the previous loop iteration. Comparing
rptr with wptr you can never be sure whether rptr has already passed
wptr, or whether rptr has to wrap first.

I could see a compromise where we sleep and wake up the waiting threads when

 1. the IH ring is empty
 2. the IH rptr wraps

That should be good enough to ensure a quick response in the common case
(no interrupt storm), and a reasonable response in the interrupt storm case.

But then it's still not clear what's the correct condition for checking
that the interrupts the caller cares about have been drained. Looking at
just rptr and wptr is always ambiguous. Maybe we could use timestamps of
the last processed interrupt? Those 48-bit time stamps wrap much less
frequently. The idea is this:

  * At the start get the timestamp of the last written IH ring entry
(checkpoint)
  * Wait until the last_processed IH timestamp passes the checkpoint:
sign_extend(last_processed - checkpoint) >= 0

Regards,
  Felix


>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> Regards,
>>> Christian.
>>>
 Regards,
    Felix


>>> Suggested-by: Felix Kuehling 
>>> Signed-off-by: Jonathan Kim 
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 46
>> +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h |  2 ++
>>>     2 files changed, 47 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> index dc852af4f3b7..cae50af9559d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
>>> @@ -22,7 +22,7 @@
>>>  */
>>>
>>>     #include 
>>> -
>>> +#include 
>>>     #include "amdgpu.h"
>>>     #include "amdgpu_ih.h"
>>>
>>> @@ -160,6 +160,50 @@ void amdgpu_ih_ring_write(struct
>> amdgpu_ih_ring *ih, const uint32_t *iv,
>>>     }
>>>     }
>>>
>>> +/**
>>> + * amdgpu_ih_wait_on_checkpoint_process - wait to process IVs
>>> up to
>>> +che

Re: [PATCH 147/159] drm/amdgpu: restore aldebaran save ttmp and trap config on init (v2)

2021-02-25 Thread Deucher, Alexander

[AMD Official Use Only - Internal Distribution Only]

I dropped the KFD debugger hunks and just added the gfx 9.4.2 changes since 
these were required for a bunch of later patches that build on that file that 
are not dependent on debugger.  I can rework the commit message if you'd like.

Alex


From: Kuehling, Felix 
Sent: Wednesday, February 24, 2021 10:22 PM
To: Deucher, Alexander ; 
amd-gfx@lists.freedesktop.org 
Cc: Kim, Jonathan 
Subject: Re: [PATCH 147/159] drm/amdgpu: restore aldebaran save ttmp and trap 
config on init (v2)

This patch is for the debugger functionality that's not on
amd-staging-drm-next yet. You can probably drop this patch for now.

Regards,
   Felix

On 2021-02-24 5:18 p.m., Alex Deucher wrote:
> From: Jonathan Kim 
>
> Initialization of TRAP_DATA0/1 is still required for the debugger to detect
> new waves on Aldebaran.  Also, per-vmid global trap enablement may be
> required outside of debugger scope so move to init phase.
>
> v2: just add the gfx 9.4.2 changes (Alex)
>
> Signed-off-by: Jonathan Kim 
> Reviewed-by: Felix Kuehling 
> Signed-off-by: Alex Deucher 
> ---
>   drivers/gpu/drm/amd/amdgpu/Makefile |  1 +
>   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   |  1 +
>   drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c | 50 +
>   drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.h | 30 +++
>   4 files changed, 82 insertions(+)
>   create mode 100644 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
>   create mode 100644 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.h
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
> b/drivers/gpu/drm/amd/amdgpu/Makefile
> index c5ec926bc6d5..741b68874e53 100644
> --- a/drivers/gpu/drm/amd/amdgpu/Makefile
> +++ b/drivers/gpu/drm/amd/amdgpu/Makefile
> @@ -123,6 +123,7 @@ amdgpu-y += \
>gfx_v8_0.o \
>gfx_v9_0.o \
>gfx_v9_4.o \
> + gfx_v9_4_2.o \
>gfx_v10_0.o
>
>   # add async DMA block
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 5bac5659e707..78bb4e28c27c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -49,6 +49,7 @@
>
>   #include "gfx_v9_4.h"
>   #include "gfx_v9_0.h"
> +#include "gfx_v9_4_2.h"
>
>   #include "asic_reg/pwr/pwr_10_0_offset.h"
>   #include "asic_reg/pwr/pwr_10_0_sh_mask.h"
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
> new file mode 100644
> index ..0c2ccbe327ab
> --- /dev/null
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
> @@ -0,0 +1,50 @@
> +/*
> + * Copyright 2020 Advanced Micro Devices, Inc.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + *
> + */
> +#include "amdgpu.h"
> +#include "soc15.h"
> +
> +#include "gc/gc_9_4_2_offset.h"
> +#include "gc/gc_9_4_2_sh_mask.h"
> +
> +void gfx_v9_4_2_debug_trap_config_init(struct amdgpu_device *adev,
> + uint32_t first_vmid,
> + uint32_t last_vmid)
> +{
> + uint32_t data;
> + int i;
> +
> + mutex_lock(&adev->srbm_mutex);
> +
> + for (i = first_vmid; i < last_vmid; i++) {
> + data = 0;
> + soc15_grbm_select(adev, 0, 0, 0, i);
> + data = REG_SET_FIELD(data, SPI_GDBG_PER_VMID_CNTL, TRAP_EN, 1);
> + data = REG_SET_FIELD(data, SPI_GDBG_PER_VMID_CNTL, EXCP_EN, 0);
> + data = REG_SET_FIELD(data, SPI_GDBG_PER_VMID_CNTL, EXCP_REPLACE,
> + 0);
> + WREG32(SOC15_REG_OFFSET(GC, 0, regSPI_GDBG_PER_VMID_CNTL), 
> data);
> + }
> +
> + soc15_grbm_select(adev, 0, 0, 0, 0);
> + mutex_unlock(&adev->srbm_mutex);
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.h 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.h
> new file mode 100644
> index ..5b175c10de23
> --- /dev/null
> +++ b/driv

Re: [PATCH v6 3/3] drm/amd/display: Skip modeset for front porch change

2021-02-25 Thread Kazlauskas, Nicholas


On 2021-02-12 8:08 p.m., Aurabindo Pillai wrote:

[Why]
A seamless transition between modes can be performed if the new incoming
mode has the same timing parameters as the optimized mode on a display with a
variable vtotal min/max.

Smooth video playback usecases can be enabled with this seamless transition by
switching to a new mode which has a refresh rate matching the video.

[How]
Skip full modeset if userspace requested a compatible freesync mode which only
differs in the front porch timing from the current mode.

Signed-off-by: Aurabindo Pillai 
Acked-by: Christian König 
---
  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 220 ++
  1 file changed, 180 insertions(+), 40 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index c472905c7d72..628fec855e14 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -212,6 +212,9 @@ static bool amdgpu_dm_psr_disable_all(struct 
amdgpu_display_manager *dm);
  static const struct drm_format_info *
  amd_get_format_info(const struct drm_mode_fb_cmd2 *cmd);
  
+static bool

+is_timing_unchanged_for_freesync(struct drm_crtc_state *old_crtc_state,
+struct drm_crtc_state *new_crtc_state);
  /*
   * dm_vblank_get_counter
   *
@@ -335,6 +338,17 @@ static inline bool amdgpu_dm_vrr_active(struct 
dm_crtc_state *dm_state)
   dm_state->freesync_config.state == VRR_STATE_ACTIVE_FIXED;
  }
  
+static inline bool is_dc_timing_adjust_needed(struct dm_crtc_state *old_state,

+ struct dm_crtc_state *new_state)
+{
+   if (new_state->freesync_config.state ==  VRR_STATE_ACTIVE_FIXED)
+   return true;
+   else if (amdgpu_dm_vrr_active(old_state) != 
amdgpu_dm_vrr_active(new_state))
+   return true;
+   else
+   return false;
+}
+
  /**
   * dm_pflip_high_irq() - Handle pageflip interrupt
   * @interrupt_params: ignored
@@ -5008,19 +5022,16 @@ static void 
fill_stream_properties_from_drm_display_mode(
timing_out->hdmi_vic = hv_frame.vic;
}
  
-	timing_out->h_addressable = mode_in->crtc_hdisplay;

-   timing_out->h_total = mode_in->crtc_htotal;
-   timing_out->h_sync_width =
-   mode_in->crtc_hsync_end - mode_in->crtc_hsync_start;
-   timing_out->h_front_porch =
-   mode_in->crtc_hsync_start - mode_in->crtc_hdisplay;
-   timing_out->v_total = mode_in->crtc_vtotal;
-   timing_out->v_addressable = mode_in->crtc_vdisplay;
-   timing_out->v_front_porch =
-   mode_in->crtc_vsync_start - mode_in->crtc_vdisplay;
-   timing_out->v_sync_width =
-   mode_in->crtc_vsync_end - mode_in->crtc_vsync_start;
-   timing_out->pix_clk_100hz = mode_in->crtc_clock * 10;
+   timing_out->h_addressable = mode_in->hdisplay;
+   timing_out->h_total = mode_in->htotal;
+   timing_out->h_sync_width = mode_in->hsync_end - mode_in->hsync_start;
+   timing_out->h_front_porch = mode_in->hsync_start - mode_in->hdisplay;
+   timing_out->v_total = mode_in->vtotal;
+   timing_out->v_addressable = mode_in->vdisplay;
+   timing_out->v_front_porch = mode_in->vsync_start - mode_in->vdisplay;
+   timing_out->v_sync_width = mode_in->vsync_end - mode_in->vsync_start;
+   timing_out->pix_clk_100hz = mode_in->clock * 10;
+
timing_out->aspect_ratio = get_aspect_ratio(mode_in);
  
  	stream->output_color_space = get_output_color_space(timing_out);

@@ -5240,6 +5251,33 @@ get_highest_refresh_rate_mode(struct amdgpu_dm_connector 
*aconnector,
return m_pref;
  }
  
+static bool is_freesync_video_mode(struct drm_display_mode *mode,

+  struct amdgpu_dm_connector *aconnector)
+{
+   struct drm_display_mode *high_mode;
+   int timing_diff;
+
+   high_mode = get_highest_refresh_rate_mode(aconnector, false);
+   if (!high_mode || !mode)
+   return false;
+
+   timing_diff = high_mode->vtotal - mode->vtotal;
+
+   if (high_mode->clock == 0 || high_mode->clock != mode->clock ||
+   high_mode->hdisplay != mode->hdisplay ||
+   high_mode->vdisplay != mode->vdisplay ||
+   high_mode->hsync_start != mode->hsync_start ||
+   high_mode->hsync_end != mode->hsync_end ||
+   high_mode->htotal != mode->htotal ||
+   high_mode->hskew != mode->hskew ||
+   high_mode->vscan != mode->vscan ||
+   high_mode->vsync_start - mode->vsync_start != timing_diff ||
+   high_mode->vsync_end - mode->vsync_end != timing_diff)
+   return false;
+   else
+   return true;
+}
+
  static struct dc_stream_state *
  create_stream_for_sink(struct amdgpu_dm_connector *aconnector,
   const struct drm_display_mode *drm_mode,
@@ -5253,8 +5291,10 @@

Re: [PATCH] drm/amdgpu: Remove amdgpu_device arg from free_sgt api

2021-02-25 Thread Christian König


Am 25.02.21 um 03:49 schrieb Ramesh Errabolu:

Currently callers have to provide handle of amdgpu_device,
which is not used by the implementation. It is unlikely this
parameter will become useful in future, thus removing it

Signed-off-by: Ramesh Errabolu 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c  | 3 +--
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h  | 3 +--
  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 4 +---
  3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 2808d5752de1..e83d73ec0e9d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -359,14 +359,13 @@ static void amdgpu_dma_buf_unmap(struct 
dma_buf_attachment *attach,
struct dma_buf *dma_buf = attach->dmabuf;
struct drm_gem_object *obj = dma_buf->priv;
struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
-   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
  
  	if (sgt->sgl->page_link) {

dma_unmap_sgtable(attach->dev, sgt, dir, 0);
sg_free_table(sgt);
kfree(sgt);
} else {
-   amdgpu_vram_mgr_free_sgt(adev, attach->dev, dir, sgt);
+   amdgpu_vram_mgr_free_sgt(attach->dev, dir, sgt);
}
  }
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h

index 927d33d75c22..028f200a3509 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
@@ -121,8 +121,7 @@ int amdgpu_vram_mgr_alloc_sgt(struct amdgpu_device *adev,
  struct device *dev,
  enum dma_data_direction dir,
  struct sg_table **sgt);
-void amdgpu_vram_mgr_free_sgt(struct amdgpu_device *adev,
- struct device *dev,
+void amdgpu_vram_mgr_free_sgt(struct device *dev,
  enum dma_data_direction dir,
  struct sg_table *sgt);
  uint64_t amdgpu_vram_mgr_usage(struct ttm_resource_manager *man);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index b325b067926b..14936bc713b0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -734,15 +734,13 @@ int amdgpu_vram_mgr_alloc_sgt(struct amdgpu_device *adev,
  /**
   * amdgpu_vram_mgr_free_sgt - allocate and fill a sg table
   *
- * @adev: amdgpu device pointer
   * @dev: device pointer
   * @dir: data direction of resource to unmap
   * @sgt: sg table to free
   *
   * Free a previously allocate sg table.
   */
-void amdgpu_vram_mgr_free_sgt(struct amdgpu_device *adev,
- struct device *dev,
+void amdgpu_vram_mgr_free_sgt(struct device *dev,
  enum dma_data_direction dir,
  struct sg_table *sgt)
  {


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: add ih call to process until checkpoint

2021-02-25 Thread Christian König




Am 25.02.21 um 16:35 schrieb Felix Kuehling:

Am 2021-02-25 um 8:53 a.m. schrieb Christian König:


Am 25.02.21 um 04:15 schrieb Felix Kuehling:

On 2021-02-24 10:54 a.m., Kim, Jonathan wrote:

[AMD Official Use Only - Internal Distribution Only]


-Original Message-
From: Koenig, Christian 
Sent: Wednesday, February 24, 2021 4:17 AM
To: Kim, Jonathan ; amd-
g...@lists.freedesktop.org
Cc: Yang, Philip ; Kuehling, Felix

Subject: Re: [PATCH] drm/amdgpu: add ih call to process until
checkpoint

Am 23.02.21 um 22:10 schrieb Jonathan Kim:

Add IH function to allow caller to process ring entries until the
checkpoint write pointer.

This needs a better description of what this will be used for.

Felix or Philip could elaborate better for HMM needs.
Debugging tools requires this but it's in experimental mode at the
moment so probably not the best place to describe here.

On the HMM side we're planning to use this to drain pending page
fault interrupts before we unmap memory. That should address phantom
VM faults after memory is unmapped.

Thought so. I suggest to use a wait_event() here which on the waiter
side checks ih->lock and add a wake_up_all() at the end of
amdgpu_ih_process.

Right. I thought about that and it should be easy to add. The reason to
suggest busy waiting first is, that interrupt processing is supposed to
be fast. The IRQ handler itself doesn't sleep. So I'd expect the wait
time to be short enough that sleeping and scheduling is not worth it.


Well the page fault IRQs are processed in a work item, so we busy wait 
for another thread here and not interrupt context.


This in turn can lead to starvation of the work handler and so a life 
lock as well.






I won't touch rptr or wptr at all for this.

Not sure what's your idea here, using ih->lock. Is it to completely
drain all IRQs until the IH ring is completely empty?


Correct.


That can
live-lock, if the GPU produces IRQs faster than the kernel can process
them. Therefore I was looking at rptr and wptr to drain only IRQs that
were already in the queue when the drain call was made. That also
ensures that the wait time is bounded and should be short (unless the
ring size is huge).


Correct as well, but the problem here is that Jonathan's implementation 
is not even remotely correct.


See when you look at the rptr and wptr you can't be sure that they 
haven't wrapped around between two looks.


What you could do is look at both the rptr as well as the original wptr, 
but that is tricky.


Regards,
Christian.



Regards,
   Felix



Regards,
Christian.


Regards,
   Felix



Suggested-by: Felix Kuehling 
Signed-off-by: Jonathan Kim 
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 46

+-

drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h |  2 ++
    2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
index dc852af4f3b7..cae50af9559d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
@@ -22,7 +22,7 @@
     */

    #include 
-
+#include 
    #include "amdgpu.h"
    #include "amdgpu_ih.h"

@@ -160,6 +160,50 @@ void amdgpu_ih_ring_write(struct

amdgpu_ih_ring *ih, const uint32_t *iv,

    }
    }

+/**
+ * amdgpu_ih_wait_on_checkpoint_process - wait to process IVs up to
+checkpoint
+ *
+ * @adev: amdgpu_device pointer
+ * @ih: ih ring to process
+ *
+ * Used to ensure ring has processed IVs up to the checkpoint write

pointer.

+ */
+int amdgpu_ih_wait_on_checkpoint_process(struct amdgpu_device

*adev,

+struct amdgpu_ih_ring *ih)
+{
+u32 prev_rptr, cur_rptr, checkpoint_wptr;
+
+if (!ih->enabled || adev->shutdown)
+return -ENODEV;
+
+cur_rptr = READ_ONCE(ih->rptr);
+/* Order read of current rptr with checktpoint wptr. */
+mb();
+checkpoint_wptr = amdgpu_ih_get_wptr(adev, ih);
+
+/* allow rptr to wrap around  */
+if (cur_rptr > checkpoint_wptr) {
+spin_begin();
+do {
+spin_cpu_relax();
+prev_rptr = cur_rptr;
+cur_rptr = READ_ONCE(ih->rptr);
+} while (cur_rptr >= prev_rptr);
+spin_end();

That's a certain NAK since it busy waits for IH processing. We need
some
event to trigger here.

The function is meant to be just a waiter up to the checkpoint.
There's a need to guarantee that "stale" interrupts have been
processed on check before doing other stuff after call.
The description could be improved to clarify that.

Would busy waiting only on a locked ring help?  I assume an unlocked
ring means nothing to process so no need to wait and we can exit
early.  Or is it better to just to process the entries up to the
checkpoint (maybe adjust amdgpu_ih_process for this need like adding
a bool arg to skip restart or something)?

Thanks,

Jon


+}
+
+/* wait for rptr to catch up to or pass checkpoint. */
+spin_begin();
+do {
+spin_cpu_relax();
+prev_rptr = cur_rptr;
+cur_rptr = READ_ONCE(ih->rptr);
+} while (cur_rptr >= prev_rptr && cur_rptr < checkpoint_wptr);

Same of course here.

Christian.


+sp

[PATCH] drm/amdgpu: enable TMZ by default on Raven asics

2021-02-25 Thread Alex Deucher

This has been stable for a while.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 10 ++
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 2f71d36d2856..21504ea9085f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -164,7 +164,7 @@ int amdgpu_discovery = -1;
 int amdgpu_mes;
 int amdgpu_noretry = -1;
 int amdgpu_force_asic_type = -1;
-int amdgpu_tmz;
+int amdgpu_tmz = -1; /* auto */
 int amdgpu_reset_method = -1; /* auto */
 int amdgpu_num_kcq = -1;
 
@@ -790,7 +790,7 @@ module_param_named(abmlevel, amdgpu_dm_abm_level, uint, 
0444);
  *
  * The default value: 0 (off).  TODO: change to auto till it is completed.
  */
-MODULE_PARM_DESC(tmz, "Enable TMZ feature (-1 = auto, 0 = off (default), 1 = 
on)");
+MODULE_PARM_DESC(tmz, "Enable TMZ feature (-1 = auto (default), 0 = off, 1 = 
on)");
 module_param_named(tmz, amdgpu_tmz, int, 0444);
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index fe1a39ffda72..1a892526d020 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -384,6 +384,16 @@ void amdgpu_gmc_tmz_set(struct amdgpu_device *adev)
 {
switch (adev->asic_type) {
case CHIP_RAVEN:
+   if (amdgpu_tmz == 0) {
+   adev->gmc.tmz_enabled = false;
+   dev_info(adev->dev,
+"Trusted Memory Zone (TMZ) feature disabled 
(cmd line)\n");
+   } else {
+   adev->gmc.tmz_enabled = true;
+   dev_info(adev->dev,
+"Trusted Memory Zone (TMZ) feature enabled\n");
+   }
+   break;
case CHIP_RENOIR:
case CHIP_NAVI10:
case CHIP_NAVI14:
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v4 00/14] RFC Support hot device unplug in amdgpu

2021-02-25 Thread Andrey Grodzovsky




On 2021-02-25 5:25 a.m., Daniel Vetter wrote:

On Wed, Feb 24, 2021 at 11:30:50AM -0500, Andrey Grodzovsky wrote:

On 2021-02-19 5:24 a.m., Daniel Vetter wrote:

On Thu, Feb 18, 2021 at 9:03 PM Andrey Grodzovsky
 wrote:

Looked a bit into it, I want to export sync_object to FD and import  from that 
FD
such that I will wait on the imported sync object handle from one thread while
signaling the exported sync object handle from another (post device unplug) ?

My problem is how to create a sync object with a non signaled 'fake' fence ?
I only see API that creates it with already signaled fence (or none) -
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Fdrm_syncobj.c%23L56&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc6828a032b80464fc0f008d8d977bc32%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637498455582209331%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Izca%2BYNliCefXqAgOIX%2Bs3XQ1vWVGXbfAh28B%2F51blQ%3D&reserved=0

P.S I expect the kernel to crash since unlike with dma_bufs we don't hold
drm device reference here on export.

Well maybe there's no crash. I think if you go through all your
dma_fence that you have and force-complete them, then I think external
callers wont go into the driver anymore. But there's still pointers
potentially pointing at your device struct and all that, but should
work. Still needs some audit ofc.

Wrt how you get such a free-standing fence, that's amdgpu specific. Roughly
- submit cs
- get the fence for that (either sync_file, but I don't think amdgpu
supports that, or maybe through drm_syncobj)
- hotunplug
- wait on that fence somehow (drm_syncobj has direct uapi for this,
same for sync_file I think)

Cheers, Daniel


Indeed worked fine, did with 2 devices. Since syncobj is refcounted, even
after I
destroyed the original syncobj and unplugged the device, the exported
syncobj and the
fence inside didn't go anywhere.

See my 3 tests in my branch on Gitlab
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagrodzov%2Figt-gpu-tools%2F-%2Fcommits%2Fmaster&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cc6828a032b80464fc0f008d8d977bc32%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637498455582209331%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=IvoCIggDUV3EgDqOhrJokWei%2B6byg%2Be9cfaJel9y2RY%3D&reserved=0
and let me know if I should go ahead and do a merge request (into which
target project/branch ?) or you
have more comments.

igt still works with patch submission.
-Daniel



I see, Need to divert to other work for a while, will get to it once I 
am back to device unplug.


Andrey





Andrey



Andrey

On 2/9/21 4:50 AM, Daniel Vetter wrote:

Yeah in the end we'd need 2 hw devices for testing full fence
functionality. A useful intermediate step would be to just export the
fence (either as sync_file, which I think amdgpu doesn't support because
no android egl support in mesa) or drm_syncobj (which you can do as
standalone fd too iirc), and then just using the fence a bit from
userspace (like wait on it or get its status) after the device is
unplugged.



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v2] drm/scheduler: Fix hang when sched_entity released

2021-02-25 Thread Andrey Grodzovsky



On 2021-02-25 2:53 a.m., Christian König wrote:

Am 24.02.21 um 16:13 schrieb Andrey Grodzovsky:

Ping


Sorry, I've been on vacation this week.



Andrey

On 2021-02-20 7:12 a.m., Andrey Grodzovsky wrote:


On 2/20/21 3:38 AM, Christian König wrote:



Am 18.02.21 um 17:41 schrieb Andrey Grodzovsky:


On 2/18/21 10:15 AM, Christian König wrote:

Am 18.02.21 um 16:05 schrieb Andrey Grodzovsky:


On 2/18/21 3:07 AM, Christian König wrote:



Am 17.02.21 um 22:59 schrieb Andrey Grodzovsky:

Problem: If scheduler is already stopped by the time sched_entity
is released and entity's job_queue not empty I encountred
a hang in drm_sched_entity_flush. This is because 
drm_sched_entity_is_idle

never becomes false.

Fix: In drm_sched_fini detach all sched_entities from the
scheduler's run queues. This will satisfy 
drm_sched_entity_is_idle.

Also wakeup all those processes stuck in sched_entity flushing
as the scheduler main thread which wakes them up is stopped by 
now.


v2:
Reverse order of drm_sched_rq_remove_entity and marking
s_entity as stopped to prevent reinserion back to rq due
to race.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/scheduler/sched_main.c | 31 
+++

  1 file changed, 31 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c

index 908b0b5..c6b7947 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init);
   */
  void drm_sched_fini(struct drm_gpu_scheduler *sched)
  {
+    int i;
+    struct drm_sched_entity *s_entity;


BTW: Please order that so that i is declared last.


  if (sched->thread)
  kthread_stop(sched->thread);
  +    /* Detach all sched_entites from this scheduler once 
it's stopped */
+    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
DRM_SCHED_PRIORITY_MIN; i--) {

+    struct drm_sched_rq *rq = &sched->sched_rq[i];
+
+    if (!rq)
+    continue;
+
+    /* Loop this way because rq->lock is taken in 
drm_sched_rq_remove_entity */

+    spin_lock(&rq->lock);
+    while ((s_entity = 
list_first_entry_or_null(&rq->entities,

+    struct drm_sched_entity,
+    list))) {
+    spin_unlock(&rq->lock);
+
+    /* Prevent reinsertion and remove */
+    spin_lock(&s_entity->rq_lock);
+    s_entity->stopped = true;
+    drm_sched_rq_remove_entity(rq, s_entity);
+ spin_unlock(&s_entity->rq_lock);


Well this spin_unlock/lock dance here doesn't look correct at 
all now.


Christian.



In what way ? It's in the same same order as in other call sites 
(see drm_sched_entity_push_job and drm_sched_entity_flush).
If i just locked rq->lock and did list_for_each_entry_safe while 
manually deleting entity->list instead of calling
drm_sched_rq_remove_entity this still would not be possible as 
the order of lock acquisition between s_entity->rq_lock
and rq->lock would be reverse compared to the call sites 
mentioned above.


Ah, now I understand. You need this because 
drm_sched_rq_remove_entity() will grab the rq lock again!


Problem is now what prevents the entity from being destroyed 
while you remove it?


Christian.


Right, well, since (unfortunately) sched_entity is part of 
amdgpu_ctx_entity and amdgpu_ctx_entity is refcounted
there is a problem here that we don't increment 
amdgpu_ctx.refcount when assigning  sched_entity
to new rq (e.g. before drm_sched_rq_add_entity) and not decrement 
before removing. We do it for
amdgpu_cs_parser.entity for example (in amdgpu_cs_parser_init and 
amdgpu_cs_parser_fini by
calling amdgpu_ctx_get and amdgpu_ctx_put). But this seems a bit 
tricky due to all the drm_sched_entity_select_rq

logic.

Another, kind of a band aid fix, would probably be just locking 
amdgpu_ctx_mgr.lock around drm_sched_fini
when finalizing the fence driver and around idr iteration in 
amdgpu_ctx_mgr_fini (which should be lock protected
anyway as I see from other idr usages in the code) ... This should 
prevent this use after free.


Puh, that's rather complicated as well. Ok let's look at it from 
the other side for a moment.


Why do we have to remove the entities from the rq in the first place?

Wouldn't it be sufficient to just set all of them to stopped?

Christian.



And adding it as another  condition in drm_sched_entity_is_idle ?


Yes, I think that should work.



In this case reverse locking order is created, In 
drm_sched_entity_push_job and drm_sched_entity_flush lock 
entity->rq_lock locked first and rq->lock locked second. In 
drm_sched_fini on the other hand, I need to lock rq->lock first to 
iterate rq->entities and then lock s_entity->rq_lock for each entity to 
modify s_entity->stopped. I guess we could change s_entity->stopped to 
atomic ?


Andrey




Christian.




Andrey






Andrey






Andrey






+
+    spin_lock(&rq->lock);
+

[PATCH] amdgpu/pm: read_sensor() report failure apporpriately

2021-02-25 Thread Shirish S

report -ENOTSUPP instead of -EINVAL, so that if userspace
fails to read sensor data can figure it out the failure correctly.

Signed-off-by: Shirish S 
---
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c  | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c   | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c   | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c | 2 +-
 drivers/gpu/drm/amd/pm/powerplay/si_dpm.c | 2 +-
 8 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
index 2c90f715ee0a..c932b632ddd4 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
@@ -1285,7 +1285,7 @@ static int smu10_read_sensor(struct pp_hwmgr *hwmgr, int 
idx,
*size = 4;
break;
default:
-   ret = -EINVAL;
+   ret = -EOPNOTSUPP;
break;
}
 
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
index c57dc9ae81f2..a58f92249c53 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
@@ -3945,7 +3945,7 @@ static int smu7_read_sensor(struct pp_hwmgr *hwmgr, int 
idx,
*((uint32_t *)value) = (uint32_t)convert_to_vddc(val_vid);
return 0;
default:
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
 }
 
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c
index ed9b89980184..2cef9c0c6d6f 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu8_hwmgr.c
@@ -1805,7 +1805,7 @@ static int smu8_read_sensor(struct pp_hwmgr *hwmgr, int 
idx,
*((uint32_t *)value) = smu8_thermal_get_temperature(hwmgr);
return 0;
default:
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
 }
 
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c
index 29c99642d22d..5e875ad8d633 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_hwmgr.c
@@ -3890,7 +3890,7 @@ static int vega10_read_sensor(struct pp_hwmgr *hwmgr, int 
idx,
*size = 8;
break;
default:
-   ret = -EINVAL;
+   ret = -EOPNOTSUPP;
break;
}
 
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c
index c0753029a8e2..a827f2bc7904 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega12_hwmgr.c
@@ -1429,7 +1429,7 @@ static int vega12_read_sensor(struct pp_hwmgr *hwmgr, int 
idx,
*size = 8;
break;
default:
-   ret = -EINVAL;
+   ret = -EOPNOTSUPP;
break;
}
return ret;
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c
index 87811b005b85..e8eec2539c17 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega20_hwmgr.c
@@ -2240,7 +2240,7 @@ static int vega20_read_sensor(struct pp_hwmgr *hwmgr, int 
idx,
*size = 8;
break;
default:
-   ret = -EINVAL;
+   ret = -EOPNOTSUPP;
break;
}
return ret;
diff --git a/drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c 
b/drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c
index 66daabebee35..bcae42cef374 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/kv_dpm.c
@@ -3305,7 +3305,7 @@ static int kv_dpm_read_sensor(void *handle, int idx,
*size = 4;
return 0;
default:
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
 }
 
diff --git a/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c 
b/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
index 62291358fb1c..26a5321e621b 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/si_dpm.c
@@ -8014,7 +8014,7 @@ static int si_dpm_read_sensor(void *handle, int idx,
*size = 4;
return 0;
default:
-   return -EINVAL;
+   return -EOPNOTSUPP;
}
 }
 
-- 
2.17.1

__

Re: [PATCH] drm/amdgpu: add ih call to process until checkpoint

2021-02-25 Thread Felix Kuehling

Am 2021-02-25 um 8:53 a.m. schrieb Christian König:
>
>
> Am 25.02.21 um 04:15 schrieb Felix Kuehling:
>> On 2021-02-24 10:54 a.m., Kim, Jonathan wrote:
>>> [AMD Official Use Only - Internal Distribution Only]
>>>
 -Original Message-
 From: Koenig, Christian 
 Sent: Wednesday, February 24, 2021 4:17 AM
 To: Kim, Jonathan ; amd-
 g...@lists.freedesktop.org
 Cc: Yang, Philip ; Kuehling, Felix
 
 Subject: Re: [PATCH] drm/amdgpu: add ih call to process until
 checkpoint

 Am 23.02.21 um 22:10 schrieb Jonathan Kim:
> Add IH function to allow caller to process ring entries until the
> checkpoint write pointer.
 This needs a better description of what this will be used for.
>>> Felix or Philip could elaborate better for HMM needs.
>>> Debugging tools requires this but it's in experimental mode at the
>>> moment so probably not the best place to describe here.
>>
>> On the HMM side we're planning to use this to drain pending page
>> fault interrupts before we unmap memory. That should address phantom
>> VM faults after memory is unmapped.
>
> Thought so. I suggest to use a wait_event() here which on the waiter
> side checks ih->lock and add a wake_up_all() at the end of
> amdgpu_ih_process. 

Right. I thought about that and it should be easy to add. The reason to
suggest busy waiting first is, that interrupt processing is supposed to
be fast. The IRQ handler itself doesn't sleep. So I'd expect the wait
time to be short enough that sleeping and scheduling is not worth it.


> I won't touch rptr or wptr at all for this.

Not sure what's your idea here, using ih->lock. Is it to completely
drain all IRQs until the IH ring is completely empty? That can
live-lock, if the GPU produces IRQs faster than the kernel can process
them. Therefore I was looking at rptr and wptr to drain only IRQs that
were already in the queue when the drain call was made. That also
ensures that the wait time is bounded and should be short (unless the
ring size is huge).

Regards,
  Felix


>
> Regards,
> Christian.
>
>>
>> Regards,
>>   Felix
>>
>>
>>>
> Suggested-by: Felix Kuehling 
> Signed-off-by: Jonathan Kim 
> ---
>    drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 46
 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h |  2 ++
>    2 files changed, 47 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> index dc852af4f3b7..cae50af9559d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
> @@ -22,7 +22,7 @@
>     */
>
>    #include 
> -
> +#include 
>    #include "amdgpu.h"
>    #include "amdgpu_ih.h"
>
> @@ -160,6 +160,50 @@ void amdgpu_ih_ring_write(struct
 amdgpu_ih_ring *ih, const uint32_t *iv,
>    }
>    }
>
> +/**
> + * amdgpu_ih_wait_on_checkpoint_process - wait to process IVs up to
> +checkpoint
> + *
> + * @adev: amdgpu_device pointer
> + * @ih: ih ring to process
> + *
> + * Used to ensure ring has processed IVs up to the checkpoint write
 pointer.
> + */
> +int amdgpu_ih_wait_on_checkpoint_process(struct amdgpu_device
 *adev,
> +struct amdgpu_ih_ring *ih)
> +{
> +u32 prev_rptr, cur_rptr, checkpoint_wptr;
> +
> +if (!ih->enabled || adev->shutdown)
> +return -ENODEV;
> +
> +cur_rptr = READ_ONCE(ih->rptr);
> +/* Order read of current rptr with checktpoint wptr. */
> +mb();
> +checkpoint_wptr = amdgpu_ih_get_wptr(adev, ih);
> +
> +/* allow rptr to wrap around  */
> +if (cur_rptr > checkpoint_wptr) {
> +spin_begin();
> +do {
> +spin_cpu_relax();
> +prev_rptr = cur_rptr;
> +cur_rptr = READ_ONCE(ih->rptr);
> +} while (cur_rptr >= prev_rptr);
> +spin_end();
 That's a certain NAK since it busy waits for IH processing. We need
 some
 event to trigger here.
>>> The function is meant to be just a waiter up to the checkpoint.
>>> There's a need to guarantee that "stale" interrupts have been
>>> processed on check before doing other stuff after call.
>>> The description could be improved to clarify that.
>>>
>>> Would busy waiting only on a locked ring help?  I assume an unlocked
>>> ring means nothing to process so no need to wait and we can exit
>>> early.  Or is it better to just to process the entries up to the
>>> checkpoint (maybe adjust amdgpu_ih_process for this need like adding
>>> a bool arg to skip restart or something)?
>>>
>>> Thanks,
>>>
>>> Jon
>>>
> +}
> +
> +/* wait for rptr to catch up to or pass checkpoint. */
> +spin_begin();
> +do {
> +spin_cpu_relax();
> +prev_rptr = cur_rptr;
> +cur_rptr = READ_ONCE(ih->rptr);
> +} while (cur_rptr >= prev_rptr && cur_rptr < checkpoint_wptr);
 Same of course here.

 Christian.

> +spin_e

[PATCH] drm/amd/display: Fix an uninitialized index variable

2021-02-25 Thread Arnd Bergmann

From: Arnd Bergmann 

clang points out that the new logic uses an always-uninitialized
array index:

drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:9810:38: warning: 
variable 'i' is uninitialized when used here [-Wuninitialized]
timing  = &edid->detailed_timings[i];
  ^
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:9720:7: note: 
initialize the variable 'i' to silence this warning

My best guess is that the index should have been returned by the
parse_hdmi_amd_vsdb() function that walks an array here, so do that.

Fixes: f9b4f20c4777 ("drm/amd/display: Add Freesync HDMI support to DM")
Signed-off-by: Arnd Bergmann 
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c| 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index b19b93c74bae..667c0d52dbfa 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -9736,7 +9736,7 @@ static bool parse_edid_cea(struct amdgpu_dm_connector 
*aconnector,
return false;
 }
 
-static bool parse_hdmi_amd_vsdb(struct amdgpu_dm_connector *aconnector,
+static int parse_hdmi_amd_vsdb(struct amdgpu_dm_connector *aconnector,
struct edid *edid, struct amdgpu_hdmi_vsdb_info *vsdb_info)
 {
uint8_t *edid_ext = NULL;
@@ -9746,7 +9746,7 @@ static bool parse_hdmi_amd_vsdb(struct 
amdgpu_dm_connector *aconnector,
/*- drm_find_cea_extension() -*/
/* No EDID or EDID extensions */
if (edid == NULL || edid->extensions == 0)
-   return false;
+   return -ENODEV;
 
/* Find CEA extension */
for (i = 0; i < edid->extensions; i++) {
@@ -9756,14 +9756,15 @@ static bool parse_hdmi_amd_vsdb(struct 
amdgpu_dm_connector *aconnector,
}
 
if (i == edid->extensions)
-   return false;
+   return -ENODEV;
 
/*- cea_db_offsets() -*/
if (edid_ext[0] != CEA_EXT)
-   return false;
+   return -ENODEV;
 
valid_vsdb_found = parse_edid_cea(aconnector, edid_ext, EDID_LENGTH, 
vsdb_info);
-   return valid_vsdb_found;
+
+   return valid_vsdb_found ? i : -ENODEV;
 }
 
 void amdgpu_dm_update_freesync_caps(struct drm_connector *connector,
@@ -9781,7 +9782,6 @@ void amdgpu_dm_update_freesync_caps(struct drm_connector 
*connector,
struct amdgpu_device *adev = drm_to_adev(dev);
bool freesync_capable = false;
struct amdgpu_hdmi_vsdb_info vsdb_info = {0};
-   bool hdmi_valid_vsdb_found = false;
 
if (!connector->state) {
DRM_ERROR("%s - Connector has no state", __func__);
@@ -9857,8 +9857,8 @@ void amdgpu_dm_update_freesync_caps(struct drm_connector 
*connector,
}
}
} else if (edid && amdgpu_dm_connector->dc_sink->sink_signal == 
SIGNAL_TYPE_HDMI_TYPE_A) {
-   hdmi_valid_vsdb_found = 
parse_hdmi_amd_vsdb(amdgpu_dm_connector, edid, &vsdb_info);
-   if (hdmi_valid_vsdb_found && vsdb_info.freesync_supported) {
+   i = parse_hdmi_amd_vsdb(amdgpu_dm_connector, edid, &vsdb_info);
+   if (i >= 0 && vsdb_info.freesync_supported) {
timing  = &edid->detailed_timings[i];
data= &timing->data.other_data;
 
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: enable one vf mode on navi21 vf

2021-02-25 Thread Deucher, Alexander

[AMD Official Use Only - Internal Distribution Only]

Acked-by: Alex Deucher 

From: Horace Chen 
Sent: Thursday, February 25, 2021 7:04 AM
To: amd-gfx@lists.freedesktop.org 
Cc: Grodzovsky, Andrey ; Quan, Evan 
; Chen, Horace ; Tuikov, Luben 
; Koenig, Christian ; Deucher, 
Alexander ; Xiao, Jack ; Zhang, 
Hawking ; Liu, Monk ; Xu, Feifei 
; Wang, Kevin(Yang) ; Xiaojie Yuan 

Subject: [PATCH] drm/amdgpu: enable one vf mode on navi21 vf

navi21 vf needs one vf mode which allows vf to set and get
clock status from guest vm. So now expose the required interface
and allow some smu request on VF mode. Also since navi21 blocked
direct MMIO access, use KIQ to send SMU request under sriov vf.

OD use same command as getting pp table which is not allowed for
 navi21, so remove OD feature under sriov vf.

Signed-off-by: Horace Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   |  2 ++
 drivers/gpu/drm/amd/pm/amdgpu_pm.c   |  2 +-
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c| 10 ++
 .../gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c  | 10 +-
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c   | 12 ++--
 5 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f0f7ed42ee7f..dfbf2f2db0de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2043,6 +2043,8 @@ static int amdgpu_device_ip_early_init(struct 
amdgpu_device *adev)
 adev->pm.pp_feature = amdgpu_pp_feature_mask;
 if (amdgpu_sriov_vf(adev) || sched_policy == KFD_SCHED_POLICY_NO_HWS)
 adev->pm.pp_feature &= ~PP_GFXOFF_MASK;
+   if (amdgpu_sriov_vf(adev) && adev->asic_type == CHIP_SIENNA_CICHLID)
+   adev->pm.pp_feature &= ~PP_OVERDRIVE_MASK;

 for (i = 0; i < adev->num_ip_blocks; i++) {
 if ((amdgpu_ip_block_mask & (1 << i)) == 0) {
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_pm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
index b770dd634ab6..1866cbaf70c3 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_pm.c
@@ -2167,7 +2167,7 @@ static ssize_t amdgpu_get_gpu_metrics(struct device *dev,

 static struct amdgpu_device_attr amdgpu_device_attrs[] = {
 AMDGPU_DEVICE_ATTR_RW(power_dpm_state,  
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
-   AMDGPU_DEVICE_ATTR_RW(power_dpm_force_performance_level,
ATTR_FLAG_BASIC),
+   AMDGPU_DEVICE_ATTR_RW(power_dpm_force_performance_level,
ATTR_FLAG_BASIC|ATTR_FLAG_ONEVF),
 AMDGPU_DEVICE_ATTR_RO(pp_num_states,
ATTR_FLAG_BASIC),
 AMDGPU_DEVICE_ATTR_RO(pp_cur_state, 
ATTR_FLAG_BASIC),
 AMDGPU_DEVICE_ATTR_RW(pp_force_state,   
ATTR_FLAG_BASIC),
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index d143ef1b460b..7033d52eb4d0 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -612,10 +612,12 @@ static int smu_late_init(void *handle)
 return ret;
 }

-   ret = smu_set_default_od_settings(smu);
-   if (ret) {
-   dev_err(adev->dev, "Failed to setup default OD settings!\n");
-   return ret;
+   if (smu->od_enabled) {
+   ret = smu_set_default_od_settings(smu);
+   if (ret) {
+   dev_err(adev->dev, "Failed to setup default OD 
settings!\n");
+   return ret;
+   }
 }

 ret = smu_populate_umd_state_clk(smu);
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index af73e1430af5..441effc23625 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -89,17 +89,17 @@ static struct cmn2asic_msg_mapping 
sienna_cichlid_message_map[SMU_MSG_MAX_COUNT]
 MSG_MAP(GetEnabledSmuFeaturesHigh,  
PPSMC_MSG_GetRunningSmuFeaturesHigh,   1),
 MSG_MAP(SetWorkloadMask,PPSMC_MSG_SetWorkloadMask, 
1),
 MSG_MAP(SetPptLimit,PPSMC_MSG_SetPptLimit, 
0),
-   MSG_MAP(SetDriverDramAddrHigh,  
PPSMC_MSG_SetDriverDramAddrHigh,   0),
-   MSG_MAP(SetDriverDramAddrLow,   PPSMC_MSG_SetDriverDramAddrLow, 
   0),
+   MSG_MAP(SetDriverDramAddrHigh,  
PPSMC_MSG_SetDriverDramAddrHigh,   1),
+   MSG_MAP(SetDriverDramAddrLow,   PPSMC_MSG_SetDriverDramAddrLow, 
   1),
 MSG_MAP(SetToolsDramAddrHigh,   
PPSMC_MSG_SetToolsDramAddrHigh,0),
 MSG_MAP(SetToolsDramAddrLow,PPSMC_MSG_SetToolsDramAd

[PATCH] drm/amd/display: fix 64-bit integer division

2021-02-25 Thread Arnd Bergmann

From: Arnd Bergmann 

The new display synchronization code caused a regression
on all 32-bit architectures:

ld.lld: error: undefined symbol: __aeabi_uldivmod
>>> referenced by dce_clock_source.c
>>>   
>>> gpu/drm/amd/display/dc/dce/dce_clock_source.o:(get_pixel_clk_frequency_100hz)
>>>  in archive drivers/built-in.a

ld.lld: error: undefined symbol: __aeabi_ldivmod
>>> referenced by dc_resource.c
>>>   
>>> gpu/drm/amd/display/dc/core/dc_resource.o:(resource_are_vblanks_synchronizable)
>>>  in archive drivers/built-in.a
>>> referenced by dc_resource.c
>>>   
>>> gpu/drm/amd/display/dc/core/dc_resource.o:(resource_are_vblanks_synchronizable)
>>>  in archive drivers/built-in.a
>>> referenced by dc_resource.c
>>>   
>>> gpu/drm/amd/display/dc/core/dc_resource.o:(resource_are_vblanks_synchronizable)
>>>  in archive drivers/built-in.a

This is not a fast path, so the use of an explicit div_u64/div_s64
seems appropriate.

Fixes: 77a2b7265f20 ("drm/amd/display: Synchronize displays with different 
timings")
Signed-off-by: Arnd Bergmann 
---
 drivers/gpu/drm/amd/display/dc/core/dc_resource.c| 12 ++--
 .../gpu/drm/amd/display/dc/dce/dce_clock_source.c|  6 +++---
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c 
b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
index 0241c9d96d7a..49214c59c836 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
@@ -441,15 +441,15 @@ bool resource_are_vblanks_synchronizable(
if (stream2->timing.pix_clk_100hz*100/stream2->timing.h_total/
stream2->timing.v_total > 60)
return false;
-   frame_time_diff = (int64_t)1 *
+   frame_time_diff = div_s64(1ll *
stream1->timing.h_total *
stream1->timing.v_total *
-   stream2->timing.pix_clk_100hz /
-   stream1->timing.pix_clk_100hz /
-   stream2->timing.h_total /
-   stream2->timing.v_total;
+   stream2->timing.pix_clk_100hz,
+   stream1->timing.pix_clk_100hz *
+   stream2->timing.h_total *
+   stream2->timing.v_total);
for (i = 0; i < rr_count; i++) {
-   int64_t diff = (frame_time_diff * 
base60_refresh_rates[i]) / 10 - 1;
+   int64_t diff = div_s64(frame_time_diff * 
base60_refresh_rates[i], 10) - 1;
 
if (diff < 0)
diff = -diff;
diff --git a/drivers/gpu/drm/amd/display/dc/dce/dce_clock_source.c 
b/drivers/gpu/drm/amd/display/dc/dce/dce_clock_source.c
index 6f47f9bab5ee..85ed6f2c9647 100644
--- a/drivers/gpu/drm/amd/display/dc/dce/dce_clock_source.c
+++ b/drivers/gpu/drm/amd/display/dc/dce/dce_clock_source.c
@@ -1013,9 +1013,9 @@ static bool get_pixel_clk_frequency_100hz(
 * not be programmed equal to DPREFCLK
 */
modulo_hz = REG_READ(MODULO[inst]);
-   *pixel_clk_khz = ((uint64_t)clock_hz*
-   
clock_source->ctx->dc->clk_mgr->dprefclk_khz*10)/
-   modulo_hz;
+   *pixel_clk_khz = div_u64((uint64_t)clock_hz * 10 *
+   clock_source->ctx->dc->clk_mgr->dprefclk_khz,
+   modulo_hz);
} else {
/* NOTE: There is agreement with VBIOS here that MODULO 
is
 * programmed equal to DPREFCLK, in which case PHASE 
will be
-- 
2.29.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amd/display: remove unnecessary conversion to bool

2021-02-25 Thread Jiapeng Chong

Fix the following coccicheck warnings:

./drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dpp_cm.c:243:67-72:
WARNING: conversion to bool not needed here.

Reported-by: Abaci Robot 
Signed-off-by: Jiapeng Chong 
---
 drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dpp_cm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dpp_cm.c 
b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dpp_cm.c
index 3398540..102f6a0 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dpp_cm.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_dpp_cm.c
@@ -240,7 +240,7 @@ bool dpp3_program_gamcor_lut(
next_mode = LUT_RAM_A;
 
dpp3_power_on_gamcor_lut(dpp_base, true);
-   dpp3_configure_gamcor_lut(dpp_base, next_mode == LUT_RAM_A ? 
true:false);
+   dpp3_configure_gamcor_lut(dpp_base, next_mode == LUT_RAM_A);
 
if (next_mode == LUT_RAM_B) {
gam_regs.start_cntl_b = REG(CM_GAMCOR_RAMB_START_CNTL_B);
-- 
1.8.3.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: add ih call to process until checkpoint

2021-02-25 Thread Christian König




Am 25.02.21 um 04:15 schrieb Felix Kuehling:

On 2021-02-24 10:54 a.m., Kim, Jonathan wrote:

[AMD Official Use Only - Internal Distribution Only]


-Original Message-
From: Koenig, Christian 
Sent: Wednesday, February 24, 2021 4:17 AM
To: Kim, Jonathan ; amd-
g...@lists.freedesktop.org
Cc: Yang, Philip ; Kuehling, Felix

Subject: Re: [PATCH] drm/amdgpu: add ih call to process until 
checkpoint


Am 23.02.21 um 22:10 schrieb Jonathan Kim:

Add IH function to allow caller to process ring entries until the
checkpoint write pointer.

This needs a better description of what this will be used for.

Felix or Philip could elaborate better for HMM needs.
Debugging tools requires this but it's in experimental mode at the 
moment so probably not the best place to describe here.


On the HMM side we're planning to use this to drain pending page fault 
interrupts before we unmap memory. That should address phantom VM 
faults after memory is unmapped.


Thought so. I suggest to use a wait_event() here which on the waiter 
side checks ih->lock and add a wake_up_all() at the end of 
amdgpu_ih_process. I won't touch rptr or wptr at all for this.


Regards,
Christian.



Regards,
  Felix





Suggested-by: Felix Kuehling 
Signed-off-by: Jonathan Kim 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 46

+-

drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h |  2 ++
   2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
index dc852af4f3b7..cae50af9559d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
@@ -22,7 +22,7 @@
    */

   #include 
-
+#include 
   #include "amdgpu.h"
   #include "amdgpu_ih.h"

@@ -160,6 +160,50 @@ void amdgpu_ih_ring_write(struct

amdgpu_ih_ring *ih, const uint32_t *iv,

   }
   }

+/**
+ * amdgpu_ih_wait_on_checkpoint_process - wait to process IVs up to
+checkpoint
+ *
+ * @adev: amdgpu_device pointer
+ * @ih: ih ring to process
+ *
+ * Used to ensure ring has processed IVs up to the checkpoint write

pointer.

+ */
+int amdgpu_ih_wait_on_checkpoint_process(struct amdgpu_device

*adev,

+struct amdgpu_ih_ring *ih)
+{
+u32 prev_rptr, cur_rptr, checkpoint_wptr;
+
+if (!ih->enabled || adev->shutdown)
+return -ENODEV;
+
+cur_rptr = READ_ONCE(ih->rptr);
+/* Order read of current rptr with checktpoint wptr. */
+mb();
+checkpoint_wptr = amdgpu_ih_get_wptr(adev, ih);
+
+/* allow rptr to wrap around  */
+if (cur_rptr > checkpoint_wptr) {
+spin_begin();
+do {
+spin_cpu_relax();
+prev_rptr = cur_rptr;
+cur_rptr = READ_ONCE(ih->rptr);
+} while (cur_rptr >= prev_rptr);
+spin_end();
That's a certain NAK since it busy waits for IH processing. We need 
some

event to trigger here.

The function is meant to be just a waiter up to the checkpoint.
There's a need to guarantee that "stale" interrupts have been 
processed on check before doing other stuff after call.

The description could be improved to clarify that.

Would busy waiting only on a locked ring help?  I assume an unlocked 
ring means nothing to process so no need to wait and we can exit 
early.  Or is it better to just to process the entries up to the 
checkpoint (maybe adjust amdgpu_ih_process for this need like adding 
a bool arg to skip restart or something)?


Thanks,

Jon


+}
+
+/* wait for rptr to catch up to or pass checkpoint. */
+spin_begin();
+do {
+spin_cpu_relax();
+prev_rptr = cur_rptr;
+cur_rptr = READ_ONCE(ih->rptr);
+} while (cur_rptr >= prev_rptr && cur_rptr < checkpoint_wptr);

Same of course here.

Christian.


+spin_end();
+
+return 0;
+}
+
   /**
    * amdgpu_ih_process - interrupt handler
    *
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
index 6ed4a85fc7c3..6817f0a812d2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
@@ -87,6 +87,8 @@ int amdgpu_ih_ring_init(struct amdgpu_device *adev,

struct amdgpu_ih_ring *ih,

   void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct

amdgpu_ih_ring *ih);
   void amdgpu_ih_ring_write(struct amdgpu_ih_ring *ih, const 
uint32_t *iv,

 unsigned int num_dw);
+int amdgpu_ih_wait_on_checkpoint_process(struct amdgpu_device

*adev,

+struct amdgpu_ih_ring *ih);
   int amdgpu_ih_process(struct amdgpu_device *adev, struct

amdgpu_ih_ring *ih);

   void amdgpu_ih_decode_iv_helper(struct amdgpu_device *adev,
   struct amdgpu_ih_ring *ih,

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 088/159] drm/ttm: ioremap buffer properly according to TTM placement flag

2021-02-25 Thread Christian König

The whole patch set needs a rebase since the TTM_PL_FLAG_* for 
controlling the caching doesn't exists any more upstream.


How should we approach that?

Thanks,
Christian.

Am 24.02.21 um 23:17 schrieb Alex Deucher:

From: Oak Zeng 

If TTM placement flag is cached, buffer is intended to be mapped
as cached from CPU. Map it with ioremap_cache.

This wasn't necessary before as device memory was never mapped
as cached from CPU side. It becomes necessary for aldebaran as
device memory is mapped cached from CPU.

Signed-off-by: Oak Zeng 
Reviewed-by: Christian Koenig 
Tested-by: Amber Lin 
Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/ttm/ttm_bo_util.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c 
b/drivers/gpu/drm/ttm/ttm_bo_util.c
index ee04716b2603..e11ec1ff5d0b 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -519,6 +519,10 @@ static int ttm_bo_ioremap(struct ttm_buffer_object *bo,
map->virtual = ioremap_wc(bo->mem.bus.base +
  bo->mem.bus.offset + offset,
  size);
+   else if (mem->placement & TTM_PL_FLAG_CACHED)
+   map->virtual = ioremap_cache(bo->mem.bus.base +
+ bo->mem.bus.offset + offset,
+ size);
else
map->virtual = ioremap(bo->mem.bus.base +
   bo->mem.bus.offset + offset,


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v4 00/14] RFC Support hot device unplug in amdgpu

2021-02-25 Thread Daniel Vetter

On Wed, Feb 24, 2021 at 11:30:50AM -0500, Andrey Grodzovsky wrote:
> 
> On 2021-02-19 5:24 a.m., Daniel Vetter wrote:
> > On Thu, Feb 18, 2021 at 9:03 PM Andrey Grodzovsky
> >  wrote:
> > > Looked a bit into it, I want to export sync_object to FD and import  from 
> > > that FD
> > > such that I will wait on the imported sync object handle from one thread 
> > > while
> > > signaling the exported sync object handle from another (post device 
> > > unplug) ?
> > > 
> > > My problem is how to create a sync object with a non signaled 'fake' 
> > > fence ?
> > > I only see API that creates it with already signaled fence (or none) -
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Fdrm_syncobj.c%23L56&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C5085bdd151c642514d2408d8d4c08e56%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637493270792459284%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=sZWIn0Lo7ZujBq0e7MdFPhJDARXWpOlLgLzANMS8cCY%3D&reserved=0
> > > 
> > > P.S I expect the kernel to crash since unlike with dma_bufs we don't hold
> > > drm device reference here on export.
> > Well maybe there's no crash. I think if you go through all your
> > dma_fence that you have and force-complete them, then I think external
> > callers wont go into the driver anymore. But there's still pointers
> > potentially pointing at your device struct and all that, but should
> > work. Still needs some audit ofc.
> > 
> > Wrt how you get such a free-standing fence, that's amdgpu specific. Roughly
> > - submit cs
> > - get the fence for that (either sync_file, but I don't think amdgpu
> > supports that, or maybe through drm_syncobj)
> > - hotunplug
> > - wait on that fence somehow (drm_syncobj has direct uapi for this,
> > same for sync_file I think)
> > 
> > Cheers, Daniel
> 
> 
> Indeed worked fine, did with 2 devices. Since syncobj is refcounted, even
> after I
> destroyed the original syncobj and unplugged the device, the exported
> syncobj and the
> fence inside didn't go anywhere.
> 
> See my 3 tests in my branch on Gitlab
> https://gitlab.freedesktop.org/agrodzov/igt-gpu-tools/-/commits/master
> and let me know if I should go ahead and do a merge request (into which
> target project/branch ?) or you
> have more comments.

igt still works with patch submission.
-Daniel

> 
> Andrey
> 
> 
> > 
> > > Andrey
> > > 
> > > On 2/9/21 4:50 AM, Daniel Vetter wrote:
> > > > Yeah in the end we'd need 2 hw devices for testing full fence
> > > > functionality. A useful intermediate step would be to just export the
> > > > fence (either as sync_file, which I think amdgpu doesn't support because
> > > > no android egl support in mesa) or drm_syncobj (which you can do as
> > > > standalone fd too iirc), and then just using the fence a bit from
> > > > userspace (like wait on it or get its status) after the device is
> > > > unplugged.
> > 
> > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amd/amdgpu: move inc gpu_reset_counter after drm_sched_stop

2021-02-25 Thread Christian König


Am 25.02.21 um 10:16 schrieb Jingwen Chen:

Move gpu_reset_counter after drm_sched_stop to avoid race
condition caused by job submitted between reset_count +1 and
drm_sched_stop.

Signed-off-by: Jingwen Chen 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f0f7ed42ee7f..703b96cf3560 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4447,7 +4447,6 @@ static bool amdgpu_device_lock_adev(struct amdgpu_device 
*adev,
down_write(&adev->reset_sem);
}
  
-	atomic_inc(&adev->gpu_reset_counter);

switch (amdgpu_asic_reset_method(adev)) {
case AMD_RESET_METHOD_MODE1:
adev->mp1_state = PP_MP1_STATE_SHUTDOWN;
@@ -4708,6 +4707,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
if (need_emergency_restart)
amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
}
+   atomic_inc(&tmp_adev->gpu_reset_counter);
}
  
  	if (need_emergency_restart)

@@ -5050,6 +5050,7 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev 
*pdev, pci_channel_sta
  
  			drm_sched_stop(&ring->sched, NULL);

}
+   atomic_inc(&adev->gpu_reset_counter);
return PCI_ERS_RESULT_NEED_RESET;
case pci_channel_io_perm_failure:
/* Permanent error, prepare for device removal */


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amd/amdgpu: move inc gpu_reset_counter after drm_sched_stop

2021-02-25 Thread Jingwen Chen

Move gpu_reset_counter after drm_sched_stop to avoid race
condition caused by job submitted between reset_count +1 and
drm_sched_stop.

Signed-off-by: Jingwen Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f0f7ed42ee7f..703b96cf3560 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4447,7 +4447,6 @@ static bool amdgpu_device_lock_adev(struct amdgpu_device 
*adev,
down_write(&adev->reset_sem);
}
 
-   atomic_inc(&adev->gpu_reset_counter);
switch (amdgpu_asic_reset_method(adev)) {
case AMD_RESET_METHOD_MODE1:
adev->mp1_state = PP_MP1_STATE_SHUTDOWN;
@@ -4708,6 +4707,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
if (need_emergency_restart)
amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
}
+   atomic_inc(&tmp_adev->gpu_reset_counter);
}
 
if (need_emergency_restart)
@@ -5050,6 +5050,7 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev 
*pdev, pci_channel_sta
 
drm_sched_stop(&ring->sched, NULL);
}
+   atomic_inc(&adev->gpu_reset_counter);
return PCI_ERS_RESULT_NEED_RESET;
case pci_channel_io_perm_failure:
/* Permanent error, prepare for device removal */
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: Replace DEFINE_SIMPLE_ATTRIBUTE with DEFINE_DEBUGFS_ATTRIBUTE

2021-02-25 Thread Yang Li

Fix the following coccicheck warning:
./drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c:1589:0-23: WARNING:
fops_ib_preempt should be defined with DEFINE_DEBUGFS_ATTRIBUTE
./drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c:1592:0-23: WARNING:
fops_sclk_set should be defined with DEFINE_DEBUGFS_ATTRIBUTE

Reported-by: Abaci Robot 
Signed-off-by: Yang Li 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 0a25fec..52ef488 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1586,10 +1586,10 @@ static int amdgpu_debugfs_sclk_set(void *data, u64 val)
return 0;
 }
 
-DEFINE_SIMPLE_ATTRIBUTE(fops_ib_preempt, NULL,
+DEFINE_DEBUGFS_ATTRIBUTE(fops_ib_preempt, NULL,
amdgpu_debugfs_ib_preempt, "%llu\n");
 
-DEFINE_SIMPLE_ATTRIBUTE(fops_sclk_set, NULL,
+DEFINE_DEBUGFS_ATTRIBUTE(fops_sclk_set, NULL,
amdgpu_debugfs_sclk_set, "%llu\n");
 
 int amdgpu_debugfs_init(struct amdgpu_device *adev)
-- 
1.8.3.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v3 01/11] drm/atomic: Pass the full state to planes async atomic check and update

2021-02-25 Thread Maxime Ripard

Hi,

On Wed, Feb 24, 2021 at 12:33:45PM +0100, Thomas Zimmermann wrote:
> Hi Maxime,
> 
> for the whole series:
> 
> Acked-by: Thomas Zimmermann 

Applied the whole series, thanks to everyone involved in the review,
it's been a pretty daunting one :)

Maxime


signature.asc
Description: PGP signature
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 2/2] drm/amd/amdgpu: force flush resubmit job

2021-02-25 Thread Liu, Monk

[AMD Official Use Only - Internal Distribution Only]

Yeah, that sounds better than original fix 

Thanks Christian

--
Monk Liu | Cloud-GPU Core team
--

-Original Message-
From: Koenig, Christian  
Sent: Thursday, February 25, 2021 4:08 PM
To: Chen, JingWen ; amd-gfx@lists.freedesktop.org
Cc: Liu, Monk 
Subject: Re: [PATCH 2/2] drm/amd/amdgpu: force flush resubmit job

Good catch, but the approach for the fix is incorrect.

The device reset count can only be incremented after taking the reset lock and 
stopping the scheduler, otherwise a whole bunch of different race conditions 
can happen.

Christian.

Am 25.02.21 um 08:56 schrieb Chen, JingWen:
> [AMD Official Use Only - Internal Distribution Only]
>
> Consider this sequence:
> 1. GPU reset begin
> 2. device reset count + 1
> 3. job id 1 scheduled with vm_need_flush=false 4. When handling this 
> job in vm_flush, amdgpu_vmid_had_gpu_reset will return true, thus this 
> job will be flush and the vmid_reset_count will be set equals to 
> device_reset_count 5. stop drm scheduler 6. GPU do real reset 7. 
> resubmit job id 1 with vm_need_flush=false 8. Job id 1 is the first 
> job to resubmit after reset. This time amdgpu_vmid_had_gpu_reset 
> returns false and the vm_need_flush==false
>
> Then no one will flush pd_addr and vmid for jobs after resubmit. In this 
> sequence amdgpu_vmid_had_gpu_reset doesn't work.
>
>
> Best Regards,
> JingWen Chen
>
> -Original Message-
> From: Koenig, Christian 
> Sent: Thursday, February 25, 2021 3:46 PM
> To: Chen, JingWen ; 
> amd-gfx@lists.freedesktop.org
> Cc: Liu, Monk 
> Subject: Re: [PATCH 2/2] drm/amd/amdgpu: force flush resubmit job
>
>
>
> Am 25.02.21 um 06:27 schrieb Jingwen Chen:
>> [Why]
>> when a job is scheduled during TDR(after device reset count increase 
>> and before drm_sched_stop), this job won't do vm_flush when resubmit 
>> itself after GPU reset done. This can lead to a page fault.
>>
>> [How]
>> Always do vm_flush for resubmit job.
> NAK, what do you think amdgpu_vmid_had_gpu_reset already does?
>
> Christian.
>
>> Signed-off-by: Jingwen Chen 
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 ++-
>>1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index fdbe7d4e8b8b..4af2c5d15950 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -1088,7 +1088,8 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
>> amdgpu_job *job,
>>if (update_spm_vmid_needed && adev->gfx.rlc.funcs->update_spm_vmid)
>>adev->gfx.rlc.funcs->update_spm_vmid(adev, job->vmid);
>>
>> -if (amdgpu_vmid_had_gpu_reset(adev, id)) {
>> +if (amdgpu_vmid_had_gpu_reset(adev, id)|| (job->base.flags & 
>> +DRM_FLAG_RESUBMIT_JOB)) {
>>gds_switch_needed = true;
>>vm_flush_needed = true;
>>pasid_mapping_needed = true;
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 2/2] drm/amd/amdgpu: force flush resubmit job

2021-02-25 Thread Christian König


Good catch, but the approach for the fix is incorrect.

The device reset count can only be incremented after taking the reset 
lock and stopping the scheduler, otherwise a whole bunch of different 
race conditions can happen.


Christian.

Am 25.02.21 um 08:56 schrieb Chen, JingWen:

[AMD Official Use Only - Internal Distribution Only]

Consider this sequence:
1. GPU reset begin
2. device reset count + 1
3. job id 1 scheduled with vm_need_flush=false
4. When handling this job in vm_flush, amdgpu_vmid_had_gpu_reset will return 
true, thus this job will be flush and the vmid_reset_count will be set equals 
to device_reset_count
5. stop drm scheduler
6. GPU do real reset
7. resubmit job id 1 with vm_need_flush=false
8. Job id 1 is the first job to resubmit after reset. This time 
amdgpu_vmid_had_gpu_reset returns false and the vm_need_flush==false

Then no one will flush pd_addr and vmid for jobs after resubmit. In this 
sequence amdgpu_vmid_had_gpu_reset doesn't work.


Best Regards,
JingWen Chen

-Original Message-
From: Koenig, Christian 
Sent: Thursday, February 25, 2021 3:46 PM
To: Chen, JingWen ; amd-gfx@lists.freedesktop.org
Cc: Liu, Monk 
Subject: Re: [PATCH 2/2] drm/amd/amdgpu: force flush resubmit job



Am 25.02.21 um 06:27 schrieb Jingwen Chen:

[Why]
when a job is scheduled during TDR(after device reset count increase
and before drm_sched_stop), this job won't do vm_flush when resubmit
itself after GPU reset done. This can lead to a page fault.

[How]
Always do vm_flush for resubmit job.

NAK, what do you think amdgpu_vmid_had_gpu_reset already does?

Christian.


Signed-off-by: Jingwen Chen 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index fdbe7d4e8b8b..4af2c5d15950 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1088,7 +1088,8 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
amdgpu_job *job,
   if (update_spm_vmid_needed && adev->gfx.rlc.funcs->update_spm_vmid)
   adev->gfx.rlc.funcs->update_spm_vmid(adev, job->vmid);

-if (amdgpu_vmid_had_gpu_reset(adev, id)) {
+if (amdgpu_vmid_had_gpu_reset(adev, id)||
+(job->base.flags & DRM_FLAG_RESUBMIT_JOB)) {
   gds_switch_needed = true;
   vm_flush_needed = true;
   pasid_mapping_needed = true;


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

64 matches

Mail list logo