RE: [PATCH] drm/amdgpu/powerplay: add some documentation about memory clock

2020-07-22 Thread Quan, Evan
[AMD Official Use Only - Internal Distribution Only]

Acked-by: Evan Quan 

-Original Message-
From: amd-gfx  On Behalf Of Alex Deucher
Sent: Wednesday, July 22, 2020 12:52 AM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: [PATCH] drm/amdgpu/powerplay: add some documentation about memory clock

We expose the actual memory controller clock rate in Linux,
not the effective memory clock of the DRAMs.  To translate
it, it follows the following formula:

Clock conversion (Mhz):
HBM: effective_memory_clock = memory_controller_clock * 1
G5:  effective_memory_clock = memory_controller_clock * 1
G6:  effective_memory_clock = memory_controller_clock * 2

DRAM data rate (MT/s):
HBM: effective_memory_clock * 2 = data_rate
G5:  effective_memory_clock * 4 = data_rate
G6:  effective_memory_clock * 8 = data_rate

Bandwidth (MB/s):
data_rate * vram_bit_width / 8 = memory_bandwidth

Some examples:
G5 on RX460:
memory_controller_clock = 1750 Mhz
effective_memory_clock = 1750 Mhz * 1 = 1750 Mhz
data rate = 1750 * 4 = 7000 MT/s
memory_bandwidth = 7000 * 128 bits / 8 = 112000 MB/s

G6 on RX5600:
memory_controller_clock = 900 Mhz
effective_memory_clock = 900 Mhz * 2 = 1800 Mhz
data rate = 1800 * 8 = 14400 MT/s
memory_bandwidth = 14400 * 192 bits / 8 = 345600 MB/s

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
index 20f39aa04fb9..dd05751f6b35 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
@@ -694,6 +694,52 @@ static ssize_t amdgpu_set_pp_table(struct device *dev,
  * in each power level within a power state.  The pp_od_clk_voltage is used for
  * this.
  *
+ * Note that the actual memory controller clock rate are exposed, not
+ * the effective memory clock of the DRAMs. To translate it, use the
+ * following formula:
+ *
+ * Clock conversion (Mhz):
+ *
+ * HBM: effective_memory_clock = memory_controller_clock * 1
+ *
+ * G5: effective_memory_clock = memory_controller_clock * 1
+ *
+ * G6: effective_memory_clock = memory_controller_clock * 2
+ *
+ * DRAM data rate (MT/s):
+ *
+ * HBM: effective_memory_clock * 2 = data_rate
+ *
+ * G5: effective_memory_clock * 4 = data_rate
+ *
+ * G6: effective_memory_clock * 8 = data_rate
+ *
+ * Bandwidth (MB/s):
+ *
+ * data_rate * vram_bit_width / 8 = memory_bandwidth
+ *
+ * Some examples:
+ *
+ * G5 on RX460:
+ *
+ * memory_controller_clock = 1750 Mhz
+ *
+ * effective_memory_clock = 1750 Mhz * 1 = 1750 Mhz
+ *
+ * data rate = 1750 * 4 = 7000 MT/s
+ *
+ * memory_bandwidth = 7000 * 128 bits / 8 = 112000 MB/s
+ *
+ * G6 on RX5700:
+ *
+ * memory_controller_clock = 875 Mhz
+ *
+ * effective_memory_clock = 875 Mhz * 2 = 1750 Mhz
+ *
+ * data rate = 1750 * 8 = 14000 MT/s
+ *
+ * memory_bandwidth = 14000 * 256 bits / 8 = 448000 MB/s
+ *
  * < For Vega10 and previous ASICs >
  *
  * Reading the file will display:
--
2.25.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cevan.quan%40amd.com%7C7742fa0e5b484785420a08d82d966809%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309471335428042sdata=370ted69cIqzuGSYStyhun%2FSGkknrdP%2FYtqfZETG1mc%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] Revert "drm/amd/powerplay: drop unnecessary message support check"

2020-07-22 Thread Quan, Evan
No,  the change below is not a good idea.
-#if defined(SWSMU_CODE_LAYER_L2) || defined(SWSMU_CODE_LAYER_L3) || 
defined(SWSMU_CODE_LAYER_L4)
+#if defined(SWSMU_CODE_LAYER_L1) || defined(SWSMU_CODE_LAYER_L2) || 
+defined(SWSMU_CODE_LAYER_L3) || defined(SWSMU_CODE_LAYER_L4)
These are used to guard a clear code layer. I recently just cleaned up the code 
layer mess ups(which produced lots of redundant codes and made lock protection 
a disaster).

To this specific one, whether or not to use smu_cmn API should be decided at 
ASIC level(${asic}_ppt.c).
Each ASIC can have its own implementation or to use smu v11/v12 common API or 
even smu_cmn api(shared by all ASICs).
But all of these should be decided at ASIC level.

P.S. Likun sent out a patch which should fix the issue here.

BR,
Evan
-Original Message-
From: amd-gfx  On Behalf Of Feng, Kenneth
Sent: Wednesday, July 22, 2020 4:12 PM
To: Zhu, Changfeng ; amd-gfx@lists.freedesktop.org; 
Huang, Ray 
Subject: RE: [PATCH] Revert "drm/amd/powerplay: drop unnecessary message 
support check"

[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Kenneth Feng 


-Original Message-
From: Zhu, Changfeng  
Sent: Wednesday, July 22, 2020 4:01 PM
To: amd-gfx@lists.freedesktop.org; Feng, Kenneth ; Huang, 
Ray 
Cc: Zhu, Changfeng 
Subject: [PATCH] Revert "drm/amd/powerplay: drop unnecessary message support 
check"

From: changzhu 

From: Changfeng 

The below 3 messages are not supported on Renoir SMU_MSG_PrepareMp1ForShutdown 
SMU_MSG_PrepareMp1ForUnload SMU_MSG_PrepareMp1ForReset

It needs to revert patch:
drm/amd/powerplay: drop unnecessary message support check to avoid set mp1 
state fail during gpu reset on renoir.

Change-Id: Ib34d17ab88e1c88173827cca962d8154ad883ab7
Signed-off-by: changfeng 
---
 drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 9 +
 drivers/gpu/drm/amd/powerplay/smu_cmn.h| 2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
index 838a369c9ec3..f778b00e49eb 100644
--- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
@@ -34,6 +34,7 @@
 #include "sienna_cichlid_ppt.h"
 #include "renoir_ppt.h"
 #include "amd_pcie.h"
+#include "smu_cmn.h"
 
 /*
  * DO NOT use these for err/warn/info/debug messages.
@@ -1589,6 +1590,14 @@ int smu_set_mp1_state(struct smu_context *smu,
return 0;
}
 
+   /* some asics may not support those messages */
+   if (smu_cmn_to_asic_specific_index(smu,
+  CMN2ASIC_MAPPING_MSG,
+  msg) < 0) {
+   mutex_unlock(>mutex);
+   return 0;
+   }
+
ret = smu_send_smc_msg(smu, msg, NULL);
if (ret)
dev_err(smu->adev->dev, "[PrepareMp1] Failed!\n"); diff --git 
a/drivers/gpu/drm/amd/powerplay/smu_cmn.h 
b/drivers/gpu/drm/amd/powerplay/smu_cmn.h
index 98face8c5fd6..f9e63f18b157 100644
--- a/drivers/gpu/drm/amd/powerplay/smu_cmn.h
+++ b/drivers/gpu/drm/amd/powerplay/smu_cmn.h
@@ -25,7 +25,7 @@
 
 #include "amdgpu_smu.h"
 
-#if defined(SWSMU_CODE_LAYER_L2) || defined(SWSMU_CODE_LAYER_L3) || 
defined(SWSMU_CODE_LAYER_L4)
+#if defined(SWSMU_CODE_LAYER_L1) || defined(SWSMU_CODE_LAYER_L2) || 
+defined(SWSMU_CODE_LAYER_L3) || defined(SWSMU_CODE_LAYER_L4)
 int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
enum smu_message_type msg,
uint32_t param,
--
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cevan.quan%40amd.com%7C8105bfba31de4bb7bc7f08d82e16e403%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637310023202564830sdata=U0HuFMHcmQ6ZtPsdL%2FisRBIvR9J422uydwWuDOY%2BYF4%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH -next 1/2] drm: Remove redundant NULL check

2020-07-22 Thread Li Heng
Fix below warnings reported by coccicheck:
./drivers/gpu/drm/drm_drv.c:819:2-7: WARNING: NULL check before some freeing 
functions is not needed.

Fixes: 5dad34f3c444 ("drm: Cleanups after drmm_add_final_kfree rollout")
Signed-off-by: Li Heng 
---
 drivers/gpu/drm/drm_drv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index bc38322..13068fd 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -815,8 +815,7 @@ static void drm_dev_release(struct kref *ref)
 
drm_managed_release(dev);
 
-   if (dev->managed.final_kfree)
-   kfree(dev->managed.final_kfree);
+   kfree(dev->managed.final_kfree);
 }
 
 /**
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH -next 2/2] drm: Remove redundant NULL check

2020-07-22 Thread Li Heng
Fix below warnings reported by coccicheck:
./drivers/gpu/drm/amd/display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c:557:2-7: 
WARNING: NULL check before some freeing functions is not needed.

Fixes: 4d55b0dd1cdd ("drm/amd/display: Add DCN3 CLK_MGR")
Signed-off-by: Li Heng 
---
 drivers/gpu/drm/amd/display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c 
b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c
index d94fdc5..d8af56a 100644
--- a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c
+++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c
@@ -553,8 +553,7 @@ void dcn3_clk_mgr_construct(
 
 void dcn3_clk_mgr_destroy(struct clk_mgr_internal *clk_mgr)
 {
-   if (clk_mgr->base.bw_params)
-   kfree(clk_mgr->base.bw_params);
+   kfree(clk_mgr->base.bw_params);
 
if (clk_mgr->wm_range_table)
dm_helpers_free_gpu_mem(clk_mgr->base.ctx, 
DC_MEM_ALLOC_TYPE_GART,
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset

2020-07-22 Thread Zhou1, Tao
Please see my comment about the patch organization

> -Original Message-
> From: Chen, Guchun 
> Sent: Thursday, July 23, 2020 11:39 AM
> To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org;
> Deucher, Alexander ; Zhang, Hawking
> ; Li, Dennis ; Yang, Stanley
> ; Clements, John 
> Subject: RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during
> bootup/reset
> 
> [AMD Public Use]
> 
> Thanks for your review, Tao.
> Please check my comments after yours.
> 
> Regards,
> Guchun
> 
> -Original Message-
> From: Zhou1, Tao 
> Sent: Thursday, July 23, 2020 10:51 AM
> To: Chen, Guchun ; amd-gfx@lists.freedesktop.org;
> Deucher, Alexander ; Zhang, Hawking
> ; Li, Dennis ; Yang, Stanley
> ; Clements, John 
> Subject: RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during
> bootup/reset
> 
> [AMD Official Use Only - Internal Distribution Only]
> 
> > -Original Message-
> > From: Chen, Guchun 
> > Sent: Wednesday, July 22, 2020 11:14 AM
> > To: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> > ; Zhang, Hawking
> ;
> > Li, Dennis ; Yang, Stanley ;
> > Zhou1, Tao ; Clements, John
> 
> > Cc: Chen, Guchun 
> > Subject: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during
> > bootup/reset
> >
> > During boot up, when init eeprom, RAS will check if the BAD GPU tag is
> > saved or not. And will break boot up and notify user to RMA it.
> >
> > At the meanwhile, when saved bad page count to eeprom exceeds
> > threshold, one ras recovery will be triggered immediately, and bad gpu
> > check will be executed and reset will fail as well to remind user.
> >
> > User could set bad_page_threshold = 0 as module parameter when probing
> > driver to disable the bad feature check.
> >
> > Signed-off-by: Guchun Chen 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 21 -
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 25 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   | 16 +++-
> >  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 87
> ++-
> >  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h|  6 +-
> >  5 files changed, 142 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 2662cd7c8685..d529bf7917f5 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2065,7 +2065,9 @@ static int amdgpu_device_ip_init(struct
> > amdgpu_device *adev)
> >   * Note: theoretically, this should be called before all vram allocations
> >   * to protect retired page from abusing
> >   */
> 
> [Tao] The comment above should be also updated [Guchun]yeah, will update it
> later.
> 
> > -amdgpu_ras_recovery_init(adev);
> > +r = amdgpu_ras_recovery_init(adev);
> > +if (r)
> > +goto init_failed;
> 
> [Tao] Are you sure "r != 0" equals to "bad gpu"? Other errors of recovery_init
> should not block gpu init.
> [Guchun]Good point. This should be addressed carefully.
> >
> >  if (adev->gmc.xgmi.num_physical_nodes > 1)
> > amdgpu_xgmi_add_device(adev); @@ -4133,8 +4135,20 @@ static int
> > amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
> >
> >  amdgpu_fbdev_set_suspend(tmp_adev, 0);
> >
> > -/* must succeed. */
> > -amdgpu_ras_resume(tmp_adev);
> > +/*
> > + * The CPU is BAD once faulty pages by ECC has
> > + * reached the threshold, and ras recovery is
> > + * scheduled. So add one check here to prevent
> > + * recovery if it's one BAD GPU, and remind
> > + * user to RMA this GPU.
> > + */
> > +if (!amdgpu_ras_check_bad_gpu(tmp_adev)) {
> > +/* must succeed. */
> > +amdgpu_ras_resume(tmp_adev);
> > +} else {
> > +r = -EINVAL;
> > +goto out;
> > +}
> >
> >  /* Update PSP FW topology after reset */  if (hive && tmp_adev-
> > >gmc.xgmi.num_physical_nodes > 1) @@ -4142,7 +4156,6 @@ static int
> > amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,  }  }
> >
> > -
> >  out:
> >  if (!r) {
> >  amdgpu_irq_gpu_reset_resume_helper(tmp_adev);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index e3d67d85c55f..818d4154e4a3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -62,8 +62,6 @@ const char *ras_block_string[] = {  #define
> > ras_err_str(i)
> > (ras_error_string[ffs(i)])  #define ras_block_str(i)
> > (ras_block_string[i])
> >
> > -#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS1 -#define
> > AMDGPU_RAS_FLAG_INIT_NEED_RESET2  #define RAS_DEFAULT_FLAGS
> > (AMDGPU_RAS_FLAG_INIT_BY_VBIOS)
> >
> >  /* inject address is 52 bits */
> > @@ -1817,6 +1815,7 @@ int amdgpu_ras_recovery_init(struct
> > amdgpu_device
> > *adev)
> >  struct amdgpu_ras *con = amdgpu_ras_get_context(adev);  struct
> > ras_err_handler_data **data;  uint32_t max_eeprom_records_len = 0;
> > +bool bad_gpu = false;
> >  int ret;
> >
> >  if (con)
> > @@ -1838,9 +1837,14 @@ int amdgpu_ras_recovery_init(struct
> > amdgpu_device
> > *adev)
> >  

RE: [PATCH] drm/amd/powerplay: add msg map for mode1 reset

2020-07-22 Thread Quan, Evan
It's Ok to have the valid_in_vf setting as  1 for now.
As I know the sriov support for sienna_cichlid is not ready now.
We can revise these when that comes.

Reviewed-by: Evan Quan 

BR,
Evan
-Original Message-
From: Sheng, Wenhui  
Sent: Tuesday, July 21, 2020 4:09 PM
To: Gao, Likun ; amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Quan, Evan 
Subject: RE: [PATCH] drm/amd/powerplay: add msg map for mode1 reset

[AMD Official Use Only - Internal Distribution Only]

Commit : 81a5f33903a30574 has already contains this change, not sure why it 
disappear in current branch.

And does it make sense to set valid_in_vf to be true? Mode1 reset shouldn't be 
supported  in vf mode I  think.

Brs
Wenhui
-Original Message-
From: Gao, Likun  
Sent: Tuesday, July 21, 2020 1:21 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Sheng, Wenhui 
; Quan, Evan ; Gao, Likun 

Subject: [PATCH] drm/amd/powerplay: add msg map for mode1 reset

From: Likun Gao 

Mapping Mode1Reset message for sienna_cichlid.

Signed-off-by: Likun Gao 
Change-Id: I9b8d39b10c7723af4589577fdbfa4acd5af6e85d
---
 drivers/gpu/drm/amd/powerplay/sienna_cichlid_ppt.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/powerplay/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/powerplay/sienna_cichlid_ppt.c
index cae8e776fafe..bf3d6bbba930 100644
--- a/drivers/gpu/drm/amd/powerplay/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/powerplay/sienna_cichlid_ppt.c
@@ -118,6 +118,7 @@ static struct cmn2asic_msg_mapping 
sienna_cichlid_message_map[SMU_MSG_MAX_COUNT]
MSG_MAP(PowerDownJpeg,  PPSMC_MSG_PowerDownJpeg,
   1),
MSG_MAP(BacoAudioD3PME, PPSMC_MSG_BacoAudioD3PME,   
   1),
MSG_MAP(ArmD3,  PPSMC_MSG_ArmD3,
   1),
+   MSG_MAP(Mode1Reset, PPSMC_MSG_Mode1Reset,   
   1),
 };
 
 static struct cmn2asic_mapping sienna_cichlid_clk_map[SMU_CLK_COUNT] = {
-- 
2.25.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter

2020-07-22 Thread Chen, Guchun
[AMD Public Use]

Hi Dennis,

To be honest, your suggestion is considered when I start the design. My thought 
is in actual world, bad page threshold is one static configuration, it should 
be set once when probing.
So module parameter is one ideal choice for this.

Regards,
Guchun

-Original Message-
From: Li, Dennis  
Sent: Thursday, July 23, 2020 8:32 AM
To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Zhang, Hawking ; 
Yang, Stanley ; Zhou1, Tao ; Clements, 
John 
Subject: RE: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module 
parameter

[AMD Official Use Only - Internal Distribution Only]

Hi, Guchun,
  It is better to let user be able to change amdgpu_bad_page_threshold with 
sysfs, so that users no need to reboot system when they want to change their 
strategy.  

Best Regards
Dennis Li
-Original Message-
From: Chen, Guchun 
Sent: Wednesday, July 22, 2020 11:14 AM
To: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Zhang, Hawking ; Li, Dennis 
; Yang, Stanley ; Zhou1, Tao 
; Clements, John 
Cc: Chen, Guchun 
Subject: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module 
parameter

bad_page_threshold could be specified to detect and retire bad GPU if faulty 
bad pages exceed it.

When it's -1, ras will use typical bad page failure value.

Signed-off-by: Guchun Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 +++
 2 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 06bfb8658dec..bb83ffb5e26a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -181,6 +181,7 @@ extern uint amdgpu_dm_abm_level;  extern struct 
amdgpu_mgpu_info mgpu_info;  extern int amdgpu_ras_enable;  extern uint 
amdgpu_ras_mask;
+extern int amdgpu_bad_page_threshold;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
 extern int amdgpu_discovery;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d28b95f721c4..f99671101746 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -161,6 +161,7 @@ struct amdgpu_mgpu_info mgpu_info = {  };  int 
amdgpu_ras_enable = -1;  uint amdgpu_ras_mask = 0x;
+int amdgpu_bad_page_threshold = -1;
 
 /**
  * DOC: vramlimit (int)
@@ -801,6 +802,16 @@ module_param_named(tmz, amdgpu_tmz, int, 0444);  
MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto (default), 0 = 
legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 = baco)");  
module_param_named(reset_method, amdgpu_reset_method, int, 0444);
 
+/**
+ * DOC: bad_page_threshold (int)
+ * Bad page threshold configuration is driven by RMA(Return Merchandise
+ * Authorization) policy, which is to specify the threshold value of 
+faulty
+ * pages detected by ECC, which may result in GPU's retirement if total
+ * faulty pages by ECC exceed threshold value.
+ */
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = 
+auto(default typical value))"); module_param_named(bad_page_threshold,
+amdgpu_bad_page_threshold, int, 0444);
+
 static const struct pci_device_id pciidlist[] = {  #ifdef  CONFIG_DRM_AMDGPU_SI
{0x1002, 0x6780, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_TAHITI},
--
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amd/powerplay: skip invalid msg when smu set mp1 state

2020-07-22 Thread Quan, Evan
[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Evan Quan 

-Original Message-
From: Gao, Likun 
Sent: Tuesday, July 21, 2020 2:30 PM
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Sheng, Wenhui 
; Quan, Evan ; Gao, Likun 

Subject: [PATCH] drm/amd/powerplay: skip invalid msg when smu set mp1 state

From: Likun Gao 

Some asic may not support for some message of set mp1 state.
If the return value of smu_send_smc_msg is -EINVAL, that means it failed
to send msg to smc as it can not map an valid message for the ASIC. And
with that case, smu_set_mp1_state should be skipped as those ASIC was in
fact do not support for that.

Signed-off-by: Likun Gao 
Change-Id: I31b40b87532a1d7549b26155d4ec8145b5e3f101
---
 drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
index b197dcaed064..237d8ab8b40d 100644
--- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
@@ -1590,6 +1590,9 @@ int smu_set_mp1_state(struct smu_context *smu,
 }

 ret = smu_send_smc_msg(smu, msg, NULL);
+/* some asics may not support those messages */
+if (ret == -EINVAL)
+ret = 0;
 if (ret)
 dev_err(smu->adev->dev, "[PrepareMp1] Failed!\n");

--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras

2020-07-22 Thread Chen, Guchun
[AMD Public Use]

Thanks for review, Stanley.

Re: [Yang, Stanley] : It's better to compare con->bad_page_cnt_threshold with 
max_length, the value of bad_page_cnt_threshold should not exceed max_length.

Correct, one guard is necessary. It will be patch v2.

Regards,
Guchun

-Original Message-
From: Yang, Stanley  
Sent: Wednesday, July 22, 2020 3:52 PM
To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Zhang, Hawking ; 
Li, Dennis ; Zhou1, Tao ; Clements, John 

Subject: RE: [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras

[AMD Official Use Only - Internal Distribution Only]

Hi Guchun,

Please see my comment inline.

> -Original Message-
> From: Chen, Guchun 
> Sent: Wednesday, July 22, 2020 11:14 AM
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> ; Zhang, Hawking ; 
> Li, Dennis ; Yang, Stanley ; 
> Zhou1, Tao ; Clements, John 
> Cc: Chen, Guchun 
> Subject: [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras
> 
> Bad page threshold value should be valid in the range between
> -1 and max records length of eeprom. It could determine when the GPU 
> should be retired.
> 
> Signed-off-by: Guchun Chen 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 43
> +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |  3 ++
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c|  5 +++
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h|  2 +
>  4 files changed, 53 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 6f06e1214622..e3d67d85c55f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -69,6 +69,9 @@ const char *ras_block_string[] = {
>  /* inject address is 52 bits */
>  #define  RAS_UMC_INJECT_ADDR_LIMIT   (0x1ULL << 52)
> 
> +/* typical ECC bad page rate(1 bad page per 100MB VRAM) */
> +#define RAS_BAD_PAGE_RATE(100 * 1024 * 1024ULL)
> +
>  enum amdgpu_ras_retire_page_reservation {
>   AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>   AMDGPU_RAS_RETIRE_PAGE_PENDING,
> @@ -1700,6 +1703,42 @@ static bool amdgpu_ras_check_bad_page(struct 
> amdgpu_device *adev,
>   return ret;
>  }
> 
> +static void amdgpu_ras_validate_threshold(struct amdgpu_device *adev,
> + uint32_t max_length)
> +{
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + int tmp_threshold = amdgpu_bad_page_threshold;
> + u64 val;
> +
> + /*
> +  * Justification of value bad_page_cnt_threshold in ras structure
> +  *
> +  * 1. -1 <= amdgpu_bad_page_threshold <= max record length in
> eeprom
> +  * 2. if amdgpu_bad_page_threshold = -1,
> +  *bad_page_cnt_threshold = typical value by formula.
> +  * 3. if amdgpu_bad_page_threshold = 0,
> +  *bad_page_cnt_threshold = 0x,
> +  *and disable RMA feature accordingly.
> +  * 4. use the value specified from user when
> (amdgpu_bad_page_threshold
> +  *> 0 && < max record length in eeprom).
> +  */
> +
> + if (tmp_threshold < -1)
> + tmp_threshold = -1;
> + else if (tmp_threshold > max_length)
> + tmp_threshold = max_length;
> +
> + if (tmp_threshold == -1) {
> + val = adev->gmc.mc_vram_size;
> + do_div(val, RAS_BAD_PAGE_RATE);
> + con->bad_page_cnt_threshold = lower_32_bits(val);
[Yang, Stanley] : It's better to compare con->bad_page_cnt_threshold with 
max_length, the value of bad_page_cnt_threshold should not exceed max_length.

> + } else if (tmp_threshold == 0) {
> + con->bad_page_cnt_threshold = 0x;
> + } else {
> + con->bad_page_cnt_threshold = tmp_threshold;
> + }
> +}
> +
>  /* called in gpu recovery/init */
>  int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev)  { @@ -
> 1777,6 +1816,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)  {
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>   struct ras_err_handler_data **data;
> + uint32_t max_eeprom_records_len = 0;
>   int ret;
> 
>   if (con)
> @@ -1795,6 +1835,9 @@ int amdgpu_ras_recovery_init(struct 
> amdgpu_device *adev)
>   atomic_set(>in_recovery, 0);
>   con->adev = adev;
> 
> + max_eeprom_records_len =
> amdgpu_ras_eeprom_get_record_max_length();
> + amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
> +
>   ret = amdgpu_ras_eeprom_init(>eeprom_control);
>   if (ret)
>   goto free;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index b2667342cf67..4672649a9293 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -336,6 +336,9 @@ struct amdgpu_ras {
>   struct amdgpu_ras_eeprom_control eeprom_control;
> 
>   bool error_query_ready;
> +
> + /* bad page 

RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset

2020-07-22 Thread Chen, Guchun
[AMD Public Use]

Thanks for your review, Tao.
Please check my comments after yours.

Regards,
Guchun

-Original Message-
From: Zhou1, Tao  
Sent: Thursday, July 23, 2020 10:51 AM
To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Zhang, Hawking ; 
Li, Dennis ; Yang, Stanley ; Clements, 
John 
Subject: RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset

[AMD Official Use Only - Internal Distribution Only]

> -Original Message-
> From: Chen, Guchun 
> Sent: Wednesday, July 22, 2020 11:14 AM
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> ; Zhang, Hawking ; 
> Li, Dennis ; Yang, Stanley ; 
> Zhou1, Tao ; Clements, John 
> Cc: Chen, Guchun 
> Subject: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during 
> bootup/reset
>
> During boot up, when init eeprom, RAS will check if the BAD GPU tag is 
> saved or not. And will break boot up and notify user to RMA it.
>
> At the meanwhile, when saved bad page count to eeprom exceeds 
> threshold, one ras recovery will be triggered immediately, and bad gpu 
> check will be executed and reset will fail as well to remind user.
>
> User could set bad_page_threshold = 0 as module parameter when probing 
> driver to disable the bad feature check.
>
> Signed-off-by: Guchun Chen 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 21 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 25 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   | 16 +++-
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 87 ++-
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h|  6 +-
>  5 files changed, 142 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 2662cd7c8685..d529bf7917f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2065,7 +2065,9 @@ static int amdgpu_device_ip_init(struct 
> amdgpu_device *adev)
>   * Note: theoretically, this should be called before all vram allocations
>   * to protect retired page from abusing
>   */

[Tao] The comment above should be also updated
[Guchun]yeah, will update it later.

> -amdgpu_ras_recovery_init(adev);
> +r = amdgpu_ras_recovery_init(adev);
> +if (r)
> +goto init_failed;

[Tao] Are you sure "r != 0" equals to "bad gpu"? Other errors of recovery_init 
should not block gpu init.
[Guchun]Good point. This should be addressed carefully.
>
>  if (adev->gmc.xgmi.num_physical_nodes > 1)  
> amdgpu_xgmi_add_device(adev); @@ -4133,8 +4135,20 @@ static int 
> amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
>
>  amdgpu_fbdev_set_suspend(tmp_adev, 0);
>
> -/* must succeed. */
> -amdgpu_ras_resume(tmp_adev);
> +/*
> + * The CPU is BAD once faulty pages by ECC has
> + * reached the threshold, and ras recovery is
> + * scheduled. So add one check here to prevent
> + * recovery if it's one BAD GPU, and remind
> + * user to RMA this GPU.
> + */
> +if (!amdgpu_ras_check_bad_gpu(tmp_adev)) {
> +/* must succeed. */
> +amdgpu_ras_resume(tmp_adev);
> +} else {
> +r = -EINVAL;
> +goto out;
> +}
>
>  /* Update PSP FW topology after reset */  if (hive && tmp_adev-
> >gmc.xgmi.num_physical_nodes > 1) @@ -4142,7 +4156,6 @@ static int
> amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,  }  }
>
> -
>  out:
>  if (!r) {
>  amdgpu_irq_gpu_reset_resume_helper(tmp_adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index e3d67d85c55f..818d4154e4a3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -62,8 +62,6 @@ const char *ras_block_string[] = {  #define ras_err_str(i)
> (ras_error_string[ffs(i)])  #define ras_block_str(i) (ras_block_string[i])
>
> -#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS1
> -#define AMDGPU_RAS_FLAG_INIT_NEED_RESET2
>  #define RAS_DEFAULT_FLAGS (AMDGPU_RAS_FLAG_INIT_BY_VBIOS)
>
>  /* inject address is 52 bits */
> @@ -1817,6 +1815,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)
>  struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>  struct ras_err_handler_data **data;
>  uint32_t max_eeprom_records_len = 0;
> +bool bad_gpu = false;
>  int ret;
>
>  if (con)
> @@ -1838,9 +1837,14 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)
>  max_eeprom_records_len =
> amdgpu_ras_eeprom_get_record_max_length();
>  amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
>
> -ret = amdgpu_ras_eeprom_init(>eeprom_control);
> -if (ret)
> +ret = amdgpu_ras_eeprom_init(>eeprom_control, _gpu);
> +/*
> + * we only fail this calling and halt booting when bad_gpu is true.
> + */
> +if (bad_gpu) {
> +ret = -EINVAL;
>  goto free;
> +}
>
>  if (con->eeprom_control.num_recs) {
>  ret = amdgpu_ras_load_bad_pages(adev); @@ -2189,3
> +2193,16 @@ bool amdgpu_ras_need_emergency_restart(struct
> amdgpu_device *adev)
>
>  return false;
>  }
> +
> +bool amdgpu_ras_check_bad_gpu(struct 

[pull] amdgpu drm-fixes-5.8

2020-07-22 Thread Alex Deucher
Hi Dave, Daniel,

Couple of fixes for 5.8.

The following changes since commit adbe8a3cae94a63e9f416795c750237a9b789124:

  Merge tag 'amd-drm-fixes-5.8-2020-07-15' of 
git://people.freedesktop.org/~agd5f/linux into drm-fixes (2020-07-17 13:29:00 
+1000)

are available in the Git repository at:

  git://people.freedesktop.org/~agd5f/linux tags/amd-drm-fixes-5.8-2020-07-22

for you to fetch changes up to 38e0c89a19fd13f28d2b4721035160a3e66e270b:

  drm/amdgpu: Fix NULL dereference in dpm sysfs handlers (2020-07-21 16:00:01 
-0400)


amd-drm-fixes-5.8-2020-07-22:

amdgpu:
- Fix crash when overclocking VegaM
- Fix possible crash when editing dpm levels


Paweł Gronowski (1):
  drm/amdgpu: Fix NULL dereference in dpm sysfs handlers

Qiu Wenbo (1):
  drm/amd/powerplay: fix a crash when overclocking Vega M

 drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c  |  9 +++--
 drivers/gpu/drm/amd/powerplay/smumgr/vegam_smumgr.c | 10 ++
 2 files changed, 9 insertions(+), 10 deletions(-)
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset

2020-07-22 Thread Zhou1, Tao
[AMD Official Use Only - Internal Distribution Only]

> -Original Message-
> From: Chen, Guchun 
> Sent: Wednesday, July 22, 2020 11:14 AM
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> ; Zhang, Hawking ;
> Li, Dennis ; Yang, Stanley ;
> Zhou1, Tao ; Clements, John
> 
> Cc: Chen, Guchun 
> Subject: [PATCH 3/5] drm/amdgpu: conduct bad gpu check during bootup/reset
>
> During boot up, when init eeprom, RAS will check if the BAD GPU tag is saved 
> or
> not. And will break boot up and notify user to RMA it.
>
> At the meanwhile, when saved bad page count to eeprom exceeds threshold,
> one ras recovery will be triggered immediately, and bad gpu check will be
> executed and reset will fail as well to remind user.
>
> User could set bad_page_threshold = 0 as module parameter when probing
> driver to disable the bad feature check.
>
> Signed-off-by: Guchun Chen 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 21 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 25 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   | 16 +++-
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 87 ++-
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h|  6 +-
>  5 files changed, 142 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 2662cd7c8685..d529bf7917f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2065,7 +2065,9 @@ static int amdgpu_device_ip_init(struct
> amdgpu_device *adev)
>   * Note: theoretically, this should be called before all vram allocations
>   * to protect retired page from abusing
>   */

[Tao] The comment above should be also updated

> -amdgpu_ras_recovery_init(adev);
> +r = amdgpu_ras_recovery_init(adev);
> +if (r)
> +goto init_failed;

[Tao] Are you sure "r != 0" equals to "bad gpu"? Other errors of recovery_init 
should not block gpu init.

>
>  if (adev->gmc.xgmi.num_physical_nodes > 1)
>  amdgpu_xgmi_add_device(adev);
> @@ -4133,8 +4135,20 @@ static int amdgpu_do_asic_reset(struct
> amdgpu_hive_info *hive,
>
>  amdgpu_fbdev_set_suspend(tmp_adev, 0);
>
> -/* must succeed. */
> -amdgpu_ras_resume(tmp_adev);
> +/*
> + * The CPU is BAD once faulty pages by ECC has
> + * reached the threshold, and ras recovery is
> + * scheduled. So add one check here to prevent
> + * recovery if it's one BAD GPU, and remind
> + * user to RMA this GPU.
> + */
> +if (!amdgpu_ras_check_bad_gpu(tmp_adev)) {
> +/* must succeed. */
> +amdgpu_ras_resume(tmp_adev);
> +} else {
> +r = -EINVAL;
> +goto out;
> +}
>
>  /* Update PSP FW topology after reset */
>  if (hive && tmp_adev-
> >gmc.xgmi.num_physical_nodes > 1) @@ -4142,7 +4156,6 @@ static int
> amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
>  }
>  }
>
> -
>  out:
>  if (!r) {
>  amdgpu_irq_gpu_reset_resume_helper(tmp_adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index e3d67d85c55f..818d4154e4a3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -62,8 +62,6 @@ const char *ras_block_string[] = {  #define ras_err_str(i)
> (ras_error_string[ffs(i)])  #define ras_block_str(i) (ras_block_string[i])
>
> -#define AMDGPU_RAS_FLAG_INIT_BY_VBIOS1
> -#define AMDGPU_RAS_FLAG_INIT_NEED_RESET2
>  #define RAS_DEFAULT_FLAGS (AMDGPU_RAS_FLAG_INIT_BY_VBIOS)
>
>  /* inject address is 52 bits */
> @@ -1817,6 +1815,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)
>  struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>  struct ras_err_handler_data **data;
>  uint32_t max_eeprom_records_len = 0;
> +bool bad_gpu = false;
>  int ret;
>
>  if (con)
> @@ -1838,9 +1837,14 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)
>  max_eeprom_records_len =
> amdgpu_ras_eeprom_get_record_max_length();
>  amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
>
> -ret = amdgpu_ras_eeprom_init(>eeprom_control);
> -if (ret)
> +ret = amdgpu_ras_eeprom_init(>eeprom_control, _gpu);
> +/*
> + * we only fail this calling and halt booting when bad_gpu is true.
> + */
> +if (bad_gpu) {
> +ret = -EINVAL;
>  goto free;
> +}
>
>  if (con->eeprom_control.num_recs) {
>  ret = amdgpu_ras_load_bad_pages(adev); @@ -2189,3
> +2193,16 @@ bool amdgpu_ras_need_emergency_restart(struct
> amdgpu_device *adev)
>
>  return false;
>  }
> +
> +bool amdgpu_ras_check_bad_gpu(struct amdgpu_device *adev) {
> +struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +bool bad_gpu = false;
> +
> +if (con && (con->bad_page_cnt_threshold != 0x))
> +amdgpu_ras_eeprom_check_bad_gpu(>eeprom_control,
> +_gpu);
> +
> +/* We are only interested in variable bad_gpu. */
> +return bad_gpu;
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 4672649a9293..d7a363b37feb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h

RE: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter

2020-07-22 Thread Li, Dennis
[AMD Official Use Only - Internal Distribution Only]

Hi, Guchun,
  It is better to let user be able to change amdgpu_bad_page_threshold with 
sysfs, so that users no need to reboot system when they want to change their 
strategy.  

Best Regards
Dennis Li
-Original Message-
From: Chen, Guchun  
Sent: Wednesday, July 22, 2020 11:14 AM
To: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Zhang, Hawking ; Li, Dennis 
; Yang, Stanley ; Zhou1, Tao 
; Clements, John 
Cc: Chen, Guchun 
Subject: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module 
parameter

bad_page_threshold could be specified to detect and retire bad GPU if faulty 
bad pages exceed it.

When it's -1, ras will use typical bad page failure value.

Signed-off-by: Guchun Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 +++
 2 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 06bfb8658dec..bb83ffb5e26a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -181,6 +181,7 @@ extern uint amdgpu_dm_abm_level;  extern struct 
amdgpu_mgpu_info mgpu_info;  extern int amdgpu_ras_enable;  extern uint 
amdgpu_ras_mask;
+extern int amdgpu_bad_page_threshold;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
 extern int amdgpu_discovery;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index d28b95f721c4..f99671101746 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -161,6 +161,7 @@ struct amdgpu_mgpu_info mgpu_info = {  };  int 
amdgpu_ras_enable = -1;  uint amdgpu_ras_mask = 0x;
+int amdgpu_bad_page_threshold = -1;
 
 /**
  * DOC: vramlimit (int)
@@ -801,6 +802,16 @@ module_param_named(tmz, amdgpu_tmz, int, 0444);  
MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto (default), 0 = 
legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 = baco)");  
module_param_named(reset_method, amdgpu_reset_method, int, 0444);
 
+/**
+ * DOC: bad_page_threshold (int)
+ * Bad page threshold configuration is driven by RMA(Return Merchandise
+ * Authorization) policy, which is to specify the threshold value of 
+faulty
+ * pages detected by ECC, which may result in GPU's retirement if total
+ * faulty pages by ECC exceed threshold value.
+ */
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = 
+auto(default typical value))"); module_param_named(bad_page_threshold, 
+amdgpu_bad_page_threshold, int, 0444);
+
 static const struct pci_device_id pciidlist[] = {  #ifdef  CONFIG_DRM_AMDGPU_SI
{0x1002, 0x6780, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_TAHITI},
--
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: update dec ring test for VCN 3.0

2020-07-22 Thread Liu, Leo
[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Leo Liu 


From: Zhang, Boyuan 
Sent: July 22, 2020 5:50 PM
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Leo 
Subject: [PATCH] drm/amdgpu: update dec ring test for VCN 3.0


[AMD Official Use Only - Internal Distribution Only]


To enable SW ring for VCN 3.0



Signed-off-by: Boyuan Zhang mailto:boyuan.zh...@amd.com>>

---

 drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 2 +-

 1 file changed, 1 insertion(+), 1 deletion(-)



diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c

index ddc1c43e09a8..8adebb3b2a3f 100644

--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c

+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c

@@ -1372,7 +1372,7 @@ static const struct amdgpu_ring_funcs 
vcn_v3_0_dec_ring_vm_funcs = {

  .emit_ib = vcn_v2_0_dec_ring_emit_ib,

  .emit_fence = vcn_v2_0_dec_ring_emit_fence,

  .emit_vm_flush = vcn_v2_0_dec_ring_emit_vm_flush,

- .test_ring = amdgpu_vcn_dec_ring_test_ring,

+ .test_ring = vcn_v2_0_dec_ring_test_ring,

  .test_ib = amdgpu_vcn_dec_ring_test_ib,

  .insert_nop = vcn_v2_0_dec_ring_insert_nop,

  .insert_start = vcn_v2_0_dec_ring_insert_start,

--

2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: update dec ring test for VCN 3.0

2020-07-22 Thread Zhang, Boyuan
[AMD Official Use Only - Internal Distribution Only]


To enable SW ring for VCN 3.0



Signed-off-by: Boyuan Zhang mailto:boyuan.zh...@amd.com>>

---

 drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 2 +-

 1 file changed, 1 insertion(+), 1 deletion(-)



diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c

index ddc1c43e09a8..8adebb3b2a3f 100644

--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c

+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c

@@ -1372,7 +1372,7 @@ static const struct amdgpu_ring_funcs 
vcn_v3_0_dec_ring_vm_funcs = {

  .emit_ib = vcn_v2_0_dec_ring_emit_ib,

  .emit_fence = vcn_v2_0_dec_ring_emit_fence,

  .emit_vm_flush = vcn_v2_0_dec_ring_emit_vm_flush,

- .test_ring = amdgpu_vcn_dec_ring_test_ring,

+ .test_ring = vcn_v2_0_dec_ring_test_ring,

  .test_ib = amdgpu_vcn_dec_ring_test_ib,

  .insert_nop = vcn_v2_0_dec_ring_insert_nop,

  .insert_start = vcn_v2_0_dec_ring_insert_start,

--

2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v6 01/12] iommu: Change type of pasid to u32

2020-07-22 Thread Fenghua Yu
Hi, Joerg,

On Wed, Jul 22, 2020 at 04:03:40PM +0200, Joerg Roedel wrote:
> On Mon, Jul 13, 2020 at 04:47:56PM -0700, Fenghua Yu wrote:
> > PASID is defined as a few different types in iommu including "int",
> > "u32", and "unsigned int". To be consistent and to match with uapi
> > definitions, define PASID and its variations (e.g. max PASID) as "u32".
> > "u32" is also shorter and a little more explicit than "unsigned int".
> > 
> > No PASID type change in uapi although it defines PASID as __u64 in
> > some places.
> > 
> > Suggested-by: Thomas Gleixner 
> > Signed-off-by: Fenghua Yu 
> > Reviewed-by: Tony Luck 
> > Reviewed-by: Lu Baolu 
> > Acked-by: Felix Kuehling 
> 
> For the IOMMU parts:
> 
> Acked-by: Joerg Roedel 

Thank you!

-Fenghua 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu/dc: Simplify drm_crtc_state::active checks

2020-07-22 Thread Alex Deucher
On Wed, Jul 22, 2020 at 10:25 AM Michel Dänzer  wrote:
>
> On 2020-07-22 3:10 p.m., Kazlauskas, Nicholas wrote:
> > On 2020-07-22 8:51 a.m., Daniel Vetter wrote:
> >> On Wed, Jul 22, 2020 at 2:38 PM Michel Dänzer  wrote:
> >>>
> >>> From: Michel Dänzer 
> >>>
> >>> drm_atomic_crtc_check enforces that ::active can only be true if
> >>> ::enable is as well.
> >>>
> >>> Signed-off-by: Michel Dänzer 
> >
> > Looks fine to me. The check is sufficiently old enough that I don't mind
> > relying on the core for this either.
> >
> > Reviewed-by: Nicholas Kazlauskas 
> >
> >>
> >> modeset vs modereset is a bit an inglorious name choice ... since this
> >> seems to be glue code and not part of core dc, maybe rename to
> >> enable_required/disable_required to keep it consistent with the
> >> wording atomic helpers use? DC also seems to use reset for a lot of
> >> other things already (state reset, like atomic, or gpu reset like
> >> drm/scheduler's td_r_), so I think this would also help clarity from a
> >> DC perspective.
> >>
> >> Patch itself is good, above just an idea for another patch on top.
> >>
> >> Reviewed-by: Daniel Vetter 
>
> Thanks for the reviews! I assume this will get picked up by a DC
> developer or Alex/Christian.

Applied.  Thanks!

Alex

>
>
> > That sounds like a reasonable idea to me. These are used more as a
> > stream_changed / stream_removed flag, but I don't think these helpers
> > really need to exist at all.
> >
> > That could come as a follow up patch.
>
> Yeah, I'm leaving that to you guys. :)
>
>
> --
> Earthling Michel Dänzer   |   https://redhat.com
> Libre software enthusiast | Mesa and X developer
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH v2] drm/amdkfd: Add thermal throttling SMI event

2020-07-22 Thread Mukul Joshi
Add support for reporting thermal throttling events through SMI.
Also, add a counter to count the number of throttling interrupts
observed and report the count in the SMI event message.

Signed-off-by: Mukul Joshi 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c|  4 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  7 ++
 drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c   | 68 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_smi_events.h   |  2 +
 drivers/gpu/drm/amd/powerplay/amdgpu_smu.c|  1 +
 drivers/gpu/drm/amd/powerplay/arcturus_ppt.c  |  1 +
 .../gpu/drm/amd/powerplay/inc/amdgpu_smu.h|  1 +
 drivers/gpu/drm/amd/powerplay/smu_v11_0.c |  5 ++
 include/uapi/linux/kfd_ioctl.h|  3 +-
 10 files changed, 75 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 1b865fed74ca..19e4658756d8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -755,4 +755,8 @@ void kgd2kfd_interrupt(struct kfd_dev *kfd, const void 
*ih_ring_entry)
 void kgd2kfd_set_sram_ecc_flag(struct kfd_dev *kfd)
 {
 }
+
+void kgd2kfd_smi_event_throttle(struct kfd_dev *kfd, uint32_t throttle_bitmask)
+{
+}
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 3f2b695cf19e..e8b0258aae24 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -269,5 +269,6 @@ int kgd2kfd_resume_mm(struct mm_struct *mm);
 int kgd2kfd_schedule_evict_and_restore_process(struct mm_struct *mm,
   struct dma_fence *fence);
 void kgd2kfd_set_sram_ecc_flag(struct kfd_dev *kfd);
+void kgd2kfd_smi_event_throttle(struct kfd_dev *kfd, uint32_t 
throttle_bitmask);
 
 #endif /* AMDGPU_AMDKFD_H_INCLUDED */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 4bfedaab183f..d5e790f046b4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -29,6 +29,7 @@
 #include "cwsr_trap_handler.h"
 #include "kfd_iommu.h"
 #include "amdgpu_amdkfd.h"
+#include "kfd_smi_events.h"
 
 #define MQD_SIZE_ALIGNED 768
 
@@ -1245,6 +1246,12 @@ void kfd_dec_compute_active(struct kfd_dev *kfd)
WARN_ONCE(count < 0, "Compute profile ref. count error");
 }
 
+void kgd2kfd_smi_event_throttle(struct kfd_dev *kfd, uint32_t throttle_bitmask)
+{
+   if (kfd)
+   kfd_smi_event_update_thermal_throttling(kfd, throttle_bitmask);
+}
+
 #if defined(CONFIG_DEBUG_FS)
 
 /* This function will send a package to HIQ to hang the HWS
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
index 7b348bf9df21..00c90b47155b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include "amdgpu.h"
 #include "amdgpu_vm.h"
 #include "kfd_priv.h"
 #include "kfd_smi_events.h"
@@ -148,6 +149,55 @@ static int kfd_smi_ev_release(struct inode *inode, struct 
file *filep)
return 0;
 }
 
+static void add_event_to_kfifo(struct kfd_dev *dev, unsigned long long 
smi_event,
+ char *event_msg, int len)
+{
+   struct kfd_smi_client *client;
+
+   rcu_read_lock();
+
+   list_for_each_entry_rcu(client, >smi_clients, list) {
+   if (!(READ_ONCE(client->events) & smi_event))
+   continue;
+   spin_lock(>lock);
+   if (kfifo_avail(>fifo) >= len) {
+   kfifo_in(>fifo, event_msg, len);
+   wake_up_all(>wait_queue);
+   } else {
+   pr_debug("smi_event(EventID: %llu): no space left\n",
+   smi_event);
+   }
+   spin_unlock(>lock);
+   }
+
+   rcu_read_unlock();
+}
+
+void kfd_smi_event_update_thermal_throttling(struct kfd_dev *dev,
+uint32_t throttle_bitmask)
+{
+   struct amdgpu_device *adev = (struct amdgpu_device *)dev->kgd;
+   /*
+* ThermalThrottle msg = gpu_id(4):throttle_bitmask(4):
+*   thermal_interrupt_count(8):
+* 16 bytes event + 1 byte space + 4 bytes gpu_id + 1 byte : +
+* 4 byte throttle_bitmask + 1 byte : +
+* 8 byte thermal_interupt_counter + 1 byte \n = 36
+*/
+   char fifo_in[36];
+   int len;
+
+   if (list_empty(>smi_clients))
+   return;
+
+   len = snprintf(fifo_in, 36, "%x %x:%x:%llx\n",
+  KFD_SMI_EVENT_THERMAL_THROTTLE,
+  dev->id, throttle_bitmask,
+  atomic64_read(>smu.throttle_int_counter));
+
+   add_event_to_kfifo(dev, 

Re: [PATCH] drm/amdgpu/dc: Simplify drm_crtc_state::active checks

2020-07-22 Thread Daniel Vetter
On Wed, Jul 22, 2020 at 4:25 PM Michel Dänzer  wrote:
>
> On 2020-07-22 3:10 p.m., Kazlauskas, Nicholas wrote:
> > On 2020-07-22 8:51 a.m., Daniel Vetter wrote:
> >> On Wed, Jul 22, 2020 at 2:38 PM Michel Dänzer  wrote:
> >>>
> >>> From: Michel Dänzer 
> >>>
> >>> drm_atomic_crtc_check enforces that ::active can only be true if
> >>> ::enable is as well.
> >>>
> >>> Signed-off-by: Michel Dänzer 
> >
> > Looks fine to me. The check is sufficiently old enough that I don't mind
> > relying on the core for this either.

"active implies enabled" has been a hard assumption of atomic from day
1. So should work anywhere you have atomic.
-Daniel

> > Reviewed-by: Nicholas Kazlauskas 
> >
> >>
> >> modeset vs modereset is a bit an inglorious name choice ... since this
> >> seems to be glue code and not part of core dc, maybe rename to
> >> enable_required/disable_required to keep it consistent with the
> >> wording atomic helpers use? DC also seems to use reset for a lot of
> >> other things already (state reset, like atomic, or gpu reset like
> >> drm/scheduler's td_r_), so I think this would also help clarity from a
> >> DC perspective.
> >>
> >> Patch itself is good, above just an idea for another patch on top.
> >>
> >> Reviewed-by: Daniel Vetter 
>
> Thanks for the reviews! I assume this will get picked up by a DC
> developer or Alex/Christian.
>
>
> > That sounds like a reasonable idea to me. These are used more as a
> > stream_changed / stream_removed flag, but I don't think these helpers
> > really need to exist at all.
> >
> > That could come as a follow up patch.
>
> Yeah, I'm leaving that to you guys. :)
>
>
> --
> Earthling Michel Dänzer   |   https://redhat.com
> Libre software enthusiast | Mesa and X developer



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Christian König

Am 22.07.20 um 16:30 schrieb Thomas Hellström (Intel):


On 2020-07-22 16:23, Christian König wrote:

Am 22.07.20 um 16:07 schrieb Daniel Vetter:

On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel)
 wrote:

On 2020-07-22 14:41, Daniel Vetter wrote:
I'm pretty sure there's more bugs, I just haven't heard from them 
yet.

Also due to the opt-in nature of dma-fence we can limit the scope of
what we fix fairly naturally, just don't put them where no one cares
:-) Of course that also hides general locking issues in dma_fence
signalling code, but well *shrug*.

Hmm, yes. Another potential big problem would be drivers that want to
use gpu page faults in the dma-fence critical sections with the
batch-based programming model.

Yeah that's a massive can of worms. But luckily there's no such driver
merged in upstream, so hopefully we can think about all the
constraints and how to best annotate this before we land any
code and have big regrets.


Do you want a bad news? I once made a prototype for that when Vega10 
came out.


But we abandoned this approach for the the batch based approach 
because of the horrible performance.


In context of the previous discussion I'd consider the fact that it's 
not performant in the batch-based model good news :)


Well the Vega10 had such a horrible page fault performance because it 
was the first generation which enabled it.


Later hardware versions are much better, but we just didn't push for 
this feature on them any more.


But yeah, now you mentioned it we did discuss this locking problem on 
tons of team calls as well.


Our solution at that time was to just not allow waiting if we do any 
allocation in the page fault handler. But this is of course not 
practical for a production environment.


Christian.



Thomas




KFD is going to see that, but this is only with user queues and no 
dma_fence involved whatsoever.


Christian.


-Daniel



--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2Fdata=02%7C01%7Cchristian.koenig%40amd.com%7C65836d463c6a43425a0b08d82e4bc09e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637310250203344946sdata=F8LZEnsMOJLeC3Sr%2BPn2HjGHlttdkVUiOzW7mYeijys%3Dreserved=0 


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cchristian.koenig%40amd.com%7C65836d463c6a43425a0b08d82e4bc09e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637310250203344946sdata=V3FsfahK6344%2FXujtLA%2BazWV0XjKWDXFWObRWc1JUKs%3Dreserved=0 



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Intel


On 2020-07-22 16:23, Christian König wrote:

Am 22.07.20 um 16:07 schrieb Daniel Vetter:

On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel)
 wrote:

On 2020-07-22 14:41, Daniel Vetter wrote:

I'm pretty sure there's more bugs, I just haven't heard from them yet.
Also due to the opt-in nature of dma-fence we can limit the scope of
what we fix fairly naturally, just don't put them where no one cares
:-) Of course that also hides general locking issues in dma_fence
signalling code, but well *shrug*.

Hmm, yes. Another potential big problem would be drivers that want to
use gpu page faults in the dma-fence critical sections with the
batch-based programming model.

Yeah that's a massive can of worms. But luckily there's no such driver
merged in upstream, so hopefully we can think about all the
constraints and how to best annotate this before we land any
code and have big regrets.


Do you want a bad news? I once made a prototype for that when Vega10 
came out.


But we abandoned this approach for the the batch based approach 
because of the horrible performance.


In context of the previous discussion I'd consider the fact that it's 
not performant in the batch-based model good news :)


Thomas




KFD is going to see that, but this is only with user queues and no 
dma_fence involved whatsoever.


Christian.


-Daniel



--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 5/5] drm/amdgpu: calculate actual size instead of hardcode size

2020-07-22 Thread Chen, Guchun
[AMD Public Use]

Hi Andrey,

Aha, thanks for your reminding, I ignore that comment. Let me update it later.

Regards,
Guchun

-Original Message-
From: Grodzovsky, Andrey  
Sent: Wednesday, July 22, 2020 10:26 PM
To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Zhang, Hawking ; 
Li, Dennis ; Yang, Stanley ; Zhou1, 
Tao ; Clements, John 
Subject: Re: [PATCH 5/5] drm/amdgpu: calculate actual size instead of hardcode 
size


On 7/21/20 11:14 PM, Guchun Chen wrote:
> Use sizeof to get actual size.
>
> Signed-off-by: Guchun Chen 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++
>   1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 96b63c026bad..a5da108e43c3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -34,11 +34,9 @@
>   /*
>* The 2 macros bellow represent the actual size in bytes that
>* those entities occupy in the EEPROM memory.
> - * EEPROM_TABLE_RECORD_SIZE is different than 
> sizeof(eeprom_table_record) which
> - * uses uint64 to store 6b fields such as retired_page.


Did you find out the comment above was wrong ?

Andrey


>*/
> -#define EEPROM_TABLE_HEADER_SIZE 20
> -#define EEPROM_TABLE_RECORD_SIZE 24
> +#define EEPROM_TABLE_HEADER_SIZE (sizeof(struct 
> +amdgpu_ras_eeprom_table_header)) #define EEPROM_TABLE_RECORD_SIZE 
> +(sizeof(struct eeprom_table_record))
>   
>   #define EEPROM_ADDRESS_SIZE 0x2
>   
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 5/5] drm/amdgpu: calculate actual size instead of hardcode size

2020-07-22 Thread Andrey Grodzovsky



On 7/21/20 11:14 PM, Guchun Chen wrote:

Use sizeof to get actual size.

Signed-off-by: Guchun Chen 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 96b63c026bad..a5da108e43c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -34,11 +34,9 @@
  /*
   * The 2 macros bellow represent the actual size in bytes that
   * those entities occupy in the EEPROM memory.
- * EEPROM_TABLE_RECORD_SIZE is different than sizeof(eeprom_table_record) which
- * uses uint64 to store 6b fields such as retired_page.



Did you find out the comment above was wrong ?

Andrey



   */
-#define EEPROM_TABLE_HEADER_SIZE 20
-#define EEPROM_TABLE_RECORD_SIZE 24
+#define EEPROM_TABLE_HEADER_SIZE (sizeof(struct 
amdgpu_ras_eeprom_table_header))
+#define EEPROM_TABLE_RECORD_SIZE (sizeof(struct eeprom_table_record))
  
  #define EEPROM_ADDRESS_SIZE 0x2
  

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu/dc: Simplify drm_crtc_state::active checks

2020-07-22 Thread Michel Dänzer
On 2020-07-22 3:10 p.m., Kazlauskas, Nicholas wrote:
> On 2020-07-22 8:51 a.m., Daniel Vetter wrote:
>> On Wed, Jul 22, 2020 at 2:38 PM Michel Dänzer  wrote:
>>>
>>> From: Michel Dänzer 
>>>
>>> drm_atomic_crtc_check enforces that ::active can only be true if
>>> ::enable is as well.
>>>
>>> Signed-off-by: Michel Dänzer 
> 
> Looks fine to me. The check is sufficiently old enough that I don't mind
> relying on the core for this either.
> 
> Reviewed-by: Nicholas Kazlauskas 
> 
>>
>> modeset vs modereset is a bit an inglorious name choice ... since this
>> seems to be glue code and not part of core dc, maybe rename to
>> enable_required/disable_required to keep it consistent with the
>> wording atomic helpers use? DC also seems to use reset for a lot of
>> other things already (state reset, like atomic, or gpu reset like
>> drm/scheduler's td_r_), so I think this would also help clarity from a
>> DC perspective.
>>
>> Patch itself is good, above just an idea for another patch on top.
>>
>> Reviewed-by: Daniel Vetter 

Thanks for the reviews! I assume this will get picked up by a DC
developer or Alex/Christian.


> That sounds like a reasonable idea to me. These are used more as a
> stream_changed / stream_removed flag, but I don't think these helpers
> really need to exist at all.
> 
> That could come as a follow up patch.

Yeah, I'm leaving that to you guys. :)


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Christian König

Am 22.07.20 um 16:07 schrieb Daniel Vetter:

On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel)
 wrote:

On 2020-07-22 14:41, Daniel Vetter wrote:

I'm pretty sure there's more bugs, I just haven't heard from them yet.
Also due to the opt-in nature of dma-fence we can limit the scope of
what we fix fairly naturally, just don't put them where no one cares
:-) Of course that also hides general locking issues in dma_fence
signalling code, but well *shrug*.

Hmm, yes. Another potential big problem would be drivers that want to
use gpu page faults in the dma-fence critical sections with the
batch-based programming model.

Yeah that's a massive can of worms. But luckily there's no such driver
merged in upstream, so hopefully we can think about all the
constraints and how to best annotate this before we land any
code and have big regrets.


Do you want a bad news? I once made a prototype for that when Vega10 
came out.


But we abandoned this approach for the the batch based approach because 
of the horrible performance.


KFD is going to see that, but this is only with user queues and no 
dma_fence involved whatsoever.


Christian.


-Daniel



--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v6 01/12] iommu: Change type of pasid to u32

2020-07-22 Thread Joerg Roedel
On Mon, Jul 13, 2020 at 04:47:56PM -0700, Fenghua Yu wrote:
> PASID is defined as a few different types in iommu including "int",
> "u32", and "unsigned int". To be consistent and to match with uapi
> definitions, define PASID and its variations (e.g. max PASID) as "u32".
> "u32" is also shorter and a little more explicit than "unsigned int".
> 
> No PASID type change in uapi although it defines PASID as __u64 in
> some places.
> 
> Suggested-by: Thomas Gleixner 
> Signed-off-by: Fenghua Yu 
> Reviewed-by: Tony Luck 
> Reviewed-by: Lu Baolu 
> Acked-by: Felix Kuehling 

For the IOMMU parts:

Acked-by: Joerg Roedel 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Daniel Vetter
On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel)
 wrote:
> On 2020-07-22 14:41, Daniel Vetter wrote:
> > Ah I think I misunderstood which options you want to compare here. I'm
> > not sure how much pain fixing up "dma-fence as memory fence" really
> > is. That's kinda why I want a lot more testing on my annotation
> > patches, to figure that out. Not much feedback aside from amdgpu and
> > intel, and those two drivers pretty much need to sort out their memory
> > fence issues anyway (because of userptr and stuff like that).
> >
> > The only other issues outside of these two drivers I'm aware of:
> > - various scheduler drivers doing allocations in the drm/scheduler
> > critical section. Since all arm-soc drivers have a mildly shoddy
> > memory model of "we just pin everything" they don't really have to
> > deal with this. So we might just declare arm as a platform broken and
> > not taint the dma-fence critical sections with fs_reclaim. Otoh we
> > need to fix this for drm/scheduler anyway, I think best option would
> > be to have a mempool for hw fences in the scheduler itself, and at
> > that point fixing the other drivers shouldn't be too onerous.
> >
> > - vmwgfx doing a dma_resv in the atomic commit tail. Entirely
> > orthogonal to the entire memory fence discussion.
>
> With vmwgfx there is another issue that is hit when the gpu signals an
> error. At that point the batch might be restarted with a new meta
> command buffer that needs to be allocated out of a dma pool. in the
> fence critical section. That's probably a bit nasty to fix, but not
> impossible.

Yeah reset is fun. From what I've seen this isn't any worse than the
hw allocation issue for drm/scheduler drivers, they just allocate
another hw fence with all that drags along. So the same mempool should
be sufficient.

The really nasty thing around reset is display interactions, because
you just can't take drm_modeset_lock. amdgpu fixed that now (at least
the modeset_lock side, not yet the memory allocations that brings
along). i915 has the same problem for gen2/3 (so really old stuff),
and we've solved that by breaking all i915 fence waits, but
that predates multi-gpu and wont work for shared fences ofc. But it's
so old and predates all multi-gpu laptops that I think wontfix is the
right take.

Other drm/scheduler drivers don't have that problem since they're all
render-only, so no display driver interaction.

> > I'm pretty sure there's more bugs, I just haven't heard from them yet.
> > Also due to the opt-in nature of dma-fence we can limit the scope of
> > what we fix fairly naturally, just don't put them where no one cares
> > :-) Of course that also hides general locking issues in dma_fence
> > signalling code, but well *shrug*.
> Hmm, yes. Another potential big problem would be drivers that want to
> use gpu page faults in the dma-fence critical sections with the
> batch-based programming model.

Yeah that's a massive can of worms. But luckily there's no such driver
merged in upstream, so hopefully we can think about all the
constraints and how to best annotate this before we land any
code and have big regrets.
-Daniel



--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amd/powerplay: remove the dpm checking in the boot sequence

2020-07-22 Thread Wang, Kevin(Yang)
Reviewed-by:
Kevin Wang 


> 在 2020年7月22日,下午9:37,Kenneth Feng  写道:
> 
> It's not necessary to retrieve the power features status when the
> asic is booted up the first time. This patch can have the features
> enablement status still checked in suspend/resume case and removed
> from the first boot up sequence.
> 
> Signed-off-by: Kenneth Feng 
> ---
> drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
> b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> index f778b00e49eb..6b03f750e63b 100644
> --- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> @@ -735,7 +735,7 @@ static int smu_smc_hw_setup(struct smu_context *smu)
>uint32_t pcie_gen = 0, pcie_width = 0;
>int ret;
> 
> -if (smu_is_dpm_running(smu) && adev->in_suspend) {
> +if (adev->in_suspend && smu_is_dpm_running(smu)) {
>dev_info(adev->dev, "dpm has been enabled\n");
>return 0;
>}
> -- 
> 2.17.1
> 
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CKevin1.Wang%40amd.com%7C03a3e81b66504da6163408d82e4457a9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637310218355834517sdata=7isqyyB0267zcneJg5pRMdFbOId98XzAFFkRv1T9d0Y%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/powerplay: remove the dpm checking in the boot sequence

2020-07-22 Thread Kenneth Feng
It's not necessary to retrieve the power features status when the
asic is booted up the first time. This patch can have the features
enablement status still checked in suspend/resume case and removed
from the first boot up sequence.

Signed-off-by: Kenneth Feng 
---
 drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
index f778b00e49eb..6b03f750e63b 100644
--- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
@@ -735,7 +735,7 @@ static int smu_smc_hw_setup(struct smu_context *smu)
uint32_t pcie_gen = 0, pcie_width = 0;
int ret;
 
-   if (smu_is_dpm_running(smu) && adev->in_suspend) {
+   if (adev->in_suspend && smu_is_dpm_running(smu)) {
dev_info(adev->dev, "dpm has been enabled\n");
return 0;
}
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amd/amdgpu: Fix compiler warning in df driver

2020-07-22 Thread Deucher, Alexander
[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Alex Deucher 

From: amd-gfx  on behalf of Tom St Denis 

Sent: Wednesday, July 22, 2020 7:40 AM
To: amd-gfx@lists.freedesktop.org 
Cc: StDenis, Tom 
Subject: [PATCH] drm/amd/amdgpu: Fix compiler warning in df driver

Fix this warning:

  CC [M]  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_1.o
In file included from drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h:29,
 from drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h:26,
 from drivers/gpu/drm/amd/amdgpu/amdgpu.h:43,
 from drivers/gpu/drm/amd/amdgpu/df_v3_6.c:23:
drivers/gpu/drm/amd/amdgpu/df_v3_6.c: In function ‘df_v3_6_pmc_get_count’:
./include/drm/drm_print.h:487:2: warning: ‘hi_base_addr’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]
  487 |  __drm_dbg(DRM_UT_DRIVER, fmt, ##__VA_ARGS__)
  |  ^
drivers/gpu/drm/amd/amdgpu/df_v3_6.c:649:25: note: ‘hi_base_addr’ was declared 
here
  649 |  uint32_t lo_base_addr, hi_base_addr, lo_val = 0, hi_val = 0;
  | ^~~~
In file included from drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h:29,
 from drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h:26,
 from drivers/gpu/drm/amd/amdgpu/amdgpu.h:43,
 from drivers/gpu/drm/amd/amdgpu/df_v3_6.c:23:
./include/drm/drm_print.h:487:2: warning: ‘lo_base_addr’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]
  487 |  __drm_dbg(DRM_UT_DRIVER, fmt, ##__VA_ARGS__)
  |  ^
drivers/gpu/drm/amd/amdgpu/df_v3_6.c:649:11: note: ‘lo_base_addr’ was declared 
here
  649 |  uint32_t lo_base_addr, hi_base_addr, lo_val = 0, hi_val = 0;

Signed-off-by: Tom St Denis 
---
 drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/df_v3_6.c 
b/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
index 1ab261836983..0aa1ac1accd6 100644
--- a/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
+++ b/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
@@ -646,7 +646,7 @@ static void df_v3_6_pmc_get_count(struct amdgpu_device 
*adev,
   uint64_t config,
   uint64_t *count)
 {
-   uint32_t lo_base_addr, hi_base_addr, lo_val = 0, hi_val = 0;
+   uint32_t lo_base_addr = 0, hi_base_addr = 0, lo_val = 0, hi_val = 0;
 *count = 0;

 switch (adev->asic_type) {
--
2.26.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Calexander.deucher%40amd.com%7C95d3961b50a743867b6c08d82e340fe9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637310148439516686sdata=mkDp4Wqa3ZSS9T1Tr%2B0HvNhb0Il2RhD4RZ6WT%2BDXL9M%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Intel



On 2020-07-22 14:41, Daniel Vetter wrote:


Ah I think I misunderstood which options you want to compare here. I'm
not sure how much pain fixing up "dma-fence as memory fence" really
is. That's kinda why I want a lot more testing on my annotation
patches, to figure that out. Not much feedback aside from amdgpu and
intel, and those two drivers pretty much need to sort out their memory
fence issues anyway (because of userptr and stuff like that).

The only other issues outside of these two drivers I'm aware of:
- various scheduler drivers doing allocations in the drm/scheduler
critical section. Since all arm-soc drivers have a mildly shoddy
memory model of "we just pin everything" they don't really have to
deal with this. So we might just declare arm as a platform broken and
not taint the dma-fence critical sections with fs_reclaim. Otoh we
need to fix this for drm/scheduler anyway, I think best option would
be to have a mempool for hw fences in the scheduler itself, and at
that point fixing the other drivers shouldn't be too onerous.

- vmwgfx doing a dma_resv in the atomic commit tail. Entirely
orthogonal to the entire memory fence discussion.


With vmwgfx there is another issue that is hit when the gpu signals an 
error. At that point the batch might be restarted with a new meta 
command buffer that needs to be allocated out of a dma pool. in the 
fence critical section. That's probably a bit nasty to fix, but not 
impossible.




I'm pretty sure there's more bugs, I just haven't heard from them yet.
Also due to the opt-in nature of dma-fence we can limit the scope of
what we fix fairly naturally, just don't put them where no one cares
:-) Of course that also hides general locking issues in dma_fence
signalling code, but well *shrug*.
Hmm, yes. Another potential big problem would be drivers that want to 
use gpu page faults in the dma-fence critical sections with the 
batch-based programming model.


/Thomas


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu/dc: Simplify drm_crtc_state::active checks

2020-07-22 Thread Kazlauskas, Nicholas

On 2020-07-22 8:51 a.m., Daniel Vetter wrote:

On Wed, Jul 22, 2020 at 2:38 PM Michel Dänzer  wrote:


From: Michel Dänzer 

drm_atomic_crtc_check enforces that ::active can only be true if
::enable is as well.

Signed-off-by: Michel Dänzer 


Looks fine to me. The check is sufficiently old enough that I don't mind 
relying on the core for this either.


Reviewed-by: Nicholas Kazlauskas 



modeset vs modereset is a bit an inglorious name choice ... since this
seems to be glue code and not part of core dc, maybe rename to
enable_required/disable_required to keep it consistent with the
wording atomic helpers use? DC also seems to use reset for a lot of
other things already (state reset, like atomic, or gpu reset like
drm/scheduler's td_r_), so I think this would also help clarity from a
DC perspective.

Patch itself is good, above just an idea for another patch on top.

Reviewed-by: Daniel Vetter 


That sounds like a reasonable idea to me. These are used more as a 
stream_changed / stream_removed flag, but I don't think these helpers 
really need to exist at all.


That could come as a follow up patch.

Regards,
Nicholas Kazlauskas


---
  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c| 16 +++-
  1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 312c543b258f..dabef307a74f 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -3415,21 +3415,12 @@ static bool modeset_required(struct drm_crtc_state 
*crtc_state,
  struct dc_stream_state *new_stream,
  struct dc_stream_state *old_stream)
  {
-   if (!drm_atomic_crtc_needs_modeset(crtc_state))
-   return false;
-
-   if (!crtc_state->enable)
-   return false;
-
-   return crtc_state->active;
+   return crtc_state->active && drm_atomic_crtc_needs_modeset(crtc_state);
  }

  static bool modereset_required(struct drm_crtc_state *crtc_state)
  {
-   if (!drm_atomic_crtc_needs_modeset(crtc_state))
-   return false;
-
-   return !crtc_state->enable || !crtc_state->active;
+   return !crtc_state->active && drm_atomic_crtc_needs_modeset(crtc_state);
  }

  static void amdgpu_dm_encoder_destroy(struct drm_encoder *encoder)
@@ -8108,8 +8099,7 @@ static int dm_update_crtc_state(struct 
amdgpu_display_manager *dm,
  * We want to do dc stream updates that do not require a
  * full modeset below.
  */
-   if (!(enable && aconnector && new_crtc_state->enable &&
- new_crtc_state->active))
+   if (!(enable && aconnector && new_crtc_state->active))
 return 0;
 /*
  * Given above conditions, the dc state cannot be NULL because:
--
2.28.0.rc0

___
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel






___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu/dc: Simplify drm_crtc_state::active checks

2020-07-22 Thread Daniel Vetter
On Wed, Jul 22, 2020 at 2:38 PM Michel Dänzer  wrote:
>
> From: Michel Dänzer 
>
> drm_atomic_crtc_check enforces that ::active can only be true if
> ::enable is as well.
>
> Signed-off-by: Michel Dänzer 

modeset vs modereset is a bit an inglorious name choice ... since this
seems to be glue code and not part of core dc, maybe rename to
enable_required/disable_required to keep it consistent with the
wording atomic helpers use? DC also seems to use reset for a lot of
other things already (state reset, like atomic, or gpu reset like
drm/scheduler's td_r_), so I think this would also help clarity from a
DC perspective.

Patch itself is good, above just an idea for another patch on top.

Reviewed-by: Daniel Vetter 
> ---
>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c| 16 +++-
>  1 file changed, 3 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 312c543b258f..dabef307a74f 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -3415,21 +3415,12 @@ static bool modeset_required(struct drm_crtc_state 
> *crtc_state,
>  struct dc_stream_state *new_stream,
>  struct dc_stream_state *old_stream)
>  {
> -   if (!drm_atomic_crtc_needs_modeset(crtc_state))
> -   return false;
> -
> -   if (!crtc_state->enable)
> -   return false;
> -
> -   return crtc_state->active;
> +   return crtc_state->active && 
> drm_atomic_crtc_needs_modeset(crtc_state);
>  }
>
>  static bool modereset_required(struct drm_crtc_state *crtc_state)
>  {
> -   if (!drm_atomic_crtc_needs_modeset(crtc_state))
> -   return false;
> -
> -   return !crtc_state->enable || !crtc_state->active;
> +   return !crtc_state->active && 
> drm_atomic_crtc_needs_modeset(crtc_state);
>  }
>
>  static void amdgpu_dm_encoder_destroy(struct drm_encoder *encoder)
> @@ -8108,8 +8099,7 @@ static int dm_update_crtc_state(struct 
> amdgpu_display_manager *dm,
>  * We want to do dc stream updates that do not require a
>  * full modeset below.
>  */
> -   if (!(enable && aconnector && new_crtc_state->enable &&
> - new_crtc_state->active))
> +   if (!(enable && aconnector && new_crtc_state->active))
> return 0;
> /*
>  * Given above conditions, the dc state cannot be NULL because:
> --
> 2.28.0.rc0
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Daniel Vetter
On Wed, Jul 22, 2020 at 2:22 PM Thomas Hellström (Intel)
 wrote:
>
>
> On 2020-07-22 13:39, Daniel Vetter wrote:
> > On Wed, Jul 22, 2020 at 12:31 PM Thomas Hellström (Intel)
> >  wrote:
> >>
> >> On 2020-07-22 11:45, Daniel Vetter wrote:
> >>> On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
> >>>  wrote:
>  On 2020-07-22 09:11, Daniel Vetter wrote:
> > On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
> >  wrote:
> >> On 2020-07-22 00:45, Dave Airlie wrote:
> >>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> >>>  wrote:
>  On 7/21/20 9:45 AM, Christian König wrote:
> > Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >> wrote:
> >>> Hi,
> >>>
> >>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>  Comes up every few years, gets somewhat tedious to discuss, let's
>  write this down once and for all.
> 
>  What I'm not sure about is whether the text should be more 
>  explicit in
>  flat out mandating the amdkfd eviction fences for long running 
>  compute
>  workloads or workloads where userspace fencing is allowed.
> >>> Although (in my humble opinion) it might be possible to completely
> >>> untangle
> >>> kernel-introduced fences for resource management and dma-fences 
> >>> used
> >>> for
> >>> completion- and dependency tracking and lift a lot of restrictions
> >>> for the
> >>> dma-fences, including prohibiting infinite ones, I think this 
> >>> makes
> >>> sense
> >>> describing the current state.
> >> Yeah I think a future patch needs to type up how we want to make 
> >> that
> >> happen (for some cross driver consistency) and what needs to be
> >> considered. Some of the necessary parts are already there (with 
> >> like the
> >> preemption fences amdkfd has as an example), but I think some 
> >> clear docs
> >> on what's required from both hw, drivers and userspace would be 
> >> really
> >> good.
> > I'm currently writing that up, but probably still need a few days 
> > for
> > this.
>  Great! I put down some (very) initial thoughts a couple of weeks ago
>  building on eviction fences for various hardware complexity levels 
>  here:
> 
>  https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> >>> We are seeing HW that has recoverable GPU page faults but only for
> >>> compute tasks, and scheduler without semaphores hw for graphics.
> >>>
> >>> So a single driver may have to expose both models to userspace and
> >>> also introduces the problem of how to interoperate between the two
> >>> models on one card.
> >>>
> >>> Dave.
> >> Hmm, yes to begin with it's important to note that this is not a
> >> replacement for new programming models or APIs, This is something that
> >> takes place internally in drivers to mitigate many of the restrictions
> >> that are currently imposed on dma-fence and documented in this and
> >> previous series. It's basically the driver-private narrow completions
> >> Jason suggested in the lockdep patches discussions implemented the same
> >> way as eviction-fences.
> >>
> >> The memory fence API would be local to helpers and middle-layers like
> >> TTM, and the corresponding drivers.  The only cross-driver-like
> >> visibility would be that the dma-buf move_notify() callback would not 
> >> be
> >> allowed to wait on dma-fences or something that depends on a dma-fence.
> > Because we can't preempt (on some engines at least) we already have
> > the requirement that cross driver buffer management can get stuck on a
> > dma-fence. Not even taking into account the horrors we do with
> > userptr, which are cross driver no matter what. Limiting move_notify
> > to memory fences only doesn't work, since the pte clearing might need
> > to wait for a dma_fence first. Hence this becomes a full end-of-batch
> > fence, not just a limited kernel-internal memory fence.
>  For non-preemptible hardware the memory fence typically *is* the
>  end-of-batch fence. (Unless, as documented, there is a scheduler
>  consuming sync-file dependencies in which case the memory fence wait
>  needs to be able to break out of that). The key thing is not that we can
>  break out of execution, but that we can break out of dependencies, since
>  when we're executing all dependecies (modulo semaphores) are already
>  fulfilled. That's what's eliminating the deadlocks.
> 
> > That's kinda why I think only 

[PATCH] drm/amdgpu/dc: Simplify drm_crtc_state::active checks

2020-07-22 Thread Michel Dänzer
From: Michel Dänzer 

drm_atomic_crtc_check enforces that ::active can only be true if
::enable is as well.

Signed-off-by: Michel Dänzer 
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c| 16 +++-
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 312c543b258f..dabef307a74f 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -3415,21 +3415,12 @@ static bool modeset_required(struct drm_crtc_state 
*crtc_state,
 struct dc_stream_state *new_stream,
 struct dc_stream_state *old_stream)
 {
-   if (!drm_atomic_crtc_needs_modeset(crtc_state))
-   return false;
-
-   if (!crtc_state->enable)
-   return false;
-
-   return crtc_state->active;
+   return crtc_state->active && drm_atomic_crtc_needs_modeset(crtc_state);
 }
 
 static bool modereset_required(struct drm_crtc_state *crtc_state)
 {
-   if (!drm_atomic_crtc_needs_modeset(crtc_state))
-   return false;
-
-   return !crtc_state->enable || !crtc_state->active;
+   return !crtc_state->active && drm_atomic_crtc_needs_modeset(crtc_state);
 }
 
 static void amdgpu_dm_encoder_destroy(struct drm_encoder *encoder)
@@ -8108,8 +8099,7 @@ static int dm_update_crtc_state(struct 
amdgpu_display_manager *dm,
 * We want to do dc stream updates that do not require a
 * full modeset below.
 */
-   if (!(enable && aconnector && new_crtc_state->enable &&
- new_crtc_state->active))
+   if (!(enable && aconnector && new_crtc_state->active))
return 0;
/*
 * Given above conditions, the dc state cannot be NULL because:
-- 
2.28.0.rc0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Intel


On 2020-07-22 13:39, Daniel Vetter wrote:

On Wed, Jul 22, 2020 at 12:31 PM Thomas Hellström (Intel)
 wrote:


On 2020-07-22 11:45, Daniel Vetter wrote:

On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
 wrote:

On 2020-07-22 09:11, Daniel Vetter wrote:

On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
 wrote:

On 2020-07-22 00:45, Dave Airlie wrote:

On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
 wrote:

On 7/21/20 9:45 AM, Christian König wrote:

Am 21.07.20 um 09:41 schrieb Daniel Vetter:

On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
wrote:

Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.

Although (in my humble opinion) it might be possible to completely
untangle
kernel-introduced fences for resource management and dma-fences used
for
completion- and dependency tracking and lift a lot of restrictions
for the
dma-fences, including prohibiting infinite ones, I think this makes
sense
describing the current state.

Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with like the
preemption fences amdkfd has as an example), but I think some clear docs
on what's required from both hw, drivers and userspace would be really
good.

I'm currently writing that up, but probably still need a few days for
this.

Great! I put down some (very) initial thoughts a couple of weeks ago
building on eviction fences for various hardware complexity levels here:

https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt

We are seeing HW that has recoverable GPU page faults but only for
compute tasks, and scheduler without semaphores hw for graphics.

So a single driver may have to expose both models to userspace and
also introduces the problem of how to interoperate between the two
models on one card.

Dave.

Hmm, yes to begin with it's important to note that this is not a
replacement for new programming models or APIs, This is something that
takes place internally in drivers to mitigate many of the restrictions
that are currently imposed on dma-fence and documented in this and
previous series. It's basically the driver-private narrow completions
Jason suggested in the lockdep patches discussions implemented the same
way as eviction-fences.

The memory fence API would be local to helpers and middle-layers like
TTM, and the corresponding drivers.  The only cross-driver-like
visibility would be that the dma-buf move_notify() callback would not be
allowed to wait on dma-fences or something that depends on a dma-fence.

Because we can't preempt (on some engines at least) we already have
the requirement that cross driver buffer management can get stuck on a
dma-fence. Not even taking into account the horrors we do with
userptr, which are cross driver no matter what. Limiting move_notify
to memory fences only doesn't work, since the pte clearing might need
to wait for a dma_fence first. Hence this becomes a full end-of-batch
fence, not just a limited kernel-internal memory fence.

For non-preemptible hardware the memory fence typically *is* the
end-of-batch fence. (Unless, as documented, there is a scheduler
consuming sync-file dependencies in which case the memory fence wait
needs to be able to break out of that). The key thing is not that we can
break out of execution, but that we can break out of dependencies, since
when we're executing all dependecies (modulo semaphores) are already
fulfilled. That's what's eliminating the deadlocks.


That's kinda why I think only reasonable option is to toss in the
towel and declare dma-fence to be the memory fence (and suck up all
the consequences of that decision as uapi, which is kinda where we
are), and construct something new free-wheeling for userspace
fencing. But only for engines that allow enough preempt/gpu page
faulting to make that possible. Free wheeling userspace fences/gpu
semaphores or whatever you want to call them (on windows I think it's
monitored fence) only work if you can preempt to decouple the memory
fences from your gpu command execution.

There's the in-between step of just decoupling the batchbuffer
submission prep for hw without any preempt (but a scheduler), but that
seems kinda pointless. Modern execbuf should be O(1) fastpath, with
all the allocation/mapping work pulled out ahead. vk exposes that
model directly to clients, GL drivers could use it internally too, so
I see zero value in spending lots of time engineering very tricky
kernel code just for old userspace. Much more reasonable to do that in
userspace, where we have 

[PATCH] drm/amd/amdgpu: Fix compiler warning in df driver

2020-07-22 Thread Tom St Denis
Fix this warning:

  CC [M]  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_1.o
In file included from drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h:29,
 from drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h:26,
 from drivers/gpu/drm/amd/amdgpu/amdgpu.h:43,
 from drivers/gpu/drm/amd/amdgpu/df_v3_6.c:23:
drivers/gpu/drm/amd/amdgpu/df_v3_6.c: In function ‘df_v3_6_pmc_get_count’:
./include/drm/drm_print.h:487:2: warning: ‘hi_base_addr’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]
  487 |  __drm_dbg(DRM_UT_DRIVER, fmt, ##__VA_ARGS__)
  |  ^
drivers/gpu/drm/amd/amdgpu/df_v3_6.c:649:25: note: ‘hi_base_addr’ was declared 
here
  649 |  uint32_t lo_base_addr, hi_base_addr, lo_val = 0, hi_val = 0;
  | ^~~~
In file included from drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h:29,
 from drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h:26,
 from drivers/gpu/drm/amd/amdgpu/amdgpu.h:43,
 from drivers/gpu/drm/amd/amdgpu/df_v3_6.c:23:
./include/drm/drm_print.h:487:2: warning: ‘lo_base_addr’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]
  487 |  __drm_dbg(DRM_UT_DRIVER, fmt, ##__VA_ARGS__)
  |  ^
drivers/gpu/drm/amd/amdgpu/df_v3_6.c:649:11: note: ‘lo_base_addr’ was declared 
here
  649 |  uint32_t lo_base_addr, hi_base_addr, lo_val = 0, hi_val = 0;

Signed-off-by: Tom St Denis 
---
 drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/df_v3_6.c 
b/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
index 1ab261836983..0aa1ac1accd6 100644
--- a/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
+++ b/drivers/gpu/drm/amd/amdgpu/df_v3_6.c
@@ -646,7 +646,7 @@ static void df_v3_6_pmc_get_count(struct amdgpu_device 
*adev,
  uint64_t config,
  uint64_t *count)
 {
-   uint32_t lo_base_addr, hi_base_addr, lo_val = 0, hi_val = 0;
+   uint32_t lo_base_addr = 0, hi_base_addr = 0, lo_val = 0, hi_val = 0;
*count = 0;
 
switch (adev->asic_type) {
-- 
2.26.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Daniel Vetter
On Wed, Jul 22, 2020 at 12:31 PM Thomas Hellström (Intel)
 wrote:
>
>
> On 2020-07-22 11:45, Daniel Vetter wrote:
> > On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
> >  wrote:
> >>
> >> On 2020-07-22 09:11, Daniel Vetter wrote:
> >>> On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
> >>>  wrote:
>  On 2020-07-22 00:45, Dave Airlie wrote:
> > On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> >  wrote:
> >> On 7/21/20 9:45 AM, Christian König wrote:
> >>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>  On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>  wrote:
> > Hi,
> >
> > On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >> Comes up every few years, gets somewhat tedious to discuss, let's
> >> write this down once and for all.
> >>
> >> What I'm not sure about is whether the text should be more 
> >> explicit in
> >> flat out mandating the amdkfd eviction fences for long running 
> >> compute
> >> workloads or workloads where userspace fencing is allowed.
> > Although (in my humble opinion) it might be possible to completely
> > untangle
> > kernel-introduced fences for resource management and dma-fences used
> > for
> > completion- and dependency tracking and lift a lot of restrictions
> > for the
> > dma-fences, including prohibiting infinite ones, I think this makes
> > sense
> > describing the current state.
>  Yeah I think a future patch needs to type up how we want to make that
>  happen (for some cross driver consistency) and what needs to be
>  considered. Some of the necessary parts are already there (with like 
>  the
>  preemption fences amdkfd has as an example), but I think some clear 
>  docs
>  on what's required from both hw, drivers and userspace would be 
>  really
>  good.
> >>> I'm currently writing that up, but probably still need a few days for
> >>> this.
> >> Great! I put down some (very) initial thoughts a couple of weeks ago
> >> building on eviction fences for various hardware complexity levels 
> >> here:
> >>
> >> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> > We are seeing HW that has recoverable GPU page faults but only for
> > compute tasks, and scheduler without semaphores hw for graphics.
> >
> > So a single driver may have to expose both models to userspace and
> > also introduces the problem of how to interoperate between the two
> > models on one card.
> >
> > Dave.
>  Hmm, yes to begin with it's important to note that this is not a
>  replacement for new programming models or APIs, This is something that
>  takes place internally in drivers to mitigate many of the restrictions
>  that are currently imposed on dma-fence and documented in this and
>  previous series. It's basically the driver-private narrow completions
>  Jason suggested in the lockdep patches discussions implemented the same
>  way as eviction-fences.
> 
>  The memory fence API would be local to helpers and middle-layers like
>  TTM, and the corresponding drivers.  The only cross-driver-like
>  visibility would be that the dma-buf move_notify() callback would not be
>  allowed to wait on dma-fences or something that depends on a dma-fence.
> >>> Because we can't preempt (on some engines at least) we already have
> >>> the requirement that cross driver buffer management can get stuck on a
> >>> dma-fence. Not even taking into account the horrors we do with
> >>> userptr, which are cross driver no matter what. Limiting move_notify
> >>> to memory fences only doesn't work, since the pte clearing might need
> >>> to wait for a dma_fence first. Hence this becomes a full end-of-batch
> >>> fence, not just a limited kernel-internal memory fence.
> >> For non-preemptible hardware the memory fence typically *is* the
> >> end-of-batch fence. (Unless, as documented, there is a scheduler
> >> consuming sync-file dependencies in which case the memory fence wait
> >> needs to be able to break out of that). The key thing is not that we can
> >> break out of execution, but that we can break out of dependencies, since
> >> when we're executing all dependecies (modulo semaphores) are already
> >> fulfilled. That's what's eliminating the deadlocks.
> >>
> >>> That's kinda why I think only reasonable option is to toss in the
> >>> towel and declare dma-fence to be the memory fence (and suck up all
> >>> the consequences of that decision as uapi, which is kinda where we
> >>> are), and construct something new free-wheeling for userspace
> >>> fencing. But only for engines that allow enough preempt/gpu page
> >>> faulting to make that possible. Free 

Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Intel


On 2020-07-22 11:45, Daniel Vetter wrote:

On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
 wrote:


On 2020-07-22 09:11, Daniel Vetter wrote:

On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
 wrote:

On 2020-07-22 00:45, Dave Airlie wrote:

On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
 wrote:

On 7/21/20 9:45 AM, Christian König wrote:

Am 21.07.20 um 09:41 schrieb Daniel Vetter:

On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
wrote:

Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.

Although (in my humble opinion) it might be possible to completely
untangle
kernel-introduced fences for resource management and dma-fences used
for
completion- and dependency tracking and lift a lot of restrictions
for the
dma-fences, including prohibiting infinite ones, I think this makes
sense
describing the current state.

Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with like the
preemption fences amdkfd has as an example), but I think some clear docs
on what's required from both hw, drivers and userspace would be really
good.

I'm currently writing that up, but probably still need a few days for
this.

Great! I put down some (very) initial thoughts a couple of weeks ago
building on eviction fences for various hardware complexity levels here:

https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt

We are seeing HW that has recoverable GPU page faults but only for
compute tasks, and scheduler without semaphores hw for graphics.

So a single driver may have to expose both models to userspace and
also introduces the problem of how to interoperate between the two
models on one card.

Dave.

Hmm, yes to begin with it's important to note that this is not a
replacement for new programming models or APIs, This is something that
takes place internally in drivers to mitigate many of the restrictions
that are currently imposed on dma-fence and documented in this and
previous series. It's basically the driver-private narrow completions
Jason suggested in the lockdep patches discussions implemented the same
way as eviction-fences.

The memory fence API would be local to helpers and middle-layers like
TTM, and the corresponding drivers.  The only cross-driver-like
visibility would be that the dma-buf move_notify() callback would not be
allowed to wait on dma-fences or something that depends on a dma-fence.

Because we can't preempt (on some engines at least) we already have
the requirement that cross driver buffer management can get stuck on a
dma-fence. Not even taking into account the horrors we do with
userptr, which are cross driver no matter what. Limiting move_notify
to memory fences only doesn't work, since the pte clearing might need
to wait for a dma_fence first. Hence this becomes a full end-of-batch
fence, not just a limited kernel-internal memory fence.

For non-preemptible hardware the memory fence typically *is* the
end-of-batch fence. (Unless, as documented, there is a scheduler
consuming sync-file dependencies in which case the memory fence wait
needs to be able to break out of that). The key thing is not that we can
break out of execution, but that we can break out of dependencies, since
when we're executing all dependecies (modulo semaphores) are already
fulfilled. That's what's eliminating the deadlocks.


That's kinda why I think only reasonable option is to toss in the
towel and declare dma-fence to be the memory fence (and suck up all
the consequences of that decision as uapi, which is kinda where we
are), and construct something new free-wheeling for userspace
fencing. But only for engines that allow enough preempt/gpu page
faulting to make that possible. Free wheeling userspace fences/gpu
semaphores or whatever you want to call them (on windows I think it's
monitored fence) only work if you can preempt to decouple the memory
fences from your gpu command execution.

There's the in-between step of just decoupling the batchbuffer
submission prep for hw without any preempt (but a scheduler), but that
seems kinda pointless. Modern execbuf should be O(1) fastpath, with
all the allocation/mapping work pulled out ahead. vk exposes that
model directly to clients, GL drivers could use it internally too, so
I see zero value in spending lots of time engineering very tricky
kernel code just for old userspace. Much more reasonable to do that in
userspace, where we have real debuggers and no panics about security
bugs (or well, a lot less, webgl is still a thing, but at least

Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Daniel Vetter
On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
 wrote:
>
>
> On 2020-07-22 09:11, Daniel Vetter wrote:
> > On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
> >  wrote:
> >>
> >> On 2020-07-22 00:45, Dave Airlie wrote:
> >>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> >>>  wrote:
>  On 7/21/20 9:45 AM, Christian König wrote:
> > Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >> wrote:
> >>> Hi,
> >>>
> >>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>  Comes up every few years, gets somewhat tedious to discuss, let's
>  write this down once and for all.
> 
>  What I'm not sure about is whether the text should be more explicit 
>  in
>  flat out mandating the amdkfd eviction fences for long running 
>  compute
>  workloads or workloads where userspace fencing is allowed.
> >>> Although (in my humble opinion) it might be possible to completely
> >>> untangle
> >>> kernel-introduced fences for resource management and dma-fences used
> >>> for
> >>> completion- and dependency tracking and lift a lot of restrictions
> >>> for the
> >>> dma-fences, including prohibiting infinite ones, I think this makes
> >>> sense
> >>> describing the current state.
> >> Yeah I think a future patch needs to type up how we want to make that
> >> happen (for some cross driver consistency) and what needs to be
> >> considered. Some of the necessary parts are already there (with like 
> >> the
> >> preemption fences amdkfd has as an example), but I think some clear 
> >> docs
> >> on what's required from both hw, drivers and userspace would be really
> >> good.
> > I'm currently writing that up, but probably still need a few days for
> > this.
>  Great! I put down some (very) initial thoughts a couple of weeks ago
>  building on eviction fences for various hardware complexity levels here:
> 
>  https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> >>> We are seeing HW that has recoverable GPU page faults but only for
> >>> compute tasks, and scheduler without semaphores hw for graphics.
> >>>
> >>> So a single driver may have to expose both models to userspace and
> >>> also introduces the problem of how to interoperate between the two
> >>> models on one card.
> >>>
> >>> Dave.
> >> Hmm, yes to begin with it's important to note that this is not a
> >> replacement for new programming models or APIs, This is something that
> >> takes place internally in drivers to mitigate many of the restrictions
> >> that are currently imposed on dma-fence and documented in this and
> >> previous series. It's basically the driver-private narrow completions
> >> Jason suggested in the lockdep patches discussions implemented the same
> >> way as eviction-fences.
> >>
> >> The memory fence API would be local to helpers and middle-layers like
> >> TTM, and the corresponding drivers.  The only cross-driver-like
> >> visibility would be that the dma-buf move_notify() callback would not be
> >> allowed to wait on dma-fences or something that depends on a dma-fence.
> > Because we can't preempt (on some engines at least) we already have
> > the requirement that cross driver buffer management can get stuck on a
> > dma-fence. Not even taking into account the horrors we do with
> > userptr, which are cross driver no matter what. Limiting move_notify
> > to memory fences only doesn't work, since the pte clearing might need
> > to wait for a dma_fence first. Hence this becomes a full end-of-batch
> > fence, not just a limited kernel-internal memory fence.
>
> For non-preemptible hardware the memory fence typically *is* the
> end-of-batch fence. (Unless, as documented, there is a scheduler
> consuming sync-file dependencies in which case the memory fence wait
> needs to be able to break out of that). The key thing is not that we can
> break out of execution, but that we can break out of dependencies, since
> when we're executing all dependecies (modulo semaphores) are already
> fulfilled. That's what's eliminating the deadlocks.
>
> > That's kinda why I think only reasonable option is to toss in the
> > towel and declare dma-fence to be the memory fence (and suck up all
> > the consequences of that decision as uapi, which is kinda where we
> > are), and construct something new free-wheeling for userspace
> > fencing. But only for engines that allow enough preempt/gpu page
> > faulting to make that possible. Free wheeling userspace fences/gpu
> > semaphores or whatever you want to call them (on windows I think it's
> > monitored fence) only work if you can preempt to decouple the memory
> > fences from your gpu command execution.
> >
> > There's the in-between step of just decoupling the batchbuffer
> > submission prep for 

Re: [PATCH] Revert "drm/amd/powerplay: drop unnecessary message support check"

2020-07-22 Thread Huang Rui
On Wed, Jul 22, 2020 at 04:00:45PM +0800, Zhu, Changfeng wrote:
> From: changzhu 
> 
> From: Changfeng 
> 
> The below 3 messages are not supported on Renoir
> SMU_MSG_PrepareMp1ForShutdown
> SMU_MSG_PrepareMp1ForUnload
> SMU_MSG_PrepareMp1ForReset
> 
> It needs to revert patch:
> drm/amd/powerplay: drop unnecessary message support check
> to avoid set mp1 state fail during gpu reset on renoir.
> 
> Change-Id: Ib34d17ab88e1c88173827cca962d8154ad883ab7
> Signed-off-by: changfeng 

Acked-by: Huang Rui 

> ---
>  drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 9 +
>  drivers/gpu/drm/amd/powerplay/smu_cmn.h| 2 +-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
> b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> index 838a369c9ec3..f778b00e49eb 100644
> --- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> @@ -34,6 +34,7 @@
>  #include "sienna_cichlid_ppt.h"
>  #include "renoir_ppt.h"
>  #include "amd_pcie.h"
> +#include "smu_cmn.h"
>  
>  /*
>   * DO NOT use these for err/warn/info/debug messages.
> @@ -1589,6 +1590,14 @@ int smu_set_mp1_state(struct smu_context *smu,
>   return 0;
>   }
>  
> + /* some asics may not support those messages */
> + if (smu_cmn_to_asic_specific_index(smu,
> +CMN2ASIC_MAPPING_MSG,
> +msg) < 0) {
> + mutex_unlock(>mutex);
> + return 0;
> + }
> +
>   ret = smu_send_smc_msg(smu, msg, NULL);
>   if (ret)
>   dev_err(smu->adev->dev, "[PrepareMp1] Failed!\n");
> diff --git a/drivers/gpu/drm/amd/powerplay/smu_cmn.h 
> b/drivers/gpu/drm/amd/powerplay/smu_cmn.h
> index 98face8c5fd6..f9e63f18b157 100644
> --- a/drivers/gpu/drm/amd/powerplay/smu_cmn.h
> +++ b/drivers/gpu/drm/amd/powerplay/smu_cmn.h
> @@ -25,7 +25,7 @@
>  
>  #include "amdgpu_smu.h"
>  
> -#if defined(SWSMU_CODE_LAYER_L2) || defined(SWSMU_CODE_LAYER_L3) || 
> defined(SWSMU_CODE_LAYER_L4)
> +#if defined(SWSMU_CODE_LAYER_L1) || defined(SWSMU_CODE_LAYER_L2) || 
> defined(SWSMU_CODE_LAYER_L3) || defined(SWSMU_CODE_LAYER_L4)
>  int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>   enum smu_message_type msg,
>   uint32_t param,
> -- 
> 2.17.1
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/5] drm/amdgpu: add vmhub funcs helper

2020-07-22 Thread Huang Rui
On Wed, Jul 22, 2020 at 04:15:52PM +0800, Christian König wrote:
> Am 21.07.20 um 12:29 schrieb Huang Rui:
> > This patch is to introduce vmhub funcs helper to add following callback
> > (print_l2_protection_fault_status). Each GC/MMHUB register specific 
> > programming
> > should be in gfxhub/mmhub level.
> >
> > Signed-off-by: Huang Rui 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h  |  7 +++
> >   drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c | 34 
> > 
> >   drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c | 34 
> > 
> >   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c   | 25 ++-
> >   drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c  | 34 
> > 
> >   5 files changed, 111 insertions(+), 23 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> > index 1785a0e..bbecd87 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> > @@ -74,6 +74,11 @@ struct amdgpu_gmc_fault {
> >   /*
> >* VMHUB structures, functions & helpers
> >*/
> > +struct amdgpu_vmhub_funcs {
> > +   void (*print_l2_protection_fault_status)(struct amdgpu_device *adev,
> > +uint32_t status);
> > +};
> > +
> >   struct amdgpu_vmhub {
> > uint32_tctx0_ptb_addr_lo32;
> > uint32_tctx0_ptb_addr_hi32;
> > @@ -94,6 +99,8 @@ struct amdgpu_vmhub {
> > uint32_teng_addr_distance; /* include LO32/HI32 */
> >   
> > uint32_tvm_cntx_cntl_vm_fault;
> > +
> > +   const struct amdgpu_vmhub_funcs *vmhub_funcs;
> >   };
> >   
> >   /*
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c 
> > b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
> > index 993185f..14268ea 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
> > @@ -31,6 +31,33 @@
> >   
> >   #include "soc15_common.h"
> >   
> > +static void
> > +gfxhub_v2_0_print_l2_protection_fault_status(struct amdgpu_device *adev,
> > +uint32_t status)
> > +{
> > +   dev_err(adev->dev,
> > +   "GCVM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
> > +   status);
> > +   dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
> > +   REG_GET_FIELD(status,
> > +   GCVM_L2_PROTECTION_FAULT_STATUS, CID));
> > +   dev_err(adev->dev, "\t MORE_FAULTS: 0x%lx\n",
> > +   REG_GET_FIELD(status,
> > +   GCVM_L2_PROTECTION_FAULT_STATUS, MORE_FAULTS));
> > +   dev_err(adev->dev, "\t WALKER_ERROR: 0x%lx\n",
> > +   REG_GET_FIELD(status,
> > +   GCVM_L2_PROTECTION_FAULT_STATUS, WALKER_ERROR));
> > +   dev_err(adev->dev, "\t PERMISSION_FAULTS: 0x%lx\n",
> > +   REG_GET_FIELD(status,
> > +   GCVM_L2_PROTECTION_FAULT_STATUS, PERMISSION_FAULTS));
> > +   dev_err(adev->dev, "\t MAPPING_ERROR: 0x%lx\n",
> > +   REG_GET_FIELD(status,
> > +   GCVM_L2_PROTECTION_FAULT_STATUS, MAPPING_ERROR));
> > +   dev_err(adev->dev, "\t RW: 0x%lx\n",
> > +   REG_GET_FIELD(status,
> > +   GCVM_L2_PROTECTION_FAULT_STATUS, RW));
> > +}
> > +
> >   u64 gfxhub_v2_0_get_fb_location(struct amdgpu_device *adev)
> >   {
> > u64 base = RREG32_SOC15(GC, 0, mmGCMC_VM_FB_LOCATION_BASE);
> > @@ -360,6 +387,10 @@ void gfxhub_v2_0_set_fault_enable_default(struct 
> > amdgpu_device *adev,
> > WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
> >   }
> >   
> > +static const struct amdgpu_vmhub_funcs gfxhub_v2_0_vmhub_funcs = {
> > +   .print_l2_protection_fault_status = 
> > gfxhub_v2_0_print_l2_protection_fault_status,
> > +};
> > +
> >   void gfxhub_v2_0_init(struct amdgpu_device *adev)
> >   {
> > struct amdgpu_vmhub *hub = >vmhub[AMDGPU_GFXHUB_0];
> > @@ -398,4 +429,7 @@ void gfxhub_v2_0_init(struct amdgpu_device *adev)
> > GCVM_CONTEXT1_CNTL__READ_PROTECTION_FAULT_ENABLE_INTERRUPT_MASK 
> > |
> > 
> > GCVM_CONTEXT1_CNTL__WRITE_PROTECTION_FAULT_ENABLE_INTERRUPT_MASK |
> > 
> > GCVM_CONTEXT1_CNTL__EXECUTE_PROTECTION_FAULT_ENABLE_INTERRUPT_MASK;
> > +
> > +   if (hub->vmhub_funcs == NULL)
> > +   hub->vmhub_funcs = _v2_0_vmhub_funcs;
> >   }
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c 
> > b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
> > index 07cae64..45fbce7 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
> > @@ -31,6 +31,33 @@
> >   
> >   #include "soc15_common.h"
> >   
> > +static void
> > +gfxhub_v2_1_print_l2_protection_fault_status(struct amdgpu_device *adev,
> > +uint32_t status)
> > +{
> > +   dev_err(adev->dev,
> > +   "GCVM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
> > +   status);
> > +   dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
> > + 

Re: [PATCH 3/5] drm/amdgpu: add vmhub funcs helper

2020-07-22 Thread Christian König

Am 21.07.20 um 12:29 schrieb Huang Rui:

This patch is to introduce vmhub funcs helper to add following callback
(print_l2_protection_fault_status). Each GC/MMHUB register specific programming
should be in gfxhub/mmhub level.

Signed-off-by: Huang Rui 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h  |  7 +++
  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c | 34 
  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c | 34 
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c   | 25 ++-
  drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c  | 34 
  5 files changed, 111 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index 1785a0e..bbecd87 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -74,6 +74,11 @@ struct amdgpu_gmc_fault {
  /*
   * VMHUB structures, functions & helpers
   */
+struct amdgpu_vmhub_funcs {
+   void (*print_l2_protection_fault_status)(struct amdgpu_device *adev,
+uint32_t status);
+};
+
  struct amdgpu_vmhub {
uint32_tctx0_ptb_addr_lo32;
uint32_tctx0_ptb_addr_hi32;
@@ -94,6 +99,8 @@ struct amdgpu_vmhub {
uint32_teng_addr_distance; /* include LO32/HI32 */
  
  	uint32_t	vm_cntx_cntl_vm_fault;

+
+   const struct amdgpu_vmhub_funcs *vmhub_funcs;
  };
  
  /*

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
index 993185f..14268ea 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
@@ -31,6 +31,33 @@
  
  #include "soc15_common.h"
  
+static void

+gfxhub_v2_0_print_l2_protection_fault_status(struct amdgpu_device *adev,
+uint32_t status)
+{
+   dev_err(adev->dev,
+   "GCVM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
+   status);
+   dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
+   REG_GET_FIELD(status,
+   GCVM_L2_PROTECTION_FAULT_STATUS, CID));
+   dev_err(adev->dev, "\t MORE_FAULTS: 0x%lx\n",
+   REG_GET_FIELD(status,
+   GCVM_L2_PROTECTION_FAULT_STATUS, MORE_FAULTS));
+   dev_err(adev->dev, "\t WALKER_ERROR: 0x%lx\n",
+   REG_GET_FIELD(status,
+   GCVM_L2_PROTECTION_FAULT_STATUS, WALKER_ERROR));
+   dev_err(adev->dev, "\t PERMISSION_FAULTS: 0x%lx\n",
+   REG_GET_FIELD(status,
+   GCVM_L2_PROTECTION_FAULT_STATUS, PERMISSION_FAULTS));
+   dev_err(adev->dev, "\t MAPPING_ERROR: 0x%lx\n",
+   REG_GET_FIELD(status,
+   GCVM_L2_PROTECTION_FAULT_STATUS, MAPPING_ERROR));
+   dev_err(adev->dev, "\t RW: 0x%lx\n",
+   REG_GET_FIELD(status,
+   GCVM_L2_PROTECTION_FAULT_STATUS, RW));
+}
+
  u64 gfxhub_v2_0_get_fb_location(struct amdgpu_device *adev)
  {
u64 base = RREG32_SOC15(GC, 0, mmGCMC_VM_FB_LOCATION_BASE);
@@ -360,6 +387,10 @@ void gfxhub_v2_0_set_fault_enable_default(struct 
amdgpu_device *adev,
WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
  }
  
+static const struct amdgpu_vmhub_funcs gfxhub_v2_0_vmhub_funcs = {

+   .print_l2_protection_fault_status = 
gfxhub_v2_0_print_l2_protection_fault_status,
+};
+
  void gfxhub_v2_0_init(struct amdgpu_device *adev)
  {
struct amdgpu_vmhub *hub = >vmhub[AMDGPU_GFXHUB_0];
@@ -398,4 +429,7 @@ void gfxhub_v2_0_init(struct amdgpu_device *adev)
GCVM_CONTEXT1_CNTL__READ_PROTECTION_FAULT_ENABLE_INTERRUPT_MASK 
|

GCVM_CONTEXT1_CNTL__WRITE_PROTECTION_FAULT_ENABLE_INTERRUPT_MASK |

GCVM_CONTEXT1_CNTL__EXECUTE_PROTECTION_FAULT_ENABLE_INTERRUPT_MASK;
+
+   if (hub->vmhub_funcs == NULL)
+   hub->vmhub_funcs = _v2_0_vmhub_funcs;
  }
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c 
b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
index 07cae64..45fbce7 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
@@ -31,6 +31,33 @@
  
  #include "soc15_common.h"
  
+static void

+gfxhub_v2_1_print_l2_protection_fault_status(struct amdgpu_device *adev,
+uint32_t status)
+{
+   dev_err(adev->dev,
+   "GCVM_L2_PROTECTION_FAULT_STATUS:0x%08X\n",
+   status);
+   dev_err(adev->dev, "\t Faulty UTCL2 client ID: 0x%lx\n",
+   REG_GET_FIELD(status,
+   GCVM_L2_PROTECTION_FAULT_STATUS, CID));
+   dev_err(adev->dev, "\t MORE_FAULTS: 0x%lx\n",
+   REG_GET_FIELD(status,
+   GCVM_L2_PROTECTION_FAULT_STATUS, MORE_FAULTS));
+   dev_err(adev->dev, "\t WALKER_ERROR: 0x%lx\n",
+   REG_GET_FIELD(status,
+   GCVM_L2_PROTECTION_FAULT_STATUS, 

RE: [PATCH] Revert "drm/amd/powerplay: drop unnecessary message support check"

2020-07-22 Thread Feng, Kenneth
[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Kenneth Feng 


-Original Message-
From: Zhu, Changfeng  
Sent: Wednesday, July 22, 2020 4:01 PM
To: amd-gfx@lists.freedesktop.org; Feng, Kenneth ; Huang, 
Ray 
Cc: Zhu, Changfeng 
Subject: [PATCH] Revert "drm/amd/powerplay: drop unnecessary message support 
check"

From: changzhu 

From: Changfeng 

The below 3 messages are not supported on Renoir SMU_MSG_PrepareMp1ForShutdown 
SMU_MSG_PrepareMp1ForUnload SMU_MSG_PrepareMp1ForReset

It needs to revert patch:
drm/amd/powerplay: drop unnecessary message support check to avoid set mp1 
state fail during gpu reset on renoir.

Change-Id: Ib34d17ab88e1c88173827cca962d8154ad883ab7
Signed-off-by: changfeng 
---
 drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 9 +
 drivers/gpu/drm/amd/powerplay/smu_cmn.h| 2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
index 838a369c9ec3..f778b00e49eb 100644
--- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
@@ -34,6 +34,7 @@
 #include "sienna_cichlid_ppt.h"
 #include "renoir_ppt.h"
 #include "amd_pcie.h"
+#include "smu_cmn.h"
 
 /*
  * DO NOT use these for err/warn/info/debug messages.
@@ -1589,6 +1590,14 @@ int smu_set_mp1_state(struct smu_context *smu,
return 0;
}
 
+   /* some asics may not support those messages */
+   if (smu_cmn_to_asic_specific_index(smu,
+  CMN2ASIC_MAPPING_MSG,
+  msg) < 0) {
+   mutex_unlock(>mutex);
+   return 0;
+   }
+
ret = smu_send_smc_msg(smu, msg, NULL);
if (ret)
dev_err(smu->adev->dev, "[PrepareMp1] Failed!\n"); diff --git 
a/drivers/gpu/drm/amd/powerplay/smu_cmn.h 
b/drivers/gpu/drm/amd/powerplay/smu_cmn.h
index 98face8c5fd6..f9e63f18b157 100644
--- a/drivers/gpu/drm/amd/powerplay/smu_cmn.h
+++ b/drivers/gpu/drm/amd/powerplay/smu_cmn.h
@@ -25,7 +25,7 @@
 
 #include "amdgpu_smu.h"
 
-#if defined(SWSMU_CODE_LAYER_L2) || defined(SWSMU_CODE_LAYER_L3) || 
defined(SWSMU_CODE_LAYER_L4)
+#if defined(SWSMU_CODE_LAYER_L1) || defined(SWSMU_CODE_LAYER_L2) || 
+defined(SWSMU_CODE_LAYER_L3) || defined(SWSMU_CODE_LAYER_L4)
 int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
enum smu_message_type msg,
uint32_t param,
--
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Intel


On 2020-07-22 09:11, Daniel Vetter wrote:

On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
 wrote:


On 2020-07-22 00:45, Dave Airlie wrote:

On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
 wrote:

On 7/21/20 9:45 AM, Christian König wrote:

Am 21.07.20 um 09:41 schrieb Daniel Vetter:

On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
wrote:

Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.

Although (in my humble opinion) it might be possible to completely
untangle
kernel-introduced fences for resource management and dma-fences used
for
completion- and dependency tracking and lift a lot of restrictions
for the
dma-fences, including prohibiting infinite ones, I think this makes
sense
describing the current state.

Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with like the
preemption fences amdkfd has as an example), but I think some clear docs
on what's required from both hw, drivers and userspace would be really
good.

I'm currently writing that up, but probably still need a few days for
this.

Great! I put down some (very) initial thoughts a couple of weeks ago
building on eviction fences for various hardware complexity levels here:

https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt

We are seeing HW that has recoverable GPU page faults but only for
compute tasks, and scheduler without semaphores hw for graphics.

So a single driver may have to expose both models to userspace and
also introduces the problem of how to interoperate between the two
models on one card.

Dave.

Hmm, yes to begin with it's important to note that this is not a
replacement for new programming models or APIs, This is something that
takes place internally in drivers to mitigate many of the restrictions
that are currently imposed on dma-fence and documented in this and
previous series. It's basically the driver-private narrow completions
Jason suggested in the lockdep patches discussions implemented the same
way as eviction-fences.

The memory fence API would be local to helpers and middle-layers like
TTM, and the corresponding drivers.  The only cross-driver-like
visibility would be that the dma-buf move_notify() callback would not be
allowed to wait on dma-fences or something that depends on a dma-fence.

Because we can't preempt (on some engines at least) we already have
the requirement that cross driver buffer management can get stuck on a
dma-fence. Not even taking into account the horrors we do with
userptr, which are cross driver no matter what. Limiting move_notify
to memory fences only doesn't work, since the pte clearing might need
to wait for a dma_fence first. Hence this becomes a full end-of-batch
fence, not just a limited kernel-internal memory fence.


For non-preemptible hardware the memory fence typically *is* the 
end-of-batch fence. (Unless, as documented, there is a scheduler 
consuming sync-file dependencies in which case the memory fence wait 
needs to be able to break out of that). The key thing is not that we can 
break out of execution, but that we can break out of dependencies, since 
when we're executing all dependecies (modulo semaphores) are already 
fulfilled. That's what's eliminating the deadlocks.




That's kinda why I think only reasonable option is to toss in the
towel and declare dma-fence to be the memory fence (and suck up all
the consequences of that decision as uapi, which is kinda where we
are), and construct something new free-wheeling for userspace
fencing. But only for engines that allow enough preempt/gpu page
faulting to make that possible. Free wheeling userspace fences/gpu
semaphores or whatever you want to call them (on windows I think it's
monitored fence) only work if you can preempt to decouple the memory
fences from your gpu command execution.

There's the in-between step of just decoupling the batchbuffer
submission prep for hw without any preempt (but a scheduler), but that
seems kinda pointless. Modern execbuf should be O(1) fastpath, with
all the allocation/mapping work pulled out ahead. vk exposes that
model directly to clients, GL drivers could use it internally too, so
I see zero value in spending lots of time engineering very tricky
kernel code just for old userspace. Much more reasonable to do that in
userspace, where we have real debuggers and no panics about security
bugs (or well, a lot less, webgl is still a thing, but at least
browsers realized you need to container that completely).


Sure, it's definitely a big chunk of work. 

Re:

2020-07-22 Thread Mauro Rossi
Hello,
re-sending and copying full DL

On Wed, Jul 22, 2020 at 4:51 AM Alex Deucher  wrote:

> On Mon, Jul 20, 2020 at 6:00 AM Mauro Rossi  wrote:
> >
> > Hi Christian,
> >
> > On Mon, Jul 20, 2020 at 11:00 AM Christian König
> >  wrote:
> > >
> > > Hi Mauro,
> > >
> > > I'm not deep into the whole DC design, so just some general high level
> > > comments on the cover letter:
> > >
> > > 1. Please add a subject line to the cover letter, my spam filter thinks
> > > that this is suspicious otherwise.
> >
> > My mistake in the editing of covert letter with git send-email,
> > I may have forgot to keep the Subject at the top
> >
> > >
> > > 2. Then you should probably note how well (badly?) is that tested.
> Since
> > > you noted proof of concept it might not even work.
> >
> > The Changelog is to be read as:
> >
> > [RFC] was the initial Proof of concept was the RFC and [PATCH v2] was
> > just a rebase onto amd-staging-drm-next
> >
> > this series [PATCH v3] has all the known changes required for DCE6
> specificity
> > and based on a long offline thread with Alexander Deutcher and past
> > dri-devel chats with Harry Wentland.
> >
> > It was tested for my possibilities of testing with HD7750 and HD7950,
> > with checks in dmesg output for not getting "missing registers/masks"
> > kernel WARNING
> > and with kernel build on Ubuntu 20.04 and with android-x86
> >
> > The proposal I made to Alex is that AMD testing systems will be used
> > for further regression testing,
> > as part of review and validation for eligibility to amd-staging-drm-next
> >
>
> We will certainly test it once it lands, but presumably this is
> working on the SI cards you have access to?
>

Yes, most of my testing was done with android-x86  Android CTS (EGL, GLES2,
GLES3, VK)

I am also in contact with a person with Firepro W5130M who is running a
piglit session

I had bought an HD7850 to test with Pitcairn, but it arrived as defective
so I could not test with Pitcair



> > >
> > > 3. How feature complete (HDMI audio?, Freesync?) is it?
> >
> > All the changes in DC impacting DCE8 (dc/dce80 path) were ported to
> > DCE6 (dc/dce60 path) in the last two years from initial submission
> >
> > >
> > > Apart from that it looks like a rather impressive piece of work :)
> > >
> > > Cheers,
> > > Christian.
> >
> > Thanks,
> > please consider that most of the latest DCE6 specific parts were
> > possible due to recent Alex support in getting the correct DCE6
> > headers,
> > his suggestions and continuous feedback.
> >
> > I would suggest that Alex comments on the proposed next steps to follow.
>
> The code looks pretty good to me.  I'd like to get some feedback from
> the display team to see if they have any concerns, but beyond that I
> think we can pull it into the tree and continue improving it there.
> Do you have a link to a git tree I can pull directly that contains
> these patches?  Is this the right branch?
> https://github.com/maurossi/linux/commits/kernel-5.8rc4_si_next
>
> Thanks!
>
> Alex
>

The following branch was pushed with the series on top of
amd-staging-drm-next

https://github.com/maurossi/linux/commits/kernel-5.6_si_drm-next


>
> >
> > Mauro
> >
> > >
> > > Am 16.07.20 um 23:22 schrieb Mauro Rossi:
> > > > The series adds SI support to AMD DC
> > > >
> > > > Changelog:
> > > >
> > > > [RFC]
> > > > Preliminar Proof Of Concept, with DCE8 headers still used in
> dce60_resources.c
> > > >
> > > > [PATCH v2]
> > > > Rebase on amd-staging-drm-next dated 17-Oct-2018
> > > >
> > > > [PATCH v3]
> > > > Add support for DCE6 specific headers,
> > > > ad hoc DCE6 macros, funtions and fixes,
> > > > rebase on current amd-staging-drm-next
> > > >
> > > >
> > > > Commits [01/27]..[08/27] SI support added in various DC components
> > > >
> > > > [PATCH v3 01/27] drm/amdgpu: add some required DCE6 registers (v6)
> > > > [PATCH v3 02/27] drm/amd/display: add asics info for SI parts
> > > > [PATCH v3 03/27] drm/amd/display: dc/dce: add initial DCE6 support
> (v9b)
> > > > [PATCH v3 04/27] drm/amd/display: dc/core: add SI/DCE6 support (v2)
> > > > [PATCH v3 05/27] drm/amd/display: dc/bios: add support for DCE6
> > > > [PATCH v3 06/27] drm/amd/display: dc/gpio: add support for DCE6 (v2)
> > > > [PATCH v3 07/27] drm/amd/display: dc/irq: add support for DCE6 (v4)
> > > > [PATCH v3 08/27] drm/amd/display: amdgpu_dm: add SI support (v4)
> > > >
> > > > Commits [09/27]..[24/27] DCE6 specific code adaptions
> > > >
> > > > [PATCH v3 09/27] drm/amd/display: dc/clk_mgr: add support for SI
> parts (v2)
> > > > [PATCH v3 10/27] drm/amd/display: dc/dce60: set max_cursor_size to 64
> > > > [PATCH v3 11/27] drm/amd/display: dce_audio: add DCE6 specific
> macros,functions
> > > > [PATCH v3 12/27] drm/amd/display: dce_dmcu: add DCE6 specific macros
> > > > [PATCH v3 13/27] drm/amd/display: dce_hwseq: add DCE6 specific
> macros,functions
> > > > [PATCH v3 14/27] drm/amd/display: dce_ipp: add DCE6 specific
> macros,functions
> > > > [PATCH v3 15/27] 

RE: [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras

2020-07-22 Thread Yang, Stanley
[AMD Official Use Only - Internal Distribution Only]

Hi Guchun,

Please see my comment inline.

> -Original Message-
> From: Chen, Guchun 
> Sent: Wednesday, July 22, 2020 11:14 AM
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> ; Zhang, Hawking
> ; Li, Dennis ; Yang,
> Stanley ; Zhou1, Tao ;
> Clements, John 
> Cc: Chen, Guchun 
> Subject: [PATCH 2/5] drm/amdgpu: validate bad page threshold in ras
> 
> Bad page threshold value should be valid in the range between
> -1 and max records length of eeprom. It could determine when the GPU
> should be retired.
> 
> Signed-off-by: Guchun Chen 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 43
> +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h   |  3 ++
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c|  5 +++
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h|  2 +
>  4 files changed, 53 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 6f06e1214622..e3d67d85c55f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -69,6 +69,9 @@ const char *ras_block_string[] = {
>  /* inject address is 52 bits */
>  #define  RAS_UMC_INJECT_ADDR_LIMIT   (0x1ULL << 52)
> 
> +/* typical ECC bad page rate(1 bad page per 100MB VRAM) */
> +#define RAS_BAD_PAGE_RATE(100 * 1024 * 1024ULL)
> +
>  enum amdgpu_ras_retire_page_reservation {
>   AMDGPU_RAS_RETIRE_PAGE_RESERVED,
>   AMDGPU_RAS_RETIRE_PAGE_PENDING,
> @@ -1700,6 +1703,42 @@ static bool amdgpu_ras_check_bad_page(struct
> amdgpu_device *adev,
>   return ret;
>  }
> 
> +static void amdgpu_ras_validate_threshold(struct amdgpu_device *adev,
> + uint32_t max_length)
> +{
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + int tmp_threshold = amdgpu_bad_page_threshold;
> + u64 val;
> +
> + /*
> +  * Justification of value bad_page_cnt_threshold in ras structure
> +  *
> +  * 1. -1 <= amdgpu_bad_page_threshold <= max record length in
> eeprom
> +  * 2. if amdgpu_bad_page_threshold = -1,
> +  *bad_page_cnt_threshold = typical value by formula.
> +  * 3. if amdgpu_bad_page_threshold = 0,
> +  *bad_page_cnt_threshold = 0x,
> +  *and disable RMA feature accordingly.
> +  * 4. use the value specified from user when
> (amdgpu_bad_page_threshold
> +  *> 0 && < max record length in eeprom).
> +  */
> +
> + if (tmp_threshold < -1)
> + tmp_threshold = -1;
> + else if (tmp_threshold > max_length)
> + tmp_threshold = max_length;
> +
> + if (tmp_threshold == -1) {
> + val = adev->gmc.mc_vram_size;
> + do_div(val, RAS_BAD_PAGE_RATE);
> + con->bad_page_cnt_threshold = lower_32_bits(val);
[Yang, Stanley] : It's better to compare con->bad_page_cnt_threshold with 
max_length,
the value of bad_page_cnt_threshold should not exceed max_length.

> + } else if (tmp_threshold == 0) {
> + con->bad_page_cnt_threshold = 0x;
> + } else {
> + con->bad_page_cnt_threshold = tmp_threshold;
> + }
> +}
> +
>  /* called in gpu recovery/init */
>  int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev)  { @@ -
> 1777,6 +1816,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device
> *adev)  {
>   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>   struct ras_err_handler_data **data;
> + uint32_t max_eeprom_records_len = 0;
>   int ret;
> 
>   if (con)
> @@ -1795,6 +1835,9 @@ int amdgpu_ras_recovery_init(struct
> amdgpu_device *adev)
>   atomic_set(>in_recovery, 0);
>   con->adev = adev;
> 
> + max_eeprom_records_len =
> amdgpu_ras_eeprom_get_record_max_length();
> + amdgpu_ras_validate_threshold(adev, max_eeprom_records_len);
> +
>   ret = amdgpu_ras_eeprom_init(>eeprom_control);
>   if (ret)
>   goto free;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index b2667342cf67..4672649a9293 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -336,6 +336,9 @@ struct amdgpu_ras {
>   struct amdgpu_ras_eeprom_control eeprom_control;
> 
>   bool error_query_ready;
> +
> + /* bad page count threshold */
> + uint32_t bad_page_cnt_threshold;
>  };
> 
>  struct ras_fs_data {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index c0096097bbcf..a2c982b1eac6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -499,6 +499,11 @@ int amdgpu_ras_eeprom_process_recods(struct
> amdgpu_ras_eeprom_control *control,
>   return ret == num ? 0 : -EIO;
>  }
> 
> +inline uint32_t amdgpu_ras_eeprom_get_record_max_length(void)
> 

RE: [PATCH] drm/amdgpu: add RAS EEPROM support for sienna chichlid

2020-07-22 Thread Clements, John
[AMD Public Use]

Good points, I'll reorganize and resubmit the patch after Guchuns changes are in

From: Zhang, Hawking 
Sent: Tuesday, July 21, 2020 7:11 PM
To: Clements, John ; amd-gfx list 
; Chen, Guchun 
Subject: RE: [PATCH] drm/amdgpu: add RAS EEPROM support for sienna chichlid


[AMD Public Use]

-  if (adev->asic_type != CHIP_VEGA20 && adev->asic_type != 
CHIP_ARCTURUS)
+ if (adev->asic_type != CHIP_VEGA20   &&
+ adev->asic_type != CHIP_ARCTURUS &&
+ adev->asic_type != CHIP_SIENNA_CICHLID)
   return 0;

Does it make sense to check UMC RAS availability through 
amdgpu_ras_is_supported, instead of check specific ASIC type one by one?

Also, it would be good to merge the upcoming logic from Guchun where we have a 
gloal flag to mark the availability of bad page retirement.

Thoughts?

Regards,
Hawking
From: Clements, John mailto:john.cleme...@amd.com>>
Sent: Tuesday, July 21, 2020 18:02
To: amd-gfx list 
mailto:amd-gfx@lists.freedesktop.org>>; Zhang, 
Hawking mailto:hawking.zh...@amd.com>>
Subject: [PATCH] drm/amdgpu: add RAS EEPROM support for sienna chichlid


[AMD Public Use]

Submitting patch to enable RAS EEPROM support for sienna chichlid
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Daniel Vetter
On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
 wrote:
>
>
> On 2020-07-22 00:45, Dave Airlie wrote:
> > On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> >  wrote:
> >>
> >> On 7/21/20 9:45 AM, Christian König wrote:
> >>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>  On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>  wrote:
> > Hi,
> >
> > On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >> Comes up every few years, gets somewhat tedious to discuss, let's
> >> write this down once and for all.
> >>
> >> What I'm not sure about is whether the text should be more explicit in
> >> flat out mandating the amdkfd eviction fences for long running compute
> >> workloads or workloads where userspace fencing is allowed.
> > Although (in my humble opinion) it might be possible to completely
> > untangle
> > kernel-introduced fences for resource management and dma-fences used
> > for
> > completion- and dependency tracking and lift a lot of restrictions
> > for the
> > dma-fences, including prohibiting infinite ones, I think this makes
> > sense
> > describing the current state.
>  Yeah I think a future patch needs to type up how we want to make that
>  happen (for some cross driver consistency) and what needs to be
>  considered. Some of the necessary parts are already there (with like the
>  preemption fences amdkfd has as an example), but I think some clear docs
>  on what's required from both hw, drivers and userspace would be really
>  good.
> >>> I'm currently writing that up, but probably still need a few days for
> >>> this.
> >> Great! I put down some (very) initial thoughts a couple of weeks ago
> >> building on eviction fences for various hardware complexity levels here:
> >>
> >> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> > We are seeing HW that has recoverable GPU page faults but only for
> > compute tasks, and scheduler without semaphores hw for graphics.
> >
> > So a single driver may have to expose both models to userspace and
> > also introduces the problem of how to interoperate between the two
> > models on one card.
> >
> > Dave.
>
> Hmm, yes to begin with it's important to note that this is not a
> replacement for new programming models or APIs, This is something that
> takes place internally in drivers to mitigate many of the restrictions
> that are currently imposed on dma-fence and documented in this and
> previous series. It's basically the driver-private narrow completions
> Jason suggested in the lockdep patches discussions implemented the same
> way as eviction-fences.
>
> The memory fence API would be local to helpers and middle-layers like
> TTM, and the corresponding drivers.  The only cross-driver-like
> visibility would be that the dma-buf move_notify() callback would not be
> allowed to wait on dma-fences or something that depends on a dma-fence.

Because we can't preempt (on some engines at least) we already have
the requirement that cross driver buffer management can get stuck on a
dma-fence. Not even taking into account the horrors we do with
userptr, which are cross driver no matter what. Limiting move_notify
to memory fences only doesn't work, since the pte clearing might need
to wait for a dma_fence first. Hence this becomes a full end-of-batch
fence, not just a limited kernel-internal memory fence.

That's kinda why I think only reasonable option is to toss in the
towel and declare dma-fence to be the memory fence (and suck up all
the consequences of that decision as uapi, which is kinda where we
are), and construct something new free-wheeling for userspace
fencing. But only for engines that allow enough preempt/gpu page
faulting to make that possible. Free wheeling userspace fences/gpu
semaphores or whatever you want to call them (on windows I think it's
monitored fence) only work if you can preempt to decouple the memory
fences from your gpu command execution.

There's the in-between step of just decoupling the batchbuffer
submission prep for hw without any preempt (but a scheduler), but that
seems kinda pointless. Modern execbuf should be O(1) fastpath, with
all the allocation/mapping work pulled out ahead. vk exposes that
model directly to clients, GL drivers could use it internally too, so
I see zero value in spending lots of time engineering very tricky
kernel code just for old userspace. Much more reasonable to do that in
userspace, where we have real debuggers and no panics about security
bugs (or well, a lot less, webgl is still a thing, but at least
browsers realized you need to container that completely).

Cheers, Daniel

> So with that in mind, I don't foresee engines with different
> capabilities on the same card being a problem.
>
> /Thomas
>
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea

2020-07-22 Thread Intel


On 2020-07-22 00:45, Dave Airlie wrote:

On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
 wrote:


On 7/21/20 9:45 AM, Christian König wrote:

Am 21.07.20 um 09:41 schrieb Daniel Vetter:

On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
wrote:

Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.

Although (in my humble opinion) it might be possible to completely
untangle
kernel-introduced fences for resource management and dma-fences used
for
completion- and dependency tracking and lift a lot of restrictions
for the
dma-fences, including prohibiting infinite ones, I think this makes
sense
describing the current state.

Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with like the
preemption fences amdkfd has as an example), but I think some clear docs
on what's required from both hw, drivers and userspace would be really
good.

I'm currently writing that up, but probably still need a few days for
this.

Great! I put down some (very) initial thoughts a couple of weeks ago
building on eviction fences for various hardware complexity levels here:

https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt

We are seeing HW that has recoverable GPU page faults but only for
compute tasks, and scheduler without semaphores hw for graphics.

So a single driver may have to expose both models to userspace and
also introduces the problem of how to interoperate between the two
models on one card.

Dave.


Hmm, yes to begin with it's important to note that this is not a 
replacement for new programming models or APIs, This is something that 
takes place internally in drivers to mitigate many of the restrictions 
that are currently imposed on dma-fence and documented in this and 
previous series. It's basically the driver-private narrow completions 
Jason suggested in the lockdep patches discussions implemented the same 
way as eviction-fences.


The memory fence API would be local to helpers and middle-layers like 
TTM, and the corresponding drivers.  The only cross-driver-like 
visibility would be that the dma-buf move_notify() callback would not be 
allowed to wait on dma-fences or something that depends on a dma-fence.


So with that in mind, I don't foresee engines with different 
capabilities on the same card being a problem.


/Thomas


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx