[PATCH] drm/amdgpu: enable VCN PG and CG for yellow carp

2021-01-19 Thread Aaron Liu
Enable VCN 3.0 PG and CG for Yellow Carp by setting up flags.

Signed-off-by: Aaron Liu 
---
 drivers/gpu/drm/amd/amdgpu/nv.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 801cf79353dd..903e1ae166c5 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -1020,9 +1020,13 @@ static int nv_common_early_init(void *handle)
AMD_CG_SUPPORT_HDP_LS |
AMD_CG_SUPPORT_ATHUB_MGCG |
AMD_CG_SUPPORT_ATHUB_LS |
-   AMD_CG_SUPPORT_IH_CG;
+   AMD_CG_SUPPORT_IH_CG |
+   AMD_CG_SUPPORT_VCN_MGCG |
+   AMD_CG_SUPPORT_JPEG_MGCG;
adev->pg_flags = AMD_PG_SUPPORT_GFX_PG |
-   AMD_PG_SUPPORT_VCN_DPG;
+   AMD_PG_SUPPORT_VCN |
+   AMD_PG_SUPPORT_VCN_DPG |
+   AMD_PG_SUPPORT_JPEG;
adev->external_rev_id = adev->rev_id + 0x01;
break;
default:
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: Assign boolean values to a bool variable

2021-01-19 Thread Jiapeng Zhong
Fix the following coccicheck warnings:

./drivers/gpu/drm/amd/display/dc/dml/dcn30/display_rq_dlg_calc_30.c:
1009:6-16: WARNING: Assignment of 0/1 to bool variable.

./drivers/gpu/drm/amd/display/dc/dml/dcn30/display_rq_dlg_calc_30.c:
200:2-10: WARNING: Assignment of 0/1 to bool variable.

Reported-by: Abaci Robot 
Signed-off-by: Jiapeng Zhong 
---
 .../display/dc/dml/dcn30/display_rq_dlg_calc_30.c  | 32 +++---
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn30/display_rq_dlg_calc_30.c 
b/drivers/gpu/drm/amd/display/dc/dml/dcn30/display_rq_dlg_calc_30.c
index 5b5916b..0f14f20 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/dcn30/display_rq_dlg_calc_30.c
+++ b/drivers/gpu/drm/amd/display/dc/dml/dcn30/display_rq_dlg_calc_30.c
@@ -165,8 +165,8 @@ static void handle_det_buf_split(struct display_mode_lib 
*mode_lib,
unsigned int swath_bytes_c = 0;
unsigned int full_swath_bytes_packed_l = 0;
unsigned int full_swath_bytes_packed_c = 0;
-   bool req128_l = 0;
-   bool req128_c = 0;
+   bool req128_l = false;
+   bool req128_c = false;
bool surf_linear = (pipe_src_param.sw_mode == dm_sw_linear);
bool surf_vert = (pipe_src_param.source_scan == dm_vert);
unsigned int log2_swath_height_l = 0;
@@ -191,37 +191,37 @@ static void handle_det_buf_split(struct display_mode_lib 
*mode_lib,
total_swath_bytes = 2 * full_swath_bytes_packed_l;
 
if (total_swath_bytes <= detile_buf_size_in_bytes) { //full 256b request
-   req128_l = 0;
-   req128_c = 0;
+   req128_l = false;
+   req128_c = false;
swath_bytes_l = full_swath_bytes_packed_l;
swath_bytes_c = full_swath_bytes_packed_c;
} else if (!rq_param->yuv420) {
-   req128_l = 1;
-   req128_c = 0;
+   req128_l = true;
+   req128_c = false;
swath_bytes_c = full_swath_bytes_packed_c;
swath_bytes_l = full_swath_bytes_packed_l / 2;
} else if ((double)full_swath_bytes_packed_l / 
(double)full_swath_bytes_packed_c < 1.5) {
-   req128_l = 0;
-   req128_c = 1;
+   req128_l = false;
+   req128_c = true;
swath_bytes_l = full_swath_bytes_packed_l;
swath_bytes_c = full_swath_bytes_packed_c / 2;
 
total_swath_bytes = 2 * swath_bytes_l + 2 * swath_bytes_c;
 
if (total_swath_bytes > detile_buf_size_in_bytes) {
-   req128_l = 1;
+   req128_l = true;
swath_bytes_l = full_swath_bytes_packed_l / 2;
}
} else {
-   req128_l = 1;
-   req128_c = 0;
+   req128_l = true;
+   req128_c = false;
swath_bytes_l = full_swath_bytes_packed_l/2;
swath_bytes_c = full_swath_bytes_packed_c;
 
total_swath_bytes = 2 * swath_bytes_l + 2 * swath_bytes_c;
 
if (total_swath_bytes > detile_buf_size_in_bytes) {
-   req128_c = 1;
+   req128_c = true;
swath_bytes_c = full_swath_bytes_packed_c/2;
}
}
@@ -1006,8 +1006,8 @@ static void dml_rq_dlg_get_dlg_params(struct 
display_mode_lib *mode_lib,
 
double min_dst_y_ttu_vblank = 0;
unsigned int dlg_vblank_start = 0;
-   bool dual_plane = 0;
-   bool mode_422 = 0;
+   bool dual_plane = false;
+   bool mode_422 = false;
unsigned int access_dir = 0;
unsigned int vp_height_l = 0;
unsigned int vp_width_l = 0;
@@ -1021,7 +1021,7 @@ static void dml_rq_dlg_get_dlg_params(struct 
display_mode_lib *mode_lib,
double hratio_c = 0;
double vratio_l = 0;
double vratio_c = 0;
-   bool scl_enable = 0;
+   bool scl_enable = false;
 
double line_time_in_us = 0;
//  double vinit_l;
@@ -1156,7 +1156,7 @@ static void dml_rq_dlg_get_dlg_params(struct 
display_mode_lib *mode_lib,
// Source
//   dcc_en   = src.dcc;
dual_plane = is_dual_plane((enum 
source_format_class)(src->source_format));
-   mode_422 = 0; // TODO
+   mode_422 = false; // TODO
access_dir = (src->source_scan == dm_vert); // vp access direction: 
horizontal or vertical accessed
vp_height_l = src->viewport_height;
vp_width_l = src->viewport_width;
-- 
1.8.3.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[pull] amdgpu drm-next-5.12

2021-01-19 Thread Alex Deucher
Hi Dave, Daniel,

More new stuff for 5.12.  Now with non-x86 fixed.

The following changes since commit 044a48f420b9d3c19a135b821c34de5b2bee4075:

  drm/amdgpu: fix DRM_INFO flood if display core is not supported (bug 210921) 
(2021-01-08 15:18:57 -0500)

are available in the Git repository at:

  https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-next-5.12-2021-01-20

for you to fetch changes up to 4aef0ebc6b65e8583bc3d96e05c7a039912b3ee6:

  drm/amdgpu: fix build error without x86 kconfig (v2) (2021-01-19 15:16:10 
-0500)


amd-drm-next-5.12-2021-01-20:

amdgpu:
- Fix non-x86 build
- W=1 fixes from Lee Jones
- Enable GPU reset on Navy Flounder
- Kernel doc fixes
- SMU workload profile fixes for APUs
- Display updates
- SR-IOV fixes
- Vangogh SMU feature enablment and bug fixes
- GPU reset support for Vangogh
- Misc cleanups


Alex Deucher (5):
  MAINTAINERS: update radeon/amdgpu/amdkfd git trees
  drm/amdgpu: add mode2 reset support for vangogh
  drm/amdgpu/nv: add mode2 reset handling
  drm/amdgpu: fix mode2 reset sequence for vangogh
  drm/amdgpu: Enable GPU reset for vangogh

Aric Cyr (2):
  drm/amd/display: 3.2.117
  drm/amd/display: 3.2.118

Bhawanpreet Lakha (2):
  drm/amd/display: enable HUBP blank behaviour
  drm/amd/display: Fix deadlock during gpu reset v3

Charlene Liu (1):
  drm/amd/display: change SMU repsonse timeout to 2s

Chiawen Huang (1):
  drm/amd/display: removed unnecessary check when dpp clock increasing

Colin Ian King (1):
  drm/amdgpu: Add missing BOOTUP_DEFAULT to profile_name[]

Emily.Deng (1):
  drm/amdgpu: Decrease compute timeout to 10 s for sriov multiple VF

Guchun Chen (1):
  drm/amdgpu: toggle on DF Cstate after finishing xgmi injection

Huang Rui (13):
  drm/amd/pm: remove vcn/jpeg powergating feature checking for vangogh
  drm/amd/pm: enhance the real response for smu message (v2)
  drm/amd/pm: clean up get_allowed_feature_mask function
  drm/amd/pm: initial feature_enabled/feature_support bitmap for vangogh
  drm/amd/pm: don't mark all apu as true on feature mask
  drm/amdgpu: revise the mode2 reset for vangogh
  drm/amd/pm: fix the return value of pm message
  drm/amd/pm: implement the processor clocks which read by metric
  drm/amd/pm: implement processor fine grain feature for vangogh (v3)
  drm/amdgpu: fix vram type and bandwidth error for DDR5 and DDR4
  drm/amd/display: fix the system memory page fault because of copy overflow
  drm/amd/display: fix the coding style issue of integrated_info
  drm/amdgpu: fix build error without x86 kconfig (v2)

Jack Zhang (1):
  drm/amdgpu/sriov Stop data exchange for wholegpu reset

Jacky Liao (1):
  drm/amd/display: Fix assert being hit with GAMCOR memory shut down

Jeremy Cline (1):
  drm/amdkfd: Fix out-of-bounds read in kdf_create_vcrat_image_cpu()

Jiansong Chen (2):
  drm/amdgpu: enable gpu recovery for navy_flounder
  drm/amd/pm: update driver if version for navy_flounder

Jinzhou Su (4):
  drm/amd/pm: Add GFXOFF interface for Vangogh
  drm/amd/pm: Enable GfxOff for Vangogh
  drm/amdgpu: Add Secure Display TA header file
  drm/amdgpu: Add secure display TA interface

John Clements (1):
  drm/amdgpu: updated fw attestation interface

Jun Lei (1):
  drm/amd/display: implement T12 compliance

Lee Jones (90):
  drm/amd/amdgpu/amdgpu_ih: Update 'amdgpu_ih_decode_iv_helper()'s function 
header
  drm/amd/amdgpu/vega20_ih: Add missing descriptions for 'ih' and fix 
spelling error
  drm/amd/pm/powerplay/hwmgr/process_pptables_v1_0: Provide description of 
'call_back_func'
  drm/amd/pm/powerplay/hwmgr/ppatomctrl: Fix documentation for 'mpll_param'
  drm/amd/pm/powerplay/hwmgr/vega12_hwmgr: Fix legacy function header 
formatting
  drm/amd/pm/powerplay/hwmgr/vega20_hwmgr: Fix legacy function header 
formatting
  drm/amd/pm/powerplay/hwmgr/smu7_hwmgr: Fix formatting and spelling issues
  drm/amd/pm/powerplay/hwmgr/hwmgr: Move prototype into shared header
  drm/amd/pm/powerplay/hwmgr/vega10_hwmgr: Fix a bunch of kernel-doc 
formatting issues
  drm/amd/display/dc/basics/conversion: Demote obvious kernel-doc abuse
  drm/amd/display/amdgpu_dm/amdgpu_dm_debugfs: Demote non-kernel-doc 
comment blocks
  drm/amd/display/dc/bios/command_table_helper: Fix kernel-doc formatting
  drm/amd/display/dc/bios/command_table_helper2: Fix legacy formatting 
problems
  drm/amd/display/dc/bios/bios_parser: Make local functions static
  drm/amd/display/dc/bios/bios_parser: Fix a whole bunch of legacy doc 
formatting
  drm/amd/display/dc/bios/bios_parser2: Fix some formatting issues and 
missing parameter docs
  drm/amd/display/dc/dce/dce_audio: Make function invoked by reference 
static
  drm/amd/dis

Re: [PATCH v4 07/14] drm/amdgpu: Register IOMMU topology notifier per device.

2021-01-19 Thread Andrey Grodzovsky


On 1/19/21 3:48 AM, Christian König wrote:

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

Handle all DMA IOMMU gropup related dependencies before the
group is removed.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h    |  5 
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 46 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h   |  1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 10 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  2 ++
  6 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

index 478a7d8..2953420 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -51,6 +51,7 @@
  #include 
  #include 
  #include 
+#include 
    #include 
  #include 
@@ -1041,6 +1042,10 @@ struct amdgpu_device {
    bool    in_pci_err_recovery;
  struct pci_saved_state  *pci_state;
+
+    struct notifier_block    nb;
+    struct blocking_notifier_head    notifier;
+    struct list_head    device_bo_list;
  };
    static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 45e23e3..e99f4f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -70,6 +70,8 @@
  #include 
  #include 
  +#include 
+
  MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
  MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
  MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
@@ -3200,6 +3202,39 @@ static const struct attribute *amdgpu_dev_attributes[] 
= {

  };
    +static int amdgpu_iommu_group_notifier(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+    struct amdgpu_device *adev = container_of(nb, struct amdgpu_device, nb);
+    struct amdgpu_bo *bo = NULL;
+
+    /*
+ * Following is a set of IOMMU group dependencies taken care of before
+ * device's IOMMU group is removed
+ */
+    if (action == IOMMU_GROUP_NOTIFY_DEL_DEVICE) {
+
+    spin_lock(&ttm_bo_glob.lru_lock);
+    list_for_each_entry(bo, &adev->device_bo_list, bo) {
+    if (bo->tbo.ttm)
+    ttm_tt_unpopulate(bo->tbo.bdev, bo->tbo.ttm);
+    }
+    spin_unlock(&ttm_bo_glob.lru_lock);


That approach won't work. ttm_tt_unpopulate() might sleep on an IOMMU lock.

You need to use a mutex here or even better make sure you can access the 
device_bo_list without a lock in this moment.


Christian.



I can think of switching to RCU list ? Otherwise, elements are added
on BO create and deleted on BO destroy, how can i prevent any of those from
happening while in this section besides mutex ? Make a copy list and run over it 
instead ?


Andrey





+
+    if (adev->irq.ih.use_bus_addr)
+    amdgpu_ih_ring_fini(adev, &adev->irq.ih);
+    if (adev->irq.ih1.use_bus_addr)
+    amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
+    if (adev->irq.ih2.use_bus_addr)
+    amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
+
+    amdgpu_gart_dummy_page_fini(adev);
+    }
+
+    return NOTIFY_OK;
+}
+
+
  /**
   * amdgpu_device_init - initialize the driver
   *
@@ -3304,6 +3339,8 @@ int amdgpu_device_init(struct amdgpu_device *adev,
    INIT_WORK(&adev->xgmi_reset_work, amdgpu_device_xgmi_reset_func);
  +    INIT_LIST_HEAD(&adev->device_bo_list);
+
  adev->gfx.gfx_off_req_count = 1;
  adev->pm.ac_power = power_supply_is_system_supplied() > 0;
  @@ -3575,6 +3612,15 @@ int amdgpu_device_init(struct amdgpu_device *adev,
  if (amdgpu_device_cache_pci_state(adev->pdev))
  pci_restore_state(pdev);
  +    BLOCKING_INIT_NOTIFIER_HEAD(&adev->notifier);
+    adev->nb.notifier_call = amdgpu_iommu_group_notifier;
+
+    if (adev->dev->iommu_group) {
+    r = iommu_group_register_notifier(adev->dev->iommu_group, &adev->nb);
+    if (r)
+    goto failed;
+    }
+
  return 0;
    failed:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c

index 0db9330..486ad6d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
@@ -92,7 +92,7 @@ static int amdgpu_gart_dummy_page_init(struct amdgpu_device 
*adev)

   *
   * Frees the dummy page used by the driver (all asics).
   */
-static void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
+void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
  {
  if (!adev->dummy_page_addr)
  return;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h

index afa2e28..5678d9c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
@@ -61,6 +61,7 @@ int amdgpu_gart_table_vram_pin(

RE: [PATCH] amdgpu/ras: Fix bug disabling DF_CSTATE when injecting xgmi_wafl errors via debugfs

2021-01-19 Thread Chen, Guchun
[AMD Public Use]

Thanks Darren. We have the fix already a few days ago.

drm/amdgpu: toggle on DF Cstate after finishing xgmi injection

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Darren Powell
Sent: Wednesday, January 20, 2021 12:46 PM
To: amd-gfx@lists.freedesktop.org
Cc: Powell, Darren 
Subject: [PATCH] amdgpu/ras: Fix bug disabling DF_CSTATE when injecting 
xgmi_wafl errors via debugfs

Typo in amdgpu_ras_error_inject_xgmi() does not set df_state back to ALLOW 
after test this can be tested with the command
 echo inject xgmi_wafl ue 0x0 0x0 0x0 > /sys/kernel/debug/dri/0/ras/ras_ctrl

Fixes patch 5c23e9e05e42b5ea56a87a17f1da9ccf9b100465

Signed-off-by: Darren Powell 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index c136bd449744..a6ec28fead07 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -846,7 +846,7 @@ static int amdgpu_ras_error_inject_xgmi(struct 
amdgpu_device *adev,
if (amdgpu_dpm_allow_xgmi_power_down(adev, true))
dev_warn(adev->dev, "Failed to allow XGMI power down");
 
-   if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
+   if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_ALLOW))
dev_warn(adev->dev, "Failed to allow df cstate");
 
return ret;

base-commit: ed94c622f91453aaca80029b0afdd2551a12e777
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Cguchun.chen%40amd.com%7Cf1d970cd301e450a074508d8bcfe5a16%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637467147916210362%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Qp1VIuGOhTZPpfHSwnbMbHycrFqhaxIgN3AZkkiHF6g%3D&reserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] amdgpu/ras: Fix bug disabling DF_CSTATE when injecting xgmi_wafl errors via debugfs

2021-01-19 Thread Darren Powell
Typo in amdgpu_ras_error_inject_xgmi() does not set df_state back to ALLOW 
after test
this can be tested with the command
 echo inject xgmi_wafl ue 0x0 0x0 0x0 > /sys/kernel/debug/dri/0/ras/ras_ctrl

Fixes patch 5c23e9e05e42b5ea56a87a17f1da9ccf9b100465

Signed-off-by: Darren Powell 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index c136bd449744..a6ec28fead07 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -846,7 +846,7 @@ static int amdgpu_ras_error_inject_xgmi(struct 
amdgpu_device *adev,
if (amdgpu_dpm_allow_xgmi_power_down(adev, true))
dev_warn(adev->dev, "Failed to allow XGMI power down");
 
-   if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
+   if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_ALLOW))
dev_warn(adev->dev, "Failed to allow df cstate");
 
return ret;

base-commit: ed94c622f91453aaca80029b0afdd2551a12e777
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 07/14] drm/amdgpu: Register IOMMU topology notifier per device.

2021-01-19 Thread Andrey Grodzovsky


On 1/19/21 5:01 PM, Daniel Vetter wrote:

On Tue, Jan 19, 2021 at 10:22 PM Andrey Grodzovsky
 wrote:


On 1/19/21 8:45 AM, Daniel Vetter wrote:

On Tue, Jan 19, 2021 at 09:48:03AM +0100, Christian König wrote:

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

Handle all DMA IOMMU gropup related dependencies before the
group is removed.

Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu.h|  5 
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 46 
++
   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   |  2 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h   |  1 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 10 +++
   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  2 ++
   6 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 478a7d8..2953420 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -51,6 +51,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
@@ -1041,6 +1042,10 @@ struct amdgpu_device {
   boolin_pci_err_recovery;
   struct pci_saved_state  *pci_state;
+
+ struct notifier_block nb;
+ struct blocking_notifier_head notifier;
+ struct list_head device_bo_list;
   };
   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 45e23e3..e99f4f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -70,6 +70,8 @@
   #include 
   #include 
+#include 
+
   MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
   MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
   MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
@@ -3200,6 +3202,39 @@ static const struct attribute *amdgpu_dev_attributes[] = 
{
   };
+static int amdgpu_iommu_group_notifier(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ struct amdgpu_device *adev = container_of(nb, struct amdgpu_device, nb);
+ struct amdgpu_bo *bo = NULL;
+
+ /*
+ * Following is a set of IOMMU group dependencies taken care of before
+ * device's IOMMU group is removed
+ */
+ if (action == IOMMU_GROUP_NOTIFY_DEL_DEVICE) {
+
+ spin_lock(&ttm_bo_glob.lru_lock);
+ list_for_each_entry(bo, &adev->device_bo_list, bo) {
+ if (bo->tbo.ttm)
+ ttm_tt_unpopulate(bo->tbo.bdev, bo->tbo.ttm);
+ }
+ spin_unlock(&ttm_bo_glob.lru_lock);

That approach won't work. ttm_tt_unpopulate() might sleep on an IOMMU lock.

You need to use a mutex here or even better make sure you can access the
device_bo_list without a lock in this moment.

I'd also be worried about the notifier mutex getting really badly in the
way.

Plus I'm worried why we even need this, it sounds a bit like papering over
the iommu subsystem. Assuming we clean up all our iommu mappings in our
device hotunplug/unload code, why do we still need to have an additional
iommu notifier on top, with all kinds of additional headaches? The iommu
shouldn't clean up before the devices in its group have cleaned up.

I think we need more info here on what the exact problem is first.
-Daniel


Originally I experienced the  crash bellow on IOMMU enabled device, it happens 
post device removal from PCI topology -
during shutting down of user client holding last reference to drm device file 
(X in my case).
The crash is because by the time I get to this point struct device->iommu_group 
pointer is NULL
already since the IOMMU group for the device is unset during PCI removal. So 
this contradicts what you said above
that the iommu shouldn't clean up before the devices in its group have cleaned 
up.
So instead of guessing when is the right place to place all IOMMU related 
cleanups it makes sense
to get notification from IOMMU subsystem in the form of event 
IOMMU_GROUP_NOTIFY_DEL_DEVICE
and use that place to do all the relevant cleanups.

Yeah that goes boom, but you shouldn't need this special iommu cleanup
handler. Making sure that all the dma-api mappings are gone needs to
be done as part of the device hotunplug, you can't delay that to the
last drm_device cleanup.

So I most of the patch here with pulling that out (should be outright
removed from the final release code even) is good, just not yet how
you call that new code. Probably these bits (aside from walking all
buffers and unpopulating the tt) should be done from the early_free
callback you're adding.

Also what I just realized: For normal unload you need to make sure the
hw is actually stopped first, before we unmap buffers. Otherwise
driver unload will likely result in wedged hw, probably not what you
want for debugging.
-Daniel


Since device removal from IOMMU group and this hook in particular
takes place before call to amdgpu_pci_remove essentially it means
that for IOMMU use case the entire amdgpu_device_fini_hw function
shouold be called here 

Re: [PATCH] drm/amd/pm: make the error log more clear for fine grain tuning function

2021-01-19 Thread Wang, Kevin(Yang)
[AMD Official Use Only - Internal Distribution Only]



From: Du, Xiaojian 
Sent: Wednesday, January 20, 2021 11:48 AM
To: amd-gfx@lists.freedesktop.org 
Cc: Huang, Ray ; Quan, Evan ; Wang, 
Kevin(Yang) ; Lazar, Lijo ; Du, 
Xiaojian ; Du, Xiaojian 
Subject: [PATCH] drm/amd/pm: make the error log more clear for fine grain 
tuning function

From: Xiaojian Du 

From: Xiaojian Du 

This patch is to make the error log more clear for fine grian tuning
function, it covers Raven/Raven2/Picasso/Renoir/Vangogh.
The fine grain tuning function uses the sysfs file -- pp_od_clk_voltage,
but only when another sysfs file -- power_dpm_force_performance_level is
switched to "manual" mode, it is allowd to access "pp_od_clk_voltage".

Signed-off-by: Xiaojian Du 
---
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c | 2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 3 ++-
 drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c  | 3 ++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
index 88322781e447..ed05a30d1139 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
@@ -1487,7 +1487,7 @@ static int smu10_set_fine_grain_clk_vol(struct pp_hwmgr 
*hwmgr,
 }

 if (!smu10_data->fine_grain_enabled) {
-   pr_err("Fine grain not started\n");
+   pr_err("pp_od_clk_voltage is not accessible if 
power_dpm_force_perfomance_level is not in manual mode!\n");
[kevin]:
for above codes, the old one looks better for me, i prefer to keep current 
design.

Best Regards,
Kevin
 return -EINVAL;
 }

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
index 6d3c556dbe6b..a847fa66797e 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
@@ -1452,7 +1452,8 @@ static int vangogh_od_edit_dpm_table(struct smu_context 
*smu, enum PP_OD_DPM_TAB
 struct smu_dpm_context *smu_dpm_ctx = &(smu->smu_dpm);

 if (!(smu_dpm_ctx->dpm_level == AMD_DPM_FORCED_LEVEL_MANUAL)) {
-   dev_warn(smu->adev->dev, "Fine grain is not enabled!\n");
+   dev_warn(smu->adev->dev,
+   "pp_od_clk_voltage is not accessible if 
power_dpm_force_perfomance_level is not in manual mode!\n");
 return -EINVAL;
 }

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c
index ab15570305f7..4ce8fb1d5ce9 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c
@@ -350,7 +350,8 @@ static int renoir_od_edit_dpm_table(struct smu_context *smu,
 struct smu_dpm_context *smu_dpm_ctx = &(smu->smu_dpm);

 if (!(smu_dpm_ctx->dpm_level == AMD_DPM_FORCED_LEVEL_MANUAL)) {
-   dev_warn(smu->adev->dev, "Fine grain is not enabled!\n");
+   dev_warn(smu->adev->dev,
+   "pp_od_clk_voltage is not accessible if 
power_dpm_force_perfomance_level is not in manual mode!\n");
 return -EINVAL;
[Kevin]:
Just tell the User what's going on, not why.
and we'd better make a function to check manual mode , then embed it to every 
sysfs node in amdgpu_pm.c
using a unify interface to return result to user.

Best Regards,
Kevin
 }

--
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/pm: make the error log more clear for fine grain tuning function

2021-01-19 Thread Xiaojian Du
From: Xiaojian Du 

From: Xiaojian Du 

This patch is to make the error log more clear for fine grian tuning
function, it covers Raven/Raven2/Picasso/Renoir/Vangogh.
The fine grain tuning function uses the sysfs file -- pp_od_clk_voltage,
but only when another sysfs file -- power_dpm_force_performance_level is
switched to "manual" mode, it is allowd to access "pp_od_clk_voltage".

Signed-off-by: Xiaojian Du 
---
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c | 2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 3 ++-
 drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c  | 3 ++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
index 88322781e447..ed05a30d1139 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu10_hwmgr.c
@@ -1487,7 +1487,7 @@ static int smu10_set_fine_grain_clk_vol(struct pp_hwmgr 
*hwmgr,
}
 
if (!smu10_data->fine_grain_enabled) {
-   pr_err("Fine grain not started\n");
+   pr_err("pp_od_clk_voltage is not accessible if 
power_dpm_force_perfomance_level is not in manual mode!\n");
return -EINVAL;
}
 
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
index 6d3c556dbe6b..a847fa66797e 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
@@ -1452,7 +1452,8 @@ static int vangogh_od_edit_dpm_table(struct smu_context 
*smu, enum PP_OD_DPM_TAB
struct smu_dpm_context *smu_dpm_ctx = &(smu->smu_dpm);
 
if (!(smu_dpm_ctx->dpm_level == AMD_DPM_FORCED_LEVEL_MANUAL)) {
-   dev_warn(smu->adev->dev, "Fine grain is not enabled!\n");
+   dev_warn(smu->adev->dev,
+   "pp_od_clk_voltage is not accessible if 
power_dpm_force_perfomance_level is not in manual mode!\n");
return -EINVAL;
}
 
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c
index ab15570305f7..4ce8fb1d5ce9 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu12/renoir_ppt.c
@@ -350,7 +350,8 @@ static int renoir_od_edit_dpm_table(struct smu_context *smu,
struct smu_dpm_context *smu_dpm_ctx = &(smu->smu_dpm);
 
if (!(smu_dpm_ctx->dpm_level == AMD_DPM_FORCED_LEVEL_MANUAL)) {
-   dev_warn(smu->adev->dev, "Fine grain is not enabled!\n");
+   dev_warn(smu->adev->dev,
+   "pp_od_clk_voltage is not accessible if 
power_dpm_force_perfomance_level is not in manual mode!\n");
return -EINVAL;
}
 
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Add RLC_PG_DELAY_3 for Vangogh

2021-01-19 Thread Huang Rui
On Wed, Jan 20, 2021 at 11:09:11AM +0800, Su, Jinzhou (Joe) wrote:
> Driver should enable the CGPG feature for RLC in safe mode to
> prevent any misalignment or conflict in middle of any power
> feature entry/exit sequence.
> Achieved by setting RLC_PG_CNTL.GFX_POWER_GATING_ENABLE = 0x1,
> and RLC_PG_DELAY_3.CGCG_ACTIVE_BEFORE_CGPG to the desired CGPG
> hysteresis value in refclk count.
> 
> Signed-off-by: Jinzhou Su 

Reviewed-by: Huang Rui 

> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index c4314e25f560..dd102cc2516a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -120,6 +120,7 @@
>  #define mmSPI_CONFIG_CNTL_Vangogh_BASE_IDX   1
>  #define mmGCR_GENERAL_CNTL_Vangogh   0x1580
>  #define mmGCR_GENERAL_CNTL_Vangogh_BASE_IDX  0
> +#define RLC_PG_DELAY_3__CGCG_ACTIVE_BEFORE_CGPG_MASK_Vangogh   0xL
>  
>  #define mmCP_HYP_PFP_UCODE_ADDR  0x5814
>  #define mmCP_HYP_PFP_UCODE_ADDR_BASE_IDX 1
> @@ -7829,6 +7830,20 @@ static void gfx_v10_cntl_power_gating(struct 
> amdgpu_device *adev, bool enable)
>   data &= ~RLC_PG_CNTL__GFX_POWER_GATING_ENABLE_MASK;
>  
>   WREG32_SOC15(GC, 0, mmRLC_PG_CNTL, data);
> +
> + /*
> +  * CGPG enablement required and the register to program the hysteresis 
> value
> +  * RLC_PG_DELAY_3.CGCG_ACTIVE_BEFORE_CGPG to the desired CGPG 
> hysteresis value
> +  * in refclk count. Note that RLC FW is modified to take 16 bits from
> +  * RLC_PG_DELAY_3[15:0] as the hysteresis instead of just 8 bits.
> +  *
> +  * The recommendation from RLC team is setting RLC_PG_DELAY_3 to 
> 200us(0x4E20)
> +  * as part of CGPG enablement starting point.
> +  */
> + if (enable && (adev->pg_flags & AMD_PG_SUPPORT_GFX_PG) && 
> adev->asic_type == CHIP_VANGOGH) {
> + data = 0x4E20 & 
> RLC_PG_DELAY_3__CGCG_ACTIVE_BEFORE_CGPG_MASK_Vangogh;
> + WREG32_SOC15(GC, 0, mmRLC_PG_DELAY_3, data);
> + }
>  }
>  
>  static void gfx_v10_cntl_pg(struct amdgpu_device *adev, bool enable)
> -- 
> 2.17.1
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: update mmhub mgcg&ls for mmhub_v2_3

2021-01-19 Thread Huang Rui
On Wed, Jan 20, 2021 at 11:11:26AM +0800, Liu, Aaron wrote:
> [AMD Official Use Only - Internal Distribution Only]
> 
> This patch has been Verfied on Van Gogh.
> 

Thanks.

Reviewed-by: Huang Rui 

> --
> Best Regards
> Aaron Liu
> 
> > -Original Message-
> > From: Huang, Ray 
> > Sent: Wednesday, January 20, 2021 10:06 AM
> > To: Liu, Aaron 
> > Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> > 
> > Subject: Re: [PATCH] drm/amdgpu: update mmhub mgcg&ls for
> > mmhub_v2_3
> > 
> > On Wed, Jan 20, 2021 at 09:57:32AM +0800, Liu, Aaron wrote:
> > > Starting from vangogh, the ATCL2 and DAGB0 registers relative to
> > > mgcg/ls has changed.
> > >
> > > For MGCG:
> > > Replace mmMM_ATC_L2_MISC_CG with mmMM_ATC_L2_CGTT_CLK_CTRL.
> > >
> > > For MGLS:
> > > Replace mmMM_ATC_L2_MISC_CG with mmMM_ATC_L2_CGTT_CLK_CTRL.
> > > Add DAGB0_(WR/RD)_CGTT_CLK_CTRL registers.
> > >
> > > Signed-off-by: Aaron Liu 
> > 
> > Could you double verify it on vangogh as well?
> > 
> > After that, patch is
> > 
> > Acked-by: Huang Rui 
> > 
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c | 84
> > > ++---
> > >  1 file changed, 61 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
> > > b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
> > > index 92f02883daa3..8f2edba5bc9e 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
> > > @@ -492,12 +492,11 @@
> > > mmhub_v2_3_update_medium_grain_clock_gating(struct amdgpu_device
> > *adev,  {
> > >   uint32_t def, data, def1, data1;
> > >
> > > - def  = data  = RREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_MISC_CG);
> > > + def  = data  = RREG32_SOC15(MMHUB, 0,
> > mmMM_ATC_L2_CGTT_CLK_CTRL);
> > >   def1 = data1 = RREG32_SOC15(MMHUB, 0, mmDAGB0_CNTL_MISC2);
> > >
> > >   if (enable && (adev->cg_flags & AMD_CG_SUPPORT_MC_MGCG)) {
> > > - data |= MM_ATC_L2_MISC_CG__ENABLE_MASK;
> > > -
> > > + data &=
> > ~MM_ATC_L2_CGTT_CLK_CTRL__SOFT_OVERRIDE_MASK;
> > >   data1 &=
> > ~(DAGB0_CNTL_MISC2__DISABLE_WRREQ_CG_MASK |
> > >  DAGB0_CNTL_MISC2__DISABLE_WRRET_CG_MASK |
> > >  DAGB0_CNTL_MISC2__DISABLE_RDREQ_CG_MASK |
> > @@ -506,8
> > > +505,7 @@ mmhub_v2_3_update_medium_grain_clock_gating(struct
> > amdgpu_device *adev,
> > >  DAGB0_CNTL_MISC2__DISABLE_TLBRD_CG_MASK);
> > >
> > >   } else {
> > > - data &= ~MM_ATC_L2_MISC_CG__ENABLE_MASK;
> > > -
> > > + data |=
> > MM_ATC_L2_CGTT_CLK_CTRL__SOFT_OVERRIDE_MASK;
> > >   data1 |= (DAGB0_CNTL_MISC2__DISABLE_WRREQ_CG_MASK |
> > > DAGB0_CNTL_MISC2__DISABLE_WRRET_CG_MASK |
> > > DAGB0_CNTL_MISC2__DISABLE_RDREQ_CG_MASK |
> > @@ -517,7 +515,7 @@
> > > mmhub_v2_3_update_medium_grain_clock_gating(struct amdgpu_device
> > *adev,
> > >   }
> > >
> > >   if (def != data)
> > > - WREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_MISC_CG, data);
> > > + WREG32_SOC15(MMHUB, 0,
> > mmMM_ATC_L2_CGTT_CLK_CTRL, data);
> > >   if (def1 != data1)
> > >   WREG32_SOC15(MMHUB, 0, mmDAGB0_CNTL_MISC2,
> > data1);  } @@ -526,17
> > > +524,44 @@ static void
> > > mmhub_v2_3_update_medium_grain_light_sleep(struct amdgpu_device
> > *adev,
> > >  bool enable)
> > >  {
> > > - uint32_t def, data;
> > > -
> > > - def  = data  = RREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_MISC_CG);
> > > -
> > > - if (enable && (adev->cg_flags & AMD_CG_SUPPORT_MC_LS))
> > > - data |= MM_ATC_L2_MISC_CG__MEM_LS_ENABLE_MASK;
> > > - else
> > > - data &= ~MM_ATC_L2_MISC_CG__MEM_LS_ENABLE_MASK;
> > > + uint32_t def, data, def1, data1, def2, data2;
> > > +
> > > + def  = data  = RREG32_SOC15(MMHUB, 0,
> > mmMM_ATC_L2_CGTT_CLK_CTRL);
> > > + def1 = data1 = RREG32_SOC15(MMHUB, 0,
> > mmDAGB0_WR_CGTT_CLK_CTRL);
> > > + def2 = data2 = RREG32_SOC15(MMHUB, 0,
> > mmDAGB0_RD_CGTT_CLK_CTRL);
> > > +
> > > + if (enable && (adev->cg_flags & AMD_CG_SUPPORT_MC_LS)) {
> > > + data &=
> > ~MM_ATC_L2_CGTT_CLK_CTRL__MGLS_OVERRIDE_MASK;
> > > + data1
> > &= !(DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_MASK |
> > > +
> > DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_WRITE_MASK |
> > > +
> > DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_READ_MASK |
> > > +
> > DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_RETURN_MASK |
> > > +
> > DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_REGISTER_MASK);
> > > + data2 &= !(DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_MASK
> > |
> > > +
> > DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_WRITE_MASK |
> > > +
> > DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_READ_MASK |
> > > +
> > DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_RETURN_MASK |
> > > +
> > DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_REGISTER_MASK);
> > > + } else {
> > > + data |=
> > MM_ATC_L2_CGTT_CLK_CTRL__MGLS_OVERRIDE_MASK;
> > > + data1 |= (DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_MASK
> > |
> > > +
> > DAGB0_WR_CGTT_CLK_

RE: [PATCH] drm/amdgpu: update mmhub mgcg&ls for mmhub_v2_3

2021-01-19 Thread Liu, Aaron
[AMD Official Use Only - Internal Distribution Only]

This patch has been Verfied on Van Gogh.

--
Best Regards
Aaron Liu

> -Original Message-
> From: Huang, Ray 
> Sent: Wednesday, January 20, 2021 10:06 AM
> To: Liu, Aaron 
> Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> 
> Subject: Re: [PATCH] drm/amdgpu: update mmhub mgcg&ls for
> mmhub_v2_3
> 
> On Wed, Jan 20, 2021 at 09:57:32AM +0800, Liu, Aaron wrote:
> > Starting from vangogh, the ATCL2 and DAGB0 registers relative to
> > mgcg/ls has changed.
> >
> > For MGCG:
> > Replace mmMM_ATC_L2_MISC_CG with mmMM_ATC_L2_CGTT_CLK_CTRL.
> >
> > For MGLS:
> > Replace mmMM_ATC_L2_MISC_CG with mmMM_ATC_L2_CGTT_CLK_CTRL.
> > Add DAGB0_(WR/RD)_CGTT_CLK_CTRL registers.
> >
> > Signed-off-by: Aaron Liu 
> 
> Could you double verify it on vangogh as well?
> 
> After that, patch is
> 
> Acked-by: Huang Rui 
> 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c | 84
> > ++---
> >  1 file changed, 61 insertions(+), 23 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
> > b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
> > index 92f02883daa3..8f2edba5bc9e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
> > @@ -492,12 +492,11 @@
> > mmhub_v2_3_update_medium_grain_clock_gating(struct amdgpu_device
> *adev,  {
> > uint32_t def, data, def1, data1;
> >
> > -   def  = data  = RREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_MISC_CG);
> > +   def  = data  = RREG32_SOC15(MMHUB, 0,
> mmMM_ATC_L2_CGTT_CLK_CTRL);
> > def1 = data1 = RREG32_SOC15(MMHUB, 0, mmDAGB0_CNTL_MISC2);
> >
> > if (enable && (adev->cg_flags & AMD_CG_SUPPORT_MC_MGCG)) {
> > -   data |= MM_ATC_L2_MISC_CG__ENABLE_MASK;
> > -
> > +   data &=
> ~MM_ATC_L2_CGTT_CLK_CTRL__SOFT_OVERRIDE_MASK;
> > data1 &=
> ~(DAGB0_CNTL_MISC2__DISABLE_WRREQ_CG_MASK |
> >DAGB0_CNTL_MISC2__DISABLE_WRRET_CG_MASK |
> >DAGB0_CNTL_MISC2__DISABLE_RDREQ_CG_MASK |
> @@ -506,8
> > +505,7 @@ mmhub_v2_3_update_medium_grain_clock_gating(struct
> amdgpu_device *adev,
> >DAGB0_CNTL_MISC2__DISABLE_TLBRD_CG_MASK);
> >
> > } else {
> > -   data &= ~MM_ATC_L2_MISC_CG__ENABLE_MASK;
> > -
> > +   data |=
> MM_ATC_L2_CGTT_CLK_CTRL__SOFT_OVERRIDE_MASK;
> > data1 |= (DAGB0_CNTL_MISC2__DISABLE_WRREQ_CG_MASK |
> >   DAGB0_CNTL_MISC2__DISABLE_WRRET_CG_MASK |
> >   DAGB0_CNTL_MISC2__DISABLE_RDREQ_CG_MASK |
> @@ -517,7 +515,7 @@
> > mmhub_v2_3_update_medium_grain_clock_gating(struct amdgpu_device
> *adev,
> > }
> >
> > if (def != data)
> > -   WREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_MISC_CG, data);
> > +   WREG32_SOC15(MMHUB, 0,
> mmMM_ATC_L2_CGTT_CLK_CTRL, data);
> > if (def1 != data1)
> > WREG32_SOC15(MMHUB, 0, mmDAGB0_CNTL_MISC2,
> data1);  } @@ -526,17
> > +524,44 @@ static void
> > mmhub_v2_3_update_medium_grain_light_sleep(struct amdgpu_device
> *adev,
> >bool enable)
> >  {
> > -   uint32_t def, data;
> > -
> > -   def  = data  = RREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_MISC_CG);
> > -
> > -   if (enable && (adev->cg_flags & AMD_CG_SUPPORT_MC_LS))
> > -   data |= MM_ATC_L2_MISC_CG__MEM_LS_ENABLE_MASK;
> > -   else
> > -   data &= ~MM_ATC_L2_MISC_CG__MEM_LS_ENABLE_MASK;
> > +   uint32_t def, data, def1, data1, def2, data2;
> > +
> > +   def  = data  = RREG32_SOC15(MMHUB, 0,
> mmMM_ATC_L2_CGTT_CLK_CTRL);
> > +   def1 = data1 = RREG32_SOC15(MMHUB, 0,
> mmDAGB0_WR_CGTT_CLK_CTRL);
> > +   def2 = data2 = RREG32_SOC15(MMHUB, 0,
> mmDAGB0_RD_CGTT_CLK_CTRL);
> > +
> > +   if (enable && (adev->cg_flags & AMD_CG_SUPPORT_MC_LS)) {
> > +   data &=
> ~MM_ATC_L2_CGTT_CLK_CTRL__MGLS_OVERRIDE_MASK;
> > +   data1
> &= !(DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_MASK |
> > +
>   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_WRITE_MASK |
> > +
>   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_READ_MASK |
> > +
>   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_RETURN_MASK |
> > +
>   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_REGISTER_MASK);
> > +   data2 &= !(DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_MASK
> |
> > +
>   DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_WRITE_MASK |
> > +
>   DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_READ_MASK |
> > +
>   DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_RETURN_MASK |
> > +
>   DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_REGISTER_MASK);
> > +   } else {
> > +   data |=
> MM_ATC_L2_CGTT_CLK_CTRL__MGLS_OVERRIDE_MASK;
> > +   data1 |= (DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_MASK
> |
> > +
>   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_WRITE_MASK |
> > +
>   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_READ_MASK |
> > +
>   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_RETURN_MASK |
> > +
>   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_REGISTER_MASK);
> > +   data2 |= (DAGB0_RD_CGTT_CLK_CTRL__LS_OVE

[PATCH] drm/amdgpu: Add RLC_PG_DELAY_3 for Vangogh

2021-01-19 Thread Jinzhou Su
Driver should enable the CGPG feature for RLC in safe mode to
prevent any misalignment or conflict in middle of any power
feature entry/exit sequence.
Achieved by setting RLC_PG_CNTL.GFX_POWER_GATING_ENABLE = 0x1,
and RLC_PG_DELAY_3.CGCG_ACTIVE_BEFORE_CGPG to the desired CGPG
hysteresis value in refclk count.

Signed-off-by: Jinzhou Su 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index c4314e25f560..dd102cc2516a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -120,6 +120,7 @@
 #define mmSPI_CONFIG_CNTL_Vangogh_BASE_IDX   1
 #define mmGCR_GENERAL_CNTL_Vangogh   0x1580
 #define mmGCR_GENERAL_CNTL_Vangogh_BASE_IDX  0
+#define RLC_PG_DELAY_3__CGCG_ACTIVE_BEFORE_CGPG_MASK_Vangogh   0xL
 
 #define mmCP_HYP_PFP_UCODE_ADDR0x5814
 #define mmCP_HYP_PFP_UCODE_ADDR_BASE_IDX   1
@@ -7829,6 +7830,20 @@ static void gfx_v10_cntl_power_gating(struct 
amdgpu_device *adev, bool enable)
data &= ~RLC_PG_CNTL__GFX_POWER_GATING_ENABLE_MASK;
 
WREG32_SOC15(GC, 0, mmRLC_PG_CNTL, data);
+
+   /*
+* CGPG enablement required and the register to program the hysteresis 
value
+* RLC_PG_DELAY_3.CGCG_ACTIVE_BEFORE_CGPG to the desired CGPG 
hysteresis value
+* in refclk count. Note that RLC FW is modified to take 16 bits from
+* RLC_PG_DELAY_3[15:0] as the hysteresis instead of just 8 bits.
+*
+* The recommendation from RLC team is setting RLC_PG_DELAY_3 to 
200us(0x4E20)
+* as part of CGPG enablement starting point.
+*/
+   if (enable && (adev->pg_flags & AMD_PG_SUPPORT_GFX_PG) && 
adev->asic_type == CHIP_VANGOGH) {
+   data = 0x4E20 & 
RLC_PG_DELAY_3__CGCG_ACTIVE_BEFORE_CGPG_MASK_Vangogh;
+   WREG32_SOC15(GC, 0, mmRLC_PG_DELAY_3, data);
+   }
 }
 
 static void gfx_v10_cntl_pg(struct amdgpu_device *adev, bool enable)
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: update mmhub mgcg&ls for mmhub_v2_3

2021-01-19 Thread Huang Rui
On Wed, Jan 20, 2021 at 09:57:32AM +0800, Liu, Aaron wrote:
> Starting from vangogh, the ATCL2 and DAGB0 registers relative
> to mgcg/ls has changed.
> 
> For MGCG:
> Replace mmMM_ATC_L2_MISC_CG with mmMM_ATC_L2_CGTT_CLK_CTRL.
> 
> For MGLS:
> Replace mmMM_ATC_L2_MISC_CG with mmMM_ATC_L2_CGTT_CLK_CTRL.
> Add DAGB0_(WR/RD)_CGTT_CLK_CTRL registers.
> 
> Signed-off-by: Aaron Liu 

Could you double verify it on vangogh as well?

After that, patch is

Acked-by: Huang Rui 

> ---
>  drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c | 84 ++---
>  1 file changed, 61 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c 
> b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
> index 92f02883daa3..8f2edba5bc9e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
> @@ -492,12 +492,11 @@ mmhub_v2_3_update_medium_grain_clock_gating(struct 
> amdgpu_device *adev,
>  {
>   uint32_t def, data, def1, data1;
>  
> - def  = data  = RREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_MISC_CG);
> + def  = data  = RREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_CGTT_CLK_CTRL);
>   def1 = data1 = RREG32_SOC15(MMHUB, 0, mmDAGB0_CNTL_MISC2);
>  
>   if (enable && (adev->cg_flags & AMD_CG_SUPPORT_MC_MGCG)) {
> - data |= MM_ATC_L2_MISC_CG__ENABLE_MASK;
> -
> + data &= ~MM_ATC_L2_CGTT_CLK_CTRL__SOFT_OVERRIDE_MASK;
>   data1 &= ~(DAGB0_CNTL_MISC2__DISABLE_WRREQ_CG_MASK |
>  DAGB0_CNTL_MISC2__DISABLE_WRRET_CG_MASK |
>  DAGB0_CNTL_MISC2__DISABLE_RDREQ_CG_MASK |
> @@ -506,8 +505,7 @@ mmhub_v2_3_update_medium_grain_clock_gating(struct 
> amdgpu_device *adev,
>  DAGB0_CNTL_MISC2__DISABLE_TLBRD_CG_MASK);
>  
>   } else {
> - data &= ~MM_ATC_L2_MISC_CG__ENABLE_MASK;
> -
> + data |= MM_ATC_L2_CGTT_CLK_CTRL__SOFT_OVERRIDE_MASK;
>   data1 |= (DAGB0_CNTL_MISC2__DISABLE_WRREQ_CG_MASK |
> DAGB0_CNTL_MISC2__DISABLE_WRRET_CG_MASK |
> DAGB0_CNTL_MISC2__DISABLE_RDREQ_CG_MASK |
> @@ -517,7 +515,7 @@ mmhub_v2_3_update_medium_grain_clock_gating(struct 
> amdgpu_device *adev,
>   }
>  
>   if (def != data)
> - WREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_MISC_CG, data);
> + WREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_CGTT_CLK_CTRL, data);
>   if (def1 != data1)
>   WREG32_SOC15(MMHUB, 0, mmDAGB0_CNTL_MISC2, data1);
>  }
> @@ -526,17 +524,44 @@ static void
>  mmhub_v2_3_update_medium_grain_light_sleep(struct amdgpu_device *adev,
>  bool enable)
>  {
> - uint32_t def, data;
> -
> - def  = data  = RREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_MISC_CG);
> -
> - if (enable && (adev->cg_flags & AMD_CG_SUPPORT_MC_LS))
> - data |= MM_ATC_L2_MISC_CG__MEM_LS_ENABLE_MASK;
> - else
> - data &= ~MM_ATC_L2_MISC_CG__MEM_LS_ENABLE_MASK;
> + uint32_t def, data, def1, data1, def2, data2;
> +
> + def  = data  = RREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_CGTT_CLK_CTRL);
> + def1 = data1 = RREG32_SOC15(MMHUB, 0, mmDAGB0_WR_CGTT_CLK_CTRL);
> + def2 = data2 = RREG32_SOC15(MMHUB, 0, mmDAGB0_RD_CGTT_CLK_CTRL);
> +
> + if (enable && (adev->cg_flags & AMD_CG_SUPPORT_MC_LS)) {
> + data &= ~MM_ATC_L2_CGTT_CLK_CTRL__MGLS_OVERRIDE_MASK;
> + data1 &= !(DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_MASK |
> + DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_WRITE_MASK |
> + DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_READ_MASK |
> + DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_RETURN_MASK |
> + DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_REGISTER_MASK);
> + data2 &= !(DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_MASK |
> + DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_WRITE_MASK |
> + DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_READ_MASK |
> + DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_RETURN_MASK |
> + DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_REGISTER_MASK);
> + } else {
> + data |= MM_ATC_L2_CGTT_CLK_CTRL__MGLS_OVERRIDE_MASK;
> + data1 |= (DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_MASK |
> + DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_WRITE_MASK |
> + DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_READ_MASK |
> + DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_RETURN_MASK |
> + DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_REGISTER_MASK);
> + data2 |= (DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_MASK |
> + DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_WRITE_MASK |
> + DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_READ_MASK |
> + DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_RETURN_MASK |
> + DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_REGISTER_MASK);
> + }
>  
>   if (def != data)
> - 

[PATCH] drm/amdgpu: update mmhub mgcg&ls for mmhub_v2_3

2021-01-19 Thread Aaron Liu
Starting from vangogh, the ATCL2 and DAGB0 registers relative
to mgcg/ls has changed.

For MGCG:
Replace mmMM_ATC_L2_MISC_CG with mmMM_ATC_L2_CGTT_CLK_CTRL.

For MGLS:
Replace mmMM_ATC_L2_MISC_CG with mmMM_ATC_L2_CGTT_CLK_CTRL.
Add DAGB0_(WR/RD)_CGTT_CLK_CTRL registers.

Signed-off-by: Aaron Liu 
---
 drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c | 84 ++---
 1 file changed, 61 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c 
b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
index 92f02883daa3..8f2edba5bc9e 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_3.c
@@ -492,12 +492,11 @@ mmhub_v2_3_update_medium_grain_clock_gating(struct 
amdgpu_device *adev,
 {
uint32_t def, data, def1, data1;
 
-   def  = data  = RREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_MISC_CG);
+   def  = data  = RREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_CGTT_CLK_CTRL);
def1 = data1 = RREG32_SOC15(MMHUB, 0, mmDAGB0_CNTL_MISC2);
 
if (enable && (adev->cg_flags & AMD_CG_SUPPORT_MC_MGCG)) {
-   data |= MM_ATC_L2_MISC_CG__ENABLE_MASK;
-
+   data &= ~MM_ATC_L2_CGTT_CLK_CTRL__SOFT_OVERRIDE_MASK;
data1 &= ~(DAGB0_CNTL_MISC2__DISABLE_WRREQ_CG_MASK |
   DAGB0_CNTL_MISC2__DISABLE_WRRET_CG_MASK |
   DAGB0_CNTL_MISC2__DISABLE_RDREQ_CG_MASK |
@@ -506,8 +505,7 @@ mmhub_v2_3_update_medium_grain_clock_gating(struct 
amdgpu_device *adev,
   DAGB0_CNTL_MISC2__DISABLE_TLBRD_CG_MASK);
 
} else {
-   data &= ~MM_ATC_L2_MISC_CG__ENABLE_MASK;
-
+   data |= MM_ATC_L2_CGTT_CLK_CTRL__SOFT_OVERRIDE_MASK;
data1 |= (DAGB0_CNTL_MISC2__DISABLE_WRREQ_CG_MASK |
  DAGB0_CNTL_MISC2__DISABLE_WRRET_CG_MASK |
  DAGB0_CNTL_MISC2__DISABLE_RDREQ_CG_MASK |
@@ -517,7 +515,7 @@ mmhub_v2_3_update_medium_grain_clock_gating(struct 
amdgpu_device *adev,
}
 
if (def != data)
-   WREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_MISC_CG, data);
+   WREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_CGTT_CLK_CTRL, data);
if (def1 != data1)
WREG32_SOC15(MMHUB, 0, mmDAGB0_CNTL_MISC2, data1);
 }
@@ -526,17 +524,44 @@ static void
 mmhub_v2_3_update_medium_grain_light_sleep(struct amdgpu_device *adev,
   bool enable)
 {
-   uint32_t def, data;
-
-   def  = data  = RREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_MISC_CG);
-
-   if (enable && (adev->cg_flags & AMD_CG_SUPPORT_MC_LS))
-   data |= MM_ATC_L2_MISC_CG__MEM_LS_ENABLE_MASK;
-   else
-   data &= ~MM_ATC_L2_MISC_CG__MEM_LS_ENABLE_MASK;
+   uint32_t def, data, def1, data1, def2, data2;
+
+   def  = data  = RREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_CGTT_CLK_CTRL);
+   def1 = data1 = RREG32_SOC15(MMHUB, 0, mmDAGB0_WR_CGTT_CLK_CTRL);
+   def2 = data2 = RREG32_SOC15(MMHUB, 0, mmDAGB0_RD_CGTT_CLK_CTRL);
+
+   if (enable && (adev->cg_flags & AMD_CG_SUPPORT_MC_LS)) {
+   data &= ~MM_ATC_L2_CGTT_CLK_CTRL__MGLS_OVERRIDE_MASK;
+   data1 &= !(DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_MASK |
+   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_WRITE_MASK |
+   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_READ_MASK |
+   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_RETURN_MASK |
+   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_REGISTER_MASK);
+   data2 &= !(DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_MASK |
+   DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_WRITE_MASK |
+   DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_READ_MASK |
+   DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_RETURN_MASK |
+   DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_REGISTER_MASK);
+   } else {
+   data |= MM_ATC_L2_CGTT_CLK_CTRL__MGLS_OVERRIDE_MASK;
+   data1 |= (DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_MASK |
+   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_WRITE_MASK |
+   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_READ_MASK |
+   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_RETURN_MASK |
+   DAGB0_WR_CGTT_CLK_CTRL__LS_OVERRIDE_REGISTER_MASK);
+   data2 |= (DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_MASK |
+   DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_WRITE_MASK |
+   DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_READ_MASK |
+   DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_RETURN_MASK |
+   DAGB0_RD_CGTT_CLK_CTRL__LS_OVERRIDE_REGISTER_MASK);
+   }
 
if (def != data)
-   WREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_MISC_CG, data);
+   WREG32_SOC15(MMHUB, 0, mmMM_ATC_L2_CGTT_CLK_CTRL, data);
+   if (def1 != data1)
+   WREG32_SOC15(MMHUB, 0, mmDAGB0_WR_CGTT_CLK_CTRL, data1);
+   i

RE: [PATCH] drm/amd/display: Implement functions to let DC allocate GPU memory

2021-01-19 Thread Chen, Guchun
[AMD Public Use]

+da = kzalloc(sizeof(struct dal_allocation), GFP_KERNEL);

This looks to be one coding style issue. It's better to modify it to 
kzalloc(sizeof(*da),...)

https://www.kernel.org/doc/html/latest/process/coding-style.html#allocating-memory

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Bhawanpreet 
Lakha
Sent: Wednesday, January 20, 2021 4:41 AM
To: Deucher, Alexander 
Cc: Wentland, Harry ; amd-gfx@lists.freedesktop.org
Subject: [PATCH] drm/amd/display: Implement functions to let DC allocate GPU 
memory

From: Harry Wentland 

[Why]
DC needs to communicate with PM FW through GPU memory. In order to do so we 
need to be able to allocate memory from within DC.

[How]
Call amdgpu_bo_create_kernel to allocate GPU memory and use a list in 
amdgpu_display_manager to track our allocations so we can clean them up later.

Signed-off-by: Harry Wentland 
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +  
.../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h |  9 +  
.../amd/display/amdgpu_dm/amdgpu_dm_helpers.c | 40 +--
 3 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index e490fc2486f7..83ec92a69cba 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1017,6 +1017,8 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
 
init_data.soc_bounding_box = adev->dm.soc_bounding_box;
 
+   INIT_LIST_HEAD(&adev->dm.da_list);
+
/* Display Core create. */
adev->dm.dc = dc_create(&init_data);
 
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
index 38bc0f88b29c..49137924a855 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
@@ -130,6 +130,13 @@ struct amdgpu_dm_backlight_caps {
bool aux_support;
 };
 
+struct dal_allocation {
+   struct list_head list;
+   struct amdgpu_bo *bo;
+   void *cpu_ptr;
+   u64 gpu_addr;
+};
+
 /**
  * struct amdgpu_display_manager - Central amdgpu display manager device
  *
@@ -350,6 +357,8 @@ struct amdgpu_display_manager {
 */
struct amdgpu_encoder mst_encoders[AMDGPU_DM_MAX_CRTC];
bool force_timing_sync;
+
+   struct list_head da_list;
 };
 
 enum dsc_clock_force_state {
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
index 3244a6ea7a65..5dc426e6e785 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
@@ -652,8 +652,31 @@ void *dm_helpers_allocate_gpu_mem(
size_t size,
long long *addr)
 {
-   // TODO
-   return NULL;
+   struct amdgpu_device *adev = ctx->driver_context;
+   struct dal_allocation *da;
+   u32 domain = (type == DC_MEM_ALLOC_TYPE_GART) ?
+   AMDGPU_GEM_DOMAIN_GTT : AMDGPU_GEM_DOMAIN_VRAM;
+   int ret;
+
+   da = kzalloc(sizeof(struct dal_allocation), GFP_KERNEL);
+   if (!da)
+   return NULL;
+
+   ret = amdgpu_bo_create_kernel(adev, size, PAGE_SIZE,
+ domain, &da->bo,
+ &da->gpu_addr, &da->cpu_ptr);
+
+   *addr = da->gpu_addr;
+
+   if (ret) {
+   kfree(da);
+   return NULL;
+   }
+
+   /* add da to list in dm */
+   list_add(&da->list, &adev->dm.da_list);
+
+   return da->cpu_ptr;
 }
 
 void dm_helpers_free_gpu_mem(
@@ -661,5 +684,16 @@ void dm_helpers_free_gpu_mem(
enum dc_gpu_mem_alloc_type type,
void *pvMem)
 {
-   // TODO
+   struct amdgpu_device *adev = ctx->driver_context;
+   struct dal_allocation *da;
+
+   /* walk the da list in DM */
+   list_for_each_entry(da, &adev->dm.da_list, list) {
+   if (pvMem == da->cpu_ptr) {
+   amdgpu_bo_free_kernel(&da->bo, &da->gpu_addr, 
&da->cpu_ptr);
+   list_del(&da->list);
+   kfree(da);
+   break;
+   }
+   }
 }
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Cguchun.chen%40amd.com%7C85ba845fc320487d19cb08d8bcba7b46%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466856421596284%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kaDroqYYgq4ooRdvMDm93i%2BNbtvBGjdWKLd4Op1yemc%3D&reserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.fre

[PATCH AUTOSEL 5.4 21/26] drm/amd/display: Fix to be able to stop crc calculation

2021-01-19 Thread Sasha Levin
From: Wayne Lin 

[ Upstream commit 02ce73b01e09e388614b22b7ebc71debf4a588f0 ]

[Why]
Find out when we try to disable CRC calculation,
crc generation is still enabled. Main reason is
that dc_stream_configure_crc() will never get
called when the source is AMDGPU_DM_PIPE_CRC_SOURCE_NONE.

[How]
Add checking condition that when source is
AMDGPU_DM_PIPE_CRC_SOURCE_NONE, we should also call
dc_stream_configure_crc() to disable crc calculation.
Also, clean up crc window when disable crc calculation.

Signed-off-by: Wayne Lin 
Reviewed-by: Nicholas Kazlauskas 
Acked-by: Qingqing Zhuo 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c
index a549c7c717ddc..f0b001b3af578 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c
@@ -113,7 +113,7 @@ int amdgpu_dm_crtc_configure_crc_source(struct drm_crtc 
*crtc,
mutex_lock(&adev->dm.dc_lock);
 
/* Enable CRTC CRC generation if necessary. */
-   if (dm_is_crc_source_crtc(source)) {
+   if (dm_is_crc_source_crtc(source) || source == 
AMDGPU_DM_PIPE_CRC_SOURCE_NONE) {
if (!dc_stream_configure_crc(stream_state->ctx->dc,
 stream_state, enable, enable)) {
ret = -EINVAL;
-- 
2.27.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH AUTOSEL 5.4 20/26] drm/amdgpu/psp: fix psp gfx ctrl cmds

2021-01-19 Thread Sasha Levin
From: Victor Zhao 

[ Upstream commit f14a5c34d143f6627f0be70c0de1d962f3a6ff1c ]

psp GFX_CTRL_CMD_ID_CONSUME_CMD different for windows and linux,
according to psp, linux cmds are not correct.

v2: only correct GFX_CTRL_CMD_ID_CONSUME_CMD.

Signed-off-by: Victor Zhao 
Reviewed-by: Emily.Deng 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/psp_gfx_if.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/psp_gfx_if.h 
b/drivers/gpu/drm/amd/amdgpu/psp_gfx_if.h
index 74a9fe8e0cfb9..8c54f0be51bab 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_gfx_if.h
+++ b/drivers/gpu/drm/amd/amdgpu/psp_gfx_if.h
@@ -44,7 +44,7 @@ enum psp_gfx_crtl_cmd_id
 GFX_CTRL_CMD_ID_DISABLE_INT = 0x0006,   /* disable PSP-to-Gfx 
interrupt */
 GFX_CTRL_CMD_ID_MODE1_RST   = 0x0007,   /* trigger the Mode 1 
reset */
 GFX_CTRL_CMD_ID_GBR_IH_SET  = 0x0008,   /* set Gbr IH_RB_CNTL 
registers */
-GFX_CTRL_CMD_ID_CONSUME_CMD = 0x000A,   /* send interrupt to psp 
for updating write pointer of vf */
+GFX_CTRL_CMD_ID_CONSUME_CMD = 0x0009,   /* send interrupt to psp 
for updating write pointer of vf */
 GFX_CTRL_CMD_ID_DESTROY_GPCOM_RING = 0x000C, /* destroy GPCOM ring */
 
 GFX_CTRL_CMD_ID_MAX = 0x000F,   /* max command ID */
-- 
2.27.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH AUTOSEL 5.10 35/45] drm/amd/display: Fix to be able to stop crc calculation

2021-01-19 Thread Sasha Levin
From: Wayne Lin 

[ Upstream commit 02ce73b01e09e388614b22b7ebc71debf4a588f0 ]

[Why]
Find out when we try to disable CRC calculation,
crc generation is still enabled. Main reason is
that dc_stream_configure_crc() will never get
called when the source is AMDGPU_DM_PIPE_CRC_SOURCE_NONE.

[How]
Add checking condition that when source is
AMDGPU_DM_PIPE_CRC_SOURCE_NONE, we should also call
dc_stream_configure_crc() to disable crc calculation.
Also, clean up crc window when disable crc calculation.

Signed-off-by: Wayne Lin 
Reviewed-by: Nicholas Kazlauskas 
Acked-by: Qingqing Zhuo 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c
index d0699e98db929..e00a30e7d2529 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crc.c
@@ -113,7 +113,7 @@ int amdgpu_dm_crtc_configure_crc_source(struct drm_crtc 
*crtc,
mutex_lock(&adev->dm.dc_lock);
 
/* Enable CRTC CRC generation if necessary. */
-   if (dm_is_crc_source_crtc(source)) {
+   if (dm_is_crc_source_crtc(source) || source == 
AMDGPU_DM_PIPE_CRC_SOURCE_NONE) {
if (!dc_stream_configure_crc(stream_state->ctx->dc,
 stream_state, enable, enable)) {
ret = -EINVAL;
-- 
2.27.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH AUTOSEL 5.10 32/45] drm/amdgpu/psp: fix psp gfx ctrl cmds

2021-01-19 Thread Sasha Levin
From: Victor Zhao 

[ Upstream commit f14a5c34d143f6627f0be70c0de1d962f3a6ff1c ]

psp GFX_CTRL_CMD_ID_CONSUME_CMD different for windows and linux,
according to psp, linux cmds are not correct.

v2: only correct GFX_CTRL_CMD_ID_CONSUME_CMD.

Signed-off-by: Victor Zhao 
Reviewed-by: Emily.Deng 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/amdgpu/psp_gfx_if.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/psp_gfx_if.h 
b/drivers/gpu/drm/amd/amdgpu/psp_gfx_if.h
index 4137dc710aafd..7ad0434be293b 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_gfx_if.h
+++ b/drivers/gpu/drm/amd/amdgpu/psp_gfx_if.h
@@ -47,7 +47,7 @@ enum psp_gfx_crtl_cmd_id
 GFX_CTRL_CMD_ID_DISABLE_INT = 0x0006,   /* disable PSP-to-Gfx 
interrupt */
 GFX_CTRL_CMD_ID_MODE1_RST   = 0x0007,   /* trigger the Mode 1 
reset */
 GFX_CTRL_CMD_ID_GBR_IH_SET  = 0x0008,   /* set Gbr IH_RB_CNTL 
registers */
-GFX_CTRL_CMD_ID_CONSUME_CMD = 0x000A,   /* send interrupt to psp 
for updating write pointer of vf */
+GFX_CTRL_CMD_ID_CONSUME_CMD = 0x0009,   /* send interrupt to psp 
for updating write pointer of vf */
 GFX_CTRL_CMD_ID_DESTROY_GPCOM_RING = 0x000C, /* destroy GPCOM ring */
 
 GFX_CTRL_CMD_ID_MAX = 0x000F,   /* max command ID */
-- 
2.27.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH AUTOSEL 5.10 33/45] drm/amd/display: disable dcn10 pipe split by default

2021-01-19 Thread Sasha Levin
From: "Li, Roman" 

[ Upstream commit 9d03bb102028b4a3f4a64d6069b219e2e1c1f306 ]

[Why]
The initial purpose of dcn10 pipe split is to support some high
bandwidth mode which requires dispclk greater than max dispclk. By
initial bring up power measurement data, it showed power consumption is
less with pipe split for dcn block. This could be reason for enable pipe
split by default. By battery life measurement of some Chromebooks,
result shows battery life is longer with pipe split disabled.

[How]
Disable pipe split by default. Pipe split could be still enabled when
required dispclk is greater than max dispclk.

Tested-by: Daniel Wheeler 
Signed-off-by: Hersen Wu 
Signed-off-by: Roman Li 
Reviewed-by: Roman Li 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
 drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c
index a78712caf1244..0524d6f1adba6 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_resource.c
@@ -608,8 +608,8 @@ static const struct dc_debug_options debug_defaults_drv = {
.disable_pplib_clock_request = false,
.disable_pplib_wm_range = false,
.pplib_wm_report_mode = WM_REPORT_DEFAULT,
-   .pipe_split_policy = MPC_SPLIT_DYNAMIC,
-   .force_single_disp_pipe_split = true,
+   .pipe_split_policy = MPC_SPLIT_AVOID,
+   .force_single_disp_pipe_split = false,
.disable_dcc = DCC_ENABLE,
.voltage_align_fclk = true,
.disable_stereo_support = true,
-- 
2.27.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12

2021-01-19 Thread Mikhail Gavrilov
On Fri, 15 Jan 2021 at 03:43, Mikhail Gavrilov
 wrote:
>

In rc4, the number of warnings has dropped dramatically.
No more errors "kasan slab-out-of-bounds" and no "DMA-API device
driver failed to check map error".
But still not fixed "sleeping function called from invalid context at
include/linux/sched/mm.h:196" and "BUG: key 88810b0d9148 has not
been registered!"
Second issue Navi specific because it started to happen in 5.10 kernel
after replacing Radeon VII to 6900XT.

1.
BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:196
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 500, name: systemd-udevd
1 lock held by systemd-udevd/500:
 #0: 888107690258 (&dev->mutex){}-{3:3}, at:
device_driver_attach+0xa3/0x250
CPU: 9 PID: 500 Comm: systemd-udevd Not tainted
5.11.0-0.rc4.129.fc34.x86_64+debug #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 2802 10/21/2020
Call Trace:
 dump_stack+0xae/0xe5
 ___might_sleep.cold+0x150/0x17e
 ? dcn30_clock_source_create+0x53/0x110 [amdgpu]
 kmem_cache_alloc_trace+0x23f/0x270
 dcn30_clock_source_create+0x53/0x110 [amdgpu]
 dcn30_create_resource_pool+0x998/0x4890 [amdgpu]
 ? dcn30_calc_max_scaled_time+0x40/0x40 [amdgpu]
 ? lock_is_held_type+0xb8/0xf0
 ? unpoison_range+0x3a/0x60
 ? kasan_kmalloc.constprop.0+0x84/0xa0
 ? dc_create_resource_pool+0x26e/0x5e0 [amdgpu]
 dc_create_resource_pool+0x26e/0x5e0 [amdgpu]
 dc_create+0x636/0x1bc0 [amdgpu]
 ? lock_acquire+0x2dd/0x7a0
 ? sched_clock+0x5/0x10
 ? sched_clock_cpu+0x18/0x170
 ? find_held_lock+0x33/0x110
 ? dc_create_state+0xa0/0xa0 [amdgpu]
 ? lock_downgrade+0x6b0/0x6b0
 ? module_assert_mutex_or_preempt+0x3e/0x70
 ? lock_is_held_type+0xb8/0xf0
 ? unpoison_range+0x3a/0x60
 ? kasan_kmalloc.constprop.0+0x84/0xa0
 amdgpu_dm_init.isra.0+0x479/0x640 [amdgpu]
 ? vprintk_emit+0x1c0/0x460
 ? dev_vprintk_emit+0x2d8/0x31a
 ? sched_clock+0x5/0x10
 ? dm_resume+0x13b0/0x13b0 [amdgpu]
 ? dev_attr_show.cold+0x35/0x35
 ? lock_downgrade+0x6b0/0x6b0
 ? dev_printk_emit+0x8c/0xa8
 ? dev_vprintk_emit+0x31a/0x31a
 ? wait_for_completion_io+0x240/0x240
 ? __dev_printk+0x71/0xdf
 ? smu_hw_init.cold+0x16b/0x18a [amdgpu]
 ? smu_suspend+0x240/0x240 [amdgpu]
 ? navi10_ih_irq_init+0xea3/0x2420 [amdgpu]
 dm_hw_init+0xe/0x20 [amdgpu]
 amdgpu_device_init.cold+0x3031/0x4940 [amdgpu]
 ? amdgpu_device_cache_pci_state+0xf0/0xf0 [amdgpu]
 ? pci_bus_read_config_byte+0x140/0x140
 ? do_pci_enable_device+0x1f8/0x260
 ? pci_find_saved_ext_cap+0x110/0x110
 ? pci_enable_bridge+0xf9/0x1e0
 ? pci_dev_check_d3cold+0x107/0x250
 ? pci_enable_device_flags+0x201/0x340
 amdgpu_driver_load_kms+0x167/0x8a0 [amdgpu]
 amdgpu_pci_probe+0x235/0x360 [amdgpu]
 ? amdgpu_pci_remove+0xd0/0xd0 [amdgpu]
 local_pci_probe+0xd8/0x170
 pci_device_probe+0x318/0x5c0
 ? kernfs_create_link+0x16c/0x230
 ? pci_device_remove+0x1d0/0x1d0
 really_probe+0x224/0xc40
 driver_probe_device+0x1f2/0x380
 device_driver_attach+0x1df/0x250
 __driver_attach+0xf6/0x260
 ? device_driver_attach+0x250/0x250
 bus_for_each_dev+0x114/0x180
 ? subsys_dev_iter_exit+0x10/0x10
 bus_add_driver+0x352/0x570
 driver_register+0x20f/0x390
 ? __pci_register_driver+0x13a/0x210
 ? 0xc1d8d000
 do_one_initcall+0xfb/0x530
 ? perf_trace_initcall_level+0x3d0/0x3d0
 ? __memset+0x2b/0x30
 ? unpoison_range+0x3a/0x60
 do_init_module+0x1ce/0x7a0
 load_module+0x9841/0xa380
 ? module_frob_arch_sections+0x20/0x20
 ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0
 ? sched_clock_cpu+0x18/0x170
 ? sched_clock+0x5/0x10
 ? lock_acquire+0x2dd/0x7a0
 ? sched_clock+0x5/0x10
 ? lock_is_held_type+0xb8/0xf0
 ? __do_sys_init_module+0x18b/0x220
 __do_sys_init_module+0x18b/0x220
 ? load_module+0xa380/0xa380
 ? ktime_get_coarse_real_ts64+0x12f/0x160
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f2c109da07e
Code: 48 8b 0d f5 1d 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d c2 1d 0c 00 f7 d8 64 89 01 48
RSP: 002b:7ffc84d33f88 EFLAGS: 0246 ORIG_RAX: 00af
RAX: ffda RBX: 55b87f8260a0 RCX: 7f2c109da07e
RDX: 55b87f834060 RSI: 01e2cbf6 RDI: 7f2c0b7e0010
RBP: 7f2c0b7e0010 R08: 55b87f8281e0 R09: 7ffc84d30a26
R10: 55bd2404cc18 R11: 0246 R12: 55b87f834060
R13: 55b87f831ca0 R14:  R15: 55b87f832640
[drm] Display Core initialized with v3.2.116!
[drm] DMUB hardware initialized: version=0x0201
usb 1-3.2: Device not responding to setup address.
usb 1-3.2: device not accepting address 5, error -71
[drm] REG_WAIT timeout 1us * 10 tries - mpc2_assert_idle_mpcc line:480


2.
BUG: key 88810b0d9148 has not been registered!
[ cut here ]
DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 25 PID: 500 at kernel/locking/lockdep.c:4618
lockdep_init_map_waits+0x592/0x770
Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 gpu_sched
drm

Re: [PATCH 2/2] drm/amdgpu/display: buffer INTERRUPT_LOW_IRQ_CONTEXT interrupt work

2021-01-19 Thread Andrey Grodzovsky


On 1/15/21 2:21 AM, Chen, Xiaogang wrote:

On 1/14/2021 1:24 AM, Grodzovsky, Andrey wrote:


On 1/14/21 12:11 AM, Chen, Xiaogang wrote:

On 1/12/2021 10:54 PM, Grodzovsky, Andrey wrote:

On 1/4/21 1:01 AM, Xiaogang.Chen wrote:

From: Xiaogang Chen 

amdgpu DM handles INTERRUPT_LOW_IRQ_CONTEXT interrupt(hpd, hpd_rx) by
using work queue and uses single work_struct. If previous interrupt
has not been handled new interrupts(same type) will be discarded and
driver just sends "amdgpu_dm_irq_schedule_work FAILED" message out.
If some important hpd, hpd_rx related interrupts are missed by driver
the hot (un)plug devices may cause system hang or unstable, such as
system resumes from S3 sleep with mst device connected.

This patch dynamically allocates new amdgpu_dm_irq_handler_data for
new interrupts if previous INTERRUPT_LOW_IRQ_CONTEXT interrupt work
has not been handled. So the new interrupt works can be queued to the
same workqueue_struct, instead discard the new interrupts.
All allocated amdgpu_dm_irq_handler_data are put into a single linked
list and will be reused after.


I believe this creates a possible concurrency between already
executing work item
and the new incoming one for which you allocate a new work item on
the fly. While
handle_hpd_irq is serialized with aconnector->hpd_lock I am seeing
that for handle_hpd_rx_irq
it's not locked for MST use case (which is the most frequently used
with this interrupt).  Did you
verified that handle_hpd_rx_irq is reentrant ?


handle_hpd_rx_irq is put at a work queue. Its execution is serialized
by the work queue. So there is no reentrant.


You are using system_highpri_wq which has the property that it has
multiple workers thread pool spread across all the
active CPUs, see all work queue definitions here
https://elixir.bootlin.com/linux/v5.11-rc3/source/include/linux/workqueue.h#L358
I beleieve that what you saying about no chance of reentrnacy would be
correct if it would be same work item dequeued for execution
while previous instance is still running, see the explanation here -
https://elixir.bootlin.com/linux/v5.11-rc3/source/kernel/workqueue.c#L1435.
Non reentrancy is guaranteed only for the same work item. If you want
non reentrancy (full serializtion) for different work items you should
create
you own single threaded work-queue using create_singlethread_workqueue



Thank you. I think the easiest way is using aconnector->hpd_lock at
handle_hpd_rx_irq to lock for dc_link->type == dc_connection_mst_branch
case, right? I will do that at next version if you think it is ok.



I am not sure what are the consequences of of using hpd lock there with
regard to other locks acquired in DRM MST code during MST related HPD 
transactions since
i haven't dealt with this for a very long time. Maybe Harry or Nick can advise 
on this ?


Andrey




amdgpu_dm_irq_schedule_work does queuing of work(put
handle_hpd_rx_irq into work queue). The first call is
dm_irq_work_func, then call handle_hpd_rx_irq.

Signed-off-by: Xiaogang Chen 
---
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h  |  14 +--
   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c  | 114
++---
   2 files changed, 80 insertions(+), 48 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
index c9d82b9..730e540 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
@@ -69,18 +69,6 @@ struct common_irq_params {
   };
     /**
- * struct irq_list_head - Linked-list for low context IRQ handlers.
- *
- * @head: The list_head within &struct handler_data
- * @work: A work_struct containing the deferred handler work
- */
-struct irq_list_head {
-    struct list_head head;
-    /* In case this interrupt needs post-processing, 'work' will
be queued*/
-    struct work_struct work;
-};
-
-/**
    * struct dm_compressor_info - Buffer info used by frame buffer
compression
    * @cpu_addr: MMIO cpu addr
    * @bo_ptr: Pointer to the buffer object
@@ -270,7 +258,7 @@ struct amdgpu_display_manager {
    * Note that handlers are called in the same order as they were
    * registered (FIFO).
    */
-    struct irq_list_head
irq_handler_list_low_tab[DAL_IRQ_SOURCES_NUMBER];
+    struct list_head
irq_handler_list_low_tab[DAL_IRQ_SOURCES_NUMBER];
     /**
    * @irq_handler_list_high_tab:
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
index 3577785..ada344a 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_irq.c
@@ -82,6 +82,7 @@ struct amdgpu_dm_irq_handler_data {
   struct amdgpu_display_manager *dm;
   /* DAL irq source which registered for this interrupt. */
   enum dc_irq_source irq_source;
+    struct work_struct work;
   };
     #define DM_IRQ_TABLE_LOCK(adev, flags) \
@@ -111,20 +112,10 @@ 

Re: [PATCH] drm/amd/display: Implement functions to let DC allocate GPU memory

2021-01-19 Thread Kazlauskas, Nicholas

On 2021-01-19 3:40 p.m., Bhawanpreet Lakha wrote:

From: Harry Wentland 

[Why]
DC needs to communicate with PM FW through GPU memory. In order
to do so we need to be able to allocate memory from within DC.

[How]
Call amdgpu_bo_create_kernel to allocate GPU memory and use a
list in amdgpu_display_manager to track our allocations so we
can clean them up later.

Signed-off-by: Harry Wentland 


Reviewed-by: Nicholas Kazlauskas 

Regards,
Nicholas Kazlauskas


---
  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +
  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h |  9 +
  .../amd/display/amdgpu_dm/amdgpu_dm_helpers.c | 40 +--
  3 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index e490fc2486f7..83ec92a69cba 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1017,6 +1017,8 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
  
  	init_data.soc_bounding_box = adev->dm.soc_bounding_box;
  
+	INIT_LIST_HEAD(&adev->dm.da_list);

+
/* Display Core create. */
adev->dm.dc = dc_create(&init_data);
  
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h

index 38bc0f88b29c..49137924a855 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
@@ -130,6 +130,13 @@ struct amdgpu_dm_backlight_caps {
bool aux_support;
  };
  
+struct dal_allocation {

+   struct list_head list;
+   struct amdgpu_bo *bo;
+   void *cpu_ptr;
+   u64 gpu_addr;
+};
+
  /**
   * struct amdgpu_display_manager - Central amdgpu display manager device
   *
@@ -350,6 +357,8 @@ struct amdgpu_display_manager {
 */
struct amdgpu_encoder mst_encoders[AMDGPU_DM_MAX_CRTC];
bool force_timing_sync;
+
+   struct list_head da_list;
  };
  
  enum dsc_clock_force_state {

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
index 3244a6ea7a65..5dc426e6e785 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
@@ -652,8 +652,31 @@ void *dm_helpers_allocate_gpu_mem(
size_t size,
long long *addr)
  {
-   // TODO
-   return NULL;
+   struct amdgpu_device *adev = ctx->driver_context;
+   struct dal_allocation *da;
+   u32 domain = (type == DC_MEM_ALLOC_TYPE_GART) ?
+   AMDGPU_GEM_DOMAIN_GTT : AMDGPU_GEM_DOMAIN_VRAM;
+   int ret;
+
+   da = kzalloc(sizeof(struct dal_allocation), GFP_KERNEL);
+   if (!da)
+   return NULL;
+
+   ret = amdgpu_bo_create_kernel(adev, size, PAGE_SIZE,
+ domain, &da->bo,
+ &da->gpu_addr, &da->cpu_ptr);
+
+   *addr = da->gpu_addr;
+
+   if (ret) {
+   kfree(da);
+   return NULL;
+   }
+
+   /* add da to list in dm */
+   list_add(&da->list, &adev->dm.da_list);
+
+   return da->cpu_ptr;
  }
  
  void dm_helpers_free_gpu_mem(

@@ -661,5 +684,16 @@ void dm_helpers_free_gpu_mem(
enum dc_gpu_mem_alloc_type type,
void *pvMem)
  {
-   // TODO
+   struct amdgpu_device *adev = ctx->driver_context;
+   struct dal_allocation *da;
+
+   /* walk the da list in DM */
+   list_for_each_entry(da, &adev->dm.da_list, list) {
+   if (pvMem == da->cpu_ptr) {
+   amdgpu_bo_free_kernel(&da->bo, &da->gpu_addr, 
&da->cpu_ptr);
+   list_del(&da->list);
+   kfree(da);
+   break;
+   }
+   }
  }



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/3] drm/amd/display: Update dcn30_apply_idle_power_optimizations() code

2021-01-19 Thread Kazlauskas, Nicholas

On 2021-01-19 3:38 p.m., Bhawanpreet Lakha wrote:

Update the function for idle optimizations
-remove hardcoded size
-enable no memory-request case
-add cursor copy
-update mall eligibility check case

Signed-off-by: Bhawanpreet Lakha 
Signed-off-by: Joshua Aberback 


Series is:

Reviewed-by: Nicholas Kazlauskas 

Though you might want to update patch 1's commit message to explain a 
little more detail about watermark set D.


Regards,
Nicholas Kazlauskas


---
  drivers/gpu/drm/amd/display/dc/dc.h   |   2 +
  .../drm/amd/display/dc/dcn30/dcn30_hwseq.c| 157 +-
  .../amd/display/dc/dcn302/dcn302_resource.c   |   4 +-
  .../gpu/drm/amd/display/dmub/inc/dmub_cmd.h   |   5 +
  4 files changed, 129 insertions(+), 39 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dc.h 
b/drivers/gpu/drm/amd/display/dc/dc.h
index e21d4602e427..71d46ade24e5 100644
--- a/drivers/gpu/drm/amd/display/dc/dc.h
+++ b/drivers/gpu/drm/amd/display/dc/dc.h
@@ -502,6 +502,8 @@ struct dc_debug_options {
  #if defined(CONFIG_DRM_AMD_DC_DCN)
bool disable_idle_power_optimizations;
unsigned int mall_size_override;
+   unsigned int mall_additional_timer_percent;
+   bool mall_error_as_fatal;
  #endif
bool dmub_command_table; /* for testing only */
struct dc_bw_validation_profile bw_val_profile;
diff --git a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c 
b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c
index 5c546b06f551..dff83c6a142a 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c
@@ -710,8 +710,11 @@ void dcn30_program_dmdata_engine(struct pipe_ctx *pipe_ctx)
  bool dcn30_apply_idle_power_optimizations(struct dc *dc, bool enable)
  {
union dmub_rb_cmd cmd;
-   unsigned int surface_size, refresh_hz, denom;
uint32_t tmr_delay = 0, tmr_scale = 0;
+   struct dc_cursor_attributes cursor_attr;
+   bool cursor_cache_enable = false;
+   struct dc_stream_state *stream = NULL;
+   struct dc_plane_state *plane = NULL;
  
  	if (!dc->ctx->dmub_srv)

return false;
@@ -722,72 +725,150 @@ bool dcn30_apply_idle_power_optimizations(struct dc *dc, 
bool enable)
  
  			/* First, check no-memory-requests case */

for (i = 0; i < dc->current_state->stream_count; i++) {
-   if (dc->current_state->stream_status[i]
-   .plane_count)
+   if 
(dc->current_state->stream_status[i].plane_count)
/* Fail eligibility on a visible stream 
*/
break;
}
  
-			if (dc->current_state->stream_count == 1 // single display only

-   && dc->current_state->stream_status[0].plane_count 
== 1 // single surface only
-   && 
dc->current_state->stream_status[0].plane_states[0]->address.page_table_base.quad_part 
== 0 // no VM
-   // Only 8 and 16 bit formats
-   && 
dc->current_state->stream_status[0].plane_states[0]->format <= 
SURFACE_PIXEL_FORMAT_GRPH_ABGR16161616F
-   && 
dc->current_state->stream_status[0].plane_states[0]->format >= 
SURFACE_PIXEL_FORMAT_GRPH_ARGB) {
-   surface_size = 
dc->current_state->stream_status[0].plane_states[0]->plane_size.surface_pitch *
-   
dc->current_state->stream_status[0].plane_states[0]->plane_size.surface_size.height
 *
-   
(dc->current_state->stream_status[0].plane_states[0]->format >= 
SURFACE_PIXEL_FORMAT_GRPH_ARGB16161616 ?
-8 : 4);
-   } else {
-   // TODO: remove hard code size
-   surface_size = 128 * 1024 * 1024;
+   if (i == dc->current_state->stream_count) {
+   /* Enable no-memory-requests case */
+   memset(&cmd, 0, sizeof(cmd));
+   cmd.mall.header.type = DMUB_CMD__MALL;
+   cmd.mall.header.sub_type = 
DMUB_CMD__MALL_ACTION_NO_DF_REQ;
+   cmd.mall.header.payload_bytes = 
sizeof(cmd.mall) - sizeof(cmd.mall.header);
+
+   dc_dmub_srv_cmd_queue(dc->ctx->dmub_srv, &cmd);
+   dc_dmub_srv_cmd_execute(dc->ctx->dmub_srv);
+
+   return true;
}
  
-			// TODO: remove hard code size

-   if (surface_size < 128 * 1024 * 1024) {
-   refresh_hz = div_u64((unsigned long long) 
dc->current_state->streams[0]->timing.pix_clk_100hz *
- 

Re: [PATCH v4 07/14] drm/amdgpu: Register IOMMU topology notifier per device.

2021-01-19 Thread Daniel Vetter
On Tue, Jan 19, 2021 at 10:22 PM Andrey Grodzovsky
 wrote:
>
>
> On 1/19/21 8:45 AM, Daniel Vetter wrote:
>
> On Tue, Jan 19, 2021 at 09:48:03AM +0100, Christian König wrote:
>
> Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:
>
> Handle all DMA IOMMU gropup related dependencies before the
> group is removed.
>
> Signed-off-by: Andrey Grodzovsky 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h|  5 
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 46 
> ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h   |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 10 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  2 ++
>   6 files changed, 65 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 478a7d8..2953420 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -51,6 +51,7 @@
>   #include 
>   #include 
>   #include 
> +#include 
>   #include 
>   #include 
> @@ -1041,6 +1042,10 @@ struct amdgpu_device {
>   boolin_pci_err_recovery;
>   struct pci_saved_state  *pci_state;
> +
> + struct notifier_block nb;
> + struct blocking_notifier_head notifier;
> + struct list_head device_bo_list;
>   };
>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 45e23e3..e99f4f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -70,6 +70,8 @@
>   #include 
>   #include 
> +#include 
> +
>   MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
>   MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
>   MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
> @@ -3200,6 +3202,39 @@ static const struct attribute *amdgpu_dev_attributes[] 
> = {
>   };
> +static int amdgpu_iommu_group_notifier(struct notifier_block *nb,
> + unsigned long action, void *data)
> +{
> + struct amdgpu_device *adev = container_of(nb, struct amdgpu_device, nb);
> + struct amdgpu_bo *bo = NULL;
> +
> + /*
> + * Following is a set of IOMMU group dependencies taken care of before
> + * device's IOMMU group is removed
> + */
> + if (action == IOMMU_GROUP_NOTIFY_DEL_DEVICE) {
> +
> + spin_lock(&ttm_bo_glob.lru_lock);
> + list_for_each_entry(bo, &adev->device_bo_list, bo) {
> + if (bo->tbo.ttm)
> + ttm_tt_unpopulate(bo->tbo.bdev, bo->tbo.ttm);
> + }
> + spin_unlock(&ttm_bo_glob.lru_lock);
>
> That approach won't work. ttm_tt_unpopulate() might sleep on an IOMMU lock.
>
> You need to use a mutex here or even better make sure you can access the
> device_bo_list without a lock in this moment.
>
> I'd also be worried about the notifier mutex getting really badly in the
> way.
>
> Plus I'm worried why we even need this, it sounds a bit like papering over
> the iommu subsystem. Assuming we clean up all our iommu mappings in our
> device hotunplug/unload code, why do we still need to have an additional
> iommu notifier on top, with all kinds of additional headaches? The iommu
> shouldn't clean up before the devices in its group have cleaned up.
>
> I think we need more info here on what the exact problem is first.
> -Daniel
>
>
> Originally I experienced the  crash bellow on IOMMU enabled device, it 
> happens post device removal from PCI topology -
> during shutting down of user client holding last reference to drm device file 
> (X in my case).
> The crash is because by the time I get to this point struct 
> device->iommu_group pointer is NULL
> already since the IOMMU group for the device is unset during PCI removal. So 
> this contradicts what you said above
> that the iommu shouldn't clean up before the devices in its group have 
> cleaned up.
> So instead of guessing when is the right place to place all IOMMU related 
> cleanups it makes sense
> to get notification from IOMMU subsystem in the form of event 
> IOMMU_GROUP_NOTIFY_DEL_DEVICE
> and use that place to do all the relevant cleanups.

Yeah that goes boom, but you shouldn't need this special iommu cleanup
handler. Making sure that all the dma-api mappings are gone needs to
be done as part of the device hotunplug, you can't delay that to the
last drm_device cleanup.

So I most of the patch here with pulling that out (should be outright
removed from the final release code even) is good, just not yet how
you call that new code. Probably these bits (aside from walking all
buffers and unpopulating the tt) should be done from the early_free
callback you're adding.

Also what I just realized: For normal unload you need to make sure the
hw is actually stopped first, before we unmap buffers. Otherwise
driver unload will likely result in wedged hw, probably not what you
want for debugging.
-Daniel

> Andrey
>
>
> [  123.810074 <   28.126960>] BUG: kernel NULL pointer dereferenc

Re: [PATCH v4 07/14] drm/amdgpu: Register IOMMU topology notifier per device.

2021-01-19 Thread Andrey Grodzovsky


On 1/19/21 8:45 AM, Daniel Vetter wrote:

On Tue, Jan 19, 2021 at 09:48:03AM +0100, Christian König wrote:

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

Handle all DMA IOMMU gropup related dependencies before the
group is removed.

Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu.h|  5 
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 46 
++
   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   |  2 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h   |  1 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 10 +++
   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  2 ++
   6 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 478a7d8..2953420 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -51,6 +51,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
@@ -1041,6 +1042,10 @@ struct amdgpu_device {
boolin_pci_err_recovery;
struct pci_saved_state  *pci_state;
+
+   struct notifier_block   nb;
+   struct blocking_notifier_head   notifier;
+   struct list_headdevice_bo_list;
   };
   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 45e23e3..e99f4f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -70,6 +70,8 @@
   #include 
   #include 
+#include 
+
   MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
   MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
   MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
@@ -3200,6 +3202,39 @@ static const struct attribute *amdgpu_dev_attributes[] = 
{
   };
+static int amdgpu_iommu_group_notifier(struct notifier_block *nb,
+unsigned long action, void *data)
+{
+   struct amdgpu_device *adev = container_of(nb, struct amdgpu_device, nb);
+   struct amdgpu_bo *bo = NULL;
+
+   /*
+* Following is a set of IOMMU group dependencies taken care of before
+* device's IOMMU group is removed
+*/
+   if (action == IOMMU_GROUP_NOTIFY_DEL_DEVICE) {
+
+   spin_lock(&ttm_bo_glob.lru_lock);
+   list_for_each_entry(bo, &adev->device_bo_list, bo) {
+   if (bo->tbo.ttm)
+   ttm_tt_unpopulate(bo->tbo.bdev, bo->tbo.ttm);
+   }
+   spin_unlock(&ttm_bo_glob.lru_lock);

That approach won't work. ttm_tt_unpopulate() might sleep on an IOMMU lock.

You need to use a mutex here or even better make sure you can access the
device_bo_list without a lock in this moment.

I'd also be worried about the notifier mutex getting really badly in the
way.

Plus I'm worried why we even need this, it sounds a bit like papering over
the iommu subsystem. Assuming we clean up all our iommu mappings in our
device hotunplug/unload code, why do we still need to have an additional
iommu notifier on top, with all kinds of additional headaches? The iommu
shouldn't clean up before the devices in its group have cleaned up.

I think we need more info here on what the exact problem is first.
-Daniel



Originally I experienced the  crash bellow on IOMMU enabled device, it happens 
post device removal from PCI topology -
during shutting down of user client holding last reference to drm device file (X 
in my case).
The crash is because by the time I get to this point struct device->iommu_group 
pointer is NULL
already since the IOMMU group for the device is unset during PCI removal. So 
this contradicts what you said above

that the iommu shouldn't clean up before the devices in its group have cleaned 
up.
So instead of guessing when is the right place to place all IOMMU related 
cleanups it makes sense
to get notification from IOMMU subsystem in the form of event 
IOMMU_GROUP_NOTIFY_DEL_DEVICE

and use that place to do all the relevant cleanups.

Andrey


[  123.810074 <   28.126960>] BUG: kernel NULL pointer dereference, address: 
00c8

[  123.810080 <    0.06>] #PF: supervisor read access in kernel mode
[  123.810082 <    0.02>] #PF: error_code(0x) - not-present page
[  123.810085 <    0.03>] PGD 0 P4D 0
[  123.810089 <    0.04>] Oops:  [#1] SMP NOPTI
[  123.810094 <    0.05>] CPU: 5 PID: 1418 Comm: Xorg:shlo4 Tainted: 
G   O  5.9.0-rc2-dev+ #59
[  123.810096 <    0.02>] Hardware name: System manufacturer System Product 
Name/PRIME X470-PRO, BIOS 4406 02/28/2019

[  123.810105 <    0.09>] *RIP: 0010:iommu_get_dma_domain*+0x10/0x20
[  123.810108 <    0.03>] Code: b0 48 c7 87 98 00 00 00 00 00 00 00 31 c0 c3 
b8 f4 ff ff ff eb a6 0f 1f 40 00 0f 1f 44 00 00 48 8b 87 d0 02 00 00 55 48 89 e5 
<48> 8b 80 c8 00 00 00 5d c3 

[PATCH] drm/amd/display: Implement functions to let DC allocate GPU memory

2021-01-19 Thread Bhawanpreet Lakha
From: Harry Wentland 

[Why]
DC needs to communicate with PM FW through GPU memory. In order
to do so we need to be able to allocate memory from within DC.

[How]
Call amdgpu_bo_create_kernel to allocate GPU memory and use a
list in amdgpu_display_manager to track our allocations so we
can clean them up later.

Signed-off-by: Harry Wentland 
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h |  9 +
 .../amd/display/amdgpu_dm/amdgpu_dm_helpers.c | 40 +--
 3 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index e490fc2486f7..83ec92a69cba 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1017,6 +1017,8 @@ static int amdgpu_dm_init(struct amdgpu_device *adev)
 
init_data.soc_bounding_box = adev->dm.soc_bounding_box;
 
+   INIT_LIST_HEAD(&adev->dm.da_list);
+
/* Display Core create. */
adev->dm.dc = dc_create(&init_data);
 
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
index 38bc0f88b29c..49137924a855 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
@@ -130,6 +130,13 @@ struct amdgpu_dm_backlight_caps {
bool aux_support;
 };
 
+struct dal_allocation {
+   struct list_head list;
+   struct amdgpu_bo *bo;
+   void *cpu_ptr;
+   u64 gpu_addr;
+};
+
 /**
  * struct amdgpu_display_manager - Central amdgpu display manager device
  *
@@ -350,6 +357,8 @@ struct amdgpu_display_manager {
 */
struct amdgpu_encoder mst_encoders[AMDGPU_DM_MAX_CRTC];
bool force_timing_sync;
+
+   struct list_head da_list;
 };
 
 enum dsc_clock_force_state {
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
index 3244a6ea7a65..5dc426e6e785 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
@@ -652,8 +652,31 @@ void *dm_helpers_allocate_gpu_mem(
size_t size,
long long *addr)
 {
-   // TODO
-   return NULL;
+   struct amdgpu_device *adev = ctx->driver_context;
+   struct dal_allocation *da;
+   u32 domain = (type == DC_MEM_ALLOC_TYPE_GART) ?
+   AMDGPU_GEM_DOMAIN_GTT : AMDGPU_GEM_DOMAIN_VRAM;
+   int ret;
+
+   da = kzalloc(sizeof(struct dal_allocation), GFP_KERNEL);
+   if (!da)
+   return NULL;
+
+   ret = amdgpu_bo_create_kernel(adev, size, PAGE_SIZE,
+ domain, &da->bo,
+ &da->gpu_addr, &da->cpu_ptr);
+
+   *addr = da->gpu_addr;
+
+   if (ret) {
+   kfree(da);
+   return NULL;
+   }
+
+   /* add da to list in dm */
+   list_add(&da->list, &adev->dm.da_list);
+
+   return da->cpu_ptr;
 }
 
 void dm_helpers_free_gpu_mem(
@@ -661,5 +684,16 @@ void dm_helpers_free_gpu_mem(
enum dc_gpu_mem_alloc_type type,
void *pvMem)
 {
-   // TODO
+   struct amdgpu_device *adev = ctx->driver_context;
+   struct dal_allocation *da;
+
+   /* walk the da list in DM */
+   list_for_each_entry(da, &adev->dm.da_list, list) {
+   if (pvMem == da->cpu_ptr) {
+   amdgpu_bo_free_kernel(&da->bo, &da->gpu_addr, 
&da->cpu_ptr);
+   list_del(&da->list);
+   kfree(da);
+   break;
+   }
+   }
 }
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 3/3] drm/amd/display: Update dcn30_apply_idle_power_optimizations() code

2021-01-19 Thread Bhawanpreet Lakha
Update the function for idle optimizations
-remove hardcoded size
-enable no memory-request case
-add cursor copy
-update mall eligibility check case

Signed-off-by: Bhawanpreet Lakha 
Signed-off-by: Joshua Aberback 
---
 drivers/gpu/drm/amd/display/dc/dc.h   |   2 +
 .../drm/amd/display/dc/dcn30/dcn30_hwseq.c| 157 +-
 .../amd/display/dc/dcn302/dcn302_resource.c   |   4 +-
 .../gpu/drm/amd/display/dmub/inc/dmub_cmd.h   |   5 +
 4 files changed, 129 insertions(+), 39 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dc.h 
b/drivers/gpu/drm/amd/display/dc/dc.h
index e21d4602e427..71d46ade24e5 100644
--- a/drivers/gpu/drm/amd/display/dc/dc.h
+++ b/drivers/gpu/drm/amd/display/dc/dc.h
@@ -502,6 +502,8 @@ struct dc_debug_options {
 #if defined(CONFIG_DRM_AMD_DC_DCN)
bool disable_idle_power_optimizations;
unsigned int mall_size_override;
+   unsigned int mall_additional_timer_percent;
+   bool mall_error_as_fatal;
 #endif
bool dmub_command_table; /* for testing only */
struct dc_bw_validation_profile bw_val_profile;
diff --git a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c 
b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c
index 5c546b06f551..dff83c6a142a 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c
@@ -710,8 +710,11 @@ void dcn30_program_dmdata_engine(struct pipe_ctx *pipe_ctx)
 bool dcn30_apply_idle_power_optimizations(struct dc *dc, bool enable)
 {
union dmub_rb_cmd cmd;
-   unsigned int surface_size, refresh_hz, denom;
uint32_t tmr_delay = 0, tmr_scale = 0;
+   struct dc_cursor_attributes cursor_attr;
+   bool cursor_cache_enable = false;
+   struct dc_stream_state *stream = NULL;
+   struct dc_plane_state *plane = NULL;
 
if (!dc->ctx->dmub_srv)
return false;
@@ -722,72 +725,150 @@ bool dcn30_apply_idle_power_optimizations(struct dc *dc, 
bool enable)
 
/* First, check no-memory-requests case */
for (i = 0; i < dc->current_state->stream_count; i++) {
-   if (dc->current_state->stream_status[i]
-   .plane_count)
+   if 
(dc->current_state->stream_status[i].plane_count)
/* Fail eligibility on a visible stream 
*/
break;
}
 
-   if (dc->current_state->stream_count == 1 // single 
display only
-   && dc->current_state->stream_status[0].plane_count 
== 1 // single surface only
-   && 
dc->current_state->stream_status[0].plane_states[0]->address.page_table_base.quad_part
 == 0 // no VM
-   // Only 8 and 16 bit formats
-   && 
dc->current_state->stream_status[0].plane_states[0]->format <= 
SURFACE_PIXEL_FORMAT_GRPH_ABGR16161616F
-   && 
dc->current_state->stream_status[0].plane_states[0]->format >= 
SURFACE_PIXEL_FORMAT_GRPH_ARGB) {
-   surface_size = 
dc->current_state->stream_status[0].plane_states[0]->plane_size.surface_pitch *
-   
dc->current_state->stream_status[0].plane_states[0]->plane_size.surface_size.height
 *
-   
(dc->current_state->stream_status[0].plane_states[0]->format >= 
SURFACE_PIXEL_FORMAT_GRPH_ARGB16161616 ?
-8 : 4);
-   } else {
-   // TODO: remove hard code size
-   surface_size = 128 * 1024 * 1024;
+   if (i == dc->current_state->stream_count) {
+   /* Enable no-memory-requests case */
+   memset(&cmd, 0, sizeof(cmd));
+   cmd.mall.header.type = DMUB_CMD__MALL;
+   cmd.mall.header.sub_type = 
DMUB_CMD__MALL_ACTION_NO_DF_REQ;
+   cmd.mall.header.payload_bytes = 
sizeof(cmd.mall) - sizeof(cmd.mall.header);
+
+   dc_dmub_srv_cmd_queue(dc->ctx->dmub_srv, &cmd);
+   dc_dmub_srv_cmd_execute(dc->ctx->dmub_srv);
+
+   return true;
}
 
-   // TODO: remove hard code size
-   if (surface_size < 128 * 1024 * 1024) {
-   refresh_hz = div_u64((unsigned long long) 
dc->current_state->streams[0]->timing.pix_clk_100hz *
-100LL,
-
(dc->current_state->streams[0]->timing.v_total *
- 
dc->current_state->streams[0]->

[PATCH 2/3] drm/amd/display: Dynamic cursor cache size for MALL eligibility check

2021-01-19 Thread Bhawanpreet Lakha
[Why]
Currently we use the maximum possible cursor cache size when deciding if we
should attempt to enable MALL, but this prevents us from enabling the
feature for certain key use cases.

[How]
 - consider cursor bpp when calculating if the cursor fits

Signed-off-by: Bhawanpreet Lakha 
Signed-off-by: Joshua Aberback 
Reviewed-by: Aric Cyr 
---
 drivers/gpu/drm/amd/display/dc/core/dc.c  |  6 ++---
 drivers/gpu/drm/amd/display/dc/dc.h   |  4 +--
 .../drm/amd/display/dc/dcn30/dcn30_hwseq.c| 25 +--
 .../drm/amd/display/dc/dcn30/dcn30_hwseq.h|  3 ++-
 .../amd/display/dc/dcn302/dcn302_resource.c   |  1 +
 .../gpu/drm/amd/display/dc/inc/hw_sequencer.h |  3 ++-
 6 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c 
b/drivers/gpu/drm/amd/display/dc/core/dc.c
index 89e8e3e11862..1efc67befad4 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -3156,11 +3156,11 @@ void dc_lock_memory_clock_frequency(struct dc *dc)
core_link_enable_stream(dc->current_state, 
&dc->current_state->res_ctx.pipe_ctx[i]);
 }
 
-bool dc_is_plane_eligible_for_idle_optimizaitons(struct dc *dc, struct 
dc_plane_state *plane)
+bool dc_is_plane_eligible_for_idle_optimizations(struct dc *dc, struct 
dc_plane_state *plane,
+   struct dc_cursor_attributes *cursor_attr)
 {
-   if (dc->hwss.does_plane_fit_in_mall && 
dc->hwss.does_plane_fit_in_mall(dc, plane))
+   if (dc->hwss.does_plane_fit_in_mall && 
dc->hwss.does_plane_fit_in_mall(dc, plane, cursor_attr))
return true;
-
return false;
 }
 
diff --git a/drivers/gpu/drm/amd/display/dc/dc.h 
b/drivers/gpu/drm/amd/display/dc/dc.h
index 28e0b6ac1f50..e21d4602e427 100644
--- a/drivers/gpu/drm/amd/display/dc/dc.h
+++ b/drivers/gpu/drm/amd/display/dc/dc.h
@@ -1272,8 +1272,8 @@ enum dc_status dc_set_clock(struct dc *dc, enum 
dc_clock_type clock_type, uint32
 void dc_get_clock(struct dc *dc, enum dc_clock_type clock_type, struct 
dc_clock_config *clock_cfg);
 #if defined(CONFIG_DRM_AMD_DC_DCN)
 
-bool dc_is_plane_eligible_for_idle_optimizations(struct dc *dc,
-struct dc_plane_state *plane);
+bool dc_is_plane_eligible_for_idle_optimizations(struct dc *dc, struct 
dc_plane_state *plane,
+   struct dc_cursor_attributes *cursor_attr);
 
 void dc_allow_idle_optimizations(struct dc *dc, bool allow);
 
diff --git a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c 
b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c
index e5cc8f8c363f..5c546b06f551 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.c
@@ -814,17 +814,38 @@ bool dcn30_apply_idle_power_optimizations(struct dc *dc, 
bool enable)
return true;
 }
 
-bool dcn30_does_plane_fit_in_mall(struct dc *dc, struct dc_plane_state *plane)
+bool dcn30_does_plane_fit_in_mall(struct dc *dc, struct dc_plane_state *plane, 
struct dc_cursor_attributes *cursor_attr)
 {
// add meta size?
unsigned int surface_size = plane->plane_size.surface_pitch * 
plane->plane_size.surface_size.height *
(plane->format >= 
SURFACE_PIXEL_FORMAT_GRPH_ARGB16161616 ? 8 : 4);
unsigned int mall_size = dc->caps.mall_size_total;
+   unsigned int cursor_size = 0;
 
if (dc->debug.mall_size_override)
mall_size = 1024 * 1024 * dc->debug.mall_size_override;
 
-   return (surface_size + dc->caps.cursor_cache_size) < mall_size;
+   if (cursor_attr) {
+   cursor_size = dc->caps.max_cursor_size * 
dc->caps.max_cursor_size;
+
+   switch (cursor_attr->color_format) {
+   case CURSOR_MODE_MONO:
+   cursor_size /= 2;
+   break;
+   case CURSOR_MODE_COLOR_1BIT_AND:
+   case CURSOR_MODE_COLOR_PRE_MULTIPLIED_ALPHA:
+   case CURSOR_MODE_COLOR_UN_PRE_MULTIPLIED_ALPHA:
+   cursor_size *= 4;
+   break;
+
+   case CURSOR_MODE_COLOR_64BIT_FP_PRE_MULTIPLIED:
+   case CURSOR_MODE_COLOR_64BIT_FP_UN_PRE_MULTIPLIED:
+   cursor_size *= 8;
+   break;
+   }
+   }
+
+   return (surface_size + cursor_size) < mall_size;
 }
 
 void dcn30_hardware_release(struct dc *dc)
diff --git a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.h 
b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.h
index 1103f6356e90..3b7d4812e311 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.h
+++ b/drivers/gpu/drm/amd/display/dc/dcn30/dcn30_hwseq.h
@@ -65,7 +65,8 @@ void dcn30_set_avmute(struct pipe_ctx *pipe_ctx, bool enable);
 void dcn30_update_info_frame(struct pipe_ctx *pipe_ctx);
 void dcn30_program_dmdata_engine(struct pipe_ctx *pipe_ctx);
 
-bool dcn30_does_plane_fi

[PATCH 1/3] drm/amd/display: Enable programing of MALL watermarks

2021-01-19 Thread Bhawanpreet Lakha
uncomment watermark set d

Signed-off-by: Bhawanpreet Lakha 
---
 .../display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c   | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c 
b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c
index ab98c259ef69..c7e5a64e06af 100644
--- a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c
+++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c
@@ -146,15 +146,15 @@ static noinline void dcn3_build_wm_range_table(struct 
clk_mgr_internal *clk_mgr)

clk_mgr->base.bw_params->wm_table.nv_entries[WM_C].pmfw_breakdown.max_uclk = 
0x;
 
/* Set D - MALL - SR enter and exit times adjusted for MALL */
-// clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].valid = true;
-// 
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].dml_input.pstate_latency_us 
= pstate_latency_us;
-// 
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].dml_input.sr_exit_time_us = 
2;
-// 
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].dml_input.sr_enter_plus_exit_time_us
 = 4;
-// 
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].pmfw_breakdown.wm_type = 
WATERMARKS_MALL;
-// 
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].pmfw_breakdown.min_dcfclk = 
0;
-// 
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].pmfw_breakdown.max_dcfclk = 
0x;
-// 
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].pmfw_breakdown.min_uclk = 
min_uclk_mhz;
-// 
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].pmfw_breakdown.max_uclk = 
0x;
+   clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].valid = true;
+   
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].dml_input.pstate_latency_us 
= pstate_latency_us;
+   
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].dml_input.sr_exit_time_us = 
2;
+   
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].dml_input.sr_enter_plus_exit_time_us
 = 4;
+   
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].pmfw_breakdown.wm_type = 
WATERMARKS_MALL;
+   
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].pmfw_breakdown.min_dcfclk = 
0;
+   
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].pmfw_breakdown.max_dcfclk = 
0x;
+   
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].pmfw_breakdown.min_uclk = 
min_uclk_mhz;
+   
clk_mgr->base.bw_params->wm_table.nv_entries[WM_D].pmfw_breakdown.max_uclk = 
0x;
 }
 
 void dcn3_init_clocks(struct clk_mgr *clk_mgr_base)
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 0/3] idle optimization patches (mall)

2021-01-19 Thread Bhawanpreet Lakha
There is some missing mall code, this series updates the code.
-enable watermark programming
-dynamic cursor cache
-updates to mall eligibility check

Bhawanpreet Lakha (3):
  drm/amd/display: Enable programing of MALL watermarks
  drm/amd/display: Dynamic cursor cache size for MALL eligibility check
  drm/amd/display: Update dcn30_apply_idle_power_optimizations() code

 .../display/dc/clk_mgr/dcn30/dcn30_clk_mgr.c  |  18 +-
 drivers/gpu/drm/amd/display/dc/core/dc.c  |   6 +-
 drivers/gpu/drm/amd/display/dc/dc.h   |   6 +-
 .../drm/amd/display/dc/dcn30/dcn30_hwseq.c| 182 ++
 .../drm/amd/display/dc/dcn30/dcn30_hwseq.h|   3 +-
 .../amd/display/dc/dcn302/dcn302_resource.c   |   5 +-
 .../gpu/drm/amd/display/dc/inc/hw_sequencer.h |   3 +-
 .../gpu/drm/amd/display/dmub/inc/dmub_cmd.h   |   5 +
 8 files changed, 171 insertions(+), 57 deletions(-)

-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 10/14] dmr/amdgpu: Move some sysfs attrs creation to default_attr

2021-01-19 Thread Greg KH
On Tue, Jan 19, 2021 at 02:04:48PM -0500, Alex Deucher wrote:
> On Tue, Jan 19, 2021 at 1:26 PM Greg KH  wrote:
> >
> > On Tue, Jan 19, 2021 at 11:36:01AM -0500, Andrey Grodzovsky wrote:
> > >
> > > On 1/19/21 2:34 AM, Greg KH wrote:
> > > > On Mon, Jan 18, 2021 at 04:01:19PM -0500, Andrey Grodzovsky wrote:
> > > > >   static struct pci_driver amdgpu_kms_pci_driver = {
> > > > >   .name = DRIVER_NAME,
> > > > >   .id_table = pciidlist,
> > > > > @@ -1595,6 +1607,7 @@ static struct pci_driver amdgpu_kms_pci_driver 
> > > > > = {
> > > > >   .shutdown = amdgpu_pci_shutdown,
> > > > >   .driver.pm = &amdgpu_pm_ops,
> > > > >   .err_handler = &amdgpu_pci_err_handler,
> > > > > + .driver.dev_groups = amdgpu_sysfs_groups,
> > > > Shouldn't this just be:
> > > > groups - amdgpu_sysfs_groups,
> > > >
> > > > Why go to the "driver root" here?
> > >
> > >
> > > Because I still didn't get to your suggestion to propose a patch to add 
> > > groups to
> > > pci_driver, it's located in 'base' driver struct.
> >
> > You are a pci driver, you should never have to mess with the "base"
> > driver struct.  Look at commit 92d50fc1602e ("PCI/IB: add support for
> > pci driver attribute groups") which got merged in 4.14, way back in
> > 2017 :)
> 
> Per the previous discussion of this patch set:
> https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg56019.html

Hey, at least I'm consistent :)
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 11/14] drm/amdgpu: Guard against write accesses after device removal

2021-01-19 Thread Andrey Grodzovsky


On 1/19/21 1:59 PM, Christian König wrote:

Am 19.01.21 um 19:22 schrieb Andrey Grodzovsky:


On 1/19/21 1:05 PM, Daniel Vetter wrote:

On Tue, Jan 19, 2021 at 4:35 PM Andrey Grodzovsky
 wrote:

There is really no other way according to this article
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F767885%2F&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7Cee61fb937d2d4baedf6f08d8bcac5b02%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466795752297305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=a9Y4ZMEVYaMP7IeMVxQgXGpAkDXSkedMAiWkyqwzEe8%3D&reserved=0 



"A perfect solution seems nearly impossible though; we cannot acquire a 
mutex on

the user
to prevent them from yanking a device and we cannot check for a presence 
change

after every
device access for performance reasons. "

But I assumed srcu_read_lock should be pretty seamless performance wise, no ?

The read side is supposed to be dirt cheap, the write side is were we
just stall for all readers to eventually complete on their own.
Definitely should be much cheaper than mmio read, on the mmio write
side it might actually hurt a bit. Otoh I think those don't stall the
cpu by default when they're timing out, so maybe if the overhead is
too much for those, we could omit them?

Maybe just do a small microbenchmark for these for testing, with a
register that doesn't change hw state. So with and without
drm_dev_enter/exit, and also one with the hw plugged out so that we
have actual timeouts in the transactions.
-Daniel



So say writing in a loop to some harmless scratch register for many times 
both for plugged

and unplugged case and measure total time delta ?


I think we should at least measure the following:

1. Writing X times to a scratch reg without your patch.
2. Writing X times to a scratch reg with your patch.
3. Writing X times to a scratch reg with the hardware physically disconnected.

I suggest to repeat that once for Polaris (or older) and once for Vega or Navi.

The SRBM on Polaris is meant to introduce some delay in each access, so it 
might react differently then the newer hardware.


Christian.



Will do.

Andrey






Andrey





The other solution would be as I suggested to keep all the device IO ranges
reserved and system
memory pages unfreed until the device is finalized in the driver but Daniel 
said

this would upset the PCI layer (the MMIO ranges reservation part).

Andrey




On 1/19/21 3:55 AM, Christian König wrote:

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

This should prevent writing to memory or IO ranges possibly
already allocated for other uses after our device is removed.

Wow, that adds quite some overhead to every register access. I'm not sure we
can do this.

Christian.


Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 
   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 
   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 53 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h    |  3 ++
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   | 70 
++

   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   | 49 ++---
   drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 16 ++-
   drivers/gpu/drm/amd/amdgpu/psp_v12_0.c |  8 +---
   drivers/gpu/drm/amd/amdgpu/psp_v3_1.c  |  8 +---
   9 files changed, 184 insertions(+), 89 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e99f4f1..0a9d73c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -72,6 +72,8 @@
 #include 
   +#include 
+
   MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
   MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
   MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
@@ -404,13 +406,21 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev,
uint32_t offset)
    */
   void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t
value)
   {
+    int idx;
+
   if (adev->in_pci_err_recovery)
   return;
   +
+    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
   if (offset < adev->rmmio_size)
   writeb(value, adev->rmmio + offset);
   else
   BUG();
+
+    drm_dev_exit(idx);
   }
 /**
@@ -427,9 +437,14 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
   uint32_t reg, uint32_t v,
   uint32_t acc_flags)
   {
+    int idx;
+
   if (adev->in_pci_err_recovery)
   return;
   +    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
   if ((reg * 4) < adev->rmmio_size) {
   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
   amdgpu_sriov_runtime(adev) &&
@@ -444,6 +459,8 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
   }
trace_amdgpu_device_wreg(adev->pdev->device, reg, v);
+
+    drm_dev_exit(idx);

Re: [PATCH v4 10/14] dmr/amdgpu: Move some sysfs attrs creation to default_attr

2021-01-19 Thread Andrey Grodzovsky



On 1/19/21 2:04 PM, Alex Deucher wrote:

On Tue, Jan 19, 2021 at 1:26 PM Greg KH  wrote:

On Tue, Jan 19, 2021 at 11:36:01AM -0500, Andrey Grodzovsky wrote:

On 1/19/21 2:34 AM, Greg KH wrote:

On Mon, Jan 18, 2021 at 04:01:19PM -0500, Andrey Grodzovsky wrote:

   static struct pci_driver amdgpu_kms_pci_driver = {
   .name = DRIVER_NAME,
   .id_table = pciidlist,
@@ -1595,6 +1607,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
   .shutdown = amdgpu_pci_shutdown,
   .driver.pm = &amdgpu_pm_ops,
   .err_handler = &amdgpu_pci_err_handler,
+ .driver.dev_groups = amdgpu_sysfs_groups,

Shouldn't this just be:
 groups - amdgpu_sysfs_groups,

Why go to the "driver root" here?


Because I still didn't get to your suggestion to propose a patch to add groups 
to
pci_driver, it's located in 'base' driver struct.

You are a pci driver, you should never have to mess with the "base"
driver struct.  Look at commit 92d50fc1602e ("PCI/IB: add support for
pci driver attribute groups") which got merged in 4.14, way back in
2017 :)

Per the previous discussion of this patch set:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mail-archive.com%2Famd-gfx%40lists.freedesktop.org%2Fmsg56019.html&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C1b43efdc8a164169eee508d8bcad1ece%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466799090087255%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=T462j96qC%2BXCZzgnJMG%2BbUEOG94GVuqkvTWfUB%2B3%2Fl8%3D&reserved=0

Alex



Got it, Next iteration I will include a patch like the above to pci-devel as 
part of the series and will update this patch accordingly.


Andrey





driver.pm also looks odd, but I'm just going to ignore that for now...

thanks,

greg k-h
___
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C1b43efdc8a164169eee508d8bcad1ece%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466799090087255%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=reQqTGFCsEXvHOmSt8c4B6idrotIS4Q69WKw%2FRtpAEg%3D&reserved=0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 10/14] dmr/amdgpu: Move some sysfs attrs creation to default_attr

2021-01-19 Thread Greg KH
On Tue, Jan 19, 2021 at 11:36:01AM -0500, Andrey Grodzovsky wrote:
> 
> On 1/19/21 2:34 AM, Greg KH wrote:
> > On Mon, Jan 18, 2021 at 04:01:19PM -0500, Andrey Grodzovsky wrote:
> > >   static struct pci_driver amdgpu_kms_pci_driver = {
> > >   .name = DRIVER_NAME,
> > >   .id_table = pciidlist,
> > > @@ -1595,6 +1607,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
> > >   .shutdown = amdgpu_pci_shutdown,
> > >   .driver.pm = &amdgpu_pm_ops,
> > >   .err_handler = &amdgpu_pci_err_handler,
> > > + .driver.dev_groups = amdgpu_sysfs_groups,
> > Shouldn't this just be:
> > groups - amdgpu_sysfs_groups,
> > 
> > Why go to the "driver root" here?
> 
> 
> Because I still didn't get to your suggestion to propose a patch to add 
> groups to
> pci_driver, it's located in 'base' driver struct.

You are a pci driver, you should never have to mess with the "base"
driver struct.  Look at commit 92d50fc1602e ("PCI/IB: add support for
pci driver attribute groups") which got merged in 4.14, way back in
2017 :)

driver.pm also looks odd, but I'm just going to ignore that for now...

thanks,

greg k-h
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 10/14] dmr/amdgpu: Move some sysfs attrs creation to default_attr

2021-01-19 Thread Alex Deucher
On Tue, Jan 19, 2021 at 1:26 PM Greg KH  wrote:
>
> On Tue, Jan 19, 2021 at 11:36:01AM -0500, Andrey Grodzovsky wrote:
> >
> > On 1/19/21 2:34 AM, Greg KH wrote:
> > > On Mon, Jan 18, 2021 at 04:01:19PM -0500, Andrey Grodzovsky wrote:
> > > >   static struct pci_driver amdgpu_kms_pci_driver = {
> > > >   .name = DRIVER_NAME,
> > > >   .id_table = pciidlist,
> > > > @@ -1595,6 +1607,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
> > > >   .shutdown = amdgpu_pci_shutdown,
> > > >   .driver.pm = &amdgpu_pm_ops,
> > > >   .err_handler = &amdgpu_pci_err_handler,
> > > > + .driver.dev_groups = amdgpu_sysfs_groups,
> > > Shouldn't this just be:
> > > groups - amdgpu_sysfs_groups,
> > >
> > > Why go to the "driver root" here?
> >
> >
> > Because I still didn't get to your suggestion to propose a patch to add 
> > groups to
> > pci_driver, it's located in 'base' driver struct.
>
> You are a pci driver, you should never have to mess with the "base"
> driver struct.  Look at commit 92d50fc1602e ("PCI/IB: add support for
> pci driver attribute groups") which got merged in 4.14, way back in
> 2017 :)

Per the previous discussion of this patch set:
https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg56019.html

Alex

>
> driver.pm also looks odd, but I'm just going to ignore that for now...
>
> thanks,
>
> greg k-h
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 11/14] drm/amdgpu: Guard against write accesses after device removal

2021-01-19 Thread Christian König

Am 19.01.21 um 19:22 schrieb Andrey Grodzovsky:


On 1/19/21 1:05 PM, Daniel Vetter wrote:

On Tue, Jan 19, 2021 at 4:35 PM Andrey Grodzovsky
 wrote:

There is really no other way according to this article
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F767885%2F&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C7a1f5ae6a06f4661d47708d8bca4cb32%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466763278674162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QupsglO9WRuis8XRLBFIhl6miTXVOdAnk8oP4BfSclQ%3D&reserved=0 



"A perfect solution seems nearly impossible though; we cannot 
acquire a mutex on

the user
to prevent them from yanking a device and we cannot check for a 
presence change

after every
device access for performance reasons. "

But I assumed srcu_read_lock should be pretty seamless performance 
wise, no ?

The read side is supposed to be dirt cheap, the write side is were we
just stall for all readers to eventually complete on their own.
Definitely should be much cheaper than mmio read, on the mmio write
side it might actually hurt a bit. Otoh I think those don't stall the
cpu by default when they're timing out, so maybe if the overhead is
too much for those, we could omit them?

Maybe just do a small microbenchmark for these for testing, with a
register that doesn't change hw state. So with and without
drm_dev_enter/exit, and also one with the hw plugged out so that we
have actual timeouts in the transactions.
-Daniel



So say writing in a loop to some harmless scratch register for many 
times both for plugged

and unplugged case and measure total time delta ?


I think we should at least measure the following:

1. Writing X times to a scratch reg without your patch.
2. Writing X times to a scratch reg with your patch.
3. Writing X times to a scratch reg with the hardware physically 
disconnected.


I suggest to repeat that once for Polaris (or older) and once for Vega 
or Navi.


The SRBM on Polaris is meant to introduce some delay in each access, so 
it might react differently then the newer hardware.


Christian.



Andrey




The other solution would be as I suggested to keep all the device IO 
ranges

reserved and system
memory pages unfreed until the device is finalized in the driver but 
Daniel said

this would upset the PCI layer (the MMIO ranges reservation part).

Andrey




On 1/19/21 3:55 AM, Christian König wrote:

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

This should prevent writing to memory or IO ranges possibly
already allocated for other uses after our device is removed.
Wow, that adds quite some overhead to every register access. I'm 
not sure we

can do this.

Christian.


Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 


   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 
   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 53 
+-

   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h    |  3 ++
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   | 70 
++
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   | 49 
++---

   drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 16 ++-
   drivers/gpu/drm/amd/amdgpu/psp_v12_0.c |  8 +---
   drivers/gpu/drm/amd/amdgpu/psp_v3_1.c  |  8 +---
   9 files changed, 184 insertions(+), 89 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e99f4f1..0a9d73c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -72,6 +72,8 @@
 #include 
   +#include 
+
   MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
   MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
   MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
@@ -404,13 +406,21 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device 
*adev,

uint32_t offset)
    */
   void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t 
offset, uint8_t

value)
   {
+    int idx;
+
   if (adev->in_pci_err_recovery)
   return;
   +
+    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
   if (offset < adev->rmmio_size)
   writeb(value, adev->rmmio + offset);
   else
   BUG();
+
+    drm_dev_exit(idx);
   }
 /**
@@ -427,9 +437,14 @@ void amdgpu_device_wreg(struct amdgpu_device 
*adev,

   uint32_t reg, uint32_t v,
   uint32_t acc_flags)
   {
+    int idx;
+
   if (adev->in_pci_err_recovery)
   return;
   +    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
   if ((reg * 4) < adev->rmmio_size) {
   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
   amdgpu_sriov_runtime(adev) &&
@@ -444,6 +459,8 @@ void amdgpu_device_wreg(struct amdgpu_device 
*adev,

   }
trace_amdgpu_device_wreg(adev->pdev->device, reg, v);
+
+    drm_dev_exit(idx);
   }
 /*
@@ -454,9 +471,14 @@ void amdgpu_de

Re: [PATCH v3 1/3] drm/amd/display: Add module parameter for freesync video mode

2021-01-19 Thread Daniel Vetter
On Tue, Jan 19, 2021 at 5:08 PM Pillai, Aurabindo
 wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> Hi Daniel,
>
> Could you please be more specific about the _unsafe API options you mentioned 
> ?

module_param_named_unsafe()

Cheers, Daniel

>
> --
>
> Thanks & Regards,
> Aurabindo Pillai
> 
> From: Daniel Vetter 
> Sent: Tuesday, January 19, 2021 8:11 AM
> To: Pekka Paalanen 
> Cc: Pillai, Aurabindo ; amd-gfx list 
> ; dri-devel ; 
> Kazlauskas, Nicholas ; Wang, Chao-kai (Stylon) 
> ; Thai, Thong ; Sharma, Shashank 
> ; Lin, Wayne ; Deucher, Alexander 
> ; Koenig, Christian 
> Subject: Re: [PATCH v3 1/3] drm/amd/display: Add module parameter for 
> freesync video mode
>
> On Tue, Jan 19, 2021 at 9:35 AM Pekka Paalanen  wrote:
> >
> > On Mon, 18 Jan 2021 09:36:47 -0500
> > Aurabindo Pillai  wrote:
> >
> > > On Thu, 2021-01-14 at 11:14 +0200, Pekka Paalanen wrote:
> > > >
> > > > Hi,
> > > >
> > > > please document somewhere that ends up in git history (commit
> > > > message,
> > > > code comments, description of the parameter would be the best but
> > > > maybe
> > > > there isn't enough space?) what Christian König explained in
> > > >
> > > >
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2020-December%2F291254.html&data=04%7C01%7Caurabindo.pillai%40amd.com%7C56ba07934c5c48e7ad7b08d8bc7bb4a9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466586800649481%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=GM0ZEM9JeFM5os13E1zlVy8Bn3D8Kxmo%2FajSG02WsGI%3D&reserved=0
> > > >
> > > > that this is a stop-gap feature intended to be removed as soon as
> > > > possible (when a better solution comes up, which could be years).
> > > >
> > > > So far I have not seen a single mention of this intention in your
> > > > patch
> > > > submissions, and I think it is very important to make known.
> > >
> > > Hi,
> > >
> > > Thanks for the headsup, I shall add the relevant info in the next
> > > verison.
> > >
> > > >
> > > > I also did not see an explanation of why this instead of
> > > > manufacturing
> > > > these video modes in userspace (an idea mentioned by Christian in the
> > > > referenced email). I think that too should be part of a commit
> > > > message.
> > >
> > > This is an opt-in feature, which shall be superseded by a better
> > > solution. We also add a set of common modes for scaling similarly.
> > > Userspace can still add whatever mode they want. So I dont see a reason
> > > why this cant be in the kernel.
> >
> > Hi,
> >
> > sorry, I think that kind of thinking is backwards. There needs to be a
> > reason to put something in the kernel, and if there is no reason, then
> > it remains in userspace. So what's the reason to put this in the kernel?
> >
> > One example reason why this should not be in the kernel is that the set
> > of video modes to manufacture is a kind of policy, which modes to add
> > and which not. Userspace knows what modes it needs, and establishing
> > the modes in the kernel instead is second-guessing what the userspace
> > would want. So if userspace needs to manufacture modes in userspace
> > anyway as some modes might be missed by the kernel, then why bother in
> > the kernel to begin with? Why should the kernel play catch-up with what
> > modes userspace wants when we already have everything userspace needs
> > to make its own modes, even to add them to the kernel mode list?
> >
> > Does manufacturing these extra video modes to achieve fast timing
> > changes require AMD hardware-specific knowledge, as opposed to the
> > general VRR approach of simply adjusting the front porch?
> >
> > Something like this should also be documented in a commit message. Or
> > if you insist that "no reason to not put this in the kernel" is reason
> > enough, then write that down, because it does not seem obvious to me or
> > others that this feature needs to be in the kernel.
>
> One reason might be debugging, if a feature is known to cause issues.
> But imo in that case the knob should be using the _unsafe variants so
> it taints the kernel, since otherwise we get stuck in this very cozy
> place where kernel maintainers don't have to care much for bugs
> "because it's off by default", but also not really care about
> polishing the feature "since users can just enable it if they want
> it". Just a slightly different flavour of what you're explaining above
> already.
> -Daniel
>
> > Thanks,
> > pq
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=04%7C01%7Caurabindo.pillai%40amd.com%7C56ba07934c5c48e7ad7b08d8bc7bb4a9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466586800649481%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2isCpwa3V92TnO4njhe9cQjdWVdsV1GQMo7

Re: [PATCH v4 12/14] drm/scheduler: Job timeout handler returns status

2021-01-19 Thread Christian König

Am 19.01.21 um 18:47 schrieb Luben Tuikov:

On 2021-01-19 2:53 a.m., Christian König wrote:

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

From: Luben Tuikov 

This patch does not change current behaviour.

The driver's job timeout handler now returns
status indicating back to the DRM layer whether
the task (job) was successfully aborted or whether
more time should be given to the task to complete.

Default behaviour as of this patch, is preserved,
except in obvious-by-comment case in the Panfrost
driver, as documented below.

All drivers which make use of the
drm_sched_backend_ops' .timedout_job() callback
have been accordingly renamed and return the
would've-been default value of
DRM_TASK_STATUS_ALIVE to restart the task's
timeout timer--this is the old behaviour, and
is preserved by this patch.

In the case of the Panfrost driver, its timedout
callback correctly first checks if the job had
completed in due time and if so, it now returns
DRM_TASK_STATUS_COMPLETE to notify the DRM layer
that the task can be moved to the done list, to be
freed later. In the other two subsequent checks,
the value of DRM_TASK_STATUS_ALIVE is returned, as
per the default behaviour.

A more involved driver's solutions can be had
in subequent patches.

v2: Use enum as the status of a driver's job
  timeout callback method.

v4: (By Andrey Grodzovsky)
Replace DRM_TASK_STATUS_COMPLETE with DRM_TASK_STATUS_ENODEV
to enable a hint to the schduler for when NOT to rearm the
timeout timer.

As Lukas pointed out returning the job (or task) status doesn't make
much sense.

What we return here is the status of the scheduler.

I would either rename the enum or completely drop it and return a
negative error status.

Yes, that could be had.

Although, dropping the enum and returning [-1, 0], might
make the return status meaning vague. Using an enum with an appropriate
name, makes the intention clear to the next programmer.


Completely agree, but -ENODEV and 0 could work.

On the other hand using DRM_SCHED_* is perfectly fine with me as well.

Christian.



Now, Andrey did rename one of the enumerated values to
DRM_TASK_STATUS_ENODEV, perhaps the same but with:

enum drm_sched_status {
     DRM_SCHED_STAT_NONE, /* Reserve 0 */
     DRM_SCHED_STAT_NOMINAL,
     DRM_SCHED_STAT_ENODEV,
};

and also renaming the enum to the above would be acceptable?

Regards,
Luben


Apart from that looks fine to me,
Christian.



Cc: Alexander Deucher 
Cc: Andrey Grodzovsky 
Cc: Christian König 
Cc: Daniel Vetter 
Cc: Lucas Stach 
Cc: Russell King 
Cc: Christian Gmeiner 
Cc: Qiang Yu 
Cc: Rob Herring 
Cc: Tomeu Vizoso 
Cc: Steven Price 
Cc: Alyssa Rosenzweig 
Cc: Eric Anholt 
Reported-by: kernel test robot 
Signed-off-by: Luben Tuikov 
Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  6 --
   drivers/gpu/drm/etnaviv/etnaviv_sched.c | 10 +-
   drivers/gpu/drm/lima/lima_sched.c   |  4 +++-
   drivers/gpu/drm/panfrost/panfrost_job.c |  9 ++---
   drivers/gpu/drm/scheduler/sched_main.c  |  4 +---
   drivers/gpu/drm/v3d/v3d_sched.c | 32 +---
   include/drm/gpu_scheduler.h | 17 ++---
   7 files changed, 54 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index ff48101..a111326 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -28,7 +28,7 @@
   #include "amdgpu.h"
   #include "amdgpu_trace.h"
   
-static void amdgpu_job_timedout(struct drm_sched_job *s_job)

+static enum drm_task_status amdgpu_job_timedout(struct drm_sched_job *s_job)
   {
struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
struct amdgpu_job *job = to_amdgpu_job(s_job);
@@ -41,7 +41,7 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) 
{
DRM_ERROR("ring %s timeout, but soft recovered\n",
  s_job->sched->name);
-   return;
+   return DRM_TASK_STATUS_ALIVE;
}
   
   	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);

@@ -53,10 +53,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
   
   	if (amdgpu_device_should_recover_gpu(ring->adev)) {

amdgpu_device_gpu_recover(ring->adev, job);
+   return DRM_TASK_STATUS_ALIVE;
} else {
drm_sched_suspend_timeout(&ring->sched);
if (amdgpu_sriov_vf(adev))
adev->virt.tdr_debug = true;
+   return DRM_TASK_STATUS_ALIVE;
}
   }
   
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c

index cd46c88..c495169 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -82,7 +82,8 @@ static struct dma_fence *etnaviv_sched_run_jo

Re: [PATCH v4 11/14] drm/amdgpu: Guard against write accesses after device removal

2021-01-19 Thread Andrey Grodzovsky


On 1/19/21 1:05 PM, Daniel Vetter wrote:

On Tue, Jan 19, 2021 at 4:35 PM Andrey Grodzovsky
 wrote:

There is really no other way according to this article
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F767885%2F&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C7a1f5ae6a06f4661d47708d8bca4cb32%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466763278674162%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QupsglO9WRuis8XRLBFIhl6miTXVOdAnk8oP4BfSclQ%3D&reserved=0

"A perfect solution seems nearly impossible though; we cannot acquire a mutex on
the user
to prevent them from yanking a device and we cannot check for a presence change
after every
device access for performance reasons. "

But I assumed srcu_read_lock should be pretty seamless performance wise, no ?

The read side is supposed to be dirt cheap, the write side is were we
just stall for all readers to eventually complete on their own.
Definitely should be much cheaper than mmio read, on the mmio write
side it might actually hurt a bit. Otoh I think those don't stall the
cpu by default when they're timing out, so maybe if the overhead is
too much for those, we could omit them?

Maybe just do a small microbenchmark for these for testing, with a
register that doesn't change hw state. So with and without
drm_dev_enter/exit, and also one with the hw plugged out so that we
have actual timeouts in the transactions.
-Daniel



So say writing in a loop to some harmless scratch register for many times both 
for plugged

and unplugged case and measure total time delta ?

Andrey





The other solution would be as I suggested to keep all the device IO ranges
reserved and system
memory pages unfreed until the device is finalized in the driver but Daniel said
this would upset the PCI layer (the MMIO ranges reservation part).

Andrey




On 1/19/21 3:55 AM, Christian König wrote:

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

This should prevent writing to memory or IO ranges possibly
already allocated for other uses after our device is removed.

Wow, that adds quite some overhead to every register access. I'm not sure we
can do this.

Christian.


Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 
   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c|  9 
   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 53 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h|  3 ++
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   | 70 
++
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   | 49 ++---
   drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 16 ++-
   drivers/gpu/drm/amd/amdgpu/psp_v12_0.c |  8 +---
   drivers/gpu/drm/amd/amdgpu/psp_v3_1.c  |  8 +---
   9 files changed, 184 insertions(+), 89 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e99f4f1..0a9d73c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -72,6 +72,8 @@
 #include 
   +#include 
+
   MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
   MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
   MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
@@ -404,13 +406,21 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev,
uint32_t offset)
*/
   void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t
value)
   {
+int idx;
+
   if (adev->in_pci_err_recovery)
   return;
   +
+if (!drm_dev_enter(&adev->ddev, &idx))
+return;
+
   if (offset < adev->rmmio_size)
   writeb(value, adev->rmmio + offset);
   else
   BUG();
+
+drm_dev_exit(idx);
   }
 /**
@@ -427,9 +437,14 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
   uint32_t reg, uint32_t v,
   uint32_t acc_flags)
   {
+int idx;
+
   if (adev->in_pci_err_recovery)
   return;
   +if (!drm_dev_enter(&adev->ddev, &idx))
+return;
+
   if ((reg * 4) < adev->rmmio_size) {
   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
   amdgpu_sriov_runtime(adev) &&
@@ -444,6 +459,8 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
   }
 trace_amdgpu_device_wreg(adev->pdev->device, reg, v);
+
+drm_dev_exit(idx);
   }
 /*
@@ -454,9 +471,14 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
   void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
uint32_t reg, uint32_t v)
   {
+int idx;
+
   if (adev->in_pci_err_recovery)
   return;
   +if (!drm_dev_enter(&adev->ddev, &idx))
+return;
+
   if (amdgpu_sriov_fullaccess(adev) &&
   adev->gfx.rlc.funcs &&
   adev->gfx.rlc.funcs->is_rlcg_access_range) {
@@ -465,6 +487,8 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
   } else {
   writel(

Re: [PATCH v4 00/14] RFC Support hot device unplug in amdgpu

2021-01-19 Thread Andrey Grodzovsky



On 1/19/21 1:08 PM, Daniel Vetter wrote:

On Tue, Jan 19, 2021 at 6:31 PM Andrey Grodzovsky
 wrote:


On 1/19/21 9:16 AM, Daniel Vetter wrote:

On Mon, Jan 18, 2021 at 04:01:09PM -0500, Andrey Grodzovsky wrote:

Until now extracting a card either by physical extraction (e.g. eGPU with
thunderbolt connection or by emulation through  syfs -> 
/sys/bus/pci/devices/device_id/remove)
would cause random crashes in user apps. The random crashes in apps were
mostly due to the app having mapped a device backed BO into its address
space was still trying to access the BO while the backing device was gone.
To answer this first problem Christian suggested to fix the handling of mapped
memory in the clients when the device goes away by forcibly unmap all buffers 
the
user processes has by clearing their respective VMAs mapping the device BOs.
Then when the VMAs try to fill in the page tables again we check in the fault
handlerif the device is removed and if so, return an error. This will generate a
SIGBUS to the application which can then cleanly terminate.This indeed was done
but this in turn created a problem of kernel OOPs were the OOPSes were due to 
the
fact that while the app was terminating because of the SIGBUSit would trigger 
use
after free in the driver by calling to accesses device structures that were 
already
released from the pci remove sequence.This was handled by introducing a 'flush'
sequence during device removal were we wait for drm file reference to drop to 0
meaning all user clients directly using this device terminated.

v2:
Based on discussions in the mailing list with Daniel and Pekka [1] and based on 
the document
produced by Pekka from those discussions [2] the whole approach with returning 
SIGBUS and
waiting for all user clients having CPU mapping of device BOs to die was 
dropped.
Instead as per the document suggestion the device structures are kept alive 
until
the last reference to the device is dropped by user client and in the meanwhile 
all existing and new CPU mappings of the BOs
belonging to the device directly or by dma-buf import are rerouted to per user
process dummy rw page.Also, I skipped the 'Requirements for KMS UAPI' section 
of [2]
since i am trying to get the minimal set of requirements that still give useful 
solution
to work and this is the'Requirements for Render and Cross-Device UAPI' section 
and so my
test case is removing a secondary device, which is render only and is not 
involved
in KMS.

v3:
More updates following comments from v2 such as removing loop to find DRM file 
when rerouting
page faults to dummy page,getting rid of unnecessary sysfs handling refactoring 
and moving
prevention of GPU recovery post device unplug from amdgpu to scheduler layer.
On top of that added unplug support for the IOMMU enabled system.

v4:
Drop last sysfs hack and use sysfs default attribute.
Guard against write accesses after device removal to avoid modifying released 
memory.
Update dummy pages handling to on demand allocation and release through drm 
managed framework.
Add return value to scheduler job TO handler (by Luben Tuikov) and use this in 
amdgpu for prevention
of GPU recovery post device unplug
Also rebase on top of drm-misc-mext instead of amd-staging-drm-next

With these patches I am able to gracefully remove the secondary card using 
sysfs remove hook while glxgears
is running off of secondary card (DRI_PRIME=1) without kernel oopses or hangs 
and keep working
with the primary card or soft reset the device without hangs or oopses

TODOs for followup work:
Convert AMDGPU code to use devm (for hw stuff) and drmm (for sw stuff and 
allocations) (Daniel)
Support plugging the secondary device back after unplug - currently still 
experiencing HW error on plugging back.
Add support for 'Requirements for KMS UAPI' section of [2] - unplugging 
primary, display connected card.

[1] - Discussions during v3 of the patchset 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Famd-gfx%2Fmsg55576.html&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9055ea164ca14a0cbce108d8bca53d37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466765176719365%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=AqqeqmhF%2BZ1%2BRwMgtpmfoW1gtEnLGxiy3U5OMm%2BBqk8%3D&reserved=0
[2] - drm/doc: device hot-unplug for userspace 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg259755.html&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C9055ea164ca14a0cbce108d8bca53d37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466765176719365%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=oHHyRtTMTNQAnkzptG0B8%2FeeniU1z2DSca8L4yCYJcE%3D&reserved=0
[3] - Related gitlab ticket 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1081&data=04%7C01%7CAndrey.G

Re: [PATCH v4 00/14] RFC Support hot device unplug in amdgpu

2021-01-19 Thread Daniel Vetter
On Tue, Jan 19, 2021 at 6:31 PM Andrey Grodzovsky
 wrote:
>
>
> On 1/19/21 9:16 AM, Daniel Vetter wrote:
> > On Mon, Jan 18, 2021 at 04:01:09PM -0500, Andrey Grodzovsky wrote:
> >> Until now extracting a card either by physical extraction (e.g. eGPU with
> >> thunderbolt connection or by emulation through  syfs -> 
> >> /sys/bus/pci/devices/device_id/remove)
> >> would cause random crashes in user apps. The random crashes in apps were
> >> mostly due to the app having mapped a device backed BO into its address
> >> space was still trying to access the BO while the backing device was gone.
> >> To answer this first problem Christian suggested to fix the handling of 
> >> mapped
> >> memory in the clients when the device goes away by forcibly unmap all 
> >> buffers the
> >> user processes has by clearing their respective VMAs mapping the device 
> >> BOs.
> >> Then when the VMAs try to fill in the page tables again we check in the 
> >> fault
> >> handlerif the device is removed and if so, return an error. This will 
> >> generate a
> >> SIGBUS to the application which can then cleanly terminate.This indeed was 
> >> done
> >> but this in turn created a problem of kernel OOPs were the OOPSes were due 
> >> to the
> >> fact that while the app was terminating because of the SIGBUSit would 
> >> trigger use
> >> after free in the driver by calling to accesses device structures that 
> >> were already
> >> released from the pci remove sequence.This was handled by introducing a 
> >> 'flush'
> >> sequence during device removal were we wait for drm file reference to drop 
> >> to 0
> >> meaning all user clients directly using this device terminated.
> >>
> >> v2:
> >> Based on discussions in the mailing list with Daniel and Pekka [1] and 
> >> based on the document
> >> produced by Pekka from those discussions [2] the whole approach with 
> >> returning SIGBUS and
> >> waiting for all user clients having CPU mapping of device BOs to die was 
> >> dropped.
> >> Instead as per the document suggestion the device structures are kept 
> >> alive until
> >> the last reference to the device is dropped by user client and in the 
> >> meanwhile all existing and new CPU mappings of the BOs
> >> belonging to the device directly or by dma-buf import are rerouted to per 
> >> user
> >> process dummy rw page.Also, I skipped the 'Requirements for KMS UAPI' 
> >> section of [2]
> >> since i am trying to get the minimal set of requirements that still give 
> >> useful solution
> >> to work and this is the'Requirements for Render and Cross-Device UAPI' 
> >> section and so my
> >> test case is removing a secondary device, which is render only and is not 
> >> involved
> >> in KMS.
> >>
> >> v3:
> >> More updates following comments from v2 such as removing loop to find DRM 
> >> file when rerouting
> >> page faults to dummy page,getting rid of unnecessary sysfs handling 
> >> refactoring and moving
> >> prevention of GPU recovery post device unplug from amdgpu to scheduler 
> >> layer.
> >> On top of that added unplug support for the IOMMU enabled system.
> >>
> >> v4:
> >> Drop last sysfs hack and use sysfs default attribute.
> >> Guard against write accesses after device removal to avoid modifying 
> >> released memory.
> >> Update dummy pages handling to on demand allocation and release through 
> >> drm managed framework.
> >> Add return value to scheduler job TO handler (by Luben Tuikov) and use 
> >> this in amdgpu for prevention
> >> of GPU recovery post device unplug
> >> Also rebase on top of drm-misc-mext instead of amd-staging-drm-next
> >>
> >> With these patches I am able to gracefully remove the secondary card using 
> >> sysfs remove hook while glxgears
> >> is running off of secondary card (DRI_PRIME=1) without kernel oopses or 
> >> hangs and keep working
> >> with the primary card or soft reset the device without hangs or oopses
> >>
> >> TODOs for followup work:
> >> Convert AMDGPU code to use devm (for hw stuff) and drmm (for sw stuff and 
> >> allocations) (Daniel)
> >> Support plugging the secondary device back after unplug - currently still 
> >> experiencing HW error on plugging back.
> >> Add support for 'Requirements for KMS UAPI' section of [2] - unplugging 
> >> primary, display connected card.
> >>
> >> [1] - Discussions during v3 of the patchset 
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Famd-gfx%2Fmsg55576.html&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C4b12f8caf53645eaf0c608d8bc84d7fa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466626035281917%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=E73dK7r1OBt1T9UcSt6kYbxYk9LL22EgizbpvkjfZ0c%3D&reserved=0
> >> [2] - drm/doc: device hot-unplug for userspace 
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg259755.html&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C4b12f8caf53645eaf0c60

Re: [PATCH v4 11/14] drm/amdgpu: Guard against write accesses after device removal

2021-01-19 Thread Daniel Vetter
On Tue, Jan 19, 2021 at 4:35 PM Andrey Grodzovsky
 wrote:
>
> There is really no other way according to this article
> https://lwn.net/Articles/767885/
>
> "A perfect solution seems nearly impossible though; we cannot acquire a mutex 
> on
> the user
> to prevent them from yanking a device and we cannot check for a presence 
> change
> after every
> device access for performance reasons. "
>
> But I assumed srcu_read_lock should be pretty seamless performance wise, no ?

The read side is supposed to be dirt cheap, the write side is were we
just stall for all readers to eventually complete on their own.
Definitely should be much cheaper than mmio read, on the mmio write
side it might actually hurt a bit. Otoh I think those don't stall the
cpu by default when they're timing out, so maybe if the overhead is
too much for those, we could omit them?

Maybe just do a small microbenchmark for these for testing, with a
register that doesn't change hw state. So with and without
drm_dev_enter/exit, and also one with the hw plugged out so that we
have actual timeouts in the transactions.
-Daniel

> The other solution would be as I suggested to keep all the device IO ranges
> reserved and system
> memory pages unfreed until the device is finalized in the driver but Daniel 
> said
> this would upset the PCI layer (the MMIO ranges reservation part).
>
> Andrey
>
>
>
>
> On 1/19/21 3:55 AM, Christian König wrote:
> > Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:
> >> This should prevent writing to memory or IO ranges possibly
> >> already allocated for other uses after our device is removed.
> >
> > Wow, that adds quite some overhead to every register access. I'm not sure we
> > can do this.
> >
> > Christian.
> >
> >>
> >> Signed-off-by: Andrey Grodzovsky 
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c|  9 
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 53 +-
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h|  3 ++
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   | 70 
> >> ++
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   | 49 ++---
> >>   drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 16 ++-
> >>   drivers/gpu/drm/amd/amdgpu/psp_v12_0.c |  8 +---
> >>   drivers/gpu/drm/amd/amdgpu/psp_v3_1.c  |  8 +---
> >>   9 files changed, 184 insertions(+), 89 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> index e99f4f1..0a9d73c 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> @@ -72,6 +72,8 @@
> >> #include 
> >>   +#include 
> >> +
> >>   MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
> >>   MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
> >>   MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
> >> @@ -404,13 +406,21 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev,
> >> uint32_t offset)
> >>*/
> >>   void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t
> >> value)
> >>   {
> >> +int idx;
> >> +
> >>   if (adev->in_pci_err_recovery)
> >>   return;
> >>   +
> >> +if (!drm_dev_enter(&adev->ddev, &idx))
> >> +return;
> >> +
> >>   if (offset < adev->rmmio_size)
> >>   writeb(value, adev->rmmio + offset);
> >>   else
> >>   BUG();
> >> +
> >> +drm_dev_exit(idx);
> >>   }
> >> /**
> >> @@ -427,9 +437,14 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
> >>   uint32_t reg, uint32_t v,
> >>   uint32_t acc_flags)
> >>   {
> >> +int idx;
> >> +
> >>   if (adev->in_pci_err_recovery)
> >>   return;
> >>   +if (!drm_dev_enter(&adev->ddev, &idx))
> >> +return;
> >> +
> >>   if ((reg * 4) < adev->rmmio_size) {
> >>   if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
> >>   amdgpu_sriov_runtime(adev) &&
> >> @@ -444,6 +459,8 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
> >>   }
> >> trace_amdgpu_device_wreg(adev->pdev->device, reg, v);
> >> +
> >> +drm_dev_exit(idx);
> >>   }
> >> /*
> >> @@ -454,9 +471,14 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
> >>   void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
> >>uint32_t reg, uint32_t v)
> >>   {
> >> +int idx;
> >> +
> >>   if (adev->in_pci_err_recovery)
> >>   return;
> >>   +if (!drm_dev_enter(&adev->ddev, &idx))
> >> +return;
> >> +
> >>   if (amdgpu_sriov_fullaccess(adev) &&
> >>   adev->gfx.rlc.funcs &&
> >>   adev->gfx.rlc.funcs->is_rlcg_access_range) {
> >> @@ -465,6 +487,8 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device 
> >> *adev,
> >>   } else {
> >>   writel(v, ((void __iomem *)adev->rmmio) + (reg * 4));
> >>   }
> >> +
> >> +drm_dev_exit(idx);
> >>   }

Re: [PATCH v4 12/14] drm/scheduler: Job timeout handler returns status

2021-01-19 Thread Luben Tuikov
On 2021-01-19 2:53 a.m., Christian König wrote:
> Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:
>> From: Luben Tuikov 
>>
>> This patch does not change current behaviour.
>>
>> The driver's job timeout handler now returns
>> status indicating back to the DRM layer whether
>> the task (job) was successfully aborted or whether
>> more time should be given to the task to complete.
>>
>> Default behaviour as of this patch, is preserved,
>> except in obvious-by-comment case in the Panfrost
>> driver, as documented below.
>>
>> All drivers which make use of the
>> drm_sched_backend_ops' .timedout_job() callback
>> have been accordingly renamed and return the
>> would've-been default value of
>> DRM_TASK_STATUS_ALIVE to restart the task's
>> timeout timer--this is the old behaviour, and
>> is preserved by this patch.
>>
>> In the case of the Panfrost driver, its timedout
>> callback correctly first checks if the job had
>> completed in due time and if so, it now returns
>> DRM_TASK_STATUS_COMPLETE to notify the DRM layer
>> that the task can be moved to the done list, to be
>> freed later. In the other two subsequent checks,
>> the value of DRM_TASK_STATUS_ALIVE is returned, as
>> per the default behaviour.
>>
>> A more involved driver's solutions can be had
>> in subequent patches.
>>
>> v2: Use enum as the status of a driver's job
>>  timeout callback method.
>>
>> v4: (By Andrey Grodzovsky)
>> Replace DRM_TASK_STATUS_COMPLETE with DRM_TASK_STATUS_ENODEV
>> to enable a hint to the schduler for when NOT to rearm the
>> timeout timer.
> As Lukas pointed out returning the job (or task) status doesn't make 
> much sense.
>
> What we return here is the status of the scheduler.
>
> I would either rename the enum or completely drop it and return a 
> negative error status.

Yes, that could be had.

Although, dropping the enum and returning [-1, 0], might
make the return status meaning vague. Using an enum with an appropriate
name, makes the intention clear to the next programmer.

Now, Andrey did rename one of the enumerated values to
DRM_TASK_STATUS_ENODEV, perhaps the same but with:

enum drm_sched_status {
    DRM_SCHED_STAT_NONE, /* Reserve 0 */
    DRM_SCHED_STAT_NOMINAL,
    DRM_SCHED_STAT_ENODEV,
};

and also renaming the enum to the above would be acceptable?

Regards,
Luben

> Apart from that looks fine to me,
> Christian.
>
>
>> Cc: Alexander Deucher 
>> Cc: Andrey Grodzovsky 
>> Cc: Christian König 
>> Cc: Daniel Vetter 
>> Cc: Lucas Stach 
>> Cc: Russell King 
>> Cc: Christian Gmeiner 
>> Cc: Qiang Yu 
>> Cc: Rob Herring 
>> Cc: Tomeu Vizoso 
>> Cc: Steven Price 
>> Cc: Alyssa Rosenzweig 
>> Cc: Eric Anholt 
>> Reported-by: kernel test robot 
>> Signed-off-by: Luben Tuikov 
>> Signed-off-by: Andrey Grodzovsky 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  6 --
>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c | 10 +-
>>   drivers/gpu/drm/lima/lima_sched.c   |  4 +++-
>>   drivers/gpu/drm/panfrost/panfrost_job.c |  9 ++---
>>   drivers/gpu/drm/scheduler/sched_main.c  |  4 +---
>>   drivers/gpu/drm/v3d/v3d_sched.c | 32 
>> +---
>>   include/drm/gpu_scheduler.h | 17 ++---
>>   7 files changed, 54 insertions(+), 28 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index ff48101..a111326 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -28,7 +28,7 @@
>>   #include "amdgpu.h"
>>   #include "amdgpu_trace.h"
>>   
>> -static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>> +static enum drm_task_status amdgpu_job_timedout(struct drm_sched_job *s_job)
>>   {
>>  struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
>>  struct amdgpu_job *job = to_amdgpu_job(s_job);
>> @@ -41,7 +41,7 @@ static void amdgpu_job_timedout(struct drm_sched_job 
>> *s_job)
>>  amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) 
>> {
>>  DRM_ERROR("ring %s timeout, but soft recovered\n",
>>s_job->sched->name);
>> -return;
>> +return DRM_TASK_STATUS_ALIVE;
>>  }
>>   
>>  amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
>> @@ -53,10 +53,12 @@ static void amdgpu_job_timedout(struct drm_sched_job 
>> *s_job)
>>   
>>  if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>  amdgpu_device_gpu_recover(ring->adev, job);
>> +return DRM_TASK_STATUS_ALIVE;
>>  } else {
>>  drm_sched_suspend_timeout(&ring->sched);
>>  if (amdgpu_sriov_vf(adev))
>>  adev->virt.tdr_debug = true;
>> +return DRM_TASK_STATUS_ALIVE;
>>  }
>>   }
>>   
>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c 
>> b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> index cd46c88..c495169 100644
>> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> +++ b/drivers/g

Re: [PATCH v4 00/14] RFC Support hot device unplug in amdgpu

2021-01-19 Thread Andrey Grodzovsky


On 1/19/21 9:16 AM, Daniel Vetter wrote:

On Mon, Jan 18, 2021 at 04:01:09PM -0500, Andrey Grodzovsky wrote:

Until now extracting a card either by physical extraction (e.g. eGPU with
thunderbolt connection or by emulation through  syfs -> 
/sys/bus/pci/devices/device_id/remove)
would cause random crashes in user apps. The random crashes in apps were
mostly due to the app having mapped a device backed BO into its address
space was still trying to access the BO while the backing device was gone.
To answer this first problem Christian suggested to fix the handling of mapped
memory in the clients when the device goes away by forcibly unmap all buffers 
the
user processes has by clearing their respective VMAs mapping the device BOs.
Then when the VMAs try to fill in the page tables again we check in the fault
handlerif the device is removed and if so, return an error. This will generate a
SIGBUS to the application which can then cleanly terminate.This indeed was done
but this in turn created a problem of kernel OOPs were the OOPSes were due to 
the
fact that while the app was terminating because of the SIGBUSit would trigger 
use
after free in the driver by calling to accesses device structures that were 
already
released from the pci remove sequence.This was handled by introducing a 'flush'
sequence during device removal were we wait for drm file reference to drop to 0
meaning all user clients directly using this device terminated.

v2:
Based on discussions in the mailing list with Daniel and Pekka [1] and based on 
the document
produced by Pekka from those discussions [2] the whole approach with returning 
SIGBUS and
waiting for all user clients having CPU mapping of device BOs to die was 
dropped.
Instead as per the document suggestion the device structures are kept alive 
until
the last reference to the device is dropped by user client and in the meanwhile 
all existing and new CPU mappings of the BOs
belonging to the device directly or by dma-buf import are rerouted to per user
process dummy rw page.Also, I skipped the 'Requirements for KMS UAPI' section 
of [2]
since i am trying to get the minimal set of requirements that still give useful 
solution
to work and this is the'Requirements for Render and Cross-Device UAPI' section 
and so my
test case is removing a secondary device, which is render only and is not 
involved
in KMS.

v3:
More updates following comments from v2 such as removing loop to find DRM file 
when rerouting
page faults to dummy page,getting rid of unnecessary sysfs handling refactoring 
and moving
prevention of GPU recovery post device unplug from amdgpu to scheduler layer.
On top of that added unplug support for the IOMMU enabled system.

v4:
Drop last sysfs hack and use sysfs default attribute.
Guard against write accesses after device removal to avoid modifying released 
memory.
Update dummy pages handling to on demand allocation and release through drm 
managed framework.
Add return value to scheduler job TO handler (by Luben Tuikov) and use this in 
amdgpu for prevention
of GPU recovery post device unplug
Also rebase on top of drm-misc-mext instead of amd-staging-drm-next

With these patches I am able to gracefully remove the secondary card using 
sysfs remove hook while glxgears
is running off of secondary card (DRI_PRIME=1) without kernel oopses or hangs 
and keep working
with the primary card or soft reset the device without hangs or oopses

TODOs for followup work:
Convert AMDGPU code to use devm (for hw stuff) and drmm (for sw stuff and 
allocations) (Daniel)
Support plugging the secondary device back after unplug - currently still 
experiencing HW error on plugging back.
Add support for 'Requirements for KMS UAPI' section of [2] - unplugging 
primary, display connected card.

[1] - Discussions during v3 of the patchset 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Famd-gfx%2Fmsg55576.html&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C4b12f8caf53645eaf0c608d8bc84d7fa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466626035281917%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=E73dK7r1OBt1T9UcSt6kYbxYk9LL22EgizbpvkjfZ0c%3D&reserved=0
[2] - drm/doc: device hot-unplug for userspace 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Fdri-devel%2Fmsg259755.html&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C4b12f8caf53645eaf0c608d8bc84d7fa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466626035291908%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EAzrOrNd14IA6gjjCVi9mAQJQZbcrFQbRNC3bN9gVQc%3D&reserved=0
[3] - Related gitlab ticket 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1081&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C4b12f8caf53645eaf0c608d8bc84d7fa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63746662603

RE: 回复: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

2021-01-19 Thread Lazar, Lijo
[AMD Public Use]

What about changing the lock hive logic like
If (this device locked) return;
Lock hive -> lock this device.

In the regular flow, lock every thing in the list except this device.
Thanks,
Lijo

From: amd-gfx  On Behalf Of Andrey 
Grodzovsky
Sent: Tuesday, January 19, 2021 10:45 PM
To: Chen, Horace ; amd-gfx@lists.freedesktop.org
Cc: Xiao, Jack ; Xu, Feifei ; Wang, 
Kevin(Yang) ; Tuikov, Luben ; 
Deucher, Alexander ; Quan, Evan ; 
Koenig, Christian ; Liu, Monk ; 
Zhang, Hawking 
Subject: Re: 回复: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring 
timeout


Well, it shouldn't happen with the hive locked as I am browsing the code but 
then your code should
reflect that and if you do fail to lock particular adev AFTER the hive is 
locked you should not silently break
iteration but throw an error, WARN_ON or BUG_ON then. Or alternatively bail out 
with unlocking all already
locked devices.



Andrey


On 1/19/21 12:09 PM, Chen, Horace wrote:

[AMD Official Use Only - Internal Distribution Only]

OK, I understand. You mean one device in the hive may be locked up 
independently without locking up the whole hive.

It could happen, I'll change my code.

Thanks & Regards,
Horace.


发件人: Grodzovsky, Andrey 

发送时间: 2021年1月20日 0:58
收件人: Chen, Horace ; 
amd-gfx@lists.freedesktop.org 

抄送: Quan, Evan ; Tuikov, Luben 
; Koenig, Christian 
; Deucher, Alexander 
; Xiao, Jack 
; Zhang, Hawking 
; Liu, Monk 
; Xu, Feifei 
; Wang, Kevin(Yang) 
; Xiaojie Yuan 

主题: Re: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout



On 1/19/21 11:39 AM, Chen, Horace wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi Andrey,

I think the list in the XGMI hive won't be break in the middle if we lock the 
device before we change the list. Because if 2 devices in 1 hive went into the 
function, it will follow the same sequence to lock the devices. So one of them 
will definately break at the first device. I add iterate devices here is just 
to lock all device in the hive since we will change the device sequence in the 
hive soon after.



I didn't mean break in a sense of breaking the list itself, I just meant the 
literal 'break' instruction
to terminate the iteration once you failed to lock a particular device.



The reason to break the interation in the middle is that the list is changed 
during the iteration without taking any lock. It is quite bad since I'm fixing 
one of this issue. And for XGMI hive, there are 2 locks protecting the list, 
one is the device lock I changed here, the other one is in front of my change, 
there is a hive->lock to protect the hive.

Even the bad thing really happened, I think moving back through the list is 
also very dengerous since we don't know what the list finally be, Unless we 
stack the devices we have iterated through a mirrored list. That can be a big 
change.



Not sure we are on the same page, my concern is let's sat your XGMI hive 
consists of 2 devices, you manged to call successfully do
amdgpu_device_lock_adev for dev1 but then failed for dev2, in this case you 
will bail out  without releasing dev1, no ?



Andrey




I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do some 
test and update the code later.

Thanks & Regards,
Horace.

发件人: Grodzovsky, Andrey 

发送时间: 2021年1月19日 22:33
收件人: Chen, Horace ; 
amd-gfx@lists.freedesktop.org 

抄送: Quan, Evan ; Tuikov, Luben 
; Koenig, Christian 
; Deucher, Alexander 
; Xiao, Jack 
; Zhang, Hawking 
; Liu, Monk 
; Xu, Feifei 
; Wang, Kevin(Yang) 
; Xiaojie Yuan 

主题: Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout


On 1/19/21 7:22 AM, Horace Chen wrote:
> Fix a racing issue when jobs on 2 rings timeout simultaneously.
>
> If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> by 2 local linked list, which may cause wild pointer issue in
> iterating.
>
> lock the device earily to prevent the node be added to 2 different
> lists.
>
> Signed-off-by: Horace Chen 

Re: 回复: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

2021-01-19 Thread Andrey Grodzovsky
Well, it shouldn't happen with the hive locked as I am browsing the code but 
then your code should
reflect that and if you do fail to lock particular adev AFTER the hive is locked 
you should not silently break
iteration but throw an error, WARN_ON or BUG_ON then. Or alternatively bail out 
with unlocking all already

locked devices.


Andrey


On 1/19/21 12:09 PM, Chen, Horace wrote:


[AMD Official Use Only - Internal Distribution Only]


OK, I understand. You mean one device in the hive may be locked up 
independently without locking up the whole hive.


It could happen, I'll change my code.

Thanks & Regards,
Horace.


*发件人:* Grodzovsky, Andrey 
*发送时间:* 2021年1月20日 0:58
*收件人:* Chen, Horace ; amd-gfx@lists.freedesktop.org 

*抄送:* Quan, Evan ; Tuikov, Luben ; 
Koenig, Christian ; Deucher, Alexander 
; Xiao, Jack ; Zhang, Hawking 
; Liu, Monk ; Xu, Feifei 
; Wang, Kevin(Yang) ; Xiaojie Yuan 


*主题:* Re: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout


On 1/19/21 11:39 AM, Chen, Horace wrote:


[AMD Official Use Only - Internal Distribution Only]


Hi Andrey,

I think the list in the XGMI hive won't be break in the middle if we lock the 
device before we change the list. Because if 2 devices in 1 hive went into 
the function, it will follow the same sequence to lock the devices. So one of 
them will definately break at the first device. I add iterate devices here is 
just to lock all device in the hive since we will change the device sequence 
in the hive soon after.



I didn't mean break in a sense of breaking the list itself, I just meant the 
literal 'break' instruction

to terminate the iteration once you failed to lock a particular device.




The reason to break the interation in the middle is that the list is changed 
during the iteration without taking any lock. It is quite bad since I'm 
fixing one of this issue. And for XGMI hive, there are 2 locks protecting the 
list, one is the device lock I changed here, the other one is in front of my 
change, there is a hive->lock to protect the hive.


Even the bad thing really happened, I think moving back through the list is 
also very dengerous since we don't know what the list finally be, Unless we 
stack the devices we have iterated through a mirrored list. That can be a big 
change.



Not sure we are on the same page, my concern is let's sat your XGMI hive 
consists of 2 devices, you manged to call successfully do
amdgpu_device_lock_adev for dev1 but then failed for dev2, in this case you 
will bail out without releasing dev1, no ?



Andrey





I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do some 
test and update the code later.


Thanks & Regards,
Horace.

*发件人:* Grodzovsky, Andrey  


*发送时间:* 2021年1月19日 22:33
*收件人:* Chen, Horace  ; 
amd-gfx@lists.freedesktop.org  
 
*抄送:* Quan, Evan  ; Tuikov, 
Luben  ; Koenig, Christian 
 ; Deucher, 
Alexander  ; 
Xiao, Jack  ; Zhang, Hawking 
 ; Liu, Monk 
 ; Xu, Feifei  
; Wang, Kevin(Yang)  
; Xiaojie Yuan  


*主题:* Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

On 1/19/21 7:22 AM, Horace Chen wrote:
> Fix a racing issue when jobs on 2 rings timeout simultaneously.
>
> If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> by 2 local linked list, which may cause wild pointer issue in
> iterating.
>
> lock the device earily to prevent the node be added to 2 different
> lists.
>
> Signed-off-by: Horace Chen  
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++---
>   1 file changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> index 4d434803fb49..9574da3abc32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
*adev,
>    int i, r = 0;
>    bool need_emergency_restart = false;
>    bool audio_suspended = false;
> + bool get_dev_lock = false;
>
>    /*
> * Special case: RAS triggered and full reset isn't supported
> @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
*adev,

> * Build list of devices to reset.
> * In case we are in XGMI hive mode, r

回复: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

2021-01-19 Thread Chen, Horace
[AMD Official Use Only - Internal Distribution Only]

OK, I understand. You mean one device in the hive may be locked up 
independently without locking up the whole hive.

It could happen, I'll change my code.

Thanks & Regards,
Horace.


发件人: Grodzovsky, Andrey 
发送时间: 2021年1月20日 0:58
收件人: Chen, Horace ; amd-gfx@lists.freedesktop.org 

抄送: Quan, Evan ; Tuikov, Luben ; 
Koenig, Christian ; Deucher, Alexander 
; Xiao, Jack ; Zhang, Hawking 
; Liu, Monk ; Xu, Feifei 
; Wang, Kevin(Yang) ; Xiaojie Yuan 

主题: Re: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout



On 1/19/21 11:39 AM, Chen, Horace wrote:

[AMD Official Use Only - Internal Distribution Only]

Hi Andrey,

I think the list in the XGMI hive won't be break in the middle if we lock the 
device before we change the list. Because if 2 devices in 1 hive went into the 
function, it will follow the same sequence to lock the devices. So one of them 
will definately break at the first device. I add iterate devices here is just 
to lock all device in the hive since we will change the device sequence in the 
hive soon after.


I didn't mean break in a sense of breaking the list itself, I just meant the 
literal 'break' instruction
to terminate the iteration once you failed to lock a particular device.


The reason to break the interation in the middle is that the list is changed 
during the iteration without taking any lock. It is quite bad since I'm fixing 
one of this issue. And for XGMI hive, there are 2 locks protecting the list, 
one is the device lock I changed here, the other one is in front of my change, 
there is a hive->lock to protect the hive.

Even the bad thing really happened, I think moving back through the list is 
also very dengerous since we don't know what the list finally be, Unless we 
stack the devices we have iterated through a mirrored list. That can be a big 
change.


Not sure we are on the same page, my concern is let's sat your XGMI hive 
consists of 2 devices, you manged to call successfully do
amdgpu_device_lock_adev for dev1 but then failed for dev2, in this case you 
will bail out  without releasing dev1, no ?


Andrey



I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do some 
test and update the code later.

Thanks & Regards,
Horace.

发件人: Grodzovsky, Andrey 

发送时间: 2021年1月19日 22:33
收件人: Chen, Horace ; 
amd-gfx@lists.freedesktop.org 

抄送: Quan, Evan ; Tuikov, Luben 
; Koenig, Christian 
; Deucher, Alexander 
; Xiao, Jack 
; Zhang, Hawking 
; Liu, Monk 
; Xu, Feifei 
; Wang, Kevin(Yang) 
; Xiaojie Yuan 

主题: Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout


On 1/19/21 7:22 AM, Horace Chen wrote:
> Fix a racing issue when jobs on 2 rings timeout simultaneously.
>
> If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> by 2 local linked list, which may cause wild pointer issue in
> iterating.
>
> lock the device earily to prevent the node be added to 2 different
> lists.
>
> Signed-off-by: Horace Chen 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++---
>   1 file changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4d434803fb49..9574da3abc32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>int i, r = 0;
>bool need_emergency_restart = false;
>bool audio_suspended = false;
> + bool get_dev_lock = false;
>
>/*
> * Special case: RAS triggered and full reset isn't supported
> @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
> * Build list of devices to reset.
> * In case we are in XGMI hive mode, resort the device list
> * to put adev in the 1st position.
> +  *
> +  * lock the device before we try to operate the linked list
> +  * if didn't get the device lock, don't touch the linked list since
> +  * others may iterating it.
> */
>INIT_LIST_HEAD(&device_list);
>if (adev->gmc.xgmi.num_physical_nodes > 1) {
>if (!hive)
>return -ENODEV;
> - if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> -

Re: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

2021-01-19 Thread Andrey Grodzovsky


On 1/19/21 11:39 AM, Chen, Horace wrote:


[AMD Official Use Only - Internal Distribution Only]


Hi Andrey,

I think the list in the XGMI hive won't be break in the middle if we lock the 
device before we change the list. Because if 2 devices in 1 hive went into the 
function, it will follow the same sequence to lock the devices. So one of them 
will definately break at the first device. I add iterate devices here is just 
to lock all device in the hive since we will change the device sequence in the 
hive soon after.



I didn't mean break in a sense of breaking the list itself, I just meant the 
literal 'break' instruction

to terminate the iteration once you failed to lock a particular device.




The reason to break the interation in the middle is that the list is changed 
during the iteration without taking any lock. It is quite bad since I'm fixing 
one of this issue. And for XGMI hive, there are 2 locks protecting the list, 
one is the device lock I changed here, the other one is in front of my change, 
there is a hive->lock to protect the hive.


Even the bad thing really happened, I think moving back through the list is 
also very dengerous since we don't know what the list finally be, Unless we 
stack the devices we have iterated through a mirrored list. That can be a big 
change.



Not sure we are on the same page, my concern is let's sat your XGMI hive 
consists of 2 devices, you manged to call successfully do
amdgpu_device_lock_adev for dev1 but then failed for dev2, in this case you will 
bail out without releasing dev1, no ?



Andrey





I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do some 
test and update the code later.


Thanks & Regards,
Horace.

*发件人:* Grodzovsky, Andrey 
*发送时间:* 2021年1月19日 22:33
*收件人:* Chen, Horace ; amd-gfx@lists.freedesktop.org 

*抄送:* Quan, Evan ; Tuikov, Luben ; 
Koenig, Christian ; Deucher, Alexander 
; Xiao, Jack ; Zhang, Hawking 
; Liu, Monk ; Xu, Feifei 
; Wang, Kevin(Yang) ; Xiaojie Yuan 


*主题:* Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

On 1/19/21 7:22 AM, Horace Chen wrote:
> Fix a racing issue when jobs on 2 rings timeout simultaneously.
>
> If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> by 2 local linked list, which may cause wild pointer issue in
> iterating.
>
> lock the device earily to prevent the node be added to 2 different
> lists.
>
> Signed-off-by: Horace Chen 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++---
>   1 file changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> index 4d434803fb49..9574da3abc32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
*adev,
>    int i, r = 0;
>    bool need_emergency_restart = false;
>    bool audio_suspended = false;
> + bool get_dev_lock = false;
>
>    /*
> * Special case: RAS triggered and full reset isn't supported
> @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
*adev,

> * Build list of devices to reset.
> * In case we are in XGMI hive mode, resort the device list
> * to put adev in the 1st position.
> +  *
> +  * lock the device before we try to operate the linked list
> +  * if didn't get the device lock, don't touch the linked list since
> +  * others may iterating it.
> */
>    INIT_LIST_HEAD(&device_list);
>    if (adev->gmc.xgmi.num_physical_nodes > 1) {
>    if (!hive)
>    return -ENODEV;
> - if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> - list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> - device_list_handle = &hive->device_list;
> +
> + list_for_each_entry(tmp_adev, &hive->device_list, 
gmc.xgmi.head) {
> + get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
> + if (!get_dev_lock)
> + break;


What about unlocking back all the devices you already locked if the break
happens in the middle of the iteration ?
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in
the current code.

Also, maybe now it's better to separate the actual locking in
amdgpu_device_lock_adev
from the other stuff going on there since I don't think you would wont to toggle
stuff
like adev->mp1_state back and forth and also the function name is not
descriptive of
the other stuff going on there anyway.

Andrey


> + }
> + if (get_dev_lock) {
> + if (!list_is_first(&adev->gmc.xgmi.head, 
&hive->d

回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

2021-01-19 Thread Chen, Horace
[AMD Official Use Only - Internal Distribution Only]

Hi Andrey,

I think the list in the XGMI hive won't be break in the middle if we lock the 
device before we change the list. Because if 2 devices in 1 hive went into the 
function, it will follow the same sequence to lock the devices. So one of them 
will definately break at the first device. I add iterate devices here is just 
to lock all device in the hive since we will change the device sequence in the 
hive soon after.

The reason to break the interation in the middle is that the list is changed 
during the iteration without taking any lock. It is quite bad since I'm fixing 
one of this issue. And for XGMI hive, there are 2 locks protecting the list, 
one is the device lock I changed here, the other one is in front of my change, 
there is a hive->lock to protect the hive.

Even the bad thing really happened, I think moving back through the list is 
also very dengerous since we don't know what the list finally be, Unless we 
stack the devices we have iterated through a mirrored list. That can be a big 
change.


I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do some 
test and update the code later.

Thanks & Regards,
Horace.

发件人: Grodzovsky, Andrey 
发送时间: 2021年1月19日 22:33
收件人: Chen, Horace ; amd-gfx@lists.freedesktop.org 

抄送: Quan, Evan ; Tuikov, Luben ; 
Koenig, Christian ; Deucher, Alexander 
; Xiao, Jack ; Zhang, Hawking 
; Liu, Monk ; Xu, Feifei 
; Wang, Kevin(Yang) ; Xiaojie Yuan 

主题: Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout


On 1/19/21 7:22 AM, Horace Chen wrote:
> Fix a racing issue when jobs on 2 rings timeout simultaneously.
>
> If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> by 2 local linked list, which may cause wild pointer issue in
> iterating.
>
> lock the device earily to prevent the node be added to 2 different
> lists.
>
> Signed-off-by: Horace Chen 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++---
>   1 file changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4d434803fb49..9574da3abc32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>int i, r = 0;
>bool need_emergency_restart = false;
>bool audio_suspended = false;
> + bool get_dev_lock = false;
>
>/*
> * Special case: RAS triggered and full reset isn't supported
> @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
> * Build list of devices to reset.
> * In case we are in XGMI hive mode, resort the device list
> * to put adev in the 1st position.
> +  *
> +  * lock the device before we try to operate the linked list
> +  * if didn't get the device lock, don't touch the linked list since
> +  * others may iterating it.
> */
>INIT_LIST_HEAD(&device_list);
>if (adev->gmc.xgmi.num_physical_nodes > 1) {
>if (!hive)
>return -ENODEV;
> - if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> - list_rotate_to_front(&adev->gmc.xgmi.head, 
> &hive->device_list);
> - device_list_handle = &hive->device_list;
> +
> + list_for_each_entry(tmp_adev, &hive->device_list, 
> gmc.xgmi.head) {
> + get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
> + if (!get_dev_lock)
> + break;


What about unlocking back all the devices you already locked if the break
happens in the middle of the iteration ?
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in
the current code.

Also, maybe now it's better to separate the actual locking in
amdgpu_device_lock_adev
from the other stuff going on there since I don't think you would wont to toggle
stuff
like adev->mp1_state back and forth and also the function name is not
descriptive of
the other stuff going on there anyway.

Andrey


> + }
> + if (get_dev_lock) {
> + if (!list_is_first(&adev->gmc.xgmi.head, 
> &hive->device_list))
> + list_rotate_to_front(&adev->gmc.xgmi.head, 
> &hive->device_list);
> + device_list_handle = &hive->device_list;
> + }
>} else {
> - list_add_tail(&adev->gmc.xgmi.head, &device_list);
> - device_list_handle = &device_list;
> + get_dev_lock = amdgpu_device_lock_adev(adev, hive);
> + tmp_adev = adev;
> + if (get_dev_lock) {
> + list_add_tail(&adev->gmc.xgmi.head, 

Re: [PATCH v4 10/14] dmr/amdgpu: Move some sysfs attrs creation to default_attr

2021-01-19 Thread Andrey Grodzovsky



On 1/19/21 2:34 AM, Greg KH wrote:

On Mon, Jan 18, 2021 at 04:01:19PM -0500, Andrey Grodzovsky wrote:

  static struct pci_driver amdgpu_kms_pci_driver = {
.name = DRIVER_NAME,
.id_table = pciidlist,
@@ -1595,6 +1607,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
.shutdown = amdgpu_pci_shutdown,
.driver.pm = &amdgpu_pm_ops,
.err_handler = &amdgpu_pci_err_handler,
+   .driver.dev_groups = amdgpu_sysfs_groups,

Shouldn't this just be:
groups - amdgpu_sysfs_groups,

Why go to the "driver root" here?



Because I still didn't get to your suggestion to propose a patch to add groups 
to
pci_driver, it's located in 'base' driver struct.

Andrey




Other than that tiny thing, looks good to me, nice cleanup!

greg k-h

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v3 1/3] drm/amd/display: Add module parameter for freesync video mode

2021-01-19 Thread Pillai, Aurabindo
[AMD Official Use Only - Internal Distribution Only]

Hi Daniel,

Could you please be more specific about the _unsafe API options you mentioned ?

--

Thanks & Regards,
Aurabindo Pillai

From: Daniel Vetter 
Sent: Tuesday, January 19, 2021 8:11 AM
To: Pekka Paalanen 
Cc: Pillai, Aurabindo ; amd-gfx list 
; dri-devel ; 
Kazlauskas, Nicholas ; Wang, Chao-kai (Stylon) 
; Thai, Thong ; Sharma, Shashank 
; Lin, Wayne ; Deucher, Alexander 
; Koenig, Christian 
Subject: Re: [PATCH v3 1/3] drm/amd/display: Add module parameter for freesync 
video mode

On Tue, Jan 19, 2021 at 9:35 AM Pekka Paalanen  wrote:
>
> On Mon, 18 Jan 2021 09:36:47 -0500
> Aurabindo Pillai  wrote:
>
> > On Thu, 2021-01-14 at 11:14 +0200, Pekka Paalanen wrote:
> > >
> > > Hi,
> > >
> > > please document somewhere that ends up in git history (commit
> > > message,
> > > code comments, description of the parameter would be the best but
> > > maybe
> > > there isn't enough space?) what Christian König explained in
> > >
> > >
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2020-December%2F291254.html&data=04%7C01%7Caurabindo.pillai%40amd.com%7C56ba07934c5c48e7ad7b08d8bc7bb4a9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466586800649481%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=GM0ZEM9JeFM5os13E1zlVy8Bn3D8Kxmo%2FajSG02WsGI%3D&reserved=0
> > >
> > > that this is a stop-gap feature intended to be removed as soon as
> > > possible (when a better solution comes up, which could be years).
> > >
> > > So far I have not seen a single mention of this intention in your
> > > patch
> > > submissions, and I think it is very important to make known.
> >
> > Hi,
> >
> > Thanks for the headsup, I shall add the relevant info in the next
> > verison.
> >
> > >
> > > I also did not see an explanation of why this instead of
> > > manufacturing
> > > these video modes in userspace (an idea mentioned by Christian in the
> > > referenced email). I think that too should be part of a commit
> > > message.
> >
> > This is an opt-in feature, which shall be superseded by a better
> > solution. We also add a set of common modes for scaling similarly.
> > Userspace can still add whatever mode they want. So I dont see a reason
> > why this cant be in the kernel.
>
> Hi,
>
> sorry, I think that kind of thinking is backwards. There needs to be a
> reason to put something in the kernel, and if there is no reason, then
> it remains in userspace. So what's the reason to put this in the kernel?
>
> One example reason why this should not be in the kernel is that the set
> of video modes to manufacture is a kind of policy, which modes to add
> and which not. Userspace knows what modes it needs, and establishing
> the modes in the kernel instead is second-guessing what the userspace
> would want. So if userspace needs to manufacture modes in userspace
> anyway as some modes might be missed by the kernel, then why bother in
> the kernel to begin with? Why should the kernel play catch-up with what
> modes userspace wants when we already have everything userspace needs
> to make its own modes, even to add them to the kernel mode list?
>
> Does manufacturing these extra video modes to achieve fast timing
> changes require AMD hardware-specific knowledge, as opposed to the
> general VRR approach of simply adjusting the front porch?
>
> Something like this should also be documented in a commit message. Or
> if you insist that "no reason to not put this in the kernel" is reason
> enough, then write that down, because it does not seem obvious to me or
> others that this feature needs to be in the kernel.

One reason might be debugging, if a feature is known to cause issues.
But imo in that case the knob should be using the _unsafe variants so
it taints the kernel, since otherwise we get stuck in this very cozy
place where kernel maintainers don't have to care much for bugs
"because it's off by default", but also not really care about
polishing the feature "since users can just enable it if they want
it". Just a slightly different flavour of what you're explaining above
already.
-Daniel

> Thanks,
> pq



--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=04%7C01%7Caurabindo.pillai%40amd.com%7C56ba07934c5c48e7ad7b08d8bc7bb4a9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637466586800649481%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2isCpwa3V92TnO4njhe9cQjdWVdsV1GQMo7WP7buVZI%3D&reserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 3/3] drm/amd/display: Skip modeset for front porch change

2021-01-19 Thread Aurabindo Pillai
[Why]
A seamless transition between modes can be performed if the new incoming
mode has the same timing parameters as the optimized mode on a display with a
variable vtotal min/max.

Smooth video playback usecases can be enabled with this seamless transition by
switching to a new mode which has a refresh rate matching the video.

[How]
Skip full modeset if userspace requested a compatible freesync mode which only
differs in the front porch timing from the current mode.

Signed-off-by: Aurabindo Pillai 
Acked-by: Christian König 
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 233 +++---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h |   1 +
 2 files changed, 198 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index aaef2fb528fd..d66494cdd8c8 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -213,6 +213,9 @@ static bool amdgpu_dm_psr_disable_all(struct 
amdgpu_display_manager *dm);
 static const struct drm_format_info *
 amd_get_format_info(const struct drm_mode_fb_cmd2 *cmd);
 
+static bool
+is_timing_unchanged_for_freesync(struct drm_crtc_state *old_crtc_state,
+struct drm_crtc_state *new_crtc_state);
 /*
  * dm_vblank_get_counter
  *
@@ -4940,7 +4943,8 @@ static void fill_stream_properties_from_drm_display_mode(
const struct drm_connector *connector,
const struct drm_connector_state *connector_state,
const struct dc_stream_state *old_stream,
-   int requested_bpc)
+   int requested_bpc,
+   bool is_in_modeset)
 {
struct dc_crtc_timing *timing_out = &stream->timing;
const struct drm_display_info *info = &connector->display_info;
@@ -4995,19 +4999,28 @@ static void 
fill_stream_properties_from_drm_display_mode(
timing_out->hdmi_vic = hv_frame.vic;
}
 
-   timing_out->h_addressable = mode_in->crtc_hdisplay;
-   timing_out->h_total = mode_in->crtc_htotal;
-   timing_out->h_sync_width =
-   mode_in->crtc_hsync_end - mode_in->crtc_hsync_start;
-   timing_out->h_front_porch =
-   mode_in->crtc_hsync_start - mode_in->crtc_hdisplay;
-   timing_out->v_total = mode_in->crtc_vtotal;
-   timing_out->v_addressable = mode_in->crtc_vdisplay;
-   timing_out->v_front_porch =
-   mode_in->crtc_vsync_start - mode_in->crtc_vdisplay;
-   timing_out->v_sync_width =
-   mode_in->crtc_vsync_end - mode_in->crtc_vsync_start;
-   timing_out->pix_clk_100hz = mode_in->crtc_clock * 10;
+   if (is_in_modeset) {
+   timing_out->h_addressable = mode_in->hdisplay;
+   timing_out->h_total = mode_in->htotal;
+   timing_out->h_sync_width = mode_in->hsync_end - 
mode_in->hsync_start;
+   timing_out->h_front_porch = mode_in->hsync_start - 
mode_in->hdisplay;
+   timing_out->v_total = mode_in->vtotal;
+   timing_out->v_addressable = mode_in->vdisplay;
+   timing_out->v_front_porch = mode_in->vsync_start - 
mode_in->vdisplay;
+   timing_out->v_sync_width = mode_in->vsync_end - 
mode_in->vsync_start;
+   timing_out->pix_clk_100hz = mode_in->clock * 10;
+   } else {
+   timing_out->h_addressable = mode_in->crtc_hdisplay;
+   timing_out->h_total = mode_in->crtc_htotal;
+   timing_out->h_sync_width = mode_in->crtc_hsync_end - 
mode_in->crtc_hsync_start;
+   timing_out->h_front_porch = mode_in->crtc_hsync_start - 
mode_in->crtc_hdisplay;
+   timing_out->v_total = mode_in->crtc_vtotal;
+   timing_out->v_addressable = mode_in->crtc_vdisplay;
+   timing_out->v_front_porch = mode_in->crtc_vsync_start - 
mode_in->crtc_vdisplay;
+   timing_out->v_sync_width = mode_in->crtc_vsync_end - 
mode_in->crtc_vsync_start;
+   timing_out->pix_clk_100hz = mode_in->crtc_clock * 10;
+   }
+
timing_out->aspect_ratio = get_aspect_ratio(mode_in);
 
stream->output_color_space = get_output_color_space(timing_out);
@@ -5227,6 +5240,33 @@ get_highest_refresh_rate_mode(struct amdgpu_dm_connector 
*aconnector,
return m_pref;
 }
 
+static bool is_freesync_video_mode(struct drm_display_mode *mode,
+  struct amdgpu_dm_connector *aconnector)
+{
+   struct drm_display_mode *high_mode;
+   int timing_diff;
+
+   high_mode = get_highest_refresh_rate_mode(aconnector, false);
+   if (!high_mode || !mode)
+   return false;
+
+   timing_diff = high_mode->vtotal - mode->vtotal;
+
+   if (high_mode->clock == 0 || high_mode->clock != mode->clock ||
+   high_mode->hdisplay != mode->hdisplay ||
+   high_mode->vdisplay != mode->vdisplay ||
+   high_mode->hsync_start != mode->hsync_sta

[PATCH 2/3] drm/amd/display: Add freesync video modes based on preferred modes

2021-01-19 Thread Aurabindo Pillai
[Why]
While possible for userspace to create and add custom mode based off the
optimized mode for the connected display which differs only in front porch
timing, this patch set adds a list of common video modes in advance.

The list of common video refresh rates is small, well known and the optimized
mode has specific requirements to be able to enable HW frame doubling and
tripling so it makes most sense to create the modes that video players will need
in advance. The optimized mode matches the preferred mode resolution but has the
highest refresh rate available to enable the largest front porch extension.

[How]
Find the optimized mode and store it on the connector so we can check it
later during our optimized modeset.

Prepopulate the mode list with a list of common video mades based on the
optimized mode (but with a longer front porch) if the panel doesn't support a
variant of the mode natively.

Signed-off-by: Aurabindo Pillai 
Acked-by: Christian König 
Reviewed-by: Shashank Sharma 
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 170 ++
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h |   2 +
 2 files changed, 172 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 245bd1284e5f..aaef2fb528fd 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -5174,6 +5174,59 @@ static void dm_enable_per_frame_crtc_master_sync(struct 
dc_state *context)
set_master_stream(context->streams, context->stream_count);
 }
 
+static struct drm_display_mode *
+get_highest_refresh_rate_mode(struct amdgpu_dm_connector *aconnector,
+ bool use_probed_modes)
+{
+   struct drm_display_mode *m, *m_pref = NULL;
+   u16 current_refresh, highest_refresh;
+   struct list_head *list_head = use_probed_modes ?
+   
&aconnector->base.probed_modes :
+   &aconnector->base.modes;
+
+   if (aconnector->freesync_vid_base.clock != 0)
+   return &aconnector->freesync_vid_base;
+
+   /* Find the preferred mode */
+   list_for_each_entry (m, list_head, head) {
+   if (m->type & DRM_MODE_TYPE_PREFERRED) {
+   m_pref = m;
+   break;
+   }
+   }
+
+   if (!m_pref) {
+   /* Probably an EDID with no preferred mode. Fallback to first 
entry */
+   m_pref = list_first_entry_or_null(
+   &aconnector->base.modes, struct drm_display_mode, head);
+   if (!m_pref) {
+   DRM_DEBUG_DRIVER("No preferred mode found in EDID\n");
+   return NULL;
+   }
+   }
+
+   highest_refresh = drm_mode_vrefresh(m_pref);
+
+   /*
+* Find the mode with highest refresh rate with same resolution.
+* For some monitors, preferred mode is not the mode with highest
+* supported refresh rate.
+*/
+   list_for_each_entry (m, list_head, head) {
+   current_refresh  = drm_mode_vrefresh(m);
+
+   if (m->hdisplay == m_pref->hdisplay &&
+   m->vdisplay == m_pref->vdisplay &&
+   highest_refresh < current_refresh) {
+   highest_refresh = current_refresh;
+   m_pref = m;
+   }
+   }
+
+   aconnector->freesync_vid_base = *m_pref;
+   return m_pref;
+}
+
 static struct dc_stream_state *
 create_stream_for_sink(struct amdgpu_dm_connector *aconnector,
   const struct drm_display_mode *drm_mode,
@@ -6999,6 +7052,122 @@ static void amdgpu_dm_connector_ddc_get_modes(struct 
drm_connector *connector,
}
 }
 
+static bool is_duplicate_mode(struct amdgpu_dm_connector *aconnector,
+ struct drm_display_mode *mode)
+{
+   struct drm_display_mode *m;
+
+   list_for_each_entry (m, &aconnector->base.probed_modes, head) {
+   if (drm_mode_equal(m, mode))
+   return true;
+   }
+
+   return false;
+}
+
+static uint add_fs_modes(struct amdgpu_dm_connector *aconnector,
+struct detailed_data_monitor_range *range)
+{
+   const struct drm_display_mode *m;
+   struct drm_display_mode *new_mode;
+   uint i;
+   uint32_t new_modes_count = 0;
+
+   /* Standard FPS values
+*
+* 23.976   - TV/NTSC
+* 24   - Cinema
+* 25   - TV/PAL
+* 29.97- TV/NTSC
+* 30   - TV/NTSC
+* 48   - Cinema HFR
+* 50   - TV/PAL
+* 60   - Commonly used
+* 48,72,96 - Multiples of 24
+*/
+   const uint32_t common_rates[] = { 23976, 24000, 25000, 29970, 3,
+48000, 5, 6000

[PATCH 1/3] drm/amd/display: Add module parameter for freesync video mode

2021-01-19 Thread Aurabindo Pillai
[Why]
This option shall be opt-in by default since it is a temporary solution
until long term solution is agreed upon which may require userspace interface
changes. There has been precedent of manufacturing modes in the kernel. In
AMDGPU, the existing usage are for common modes and scaling modes. Other driver
have a similar approach as well.

[How]
Adds a module parameter to enable freesync video mode modeset
optimization. Enabling this mode allows the driver to skip a full modeset when a
freesync compatible mode is requested by the userspace. This parameter will also
add some additional modes that are within the connected monitor's VRR range
corresponding to common video modes, which media players can use for a seamless
experience while making use of freesync.

Signed-off-by: Aurabindo Pillai 
Acked-by: Christian König 
Reviewed-by: Shashank Sharma 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 
 2 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 100a431f0792..770e42fcaa62 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -177,6 +177,7 @@ extern int amdgpu_gpu_recovery;
 extern int amdgpu_emu_mode;
 extern uint amdgpu_smu_memory_pool_size;
 extern uint amdgpu_dc_feature_mask;
+extern uint amdgpu_freesync_vid_mode;
 extern uint amdgpu_dc_debug_mask;
 extern uint amdgpu_dm_abm_level;
 extern struct amdgpu_mgpu_info mgpu_info;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index b48d7a3c2a11..5c6dc8362e6d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -158,6 +158,7 @@ int amdgpu_mes;
 int amdgpu_noretry = -1;
 int amdgpu_force_asic_type = -1;
 int amdgpu_tmz;
+uint amdgpu_freesync_vid_mode;
 int amdgpu_reset_method = -1; /* auto */
 int amdgpu_num_kcq = -1;
 
@@ -786,6 +787,17 @@ module_param_named(abmlevel, amdgpu_dm_abm_level, uint, 
0444);
 MODULE_PARM_DESC(tmz, "Enable TMZ feature (-1 = auto, 0 = off (default), 1 = 
on)");
 module_param_named(tmz, amdgpu_tmz, int, 0444);
 
+/**
+ * DOC: freesync_video (uint)
+ * Enabled the optimization to adjust front porch timing to achieve seamless 
mode change experience
+ * when setting a freesync supported mode for which full modeset is not needed.
+ * The default value: 0 (off).
+ */
+MODULE_PARM_DESC(
+   freesync_video,
+   "Enable freesync modesetting optimization feature (0 = off (default), 1 
= on)");
+module_param_named(freesync_video, amdgpu_freesync_vid_mode, uint, 0444);
+
 /**
  * DOC: reset_method (int)
  * GPU reset method (-1 = auto (default), 0 = legacy, 1 = mode0, 2 = mode1, 3 
= mode2, 4 = baco)
-- 
2.30.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 0/3] Experimental freesync video mode optimization

2021-01-19 Thread Aurabindo Pillai
Changes in V5
=

* More info in commit messages on the rationale of changes being added
to the kernel.
* Minor fixes

Changes in V4
=

1) Add module parameter for freesync video mode

* Change module parameter name to freesync_video

2) Add freesync video modes based on preferred modes:

* Cosmetic fixes
* Added comments about all modes being added by the driver.

3) Skip modeset for front porch change

* Added more conditions for checking freesync video mode

Changes in V3
=

1) Add freesync video modes based on preferred modes:

* Cache base freesync video mode during the first iteration to avoid
  iterating over modelist again later.
* Add mode for 60 fps videos

2) Skip modeset for front porch change

* Fixes for bug exposed by caching of modes.

Changes in V2
=

1) Add freesync video modes based on preferred modes:

* Remove check for connector type before adding freesync compatible
  modes as VRR support is being checked, and there is no reason to block
  freesync video support on eDP.
* use drm_mode_equal() instead of creating same functionality.
* Additional null pointer deference check
* Removed unnecessary variables.
* Cosmetic fixes.

2) Skip modeset for front porch change

* Remove _FSV string being appended to freesync video modes so as to not
  define new policies or break existing application that might use the
  mode name to figure out mode resolution.
* Remove unnecessary variables
* Cosmetic fixes.

--

This patchset enables freesync video mode usecase where the userspace
can request a freesync compatible video mode such that switching to this
mode does not trigger blanking.

This feature is guarded by a module parameter which is disabled by
default. Enabling this paramters adds additional modes to the driver
modelist, and also enables the optimization to skip modeset when using
one of these modes.

--

Aurabindo Pillai (3):
  drm/amd/display: Add module parameter for freesync video mode
  drm/amd/display: Add freesync video modes based on preferred modes
  drm/amd/display: Skip modeset for front porch change

 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   |  12 +
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 401 --
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h |   3 +
 4 files changed, 382 insertions(+), 35 deletions(-)

-- 
2.30.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 11/14] drm/amdgpu: Guard against write accesses after device removal

2021-01-19 Thread Christian König
The is also the possibility to have the drm_dev_enter/exit much more 
high level.


E.g. we should it have anyway on every IOCTL and what remains are work 
items, scheduler threads and interrupts.


Christian.

Am 19.01.21 um 16:35 schrieb Andrey Grodzovsky:
There is really no other way according to this article 
https://lwn.net/Articles/767885/


"A perfect solution seems nearly impossible though; we cannot acquire 
a mutex on the user
to prevent them from yanking a device and we cannot check for a 
presence change after every

device access for performance reasons. "

But I assumed srcu_read_lock should be pretty seamless performance 
wise, no ?
The other solution would be as I suggested to keep all the device IO 
ranges reserved and system
memory pages unfreed until the device is finalized in the driver but 
Daniel said this would upset the PCI layer (the MMIO ranges 
reservation part).


Andrey




On 1/19/21 3:55 AM, Christian König wrote:

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

This should prevent writing to memory or IO ranges possibly
already allocated for other uses after our device is removed.


Wow, that adds quite some overhead to every register access. I'm not 
sure we can do this.


Christian.



Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 


  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 53 
+-

  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h    |  3 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   | 70 
++

  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   | 49 ++---
  drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 16 ++-
  drivers/gpu/drm/amd/amdgpu/psp_v12_0.c |  8 +---
  drivers/gpu/drm/amd/amdgpu/psp_v3_1.c  |  8 +---
  9 files changed, 184 insertions(+), 89 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index e99f4f1..0a9d73c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -72,6 +72,8 @@
    #include 
  +#include 
+
  MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
  MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
  MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
@@ -404,13 +406,21 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device 
*adev, uint32_t offset)

   */
  void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, 
uint8_t value)

  {
+    int idx;
+
  if (adev->in_pci_err_recovery)
  return;
  +
+    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
  if (offset < adev->rmmio_size)
  writeb(value, adev->rmmio + offset);
  else
  BUG();
+
+    drm_dev_exit(idx);
  }
    /**
@@ -427,9 +437,14 @@ void amdgpu_device_wreg(struct amdgpu_device 
*adev,

  uint32_t reg, uint32_t v,
  uint32_t acc_flags)
  {
+    int idx;
+
  if (adev->in_pci_err_recovery)
  return;
  +    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
  if ((reg * 4) < adev->rmmio_size) {
  if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
  amdgpu_sriov_runtime(adev) &&
@@ -444,6 +459,8 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
  }
    trace_amdgpu_device_wreg(adev->pdev->device, reg, v);
+
+    drm_dev_exit(idx);
  }
    /*
@@ -454,9 +471,14 @@ void amdgpu_device_wreg(struct amdgpu_device 
*adev,

  void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
   uint32_t reg, uint32_t v)
  {
+    int idx;
+
  if (adev->in_pci_err_recovery)
  return;
  +    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
  if (amdgpu_sriov_fullaccess(adev) &&
  adev->gfx.rlc.funcs &&
  adev->gfx.rlc.funcs->is_rlcg_access_range) {
@@ -465,6 +487,8 @@ void amdgpu_mm_wreg_mmio_rlc(struct 
amdgpu_device *adev,

  } else {
  writel(v, ((void __iomem *)adev->rmmio) + (reg * 4));
  }
+
+    drm_dev_exit(idx);
  }
    /**
@@ -499,15 +523,22 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, 
u32 reg)

   */
  void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
  {
+    int idx;
+
  if (adev->in_pci_err_recovery)
  return;
  +    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
  if ((reg * 4) < adev->rio_mem_size)
  iowrite32(v, adev->rio_mem + (reg * 4));
  else {
  iowrite32((reg * 4), adev->rio_mem + (mmMM_INDEX * 4));
  iowrite32(v, adev->rio_mem + (mmMM_DATA * 4));
  }
+
+    drm_dev_exit(idx);
  }
    /**
@@ -544,14 +575,21 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device 
*adev, u32 index)

   */
  void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, 
u32 v)

  {
+    int idx;
+
  if (adev->in_pci_err_recovery)
  return;
  +    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
  if (index < adev->doorbell.num_doorbells) {

Re: [PATCH v4 11/14] drm/amdgpu: Guard against write accesses after device removal

2021-01-19 Thread Andrey Grodzovsky
There is really no other way according to this article 
https://lwn.net/Articles/767885/


"A perfect solution seems nearly impossible though; we cannot acquire a mutex on 
the user
to prevent them from yanking a device and we cannot check for a presence change 
after every

device access for performance reasons. "

But I assumed srcu_read_lock should be pretty seamless performance wise, no ?
The other solution would be as I suggested to keep all the device IO ranges 
reserved and system
memory pages unfreed until the device is finalized in the driver but Daniel said 
this would upset the PCI layer (the MMIO ranges reservation part).


Andrey




On 1/19/21 3:55 AM, Christian König wrote:

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

This should prevent writing to memory or IO ranges possibly
already allocated for other uses after our device is removed.


Wow, that adds quite some overhead to every register access. I'm not sure we 
can do this.


Christian.



Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 
  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 53 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h    |  3 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   | 70 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   | 49 ++---
  drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 16 ++-
  drivers/gpu/drm/amd/amdgpu/psp_v12_0.c |  8 +---
  drivers/gpu/drm/amd/amdgpu/psp_v3_1.c  |  8 +---
  9 files changed, 184 insertions(+), 89 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index e99f4f1..0a9d73c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -72,6 +72,8 @@
    #include 
  +#include 
+
  MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
  MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
  MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
@@ -404,13 +406,21 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, 
uint32_t offset)

   */
  void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t 
value)

  {
+    int idx;
+
  if (adev->in_pci_err_recovery)
  return;
  +
+    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
  if (offset < adev->rmmio_size)
  writeb(value, adev->rmmio + offset);
  else
  BUG();
+
+    drm_dev_exit(idx);
  }
    /**
@@ -427,9 +437,14 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
  uint32_t reg, uint32_t v,
  uint32_t acc_flags)
  {
+    int idx;
+
  if (adev->in_pci_err_recovery)
  return;
  +    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
  if ((reg * 4) < adev->rmmio_size) {
  if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
  amdgpu_sriov_runtime(adev) &&
@@ -444,6 +459,8 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
  }
    trace_amdgpu_device_wreg(adev->pdev->device, reg, v);
+
+    drm_dev_exit(idx);
  }
    /*
@@ -454,9 +471,14 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
  void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
   uint32_t reg, uint32_t v)
  {
+    int idx;
+
  if (adev->in_pci_err_recovery)
  return;
  +    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
  if (amdgpu_sriov_fullaccess(adev) &&
  adev->gfx.rlc.funcs &&
  adev->gfx.rlc.funcs->is_rlcg_access_range) {
@@ -465,6 +487,8 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
  } else {
  writel(v, ((void __iomem *)adev->rmmio) + (reg * 4));
  }
+
+    drm_dev_exit(idx);
  }
    /**
@@ -499,15 +523,22 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
   */
  void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
  {
+    int idx;
+
  if (adev->in_pci_err_recovery)
  return;
  +    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
  if ((reg * 4) < adev->rio_mem_size)
  iowrite32(v, adev->rio_mem + (reg * 4));
  else {
  iowrite32((reg * 4), adev->rio_mem + (mmMM_INDEX * 4));
  iowrite32(v, adev->rio_mem + (mmMM_DATA * 4));
  }
+
+    drm_dev_exit(idx);
  }
    /**
@@ -544,14 +575,21 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 
index)

   */
  void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
  {
+    int idx;
+
  if (adev->in_pci_err_recovery)
  return;
  +    if (!drm_dev_enter(&adev->ddev, &idx))
+    return;
+
  if (index < adev->doorbell.num_doorbells) {
  writel(v, adev->doorbell.ptr + index);
  } else {
  DRM_ERROR("writing beyond doorbell aperture: 0x%08x!\n", index);
  }
+
+    drm_dev_exit(idx);
  }
    /**
@@ -588,14 +626,21 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, 
u32 index)

  

Re: [PATCH 2/2] drm/amdgpu: set job guilty if reset skipped

2021-01-19 Thread Andrey Grodzovsky

Reviewed-by: Andrey Grodzovsky 

Andrey

On 1/19/21 7:22 AM, Horace Chen wrote:

If 2 jobs on 2 different ring timed out the at a very short
period, the reset for second job will be skipped because the
reset is already in progress.

But it doesn't mean the second job is not guilty since it
also timed out and can be a bad job. So before skipped out
from the reset, we need to increase karma for this job too.

Signed-off-by: Horace Chen 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9574da3abc32..1d6ff9fe37de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4574,6 +4574,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as 
another already in progress",
job ? job->base.id : -1, hive->hive_id);
amdgpu_put_xgmi_hive(hive);
+   if (job)
+   drm_sched_increase_karma(&job->base);
return 0;
}
mutex_lock(&hive->hive_lock);
@@ -4617,6 +4619,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
job ? job->base.id : -1);
r = 0;
/* even we skipped this reset, still need to set the job to 
guilty */
+   if (job)
+   drm_sched_increase_karma(&job->base);
goto skip_recovery;
}
  

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

2021-01-19 Thread Andrey Grodzovsky



On 1/19/21 7:22 AM, Horace Chen wrote:

Fix a racing issue when jobs on 2 rings timeout simultaneously.

If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
will be reentered. Then the adev->gmc.xgmi.head will be grabbed
by 2 local linked list, which may cause wild pointer issue in
iterating.

lock the device earily to prevent the node be added to 2 different
lists.

Signed-off-by: Horace Chen 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++---
  1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 4d434803fb49..9574da3abc32 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
int i, r = 0;
bool need_emergency_restart = false;
bool audio_suspended = false;
+   bool get_dev_lock = false;
  
  	/*

 * Special case: RAS triggered and full reset isn't supported
@@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
*adev,
 * Build list of devices to reset.
 * In case we are in XGMI hive mode, resort the device list
 * to put adev in the 1st position.
+*
+* lock the device before we try to operate the linked list
+* if didn't get the device lock, don't touch the linked list since
+* others may iterating it.
 */
INIT_LIST_HEAD(&device_list);
if (adev->gmc.xgmi.num_physical_nodes > 1) {
if (!hive)
return -ENODEV;
-   if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
-   list_rotate_to_front(&adev->gmc.xgmi.head, 
&hive->device_list);
-   device_list_handle = &hive->device_list;
+
+   list_for_each_entry(tmp_adev, &hive->device_list, 
gmc.xgmi.head) {
+   get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
+   if (!get_dev_lock)
+   break;



What about unlocking back all the devices you already locked if the break
happens in the middle of the iteration ?
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in 
the current code.


Also, maybe now it's better to separate the actual locking in 
amdgpu_device_lock_adev
from the other stuff going on there since I don't think you would wont to toggle 
stuff
like adev->mp1_state back and forth and also the function name is not 
descriptive of

the other stuff going on there anyway.

Andrey



+   }
+   if (get_dev_lock) {
+   if (!list_is_first(&adev->gmc.xgmi.head, 
&hive->device_list))
+   list_rotate_to_front(&adev->gmc.xgmi.head, 
&hive->device_list);
+   device_list_handle = &hive->device_list;
+   }
} else {
-   list_add_tail(&adev->gmc.xgmi.head, &device_list);
-   device_list_handle = &device_list;
+   get_dev_lock = amdgpu_device_lock_adev(adev, hive);
+   tmp_adev = adev;
+   if (get_dev_lock) {
+   list_add_tail(&adev->gmc.xgmi.head, &device_list);
+   device_list_handle = &device_list;
+   }
+   }
+
+   if (!get_dev_lock) {
+   dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another 
already in progress",
+   job ? job->base.id : -1);
+   r = 0;
+   /* even we skipped this reset, still need to set the job to 
guilty */
+   goto skip_recovery;
}
  
  	/* block all schedulers and reset given job's ring */

list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-   if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
-   dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as 
another already in progress",
- job ? job->base.id : -1);
-   r = 0;
-   goto skip_recovery;
-   }
-
/*
 * Try to put the audio codec into suspend state
 * before gpu reset started.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 00/14] RFC Support hot device unplug in amdgpu

2021-01-19 Thread Daniel Vetter
On Mon, Jan 18, 2021 at 04:01:09PM -0500, Andrey Grodzovsky wrote:
> Until now extracting a card either by physical extraction (e.g. eGPU with 
> thunderbolt connection or by emulation through  syfs -> 
> /sys/bus/pci/devices/device_id/remove) 
> would cause random crashes in user apps. The random crashes in apps were 
> mostly due to the app having mapped a device backed BO into its address 
> space was still trying to access the BO while the backing device was gone.
> To answer this first problem Christian suggested to fix the handling of 
> mapped 
> memory in the clients when the device goes away by forcibly unmap all buffers 
> the 
> user processes has by clearing their respective VMAs mapping the device BOs. 
> Then when the VMAs try to fill in the page tables again we check in the fault 
> handlerif the device is removed and if so, return an error. This will 
> generate a 
> SIGBUS to the application which can then cleanly terminate.This indeed was 
> done 
> but this in turn created a problem of kernel OOPs were the OOPSes were due to 
> the 
> fact that while the app was terminating because of the SIGBUSit would trigger 
> use 
> after free in the driver by calling to accesses device structures that were 
> already 
> released from the pci remove sequence.This was handled by introducing a 
> 'flush' 
> sequence during device removal were we wait for drm file reference to drop to 
> 0 
> meaning all user clients directly using this device terminated.
> 
> v2:
> Based on discussions in the mailing list with Daniel and Pekka [1] and based 
> on the document 
> produced by Pekka from those discussions [2] the whole approach with 
> returning SIGBUS and 
> waiting for all user clients having CPU mapping of device BOs to die was 
> dropped. 
> Instead as per the document suggestion the device structures are kept alive 
> until 
> the last reference to the device is dropped by user client and in the 
> meanwhile all existing and new CPU mappings of the BOs 
> belonging to the device directly or by dma-buf import are rerouted to per 
> user 
> process dummy rw page.Also, I skipped the 'Requirements for KMS UAPI' section 
> of [2] 
> since i am trying to get the minimal set of requirements that still give 
> useful solution 
> to work and this is the'Requirements for Render and Cross-Device UAPI' 
> section and so my 
> test case is removing a secondary device, which is render only and is not 
> involved 
> in KMS.
> 
> v3:
> More updates following comments from v2 such as removing loop to find DRM 
> file when rerouting 
> page faults to dummy page,getting rid of unnecessary sysfs handling 
> refactoring and moving 
> prevention of GPU recovery post device unplug from amdgpu to scheduler layer. 
> On top of that added unplug support for the IOMMU enabled system.
> 
> v4:
> Drop last sysfs hack and use sysfs default attribute.
> Guard against write accesses after device removal to avoid modifying released 
> memory.
> Update dummy pages handling to on demand allocation and release through drm 
> managed framework.
> Add return value to scheduler job TO handler (by Luben Tuikov) and use this 
> in amdgpu for prevention 
> of GPU recovery post device unplug
> Also rebase on top of drm-misc-mext instead of amd-staging-drm-next
> 
> With these patches I am able to gracefully remove the secondary card using 
> sysfs remove hook while glxgears 
> is running off of secondary card (DRI_PRIME=1) without kernel oopses or hangs 
> and keep working 
> with the primary card or soft reset the device without hangs or oopses
> 
> TODOs for followup work:
> Convert AMDGPU code to use devm (for hw stuff) and drmm (for sw stuff and 
> allocations) (Daniel)
> Support plugging the secondary device back after unplug - currently still 
> experiencing HW error on plugging back.
> Add support for 'Requirements for KMS UAPI' section of [2] - unplugging 
> primary, display connected card.
> 
> [1] - Discussions during v3 of the patchset 
> https://www.spinics.net/lists/amd-gfx/msg55576.html
> [2] - drm/doc: device hot-unplug for userspace 
> https://www.spinics.net/lists/dri-devel/msg259755.html
> [3] - Related gitlab ticket 
> https://gitlab.freedesktop.org/drm/amd/-/issues/1081

btw have you tried this out with some of the igts we have? core_hotunplug
is the one I'm thinking of. Might be worth to extend this for amdgpu
specific stuff (like run some batches on it while hotunplugging).

Since there's so many corner cases we need to test here (shared dma-buf,
shared dma_fence) I think it would make sense to have a shared testcase
across drivers. Only specific thing would be some hooks to keep the gpu
busy in some fashion while we yank the driver. But just to get it started
you can throw in entirely amdgpu specific subtests and just share some of
the test code.
-Daniel

> 
> Andrey Grodzovsky (13):
>   drm/ttm: Remap all page faults to per process dummy page.
>   drm: Unamp the entire device address space on device unplug
>   drm/ttm

Re: [PATCH] drm/amdgpu: remove gpu info firmware of green sardine

2021-01-19 Thread Alex Deucher
On Tue, Jan 19, 2021 at 2:20 AM Liang, Prike  wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> > -Original Message-
> > From: Huang, Ray 
> > Sent: Tuesday, January 19, 2021 2:57 PM
> > To: Liang, Prike 
> > Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> > 
> > Subject: Re: [PATCH] drm/amdgpu: remove gpu info firmware of green
> > sardine
> >
> > On Tue, Jan 19, 2021 at 02:25:36PM +0800, Liang, Prike wrote:
> > > [AMD Official Use Only - Internal Distribution Only]
> > >
> > > Thanks help clean up. Generally that seems fine but could we keep the
> > green sardine chip name to retrieve the GPU info FW when the IP discovery
> > fail back to legacy mode?
> >
> > Do you want to only clean MODULE_FIRMWARE(gpu_info.bin)? That's fine
> > for me.
> [Prike]  Yeah, that seems enough just remove the green sardine GPU info FW 
> declared for amdgpu driver module.
> >

We can probably remove the renoir gpu_info firmware as well.  We use
the IP discovery table there as well at this point.

Alex


> > Thanks,
> > Ray
> >
> > >
> > > Anyway this patch is Reviewed-by: Prike Liang 
> > >
> > > Thanks,
> > > Prike
> > > > -Original Message-
> > > > From: Huang, Ray 
> > > > Sent: Tuesday, January 19, 2021 1:52 PM
> > > > To: amd-gfx@lists.freedesktop.org
> > > > Cc: Deucher, Alexander ; Liang, Prike
> > > > ; Huang, Ray 
> > > > Subject: [PATCH] drm/amdgpu: remove gpu info firmware of green
> > > > sardine
> > > >
> > > > The ip discovery is supported on green sardine, it doesn't need gpu
> > > > info firmware anymore.
> > > >
> > > > Signed-off-by: Huang Rui 
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +--
> > > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index 4d434803fb49..f1a426d8861d 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -81,7 +81,6 @@
> > MODULE_FIRMWARE("amdgpu/navi10_gpu_info.bin");
> > > >  MODULE_FIRMWARE("amdgpu/navi14_gpu_info.bin");
> > > >  MODULE_FIRMWARE("amdgpu/navi12_gpu_info.bin");
> > > >  MODULE_FIRMWARE("amdgpu/vangogh_gpu_info.bin");
> > > > -MODULE_FIRMWARE("amdgpu/green_sardine_gpu_info.bin");
> > > >
> > > >  #define AMDGPU_RESUME_MS2000
> > > >
> > > > @@ -1825,7 +1824,7 @@ static int
> > > > amdgpu_device_parse_gpu_info_fw(struct amdgpu_device *adev)  if
> > > > (adev->apu_flags & AMD_APU_IS_RENOIR)  chip_name = "renoir";  else
> > > > -chip_name = "green_sardine";
> > > > +return 0;
> > > >  break;
> > > >  case CHIP_NAVI10:
> > > >  chip_name = "navi10";
> > > > --
> > > > 2.25.1
> > >
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Add RLC_PG_DELAY_3 for Vangogh

2021-01-19 Thread Huang Rui
On Tue, Jan 19, 2021 at 06:48:54PM +0800, Su, Jinzhou (Joe) wrote:
> Copy from RLC MAS:

Remove this line.

> 
> Driver should enable the CGPG feature for RLC while it is in
> safe mode to prevent any misalignment or conflict while it is
> in middle of any power feature entry/exit sequence. This can
> be achieved by setting RLC_PG_CNTL.GFX_POWER_GATING_ENABLE = 0x1,
> and RLC_PG_DELAY_3.CGCG_ACTIVE_BEFORE_CGPG to the desired CGPG
> hysteresis value in refclk count.
> 
> Signed-off-by: Jinzhou Su 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index c4314e25f560..23a11ec40c33 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -120,6 +120,7 @@
>  #define mmSPI_CONFIG_CNTL_Vangogh_BASE_IDX   1
>  #define mmGCR_GENERAL_CNTL_Vangogh   0x1580
>  #define mmGCR_GENERAL_CNTL_Vangogh_BASE_IDX  0
> +#define RLC_PG_DELAY_3__CGCG_ACTIVE_BEFORE_CGPG_MASK_Vangogh   0xL
>  
>  #define mmCP_HYP_PFP_UCODE_ADDR  0x5814
>  #define mmCP_HYP_PFP_UCODE_ADDR_BASE_IDX 1
> @@ -7829,6 +7830,17 @@ static void gfx_v10_cntl_power_gating(struct 
> amdgpu_device *adev, bool enable)
>   data &= ~RLC_PG_CNTL__GFX_POWER_GATING_ENABLE_MASK;
>  
>   WREG32_SOC15(GC, 0, mmRLC_PG_CNTL, data);
> +
> + /*
> +  * CGPG enablement required and the register to program the hysteresis 
> value
> +  * RLC_PG_DELAY_3.CGCG_ACTIVE_BEFORE_CGPG to the desired CGPG 
> hysteresis value
> +  * in refclk count. Note that RLC FW is modified to take 16 bits from
> +  * RLC_PG_DELAY_3[15:0] as the hysteresis instead of just 8 bits.
> +  */
> + if (enable && (adev->pg_flags & AMD_PG_SUPPORT_GFX_PG) && 
> adev->asic_type == CHIP_VANGOGH) {
> + data = 0x4E20 & 
> RLC_PG_DELAY_3__CGCG_ACTIVE_BEFORE_CGPG_MASK_Vangogh;

How can you get the "0x4E20" here?

Thanks,
Ray

> + WREG32_SOC15(GC, 0, mmRLC_PG_DELAY_3, data);
> + }
>  }
>  
>  static void gfx_v10_cntl_pg(struct amdgpu_device *adev, bool enable)
> -- 
> 2.17.1
> 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 01/14] drm/ttm: Remap all page faults to per process dummy page.

2021-01-19 Thread Daniel Vetter
On Mon, Jan 18, 2021 at 04:01:10PM -0500, Andrey Grodzovsky wrote:
> On device removal reroute all CPU mappings to dummy page.
> 
> v3:
> Remove loop to find DRM file and instead access it
> by vma->vm_file->private_data. Move dummy page installation
> into a separate function.
> 
> v4:
> Map the entire BOs VA space into on demand allocated dummy page
> on the first fault for that BO.
> 
> Signed-off-by: Andrey Grodzovsky 
> ---
>  drivers/gpu/drm/ttm/ttm_bo_vm.c | 82 
> -
>  include/drm/ttm/ttm_bo_api.h|  2 +
>  2 files changed, 83 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index 6dc96cf..ed89da3 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -34,6 +34,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -380,25 +382,103 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault 
> *vmf,
>  }
>  EXPORT_SYMBOL(ttm_bo_vm_fault_reserved);
>  
> +static void ttm_bo_release_dummy_page(struct drm_device *dev, void *res)
> +{
> + struct page *dummy_page = (struct page *)res;
> +
> + __free_page(dummy_page);
> +}
> +
> +vm_fault_t ttm_bo_vm_dummy_page(struct vm_fault *vmf, pgprot_t prot)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + struct ttm_buffer_object *bo = vma->vm_private_data;
> + struct ttm_bo_device *bdev = bo->bdev;
> + struct drm_device *ddev = bo->base.dev;
> + vm_fault_t ret = VM_FAULT_NOPAGE;
> + unsigned long address = vma->vm_start;
> + unsigned long num_prefault = (vma->vm_end - vma->vm_start) >> 
> PAGE_SHIFT;
> + unsigned long pfn;
> + struct page *page;
> + int i;
> +
> + /*
> +  * Wait for buffer data in transit, due to a pipelined
> +  * move.
> +  */
> + ret = ttm_bo_vm_fault_idle(bo, vmf);
> + if (unlikely(ret != 0))
> + return ret;
> +
> + /* Allocate new dummy page to map all the VA range in this VMA to it*/
> + page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> + if (!page)
> + return VM_FAULT_OOM;
> +
> + pfn = page_to_pfn(page);
> +
> + /*
> +  * Prefault the entire VMA range right away to avoid further faults
> +  */
> + for (i = 0; i < num_prefault; ++i) {
> +
> + if (unlikely(address >= vma->vm_end))
> + break;
> +
> + if (vma->vm_flags & VM_MIXEDMAP)
> + ret = vmf_insert_mixed_prot(vma, address,
> + __pfn_to_pfn_t(pfn, 
> PFN_DEV),
> + prot);
> + else
> + ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
> +
> + /* Never error on prefaulted PTEs */
> + if (unlikely((ret & VM_FAULT_ERROR))) {
> + if (i == 0)
> + return VM_FAULT_NOPAGE;
> + else
> + break;
> + }
> +
> + address += PAGE_SIZE;
> + }
> +
> + /* Set the page to be freed using drmm release action */
> + if (drmm_add_action_or_reset(ddev, ttm_bo_release_dummy_page, page))
> + return VM_FAULT_OOM;
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(ttm_bo_vm_dummy_page);

I think we can lift this entire thing (once the ttm_bo_vm_fault_idle is
gone) to the drm level, since nothing ttm specific in here. Probably stuff
it into drm_gem.c (but really it's not even gem specific, it's fully
generic "replace this vma with dummy pages pls" function.

Aside from this nit I think the overall approach you have here is starting
to look good. Lots of work&polish, but imo we're getting there and can
start landing stuff soon.
-Daniel

> +
>  vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   pgprot_t prot;
>   struct ttm_buffer_object *bo = vma->vm_private_data;
> + struct drm_device *ddev = bo->base.dev;
>   vm_fault_t ret;
> + int idx;
>  
>   ret = ttm_bo_vm_reserve(bo, vmf);
>   if (ret)
>   return ret;
>  
>   prot = vma->vm_page_prot;
> - ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT, 1);
> + if (drm_dev_enter(ddev, &idx)) {
> + ret = ttm_bo_vm_fault_reserved(vmf, prot, 
> TTM_BO_VM_NUM_PREFAULT, 1);
> + drm_dev_exit(idx);
> + } else {
> + ret = ttm_bo_vm_dummy_page(vmf, prot);
> + }
>   if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
>   return ret;
>  
>   dma_resv_unlock(bo->base.resv);
>  
>   return ret;
> +
> + return ret;
>  }
>  EXPORT_SYMBOL(ttm_bo_vm_fault);
>  
> diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
> index e17be32..12fb240 100644
> --- a/include/drm/ttm/ttm_bo_api.h
> +++ b/includ

Re: [PATCH v4 07/14] drm/amdgpu: Register IOMMU topology notifier per device.

2021-01-19 Thread Daniel Vetter
On Tue, Jan 19, 2021 at 09:48:03AM +0100, Christian König wrote:
> Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:
> > Handle all DMA IOMMU gropup related dependencies before the
> > group is removed.
> > 
> > Signed-off-by: Andrey Grodzovsky 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h|  5 
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 46 
> > ++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   |  2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h   |  1 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 10 +++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  2 ++
> >   6 files changed, 65 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 478a7d8..2953420 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -51,6 +51,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> > @@ -1041,6 +1042,10 @@ struct amdgpu_device {
> > boolin_pci_err_recovery;
> > struct pci_saved_state  *pci_state;
> > +
> > +   struct notifier_block   nb;
> > +   struct blocking_notifier_head   notifier;
> > +   struct list_headdevice_bo_list;
> >   };
> >   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 45e23e3..e99f4f1 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -70,6 +70,8 @@
> >   #include 
> >   #include 
> > +#include 
> > +
> >   MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
> >   MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
> >   MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
> > @@ -3200,6 +3202,39 @@ static const struct attribute 
> > *amdgpu_dev_attributes[] = {
> >   };
> > +static int amdgpu_iommu_group_notifier(struct notifier_block *nb,
> > +unsigned long action, void *data)
> > +{
> > +   struct amdgpu_device *adev = container_of(nb, struct amdgpu_device, nb);
> > +   struct amdgpu_bo *bo = NULL;
> > +
> > +   /*
> > +* Following is a set of IOMMU group dependencies taken care of before
> > +* device's IOMMU group is removed
> > +*/
> > +   if (action == IOMMU_GROUP_NOTIFY_DEL_DEVICE) {
> > +
> > +   spin_lock(&ttm_bo_glob.lru_lock);
> > +   list_for_each_entry(bo, &adev->device_bo_list, bo) {
> > +   if (bo->tbo.ttm)
> > +   ttm_tt_unpopulate(bo->tbo.bdev, bo->tbo.ttm);
> > +   }
> > +   spin_unlock(&ttm_bo_glob.lru_lock);
> 
> That approach won't work. ttm_tt_unpopulate() might sleep on an IOMMU lock.
> 
> You need to use a mutex here or even better make sure you can access the
> device_bo_list without a lock in this moment.

I'd also be worried about the notifier mutex getting really badly in the
way.

Plus I'm worried why we even need this, it sounds a bit like papering over
the iommu subsystem. Assuming we clean up all our iommu mappings in our
device hotunplug/unload code, why do we still need to have an additional
iommu notifier on top, with all kinds of additional headaches? The iommu
shouldn't clean up before the devices in its group have cleaned up.

I think we need more info here on what the exact problem is first.
-Daniel

> 
> Christian.
> 
> > +
> > +   if (adev->irq.ih.use_bus_addr)
> > +   amdgpu_ih_ring_fini(adev, &adev->irq.ih);
> > +   if (adev->irq.ih1.use_bus_addr)
> > +   amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
> > +   if (adev->irq.ih2.use_bus_addr)
> > +   amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
> > +
> > +   amdgpu_gart_dummy_page_fini(adev);
> > +   }
> > +
> > +   return NOTIFY_OK;
> > +}
> > +
> > +
> >   /**
> >* amdgpu_device_init - initialize the driver
> >*
> > @@ -3304,6 +3339,8 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> > INIT_WORK(&adev->xgmi_reset_work, amdgpu_device_xgmi_reset_func);
> > +   INIT_LIST_HEAD(&adev->device_bo_list);
> > +
> > adev->gfx.gfx_off_req_count = 1;
> > adev->pm.ac_power = power_supply_is_system_supplied() > 0;
> > @@ -3575,6 +3612,15 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> > if (amdgpu_device_cache_pci_state(adev->pdev))
> > pci_restore_state(pdev);
> > +   BLOCKING_INIT_NOTIFIER_HEAD(&adev->notifier);
> > +   adev->nb.notifier_call = amdgpu_iommu_group_notifier;
> > +
> > +   if (adev->dev->iommu_group) {
> > +   r = iommu_group_register_notifier(adev->dev->iommu_group, 
> > &adev->nb);
> > +   if (r)
> > +   goto failed;
> > +   }
> > +
> > return 0;
> >   failed:
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c 
> > 

Re: [PATCH v3 1/3] drm/amd/display: Add module parameter for freesync video mode

2021-01-19 Thread Daniel Vetter
On Tue, Jan 19, 2021 at 9:35 AM Pekka Paalanen  wrote:
>
> On Mon, 18 Jan 2021 09:36:47 -0500
> Aurabindo Pillai  wrote:
>
> > On Thu, 2021-01-14 at 11:14 +0200, Pekka Paalanen wrote:
> > >
> > > Hi,
> > >
> > > please document somewhere that ends up in git history (commit
> > > message,
> > > code comments, description of the parameter would be the best but
> > > maybe
> > > there isn't enough space?) what Christian König explained in
> > >
> > >
> > > https://lists.freedesktop.org/archives/dri-devel/2020-December/291254.html
> > >
> > > that this is a stop-gap feature intended to be removed as soon as
> > > possible (when a better solution comes up, which could be years).
> > >
> > > So far I have not seen a single mention of this intention in your
> > > patch
> > > submissions, and I think it is very important to make known.
> >
> > Hi,
> >
> > Thanks for the headsup, I shall add the relevant info in the next
> > verison.
> >
> > >
> > > I also did not see an explanation of why this instead of
> > > manufacturing
> > > these video modes in userspace (an idea mentioned by Christian in the
> > > referenced email). I think that too should be part of a commit
> > > message.
> >
> > This is an opt-in feature, which shall be superseded by a better
> > solution. We also add a set of common modes for scaling similarly.
> > Userspace can still add whatever mode they want. So I dont see a reason
> > why this cant be in the kernel.
>
> Hi,
>
> sorry, I think that kind of thinking is backwards. There needs to be a
> reason to put something in the kernel, and if there is no reason, then
> it remains in userspace. So what's the reason to put this in the kernel?
>
> One example reason why this should not be in the kernel is that the set
> of video modes to manufacture is a kind of policy, which modes to add
> and which not. Userspace knows what modes it needs, and establishing
> the modes in the kernel instead is second-guessing what the userspace
> would want. So if userspace needs to manufacture modes in userspace
> anyway as some modes might be missed by the kernel, then why bother in
> the kernel to begin with? Why should the kernel play catch-up with what
> modes userspace wants when we already have everything userspace needs
> to make its own modes, even to add them to the kernel mode list?
>
> Does manufacturing these extra video modes to achieve fast timing
> changes require AMD hardware-specific knowledge, as opposed to the
> general VRR approach of simply adjusting the front porch?
>
> Something like this should also be documented in a commit message. Or
> if you insist that "no reason to not put this in the kernel" is reason
> enough, then write that down, because it does not seem obvious to me or
> others that this feature needs to be in the kernel.

One reason might be debugging, if a feature is known to cause issues.
But imo in that case the knob should be using the _unsafe variants so
it taints the kernel, since otherwise we get stuck in this very cozy
place where kernel maintainers don't have to care much for bugs
"because it's off by default", but also not really care about
polishing the feature "since users can just enable it if they want
it". Just a slightly different flavour of what you're explaining above
already.
-Daniel

> Thanks,
> pq



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: Add RLC_PG_DELAY_3 for Vangogh

2021-01-19 Thread Jinzhou Su
Copy from RLC MAS:

Driver should enable the CGPG feature for RLC while it is in
safe mode to prevent any misalignment or conflict while it is
in middle of any power feature entry/exit sequence. This can
be achieved by setting RLC_PG_CNTL.GFX_POWER_GATING_ENABLE = 0x1,
and RLC_PG_DELAY_3.CGCG_ACTIVE_BEFORE_CGPG to the desired CGPG
hysteresis value in refclk count.

Signed-off-by: Jinzhou Su 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index c4314e25f560..23a11ec40c33 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -120,6 +120,7 @@
 #define mmSPI_CONFIG_CNTL_Vangogh_BASE_IDX   1
 #define mmGCR_GENERAL_CNTL_Vangogh   0x1580
 #define mmGCR_GENERAL_CNTL_Vangogh_BASE_IDX  0
+#define RLC_PG_DELAY_3__CGCG_ACTIVE_BEFORE_CGPG_MASK_Vangogh   0xL
 
 #define mmCP_HYP_PFP_UCODE_ADDR0x5814
 #define mmCP_HYP_PFP_UCODE_ADDR_BASE_IDX   1
@@ -7829,6 +7830,17 @@ static void gfx_v10_cntl_power_gating(struct 
amdgpu_device *adev, bool enable)
data &= ~RLC_PG_CNTL__GFX_POWER_GATING_ENABLE_MASK;
 
WREG32_SOC15(GC, 0, mmRLC_PG_CNTL, data);
+
+   /*
+* CGPG enablement required and the register to program the hysteresis 
value
+* RLC_PG_DELAY_3.CGCG_ACTIVE_BEFORE_CGPG to the desired CGPG 
hysteresis value
+* in refclk count. Note that RLC FW is modified to take 16 bits from
+* RLC_PG_DELAY_3[15:0] as the hysteresis instead of just 8 bits.
+*/
+   if (enable && (adev->pg_flags & AMD_PG_SUPPORT_GFX_PG) && 
adev->asic_type == CHIP_VANGOGH) {
+   data = 0x4E20 & 
RLC_PG_DELAY_3__CGCG_ACTIVE_BEFORE_CGPG_MASK_Vangogh;
+   WREG32_SOC15(GC, 0, mmRLC_PG_DELAY_3, data);
+   }
 }
 
 static void gfx_v10_cntl_pg(struct amdgpu_device *adev, bool enable)
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 04/14] drm/sched: Cancel and flush all oustatdning jobs before finish.

2021-01-19 Thread Christian König

Added a CC: stable tag and pushed it.

Thanks,
Christian.

Am 19.01.21 um 09:42 schrieb Christian König:
This is a bug fix and should probably be pushed separately to 
drm-misc-next.


Christian.

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

To avoid any possible use after free.

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Christian König 
---
  drivers/gpu/drm/scheduler/sched_main.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c

index 997aa15..92637b7 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -899,6 +899,9 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
  if (sched->thread)
  kthread_stop(sched->thread);
  +    /* Confirm no work left behind accessing device structures */
+    cancel_delayed_work_sync(&sched->work_tdr);
+
  sched->ready = false;
  }
  EXPORT_SYMBOL(drm_sched_fini);




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/amdgpu: add error handling to amdgpu_virt_read_pf2vf_data

2021-01-19 Thread Jingwen Chen
[Why]
when vram lost happened in guest, try to write vram can lead to
kernel stuck.

[How]
When the readback data is invalid, don't do write work, directly
reschedule a new work.

Signed-off-by: Jingwen Chen 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index c649944e49da..3dd7eec52344 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -558,10 +558,14 @@ static int amdgpu_virt_write_vf2pf_data(struct 
amdgpu_device *adev)
 static void amdgpu_virt_update_vf2pf_work_item(struct work_struct *work)
 {
struct amdgpu_device *adev = container_of(work, struct amdgpu_device, 
virt.vf2pf_work.work);
+   int ret;
 
-   amdgpu_virt_read_pf2vf_data(adev);
+   ret = amdgpu_virt_read_pf2vf_data(adev);
+   if (ret)
+   goto out;
amdgpu_virt_write_vf2pf_data(adev);
 
+out:
schedule_delayed_work(&(adev->virt.vf2pf_work), 
adev->virt.vf2pf_update_interval_ms);
 }
 
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 11/14] drm/amdgpu: Guard against write accesses after device removal

2021-01-19 Thread Christian König

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

This should prevent writing to memory or IO ranges possibly
already allocated for other uses after our device is removed.


Wow, that adds quite some overhead to every register access. I'm not 
sure we can do this.


Christian.



Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 57 
  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c|  9 
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c| 53 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.h|  3 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   | 70 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   | 49 ++---
  drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 16 ++-
  drivers/gpu/drm/amd/amdgpu/psp_v12_0.c |  8 +---
  drivers/gpu/drm/amd/amdgpu/psp_v3_1.c  |  8 +---
  9 files changed, 184 insertions(+), 89 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e99f4f1..0a9d73c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -72,6 +72,8 @@
  
  #include 
  
+#include 

+
  MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
  MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
  MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
@@ -404,13 +406,21 @@ uint8_t amdgpu_mm_rreg8(struct amdgpu_device *adev, 
uint32_t offset)
   */
  void amdgpu_mm_wreg8(struct amdgpu_device *adev, uint32_t offset, uint8_t 
value)
  {
+   int idx;
+
if (adev->in_pci_err_recovery)
return;
  
+

+   if (!drm_dev_enter(&adev->ddev, &idx))
+   return;
+
if (offset < adev->rmmio_size)
writeb(value, adev->rmmio + offset);
else
BUG();
+
+   drm_dev_exit(idx);
  }
  
  /**

@@ -427,9 +437,14 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
uint32_t reg, uint32_t v,
uint32_t acc_flags)
  {
+   int idx;
+
if (adev->in_pci_err_recovery)
return;
  
+	if (!drm_dev_enter(&adev->ddev, &idx))

+   return;
+
if ((reg * 4) < adev->rmmio_size) {
if (!(acc_flags & AMDGPU_REGS_NO_KIQ) &&
amdgpu_sriov_runtime(adev) &&
@@ -444,6 +459,8 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
}
  
  	trace_amdgpu_device_wreg(adev->pdev->device, reg, v);

+
+   drm_dev_exit(idx);
  }
  
  /*

@@ -454,9 +471,14 @@ void amdgpu_device_wreg(struct amdgpu_device *adev,
  void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
 uint32_t reg, uint32_t v)
  {
+   int idx;
+
if (adev->in_pci_err_recovery)
return;
  
+	if (!drm_dev_enter(&adev->ddev, &idx))

+   return;
+
if (amdgpu_sriov_fullaccess(adev) &&
adev->gfx.rlc.funcs &&
adev->gfx.rlc.funcs->is_rlcg_access_range) {
@@ -465,6 +487,8 @@ void amdgpu_mm_wreg_mmio_rlc(struct amdgpu_device *adev,
} else {
writel(v, ((void __iomem *)adev->rmmio) + (reg * 4));
}
+
+   drm_dev_exit(idx);
  }
  
  /**

@@ -499,15 +523,22 @@ u32 amdgpu_io_rreg(struct amdgpu_device *adev, u32 reg)
   */
  void amdgpu_io_wreg(struct amdgpu_device *adev, u32 reg, u32 v)
  {
+   int idx;
+
if (adev->in_pci_err_recovery)
return;
  
+	if (!drm_dev_enter(&adev->ddev, &idx))

+   return;
+
if ((reg * 4) < adev->rio_mem_size)
iowrite32(v, adev->rio_mem + (reg * 4));
else {
iowrite32((reg * 4), adev->rio_mem + (mmMM_INDEX * 4));
iowrite32(v, adev->rio_mem + (mmMM_DATA * 4));
}
+
+   drm_dev_exit(idx);
  }
  
  /**

@@ -544,14 +575,21 @@ u32 amdgpu_mm_rdoorbell(struct amdgpu_device *adev, u32 
index)
   */
  void amdgpu_mm_wdoorbell(struct amdgpu_device *adev, u32 index, u32 v)
  {
+   int idx;
+
if (adev->in_pci_err_recovery)
return;
  
+	if (!drm_dev_enter(&adev->ddev, &idx))

+   return;
+
if (index < adev->doorbell.num_doorbells) {
writel(v, adev->doorbell.ptr + index);
} else {
DRM_ERROR("writing beyond doorbell aperture: 0x%08x!\n", index);
}
+
+   drm_dev_exit(idx);
  }
  
  /**

@@ -588,14 +626,21 @@ u64 amdgpu_mm_rdoorbell64(struct amdgpu_device *adev, u32 
index)
   */
  void amdgpu_mm_wdoorbell64(struct amdgpu_device *adev, u32 index, u64 v)
  {
+   int idx;
+
if (adev->in_pci_err_recovery)
return;
  
+	if (!drm_dev_enter(&adev->ddev, &idx))

+   return;
+
if (index < adev->doorbell.num_doorbells) {
atomic64_set((atomic64_t *)(adev->doorbell.ptr + index), v);
} else {
DRM_ERROR("writing beyond doorbell aperture: 0x%08x!\n", index);
  

Re: [PATCH v4 10/14] dmr/amdgpu: Move some sysfs attrs creation to default_attr

2021-01-19 Thread Christian König

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

This allows to remove explicit creation and destruction
of those attrs and by this avoids warnings on device
finilizing post physical device extraction.

Signed-off-by: Andrey Grodzovsky 


Acked-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c | 17 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  | 13 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c  | 25 ++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 14 +-
  4 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
index 86add0f..0346e12 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c
@@ -1953,6 +1953,15 @@ static ssize_t amdgpu_atombios_get_vbios_version(struct 
device *dev,
  static DEVICE_ATTR(vbios_version, 0444, amdgpu_atombios_get_vbios_version,
   NULL);
  
+static struct attribute *amdgpu_vbios_version_attrs[] = {

+   &dev_attr_vbios_version.attr,
+   NULL
+};
+
+const struct attribute_group amdgpu_vbios_version_attr_group = {
+   .attrs = amdgpu_vbios_version_attrs
+};
+
  /**
   * amdgpu_atombios_fini - free the driver info and callbacks for atombios
   *
@@ -1972,7 +1981,6 @@ void amdgpu_atombios_fini(struct amdgpu_device *adev)
adev->mode_info.atom_context = NULL;
kfree(adev->mode_info.atom_card_info);
adev->mode_info.atom_card_info = NULL;
-   device_remove_file(adev->dev, &dev_attr_vbios_version);
  }
  
  /**

@@ -1989,7 +1997,6 @@ int amdgpu_atombios_init(struct amdgpu_device *adev)
  {
struct card_info *atom_card_info =
kzalloc(sizeof(struct card_info), GFP_KERNEL);
-   int ret;
  
  	if (!atom_card_info)

return -ENOMEM;
@@ -2027,12 +2034,6 @@ int amdgpu_atombios_init(struct amdgpu_device *adev)
amdgpu_atombios_allocate_fb_scratch(adev);
}
  
-	ret = device_create_file(adev->dev, &dev_attr_vbios_version);

-   if (ret) {
-   DRM_ERROR("Failed to create device file for VBIOS version\n");
-   return ret;
-   }
-
return 0;
  }
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

index 9c0cd00..8fddd74 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1587,6 +1587,18 @@ static struct pci_error_handlers amdgpu_pci_err_handler 
= {
.resume = amdgpu_pci_resume,
  };
  
+extern const struct attribute_group amdgpu_vram_mgr_attr_group;

+extern const struct attribute_group amdgpu_gtt_mgr_attr_group;
+extern const struct attribute_group amdgpu_vbios_version_attr_group;
+
+static const struct attribute_group *amdgpu_sysfs_groups[] = {
+   &amdgpu_vram_mgr_attr_group,
+   &amdgpu_gtt_mgr_attr_group,
+   &amdgpu_vbios_version_attr_group,
+   NULL,
+};
+
+
  static struct pci_driver amdgpu_kms_pci_driver = {
.name = DRIVER_NAME,
.id_table = pciidlist,
@@ -1595,6 +1607,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
.shutdown = amdgpu_pci_shutdown,
.driver.pm = &amdgpu_pm_ops,
.err_handler = &amdgpu_pci_err_handler,
+   .driver.dev_groups = amdgpu_sysfs_groups,
  };
  
  static int __init amdgpu_init(void)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
index 8980329..3b7150e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
@@ -77,6 +77,16 @@ static DEVICE_ATTR(mem_info_gtt_total, S_IRUGO,
  static DEVICE_ATTR(mem_info_gtt_used, S_IRUGO,
   amdgpu_mem_info_gtt_used_show, NULL);
  
+static struct attribute *amdgpu_gtt_mgr_attributes[] = {

+   &dev_attr_mem_info_gtt_total.attr,
+   &dev_attr_mem_info_gtt_used.attr,
+   NULL
+};
+
+const struct attribute_group amdgpu_gtt_mgr_attr_group = {
+   .attrs = amdgpu_gtt_mgr_attributes
+};
+
  static const struct ttm_resource_manager_func amdgpu_gtt_mgr_func;
  /**
   * amdgpu_gtt_mgr_init - init GTT manager and DRM MM
@@ -91,7 +101,6 @@ int amdgpu_gtt_mgr_init(struct amdgpu_device *adev, uint64_t 
gtt_size)
struct amdgpu_gtt_mgr *mgr = &adev->mman.gtt_mgr;
struct ttm_resource_manager *man = &mgr->manager;
uint64_t start, size;
-   int ret;
  
  	man->use_tt = true;

man->func = &amdgpu_gtt_mgr_func;
@@ -104,17 +113,6 @@ int amdgpu_gtt_mgr_init(struct amdgpu_device *adev, 
uint64_t gtt_size)
spin_lock_init(&mgr->lock);
atomic64_set(&mgr->available, gtt_size >> PAGE_SHIFT);
  
-	ret = device_create_file(adev->dev, &dev_attr_mem_info_gtt_total);

-   if (ret) {
-   DRM_ERROR("Failed to create device file mem_info_gtt_total\n");
-   return ret;
-   }
-  

Re: [PATCH v4 09/14] drm/amdgpu: Remap all page faults to per process dummy page.

2021-01-19 Thread Christian König

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

On device removal reroute all CPU mappings to dummy page
per drm_file instance or imported GEM object.

v4:
Update for modified ttm_bo_vm_dummy_page

Signed-off-by: Andrey Grodzovsky 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 21 -
  1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 9fd2157..550dc5e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -49,6 +49,7 @@
  
  #include 

  #include 
+#include 
  
  #include "amdgpu.h"

  #include "amdgpu_object.h"
@@ -1982,18 +1983,28 @@ void amdgpu_ttm_set_buffer_funcs_status(struct 
amdgpu_device *adev, bool enable)
  static vm_fault_t amdgpu_ttm_fault(struct vm_fault *vmf)
  {
struct ttm_buffer_object *bo = vmf->vma->vm_private_data;
+   struct drm_device *ddev = bo->base.dev;
vm_fault_t ret;
+   int idx;
  
  	ret = ttm_bo_vm_reserve(bo, vmf);

if (ret)
return ret;
  
-	ret = amdgpu_bo_fault_reserve_notify(bo);

-   if (ret)
-   goto unlock;
+   if (drm_dev_enter(ddev, &idx)) {
+   ret = amdgpu_bo_fault_reserve_notify(bo);
+   if (ret) {
+   drm_dev_exit(idx);
+   goto unlock;
+   }
  
-	ret = ttm_bo_vm_fault_reserved(vmf, vmf->vma->vm_page_prot,

-  TTM_BO_VM_NUM_PREFAULT, 1);
+ret = ttm_bo_vm_fault_reserved(vmf, vmf->vma->vm_page_prot,
+   TTM_BO_VM_NUM_PREFAULT, 1);
+
+drm_dev_exit(idx);
+   } else {
+   ret = ttm_bo_vm_dummy_page(vmf, vmf->vma->vm_page_prot);
+   }
if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
return ret;
  


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 08/14] drm/amdgpu: Fix a bunch of sdma code crash post device unplug

2021-01-19 Thread Christian König

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

We can't allocate and submit IBs post device unplug.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index ad91c0c..5096351 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -31,6 +31,7 @@
  #include 
  
  #include 

+#include 
  #include "amdgpu.h"
  #include "amdgpu_trace.h"
  #include "amdgpu_amdkfd.h"
@@ -1604,7 +1605,10 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
struct amdgpu_vm_update_params params;
enum amdgpu_sync_mode sync_mode;
uint64_t pfn;
-   int r;
+   int r, idx;
+
+   if (!drm_dev_enter(&adev->ddev, &idx))
+   return -ENOENT;


Why not -ENODEV?

  
  	memset(¶ms, 0, sizeof(params));

params.adev = adev;
@@ -1647,6 +1651,8 @@ static int amdgpu_vm_bo_update_mapping(struct 
amdgpu_device *adev,
if (r)
goto error_unlock;
  
+

+   drm_dev_exit(idx);


That's to early. You probably need to do this much further below after 
the commit.


Christian.


do {
uint64_t tmp, num_entries, addr;
  


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 07/14] drm/amdgpu: Register IOMMU topology notifier per device.

2021-01-19 Thread Christian König

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

Handle all DMA IOMMU gropup related dependencies before the
group is removed.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  5 
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 46 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c   |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h   |  1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 10 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  2 ++
  6 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 478a7d8..2953420 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -51,6 +51,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -1041,6 +1042,10 @@ struct amdgpu_device {
  
  	boolin_pci_err_recovery;

struct pci_saved_state  *pci_state;
+
+   struct notifier_block   nb;
+   struct blocking_notifier_head   notifier;
+   struct list_headdevice_bo_list;
  };
  
  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 45e23e3..e99f4f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -70,6 +70,8 @@
  #include 
  #include 
  
+#include 

+
  MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
  MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
  MODULE_FIRMWARE("amdgpu/raven_gpu_info.bin");
@@ -3200,6 +3202,39 @@ static const struct attribute *amdgpu_dev_attributes[] = 
{
  };
  
  
+static int amdgpu_iommu_group_notifier(struct notifier_block *nb,

+unsigned long action, void *data)
+{
+   struct amdgpu_device *adev = container_of(nb, struct amdgpu_device, nb);
+   struct amdgpu_bo *bo = NULL;
+
+   /*
+* Following is a set of IOMMU group dependencies taken care of before
+* device's IOMMU group is removed
+*/
+   if (action == IOMMU_GROUP_NOTIFY_DEL_DEVICE) {
+
+   spin_lock(&ttm_bo_glob.lru_lock);
+   list_for_each_entry(bo, &adev->device_bo_list, bo) {
+   if (bo->tbo.ttm)
+   ttm_tt_unpopulate(bo->tbo.bdev, bo->tbo.ttm);
+   }
+   spin_unlock(&ttm_bo_glob.lru_lock);


That approach won't work. ttm_tt_unpopulate() might sleep on an IOMMU lock.

You need to use a mutex here or even better make sure you can access the 
device_bo_list without a lock in this moment.


Christian.


+
+   if (adev->irq.ih.use_bus_addr)
+   amdgpu_ih_ring_fini(adev, &adev->irq.ih);
+   if (adev->irq.ih1.use_bus_addr)
+   amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
+   if (adev->irq.ih2.use_bus_addr)
+   amdgpu_ih_ring_fini(adev, &adev->irq.ih2);
+
+   amdgpu_gart_dummy_page_fini(adev);
+   }
+
+   return NOTIFY_OK;
+}
+
+
  /**
   * amdgpu_device_init - initialize the driver
   *
@@ -3304,6 +3339,8 @@ int amdgpu_device_init(struct amdgpu_device *adev,
  
  	INIT_WORK(&adev->xgmi_reset_work, amdgpu_device_xgmi_reset_func);
  
+	INIT_LIST_HEAD(&adev->device_bo_list);

+
adev->gfx.gfx_off_req_count = 1;
adev->pm.ac_power = power_supply_is_system_supplied() > 0;
  
@@ -3575,6 +3612,15 @@ int amdgpu_device_init(struct amdgpu_device *adev,

if (amdgpu_device_cache_pci_state(adev->pdev))
pci_restore_state(pdev);
  
+	BLOCKING_INIT_NOTIFIER_HEAD(&adev->notifier);

+   adev->nb.notifier_call = amdgpu_iommu_group_notifier;
+
+   if (adev->dev->iommu_group) {
+   r = iommu_group_register_notifier(adev->dev->iommu_group, 
&adev->nb);
+   if (r)
+   goto failed;
+   }
+
return 0;
  
  failed:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
index 0db9330..486ad6d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
@@ -92,7 +92,7 @@ static int amdgpu_gart_dummy_page_init(struct amdgpu_device 
*adev)
   *
   * Frees the dummy page used by the driver (all asics).
   */
-static void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
+void amdgpu_gart_dummy_page_fini(struct amdgpu_device *adev)
  {
if (!adev->dummy_page_addr)
return;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
index afa2e28..5678d9c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.h
@@ -61,6 +61,7 @@ int amdgpu_gart_table_vram_pin(struct amdgpu_device *adev);
  void amdgpu_gart_table_vram_u

Re: [PATCH v4 05/14] drm/amdgpu: Split amdgpu_device_fini into early and late

2021-01-19 Thread Christian König

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

Some of the stuff in amdgpu_device_fini such as HW interrupts
disable and pending fences finilization must be done right away on
pci_remove while most of the stuff which relates to finilizing and
releasing driver data structures can be kept until
drm_driver.release hook is called, i.e. when the last device
reference is dropped.

v4: Change functions prefix early->hw and late->sw

Signed-off-by: Andrey Grodzovsky 


The fence and irq changes look sane to me, no idea for the rest.

Acked-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  6 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 26 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c|  7 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 15 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c| 26 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h|  3 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c| 12 +++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c|  1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  3 ++-
  drivers/gpu/drm/amd/amdgpu/cik_ih.c|  2 +-
  drivers/gpu/drm/amd/amdgpu/cz_ih.c |  2 +-
  drivers/gpu/drm/amd/amdgpu/iceland_ih.c|  2 +-
  drivers/gpu/drm/amd/amdgpu/navi10_ih.c |  2 +-
  drivers/gpu/drm/amd/amdgpu/si_ih.c |  2 +-
  drivers/gpu/drm/amd/amdgpu/tonga_ih.c  |  2 +-
  drivers/gpu/drm/amd/amdgpu/vega10_ih.c |  2 +-
  16 files changed, 78 insertions(+), 35 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index f77443c..478a7d8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1060,7 +1060,9 @@ static inline struct amdgpu_device 
*amdgpu_ttm_adev(struct ttm_bo_device *bdev)
  
  int amdgpu_device_init(struct amdgpu_device *adev,

   uint32_t flags);
-void amdgpu_device_fini(struct amdgpu_device *adev);
+void amdgpu_device_fini_hw(struct amdgpu_device *adev);
+void amdgpu_device_fini_sw(struct amdgpu_device *adev);
+
  int amdgpu_gpu_wait_for_idle(struct amdgpu_device *adev);
  
  void amdgpu_device_vram_access(struct amdgpu_device *adev, loff_t pos,

@@ -1273,6 +1275,8 @@ void amdgpu_driver_lastclose_kms(struct drm_device *dev);
  int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file 
*file_priv);
  void amdgpu_driver_postclose_kms(struct drm_device *dev,
 struct drm_file *file_priv);
+void amdgpu_driver_release_kms(struct drm_device *dev);
+
  int amdgpu_device_ip_suspend(struct amdgpu_device *adev);
  int amdgpu_device_suspend(struct drm_device *dev, bool fbcon);
  int amdgpu_device_resume(struct drm_device *dev, bool fbcon);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 348ac67..90c8353 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3579,14 +3579,12 @@ int amdgpu_device_init(struct amdgpu_device *adev,
   * Tear down the driver info (all asics).
   * Called at driver shutdown.
   */
-void amdgpu_device_fini(struct amdgpu_device *adev)
+void amdgpu_device_fini_hw(struct amdgpu_device *adev)
  {
dev_info(adev->dev, "amdgpu: finishing device.\n");
flush_delayed_work(&adev->delayed_init_work);
adev->shutdown = true;
  
-	kfree(adev->pci_state);

-
/* make sure IB test finished before entering exclusive mode
 * to avoid preemption on IB test
 * */
@@ -3603,11 +3601,24 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
else
drm_atomic_helper_shutdown(adev_to_drm(adev));
}
-   amdgpu_fence_driver_fini(adev);
+   amdgpu_fence_driver_fini_hw(adev);
+
if (adev->pm_sysfs_en)
amdgpu_pm_sysfs_fini(adev);
+   if (adev->ucode_sysfs_en)
+   amdgpu_ucode_sysfs_fini(adev);
+   sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
+
+
amdgpu_fbdev_fini(adev);
+
+   amdgpu_irq_fini_hw(adev);
+}
+
+void amdgpu_device_fini_sw(struct amdgpu_device *adev)
+{
amdgpu_device_ip_fini(adev);
+   amdgpu_fence_driver_fini_sw(adev);
release_firmware(adev->firmware.gpu_info_fw);
adev->firmware.gpu_info_fw = NULL;
adev->accel_working = false;
@@ -3636,14 +3647,13 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
adev->rmmio = NULL;
amdgpu_device_doorbell_fini(adev);
  
-	if (adev->ucode_sysfs_en)

-   amdgpu_ucode_sysfs_fini(adev);
-
-   sysfs_remove_files(&adev->dev->kobj, amdgpu_dev_attributes);
if (IS_ENABLED(CONFIG_PERF_EVENTS))
amdgpu_pmu_fini(adev);
if (adev->mman.discovery_bin)
amdgpu_discovery_fini(adev);
+
+   kfree(adev->pci_state);
+
  }
  
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/d

Re: [PATCH v4 04/14] drm/sched: Cancel and flush all oustatdning jobs before finish.

2021-01-19 Thread Christian König

This is a bug fix and should probably be pushed separately to drm-misc-next.

Christian.

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

To avoid any possible use after free.

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Christian König 
---
  drivers/gpu/drm/scheduler/sched_main.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 997aa15..92637b7 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -899,6 +899,9 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
if (sched->thread)
kthread_stop(sched->thread);
  
+	/* Confirm no work left behind accessing device structures */

+   cancel_delayed_work_sync(&sched->work_tdr);
+
sched->ready = false;
  }
  EXPORT_SYMBOL(drm_sched_fini);


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 01/14] drm/ttm: Remap all page faults to per process dummy page.

2021-01-19 Thread Christian König

Am 18.01.21 um 22:01 schrieb Andrey Grodzovsky:

On device removal reroute all CPU mappings to dummy page.

v3:
Remove loop to find DRM file and instead access it
by vma->vm_file->private_data. Move dummy page installation
into a separate function.

v4:
Map the entire BOs VA space into on demand allocated dummy page
on the first fault for that BO.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 82 -
  include/drm/ttm/ttm_bo_api.h|  2 +
  2 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 6dc96cf..ed89da3 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -34,6 +34,8 @@
  #include 
  #include 
  #include 
+#include 
+#include 
  #include 
  #include 
  #include 
@@ -380,25 +382,103 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
  }
  EXPORT_SYMBOL(ttm_bo_vm_fault_reserved);
  
+static void ttm_bo_release_dummy_page(struct drm_device *dev, void *res)

+{
+   struct page *dummy_page = (struct page *)res;
+
+   __free_page(dummy_page);
+}
+
+vm_fault_t ttm_bo_vm_dummy_page(struct vm_fault *vmf, pgprot_t prot)
+{
+   struct vm_area_struct *vma = vmf->vma;
+   struct ttm_buffer_object *bo = vma->vm_private_data;
+   struct ttm_bo_device *bdev = bo->bdev;
+   struct drm_device *ddev = bo->base.dev;
+   vm_fault_t ret = VM_FAULT_NOPAGE;
+   unsigned long address = vma->vm_start;
+   unsigned long num_prefault = (vma->vm_end - vma->vm_start) >> 
PAGE_SHIFT;
+   unsigned long pfn;
+   struct page *page;
+   int i;
+
+   /*
+* Wait for buffer data in transit, due to a pipelined
+* move.
+*/
+   ret = ttm_bo_vm_fault_idle(bo, vmf);
+   if (unlikely(ret != 0))
+   return ret;


This is superfluous and probably quite harmful here because we wait for 
the hardware to do something.


We map a dummy page instead of the real BO content to the whole range 
anyway, so no need to wait for the real BO content to show up.



+
+   /* Allocate new dummy page to map all the VA range in this VMA to it*/
+   page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+   if (!page)
+   return VM_FAULT_OOM;
+
+   pfn = page_to_pfn(page);
+
+   /*
+* Prefault the entire VMA range right away to avoid further faults
+*/
+   for (i = 0; i < num_prefault; ++i) {


Maybe rename the variable to num_pages. I was confused for a moment why 
we still prefault.


Alternative you can just drop i and do "for (addr = vma->vm_start; addr 
< vma->vm_end; addr += PAGE_SIZE)".



+
+   if (unlikely(address >= vma->vm_end))
+   break;
+
+   if (vma->vm_flags & VM_MIXEDMAP)
+   ret = vmf_insert_mixed_prot(vma, address,
+   __pfn_to_pfn_t(pfn, 
PFN_DEV),
+   prot);
+   else
+   ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
+
+   /* Never error on prefaulted PTEs */
+   if (unlikely((ret & VM_FAULT_ERROR))) {
+   if (i == 0)
+   return VM_FAULT_NOPAGE;
+   else
+   break;


This should probably be modified to either always return the error or 
always ignore it.


Apart from that looks good to me.

Christian.


+   }
+
+   address += PAGE_SIZE;
+   }
+
+   /* Set the page to be freed using drmm release action */
+   if (drmm_add_action_or_reset(ddev, ttm_bo_release_dummy_page, page))
+   return VM_FAULT_OOM;
+
+   return ret;
+}
+EXPORT_SYMBOL(ttm_bo_vm_dummy_page);
+
  vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
  {
struct vm_area_struct *vma = vmf->vma;
pgprot_t prot;
struct ttm_buffer_object *bo = vma->vm_private_data;
+   struct drm_device *ddev = bo->base.dev;
vm_fault_t ret;
+   int idx;
  
  	ret = ttm_bo_vm_reserve(bo, vmf);

if (ret)
return ret;
  
  	prot = vma->vm_page_prot;

-   ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT, 1);
+   if (drm_dev_enter(ddev, &idx)) {
+   ret = ttm_bo_vm_fault_reserved(vmf, prot, 
TTM_BO_VM_NUM_PREFAULT, 1);
+   drm_dev_exit(idx);
+   } else {
+   ret = ttm_bo_vm_dummy_page(vmf, prot);
+   }
if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
return ret;
  
  	dma_resv_unlock(bo->base.resv);
  
  	return ret;

+
+   return ret;
  }
  EXPORT_SYMBOL(ttm_bo_vm_fault);
  
diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h

index e17be32..12fb240 100644
--- a/include/drm/ttm/ttm_bo_api.h
+++ b/include/drm/ttm/ttm_bo_api.h

Re: [PATCH v4 10/14] dmr/amdgpu: Move some sysfs attrs creation to default_attr

2021-01-19 Thread Greg KH
On Mon, Jan 18, 2021 at 04:01:19PM -0500, Andrey Grodzovsky wrote:
>  static struct pci_driver amdgpu_kms_pci_driver = {
>   .name = DRIVER_NAME,
>   .id_table = pciidlist,
> @@ -1595,6 +1607,7 @@ static struct pci_driver amdgpu_kms_pci_driver = {
>   .shutdown = amdgpu_pci_shutdown,
>   .driver.pm = &amdgpu_pm_ops,
>   .err_handler = &amdgpu_pci_err_handler,
> + .driver.dev_groups = amdgpu_sysfs_groups,

Shouldn't this just be:
groups - amdgpu_sysfs_groups,

Why go to the "driver root" here?

Other than that tiny thing, looks good to me, nice cleanup!

greg k-h
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v3 1/3] drm/amd/display: Add module parameter for freesync video mode

2021-01-19 Thread Pekka Paalanen
On Mon, 18 Jan 2021 09:36:47 -0500
Aurabindo Pillai  wrote:

> On Thu, 2021-01-14 at 11:14 +0200, Pekka Paalanen wrote:
> > 
> > Hi,
> > 
> > please document somewhere that ends up in git history (commit
> > message,
> > code comments, description of the parameter would be the best but
> > maybe
> > there isn't enough space?) what Christian König explained in
> > 
> >  
> > https://lists.freedesktop.org/archives/dri-devel/2020-December/291254.html
> > 
> > that this is a stop-gap feature intended to be removed as soon as
> > possible (when a better solution comes up, which could be years).
> > 
> > So far I have not seen a single mention of this intention in your
> > patch
> > submissions, and I think it is very important to make known.  
> 
> Hi,
> 
> Thanks for the headsup, I shall add the relevant info in the next
> verison.
> 
> > 
> > I also did not see an explanation of why this instead of
> > manufacturing
> > these video modes in userspace (an idea mentioned by Christian in the
> > referenced email). I think that too should be part of a commit
> > message.  
> 
> This is an opt-in feature, which shall be superseded by a better
> solution. We also add a set of common modes for scaling similarly.
> Userspace can still add whatever mode they want. So I dont see a reason
> why this cant be in the kernel.

Hi,

sorry, I think that kind of thinking is backwards. There needs to be a
reason to put something in the kernel, and if there is no reason, then
it remains in userspace. So what's the reason to put this in the kernel?

One example reason why this should not be in the kernel is that the set
of video modes to manufacture is a kind of policy, which modes to add
and which not. Userspace knows what modes it needs, and establishing
the modes in the kernel instead is second-guessing what the userspace
would want. So if userspace needs to manufacture modes in userspace
anyway as some modes might be missed by the kernel, then why bother in
the kernel to begin with? Why should the kernel play catch-up with what
modes userspace wants when we already have everything userspace needs
to make its own modes, even to add them to the kernel mode list?

Does manufacturing these extra video modes to achieve fast timing
changes require AMD hardware-specific knowledge, as opposed to the
general VRR approach of simply adjusting the front porch?

Something like this should also be documented in a commit message. Or
if you insist that "no reason to not put this in the kernel" is reason
enough, then write that down, because it does not seem obvious to me or
others that this feature needs to be in the kernel.


Thanks,
pq


pgpC1uMRf4tk7.pgp
Description: OpenPGP digital signature
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: change the fence ring wait timeout

2021-01-19 Thread Christian König

Am 19.01.21 um 04:23 schrieb Deng, Emily:

[AMD Official Use Only - Internal Distribution Only]


-Original Message-
From: amd-gfx  On Behalf Of
Christian König
Sent: Tuesday, January 19, 2021 12:04 AM
To: Deng, Emily ; Koenig, Christian
; Sun, Roy ; amd-
g...@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout

Am 18.01.21 um 12:56 schrieb Deng, Emily:

[AMD Official Use Only - Internal Distribution Only]


-Original Message-
From: Koenig, Christian 
Sent: Monday, January 18, 2021 3:49 PM
To: Deng, Emily ; Sun, Roy ;
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout

Mhm, we could change amdgpu_fence_wait_empty() to timeout. But I
think that waiting forever here is intentional and the right thing to do.

What happens is that we wait for the hardware to make sure that
nothing is writing to any memory before we unload the driver.

Now the VCN block has crashed and doesn't respond, but we can't
guarantee that it is not accidentally writing anywhere.

The only alternative we have is to time out and proceed with the
driver unload, risking corrupting the memory we free during that
should the hardware continue to do something.

Hi Christian,
  Thanks your suggestion, but still not quite clearly, could you detail the

solution to avoid kernel not lockup?

Well as I said that the kernel locks up is intentional here.

So you think the lock up is better than only some memory corruption?


Yes exactly.


Because we could give more time, such as 60s to wait, I don't think the fence 
won't be signaled within 60s if the engine is good. So
when the engine is ok, it won't cause memory corruption with the timeout. When 
engine is bad, the fence will never be signaled, so even we force completion, 
it still won't cause memory corruption.


Unfortunately it's not that easy. See for example some engines keep 
updating statistics (how many bytes written, samples encoded etc...) 
even when the engines is stuck in an endless loop.



As for sriov, when engine
is bad, we still could do recover, and do driver reload to make the driver 
works ok again, so we don't want the kernel lockup.


Well that is certainly something you could try, e.g. wait for empty with 
a timeout and if that times out triggers do a GPU reset and wait again.


Alternative would be to see if we could disable system memory access 
from the device all together with the PCIe config regs if we can't get 
it idle again.


Regards,
Christian.


Regards,
Christian.


Regards,
Christian.

Am 18.01.21 um 03:01 schrieb Deng, Emily:

[AMD Official Use Only - Internal Distribution Only]


-Original Message-
From: Koenig, Christian 
Sent: Thursday, January 14, 2021 9:50 PM
To: Deng, Emily ; Sun, Roy

;

amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait timeout

Am 14.01.21 um 03:00 schrieb Deng, Emily:

[AMD Official Use Only - Internal Distribution Only]


-Original Message-
From: amd-gfx  On Behalf
Of Christian König
Sent: Wednesday, January 13, 2021 10:03 PM
To: Sun, Roy ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: change the fence ring wait
timeout

Am 13.01.21 um 07:36 schrieb Roy Sun:

This fix bug where when the engine hang, the fence ring will
wait without quit and cause kernel crash

NAK, this blocking is intentional unlimited because otherwise we
will cause a memory corruption.

What is the actual bug you are trying to fix here?

When some engine hang during initialization, the IB test will
fail, and fence will never come back, then never could wait the fence

back.

Why we need to wait here forever? We'd better not use forever wait
which

will cause kernel crash and lockup. And we have
amdgpu_fence_driver_force_completion will let memory free. We
should remove all those forever wait, and replace them with one
timeout, and do the correct error process if timeout.

This wait here is to make sure we never overwrite the software
fence ring buffer. Without it we would not signal all fences in
amdgpu_fence_driver_force_completion() and cause either memory

leak

or corruption.

Waiting here forever is the right thing to do even when that means
that the submission thread gets stuck forever, cause that is still
better than memory corruption.

But this should never happen in practice and is only here for
precaution. So do you really see that in practice?

Yes, we hit the issue when vcn ib test fail. Could you give some
suggestions

about how to fix this?

[  958.301685] failed to read reg:1a6c0

[  959.036645] gmc_v10_0_process_interrupt: 42 callbacks suppressed

[  959.036653] amdgpu :00:07.0: [mmhub] page fault (src_id:0
ring:0 vmid:0 pasid:0, for process  pid 0 thread  pid 0)

[  959.038043] amdgpu :00:07.0:   in page starting at address

0x00567000 from client 18

[  959.039014] amdgpu :00:07.0: [mmhub] page fault (src_id:0
ring:0 vmid:0 pasid:0, for process  pid 0