[PATCH] drm/amdgpu: drop log message in amdgpu_dpm_baco_reset()
The caller does this now for all reset types. This is now a duplicate call. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_dpm.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dpm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dpm.c index 2082c0acd216..1c661c7ab49f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dpm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dpm.c @@ -1110,8 +1110,6 @@ int amdgpu_dpm_baco_reset(struct amdgpu_device *adev) struct smu_context *smu = >smu; int ret = 0; - dev_info(adev->dev, "GPU BACO reset\n"); - if (is_support_sw_smu(adev)) { ret = smu_baco_enter(smu); if (ret) -- 2.25.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[pull] amdgpu drm-fixes-5.9
Hi Dave, Daniel, Fixes for 5.9. The following changes since commit f87812284172a9809820d10143b573d833cd3f75: drm/amdgpu: Fix bug where DPM is not enabled after hibernate and resume (2020-08-07 17:52:15 -0400) are available in the Git repository at: git://people.freedesktop.org/~agd5f/linux tags/amd-drm-fixes-5.9-2020-08-12 for you to fetch changes up to f41ed88cbd6f025f7a683a11a74f901555fba11c: drm/amdgpu/display: use GFP_ATOMIC in dcn20_validate_bandwidth_internal (2020-08-10 18:09:46 -0400) amd-drm-fixes-5.9-2020-08-12: amdgpu: - Fix allocation size - SR-IOV fixes - Vega20 SMU feature state caching fix - Fix custom pptable handling - Arcturus golden settings update - Several display fixes Anthony Koo (2): drm/amd/display: Fix LFC multiplier changing erratically drm/amd/display: Switch to immediate mode for updating infopackets Aric Cyr (1): drm/amd/display: Fix incorrect backlight register offset for DCN Christophe JAILLET (1): drm: amdgpu: Use the correct size when allocating memory Daniel Kolesa (1): drm/amdgpu/display: use GFP_ATOMIC in dcn20_validate_bandwidth_internal Evan Quan (2): drm/amd/powerplay: correct Vega20 cached smu feature state drm/amd/powerplay: correct UVD/VCE PG state on custom pptable uploading Jaehyun Chung (1): drm/amd/display: Blank stream before destroying HDCP session Liu ChengZhe (1): drm/amdgpu: Skip some registers config for SRIOV Stylon Wang (1): drm/amd/display: Fix EDID parsing after resume from suspend shiwu.zhang (1): drm/amdgpu: update gc golden register for arcturus drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 2 +- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 1 + drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c | 19 ++ drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c| 19 ++ drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 1 + drivers/gpu/drm/amd/display/dc/core/dc_link.c | 3 +- .../gpu/drm/amd/display/dc/dce/dce_panel_cntl.h| 2 +- .../amd/display/dc/dcn10/dcn10_stream_encoder.c| 16 .../amd/display/dc/dcn10/dcn10_stream_encoder.h| 14 +++ .../gpu/drm/amd/display/dc/dcn20/dcn20_resource.c | 2 +- .../drm/amd/display/modules/freesync/freesync.c| 36 ++ drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c | 44 ++ 12 files changed, 114 insertions(+), 45 deletions(-) ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [Nouveau] [PATCH 1/4] drm: retrieve EDID via ACPI _DDC method
On Wed, Aug 12, 2020 at 10:31 PM Daniel Dadap wrote: > > Thanks, Lukas. I've incorporated your feedback into my local tree, but > will wait for additional feedback from the individual DRM driver > maintainers before sending out a series v2. > > On 8/8/20 5:11 PM, Lukas Wunner wrote: > > On Mon, Jul 27, 2020 at 03:53:54PM -0500, Daniel Dadap wrote: > >> + for (i = 0; i < num_dod_entries; i++) { > >> + if (adr == dod_entries[i]) { > >> + ret = do_acpi_ddc(child->handle); > >> + > >> + if (ret != NULL) > >> + goto done; > > I guess ideally we'd want to correlate the display objects with > > drm_connectors or at least constrain the search to Display Type > > "Internal/Integrated Digital Flat Panel" instead of picking the > > first EDID found. Otherwise we might erroneously use the DDC > > for an externally attached display. > > > Yes, we'd definitely need a way to do this if this functionality ever > needs to be extended to systems with more than one _DDC method. > Unfortunately, this will be much easier said than done, since I'm not > aware of any way to reliably do map _DOD entries to connectors in a GPU > driver, especially when we're talking about possibly correlating > connectors on multiple GPUs which mux to the same internal display or > external connector. All systems which I am aware of that implement ACPI > _DDC do so for a single internal panel. I don't believe there's any > reason to ever retrieve an EDID via ACPI _DDC for an external panel, but > a hypothetical design with multiple internal panels, more than one of > which needs to retrieve an EDID via ACPI _DDC, would certainly be > problematic. > > > On at least the system I'm working with for the various switcheroo and > platform-x86 driver patches I've recently sent off, the dGPU has an ACPI > _DOD table and one _DDC method corresponding to one of the _DOD entries, > but the iGPU has neither a _DOD table nor a _DDC method. Either GPU can > be connected to the internal panel via the dynamically switchable mux, > and the internal panel's EDID is available via _DDC to allow a > disconnected GPU to read the EDID. Since only the DGPU has _DOD and > _DDC, and there's no obvious way to associate connectors on the iGPU > with connectors on the dGPU, I've implemented the ACPI _DDC EDID > retrieval with the "first available" implementation you see here. I'm > open to other ideas if you have them, but didn't see a good way to > search for the "right" _DDC implementation should there be more than one. > > > As for preventing the ACPI EDID retrieval from being used for external > panels, I've done this in the individual DRM drivers that call into the > new drm_edid_acpi() API since it seemed that each DRM driver had its own > way of distinguishing display connector types. If there's a good way to > filter for internal panels in DRM core, I'd be happy to do that instead. I can double check with our ACPI and vbios teams, but I'm not sure that we ever used the _DDC method on any AMD platforms. Even if we did, the driver is still able to get the integrated panel's mode info via other means. In general, external connectors would always get the EDID via i2c or aux. The only use case we ever had to hardcoding an EDID was for some really old server chips so they would always think something was connected if you wanted to attach a crash cart. That EDID was stored in the vbios, not ACPI. As for enumerating which displays were muxed or not and the local mappings to acpi ids, etc., we had a whole set of ATPX methods to do that if anyone is interested: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/include/amd_acpi.h Alex ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH v4] drm/amdgpu: add new trace event for page table update v3
This patch adds a new trace event to track the PTE update events. This specific event will provide information like: - start and end of virtual memory mapping - HW engine flags for the map - physical address for mapping This will be particularly useful for memory profiling tools (like RMV) which are monitoring the page table update events. V2: Added physical address lookup logic in trace point V3: switch to use __dynamic_array added nptes int the TPprint arguments list added page size in the arg list V4: Addressed Christian's review comments add start/end instead of seg use incr instead of page_sz to be accurate Cc: Christian König Cc: Alex Deucher Signed-off-by: Christian König Signed-off-by: Shashank Sharma --- drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 37 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 9 -- 2 files changed, 44 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h index 63e734a125fb..df12cf8466c2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h @@ -321,6 +321,43 @@ DEFINE_EVENT(amdgpu_vm_mapping, amdgpu_vm_bo_cs, TP_ARGS(mapping) ); +TRACE_EVENT(amdgpu_vm_update_ptes, + TP_PROTO(struct amdgpu_vm_update_params *p, +uint64_t start, uint64_t end, +unsigned int nptes, uint64_t dst, +uint64_t incr, uint64_t flags), + TP_ARGS(p, start, end, nptes, dst, incr, flags), + TP_STRUCT__entry( +__field(u64, start) +__field(u64, end) +__field(u64, flags) +__field(unsigned int, nptes) +__field(u64, incr) +__dynamic_array(u64, dst, nptes) + ), + + TP_fast_assign( + unsigned int i; + + __entry->start = start; + __entry->end = end; + __entry->flags = flags; + __entry->incr = incr; + __entry->nptes = nptes; + for (i = 0; i < nptes; ++i) { + u64 addr = p->pages_addr ? amdgpu_vm_map_gart( + p->pages_addr, dst) : dst; + + ((u64 *)__get_dynamic_array(dst))[i] = addr; + dst += incr; + } + ), + TP_printk("start:0x%010llx end:0x%010llx, flags:0x%llx, incr:%llu," + " dst:\n%s", __entry->start, __entry->end, __entry->flags, + __entry->incr, __print_array( + __get_dynamic_array(dst), __entry->nptes, 8)) +); + TRACE_EVENT(amdgpu_vm_set_ptes, TP_PROTO(uint64_t pe, uint64_t addr, unsigned count, uint32_t incr, uint64_t flags, bool direct), diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 71e005cf2952..b5dbb5e8bc61 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1513,17 +1513,22 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params, do { uint64_t upd_end = min(entry_end, frag_end); unsigned nptes = (upd_end - frag_start) >> shift; + uint64_t upd_flags = flags | AMDGPU_PTE_FRAG(frag); /* This can happen when we set higher level PDs to * silent to stop fault floods. */ nptes = max(nptes, 1u); + + trace_amdgpu_vm_update_ptes(params, frag_start, upd_end, + nptes, dst, incr, + upd_flags); amdgpu_vm_update_flags(params, pt, cursor.level, pe_start, dst, nptes, incr, - flags | AMDGPU_PTE_FRAG(frag)); + upd_flags); pe_start += nptes * 8; - dst += (uint64_t)nptes * AMDGPU_GPU_PAGE_SIZE << shift; + dst += nptes * incr; frag_start = upd_end; if (frag_start >= frag_end) { -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver
[AMD Official Use Only - Internal Distribution Only] Hi Tom, drm/amdgpu: fix uninit-value in arcturus_log_thermal_throttling_event() the fixed patch has been merged into drm-next branch. Best Regards, Kevin From: amd-gfx on behalf of Quan, Evan Sent: Thursday, August 13, 2020 10:07 AM To: StDenis, Tom ; amd-gfx@lists.freedesktop.org Cc: StDenis, Tom Subject: RE: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver [AMD Official Use Only - Internal Distribution Only] [AMD Official Use Only - Internal Distribution Only] Your change below should be able to suppress the compile warning. -arcturus_get_smu_metrics_data(smu, +ret = arcturus_get_smu_metrics_data(smu, METRICS_THROTTLER_STATUS, _status); +if (ret) { +dev_err(adev->dev, "Could not read from arcturus_get_smu_metrics_data()\n"); +return; +} + Setting *value as 0 may be unnecessary. However anyway this patch is reviewed-by: Evan Quan BR Evan -Original Message- From: amd-gfx On Behalf Of Tom St Denis Sent: Wednesday, August 12, 2020 8:21 PM To: amd-gfx@lists.freedesktop.org Cc: StDenis, Tom Subject: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver Fixes: CC [M] drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.o drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c: In function ‘arcturus_log_thermal_throttling_event’: drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c:2223:24: warning: ‘throttler_status’ may be used uninitialized in this function [-Wmaybe-uninitialized] 2223 | if (throttler_status & logging_label[throttler_idx].feature_mask) { by making arcturus_get_smu_metrics_data() assign a default value (of zero) before any possible return point as well as simply error out of arcturus_log_thermal_throttling_event() if it fails. Signed-off-by: Tom St Denis --- drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c index 8b1025dc54fd..78f7ec95e4f5 100644 --- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c +++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c @@ -551,6 +551,9 @@ static int arcturus_get_smu_metrics_data(struct smu_context *smu, mutex_lock(>metrics_lock); +// assign default value +*value = 0; + ret = smu_cmn_get_metrics_table_locked(smu, NULL, false); @@ -2208,15 +2211,20 @@ static const struct throttling_logging_label { }; static void arcturus_log_thermal_throttling_event(struct smu_context *smu) { -int throttler_idx, throtting_events = 0, buf_idx = 0; +int throttler_idx, throtting_events = 0, buf_idx = 0, ret; struct amdgpu_device *adev = smu->adev; uint32_t throttler_status; char log_buf[256]; -arcturus_get_smu_metrics_data(smu, +ret = arcturus_get_smu_metrics_data(smu, METRICS_THROTTLER_STATUS, _status); +if (ret) { +dev_err(adev->dev, "Could not read from arcturus_get_smu_metrics_data()\n"); +return; +} + memset(log_buf, 0, sizeof(log_buf)); for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label); throttler_idx++) { -- 2.26.2 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CKevin1.Wang%40amd.com%7Cf47512097dfc40168e1a08d83f2db359%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637328812813110771sdata=xedbRrZeOi0PK3EM%2FKBYhfXxdfpOkocXPjQjcQ5ErI0%3Dreserved=0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CKevin1.Wang%40amd.com%7Cf47512097dfc40168e1a08d83f2db359%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637328812813110771sdata=xedbRrZeOi0PK3EM%2FKBYhfXxdfpOkocXPjQjcQ5ErI0%3Dreserved=0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [Nouveau] [PATCH 1/4] drm: retrieve EDID via ACPI _DDC method
Thanks, Lukas. I've incorporated your feedback into my local tree, but will wait for additional feedback from the individual DRM driver maintainers before sending out a series v2. On 8/8/20 5:11 PM, Lukas Wunner wrote: On Mon, Jul 27, 2020 at 03:53:54PM -0500, Daniel Dadap wrote: + for (i = 0; i < num_dod_entries; i++) { + if (adr == dod_entries[i]) { + ret = do_acpi_ddc(child->handle); + + if (ret != NULL) + goto done; I guess ideally we'd want to correlate the display objects with drm_connectors or at least constrain the search to Display Type "Internal/Integrated Digital Flat Panel" instead of picking the first EDID found. Otherwise we might erroneously use the DDC for an externally attached display. Yes, we'd definitely need a way to do this if this functionality ever needs to be extended to systems with more than one _DDC method. Unfortunately, this will be much easier said than done, since I'm not aware of any way to reliably do map _DOD entries to connectors in a GPU driver, especially when we're talking about possibly correlating connectors on multiple GPUs which mux to the same internal display or external connector. All systems which I am aware of that implement ACPI _DDC do so for a single internal panel. I don't believe there's any reason to ever retrieve an EDID via ACPI _DDC for an external panel, but a hypothetical design with multiple internal panels, more than one of which needs to retrieve an EDID via ACPI _DDC, would certainly be problematic. On at least the system I'm working with for the various switcheroo and platform-x86 driver patches I've recently sent off, the dGPU has an ACPI _DOD table and one _DDC method corresponding to one of the _DOD entries, but the iGPU has neither a _DOD table nor a _DDC method. Either GPU can be connected to the internal panel via the dynamically switchable mux, and the internal panel's EDID is available via _DDC to allow a disconnected GPU to read the EDID. Since only the DGPU has _DOD and _DDC, and there's no obvious way to associate connectors on the iGPU with connectors on the dGPU, I've implemented the ACPI _DDC EDID retrieval with the "first available" implementation you see here. I'm open to other ideas if you have them, but didn't see a good way to search for the "right" _DDC implementation should there be more than one. As for preventing the ACPI EDID retrieval from being used for external panels, I've done this in the individual DRM drivers that call into the new drm_edid_acpi() API since it seemed that each DRM driver had its own way of distinguishing display connector types. If there's a good way to filter for internal panels in DRM core, I'd be happy to do that instead. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver
[AMD Official Use Only - Internal Distribution Only] Your change below should be able to suppress the compile warning. -arcturus_get_smu_metrics_data(smu, +ret = arcturus_get_smu_metrics_data(smu, METRICS_THROTTLER_STATUS, _status); +if (ret) { +dev_err(adev->dev, "Could not read from arcturus_get_smu_metrics_data()\n"); +return; +} + Setting *value as 0 may be unnecessary. However anyway this patch is reviewed-by: Evan Quan BR Evan -Original Message- From: amd-gfx On Behalf Of Tom St Denis Sent: Wednesday, August 12, 2020 8:21 PM To: amd-gfx@lists.freedesktop.org Cc: StDenis, Tom Subject: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver Fixes: CC [M] drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.o drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c: In function ‘arcturus_log_thermal_throttling_event’: drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c:2223:24: warning: ‘throttler_status’ may be used uninitialized in this function [-Wmaybe-uninitialized] 2223 | if (throttler_status & logging_label[throttler_idx].feature_mask) { by making arcturus_get_smu_metrics_data() assign a default value (of zero) before any possible return point as well as simply error out of arcturus_log_thermal_throttling_event() if it fails. Signed-off-by: Tom St Denis --- drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c index 8b1025dc54fd..78f7ec95e4f5 100644 --- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c +++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c @@ -551,6 +551,9 @@ static int arcturus_get_smu_metrics_data(struct smu_context *smu, mutex_lock(>metrics_lock); +// assign default value +*value = 0; + ret = smu_cmn_get_metrics_table_locked(smu, NULL, false); @@ -2208,15 +2211,20 @@ static const struct throttling_logging_label { }; static void arcturus_log_thermal_throttling_event(struct smu_context *smu) { -int throttler_idx, throtting_events = 0, buf_idx = 0; +int throttler_idx, throtting_events = 0, buf_idx = 0, ret; struct amdgpu_device *adev = smu->adev; uint32_t throttler_status; char log_buf[256]; -arcturus_get_smu_metrics_data(smu, +ret = arcturus_get_smu_metrics_data(smu, METRICS_THROTTLER_STATUS, _status); +if (ret) { +dev_err(adev->dev, "Could not read from arcturus_get_smu_metrics_data()\n"); +return; +} + memset(log_buf, 0, sizeof(log_buf)); for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label); throttler_idx++) { -- 2.26.2 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cevan.quan%40amd.com%7C33cba706ddbb41b6f01508d83eba26bd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637328316549041092sdata=7BcUmmThbZU8quP0g4t3ajSYe3scCOahigO%2Bye2GHaw%3Dreserved=0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: load ta firmware for navy_flounder
[AMD Public Use] Reviewed-by: Alex Deucher From: Bhawanpreet Lakha Sent: Wednesday, August 12, 2020 4:17 PM To: Kazlauskas, Nicholas ; Deucher, Alexander ; amd-gfx@lists.freedesktop.org Cc: Clements, John ; Lakha, Bhawanpreet Subject: [PATCH] drm/amdgpu: load ta firmware for navy_flounder call psp_int_ta_microcode() to parse the ta firmware. Signed-off-by: Bhawanpreet Lakha --- drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c index d488d250805d..6c5d9612abcb 100644 --- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c @@ -58,7 +58,7 @@ MODULE_FIRMWARE("amdgpu/arcturus_ta.bin"); MODULE_FIRMWARE("amdgpu/sienna_cichlid_sos.bin"); MODULE_FIRMWARE("amdgpu/sienna_cichlid_ta.bin"); MODULE_FIRMWARE("amdgpu/navy_flounder_sos.bin"); -MODULE_FIRMWARE("amdgpu/navy_flounder_asd.bin"); +MODULE_FIRMWARE("amdgpu/navy_flounder_ta.bin"); /* address block */ #define smnMP1_FIRMWARE_FLAGS 0x3010024 @@ -179,12 +179,11 @@ static int psp_v11_0_init_microcode(struct psp_context *psp) } break; case CHIP_SIENNA_CICHLID: + case CHIP_NAVY_FLOUNDER: err = psp_init_ta_microcode(>psp, chip_name); if (err) return err; break; - case CHIP_NAVY_FLOUNDER: - break; default: BUG(); } -- 2.17.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: load ta firmware for navy_flounder
call psp_int_ta_microcode() to parse the ta firmware. Signed-off-by: Bhawanpreet Lakha --- drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c index d488d250805d..6c5d9612abcb 100644 --- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c @@ -58,7 +58,7 @@ MODULE_FIRMWARE("amdgpu/arcturus_ta.bin"); MODULE_FIRMWARE("amdgpu/sienna_cichlid_sos.bin"); MODULE_FIRMWARE("amdgpu/sienna_cichlid_ta.bin"); MODULE_FIRMWARE("amdgpu/navy_flounder_sos.bin"); -MODULE_FIRMWARE("amdgpu/navy_flounder_asd.bin"); +MODULE_FIRMWARE("amdgpu/navy_flounder_ta.bin"); /* address block */ #define smnMP1_FIRMWARE_FLAGS 0x3010024 @@ -179,12 +179,11 @@ static int psp_v11_0_init_microcode(struct psp_context *psp) } break; case CHIP_SIENNA_CICHLID: + case CHIP_NAVY_FLOUNDER: err = psp_init_ta_microcode(>psp, chip_name); if (err) return err; break; - case CHIP_NAVY_FLOUNDER: - break; default: BUG(); } -- 2.17.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: revert "fix system hang issue during GPU reset"
On Wed, Aug 12, 2020 at 11:54 AM Christian König wrote: > > The whole approach wasn't thought through till the end. > > We already had a reset lock like this in the past and it caused the same > problems like this one. > > Completely revert the patch for now and add individual trylock protection to > the hardware access functions as necessary. > > This reverts commit edad8312cbbf9a33c86873fc4093664f150dd5c1. > > Signed-off-by: Christian König This also broke GPU overclocking. Acked-by: Alex Deucher > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 9 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 40 +- > .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c| 2 +- > .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v7.c | 2 +- > .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v8.c | 2 +- > .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 2 +- > .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 7 - > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c| 4 - > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 4 - > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 57 ++- > drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 4 - > drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 6 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 14 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 - > drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c| 353 -- > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 3 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 4 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 11 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 3 +- > drivers/gpu/drm/amd/amdgpu/atom.c | 1 - > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 10 +- > drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 6 +- > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 10 +- > drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c| 4 +- > drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 2 +- > drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 2 +- > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 7 +- > drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 13 +- > drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 13 +- > .../drm/amd/amdkfd/kfd_device_queue_manager.c | 16 +- > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 - > .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 4 +- > drivers/gpu/drm/amd/powerplay/amdgpu_smu.c| 2 +- > .../drm/amd/powerplay/hwmgr/vega20_hwmgr.c| 2 +- > 39 files changed, 184 insertions(+), 469 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 1f9d97f61aa5..9c6fb38ce59d 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -952,9 +952,9 @@ struct amdgpu_device { > boolin_suspend; > boolin_hibernate; > > - atomic_tin_gpu_reset; > + boolin_gpu_reset; > enum pp_mp1_state mp1_state; > - struct rw_semaphore reset_sem; > + struct mutex lock_reset; > struct amdgpu_doorbell_index doorbell_index; > > struct mutexnotifier_lock; > @@ -1269,9 +1269,4 @@ static inline bool amdgpu_is_tmz(struct amdgpu_device > *adev) > return adev->gmc.tmz_enabled; > } > > -static inline bool amdgpu_in_reset(struct amdgpu_device *adev) > -{ > - return atomic_read(>in_gpu_reset) ? true : false; > -} > - > #endif > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > index 9738dccb1c2c..0effc1d46824 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > @@ -244,14 +244,11 @@ int amdgpu_amdkfd_alloc_gtt_mem(struct kgd_dev *kgd, > size_t size, > if (cp_mqd_gfx9) > bp.flags |= AMDGPU_GEM_CREATE_CP_MQD_GFX9; > > - if (!down_read_trylock(>reset_sem)) > - return -EIO; > - > r = amdgpu_bo_create(adev, , ); > if (r) { > dev_err(adev->dev, > "failed to allocate BO for amdkfd (%d)\n", r); > - goto err; > + return r; > } > > /* map the buffer */ > @@ -286,7 +283,6 @@ int amdgpu_amdkfd_alloc_gtt_mem(struct kgd_dev *kgd, > size_t size, > > amdgpu_bo_unreserve(bo); > > - up_read(>reset_sem); > return 0; > > allocate_mem_kmap_bo_failed: > @@ -295,25 +291,19 @@ int amdgpu_amdkfd_alloc_gtt_mem(struct kgd_dev *kgd, > size_t size, >
[PATCH] drm/amdgpu: revert "fix system hang issue during GPU reset"
The whole approach wasn't thought through till the end. We already had a reset lock like this in the past and it caused the same problems like this one. Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary. This reverts commit edad8312cbbf9a33c86873fc4093664f150dd5c1. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 9 +- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c| 40 +- .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c| 2 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v7.c | 2 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v8.c | 2 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 2 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 7 - drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c| 4 - drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 4 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 57 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 4 - drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 6 +- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 14 +- drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 - drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c| 353 -- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 3 +- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 4 +- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 11 +- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 3 +- drivers/gpu/drm/amd/amdgpu/atom.c | 1 - drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c| 10 +- drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 6 +- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 10 +- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c| 4 +- drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 7 +- drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 13 +- drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 13 +- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 16 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 - .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 4 +- drivers/gpu/drm/amd/powerplay/amdgpu_smu.c| 2 +- .../drm/amd/powerplay/hwmgr/vega20_hwmgr.c| 2 +- 39 files changed, 184 insertions(+), 469 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 1f9d97f61aa5..9c6fb38ce59d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -952,9 +952,9 @@ struct amdgpu_device { boolin_suspend; boolin_hibernate; - atomic_tin_gpu_reset; + boolin_gpu_reset; enum pp_mp1_state mp1_state; - struct rw_semaphore reset_sem; + struct mutex lock_reset; struct amdgpu_doorbell_index doorbell_index; struct mutexnotifier_lock; @@ -1269,9 +1269,4 @@ static inline bool amdgpu_is_tmz(struct amdgpu_device *adev) return adev->gmc.tmz_enabled; } -static inline bool amdgpu_in_reset(struct amdgpu_device *adev) -{ - return atomic_read(>in_gpu_reset) ? true : false; -} - #endif diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c index 9738dccb1c2c..0effc1d46824 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c @@ -244,14 +244,11 @@ int amdgpu_amdkfd_alloc_gtt_mem(struct kgd_dev *kgd, size_t size, if (cp_mqd_gfx9) bp.flags |= AMDGPU_GEM_CREATE_CP_MQD_GFX9; - if (!down_read_trylock(>reset_sem)) - return -EIO; - r = amdgpu_bo_create(adev, , ); if (r) { dev_err(adev->dev, "failed to allocate BO for amdkfd (%d)\n", r); - goto err; + return r; } /* map the buffer */ @@ -286,7 +283,6 @@ int amdgpu_amdkfd_alloc_gtt_mem(struct kgd_dev *kgd, size_t size, amdgpu_bo_unreserve(bo); - up_read(>reset_sem); return 0; allocate_mem_kmap_bo_failed: @@ -295,25 +291,19 @@ int amdgpu_amdkfd_alloc_gtt_mem(struct kgd_dev *kgd, size_t size, amdgpu_bo_unreserve(bo); allocate_mem_reserve_bo_failed: amdgpu_bo_unref(); -err: - up_read(>reset_sem); + return r; } void amdgpu_amdkfd_free_gtt_mem(struct kgd_dev *kgd, void *mem_obj) { - struct amdgpu_device *adev = (struct amdgpu_device *)kgd; struct amdgpu_bo *bo = (struct amdgpu_bo *) mem_obj; -
Re: [PATCH] drm/amdgpu: adjust the pid in the grab_id trace point
Am 12.08.20 um 17:19 schrieb Steven Rostedt: On Wed, 12 Aug 2020 16:36:36 +0200 Christian König wrote: Am 12.08.20 um 16:17 schrieb Steven Rostedt: On Fri, Aug 07, 2020 at 03:36:58PM +0200, Christian König wrote: Trace something useful instead of the pid of a kernel thread here. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h index 5da20fc166d9..07f99ef69d91 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h @@ -228,6 +228,7 @@ TRACE_EVENT(amdgpu_vm_grab_id, ), TP_fast_assign( + __entry->ent.pid = vm->task_info.pid; If the ent.pid is not the pid you are interested in for this trace event, just add a "pid" field to the trace event and place it there. Do not modify the generic pid that is recorded, as we would like that to be consistent for all trace events. The problem my userspace guys have is that this doesn't work with "trace-cmd -P $pid". But I think I can teach them how filters work :) Yep, trace-cmd record -e event -f "pid == $pid" The "ent.pid" turns into "common_pid" in the field, leaving "pid" free to use. Other trace events (like sched_waking) record a pid field that is not the same as the pid of the executing task. Yes, we thought about this alternative as well. The "ent.pid" should always be the pid of the task that executed the event. Why? For the case here we just execute a work item in the background for an userspace process. Tracing the pid of the worker pool which executes it doesn't seem to make to much sense. Maybe not for you, but it does for me. All trace events show what happened when it happened and who executed it. I like to see what worker threads are executing. I may filter on the worker thread, and by changing the ent.pid, I wont see what it is doing. That's enough explanation for me. Going with the separate pid field then. Thanks, Christian. That said, I think I may add a feature to a trace evnt for a special filter to say, "test field to the set_event_pid", and if it exists in that file to include that event in the filtered trace. This would include sched_waking trace events as well. That way "trace-cmd record -P $pid" will still work for your case. -- Steve ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
Am 12.08.20 um 17:07 schrieb Felix Kuehling: Am 2020-08-12 um 4:53 a.m. schrieb Christian König: Am 12.08.20 um 03:19 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Felix, Re: It may be better to fix it the other way around in amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the reservation. Otherwise you will never be able to take the reset_sem while any BOs are reserved. That's probably going to cause you other problems later. [Dennis Li] Thanks that you find the potential issue, I will change it in version 2. Re: That makes me wonder, why do you need the reset_sem in amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious hardware access in that function. Is it for amdgpu_ttm_alloc_gart updating the GART table through HDP? [Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have considered to only protect amdgpu_ttm_alloc_gart before. That access is irrelevant and the lock should be removed or changed into a trylock. See we need the HDP flush only because the hardware could have accessed the data before. But after a GPU reset the HDP is known to be clean, so this doesn't need any protection. But I worry other functions will access hardware in the future. Therefore I select an aggressive approach which lock reset_sem at the beginning of entry functions of amdgpu driver. This is not a good idea. We used to have such a global lock before and removed it because it caused all kind of problems. In most cases it's a global read-lock, so most of the time it should not impact concurrency. The only "writer" that blocks all readers is the GPU reset. When was this added? Looks like it slipped under my radar or I wasn't awake enough at that moment. The change that added the reset_sem added read-locks in about 70 places. I'm still concerned that this will be hard to maintain, to make sure future HW access will do the necessary locking. I guess that's the motivation for doing the locking more coarse-grained. Taking the lock higher up the call chains means fewer places that take the lock and new low-level code may not need its own locking in the future because it's already covered by higher-level callers. On the other hand, it needs to be low enough in the call chains to avoid recursive locking or circular dependencies with other locks. Well the whole approach is a NAK. We already tried this and it didn't worked at all. See how much effort it was to remove the global reset rwsem again in the past. We have only minimal functions accessing the hardware directly which can run concurrently with a GPU reset. Everything else works through ring buffers where the processing is stopped before we reset the GPU. So the whole thing is just a bad idea as far as I can see. Regards, Christian. Regards, Felix Christian. Best Regards Dennis Li -Original Message- From: Kuehling, Felix Sent: Tuesday, August 11, 2020 9:57 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Zhang, Hawking ; Koenig, Christian Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li: [ 653.902305] == [ 653.902928] WARNING: possible circular locking dependency detected [ 653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 653.904098] -- [ 653.904675] amdgpu_test/3975 is trying to acquire lock: [ 653.905241] 97848f8647a0 (>reset_sem){.+.+}, at: amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.905953] but task is already holding lock: [ 653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [ 653.907694] which lock already depends on the new lock. [ 653.909423] the existing dependency chain (in reverse order) is: [ 653.910594] -> #1 (reservation_ww_class_mutex){+.+.}: [ 653.911759] __ww_mutex_lock.constprop.15+0xca/0x1120 [ 653.912350] ww_mutex_lock+0x73/0x80 [ 653.913044] amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu] [ 653.913724] kgd2kfd_device_init+0x13f/0x5e0 [amdgpu] [ 653.914388] amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu] [ 653.915033] amdgpu_device_init+0x1303/0x1e10 [amdgpu] [ 653.915685] amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu] [ 653.916349] amdgpu_pci_probe+0x11d/0x200 [amdgpu] [ 653.916959] local_pci_probe+0x47/0xa0 [ 653.917570] work_for_cpu_fn+0x1a/0x30 [ 653.918184] process_one_work+0x29e/0x630 [ 653.918803] worker_thread+0x22b/0x3f0 [ 653.919427] kthread+0x12f/0x150 [ 653.920047] ret_from_fork+0x3a/0x50 [ 653.920661] -> #0 (>reset_sem){.+.+}: [ 653.921893] __lock_acquire+0x13ec/0x16e0 [ 653.922531] lock_acquire+0xb8/0x1c0 [
Re: [PATCH] drm/amdgpu: adjust the pid in the grab_id trace point
On Wed, 12 Aug 2020 16:36:36 +0200 Christian König wrote: > Am 12.08.20 um 16:17 schrieb Steven Rostedt: > > On Fri, Aug 07, 2020 at 03:36:58PM +0200, Christian König wrote: > >> Trace something useful instead of the pid of a kernel thread here. > >> > >> Signed-off-by: Christian König > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 1 + > >> 1 file changed, 1 insertion(+) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h > >> index 5da20fc166d9..07f99ef69d91 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h > >> @@ -228,6 +228,7 @@ TRACE_EVENT(amdgpu_vm_grab_id, > >> ), > >> > >>TP_fast_assign( > >> + __entry->ent.pid = vm->task_info.pid; > > If the ent.pid is not the pid you are interested in for this trace event, > > just > > add a "pid" field to the trace event and place it there. Do not modify the > > generic pid that is recorded, as we would like that to be consistent for all > > trace events. > > The problem my userspace guys have is that this doesn't work with > "trace-cmd -P $pid". > > But I think I can teach them how filters work :) Yep, trace-cmd record -e event -f "pid == $pid" > > > The "ent.pid" turns into "common_pid" in the field, leaving "pid" free to > > use. > > Other trace events (like sched_waking) record a pid field that is not the > > same > > as the pid of the executing task. > > Yes, we thought about this alternative as well. > > > The "ent.pid" should always be the pid of the task that executed the event. > > > > Why? For the case here we just execute a work item in the background for > an userspace process. > > Tracing the pid of the worker pool which executes it doesn't seem to > make to much sense. Maybe not for you, but it does for me. All trace events show what happened when it happened and who executed it. I like to see what worker threads are executing. I may filter on the worker thread, and by changing the ent.pid, I wont see what it is doing. That said, I think I may add a feature to a trace evnt for a special filter to say, "test field to the set_event_pid", and if it exists in that file to include that event in the filtered trace. This would include sched_waking trace events as well. That way "trace-cmd record -P $pid" will still work for your case. -- Steve ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver
On 8/12/20 2:43 PM, StDenis, Tom wrote: [AMD Official Use Only - Internal Distribution Only] Possibly, but since the arcturus_get_smu_metrics_data() can error out we should check that return value no? Yes, we need that return check. (also setting *value to 0 avoids this bug in the future...). I think caller should initialize the result value before passing it to arcturus_get_smu_metrics_data as the warning is generated from the calling function. My comment is only concerning about setting "value" to 0, which seems wrong. Rest of the patch is fine. Nirmoy Tom From: Das, Nirmoy Sent: Wednesday, August 12, 2020 08:40 To: StDenis, Tom; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver On 8/12/20 2:20 PM, Tom St Denis wrote: Fixes: CC [M] drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.o drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c: In function ‘arcturus_log_thermal_throttling_event’: drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c:2223:24: warning: ‘throttler_status’ may be used uninitialized in this function [-Wmaybe-uninitialized] 2223 | if (throttler_status & logging_label[throttler_idx].feature_mask) { by making arcturus_get_smu_metrics_data() assign a default value (of zero) before any possible return point as well as simply error out of arcturus_log_thermal_throttling_event() if it fails. Signed-off-by: Tom St Denis --- drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c index 8b1025dc54fd..78f7ec95e4f5 100644 --- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c +++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c @@ -551,6 +551,9 @@ static int arcturus_get_smu_metrics_data(struct smu_context *smu, mutex_lock(>metrics_lock); + // assign default value + *value = 0; + ret = smu_cmn_get_metrics_table_locked(smu, NULL, false); @@ -2208,15 +2211,20 @@ static const struct throttling_logging_label { }; static void arcturus_log_thermal_throttling_event(struct smu_context *smu) { - int throttler_idx, throtting_events = 0, buf_idx = 0; + int throttler_idx, throtting_events = 0, buf_idx = 0, ret; struct amdgpu_device *adev = smu->adev; uint32_t throttler_status; I think initializing throttler_status here should resolve the warning. char log_buf[256]; - arcturus_get_smu_metrics_data(smu, + ret = arcturus_get_smu_metrics_data(smu, METRICS_THROTTLER_STATUS, _status); + if (ret) { + dev_err(adev->dev, "Could not read from arcturus_get_smu_metrics_data()\n"); + return; + } + memset(log_buf, 0, sizeof(log_buf)); for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label); throttler_idx++) { ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
Am 2020-08-12 um 4:53 a.m. schrieb Christian König: > Am 12.08.20 um 03:19 schrieb Li, Dennis: >> [AMD Official Use Only - Internal Distribution Only] >> >> Hi, Felix, >> >> Re: It may be better to fix it the other way around in >> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the >> reservation. Otherwise you will never be able to take the reset_sem >> while any BOs are reserved. That's probably going to cause you other >> problems later. >> [Dennis Li] Thanks that you find the potential issue, I will change >> it in version 2. >> >> Re: That makes me wonder, why do you need the reset_sem in >> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious >> hardware access in that function. Is it for amdgpu_ttm_alloc_gart >> updating the GART table through HDP? >> [Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have >> considered to only protect amdgpu_ttm_alloc_gart before. > > That access is irrelevant and the lock should be removed or changed > into a trylock. > > See we need the HDP flush only because the hardware could have > accessed the data before. > > But after a GPU reset the HDP is known to be clean, so this doesn't > need any protection. > >> But I worry other functions will access hardware in the future. >> Therefore I select an aggressive approach which lock reset_sem at the >> beginning of entry functions of amdgpu driver. > > This is not a good idea. We used to have such a global lock before and > removed it because it caused all kind of problems. In most cases it's a global read-lock, so most of the time it should not impact concurrency. The only "writer" that blocks all readers is the GPU reset. > > When was this added? Looks like it slipped under my radar or I wasn't > awake enough at that moment. The change that added the reset_sem added read-locks in about 70 places. I'm still concerned that this will be hard to maintain, to make sure future HW access will do the necessary locking. I guess that's the motivation for doing the locking more coarse-grained. Taking the lock higher up the call chains means fewer places that take the lock and new low-level code may not need its own locking in the future because it's already covered by higher-level callers. On the other hand, it needs to be low enough in the call chains to avoid recursive locking or circular dependencies with other locks. Regards, Felix > > Christian. > >> >> Best Regards >> Dennis Li >> -Original Message- >> From: Kuehling, Felix >> Sent: Tuesday, August 11, 2020 9:57 PM >> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; >> Deucher, Alexander ; Zhang, Hawking >> ; Koenig, Christian >> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking >> dependency >> >> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li: >>> [ 653.902305] == >>> [ 653.902928] WARNING: possible circular locking dependency detected >>> [ 653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: >>> G OE >>> [ 653.904098] -- >>> [ 653.904675] amdgpu_test/3975 is trying to acquire lock: >>> [ 653.905241] 97848f8647a0 (>reset_sem){.+.+}, at: >>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.905953] >>> but task is already holding lock: >>> [ 653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, >>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [ 653.907694] >>> which lock already depends on the new lock. >>> >>> [ 653.909423] >>> the existing dependency chain (in reverse order) is: >>> [ 653.910594] >>> -> #1 (reservation_ww_class_mutex){+.+.}: >>> [ 653.911759] __ww_mutex_lock.constprop.15+0xca/0x1120 >>> [ 653.912350] ww_mutex_lock+0x73/0x80 >>> [ 653.913044] amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu] >>> [ 653.913724] kgd2kfd_device_init+0x13f/0x5e0 [amdgpu] >>> [ 653.914388] amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu] >>> [ 653.915033] amdgpu_device_init+0x1303/0x1e10 [amdgpu] >>> [ 653.915685] amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu] >>> [ 653.916349] amdgpu_pci_probe+0x11d/0x200 [amdgpu] >>> [ 653.916959] local_pci_probe+0x47/0xa0 >>> [ 653.917570] work_for_cpu_fn+0x1a/0x30 >>> [ 653.918184] process_one_work+0x29e/0x630 >>> [ 653.918803] worker_thread+0x22b/0x3f0 >>> [ 653.919427] kthread+0x12f/0x150 >>> [ 653.920047] ret_from_fork+0x3a/0x50 >>> [ 653.920661] >>> -> #0 (>reset_sem){.+.+}: >>> [ 653.921893] __lock_acquire+0x13ec/0x16e0 >>> [ 653.922531] lock_acquire+0xb8/0x1c0 >>> [ 653.923174] down_read+0x48/0x230 >>> [ 653.923886] amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] >>> [ 653.924588] drm_ioctl_kernel+0xb6/0x100 [drm] >>> [ 653.925283] drm_ioctl+0x389/0x450 [drm] >>> [ 653.926013]
Re: [RFC PATCH 1/1] drm/amdgpu: add initial support for pci error handler
On 8/11/20 9:30 AM, Nirmoy Das wrote: This patch will ignore non-fatal errors and try to stop amdgpu's sw stack on fatal errors. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 56 - 1 file changed, 54 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index c1219af2e7d6..2b9ede3000ee 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include "amdgpu.h" @@ -1516,6 +1517,58 @@ static struct drm_driver kms_driver = { .patchlevel = KMS_DRIVER_PATCHLEVEL, }; +static pci_ers_result_t amdgpu_pci_err_detected(struct pci_dev *pdev, + pci_channel_state_t state) +{ + struct drm_device *dev = pci_get_drvdata(pdev); + struct amdgpu_device *adev = dev->dev_private; + int i; + int ret = PCI_ERS_RESULT_DISCONNECT; + + switch (state) { + case pci_channel_io_normal: + ret = PCI_ERS_RESULT_CAN_RECOVER; + break; + default: + /* Disable power management */ + adev->runpm = 0; + /* Suspend all IO operations */ + amdgpu_fbdev_set_suspend(adev, 1); + cancel_delayed_work_sync(>delayed_init_work); + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { + struct amdgpu_ring *ring = adev->rings[i]; + + if (!ring || !ring->sched.thread) + continue; + + amdgpu_job_stop_all_jobs_on_sched(>sched); You need to call drm_sched_stop first before calling this + } + + if (adev->mode_info.mode_config_initialized) { + if (!amdgpu_device_has_dc_support(adev)) + drm_helper_force_disable_all(adev->ddev); + else + drm_atomic_helper_shutdown(adev->ddev); + } + + amdgpu_fence_driver_fini(adev); + amdgpu_fbdev_fini(adev); + /* Try to close drm device to stop applications +* from opening dri files for further IO operations. +* TODO: This will throw warning as ttm is not +* cleaned perperly */ + drm_dev_fini(dev); I think user mode applications might still hold reference to the drm device through through drm_dev_get either by directly opening the device file or indirectly through importing DMA buff, if so when the last of them terminate drm_dev_put->drm_dev_release->...->drm_dev_fini might get called again causing use after free e.t.c issues. Maybe better to call here drm_dev_put then and so drm_dev_fini will get called when this last user client releases his reference. Also a general question - in my work on DPC recovery feature which tries to recover after PCIe error - once the PCI error has happened MMIO registers become unaccessible for r/w as the PCI link is dead until after the PCI link is reset by the DPC driver (see https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html section 6.1.4). Your case is to try and gracefully to close the drm device once fatal error happened, didn't you encounter errors or warnings when accessing HW registers during any of the operations above ? Andrey + break; + } + + return ret; +} + +static const struct pci_error_handlers amdgpu_err_handler = { + .error_detected = amdgpu_pci_err_detected, +}; + + static struct pci_driver amdgpu_kms_pci_driver = { .name = DRIVER_NAME, .id_table = pciidlist, @@ -1523,10 +1576,9 @@ static struct pci_driver amdgpu_kms_pci_driver = { .remove = amdgpu_pci_remove, .shutdown = amdgpu_pci_shutdown, .driver.pm = _pm_ops, + .err_handler = _err_handler, }; - - static int __init amdgpu_init(void) { int r; ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: adjust the pid in the grab_id trace point
On Fri, Aug 07, 2020 at 03:36:58PM +0200, Christian König wrote: > Trace something useful instead of the pid of a kernel thread here. > > Signed-off-by: Christian König > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h > index 5da20fc166d9..07f99ef69d91 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h > @@ -228,6 +228,7 @@ TRACE_EVENT(amdgpu_vm_grab_id, >), > > TP_fast_assign( > +__entry->ent.pid = vm->task_info.pid; If the ent.pid is not the pid you are interested in for this trace event, just add a "pid" field to the trace event and place it there. Do not modify the generic pid that is recorded, as we would like that to be consistent for all trace events. The "ent.pid" turns into "common_pid" in the field, leaving "pid" free to use. Other trace events (like sched_waking) record a pid field that is not the same as the pid of the executing task. The "ent.pid" should always be the pid of the task that executed the event. -- Steve > __entry->pasid = vm->pasid; > __assign_str(ring, ring->name) > __entry->vmid = job->vmid; > -- > 2.17.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: adjust the pid in the grab_id trace point
Am 12.08.20 um 16:17 schrieb Steven Rostedt: On Fri, Aug 07, 2020 at 03:36:58PM +0200, Christian König wrote: Trace something useful instead of the pid of a kernel thread here. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h index 5da20fc166d9..07f99ef69d91 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h @@ -228,6 +228,7 @@ TRACE_EVENT(amdgpu_vm_grab_id, ), TP_fast_assign( + __entry->ent.pid = vm->task_info.pid; If the ent.pid is not the pid you are interested in for this trace event, just add a "pid" field to the trace event and place it there. Do not modify the generic pid that is recorded, as we would like that to be consistent for all trace events. The problem my userspace guys have is that this doesn't work with "trace-cmd -P $pid". But I think I can teach them how filters work :) The "ent.pid" turns into "common_pid" in the field, leaving "pid" free to use. Other trace events (like sched_waking) record a pid field that is not the same as the pid of the executing task. Yes, we thought about this alternative as well. The "ent.pid" should always be the pid of the task that executed the event. Why? For the case here we just execute a work item in the background for an userspace process. Tracing the pid of the worker pool which executes it doesn't seem to make to much sense. Thanks for the quick reply, Christian. -- Steve __entry->pasid = vm->pasid; __assign_str(ring, ring->name) __entry->vmid = job->vmid; -- 2.17.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 3/7] drm/amd/display: Avoid using unvalidated tiling_flags and tmz_surface in prepare_planes
On Tue, Aug 11, 2020 at 09:42:11AM -0400, Marek Olšák wrote: > There are a few cases when the flags can change, for example DCC can be > disabled due to a hw limitation in the 3d engine. Modifiers give the > misleading impression that they help with that, but they don't. They don't > really help with anything. But if that happens, how do you tell the other side that it needs to sample new flags? Does that just happen all the time? Also do the DDC state changes happen for shared buffers too? -Daniel > > Marek > > On Mon., Aug. 10, 2020, 08:30 Christian König, < > ckoenig.leichtzumer...@gmail.com> wrote: > > > Am 10.08.20 um 14:25 schrieb Daniel Vetter: > > > On Fri, Aug 07, 2020 at 10:29:09AM -0400, Kazlauskas, Nicholas wrote: > > >> On 2020-08-07 4:30 a.m., dan...@ffwll.ch wrote: > > >>> On Thu, Jul 30, 2020 at 04:36:38PM -0400, Nicholas Kazlauskas wrote: > > [Why] > > We're racing with userspace as the flags could potentially change > > from when we acquired and validated them in commit_check. > > >>> Uh ... I didn't know these could change. I think my comments on Bas' > > >>> series are even more relevant now. I think long term would be best to > > bake > > >>> these flags in at addfb time when modifiers aren't set. And otherwise > > >>> always use the modifiers flag, and completely ignore the legacy flags > > >>> here. > > >>> -Daniel > > >>> > > >> There's a set tiling/mod flags IOCTL that can be called after addfb > > happens, > > >> so unless there's some sort of driver magic preventing this from working > > >> when it's already been pinned for scanout then I don't see anything > > stopping > > >> this from happening. > > >> > > >> I still need to review the modifiers series in a little more detail but > > that > > >> looks like a good approach to fixing these kind of issues. > > > Yeah we had the same model for i915, but it's awkward and can surprise > > > compositors (since the client could change the tiling mode from > > underneath > > > the compositor). So freezing the tiling mode at addfb time is the right > > > thing to do, and anyway how things work with modifiers. > > > > > > Ofc maybe good to audit the -amd driver, but hopefully it doesn't do > > > anything silly with changed tiling. If it does, it's kinda sad day. > > > > Marek should know this right away, but I think we only set the tilling > > flags once while exporting the BO and then never change them. > > > > Regards, > > Christian. > > > > > > > > Btw for i915 we even went a step further, and made the set_tiling ioctl > > > return an error if a framebuffer for that gem_bo existed. Just to make > > > sure we don't ever accidentally break this. > > > > > > Cheers, Daniel > > > > > >> Regards, > > >> Nicholas Kazlauskas > > >> > > [How] > > We unfortunately can't drop this function in its entirety from > > prepare_planes since we don't know the afb->address at commit_check > > time yet. > > > > So instead of querying new tiling_flags and tmz_surface use the ones > > from the plane_state directly. > > > > While we're at it, also update the force_disable_dcc option based > > on the state from atomic check. > > > > Cc: Bhawanpreet Lakha > > Cc: Rodrigo Siqueira > > Signed-off-by: Nicholas Kazlauskas > > --- > > .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 36 > > ++- > > 1 file changed, 19 insertions(+), 17 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > > index bf1881bd492c..f78c09c9585e 100644 > > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > > @@ -5794,14 +5794,8 @@ static int dm_plane_helper_prepare_fb(struct > > drm_plane *plane, > > struct list_head list; > > struct ttm_validate_buffer tv; > > struct ww_acquire_ctx ticket; > > - uint64_t tiling_flags; > > uint32_t domain; > > int r; > > - bool tmz_surface = false; > > - bool force_disable_dcc = false; > > - > > - dm_plane_state_old = to_dm_plane_state(plane->state); > > - dm_plane_state_new = to_dm_plane_state(new_state); > > if (!new_state->fb) { > > DRM_DEBUG_DRIVER("No FB bound\n"); > > @@ -5845,27 +5839,35 @@ static int dm_plane_helper_prepare_fb(struct > > drm_plane *plane, > > return r; > > } > > - amdgpu_bo_get_tiling_flags(rbo, _flags); > > - > > - tmz_surface = amdgpu_bo_encrypted(rbo); > > - > > ttm_eu_backoff_reservation(, ); > > afb->address = amdgpu_bo_gpu_offset(rbo); > > amdgpu_bo_ref(rbo); > > + /** > > + * We don't do surface updates on planes that have been newly >
Re: RFC: How to adjust the trace pid?
On Wed, Aug 12, 2020 at 3:42 PM Christian König wrote: > > Ping? Daniel, Dave any opinion on this? Type patch, cc: tracing people, see what they say? tbh I have no idea, but they have been making unhappy noises about some of the tricks we've played in the past in i915 tracepoints. So not everything is cool in there. Otherwise I guess just add another tracepoint parameter to dump the correct userspace mm. 3rd option could be to dump the current mm (since I'm assuming those threads do kthread_use/unuse_mm to impersonate the right userspace process correctly) in the tracepoint infrastructure too? Cheers, Daniel > Christian. > > Am 07.08.20 um 15:36 schrieb Christian König: > > Hi everybody, > > > > in amdgpu we got the following issue which I'm seeking advise how to > > cleanly handle it. > > > > We have a bunch of trace points which are related to the VM subsystem and > > executed in either a work item, kthread or foreign process context. > > > > Now tracing the pid of the context which we are executing in is not really > > that useful, so I'm wondering if we could just overwrite the pid recorded > > in the trace entry? > > > > The following patch does exactly that for the vm_grab_id() trace point, but > > I'm not 100% sure if that is legal or not. > > > > Any ideas? Comments? > > > > Thanks, > > Christian. > > > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: RFC: How to adjust the trace pid?
Ping? Daniel, Dave any opinion on this? Christian. Am 07.08.20 um 15:36 schrieb Christian König: Hi everybody, in amdgpu we got the following issue which I'm seeking advise how to cleanly handle it. We have a bunch of trace points which are related to the VM subsystem and executed in either a work item, kthread or foreign process context. Now tracing the pid of the context which we are executing in is not really that useful, so I'm wondering if we could just overwrite the pid recorded in the trace entry? The following patch does exactly that for the vm_grab_id() trace point, but I'm not 100% sure if that is legal or not. Any ideas? Comments? Thanks, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 3/4] drm/amd/powerplay: enable Navi1X mgpu fan boost feature
On Wed, Aug 12, 2020 at 12:57 AM Evan Quan wrote: > > Support Navi1X mgpu fan boost enablement. > > Change-Id: Iafbf07c56462120d2db578b6af45dd7f985a4cc1 > Signed-off-by: Evan Quan > --- > .../drm/amd/powerplay/inc/smu_v11_0_ppsmc.h | 4 +++- > drivers/gpu/drm/amd/powerplay/navi10_ppt.c| 21 +++ > 2 files changed, 24 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h > b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h > index 406bfd187ce8..fa0174dc7e0e 100644 > --- a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h > +++ b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h > @@ -123,7 +123,9 @@ > #define PPSMC_MSG_DALDisableDummyPstateChange0x49 > #define PPSMC_MSG_DALEnableDummyPstateChange 0x4A > > -#define PPSMC_Message_Count 0x4B > +#define PPSMC_MSG_SetMGpuFanBoostLimitRpm0x4C > + > +#define PPSMC_Message_Count 0x4D > > typedef uint32_t PPSMC_Result; > typedef uint32_t PPSMC_Msg; > diff --git a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c > b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c > index 61e2971be9f3..a86cd819b44b 100644 > --- a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c > +++ b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c > @@ -138,6 +138,7 @@ static struct cmn2asic_msg_mapping > navi10_message_map[SMU_MSG_MAX_COUNT] = { > MSG_MAP(DAL_ENABLE_DUMMY_PSTATE_CHANGE, > PPSMC_MSG_DALEnableDummyPstateChange, 0), > MSG_MAP(GetVoltageByDpm,PPSMC_MSG_GetVoltageByDpm, > 0), > MSG_MAP(GetVoltageByDpmOverdrive, > PPSMC_MSG_GetVoltageByDpmOverdrive, 0), > + MSG_MAP(SetMGpuFanBoostLimitRpm, > PPSMC_MSG_SetMGpuFanBoostLimitRpm, 0), > }; > > static struct cmn2asic_mapping navi10_clk_map[SMU_CLK_COUNT] = { > @@ -2555,6 +2556,25 @@ static ssize_t navi10_get_gpu_metrics(struct > smu_context *smu, > return sizeof(struct gpu_metrics_v1_0); > } > > +static int navi10_enable_mgpu_fan_boost(struct smu_context *smu) > +{ > + struct amdgpu_device *adev = smu->adev; > + uint32_t param = 0; > + > + /* Navi12 does not support this */ > + if (adev->asic_type == CHIP_NAVI12) > + return 0; > + > + if (adev->pdev->device == 0x7312 && > + adev->external_rev_id == 0) You want adev->pdev->revision here. With that fixed, series is: Reviewed-by: Alex Deucher > + param = 0xD188; > + > + return smu_cmn_send_smc_msg_with_param(smu, > + > SMU_MSG_SetMGpuFanBoostLimitRpm, > + param, > + NULL); > +} > + > static const struct pptable_funcs navi10_ppt_funcs = { > .get_allowed_feature_mask = navi10_get_allowed_feature_mask, > .set_default_dpm_table = navi10_set_default_dpm_table, > @@ -2636,6 +2656,7 @@ static const struct pptable_funcs navi10_ppt_funcs = { > .get_pp_feature_mask = smu_cmn_get_pp_feature_mask, > .set_pp_feature_mask = smu_cmn_set_pp_feature_mask, > .get_gpu_metrics = navi10_get_gpu_metrics, > + .enable_mgpu_fan_boost = navi10_enable_mgpu_fan_boost, > }; > > void navi10_set_ppt_funcs(struct smu_context *smu) > -- > 2.28.0 > > ___ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH v3] drm/amdgpu: add new trace event for page table update v3
Am 12.08.20 um 14:09 schrieb Shashank Sharma: On 12/08/20 2:02 pm, Christian König wrote: Am 12.08.20 um 10:15 schrieb Shashank Sharma: Hello Christian, On 12/08/20 12:15 pm, Christian König wrote: Am 12.08.20 um 06:33 schrieb Shashank Sharma: This patch adds a new trace event to track the PTE update events. This specific event will provide information like: - start and end of virtual memory mapping - HW engine flags for the map - physical address for mapping This will be particularly useful for memory profiling tools (like RMV) which are monitoring the page table update events. V2: Added physical address lookup logic in trace point V3: switch to use __dynamic_array added nptes int the TPprint arguments list added page size in the arg list Cc: Christian König Cc: Alex Deucher Signed-off-by: Christian König Signed-off-by: Shashank Sharma --- drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 38 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 9 -- 2 files changed, 45 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h index 63e734a125fb..b9aae7983b4b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h @@ -321,6 +321,44 @@ DEFINE_EVENT(amdgpu_vm_mapping, amdgpu_vm_bo_cs, TP_ARGS(mapping) ); +TRACE_EVENT(amdgpu_vm_update_ptes, + TP_PROTO(struct amdgpu_vm_update_params *p, +uint64_t start, uint64_t end, +unsigned int nptes, uint64_t dst, +uint64_t incr, uint64_t flags), + TP_ARGS(p, start, end, nptes, dst, incr, flags), + TP_STRUCT__entry( +__field(u64, start) +__field(u64, end) +__field(u64, flags) +__field(u64, incr) +__field(unsigned int, nptes) +__dynamic_array(u64, dst, nptes) + ), + + TP_fast_assign( + unsigned int i; + + __entry->start = start; + __entry->end = end; + __entry->flags = flags; + __entry->incr = incr; + __entry->nptes = nptes; + + for (i = 0; i < nptes; ++i) { + u64 addr = p->pages_addr ? amdgpu_vm_map_gart( + p->pages_addr, dst) : dst; + + ((u64 *)__get_dynamic_array(dst))[i] = addr; + dst += incr; + } + ), + TP_printk("seg:0x%010llx-0x%010llx, flags:0x%llx, nptr=%u, pgsz:%llu," + " dst:\n%s", __entry->start, __entry->end, __entry->flags, + __entry->nptes, __entry->incr, This is not correct. The increment is NOT the page size, but rather the page size rounded down to a power of 512+4K. In other words page size can be 4K, 8K, 16K, 32K, 64K1M, 2M, 4M512M, 1G, 2G But the increment can only be 4K, 2M, 1G Understood. But I think the requirement here is for increment. My understanding is that the tool needs to save the page entries, and for that, it will need start of virtual mem, start of physical mem, mapping size and step to increment the entries. If that's so, we can re-label this entry as "step" instead of "page size". Please let me know if you think it's the right thing to do. We could stick with the naming increment if that helps, but this can also be derived from the number of destination addresses we have. sure, i will make it increment. On the other hand explicitly mentioning it probably won't hurt us either. And by the way what does "seg" mean? Ah, to get into 80 char limit, I made 'segment' as 'seg' and later just realized I have to still break the print into two lines :) . I will make it back to segment or start/end Ah, maybe rather use VA for virtual address. But start/end is probably the best option. Christian. - Shashank Christian. And do we need the nptes here? We just need it to print the correct number of destination addresses. Agree, we don't really need nptes here, I will remove that and send V4. - Shashank Regards, Christian. + __print_array(__get_dynamic_array(dst), __entry->nptes, 8)) +); + TRACE_EVENT(amdgpu_vm_set_ptes, TP_PROTO(uint64_t pe, uint64_t addr, unsigned count, uint32_t incr, uint64_t flags, bool direct), diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 71e005cf2952..b5dbb5e8bc61 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1513,17 +1513,22 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params, do { uint64_t
Re: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver
[AMD Official Use Only - Internal Distribution Only] Possibly, but since the arcturus_get_smu_metrics_data() can error out we should check that return value no? (also setting *value to 0 avoids this bug in the future...). Tom From: Das, Nirmoy Sent: Wednesday, August 12, 2020 08:40 To: StDenis, Tom; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver On 8/12/20 2:20 PM, Tom St Denis wrote: > Fixes: > >CC [M] > drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.o > drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c: In function > ‘arcturus_log_thermal_throttling_event’: > drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c:2223:24: warning: > ‘throttler_status’ may be used uninitialized in this function > [-Wmaybe-uninitialized] > 2223 | if (throttler_status & logging_label[throttler_idx].feature_mask) { > > by making arcturus_get_smu_metrics_data() assign a default value > (of zero) before any possible return point as well as simply error > out of arcturus_log_thermal_throttling_event() if it fails. > > Signed-off-by: Tom St Denis > --- > drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 12 ++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c > b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c > index 8b1025dc54fd..78f7ec95e4f5 100644 > --- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c > +++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c > @@ -551,6 +551,9 @@ static int arcturus_get_smu_metrics_data(struct > smu_context *smu, > > mutex_lock(>metrics_lock); > > + // assign default value > + *value = 0; > + > ret = smu_cmn_get_metrics_table_locked(smu, > NULL, > false); > @@ -2208,15 +2211,20 @@ static const struct throttling_logging_label { > }; > static void arcturus_log_thermal_throttling_event(struct smu_context *smu) > { > - int throttler_idx, throtting_events = 0, buf_idx = 0; > + int throttler_idx, throtting_events = 0, buf_idx = 0, ret; > struct amdgpu_device *adev = smu->adev; > uint32_t throttler_status; I think initializing throttler_status here should resolve the warning. > char log_buf[256]; > > - arcturus_get_smu_metrics_data(smu, > + ret = arcturus_get_smu_metrics_data(smu, > METRICS_THROTTLER_STATUS, > _status); > > + if (ret) { > + dev_err(adev->dev, "Could not read from > arcturus_get_smu_metrics_data()\n"); > + return; > + } > + > memset(log_buf, 0, sizeof(log_buf)); > for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label); >throttler_idx++) { ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver
On 8/12/20 2:20 PM, Tom St Denis wrote: Fixes: CC [M] drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.o drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c: In function ‘arcturus_log_thermal_throttling_event’: drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c:2223:24: warning: ‘throttler_status’ may be used uninitialized in this function [-Wmaybe-uninitialized] 2223 | if (throttler_status & logging_label[throttler_idx].feature_mask) { by making arcturus_get_smu_metrics_data() assign a default value (of zero) before any possible return point as well as simply error out of arcturus_log_thermal_throttling_event() if it fails. Signed-off-by: Tom St Denis --- drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c index 8b1025dc54fd..78f7ec95e4f5 100644 --- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c +++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c @@ -551,6 +551,9 @@ static int arcturus_get_smu_metrics_data(struct smu_context *smu, mutex_lock(>metrics_lock); + // assign default value + *value = 0; + ret = smu_cmn_get_metrics_table_locked(smu, NULL, false); @@ -2208,15 +2211,20 @@ static const struct throttling_logging_label { }; static void arcturus_log_thermal_throttling_event(struct smu_context *smu) { - int throttler_idx, throtting_events = 0, buf_idx = 0; + int throttler_idx, throtting_events = 0, buf_idx = 0, ret; struct amdgpu_device *adev = smu->adev; uint32_t throttler_status; I think initializing throttler_status here should resolve the warning. char log_buf[256]; - arcturus_get_smu_metrics_data(smu, + ret = arcturus_get_smu_metrics_data(smu, METRICS_THROTTLER_STATUS, _status); + if (ret) { + dev_err(adev->dev, "Could not read from arcturus_get_smu_metrics_data()\n"); + return; + } + memset(log_buf, 0, sizeof(log_buf)); for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label); throttler_idx++) { ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver
Fixes: CC [M] drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.o drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c: In function ‘arcturus_log_thermal_throttling_event’: drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c:2223:24: warning: ‘throttler_status’ may be used uninitialized in this function [-Wmaybe-uninitialized] 2223 | if (throttler_status & logging_label[throttler_idx].feature_mask) { by making arcturus_get_smu_metrics_data() assign a default value (of zero) before any possible return point as well as simply error out of arcturus_log_thermal_throttling_event() if it fails. Signed-off-by: Tom St Denis --- drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c index 8b1025dc54fd..78f7ec95e4f5 100644 --- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c +++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c @@ -551,6 +551,9 @@ static int arcturus_get_smu_metrics_data(struct smu_context *smu, mutex_lock(>metrics_lock); + // assign default value + *value = 0; + ret = smu_cmn_get_metrics_table_locked(smu, NULL, false); @@ -2208,15 +2211,20 @@ static const struct throttling_logging_label { }; static void arcturus_log_thermal_throttling_event(struct smu_context *smu) { - int throttler_idx, throtting_events = 0, buf_idx = 0; + int throttler_idx, throtting_events = 0, buf_idx = 0, ret; struct amdgpu_device *adev = smu->adev; uint32_t throttler_status; char log_buf[256]; - arcturus_get_smu_metrics_data(smu, + ret = arcturus_get_smu_metrics_data(smu, METRICS_THROTTLER_STATUS, _status); + if (ret) { + dev_err(adev->dev, "Could not read from arcturus_get_smu_metrics_data()\n"); + return; + } + memset(log_buf, 0, sizeof(log_buf)); for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label); throttler_idx++) { -- 2.26.2 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: Fix repeatly flr issue
On 8/12/20 11:19 AM, Emily.Deng wrote: From: jqdeng Only for no job running test case need to do recover in flr notification. For having job in mirror list, then let guest driver to hit job timeout, and then do recover. Signed-off-by: jqdeng Change-Id: Ic6234fce46fa1655ba81c4149235eeac75e75868 --- drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 20 +++- drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 22 -- 2 files changed, 39 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c index fe31cbeccfe9..12fe5164aaf3 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c @@ -238,6 +238,9 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work) struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work); struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt); int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT; + int i; + bool need_do_recover = true; We should find a better name for "need_do_recover", may be "need_to_recover" ? + struct drm_sched_job *job; /* block amdgpu_gpu_recover till msg FLR COMPLETE received, * otherwise the mailbox msg will be ruined/reseted by @@ -258,10 +261,25 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work) flr_done: up_read(>reset_sem); + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { + struct amdgpu_ring *ring = adev->rings[i]; + + if (!ring || !ring->sched.thread) + continue; + + spin_lock(>sched.job_list_lock); + job = list_first_entry_or_null(>sched.ring_mirror_list, + struct drm_sched_job, node); + spin_unlock(>sched.job_list_lock); + if (job) { + need_do_recover = false; + break; + } + } This 1st job retrieval logic can move to a function as there are two instance of it. /* Trigger recovery for world switch failure if no TDR */ if (amdgpu_device_should_recover_gpu(adev) - && adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT) + && (need_do_recover || adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT)) amdgpu_device_gpu_recover(adev, NULL); } diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c index 6f55172e8337..fc92c494df0b 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c @@ -259,6 +259,9 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work) struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work); struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt); int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT; + int i; + bool need_do_recover = true; + struct drm_sched_job *job; /* block amdgpu_gpu_recover till msg FLR COMPLETE received, * otherwise the mailbox msg will be ruined/reseted by @@ -279,10 +282,25 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work) flr_done: up_read(>reset_sem); + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { + struct amdgpu_ring *ring = adev->rings[i]; + + if (!ring || !ring->sched.thread) + continue; + + spin_lock(>sched.job_list_lock); + job = list_first_entry_or_null(>sched.ring_mirror_list, + struct drm_sched_job, node); + spin_unlock(>sched.job_list_lock); + if (job) { + need_do_recover = false; + break; + } + } /* Trigger recovery for world switch failure if no TDR */ - if (amdgpu_device_should_recover_gpu(adev) - && (adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT || + if (amdgpu_device_should_recover_gpu(adev) && (need_do_recover || + adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT || adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT || adev->compute_timeout == MAX_SCHEDULE_TIMEOUT || adev->video_timeout == MAX_SCHEDULE_TIMEOUT)) ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH v3] drm/amdgpu: add new trace event for page table update v3
On 12/08/20 2:02 pm, Christian König wrote: > Am 12.08.20 um 10:15 schrieb Shashank Sharma: >> Hello Christian, >> >> On 12/08/20 12:15 pm, Christian König wrote: >>> Am 12.08.20 um 06:33 schrieb Shashank Sharma: This patch adds a new trace event to track the PTE update events. This specific event will provide information like: - start and end of virtual memory mapping - HW engine flags for the map - physical address for mapping This will be particularly useful for memory profiling tools (like RMV) which are monitoring the page table update events. V2: Added physical address lookup logic in trace point V3: switch to use __dynamic_array added nptes int the TPprint arguments list added page size in the arg list Cc: Christian König Cc: Alex Deucher Signed-off-by: Christian König Signed-off-by: Shashank Sharma --- drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 38 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 9 -- 2 files changed, 45 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h index 63e734a125fb..b9aae7983b4b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h @@ -321,6 +321,44 @@ DEFINE_EVENT(amdgpu_vm_mapping, amdgpu_vm_bo_cs, TP_ARGS(mapping) ); +TRACE_EVENT(amdgpu_vm_update_ptes, + TP_PROTO(struct amdgpu_vm_update_params *p, + uint64_t start, uint64_t end, + unsigned int nptes, uint64_t dst, + uint64_t incr, uint64_t flags), + TP_ARGS(p, start, end, nptes, dst, incr, flags), + TP_STRUCT__entry( + __field(u64, start) + __field(u64, end) + __field(u64, flags) + __field(u64, incr) + __field(unsigned int, nptes) + __dynamic_array(u64, dst, nptes) + ), + + TP_fast_assign( + unsigned int i; + + __entry->start = start; + __entry->end = end; + __entry->flags = flags; + __entry->incr = incr; + __entry->nptes = nptes; + + for (i = 0; i < nptes; ++i) { + u64 addr = p->pages_addr ? amdgpu_vm_map_gart( + p->pages_addr, dst) : dst; + + ((u64 *)__get_dynamic_array(dst))[i] = addr; + dst += incr; + } + ), + TP_printk("seg:0x%010llx-0x%010llx, flags:0x%llx, nptr=%u, pgsz:%llu," +" dst:\n%s", __entry->start, __entry->end, __entry->flags, +__entry->nptes, __entry->incr, >>> This is not correct. The increment is NOT the page size, but rather the >>> page size rounded down to a power of 512+4K. >>> >>> In other words page size can be 4K, 8K, 16K, 32K, 64K1M, 2M, >>> 4M512M, 1G, 2G >>> >>> But the increment can only be 4K, 2M, 1G >> Understood. But I think the requirement here is for increment. My >> understanding is that the tool needs to save the page entries, and for that, >> it will need start of virtual mem, start of physical mem, mapping size and >> step to increment the entries. If that's so, we can re-label this entry as >> "step" instead of "page size". Please let me know if you think it's the >> right thing to do. > We could stick with the naming increment if that helps, but this can > also be derived from the number of destination addresses we have. sure, i will make it increment. > > On the other hand explicitly mentioning it probably won't hurt us either. > > And by the way what does "seg" mean? Ah, to get into 80 char limit, I made 'segment' as 'seg' and later just realized I have to still break the print into two lines :) . I will make it back to segment or start/end - Shashank > > Christian. > >>> And do we need the nptes here? We just need it to print the correct >>> number of destination addresses. >> Agree, we don't really need nptes here, I will remove that and send V4. >> >> - Shashank >> >>> Regards, >>> Christian. >>> +__print_array(__get_dynamic_array(dst), __entry->nptes, 8)) +); + TRACE_EVENT(amdgpu_vm_set_ptes, TP_PROTO(uint64_t pe, uint64_t addr, unsigned count, uint32_t incr, uint64_t flags, bool direct), diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 71e005cf2952..b5dbb5e8bc61 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++
RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency
[AMD Official Use Only - Internal Distribution Only] Am 12.08.20 um 12:02 schrieb Li, Dennis: > [AMD Official Use Only - Internal Distribution Only] > > Am 12.08.20 um 11:23 schrieb Li, Dennis: >> [AMD Official Use Only - Internal Distribution Only] >> >> Am 12.08.20 um 03:33 schrieb Li, Dennis: >>> [AMD Official Use Only - Internal Distribution Only] >>> >>> Hi, Christian, >>> >>> Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. >>> We shouldn't have any hardware access here, so taking the reset_sem looks >>> like overkill to me. >>> >>> [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, >>> amdgpu_vm_bo_replace_map and amdgpu_gem_va_update_vm all a chance to >>> access hardware. >> This is complete nonsense. The functions intentionally work through the >> scheduler to avoid accessing the hardware directly for exactly that reason. >> >> The only hardware access we have here is the HDP flush and that can fail in >> the case of a GPU reset without causing problems. >> >> [Dennis Li] amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> >> amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt > That is for pre gfx9 hardware and only called once during initial enabling of > the feature. > > Please remove that locking again since it is clearly completely against the > driver design. > > [Dennis Li] okay, if you agree, I will change to only protect > amdgpu_gem_va_update_vm in this function. Better even only protect the amdgpu_vm_update_prt_state() function. [Dennis Li] Got it. According to your suggestion, I will also narrow down the scope of reset_sem in other functions. Christian. > > Christian. > >> Regards, >> Christian. >> >>> Best Regards >>> Dennis Li >>> -Original Message- >>> From: Koenig, Christian >>> Sent: Wednesday, August 12, 2020 12:15 AM >>> To: Kuehling, Felix ; Li, Dennis >>> ; amd-gfx@lists.freedesktop.org; Deucher, >>> Alexander ; Zhang, Hawking >>> >>> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking >>> dependency >>> >>> Am 11.08.20 um 15:57 schrieb Felix Kuehling: Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li: > [ 653.902305] > == > [ 653.902928] WARNING: possible circular locking dependency detected > [ 653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G > OE > [ 653.904098] > -- > [ 653.904675] amdgpu_test/3975 is trying to acquire lock: > [ 653.905241] 97848f8647a0 (>reset_sem){.+.+}, at: > amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.905953] >but task is already holding lock: > [ 653.907087] 9744adbee1f8 > (reservation_ww_class_mutex){+.+.}, > at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [ 653.907694] >which lock already depends on the new lock. > > [ 653.909423] >the existing dependency chain (in reverse order) is: > [ 653.910594] >-> #1 (reservation_ww_class_mutex){+.+.}: > [ 653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120 > [ 653.912350]ww_mutex_lock+0x73/0x80 > [ 653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu] > [ 653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu] > [ 653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu] > [ 653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu] > [ 653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu] > [ 653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu] > [ 653.916959]local_pci_probe+0x47/0xa0 > [ 653.917570]work_for_cpu_fn+0x1a/0x30 > [ 653.918184]process_one_work+0x29e/0x630 > [ 653.918803]worker_thread+0x22b/0x3f0 > [ 653.919427]kthread+0x12f/0x150 > [ 653.920047]ret_from_fork+0x3a/0x50 > [ 653.920661] >-> #0 (>reset_sem){.+.+}: > [ 653.921893]__lock_acquire+0x13ec/0x16e0 > [ 653.922531]lock_acquire+0xb8/0x1c0 > [ 653.923174]down_read+0x48/0x230 > [ 653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] > [ 653.924588]drm_ioctl_kernel+0xb6/0x100 [drm] > [ 653.925283]drm_ioctl+0x389/0x450 [drm] > [ 653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] > [ 653.926686]ksys_ioctl+0x98/0xb0 > [ 653.927357]__x64_sys_ioctl+0x1a/0x20 > [ 653.928030]do_syscall_64+0x5f/0x250 > [ 653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe > [ 653.929373] >other info that might help us debug this: > > [ 653.931356] Possible unsafe locking scenario: > > [ 653.932647]CPU0CPU1 > [ 653.933287] > [ 653.933911]
Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
Am 12.08.20 um 12:02 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Am 12.08.20 um 11:23 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Am 12.08.20 um 03:33 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian, Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me. [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, amdgpu_vm_bo_replace_map and amdgpu_gem_va_update_vm all a chance to access hardware. This is complete nonsense. The functions intentionally work through the scheduler to avoid accessing the hardware directly for exactly that reason. The only hardware access we have here is the HDP flush and that can fail in the case of a GPU reset without causing problems. [Dennis Li] amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt That is for pre gfx9 hardware and only called once during initial enabling of the feature. Please remove that locking again since it is clearly completely against the driver design. [Dennis Li] okay, if you agree, I will change to only protect amdgpu_gem_va_update_vm in this function. Better even only protect the amdgpu_vm_update_prt_state() function. Christian. Christian. Regards, Christian. Best Regards Dennis Li -Original Message- From: Koenig, Christian Sent: Wednesday, August 12, 2020 12:15 AM To: Kuehling, Felix ; Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency Am 11.08.20 um 15:57 schrieb Felix Kuehling: Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li: [ 653.902305] == [ 653.902928] WARNING: possible circular locking dependency detected [ 653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 653.904098] -- [ 653.904675] amdgpu_test/3975 is trying to acquire lock: [ 653.905241] 97848f8647a0 (>reset_sem){.+.+}, at: amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.905953] but task is already holding lock: [ 653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [ 653.907694] which lock already depends on the new lock. [ 653.909423] the existing dependency chain (in reverse order) is: [ 653.910594] -> #1 (reservation_ww_class_mutex){+.+.}: [ 653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120 [ 653.912350]ww_mutex_lock+0x73/0x80 [ 653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu] [ 653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu] [ 653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu] [ 653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu] [ 653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu] [ 653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu] [ 653.916959]local_pci_probe+0x47/0xa0 [ 653.917570]work_for_cpu_fn+0x1a/0x30 [ 653.918184]process_one_work+0x29e/0x630 [ 653.918803]worker_thread+0x22b/0x3f0 [ 653.919427]kthread+0x12f/0x150 [ 653.920047]ret_from_fork+0x3a/0x50 [ 653.920661] -> #0 (>reset_sem){.+.+}: [ 653.921893]__lock_acquire+0x13ec/0x16e0 [ 653.922531]lock_acquire+0xb8/0x1c0 [ 653.923174]down_read+0x48/0x230 [ 653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.924588]drm_ioctl_kernel+0xb6/0x100 [drm] [ 653.925283]drm_ioctl+0x389/0x450 [drm] [ 653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] [ 653.926686]ksys_ioctl+0x98/0xb0 [ 653.927357]__x64_sys_ioctl+0x1a/0x20 [ 653.928030]do_syscall_64+0x5f/0x250 [ 653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 653.929373] other info that might help us debug this: [ 653.931356] Possible unsafe locking scenario: [ 653.932647]CPU0CPU1 [ 653.933287] [ 653.933911] lock(reservation_ww_class_mutex); [ 653.934530]lock(>reset_sem); [ 653.935154]lock(reservation_ww_class_mutex); [ 653.935766] lock(>reset_sem); [ 653.936360] *** DEADLOCK *** [ 653.938028] 2 locks held by amdgpu_test/3975: [ 653.938574] #0: b2a862d6bcd0 (reservation_ww_class_acquire){+.+.}, at: amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [ 653.939233] #1: 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] change the order of reservation_ww_class_mutex and adev->reset_sem in amdgpu_gem_va_ioctl the same
RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency
[AMD Official Use Only - Internal Distribution Only] Am 12.08.20 um 11:23 schrieb Li, Dennis: > [AMD Official Use Only - Internal Distribution Only] > > Am 12.08.20 um 03:33 schrieb Li, Dennis: >> [AMD Official Use Only - Internal Distribution Only] >> >> Hi, Christian, >> >> Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We >> shouldn't have any hardware access here, so taking the reset_sem looks like >> overkill to me. >> >> [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, >> amdgpu_vm_bo_replace_map and amdgpu_gem_va_update_vm all a chance to access >> hardware. > This is complete nonsense. The functions intentionally work through the > scheduler to avoid accessing the hardware directly for exactly that reason. > > The only hardware access we have here is the HDP flush and that can fail in > the case of a GPU reset without causing problems. > > [Dennis Li] amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> > amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt That is for pre gfx9 hardware and only called once during initial enabling of the feature. Please remove that locking again since it is clearly completely against the driver design. [Dennis Li] okay, if you agree, I will change to only protect amdgpu_gem_va_update_vm in this function. Christian. > > Regards, > Christian. > >> Best Regards >> Dennis Li >> -Original Message- >> From: Koenig, Christian >> Sent: Wednesday, August 12, 2020 12:15 AM >> To: Kuehling, Felix ; Li, Dennis >> ; amd-gfx@lists.freedesktop.org; Deucher, >> Alexander ; Zhang, Hawking >> >> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking >> dependency >> >> Am 11.08.20 um 15:57 schrieb Felix Kuehling: >>> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li: [ 653.902305] == [ 653.902928] WARNING: possible circular locking dependency detected [ 653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 653.904098] -- [ 653.904675] amdgpu_test/3975 is trying to acquire lock: [ 653.905241] 97848f8647a0 (>reset_sem){.+.+}, at: amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.905953] but task is already holding lock: [ 653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [ 653.907694] which lock already depends on the new lock. [ 653.909423] the existing dependency chain (in reverse order) is: [ 653.910594] -> #1 (reservation_ww_class_mutex){+.+.}: [ 653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120 [ 653.912350]ww_mutex_lock+0x73/0x80 [ 653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu] [ 653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu] [ 653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu] [ 653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu] [ 653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu] [ 653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu] [ 653.916959]local_pci_probe+0x47/0xa0 [ 653.917570]work_for_cpu_fn+0x1a/0x30 [ 653.918184]process_one_work+0x29e/0x630 [ 653.918803]worker_thread+0x22b/0x3f0 [ 653.919427]kthread+0x12f/0x150 [ 653.920047]ret_from_fork+0x3a/0x50 [ 653.920661] -> #0 (>reset_sem){.+.+}: [ 653.921893]__lock_acquire+0x13ec/0x16e0 [ 653.922531]lock_acquire+0xb8/0x1c0 [ 653.923174]down_read+0x48/0x230 [ 653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.924588]drm_ioctl_kernel+0xb6/0x100 [drm] [ 653.925283]drm_ioctl+0x389/0x450 [drm] [ 653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] [ 653.926686]ksys_ioctl+0x98/0xb0 [ 653.927357]__x64_sys_ioctl+0x1a/0x20 [ 653.928030]do_syscall_64+0x5f/0x250 [ 653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 653.929373] other info that might help us debug this: [ 653.931356] Possible unsafe locking scenario: [ 653.932647]CPU0CPU1 [ 653.933287] [ 653.933911] lock(reservation_ww_class_mutex); [ 653.934530]lock(>reset_sem); [ 653.935154] lock(reservation_ww_class_mutex); [ 653.935766] lock(>reset_sem); [ 653.936360] *** DEADLOCK *** [ 653.938028] 2 locks held by amdgpu_test/3975: [ 653.938574] #0: b2a862d6bcd0
Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
Am 12.08.20 um 11:23 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Am 12.08.20 um 03:33 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian, Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me. [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, amdgpu_vm_bo_replace_map and amdgpu_gem_va_update_vm all a chance to access hardware. This is complete nonsense. The functions intentionally work through the scheduler to avoid accessing the hardware directly for exactly that reason. The only hardware access we have here is the HDP flush and that can fail in the case of a GPU reset without causing problems. [Dennis Li] amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt That is for pre gfx9 hardware and only called once during initial enabling of the feature. Please remove that locking again since it is clearly completely against the driver design. Christian. Regards, Christian. Best Regards Dennis Li -Original Message- From: Koenig, Christian Sent: Wednesday, August 12, 2020 12:15 AM To: Kuehling, Felix ; Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency Am 11.08.20 um 15:57 schrieb Felix Kuehling: Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li: [ 653.902305] == [ 653.902928] WARNING: possible circular locking dependency detected [ 653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 653.904098] -- [ 653.904675] amdgpu_test/3975 is trying to acquire lock: [ 653.905241] 97848f8647a0 (>reset_sem){.+.+}, at: amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.905953] but task is already holding lock: [ 653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [ 653.907694] which lock already depends on the new lock. [ 653.909423] the existing dependency chain (in reverse order) is: [ 653.910594] -> #1 (reservation_ww_class_mutex){+.+.}: [ 653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120 [ 653.912350]ww_mutex_lock+0x73/0x80 [ 653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu] [ 653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu] [ 653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu] [ 653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu] [ 653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu] [ 653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu] [ 653.916959]local_pci_probe+0x47/0xa0 [ 653.917570]work_for_cpu_fn+0x1a/0x30 [ 653.918184]process_one_work+0x29e/0x630 [ 653.918803]worker_thread+0x22b/0x3f0 [ 653.919427]kthread+0x12f/0x150 [ 653.920047]ret_from_fork+0x3a/0x50 [ 653.920661] -> #0 (>reset_sem){.+.+}: [ 653.921893]__lock_acquire+0x13ec/0x16e0 [ 653.922531]lock_acquire+0xb8/0x1c0 [ 653.923174]down_read+0x48/0x230 [ 653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.924588]drm_ioctl_kernel+0xb6/0x100 [drm] [ 653.925283]drm_ioctl+0x389/0x450 [drm] [ 653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] [ 653.926686]ksys_ioctl+0x98/0xb0 [ 653.927357]__x64_sys_ioctl+0x1a/0x20 [ 653.928030]do_syscall_64+0x5f/0x250 [ 653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 653.929373] other info that might help us debug this: [ 653.931356] Possible unsafe locking scenario: [ 653.932647]CPU0CPU1 [ 653.933287] [ 653.933911] lock(reservation_ww_class_mutex); [ 653.934530]lock(>reset_sem); [ 653.935154]lock(reservation_ww_class_mutex); [ 653.935766] lock(>reset_sem); [ 653.936360] *** DEADLOCK *** [ 653.938028] 2 locks held by amdgpu_test/3975: [ 653.938574] #0: b2a862d6bcd0 (reservation_ww_class_acquire){+.+.}, at: amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [ 653.939233] #1: 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] change the order of reservation_ww_class_mutex and adev->reset_sem in amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem, to avoid potential dead lock. It may be better to fix it the other way around in amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the reservation. Otherwise you will never be able to take the reset_sem while any BOs are reserved.
RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency
[AMD Official Use Only - Internal Distribution Only] Am 12.08.20 um 03:19 schrieb Li, Dennis: > [AMD Official Use Only - Internal Distribution Only] > > Hi, Felix, > > Re: It may be better to fix it the other way around in > amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the > reservation. Otherwise you will never be able to take the reset_sem while any > BOs are reserved. That's probably going to cause you other problems later. > [Dennis Li] Thanks that you find the potential issue, I will change it in > version 2. > > Re: That makes me wonder, why do you need the reset_sem in > amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious hardware > access in that function. Is it for amdgpu_ttm_alloc_gart updating the GART > table through HDP? > [Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have considered > to only protect amdgpu_ttm_alloc_gart before. That access is irrelevant and the lock should be removed or changed into a trylock. See we need the HDP flush only because the hardware could have accessed the data before. But after a GPU reset the HDP is known to be clean, so this doesn't need any protection. > But I worry other functions will access hardware in the future. Therefore I > select an aggressive approach which lock reset_sem at the beginning of entry > functions of amdgpu driver. This is not a good idea. We used to have such a global lock before and removed it because it caused all kind of problems. [Dennis Li] okay. If you don't agree this aggressive approach. I will change to protect amdgpu_ttm_alloc_gart only. When was this added? Looks like it slipped under my radar or I wasn't awake enough at that moment. Christian. > > Best Regards > Dennis Li > -Original Message- > From: Kuehling, Felix > Sent: Tuesday, August 11, 2020 9:57 PM > To: Li, Dennis ; amd-gfx@lists.freedesktop.org; > Deucher, Alexander ; Zhang, Hawking > ; Koenig, Christian > Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking > dependency > > Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li: >> [ 653.902305] == >> [ 653.902928] WARNING: possible circular locking dependency detected >> [ 653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE >> [ 653.904098] -- >> [ 653.904675] amdgpu_test/3975 is trying to acquire lock: >> [ 653.905241] 97848f8647a0 (>reset_sem){.+.+}, at: >> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.905953] >> but task is already holding lock: >> [ 653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, >> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [ 653.907694] >> which lock already depends on the new lock. >> >> [ 653.909423] >> the existing dependency chain (in reverse order) is: >> [ 653.910594] >> -> #1 (reservation_ww_class_mutex){+.+.}: >> [ 653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120 >> [ 653.912350]ww_mutex_lock+0x73/0x80 >> [ 653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu] >> [ 653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu] >> [ 653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu] >> [ 653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu] >> [ 653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu] >> [ 653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu] >> [ 653.916959]local_pci_probe+0x47/0xa0 >> [ 653.917570]work_for_cpu_fn+0x1a/0x30 >> [ 653.918184]process_one_work+0x29e/0x630 >> [ 653.918803]worker_thread+0x22b/0x3f0 >> [ 653.919427]kthread+0x12f/0x150 >> [ 653.920047]ret_from_fork+0x3a/0x50 >> [ 653.920661] >> -> #0 (>reset_sem){.+.+}: >> [ 653.921893]__lock_acquire+0x13ec/0x16e0 >> [ 653.922531]lock_acquire+0xb8/0x1c0 >> [ 653.923174]down_read+0x48/0x230 >> [ 653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] >> [ 653.924588]drm_ioctl_kernel+0xb6/0x100 [drm] >> [ 653.925283]drm_ioctl+0x389/0x450 [drm] >> [ 653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] >> [ 653.926686]ksys_ioctl+0x98/0xb0 >> [ 653.927357]__x64_sys_ioctl+0x1a/0x20 >> [ 653.928030]do_syscall_64+0x5f/0x250 >> [ 653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe >> [ 653.929373] >> other info that might help us debug this: >> >> [ 653.931356] Possible unsafe locking scenario: >> >> [ 653.932647]CPU0CPU1 >> [ 653.933287] >> [ 653.933911] lock(reservation_ww_class_mutex); >> [ 653.934530]lock(>reset_sem); >> [ 653.935154] >> lock(reservation_ww_class_mutex); >> [ 653.935766] lock(>reset_sem); >> [ 653.936360]
[PATCH] drm/amdgpu: Limit the error info print rate
From: jqdeng Use function printk_ratelimit to limit the print rate. Signed-off-by: jqdeng Change-Id: Ief05debe30d975cbcf88e473c9f486d70b5a202c --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index a94b3f862fc2..727b909b4b9e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -1296,7 +1296,8 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) r = amdgpu_cs_parser_init(, data); if (r) { - DRM_ERROR("Failed to initialize parser %d!\n", r); + if (printk_ratelimit()) + DRM_ERROR("Failed to initialize parser %d!\n", r); goto out; } -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency
[AMD Official Use Only - Internal Distribution Only] Am 12.08.20 um 03:33 schrieb Li, Dennis: > [AMD Official Use Only - Internal Distribution Only] > > Hi, Christian, > > Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We > shouldn't have any hardware access here, so taking the reset_sem looks like > overkill to me. > > [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, > amdgpu_vm_bo_replace_map and amdgpu_gem_va_update_vm all a chance to access > hardware. This is complete nonsense. The functions intentionally work through the scheduler to avoid accessing the hardware directly for exactly that reason. The only hardware access we have here is the HDP flush and that can fail in the case of a GPU reset without causing problems. [Dennis Li] amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt Regards, Christian. > > Best Regards > Dennis Li > -Original Message- > From: Koenig, Christian > Sent: Wednesday, August 12, 2020 12:15 AM > To: Kuehling, Felix ; Li, Dennis > ; amd-gfx@lists.freedesktop.org; Deucher, Alexander > ; Zhang, Hawking > Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking > dependency > > Am 11.08.20 um 15:57 schrieb Felix Kuehling: >> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li: >>> [ 653.902305] >>> == >>> [ 653.902928] WARNING: possible circular locking dependency detected >>> [ 653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE >>> [ 653.904098] >>> -- >>> [ 653.904675] amdgpu_test/3975 is trying to acquire lock: >>> [ 653.905241] 97848f8647a0 (>reset_sem){.+.+}, at: >>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.905953] >>> but task is already holding lock: >>> [ 653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, >>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [ 653.907694] >>> which lock already depends on the new lock. >>> >>> [ 653.909423] >>> the existing dependency chain (in reverse order) is: >>> [ 653.910594] >>> -> #1 (reservation_ww_class_mutex){+.+.}: >>> [ 653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120 >>> [ 653.912350]ww_mutex_lock+0x73/0x80 >>> [ 653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu] >>> [ 653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu] >>> [ 653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu] >>> [ 653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu] >>> [ 653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu] >>> [ 653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu] >>> [ 653.916959]local_pci_probe+0x47/0xa0 >>> [ 653.917570]work_for_cpu_fn+0x1a/0x30 >>> [ 653.918184]process_one_work+0x29e/0x630 >>> [ 653.918803]worker_thread+0x22b/0x3f0 >>> [ 653.919427]kthread+0x12f/0x150 >>> [ 653.920047]ret_from_fork+0x3a/0x50 >>> [ 653.920661] >>> -> #0 (>reset_sem){.+.+}: >>> [ 653.921893]__lock_acquire+0x13ec/0x16e0 >>> [ 653.922531]lock_acquire+0xb8/0x1c0 >>> [ 653.923174]down_read+0x48/0x230 >>> [ 653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] >>> [ 653.924588]drm_ioctl_kernel+0xb6/0x100 [drm] >>> [ 653.925283]drm_ioctl+0x389/0x450 [drm] >>> [ 653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] >>> [ 653.926686]ksys_ioctl+0x98/0xb0 >>> [ 653.927357]__x64_sys_ioctl+0x1a/0x20 >>> [ 653.928030]do_syscall_64+0x5f/0x250 >>> [ 653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe >>> [ 653.929373] >>> other info that might help us debug this: >>> >>> [ 653.931356] Possible unsafe locking scenario: >>> >>> [ 653.932647]CPU0CPU1 >>> [ 653.933287] >>> [ 653.933911] lock(reservation_ww_class_mutex); >>> [ 653.934530]lock(>reset_sem); >>> [ 653.935154] >>> lock(reservation_ww_class_mutex); >>> [ 653.935766] lock(>reset_sem); >>> [ 653.936360] >>> *** DEADLOCK *** >>> >>> [ 653.938028] 2 locks held by amdgpu_test/3975: >>> [ 653.938574] #0: b2a862d6bcd0 >>> (reservation_ww_class_acquire){+.+.}, at: >>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [ 653.939233] #1: >>> 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: >>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] >>> >>> change the order of reservation_ww_class_mutex and adev->reset_sem >>> in amdgpu_gem_va_ioctl the same as ones in >>> amdgpu_amdkfd_alloc_gtt_mem, to avoid potential dead lock. >> It may be better to fix it the other way around in >> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the >> reservation.
[PATCH] drm/amdgpu: Fix repeatly flr issue
From: jqdeng Only for no job running test case need to do recover in flr notification. For having job in mirror list, then let guest driver to hit job timeout, and then do recover. Signed-off-by: jqdeng Change-Id: Ic6234fce46fa1655ba81c4149235eeac75e75868 --- drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 20 +++- drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 22 -- 2 files changed, 39 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c index fe31cbeccfe9..12fe5164aaf3 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c @@ -238,6 +238,9 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work) struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work); struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt); int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT; + int i; + bool need_do_recover = true; + struct drm_sched_job *job; /* block amdgpu_gpu_recover till msg FLR COMPLETE received, * otherwise the mailbox msg will be ruined/reseted by @@ -258,10 +261,25 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work) flr_done: up_read(>reset_sem); + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { + struct amdgpu_ring *ring = adev->rings[i]; + + if (!ring || !ring->sched.thread) + continue; + + spin_lock(>sched.job_list_lock); + job = list_first_entry_or_null(>sched.ring_mirror_list, + struct drm_sched_job, node); + spin_unlock(>sched.job_list_lock); + if (job) { + need_do_recover = false; + break; + } + } /* Trigger recovery for world switch failure if no TDR */ if (amdgpu_device_should_recover_gpu(adev) - && adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT) + && (need_do_recover || adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT)) amdgpu_device_gpu_recover(adev, NULL); } diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c index 6f55172e8337..fc92c494df0b 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c @@ -259,6 +259,9 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work) struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work); struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt); int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT; + int i; + bool need_do_recover = true; + struct drm_sched_job *job; /* block amdgpu_gpu_recover till msg FLR COMPLETE received, * otherwise the mailbox msg will be ruined/reseted by @@ -279,10 +282,25 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work) flr_done: up_read(>reset_sem); + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { + struct amdgpu_ring *ring = adev->rings[i]; + + if (!ring || !ring->sched.thread) + continue; + + spin_lock(>sched.job_list_lock); + job = list_first_entry_or_null(>sched.ring_mirror_list, + struct drm_sched_job, node); + spin_unlock(>sched.job_list_lock); + if (job) { + need_do_recover = false; + break; + } + } /* Trigger recovery for world switch failure if no TDR */ - if (amdgpu_device_should_recover_gpu(adev) - && (adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT || + if (amdgpu_device_should_recover_gpu(adev) && (need_do_recover || + adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT || adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT || adev->compute_timeout == MAX_SCHEDULE_TIMEOUT || adev->video_timeout == MAX_SCHEDULE_TIMEOUT)) -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
Am 12.08.20 um 03:33 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian, Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me. [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, amdgpu_vm_bo_replace_map and amdgpu_gem_va_update_vm all a chance to access hardware. This is complete nonsense. The functions intentionally work through the scheduler to avoid accessing the hardware directly for exactly that reason. The only hardware access we have here is the HDP flush and that can fail in the case of a GPU reset without causing problems. Regards, Christian. Best Regards Dennis Li -Original Message- From: Koenig, Christian Sent: Wednesday, August 12, 2020 12:15 AM To: Kuehling, Felix ; Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency Am 11.08.20 um 15:57 schrieb Felix Kuehling: Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li: [ 653.902305] == [ 653.902928] WARNING: possible circular locking dependency detected [ 653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 653.904098] -- [ 653.904675] amdgpu_test/3975 is trying to acquire lock: [ 653.905241] 97848f8647a0 (>reset_sem){.+.+}, at: amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.905953] but task is already holding lock: [ 653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [ 653.907694] which lock already depends on the new lock. [ 653.909423] the existing dependency chain (in reverse order) is: [ 653.910594] -> #1 (reservation_ww_class_mutex){+.+.}: [ 653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120 [ 653.912350]ww_mutex_lock+0x73/0x80 [ 653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu] [ 653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu] [ 653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu] [ 653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu] [ 653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu] [ 653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu] [ 653.916959]local_pci_probe+0x47/0xa0 [ 653.917570]work_for_cpu_fn+0x1a/0x30 [ 653.918184]process_one_work+0x29e/0x630 [ 653.918803]worker_thread+0x22b/0x3f0 [ 653.919427]kthread+0x12f/0x150 [ 653.920047]ret_from_fork+0x3a/0x50 [ 653.920661] -> #0 (>reset_sem){.+.+}: [ 653.921893]__lock_acquire+0x13ec/0x16e0 [ 653.922531]lock_acquire+0xb8/0x1c0 [ 653.923174]down_read+0x48/0x230 [ 653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.924588]drm_ioctl_kernel+0xb6/0x100 [drm] [ 653.925283]drm_ioctl+0x389/0x450 [drm] [ 653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] [ 653.926686]ksys_ioctl+0x98/0xb0 [ 653.927357]__x64_sys_ioctl+0x1a/0x20 [ 653.928030]do_syscall_64+0x5f/0x250 [ 653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 653.929373] other info that might help us debug this: [ 653.931356] Possible unsafe locking scenario: [ 653.932647]CPU0CPU1 [ 653.933287] [ 653.933911] lock(reservation_ww_class_mutex); [ 653.934530]lock(>reset_sem); [ 653.935154]lock(reservation_ww_class_mutex); [ 653.935766] lock(>reset_sem); [ 653.936360] *** DEADLOCK *** [ 653.938028] 2 locks held by amdgpu_test/3975: [ 653.938574] #0: b2a862d6bcd0 (reservation_ww_class_acquire){+.+.}, at: amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [ 653.939233] #1: 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] change the order of reservation_ww_class_mutex and adev->reset_sem in amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem, to avoid potential dead lock. It may be better to fix it the other way around in amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the reservation. Otherwise you will never be able to take the reset_sem while any BOs are reserved. That's probably going to cause you other problems later. That makes me wonder, why do you need the reset_sem in amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious hardware access in that function. Is it for amdgpu_ttm_alloc_gart updating the GART table through HDP? I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We shouldn't have any hardware access here, so taking the
Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
Am 12.08.20 um 03:19 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Felix, Re: It may be better to fix it the other way around in amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the reservation. Otherwise you will never be able to take the reset_sem while any BOs are reserved. That's probably going to cause you other problems later. [Dennis Li] Thanks that you find the potential issue, I will change it in version 2. Re: That makes me wonder, why do you need the reset_sem in amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious hardware access in that function. Is it for amdgpu_ttm_alloc_gart updating the GART table through HDP? [Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have considered to only protect amdgpu_ttm_alloc_gart before. That access is irrelevant and the lock should be removed or changed into a trylock. See we need the HDP flush only because the hardware could have accessed the data before. But after a GPU reset the HDP is known to be clean, so this doesn't need any protection. But I worry other functions will access hardware in the future. Therefore I select an aggressive approach which lock reset_sem at the beginning of entry functions of amdgpu driver. This is not a good idea. We used to have such a global lock before and removed it because it caused all kind of problems. When was this added? Looks like it slipped under my radar or I wasn't awake enough at that moment. Christian. Best Regards Dennis Li -Original Message- From: Kuehling, Felix Sent: Tuesday, August 11, 2020 9:57 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Zhang, Hawking ; Koenig, Christian Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li: [ 653.902305] == [ 653.902928] WARNING: possible circular locking dependency detected [ 653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 653.904098] -- [ 653.904675] amdgpu_test/3975 is trying to acquire lock: [ 653.905241] 97848f8647a0 (>reset_sem){.+.+}, at: amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.905953] but task is already holding lock: [ 653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [ 653.907694] which lock already depends on the new lock. [ 653.909423] the existing dependency chain (in reverse order) is: [ 653.910594] -> #1 (reservation_ww_class_mutex){+.+.}: [ 653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120 [ 653.912350]ww_mutex_lock+0x73/0x80 [ 653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu] [ 653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu] [ 653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu] [ 653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu] [ 653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu] [ 653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu] [ 653.916959]local_pci_probe+0x47/0xa0 [ 653.917570]work_for_cpu_fn+0x1a/0x30 [ 653.918184]process_one_work+0x29e/0x630 [ 653.918803]worker_thread+0x22b/0x3f0 [ 653.919427]kthread+0x12f/0x150 [ 653.920047]ret_from_fork+0x3a/0x50 [ 653.920661] -> #0 (>reset_sem){.+.+}: [ 653.921893]__lock_acquire+0x13ec/0x16e0 [ 653.922531]lock_acquire+0xb8/0x1c0 [ 653.923174]down_read+0x48/0x230 [ 653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.924588]drm_ioctl_kernel+0xb6/0x100 [drm] [ 653.925283]drm_ioctl+0x389/0x450 [drm] [ 653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] [ 653.926686]ksys_ioctl+0x98/0xb0 [ 653.927357]__x64_sys_ioctl+0x1a/0x20 [ 653.928030]do_syscall_64+0x5f/0x250 [ 653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 653.929373] other info that might help us debug this: [ 653.931356] Possible unsafe locking scenario: [ 653.932647]CPU0CPU1 [ 653.933287] [ 653.933911] lock(reservation_ww_class_mutex); [ 653.934530]lock(>reset_sem); [ 653.935154]lock(reservation_ww_class_mutex); [ 653.935766] lock(>reset_sem); [ 653.936360] *** DEADLOCK *** [ 653.938028] 2 locks held by amdgpu_test/3975: [ 653.938574] #0: b2a862d6bcd0 (reservation_ww_class_acquire){+.+.}, at: amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [ 653.939233] #1: 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] change the order of reservation_ww_class_mutex and adev->reset_sem
RE: [PATCH] drm/amdgpu: disable gfxoff for navy_flounder
[AMD Public Use] Please remember to revert it when root cause is found out, the patch is: Reviewed-by: Tao Zhou > -Original Message- > From: Jiansong Chen > Sent: Wednesday, August 12, 2020 4:44 PM > To: amd-gfx@lists.freedesktop.org > Cc: Zhou1, Tao ; Feng, Kenneth > ; Chen, Jiansong (Simon) > Subject: [PATCH] drm/amdgpu: disable gfxoff for navy_flounder > > gfxoff is temporarily disabled for navy_flounder, since at present the > feature has > broken some basic amdgpu test. > > Signed-off-by: Jiansong Chen > Change-Id: Icc030370997a66fb9f01cdd4b1c45816e3c88584 > --- > drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > index d851fe80eaf4..de6e6de41867 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > @@ -3610,6 +3610,9 @@ static void gfx_v10_0_check_gfxoff_flag(struct > amdgpu_device *adev) > if (!gfx_v10_0_navi10_gfxoff_should_enable(adev)) > adev->pm.pp_feature &= ~PP_GFXOFF_MASK; > break; > + case CHIP_NAVY_FLOUNDER: > + adev->pm.pp_feature &= ~PP_GFXOFF_MASK; > + break; > default: > break; > } > -- > 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: disable gfxoff for navy_flounder
gfxoff is temporarily disabled for navy_flounder, since at present the feature has broken some basic amdgpu test. Signed-off-by: Jiansong Chen Change-Id: Icc030370997a66fb9f01cdd4b1c45816e3c88584 --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index d851fe80eaf4..de6e6de41867 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -3610,6 +3610,9 @@ static void gfx_v10_0_check_gfxoff_flag(struct amdgpu_device *adev) if (!gfx_v10_0_navi10_gfxoff_should_enable(adev)) adev->pm.pp_feature &= ~PP_GFXOFF_MASK; break; + case CHIP_NAVY_FLOUNDER: + adev->pm.pp_feature &= ~PP_GFXOFF_MASK; + break; default: break; } -- 2.25.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH v3] drm/amdgpu: add new trace event for page table update v3
Am 12.08.20 um 10:15 schrieb Shashank Sharma: Hello Christian, On 12/08/20 12:15 pm, Christian König wrote: Am 12.08.20 um 06:33 schrieb Shashank Sharma: This patch adds a new trace event to track the PTE update events. This specific event will provide information like: - start and end of virtual memory mapping - HW engine flags for the map - physical address for mapping This will be particularly useful for memory profiling tools (like RMV) which are monitoring the page table update events. V2: Added physical address lookup logic in trace point V3: switch to use __dynamic_array added nptes int the TPprint arguments list added page size in the arg list Cc: Christian König Cc: Alex Deucher Signed-off-by: Christian König Signed-off-by: Shashank Sharma --- drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 38 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 9 -- 2 files changed, 45 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h index 63e734a125fb..b9aae7983b4b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h @@ -321,6 +321,44 @@ DEFINE_EVENT(amdgpu_vm_mapping, amdgpu_vm_bo_cs, TP_ARGS(mapping) ); +TRACE_EVENT(amdgpu_vm_update_ptes, + TP_PROTO(struct amdgpu_vm_update_params *p, +uint64_t start, uint64_t end, +unsigned int nptes, uint64_t dst, +uint64_t incr, uint64_t flags), + TP_ARGS(p, start, end, nptes, dst, incr, flags), + TP_STRUCT__entry( +__field(u64, start) +__field(u64, end) +__field(u64, flags) +__field(u64, incr) +__field(unsigned int, nptes) +__dynamic_array(u64, dst, nptes) + ), + + TP_fast_assign( + unsigned int i; + + __entry->start = start; + __entry->end = end; + __entry->flags = flags; + __entry->incr = incr; + __entry->nptes = nptes; + + for (i = 0; i < nptes; ++i) { + u64 addr = p->pages_addr ? amdgpu_vm_map_gart( + p->pages_addr, dst) : dst; + + ((u64 *)__get_dynamic_array(dst))[i] = addr; + dst += incr; + } + ), + TP_printk("seg:0x%010llx-0x%010llx, flags:0x%llx, nptr=%u, pgsz:%llu," + " dst:\n%s", __entry->start, __entry->end, __entry->flags, + __entry->nptes, __entry->incr, This is not correct. The increment is NOT the page size, but rather the page size rounded down to a power of 512+4K. In other words page size can be 4K, 8K, 16K, 32K, 64K1M, 2M, 4M512M, 1G, 2G But the increment can only be 4K, 2M, 1G Understood. But I think the requirement here is for increment. My understanding is that the tool needs to save the page entries, and for that, it will need start of virtual mem, start of physical mem, mapping size and step to increment the entries. If that's so, we can re-label this entry as "step" instead of "page size". Please let me know if you think it's the right thing to do. We could stick with the naming increment if that helps, but this can also be derived from the number of destination addresses we have. On the other hand explicitly mentioning it probably won't hurt us either. And by the way what does "seg" mean? Christian. And do we need the nptes here? We just need it to print the correct number of destination addresses. Agree, we don't really need nptes here, I will remove that and send V4. - Shashank Regards, Christian. + __print_array(__get_dynamic_array(dst), __entry->nptes, 8)) +); + TRACE_EVENT(amdgpu_vm_set_ptes, TP_PROTO(uint64_t pe, uint64_t addr, unsigned count, uint32_t incr, uint64_t flags, bool direct), diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 71e005cf2952..b5dbb5e8bc61 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1513,17 +1513,22 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params, do { uint64_t upd_end = min(entry_end, frag_end); unsigned nptes = (upd_end - frag_start) >> shift; + uint64_t upd_flags = flags | AMDGPU_PTE_FRAG(frag); /* This can happen when we set higher level PDs to * silent to stop fault floods. */ nptes = max(nptes, 1u); + +
Re: [PATCH v3] drm/amdgpu: add new trace event for page table update v3
Hello Christian, On 12/08/20 12:15 pm, Christian König wrote: > Am 12.08.20 um 06:33 schrieb Shashank Sharma: >> This patch adds a new trace event to track the PTE update >> events. This specific event will provide information like: >> - start and end of virtual memory mapping >> - HW engine flags for the map >> - physical address for mapping >> >> This will be particularly useful for memory profiling tools >> (like RMV) which are monitoring the page table update events. >> >> V2: Added physical address lookup logic in trace point >> V3: switch to use __dynamic_array >> added nptes int the TPprint arguments list >> added page size in the arg list >> >> Cc: Christian König >> Cc: Alex Deucher >> Signed-off-by: Christian König >> Signed-off-by: Shashank Sharma >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 38 +++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 9 -- >> 2 files changed, 45 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h >> index 63e734a125fb..b9aae7983b4b 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h >> @@ -321,6 +321,44 @@ DEFINE_EVENT(amdgpu_vm_mapping, amdgpu_vm_bo_cs, >> TP_ARGS(mapping) >> ); >> >> +TRACE_EVENT(amdgpu_vm_update_ptes, >> +TP_PROTO(struct amdgpu_vm_update_params *p, >> + uint64_t start, uint64_t end, >> + unsigned int nptes, uint64_t dst, >> + uint64_t incr, uint64_t flags), >> +TP_ARGS(p, start, end, nptes, dst, incr, flags), >> +TP_STRUCT__entry( >> + __field(u64, start) >> + __field(u64, end) >> + __field(u64, flags) >> + __field(u64, incr) >> + __field(unsigned int, nptes) >> + __dynamic_array(u64, dst, nptes) >> +), >> + >> +TP_fast_assign( >> +unsigned int i; >> + >> +__entry->start = start; >> +__entry->end = end; >> +__entry->flags = flags; >> +__entry->incr = incr; >> +__entry->nptes = nptes; >> + >> +for (i = 0; i < nptes; ++i) { >> +u64 addr = p->pages_addr ? amdgpu_vm_map_gart( >> +p->pages_addr, dst) : dst; >> + >> +((u64 *)__get_dynamic_array(dst))[i] = addr; >> +dst += incr; >> +} >> +), >> +TP_printk("seg:0x%010llx-0x%010llx, flags:0x%llx, nptr=%u, pgsz:%llu," >> + " dst:\n%s", __entry->start, __entry->end, __entry->flags, >> + __entry->nptes, __entry->incr, > This is not correct. The increment is NOT the page size, but rather the > page size rounded down to a power of 512+4K. > > In other words page size can be 4K, 8K, 16K, 32K, 64K1M, 2M, > 4M512M, 1G, 2G > > But the increment can only be 4K, 2M, 1G Understood. But I think the requirement here is for increment. My understanding is that the tool needs to save the page entries, and for that, it will need start of virtual mem, start of physical mem, mapping size and step to increment the entries. If that's so, we can re-label this entry as "step" instead of "page size". Please let me know if you think it's the right thing to do. > And do we need the nptes here? We just need it to print the correct > number of destination addresses. Agree, we don't really need nptes here, I will remove that and send V4. - Shashank > > Regards, > Christian. > >> + __print_array(__get_dynamic_array(dst), __entry->nptes, 8)) >> +); >> + >> TRACE_EVENT(amdgpu_vm_set_ptes, >> TP_PROTO(uint64_t pe, uint64_t addr, unsigned count, >> uint32_t incr, uint64_t flags, bool direct), >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> index 71e005cf2952..b5dbb5e8bc61 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> @@ -1513,17 +1513,22 @@ static int amdgpu_vm_update_ptes(struct >> amdgpu_vm_update_params *params, >> do { >> uint64_t upd_end = min(entry_end, frag_end); >> unsigned nptes = (upd_end - frag_start) >> shift; >> +uint64_t upd_flags = flags | AMDGPU_PTE_FRAG(frag); >> >> /* This can happen when we set higher level PDs to >> * silent to stop fault floods. >> */ >> nptes = max(nptes, 1u); >> + >> +trace_amdgpu_vm_update_ptes(params, frag_start, upd_end, >> +nptes, dst, incr, >> +
Re: [PATCH v3] drm/amdgpu: add new trace event for page table update v3
Am 12.08.20 um 06:33 schrieb Shashank Sharma: This patch adds a new trace event to track the PTE update events. This specific event will provide information like: - start and end of virtual memory mapping - HW engine flags for the map - physical address for mapping This will be particularly useful for memory profiling tools (like RMV) which are monitoring the page table update events. V2: Added physical address lookup logic in trace point V3: switch to use __dynamic_array added nptes int the TPprint arguments list added page size in the arg list Cc: Christian König Cc: Alex Deucher Signed-off-by: Christian König Signed-off-by: Shashank Sharma --- drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 38 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 9 -- 2 files changed, 45 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h index 63e734a125fb..b9aae7983b4b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h @@ -321,6 +321,44 @@ DEFINE_EVENT(amdgpu_vm_mapping, amdgpu_vm_bo_cs, TP_ARGS(mapping) ); +TRACE_EVENT(amdgpu_vm_update_ptes, + TP_PROTO(struct amdgpu_vm_update_params *p, +uint64_t start, uint64_t end, +unsigned int nptes, uint64_t dst, +uint64_t incr, uint64_t flags), + TP_ARGS(p, start, end, nptes, dst, incr, flags), + TP_STRUCT__entry( +__field(u64, start) +__field(u64, end) +__field(u64, flags) +__field(u64, incr) +__field(unsigned int, nptes) +__dynamic_array(u64, dst, nptes) + ), + + TP_fast_assign( + unsigned int i; + + __entry->start = start; + __entry->end = end; + __entry->flags = flags; + __entry->incr = incr; + __entry->nptes = nptes; + + for (i = 0; i < nptes; ++i) { + u64 addr = p->pages_addr ? amdgpu_vm_map_gart( + p->pages_addr, dst) : dst; + + ((u64 *)__get_dynamic_array(dst))[i] = addr; + dst += incr; + } + ), + TP_printk("seg:0x%010llx-0x%010llx, flags:0x%llx, nptr=%u, pgsz:%llu," + " dst:\n%s", __entry->start, __entry->end, __entry->flags, + __entry->nptes, __entry->incr, This is not correct. The increment is NOT the page size, but rather the page size rounded down to a power of 512+4K. In other words page size can be 4K, 8K, 16K, 32K, 64K1M, 2M, 4M512M, 1G, 2G But the increment can only be 4K, 2M, 1G And do we need the nptes here? We just need it to print the correct number of destination addresses. Regards, Christian. + __print_array(__get_dynamic_array(dst), __entry->nptes, 8)) +); + TRACE_EVENT(amdgpu_vm_set_ptes, TP_PROTO(uint64_t pe, uint64_t addr, unsigned count, uint32_t incr, uint64_t flags, bool direct), diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 71e005cf2952..b5dbb5e8bc61 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1513,17 +1513,22 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params, do { uint64_t upd_end = min(entry_end, frag_end); unsigned nptes = (upd_end - frag_start) >> shift; + uint64_t upd_flags = flags | AMDGPU_PTE_FRAG(frag); /* This can happen when we set higher level PDs to * silent to stop fault floods. */ nptes = max(nptes, 1u); + + trace_amdgpu_vm_update_ptes(params, frag_start, upd_end, + nptes, dst, incr, + upd_flags); amdgpu_vm_update_flags(params, pt, cursor.level, pe_start, dst, nptes, incr, - flags | AMDGPU_PTE_FRAG(frag)); + upd_flags); pe_start += nptes * 8; - dst += (uint64_t)nptes * AMDGPU_GPU_PAGE_SIZE << shift; + dst += nptes * incr; frag_start = upd_end; if (frag_start >= frag_end) { ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 3/4] drm/amd/powerplay: enable Navi1X mgpu fan boost feature
On 8/12/20 6:56 AM, Evan Quan wrote: Support Navi1X mgpu fan boost enablement. Change-Id: Iafbf07c56462120d2db578b6af45dd7f985a4cc1 Signed-off-by: Evan Quan --- .../drm/amd/powerplay/inc/smu_v11_0_ppsmc.h | 4 +++- drivers/gpu/drm/amd/powerplay/navi10_ppt.c| 21 +++ 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h index 406bfd187ce8..fa0174dc7e0e 100644 --- a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h +++ b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h @@ -123,7 +123,9 @@ #define PPSMC_MSG_DALDisableDummyPstateChange0x49 #define PPSMC_MSG_DALEnableDummyPstateChange 0x4A -#define PPSMC_Message_Count 0x4B +#define PPSMC_MSG_SetMGpuFanBoostLimitRpm0x4C + +#define PPSMC_Message_Count 0x4D typedef uint32_t PPSMC_Result; typedef uint32_t PPSMC_Msg; diff --git a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c index 61e2971be9f3..a86cd819b44b 100644 --- a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c +++ b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c @@ -138,6 +138,7 @@ static struct cmn2asic_msg_mapping navi10_message_map[SMU_MSG_MAX_COUNT] = { MSG_MAP(DAL_ENABLE_DUMMY_PSTATE_CHANGE, PPSMC_MSG_DALEnableDummyPstateChange, 0), MSG_MAP(GetVoltageByDpm,PPSMC_MSG_GetVoltageByDpm, 0), MSG_MAP(GetVoltageByDpmOverdrive, PPSMC_MSG_GetVoltageByDpmOverdrive, 0), + MSG_MAP(SetMGpuFanBoostLimitRpm, PPSMC_MSG_SetMGpuFanBoostLimitRpm, 0), }; static struct cmn2asic_mapping navi10_clk_map[SMU_CLK_COUNT] = { @@ -2555,6 +2556,25 @@ static ssize_t navi10_get_gpu_metrics(struct smu_context *smu, return sizeof(struct gpu_metrics_v1_0); } +static int navi10_enable_mgpu_fan_boost(struct smu_context *smu) +{ + struct amdgpu_device *adev = smu->adev; + uint32_t param = 0; + + /* Navi12 does not support this */ + if (adev->asic_type == CHIP_NAVI12) + return 0; + + if (adev->pdev->device == 0x7312 && + adev->external_rev_id == 0) + param = 0xD188; Can you please add a comment explaining above condition? Apart from that, the series is Acked-by: Nirmoy Das + + return smu_cmn_send_smc_msg_with_param(smu, + SMU_MSG_SetMGpuFanBoostLimitRpm, + param, + NULL); +} + static const struct pptable_funcs navi10_ppt_funcs = { .get_allowed_feature_mask = navi10_get_allowed_feature_mask, .set_default_dpm_table = navi10_set_default_dpm_table, @@ -2636,6 +2656,7 @@ static const struct pptable_funcs navi10_ppt_funcs = { .get_pp_feature_mask = smu_cmn_get_pp_feature_mask, .set_pp_feature_mask = smu_cmn_set_pp_feature_mask, .get_gpu_metrics = navi10_get_gpu_metrics, + .enable_mgpu_fan_boost = navi10_enable_mgpu_fan_boost, }; void navi10_set_ppt_funcs(struct smu_context *smu) ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx