[PATCH] drm/amdgpu: drop log message in amdgpu_dpm_baco_reset()

2020-08-12 Thread Alex Deucher
The caller does this now for all reset types.  This is now
a duplicate call.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_dpm.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dpm.c
index 2082c0acd216..1c661c7ab49f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dpm.c
@@ -1110,8 +1110,6 @@ int amdgpu_dpm_baco_reset(struct amdgpu_device *adev)
struct smu_context *smu = >smu;
int ret = 0;
 
-   dev_info(adev->dev, "GPU BACO reset\n");
-
if (is_support_sw_smu(adev)) {
ret = smu_baco_enter(smu);
if (ret)
-- 
2.25.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[pull] amdgpu drm-fixes-5.9

2020-08-12 Thread Alex Deucher
Hi Dave, Daniel,

Fixes for 5.9.

The following changes since commit f87812284172a9809820d10143b573d833cd3f75:

  drm/amdgpu: Fix bug where DPM is not enabled after hibernate and resume 
(2020-08-07 17:52:15 -0400)

are available in the Git repository at:

  git://people.freedesktop.org/~agd5f/linux tags/amd-drm-fixes-5.9-2020-08-12

for you to fetch changes up to f41ed88cbd6f025f7a683a11a74f901555fba11c:

  drm/amdgpu/display: use GFP_ATOMIC in dcn20_validate_bandwidth_internal 
(2020-08-10 18:09:46 -0400)


amd-drm-fixes-5.9-2020-08-12:

amdgpu:
- Fix allocation size
- SR-IOV fixes
- Vega20 SMU feature state caching fix
- Fix custom pptable handling
- Arcturus golden settings update
- Several display fixes


Anthony Koo (2):
  drm/amd/display: Fix LFC multiplier changing erratically
  drm/amd/display: Switch to immediate mode for updating infopackets

Aric Cyr (1):
  drm/amd/display: Fix incorrect backlight register offset for DCN

Christophe JAILLET (1):
  drm: amdgpu: Use the correct size when allocating memory

Daniel Kolesa (1):
  drm/amdgpu/display: use GFP_ATOMIC in dcn20_validate_bandwidth_internal

Evan Quan (2):
  drm/amd/powerplay: correct Vega20 cached smu feature state
  drm/amd/powerplay: correct UVD/VCE PG state on custom pptable uploading

Jaehyun Chung (1):
  drm/amd/display: Blank stream before destroying HDCP session

Liu ChengZhe (1):
  drm/amdgpu: Skip some registers config for SRIOV

Stylon Wang (1):
  drm/amd/display: Fix EDID parsing after resume from suspend

shiwu.zhang (1):
  drm/amdgpu: update gc golden register for arcturus

 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c  |  1 +
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c   | 19 ++
 drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c| 19 ++
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c  |  1 +
 drivers/gpu/drm/amd/display/dc/core/dc_link.c  |  3 +-
 .../gpu/drm/amd/display/dc/dce/dce_panel_cntl.h|  2 +-
 .../amd/display/dc/dcn10/dcn10_stream_encoder.c| 16 
 .../amd/display/dc/dcn10/dcn10_stream_encoder.h| 14 +++
 .../gpu/drm/amd/display/dc/dcn20/dcn20_resource.c  |  2 +-
 .../drm/amd/display/modules/freesync/freesync.c| 36 ++
 drivers/gpu/drm/amd/powerplay/hwmgr/vega20_hwmgr.c | 44 ++
 12 files changed, 114 insertions(+), 45 deletions(-)
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Nouveau] [PATCH 1/4] drm: retrieve EDID via ACPI _DDC method

2020-08-12 Thread Alex Deucher
On Wed, Aug 12, 2020 at 10:31 PM Daniel Dadap  wrote:
>
> Thanks, Lukas. I've incorporated your feedback into my local tree, but
> will wait for additional feedback from the individual DRM driver
> maintainers before sending out a series v2.
>
> On 8/8/20 5:11 PM, Lukas Wunner wrote:
> > On Mon, Jul 27, 2020 at 03:53:54PM -0500, Daniel Dadap wrote:
> >> + for (i = 0; i < num_dod_entries; i++) {
> >> + if (adr == dod_entries[i]) {
> >> + ret = do_acpi_ddc(child->handle);
> >> +
> >> + if (ret != NULL)
> >> + goto done;
> > I guess ideally we'd want to correlate the display objects with
> > drm_connectors or at least constrain the search to Display Type
> > "Internal/Integrated Digital Flat Panel" instead of picking the
> > first EDID found.  Otherwise we might erroneously use the DDC
> > for an externally attached display.
>
>
> Yes, we'd definitely need a way to do this if this functionality ever
> needs to be extended to systems with more than one _DDC method.
> Unfortunately, this will be much easier said than done, since I'm not
> aware of any way to reliably do map _DOD entries to connectors in a GPU
> driver, especially when we're talking about possibly correlating
> connectors on multiple GPUs which mux to the same internal display or
> external connector. All systems which I am aware of that implement ACPI
> _DDC do so for a single internal panel. I don't believe there's any
> reason to ever retrieve an EDID via ACPI _DDC for an external panel, but
> a hypothetical design with multiple internal panels, more than one of
> which needs to retrieve an EDID via ACPI _DDC, would certainly be
> problematic.
>
>
> On at least the system I'm working with for the various switcheroo and
> platform-x86 driver patches I've recently sent off, the dGPU has an ACPI
> _DOD table and one _DDC method corresponding to one of the _DOD entries,
> but the iGPU has neither a _DOD table nor a _DDC method. Either GPU can
> be connected to the internal panel via the dynamically switchable mux,
> and the internal panel's EDID is available via _DDC to allow a
> disconnected GPU to read the EDID. Since only the DGPU has _DOD and
> _DDC, and there's no obvious way to associate connectors on the iGPU
> with connectors on the dGPU, I've implemented the ACPI _DDC EDID
> retrieval with the "first available" implementation you see here. I'm
> open to other ideas if you have them, but didn't see a good way to
> search for the "right" _DDC implementation should there be more than one.
>
>
> As for preventing the ACPI EDID retrieval from being used for external
> panels, I've done this in the individual DRM drivers that call into the
> new drm_edid_acpi() API since it seemed that each DRM driver had its own
> way of distinguishing display connector types. If there's a good way to
> filter for internal panels in DRM core, I'd be happy to do that instead.

I can double check with our ACPI and vbios teams, but I'm not sure
that we ever used the _DDC method on any AMD platforms.  Even if we
did, the driver is still able to get the integrated panel's mode info
via other means.  In general, external connectors would always get the
EDID via i2c or aux.  The only use case we ever had to hardcoding an
EDID was for some really old server chips so they would always think
something was connected if you wanted to attach a crash cart.  That
EDID was stored in the vbios, not ACPI.

As for enumerating which displays were muxed or not and the local
mappings to acpi ids, etc., we had a whole set of ATPX methods to do
that if anyone is interested:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/include/amd_acpi.h

Alex
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH v4] drm/amdgpu: add new trace event for page table update v3

2020-08-12 Thread Shashank Sharma
This patch adds a new trace event to track the PTE update
events. This specific event will provide information like:
- start and end of virtual memory mapping
- HW engine flags for the map
- physical address for mapping

This will be particularly useful for memory profiling tools
(like RMV) which are monitoring the page table update events.

V2: Added physical address lookup logic in trace point
V3: switch to use __dynamic_array
added nptes int the TPprint arguments list
added page size in the arg list
V4: Addressed Christian's review comments
add start/end instead of seg
use incr instead of page_sz to be accurate

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Christian König 
Signed-off-by: Shashank Sharma 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 37 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c|  9 --
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
index 63e734a125fb..df12cf8466c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -321,6 +321,43 @@ DEFINE_EVENT(amdgpu_vm_mapping, amdgpu_vm_bo_cs,
TP_ARGS(mapping)
 );
 
+TRACE_EVENT(amdgpu_vm_update_ptes,
+   TP_PROTO(struct amdgpu_vm_update_params *p,
+uint64_t start, uint64_t end,
+unsigned int nptes, uint64_t dst,
+uint64_t incr, uint64_t flags),
+   TP_ARGS(p, start, end, nptes, dst, incr, flags),
+   TP_STRUCT__entry(
+__field(u64, start)
+__field(u64, end)
+__field(u64, flags)
+__field(unsigned int, nptes)
+__field(u64, incr)
+__dynamic_array(u64, dst, nptes)
+   ),
+
+   TP_fast_assign(
+   unsigned int i;
+
+   __entry->start = start;
+   __entry->end = end;
+   __entry->flags = flags;
+   __entry->incr = incr;
+   __entry->nptes = nptes;
+   for (i = 0; i < nptes; ++i) {
+   u64 addr = p->pages_addr ? amdgpu_vm_map_gart(
+   p->pages_addr, dst) : dst;
+
+   ((u64 *)__get_dynamic_array(dst))[i] = addr;
+   dst += incr;
+   }
+   ),
+   TP_printk("start:0x%010llx end:0x%010llx, flags:0x%llx, incr:%llu,"
+ " dst:\n%s", __entry->start, __entry->end, __entry->flags,
+ __entry->incr, __print_array(
+ __get_dynamic_array(dst), __entry->nptes, 8))
+);
+
 TRACE_EVENT(amdgpu_vm_set_ptes,
TP_PROTO(uint64_t pe, uint64_t addr, unsigned count,
 uint32_t incr, uint64_t flags, bool direct),
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 71e005cf2952..b5dbb5e8bc61 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1513,17 +1513,22 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
do {
uint64_t upd_end = min(entry_end, frag_end);
unsigned nptes = (upd_end - frag_start) >> shift;
+   uint64_t upd_flags = flags | AMDGPU_PTE_FRAG(frag);
 
/* This can happen when we set higher level PDs to
 * silent to stop fault floods.
 */
nptes = max(nptes, 1u);
+
+   trace_amdgpu_vm_update_ptes(params, frag_start, upd_end,
+   nptes, dst, incr,
+   upd_flags);
amdgpu_vm_update_flags(params, pt, cursor.level,
   pe_start, dst, nptes, incr,
-  flags | AMDGPU_PTE_FRAG(frag));
+  upd_flags);
 
pe_start += nptes * 8;
-   dst += (uint64_t)nptes * AMDGPU_GPU_PAGE_SIZE << shift;
+   dst += nptes * incr;
 
frag_start = upd_end;
if (frag_start >= frag_end) {
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver

2020-08-12 Thread Wang, Kevin(Yang)
[AMD Official Use Only - Internal Distribution Only]

Hi Tom,

drm/amdgpu: fix uninit-value in arcturus_log_thermal_throttling_event()

the fixed patch has been merged into drm-next branch.

Best Regards,
Kevin


From: amd-gfx  on behalf of Quan, Evan 

Sent: Thursday, August 13, 2020 10:07 AM
To: StDenis, Tom ; amd-gfx@lists.freedesktop.org 

Cc: StDenis, Tom 
Subject: RE: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus 
ppt driver

[AMD Official Use Only - Internal Distribution Only]

[AMD Official Use Only - Internal Distribution Only]

Your change below should be able to suppress the compile warning.
-arcturus_get_smu_metrics_data(smu,
+ret = arcturus_get_smu_metrics_data(smu,
   METRICS_THROTTLER_STATUS,
   _status);

+if (ret) {
+dev_err(adev->dev, "Could not read from arcturus_get_smu_metrics_data()\n");
+return;
+}
+
Setting *value as 0 may be unnecessary.  However anyway this patch is 
reviewed-by: Evan Quan 

BR
Evan
-Original Message-
From: amd-gfx  On Behalf Of Tom St Denis
Sent: Wednesday, August 12, 2020 8:21 PM
To: amd-gfx@lists.freedesktop.org
Cc: StDenis, Tom 
Subject: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt 
driver

Fixes:

  CC [M]  drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.o
drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c: In function 
‘arcturus_log_thermal_throttling_event’:
drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c:2223:24: warning: 
‘throttler_status’ may be used uninitialized in this function 
[-Wmaybe-uninitialized]
 2223 |   if (throttler_status & logging_label[throttler_idx].feature_mask) {

by making arcturus_get_smu_metrics_data() assign a default value (of zero) 
before any possible return point as well as simply error out of 
arcturus_log_thermal_throttling_event() if it fails.

Signed-off-by: Tom St Denis 
---
 drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c 
b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
index 8b1025dc54fd..78f7ec95e4f5 100644
--- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
+++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
@@ -551,6 +551,9 @@ static int arcturus_get_smu_metrics_data(struct smu_context 
*smu,

 mutex_lock(>metrics_lock);

+// assign default value
+*value = 0;
+
 ret = smu_cmn_get_metrics_table_locked(smu,
NULL,
false);
@@ -2208,15 +2211,20 @@ static const struct throttling_logging_label {  };  
static void arcturus_log_thermal_throttling_event(struct smu_context *smu)  {
-int throttler_idx, throtting_events = 0, buf_idx = 0;
+int throttler_idx, throtting_events = 0, buf_idx = 0, ret;
 struct amdgpu_device *adev = smu->adev;
 uint32_t throttler_status;
 char log_buf[256];

-arcturus_get_smu_metrics_data(smu,
+ret = arcturus_get_smu_metrics_data(smu,
   METRICS_THROTTLER_STATUS,
   _status);

+if (ret) {
+dev_err(adev->dev, "Could not read from arcturus_get_smu_metrics_data()\n");
+return;
+}
+
 memset(log_buf, 0, sizeof(log_buf));
 for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label);
  throttler_idx++) {
--
2.26.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CKevin1.Wang%40amd.com%7Cf47512097dfc40168e1a08d83f2db359%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637328812813110771sdata=xedbRrZeOi0PK3EM%2FKBYhfXxdfpOkocXPjQjcQ5ErI0%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CKevin1.Wang%40amd.com%7Cf47512097dfc40168e1a08d83f2db359%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637328812813110771sdata=xedbRrZeOi0PK3EM%2FKBYhfXxdfpOkocXPjQjcQ5ErI0%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Nouveau] [PATCH 1/4] drm: retrieve EDID via ACPI _DDC method

2020-08-12 Thread Daniel Dadap
Thanks, Lukas. I've incorporated your feedback into my local tree, but 
will wait for additional feedback from the individual DRM driver 
maintainers before sending out a series v2.


On 8/8/20 5:11 PM, Lukas Wunner wrote:

On Mon, Jul 27, 2020 at 03:53:54PM -0500, Daniel Dadap wrote:

+ for (i = 0; i < num_dod_entries; i++) {
+ if (adr == dod_entries[i]) {
+ ret = do_acpi_ddc(child->handle);
+
+ if (ret != NULL)
+ goto done;

I guess ideally we'd want to correlate the display objects with
drm_connectors or at least constrain the search to Display Type
"Internal/Integrated Digital Flat Panel" instead of picking the
first EDID found.  Otherwise we might erroneously use the DDC
for an externally attached display.



Yes, we'd definitely need a way to do this if this functionality ever 
needs to be extended to systems with more than one _DDC method. 
Unfortunately, this will be much easier said than done, since I'm not 
aware of any way to reliably do map _DOD entries to connectors in a GPU 
driver, especially when we're talking about possibly correlating 
connectors on multiple GPUs which mux to the same internal display or 
external connector. All systems which I am aware of that implement ACPI 
_DDC do so for a single internal panel. I don't believe there's any 
reason to ever retrieve an EDID via ACPI _DDC for an external panel, but 
a hypothetical design with multiple internal panels, more than one of 
which needs to retrieve an EDID via ACPI _DDC, would certainly be 
problematic.



On at least the system I'm working with for the various switcheroo and 
platform-x86 driver patches I've recently sent off, the dGPU has an ACPI 
_DOD table and one _DDC method corresponding to one of the _DOD entries, 
but the iGPU has neither a _DOD table nor a _DDC method. Either GPU can 
be connected to the internal panel via the dynamically switchable mux, 
and the internal panel's EDID is available via _DDC to allow a 
disconnected GPU to read the EDID. Since only the DGPU has _DOD and 
_DDC, and there's no obvious way to associate connectors on the iGPU 
with connectors on the dGPU, I've implemented the ACPI _DDC EDID 
retrieval with the "first available" implementation you see here. I'm 
open to other ideas if you have them, but didn't see a good way to 
search for the "right" _DDC implementation should there be more than one.



As for preventing the ACPI EDID retrieval from being used for external 
panels, I've done this in the individual DRM drivers that call into the 
new drm_edid_acpi() API since it seemed that each DRM driver had its own 
way of distinguishing display connector types. If there's a good way to 
filter for internal panels in DRM core, I'd be happy to do that instead.


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver

2020-08-12 Thread Quan, Evan
[AMD Official Use Only - Internal Distribution Only]

Your change below should be able to suppress the compile warning.
-arcturus_get_smu_metrics_data(smu,
+ret = arcturus_get_smu_metrics_data(smu,
   METRICS_THROTTLER_STATUS,
   _status);

+if (ret) {
+dev_err(adev->dev, "Could not read from arcturus_get_smu_metrics_data()\n");
+return;
+}
+
Setting *value as 0 may be unnecessary.  However anyway this patch is 
reviewed-by: Evan Quan 

BR
Evan
-Original Message-
From: amd-gfx  On Behalf Of Tom St Denis
Sent: Wednesday, August 12, 2020 8:21 PM
To: amd-gfx@lists.freedesktop.org
Cc: StDenis, Tom 
Subject: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt 
driver

Fixes:

  CC [M]  drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.o
drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c: In function 
‘arcturus_log_thermal_throttling_event’:
drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c:2223:24: warning: 
‘throttler_status’ may be used uninitialized in this function 
[-Wmaybe-uninitialized]
 2223 |   if (throttler_status & logging_label[throttler_idx].feature_mask) {

by making arcturus_get_smu_metrics_data() assign a default value (of zero) 
before any possible return point as well as simply error out of 
arcturus_log_thermal_throttling_event() if it fails.

Signed-off-by: Tom St Denis 
---
 drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c 
b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
index 8b1025dc54fd..78f7ec95e4f5 100644
--- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
+++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
@@ -551,6 +551,9 @@ static int arcturus_get_smu_metrics_data(struct smu_context 
*smu,

 mutex_lock(>metrics_lock);

+// assign default value
+*value = 0;
+
 ret = smu_cmn_get_metrics_table_locked(smu,
NULL,
false);
@@ -2208,15 +2211,20 @@ static const struct throttling_logging_label {  };  
static void arcturus_log_thermal_throttling_event(struct smu_context *smu)  {
-int throttler_idx, throtting_events = 0, buf_idx = 0;
+int throttler_idx, throtting_events = 0, buf_idx = 0, ret;
 struct amdgpu_device *adev = smu->adev;
 uint32_t throttler_status;
 char log_buf[256];

-arcturus_get_smu_metrics_data(smu,
+ret = arcturus_get_smu_metrics_data(smu,
   METRICS_THROTTLER_STATUS,
   _status);

+if (ret) {
+dev_err(adev->dev, "Could not read from arcturus_get_smu_metrics_data()\n");
+return;
+}
+
 memset(log_buf, 0, sizeof(log_buf));
 for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label);
  throttler_idx++) {
--
2.26.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cevan.quan%40amd.com%7C33cba706ddbb41b6f01508d83eba26bd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637328316549041092sdata=7BcUmmThbZU8quP0g4t3ajSYe3scCOahigO%2Bye2GHaw%3Dreserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: load ta firmware for navy_flounder

2020-08-12 Thread Deucher, Alexander
[AMD Public Use]

Reviewed-by: Alex Deucher 

From: Bhawanpreet Lakha 
Sent: Wednesday, August 12, 2020 4:17 PM
To: Kazlauskas, Nicholas ; Deucher, Alexander 
; amd-gfx@lists.freedesktop.org 

Cc: Clements, John ; Lakha, Bhawanpreet 

Subject: [PATCH] drm/amdgpu: load ta firmware for navy_flounder

call psp_int_ta_microcode() to parse the ta firmware.

Signed-off-by: Bhawanpreet Lakha 
---
 drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
index d488d250805d..6c5d9612abcb 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
@@ -58,7 +58,7 @@ MODULE_FIRMWARE("amdgpu/arcturus_ta.bin");
 MODULE_FIRMWARE("amdgpu/sienna_cichlid_sos.bin");
 MODULE_FIRMWARE("amdgpu/sienna_cichlid_ta.bin");
 MODULE_FIRMWARE("amdgpu/navy_flounder_sos.bin");
-MODULE_FIRMWARE("amdgpu/navy_flounder_asd.bin");
+MODULE_FIRMWARE("amdgpu/navy_flounder_ta.bin");

 /* address block */
 #define smnMP1_FIRMWARE_FLAGS   0x3010024
@@ -179,12 +179,11 @@ static int psp_v11_0_init_microcode(struct psp_context 
*psp)
 }
 break;
 case CHIP_SIENNA_CICHLID:
+   case CHIP_NAVY_FLOUNDER:
 err = psp_init_ta_microcode(>psp, chip_name);
 if (err)
 return err;
 break;
-   case CHIP_NAVY_FLOUNDER:
-   break;
 default:
 BUG();
 }
--
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: load ta firmware for navy_flounder

2020-08-12 Thread Bhawanpreet Lakha
call psp_int_ta_microcode() to parse the ta firmware.

Signed-off-by: Bhawanpreet Lakha 
---
 drivers/gpu/drm/amd/amdgpu/psp_v11_0.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
index d488d250805d..6c5d9612abcb 100644
--- a/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/psp_v11_0.c
@@ -58,7 +58,7 @@ MODULE_FIRMWARE("amdgpu/arcturus_ta.bin");
 MODULE_FIRMWARE("amdgpu/sienna_cichlid_sos.bin");
 MODULE_FIRMWARE("amdgpu/sienna_cichlid_ta.bin");
 MODULE_FIRMWARE("amdgpu/navy_flounder_sos.bin");
-MODULE_FIRMWARE("amdgpu/navy_flounder_asd.bin");
+MODULE_FIRMWARE("amdgpu/navy_flounder_ta.bin");
 
 /* address block */
 #define smnMP1_FIRMWARE_FLAGS  0x3010024
@@ -179,12 +179,11 @@ static int psp_v11_0_init_microcode(struct psp_context 
*psp)
}
break;
case CHIP_SIENNA_CICHLID:
+   case CHIP_NAVY_FLOUNDER:
err = psp_init_ta_microcode(>psp, chip_name);
if (err)
return err;
break;
-   case CHIP_NAVY_FLOUNDER:
-   break;
default:
BUG();
}
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: revert "fix system hang issue during GPU reset"

2020-08-12 Thread Alex Deucher
On Wed, Aug 12, 2020 at 11:54 AM Christian König
 wrote:
>
> The whole approach wasn't thought through till the end.
>
> We already had a reset lock like this in the past and it caused the same 
> problems like this one.
>
> Completely revert the patch for now and add individual trylock protection to 
> the hardware access functions as necessary.
>
> This reverts commit edad8312cbbf9a33c86873fc4093664f150dd5c1.
>
> Signed-off-by: Christian König 

This also broke GPU overclocking.

Acked-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   9 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c|  40 +-
>  .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c|   2 +-
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v7.c |   2 +-
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v8.c |   2 +-
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c |   2 +-
>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   7 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c|   4 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c   |   4 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  57 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   |   4 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c   |   6 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |  14 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c   |   4 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c| 353 --
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c   |   4 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |   4 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   |   2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c |   3 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c  |   2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h  |   4 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c  |  11 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h  |   3 +-
>  drivers/gpu/drm/amd/amdgpu/atom.c |   1 -
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c|  10 +-
>  drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c |   6 +-
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c |  10 +-
>  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c|   4 +-
>  drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c |   2 +-
>  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c |   2 +-
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |   7 +-
>  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |  13 +-
>  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |  13 +-
>  .../drm/amd/amdkfd/kfd_device_queue_manager.c |  16 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_process.c  |   4 -
>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |   4 +-
>  drivers/gpu/drm/amd/powerplay/amdgpu_smu.c|   2 +-
>  .../drm/amd/powerplay/hwmgr/vega20_hwmgr.c|   2 +-
>  39 files changed, 184 insertions(+), 469 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 1f9d97f61aa5..9c6fb38ce59d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -952,9 +952,9 @@ struct amdgpu_device {
> boolin_suspend;
> boolin_hibernate;
>
> -   atomic_tin_gpu_reset;
> +   boolin_gpu_reset;
> enum pp_mp1_state   mp1_state;
> -   struct rw_semaphore reset_sem;
> +   struct mutex  lock_reset;
> struct amdgpu_doorbell_index doorbell_index;
>
> struct mutexnotifier_lock;
> @@ -1269,9 +1269,4 @@ static inline bool amdgpu_is_tmz(struct amdgpu_device 
> *adev)
> return adev->gmc.tmz_enabled;
>  }
>
> -static inline bool amdgpu_in_reset(struct amdgpu_device *adev)
> -{
> -   return atomic_read(>in_gpu_reset) ? true : false;
> -}
> -
>  #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 9738dccb1c2c..0effc1d46824 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -244,14 +244,11 @@ int amdgpu_amdkfd_alloc_gtt_mem(struct kgd_dev *kgd, 
> size_t size,
> if (cp_mqd_gfx9)
> bp.flags |= AMDGPU_GEM_CREATE_CP_MQD_GFX9;
>
> -   if (!down_read_trylock(>reset_sem))
> -   return -EIO;
> -
> r = amdgpu_bo_create(adev, , );
> if (r) {
> dev_err(adev->dev,
> "failed to allocate BO for amdkfd (%d)\n", r);
> -   goto err;
> +   return r;
> }
>
> /* map the buffer */
> @@ -286,7 +283,6 @@ int amdgpu_amdkfd_alloc_gtt_mem(struct kgd_dev *kgd, 
> size_t size,
>
> amdgpu_bo_unreserve(bo);
>
> -   up_read(>reset_sem);
> return 0;
>
>  allocate_mem_kmap_bo_failed:
> @@ -295,25 +291,19 @@ int amdgpu_amdkfd_alloc_gtt_mem(struct kgd_dev *kgd, 
> size_t size,
> 

[PATCH] drm/amdgpu: revert "fix system hang issue during GPU reset"

2020-08-12 Thread Christian König
The whole approach wasn't thought through till the end.

We already had a reset lock like this in the past and it caused the same 
problems like this one.

Completely revert the patch for now and add individual trylock protection to 
the hardware access functions as necessary.

This reverts commit edad8312cbbf9a33c86873fc4093664f150dd5c1.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   9 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c|  40 +-
 .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c|   2 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v7.c |   2 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v8.c |   2 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c |   2 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   7 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c|   4 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c   |   4 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|  57 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c   |   4 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c   |   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c   |   4 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c| 353 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c   |   4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |   4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c  |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h  |   4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c  |  11 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h  |   3 +-
 drivers/gpu/drm/amd/amdgpu/atom.c |   1 -
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c|  10 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c |   6 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c |  10 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c|   4 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c |   2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c |   2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |   7 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |  13 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |  13 +-
 .../drm/amd/amdkfd/kfd_device_queue_manager.c |  16 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c  |   4 -
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |   4 +-
 drivers/gpu/drm/amd/powerplay/amdgpu_smu.c|   2 +-
 .../drm/amd/powerplay/hwmgr/vega20_hwmgr.c|   2 +-
 39 files changed, 184 insertions(+), 469 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 1f9d97f61aa5..9c6fb38ce59d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -952,9 +952,9 @@ struct amdgpu_device {
boolin_suspend;
boolin_hibernate;
 
-   atomic_tin_gpu_reset;
+   boolin_gpu_reset;
enum pp_mp1_state   mp1_state;
-   struct rw_semaphore reset_sem;
+   struct mutex  lock_reset;
struct amdgpu_doorbell_index doorbell_index;
 
struct mutexnotifier_lock;
@@ -1269,9 +1269,4 @@ static inline bool amdgpu_is_tmz(struct amdgpu_device 
*adev)
return adev->gmc.tmz_enabled;
 }
 
-static inline bool amdgpu_in_reset(struct amdgpu_device *adev)
-{
-   return atomic_read(>in_gpu_reset) ? true : false;
-}
-
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 9738dccb1c2c..0effc1d46824 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -244,14 +244,11 @@ int amdgpu_amdkfd_alloc_gtt_mem(struct kgd_dev *kgd, 
size_t size,
if (cp_mqd_gfx9)
bp.flags |= AMDGPU_GEM_CREATE_CP_MQD_GFX9;
 
-   if (!down_read_trylock(>reset_sem))
-   return -EIO;
-
r = amdgpu_bo_create(adev, , );
if (r) {
dev_err(adev->dev,
"failed to allocate BO for amdkfd (%d)\n", r);
-   goto err;
+   return r;
}
 
/* map the buffer */
@@ -286,7 +283,6 @@ int amdgpu_amdkfd_alloc_gtt_mem(struct kgd_dev *kgd, size_t 
size,
 
amdgpu_bo_unreserve(bo);
 
-   up_read(>reset_sem);
return 0;
 
 allocate_mem_kmap_bo_failed:
@@ -295,25 +291,19 @@ int amdgpu_amdkfd_alloc_gtt_mem(struct kgd_dev *kgd, 
size_t size,
amdgpu_bo_unreserve(bo);
 allocate_mem_reserve_bo_failed:
amdgpu_bo_unref();
-err:
-   up_read(>reset_sem);
+
return r;
 }
 
 void amdgpu_amdkfd_free_gtt_mem(struct kgd_dev *kgd, void *mem_obj)
 {
-   struct amdgpu_device *adev = (struct amdgpu_device *)kgd;
struct amdgpu_bo *bo = (struct amdgpu_bo *) mem_obj;
 
-

Re: [PATCH] drm/amdgpu: adjust the pid in the grab_id trace point

2020-08-12 Thread Christian König

Am 12.08.20 um 17:19 schrieb Steven Rostedt:

On Wed, 12 Aug 2020 16:36:36 +0200
Christian König  wrote:


Am 12.08.20 um 16:17 schrieb Steven Rostedt:

On Fri, Aug 07, 2020 at 03:36:58PM +0200, Christian König wrote:

Trace something useful instead of the pid of a kernel thread here.

Signed-off-by: Christian König 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 1 +
   1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
index 5da20fc166d9..07f99ef69d91 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -228,6 +228,7 @@ TRACE_EVENT(amdgpu_vm_grab_id,
 ),
   
   	TP_fast_assign(

+  __entry->ent.pid = vm->task_info.pid;

If the ent.pid is not the pid you are interested in for this trace event, just
add a "pid" field to the trace event and place it there. Do not modify the
generic pid that is recorded, as we would like that to be consistent for all
trace events.

The problem my userspace guys have is that this doesn't work with
"trace-cmd -P $pid".

But I think I can teach them how filters work :)

Yep, trace-cmd record -e event -f "pid == $pid"


The "ent.pid" turns into "common_pid" in the field, leaving "pid" free to use.
Other trace events (like sched_waking) record a pid field that is not the same
as the pid of the executing task.

Yes, we thought about this alternative as well.


The "ent.pid" should always be the pid of the task that executed the event.

Why? For the case here we just execute a work item in the background for
an userspace process.

Tracing the pid of the worker pool which executes it doesn't seem to
make to much sense.

Maybe not for you, but it does for me. All trace events show what
happened when it happened and who executed it. I like to see what
worker threads are executing. I may filter on the worker thread, and by
changing the ent.pid, I wont see what it is doing.


That's enough explanation for me. Going with the separate pid field then.

Thanks,
Christian.



That said, I think I may add a feature to a trace evnt for a special filter
to say, "test field to the set_event_pid", and if it exists in that
file to include that event in the filtered trace. This would include
sched_waking trace events as well.

That way "trace-cmd record -P $pid" will still work for your case.

-- Steve


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency

2020-08-12 Thread Christian König

Am 12.08.20 um 17:07 schrieb Felix Kuehling:

Am 2020-08-12 um 4:53 a.m. schrieb Christian König:

Am 12.08.20 um 03:19 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Felix,

Re: It may be better to fix it the other way around in
amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
reservation. Otherwise you will never be able to take the reset_sem
while any BOs are reserved. That's probably going to cause you other
problems later.
[Dennis Li] Thanks that you find the potential issue, I will change
it in version 2.

Re: That makes me wonder, why do you need the reset_sem in
amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
hardware access in that function. Is it for amdgpu_ttm_alloc_gart
updating the GART table through HDP?
[Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have
considered to only protect amdgpu_ttm_alloc_gart before.

That access is irrelevant and the lock should be removed or changed
into a trylock.

See we need the HDP flush only because the hardware could have
accessed the data before.

But after a GPU reset the HDP is known to be clean, so this doesn't
need any protection.


   But I worry other functions will access hardware in the future.
Therefore I select an aggressive approach which lock reset_sem at the
beginning of entry functions of amdgpu driver.

This is not a good idea. We used to have such a global lock before and
removed it because it caused all kind of problems.

In most cases it's a global read-lock, so most of the time it should not
impact concurrency. The only "writer" that blocks all readers is the GPU
reset.



When was this added? Looks like it slipped under my radar or I wasn't
awake enough at that moment.

The change that added the reset_sem added read-locks in about 70 places.
I'm still concerned that this will be hard to maintain, to make sure
future HW access will do the necessary locking. I guess that's the
motivation for doing the locking more coarse-grained. Taking the lock
higher up the call chains means fewer places that take the lock and new
low-level code may not need its own locking in the future because it's
already covered by higher-level callers.

On the other hand, it needs to be low enough in the call chains to avoid
recursive locking or circular dependencies with other locks.


Well the whole approach is a NAK. We already tried this and it didn't 
worked at all.


See how much effort it was to remove the global reset rwsem again in the 
past.


We have only minimal functions accessing the hardware directly which can 
run concurrently with a GPU reset.


Everything else works through ring buffers where the processing is 
stopped before we reset the GPU.


So the whole thing is just a bad idea as far as I can see.

Regards,
Christian.



Regards,
   Felix



Christian.


Best Regards
Dennis Li
-Original Message-
From: Kuehling, Felix 
Sent: Tuesday, August 11, 2020 9:57 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org;
Deucher, Alexander ; Zhang, Hawking
; Koenig, Christian 
Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking
dependency

Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:

[  653.902305] ==
[  653.902928] WARNING: possible circular locking dependency detected
[  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted:
G   OE
[  653.904098] --
[  653.904675] amdgpu_test/3975 is trying to acquire lock:
[  653.905241] 97848f8647a0 (>reset_sem){.+.+}, at:
amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
     but task is already holding lock:
[  653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.},
at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
     which lock already depends on the new lock.

[  653.909423]
     the existing dependency chain (in reverse order) is:
[  653.910594]
     -> #1 (reservation_ww_class_mutex){+.+.}:
[  653.911759]    __ww_mutex_lock.constprop.15+0xca/0x1120
[  653.912350]    ww_mutex_lock+0x73/0x80
[  653.913044]    amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
[  653.913724]    kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
[  653.914388]    amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
[  653.915033]    amdgpu_device_init+0x1303/0x1e10 [amdgpu]
[  653.915685]    amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
[  653.916349]    amdgpu_pci_probe+0x11d/0x200 [amdgpu]
[  653.916959]    local_pci_probe+0x47/0xa0
[  653.917570]    work_for_cpu_fn+0x1a/0x30
[  653.918184]    process_one_work+0x29e/0x630
[  653.918803]    worker_thread+0x22b/0x3f0
[  653.919427]    kthread+0x12f/0x150
[  653.920047]    ret_from_fork+0x3a/0x50
[  653.920661]
     -> #0 (>reset_sem){.+.+}:
[  653.921893]    __lock_acquire+0x13ec/0x16e0
[  653.922531]    lock_acquire+0xb8/0x1c0
[  

Re: [PATCH] drm/amdgpu: adjust the pid in the grab_id trace point

2020-08-12 Thread Steven Rostedt
On Wed, 12 Aug 2020 16:36:36 +0200
Christian König  wrote:

> Am 12.08.20 um 16:17 schrieb Steven Rostedt:
> > On Fri, Aug 07, 2020 at 03:36:58PM +0200, Christian König wrote:  
> >> Trace something useful instead of the pid of a kernel thread here.
> >>
> >> Signed-off-by: Christian König 
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 1 +
> >>   1 file changed, 1 insertion(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> >> index 5da20fc166d9..07f99ef69d91 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> >> @@ -228,6 +228,7 @@ TRACE_EVENT(amdgpu_vm_grab_id,
> >> ),
> >>   
> >>TP_fast_assign(
> >> + __entry->ent.pid = vm->task_info.pid;  
> > If the ent.pid is not the pid you are interested in for this trace event, 
> > just
> > add a "pid" field to the trace event and place it there. Do not modify the
> > generic pid that is recorded, as we would like that to be consistent for all
> > trace events.  
> 
> The problem my userspace guys have is that this doesn't work with 
> "trace-cmd -P $pid".
> 
> But I think I can teach them how filters work :)

Yep, trace-cmd record -e event -f "pid == $pid"

> 
> > The "ent.pid" turns into "common_pid" in the field, leaving "pid" free to 
> > use.
> > Other trace events (like sched_waking) record a pid field that is not the 
> > same
> > as the pid of the executing task.  
> 
> Yes, we thought about this alternative as well.
> 
> > The "ent.pid" should always be the pid of the task that executed the event. 
> >  
> 
> Why? For the case here we just execute a work item in the background for 
> an userspace process.
> 
> Tracing the pid of the worker pool which executes it doesn't seem to 
> make to much sense.

Maybe not for you, but it does for me. All trace events show what
happened when it happened and who executed it. I like to see what
worker threads are executing. I may filter on the worker thread, and by
changing the ent.pid, I wont see what it is doing.

That said, I think I may add a feature to a trace evnt for a special filter
to say, "test field to the set_event_pid", and if it exists in that
file to include that event in the filtered trace. This would include
sched_waking trace events as well.

That way "trace-cmd record -P $pid" will still work for your case.

-- Steve
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver

2020-08-12 Thread Nirmoy



On 8/12/20 2:43 PM, StDenis, Tom wrote:

[AMD Official Use Only - Internal Distribution Only]

Possibly, but since the arcturus_get_smu_metrics_data() can error out we should 
check that return value no?



Yes, we need that return check.



(also setting *value to 0 avoids this bug in the future...).



I think caller should initialize the result value before passing it to 
arcturus_get_smu_metrics_data


as the warning is generated from the calling function. My comment is 
only concerning about setting "value"


to 0, which seems wrong. Rest of the patch is fine.


Nirmoy




Tom


From: Das, Nirmoy 
Sent: Wednesday, August 12, 2020 08:40
To: StDenis, Tom; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus 
ppt driver


On 8/12/20 2:20 PM, Tom St Denis wrote:

Fixes:

CC [M]  
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.o
drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c: In function 
‘arcturus_log_thermal_throttling_event’:
drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c:2223:24: warning: 
‘throttler_status’ may be used uninitialized in this function 
[-Wmaybe-uninitialized]
   2223 |   if (throttler_status & logging_label[throttler_idx].feature_mask) {

by making arcturus_get_smu_metrics_data() assign a default value
(of zero) before any possible return point as well as simply error
out of arcturus_log_thermal_throttling_event() if it fails.

Signed-off-by: Tom St Denis 
---
   drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 12 ++--
   1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c 
b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
index 8b1025dc54fd..78f7ec95e4f5 100644
--- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
+++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
@@ -551,6 +551,9 @@ static int arcturus_get_smu_metrics_data(struct smu_context 
*smu,

   mutex_lock(>metrics_lock);

+ // assign default value
+ *value = 0;
+
   ret = smu_cmn_get_metrics_table_locked(smu,
  NULL,
  false);
@@ -2208,15 +2211,20 @@ static const struct throttling_logging_label {
   };
   static void arcturus_log_thermal_throttling_event(struct smu_context *smu)
   {
- int throttler_idx, throtting_events = 0, buf_idx = 0;
+ int throttler_idx, throtting_events = 0, buf_idx = 0, ret;
   struct amdgpu_device *adev = smu->adev;
   uint32_t throttler_status;


I think initializing throttler_status here should resolve the warning.


   char log_buf[256];

- arcturus_get_smu_metrics_data(smu,
+ ret = arcturus_get_smu_metrics_data(smu,
 METRICS_THROTTLER_STATUS,
 _status);

+ if (ret) {
+ dev_err(adev->dev, "Could not read from 
arcturus_get_smu_metrics_data()\n");
+ return;
+ }
+
   memset(log_buf, 0, sizeof(log_buf));
   for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label);
throttler_idx++) {

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency

2020-08-12 Thread Felix Kuehling
Am 2020-08-12 um 4:53 a.m. schrieb Christian König:
> Am 12.08.20 um 03:19 schrieb Li, Dennis:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Hi, Felix,
>>
>> Re: It may be better to fix it the other way around in
>> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
>> reservation. Otherwise you will never be able to take the reset_sem
>> while any BOs are reserved. That's probably going to cause you other
>> problems later.
>> [Dennis Li] Thanks that you find the potential issue, I will change
>> it in version 2.
>>
>> Re: That makes me wonder, why do you need the reset_sem in
>> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
>> hardware access in that function. Is it for amdgpu_ttm_alloc_gart
>> updating the GART table through HDP?
>> [Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have
>> considered to only protect amdgpu_ttm_alloc_gart before.
>
> That access is irrelevant and the lock should be removed or changed
> into a trylock.
>
> See we need the HDP flush only because the hardware could have
> accessed the data before.
>
> But after a GPU reset the HDP is known to be clean, so this doesn't
> need any protection.
>
>>   But I worry other functions will access hardware in the future.
>> Therefore I select an aggressive approach which lock reset_sem at the
>> beginning of entry functions of amdgpu driver.
>
> This is not a good idea. We used to have such a global lock before and
> removed it because it caused all kind of problems.

In most cases it's a global read-lock, so most of the time it should not
impact concurrency. The only "writer" that blocks all readers is the GPU
reset.


>
> When was this added? Looks like it slipped under my radar or I wasn't
> awake enough at that moment.

The change that added the reset_sem added read-locks in about 70 places.
I'm still concerned that this will be hard to maintain, to make sure
future HW access will do the necessary locking. I guess that's the
motivation for doing the locking more coarse-grained. Taking the lock
higher up the call chains means fewer places that take the lock and new
low-level code may not need its own locking in the future because it's
already covered by higher-level callers.

On the other hand, it needs to be low enough in the call chains to avoid
recursive locking or circular dependencies with other locks.

Regards,
  Felix


>
> Christian.
>
>>
>> Best Regards
>> Dennis Li
>> -Original Message-
>> From: Kuehling, Felix 
>> Sent: Tuesday, August 11, 2020 9:57 PM
>> To: Li, Dennis ; amd-gfx@lists.freedesktop.org;
>> Deucher, Alexander ; Zhang, Hawking
>> ; Koenig, Christian 
>> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking
>> dependency
>>
>> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>>> [  653.902305] ==
>>> [  653.902928] WARNING: possible circular locking dependency detected
>>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted:
>>> G   OE
>>> [  653.904098] --
>>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>>> [  653.905241] 97848f8647a0 (>reset_sem){.+.+}, at:
>>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>>>     but task is already holding lock:
>>> [  653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.},
>>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>>>     which lock already depends on the new lock.
>>>
>>> [  653.909423]
>>>     the existing dependency chain (in reverse order) is:
>>> [  653.910594]
>>>     -> #1 (reservation_ww_class_mutex){+.+.}:
>>> [  653.911759]    __ww_mutex_lock.constprop.15+0xca/0x1120
>>> [  653.912350]    ww_mutex_lock+0x73/0x80
>>> [  653.913044]    amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>>> [  653.913724]    kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>>> [  653.914388]    amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>>> [  653.915033]    amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>>> [  653.915685]    amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>>> [  653.916349]    amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>>> [  653.916959]    local_pci_probe+0x47/0xa0
>>> [  653.917570]    work_for_cpu_fn+0x1a/0x30
>>> [  653.918184]    process_one_work+0x29e/0x630
>>> [  653.918803]    worker_thread+0x22b/0x3f0
>>> [  653.919427]    kthread+0x12f/0x150
>>> [  653.920047]    ret_from_fork+0x3a/0x50
>>> [  653.920661]
>>>     -> #0 (>reset_sem){.+.+}:
>>> [  653.921893]    __lock_acquire+0x13ec/0x16e0
>>> [  653.922531]    lock_acquire+0xb8/0x1c0
>>> [  653.923174]    down_read+0x48/0x230
>>> [  653.923886]    amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>>> [  653.924588]    drm_ioctl_kernel+0xb6/0x100 [drm]
>>> [  653.925283]    drm_ioctl+0x389/0x450 [drm]
>>> [  653.926013]    

Re: [RFC PATCH 1/1] drm/amdgpu: add initial support for pci error handler

2020-08-12 Thread Andrey Grodzovsky



On 8/11/20 9:30 AM, Nirmoy Das wrote:

This patch will ignore non-fatal errors and try to
stop amdgpu's sw stack on fatal errors.

Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 56 -
  1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index c1219af2e7d6..2b9ede3000ee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -35,6 +35,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #include "amdgpu.h"

@@ -1516,6 +1517,58 @@ static struct drm_driver kms_driver = {
.patchlevel = KMS_DRIVER_PATCHLEVEL,
  };
  
+static pci_ers_result_t amdgpu_pci_err_detected(struct pci_dev *pdev,

+   pci_channel_state_t state)
+{
+   struct drm_device *dev = pci_get_drvdata(pdev);
+   struct amdgpu_device *adev = dev->dev_private;
+   int i;
+   int ret = PCI_ERS_RESULT_DISCONNECT;
+
+   switch (state) {
+   case pci_channel_io_normal:
+   ret = PCI_ERS_RESULT_CAN_RECOVER;
+   break;
+   default:
+   /* Disable power management */
+   adev->runpm = 0;
+   /* Suspend all IO operations */
+   amdgpu_fbdev_set_suspend(adev, 1);
+   cancel_delayed_work_sync(>delayed_init_work);
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   amdgpu_job_stop_all_jobs_on_sched(>sched);



You need to call drm_sched_stop first before calling this


+   }
+
+   if (adev->mode_info.mode_config_initialized) {
+   if (!amdgpu_device_has_dc_support(adev))
+   drm_helper_force_disable_all(adev->ddev);
+   else
+   drm_atomic_helper_shutdown(adev->ddev);
+   }
+
+   amdgpu_fence_driver_fini(adev);
+   amdgpu_fbdev_fini(adev);
+   /* Try to close drm device to stop applications
+* from opening dri files for further IO operations.
+* TODO: This will throw warning as ttm is not
+* cleaned perperly */
+   drm_dev_fini(dev);



I think user mode applications might still hold reference to the drm device 
through through drm_dev_get either by directly opening
the device file or indirectly through importing DMA buff, if so when the last of 
them terminate drm_dev_put->drm_dev_release->...->drm_dev_fini
might get called again causing use after free e.t.c issues. Maybe better to call 
here drm_dev_put then and so drm_dev_fini will get called when this

last user client releases his reference.

Also a general question - in my work on DPC recovery feature which tries to 
recover after PCIe error - once the PCI error has happened MMIO registers become
unaccessible for r/w as the PCI link is dead until after the PCI link is reset 
by the DPC driver (see 
https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html section 6.1.4).
Your case is to try and gracefully to close the drm device once fatal error 
happened, didn't you encounter errors or warnings when accessing HW registers 
during any of the operations

above ?

Andrey



+   break;
+   }
+
+   return ret;
+}
+
+static const struct pci_error_handlers amdgpu_err_handler = {
+   .error_detected = amdgpu_pci_err_detected,
+};
+
+
  static struct pci_driver amdgpu_kms_pci_driver = {
.name = DRIVER_NAME,
.id_table = pciidlist,
@@ -1523,10 +1576,9 @@ static struct pci_driver amdgpu_kms_pci_driver = {
.remove = amdgpu_pci_remove,
.shutdown = amdgpu_pci_shutdown,
.driver.pm = _pm_ops,
+   .err_handler = _err_handler,
  };
  
-

-
  static int __init amdgpu_init(void)
  {
int r;

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: adjust the pid in the grab_id trace point

2020-08-12 Thread Steven Rostedt
On Fri, Aug 07, 2020 at 03:36:58PM +0200, Christian König wrote:
> Trace something useful instead of the pid of a kernel thread here.
> 
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> index 5da20fc166d9..07f99ef69d91 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> @@ -228,6 +228,7 @@ TRACE_EVENT(amdgpu_vm_grab_id,
>),
>  
>   TP_fast_assign(
> +__entry->ent.pid = vm->task_info.pid;

If the ent.pid is not the pid you are interested in for this trace event, just
add a "pid" field to the trace event and place it there. Do not modify the
generic pid that is recorded, as we would like that to be consistent for all
trace events.

The "ent.pid" turns into "common_pid" in the field, leaving "pid" free to use.
Other trace events (like sched_waking) record a pid field that is not the same
as the pid of the executing task.

The "ent.pid" should always be the pid of the task that executed the event.

-- Steve


>  __entry->pasid = vm->pasid;
>  __assign_str(ring, ring->name)
>  __entry->vmid = job->vmid;
> -- 
> 2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: adjust the pid in the grab_id trace point

2020-08-12 Thread Christian König

Am 12.08.20 um 16:17 schrieb Steven Rostedt:

On Fri, Aug 07, 2020 at 03:36:58PM +0200, Christian König wrote:

Trace something useful instead of the pid of a kernel thread here.

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
index 5da20fc166d9..07f99ef69d91 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -228,6 +228,7 @@ TRACE_EVENT(amdgpu_vm_grab_id,
 ),
  
  	TP_fast_assign(

+  __entry->ent.pid = vm->task_info.pid;

If the ent.pid is not the pid you are interested in for this trace event, just
add a "pid" field to the trace event and place it there. Do not modify the
generic pid that is recorded, as we would like that to be consistent for all
trace events.


The problem my userspace guys have is that this doesn't work with 
"trace-cmd -P $pid".


But I think I can teach them how filters work :)


The "ent.pid" turns into "common_pid" in the field, leaving "pid" free to use.
Other trace events (like sched_waking) record a pid field that is not the same
as the pid of the executing task.


Yes, we thought about this alternative as well.


The "ent.pid" should always be the pid of the task that executed the event.


Why? For the case here we just execute a work item in the background for 
an userspace process.


Tracing the pid of the worker pool which executes it doesn't seem to 
make to much sense.


Thanks for the quick reply,
Christian.



-- Steve



   __entry->pasid = vm->pasid;
   __assign_str(ring, ring->name)
   __entry->vmid = job->vmid;
--
2.17.1


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/7] drm/amd/display: Avoid using unvalidated tiling_flags and tmz_surface in prepare_planes

2020-08-12 Thread Daniel Vetter
On Tue, Aug 11, 2020 at 09:42:11AM -0400, Marek Olšák wrote:
> There are a few cases when the flags can change, for example DCC can be
> disabled due to a hw limitation in the 3d engine. Modifiers give the
> misleading impression that they help with that, but they don't. They don't
> really help with anything.

But if that happens, how do you tell the other side that it needs to
sample new flags? Does that just happen all the time?

Also do the DDC state changes happen for shared buffers too?
-Daniel

> 
> Marek
> 
> On Mon., Aug. 10, 2020, 08:30 Christian König, <
> ckoenig.leichtzumer...@gmail.com> wrote:
> 
> > Am 10.08.20 um 14:25 schrieb Daniel Vetter:
> > > On Fri, Aug 07, 2020 at 10:29:09AM -0400, Kazlauskas, Nicholas wrote:
> > >> On 2020-08-07 4:30 a.m., dan...@ffwll.ch wrote:
> > >>> On Thu, Jul 30, 2020 at 04:36:38PM -0400, Nicholas Kazlauskas wrote:
> >  [Why]
> >  We're racing with userspace as the flags could potentially change
> >  from when we acquired and validated them in commit_check.
> > >>> Uh ... I didn't know these could change. I think my comments on Bas'
> > >>> series are even more relevant now. I think long term would be best to
> > bake
> > >>> these flags in at addfb time when modifiers aren't set. And otherwise
> > >>> always use the modifiers flag, and completely ignore the legacy flags
> > >>> here.
> > >>> -Daniel
> > >>>
> > >> There's a set tiling/mod flags IOCTL that can be called after addfb
> > happens,
> > >> so unless there's some sort of driver magic preventing this from working
> > >> when it's already been pinned for scanout then I don't see anything
> > stopping
> > >> this from happening.
> > >>
> > >> I still need to review the modifiers series in a little more detail but
> > that
> > >> looks like a good approach to fixing these kind of issues.
> > > Yeah we had the same model for i915, but it's awkward and can surprise
> > > compositors (since the client could change the tiling mode from
> > underneath
> > > the compositor). So freezing the tiling mode at addfb time is the right
> > > thing to do, and anyway how things work with modifiers.
> > >
> > > Ofc maybe good to audit the -amd driver, but hopefully it doesn't do
> > > anything silly with changed tiling. If it does, it's kinda sad day.
> >
> > Marek should know this right away, but I think we only set the tilling
> > flags once while exporting the BO and then never change them.
> >
> > Regards,
> > Christian.
> >
> > >
> > > Btw for i915 we even went a step further, and made the set_tiling ioctl
> > > return an error if a framebuffer for that gem_bo existed. Just to make
> > > sure we don't ever accidentally break this.
> > >
> > > Cheers, Daniel
> > >
> > >> Regards,
> > >> Nicholas Kazlauskas
> > >>
> >  [How]
> >  We unfortunately can't drop this function in its entirety from
> >  prepare_planes since we don't know the afb->address at commit_check
> >  time yet.
> > 
> >  So instead of querying new tiling_flags and tmz_surface use the ones
> >  from the plane_state directly.
> > 
> >  While we're at it, also update the force_disable_dcc option based
> >  on the state from atomic check.
> > 
> >  Cc: Bhawanpreet Lakha 
> >  Cc: Rodrigo Siqueira 
> >  Signed-off-by: Nicholas Kazlauskas 
> >  ---
> > .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 36
> > ++-
> > 1 file changed, 19 insertions(+), 17 deletions(-)
> > 
> >  diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >  index bf1881bd492c..f78c09c9585e 100644
> >  --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >  +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >  @@ -5794,14 +5794,8 @@ static int dm_plane_helper_prepare_fb(struct
> > drm_plane *plane,
> > struct list_head list;
> > struct ttm_validate_buffer tv;
> > struct ww_acquire_ctx ticket;
> >  -  uint64_t tiling_flags;
> > uint32_t domain;
> > int r;
> >  -  bool tmz_surface = false;
> >  -  bool force_disable_dcc = false;
> >  -
> >  -  dm_plane_state_old = to_dm_plane_state(plane->state);
> >  -  dm_plane_state_new = to_dm_plane_state(new_state);
> > if (!new_state->fb) {
> > DRM_DEBUG_DRIVER("No FB bound\n");
> >  @@ -5845,27 +5839,35 @@ static int dm_plane_helper_prepare_fb(struct
> > drm_plane *plane,
> > return r;
> > }
> >  -  amdgpu_bo_get_tiling_flags(rbo, _flags);
> >  -
> >  -  tmz_surface = amdgpu_bo_encrypted(rbo);
> >  -
> > ttm_eu_backoff_reservation(, );
> > afb->address = amdgpu_bo_gpu_offset(rbo);
> > amdgpu_bo_ref(rbo);
> >  +  /**
> >  +   * We don't do surface updates on planes that have been newly
> 

Re: RFC: How to adjust the trace pid?

2020-08-12 Thread Daniel Vetter
On Wed, Aug 12, 2020 at 3:42 PM Christian König
 wrote:
>
> Ping? Daniel, Dave any opinion on this?

Type patch, cc: tracing people, see what they say? tbh I have no idea,
but they have been making unhappy noises about some of the tricks
we've played in the past in i915 tracepoints. So not everything is
cool in there.

Otherwise I guess just add another tracepoint parameter to dump the
correct userspace mm.

3rd option could be to dump the current mm (since I'm assuming those
threads do kthread_use/unuse_mm to impersonate the right userspace
process correctly) in the tracepoint infrastructure too?

Cheers, Daniel

> Christian.
>
> Am 07.08.20 um 15:36 schrieb Christian König:
> > Hi everybody,
> >
> > in amdgpu we got the following issue which I'm seeking advise how to 
> > cleanly handle it.
> >
> > We have a bunch of trace points which are related to the VM subsystem and 
> > executed in either a work item, kthread or foreign process context.
> >
> > Now tracing the pid of the context which we are executing in is not really 
> > that useful, so I'm wondering if we could just overwrite the pid recorded 
> > in the trace entry?
> >
> > The following patch does exactly that for the vm_grab_id() trace point, but 
> > I'm not 100% sure if that is legal or not.
> >
> > Any ideas? Comments?
> >
> > Thanks,
> > Christian.
> >
> >
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: RFC: How to adjust the trace pid?

2020-08-12 Thread Christian König

Ping? Daniel, Dave any opinion on this?

Christian.

Am 07.08.20 um 15:36 schrieb Christian König:

Hi everybody,

in amdgpu we got the following issue which I'm seeking advise how to cleanly 
handle it.

We have a bunch of trace points which are related to the VM subsystem and 
executed in either a work item, kthread or foreign process context.

Now tracing the pid of the context which we are executing in is not really that 
useful, so I'm wondering if we could just overwrite the pid recorded in the 
trace entry?

The following patch does exactly that for the vm_grab_id() trace point, but I'm 
not 100% sure if that is legal or not.

Any ideas? Comments?

Thanks,
Christian.




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/4] drm/amd/powerplay: enable Navi1X mgpu fan boost feature

2020-08-12 Thread Alex Deucher
On Wed, Aug 12, 2020 at 12:57 AM Evan Quan  wrote:
>
> Support Navi1X mgpu fan boost enablement.
>
> Change-Id: Iafbf07c56462120d2db578b6af45dd7f985a4cc1
> Signed-off-by: Evan Quan 
> ---
>  .../drm/amd/powerplay/inc/smu_v11_0_ppsmc.h   |  4 +++-
>  drivers/gpu/drm/amd/powerplay/navi10_ppt.c| 21 +++
>  2 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h 
> b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h
> index 406bfd187ce8..fa0174dc7e0e 100644
> --- a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h
> +++ b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h
> @@ -123,7 +123,9 @@
>  #define PPSMC_MSG_DALDisableDummyPstateChange0x49
>  #define PPSMC_MSG_DALEnableDummyPstateChange 0x4A
>
> -#define PPSMC_Message_Count  0x4B
> +#define PPSMC_MSG_SetMGpuFanBoostLimitRpm0x4C
> +
> +#define PPSMC_Message_Count  0x4D
>
>  typedef uint32_t PPSMC_Result;
>  typedef uint32_t PPSMC_Msg;
> diff --git a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c 
> b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
> index 61e2971be9f3..a86cd819b44b 100644
> --- a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
> +++ b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
> @@ -138,6 +138,7 @@ static struct cmn2asic_msg_mapping 
> navi10_message_map[SMU_MSG_MAX_COUNT] = {
> MSG_MAP(DAL_ENABLE_DUMMY_PSTATE_CHANGE, 
> PPSMC_MSG_DALEnableDummyPstateChange,   0),
> MSG_MAP(GetVoltageByDpm,PPSMC_MSG_GetVoltageByDpm,
>   0),
> MSG_MAP(GetVoltageByDpmOverdrive,   
> PPSMC_MSG_GetVoltageByDpmOverdrive, 0),
> +   MSG_MAP(SetMGpuFanBoostLimitRpm,
> PPSMC_MSG_SetMGpuFanBoostLimitRpm,  0),
>  };
>
>  static struct cmn2asic_mapping navi10_clk_map[SMU_CLK_COUNT] = {
> @@ -2555,6 +2556,25 @@ static ssize_t navi10_get_gpu_metrics(struct 
> smu_context *smu,
> return sizeof(struct gpu_metrics_v1_0);
>  }
>
> +static int navi10_enable_mgpu_fan_boost(struct smu_context *smu)
> +{
> +   struct amdgpu_device *adev = smu->adev;
> +   uint32_t param = 0;
> +
> +   /* Navi12 does not support this */
> +   if (adev->asic_type == CHIP_NAVI12)
> +   return 0;
> +
> +   if (adev->pdev->device == 0x7312 &&
> +   adev->external_rev_id == 0)

You want adev->pdev->revision here.  With that fixed, series is:
Reviewed-by: Alex Deucher 

> +   param = 0xD188;
> +
> +   return smu_cmn_send_smc_msg_with_param(smu,
> +  
> SMU_MSG_SetMGpuFanBoostLimitRpm,
> +  param,
> +  NULL);
> +}
> +
>  static const struct pptable_funcs navi10_ppt_funcs = {
> .get_allowed_feature_mask = navi10_get_allowed_feature_mask,
> .set_default_dpm_table = navi10_set_default_dpm_table,
> @@ -2636,6 +2656,7 @@ static const struct pptable_funcs navi10_ppt_funcs = {
> .get_pp_feature_mask = smu_cmn_get_pp_feature_mask,
> .set_pp_feature_mask = smu_cmn_set_pp_feature_mask,
> .get_gpu_metrics = navi10_get_gpu_metrics,
> +   .enable_mgpu_fan_boost = navi10_enable_mgpu_fan_boost,
>  };
>
>  void navi10_set_ppt_funcs(struct smu_context *smu)
> --
> 2.28.0
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v3] drm/amdgpu: add new trace event for page table update v3

2020-08-12 Thread Christian König

Am 12.08.20 um 14:09 schrieb Shashank Sharma:

On 12/08/20 2:02 pm, Christian König wrote:

Am 12.08.20 um 10:15 schrieb Shashank Sharma:

Hello Christian,

On 12/08/20 12:15 pm, Christian König wrote:

Am 12.08.20 um 06:33 schrieb Shashank Sharma:

This patch adds a new trace event to track the PTE update
events. This specific event will provide information like:
- start and end of virtual memory mapping
- HW engine flags for the map
- physical address for mapping

This will be particularly useful for memory profiling tools
(like RMV) which are monitoring the page table update events.

V2: Added physical address lookup logic in trace point
V3: switch to use __dynamic_array
   added nptes int the TPprint arguments list
   added page size in the arg list

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Christian König 
Signed-off-by: Shashank Sharma 
---
drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 38 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c|  9 --
2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
index 63e734a125fb..b9aae7983b4b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -321,6 +321,44 @@ DEFINE_EVENT(amdgpu_vm_mapping, amdgpu_vm_bo_cs,
TP_ARGS(mapping)
);

+TRACE_EVENT(amdgpu_vm_update_ptes,

+   TP_PROTO(struct amdgpu_vm_update_params *p,
+uint64_t start, uint64_t end,
+unsigned int nptes, uint64_t dst,
+uint64_t incr, uint64_t flags),
+   TP_ARGS(p, start, end, nptes, dst, incr, flags),
+   TP_STRUCT__entry(
+__field(u64, start)
+__field(u64, end)
+__field(u64, flags)
+__field(u64, incr)
+__field(unsigned int, nptes)
+__dynamic_array(u64, dst, nptes)
+   ),
+
+   TP_fast_assign(
+   unsigned int i;
+
+   __entry->start = start;
+   __entry->end = end;
+   __entry->flags = flags;
+   __entry->incr = incr;
+   __entry->nptes = nptes;
+
+   for (i = 0; i < nptes; ++i) {
+   u64 addr = p->pages_addr ? amdgpu_vm_map_gart(
+   p->pages_addr, dst) : dst;
+
+   ((u64 *)__get_dynamic_array(dst))[i] = addr;
+   dst += incr;
+   }
+   ),
+   TP_printk("seg:0x%010llx-0x%010llx, flags:0x%llx, nptr=%u, pgsz:%llu,"
+ " dst:\n%s", __entry->start, __entry->end, __entry->flags,
+ __entry->nptes, __entry->incr,

This is not correct. The increment is NOT the page size, but rather the
page size rounded down to a power of 512+4K.

In other words page size can be 4K, 8K, 16K, 32K, 64K1M, 2M,
4M512M, 1G, 2G

But the increment can only be 4K, 2M, 1G

Understood. But I think the requirement here is for increment. My understanding is that the tool 
needs to save the page entries, and for that, it will need start of virtual mem, start of physical 
mem, mapping size and step to increment the entries. If that's so, we can re-label this entry as 
"step" instead of "page size". Please let me know if you think it's the right 
thing to do.

We could stick with the naming increment if that helps, but this can
also be derived from the number of destination addresses we have.

sure, i will make it increment.

On the other hand explicitly mentioning it probably won't hurt us either.

And by the way what does "seg" mean?

Ah, to get into 80 char limit, I made 'segment' as 'seg' and later just 
realized I have to still break the print into two lines :) . I will make it 
back to segment or start/end


Ah, maybe rather use VA for virtual address. But start/end is probably 
the best option.


Christian.



- Shashank


Christian.


And do we need the nptes here? We just need it to print the correct
number of destination addresses.

Agree, we don't really need nptes here, I will remove that and send V4.

- Shashank


Regards,
Christian.


+ __print_array(__get_dynamic_array(dst), __entry->nptes, 8))
+);
+
TRACE_EVENT(amdgpu_vm_set_ptes,
TP_PROTO(uint64_t pe, uint64_t addr, unsigned count,
 uint32_t incr, uint64_t flags, bool direct),
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 71e005cf2952..b5dbb5e8bc61 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1513,17 +1513,22 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
do {
uint64_t 

Re: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver

2020-08-12 Thread StDenis, Tom
[AMD Official Use Only - Internal Distribution Only]

Possibly, but since the arcturus_get_smu_metrics_data() can error out we should 
check that return value no?

(also setting *value to 0 avoids this bug in the future...).

Tom


From: Das, Nirmoy 
Sent: Wednesday, August 12, 2020 08:40
To: StDenis, Tom; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus 
ppt driver


On 8/12/20 2:20 PM, Tom St Denis wrote:
> Fixes:
>
>CC [M]  
> drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.o
> drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c: In function 
> ‘arcturus_log_thermal_throttling_event’:
> drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c:2223:24: warning: 
> ‘throttler_status’ may be used uninitialized in this function 
> [-Wmaybe-uninitialized]
>   2223 |   if (throttler_status & logging_label[throttler_idx].feature_mask) {
>
> by making arcturus_get_smu_metrics_data() assign a default value
> (of zero) before any possible return point as well as simply error
> out of arcturus_log_thermal_throttling_event() if it fails.
>
> Signed-off-by: Tom St Denis 
> ---
>   drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 12 ++--
>   1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c 
> b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
> index 8b1025dc54fd..78f7ec95e4f5 100644
> --- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
> +++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
> @@ -551,6 +551,9 @@ static int arcturus_get_smu_metrics_data(struct 
> smu_context *smu,
>
>   mutex_lock(>metrics_lock);
>
> + // assign default value
> + *value = 0;
> +
>   ret = smu_cmn_get_metrics_table_locked(smu,
>  NULL,
>  false);
> @@ -2208,15 +2211,20 @@ static const struct throttling_logging_label {
>   };
>   static void arcturus_log_thermal_throttling_event(struct smu_context *smu)
>   {
> - int throttler_idx, throtting_events = 0, buf_idx = 0;
> + int throttler_idx, throtting_events = 0, buf_idx = 0, ret;
>   struct amdgpu_device *adev = smu->adev;
>   uint32_t throttler_status;


I think initializing throttler_status here should resolve the warning.

>   char log_buf[256];
>
> - arcturus_get_smu_metrics_data(smu,
> + ret = arcturus_get_smu_metrics_data(smu,
> METRICS_THROTTLER_STATUS,
> _status);
>
> + if (ret) {
> + dev_err(adev->dev, "Could not read from 
> arcturus_get_smu_metrics_data()\n");
> + return;
> + }
> +
>   memset(log_buf, 0, sizeof(log_buf));
>   for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label);
>throttler_idx++) {
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver

2020-08-12 Thread Nirmoy


On 8/12/20 2:20 PM, Tom St Denis wrote:

Fixes:

   CC [M]  drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.o
drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c: In function 
‘arcturus_log_thermal_throttling_event’:
drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c:2223:24: warning: 
‘throttler_status’ may be used uninitialized in this function 
[-Wmaybe-uninitialized]
  2223 |   if (throttler_status & logging_label[throttler_idx].feature_mask) {

by making arcturus_get_smu_metrics_data() assign a default value
(of zero) before any possible return point as well as simply error
out of arcturus_log_thermal_throttling_event() if it fails.

Signed-off-by: Tom St Denis 
---
  drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c 
b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
index 8b1025dc54fd..78f7ec95e4f5 100644
--- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
+++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
@@ -551,6 +551,9 @@ static int arcturus_get_smu_metrics_data(struct smu_context 
*smu,
  
  	mutex_lock(>metrics_lock);
  
+	// assign default value

+   *value = 0;
+
ret = smu_cmn_get_metrics_table_locked(smu,
   NULL,
   false);
@@ -2208,15 +2211,20 @@ static const struct throttling_logging_label {
  };
  static void arcturus_log_thermal_throttling_event(struct smu_context *smu)
  {
-   int throttler_idx, throtting_events = 0, buf_idx = 0;
+   int throttler_idx, throtting_events = 0, buf_idx = 0, ret;
struct amdgpu_device *adev = smu->adev;
uint32_t throttler_status;



I think initializing throttler_status here should resolve the warning.


char log_buf[256];
  
-	arcturus_get_smu_metrics_data(smu,

+   ret = arcturus_get_smu_metrics_data(smu,
  METRICS_THROTTLER_STATUS,
  _status);
  
+	if (ret) {

+   dev_err(adev->dev, "Could not read from 
arcturus_get_smu_metrics_data()\n");
+   return;
+   }
+
memset(log_buf, 0, sizeof(log_buf));
for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label);
 throttler_idx++) {

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/powerplay: Fix uninitialized warning in arcturus ppt driver

2020-08-12 Thread Tom St Denis
Fixes:

  CC [M]  drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_mst_types.o
drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c: In function 
‘arcturus_log_thermal_throttling_event’:
drivers/gpu/drm/amd/amdgpu/../powerplay/arcturus_ppt.c:2223:24: warning: 
‘throttler_status’ may be used uninitialized in this function 
[-Wmaybe-uninitialized]
 2223 |   if (throttler_status & logging_label[throttler_idx].feature_mask) {

by making arcturus_get_smu_metrics_data() assign a default value
(of zero) before any possible return point as well as simply error
out of arcturus_log_thermal_throttling_event() if it fails.

Signed-off-by: Tom St Denis 
---
 drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c 
b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
index 8b1025dc54fd..78f7ec95e4f5 100644
--- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
+++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
@@ -551,6 +551,9 @@ static int arcturus_get_smu_metrics_data(struct smu_context 
*smu,
 
mutex_lock(>metrics_lock);
 
+   // assign default value
+   *value = 0;
+
ret = smu_cmn_get_metrics_table_locked(smu,
   NULL,
   false);
@@ -2208,15 +2211,20 @@ static const struct throttling_logging_label {
 };
 static void arcturus_log_thermal_throttling_event(struct smu_context *smu)
 {
-   int throttler_idx, throtting_events = 0, buf_idx = 0;
+   int throttler_idx, throtting_events = 0, buf_idx = 0, ret;
struct amdgpu_device *adev = smu->adev;
uint32_t throttler_status;
char log_buf[256];
 
-   arcturus_get_smu_metrics_data(smu,
+   ret = arcturus_get_smu_metrics_data(smu,
  METRICS_THROTTLER_STATUS,
  _status);
 
+   if (ret) {
+   dev_err(adev->dev, "Could not read from 
arcturus_get_smu_metrics_data()\n");
+   return;
+   }
+
memset(log_buf, 0, sizeof(log_buf));
for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label);
 throttler_idx++) {
-- 
2.26.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Fix repeatly flr issue

2020-08-12 Thread Nirmoy



On 8/12/20 11:19 AM, Emily.Deng wrote:

From: jqdeng 

Only for no job running test case need to do recover in
flr notification.
For having job in mirror list, then let guest driver to
hit job timeout, and then do recover.

Signed-off-by: jqdeng 
Change-Id: Ic6234fce46fa1655ba81c4149235eeac75e75868
---
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 20 +++-
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 22 --
  2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index fe31cbeccfe9..12fe5164aaf3 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -238,6 +238,9 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, 
flr_work);
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
virt);
int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
+   int i;
+   bool need_do_recover = true;



We should find a better name for "need_do_recover", may be 
"need_to_recover" ?




+   struct drm_sched_job *job;
  
  	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,

 * otherwise the mailbox msg will be ruined/reseted by
@@ -258,10 +261,25 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
  
  flr_done:

up_read(>reset_sem);
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   spin_lock(>sched.job_list_lock);
+   job = list_first_entry_or_null(>sched.ring_mirror_list,
+   struct drm_sched_job, node);
+   spin_unlock(>sched.job_list_lock);
+   if (job) {
+   need_do_recover = false;
+   break;
+   }
+   }



This 1st job retrieval logic can move to a function as there are two 
instance of it.



  
  	/* Trigger recovery for world switch failure if no TDR */

if (amdgpu_device_should_recover_gpu(adev)
-   && adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT)
+   && (need_do_recover || adev->sdma_timeout == 
MAX_SCHEDULE_TIMEOUT))
amdgpu_device_gpu_recover(adev, NULL);
  }
  
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c

index 6f55172e8337..fc92c494df0b 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -259,6 +259,9 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, 
flr_work);
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
virt);
int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
+   int i;
+   bool need_do_recover = true;
+   struct drm_sched_job *job;
  
  	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,

 * otherwise the mailbox msg will be ruined/reseted by
@@ -279,10 +282,25 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
  
  flr_done:

up_read(>reset_sem);
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   spin_lock(>sched.job_list_lock);
+   job = list_first_entry_or_null(>sched.ring_mirror_list,
+   struct drm_sched_job, node);
+   spin_unlock(>sched.job_list_lock);
+   if (job) {
+   need_do_recover = false;
+   break;
+   }
+   }
  
  	/* Trigger recovery for world switch failure if no TDR */

-   if (amdgpu_device_should_recover_gpu(adev)
-   && (adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
+   if (amdgpu_device_should_recover_gpu(adev) && (need_do_recover ||
+   adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
adev->video_timeout == MAX_SCHEDULE_TIMEOUT))

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v3] drm/amdgpu: add new trace event for page table update v3

2020-08-12 Thread Shashank Sharma

On 12/08/20 2:02 pm, Christian König wrote:
> Am 12.08.20 um 10:15 schrieb Shashank Sharma:
>> Hello Christian,
>>
>> On 12/08/20 12:15 pm, Christian König wrote:
>>> Am 12.08.20 um 06:33 schrieb Shashank Sharma:
 This patch adds a new trace event to track the PTE update
 events. This specific event will provide information like:
 - start and end of virtual memory mapping
 - HW engine flags for the map
 - physical address for mapping

 This will be particularly useful for memory profiling tools
 (like RMV) which are monitoring the page table update events.

 V2: Added physical address lookup logic in trace point
 V3: switch to use __dynamic_array
   added nptes int the TPprint arguments list
   added page size in the arg list

 Cc: Christian König 
 Cc: Alex Deucher 
 Signed-off-by: Christian König 
 Signed-off-by: Shashank Sharma 
 ---
drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 38 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c|  9 --
2 files changed, 45 insertions(+), 2 deletions(-)

 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
 b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
 index 63e734a125fb..b9aae7983b4b 100644
 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
 +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
 @@ -321,6 +321,44 @@ DEFINE_EVENT(amdgpu_vm_mapping, amdgpu_vm_bo_cs,
TP_ARGS(mapping)
);

 +TRACE_EVENT(amdgpu_vm_update_ptes,
 +  TP_PROTO(struct amdgpu_vm_update_params *p,
 +   uint64_t start, uint64_t end,
 +   unsigned int nptes, uint64_t dst,
 +   uint64_t incr, uint64_t flags),
 +  TP_ARGS(p, start, end, nptes, dst, incr, flags),
 +  TP_STRUCT__entry(
 +   __field(u64, start)
 +   __field(u64, end)
 +   __field(u64, flags)
 +   __field(u64, incr)
 +   __field(unsigned int, nptes)
 +   __dynamic_array(u64, dst, nptes)
 +  ),
 +
 +  TP_fast_assign(
 +  unsigned int i;
 +
 +  __entry->start = start;
 +  __entry->end = end;
 +  __entry->flags = flags;
 +  __entry->incr = incr;
 +  __entry->nptes = nptes;
 +
 +  for (i = 0; i < nptes; ++i) {
 +  u64 addr = p->pages_addr ? amdgpu_vm_map_gart(
 +  p->pages_addr, dst) : dst;
 +
 +  ((u64 *)__get_dynamic_array(dst))[i] = addr;
 +  dst += incr;
 +  }
 +  ),
 +  TP_printk("seg:0x%010llx-0x%010llx, flags:0x%llx, nptr=%u, pgsz:%llu,"
 +" dst:\n%s", __entry->start, __entry->end, __entry->flags,
 +__entry->nptes, __entry->incr,
>>> This is not correct. The increment is NOT the page size, but rather the
>>> page size rounded down to a power of 512+4K.
>>>
>>> In other words page size can be 4K, 8K, 16K, 32K, 64K1M, 2M,
>>> 4M512M, 1G, 2G
>>>
>>> But the increment can only be 4K, 2M, 1G
>> Understood. But I think the requirement here is for increment. My 
>> understanding is that the tool needs to save the page entries, and for that, 
>> it will need start of virtual mem, start of physical mem, mapping size and 
>> step to increment the entries. If that's so, we can re-label this entry as 
>> "step" instead of "page size". Please let me know if you think it's the 
>> right thing to do.
> We could stick with the naming increment if that helps, but this can 
> also be derived from the number of destination addresses we have.
sure, i will make it increment.
>
> On the other hand explicitly mentioning it probably won't hurt us either.
>
> And by the way what does "seg" mean?

Ah, to get into 80 char limit, I made 'segment' as 'seg' and later just 
realized I have to still break the print into two lines :) . I will make it 
back to segment or start/end

- Shashank

>
> Christian.
>
>>> And do we need the nptes here? We just need it to print the correct
>>> number of destination addresses.
>> Agree, we don't really need nptes here, I will remove that and send V4.
>>
>> - Shashank
>>
>>> Regards,
>>> Christian.
>>>
 +__print_array(__get_dynamic_array(dst), __entry->nptes, 8))
 +);
 +
TRACE_EVENT(amdgpu_vm_set_ptes,
TP_PROTO(uint64_t pe, uint64_t addr, unsigned count,
 uint32_t incr, uint64_t flags, bool direct),
 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
 b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
 index 71e005cf2952..b5dbb5e8bc61 100644
 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
 +++ 

RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency

2020-08-12 Thread Li, Dennis
[AMD Official Use Only - Internal Distribution Only]

Am 12.08.20 um 12:02 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Am 12.08.20 um 11:23 schrieb Li, Dennis:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Am 12.08.20 um 03:33 schrieb Li, Dennis:
>>> [AMD Official Use Only - Internal Distribution Only]
>>>
>>> Hi, Christian,
>>>
>>> Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. 
>>> We shouldn't have any hardware access here, so taking the reset_sem looks 
>>> like overkill to me.
>>>
>>> [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, 
>>> amdgpu_vm_bo_replace_map  and amdgpu_gem_va_update_vm all a chance to 
>>> access hardware.
>> This is complete nonsense. The functions intentionally work through the 
>> scheduler to avoid accessing the hardware directly for exactly that reason.
>>
>> The only hardware access we have here is the HDP flush and that can fail in 
>> the case of a GPU reset without causing problems.
>>
>> [Dennis Li]  amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> 
>> amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt
> That is for pre gfx9 hardware and only called once during initial enabling of 
> the feature.
>
> Please remove that locking again since it is clearly completely against the 
> driver design.
>
> [Dennis Li] okay, if you agree, I will change to only protect 
> amdgpu_gem_va_update_vm in this function.

Better even only protect the amdgpu_vm_update_prt_state() function.
[Dennis Li] Got it. According to your suggestion, I will also narrow down the 
scope of reset_sem in other functions. 

Christian.

>
> Christian.
>
>> Regards,
>> Christian.
>>
>>> Best Regards
>>> Dennis Li
>>> -Original Message-
>>> From: Koenig, Christian 
>>> Sent: Wednesday, August 12, 2020 12:15 AM
>>> To: Kuehling, Felix ; Li, Dennis 
>>> ; amd-gfx@lists.freedesktop.org; Deucher, 
>>> Alexander ; Zhang, Hawking 
>>> 
>>> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking 
>>> dependency
>>>
>>> Am 11.08.20 um 15:57 schrieb Felix Kuehling:
 Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
> [  653.902305]
> ==
> [  653.902928] WARNING: possible circular locking dependency detected
> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   
> OE
> [  653.904098]
> --
> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
> [  653.905241] 97848f8647a0 (>reset_sem){.+.+}, at:
> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>but task is already holding lock:
> [  653.907087] 9744adbee1f8 
> (reservation_ww_class_mutex){+.+.},
> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>which lock already depends on the new lock.
>
> [  653.909423]
>the existing dependency chain (in reverse order) is:
> [  653.910594]
>-> #1 (reservation_ww_class_mutex){+.+.}:
> [  653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120
> [  653.912350]ww_mutex_lock+0x73/0x80
> [  653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
> [  653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
> [  653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
> [  653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu]
> [  653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
> [  653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu]
> [  653.916959]local_pci_probe+0x47/0xa0
> [  653.917570]work_for_cpu_fn+0x1a/0x30
> [  653.918184]process_one_work+0x29e/0x630
> [  653.918803]worker_thread+0x22b/0x3f0
> [  653.919427]kthread+0x12f/0x150
> [  653.920047]ret_from_fork+0x3a/0x50
> [  653.920661]
>-> #0 (>reset_sem){.+.+}:
> [  653.921893]__lock_acquire+0x13ec/0x16e0
> [  653.922531]lock_acquire+0xb8/0x1c0
> [  653.923174]down_read+0x48/0x230
> [  653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
> [  653.924588]drm_ioctl_kernel+0xb6/0x100 [drm]
> [  653.925283]drm_ioctl+0x389/0x450 [drm]
> [  653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
> [  653.926686]ksys_ioctl+0x98/0xb0
> [  653.927357]__x64_sys_ioctl+0x1a/0x20
> [  653.928030]do_syscall_64+0x5f/0x250
> [  653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  653.929373]
>other info that might help us debug this:
>
> [  653.931356]  Possible unsafe locking scenario:
>
> [  653.932647]CPU0CPU1
> [  653.933287]
> [  653.933911]   

Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency

2020-08-12 Thread Christian König

Am 12.08.20 um 12:02 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Am 12.08.20 um 11:23 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Am 12.08.20 um 03:33 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,

Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We 
shouldn't have any hardware access here, so taking the reset_sem looks like 
overkill to me.

[Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, 
amdgpu_vm_bo_replace_map  and amdgpu_gem_va_update_vm all a chance to access 
hardware.

This is complete nonsense. The functions intentionally work through the 
scheduler to avoid accessing the hardware directly for exactly that reason.

The only hardware access we have here is the HDP flush and that can fail in the 
case of a GPU reset without causing problems.

[Dennis Li]  amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get ->
amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt

That is for pre gfx9 hardware and only called once during initial enabling of 
the feature.

Please remove that locking again since it is clearly completely against the 
driver design.

[Dennis Li] okay, if you agree, I will change to only protect 
amdgpu_gem_va_update_vm in this function.


Better even only protect the amdgpu_vm_update_prt_state() function.

Christian.



Christian.


Regards,
Christian.


Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian 
Sent: Wednesday, August 12, 2020 12:15 AM
To: Kuehling, Felix ; Li, Dennis
; amd-gfx@lists.freedesktop.org; Deucher,
Alexander ; Zhang, Hawking

Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking
dependency

Am 11.08.20 um 15:57 schrieb Felix Kuehling:

Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:

[  653.902305]
==
[  653.902928] WARNING: possible circular locking dependency detected
[  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
[  653.904098]
--
[  653.904675] amdgpu_test/3975 is trying to acquire lock:
[  653.905241] 97848f8647a0 (>reset_sem){.+.+}, at:
amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
   but task is already holding lock:
[  653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.},
at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
   which lock already depends on the new lock.

[  653.909423]
   the existing dependency chain (in reverse order) is:
[  653.910594]
   -> #1 (reservation_ww_class_mutex){+.+.}:
[  653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120
[  653.912350]ww_mutex_lock+0x73/0x80
[  653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
[  653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
[  653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
[  653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu]
[  653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
[  653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu]
[  653.916959]local_pci_probe+0x47/0xa0
[  653.917570]work_for_cpu_fn+0x1a/0x30
[  653.918184]process_one_work+0x29e/0x630
[  653.918803]worker_thread+0x22b/0x3f0
[  653.919427]kthread+0x12f/0x150
[  653.920047]ret_from_fork+0x3a/0x50
[  653.920661]
   -> #0 (>reset_sem){.+.+}:
[  653.921893]__lock_acquire+0x13ec/0x16e0
[  653.922531]lock_acquire+0xb8/0x1c0
[  653.923174]down_read+0x48/0x230
[  653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
[  653.924588]drm_ioctl_kernel+0xb6/0x100 [drm]
[  653.925283]drm_ioctl+0x389/0x450 [drm]
[  653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
[  653.926686]ksys_ioctl+0x98/0xb0
[  653.927357]__x64_sys_ioctl+0x1a/0x20
[  653.928030]do_syscall_64+0x5f/0x250
[  653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  653.929373]
   other info that might help us debug this:

[  653.931356]  Possible unsafe locking scenario:

[  653.932647]CPU0CPU1
[  653.933287]
[  653.933911]   lock(reservation_ww_class_mutex);
[  653.934530]lock(>reset_sem);
[  653.935154]lock(reservation_ww_class_mutex);
[  653.935766]   lock(>reset_sem);
[  653.936360]
*** DEADLOCK ***

[  653.938028] 2 locks held by amdgpu_test/3975:
[  653.938574]  #0: b2a862d6bcd0
(reservation_ww_class_acquire){+.+.}, at:
amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]

change the order of reservation_ww_class_mutex and adev->reset_sem
in amdgpu_gem_va_ioctl the same 

RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency

2020-08-12 Thread Li, Dennis
[AMD Official Use Only - Internal Distribution Only]

Am 12.08.20 um 11:23 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Am 12.08.20 um 03:33 schrieb Li, Dennis:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Hi, Christian,
>>
>> Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We 
>> shouldn't have any hardware access here, so taking the reset_sem looks like 
>> overkill to me.
>>
>> [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, 
>> amdgpu_vm_bo_replace_map  and amdgpu_gem_va_update_vm all a chance to access 
>> hardware.
> This is complete nonsense. The functions intentionally work through the 
> scheduler to avoid accessing the hardware directly for exactly that reason.
>
> The only hardware access we have here is the HDP flush and that can fail in 
> the case of a GPU reset without causing problems.
>
> [Dennis Li]  amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> 
> amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt

That is for pre gfx9 hardware and only called once during initial enabling of 
the feature.

Please remove that locking again since it is clearly completely against the 
driver design.

[Dennis Li] okay, if you agree, I will change to only protect 
amdgpu_gem_va_update_vm in this function. 

Christian.

>
> Regards,
> Christian.
>
>> Best Regards
>> Dennis Li
>> -Original Message-
>> From: Koenig, Christian 
>> Sent: Wednesday, August 12, 2020 12:15 AM
>> To: Kuehling, Felix ; Li, Dennis 
>> ; amd-gfx@lists.freedesktop.org; Deucher, 
>> Alexander ; Zhang, Hawking 
>> 
>> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking 
>> dependency
>>
>> Am 11.08.20 um 15:57 schrieb Felix Kuehling:
>>> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
 [  653.902305]
 ==
 [  653.902928] WARNING: possible circular locking dependency detected
 [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   
 OE
 [  653.904098]
 --
 [  653.904675] amdgpu_test/3975 is trying to acquire lock:
 [  653.905241] 97848f8647a0 (>reset_sem){.+.+}, at:
 amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
   but task is already holding lock:
 [  653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.},
 at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
   which lock already depends on the new lock.

 [  653.909423]
   the existing dependency chain (in reverse order) is:
 [  653.910594]
   -> #1 (reservation_ww_class_mutex){+.+.}:
 [  653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120
 [  653.912350]ww_mutex_lock+0x73/0x80
 [  653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
 [  653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
 [  653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
 [  653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu]
 [  653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
 [  653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu]
 [  653.916959]local_pci_probe+0x47/0xa0
 [  653.917570]work_for_cpu_fn+0x1a/0x30
 [  653.918184]process_one_work+0x29e/0x630
 [  653.918803]worker_thread+0x22b/0x3f0
 [  653.919427]kthread+0x12f/0x150
 [  653.920047]ret_from_fork+0x3a/0x50
 [  653.920661]
   -> #0 (>reset_sem){.+.+}:
 [  653.921893]__lock_acquire+0x13ec/0x16e0
 [  653.922531]lock_acquire+0xb8/0x1c0
 [  653.923174]down_read+0x48/0x230
 [  653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
 [  653.924588]drm_ioctl_kernel+0xb6/0x100 [drm]
 [  653.925283]drm_ioctl+0x389/0x450 [drm]
 [  653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
 [  653.926686]ksys_ioctl+0x98/0xb0
 [  653.927357]__x64_sys_ioctl+0x1a/0x20
 [  653.928030]do_syscall_64+0x5f/0x250
 [  653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe
 [  653.929373]
   other info that might help us debug this:

 [  653.931356]  Possible unsafe locking scenario:

 [  653.932647]CPU0CPU1
 [  653.933287]
 [  653.933911]   lock(reservation_ww_class_mutex);
 [  653.934530]lock(>reset_sem);
 [  653.935154]
 lock(reservation_ww_class_mutex);
 [  653.935766]   lock(>reset_sem);
 [  653.936360]
*** DEADLOCK ***

 [  653.938028] 2 locks held by amdgpu_test/3975:
 [  653.938574]  #0: b2a862d6bcd0 
 

Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency

2020-08-12 Thread Christian König

Am 12.08.20 um 11:23 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Am 12.08.20 um 03:33 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,

Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We 
shouldn't have any hardware access here, so taking the reset_sem looks like 
overkill to me.

[Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, 
amdgpu_vm_bo_replace_map  and amdgpu_gem_va_update_vm all a chance to access 
hardware.

This is complete nonsense. The functions intentionally work through the 
scheduler to avoid accessing the hardware directly for exactly that reason.

The only hardware access we have here is the HDP flush and that can fail in the 
case of a GPU reset without causing problems.

[Dennis Li]  amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> 
amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt


That is for pre gfx9 hardware and only called once during initial 
enabling of the feature.


Please remove that locking again since it is clearly completely against 
the driver design.


Christian.



Regards,
Christian.


Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian 
Sent: Wednesday, August 12, 2020 12:15 AM
To: Kuehling, Felix ; Li, Dennis
; amd-gfx@lists.freedesktop.org; Deucher, Alexander
; Zhang, Hawking 
Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking
dependency

Am 11.08.20 um 15:57 schrieb Felix Kuehling:

Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:

[  653.902305]
==
[  653.902928] WARNING: possible circular locking dependency detected
[  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
[  653.904098]
--
[  653.904675] amdgpu_test/3975 is trying to acquire lock:
[  653.905241] 97848f8647a0 (>reset_sem){.+.+}, at:
amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
  but task is already holding lock:
[  653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.},
at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
  which lock already depends on the new lock.

[  653.909423]
  the existing dependency chain (in reverse order) is:
[  653.910594]
  -> #1 (reservation_ww_class_mutex){+.+.}:
[  653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120
[  653.912350]ww_mutex_lock+0x73/0x80
[  653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
[  653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
[  653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
[  653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu]
[  653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
[  653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu]
[  653.916959]local_pci_probe+0x47/0xa0
[  653.917570]work_for_cpu_fn+0x1a/0x30
[  653.918184]process_one_work+0x29e/0x630
[  653.918803]worker_thread+0x22b/0x3f0
[  653.919427]kthread+0x12f/0x150
[  653.920047]ret_from_fork+0x3a/0x50
[  653.920661]
  -> #0 (>reset_sem){.+.+}:
[  653.921893]__lock_acquire+0x13ec/0x16e0
[  653.922531]lock_acquire+0xb8/0x1c0
[  653.923174]down_read+0x48/0x230
[  653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
[  653.924588]drm_ioctl_kernel+0xb6/0x100 [drm]
[  653.925283]drm_ioctl+0x389/0x450 [drm]
[  653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
[  653.926686]ksys_ioctl+0x98/0xb0
[  653.927357]__x64_sys_ioctl+0x1a/0x20
[  653.928030]do_syscall_64+0x5f/0x250
[  653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  653.929373]
  other info that might help us debug this:

[  653.931356]  Possible unsafe locking scenario:

[  653.932647]CPU0CPU1
[  653.933287]
[  653.933911]   lock(reservation_ww_class_mutex);
[  653.934530]lock(>reset_sem);
[  653.935154]lock(reservation_ww_class_mutex);
[  653.935766]   lock(>reset_sem);
[  653.936360]
   *** DEADLOCK ***

[  653.938028] 2 locks held by amdgpu_test/3975:
[  653.938574]  #0: b2a862d6bcd0
(reservation_ww_class_acquire){+.+.}, at:
amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]

change the order of reservation_ww_class_mutex and adev->reset_sem
in amdgpu_gem_va_ioctl the same as ones in
amdgpu_amdkfd_alloc_gtt_mem, to avoid potential dead lock.

It may be better to fix it the other way around in
amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
reservation. Otherwise you will never be able to take the reset_sem
while any BOs are reserved. 

RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency

2020-08-12 Thread Li, Dennis
[AMD Official Use Only - Internal Distribution Only]

Am 12.08.20 um 03:19 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Felix,
>
> Re: It may be better to fix it the other way around in 
> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the 
> reservation. Otherwise you will never be able to take the reset_sem while any 
> BOs are reserved. That's probably going to cause you other problems later.
> [Dennis Li] Thanks that you find the potential issue, I will change it in 
> version 2.
>
> Re: That makes me wonder, why do you need the reset_sem in 
> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious hardware 
> access in that function. Is it for amdgpu_ttm_alloc_gart updating the GART 
> table through HDP?
> [Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have considered 
> to only protect amdgpu_ttm_alloc_gart before.

That access is irrelevant and the lock should be removed or changed into a 
trylock.

See we need the HDP flush only because the hardware could have accessed the 
data before.

But after a GPU reset the HDP is known to be clean, so this doesn't need any 
protection.

>   But I worry other functions will access hardware in the future. Therefore I 
> select an aggressive approach which lock reset_sem at the beginning of entry 
> functions of amdgpu driver.

This is not a good idea. We used to have such a global lock before and removed 
it because it caused all kind of problems.

[Dennis Li] okay. If you don't agree this aggressive approach. I will change to 
protect amdgpu_ttm_alloc_gart only. 

When was this added? Looks like it slipped under my radar or I wasn't awake 
enough at that moment.

Christian.

>
> Best Regards
> Dennis Li
> -Original Message-
> From: Kuehling, Felix 
> Sent: Tuesday, August 11, 2020 9:57 PM
> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; 
> Deucher, Alexander ; Zhang, Hawking 
> ; Koenig, Christian 
> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking 
> dependency
>
> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>> [  653.902305] ==
>> [  653.902928] WARNING: possible circular locking dependency detected
>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
>> [  653.904098] --
>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>> [  653.905241] 97848f8647a0 (>reset_sem){.+.+}, at:
>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>> but task is already holding lock:
>> [  653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.},
>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>> which lock already depends on the new lock.
>>
>> [  653.909423]
>> the existing dependency chain (in reverse order) is:
>> [  653.910594]
>> -> #1 (reservation_ww_class_mutex){+.+.}:
>> [  653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120
>> [  653.912350]ww_mutex_lock+0x73/0x80
>> [  653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>> [  653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>> [  653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>> [  653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>> [  653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>> [  653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>> [  653.916959]local_pci_probe+0x47/0xa0
>> [  653.917570]work_for_cpu_fn+0x1a/0x30
>> [  653.918184]process_one_work+0x29e/0x630
>> [  653.918803]worker_thread+0x22b/0x3f0
>> [  653.919427]kthread+0x12f/0x150
>> [  653.920047]ret_from_fork+0x3a/0x50
>> [  653.920661]
>> -> #0 (>reset_sem){.+.+}:
>> [  653.921893]__lock_acquire+0x13ec/0x16e0
>> [  653.922531]lock_acquire+0xb8/0x1c0
>> [  653.923174]down_read+0x48/0x230
>> [  653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>> [  653.924588]drm_ioctl_kernel+0xb6/0x100 [drm]
>> [  653.925283]drm_ioctl+0x389/0x450 [drm]
>> [  653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>> [  653.926686]ksys_ioctl+0x98/0xb0
>> [  653.927357]__x64_sys_ioctl+0x1a/0x20
>> [  653.928030]do_syscall_64+0x5f/0x250
>> [  653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [  653.929373]
>> other info that might help us debug this:
>>
>> [  653.931356]  Possible unsafe locking scenario:
>>
>> [  653.932647]CPU0CPU1
>> [  653.933287]
>> [  653.933911]   lock(reservation_ww_class_mutex);
>> [  653.934530]lock(>reset_sem);
>> [  653.935154]
>> lock(reservation_ww_class_mutex);
>> [  653.935766]   lock(>reset_sem);
>> [  653.936360]

[PATCH] drm/amdgpu: Limit the error info print rate

2020-08-12 Thread Emily . Deng
From: jqdeng 

Use function printk_ratelimit to limit the print rate.

Signed-off-by: jqdeng 
Change-Id: Ief05debe30d975cbcf88e473c9f486d70b5a202c
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index a94b3f862fc2..727b909b4b9e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1296,7 +1296,8 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, 
struct drm_file *filp)
 
r = amdgpu_cs_parser_init(, data);
if (r) {
-   DRM_ERROR("Failed to initialize parser %d!\n", r);
+   if (printk_ratelimit())
+   DRM_ERROR("Failed to initialize parser %d!\n", r);
goto out;
}
 
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency

2020-08-12 Thread Li, Dennis
[AMD Official Use Only - Internal Distribution Only]

Am 12.08.20 um 03:33 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Christian,
>
> Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We 
> shouldn't have any hardware access here, so taking the reset_sem looks like 
> overkill to me.
>
> [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, 
> amdgpu_vm_bo_replace_map  and amdgpu_gem_va_update_vm all a chance to access 
> hardware.

This is complete nonsense. The functions intentionally work through the 
scheduler to avoid accessing the hardware directly for exactly that reason.

The only hardware access we have here is the HDP flush and that can fail in the 
case of a GPU reset without causing problems.

[Dennis Li]  amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> 
amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt

Regards,
Christian.

>
> Best Regards
> Dennis Li
> -Original Message-
> From: Koenig, Christian 
> Sent: Wednesday, August 12, 2020 12:15 AM
> To: Kuehling, Felix ; Li, Dennis 
> ; amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> ; Zhang, Hawking 
> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking 
> dependency
>
> Am 11.08.20 um 15:57 schrieb Felix Kuehling:
>> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>>> [  653.902305] 
>>> ==
>>> [  653.902928] WARNING: possible circular locking dependency detected
>>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
>>> [  653.904098] 
>>> --
>>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>>> [  653.905241] 97848f8647a0 (>reset_sem){.+.+}, at:
>>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>>>  but task is already holding lock:
>>> [  653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.},
>>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>>>  which lock already depends on the new lock.
>>>
>>> [  653.909423]
>>>  the existing dependency chain (in reverse order) is:
>>> [  653.910594]
>>>  -> #1 (reservation_ww_class_mutex){+.+.}:
>>> [  653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120
>>> [  653.912350]ww_mutex_lock+0x73/0x80
>>> [  653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>>> [  653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>>> [  653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>>> [  653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>>> [  653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>>> [  653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>>> [  653.916959]local_pci_probe+0x47/0xa0
>>> [  653.917570]work_for_cpu_fn+0x1a/0x30
>>> [  653.918184]process_one_work+0x29e/0x630
>>> [  653.918803]worker_thread+0x22b/0x3f0
>>> [  653.919427]kthread+0x12f/0x150
>>> [  653.920047]ret_from_fork+0x3a/0x50
>>> [  653.920661]
>>>  -> #0 (>reset_sem){.+.+}:
>>> [  653.921893]__lock_acquire+0x13ec/0x16e0
>>> [  653.922531]lock_acquire+0xb8/0x1c0
>>> [  653.923174]down_read+0x48/0x230
>>> [  653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>>> [  653.924588]drm_ioctl_kernel+0xb6/0x100 [drm]
>>> [  653.925283]drm_ioctl+0x389/0x450 [drm]
>>> [  653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>>> [  653.926686]ksys_ioctl+0x98/0xb0
>>> [  653.927357]__x64_sys_ioctl+0x1a/0x20
>>> [  653.928030]do_syscall_64+0x5f/0x250
>>> [  653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> [  653.929373]
>>>  other info that might help us debug this:
>>>
>>> [  653.931356]  Possible unsafe locking scenario:
>>>
>>> [  653.932647]CPU0CPU1
>>> [  653.933287]
>>> [  653.933911]   lock(reservation_ww_class_mutex);
>>> [  653.934530]lock(>reset_sem);
>>> [  653.935154]
>>> lock(reservation_ww_class_mutex);
>>> [  653.935766]   lock(>reset_sem);
>>> [  653.936360]
>>>   *** DEADLOCK ***
>>>
>>> [  653.938028] 2 locks held by amdgpu_test/3975:
>>> [  653.938574]  #0: b2a862d6bcd0 
>>> (reservation_ww_class_acquire){+.+.}, at:
>>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
>>> 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
>>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>>>
>>> change the order of reservation_ww_class_mutex and adev->reset_sem 
>>> in amdgpu_gem_va_ioctl the same as ones in 
>>> amdgpu_amdkfd_alloc_gtt_mem, to avoid potential dead lock.
>> It may be better to fix it the other way around in 
>> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the 
>> reservation. 

[PATCH] drm/amdgpu: Fix repeatly flr issue

2020-08-12 Thread Emily . Deng
From: jqdeng 

Only for no job running test case need to do recover in
flr notification.
For having job in mirror list, then let guest driver to
hit job timeout, and then do recover.

Signed-off-by: jqdeng 
Change-Id: Ic6234fce46fa1655ba81c4149235eeac75e75868
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 20 +++-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 22 --
 2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index fe31cbeccfe9..12fe5164aaf3 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -238,6 +238,9 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, 
flr_work);
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
virt);
int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
+   int i;
+   bool need_do_recover = true;
+   struct drm_sched_job *job;
 
/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
 * otherwise the mailbox msg will be ruined/reseted by
@@ -258,10 +261,25 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
 
 flr_done:
up_read(>reset_sem);
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   spin_lock(>sched.job_list_lock);
+   job = list_first_entry_or_null(>sched.ring_mirror_list,
+   struct drm_sched_job, node);
+   spin_unlock(>sched.job_list_lock);
+   if (job) {
+   need_do_recover = false;
+   break;
+   }
+   }
 
/* Trigger recovery for world switch failure if no TDR */
if (amdgpu_device_should_recover_gpu(adev)
-   && adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT)
+   && (need_do_recover || adev->sdma_timeout == 
MAX_SCHEDULE_TIMEOUT))
amdgpu_device_gpu_recover(adev, NULL);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 6f55172e8337..fc92c494df0b 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -259,6 +259,9 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, 
flr_work);
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, 
virt);
int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
+   int i;
+   bool need_do_recover = true;
+   struct drm_sched_job *job;
 
/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
 * otherwise the mailbox msg will be ruined/reseted by
@@ -279,10 +282,25 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct 
*work)
 
 flr_done:
up_read(>reset_sem);
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   spin_lock(>sched.job_list_lock);
+   job = list_first_entry_or_null(>sched.ring_mirror_list,
+   struct drm_sched_job, node);
+   spin_unlock(>sched.job_list_lock);
+   if (job) {
+   need_do_recover = false;
+   break;
+   }
+   }
 
/* Trigger recovery for world switch failure if no TDR */
-   if (amdgpu_device_should_recover_gpu(adev)
-   && (adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
+   if (amdgpu_device_should_recover_gpu(adev) && (need_do_recover ||
+   adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT ||
adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency

2020-08-12 Thread Christian König

Am 12.08.20 um 03:33 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,

Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We 
shouldn't have any hardware access here, so taking the reset_sem looks like 
overkill to me.

[Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, 
amdgpu_vm_bo_replace_map  and amdgpu_gem_va_update_vm all a chance to access 
hardware.


This is complete nonsense. The functions intentionally work through the 
scheduler to avoid accessing the hardware directly for exactly that reason.


The only hardware access we have here is the HDP flush and that can fail 
in the case of a GPU reset without causing problems.


Regards,
Christian.



Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian 
Sent: Wednesday, August 12, 2020 12:15 AM
To: Kuehling, Felix ; Li, Dennis ; 
amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Zhang, Hawking 

Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency

Am 11.08.20 um 15:57 schrieb Felix Kuehling:

Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:

[  653.902305] ==
[  653.902928] WARNING: possible circular locking dependency detected
[  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
[  653.904098] --
[  653.904675] amdgpu_test/3975 is trying to acquire lock:
[  653.905241] 97848f8647a0 (>reset_sem){.+.+}, at:
amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
 but task is already holding lock:
[  653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.},
at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
 which lock already depends on the new lock.

[  653.909423]
 the existing dependency chain (in reverse order) is:
[  653.910594]
 -> #1 (reservation_ww_class_mutex){+.+.}:
[  653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120
[  653.912350]ww_mutex_lock+0x73/0x80
[  653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
[  653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
[  653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
[  653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu]
[  653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
[  653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu]
[  653.916959]local_pci_probe+0x47/0xa0
[  653.917570]work_for_cpu_fn+0x1a/0x30
[  653.918184]process_one_work+0x29e/0x630
[  653.918803]worker_thread+0x22b/0x3f0
[  653.919427]kthread+0x12f/0x150
[  653.920047]ret_from_fork+0x3a/0x50
[  653.920661]
 -> #0 (>reset_sem){.+.+}:
[  653.921893]__lock_acquire+0x13ec/0x16e0
[  653.922531]lock_acquire+0xb8/0x1c0
[  653.923174]down_read+0x48/0x230
[  653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
[  653.924588]drm_ioctl_kernel+0xb6/0x100 [drm]
[  653.925283]drm_ioctl+0x389/0x450 [drm]
[  653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
[  653.926686]ksys_ioctl+0x98/0xb0
[  653.927357]__x64_sys_ioctl+0x1a/0x20
[  653.928030]do_syscall_64+0x5f/0x250
[  653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  653.929373]
 other info that might help us debug this:

[  653.931356]  Possible unsafe locking scenario:

[  653.932647]CPU0CPU1
[  653.933287]
[  653.933911]   lock(reservation_ww_class_mutex);
[  653.934530]lock(>reset_sem);
[  653.935154]lock(reservation_ww_class_mutex);
[  653.935766]   lock(>reset_sem);
[  653.936360]
  *** DEADLOCK ***

[  653.938028] 2 locks held by amdgpu_test/3975:
[  653.938574]  #0: b2a862d6bcd0
(reservation_ww_class_acquire){+.+.}, at:
amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]

change the order of reservation_ww_class_mutex and adev->reset_sem in
amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem,
to avoid potential dead lock.

It may be better to fix it the other way around in
amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
reservation. Otherwise you will never be able to take the reset_sem
while any BOs are reserved. That's probably going to cause you other
problems later.

That makes me wonder, why do you need the reset_sem in
amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
hardware access in that function. Is it for amdgpu_ttm_alloc_gart
updating the GART table through HDP?

I was wondering the same thing for the amdgpu_gem_va_ioctl() as well.

We shouldn't have any hardware access here, so taking the 

Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency

2020-08-12 Thread Christian König

Am 12.08.20 um 03:19 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Felix,

Re: It may be better to fix it the other way around in 
amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the reservation. 
Otherwise you will never be able to take the reset_sem while any BOs are 
reserved. That's probably going to cause you other problems later.
[Dennis Li] Thanks that you find the potential issue, I will change it in 
version 2.

Re: That makes me wonder, why do you need the reset_sem in 
amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious hardware 
access in that function. Is it for amdgpu_ttm_alloc_gart updating the GART 
table through HDP?
[Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have considered to 
only protect amdgpu_ttm_alloc_gart before.


That access is irrelevant and the lock should be removed or changed into 
a trylock.


See we need the HDP flush only because the hardware could have accessed 
the data before.


But after a GPU reset the HDP is known to be clean, so this doesn't need 
any protection.



  But I worry other functions will access hardware in the future. Therefore I 
select an aggressive approach which lock reset_sem at the beginning of entry 
functions of amdgpu driver.


This is not a good idea. We used to have such a global lock before and 
removed it because it caused all kind of problems.


When was this added? Looks like it slipped under my radar or I wasn't 
awake enough at that moment.


Christian.



Best Regards
Dennis Li
-Original Message-
From: Kuehling, Felix 
Sent: Tuesday, August 11, 2020 9:57 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Zhang, Hawking ; Koenig, Christian 

Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency

Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:

[  653.902305] ==
[  653.902928] WARNING: possible circular locking dependency detected
[  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
[  653.904098] --
[  653.904675] amdgpu_test/3975 is trying to acquire lock:
[  653.905241] 97848f8647a0 (>reset_sem){.+.+}, at:
amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
but task is already holding lock:
[  653.907087] 9744adbee1f8 (reservation_ww_class_mutex){+.+.},
at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
which lock already depends on the new lock.

[  653.909423]
the existing dependency chain (in reverse order) is:
[  653.910594]
-> #1 (reservation_ww_class_mutex){+.+.}:
[  653.911759]__ww_mutex_lock.constprop.15+0xca/0x1120
[  653.912350]ww_mutex_lock+0x73/0x80
[  653.913044]amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
[  653.913724]kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
[  653.914388]amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
[  653.915033]amdgpu_device_init+0x1303/0x1e10 [amdgpu]
[  653.915685]amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
[  653.916349]amdgpu_pci_probe+0x11d/0x200 [amdgpu]
[  653.916959]local_pci_probe+0x47/0xa0
[  653.917570]work_for_cpu_fn+0x1a/0x30
[  653.918184]process_one_work+0x29e/0x630
[  653.918803]worker_thread+0x22b/0x3f0
[  653.919427]kthread+0x12f/0x150
[  653.920047]ret_from_fork+0x3a/0x50
[  653.920661]
-> #0 (>reset_sem){.+.+}:
[  653.921893]__lock_acquire+0x13ec/0x16e0
[  653.922531]lock_acquire+0xb8/0x1c0
[  653.923174]down_read+0x48/0x230
[  653.923886]amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
[  653.924588]drm_ioctl_kernel+0xb6/0x100 [drm]
[  653.925283]drm_ioctl+0x389/0x450 [drm]
[  653.926013]amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
[  653.926686]ksys_ioctl+0x98/0xb0
[  653.927357]__x64_sys_ioctl+0x1a/0x20
[  653.928030]do_syscall_64+0x5f/0x250
[  653.928697]entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  653.929373]
other info that might help us debug this:

[  653.931356]  Possible unsafe locking scenario:

[  653.932647]CPU0CPU1
[  653.933287]
[  653.933911]   lock(reservation_ww_class_mutex);
[  653.934530]lock(>reset_sem);
[  653.935154]lock(reservation_ww_class_mutex);
[  653.935766]   lock(>reset_sem);
[  653.936360]
 *** DEADLOCK ***

[  653.938028] 2 locks held by amdgpu_test/3975:
[  653.938574]  #0: b2a862d6bcd0
(reservation_ww_class_acquire){+.+.}, at:
amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]

change the order of reservation_ww_class_mutex and adev->reset_sem 

RE: [PATCH] drm/amdgpu: disable gfxoff for navy_flounder

2020-08-12 Thread Zhou1, Tao
[AMD Public Use]

Please remember to revert it when root cause is found out, the patch is:

Reviewed-by: Tao Zhou 

> -Original Message-
> From: Jiansong Chen 
> Sent: Wednesday, August 12, 2020 4:44 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao ; Feng, Kenneth
> ; Chen, Jiansong (Simon) 
> Subject: [PATCH] drm/amdgpu: disable gfxoff for navy_flounder
> 
> gfxoff is temporarily disabled for navy_flounder, since at present the 
> feature has
> broken some basic amdgpu test.
> 
> Signed-off-by: Jiansong Chen 
> Change-Id: Icc030370997a66fb9f01cdd4b1c45816e3c88584
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index d851fe80eaf4..de6e6de41867 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -3610,6 +3610,9 @@ static void gfx_v10_0_check_gfxoff_flag(struct
> amdgpu_device *adev)
>   if (!gfx_v10_0_navi10_gfxoff_should_enable(adev))
>   adev->pm.pp_feature &= ~PP_GFXOFF_MASK;
>   break;
> + case CHIP_NAVY_FLOUNDER:
> + adev->pm.pp_feature &= ~PP_GFXOFF_MASK;
> + break;
>   default:
>   break;
>   }
> --
> 2.25.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: disable gfxoff for navy_flounder

2020-08-12 Thread Jiansong Chen
gfxoff is temporarily disabled for navy_flounder,
since at present the feature has broken some basic
amdgpu test.

Signed-off-by: Jiansong Chen 
Change-Id: Icc030370997a66fb9f01cdd4b1c45816e3c88584
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index d851fe80eaf4..de6e6de41867 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3610,6 +3610,9 @@ static void gfx_v10_0_check_gfxoff_flag(struct 
amdgpu_device *adev)
if (!gfx_v10_0_navi10_gfxoff_should_enable(adev))
adev->pm.pp_feature &= ~PP_GFXOFF_MASK;
break;
+   case CHIP_NAVY_FLOUNDER:
+   adev->pm.pp_feature &= ~PP_GFXOFF_MASK;
+   break;
default:
break;
}
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v3] drm/amdgpu: add new trace event for page table update v3

2020-08-12 Thread Christian König

Am 12.08.20 um 10:15 schrieb Shashank Sharma:

Hello Christian,

On 12/08/20 12:15 pm, Christian König wrote:

Am 12.08.20 um 06:33 schrieb Shashank Sharma:

This patch adds a new trace event to track the PTE update
events. This specific event will provide information like:
- start and end of virtual memory mapping
- HW engine flags for the map
- physical address for mapping

This will be particularly useful for memory profiling tools
(like RMV) which are monitoring the page table update events.

V2: Added physical address lookup logic in trace point
V3: switch to use __dynamic_array
  added nptes int the TPprint arguments list
  added page size in the arg list

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Christian König 
Signed-off-by: Shashank Sharma 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 38 +++
   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c|  9 --
   2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
index 63e734a125fb..b9aae7983b4b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -321,6 +321,44 @@ DEFINE_EVENT(amdgpu_vm_mapping, amdgpu_vm_bo_cs,
TP_ARGS(mapping)
   );
   
+TRACE_EVENT(amdgpu_vm_update_ptes,

+   TP_PROTO(struct amdgpu_vm_update_params *p,
+uint64_t start, uint64_t end,
+unsigned int nptes, uint64_t dst,
+uint64_t incr, uint64_t flags),
+   TP_ARGS(p, start, end, nptes, dst, incr, flags),
+   TP_STRUCT__entry(
+__field(u64, start)
+__field(u64, end)
+__field(u64, flags)
+__field(u64, incr)
+__field(unsigned int, nptes)
+__dynamic_array(u64, dst, nptes)
+   ),
+
+   TP_fast_assign(
+   unsigned int i;
+
+   __entry->start = start;
+   __entry->end = end;
+   __entry->flags = flags;
+   __entry->incr = incr;
+   __entry->nptes = nptes;
+
+   for (i = 0; i < nptes; ++i) {
+   u64 addr = p->pages_addr ? amdgpu_vm_map_gart(
+   p->pages_addr, dst) : dst;
+
+   ((u64 *)__get_dynamic_array(dst))[i] = addr;
+   dst += incr;
+   }
+   ),
+   TP_printk("seg:0x%010llx-0x%010llx, flags:0x%llx, nptr=%u, pgsz:%llu,"
+ " dst:\n%s", __entry->start, __entry->end, __entry->flags,
+ __entry->nptes, __entry->incr,

This is not correct. The increment is NOT the page size, but rather the
page size rounded down to a power of 512+4K.

In other words page size can be 4K, 8K, 16K, 32K, 64K1M, 2M,
4M512M, 1G, 2G

But the increment can only be 4K, 2M, 1G

Understood. But I think the requirement here is for increment. My understanding is that the tool 
needs to save the page entries, and for that, it will need start of virtual mem, start of physical 
mem, mapping size and step to increment the entries. If that's so, we can re-label this entry as 
"step" instead of "page size". Please let me know if you think it's the right 
thing to do.


We could stick with the naming increment if that helps, but this can 
also be derived from the number of destination addresses we have.


On the other hand explicitly mentioning it probably won't hurt us either.

And by the way what does "seg" mean?

Christian.


And do we need the nptes here? We just need it to print the correct
number of destination addresses.

Agree, we don't really need nptes here, I will remove that and send V4.

- Shashank


Regards,
Christian.


+ __print_array(__get_dynamic_array(dst), __entry->nptes, 8))
+);
+
   TRACE_EVENT(amdgpu_vm_set_ptes,
TP_PROTO(uint64_t pe, uint64_t addr, unsigned count,
 uint32_t incr, uint64_t flags, bool direct),
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 71e005cf2952..b5dbb5e8bc61 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1513,17 +1513,22 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
do {
uint64_t upd_end = min(entry_end, frag_end);
unsigned nptes = (upd_end - frag_start) >> shift;
+   uint64_t upd_flags = flags | AMDGPU_PTE_FRAG(frag);
   
   			/* This can happen when we set higher level PDs to

 * silent to stop fault floods.
 */
nptes = max(nptes, 1u);
+
+   

Re: [PATCH v3] drm/amdgpu: add new trace event for page table update v3

2020-08-12 Thread Shashank Sharma
Hello Christian,

On 12/08/20 12:15 pm, Christian König wrote:
> Am 12.08.20 um 06:33 schrieb Shashank Sharma:
>> This patch adds a new trace event to track the PTE update
>> events. This specific event will provide information like:
>> - start and end of virtual memory mapping
>> - HW engine flags for the map
>> - physical address for mapping
>>
>> This will be particularly useful for memory profiling tools
>> (like RMV) which are monitoring the page table update events.
>>
>> V2: Added physical address lookup logic in trace point
>> V3: switch to use __dynamic_array
>>  added nptes int the TPprint arguments list
>>  added page size in the arg list
>>
>> Cc: Christian König 
>> Cc: Alex Deucher 
>> Signed-off-by: Christian König 
>> Signed-off-by: Shashank Sharma 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 38 +++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c|  9 --
>>   2 files changed, 45 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>> index 63e734a125fb..b9aae7983b4b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>> @@ -321,6 +321,44 @@ DEFINE_EVENT(amdgpu_vm_mapping, amdgpu_vm_bo_cs,
>>  TP_ARGS(mapping)
>>   );
>>   
>> +TRACE_EVENT(amdgpu_vm_update_ptes,
>> +TP_PROTO(struct amdgpu_vm_update_params *p,
>> + uint64_t start, uint64_t end,
>> + unsigned int nptes, uint64_t dst,
>> + uint64_t incr, uint64_t flags),
>> +TP_ARGS(p, start, end, nptes, dst, incr, flags),
>> +TP_STRUCT__entry(
>> + __field(u64, start)
>> + __field(u64, end)
>> + __field(u64, flags)
>> + __field(u64, incr)
>> + __field(unsigned int, nptes)
>> + __dynamic_array(u64, dst, nptes)
>> +),
>> +
>> +TP_fast_assign(
>> +unsigned int i;
>> +
>> +__entry->start = start;
>> +__entry->end = end;
>> +__entry->flags = flags;
>> +__entry->incr = incr;
>> +__entry->nptes = nptes;
>> +
>> +for (i = 0; i < nptes; ++i) {
>> +u64 addr = p->pages_addr ? amdgpu_vm_map_gart(
>> +p->pages_addr, dst) : dst;
>> +
>> +((u64 *)__get_dynamic_array(dst))[i] = addr;
>> +dst += incr;
>> +}
>> +),
>> +TP_printk("seg:0x%010llx-0x%010llx, flags:0x%llx, nptr=%u, pgsz:%llu,"
>> +  " dst:\n%s", __entry->start, __entry->end, __entry->flags,
>> +  __entry->nptes, __entry->incr,
> This is not correct. The increment is NOT the page size, but rather the 
> page size rounded down to a power of 512+4K.
>
> In other words page size can be 4K, 8K, 16K, 32K, 64K1M, 2M, 
> 4M512M, 1G, 2G
>
> But the increment can only be 4K, 2M, 1G
Understood. But I think the requirement here is for increment. My understanding 
is that the tool needs to save the page entries, and for that, it will need 
start of virtual mem, start of physical mem, mapping size and step to increment 
the entries. If that's so, we can re-label this entry as "step" instead of 
"page size". Please let me know if you think it's the right thing to do. 
> And do we need the nptes here? We just need it to print the correct 
> number of destination addresses.

Agree, we don't really need nptes here, I will remove that and send V4.

- Shashank

>
> Regards,
> Christian.
>
>> +  __print_array(__get_dynamic_array(dst), __entry->nptes, 8))
>> +);
>> +
>>   TRACE_EVENT(amdgpu_vm_set_ptes,
>>  TP_PROTO(uint64_t pe, uint64_t addr, unsigned count,
>>   uint32_t incr, uint64_t flags, bool direct),
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 71e005cf2952..b5dbb5e8bc61 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -1513,17 +1513,22 @@ static int amdgpu_vm_update_ptes(struct 
>> amdgpu_vm_update_params *params,
>>  do {
>>  uint64_t upd_end = min(entry_end, frag_end);
>>  unsigned nptes = (upd_end - frag_start) >> shift;
>> +uint64_t upd_flags = flags | AMDGPU_PTE_FRAG(frag);
>>   
>>  /* This can happen when we set higher level PDs to
>>   * silent to stop fault floods.
>>   */
>>  nptes = max(nptes, 1u);
>> +
>> +trace_amdgpu_vm_update_ptes(params, frag_start, upd_end,
>> +nptes, dst, incr,
>> +

Re: [PATCH v3] drm/amdgpu: add new trace event for page table update v3

2020-08-12 Thread Christian König

Am 12.08.20 um 06:33 schrieb Shashank Sharma:

This patch adds a new trace event to track the PTE update
events. This specific event will provide information like:
- start and end of virtual memory mapping
- HW engine flags for the map
- physical address for mapping

This will be particularly useful for memory profiling tools
(like RMV) which are monitoring the page table update events.

V2: Added physical address lookup logic in trace point
V3: switch to use __dynamic_array
 added nptes int the TPprint arguments list
 added page size in the arg list

Cc: Christian König 
Cc: Alex Deucher 
Signed-off-by: Christian König 
Signed-off-by: Shashank Sharma 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 38 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c|  9 --
  2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
index 63e734a125fb..b9aae7983b4b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -321,6 +321,44 @@ DEFINE_EVENT(amdgpu_vm_mapping, amdgpu_vm_bo_cs,
TP_ARGS(mapping)
  );
  
+TRACE_EVENT(amdgpu_vm_update_ptes,

+   TP_PROTO(struct amdgpu_vm_update_params *p,
+uint64_t start, uint64_t end,
+unsigned int nptes, uint64_t dst,
+uint64_t incr, uint64_t flags),
+   TP_ARGS(p, start, end, nptes, dst, incr, flags),
+   TP_STRUCT__entry(
+__field(u64, start)
+__field(u64, end)
+__field(u64, flags)
+__field(u64, incr)
+__field(unsigned int, nptes)
+__dynamic_array(u64, dst, nptes)
+   ),
+
+   TP_fast_assign(
+   unsigned int i;
+
+   __entry->start = start;
+   __entry->end = end;
+   __entry->flags = flags;
+   __entry->incr = incr;
+   __entry->nptes = nptes;
+
+   for (i = 0; i < nptes; ++i) {
+   u64 addr = p->pages_addr ? amdgpu_vm_map_gart(
+   p->pages_addr, dst) : dst;
+
+   ((u64 *)__get_dynamic_array(dst))[i] = addr;
+   dst += incr;
+   }
+   ),
+   TP_printk("seg:0x%010llx-0x%010llx, flags:0x%llx, nptr=%u, pgsz:%llu,"
+ " dst:\n%s", __entry->start, __entry->end, __entry->flags,
+ __entry->nptes, __entry->incr,


This is not correct. The increment is NOT the page size, but rather the 
page size rounded down to a power of 512+4K.


In other words page size can be 4K, 8K, 16K, 32K, 64K1M, 2M, 
4M512M, 1G, 2G


But the increment can only be 4K, 2M, 1G

And do we need the nptes here? We just need it to print the correct 
number of destination addresses.


Regards,
Christian.


+ __print_array(__get_dynamic_array(dst), __entry->nptes, 8))
+);
+
  TRACE_EVENT(amdgpu_vm_set_ptes,
TP_PROTO(uint64_t pe, uint64_t addr, unsigned count,
 uint32_t incr, uint64_t flags, bool direct),
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 71e005cf2952..b5dbb5e8bc61 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1513,17 +1513,22 @@ static int amdgpu_vm_update_ptes(struct 
amdgpu_vm_update_params *params,
do {
uint64_t upd_end = min(entry_end, frag_end);
unsigned nptes = (upd_end - frag_start) >> shift;
+   uint64_t upd_flags = flags | AMDGPU_PTE_FRAG(frag);
  
  			/* This can happen when we set higher level PDs to

 * silent to stop fault floods.
 */
nptes = max(nptes, 1u);
+
+   trace_amdgpu_vm_update_ptes(params, frag_start, upd_end,
+   nptes, dst, incr,
+   upd_flags);
amdgpu_vm_update_flags(params, pt, cursor.level,
   pe_start, dst, nptes, incr,
-  flags | AMDGPU_PTE_FRAG(frag));
+  upd_flags);
  
  			pe_start += nptes * 8;

-   dst += (uint64_t)nptes * AMDGPU_GPU_PAGE_SIZE << shift;
+   dst += nptes * incr;
  
  			frag_start = upd_end;

if (frag_start >= frag_end) {


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 3/4] drm/amd/powerplay: enable Navi1X mgpu fan boost feature

2020-08-12 Thread Nirmoy



On 8/12/20 6:56 AM, Evan Quan wrote:

Support Navi1X mgpu fan boost enablement.

Change-Id: Iafbf07c56462120d2db578b6af45dd7f985a4cc1
Signed-off-by: Evan Quan 
---
  .../drm/amd/powerplay/inc/smu_v11_0_ppsmc.h   |  4 +++-
  drivers/gpu/drm/amd/powerplay/navi10_ppt.c| 21 +++
  2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h 
b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h
index 406bfd187ce8..fa0174dc7e0e 100644
--- a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h
+++ b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0_ppsmc.h
@@ -123,7 +123,9 @@
  #define PPSMC_MSG_DALDisableDummyPstateChange0x49
  #define PPSMC_MSG_DALEnableDummyPstateChange 0x4A
  
-#define PPSMC_Message_Count  0x4B

+#define PPSMC_MSG_SetMGpuFanBoostLimitRpm0x4C
+
+#define PPSMC_Message_Count  0x4D
  
  typedef uint32_t PPSMC_Result;

  typedef uint32_t PPSMC_Msg;
diff --git a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c 
b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
index 61e2971be9f3..a86cd819b44b 100644
--- a/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/powerplay/navi10_ppt.c
@@ -138,6 +138,7 @@ static struct cmn2asic_msg_mapping 
navi10_message_map[SMU_MSG_MAX_COUNT] = {
MSG_MAP(DAL_ENABLE_DUMMY_PSTATE_CHANGE, 
PPSMC_MSG_DALEnableDummyPstateChange,   0),
MSG_MAP(GetVoltageByDpm,PPSMC_MSG_GetVoltageByDpm,  
0),
MSG_MAP(GetVoltageByDpmOverdrive,   
PPSMC_MSG_GetVoltageByDpmOverdrive, 0),
+   MSG_MAP(SetMGpuFanBoostLimitRpm,
PPSMC_MSG_SetMGpuFanBoostLimitRpm,  0),
  };
  
  static struct cmn2asic_mapping navi10_clk_map[SMU_CLK_COUNT] = {

@@ -2555,6 +2556,25 @@ static ssize_t navi10_get_gpu_metrics(struct smu_context 
*smu,
return sizeof(struct gpu_metrics_v1_0);
  }
  
+static int navi10_enable_mgpu_fan_boost(struct smu_context *smu)

+{
+   struct amdgpu_device *adev = smu->adev;
+   uint32_t param = 0;
+
+   /* Navi12 does not support this */
+   if (adev->asic_type == CHIP_NAVI12)
+   return 0;
+
+   if (adev->pdev->device == 0x7312 &&
+   adev->external_rev_id == 0)
+   param = 0xD188;



Can you please add a comment explaining above condition?

Apart from that, the series is Acked-by: Nirmoy Das 



+
+   return smu_cmn_send_smc_msg_with_param(smu,
+  SMU_MSG_SetMGpuFanBoostLimitRpm,
+  param,
+  NULL);
+}
+
  static const struct pptable_funcs navi10_ppt_funcs = {
.get_allowed_feature_mask = navi10_get_allowed_feature_mask,
.set_default_dpm_table = navi10_set_default_dpm_table,
@@ -2636,6 +2656,7 @@ static const struct pptable_funcs navi10_ppt_funcs = {
.get_pp_feature_mask = smu_cmn_get_pp_feature_mask,
.set_pp_feature_mask = smu_cmn_set_pp_feature_mask,
.get_gpu_metrics = navi10_get_gpu_metrics,
+   .enable_mgpu_fan_boost = navi10_enable_mgpu_fan_boost,
  };
  
  void navi10_set_ppt_funcs(struct smu_context *smu)

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx