date:20200806

RE: [PATCH 2/2] drm/amdgpu: reconfigure spm golden settings on Navi1x after GFXOFF exit(v3)

2020-08-06 Thread Chen, Guchun

[AMD Public Use]

Please check one comment inline.

Regards,
Guchun

-Original Message-
From: Tianci Yin  
Sent: Friday, August 7, 2020 10:37 AM
To: amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben ; Deucher, Alexander 
; Zhang, Hawking ; Xu, Feifei 
; Hesik, Christopher ; Swamy, 
Manjunatha ; Quan, Evan ; Chen, 
Guchun ; Feng, Kenneth ; Yin, Tianci 
(Rico) 
Subject: [PATCH 2/2] drm/amdgpu: reconfigure spm golden settings on Navi1x 
after GFXOFF exit(v3)

From: "Tianci.Yin" 

On Navi1x, the SPM golden settings are lost after GFXOFF enter/exit, so 
reconfigure the golden settings after GFXOFF exit.

Change-Id: I9358ba9c65f241c36f8a35916170b19535148ee9
Signed-off-by: Tianci.Yin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 8eff0173360d..9e133fd0372d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -571,8 +571,14 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool 
enable)
if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) 
{
schedule_delayed_work(>gfx.gfx_off_delay_work, 
GFX_OFF_DELAY_ENABLE);
} else if (!enable && adev->gfx.gfx_off_state) {
-   if (!amdgpu_dpm_set_powergating_by_smu(adev, 
AMD_IP_BLOCK_TYPE_GFX, false))
+   if (!amdgpu_dpm_set_powergating_by_smu(adev, 
AMD_IP_BLOCK_TYPE_GFX, 
+false)) {
adev->gfx.gfx_off_state = false;
+
+   if (adev->gfx.funcs->init_spm_golden) {
+   dev_dbg(adev->dev, "GFXOFF is disabled, re-init 
SPM golden settings\n");
+   amdgpu_gfx_init_spm_golden(adev);
[Guchun] Since we have the function pointer check, why here we use another 
function for execution? init_spm_golden is one guard to indicate spm 
availability on navi1x ASICs?

+   }
+   }
}
 
mutex_unlock(>gfx.gfx_off_mutex);
--
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 2/2] drm/amdgpu: reconfigure spm golden settings on Navi1x after GFXOFF exit(v3)

2020-08-06 Thread Zhang, Hawking

[AMD Public Use]

Series is

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: Tianci Yin  
Sent: Friday, August 7, 2020 10:37
To: amd-gfx@lists.freedesktop.org
Cc: Tuikov, Luben ; Deucher, Alexander 
; Zhang, Hawking ; Xu, Feifei 
; Hesik, Christopher ; Swamy, 
Manjunatha ; Quan, Evan ; Chen, 
Guchun ; Feng, Kenneth ; Yin, Tianci 
(Rico) 
Subject: [PATCH 2/2] drm/amdgpu: reconfigure spm golden settings on Navi1x 
after GFXOFF exit(v3)

From: "Tianci.Yin" 

On Navi1x, the SPM golden settings are lost after GFXOFF enter/exit, so 
reconfigure the golden settings after GFXOFF exit.

Change-Id: I9358ba9c65f241c36f8a35916170b19535148ee9
Signed-off-by: Tianci.Yin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 8eff0173360d..9e133fd0372d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -571,8 +571,14 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool 
enable)
if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) 
{
schedule_delayed_work(>gfx.gfx_off_delay_work, 
GFX_OFF_DELAY_ENABLE);
} else if (!enable && adev->gfx.gfx_off_state) {
-   if (!amdgpu_dpm_set_powergating_by_smu(adev, 
AMD_IP_BLOCK_TYPE_GFX, false))
+   if (!amdgpu_dpm_set_powergating_by_smu(adev, 
AMD_IP_BLOCK_TYPE_GFX, 
+false)) {
adev->gfx.gfx_off_state = false;
+
+   if (adev->gfx.funcs->init_spm_golden) {
+   dev_dbg(adev->dev, "GFXOFF is disabled, re-init 
SPM golden settings\n");
+   amdgpu_gfx_init_spm_golden(adev);
+   }
+   }
}
 
mutex_unlock(>gfx.gfx_off_mutex);
--
2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 2/2] drm/amdgpu: reconfigure spm golden settings on Navi1x after GFXOFF exit(v3)

2020-08-06 Thread Tianci Yin

From: "Tianci.Yin" 

On Navi1x, the SPM golden settings are lost after GFXOFF
enter/exit, so reconfigure the golden settings after GFXOFF
exit.

Change-Id: I9358ba9c65f241c36f8a35916170b19535148ee9
Signed-off-by: Tianci.Yin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 8eff0173360d..9e133fd0372d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -571,8 +571,14 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, bool 
enable)
if (enable && !adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) 
{
schedule_delayed_work(>gfx.gfx_off_delay_work, 
GFX_OFF_DELAY_ENABLE);
} else if (!enable && adev->gfx.gfx_off_state) {
-   if (!amdgpu_dpm_set_powergating_by_smu(adev, 
AMD_IP_BLOCK_TYPE_GFX, false))
+   if (!amdgpu_dpm_set_powergating_by_smu(adev, 
AMD_IP_BLOCK_TYPE_GFX, false)) {
adev->gfx.gfx_off_state = false;
+
+   if (adev->gfx.funcs->init_spm_golden) {
+   dev_dbg(adev->dev, "GFXOFF is disabled, re-init 
SPM golden settings\n");
+   amdgpu_gfx_init_spm_golden(adev);
+   }
+   }
}
 
mutex_unlock(>gfx.gfx_off_mutex);
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 1/2] drm/amdgpu: add interface amdgpu_gfx_init_spm_golden for Navi1x

2020-08-06 Thread Tianci Yin

From: "Tianci.Yin" 

On Navi1x, the SPM golden settings are lost after GFXOFF
enter/exit, so reconfiguration is needed. Make the
configuration code as an interface for future use.

Change-Id: I172f3dc7f59da69b0364052dcad75a9c9aab019e
Reviewed-by: Luben Tuikov 
Reviewed-by: Feifei Xu 
Signed-off-by: Tianci.Yin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h |  2 ++
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 34 ++---
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index 1e7a2b0997c5..a611e78dd4ba 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -216,6 +216,7 @@ struct amdgpu_gfx_funcs {
int (*ras_error_inject)(struct amdgpu_device *adev, void *inject_if);
int (*query_ras_error_count) (struct amdgpu_device *adev, void 
*ras_error_status);
void (*reset_ras_error_count) (struct amdgpu_device *adev);
+   void (*init_spm_golden)(struct amdgpu_device *adev);
 };
 
 struct sq_work {
@@ -324,6 +325,7 @@ struct amdgpu_gfx {
 #define amdgpu_gfx_get_gpu_clock_counter(adev) 
(adev)->gfx.funcs->get_gpu_clock_counter((adev))
 #define amdgpu_gfx_select_se_sh(adev, se, sh, instance) 
(adev)->gfx.funcs->select_se_sh((adev), (se), (sh), (instance))
 #define amdgpu_gfx_select_me_pipe_q(adev, me, pipe, q, vmid) 
(adev)->gfx.funcs->select_me_pipe_q((adev), (me), (pipe), (q), (vmid))
+#define amdgpu_gfx_init_spm_golden(adev) 
(adev)->gfx.funcs->init_spm_golden((adev))
 
 /**
  * amdgpu_gfx_create_bitmask - create a bitmask
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index db9f1e89a0f8..da21ad04ac0f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3307,6 +3307,29 @@ static void gfx_v10_0_set_kiq_pm4_funcs(struct 
amdgpu_device *adev)
adev->gfx.kiq.pmf = _v10_0_kiq_pm4_funcs;
 }
 
+static void gfx_v10_0_init_spm_golden_registers(struct amdgpu_device *adev)
+{
+   switch (adev->asic_type) {
+   case CHIP_NAVI10:
+   soc15_program_register_sequence(adev,
+   
golden_settings_gc_rlc_spm_10_0_nv10,
+   (const 
u32)ARRAY_SIZE(golden_settings_gc_rlc_spm_10_0_nv10));
+   break;
+   case CHIP_NAVI14:
+   soc15_program_register_sequence(adev,
+   
golden_settings_gc_rlc_spm_10_1_nv14,
+   (const 
u32)ARRAY_SIZE(golden_settings_gc_rlc_spm_10_1_nv14));
+   break;
+   case CHIP_NAVI12:
+   soc15_program_register_sequence(adev,
+   
golden_settings_gc_rlc_spm_10_1_2_nv12,
+   (const 
u32)ARRAY_SIZE(golden_settings_gc_rlc_spm_10_1_2_nv12));
+   break;
+   default:
+   break;
+   }
+}
+
 static void gfx_v10_0_init_golden_registers(struct amdgpu_device *adev)
 {
switch (adev->asic_type) {
@@ -3317,9 +3340,6 @@ static void gfx_v10_0_init_golden_registers(struct 
amdgpu_device *adev)
soc15_program_register_sequence(adev,
golden_settings_gc_10_0_nv10,
(const 
u32)ARRAY_SIZE(golden_settings_gc_10_0_nv10));
-   soc15_program_register_sequence(adev,
-   
golden_settings_gc_rlc_spm_10_0_nv10,
-   (const 
u32)ARRAY_SIZE(golden_settings_gc_rlc_spm_10_0_nv10));
break;
case CHIP_NAVI14:
soc15_program_register_sequence(adev,
@@ -3328,9 +3348,6 @@ static void gfx_v10_0_init_golden_registers(struct 
amdgpu_device *adev)
soc15_program_register_sequence(adev,
golden_settings_gc_10_1_nv14,
(const 
u32)ARRAY_SIZE(golden_settings_gc_10_1_nv14));
-   soc15_program_register_sequence(adev,
-   
golden_settings_gc_rlc_spm_10_1_nv14,
-   (const 
u32)ARRAY_SIZE(golden_settings_gc_rlc_spm_10_1_nv14));
break;
case CHIP_NAVI12:
soc15_program_register_sequence(adev,
@@ -3339,9 +3356,6 @@ static void gfx_v10_0_init_golden_registers(struct 
amdgpu_device *adev)
soc15_program_register_sequence(adev,
golden_settings_gc_10_1_2_nv12,
(const 
u32)ARRAY_SIZE(golden_settings_gc_10_1_2_nv12));
-   soc15_program_register_sequence(adev,
-

RE: [PATCH] drm/amdgpu: annotate a false positive recursive locking

2020-08-06 Thread Li, Dennis

[AMD Public Use]

> [SNIP]
>>> I think it is a limitation of init_rwsem.
>> And exactly that's wrong, this is intentional and perfectly correct.
>>
>> [Dennis Li] I couldn't understand. Why is it a perfectly correct?
>> For example, we define two rw_sem: a and b. If we don't check init_rwsem 
>> definition, we may think case#1 and case#2 have the same behavior, but in 
>> fact they are different.
>>
>> Case 1:
>> init_rwsem();
>> init_rwsem();
>>
>> Case2:
>> void init_rwsem_ext(rw_sem* sem)
>> {
>>init_rwsem(sem);
>> }
>> init_rwsem_ext();
>> init_rwsem_ext();
>>
>> As far as I know it is perfectly possible that the locks in the hive are not 
>> always grabbed in the same order. And that's why lockdep is complaining 
>> about this.
>> [Dennis Li] No. I takes a hive with two devices(a and b) to explain why 
>> lockdep complain.
>>
>> // Firstly driver lock the reset_sem of all devices 
>> down_write(>reset_sem); do_suspend(a);
>> down_write(>reset_sem);   // Because  b shared the same lock_class_key 
>> with a, lockdep will take a and b as the same rw_sem and complain here.
>> do_suspend(b);
>>
>> // do recovery
>> do_hive_recovery()
>>
>> // unlock the reset_sem of all devices do_resume(a); 
>> up_write(>reset_sem); do_resume(b); up_write(>reset_sem);
> Ah! Now I understand what you are working around. So the problem is the 
> static lock_class_key in the macro?
> [Dennis Li] Yes. The author of init_rwsem might not consider our similar use 
> case.
>
>> What we should do instead is to make sure we have only a single lock for the 
>> complete hive instead.
>> [Dennis Li] If we use a single lock, users will must wait for all devices 
>> resuming successfully, but in fact their tasks are only running in device a. 
>> It is not friendly to users.
> Well that is intentional I would say. We can only restart submissions after 
> all devices are resumed successfully, cause otherwise it can happen that a 
> job on device A depends on memory on device B which isn't accessible.
> [Dennis Li] Yes, you are right. Driver have make sure that the shared 
> resources(such as the shard memory) are ready before unlock the lock of adev 
> one by one. But each device still has private hardware resources such as 
> video  and display.

Yeah, but those are negligible and we have a team working on display support 
for XGMI. So this will sooner or later become a problem as well.

I still think that a single rwsem for the whole hive is still the best option 
here.

[Dennis Li] For your above concern, we can open a new thread to discuss it. If 
we make a decision to use a single after discussing, we can create another 
patch for it. 

Best Regards
Dennis lI
>
> Regards,
> Christian.
>
>> Regards,
>> Christian.
>>
>> Am 06.08.20 um 11:15 schrieb Li, Dennis:
>>> [AMD Official Use Only - Internal Distribution Only]
>>>
>>> Hi, Christian,
>>> For this case, it is safe to use separated lock key. Please see the 
>>> definition of init_rwsem as the below. Every init_rwsem calling will use a 
>>> new static key, but devices of  the hive share the same code to do 
>>> initialization, so their lock_class_key are the same. I think it is a 
>>> limitation of init_rwsem.  In our case, it should be correct that reset_sem 
>>> of each adev has different  lock_class_key. BTW, this change doesn't effect 
>>> dead-lock detection and just correct it.
>>>
>>> #define init_rwsem(sem) \
>>> do {\
>>> static struct lock_class_key __key; \
>>> \
>>> __init_rwsem((sem), #sem, &__key);  \
>>> } while (0)
>>>
>>> Best Regards
>>> Dennis Li
>>> -Original Message-
>>> From: Koenig, Christian 
>>> Sent: Thursday, August 6, 2020 4:13 PM
>>> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; 
>>> Deucher, Alexander ; Kuehling, Felix 
>>> ; Zhang, Hawking 
>>> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive 
>>> locking
>>>
>>> Preventing locking problems during implementation is obviously a good 
>>> approach, but lockdep has proven to be massively useful for finding and 
>>> fixing problems.
>>>
>>> Disabling lockdep splat by annotating lock with separate classes is usually 
>>> a no-go and only allowed if there is no other potential approach.
>>>
>>> In this case here we should really clean things up instead.
>>>
>>> Christian.
>>>
>>> Am 06.08.20 um 09:44 schrieb Li, Dennis:
 [AMD Official Use Only - Internal Distribution Only]

 Hi, Christian,
   I agree with your concern. However we shouldn't rely on system 
 to detect dead-lock, and should consider this when doing code 
 implementation. In fact, dead-lock detection isn't enabled by default.
   For your proposal to remove reset_sem into the hive structure, 
 we can open a new topic to discuss it. Currently we

Re: [PATCH 3/3] drm/radeon: drop superflous AGP handling

2020-08-06 Thread Dave Airlie

On Thu, 6 Aug 2020 at 23:51, Christian König
 wrote:
>
> The object flags created in radeon_ttm_placement_from_domain take care that
> we use the correct caching for AGP, this is just superflous.
>
> Signed-off-by: Christian König 

Reviewed-by: Dave Airlie 

> ---
>  drivers/gpu/drm/radeon/radeon_ttm.c | 21 ++---
>  1 file changed, 2 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
> b/drivers/gpu/drm/radeon/radeon_ttm.c
> index 31f4cf211b6a..290c8b479853 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -76,26 +76,9 @@ static int radeon_ttm_init_vram(struct radeon_device *rdev)
>
>  static int radeon_ttm_init_gtt(struct radeon_device *rdev)
>  {
> -   uint32_t available_caching, default_caching;
> -
> -   available_caching = TTM_PL_MASK_CACHING;
> -   default_caching = TTM_PL_FLAG_CACHED;
> -
> -#if IS_ENABLED(CONFIG_AGP)
> -   if (rdev->flags & RADEON_IS_AGP) {
> -   if (!rdev->ddev->agp) {
> -   DRM_ERROR("AGP is not enabled\n");
> -   return -EINVAL;
> -   }
> -   available_caching = TTM_PL_FLAG_UNCACHED |
> -   TTM_PL_FLAG_WC;
> -   default_caching = TTM_PL_FLAG_WC;
> -   }
> -#endif
> -
> return ttm_range_man_init(>mman.bdev, TTM_PL_TT,
> - available_caching,
> - default_caching, true,
> + TTM_PL_MASK_CACHING,
> + TTM_PL_FLAG_CACHED, true,
>   rdev->mc.gtt_size >> PAGE_SHIFT);
>  }
>
> --
> 2.17.1
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 2/3] drm/ttm: give resource functions their own [ch] files

2020-08-06 Thread Dave Airlie

On Thu, 6 Aug 2020 at 23:51, Christian König
 wrote:
>
> This is a separate object we work within TTM.

Reviewed-by: Dave Airlie 
>
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |   2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c|   8 +-
>  drivers/gpu/drm/nouveau/nouveau_bo.c   |   4 +-
>  drivers/gpu/drm/radeon/radeon_ttm.c|   4 +-
>  drivers/gpu/drm/ttm/Makefile   |   3 +-
>  drivers/gpu/drm/ttm/ttm_bo.c   | 124 +-
>  drivers/gpu/drm/ttm/ttm_bo_util.c  |   4 +-
>  drivers/gpu/drm/ttm/ttm_resource.c | 151 
>  include/drm/ttm/ttm_bo_api.h   |  70 +-
>  include/drm/ttm/ttm_bo_driver.h| 189 ---
>  include/drm/ttm/ttm_resource.h | 263 +
>  11 files changed, 446 insertions(+), 376 deletions(-)
>  create mode 100644 drivers/gpu/drm/ttm/ttm_resource.c
>  create mode 100644 include/drm/ttm/ttm_resource.h
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 43f4966331dd..b36d94f57d42 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -381,7 +381,7 @@ int amdgpu_bo_create_kernel_at(struct amdgpu_device *adev,
> if (cpu_addr)
> amdgpu_bo_kunmap(*bo_ptr);
>
> -   ttm_bo_mem_put(&(*bo_ptr)->tbo, &(*bo_ptr)->tbo.mem);
> +   ttm_resource_free(&(*bo_ptr)->tbo, &(*bo_ptr)->tbo.mem);
>
> for (i = 0; i < (*bo_ptr)->placement.num_placement; ++i) {
> (*bo_ptr)->placements[i].fpfn = offset >> PAGE_SHIFT;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 67d2eef2f9eb..462402fcd1a4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -578,7 +578,7 @@ static int amdgpu_move_vram_ram(struct ttm_buffer_object 
> *bo, bool evict,
> /* move BO (in tmp_mem) to new_mem */
> r = ttm_bo_move_ttm(bo, ctx, new_mem);
>  out_cleanup:
> -   ttm_bo_mem_put(bo, _mem);
> +   ttm_resource_free(bo, _mem);
> return r;
>  }
>
> @@ -625,7 +625,7 @@ static int amdgpu_move_ram_vram(struct ttm_buffer_object 
> *bo, bool evict,
> goto out_cleanup;
> }
>  out_cleanup:
> -   ttm_bo_mem_put(bo, _mem);
> +   ttm_resource_free(bo, _mem);
> return r;
>  }
>
> @@ -1203,11 +1203,11 @@ int amdgpu_ttm_alloc_gart(struct ttm_buffer_object 
> *bo)
> gtt->offset = (u64)tmp.start << PAGE_SHIFT;
> r = amdgpu_ttm_gart_bind(adev, bo, flags);
> if (unlikely(r)) {
> -   ttm_bo_mem_put(bo, );
> +   ttm_resource_free(bo, );
> return r;
> }
>
> -   ttm_bo_mem_put(bo, >mem);
> +   ttm_resource_free(bo, >mem);
> bo->mem = tmp;
> }
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
> b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 604a74323696..29d7d7e100f7 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -1191,7 +1191,7 @@ nouveau_bo_move_flipd(struct ttm_buffer_object *bo, 
> bool evict, bool intr,
>
> ret = ttm_bo_move_ttm(bo, , new_reg);
>  out:
> -   ttm_bo_mem_put(bo, _reg);
> +   ttm_resource_free(bo, _reg);
> return ret;
>  }
>
> @@ -1227,7 +1227,7 @@ nouveau_bo_move_flips(struct ttm_buffer_object *bo, 
> bool evict, bool intr,
> goto out;
>
>  out:
> -   ttm_bo_mem_put(bo, _reg);
> +   ttm_resource_free(bo, _reg);
> return ret;
>  }
>
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
> b/drivers/gpu/drm/radeon/radeon_ttm.c
> index 3355b69b13d1..31f4cf211b6a 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -271,7 +271,7 @@ static int radeon_move_vram_ram(struct ttm_buffer_object 
> *bo,
> }
> r = ttm_bo_move_ttm(bo, , new_mem);
>  out_cleanup:
> -   ttm_bo_mem_put(bo, _mem);
> +   ttm_resource_free(bo, _mem);
> return r;
>  }
>
> @@ -309,7 +309,7 @@ static int radeon_move_ram_vram(struct ttm_buffer_object 
> *bo,
> goto out_cleanup;
> }
>  out_cleanup:
> -   ttm_bo_mem_put(bo, _mem);
> +   ttm_resource_free(bo, _mem);
> return r;
>  }
>
> diff --git a/drivers/gpu/drm/ttm/Makefile b/drivers/gpu/drm/ttm/Makefile
> index e54326e6cea4..90c0da88cc98 100644
> --- a/drivers/gpu/drm/ttm/Makefile
> +++ b/drivers/gpu/drm/ttm/Makefile
> @@ -4,7 +4,8 @@
>
>  ttm-y := ttm_memory.o ttm_tt.o ttm_bo.o \
> ttm_bo_util.o ttm_bo_vm.o ttm_module.o \
> -   ttm_execbuf_util.o ttm_page_alloc.o ttm_range_manager.o
> +   ttm_execbuf_util.o ttm_page_alloc.o ttm_range_manager.o \
> +   ttm_resource.o
>  ttm-$(CONFIG_AGP) +=

Re: [PATCH] drm/amdgpu: Skip some registers config for SRIOV

2020-08-06 Thread Luben Tuikov

On 2020-08-06 3:40 a.m., Liu ChengZhe wrote:
> For VF, registers L2_CNTL, L2_CONTEXT1_IDENTITY_APERTURE
> L2_PROTECTION_FAULT_CNTL are not accesible, skip the

Spelling: "accessible".

I'd rather move the "For VF" to the end, after "accessible",
like this. I also believe it should be "to VF" as opposed
to "for VF". Both are correct, but mean different things.
Maybe like this:

Some registers are not accessible to virtual function
setup, so skip their initialization when in VF-SRIOV mode.

> configuration for them in SRIOV mode.
> 
> Signed-off-by: Liu ChengZhe 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c | 12 ++--
>  drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c  | 12 ++--
>  2 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c 
> b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
> index 1f6112b7fa49..6b96f45fde2a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
> @@ -330,10 +330,13 @@ int gfxhub_v2_1_gart_enable(struct amdgpu_device *adev)
>   gfxhub_v2_1_init_gart_aperture_regs(adev);
>   gfxhub_v2_1_init_system_aperture_regs(adev);
>   gfxhub_v2_1_init_tlb_regs(adev);
> - gfxhub_v2_1_init_cache_regs(adev);
> + if (!amdgpu_sriov_vf(adev))
> + gfxhub_v2_1_init_cache_regs(adev);

Yes, that's the literal meaning of the commit description,
but it would be cleaner if gfxhub_v2_1_init_cache_regs(adev)
did that check instead. (Some might even argue it'd be more
object-oriented...)

So then, this code would look like a sequence of
statements, unchanged, and each method of initialization
would know/check whether it is applicable to "adev".
It also makes it more centralized, less code duplication
and less code clutter.

>  
>   gfxhub_v2_1_enable_system_domain(adev);
> - gfxhub_v2_1_disable_identity_aperture(adev);
> + if (!amdgpu_sriov_vf(adev))
> + gfxhub_v2_1_disable_identity_aperture(adev);
> +

Ditto.

>   gfxhub_v2_1_setup_vmid_config(adev);
>   gfxhub_v2_1_program_invalidation(adev);
>  
> @@ -372,7 +375,12 @@ void gfxhub_v2_1_gart_disable(struct amdgpu_device *adev)
>  void gfxhub_v2_1_set_fault_enable_default(struct amdgpu_device *adev,
> bool value)
>  {
> + /*These regs are not accessible for VF, PF will program in SRIOV */

Add a space after the asterisk and before the beginning of the sentence.
Format the comment in one of the two acceptable Linux Kernel Coding styles.
End the sentence with a full stop. It also helps to spell out "registers".
Like this perhaps:

/* These registers are not accessible to VF-SRIOV.
 * The PF will program them instead.
 */
> + if (amdgpu_sriov_vf(adev))
> + return;
> +
>   u32 tmp;

The definition of this automatic variable should precede
all other code in this function. The compiler will optimize
it away anyway.

> +
>   tmp = RREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL);
>   tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
>   RANGE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
> diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c 
> b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c
> index d83912901f73..9cfde9b81600 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c
> @@ -321,10 +321,13 @@ int mmhub_v2_0_gart_enable(struct amdgpu_device *adev)
>   mmhub_v2_0_init_gart_aperture_regs(adev);
>   mmhub_v2_0_init_system_aperture_regs(adev);
>   mmhub_v2_0_init_tlb_regs(adev);
> - mmhub_v2_0_init_cache_regs(adev);
> + if (!amdgpu_sriov_vf(adev))
> + mmhub_v2_0_init_cache_regs(adev);

Move this check inside said function.

>  
>   mmhub_v2_0_enable_system_domain(adev);
> - mmhub_v2_0_disable_identity_aperture(adev);
> + if (!amdgpu_sriov_vf(adev))
> + mmhub_v2_0_disable_identity_aperture(adev);
> +

Ditto.

>   mmhub_v2_0_setup_vmid_config(adev);
>   mmhub_v2_0_program_invalidation(adev);
>  
> @@ -364,7 +367,12 @@ void mmhub_v2_0_gart_disable(struct amdgpu_device *adev)
>   */
>  void mmhub_v2_0_set_fault_enable_default(struct amdgpu_device *adev, bool 
> value)
>  {
> + /*These regs are not accessible for VF, PF will program in SRIOV */

You can duplicate this comment from the one above.

> + if (amdgpu_sriov_vf(adev))
> + return;
> +
>   u32 tmp;

Should be the first thing in the function body, as noted above.

Regards,
Luben

> +
>   tmp = RREG32_SOC15(MMHUB, 0, mmMMVM_L2_PROTECTION_FAULT_CNTL);
>   tmp = REG_SET_FIELD(tmp, MMVM_L2_PROTECTION_FAULT_CNTL,
>   RANGE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v2] drm/amdgpu: unlock mutex on error

2020-08-06 Thread Luben Tuikov

Yes, that's fine. Thanks for fix.

Reviewed-by: Luben Tuikov 

On 2020-08-05 9:31 p.m., Dennis Li wrote:
> Make sure to unlock the mutex when error happen
> 
> v2:
> 1. correct syntax error in the commit comment
> 2. remove change-Id
> 
> Acked-by: Nirmoy Das 
> Signed-off-by: Dennis Li 
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index a0ea663ecdbc..5e5369abc6fa 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -632,13 +632,14 @@ int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum 
> kgd_engine_type engine,
>   }
>  
>   ret = amdgpu_ib_schedule(ring, 1, ib, job, );
> +
> + up_read(>reset_sem);
> +
>   if (ret) {
>   DRM_ERROR("amdgpu: failed to schedule IB.\n");
>   goto err_ib_sched;
>   }
>  
> - up_read(>reset_sem);
> -
>   ret = dma_fence_wait(f, false);
>  
>  err_ib_sched:
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index 4e017f379eb6..67a756f4337b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -545,7 +545,7 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>   }
>   ret = amdgpu_xgmi_update_topology(hive, tmp_adev);
>   if (ret)
> - goto exit;
> + goto exit_unlock;
>   }
>  
>   /* get latest topology info for each device from psp */
> @@ -558,7 +558,7 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>   tmp_adev->gmc.xgmi.node_id,
>   tmp_adev->gmc.xgmi.hive_id, ret);
>   /* To do : continue with some node failed or 
> disable the whole hive */
> - goto exit;
> + goto exit_unlock;
>   }
>   }
>   }
> @@ -566,7 +566,7 @@ int amdgpu_xgmi_add_device(struct amdgpu_device *adev)
>   if (!ret)
>   ret = amdgpu_xgmi_sysfs_add_dev_info(adev, hive);
>  
> -
> +exit_unlock:
>   mutex_unlock(>hive_lock);
>  exit:
>   if (!ret)
> 

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 5/7] drm/amd/display: Reset plane for anything that's not a FAST update

2020-08-06 Thread Kazlauskas, Nicholas


On 2020-08-05 4:45 p.m., Rodrigo Siqueira wrote:

On 07/30, Nicholas Kazlauskas wrote:

[Why]
MEDIUM or FULL updates can require global validation or affect
bandwidth. By treating these all simply as surface updates we aren't
actually passing this through DC global validation.

[How]
There's currently no way to pass surface updates through DC global
validation, nor do I think it's a good idea to change the interface
to accept these.

DC global validation itself is currently stateless, and we can move
our update type checking to be stateless as well by duplicating DC
surface checks in DM based on DRM properties.

We wanted to rely on DC automatically determining this since DC knows
best, but DM is ultimately what fills in everything into DC plane
state so it does need to know as well.

There are basically only three paths that we exercise in DM today:

1) Cursor (async update)
2) Pageflip (fast update)
3) Full pipe programming (medium/full updates)

Which means that anything that's more than a pageflip really needs to
go down path #3.

So this change duplicates all the surface update checks based on DRM
state instead inside of should_reset_plane().

Next step is dropping dm_determine_update_type_for_commit and we no
longer require the old DC state at all for global validation.

Optimization can come later so we don't reset DC planes at all for
MEDIUM udpates and avoid validation, but we might require some extra
checks in DM to achieve this.


How about adding this optimization description in our TODO list
under-display folder?

Reviewed-by: Rodrigo Siqueira 


Sure, I'll make another patch to clean up some of the TODO items in the 
text file.


Regards,
Nicholas Kazlauskas

  

Cc: Bhawanpreet Lakha 
Cc: Hersen Wu 
Signed-off-by: Nicholas Kazlauskas 
---
  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 25 +++
  1 file changed, 25 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 0d5f45742bb5..2cbb29199e61 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -8336,6 +8336,31 @@ static bool should_reset_plane(struct drm_atomic_state 
*state,
if (old_other_state->crtc != new_other_state->crtc)
return true;
  
+		/* Src/dst size and scaling updates. */

+   if (old_other_state->src_w != new_other_state->src_w ||
+   old_other_state->src_h != new_other_state->src_h ||
+   old_other_state->crtc_w != new_other_state->crtc_w ||
+   old_other_state->crtc_h != new_other_state->crtc_h)
+   return true;
+
+   /* Rotation / mirroring updates. */
+   if (old_other_state->rotation != new_other_state->rotation)
+   return true;
+
+   /* Blending updates. */
+   if (old_other_state->pixel_blend_mode !=
+   new_other_state->pixel_blend_mode)
+   return true;
+
+   /* Alpha updates. */
+   if (old_other_state->alpha != new_other_state->alpha)
+   return true;
+
+   /* Colorspace changes. */
+   if (old_other_state->color_range != 
new_other_state->color_range ||
+   old_other_state->color_encoding != 
new_other_state->color_encoding)
+   return true;
+
/* Framebuffer checks fall at the end. */
if (!old_other_state->fb || !new_other_state->fb)
continue;
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7CRodrigo.Siqueira%40amd.com%7Ccc095e7ce6164f529e2708d834c86d1b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637317382766607890sdata=omLC%2BizXVEjjGe6IylBpniZzyUGlzTATrgRoWEo6dHc%3Dreserved=0




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 2/7] drm/amd/display: Reset plane when tiling flags change

2020-08-06 Thread Kazlauskas, Nicholas


On 2020-08-05 5:11 p.m., Rodrigo Siqueira wrote:

On 07/30, Nicholas Kazlauskas wrote:

[Why]
Enabling or disable DCC or switching between tiled and linear formats
can require bandwidth updates.

They're currently skipping all DC validation by being treated as purely
surface updates.

[How]
Treat tiling_flag changes (which encode DCC state) as a condition for
resetting the plane.

Cc: Bhawanpreet Lakha 
Cc: Rodrigo Siqueira 
Cc: Hersen Wu 
Signed-off-by: Nicholas Kazlauskas 
---
  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 ---
  1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 7cc5ab90ce13..bf1881bd492c 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -8332,6 +8332,8 @@ static bool should_reset_plane(struct drm_atomic_state 
*state,
 * TODO: Come up with a more elegant solution for this.
 */
for_each_oldnew_plane_in_state(state, other, old_other_state, 
new_other_state, i) {
+   struct dm_plane_state *old_dm_plane_state, *new_dm_plane_state;
+
if (other->type == DRM_PLANE_TYPE_CURSOR)
continue;
  
@@ -8342,9 +8344,20 @@ static bool should_reset_plane(struct drm_atomic_state *state,

if (old_other_state->crtc != new_other_state->crtc)
return true;
  
-		/* TODO: Remove this once we can handle fast format changes. */

-   if (old_other_state->fb && new_other_state->fb &&
-   old_other_state->fb->format != new_other_state->fb->format)
+   /* Framebuffer checks fall at the end. */
+   if (!old_other_state->fb || !new_other_state->fb)
+   continue;
+
+   /* Pixel format changes can require bandwidth updates. */
+   if (old_other_state->fb->format != new_other_state->fb->format)
+   return true;
+
+   old_dm_plane_state = to_dm_plane_state(old_other_state);
+   new_dm_plane_state = to_dm_plane_state(new_other_state);
+
+   /* Tiling and DCC changes also require bandwidth updates. */
+   if (old_dm_plane_state->tiling_flags !=
+   new_dm_plane_state->tiling_flags)


Why not add a case when we move to a TMZ area?

Reviewed-by: Rodrigo Siqueira 


TMZ doesn't affect DML calculations or validation in this case so we can 
safely skip it.


Regards,
Nicholas Kazlauskas




return true;
}
  
--

2.25.1





___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: don't create entity when use cpu to update page table

2020-08-06 Thread Koenig, Christian

NAK, we also use the entity context number for debugging.

Additional to this the entities should not need any additional resources. So 
the functions are only initializing fields.

Regards,
Christian.


Am 06.08.2020 18:06 schrieb "Wang, Kevin(Yang)" :
the entity isn't needed when vm use cpu to update page table.

Signed-off-by: Kevin Wang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 45 ++
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 71e005cf2952..e15c29d613d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2802,20 +2802,6 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
 spin_lock_init(>invalidated_lock);
 INIT_LIST_HEAD(>freed);

-
-   /* create scheduler entities for page table updates */
-   r = drm_sched_entity_init(>immediate, DRM_SCHED_PRIORITY_NORMAL,
- adev->vm_manager.vm_pte_scheds,
- adev->vm_manager.vm_pte_num_scheds, NULL);
-   if (r)
-   return r;
-
-   r = drm_sched_entity_init(>delayed, DRM_SCHED_PRIORITY_NORMAL,
- adev->vm_manager.vm_pte_scheds,
- adev->vm_manager.vm_pte_num_scheds, NULL);
-   if (r)
-   goto error_free_immediate;
-
 vm->pte_support_ats = false;
 vm->is_compute_context = false;

@@ -2835,10 +2821,25 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
!amdgpu_gmc_vram_full_visible(>gmc)),
   "CPU update of VM recommended only for large BAR system\n");

-   if (vm->use_cpu_for_update)
+   if (vm->use_cpu_for_update) {
 vm->update_funcs = _vm_cpu_funcs;
-   else
+   } else {
+   /* create scheduler entities for page table updates */
+   r = drm_sched_entity_init(>immediate, 
DRM_SCHED_PRIORITY_NORMAL,
+ adev->vm_manager.vm_pte_scheds,
+ adev->vm_manager.vm_pte_num_scheds, 
NULL);
+   if (r)
+   return r;
+
+   r = drm_sched_entity_init(>delayed, 
DRM_SCHED_PRIORITY_NORMAL,
+ adev->vm_manager.vm_pte_scheds,
+ adev->vm_manager.vm_pte_num_scheds, 
NULL);
+   if (r)
+   goto error_free_immediate;
+
 vm->update_funcs = _vm_sdma_funcs;
+   }
+
 vm->last_update = NULL;
 vm->last_unlocked = dma_fence_get_stub();

@@ -2895,10 +2896,12 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,

 error_free_delayed:
 dma_fence_put(vm->last_unlocked);
-   drm_sched_entity_destroy(>delayed);
+   if (!vm->use_cpu_for_update)
+   drm_sched_entity_destroy(>delayed);

 error_free_immediate:
-   drm_sched_entity_destroy(>immediate);
+   if (!vm->use_cpu_for_update)
+   drm_sched_entity_destroy(>immediate);

 return r;
 }
@@ -3120,8 +3123,10 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct 
amdgpu_vm *vm)
 amdgpu_bo_unref();
 WARN_ON(vm->root.base.bo);

-   drm_sched_entity_destroy(>immediate);
-   drm_sched_entity_destroy(>delayed);
+   if (!vm->use_cpu_for_update) {
+   drm_sched_entity_destroy(>immediate);
+   drm_sched_entity_destroy(>delayed);
+   }

 if (!RB_EMPTY_ROOT(>va.rb_root)) {
 dev_err(adev->dev, "still active bo inside vm\n");
--
2.27.0


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: fix uninit-value in arcturus_log_thermal_throttling_event()

2020-08-06 Thread Alex Deucher

On Thu, Aug 6, 2020 at 11:59 AM Kevin Wang  wrote:
>
> when function arcturus_get_smu_metrics_data() call failed,
> it will cause the variable "throttler_status" isn't initialized before use.
>
> warning:
> powerplay/arcturus_ppt.c:2268:24: warning: ‘throttler_status’ may be used 
> uninitialized in this function [-Wmaybe-uninitialized]
>  2268 |   if (throttler_status & logging_label[throttler_idx].feature_mask) {
>
> Signed-off-by: Kevin Wang 

Reviewed-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c 
> b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
> index e58af84433c7..946dbc0db1b1 100644
> --- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
> +++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
> @@ -2253,14 +2253,17 @@ static const struct throttling_logging_label {
>  };
>  static void arcturus_log_thermal_throttling_event(struct smu_context *smu)
>  {
> +   int ret;
> int throttler_idx, throtting_events = 0, buf_idx = 0;
> struct amdgpu_device *adev = smu->adev;
> uint32_t throttler_status;
> char log_buf[256];
>
> -   arcturus_get_smu_metrics_data(smu,
> - METRICS_THROTTLER_STATUS,
> - _status);
> +   ret = arcturus_get_smu_metrics_data(smu,
> +   METRICS_THROTTLER_STATUS,
> +   _status);
> +   if (ret)
> +   return;
>
> memset(log_buf, 0, sizeof(log_buf));
> for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label);
> --
> 2.27.0
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: don't create entity when use cpu to update page table

2020-08-06 Thread Kevin Wang

the entity isn't needed when vm use cpu to update page table.

Signed-off-by: Kevin Wang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 45 ++
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 71e005cf2952..e15c29d613d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2802,20 +2802,6 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
spin_lock_init(>invalidated_lock);
INIT_LIST_HEAD(>freed);
 
-
-   /* create scheduler entities for page table updates */
-   r = drm_sched_entity_init(>immediate, DRM_SCHED_PRIORITY_NORMAL,
- adev->vm_manager.vm_pte_scheds,
- adev->vm_manager.vm_pte_num_scheds, NULL);
-   if (r)
-   return r;
-
-   r = drm_sched_entity_init(>delayed, DRM_SCHED_PRIORITY_NORMAL,
- adev->vm_manager.vm_pte_scheds,
- adev->vm_manager.vm_pte_num_scheds, NULL);
-   if (r)
-   goto error_free_immediate;
-
vm->pte_support_ats = false;
vm->is_compute_context = false;
 
@@ -2835,10 +2821,25 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
   !amdgpu_gmc_vram_full_visible(>gmc)),
  "CPU update of VM recommended only for large BAR system\n");
 
-   if (vm->use_cpu_for_update)
+   if (vm->use_cpu_for_update) {
vm->update_funcs = _vm_cpu_funcs;
-   else
+   } else {
+   /* create scheduler entities for page table updates */
+   r = drm_sched_entity_init(>immediate, 
DRM_SCHED_PRIORITY_NORMAL,
+ adev->vm_manager.vm_pte_scheds,
+ adev->vm_manager.vm_pte_num_scheds, 
NULL);
+   if (r)
+   return r;
+
+   r = drm_sched_entity_init(>delayed, 
DRM_SCHED_PRIORITY_NORMAL,
+ adev->vm_manager.vm_pte_scheds,
+ adev->vm_manager.vm_pte_num_scheds, 
NULL);
+   if (r)
+   goto error_free_immediate;
+
vm->update_funcs = _vm_sdma_funcs;
+   }
+
vm->last_update = NULL;
vm->last_unlocked = dma_fence_get_stub();
 
@@ -2895,10 +2896,12 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
 
 error_free_delayed:
dma_fence_put(vm->last_unlocked);
-   drm_sched_entity_destroy(>delayed);
+   if (!vm->use_cpu_for_update)
+   drm_sched_entity_destroy(>delayed);
 
 error_free_immediate:
-   drm_sched_entity_destroy(>immediate);
+   if (!vm->use_cpu_for_update)
+   drm_sched_entity_destroy(>immediate);
 
return r;
 }
@@ -3120,8 +3123,10 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct 
amdgpu_vm *vm)
amdgpu_bo_unref();
WARN_ON(vm->root.base.bo);
 
-   drm_sched_entity_destroy(>immediate);
-   drm_sched_entity_destroy(>delayed);
+   if (!vm->use_cpu_for_update) {
+   drm_sched_entity_destroy(>immediate);
+   drm_sched_entity_destroy(>delayed);
+   }
 
if (!RB_EMPTY_ROOT(>va.rb_root)) {
dev_err(adev->dev, "still active bo inside vm\n");
-- 
2.27.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: fix uninit-value in arcturus_log_thermal_throttling_event()

2020-08-06 Thread Kevin Wang

when function arcturus_get_smu_metrics_data() call failed,
it will cause the variable "throttler_status" isn't initialized before use.

warning:
powerplay/arcturus_ppt.c:2268:24: warning: ‘throttler_status’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]
 2268 |   if (throttler_status & logging_label[throttler_idx].feature_mask) {

Signed-off-by: Kevin Wang 
---
 drivers/gpu/drm/amd/powerplay/arcturus_ppt.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c 
b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
index e58af84433c7..946dbc0db1b1 100644
--- a/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
+++ b/drivers/gpu/drm/amd/powerplay/arcturus_ppt.c
@@ -2253,14 +2253,17 @@ static const struct throttling_logging_label {
 };
 static void arcturus_log_thermal_throttling_event(struct smu_context *smu)
 {
+   int ret;
int throttler_idx, throtting_events = 0, buf_idx = 0;
struct amdgpu_device *adev = smu->adev;
uint32_t throttler_status;
char log_buf[256];
 
-   arcturus_get_smu_metrics_data(smu,
- METRICS_THROTTLER_STATUS,
- _status);
+   ret = arcturus_get_smu_metrics_data(smu,
+   METRICS_THROTTLER_STATUS,
+   _status);
+   if (ret)
+   return;
 
memset(log_buf, 0, sizeof(log_buf));
for (throttler_idx = 0; throttler_idx < ARRAY_SIZE(logging_label);
-- 
2.27.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 7/7] drm/amd/display: Replace DRM private objects with subclassed DRM atomic state

2020-08-06 Thread Kazlauskas, Nicholas


On 2020-08-05 4:37 p.m., Rodrigo Siqueira wrote:

Hi,

I have some minor inline comments, but everything looks fine when I
tested it on Raven; feel free to add

Tested-by: Rodrigo Siqueira 

in the whole series.


Thanks for the reviews!

I can clean up the nitpicks for this patch and make a v2.

Regards,
Nicholas Kazlauskas



On 07/30, Nicholas Kazlauskas wrote:

[Why]
DM atomic check was structured in a way that we required old DC state
in order to dynamically add and remove planes and streams from the
context to build the DC state context for validation.

DRM private objects were used to carry over the last DC state and
were added to the context on nearly every commit - regardless of fast
or full so we could check whether or not the new state could affect
bandwidth.

The problem with this model is that DRM private objects do not
implicitly stall out other commits.

So if you have two commits touching separate DRM objects they could
run concurrently and potentially execute out of order - leading to a
use-after-free.

If we want this to be safe we have two options:
1. Stall out concurrent commits since they touch the same private object
2. Refactor DM to not require old DC state and drop private object usage

[How]
This implements approach #2 since it still allows for judder free
updates in multi-display scenarios.

I'll list the big changes in order at a high level:

1. Subclass DRM atomic state instead of using DRM private objects.

DC relied on the old state to determine which changes cause bandwidth
updates but now we have DM perform similar checks based on DRM state
instead - dropping the requirement for old state to exist at all.

This means that we can now build a new DC context from scratch whenever
we have something that DM thinks could affect bandwidth.

Whenever we need to rebuild bandwidth we now add all CRTCs and planes
to the DRM state in order to get the absolute set of DC streams and
DC planes.

This introduces a stall on other commits, but this stall already
exists because of the lock_and_validation logic and it's necessary
since updates may move around pipes and require full reprogramming.

2. Drop workarounds to add planes to maintain z-order early in atomic
check. This is no longer needed because of the changes for (1).

This also involves fixing up should_plane_reset checks since we can just
avoid resetting streams and planes when they haven't actually changed.

3. Rework dm_update_crtc_state and dm_update_plane_state to be single
pass instead of two pass.

This is necessary since we no longer have the dc_state to add and
remove planes to the context in and we want to defer creation to the
end of commit_check.

It also makes the logic a lot simpler to follow since as an added bonus.

Cc: Bhawanpreet Lakha 
Cc: Harry Wentland 
Cc: Leo Li 
Cc: Daniel Vetter 
Signed-off-by: Nicholas Kazlauskas 
---
  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 720 +++---
  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h |  11 +-
  2 files changed, 280 insertions(+), 451 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 59829ec81694..97a7dfc620e8 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -1839,7 +1839,6 @@ static int dm_resume(void *handle)
struct drm_plane *plane;
struct drm_plane_state *new_plane_state;
struct dm_plane_state *dm_new_plane_state;
-   struct dm_atomic_state *dm_state = 
to_dm_atomic_state(dm->atomic_obj.state);
enum dc_connection_type new_connection_type = dc_connection_none;
struct dc_state *dc_state;
int i, r, j;
@@ -1879,11 +1878,6 @@ static int dm_resume(void *handle)
  
  		return 0;

}
-   /* Recreate dc_state - DC invalidates it when setting power state to 
S3. */
-   dc_release_state(dm_state->context);
-   dm_state->context = dc_create_state(dm->dc);
-   /* TODO: Remove dc_state->dccg, use dc->dccg directly. */
-   dc_resource_state_construct(dm->dc, dm_state->context);
  
  	/* Before powering on DC we need to re-initialize DMUB. */

r = dm_dmub_hw_init(adev);
@@ -2019,11 +2013,51 @@ const struct amdgpu_ip_block_version dm_ip_block =
   * *WIP*
   */
  
+struct drm_atomic_state *dm_atomic_state_alloc(struct drm_device *dev)

+{
+   struct dm_atomic_state *dm_state;
+
+   dm_state = kzalloc(sizeof(*dm_state), GFP_KERNEL);


How about use GFP_ATOMIC here?


+
+   if (!dm_state)
+   return NULL;
+
+   if (drm_atomic_state_init(dev, _state->base) < 0) {
+   kfree(dm_state);
+   return NULL;
+   }
+
+   return _state->base;
+}
+
+void dm_atomic_state_free(struct drm_atomic_state *state)
+{
+   struct dm_atomic_state *dm_state = to_dm_atomic_state(state);
+
+   if (dm_state->context) {
+   dc_release_state(dm_state->context);
+

RE: [PATCH] drm/amdgpu: use mode1 reset by default for sienna_cichlid

2020-08-06 Thread Zhang, Hawking

[AMD Public Use]

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: Gao, Likun  
Sent: Thursday, August 6, 2020 17:43
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Feng, Kenneth 
; Sheng, Wenhui ; Gao, Likun 

Subject: [PATCH] drm/amdgpu: use mode1 reset by default for sienna_cichlid

From: Likun Gao 

Swith default gpu reset method for sienna_cichlid to MODE1 reset.

Signed-off-by: Likun Gao 
Change-Id: I775e5a66bbac474f65ca8c999136ccaf9c1dc14e
---
 drivers/gpu/drm/amd/amdgpu/nv.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c 
index 74d02d270d34..da8024c2826e 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -362,10 +362,15 @@ nv_asic_reset_method(struct amdgpu_device *adev)
dev_warn(adev->dev, "Specified reset method:%d isn't supported, 
using AUTO instead.\n",
  amdgpu_reset_method);
 
-   if (smu_baco_is_support(smu))
-   return AMD_RESET_METHOD_BACO;
-   else
+   switch (adev->asic_type) {
+   case CHIP_SIENNA_CICHLID:
return AMD_RESET_METHOD_MODE1;
+   default:
+   if (smu_baco_is_support(smu))
+   return AMD_RESET_METHOD_BACO;
+   else
+   return AMD_RESET_METHOD_MODE1;
+   }
 }
 
 static int nv_asic_reset(struct amdgpu_device *adev)
--
2.25.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 1/3] drm/ttm: rename ttm_resource_manager_func callbacks

2020-08-06 Thread Christian König

The names get/put are associated with reference counting
in the Linux kernel, use alloc/free instead.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c   |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c  |  6 +++---
 drivers/gpu/drm/nouveau/nouveau_ttm.c | 12 ++--
 drivers/gpu/drm/ttm/ttm_bo.c  |  8 
 drivers/gpu/drm/ttm/ttm_range_manager.c   | 16 
 drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c |  4 ++--
 drivers/gpu/drm/vmwgfx/vmwgfx_thp.c   |  4 ++--
 include/drm/ttm/ttm_bo_driver.h   | 18 +-
 8 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
index c847a5fe94c9..4f8451bdc28c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c
@@ -324,7 +324,7 @@ static void amdgpu_gtt_mgr_debug(struct 
ttm_resource_manager *man,
 }
 
 static const struct ttm_resource_manager_func amdgpu_gtt_mgr_func = {
-   .get_node = amdgpu_gtt_mgr_new,
-   .put_node = amdgpu_gtt_mgr_del,
+   .alloc = amdgpu_gtt_mgr_new,
+   .free = amdgpu_gtt_mgr_del,
.debug = amdgpu_gtt_mgr_debug
 };
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 895634cbf999..044262b9ded4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -619,7 +619,7 @@ static void amdgpu_vram_mgr_debug(struct 
ttm_resource_manager *man,
 }
 
 static const struct ttm_resource_manager_func amdgpu_vram_mgr_func = {
-   .get_node   = amdgpu_vram_mgr_new,
-   .put_node   = amdgpu_vram_mgr_del,
-   .debug  = amdgpu_vram_mgr_debug
+   .alloc  = amdgpu_vram_mgr_new,
+   .free   = amdgpu_vram_mgr_del,
+   .debug  = amdgpu_vram_mgr_debug
 };
diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c 
b/drivers/gpu/drm/nouveau/nouveau_ttm.c
index e6a30865a00b..53c6f8827322 100644
--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
@@ -64,8 +64,8 @@ nouveau_vram_manager_new(struct ttm_resource_manager *man,
 }
 
 const struct ttm_resource_manager_func nouveau_vram_manager = {
-   .get_node = nouveau_vram_manager_new,
-   .put_node = nouveau_manager_del,
+   .alloc = nouveau_vram_manager_new,
+   .free = nouveau_manager_del,
 };
 
 static int
@@ -87,8 +87,8 @@ nouveau_gart_manager_new(struct ttm_resource_manager *man,
 }
 
 const struct ttm_resource_manager_func nouveau_gart_manager = {
-   .get_node = nouveau_gart_manager_new,
-   .put_node = nouveau_manager_del,
+   .alloc = nouveau_gart_manager_new,
+   .free = nouveau_manager_del,
 };
 
 static int
@@ -119,8 +119,8 @@ nv04_gart_manager_new(struct ttm_resource_manager *man,
 }
 
 const struct ttm_resource_manager_func nv04_gart_manager = {
-   .get_node = nv04_gart_manager_new,
-   .put_node = nouveau_manager_del,
+   .alloc = nv04_gart_manager_new,
+   .free = nouveau_manager_del,
 };
 
 int
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index ad09329b62d3..ae71c3ab6cc4 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -846,20 +846,20 @@ static int ttm_bo_mem_get(struct ttm_buffer_object *bo,
struct ttm_resource_manager *man = ttm_manager_type(bo->bdev, 
mem->mem_type);
 
mem->mm_node = NULL;
-   if (!man->func || !man->func->get_node)
+   if (!man->func || !man->func->alloc)
return 0;
 
-   return man->func->get_node(man, bo, place, mem);
+   return man->func->alloc(man, bo, place, mem);
 }
 
 void ttm_bo_mem_put(struct ttm_buffer_object *bo, struct ttm_resource *mem)
 {
struct ttm_resource_manager *man = ttm_manager_type(bo->bdev, 
mem->mem_type);
 
-   if (!man->func || !man->func->put_node)
+   if (!man->func || !man->func->free)
return;
 
-   man->func->put_node(man, mem);
+   man->func->free(man, mem);
mem->mm_node = NULL;
mem->mem_type = TTM_PL_SYSTEM;
 }
diff --git a/drivers/gpu/drm/ttm/ttm_range_manager.c 
b/drivers/gpu/drm/ttm/ttm_range_manager.c
index 274a05ca13d3..770c8988c139 100644
--- a/drivers/gpu/drm/ttm/ttm_range_manager.c
+++ b/drivers/gpu/drm/ttm/ttm_range_manager.c
@@ -54,10 +54,10 @@ static inline struct ttm_range_manager 
*to_range_manager(struct ttm_resource_man
return container_of(man, struct ttm_range_manager, manager);
 }
 
-static int ttm_range_man_get_node(struct ttm_resource_manager *man,
- struct ttm_buffer_object *bo,
- const struct ttm_place *place,
- struct ttm_resource *mem)
+static int ttm_range_man_alloc(struct ttm_resource_manager *man,
+  struct

[PATCH 3/3] drm/radeon: drop superflous AGP handling

2020-08-06 Thread Christian König

The object flags created in radeon_ttm_placement_from_domain take care that
we use the correct caching for AGP, this is just superflous.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/radeon/radeon_ttm.c | 21 ++---
 1 file changed, 2 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
b/drivers/gpu/drm/radeon/radeon_ttm.c
index 31f4cf211b6a..290c8b479853 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -76,26 +76,9 @@ static int radeon_ttm_init_vram(struct radeon_device *rdev)
 
 static int radeon_ttm_init_gtt(struct radeon_device *rdev)
 {
-   uint32_t available_caching, default_caching;
-
-   available_caching = TTM_PL_MASK_CACHING;
-   default_caching = TTM_PL_FLAG_CACHED;
-
-#if IS_ENABLED(CONFIG_AGP)
-   if (rdev->flags & RADEON_IS_AGP) {
-   if (!rdev->ddev->agp) {
-   DRM_ERROR("AGP is not enabled\n");
-   return -EINVAL;
-   }
-   available_caching = TTM_PL_FLAG_UNCACHED |
-   TTM_PL_FLAG_WC;
-   default_caching = TTM_PL_FLAG_WC;
-   }
-#endif
-
return ttm_range_man_init(>mman.bdev, TTM_PL_TT,
- available_caching,
- default_caching, true,
+ TTM_PL_MASK_CACHING,
+ TTM_PL_FLAG_CACHED, true,
  rdev->mc.gtt_size >> PAGE_SHIFT);
 }
 
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 2/3] drm/ttm: give resource functions their own [ch] files

2020-08-06 Thread Christian König

This is a separate object we work within TTM.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c|   8 +-
 drivers/gpu/drm/nouveau/nouveau_bo.c   |   4 +-
 drivers/gpu/drm/radeon/radeon_ttm.c|   4 +-
 drivers/gpu/drm/ttm/Makefile   |   3 +-
 drivers/gpu/drm/ttm/ttm_bo.c   | 124 +-
 drivers/gpu/drm/ttm/ttm_bo_util.c  |   4 +-
 drivers/gpu/drm/ttm/ttm_resource.c | 151 
 include/drm/ttm/ttm_bo_api.h   |  70 +-
 include/drm/ttm/ttm_bo_driver.h| 189 ---
 include/drm/ttm/ttm_resource.h | 263 +
 11 files changed, 446 insertions(+), 376 deletions(-)
 create mode 100644 drivers/gpu/drm/ttm/ttm_resource.c
 create mode 100644 include/drm/ttm/ttm_resource.h

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 43f4966331dd..b36d94f57d42 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -381,7 +381,7 @@ int amdgpu_bo_create_kernel_at(struct amdgpu_device *adev,
if (cpu_addr)
amdgpu_bo_kunmap(*bo_ptr);
 
-   ttm_bo_mem_put(&(*bo_ptr)->tbo, &(*bo_ptr)->tbo.mem);
+   ttm_resource_free(&(*bo_ptr)->tbo, &(*bo_ptr)->tbo.mem);
 
for (i = 0; i < (*bo_ptr)->placement.num_placement; ++i) {
(*bo_ptr)->placements[i].fpfn = offset >> PAGE_SHIFT;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 67d2eef2f9eb..462402fcd1a4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -578,7 +578,7 @@ static int amdgpu_move_vram_ram(struct ttm_buffer_object 
*bo, bool evict,
/* move BO (in tmp_mem) to new_mem */
r = ttm_bo_move_ttm(bo, ctx, new_mem);
 out_cleanup:
-   ttm_bo_mem_put(bo, _mem);
+   ttm_resource_free(bo, _mem);
return r;
 }
 
@@ -625,7 +625,7 @@ static int amdgpu_move_ram_vram(struct ttm_buffer_object 
*bo, bool evict,
goto out_cleanup;
}
 out_cleanup:
-   ttm_bo_mem_put(bo, _mem);
+   ttm_resource_free(bo, _mem);
return r;
 }
 
@@ -1203,11 +1203,11 @@ int amdgpu_ttm_alloc_gart(struct ttm_buffer_object *bo)
gtt->offset = (u64)tmp.start << PAGE_SHIFT;
r = amdgpu_ttm_gart_bind(adev, bo, flags);
if (unlikely(r)) {
-   ttm_bo_mem_put(bo, );
+   ttm_resource_free(bo, );
return r;
}
 
-   ttm_bo_mem_put(bo, >mem);
+   ttm_resource_free(bo, >mem);
bo->mem = tmp;
}
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 604a74323696..29d7d7e100f7 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -1191,7 +1191,7 @@ nouveau_bo_move_flipd(struct ttm_buffer_object *bo, bool 
evict, bool intr,
 
ret = ttm_bo_move_ttm(bo, , new_reg);
 out:
-   ttm_bo_mem_put(bo, _reg);
+   ttm_resource_free(bo, _reg);
return ret;
 }
 
@@ -1227,7 +1227,7 @@ nouveau_bo_move_flips(struct ttm_buffer_object *bo, bool 
evict, bool intr,
goto out;
 
 out:
-   ttm_bo_mem_put(bo, _reg);
+   ttm_resource_free(bo, _reg);
return ret;
 }
 
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c 
b/drivers/gpu/drm/radeon/radeon_ttm.c
index 3355b69b13d1..31f4cf211b6a 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -271,7 +271,7 @@ static int radeon_move_vram_ram(struct ttm_buffer_object 
*bo,
}
r = ttm_bo_move_ttm(bo, , new_mem);
 out_cleanup:
-   ttm_bo_mem_put(bo, _mem);
+   ttm_resource_free(bo, _mem);
return r;
 }
 
@@ -309,7 +309,7 @@ static int radeon_move_ram_vram(struct ttm_buffer_object 
*bo,
goto out_cleanup;
}
 out_cleanup:
-   ttm_bo_mem_put(bo, _mem);
+   ttm_resource_free(bo, _mem);
return r;
 }
 
diff --git a/drivers/gpu/drm/ttm/Makefile b/drivers/gpu/drm/ttm/Makefile
index e54326e6cea4..90c0da88cc98 100644
--- a/drivers/gpu/drm/ttm/Makefile
+++ b/drivers/gpu/drm/ttm/Makefile
@@ -4,7 +4,8 @@
 
 ttm-y := ttm_memory.o ttm_tt.o ttm_bo.o \
ttm_bo_util.o ttm_bo_vm.o ttm_module.o \
-   ttm_execbuf_util.o ttm_page_alloc.o ttm_range_manager.o
+   ttm_execbuf_util.o ttm_page_alloc.o ttm_range_manager.o \
+   ttm_resource.o
 ttm-$(CONFIG_AGP) += ttm_agp_backend.o
 ttm-$(CONFIG_DRM_TTM_DMA_PAGE_POOL) += ttm_page_alloc_dma.o
 
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index ae71c3ab6cc4..55890314316b 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -77,19 +77,6 @@ static inline int

Re: [PATCH] drm/amdgpu: make sure userptr ttm is allocated

2020-08-06 Thread Michel Dänzer

On 2020-08-06 2:56 p.m., Christian König wrote:
> We need to allocate that manually now.
> 
> Signed-off-by: Christian König 
> Fixes: 2ddef17678bc (HEAD) drm/ttm: make TT creation purely optional v3

Reviewed-by: Michel Dänzer 
Tested-by: Michel Dänzer 


Thanks!


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

2020-08-06 Thread Christian König


Am 06.08.20 um 13:38 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

[SNIP]

I think it is a limitation of init_rwsem.

And exactly that's wrong, this is intentional and perfectly correct.

[Dennis Li] I couldn't understand. Why is it a perfectly correct?
For example, we define two rw_sem: a and b. If we don't check init_rwsem 
definition, we may think case#1 and case#2 have the same behavior, but in fact 
they are different.

Case 1:
init_rwsem();
init_rwsem();

Case2:
void init_rwsem_ext(rw_sem* sem)
{
   init_rwsem(sem);
}
init_rwsem_ext();
init_rwsem_ext();

As far as I know it is perfectly possible that the locks in the hive are not 
always grabbed in the same order. And that's why lockdep is complaining about 
this.
[Dennis Li] No. I takes a hive with two devices(a and b) to explain why lockdep 
complain.

// Firstly driver lock the reset_sem of all devices
down_write(>reset_sem); do_suspend(a);
down_write(>reset_sem);   // Because  b shared the same lock_class_key with 
a, lockdep will take a and b as the same rw_sem and complain here.
do_suspend(b);

// do recovery
do_hive_recovery()

// unlock the reset_sem of all devices do_resume(a);
up_write(>reset_sem); do_resume(b); up_write(>reset_sem);

Ah! Now I understand what you are working around. So the problem is the static 
lock_class_key in the macro?
[Dennis Li] Yes. The author of init_rwsem might not consider our similar use 
case.


What we should do instead is to make sure we have only a single lock for the 
complete hive instead.
[Dennis Li] If we use a single lock, users will must wait for all devices 
resuming successfully, but in fact their tasks are only running in device a. It 
is not friendly to users.

Well that is intentional I would say. We can only restart submissions after all 
devices are resumed successfully, cause otherwise it can happen that a job on 
device A depends on memory on device B which isn't accessible.
[Dennis Li] Yes, you are right. Driver have make sure that the shared 
resources(such as the shard memory) are ready before unlock the lock of adev 
one by one. But each device still has private hardware resources such as video  
and display.


Yeah, but those are negligible and we have a team working on display 
support for XGMI. So this will sooner or later become a problem as well.


I still think that a single rwsem for the whole hive is still the best 
option here.


Regards,
Christian.



Regards,
Christian.


Regards,
Christian.

Am 06.08.20 um 11:15 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,
For this case, it is safe to use separated lock key. Please see the 
definition of init_rwsem as the below. Every init_rwsem calling will use a new 
static key, but devices of  the hive share the same code to do initialization, 
so their lock_class_key are the same. I think it is a limitation of init_rwsem. 
 In our case, it should be correct that reset_sem of each adev has different  
lock_class_key. BTW, this change doesn't effect dead-lock detection and just 
correct it.

#define init_rwsem(sem) \
do {\
static struct lock_class_key __key; \
\
__init_rwsem((sem), #sem, &__key);  \
} while (0)

Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian 
Sent: Thursday, August 6, 2020 4:13 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org;
Deucher, Alexander ; Kuehling, Felix
; Zhang, Hawking 
Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive
locking

Preventing locking problems during implementation is obviously a good approach, 
but lockdep has proven to be massively useful for finding and fixing problems.

Disabling lockdep splat by annotating lock with separate classes is usually a 
no-go and only allowed if there is no other potential approach.

In this case here we should really clean things up instead.

Christian.

Am 06.08.20 um 09:44 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,
  I agree with your concern. However we shouldn't rely on system to 
detect dead-lock, and should consider this when doing code implementation. In 
fact, dead-lock detection isn't enabled by default.
  For your proposal to remove reset_sem into the hive structure, we can 
open a new topic to discuss it. Currently we couldn't make sure which is the 
best solution. For example, with your proposal, we must wait for all devices 
resuming successfully before resubmit an old task in one device, which will 
effect performance.

Best Regards
Dennis Li
-Original Message-
From: amd-gfx  On Behalf Of
Christian König
Sent: Thursday, August 6, 2020 3:08 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org;
Deucher, Alexander ; Kuehling, Felix
; Zhang, Hawking

[PATCH] drm/amdgpu: make sure userptr ttm is allocated

2020-08-06 Thread Christian König

We need to allocate that manually now.

Signed-off-by: Christian König 
Fixes: 2ddef17678bc (HEAD) drm/ttm: make TT creation purely optional v3
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 17 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h |  4 ++--
 4 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 9015c7b76d60..55d2e870fdda 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -564,7 +564,7 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t 
user_addr)
 
mutex_lock(_info->lock);
 
-   ret = amdgpu_ttm_tt_set_userptr(bo->tbo.ttm, user_addr, 0);
+   ret = amdgpu_ttm_tt_set_userptr(>tbo, user_addr, 0);
if (ret) {
pr_err("%s: Failed to set userptr: %d\n", __func__, ret);
goto out;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index de9784b0c19b..b2dc320e7305 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -332,7 +332,7 @@ int amdgpu_gem_userptr_ioctl(struct drm_device *dev, void 
*data,
bo = gem_to_amdgpu_bo(gobj);
bo->preferred_domains = AMDGPU_GEM_DOMAIN_GTT;
bo->allowed_domains = AMDGPU_GEM_DOMAIN_GTT;
-   r = amdgpu_ttm_tt_set_userptr(bo->tbo.ttm, args->addr, args->flags);
+   r = amdgpu_ttm_tt_set_userptr(>tbo, args->addr, args->flags);
if (r)
goto release_object;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 28557839f132..67d2eef2f9eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1407,21 +1407,26 @@ static void amdgpu_ttm_tt_unpopulate(struct ttm_tt *ttm)
  * amdgpu_ttm_tt_set_userptr - Initialize userptr GTT ttm_tt for the current
  * task
  *
- * @ttm: The ttm_tt object to bind this userptr object to
+ * @bo: The ttm_buffer_object to bind this userptr to
  * @addr:  The address in the current tasks VM space to use
  * @flags: Requirements of userptr object.
  *
  * Called by amdgpu_gem_userptr_ioctl() to bind userptr pages
  * to current task
  */
-int amdgpu_ttm_tt_set_userptr(struct ttm_tt *ttm, uint64_t addr,
- uint32_t flags)
+int amdgpu_ttm_tt_set_userptr(struct ttm_buffer_object *bo,
+ uint64_t addr, uint32_t flags)
 {
-   struct amdgpu_ttm_tt *gtt = (void *)ttm;
+   struct amdgpu_ttm_tt *gtt;
 
-   if (gtt == NULL)
-   return -EINVAL;
+   if (!bo->ttm) {
+   /* TODO: We want a separate TTM object type for userptrs */
+   bo->ttm = amdgpu_ttm_tt_create(bo, 0);
+   if (bo->ttm == NULL)
+   return -ENOMEM;
+   }
 
+   gtt = (void*)bo->ttm;
gtt->userptr = addr;
gtt->userflags = flags;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
index 36b024fd077e..433156c2e8fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h
@@ -132,8 +132,8 @@ static inline bool amdgpu_ttm_tt_get_user_pages_done(struct 
ttm_tt *ttm)
 #endif
 
 void amdgpu_ttm_tt_set_user_pages(struct ttm_tt *ttm, struct page **pages);
-int amdgpu_ttm_tt_set_userptr(struct ttm_tt *ttm, uint64_t addr,
-uint32_t flags);
+int amdgpu_ttm_tt_set_userptr(struct ttm_buffer_object *bo,
+ uint64_t addr, uint32_t flags);
 bool amdgpu_ttm_tt_has_userptr(struct ttm_tt *ttm);
 struct mm_struct *amdgpu_ttm_tt_get_usermm(struct ttm_tt *ttm);
 bool amdgpu_ttm_tt_affect_userptr(struct ttm_tt *ttm, unsigned long start,
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: annotate a false positive recursive locking

2020-08-06 Thread Li, Dennis

[AMD Official Use Only - Internal Distribution Only]

[SNIP]
>> I think it is a limitation of init_rwsem.
> And exactly that's wrong, this is intentional and perfectly correct.
>
> [Dennis Li] I couldn't understand. Why is it a perfectly correct?
> For example, we define two rw_sem: a and b. If we don't check init_rwsem 
> definition, we may think case#1 and case#2 have the same behavior, but in 
> fact they are different.
>
> Case 1:
> init_rwsem();
> init_rwsem();
>
> Case2:
> void init_rwsem_ext(rw_sem* sem)
> {
>   init_rwsem(sem);
> }
> init_rwsem_ext();
> init_rwsem_ext();
>
> As far as I know it is perfectly possible that the locks in the hive are not 
> always grabbed in the same order. And that's why lockdep is complaining about 
> this.
> [Dennis Li] No. I takes a hive with two devices(a and b) to explain why 
> lockdep complain.
>
> // Firstly driver lock the reset_sem of all devices 
> down_write(>reset_sem); do_suspend(a);
> down_write(>reset_sem);   // Because  b shared the same lock_class_key 
> with a, lockdep will take a and b as the same rw_sem and complain here.
> do_suspend(b);
>
> // do recovery
> do_hive_recovery()
>
> // unlock the reset_sem of all devices do_resume(a); 
> up_write(>reset_sem); do_resume(b); up_write(>reset_sem);

Ah! Now I understand what you are working around. So the problem is the static 
lock_class_key in the macro?
[Dennis Li] Yes. The author of init_rwsem might not consider our similar use 
case. 

> What we should do instead is to make sure we have only a single lock for the 
> complete hive instead.
> [Dennis Li] If we use a single lock, users will must wait for all devices 
> resuming successfully, but in fact their tasks are only running in device a. 
> It is not friendly to users.

Well that is intentional I would say. We can only restart submissions after all 
devices are resumed successfully, cause otherwise it can happen that a job on 
device A depends on memory on device B which isn't accessible.
[Dennis Li] Yes, you are right. Driver have make sure that the shared 
resources(such as the shard memory) are ready before unlock the lock of adev 
one by one. But each device still has private hardware resources such as video  
and display. 

Regards,
Christian.

>
> Regards,
> Christian.
>
> Am 06.08.20 um 11:15 schrieb Li, Dennis:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Hi, Christian,
>>For this case, it is safe to use separated lock key. Please see the 
>> definition of init_rwsem as the below. Every init_rwsem calling will use a 
>> new static key, but devices of  the hive share the same code to do 
>> initialization, so their lock_class_key are the same. I think it is a 
>> limitation of init_rwsem.  In our case, it should be correct that reset_sem 
>> of each adev has different  lock_class_key. BTW, this change doesn't effect 
>> dead-lock detection and just correct it.
>>
>> #define init_rwsem(sem)  \
>> do { \
>>  static struct lock_class_key __key; \
>>  \
>>  __init_rwsem((sem), #sem, &__key);  \
>> } while (0)
>>
>> Best Regards
>> Dennis Li
>> -Original Message-
>> From: Koenig, Christian 
>> Sent: Thursday, August 6, 2020 4:13 PM
>> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; 
>> Deucher, Alexander ; Kuehling, Felix 
>> ; Zhang, Hawking 
>> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive 
>> locking
>>
>> Preventing locking problems during implementation is obviously a good 
>> approach, but lockdep has proven to be massively useful for finding and 
>> fixing problems.
>>
>> Disabling lockdep splat by annotating lock with separate classes is usually 
>> a no-go and only allowed if there is no other potential approach.
>>
>> In this case here we should really clean things up instead.
>>
>> Christian.
>>
>> Am 06.08.20 um 09:44 schrieb Li, Dennis:
>>> [AMD Official Use Only - Internal Distribution Only]
>>>
>>> Hi, Christian,
>>>  I agree with your concern. However we shouldn't rely on system to 
>>> detect dead-lock, and should consider this when doing code implementation. 
>>> In fact, dead-lock detection isn't enabled by default.
>>>  For your proposal to remove reset_sem into the hive structure, we 
>>> can open a new topic to discuss it. Currently we couldn't make sure which 
>>> is the best solution. For example, with your proposal, we must wait for all 
>>> devices resuming successfully before resubmit an old task in one device, 
>>> which will effect performance.
>>>
>>> Best Regards
>>> Dennis Li
>>> -Original Message-
>>> From: amd-gfx  On Behalf Of 
>>> Christian König
>>> Sent: Thursday, August 6, 2020 3:08 PM
>>> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; 
>>> Deucher, Alexander ; Kuehling, Felix 
>>> ; Zhang, Hawking 
>>> Subject: Re:

Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

2020-08-06 Thread Christian König



[SNIP]

I think it is a limitation of init_rwsem.

And exactly that's wrong, this is intentional and perfectly correct.

[Dennis Li] I couldn't understand. Why is it a perfectly correct?
For example, we define two rw_sem: a and b. If we don't check init_rwsem 
definition, we may think case#1 and case#2 have the same behavior, but in fact 
they are different.

Case 1:
init_rwsem();
init_rwsem();

Case2:
void init_rwsem_ext(rw_sem* sem)
{
  init_rwsem(sem);
}
init_rwsem_ext();
init_rwsem_ext();

As far as I know it is perfectly possible that the locks in the hive are not 
always grabbed in the same order. And that's why lockdep is complaining about 
this.
[Dennis Li] No. I takes a hive with two devices(a and b) to explain why lockdep 
complain.

// Firstly driver lock the reset_sem of all devices
down_write(>reset_sem);
do_suspend(a);
down_write(>reset_sem);   // Because  b shared the same lock_class_key with 
a, lockdep will take a and b as the same rw_sem and complain here.
do_suspend(b);

// do recovery
do_hive_recovery()

// unlock the reset_sem of all devices
do_resume(a);
up_write(>reset_sem);
do_resume(b);
up_write(>reset_sem);


Ah! Now I understand what you are working around. So the problem is the 
static lock_class_key in the macro?



What we should do instead is to make sure we have only a single lock for the 
complete hive instead.
[Dennis Li] If we use a single lock, users will must wait for all devices 
resuming successfully, but in fact their tasks are only running in device a. It 
is not friendly to users.


Well that is intentional I would say. We can only restart submissions 
after all devices are resumed successfully, cause otherwise it can 
happen that a job on device A depends on memory on device B which isn't 
accessible.


Regards,
Christian.



Regards,
Christian.

Am 06.08.20 um 11:15 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,
   For this case, it is safe to use separated lock key. Please see the 
definition of init_rwsem as the below. Every init_rwsem calling will use a new 
static key, but devices of  the hive share the same code to do initialization, 
so their lock_class_key are the same. I think it is a limitation of init_rwsem. 
 In our case, it should be correct that reset_sem of each adev has different  
lock_class_key. BTW, this change doesn't effect dead-lock detection and just 
correct it.

#define init_rwsem(sem) \
do {\
static struct lock_class_key __key; \
\
__init_rwsem((sem), #sem, &__key);  \
} while (0)

Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian 
Sent: Thursday, August 6, 2020 4:13 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org;
Deucher, Alexander ; Kuehling, Felix
; Zhang, Hawking 
Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive
locking

Preventing locking problems during implementation is obviously a good approach, 
but lockdep has proven to be massively useful for finding and fixing problems.

Disabling lockdep splat by annotating lock with separate classes is usually a 
no-go and only allowed if there is no other potential approach.

In this case here we should really clean things up instead.

Christian.

Am 06.08.20 um 09:44 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,
 I agree with your concern. However we shouldn't rely on system to 
detect dead-lock, and should consider this when doing code implementation. In 
fact, dead-lock detection isn't enabled by default.
 For your proposal to remove reset_sem into the hive structure, we can 
open a new topic to discuss it. Currently we couldn't make sure which is the 
best solution. For example, with your proposal, we must wait for all devices 
resuming successfully before resubmit an old task in one device, which will 
effect performance.

Best Regards
Dennis Li
-Original Message-
From: amd-gfx  On Behalf Of
Christian König
Sent: Thursday, August 6, 2020 3:08 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org;
Deucher, Alexander ; Kuehling, Felix
; Zhang, Hawking 
Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive
locking

Am 06.08.20 um 09:02 schrieb Dennis Li:

[  584.110304] 
[  584.110590] WARNING: possible recursive locking detected
[  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
[  584.64] 
[  584.111456] kworker/38:1/553 is trying to acquire lock:
[  584.111721] 9b15ff0a47a0 (>reset_sem){}, at:
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
   but task is already holding lock:
[  584.112673] 9b1603d247a0 (>reset_sem){}, at:

Re: [PATCH] drm/amd/powerplay: update driver if file for sienna_cichlid

2020-08-06 Thread Feng, Kenneth

[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Kenneth Feng 


在 2020/8/6 下午5:42，“Gao, Likun” 写入:

From: Likun Gao 

Update drive if file for sienna_cichlid.

Signed-off-by: Likun Gao 
Change-Id: If405461cfbe0133ceb61fa123272b2e53db99755
---
 .../drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h  | 6 +++---
 drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h   | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git 
a/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h 
b/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h
index aa2708fccb6d..5ef9c92f57c4 100644
--- a/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h
+++ b/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h
@@ -27,7 +27,7 @@
 // *** IMPORTANT ***
 // SMU TEAM: Always increment the interface version if
 // any structure is changed in this file
-#define SMU11_DRIVER_IF_VERSION 0x34
+#define SMU11_DRIVER_IF_VERSION 0x35

 #define PPTABLE_Sienna_Cichlid_SMU_VERSION 5

@@ -127,7 +127,7 @@
 #define FEATURE_DF_CSTATE_BIT   45
 #define FEATURE_2_STEP_PSTATE_BIT   46
 #define FEATURE_SMNCLK_DPM_BIT  47
-#define FEATURE_SPARE_48_BIT48
+#define FEATURE_PERLINK_GMIDOWN_BIT 48
 #define FEATURE_GFX_EDC_BIT 49
 #define FEATURE_SPARE_50_BIT50
 #define FEATURE_SPARE_51_BIT51
@@ -169,7 +169,7 @@ typedef enum {
 #define DPM_OVERRIDE_DISABLE_DFLL_PLL_SHUTDOWN   0x0200
 #define DPM_OVERRIDE_DISABLE_MEMORY_TEMPERATURE_READ 0x0400
 #define DPM_OVERRIDE_DISABLE_VOLT_LINK_VCN_DCEFCLK   0x0800
-#define DPM_OVERRIDE_DISABLE_FAST_FCLK_TIMER 0x1000
+#define DPM_OVERRIDE_ENABLE_FAST_FCLK_TIMER  0x1000
 #define DPM_OVERRIDE_DISABLE_VCN_PG  0x2000
 #define DPM_OVERRIDE_DISABLE_FMAX_VMAX   0x4000

diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h 
b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h
index 6a42331aba8a..737b6d14372c 100644
--- a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h
+++ b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h
@@ -30,7 +30,7 @@
 #define SMU11_DRIVER_IF_VERSION_NV10 0x36
 #define SMU11_DRIVER_IF_VERSION_NV12 0x33
 #define SMU11_DRIVER_IF_VERSION_NV14 0x36
-#define SMU11_DRIVER_IF_VERSION_Sienna_Cichlid 0x34
+#define SMU11_DRIVER_IF_VERSION_Sienna_Cichlid 0x35
 #define SMU11_DRIVER_IF_VERSION_Navy_Flounder 0x3

 /* MP Apertures */
--
2.25.1


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: annotate a false positive recursive locking

2020-08-06 Thread Li, Dennis

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,
   See my below comments.

Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian  
Sent: Thursday, August 6, 2020 5:19 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Kuehling, Felix 
; Zhang, Hawking 
Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

> I think it is a limitation of init_rwsem.

And exactly that's wrong, this is intentional and perfectly correct.

[Dennis Li] I couldn't understand. Why is it a perfectly correct? 
For example, we define two rw_sem: a and b. If we don't check init_rwsem 
definition, we may think case#1 and case#2 have the same behavior, but in fact 
they are different. 

Case 1:
init_rwsem();
init_rwsem();

Case2:
void init_rwsem_ext(rw_sem* sem)
{
 init_rwsem(sem);
}
init_rwsem_ext();
init_rwsem_ext();

As far as I know it is perfectly possible that the locks in the hive are not 
always grabbed in the same order. And that's why lockdep is complaining about 
this.
[Dennis Li] No. I takes a hive with two devices(a and b) to explain why lockdep 
complain. 

// Firstly driver lock the reset_sem of all devices
down_write(>reset_sem);
do_suspend(a);
down_write(>reset_sem);   // Because  b shared the same lock_class_key with 
a, lockdep will take a and b as the same rw_sem and complain here. 
do_suspend(b);

// do recovery
do_hive_recovery()

// unlock the reset_sem of all devices
do_resume(a);
up_write(>reset_sem);
do_resume(b);
up_write(>reset_sem);

What we should do instead is to make sure we have only a single lock for the 
complete hive instead.
[Dennis Li] If we use a single lock, users will must wait for all devices 
resuming successfully, but in fact their tasks are only running in device a. It 
is not friendly to users. 

Regards,
Christian.

Am 06.08.20 um 11:15 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Christian,
>   For this case, it is safe to use separated lock key. Please see the 
> definition of init_rwsem as the below. Every init_rwsem calling will use a 
> new static key, but devices of  the hive share the same code to do 
> initialization, so their lock_class_key are the same. I think it is a 
> limitation of init_rwsem.  In our case, it should be correct that reset_sem 
> of each adev has different  lock_class_key. BTW, this change doesn't effect 
> dead-lock detection and just correct it.
>
> #define init_rwsem(sem)   \
> do {  \
>   static struct lock_class_key __key; \
>   \
>   __init_rwsem((sem), #sem, &__key);  \
> } while (0)
>
> Best Regards
> Dennis Li
> -Original Message-
> From: Koenig, Christian 
> Sent: Thursday, August 6, 2020 4:13 PM
> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; 
> Deucher, Alexander ; Kuehling, Felix 
> ; Zhang, Hawking 
> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive 
> locking
>
> Preventing locking problems during implementation is obviously a good 
> approach, but lockdep has proven to be massively useful for finding and 
> fixing problems.
>
> Disabling lockdep splat by annotating lock with separate classes is usually a 
> no-go and only allowed if there is no other potential approach.
>
> In this case here we should really clean things up instead.
>
> Christian.
>
> Am 06.08.20 um 09:44 schrieb Li, Dennis:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Hi, Christian,
>> I agree with your concern. However we shouldn't rely on system to 
>> detect dead-lock, and should consider this when doing code implementation. 
>> In fact, dead-lock detection isn't enabled by default.
>> For your proposal to remove reset_sem into the hive structure, we 
>> can open a new topic to discuss it. Currently we couldn't make sure which is 
>> the best solution. For example, with your proposal, we must wait for all 
>> devices resuming successfully before resubmit an old task in one device, 
>> which will effect performance.
>>
>> Best Regards
>> Dennis Li
>> -Original Message-
>> From: amd-gfx  On Behalf Of 
>> Christian König
>> Sent: Thursday, August 6, 2020 3:08 PM
>> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; 
>> Deucher, Alexander ; Kuehling, Felix 
>> ; Zhang, Hawking 
>> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive 
>> locking
>>
>> Am 06.08.20 um 09:02 schrieb Dennis Li:
>>> [  584.110304] 
>>> [  584.110590] WARNING: possible recursive locking detected
>>> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
>>> [  584.64] 
>>> [  584.111456] kworker/38:1/553 is trying to acquire lock:
>>> [  584.111721] 9b15ff0a47a0

[PATCH] drm/amdgpu: use mode1 reset by default for sienna_cichlid

2020-08-06 Thread Likun Gao

From: Likun Gao 

Swith default gpu reset method for sienna_cichlid to MODE1 reset.

Signed-off-by: Likun Gao 
Change-Id: I775e5a66bbac474f65ca8c999136ccaf9c1dc14e
---
 drivers/gpu/drm/amd/amdgpu/nv.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 74d02d270d34..da8024c2826e 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -362,10 +362,15 @@ nv_asic_reset_method(struct amdgpu_device *adev)
dev_warn(adev->dev, "Specified reset method:%d isn't supported, 
using AUTO instead.\n",
  amdgpu_reset_method);
 
-   if (smu_baco_is_support(smu))
-   return AMD_RESET_METHOD_BACO;
-   else
+   switch (adev->asic_type) {
+   case CHIP_SIENNA_CICHLID:
return AMD_RESET_METHOD_MODE1;
+   default:
+   if (smu_baco_is_support(smu))
+   return AMD_RESET_METHOD_BACO;
+   else
+   return AMD_RESET_METHOD_MODE1;
+   }
 }
 
 static int nv_asic_reset(struct amdgpu_device *adev)
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amd/powerplay: update driver if file for sienna_cichlid

2020-08-06 Thread Likun Gao

From: Likun Gao 

Update drive if file for sienna_cichlid.

Signed-off-by: Likun Gao 
Change-Id: If405461cfbe0133ceb61fa123272b2e53db99755
---
 .../drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h  | 6 +++---
 drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h   | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h 
b/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h
index aa2708fccb6d..5ef9c92f57c4 100644
--- a/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h
+++ b/drivers/gpu/drm/amd/powerplay/inc/smu11_driver_if_sienna_cichlid.h
@@ -27,7 +27,7 @@
 // *** IMPORTANT ***
 // SMU TEAM: Always increment the interface version if 
 // any structure is changed in this file
-#define SMU11_DRIVER_IF_VERSION 0x34
+#define SMU11_DRIVER_IF_VERSION 0x35
 
 #define PPTABLE_Sienna_Cichlid_SMU_VERSION 5
 
@@ -127,7 +127,7 @@
 #define FEATURE_DF_CSTATE_BIT   45
 #define FEATURE_2_STEP_PSTATE_BIT   46
 #define FEATURE_SMNCLK_DPM_BIT  47
-#define FEATURE_SPARE_48_BIT48
+#define FEATURE_PERLINK_GMIDOWN_BIT 48
 #define FEATURE_GFX_EDC_BIT 49
 #define FEATURE_SPARE_50_BIT50
 #define FEATURE_SPARE_51_BIT51
@@ -169,7 +169,7 @@ typedef enum {
 #define DPM_OVERRIDE_DISABLE_DFLL_PLL_SHUTDOWN   0x0200
 #define DPM_OVERRIDE_DISABLE_MEMORY_TEMPERATURE_READ 0x0400
 #define DPM_OVERRIDE_DISABLE_VOLT_LINK_VCN_DCEFCLK   0x0800
-#define DPM_OVERRIDE_DISABLE_FAST_FCLK_TIMER 0x1000
+#define DPM_OVERRIDE_ENABLE_FAST_FCLK_TIMER  0x1000
 #define DPM_OVERRIDE_DISABLE_VCN_PG  0x2000
 #define DPM_OVERRIDE_DISABLE_FMAX_VMAX   0x4000
 
diff --git a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h 
b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h
index 6a42331aba8a..737b6d14372c 100644
--- a/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h
+++ b/drivers/gpu/drm/amd/powerplay/inc/smu_v11_0.h
@@ -30,7 +30,7 @@
 #define SMU11_DRIVER_IF_VERSION_NV10 0x36
 #define SMU11_DRIVER_IF_VERSION_NV12 0x33
 #define SMU11_DRIVER_IF_VERSION_NV14 0x36
-#define SMU11_DRIVER_IF_VERSION_Sienna_Cichlid 0x34
+#define SMU11_DRIVER_IF_VERSION_Sienna_Cichlid 0x35
 #define SMU11_DRIVER_IF_VERSION_Navy_Flounder 0x3
 
 /* MP Apertures */
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

2020-08-06 Thread Christian König


I think it is a limitation of init_rwsem.


And exactly that's wrong, this is intentional and perfectly correct.

As far as I know it is perfectly possible that the locks in the hive are 
not always grabbed in the same order. And that's why lockdep is 
complaining about this.


What we should do instead is to make sure we have only a single lock for 
the complete hive instead.


Regards,
Christian.

Am 06.08.20 um 11:15 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,
  For this case, it is safe to use separated lock key. Please see the 
definition of init_rwsem as the below. Every init_rwsem calling will use a new 
static key, but devices of  the hive share the same code to do initialization, 
so their lock_class_key are the same. I think it is a limitation of init_rwsem. 
 In our case, it should be correct that reset_sem of each adev has different  
lock_class_key. BTW, this change doesn't effect dead-lock detection and just 
correct it.

#define init_rwsem(sem) \
do {\
static struct lock_class_key __key; \
\
__init_rwsem((sem), #sem, &__key);  \
} while (0)

Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian 
Sent: Thursday, August 6, 2020 4:13 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Kuehling, Felix ; Zhang, Hawking 

Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

Preventing locking problems during implementation is obviously a good approach, 
but lockdep has proven to be massively useful for finding and fixing problems.

Disabling lockdep splat by annotating lock with separate classes is usually a 
no-go and only allowed if there is no other potential approach.

In this case here we should really clean things up instead.

Christian.

Am 06.08.20 um 09:44 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,
I agree with your concern. However we shouldn't rely on system to 
detect dead-lock, and should consider this when doing code implementation. In 
fact, dead-lock detection isn't enabled by default.
For your proposal to remove reset_sem into the hive structure, we can 
open a new topic to discuss it. Currently we couldn't make sure which is the 
best solution. For example, with your proposal, we must wait for all devices 
resuming successfully before resubmit an old task in one device, which will 
effect performance.

Best Regards
Dennis Li
-Original Message-
From: amd-gfx  On Behalf Of
Christian König
Sent: Thursday, August 6, 2020 3:08 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org;
Deucher, Alexander ; Kuehling, Felix
; Zhang, Hawking 
Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive
locking

Am 06.08.20 um 09:02 schrieb Dennis Li:

[  584.110304] 
[  584.110590] WARNING: possible recursive locking detected
[  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
[  584.64] 
[  584.111456] kworker/38:1/553 is trying to acquire lock:
[  584.111721] 9b15ff0a47a0 (>reset_sem){}, at:
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
  but task is already holding lock:
[  584.112673] 9b1603d247a0 (>reset_sem){}, at:
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
  other info that might help us debug this:
[  584.113689]  Possible unsafe locking scenario:

[  584.114350]CPU0
[  584.114685]
[  584.115014]   lock(>reset_sem);
[  584.115349]   lock(>reset_sem);
[  584.115678]
   *** DEADLOCK ***

[  584.116624]  May be due to missing lock nesting notation

[  584.117284] 4 locks held by kworker/38:1/553:
[  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.},
at: process_one_work+0x21f/0x630 [  584.117967]  #1: ac708e1c3e58
((work_completion)(>recovery_work)){+.+.}, at:
process_one_work+0x21f/0x630 [  584.118358]  #2: c1c2a5d0
(>hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [  
584.118786]  #3: 9b1603d247a0 (>reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
  stack backtrace:
[  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G 
  OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
[  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT,
BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events
amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
[  584.122050]  dump_stack+0x98/0xd5
[  584.122499]  __lock_acquire+0x1139/0x16e0 [  584.122931]  ?
trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ?
cancel_delayed_work+0xa6/0xc0 [

RE: [PATCH] drm/amdgpu: annotate a false positive recursive locking

2020-08-06 Thread Li, Dennis

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,
 For this case, it is safe to use separated lock key. Please see the 
definition of init_rwsem as the below. Every init_rwsem calling will use a new 
static key, but devices of  the hive share the same code to do initialization, 
so their lock_class_key are the same. I think it is a limitation of init_rwsem. 
 In our case, it should be correct that reset_sem of each adev has different  
lock_class_key. BTW, this change doesn't effect dead-lock detection and just 
correct it.

#define init_rwsem(sem) \
do {\
static struct lock_class_key __key; \
\
__init_rwsem((sem), #sem, &__key);  \
} while (0)

Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian  
Sent: Thursday, August 6, 2020 4:13 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Kuehling, Felix 
; Zhang, Hawking 
Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

Preventing locking problems during implementation is obviously a good approach, 
but lockdep has proven to be massively useful for finding and fixing problems.

Disabling lockdep splat by annotating lock with separate classes is usually a 
no-go and only allowed if there is no other potential approach.

In this case here we should really clean things up instead.

Christian.

Am 06.08.20 um 09:44 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Christian,
>I agree with your concern. However we shouldn't rely on system to 
> detect dead-lock, and should consider this when doing code implementation. In 
> fact, dead-lock detection isn't enabled by default.
>For your proposal to remove reset_sem into the hive structure, we can 
> open a new topic to discuss it. Currently we couldn't make sure which is the 
> best solution. For example, with your proposal, we must wait for all devices 
> resuming successfully before resubmit an old task in one device, which will 
> effect performance.
>
> Best Regards
> Dennis Li
> -Original Message-
> From: amd-gfx  On Behalf Of 
> Christian König
> Sent: Thursday, August 6, 2020 3:08 PM
> To: Li, Dennis ; amd-gfx@lists.freedesktop.org; 
> Deucher, Alexander ; Kuehling, Felix 
> ; Zhang, Hawking 
> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive 
> locking
>
> Am 06.08.20 um 09:02 schrieb Dennis Li:
>> [  584.110304] 
>> [  584.110590] WARNING: possible recursive locking detected
>> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
>> [  584.64] 
>> [  584.111456] kworker/38:1/553 is trying to acquire lock:
>> [  584.111721] 9b15ff0a47a0 (>reset_sem){}, at:
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
>>  but task is already holding lock:
>> [  584.112673] 9b1603d247a0 (>reset_sem){}, at:
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
>>  other info that might help us debug this:
>> [  584.113689]  Possible unsafe locking scenario:
>>
>> [  584.114350]CPU0
>> [  584.114685]
>> [  584.115014]   lock(>reset_sem);
>> [  584.115349]   lock(>reset_sem);
>> [  584.115678]
>>   *** DEADLOCK ***
>>
>> [  584.116624]  May be due to missing lock nesting notation
>>
>> [  584.117284] 4 locks held by kworker/38:1/553:
>> [  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.},
>> at: process_one_work+0x21f/0x630 [  584.117967]  #1: ac708e1c3e58 
>> ((work_completion)(>recovery_work)){+.+.}, at:
>> process_one_work+0x21f/0x630 [  584.118358]  #2: c1c2a5d0 
>> (>hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] 
>> [  584.118786]  #3: 9b1603d247a0 (>reset_sem){}, at: 
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
>>  stack backtrace:
>> [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G  
>>  OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
>> [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, 
>> BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events 
>> amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
>> [  584.122050]  dump_stack+0x98/0xd5
>> [  584.122499]  __lock_acquire+0x1139/0x16e0 [  584.122931]  ?
>> trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ?
>> cancel_delayed_work+0xa6/0xc0 [  584.123771]  lock_acquire+0xb8/0x1c0 
>> [  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 
>> 584.124599]  down_write+0x49/0x120 [  584.125032]  ?
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [

Re: [PATCH] drm/amdgpu: Enable P2P dmabuf over XGMI

2020-08-06 Thread Christian König


Am 06.08.20 um 11:04 schrieb Arunpravin:

Access the exported P2P dmabuf over XGMI, if available.
Otherwise, fall back to the existing PCIe method.

Signed-off-by: Arunpravin 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 34 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 19 ++--
  3 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index ffeb20f11c07..589d008f91df 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -35,6 +35,7 @@
  #include "amdgpu_display.h"
  #include "amdgpu_gem.h"
  #include "amdgpu_dma_buf.h"
+#include "amdgpu_xgmi.h"
  #include 
  #include 
  #include 
@@ -560,3 +561,36 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct 
drm_device *dev,
obj->import_attach = attach;
return obj;
  }
+
+/**
+ * amdgpu_dmabuf_is_xgmi_accessible - Check if xgmi available for P2P transfer
+ *
+ * @adev: amdgpu_device pointer of the importer
+ * @bo: amdgpu buffer object
+ *
+ * Returns:
+ * True if dmabuf accessible over xgmi, false otherwise.
+ */
+bool amdgpu_dmabuf_is_xgmi_accessible(struct amdgpu_device *adev,
+ struct amdgpu_bo *bo)
+{
+   struct drm_gem_object *obj = >tbo.base;
+   struct drm_gem_object *gobj;
+
+   if (obj->import_attach) {
+   struct dma_buf *dma_buf = obj->import_attach->dmabuf;
+
+   if (dma_buf->ops != _dmabuf_ops)
+   /* No XGMI with non AMD GPUs */
+   return false;
+
+   gobj = dma_buf->priv;
+   bo = gem_to_amdgpu_bo(gobj);
+   }
+
+   if (amdgpu_xgmi_same_hive(adev, amdgpu_ttm_adev(bo->tbo.bdev)) &&
+   (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM))
+   return true;
+
+   return false;
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h
index ec447a7b6b28..2c5c84a06bb9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h
@@ -29,6 +29,8 @@ struct dma_buf *amdgpu_gem_prime_export(struct drm_gem_object 
*gobj,
int flags);
  struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev,
struct dma_buf *dma_buf);
+bool amdgpu_dmabuf_is_xgmi_accessible(struct amdgpu_device *adev,
+ struct amdgpu_bo *bo);
  void *amdgpu_gem_prime_vmap(struct drm_gem_object *obj);
  void amdgpu_gem_prime_vunmap(struct drm_gem_object *obj, void *vaddr);
  int amdgpu_gem_prime_mmap(struct drm_gem_object *obj,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 71e005cf2952..771c27478bb1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -28,6 +28,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include "amdgpu.h"
@@ -35,6 +36,7 @@
  #include "amdgpu_amdkfd.h"
  #include "amdgpu_gmc.h"
  #include "amdgpu_xgmi.h"
+#include "amdgpu_dma_buf.h"
  
  /**

   * DOC: GPUVM
@@ -1778,15 +1780,24 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, 
struct amdgpu_bo_va *bo_va,
nodes = NULL;
resv = vm->root.base.bo->tbo.base.resv;
} else {
+   struct drm_gem_object *obj = >tbo.base;
struct ttm_dma_tt *ttm;
  
+		resv = bo->tbo.base.resv;

+   if (obj->import_attach && bo_va->is_xgmi) {
+   struct dma_buf *dma_buf = obj->import_attach->dmabuf;
+   struct drm_gem_object *gobj = dma_buf->priv;
+   struct amdgpu_bo *abo = gem_to_amdgpu_bo(gobj);
+
+   if (abo->tbo.mem.mem_type == TTM_PL_VRAM)
+   bo = gem_to_amdgpu_bo(gobj);
+   }
mem = >tbo.mem;
nodes = mem->mm_node;
if (mem->mem_type == TTM_PL_TT) {
ttm = container_of(bo->tbo.ttm, struct ttm_dma_tt, ttm);
pages_addr = ttm->dma_address;
}
-   resv = bo->tbo.base.resv;
}
  
  	if (bo) {

@@ -2132,8 +2143,10 @@ struct amdgpu_bo_va *amdgpu_vm_bo_add(struct 
amdgpu_device *adev,
INIT_LIST_HEAD(_va->valids);
INIT_LIST_HEAD(_va->invalids);
  
-	if (bo && amdgpu_xgmi_same_hive(adev, amdgpu_ttm_adev(bo->tbo.bdev)) &&

-   (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM)) {
+   if (!bo)
+   return bo_va;
+
+   if (amdgpu_dmabuf_is_xgmi_accessible(adev, bo)) {
bo_va->is_xgmi = true;
/* Power up XGMI if it can be potentially used */

[PATCH] drm/amdgpu: Enable P2P dmabuf over XGMI

2020-08-06 Thread Arunpravin

Access the exported P2P dmabuf over XGMI, if available.
Otherwise, fall back to the existing PCIe method.

Signed-off-by: Arunpravin 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 34 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 19 ++--
 3 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index ffeb20f11c07..589d008f91df 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -35,6 +35,7 @@
 #include "amdgpu_display.h"
 #include "amdgpu_gem.h"
 #include "amdgpu_dma_buf.h"
+#include "amdgpu_xgmi.h"
 #include 
 #include 
 #include 
@@ -560,3 +561,36 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct 
drm_device *dev,
obj->import_attach = attach;
return obj;
 }
+
+/**
+ * amdgpu_dmabuf_is_xgmi_accessible - Check if xgmi available for P2P transfer
+ *
+ * @adev: amdgpu_device pointer of the importer
+ * @bo: amdgpu buffer object
+ *
+ * Returns:
+ * True if dmabuf accessible over xgmi, false otherwise.
+ */
+bool amdgpu_dmabuf_is_xgmi_accessible(struct amdgpu_device *adev,
+ struct amdgpu_bo *bo)
+{
+   struct drm_gem_object *obj = >tbo.base;
+   struct drm_gem_object *gobj;
+
+   if (obj->import_attach) {
+   struct dma_buf *dma_buf = obj->import_attach->dmabuf;
+
+   if (dma_buf->ops != _dmabuf_ops)
+   /* No XGMI with non AMD GPUs */
+   return false;
+
+   gobj = dma_buf->priv;
+   bo = gem_to_amdgpu_bo(gobj);
+   }
+
+   if (amdgpu_xgmi_same_hive(adev, amdgpu_ttm_adev(bo->tbo.bdev)) &&
+   (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM))
+   return true;
+
+   return false;
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h
index ec447a7b6b28..2c5c84a06bb9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.h
@@ -29,6 +29,8 @@ struct dma_buf *amdgpu_gem_prime_export(struct drm_gem_object 
*gobj,
int flags);
 struct drm_gem_object *amdgpu_gem_prime_import(struct drm_device *dev,
struct dma_buf *dma_buf);
+bool amdgpu_dmabuf_is_xgmi_accessible(struct amdgpu_device *adev,
+ struct amdgpu_bo *bo);
 void *amdgpu_gem_prime_vmap(struct drm_gem_object *obj);
 void amdgpu_gem_prime_vunmap(struct drm_gem_object *obj, void *vaddr);
 int amdgpu_gem_prime_mmap(struct drm_gem_object *obj,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 71e005cf2952..771c27478bb1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include "amdgpu.h"
@@ -35,6 +36,7 @@
 #include "amdgpu_amdkfd.h"
 #include "amdgpu_gmc.h"
 #include "amdgpu_xgmi.h"
+#include "amdgpu_dma_buf.h"
 
 /**
  * DOC: GPUVM
@@ -1778,15 +1780,24 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, 
struct amdgpu_bo_va *bo_va,
nodes = NULL;
resv = vm->root.base.bo->tbo.base.resv;
} else {
+   struct drm_gem_object *obj = >tbo.base;
struct ttm_dma_tt *ttm;
 
+   resv = bo->tbo.base.resv;
+   if (obj->import_attach && bo_va->is_xgmi) {
+   struct dma_buf *dma_buf = obj->import_attach->dmabuf;
+   struct drm_gem_object *gobj = dma_buf->priv;
+   struct amdgpu_bo *abo = gem_to_amdgpu_bo(gobj);
+
+   if (abo->tbo.mem.mem_type == TTM_PL_VRAM)
+   bo = gem_to_amdgpu_bo(gobj);
+   }
mem = >tbo.mem;
nodes = mem->mm_node;
if (mem->mem_type == TTM_PL_TT) {
ttm = container_of(bo->tbo.ttm, struct ttm_dma_tt, ttm);
pages_addr = ttm->dma_address;
}
-   resv = bo->tbo.base.resv;
}
 
if (bo) {
@@ -2132,8 +2143,10 @@ struct amdgpu_bo_va *amdgpu_vm_bo_add(struct 
amdgpu_device *adev,
INIT_LIST_HEAD(_va->valids);
INIT_LIST_HEAD(_va->invalids);
 
-   if (bo && amdgpu_xgmi_same_hive(adev, amdgpu_ttm_adev(bo->tbo.bdev)) &&
-   (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM)) {
+   if (!bo)
+   return bo_va;
+
+   if (amdgpu_dmabuf_is_xgmi_accessible(adev, bo)) {
bo_va->is_xgmi = true;
/* Power up XGMI if it can be potentially used */
amdgpu_xgmi_set_pstate(adev, AMDGPU_XGMI_PSTATE_MAX_VEGA20);
-- 
2.25.1

RE: [PATCH] drm/amd/mst: clean DP main link status only when unplug mst 1st link

2020-08-06 Thread Lin, Wayne

[AMD Public Use]



> -Original Message-
> From: Siqueira, Rodrigo 
> Sent: Wednesday, August 5, 2020 10:55 PM
> To: Lin, Wayne 
> Cc: amd-gfx@lists.freedesktop.org; Lakha, Bhawanpreet 
> ; Wu, Hersen ; 
> Kazlauskas, Nicholas ; Zuo, Jerry 
> 
> Subject: Re: [PATCH] drm/amd/mst: clean DP main link status only when 
> unplug mst 1st link
> 
> On 08/04, Wayne Lin wrote:
> > [Why]
> > Under DP daisy chain scenario as below:
> >
> > Src - Monitor_1 - Monitor_2
> >
> > If unplug 2nd Monitor_2 and plug in again, observe that Monitor_1 
> > doesn't light up.
> >
> > When unplug 2nd monitor, we clear the 
> > dc_link->cur_link_settings.lane_count in dm_dp_destroy_mst_connector().
> > However this link status is a shared data structure by all connected 
> > mst monitors. Although the 2nd monitor is gone, this link status 
> > should still be retained for other connected mst monitors. 
> > Otherwise, when we plug the 2nd monitor in again, we find out that 
> > main link is not trained and do link training again. Payload ID 
> > Table for Monitor_1 is ruined and we don't reallocate it.
> >
> > [How]
> > In dm_dp_destroy_mst_connector(), only clean the cur_link_settings 
> > when we no longer do mst mode.
> >
> > Signed-off-by: Wayne Lin 
> > ---
> >  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c | 5
> -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git
> > a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
> > index 2c10352fa514..526f29598403 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
> > @@ -415,7 +415,10 @@ static void dm_dp_destroy_mst_connector(struct
> drm_dp_mst_topology_mgr *mgr,
> >aconnector->dc_sink);
> > dc_sink_release(aconnector->dc_sink);
> > aconnector->dc_sink = NULL;
> > -   aconnector->dc_link->cur_link_settings.lane_count = 0;
> > +   mutex_lock(>lock);
> > +   if (!mgr->mst_state)
> > +   aconnector->dc_link->cur_link_settings.lane_count = 0;
> > +   mutex_unlock(>lock);
> Hi Wayne,
> 
> The change looks good to me.
> 
> Reviewed-by: Rodrigo Siqueira 
> 
> Just for curiosity, do you know why we use a mutex instead of 
> spin_lock_irq for this case? FWIU, the spin_lock_irq behavior looks 
> better for this sort of manipulation.

Hi Siqueira,

Thanks for your time.
AFAIK, changing mst_state (e.g. enabling MST mode) involves some 
reading/writing steps on DPCD which might cost some time. 
> 
> Thanks
> 
> > }
> >
> > drm_connector_unregister(connector);
> > --
> > 2.17.1
> >
> 
> --
> Rodrigo Siqueira
> https://siqueira.tech
--
Wayne Lin
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

2020-08-06 Thread Christian König

Preventing locking problems during implementation is obviously a good 
approach, but lockdep has proven to be massively useful for finding and 
fixing problems.


Disabling lockdep splat by annotating lock with separate classes is 
usually a no-go and only allowed if there is no other potential approach.


In this case here we should really clean things up instead.

Christian.

Am 06.08.20 um 09:44 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,
   I agree with your concern. However we shouldn't rely on system to detect 
dead-lock, and should consider this when doing code implementation. In fact, 
dead-lock detection isn't enabled by default.
   For your proposal to remove reset_sem into the hive structure, we can 
open a new topic to discuss it. Currently we couldn't make sure which is the 
best solution. For example, with your proposal, we must wait for all devices 
resuming successfully before resubmit an old task in one device, which will 
effect performance.

Best Regards
Dennis Li
-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Thursday, August 6, 2020 3:08 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Kuehling, Felix ; Zhang, Hawking 

Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

Am 06.08.20 um 09:02 schrieb Dennis Li:

[  584.110304] 
[  584.110590] WARNING: possible recursive locking detected
[  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
[  584.64] 
[  584.111456] kworker/38:1/553 is trying to acquire lock:
[  584.111721] 9b15ff0a47a0 (>reset_sem){}, at:
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
 but task is already holding lock:
[  584.112673] 9b1603d247a0 (>reset_sem){}, at:
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
 other info that might help us debug this:
[  584.113689]  Possible unsafe locking scenario:

[  584.114350]CPU0
[  584.114685]
[  584.115014]   lock(>reset_sem);
[  584.115349]   lock(>reset_sem);
[  584.115678]
  *** DEADLOCK ***

[  584.116624]  May be due to missing lock nesting notation

[  584.117284] 4 locks held by kworker/38:1/553:
[  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.},
at: process_one_work+0x21f/0x630 [  584.117967]  #1: ac708e1c3e58
((work_completion)(>recovery_work)){+.+.}, at:
process_one_work+0x21f/0x630 [  584.118358]  #2: c1c2a5d0
(>hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [  
584.118786]  #3: 9b1603d247a0 (>reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
 stack backtrace:
[  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G 
  OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
[  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT,
BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events
amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
[  584.122050]  dump_stack+0x98/0xd5
[  584.122499]  __lock_acquire+0x1139/0x16e0 [  584.122931]  ?
trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ?
cancel_delayed_work+0xa6/0xc0 [  584.123771]  lock_acquire+0xb8/0x1c0
[  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [
584.124599]  down_write+0x49/0x120 [  584.125032]  ?
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125910]  ?
amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [  584.126367]
amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [  584.126789]
process_one_work+0x29e/0x630 [  584.127208]  worker_thread+0x3c/0x3f0
[  584.127621]  ? __kthread_parkme+0x61/0x90 [  584.128014]
kthread+0x12f/0x150 [  584.128402]  ? process_one_work+0x630/0x630 [
584.128790]  ? kthread_park+0x90/0x90 [  584.129174]
ret_from_fork+0x3a/0x50

Each adev has owned lock_class_key to avoid false positive recursive
locking.

NAK, that is not a false positive but a real problem.

The issue here is that we have multiple reset semaphores, one for each device 
in the hive. If those are not acquired in the correct order we deadlock.

The real solution would be to move the reset_sem into the hive structure and 
make sure that we lock it only once.

Christian.


Signed-off-by: Dennis Li 
Change-Id: I7571efeccbf15483982031d00504a353031a854a

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index e97c088d03b3..766dc8f8c8a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -967,6 +967,7 @@ struct amdgpu_device {
atomic_tin_gpu_reset;
enum pp_mp1_state   mp1_state;
struct rw_semaphore reset_sem;
+   struct lock_class_key lock_key;
struct

RE: [PATCH] drm/amdgpu: annotate a false positive recursive locking

2020-08-06 Thread Li, Dennis

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,
  I agree with your concern. However we shouldn't rely on system to detect 
dead-lock, and should consider this when doing code implementation. In fact, 
dead-lock detection isn't enabled by default. 
  For your proposal to remove reset_sem into the hive structure, we can 
open a new topic to discuss it. Currently we couldn't make sure which is the 
best solution. For example, with your proposal, we must wait for all devices 
resuming successfully before resubmit an old task in one device, which will 
effect performance. 

Best Regards
Dennis Li
-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Thursday, August 6, 2020 3:08 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Kuehling, Felix 
; Zhang, Hawking 
Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

Am 06.08.20 um 09:02 schrieb Dennis Li:
> [  584.110304] 
> [  584.110590] WARNING: possible recursive locking detected
> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
> [  584.64] 
> [  584.111456] kworker/38:1/553 is trying to acquire lock:
> [  584.111721] 9b15ff0a47a0 (>reset_sem){}, at: 
> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
> but task is already holding lock:
> [  584.112673] 9b1603d247a0 (>reset_sem){}, at: 
> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
> other info that might help us debug this:
> [  584.113689]  Possible unsafe locking scenario:
>
> [  584.114350]CPU0
> [  584.114685]
> [  584.115014]   lock(>reset_sem);
> [  584.115349]   lock(>reset_sem);
> [  584.115678]
>  *** DEADLOCK ***
>
> [  584.116624]  May be due to missing lock nesting notation
>
> [  584.117284] 4 locks held by kworker/38:1/553:
> [  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.}, 
> at: process_one_work+0x21f/0x630 [  584.117967]  #1: ac708e1c3e58 
> ((work_completion)(>recovery_work)){+.+.}, at: 
> process_one_work+0x21f/0x630 [  584.118358]  #2: c1c2a5d0 
> (>hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ 
>  584.118786]  #3: 9b1603d247a0 (>reset_sem){}, at: 
> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
> stack backtrace:
> [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G   
> OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
> [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, 
> BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events 
> amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
> [  584.122050]  dump_stack+0x98/0xd5
> [  584.122499]  __lock_acquire+0x1139/0x16e0 [  584.122931]  ? 
> trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ? 
> cancel_delayed_work+0xa6/0xc0 [  584.123771]  lock_acquire+0xb8/0x1c0 
> [  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  
> 584.124599]  down_write+0x49/0x120 [  584.125032]  ? 
> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]  
> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125910]  ? 
> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [  584.126367]  
> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [  584.126789]  
> process_one_work+0x29e/0x630 [  584.127208]  worker_thread+0x3c/0x3f0 
> [  584.127621]  ? __kthread_parkme+0x61/0x90 [  584.128014]  
> kthread+0x12f/0x150 [  584.128402]  ? process_one_work+0x630/0x630 [  
> 584.128790]  ? kthread_park+0x90/0x90 [  584.129174]  
> ret_from_fork+0x3a/0x50
>
> Each adev has owned lock_class_key to avoid false positive recursive 
> locking.

NAK, that is not a false positive but a real problem.

The issue here is that we have multiple reset semaphores, one for each device 
in the hive. If those are not acquired in the correct order we deadlock.

The real solution would be to move the reset_sem into the hive structure and 
make sure that we lock it only once.

Christian.

>
> Signed-off-by: Dennis Li 
> Change-Id: I7571efeccbf15483982031d00504a353031a854a
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index e97c088d03b3..766dc8f8c8a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -967,6 +967,7 @@ struct amdgpu_device {
>   atomic_tin_gpu_reset;
>   enum pp_mp1_state   mp1_state;
>   struct rw_semaphore reset_sem;
> + struct lock_class_key lock_key;
>   struct amdgpu_doorbell_index doorbell_index;
>   
>   struct mutexnotifier_lock;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 6c572db42d92..d78df9312d34 100644
> ---

Re: [PATCH] drm/amdgpu: fix spelling mistake "Falied" -> "Failed"

2020-08-06 Thread Stephen Rothwell

Hi all,

On Wed, 05 Aug 2020 15:19:38 -0700 Joe Perches  wrote:
>
> On Wed, 2020-08-05 at 17:27 -0400, Alex Deucher wrote:
> > On Wed, Aug 5, 2020 at 4:53 PM Joe Perches  wrote:  
> > > On Wed, 2020-08-05 at 16:01 -0400, Alex Deucher wrote:  
> > > > On Wed, Aug 5, 2020 at 7:35 AM Colin King  
> > > > wrote:  
> > > > > From: Colin Ian King 
> > > > > 
> > > > > There is a spelling mistake in a DRM_ERROR message. Fix it.
> > > > > 
> > > > > Signed-off-by: Colin Ian King   
> > > > 
> > > > This is already fixed.  
> > > 
> > > This fix is not in today's -next.
> > > 
> > > Perhaps whatever tree it's fixed in should be in -next.
> > >   
> > 
> > Weird.  It's in the drm-next tree as:
> > 
> > commit 4afaa61db9cf5250b5734c2531b226e7b3a3d691
> > Author: Colin Ian King 
> > Date:   Fri Jul 10 09:37:58 2020 +0100
> > 
> > drm/amdgpu: fix spelling mistake "Falied" -> "Failed"
> > 
> > There is a spelling mistake in a DRM_ERROR error message. Fix it.
> > 
> > Signed-off-by: Colin Ian King 
> > Signed-off-by: Alex Deucher 
> > 
> > Alex
> >   
> > > $ git show --oneline -s
> > > d15fe4ec0435 (HEAD, tag: next-20200805, origin/master, origin/HEAD) Add 
> > > linux-next specific files for 20200805
> > > 
> > > $ git grep -i falied drivers
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c:DRM_ERROR("Falied 
> > > to terminate tmr\n");
> > >   
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c  
> > > []  
> > > > > @@ -2010,7 +2010,7 @@ static int psp_suspend(void *handle)
> > > > > 
> > > > > ret = psp_tmr_terminate(psp);
> > > > > if (ret) {
> > > > > -   DRM_ERROR("Falied to terminate tmr\n");
> > > > > +   DRM_ERROR("Failed to terminate tmr\n");
> > > > > return ret;
> > > > > }  
> 
> Dunno.
> 
> Maybe it's due to some ordering of trees in
> how -next accumulates patches?

The spelling error is introduced in two commits:

  c564b8601ae9 ("drm/amdgpu: add TMR destory function for psp")

in Linus' tree between v5.8-rc4 and rc5

  90937420c44f ("drm/amdgpu: add TMR destory function for psp")

in the amdgpu tree between two merges by the drm tree.  In this same
interval, the error is corrected by commit

  4afaa61db9cf ("drm/amdgpu: fix spelling mistake "Falied" -> "Failed"")

so when David comes to merge the amdgpu tree in commit

  206739119508 ("Merge tag 'amd-drm-next-5.9-2020-07-17' of 
git://people.freedesktop.org/~agd5f/linux into drm-next")

the spelling error has been introduced on one side of the merge and
introduced and corrected on the other.  This would have produced a
conflict which David presumably resolved in haste by picking the HEAD
side of the merge instead of the MERGE_HEAD side (it happens).

This could have been avoided by not cherry-picking fix commits around
in the amdgpu process - instead having a fixes branch that is merged
into the next branch after the fixes branch has been accepted upstream
(that way there is only one commit for each fix and less conflicts).

I have to deal with these sort of conflicts (sometimes daily) due to
the drm processes.  Its a pain as I have to track down each conflict to
see if the same patches appear on both sides of merges and then try to
figure out what other changes occur.  (This is only slightly helped by
have the "cherry-picked from" tags in the fix commits.)
-- 
Cheers,
Stephen Rothwell


pgphtd3IyHin5.pgp
Description: OpenPGP digital signature
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: Skip some registers config for SRIOV

2020-08-06 Thread Liu ChengZhe

For VF, registers L2_CNTL, L2_CONTEXT1_IDENTITY_APERTURE
L2_PROTECTION_FAULT_CNTL are not accesible, skip the
configuration for them in SRIOV mode.

Signed-off-by: Liu ChengZhe 
---
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c | 12 ++--
 drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c  | 12 ++--
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c 
b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
index 1f6112b7fa49..6b96f45fde2a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
@@ -330,10 +330,13 @@ int gfxhub_v2_1_gart_enable(struct amdgpu_device *adev)
gfxhub_v2_1_init_gart_aperture_regs(adev);
gfxhub_v2_1_init_system_aperture_regs(adev);
gfxhub_v2_1_init_tlb_regs(adev);
-   gfxhub_v2_1_init_cache_regs(adev);
+   if (!amdgpu_sriov_vf(adev))
+   gfxhub_v2_1_init_cache_regs(adev);
 
gfxhub_v2_1_enable_system_domain(adev);
-   gfxhub_v2_1_disable_identity_aperture(adev);
+   if (!amdgpu_sriov_vf(adev))
+   gfxhub_v2_1_disable_identity_aperture(adev);
+
gfxhub_v2_1_setup_vmid_config(adev);
gfxhub_v2_1_program_invalidation(adev);
 
@@ -372,7 +375,12 @@ void gfxhub_v2_1_gart_disable(struct amdgpu_device *adev)
 void gfxhub_v2_1_set_fault_enable_default(struct amdgpu_device *adev,
  bool value)
 {
+   /*These regs are not accessible for VF, PF will program in SRIOV */
+   if (amdgpu_sriov_vf(adev))
+   return;
+
u32 tmp;
+
tmp = RREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL);
tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
RANGE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c 
b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c
index d83912901f73..9cfde9b81600 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v2_0.c
@@ -321,10 +321,13 @@ int mmhub_v2_0_gart_enable(struct amdgpu_device *adev)
mmhub_v2_0_init_gart_aperture_regs(adev);
mmhub_v2_0_init_system_aperture_regs(adev);
mmhub_v2_0_init_tlb_regs(adev);
-   mmhub_v2_0_init_cache_regs(adev);
+   if (!amdgpu_sriov_vf(adev))
+   mmhub_v2_0_init_cache_regs(adev);
 
mmhub_v2_0_enable_system_domain(adev);
-   mmhub_v2_0_disable_identity_aperture(adev);
+   if (!amdgpu_sriov_vf(adev))
+   mmhub_v2_0_disable_identity_aperture(adev);
+
mmhub_v2_0_setup_vmid_config(adev);
mmhub_v2_0_program_invalidation(adev);
 
@@ -364,7 +367,12 @@ void mmhub_v2_0_gart_disable(struct amdgpu_device *adev)
  */
 void mmhub_v2_0_set_fault_enable_default(struct amdgpu_device *adev, bool 
value)
 {
+   /*These regs are not accessible for VF, PF will program in SRIOV */
+   if (amdgpu_sriov_vf(adev))
+   return;
+
u32 tmp;
+
tmp = RREG32_SOC15(MMHUB, 0, mmMMVM_L2_PROTECTION_FAULT_CNTL);
tmp = REG_SET_FIELD(tmp, MMVM_L2_PROTECTION_FAULT_CNTL,
RANGE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: fix spelling mistake "Falied" -> "Failed"

2020-08-06 Thread daniel

On Thu, Aug 06, 2020 at 09:36:41AM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> On Wed, 05 Aug 2020 15:19:38 -0700 Joe Perches  wrote:
> >
> > On Wed, 2020-08-05 at 17:27 -0400, Alex Deucher wrote:
> > > On Wed, Aug 5, 2020 at 4:53 PM Joe Perches  wrote:  
> > > > On Wed, 2020-08-05 at 16:01 -0400, Alex Deucher wrote:  
> > > > > On Wed, Aug 5, 2020 at 7:35 AM Colin King  
> > > > > wrote:  
> > > > > > From: Colin Ian King 
> > > > > > 
> > > > > > There is a spelling mistake in a DRM_ERROR message. Fix it.
> > > > > > 
> > > > > > Signed-off-by: Colin Ian King   
> > > > > 
> > > > > This is already fixed.  
> > > > 
> > > > This fix is not in today's -next.
> > > > 
> > > > Perhaps whatever tree it's fixed in should be in -next.
> > > >   
> > > 
> > > Weird.  It's in the drm-next tree as:
> > > 
> > > commit 4afaa61db9cf5250b5734c2531b226e7b3a3d691
> > > Author: Colin Ian King 
> > > Date:   Fri Jul 10 09:37:58 2020 +0100
> > > 
> > > drm/amdgpu: fix spelling mistake "Falied" -> "Failed"
> > > 
> > > There is a spelling mistake in a DRM_ERROR error message. Fix it.
> > > 
> > > Signed-off-by: Colin Ian King 
> > > Signed-off-by: Alex Deucher 
> > > 
> > > Alex
> > >   
> > > > $ git show --oneline -s
> > > > d15fe4ec0435 (HEAD, tag: next-20200805, origin/master, origin/HEAD) Add 
> > > > linux-next specific files for 20200805
> > > > 
> > > > $ git grep -i falied drivers
> > > > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c:
> > > > DRM_ERROR("Falied to terminate tmr\n");
> > > >   
> > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
> > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c  
> > > > []  
> > > > > > @@ -2010,7 +2010,7 @@ static int psp_suspend(void *handle)
> > > > > > 
> > > > > > ret = psp_tmr_terminate(psp);
> > > > > > if (ret) {
> > > > > > -   DRM_ERROR("Falied to terminate tmr\n");
> > > > > > +   DRM_ERROR("Failed to terminate tmr\n");
> > > > > > return ret;
> > > > > > }  
> > 
> > Dunno.
> > 
> > Maybe it's due to some ordering of trees in
> > how -next accumulates patches?
> 
> The spelling error is introduced in two commits:
> 
>   c564b8601ae9 ("drm/amdgpu: add TMR destory function for psp")
> 
> in Linus' tree between v5.8-rc4 and rc5
> 
>   90937420c44f ("drm/amdgpu: add TMR destory function for psp")
> 
> in the amdgpu tree between two merges by the drm tree.  In this same
> interval, the error is corrected by commit
> 
>   4afaa61db9cf ("drm/amdgpu: fix spelling mistake "Falied" -> "Failed"")
> 
> so when David comes to merge the amdgpu tree in commit
> 
>   206739119508 ("Merge tag 'amd-drm-next-5.9-2020-07-17' of 
> git://people.freedesktop.org/~agd5f/linux into drm-next")
> 
> the spelling error has been introduced on one side of the merge and
> introduced and corrected on the other.  This would have produced a
> conflict which David presumably resolved in haste by picking the HEAD
> side of the merge instead of the MERGE_HEAD side (it happens).
> 
> This could have been avoided by not cherry-picking fix commits around
> in the amdgpu process - instead having a fixes branch that is merged
> into the next branch after the fixes branch has been accepted upstream
> (that way there is only one commit for each fix and less conflicts).
> 
> I have to deal with these sort of conflicts (sometimes daily) due to
> the drm processes.  Its a pain as I have to track down each conflict to
> see if the same patches appear on both sides of merges and then try to
> figure out what other changes occur.  (This is only slightly helped by
> have the "cherry-picked from" tags in the fix commits.)

Yeah cherry-picking breaks if you only occasionally cherry-pick - then
stuff gets lost.

I'd say just apply it once more for the merge window fixes pull.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

2020-08-06 Thread Christian König


Am 06.08.20 um 09:02 schrieb Dennis Li:

[  584.110304] 
[  584.110590] WARNING: possible recursive locking detected
[  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
[  584.64] 
[  584.111456] kworker/38:1/553 is trying to acquire lock:
[  584.111721] 9b15ff0a47a0 (>reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.112112]
but task is already holding lock:
[  584.112673] 9b1603d247a0 (>reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.113068]
other info that might help us debug this:
[  584.113689]  Possible unsafe locking scenario:

[  584.114350]CPU0
[  584.114685]
[  584.115014]   lock(>reset_sem);
[  584.115349]   lock(>reset_sem);
[  584.115678]
 *** DEADLOCK ***

[  584.116624]  May be due to missing lock nesting notation

[  584.117284] 4 locks held by kworker/38:1/553:
[  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: 
process_one_work+0x21f/0x630
[  584.117967]  #1: ac708e1c3e58 
((work_completion)(>recovery_work)){+.+.}, at: process_one_work+0x21f/0x630
[  584.118358]  #2: c1c2a5d0 (>hive_lock){+.+.}, at: 
amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu]
[  584.118786]  #3: 9b1603d247a0 (>reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.119222]
stack backtrace:
[  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G 
  OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
[  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 
05/23/2019
[  584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[  584.121638] Call Trace:
[  584.122050]  dump_stack+0x98/0xd5
[  584.122499]  __lock_acquire+0x1139/0x16e0
[  584.122931]  ? trace_hardirqs_on+0x3b/0xf0
[  584.123358]  ? cancel_delayed_work+0xa6/0xc0
[  584.123771]  lock_acquire+0xb8/0x1c0
[  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.124599]  down_write+0x49/0x120
[  584.125032]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.125472]  amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.125910]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
[  584.126367]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
[  584.126789]  process_one_work+0x29e/0x630
[  584.127208]  worker_thread+0x3c/0x3f0
[  584.127621]  ? __kthread_parkme+0x61/0x90
[  584.128014]  kthread+0x12f/0x150
[  584.128402]  ? process_one_work+0x630/0x630
[  584.128790]  ? kthread_park+0x90/0x90
[  584.129174]  ret_from_fork+0x3a/0x50

Each adev has owned lock_class_key to avoid false positive
recursive locking.


NAK, that is not a false positive but a real problem.

The issue here is that we have multiple reset semaphores, one for each 
device in the hive. If those are not acquired in the correct order we 
deadlock.


The real solution would be to move the reset_sem into the hive structure 
and make sure that we lock it only once.


Christian.



Signed-off-by: Dennis Li 
Change-Id: I7571efeccbf15483982031d00504a353031a854a

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index e97c088d03b3..766dc8f8c8a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -967,6 +967,7 @@ struct amdgpu_device {
atomic_tin_gpu_reset;
enum pp_mp1_state   mp1_state;
struct rw_semaphore reset_sem;
+   struct lock_class_key lock_key;
struct amdgpu_doorbell_index doorbell_index;
  
  	struct mutex			notifier_lock;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 6c572db42d92..d78df9312d34 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3037,6 +3037,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
mutex_init(>virt.vf_errors.lock);
hash_init(adev->mn_hash);
init_rwsem(>reset_sem);
+   lockdep_set_class(>reset_sem, >lock_key);
atomic_set(>in_gpu_reset, 0);
mutex_init(>psp.mutex);
mutex_init(>notifier_lock);


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: annotate a false positive recursive locking

2020-08-06 Thread Dennis Li

[  584.110304] 
[  584.110590] WARNING: possible recursive locking detected
[  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G   OE
[  584.64] 
[  584.111456] kworker/38:1/553 is trying to acquire lock:
[  584.111721] 9b15ff0a47a0 (>reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.112112]
   but task is already holding lock:
[  584.112673] 9b1603d247a0 (>reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.113068]
   other info that might help us debug this:
[  584.113689]  Possible unsafe locking scenario:

[  584.114350]CPU0
[  584.114685]
[  584.115014]   lock(>reset_sem);
[  584.115349]   lock(>reset_sem);
[  584.115678]
*** DEADLOCK ***

[  584.116624]  May be due to missing lock nesting notation

[  584.117284] 4 locks held by kworker/38:1/553:
[  584.117616]  #0: 9ad635c1d348 ((wq_completion)events){+.+.}, at: 
process_one_work+0x21f/0x630
[  584.117967]  #1: ac708e1c3e58 
((work_completion)(>recovery_work)){+.+.}, at: process_one_work+0x21f/0x630
[  584.118358]  #2: c1c2a5d0 (>hive_lock){+.+.}, at: 
amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu]
[  584.118786]  #3: 9b1603d247a0 (>reset_sem){}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.119222]
   stack backtrace:
[  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G 
  OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
[  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 
05/23/2019
[  584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[  584.121638] Call Trace:
[  584.122050]  dump_stack+0x98/0xd5
[  584.122499]  __lock_acquire+0x1139/0x16e0
[  584.122931]  ? trace_hardirqs_on+0x3b/0xf0
[  584.123358]  ? cancel_delayed_work+0xa6/0xc0
[  584.123771]  lock_acquire+0xb8/0x1c0
[  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.124599]  down_write+0x49/0x120
[  584.125032]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.125472]  amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.125910]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
[  584.126367]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
[  584.126789]  process_one_work+0x29e/0x630
[  584.127208]  worker_thread+0x3c/0x3f0
[  584.127621]  ? __kthread_parkme+0x61/0x90
[  584.128014]  kthread+0x12f/0x150
[  584.128402]  ? process_one_work+0x630/0x630
[  584.128790]  ? kthread_park+0x90/0x90
[  584.129174]  ret_from_fork+0x3a/0x50

Each adev has owned lock_class_key to avoid false positive
recursive locking.

Signed-off-by: Dennis Li 
Change-Id: I7571efeccbf15483982031d00504a353031a854a

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index e97c088d03b3..766dc8f8c8a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -967,6 +967,7 @@ struct amdgpu_device {
atomic_tin_gpu_reset;
enum pp_mp1_state   mp1_state;
struct rw_semaphore reset_sem;
+   struct lock_class_key lock_key;
struct amdgpu_doorbell_index doorbell_index;
 
struct mutexnotifier_lock;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 6c572db42d92..d78df9312d34 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3037,6 +3037,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
mutex_init(>virt.vf_errors.lock);
hash_init(adev->mn_hash);
init_rwsem(>reset_sem);
+   lockdep_set_class(>reset_sem, >lock_key);
atomic_set(>in_gpu_reset, 0);
mutex_init(>psp.mutex);
mutex_init(>notifier_lock);
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

41 matches

Mail list logo