[PATCH] drm/amdgpu: enable support error injection broadcast to all instances

2021-06-11 Thread Dennis Li
when the address is -1, TA will do error injection for all instances of the specail sram. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 885a78301bbf..c828ce9525d4 100644 --- a/drivers/gpu/drm/amd/amdgpu

[PATCH] drm/amdkfd: fix a resource leakage issue

2021-05-18 Thread Dennis Li
The function kfd_lookup_process_by_pasid will increase the reference count of kfd_process object, its caller should call kfd_unref_process to decrease the reference count. Otherwise resource leakage will happen. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b

[PATCH] drm/amdkfd: refine the poison data consumption handling

2021-05-11 Thread Dennis Li
context to re-dispatch works. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_events.c index ba2c2ce0c55a..4d210f23c33c 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c @@ -1050,3

[PATCH] drm/amdgpu: add synchronization among waves in the same threadgroup

2021-05-10 Thread Dennis Li
It is possible that the previous waves have exited before others are created, so the other waves maybe reuse pyhsical resouces left by previous ones. Therefore add barrier instruction to synchronize waves within the same threadgroup. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd

[PATCH] drm/amdgpu: add function to clear MMEA error status for aldebaran

2021-05-10 Thread Dennis Li
For aldebaran, hardware will not clear error status automatically when reading error status register, insteadly driver should set clear bit of the error status register explicitly to clear error status. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mmhub.h b/drivers

[PATCH] drm/amdgpu: correct the funtion to clear GCEA error status

2021-05-10 Thread Dennis Li
The bit 11 of GCEA_ERR_STATUS register is used to clear GCEA error status. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c index e943cd2923ac..c63599686708 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c +++ b/drivers

[PATCH] drm/amdgpu: covert ras status to kernel errno

2021-05-09 Thread Dennis Li
The original codes use ras status and kernl errno together in the same function, which is a wrong code style. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index 17b728d2c1f2..231479b67b33 100644 --- a/drivers/gpu/drm

[PATCH] drm/amdgpu: update the shader to clear specific SGPRs

2021-05-06 Thread Dennis Li
Add shader codes to explicitly clear specific SGPRs, such as flat_scratch_lo, flat_scratch_hi and so on. And also correct the allocation size of SGPRs in PGM_RSRC1. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c index

[PATCH] drm/amdgpu: fix no full coverage issue for gprs initialization

2021-04-27 Thread Dennis Li
The number of waves is changed to 8, so it is impossible to use old solution to cover all sgprs. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c index a2fe2dac32c1..2e6789a7dc46 100644 --- a/drivers/gpu/drm/amd/amdgpu

[PATCH] drm/amdgpu: refine gprs init shaders to check coverage

2021-04-20 Thread Dennis Li
Add codes to check whether all SIMDs are covered, make sure that all GPRs are initialized. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index 9889bd495ba5..9e629f239288 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c

[PATCH] drm/amdgpu: fix a error injection failed issue

2021-04-16 Thread Dennis Li
because "sscanf(str, "retire_page")" always return 0, if application use the raw data for error injection, it always wrongly falls into "op == 3". Change to use strstr instead. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c

[PATCH] drm/amdkfd: add edc error interrupt handle for poison propogate mode

2021-04-15 Thread Dennis Li
In poison progogate mode, when driver receive the edc error interrupt from SQ, driver should kill the process by pasid which is using the poison data, and then trigger GPU reset. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c b/drivers/gpu/drm/amd/amdkfd

[PATCH 4/4] drm/amdkfd: add reset lock protection for kfd entry functions

2021-03-18 Thread Dennis Li
read_lock from process queue manager, and add read_lock into related ioctls instead. v3: put pqm_query_dev_by_qid under the protection of p->mutex Signed-off-by: Dennis Li Acked-by: Christian König diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c in

[PATCH 3/4] drm/amdgpu: instead of using down/up_read directly

2021-03-18 Thread Dennis Li
change to use amdgpu_read_lock/unlock which could handle more cases Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index bcaf271b39bf..66dec0f49c4a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b

[PATCH 2/4] drm/amdgpu: refine the GPU recovery sequence

2021-03-18 Thread Dennis Li
hen system detect hung timeout in the recovery thread. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 02a34f9a26aa..67c716e5ee8d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -104

[PATCH 1/4] drm/amdgpu: remove reset lock from low level functions

2021-03-18 Thread Dennis Li
It is easy to cause performance drop issue when using lock in low level functions. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 0b1e0127056f..24ff5992cb02 100644 --- a/drivers/gpu/drm/amd/amdgpu

[PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Dennis Li
pu_reset is 1, it should release read lock if it has holden one, and then blocks itself to wait for recovery finished event. If thread successfully hold read lock and in_gpu_reset is 0, it continues. It will exit normally or be stopped by recovery thread in step 1. Dennis Li (4): drm/amd

[PATCH] drm/amdgpu: block hardware accessed by other threads when doing gpu recovery

2021-03-01 Thread Dennis Li
When GPU recovery thread is doing GPU reset, it is unsafe that other threads access hardware concurrently, which could cause GPU reset randomly hang. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 1624c2bc8285..c71d3bba5f69

[PATCH v2] drm/amdgpu: remove unnecessary reading for epprom header

2021-02-25 Thread Dennis Li
old Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index f0f7ed42ee7f..f2ff10403d93 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4397,7 +4397,7 @@ static

[PATCH] drm/amdgpu: remove unnecessary reading for epprom header

2021-02-25 Thread Dennis Li
If the number of badpage records exceed the threshold, driver has updated both epprom header and control->tbl_hdr.header before gpu reset, therefore GPU recovery thread no need to read epprom header directly. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eepro

[PATCH v2] drm/amdgpu: reserve backup pages for bad page retirment

2021-02-22 Thread Dennis Li
ned-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index b7ee587484b2..ff4387bbfb1e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -170,7 +170,7 @@ struct amdgpu_mgpu_info

[PATCH] drm/amdgpu: reserve backup pages for bad page retirment

2021-02-22 Thread Dennis Li
it's not user friendly that users' visiable unused memories are decreased when bad pages are retired. Therefore reserve limit backup pages when init, and return ones when bad pages retired, to keep no change of unused memory size. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/

[PATCH] drm/amdgpu: Fix issue no bad_pages after umc ue injection

2021-01-04 Thread Dennis Li
old code wrongly used the bad page status as the function return value, which cause amdgpu_ras_badpages_read always return failed. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index c136bd449744..82e952696d24 100644

[PATCH v2] drm/amdgpu: fix a GPU hang issue when remove device

2020-12-30 Thread Dennis Li
amdgpu :03:00.0: amdgpu: failed to write reg 2890 wait reg 28a2 amdgpu :03:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706 amdgpu :03:00.0: amdgpu: failed to write reg 2890 wait reg 28a2 amdgpu :03:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706 Signed-off-by: Dennis Li

[PATCH] drm/amdgpu: fix a GPU hang issue when remove device

2020-12-30 Thread Dennis Li
28a2 amdgpu :03:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706 amdgpu :03:00.0: amdgpu: failed to write reg 2890 wait reg 28a2 amdgpu :03:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706 Signed-off-by: Dennis Li Change-Id: I42431f5d0bf54909e1df888a0d72fc009d8e196c diff

[PATCH] drm/amdgpu: fix a memory protection fault when remove amdgpu device

2020-12-29 Thread Dennis Li
101594] pci_unregister_driver+0x22/0xa0 [ 84.106806] amdgpu_exit+0x15/0x2b [amdgpu] Signed-off-by: Dennis Li Change-Id: Icc981a421499dff844855d5a662e91d1730c2754 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index eb19ae734396..b44b46dd60f2 100644 --- a/d

[PATCH 2/3] drm/amdgpu: remove redundant GPU reset

2020-10-27 Thread Dennis Li
Because bad pages saving has been moved to UMC error interrupt callback, which will trigger a new GPU reset after saving. Signed-off-by: Dennis Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h| 10 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 16 2 files

[PATCH 3/3] drm/amdgpu: fix the issue of reserving bad pages failed

2020-10-27 Thread Dennis Li
nused include "amdgpu_ras.h"; 3. rename amdgpu_vram_mgr_check_and_reserve as amdgpu_vram_mgr_do_reserve; 4. refine amdgpu_vram_mgr_reserve_range to call amdgpu_vram_mgr_do_reserve. Signed-off-by: Dennis Li Signed-off-by: Wenhui Sheng --- drivers/gpu/drm/amd/amdgpu/amdgpu_r

[PATCH 1/3] drm/amdgpu: change to save bad pages in UMC error interrupt callback

2020-10-27 Thread Dennis Li
Instead of saving bad pages in amdgpu_ras_reset_gpu, it will reduce the unnecessary calling of amdgpu_ras_save_bad_pages. Signed-off-by: Dennis Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 + drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 7

[PATCH 0/3] Refine the codes about reseving bad pages.

2020-10-27 Thread Dennis Li
. The third patch will reserve the bad page when freeing it, make system has no chance to allocate it to other proccess. Dennis Li (3): drm/amdgpu: change to save bad pages in UMC error interrupt callback drm/amdgpu: remove redundant GPU reset drm/amdgpu: fix the issue of reserving bad pages f

[PATCH] drm/amdgpu: protect eeprom update from GPU reset

2020-10-14 Thread Dennis Li
because i2c is unstable in GPU reset, driver need protect eeprom update from GPU reset, to not miss any bad page record. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c index 0e64c39a2372..695bcfc5c983 100644

[PATCH] drm/amdkfd: fix a memory leak issue

2020-09-02 Thread Dennis Li
In the resume stage of GPU recovery, start_cpsch will call pm_init which set pm->allocated as false, cause the next pm_release_ib has no chance to release ib memory. Add pm_release_ib in stop_cpsch which will be called in the suspend stage of GPU recovery. Signed-off-by: Dennis Li diff --gi

[PATCH v2] drm/kfd: fix a system crash issue during GPU recovery

2020-09-01 Thread Dennis Li
return value from execute_queues. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c index 560adc57a050..069ba4be1e8f 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c +++ b/drivers/g

[PATCH v2] drm/amdgpu: block ring buffer access during GPU recovery

2020-09-01 Thread Dennis Li
When GPU is in reset, its status isn't stable and ring buffer also need be reset when resuming. Therefore driver should protect GPU recovery thread from ring buffer accessed by other threads. Otherwise GPU will randomly hang during recovery. v2: correct indent Signed-off-by: Dennis Li

[PATCH] drm/kfd: fix a system crash issue during GPU recovery

2020-08-31 Thread Dennis Li
queue to queue list of the proccess. And then kfd_process_evict_queues will access a freed memory, which cause a system crash. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c index 560adc

[PATCH] drm/amdgpu: block ring buffer access during GPU recovery

2020-08-31 Thread Dennis Li
When GPU is in reset, its status isn't stable and ring buffer also need be reset when resuming. Therefore driver should protect GPU recovery thread from ring buffer accessed by other threads. Otherwise GPU will randomly hang during recovery. Signed-off-by: Dennis Li diff --git a/drivers/gp

[PATCH] drm/amdgpu: skip scheduling IBs when GPU recovery

2020-08-21 Thread Dennis Li
If GPU begin to do recovery, skip scheduling IBs. Otherwise GPU recovery randomly fail. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index dcfe8a3b03ff..054d7b0357fd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c

[PATCH v3] drm/amdgpu: change reset lock from mutex to rw_semaphore

2020-08-20 Thread Dennis Li
clients don't need reset-lock for synchronization when no GPU recovery. v2: change to return the return value of down_read_killable. v3: if GPU recovery begin, VF ignore FLR notification. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/a

[PATCH v2] drm/amdgpu: change reset lock from mutex to rw_semaphore

2020-08-20 Thread Dennis Li
clients don't need reset-lock for synchronization when no GPU recovery. v2: change to return the return value of down_read_killable. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index c8aec832b244..ec11ed2a9ca4 100644

[PATCH] drm/amdgpu: refine message print for devices of hive

2020-08-19 Thread Dennis Li
Using dev_xxx instead of DRM_xxx/pr_xxx to indicate which device of a hive is the message for. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 81b1d9a1dca0..08548e051cc0 100644 --- a/drivers/gpu/drm/amd/amdgpu

[PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery

2020-08-19 Thread Dennis Li
in single gpu system, if driver reenter gpu recovery, amdgpu_device_lock_adev will return false, but hive is nullptr now. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 82242e2f5658..81b1d9a1dca0 100644 --- a

[PATCH] drm/amdgpu: change reset lock from mutex to rw_semaphore

2020-08-19 Thread Dennis Li
clients don't need reset-lock for synchronization when no GPU recovery. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index c8aec832b244..ec11ed2a9ca4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/dr

[PATCH v2] drm/amdgpu: refine codes to avoid reentering GPU recovery

2020-08-19 Thread Dennis Li
if other threads have holden the reset lock, recovery will fail to try_lock. Therefore we introduce atomic hive->in_reset and adev->in_gpu_reset, to avoid reentering GPU recovery. v2: drop "? true : false" in the definition of amdgpu_in_reset Signed-off-by: Dennis Li diff --g

[PATCH] drm/amdgpu: refine codes to avoid reentering GPU recovery

2020-08-19 Thread Dennis Li
if other threads have holden the reset lock, recovery will fail to try_lock. Therefore we introduce atomic hive->in_reset and adev->in_gpu_reset, to avoid reentering GPU recovery. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/am

[PATCH v3] drm/amdgpu: refine create and release logic of hive info

2020-08-18 Thread Dennis Li
amdgpu_hive_info*. 2. remove unnecessary variable initialization. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 98d0c6e5ab3c..e25f952d8836 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu

[PATCH v2] drm/amdgpu: refine create and release logic of hive info

2020-08-18 Thread Dennis Li
Change to dynamically create and release hive info object, which help driver support more hives in the future. v2: Change to save hive object pointer in adev, to avoid locking xgmi_mutex every time when calling amdgpu_get_xgmi_hive. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd

[PATCH] drm/amdgpu: refine create and release logic of hive info

2020-08-17 Thread Dennis Li
Change to dynamically create and release hive info object, which help driver support more hives in the future. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 8a55b0bc044a..fdfdc2f678c9 100644 --- a/drivers/gpu

[PATCH] drm/amdgpu: fix a potential circular locking dependency

2020-08-11 Thread Dennis Li
653.939233] #1: 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] change the order of reservation_ww_class_mutex and adev->reset_sem in amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem, to avoid potential dead lock. Signed-off

[PATCH] drm/amdgpu: refine create and release logic of hive info

2020-08-10 Thread Dennis Li
Change to dynamically create and release hive info object, which help driver support more hives in the future. Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 8a55b0bc044a..fdfdc2f678c9 100644 --- a/drivers/gpu

[PATCH v3] drm/amdgpu: annotate a false positive recursive locking

2020-08-10 Thread Dennis Li
en registered! [ 1216.705924] [ cut here ] [ 1216.705972] DEBUG_LOCKS_WARN_ON(1) [ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210 v3: change to use down_write_nest_lock to annotate the false dead-lock warning. Signed-off-by: Dennis Li diff -

[PATCH v2] drm/amdgpu: annotate a false positive recursive locking

2020-08-07 Thread Dennis Li
en registered! [ 1216.705924] [ cut here ] [ 1216.705972] DEBUG_LOCKS_WARN_ON(1) [ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210 Signed-off-by: Dennis Li Change-Id: I7571efeccbf15483982031d00504a353031a854a diff --git a/drivers/gpu/d

[PATCH] drm/amdgpu: annotate a false positive recursive locking

2020-08-06 Thread Dennis Li
0x90/0x90 [ 584.129174] ret_from_fork+0x3a/0x50 Each adev has owned lock_class_key to avoid false positive recursive locking. Signed-off-by: Dennis Li Change-Id: I7571efeccbf15483982031d00504a353031a854a diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h inde

[PATCH v2] drm/amdgpu: unlock mutex on error

2020-08-05 Thread Dennis Li
Make sure to unlock the mutex when error happen v2: 1. correct syntax error in the commit comment 2. remove change-Id Acked-by: Nirmoy Das Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c index a0ea663ecdbc

[PATCH] drm/amdgpu: annotate a false positive locking dependency

2020-08-05 Thread Dennis Li
r+0xb0/0x1030 [amdgpu] [ 264.512450] #3: 965fd31647a0 (&adev->reset_sem){}, at: amdgpu_device_gpu_recover+0x264/0x1030 [amdgpu] Remove the lock(&hive->hive_lock) out of amdgpu_get_xgmi_hive, to disable its locking dependency on xgmi_mutex. Signed-off-by: Dennis Li Change-

[PATCH] drm/amdgpu: unlock mutex on error

2020-08-05 Thread Dennis Li
Make sure unlock the mutex when error happen Signed-off-by: Dennis Li Change-Id: I6c36a193df5fe70516282d8136b4eadf32d20915 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c index a0ea663ecdbc..5e5369abc6fa 100644 --- a/drivers/gpu/drm/amd

[PATCH v5] drm/amdgpu: fix system hang issue during GPU reset

2020-07-18 Thread Dennis Li
.c 2. remove comment codes in amdgpu_device.c 3. add more detailed comment in commit message 4. define a wrap function amdgpu_in_reset v5: 1. Fix some style issues. Signed-off-by: Dennis Li Reviewed-by: Andrey Grodzovsky Reviewed-by: Christian König Reviewed-by: Felix Kuehling Reviewed-by

[PATCH v4] drm/amdgpu: fix system hang issue during GPU reset

2020-07-15 Thread Dennis Li
.c 2. remove comment codes in amdgpu_device.c 3. add more detailed comment in commit message 4. define a wrap function amdgpu_in_reset Signed-off-by: Dennis Li Reviewed-by: Andrey Grodzovsky Reviewed-by: Christian König Reviewed-by: Felix Kuehling Reviewed-

[PATCH v2] drm/amdgpu: fix system hang issue during GPU reset

2020-07-08 Thread Dennis Li
ove try_lock and change adev->in_gpu_reset as atomic, to avoid re-enter GPU recovery for the same GPU hang. Signed-off-by: Dennis Li Change-Id: I7f77a72795462587ed7d5f51fe53a594a0f1f708 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 80f32b

[PATCH] drm/amdgpu: fix system hang issue during GPU reset

2020-07-06 Thread Dennis Li
During GPU reset, driver should hold on all external access to GPU, otherwise psp will randomly fail to do post, and then cause system hang. Signed-off-by: Dennis Li Change-Id: I7d5d41f9c4198b917d7b49606ba3850988e5b936 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd

[PATCH] drm/amdkfd: change to return status when flush tlb

2020-07-06 Thread Dennis Li
If GPU hang, driver will fail to flush tlb, return the hang error to callers, make callers have a chance to handle the error. Signed-off-by: Dennis Li Change-Id: Ie305ad0a77675f6eab7d5b8f68e279b7f4e7a8b9 diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd

[PATCH v2] drm/amdgpu: set error query ready after all IPs late init

2020-04-21 Thread Dennis Li
If set error query ready in amdgpu_ras_late_init, which will cause some IP blocks aren't initialized, but their error query is ready. v2: change the prefix of title to "drm/amdgpu" and remove the unnecessary "{}". Change-Id: I5087527261cb1b462afd82ad7592cf1ef73b15bd

[PATCH] drm/amd/amdgpu: set error query ready after all IPs late init

2020-04-21 Thread Dennis Li
If set error query ready in amdgpu_ras_late_init, which will cause some IP blocks aren't initialized, but their error query is ready. Change-Id: I5087527261cb1b462afd82ad7592cf1ef73b15bd Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/dr

[PATCH] drm/amdgpu: replace DRM prefix with PCI device info for gfx/mmhub

2020-04-17 Thread Dennis Li
Prefix RAS message printing in gfx/mmhub with PCI device info, which assists the debug in multiple GPU case. Change-Id: Iceba7cafd5aac7d0251d9f871503745cc617fba2 Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c old mode 100644

[PATCH] drm/amdgpu: fix the coverage issue to clear ArcVPGRs

2020-03-22 Thread Dennis Li
Set ComputePGMRSRC1.VGPRS as 0x3f to clear all ArcVGPRs. Change-Id: I296c3a162c0d5c7b84d4b48dc2002340a5c22e2a Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c old mode 100644 new mode 100755 index 324838baa71c..44fb64460c1f

[PATCH v2] drm/amdgpu: add codes to clear AccVGPR for arcturus

2020-03-12 Thread Dennis Li
AccVGPRs are newly added in arcturus. Before reading these registers, they should be initialized. Otherwise edc error happens, when RAS is enabled. v2: reuse the existing logical to calculate register size Change-Id: I4ed384f0cc4b781a10cfd6ad1e3a132445bdc261 Signed-off-by: Dennis Li diff --git

[PATCH] drm/amdgpu: add codes to clear AccVGPR for arcturus

2020-03-12 Thread Dennis Li
AccVGPRs are newly added in arcturus. Before reading these registers, they should be initialized. Otherwise edc error happens, when RAS is enabled. Change-Id: I4ed384f0cc4b781a10cfd6ad1e3a132445bdc261 Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm

[PATCH] drm/amdgpu: fix a bug NULL pointer dereference

2020-02-18 Thread Dennis Li
check whether the queue of entity is null to avoid null pointer dereference. Change-Id: I08d56774012cf229ba2fe7a011c1359e8d1e2781 Signed-off-by: Dennis Li diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index 4cc7881f438c..67cca463ddcc

[PATCH 0/2] query edc counter for more mmhub sub-blocks of Acrturus

2020-01-18 Thread Dennis Li
1. Add RAS support for MAM D(0~3)_MEM in mmhub. 2. Add RAS support for other mmhub ranges from 2 to 7. Dennis Li (2): drm/amdgpu: update mmhub 9.4.1 header files for Acrturus drm/amdgpu: enable RAS feature for more mmhub sub-blocks of Acrturus drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.c

[PATCH 2/2] drm/amdgpu: enable RAS feature for more mmhub sub-blocks of Acrturus

2020-01-18 Thread Dennis Li
Compared with Vg20, the size of mmhub range is changed from 2 to 8. Change-Id: I529c0ff0aaed200e5b102d482563ed9dc2278260 Signed-off-by: Dennis Li --- drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.c | 701 +++- 1 file changed, 695 insertions(+), 6 deletions(-) diff --git a/drivers

[PATCH 1/2] drm/amdgpu: update mmhub 9.4.1 header files for Acrturus

2020-01-18 Thread Dennis Li
Add mask & shift definition of MAM_D(0~3)MEM for all mmhub ranges. Change-Id: I65c8a3040611198273a4b6da77c1a1ad2ffe7fd3 Signed-off-by: Dennis Li --- .../asic_reg/mmhub/mmhub_9_4_1_sh_mask.h | 128 ++ 1 file changed, 128 insertions(+) diff --git a/drivers/gpu/drm

[PATCH 4/4] drm/amdgpu: add RAS support for the gfx block of Arcturus

2020-01-18 Thread Dennis Li
Implement functions to do the RAS error injection and query EDC counter. Change-Id: I4d947511331a19c1967551b9d42997698073f795 Signed-off-by: Dennis Li --- drivers/gpu/drm/amd/amdgpu/Makefile | 1 + drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 26 +- drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c | 978

[PATCH 0/4] Enable RAS feature for the gc of Arcturus

2020-01-18 Thread Dennis Li
support querying of EDC counter and error injection. Dennis Li (4): drm/amdgpu: refine the security check for RAS functions drm/amdgpu: abstract EDC counter clear to a separated function drm/amdgpu: add EDC counter registers of gc for Arcturus drm/amdgpu: add RAS support for the gfx block

[PATCH 1/4] drm/amdgpu: refine the security check for RAS functions

2020-01-18 Thread Dennis Li
ge-Id: Ia3f73bd9ee41ee3d0dd18d6f46e67124cf88d653 Signed-off-by: Dennis Li --- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index e3d466bd5c4e..759d8144f9c0 100644 --- a/drivers/gpu/dr

[PATCH 3/4] drm/amdgpu: add EDC counter registers of gc for Arcturus

2020-01-18 Thread Dennis Li
add reg headers to gc includes v2: remove unused registers and fields in this patch set Change-Id: If3476c0b0ed88e5d11bdb8bec1278ae10fc5af25 Signed-off-by: Dennis Li --- .../amd/include/asic_reg/gc/gc_9_4_1_offset.h | 264 +++ .../include/asic_reg/gc/gc_9_4_1_sh_mask.h| 748

[PATCH 2/4] drm/amdgpu: abstract EDC counter clear to a separated function

2020-01-18 Thread Dennis Li
1. Add IP prefix for the IP related codes. 2. Refactor the code to clear EDC counter. Change-Id: I1cd9ec304a7ace9a74480264d24368fd11a87833 Signed-off-by: Dennis Li --- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 112 ++ 1 file changed, 77 insertions(+), 35 deletions(-) diff

[PATCH v2 2/3] drm/amdgpu: refine query function of mmhub EDC counter in vg20

2019-11-20 Thread Dennis Li
Add codes to print the detail EDC info for the subblock of mmhub v2: Move the EDC_CNT registers' defintion from mmhub_9_4 header files to mmhub_1_0 ones. Add mmhub_v1_0_ prefix for the local static variable and function. Change-Id: I1d5b3df38caa8f0b437c96b78091662aaeaf264b Signed-off-by: D

[PATCH v2 0/3] RAS support for mmhub

2019-11-20 Thread Dennis Li
error. 3. Implement the query function of RAS error counter for Mi100 v2: 1. Fix some comment issues. 2. Add IP name prefix for the local static variable and function. 3. Move the EDC_CNT registers' defintion from mmhub_9_4 header files to mmhub_1_0 ones for vg20. Dennis Li (3): drm/amdgpu: d

[PATCH v2 1/3] drm/amdgpu: define soc15_ras_field_entry for reuse

2019-11-20 Thread Dennis Li
The struct soc15_ras_field_entry will be reused by other IPs, such as mmhub and gc v2: rename ras_subblock_regs to gc_ras_fields_vg20, because the future asic maybe have a different table. Change-Id: I6c3388a09b5fbf927ad90fcd626baa448d1681a6 Signed-off-by: Dennis Li --- drivers/gpu/drm/amd

[PATCH v2 3/3] drm/amdgpu: implement querying ras error count for mmhub9.4

2019-11-20 Thread Dennis Li
Get mmhub error counter by accessing EDC_CNT registers. v2: Add mmhub_v9_4_ prefix for local static variable and function Change-Id: I728d4183a08707aaf0fc71d184e86322a681e725 Signed-off-by: Dennis Li --- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 3 + drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.c

[PATCH 3/3] drm/amdgpu: add RAS support for VML2 and ATCL2

2019-10-10 Thread Dennis Li
Add codes to query the EDC count of VML2 & ATCL2 Change-Id: If2c251481ba0a1a34ce3405a85f86d65eecee461 Signed-off-by: Dennis Li --- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 167 ++ 1 file changed, 167 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.

[PATCH 1/3] drm/amdgpu: change to query the actual EDC counter

2019-10-10 Thread Dennis Li
For the potential request in the future, change to query the actual EDC counter. Change-Id: I783ccd76f4c65f9829f7a8967a539a23ae5484b5 Signed-off-by: Dennis Li --- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 819 -- drivers/gpu/drm/amd/amdgpu/soc15.h| 2 + 2 files

[PATCH 2/3] drm/amd/include: add register define for VML2 and ATCL2

2019-10-10 Thread Dennis Li
Add VML2 and ATCL2 ECC registers to support VEGA20 RAS Change-Id: I8860f2e37fa7afd8d6123290fb7b9dcee56edd6e Signed-off-by: Dennis Li --- .../amd/include/asic_reg/gc/gc_9_0_offset.h| 18 -- .../amd/include/asic_reg/gc/gc_9_0_sh_mask.h | 18 -- 2 files

[PATCH 0/3] RAS Support for GFX blocks

2019-10-10 Thread Dennis Li
1. Add the EDC count from hardware. 2. Add RAS support for VML2 amd ATCL2 sub blocks. Dennis Li (3): drm/amdgpu: change to query the actual EDC counter drm/amd/include: add register define for VML2 and ATCL2 drm/amdgpu: add RAS support for VML2 and ATCL2 drivers/gpu/drm/amd/amdgpu