when the address is -1, TA will do error injection for all instances of
the specail sram.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 885a78301bbf..c828ce9525d4 100644
--- a/drivers/gpu/drm/amd/amdgpu
The function kfd_lookup_process_by_pasid will increase the reference
count of kfd_process object, its caller should call kfd_unref_process to
decrease the reference count. Otherwise resource leakage will happen.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
b
context to re-dispatch works.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index ba2c2ce0c55a..4d210f23c33c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -1050,3
It is possible that the previous waves have exited before others are
created, so the other waves maybe reuse pyhsical resouces left by
previous ones. Therefore add barrier instruction to synchronize waves within
the same threadgroup.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd
For aldebaran, hardware will not clear error status automatically when
reading error status register, insteadly driver should set clear bit of
the error status register explicitly to clear error status.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mmhub.h
b/drivers
The bit 11 of GCEA_ERR_STATUS register is used to clear GCEA error
status.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
index e943cd2923ac..c63599686708 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
+++ b/drivers
The original codes use ras status and kernl errno together in the same
function, which is a wrong code style.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 17b728d2c1f2..231479b67b33 100644
--- a/drivers/gpu/drm
Add shader codes to explicitly clear specific SGPRs, such as
flat_scratch_lo, flat_scratch_hi and so on. And also correct the
allocation size of SGPRs in PGM_RSRC1.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c
index
The number of waves is changed to 8, so it is impossible to use old
solution to cover all sgprs.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
index a2fe2dac32c1..2e6789a7dc46 100644
--- a/drivers/gpu/drm/amd/amdgpu
Add codes to check whether all SIMDs are covered, make sure that all
GPRs are initialized.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 9889bd495ba5..9e629f239288 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
because "sscanf(str, "retire_page")" always return 0, if application use
the raw data for error injection, it always wrongly falls into "op ==
3". Change to use strstr instead.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
In poison progogate mode, when driver receive the edc error interrupt
from SQ, driver should kill the process by pasid which is using the
poison data, and then trigger GPU reset.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
b/drivers/gpu/drm/amd/amdkfd
read_lock from process
queue manager, and add read_lock into related ioctls instead.
v3: put pqm_query_dev_by_qid under the protection of p->mutex
Signed-off-by: Dennis Li
Acked-by: Christian König
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
in
change to use amdgpu_read_lock/unlock which could handle more cases
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index bcaf271b39bf..66dec0f49c4a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b
hen system detect hung timeout
in the recovery thread.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 02a34f9a26aa..67c716e5ee8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -104
It is easy to cause performance drop issue when using lock in low level
functions.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0b1e0127056f..24ff5992cb02 100644
--- a/drivers/gpu/drm/amd/amdgpu
pu_reset is 1, it should release read lock if it has holden
one, and then blocks itself to wait for recovery finished event. If thread
successfully hold read lock and in_gpu_reset is 0, it continues. It will exit
normally or be stopped by recovery thread in step 1.
Dennis Li (4):
drm/amd
When GPU recovery thread is doing GPU reset, it is unsafe that other
threads access hardware concurrently, which could cause GPU reset
randomly hang.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 1624c2bc8285..c71d3bba5f69
old
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index f0f7ed42ee7f..f2ff10403d93 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4397,7 +4397,7 @@ static
If the number of badpage records exceed the threshold, driver has
updated both epprom header and control->tbl_hdr.header before gpu reset,
therefore GPU recovery thread no need to read epprom header directly.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eepro
ned-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index b7ee587484b2..ff4387bbfb1e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -170,7 +170,7 @@ struct amdgpu_mgpu_info
it's not user friendly that users' visiable unused memories are
decreased when bad pages are retired. Therefore reserve limit backup
pages when init, and return ones when bad pages retired, to keep no
change of unused memory size.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/
old code wrongly used the bad page status as the function return value,
which cause amdgpu_ras_badpages_read always return failed.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index c136bd449744..82e952696d24 100644
amdgpu :03:00.0: amdgpu: failed to write reg 2890 wait reg 28a2
amdgpu :03:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706
amdgpu :03:00.0: amdgpu: failed to write reg 2890 wait reg 28a2
amdgpu :03:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706
Signed-off-by: Dennis Li
28a2
amdgpu :03:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706
amdgpu :03:00.0: amdgpu: failed to write reg 2890 wait reg 28a2
amdgpu :03:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706
Signed-off-by: Dennis Li
Change-Id: I42431f5d0bf54909e1df888a0d72fc009d8e196c
diff
101594] pci_unregister_driver+0x22/0xa0
[ 84.106806] amdgpu_exit+0x15/0x2b [amdgpu]
Signed-off-by: Dennis Li
Change-Id: Icc981a421499dff844855d5a662e91d1730c2754
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index eb19ae734396..b44b46dd60f2 100644
--- a/d
Because bad pages saving has been moved to UMC error interrupt callback,
which will trigger a new GPU reset after saving.
Signed-off-by: Dennis Li
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h| 10 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 16
2 files
nused include "amdgpu_ras.h";
3. rename amdgpu_vram_mgr_check_and_reserve as
amdgpu_vram_mgr_do_reserve;
4. refine amdgpu_vram_mgr_reserve_range to call
amdgpu_vram_mgr_do_reserve.
Signed-off-by: Dennis Li
Signed-off-by: Wenhui Sheng
---
drivers/gpu/drm/amd/amdgpu/amdgpu_r
Instead of saving bad pages in amdgpu_ras_reset_gpu, it will reduce
the unnecessary calling of amdgpu_ras_save_bad_pages.
Signed-off-by: Dennis Li
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 +
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 7
. The third
patch will reserve the bad page when freeing it, make system has no
chance to allocate it to other proccess.
Dennis Li (3):
drm/amdgpu: change to save bad pages in UMC error interrupt callback
drm/amdgpu: remove redundant GPU reset
drm/amdgpu: fix the issue of reserving bad pages f
because i2c is unstable in GPU reset, driver need protect
eeprom update from GPU reset, to not miss any bad page record.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 0e64c39a2372..695bcfc5c983 100644
In the resume stage of GPU recovery, start_cpsch will call pm_init
which set pm->allocated as false, cause the next pm_release_ib has
no chance to release ib memory.
Add pm_release_ib in stop_cpsch which will be called in the suspend
stage of GPU recovery.
Signed-off-by: Dennis Li
diff --gi
return value from execute_queues.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 560adc57a050..069ba4be1e8f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/g
When GPU is in reset, its status isn't stable and ring buffer also need
be reset when resuming. Therefore driver should protect GPU recovery
thread from ring buffer accessed by other threads. Otherwise GPU will
randomly hang during recovery.
v2: correct indent
Signed-off-by: Dennis Li
queue to
queue list of the proccess. And then kfd_process_evict_queues will
access a freed memory, which cause a system crash.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 560adc
When GPU is in reset, its status isn't stable and ring buffer also need
be reset when resuming. Therefore driver should protect GPU recovery
thread from ring buffer accessed by other threads. Otherwise GPU will
randomly hang during recovery.
Signed-off-by: Dennis Li
diff --git a/drivers/gp
If GPU begin to do recovery, skip scheduling IBs. Otherwise
GPU recovery randomly fail.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index dcfe8a3b03ff..054d7b0357fd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
clients don't need reset-lock for synchronization when no
GPU recovery.
v2:
change to return the return value of down_read_killable.
v3:
if GPU recovery begin, VF ignore FLR notification.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/a
clients don't need reset-lock for synchronization when no
GPU recovery.
v2:
change to return the return value of down_read_killable.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c8aec832b244..ec11ed2a9ca4 100644
Using dev_xxx instead of DRM_xxx/pr_xxx to indicate which device
of a hive is the message for.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 81b1d9a1dca0..08548e051cc0 100644
--- a/drivers/gpu/drm/amd/amdgpu
in single gpu system, if driver reenter gpu recovery,
amdgpu_device_lock_adev will return false, but hive is
nullptr now.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 82242e2f5658..81b1d9a1dca0 100644
--- a
clients don't need reset-lock for synchronization when no
GPU recovery.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c8aec832b244..ec11ed2a9ca4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/dr
if other threads have holden the reset lock, recovery will
fail to try_lock. Therefore we introduce atomic hive->in_reset
and adev->in_gpu_reset, to avoid reentering GPU recovery.
v2:
drop "? true : false" in the definition of amdgpu_in_reset
Signed-off-by: Dennis Li
diff --g
if other threads have holden the reset lock, recovery will
fail to try_lock. Therefore we introduce atomic hive->in_reset
and adev->in_gpu_reset, to avoid reentering GPU recovery.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/am
amdgpu_hive_info*.
2. remove unnecessary variable initialization.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 98d0c6e5ab3c..e25f952d8836 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu
Change to dynamically create and release hive info object,
which help driver support more hives in the future.
v2:
Change to save hive object pointer in adev, to avoid locking
xgmi_mutex every time when calling amdgpu_get_xgmi_hive.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd
Change to dynamically create and release hive info object,
which help driver support more hives in the future.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8a55b0bc044a..fdfdc2f678c9 100644
--- a/drivers/gpu
653.939233] #1: 9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
change the order of reservation_ww_class_mutex and adev->reset_sem in
amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem, to
avoid potential dead lock.
Signed-off
Change to dynamically create and release hive info object,
which help driver support more hives in the future.
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8a55b0bc044a..fdfdc2f678c9 100644
--- a/drivers/gpu
en registered!
[ 1216.705924] [ cut here ]
[ 1216.705972] DEBUG_LOCKS_WARN_ON(1)
[ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743
lockdep_init_map+0x150/0x210
v3:
change to use down_write_nest_lock to annotate the false dead-lock
warning.
Signed-off-by: Dennis Li
diff -
en registered!
[ 1216.705924] [ cut here ]
[ 1216.705972] DEBUG_LOCKS_WARN_ON(1)
[ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743
lockdep_init_map+0x150/0x210
Signed-off-by: Dennis Li
Change-Id: I7571efeccbf15483982031d00504a353031a854a
diff --git a/drivers/gpu/d
0x90/0x90
[ 584.129174] ret_from_fork+0x3a/0x50
Each adev has owned lock_class_key to avoid false positive
recursive locking.
Signed-off-by: Dennis Li
Change-Id: I7571efeccbf15483982031d00504a353031a854a
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
inde
Make sure to unlock the mutex when error happen
v2:
1. correct syntax error in the commit comment
2. remove change-Id
Acked-by: Nirmoy Das
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index a0ea663ecdbc
r+0xb0/0x1030 [amdgpu]
[ 264.512450] #3: 965fd31647a0 (&adev->reset_sem){}, at:
amdgpu_device_gpu_recover+0x264/0x1030 [amdgpu]
Remove the lock(&hive->hive_lock) out of amdgpu_get_xgmi_hive,
to disable its locking dependency on xgmi_mutex.
Signed-off-by: Dennis Li
Change-
Make sure unlock the mutex when error happen
Signed-off-by: Dennis Li
Change-Id: I6c36a193df5fe70516282d8136b4eadf32d20915
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index a0ea663ecdbc..5e5369abc6fa 100644
--- a/drivers/gpu/drm/amd
.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset
v5:
1. Fix some style issues.
Signed-off-by: Dennis Li
Reviewed-by: Andrey Grodzovsky
Reviewed-by: Christian König
Reviewed-by: Felix Kuehling
Reviewed-by
.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset
Signed-off-by: Dennis Li
Reviewed-by: Andrey Grodzovsky
Reviewed-by: Christian König
Reviewed-by: Felix Kuehling
Reviewed-
ove try_lock and change adev->in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.
Signed-off-by: Dennis Li
Change-Id: I7f77a72795462587ed7d5f51fe53a594a0f1f708
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 80f32b
During GPU reset, driver should hold on all external access to
GPU, otherwise psp will randomly fail to do post, and then cause
system hang.
Signed-off-by: Dennis Li
Change-Id: I7d5d41f9c4198b917d7b49606ba3850988e5b936
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd
If GPU hang, driver will fail to flush tlb, return the hang error
to callers, make callers have a chance to handle the error.
Signed-off-by: Dennis Li
Change-Id: Ie305ad0a77675f6eab7d5b8f68e279b7f4e7a8b9
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
b/drivers/gpu/drm/amd/amdkfd
If set error query ready in amdgpu_ras_late_init, which will
cause some IP blocks aren't initialized, but their error query
is ready.
v2: change the prefix of title to "drm/amdgpu" and remove
the unnecessary "{}".
Change-Id: I5087527261cb1b462afd82ad7592cf1ef73b15bd
If set error query ready in amdgpu_ras_late_init, which will
cause some IP blocks aren't initialized, but their error query
is ready.
Change-Id: I5087527261cb1b462afd82ad7592cf1ef73b15bd
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/dr
Prefix RAS message printing in gfx/mmhub with PCI device info,
which assists the debug in multiple GPU case.
Change-Id: Iceba7cafd5aac7d0251d9f871503745cc617fba2
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c
old mode 100644
Set ComputePGMRSRC1.VGPRS as 0x3f to clear all ArcVGPRs.
Change-Id: I296c3a162c0d5c7b84d4b48dc2002340a5c22e2a
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
old mode 100644
new mode 100755
index 324838baa71c..44fb64460c1f
AccVGPRs are newly added in arcturus. Before reading these
registers, they should be initialized. Otherwise edc error
happens, when RAS is enabled.
v2: reuse the existing logical to calculate register size
Change-Id: I4ed384f0cc4b781a10cfd6ad1e3a132445bdc261
Signed-off-by: Dennis Li
diff --git
AccVGPRs are newly added in arcturus. Before reading these
registers, they should be initialized. Otherwise edc error
happens, when RAS is enabled.
Change-Id: I4ed384f0cc4b781a10cfd6ad1e3a132445bdc261
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
b/drivers/gpu/drm
check whether the queue of entity is null to avoid null
pointer dereference.
Change-Id: I08d56774012cf229ba2fe7a011c1359e8d1e2781
Signed-off-by: Dennis Li
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 4cc7881f438c..67cca463ddcc
1. Add RAS support for MAM D(0~3)_MEM in mmhub.
2. Add RAS support for other mmhub ranges from 2 to 7.
Dennis Li (2):
drm/amdgpu: update mmhub 9.4.1 header files for Acrturus
drm/amdgpu: enable RAS feature for more mmhub sub-blocks of Acrturus
drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.c
Compared with Vg20, the size of mmhub range is changed from 2 to 8.
Change-Id: I529c0ff0aaed200e5b102d482563ed9dc2278260
Signed-off-by: Dennis Li
---
drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.c | 701 +++-
1 file changed, 695 insertions(+), 6 deletions(-)
diff --git a/drivers
Add mask & shift definition of MAM_D(0~3)MEM for all mmhub
ranges.
Change-Id: I65c8a3040611198273a4b6da77c1a1ad2ffe7fd3
Signed-off-by: Dennis Li
---
.../asic_reg/mmhub/mmhub_9_4_1_sh_mask.h | 128 ++
1 file changed, 128 insertions(+)
diff --git a/drivers/gpu/drm
Implement functions to do the RAS error injection and
query EDC counter.
Change-Id: I4d947511331a19c1967551b9d42997698073f795
Signed-off-by: Dennis Li
---
drivers/gpu/drm/amd/amdgpu/Makefile | 1 +
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 26 +-
drivers/gpu/drm/amd/amdgpu/gfx_v9_4.c | 978
support querying of EDC counter and error injection.
Dennis Li (4):
drm/amdgpu: refine the security check for RAS functions
drm/amdgpu: abstract EDC counter clear to a separated function
drm/amdgpu: add EDC counter registers of gc for Arcturus
drm/amdgpu: add RAS support for the gfx block
ge-Id: Ia3f73bd9ee41ee3d0dd18d6f46e67124cf88d653
Signed-off-by: Dennis Li
---
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index e3d466bd5c4e..759d8144f9c0 100644
--- a/drivers/gpu/dr
add reg headers to gc includes
v2: remove unused registers and fields in this patch set
Change-Id: If3476c0b0ed88e5d11bdb8bec1278ae10fc5af25
Signed-off-by: Dennis Li
---
.../amd/include/asic_reg/gc/gc_9_4_1_offset.h | 264 +++
.../include/asic_reg/gc/gc_9_4_1_sh_mask.h| 748
1. Add IP prefix for the IP related codes.
2. Refactor the code to clear EDC counter.
Change-Id: I1cd9ec304a7ace9a74480264d24368fd11a87833
Signed-off-by: Dennis Li
---
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 112 ++
1 file changed, 77 insertions(+), 35 deletions(-)
diff
Add codes to print the detail EDC info for the subblock of mmhub
v2: Move the EDC_CNT registers' defintion from mmhub_9_4 header
files to mmhub_1_0 ones. Add mmhub_v1_0_ prefix for the local
static variable and function.
Change-Id: I1d5b3df38caa8f0b437c96b78091662aaeaf264b
Signed-off-by: D
error.
3. Implement the query function of RAS error counter for Mi100
v2:
1. Fix some comment issues.
2. Add IP name prefix for the local static variable and function.
3. Move the EDC_CNT registers' defintion from mmhub_9_4 header files to
mmhub_1_0 ones for vg20.
Dennis Li (3):
drm/amdgpu: d
The struct soc15_ras_field_entry will be reused by
other IPs, such as mmhub and gc
v2: rename ras_subblock_regs to gc_ras_fields_vg20,
because the future asic maybe have a different table.
Change-Id: I6c3388a09b5fbf927ad90fcd626baa448d1681a6
Signed-off-by: Dennis Li
---
drivers/gpu/drm/amd
Get mmhub error counter by accessing EDC_CNT registers.
v2: Add mmhub_v9_4_ prefix for local static variable and function
Change-Id: I728d4183a08707aaf0fc71d184e86322a681e725
Signed-off-by: Dennis Li
---
drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 3 +
drivers/gpu/drm/amd/amdgpu/mmhub_v9_4.c
Add codes to query the EDC count of VML2 & ATCL2
Change-Id: If2c251481ba0a1a34ce3405a85f86d65eecee461
Signed-off-by: Dennis Li
---
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 167 ++
1 file changed, 167 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.
For the potential request in the future, change to
query the actual EDC counter.
Change-Id: I783ccd76f4c65f9829f7a8967a539a23ae5484b5
Signed-off-by: Dennis Li
---
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 819 --
drivers/gpu/drm/amd/amdgpu/soc15.h| 2 +
2 files
Add VML2 and ATCL2 ECC registers to support VEGA20 RAS
Change-Id: I8860f2e37fa7afd8d6123290fb7b9dcee56edd6e
Signed-off-by: Dennis Li
---
.../amd/include/asic_reg/gc/gc_9_0_offset.h| 18 --
.../amd/include/asic_reg/gc/gc_9_0_sh_mask.h | 18 --
2 files
1. Add the EDC count from hardware.
2. Add RAS support for VML2 amd ATCL2 sub blocks.
Dennis Li (3):
drm/amdgpu: change to query the actual EDC counter
drm/amd/include: add register define for VML2 and ATCL2
drm/amdgpu: add RAS support for VML2 and ATCL2
drivers/gpu/drm/amd/amdgpu
83 matches
Mail list logo