We observed a gpu hang when querying mmhub CG status(i.e.,
cat amdgpu_pm_info) on cyan skillfish. Acctually, cyan
skillfish doesn't support any CG features.
Only allow asics which support CG features accessing related
registers. Will add similar safeguards for other IPs in the
furture.
Am 25.01.22 um 16:58 schrieb Felix Kuehling:
On GPUs with RAS, poison can propagate between processes if VRAM is not
cleared when it is freed or allocated. The reason is, that not all write
accesses clear RAS poison. 32-byte writes by the SDMA engine do clear RAS
poison. Clearing memory in the
[AMD Official Use Only]
Thanks Kevin and Felix!
In gfxoff state, the dequeue request(by cp register writing) can't make gfxoff
exit, actually the cp is powered off and the cp register writing is invalid,
doorbell registers writing(regluar way) or directly request smu to disable gfx
[Public]
To simply code lines, I guess we can drop variable 'r'. And use 'return
svm_ioctl(p, args->op ' directly.
Regards,
Guchun
-Original Message-
From: amd-gfx On Behalf Of Philip Yang
Sent: Wednesday, January 26, 2022 2:04 AM
To: amd-gfx@lists.freedesktop.org
Cc: Yang, Philip
A few suggestion ideas inline.
On 1/25/2022 12:18, Tom St Denis wrote:
Newer hardware has a discovery table in hardware that the kernel will
rely on instead of header files for things like IP offsets. This
sysfs entry adds a simple to parse table of IP instances and segment
offsets.
Produces
== Description ==
Scnprintf use within the kernel is not recommended, but simple sysfs_emit
replacement has
not been successful due to the page alignment requirement of the function. This
patch
set implements a new api "emit_clock_levels" to facilitate passing both the
base and
offset to the
(v3)
Rewrote patchset to order patches as (API, hw impl, usecase)
- added API for new power management function emit_clk_levels
This function should duplicate the functionality of print_clk_levels,
but this solution passes the buffer base and write offset down the
(v3)
Rewrote patchset to order patches as (API, hw impl, usecase)
- implement emit_clk_levels for navi10, based on print_clk_levels, but
using sysfs_emit without smu_cmn_get_sysfs() workaround
Signed-off-by: Darren Powell
---
.../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c |
This will cause misconfigured systems to not run the GPU suspend
routines.
* In APUs that are properly configured system will go into s2idle.
* In APUs that are intended to be S3 but user selects
s2idle the GPU will stay fully powered for the suspend.
* In APUs that are intended to be s2idle
dGPUs connected to Intel systems configured for suspend to idle
will not necessarily have the power rails cut at suspend and
resetting the GPU may lead to problematic behaviors.
Fixes: 6dc8265f9803 ("drm/amdgpu: always reset the asic in suspend (v2)")
Link:
This will be used to help make decisions on what to do in
misconfigured systems.
Signed-off-by: Mario Limonciello
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 17 +
2 files changed, 19 insertions(+)
diff --git
On some OEM setups users can configure the BIOS for S3 or S2idle.
When configured to S3 users can still choose 's2idle' in the kernel by
using `/sys/power/mem_sleep`. Before commit 6dc8265f9803 ("drm/amdgpu:
always reset the asic in suspend (v2)"), the GPU would crash. Now when
configured this
[AMD Official Use Only]
> -Original Message-
> From: Alex Deucher
> Sent: Tuesday, January 25, 2022 11:58 PM
> To: Quan, Evan
> Cc: amd-gfx list ; Deucher, Alexander
> ; Lazar, Lijo
> Subject: Re: [PATCH V2 2/7] drm/amd/pm: unify the interface for retrieving
> enabled ppfeatures
>
>
[AMD Official Use Only]
> -Original Message-
> From: Alex Deucher
> Sent: Tuesday, January 25, 2022 11:35 PM
> To: Quan, Evan
> Cc: amd-gfx list ; Deucher, Alexander
>
> Subject: Re: [PATCH 2/2] drm/amd/pm: fix the deadlock observed on
> performance_level setting
>
> On Tue, Jan 25,
On 1/25/2022 4:16 PM, Tao Zhou wrote:
On ALDEBARAN, we need to traverse all column bits higher than
BIT11(C4C3C2) in a row, the shift of R14 bit should be also taken
into account. Retire all pages we find.
Signed-off-by: Tao Zhou
---
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 41
A number of BIOS versions have a problem with the watermarks table not
being configured properly. This manifests as a very scary looking warning
during resume from s0i3. This should be harmless in most cases and is well
understood, so decrease the assertion to a clearer warning about the
Since we have a single instance of reset semaphore which we
lock only once even for XGMI hive we don't need the nested
locking hint anymore.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 --
1 file changed, 4 insertions(+), 10 deletions(-)
Since now all GPU resets are serialzied there is no need for this.
This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout'
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 89 ++
1 file
This functions needs to be split into 2 parts where
one is called only once for locking single instance of
reset_domain's sem and reset flag and the other part
which handles MP1 states should still be called for
each device in XGMI hive.
Signed-off-by: Andrey Grodzovsky
---
We should have a single instance per entrire reset domain.
Signed-off-by: Andrey Grodzovsky
Suggested-by: Lijo Lazar
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 7 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++---
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 +
We want single instance of reset sem across all
reset clients because in case of XGMI we should stop
access cross device MMIO because any of them could be
in a reset in the moment.
Signed-off-by: Andrey Grodzovsky
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 -
The reset domain contains register access semaphor
now and so needs to be present as long as each device
in a hive needs it and so it cannot be binded to XGMI
hive life cycle.
Adress this by making reset domain refcounted and pointed
by each member of the hive and the hive itself.
Signed-off-by:
Since we serialize all resets no need to protect from concurrent
resets.
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 1 -
No need to to trigger another work queue inside the work queue.
v3:
Problem:
Extra reset caused by host side FLR notification
following guest side triggered reset.
Fix: Preven qeuing flr_work from mailbox irq if guest
already executing a reset.
Suggested-by: Liu Shaoyun
Signed-off-by: Andrey
Use reset domain wq also for non TDR gpu recovery trigers
such as sysfs and RAS. We must serialize all possible
GPU recoveries to gurantee no concurrency there.
For TDR call the original recovery function directly since
it's already executed from within the wq. For others just
use a wrapper to
Restrict jobs resubmission to suspend case
only since schedulers not initialised yet on
probe.
Signed-off-by: Andrey Grodzovsky
Reviewed-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 -
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git
Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.
Signed-off-by: Andrey Grodzovsky
---
Defined a reset_domain struct such that
all the entities that go through reset
together will be serialized one against
another. Do it for both single device and
XGMI hive cases.
Signed-off-by: Andrey Grodzovsky
Suggested-by: Daniel Vetter
Suggested-by: Christian König
Reviewed-by: Christian
This patchset is based on earlier work by Boris[1] that allowed to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. On top of that I also serialized
any GPU reset we trigger from within amdgpu code to also go through the same
Inlined:
On 2022-01-25 13:18, Tom St Denis wrote:
> Newer hardware has a discovery table in hardware that the kernel will
> rely on instead of header files for things like IP offsets. This
> sysfs entry adds a simple to parse table of IP instances and segment
> offsets.
>
> Produces output that
Applied. Thanks!
Alex
On Tue, Jan 25, 2022 at 12:53 PM Harry Wentland wrote:
>
> On 2022-01-22 21:38, Bas Nieuwenhuizen wrote:
> > Unused. Convert the divisions into asserts on the divisor, to
> > debug why it is zero. The divide by zero is suspected of causing
> > kernel panics.
> >
> > While
Am 2022-01-25 um 13:04 schrieb Philip Yang:
SVM ioctls take proper svms->lock to handle race conditions, don't need
take process mutex to serialize ioctls. This also fixes circular locking
warning:
WARNING: possible circular locking dependency detected
Possible unsafe locking scenario:
Newer hardware has a discovery table in hardware that the kernel will
rely on instead of header files for things like IP offsets. This
sysfs entry adds a simple to parse table of IP instances and segment
offsets.
Produces output that looks like:
$ cat ip_discovery_text
ATHUB{0} v2.0.0: 0c00
SVM ioctls take proper svms->lock to handle race conditions, don't need
take process mutex to serialize ioctls. This also fixes circular locking
warning:
WARNING: possible circular locking dependency detected
Possible unsafe locking scenario:
CPU0CPU1
On 2022-01-22 21:38, Bas Nieuwenhuizen wrote:
> Unused. Convert the divisions into asserts on the divisor, to
> debug why it is zero. The divide by zero is suspected of causing
> kernel panics.
>
> While I have no idea where the zero is coming from I think this
> patch is a positive either way.
>
Am 2022-01-20 um 18:13 schrieb Philip Yang:
Define new system management interface event IDs, migration triggers and
user queue eviction triggers, those will be implemented in the following
patches.
Signed-off-by: Philip Yang
---
include/uapi/linux/kfd_ioctl.h | 27
On Tue, Jan 25, 2022 at 11:42 AM StDenis, Tom wrote:
>
> I literally brought this up in our initial discussion
>
> Frankly from umrs point of view a single file is easier.
>
> But I can't code anything until it's in the tree...
yeah, the single file is arguably easier to deal with. We could
I literally brought this up in our initial discussion
Frankly from umrs point of view a single file is easier.
But I can't code anything until it's in the tree...
Tom
From: Alex Deucher
Sent: Tuesday, January 25, 2022 11:39
To: StDenis, Tom
Cc:
On 2022-01-25 01:25, Fangzhi Zuo wrote:
> [Why]
> configure_dp_hpo_throttled_vcp_size() was missing promotion before, but it
> was covered by
> not calling the missing function hook in the old interface
> hpo_dp_link_encoder->funcs.
>
> Recent refactor replaces with new caller
On Mon, Jan 24, 2022 at 1:07 PM Tom St Denis wrote:
>
> Newer hardware has a discovery table in hardware that the kernel will
> rely on instead of header files for things like IP offsets. This
> sysfs entry adds a simple to parse table of IP instances and segment
> offsets.
>
> Produces output
Am 2022-01-20 um 18:13 schrieb Philip Yang:
sizeof(buf) is 8 bytes because it is defined as unsigned char *buf,
each SMI event read only copy max 8 bytes to user buffer. Correct this
by using the buf allocate size.
Signed-off-by: Philip Yang
Reviewed-by: Felix Kuehling
---
On Tue, Jan 25, 2022 at 6:32 AM Lazar, Lijo wrote:
>
>
>
> On 1/20/2022 11:34 PM, Alex Deucher wrote:
> > Some architectures (e.g., ARM) throw an compilation error if the
> > udelay is too long. In general udelays of longer than 2000us are
> > not recommended on any architecture. Switch to
Patch is fine, if it does what you want. A few comments inline.
On 2022-01-24 13:07, Tom St Denis wrote:
> Newer hardware has a discovery table in hardware that the kernel will
> rely on instead of header files for things like IP offsets. This
> sysfs entry adds a simple to parse table of IP
Reviewed-by: Alex Deucher
On Tue, Jan 25, 2022 at 4:00 AM Evan Quan wrote:
>
> Use uint64_t instead of an array of uint32_t. This can avoid
> some non-necessary intermediate uint32_t -> uint64_t conversions.
>
> Signed-off-by: Evan Quan
> Change-Id: I4e217357203a23440f058d7e25f55eaebd15c5ef
>
On GPUs with RAS, poison can propagate between processes if VRAM is not
cleared when it is freed or allocated. The reason is, that not all write
accesses clear RAS poison. 32-byte writes by the SDMA engine do clear RAS
poison. Clearing memory in the background when it is freed should avoid
major
Acked-by: Alex Deucher
On Tue, Jan 25, 2022 at 4:00 AM Evan Quan wrote:
>
> As other dGPU asics, Renoir should use smu_cmn_get_enabled_mask() for
> that job.
>
> Signed-off-by: Evan Quan
> Change-Id: I9e845ba84dd45d0826506de44ef4760fa851a516
> ---
> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 3
On Tue, Jan 25, 2022 at 4:00 AM Evan Quan wrote:
>
> Instead of having two which do the same thing.
>
> Signed-off-by: Evan Quan
> Change-Id: I6302c9b5abdb999c4b7c83a0d1852181208b1c1f
> ---
> .../amd/pm/swsmu/smu11/cyan_skillfish_ppt.c | 2 +-
> .../gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c
On Mon, Jan 24, 2022 at 08:55:40AM +0100, Geert Uytterhoeven wrote:
> > + /kisskb/src/lib/test_printf.c: error: "PTR" redefined [-Werror]: =>
> > 247:0, 247
> > + /kisskb/src/sound/pci/ca0106/ca0106.h: error: "PTR" redefined [-Werror]:
> > => 62, 62:0
>
> mips-gcc8/mips-allmodconfig
>
On Tue, Jan 25, 2022 at 3:57 AM Evan Quan wrote:
>
> The sub-routine(amdgpu_gfx_off_ctrl) tried to obtain the lock
> adev->pm.mutex which was actually hold by amdgpu_dpm_force_performance_level.
> A deadlock happened then.
>
> Signed-off-by: Evan Quan
> Change-Id:
On 2022-01-25 06:32, Lazar, Lijo wrote:
>
>
> On 1/20/2022 11:34 PM, Alex Deucher wrote:
>> Some architectures (e.g., ARM) throw an compilation error if the
>> udelay is too long. In general udelays of longer than 2000us are
>> not recommended on any architecture. Switch to msleep in these
I have no objection to the change. It restores the sequence that was
used before e9669fb78262. But I don't understand why GFX_OFF is causing
a preemption error during module unload, but not when KFD is in normal
use. Maybe it's because of the compute power profile that's normally set
by
Reviewed-by: Alex Deucher
On Tue, Jan 25, 2022 at 3:57 AM Evan Quan wrote:
>
> The existing way cannot handle Beige Goby well as a different
> PPTable data structure(PPTable_beige_goby_t instead of PPTable_t)
> is used there.
>
> Signed-off-by: Evan Quan
> Change-Id:
> -Original Message-
> From: Geert Uytterhoeven
> Sent: Monday, January 24, 2022 1:26 PM
> To: linux-ker...@vger.kernel.org
> Cc: linuxppc-...@lists.ozlabs.org; sparcli...@vger.kernel.org; linux-
> u...@lists.infradead.org; D, Lakshmi Sowjanya
> ; k...@vger.kernel.org; linux-
>
On 1/25/2022 5:28 AM, James Turner wrote:
Hi Lijo,
Not able to relate to how it affects gfx/mem DPM alone. Unless Alex
has other ideas, would you be able to enable drm debug messages and
share the log?
Sure, I'm happy to provide drm debug messages. Enabling everything
(0x1ff) generates *a
On 1/20/2022 11:34 PM, Alex Deucher wrote:
Some architectures (e.g., ARM) throw an compilation error if the
udelay is too long. In general udelays of longer than 2000us are
not recommended on any architecture. Switch to msleep in these
cases.
Signed-off-by: Alex Deucher
---
[AMD Official Use Only]
the issue is introduced in following patch, so add following information is
better.
fixes: (e9669fb78262) drm/amdgpu: Add early fini callback
Reviewed-by: Yang Wang
Best Regards,
Kevin
From: amd-gfx on behalf of Tianci Yin
Sent:
On ALDEBARAN, the umc channel bits are not original values, they
are hashed.
Signed-off-by: Tao Zhou
---
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 8
drivers/gpu/drm/amd/amdgpu/umc_v6_7.h | 15 +++
2 files changed, 23 insertions(+)
diff --git
One piece of umc normalizing address can be mapped to 16 pieces of
physical address in each umc channel on ALDEBARAN.
Signed-off-by: Tao Zhou
---
drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 3 ++-
drivers/gpu/drm/amd/amdgpu/umc_v6_7.h | 4
2 files changed, 6 insertions(+), 1 deletion(-)
diff
On ALDEBARAN, we need to traverse all column bits higher than
BIT11(C4C3C2) in a row, the shift of R14 bit should be also taken
into account. Retire all pages we find.
Signed-off-by: Tao Zhou
---
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 41 +--
Create common amdgpu_umc_fill_error_record function for all versions
of UMC and clean up related codes.
Signed-off-by: Tao Zhou
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 23
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 21 +++
From: "Tianci.Yin"
[why]
In rmmod procedure, kfd sends cp a dequeue request, but the
request does not get response, then an error message "cp
queue pipe 4 queue 0 preemption failed" printed.
[how]
Performing kfd suspending after disabling gfxoff can fix it.
Change-Id:
[AMD Official Use Only]
> -Original Message-
> From: Lazar, Lijo
> Sent: Monday, January 24, 2022 1:03 PM
> To: Quan, Evan ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Chen, Guchun
> ; Huang, Ray
> Subject: Re: [PATCH 3/7] drm/amd/pm: drop the redundant 'supported'
>
As there is no internal cache for enabled ppfeatures now. Thus the 2nd
parameter will be not needed any more.
Signed-off-by: Evan Quan
Change-Id: I0c1811f216c55d6ddfabdc9e099dc214c21bdf2e
---
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 9 ++---
As the enabled ppfeatures are just retrieved ahead. We can use
that directly instead of retrieving again and again.
Signed-off-by: Evan Quan
Change-Id: I08827437fcbbc52084418c8ca6a90cfa503306a9
---
drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 10 +-
1 file changed, 9 insertions(+), 1
The following scenarios make the driver cache for enabled ppfeatures
outdated and invalid:
- Other tools interact with PMFW to change the enabled ppfeatures.
- PMFW may enable/disable some features behind driver's back. E.g.
for sienna_cichild, on gfxoff entering, PMFW will disable gfx
The supported features should be retrieved just after EnableAllDpmFeatures
message
complete. And the check(whether some dpm feature is supported) is only needed
when we
decide to enable or disable it.
Signed-off-by: Evan Quan
Change-Id: I07c9a5ac5290cd0d88a40ce1768d393156419b5a
---
Instead of having two which do the same thing.
Signed-off-by: Evan Quan
Change-Id: I6302c9b5abdb999c4b7c83a0d1852181208b1c1f
---
.../amd/pm/swsmu/smu11/cyan_skillfish_ppt.c | 2 +-
.../gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 6 +-
.../drm/amd/pm/swsmu/smu13/yellow_carp_ppt.c | 6 +-
As other dGPU asics, Renoir should use smu_cmn_get_enabled_mask() for
that job.
Signed-off-by: Evan Quan
Change-Id: I9e845ba84dd45d0826506de44ef4760fa851a516
---
drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git
Hi Lijo,
> Not able to relate to how it affects gfx/mem DPM alone. Unless Alex
> has other ideas, would you be able to enable drm debug messages and
> share the log?
Sure, I'm happy to provide drm debug messages. Enabling everything
(0x1ff) generates *a lot* of log messages, though. Is there a
On 1/24/22 17:23, Felix Kuehling wrote:
>
> Am 2022-01-24 um 14:11 schrieb Randy Dunlap:
>> On 1/24/22 10:55, Geert Uytterhoeven wrote:
>>> Hi Alex,
>>>
>>> On Mon, Jan 24, 2022 at 7:52 PM Alex Deucher wrote:
On Mon, Jan 24, 2022 at 5:25 AM Geert Uytterhoeven
wrote:
> On Sun,
The sub-routine(amdgpu_gfx_off_ctrl) tried to obtain the lock
adev->pm.mutex which was actually hold by amdgpu_dpm_force_performance_level.
A deadlock happened then.
Signed-off-by: Evan Quan
Change-Id: Id692829381dedc6380f5464d74107d696f7abca1
---
drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 50
The existing way cannot handle Beige Goby well as a different
PPTable data structure(PPTable_beige_goby_t instead of PPTable_t)
is used there.
Signed-off-by: Evan Quan
Change-Id: I02208c011e93c4d37769bd022e65e9084faa97e4
---
drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 6 +++---
1
72 matches
Mail list logo