Re: [RFC] How to test panic handlers, without crashing the kernel

2024-03-07 Thread Guilherme G. Piccoli
On 07/03/2024 14:22, Jocelyn Falempe wrote:
> [...]
> 
> For drm_panic, I changed the way the debugfs is calling the drm_panic 
> functions in the last version:
> https://patchwork.freedesktop.org/patch/581845/?series=122244=9
> 
> It doesn't use the panic notifier list, but create a file for each plane 
> of each device directly.
> It allows to test the panic handler, not in a real panic condition, but 
> that's still better than nothing.
> 

Nice! Seems a very good idea, at least as a first step to unblock the
work you're doing.

Thanks again for the effort, much appreciated =)


Re: [RFC] How to test panic handlers, without crashing the kernel

2024-03-05 Thread Guilherme G. Piccoli
On 05/03/2024 13:52, Jocelyn Falempe wrote:
> [...]
> Or maybe have two lists of panic notifiers, the safe and the destructive 
> list. So in case of fake panic, we can only call the safe notifiers.
> 

I tried something like that:
https://lore.kernel.org/lkml/20220427224924.592546-1-gpicc...@igalia.com/

There were many suggestions, a completely refactor of the idea (panic
lists are not really seen as reliable things).

Given that, I'm not really sure splitting in lists gonna fly; maybe
restricting the test infrastructure to drm_panic plus some paths of
panic would be enough for this debugfs interface, in principle? I mean,
to unblock your work on the drm panic stuff.

Cheers,


Guilherme


Re: [RFC] How to test panic handlers, without crashing the kernel

2024-03-04 Thread Guilherme G. Piccoli
On 04/03/2024 18:12, John Ogness wrote:
> [...]
>> The second question is how to simulate a panic context in a
>> non-destructive way, so we can test the panic notifiers in CI, without
>> crashing the machine.
> 
> I'm wondering if a "fake panic" can be implemented that quiesces all the
> other CPUs via NMI (similar to kdb) and then calls the panic
> notifiers. And finally releases everything back to normal. That might
> produce a fairly realistic panic situation and should be fairly
> non-destructive (depending on what the notifiers do and how long they
> take).
> 

Hi Jocelyn / John,

one concern here is that the panic notifiers are kind of a no man's
land, so we can have very simple / safe ones, while others are
destructive in nature.

An example of a good behaving notifier that is destructive is the
Hyper-V one, that destroys an essential host-guest interface (called
"vmbus connection"). What happens if we trigger this one just for
testing purposes in a debugfs interface? Likely the guest would die...

[+CCing Michael Kelley here since he seems interested in panic and is
also expert in Hyper-V, just in case my example is bogus.]

So, maybe the problem could be split in 2: the non-notifiers portion of
the panic path, and the the notifiers; maybe restricting the notifiers
you'd run is a way to circumvent the risks, like if you could pass a
list of the specific notifiers you aim to test, this could be
interesting. Let's see what the others think and thanks for your work in
the DRM panic notifier =)

Cheers,


Guilherme


Re: [PATCH] Revert "drm/amd/display: Program OTG vtotal min/max selectors unconditionally for DCN1+"

2023-07-12 Thread Guilherme G. Piccoli
On 12/07/2023 11:47, Pillai, Aurabindo wrote:
> Hi Guilherme,
> 
> Sorry there was one more patch which I missed to attach. Please add this
> 3^rd  patch and retry.
> 
> Reverting that patch would cause high power consumption on Navi2x GPU
> also cause hangs on certain multi monitor configurations. With these 3
> patches, you're getting the same effect as reverting the aforementioned
> patches, but it makes the reverted sequence available only for Steam
> deck hardware.
> 

Thanks a lot for your detailed explanation, and the 3rd patch! Indeed,
amdgpu works fine on Deck with that - great =)

Feel free to add my:
Tested-by: Guilherme G. Piccoli  #Steam Deck

Oh, a fixes tag would also makes sense, I guess.
BTW, if possible to submit the 3 patches in a proper series to get it
merged on 6.5-rc cycle (the sooner the better), I'd really appreciate!

Cheers,


Guilherme






Re: [PATCH] Revert "drm/amd/display: Program OTG vtotal min/max selectors unconditionally for DCN1+"

2023-07-11 Thread Guilherme G. Piccoli
On 11/07/2023 15:22, Aurabindo Pillai wrote:
> [...]
> Hi,
> 
> Sorry for the delayed response, this patch went unnoticed. This revert would 
> break asics. Could you try the attached patch without reverting this one ?

Hi Aurabindo, thanks for your response!

I've tried kernel 6.5-rc1, and it seems the issue is present, due to the
patch being merged on Linus tree [as 1598fc576420 ("drm/amd/display:
Program OTG vtotal min/max selectors unconditionally for DCN1+")].

Then, I tried both your attached patches on top of that, and
unfortunately, the behavior is the same: Steam Deck doesn't boot with
graphics, and we can see the single error "amdgpu :04:00.0: [drm]
*ERROR* [CRTC:67:crtc-0] flip_done timed out" on dmesg.

Do you / Alex think we could get this revert for 6.5-rc2, so at least we
could boot mainline there while the issue is handled? It would be an
intermediate fix. You mentioned it breaks some asics, but did they work
until now, without your patch?

Thanks,


Guilherme


[PATCH] Revert "drm/amd/display: Program OTG vtotal min/max selectors unconditionally for DCN1+"

2023-07-02 Thread Guilherme G. Piccoli
This reverts commit 06c3a652a787efc960af7c8816036d25c4227c6c.

After this commit, the Steam Deck cannot boot with graphics anymore;
the following message is observed on dmesg:

"[drm] ERROR [CRTC:67:crtc-0] flip_done timed out"

No other error is observed, it just stays like that. After bisecting
amd-staging-drm-next, we narrowed it down to this commit. Seems it
makes sense to revert it to have the tree bootable until a proper
solution is worked.

Cc: Aurabindo Pillai 
Cc: André Almeida 
Cc: Melissa Wen 
Cc: Rodrigo Siqueira 
Signed-off-by: Guilherme G. Piccoli 

---

Hi Alex / Aurabindo, we couldn't boot the Deck with in HEAD
(amd-staging-drm-next), git bisect led to this commit. Since its
description already mentions a potential proper solution, related
to the DMCUB (and some complex state tracking), I thought it was
more effective to revert it to allow booting the tree in Deck (and
maybe other HW - I just tested the Deck BTW).
Lemme know your thoughts.

Special thanks to André and Melissa for helping the debug / bisect!
We're open to test alternative patches, feel free to ping.
Cheers,

Guilherme


 drivers/gpu/drm/amd/display/dc/dcn10/dcn10_optc.c | 15 ---
 drivers/gpu/drm/amd/display/dc/dcn20/dcn20_optc.c | 10 --
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_optc.c 
b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_optc.c
index 0e8f4f36c87c..27419cd98264 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_optc.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_optc.c
@@ -945,10 +945,19 @@ void optc1_set_drr(
OTG_FORCE_LOCK_ON_EVENT, 0,
OTG_SET_V_TOTAL_MIN_MASK_EN, 0,
OTG_SET_V_TOTAL_MIN_MASK, 0);
-   }
 
-   // Setup manual flow control for EOF via TRIG_A
-   optc->funcs->setup_manual_trigger(optc);
+   // Setup manual flow control for EOF via TRIG_A
+   optc->funcs->setup_manual_trigger(optc);
+
+   } else {
+   REG_UPDATE_4(OTG_V_TOTAL_CONTROL,
+   OTG_SET_V_TOTAL_MIN_MASK, 0,
+   OTG_V_TOTAL_MIN_SEL, 0,
+   OTG_V_TOTAL_MAX_SEL, 0,
+   OTG_FORCE_LOCK_ON_EVENT, 0);
+
+   optc->funcs->set_vtotal_min_max(optc, 0, 0);
+   }
 }
 
 void optc1_set_vtotal_min_max(struct timing_generator *optc, int vtotal_min, 
int vtotal_max)
diff --git a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_optc.c 
b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_optc.c
index 58bdbd859bf9..d6f095b4555d 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_optc.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_optc.c
@@ -462,16 +462,6 @@ void optc2_setup_manual_trigger(struct timing_generator 
*optc)
 {
struct optc *optc1 = DCN10TG_FROM_TG(optc);
 
-   /* Set the min/max selectors unconditionally so that
-* DMCUB fw may change OTG timings when necessary
-* TODO: Remove the w/a after fixing the issue in DMCUB firmware
-*/
-   REG_UPDATE_4(OTG_V_TOTAL_CONTROL,
-OTG_V_TOTAL_MIN_SEL, 1,
-OTG_V_TOTAL_MAX_SEL, 1,
-OTG_FORCE_LOCK_ON_EVENT, 0,
-OTG_SET_V_TOTAL_MIN_MASK, (1 << 1)); /* TRIGA 
*/
-
REG_SET_8(OTG_TRIGA_CNTL, 0,
OTG_TRIGA_SOURCE_SELECT, 21,
OTG_TRIGA_SOURCE_PIPE_SELECT, optc->inst,
-- 
2.40.1



[PATCH] drm/amdgpu/vcn: Disable indirect SRAM on Vangogh broken BIOSes

2023-03-12 Thread Guilherme G. Piccoli
The VCN firmware loading path enables the indirect SRAM mode if it's
advertised as supported. We might have some cases of FW issues that
prevents this mode to working properly though, ending-up in a failed
probe. An example below, observed in the Steam Deck:

[...]
[drm] failed to load ucode VCN0_RAM(0x3A)
[drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x)
amdgpu :04:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring 
vcn_dec_0 test failed (-110)
[drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init of IP block  
failed -110
amdgpu :04:00.0: amdgpu: amdgpu_device_ip_init failed
amdgpu :04:00.0: amdgpu: Fatal error during GPU init
[...]

Disabling the VCN block circumvents this, but it's a very invasive
workaround that turns off the entire feature. So, let's add a quirk
on VCN loading that checks for known problematic BIOSes on Vangogh,
so we can proactively disable the indirect SRAM mode and allow the
HW proper probe and VCN IP block to work fine.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2385
Fixes: 82132ecc5432 ("drm/amdgpu: enable Vangogh VCN indirect sram mode")
Cc: sta...@vger.kernel.org
Cc: James Zhu 
Cc: Leo Liu 
Signed-off-by: Guilherme G. Piccoli 
---


Hi folks, based on the feedback from the gitlab issue, here is the upstream
attempt to quirk the Steam Deck's BIOSes having known issues with the
indirect SRAM mode. I've tested it on both the quirked BIOSes, and also
with some working ones. This patch is based on agd5f/amd-staging-drm-next.

Thanks in advance for reviews!
Cheers,

Guilherme


 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index 02d428ddf2f8..dc4f3f4cb644 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -26,6 +26,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -114,6 +115,24 @@ int amdgpu_vcn_sw_init(struct amdgpu_device *adev)
(adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
adev->vcn.indirect_sram = true;
 
+   /*
+* Some Steam Deck's BIOS versions are incompatible with the
+* indirect SRAM mode, leading to amdgpu being unable to get
+* properly probed (and even potentially crashing the kernel).
+* Hence, check for these versions here - notice this is
+* restricted to Vangogh (Deck's APU).
+*/
+   if (adev->ip_versions[UVD_HWIP][0] == IP_VERSION(3, 0, 2)) {
+   const char *bios_ver = dmi_get_system_info(DMI_BIOS_VERSION);
+
+   if (bios_ver && (!strncmp("F7A0113", bios_ver, 7) ||
+!strncmp("F7A0114", bios_ver, 7))) {
+   adev->vcn.indirect_sram = false;
+   dev_info(adev->dev,
+   "Steam Deck quirk: indirect SRAM disabled on 
BIOS %s\n", bios_ver);
+   }
+   }
+
hdr = (const struct common_firmware_header *)adev->vcn.fw->data;
adev->vcn.fw_version = le32_to_cpu(hdr->ucode_version);
 
-- 
2.39.2



Re: [PATCH v3] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-02 Thread Guilherme G. Piccoli
On 02/02/2023 13:12, Luben Tuikov wrote:
> Hi Guilherme,
> 
> Thanks for redoing to a v3. This patch is:
> 
> Reviewed-by: Luben Tuikov 
> 
> Regards,
> Luben
> 

Thank you for the reviews Luben, much appreciated!


Re: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-02 Thread Guilherme G. Piccoli
On 02/02/2023 08:58, Christian König wrote:
> [...]
>> +if (!ring->no_scheduler && ring->sched.ops)
>>  drm_sched_fini(>sched);
> 
> I think we should drop the check for no_scheduler here and just call 
> drm_sched_fini() when the scheduler instance was initialized before.
> 
> Background is that I've found a couple of places where we potentially 
> set no_scheduler to false after the scheduler was already initialized.
> 
> This is then probably leading to a memory leak or worth.
> 
> Regards,
> Christian.

Thanks Christian, nice finding! And thanks for the reviews Guchun / Luben =)

I just submitted a V3 [0], but didn't add the review tags from Guchun /
Luben, since the patch changed.

Cheers,


Guilherme


[0]
https://lore.kernel.org/dri-devel/20230202134856.1382169-1-gpicc...@igalia.com/


[PATCH v3] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-02 Thread Guilherme G. Piccoli
Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
routine - such function is expected to be called only after the
respective init function - drm_sched_init() - was executed successfully.

Happens that we faced a driver probe failure in the Steam Deck
recently, and the function drm_sched_fini() was called even without
its counter-part had been previously called, causing the following oops:

amdgpu: probe of :04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0090
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
[...]
Call Trace:
 
 amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
 amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
 amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
 devm_drm_dev_init_release+0x49/0x70
 [...]

To prevent that, check if the drm_sched was properly initialized for a
given ring before calling its fini counter-part.

Notice ideally we'd use sched.ready for that; such field is set as the latest
thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
field - in the above oops for example, it was a GFX ring causing the crash, and
the sched.ready field was set to true in the ring init routine, regardless of
the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
Christian's suggestion [0], and also removed the no_scheduler check [1].

[0] 
https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb...@amd.com/
[1] 
https://lore.kernel.org/amd-gfx/cd0e2994-f85f-d837-609f-7056d5fb7...@amd.com/

Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 
test (v2)")
Suggested-by: Christian König 
Cc: Guchun Chen 
Cc: Luben Tuikov 
Cc: Mario Limonciello 
Signed-off-by: Guilherme G. Piccoli 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 00444203220d..faff4a3f96e6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device 
*adev)
if (!ring || !ring->fence_drv.initialized)
continue;
 
-   if (!ring->no_scheduler)
+   /*
+* Notice we check for sched.ops since there's some
+* override on the meaning of sched.ready by amdgpu.
+* The natural check would be sched.ready, which is
+* set as drm_sched_init() finishes...
+*/
+   if (ring->sched.ops)
drm_sched_fini(>sched);
 
for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
-- 
2.39.0



Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-01 Thread Guilherme G. Piccoli
On 01/02/2023 13:21, Luben Tuikov wrote:
> Hi Guilherme,
> 
> Since setting sched->ready to false, seems to be taking place in, directly 
> amdgpu_ring_fini()
> and in amdgpu_fence_driver_sw_fini() indirectly as that function calls 
> drm_sched_fini()
> which sets it to false, we seem to have two competing policies of,
> "set ready to false to show that _fini() was called, and set to false to 
> disable IB submissions".
> 
> To that effect, your patch is generally correct, as it would be the case of 
> an early failure
> and unroll from (indirectly) amdgpu_device_init_schedulers().
> 
> Please resubmit your patch but using .ops as Christian suggested, as .name is 
> sufficient,
> but .ops is necessary.
> 
> On a side-note: in the future we should probably discern between
> "this ring has an initialized and working scheduler" (looking up at DRM), from
> "this ring can take on IBs to send them down to the hardware" (looking down 
> at hardware).
> Sched->ready seems to be overloaded with these disparate states, and this is 
> why you need
> to use .ops to guard calling drm_sched_fini().
> 
> Regards,
> Luben

Thanks a lot Luben, makes perfect sense!

Also, thanks for everyone that provided feedback here, very interesting
discussion.

Submitted V2:
https://lore.kernel.org/dri-devel/20230201164814.1353383-1-gpicc...@igalia.com/
Cheers,


Guilherme


[PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-01 Thread Guilherme G. Piccoli
Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
routine - such function is expected to be called only after the
respective init function - drm_sched_init() - was executed successfully.

Happens that we faced a driver probe failure in the Steam Deck
recently, and the function drm_sched_fini() was called even without
its counter-part had been previously called, causing the following oops:

amdgpu: probe of :04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0090
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
[...]
Call Trace:
 
 amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
 amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
 amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
 devm_drm_dev_init_release+0x49/0x70
 [...]

To prevent that, check if the drm_sched was properly initialized for a
given ring before calling its fini counter-part.

Notice ideally we'd use sched.ready for that; such field is set as the latest
thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
field - in the above oops for example, it was a GFX ring causing the crash, and
the sched.ready field was set to true in the ring init routine, regardless of
the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
Christian's suggestion [0].

[0] 
https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb...@amd.com/

Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 
test (v2)")
Suggested-by: Christian König 
Cc: Guchun Chen 
Cc: Luben Tuikov 
Cc: Mario Limonciello 
Signed-off-by: Guilherme G. Piccoli 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 00444203220d..3b962cb680a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device 
*adev)
if (!ring || !ring->fence_drv.initialized)
continue;
 
-   if (!ring->no_scheduler)
+   /*
+* Notice we check for sched.ops since there's some
+* override on the meaning of sched.ready by amdgpu.
+* The natural check would be sched.ready, which is
+* set as drm_sched_init() finishes...
+*/
+   if (!ring->no_scheduler && ring->sched.ops)
drm_sched_fini(>sched);
 
for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
-- 
2.39.0



Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-01-31 Thread Guilherme G. Piccoli
On 31/01/2023 14:52, Alex Deucher wrote:
> [...]
>> (b) We can't use sched.ready, which would make sense...but amdgpu
>> overrides its meaning, the driver manipulates this value for its own
>> purposes of tracking ring init, or something like that.
>>
>> This is the tangential topic: what should we do here? My understanding
>> of Alex's message is that we could have a "ready" field in the ring
>> structure and stop messing with sched.ready - does it make sense Alex?
> 
> Yes, I think so.  The tricky part will be figuring out which current
> sched.ready checks are checking for the scheduler being ready vs.
> whether the ring itself is ready.
> 

Thanks, makes sense!

$ grep -nr "sched.ready" drivers/gpu/drm/amd/ | wc -l
83

Maybe not super tough, I hope heh

>>
>> Guchun / Christian, does it also make sense for you?
>>
>>
>> Regarding (a), I could re-submit having s/sched.name/sched.ops, no
>> biggies, I tested both to be fair, before sending...I just chose name
>> but any field that is proper initialized on drm_sched_init() would work.
> 
> Yeah, I think ops is fine.  You could even use sched.ready after we
> clean up the use of that in the driver.  There are already a bunch of
> places where we check sched.ready to check if the scheduler really is
> ready.

Hmm..unfortunately, doesn't work. This was a case in which sched.ready
was set to true in the ring init routine, but scheduler wasn't properly
initialized. So, a different key for comparison is required..I'll
re-submit with sched.ops.

After a potential rework of the driver to get rid of sched.ready
manipulation, then it could be fixed to properly use this flag...makes
sense to you?

Tnx again for the prompt review!
Cheers,


Guilherme


Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-01-31 Thread Guilherme G. Piccoli
On 31/01/2023 10:58, Chen, Guchun wrote:
> Hi Christian,
> 
> Do you think if it makes sense that we can set 'ring->sched.ready' to be true 
> in each ring init, even if before executing/setting up drm_sched_init in 
> amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler 
> structure.
> 
> Regards,
> Guchun
> 

Hi folks, thanks a lot for the feedback so far, much appreciated!

I'm feeling a bit confused specially since there seems to be 2
orthogonal (yet related) topics being discussed; let me try to summarize
my understanding and we can then further discuss better:

(a) The first problem is the one addressed in the patch - how to prevent
drm_sched_fini() to get called if drm_sched_init() wasn't called?

I've proposed sched.name, seems Christian prefers sched.ops, correct?


(b) We can't use sched.ready, which would make sense...but amdgpu
overrides its meaning, the driver manipulates this value for its own
purposes of tracking ring init, or something like that.

This is the tangential topic: what should we do here? My understanding
of Alex's message is that we could have a "ready" field in the ring
structure and stop messing with sched.ready - does it make sense Alex?

Guchun / Christian, does it also make sense for you?


Regarding (a), I could re-submit having s/sched.name/sched.ops, no
biggies, I tested both to be fair, before sending...I just chose name
but any field that is proper initialized on drm_sched_init() would work.

Thanks,


Guilherme


Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-01-30 Thread Guilherme G. Piccoli
+ Luben

(sorry, missed that in the first submission).

On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
> Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
> routine - such function is expected to be called only after the
> respective init function - drm_sched_init() - was executed successfully.
> 
> Happens that we faced a driver probe failure in the Steam Deck
> recently, and the function drm_sched_fini() was called even without
> its counter-part had been previously called, causing the following oops:
> 
> amdgpu: probe of :04:00.0 failed with error -110
> BUG: kernel NULL pointer dereference, address: 0090
> PGD 0 P4D 0
> Oops: 0002 [#1] PREEMPT SMP NOPTI
> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
> Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
> [...]
> Call Trace:
>  
>  amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>  amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>  devm_drm_dev_init_release+0x49/0x70
>  [...]
> 
> To prevent that, check if the drm_sched was properly initialized for a
> given ring before calling its fini counter-part.
> 
> Notice ideally we'd use sched.ready for that; such field is set as the latest
> thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
> field - in the above oops for example, it was a GFX ring causing the crash, 
> and
> the sched.ready field was set to true in the ring init routine, regardless of
> the state of the DRM scheduler. Hence, we ended-up using another sched field.
> 
> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in 
> s3 test (v2)")
> Cc: Andrey Grodzovsky 
> Cc: Guchun Chen 
> Cc: Mario Limonciello 
> Signed-off-by: Guilherme G. Piccoli 
> ---
> 
> 
> Hi folks, first of all thanks in advance for reviews / comments!
> Notice that I've used the Fixes tag more in the sense to bring it
> to stable, I didn't find a good patch candidate that added the
> call to drm_sched_fini(), was reaching way too old commits...so
> 067f44c8b459 seems a good candidate - or maybe not?
> 
> Now, with regards sched.ready, spent a bit of time to figure what
> was happening...would be feasible maybe to stop using that to
> mark some kind ring status? I think it should be possible to add a
> flag to the ring structure for that, and free sched.ready from
> being manipulate by the amdgpu driver, what's your thoughts on that?
> 
> I could try myself, but first of course I'd like to raise the
> "temperature" on this topic and check if somebody is already working
> on that.
> 
> Cheers,
> 
> Guilherme
> 
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 00444203220d..e154eb8241fb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device 
> *adev)
>   if (!ring || !ring->fence_drv.initialized)
>   continue;
>  
> - if (!ring->no_scheduler)
> + /*
> +  * Notice we check for sched.name since there's some
> +  * override on the meaning of sched.ready by amdgpu.
> +  * The natural check would be sched.ready, which is
> +  * set as drm_sched_init() finishes...
> +  */
> + if (!ring->no_scheduler && ring->sched.name)
>   drm_sched_fini(>sched);
>  
>   for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)


[PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-01-30 Thread Guilherme G. Piccoli
Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
routine - such function is expected to be called only after the
respective init function - drm_sched_init() - was executed successfully.

Happens that we faced a driver probe failure in the Steam Deck
recently, and the function drm_sched_fini() was called even without
its counter-part had been previously called, causing the following oops:

amdgpu: probe of :04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0090
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
[...]
Call Trace:
 
 amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
 amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
 amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
 devm_drm_dev_init_release+0x49/0x70
 [...]

To prevent that, check if the drm_sched was properly initialized for a
given ring before calling its fini counter-part.

Notice ideally we'd use sched.ready for that; such field is set as the latest
thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
field - in the above oops for example, it was a GFX ring causing the crash, and
the sched.ready field was set to true in the ring init routine, regardless of
the state of the DRM scheduler. Hence, we ended-up using another sched field.

Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 
test (v2)")
Cc: Andrey Grodzovsky 
Cc: Guchun Chen 
Cc: Mario Limonciello 
Signed-off-by: Guilherme G. Piccoli 
---


Hi folks, first of all thanks in advance for reviews / comments!
Notice that I've used the Fixes tag more in the sense to bring it
to stable, I didn't find a good patch candidate that added the
call to drm_sched_fini(), was reaching way too old commits...so
067f44c8b459 seems a good candidate - or maybe not?

Now, with regards sched.ready, spent a bit of time to figure what
was happening...would be feasible maybe to stop using that to
mark some kind ring status? I think it should be possible to add a
flag to the ring structure for that, and free sched.ready from
being manipulate by the amdgpu driver, what's your thoughts on that?

I could try myself, but first of course I'd like to raise the
"temperature" on this topic and check if somebody is already working
on that.

Cheers,

Guilherme


 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 00444203220d..e154eb8241fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device 
*adev)
if (!ring || !ring->fence_drv.initialized)
continue;
 
-   if (!ring->no_scheduler)
+   /*
+* Notice we check for sched.name since there's some
+* override on the meaning of sched.ready by amdgpu.
+* The natural check would be sched.ready, which is
+* set as drm_sched_init() finishes...
+*/
+   if (!ring->no_scheduler && ring->sched.name)
drm_sched_fini(>sched);
 
for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
-- 
2.39.0



[PATCH] drm/amd/display: Trivial swizzle-related code clean-ups

2023-01-30 Thread Guilherme G. Piccoli
This is a very trivial code clean-up related to commit 5468c36d6285
("drm/amd/display: Filter Invalid 420 Modes for HDMI TMDS"). This commit
added a validation on driver probe to prevent invalid TMDS modes, but one
of the fake properties (swizzle) ended-up causing a warning on driver
probe; was reported here: https://gitlab.freedesktop.org/drm/amd/-/issues/2264.

It was fixed by commit 105a8b8698e2 ("drm/amd/display: patch cases with
unknown plane state to prevent warning"), but the validation code had
a double variable assignment, which we hereby remove. Also, the fix relies
in the dcn2{0,1}patch_unknown_plane_state() callbacks, so while at it we
took the opportunity to perform a small code clean-up in such routines.

Cc: Aurabindo Pillai 
Cc: Daniel Wheeler 
Cc: Fangzhi Zuo 
Cc: Harry Wentland 
Cc: Leo Li 
Cc: Mark Broadworth 
Cc: Melissa Wen 
Cc: Rodrigo Siqueira 
Cc: Sung Joon Kim 
Cc: Swapnil Patel 
Signed-off-by: Guilherme G. Piccoli 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 1 -
 drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c | 8 ++--
 drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c | 6 ++
 3 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 86a2f7f58550..e71e94663d14 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6336,7 +6336,6 @@ static enum dc_status 
dm_validate_stream_and_context(struct dc *dc,
dc_plane_state->plane_size.surface_size.width  = stream->src.width;
dc_plane_state->plane_size.chroma_size.height  = stream->src.height;
dc_plane_state->plane_size.chroma_size.width   = stream->src.width;
-   dc_plane_state->tiling_info.gfx9.swizzle =  DC_SW_UNKNOWN;
dc_plane_state->format = SURFACE_PIXEL_FORMAT_GRPH_ARGB;
dc_plane_state->tiling_info.gfx9.swizzle = DC_SW_UNKNOWN;
dc_plane_state->rotation = ROTATION_ANGLE_0;
diff --git a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
index 531f405d2554..3af24ef9cb2d 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c
@@ -2225,14 +2225,10 @@ enum dc_status dcn20_patch_unknown_plane_state(struct 
dc_plane_state *plane_stat
enum surface_pixel_format surf_pix_format = plane_state->format;
unsigned int bpp = resource_pixel_format_to_bpp(surf_pix_format);
 
-   enum swizzle_mode_values swizzle = DC_SW_LINEAR;
-
+   plane_state->tiling_info.gfx9.swizzle = DC_SW_64KB_S;
if (bpp == 64)
-   swizzle = DC_SW_64KB_D;
-   else
-   swizzle = DC_SW_64KB_S;
+   plane_state->tiling_info.gfx9.swizzle = DC_SW_64KB_D;
 
-   plane_state->tiling_info.gfx9.swizzle = swizzle;
return DC_OK;
 }
 
diff --git a/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c 
b/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
index fbcf0afeae0d..8f9244fe5c86 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn21/dcn21_resource.c
@@ -1393,15 +1393,13 @@ static uint32_t read_pipe_fuses(struct dc_context *ctx)
 
 static enum dc_status dcn21_patch_unknown_plane_state(struct dc_plane_state 
*plane_state)
 {
-   enum dc_status result = DC_OK;
-
if (plane_state->ctx->dc->debug.disable_dcc == DCC_ENABLE) {
plane_state->dcc.enable = 1;
/* align to our worst case block width */
plane_state->dcc.meta_pitch = ((plane_state->src_rect.width + 
1023) / 1024) * 1024;
}
-   result = dcn20_patch_unknown_plane_state(plane_state);
-   return result;
+
+   return dcn20_patch_unknown_plane_state(plane_state);
 }
 
 static const struct resource_funcs dcn21_res_pool_funcs = {
-- 
2.39.0



Re: [PATCH v2 2/2] drm/amdgpu/vcn: Remove redundant indirect SRAM HW model check

2023-01-17 Thread Guilherme G. Piccoli
On 17/01/2023 15:14, Limonciello, Mario wrote:
> [Public]
> 
> 
> 
>> -Original Message-
>> From: Guilherme G. Piccoli 
>> Sent: Tuesday, January 17, 2023 12:14
>> To: Limonciello, Mario ; amd-
>> g...@lists.freedesktop.org; Deucher, Alexander
>> 
>> Cc: dri-devel@lists.freedesktop.org; Koenig, Christian
>> ; Pan, Xinhui ;
>> ker...@gpiccoli.net; kernel-...@igalia.com; Zhu, James
>> ; Lazar, Lijo ; Liu, Leo
>> ; Jiang, Sonny 
>> Subject: Re: [PATCH v2 2/2] drm/amdgpu/vcn: Remove redundant indirect
>> SRAM HW model check
>>
>> On 17/01/2023 15:08, Limonciello, Mario wrote:
>>> [...]
>>>
>>> Should have added this tag too:
>>> Suggested-by: Alexander Deucher 
>>>
>>> Looks good to me, thanks!
>>> Reviewed-by: Mario Limonciello 
>>>
>>
>> You're totally right, thanks for the reminder and apologies for missing
>> that! Just sending V3 heheh
>>
>> Ah, thanks for the reviews and prompt responses.
>> Cheers,
>>
>>
>> Guilherme
> 
> No need to resend.  Patchwork will embed the tags when we pick this up.

Already did, but thanks again for the info - learning a lot in this
thread =)


[PATCH v3 2/2] drm/amdgpu/vcn: Remove redundant indirect SRAM HW model check

2023-01-17 Thread Guilherme G. Piccoli
The HW model validation that guards the indirect SRAM checking in the
VCN code path is redundant - there's no model that's not included in the
switch, making it useless in practice [0].

So, let's remove this switch statement for good.

[0] 
lore.kernel.org/amd-gfx/mn0pr12mb61013d20b8a2263b22ae1bcfe2...@mn0pr12mb6101.namprd12.prod.outlook.com

Suggested-by: Alex Deucher 
Reviewed-by: Mario Limonciello 
Cc: James Zhu 
Cc: Lazar Lijo 
Cc: Leo Liu 
Cc: Sonny Jiang 
Signed-off-by: Guilherme G. Piccoli 
---


V3:
* Added Mario's review tag and Alex's suggested tag - thanks
for the reminder Mario!

V2:
* Changed the approach after ML discussion- instead of cleaning up
the switch statement, removed it entirely - special thanks to Alex
and Mario for the feedback!


 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 81 +
 1 file changed, 3 insertions(+), 78 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index 1b1a3c9e1863..02d428ddf2f8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -110,84 +110,9 @@ int amdgpu_vcn_sw_init(struct amdgpu_device *adev)
for (i = 0; i < adev->vcn.num_vcn_inst; i++)
atomic_set(>vcn.inst[i].dpg_enc_submission_cnt, 0);
 
-   switch (adev->ip_versions[UVD_HWIP][0]) {
-   case IP_VERSION(1, 0, 0):
-   case IP_VERSION(1, 0, 1):
-   case IP_VERSION(2, 5, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(2, 2, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(2, 6, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(2, 0, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(2, 0, 2):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 0, 0):
-   case IP_VERSION(3, 0, 64):
-   case IP_VERSION(3, 0, 192):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 0, 2):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 0, 16):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 0, 33):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 1, 1):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 1, 2):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(4, 0, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(4, 0, 2):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_s

[PATCH v3 1/2] drm/amdgpu/vcn: Adjust firmware names indentation

2023-01-17 Thread Guilherme G. Piccoli
This is an incredibly trivial fix, just for the sake of
"aesthetical" organization of the defines. Some were space based,
most were tab based and there was a lack of "alignment", now it's
all the same and aligned.

Cc: James Zhu 
Cc: Lazar Lijo 
Cc: Leo Liu 
Cc: Mario Limonciello 
Cc: Sonny Jiang 
Reviewed-by: Alex Deucher 
Signed-off-by: Guilherme G. Piccoli 
---


V2/V3:
* Added Alex's review tag - thanks!


 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 38 -
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index f8397d993f23..1b1a3c9e1863 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -36,26 +36,26 @@
 #include "soc15d.h"
 
 /* Firmware Names */
-#define FIRMWARE_RAVEN "amdgpu/raven_vcn.bin"
-#define FIRMWARE_PICASSO   "amdgpu/picasso_vcn.bin"
-#define FIRMWARE_RAVEN2"amdgpu/raven2_vcn.bin"
-#define FIRMWARE_ARCTURUS  "amdgpu/arcturus_vcn.bin"
-#define FIRMWARE_RENOIR"amdgpu/renoir_vcn.bin"
-#define FIRMWARE_GREEN_SARDINE "amdgpu/green_sardine_vcn.bin"
-#define FIRMWARE_NAVI10"amdgpu/navi10_vcn.bin"
-#define FIRMWARE_NAVI14"amdgpu/navi14_vcn.bin"
-#define FIRMWARE_NAVI12"amdgpu/navi12_vcn.bin"
-#define FIRMWARE_SIENNA_CICHLID"amdgpu/sienna_cichlid_vcn.bin"
-#define FIRMWARE_NAVY_FLOUNDER "amdgpu/navy_flounder_vcn.bin"
-#define FIRMWARE_VANGOGH   "amdgpu/vangogh_vcn.bin"
+#define FIRMWARE_RAVEN "amdgpu/raven_vcn.bin"
+#define FIRMWARE_PICASSO   "amdgpu/picasso_vcn.bin"
+#define FIRMWARE_RAVEN2"amdgpu/raven2_vcn.bin"
+#define FIRMWARE_ARCTURUS  "amdgpu/arcturus_vcn.bin"
+#define FIRMWARE_RENOIR"amdgpu/renoir_vcn.bin"
+#define FIRMWARE_GREEN_SARDINE "amdgpu/green_sardine_vcn.bin"
+#define FIRMWARE_NAVI10"amdgpu/navi10_vcn.bin"
+#define FIRMWARE_NAVI14"amdgpu/navi14_vcn.bin"
+#define FIRMWARE_NAVI12"amdgpu/navi12_vcn.bin"
+#define FIRMWARE_SIENNA_CICHLID"amdgpu/sienna_cichlid_vcn.bin"
+#define FIRMWARE_NAVY_FLOUNDER "amdgpu/navy_flounder_vcn.bin"
+#define FIRMWARE_VANGOGH   "amdgpu/vangogh_vcn.bin"
 #define FIRMWARE_DIMGREY_CAVEFISH  "amdgpu/dimgrey_cavefish_vcn.bin"
-#define FIRMWARE_ALDEBARAN "amdgpu/aldebaran_vcn.bin"
-#define FIRMWARE_BEIGE_GOBY"amdgpu/beige_goby_vcn.bin"
-#define FIRMWARE_YELLOW_CARP   "amdgpu/yellow_carp_vcn.bin"
-#define FIRMWARE_VCN_3_1_2 "amdgpu/vcn_3_1_2.bin"
-#define FIRMWARE_VCN4_0_0  "amdgpu/vcn_4_0_0.bin"
-#define FIRMWARE_VCN4_0_2  "amdgpu/vcn_4_0_2.bin"
-#define FIRMWARE_VCN4_0_4  "amdgpu/vcn_4_0_4.bin"
+#define FIRMWARE_ALDEBARAN "amdgpu/aldebaran_vcn.bin"
+#define FIRMWARE_BEIGE_GOBY"amdgpu/beige_goby_vcn.bin"
+#define FIRMWARE_YELLOW_CARP   "amdgpu/yellow_carp_vcn.bin"
+#define FIRMWARE_VCN_3_1_2 "amdgpu/vcn_3_1_2.bin"
+#define FIRMWARE_VCN4_0_0  "amdgpu/vcn_4_0_0.bin"
+#define FIRMWARE_VCN4_0_2  "amdgpu/vcn_4_0_2.bin"
+#define FIRMWARE_VCN4_0_4  "amdgpu/vcn_4_0_4.bin"
 
 MODULE_FIRMWARE(FIRMWARE_RAVEN);
 MODULE_FIRMWARE(FIRMWARE_PICASSO);
-- 
2.39.0



Re: [PATCH v2 2/2] drm/amdgpu/vcn: Remove redundant indirect SRAM HW model check

2023-01-17 Thread Guilherme G. Piccoli
On 17/01/2023 15:08, Limonciello, Mario wrote:
> [...]
> 
> Should have added this tag too:
> Suggested-by: Alexander Deucher 
> 
> Looks good to me, thanks!
> Reviewed-by: Mario Limonciello 
> 

You're totally right, thanks for the reminder and apologies for missing
that! Just sending V3 heheh

Ah, thanks for the reviews and prompt responses.
Cheers,


Guilherme


[PATCH v2 2/2] drm/amdgpu/vcn: Remove redundant indirect SRAM HW model check

2023-01-17 Thread Guilherme G. Piccoli
The HW model validation that guards the indirect SRAM checking in the
VCN code path is redundant - there's no model that's not included in the
switch, making it useless in practice [0].

So, let's remove this switch statement for good.

[0] 
lore.kernel.org/amd-gfx/mn0pr12mb61013d20b8a2263b22ae1bcfe2...@mn0pr12mb6101.namprd12.prod.outlook.com

Cc: James Zhu 
Cc: Lazar Lijo 
Cc: Leo Liu 
Cc: Mario Limonciello 
Cc: Sonny Jiang 
Signed-off-by: Guilherme G. Piccoli 
---


V2:
* Changed the approach after ML discussion- instead of cleaning up
the switch statement, removed it entirely - special thanks to Alex
and Mario for the feedback!

Notice that patch 3 was dropped from this series after reviews.


 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 81 +
 1 file changed, 3 insertions(+), 78 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index 1b1a3c9e1863..02d428ddf2f8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -110,84 +110,9 @@ int amdgpu_vcn_sw_init(struct amdgpu_device *adev)
for (i = 0; i < adev->vcn.num_vcn_inst; i++)
atomic_set(>vcn.inst[i].dpg_enc_submission_cnt, 0);
 
-   switch (adev->ip_versions[UVD_HWIP][0]) {
-   case IP_VERSION(1, 0, 0):
-   case IP_VERSION(1, 0, 1):
-   case IP_VERSION(2, 5, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(2, 2, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(2, 6, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(2, 0, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(2, 0, 2):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 0, 0):
-   case IP_VERSION(3, 0, 64):
-   case IP_VERSION(3, 0, 192):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 0, 2):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 0, 16):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 0, 33):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 1, 1):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 1, 2):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(4, 0, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(4, 0, 2):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case I

[PATCH v2 1/2] drm/amdgpu/vcn: Adjust firmware names indentation

2023-01-17 Thread Guilherme G. Piccoli
This is an incredibly trivial fix, just for the sake of
"aesthetical" organization of the defines. Some were space based,
most were tab based and there was a lack of "alignment", now it's
all the same and aligned.

Cc: James Zhu 
Cc: Lazar Lijo 
Cc: Leo Liu 
Cc: Mario Limonciello 
Cc: Sonny Jiang 
Reviewed-by: Alex Deucher 
Signed-off-by: Guilherme G. Piccoli 
---


V2:
* Added Alex's review tag - thanks!


 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 38 -
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index f8397d993f23..1b1a3c9e1863 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -36,26 +36,26 @@
 #include "soc15d.h"
 
 /* Firmware Names */
-#define FIRMWARE_RAVEN "amdgpu/raven_vcn.bin"
-#define FIRMWARE_PICASSO   "amdgpu/picasso_vcn.bin"
-#define FIRMWARE_RAVEN2"amdgpu/raven2_vcn.bin"
-#define FIRMWARE_ARCTURUS  "amdgpu/arcturus_vcn.bin"
-#define FIRMWARE_RENOIR"amdgpu/renoir_vcn.bin"
-#define FIRMWARE_GREEN_SARDINE "amdgpu/green_sardine_vcn.bin"
-#define FIRMWARE_NAVI10"amdgpu/navi10_vcn.bin"
-#define FIRMWARE_NAVI14"amdgpu/navi14_vcn.bin"
-#define FIRMWARE_NAVI12"amdgpu/navi12_vcn.bin"
-#define FIRMWARE_SIENNA_CICHLID"amdgpu/sienna_cichlid_vcn.bin"
-#define FIRMWARE_NAVY_FLOUNDER "amdgpu/navy_flounder_vcn.bin"
-#define FIRMWARE_VANGOGH   "amdgpu/vangogh_vcn.bin"
+#define FIRMWARE_RAVEN "amdgpu/raven_vcn.bin"
+#define FIRMWARE_PICASSO   "amdgpu/picasso_vcn.bin"
+#define FIRMWARE_RAVEN2"amdgpu/raven2_vcn.bin"
+#define FIRMWARE_ARCTURUS  "amdgpu/arcturus_vcn.bin"
+#define FIRMWARE_RENOIR"amdgpu/renoir_vcn.bin"
+#define FIRMWARE_GREEN_SARDINE "amdgpu/green_sardine_vcn.bin"
+#define FIRMWARE_NAVI10"amdgpu/navi10_vcn.bin"
+#define FIRMWARE_NAVI14"amdgpu/navi14_vcn.bin"
+#define FIRMWARE_NAVI12"amdgpu/navi12_vcn.bin"
+#define FIRMWARE_SIENNA_CICHLID"amdgpu/sienna_cichlid_vcn.bin"
+#define FIRMWARE_NAVY_FLOUNDER "amdgpu/navy_flounder_vcn.bin"
+#define FIRMWARE_VANGOGH   "amdgpu/vangogh_vcn.bin"
 #define FIRMWARE_DIMGREY_CAVEFISH  "amdgpu/dimgrey_cavefish_vcn.bin"
-#define FIRMWARE_ALDEBARAN "amdgpu/aldebaran_vcn.bin"
-#define FIRMWARE_BEIGE_GOBY"amdgpu/beige_goby_vcn.bin"
-#define FIRMWARE_YELLOW_CARP   "amdgpu/yellow_carp_vcn.bin"
-#define FIRMWARE_VCN_3_1_2 "amdgpu/vcn_3_1_2.bin"
-#define FIRMWARE_VCN4_0_0  "amdgpu/vcn_4_0_0.bin"
-#define FIRMWARE_VCN4_0_2  "amdgpu/vcn_4_0_2.bin"
-#define FIRMWARE_VCN4_0_4  "amdgpu/vcn_4_0_4.bin"
+#define FIRMWARE_ALDEBARAN "amdgpu/aldebaran_vcn.bin"
+#define FIRMWARE_BEIGE_GOBY"amdgpu/beige_goby_vcn.bin"
+#define FIRMWARE_YELLOW_CARP   "amdgpu/yellow_carp_vcn.bin"
+#define FIRMWARE_VCN_3_1_2 "amdgpu/vcn_3_1_2.bin"
+#define FIRMWARE_VCN4_0_0  "amdgpu/vcn_4_0_0.bin"
+#define FIRMWARE_VCN4_0_2  "amdgpu/vcn_4_0_2.bin"
+#define FIRMWARE_VCN4_0_4  "amdgpu/vcn_4_0_4.bin"
 
 MODULE_FIRMWARE(FIRMWARE_RAVEN);
 MODULE_FIRMWARE(FIRMWARE_PICASSO);
-- 
2.39.0



Re: [PATCH 3/3] drm/amdgpu/vcn: Add parameter to force (en/dis)abling indirect SRAM mode

2023-01-17 Thread Guilherme G. Piccoli
On 17/01/2023 13:24, Limonciello, Mario wrote:
> [...]
>>> Though I see two problems with that: first, I'm not sure what's the
>>> impact in the GPU functioning when I disable some IP block.
>>>
> 
> It depends on the individual block what the impact is.  For example
> if you don't have VCN, then you can't do any accelerated video playback.
> 
>>> Second, the parameter is a bit hard to figure - we need to clear a bit
>>> for the IP block we want to disable, and the doc suggest to read on
>>> dmesg to get this information (it seems it changes depending on the HW
>>> model), but I couldn't parse the proper bit from dmesg. Needed to
>>> instrument the kernel to find the proper bit heh
>>>
> 
> Isn't it this stuff (taken from a CZN system):
> 
> [7.797779] [drm] add ip block number 0 
> [7.797781] [drm] add ip block number 1 
> [7.797782] [drm] add ip block number 2 
> [7.797783] [drm] add ip block number 3 
> [7.797783] [drm] add ip block number 4 
> [7.797784] [drm] add ip block number 5 
> [7.797785] [drm] add ip block number 6 
> [7.797786] [drm] add ip block number 7 
> [7.797787] [drm] add ip block number 8 
> [7.797788] [drm] add ip block number 9 
> 
> So for that system it would be bit 8 to disable vcn.
> 
> In terms of how debugging would work:
> I would expect when you get your failure it will have been the previous
> block # that failed, and so you can reboot with that block masked and
> see if you get further.
> 

Thanks Mario, much appreciated! You're totally right and I messed up not
seeing these obvious messages...

So, I'll just drop the parameter on V2.
Cheers,


Guilherme


Re: [PATCH 3/3] drm/amdgpu/vcn: Add parameter to force (en/dis)abling indirect SRAM mode

2023-01-17 Thread Guilherme G. Piccoli
On 16/01/2023 23:33, Limonciello, Mario wrote:
> [...]
> 
> For debugging these type of problems, I think an effective debugging 
> tactic would have been to mask the IP block (amdgpu.ip_block_mask).

Thank you, it worked indeed - nice suggestion!

Though I see two problems with that: first, I'm not sure what's the
impact in the GPU functioning when I disable some IP block.

Second, the parameter is a bit hard to figure - we need to clear a bit
for the IP block we want to disable, and the doc suggest to read on
dmesg to get this information (it seems it changes depending on the HW
model), but I couldn't parse the proper bit from dmesg. Needed to
instrument the kernel to find the proper bit heh

The second part is easy to improve (we can just show this bit in
dmesg!), I might do that instead of proposing this parameter, which
seems didn't raise much excitement after all heheh

Finally, I'm still curious on why Deck was working fine with the
indirect SRAM mode disabled (by mistake) in many kernels - was it in
practice the same as disabling the VCN IP block?

Thanks,


Guilherme



Re: [PATCH 3/3] drm/amdgpu/vcn: Add parameter to force (en/dis)abling indirect SRAM mode

2023-01-16 Thread Guilherme G. Piccoli
On 16/01/2023 20:00, Alex Deucher wrote:
> [...]
> 
> It's not clear to me when this would be used.  We only disable it
> briefly during new asic bring up, after that we never touch it again.
> No end user on production hardware should ever have to change it and
> doing so would break VCN on their system.
> 
> Alex

[+ Pierre-Loup]

Steam Deck is facing a pretty weird scenario then heheh

Commit 82132ecc543 ("drm/amdgpu: enable Vangogh VCN indirect sram mode")
corrected a long-term issue in which the indirect SRAM mode wasn't
enabled for Vangogh - and Deck GPU architecture is Vangogh, so it was
working perfectly with that disabled.

Happens that a bad FW candidate seems to have broken something - it was
a bit convoluted to debug, but we proved that disabling indirect SRAM is
a good workaround so far, it "restored the behavior" pre-82132ecc543.

Hence my proposal - this parameter would've made life so much easier,
and we're start using it "downstream" now. While I understand that of
course the FW should be fixed, meanwhile this is a cheap solution to
allow further debug and real use of the system.

Thanks,


Guilherme


Re: [PATCH 3/3] drm/amdgpu/vcn: Add parameter to force (en/dis)abling indirect SRAM mode

2023-01-16 Thread Guilherme G. Piccoli
On 16/01/2023 19:27, Liu, Leo wrote:
> Secure part requires PSP load VCN boot sequence which is with indirect
> sram mode.
> 
> Regards,
> Leo

With that said, and with the plans of just setting the indirect sram
mode nevertheless (with no switch/cases anymore - I'll submit the V2
tomorrow if everybody agrees), does anybody oppose to have this
parameter for debug reasons?

I first considered doing through debugfs, but obviously was a (very) bad
idea, since this should be an early boot tuning heheh

Cheers,


Guilherme


Re: [PATCH 2/3] drm/amdgpu/vcn: Deduplicate indirect SRAM checking code

2023-01-16 Thread Guilherme G. Piccoli
On 16/01/2023 18:49, Limonciello, Mario wrote:
> [...] 
> 
> The problem with adding a comment about which platform it's in is that
> this will probably go stale.  Some IP blocks are re-used in multiple ASICs.
> 
> VCN 3, 1, 1 is a great example.  This is used in both Yellow Carp (Rembrandt) 
> as well
> as Mendocino.   I would think a better approach would be to have a single 
> point
> of documentation somewhere in the source tree that we can update after new
> ASICs launch that could act as a decoder ring.
> 
> It's effort to remember to update it, but it's at least a single point of 
> failure.
> 

Hi Mario, thanks - it makes a lot of sense! I can remove the comments in
the V2, I'll wait a bit more for (potential) extra feedback but I agree
with you, so I'm inclined to remove them.

Cheers,


Guilherme


Re: [PATCH 2/3] drm/amdgpu/vcn: Deduplicate indirect SRAM checking code

2023-01-16 Thread Guilherme G. Piccoli
On 16/01/2023 18:41, Alex Deucher wrote:
> [...]
>> +   case IP_VERSION(4, 0, 0):   /* Vega10 */
> 
> This comment is incorrect.  Vega10 does not have VCN (it has UVD and VCE).
> 
> Alex

Thanks Alex! I'll resubmit V2 with this comment removed.

I've stolen it from
https://gitlab.freedesktop.org/agd5f/linux/-/blob/amd-staging-drm-next/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c#L1147
, but obviously I misunderstood the mapping here heheh

Cheers,


Guilherme


[PATCH 3/3] drm/amdgpu/vcn: Add parameter to force (en/dis)abling indirect SRAM mode

2023-01-16 Thread Guilherme G. Piccoli
Currently the FW loading path perform some checks based on IP model
and in case it is advertised as supported, the VCN indirect SRAM
mode is used.

Happens that in case there's any issue on FW and this mode ends-up
not being properly supported, the driver probe fails [0]. Debugging
this requires driver rebuilding, so to allow fast debug and experiments,
add a parameter to force setting indirect SRAM mode to true/false from
the kernel command-line; parameter default is -1, which doesn't change
the current driver's behavior.

[0] Example of this issue, observed on Steam Deck:

[drm] kiq ring mec 2 pipe 1 q 0
[drm] failed to load ucode VCN0_RAM(0x3A)
[drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x)
amdgpu :04:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring 
vcn_dec_0 test failed (-110)
[drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init of IP block  
failed -110
amdgpu :04:00.0: amdgpu: amdgpu_device_ip_init failed
amdgpu :04:00.0: amdgpu: Fatal error during GPU init

Cc: James Zhu 
Cc: Lazar Lijo 
Cc: Leo Liu 
Cc: Mario Limonciello 
Cc: Sonny Jiang 
Signed-off-by: Guilherme G. Piccoli 
---


This work is based on agd5f/amd-staging-drm-next branch.
Thanks in advance for reviews/comments!
Cheers,

Guilherme


 drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 9 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 3 +++
 3 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 872450a3a164..5d3c92c94f18 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -215,6 +215,7 @@ extern int amdgpu_noretry;
 extern int amdgpu_force_asic_type;
 extern int amdgpu_smartshift_bias;
 extern int amdgpu_use_xgmi_p2p;
+extern int amdgpu_indirect_sram;
 #ifdef CONFIG_HSA_AMD
 extern int sched_policy;
 extern bool debug_evictions;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 06aba201d4db..c7182c0bc841 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -187,6 +187,7 @@ int amdgpu_num_kcq = -1;
 int amdgpu_smartshift_bias;
 int amdgpu_use_xgmi_p2p = 1;
 int amdgpu_vcnfw_log;
+int amdgpu_indirect_sram = -1;
 
 static void amdgpu_drv_delayed_reset_work_handler(struct work_struct *work);
 
@@ -941,6 +942,14 @@ MODULE_PARM_DESC(smu_pptable_id,
"specify pptable id to be used (-1 = auto(default) value, 0 = use 
pptable from vbios, > 0 = soft pptable id)");
 module_param_named(smu_pptable_id, amdgpu_smu_pptable_id, int, 0444);
 
+/**
+ * DOC: indirect_sram (int)
+ * Allow users to force using (or not) the VCN indirect SRAM mode in the fw 
load
+ * code. Default is -1, meaning auto (aka, don't mess with driver's behavior).
+ */
+MODULE_PARM_DESC(indirect_sram, "Force VCN indirect SRAM (-1 = auto (default), 
0 = disabled, 1 = enabled)");
+module_param_named(indirect_sram, amdgpu_indirect_sram, int, 0444);
+
 /* These devices are not supported by amdgpu.
  * They are supported by the mach64, r128, radeon drivers
  */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index 1f880e162d9d..a2290087e01c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -137,6 +137,9 @@ int amdgpu_vcn_sw_init(struct amdgpu_device *adev)
return -EINVAL;
}
 
+   if (amdgpu_indirect_sram >= 0)
+   adev->vcn.indirect_sram = (bool)amdgpu_indirect_sram;
+
hdr = (const struct common_firmware_header *)adev->vcn.fw->data;
adev->vcn.fw_version = le32_to_cpu(hdr->ucode_version);
 
-- 
2.39.0



[PATCH 2/3] drm/amdgpu/vcn: Deduplicate indirect SRAM checking code

2023-01-16 Thread Guilherme G. Piccoli
Currently both conditionals checked to enable indirect SRAM are
repeated for all HW/IP models. Let's just simplify it a bit so
code is more compact and readable.

While at it, add the legacy names as a comment per case block, to
help whoever is reading the code and doesn't have much experience
with the name/number mapping.

Cc: James Zhu 
Cc: Lazar Lijo 
Cc: Leo Liu 
Cc: Mario Limonciello 
Cc: Sonny Jiang 
Signed-off-by: Guilherme G. Piccoli 
---


Hi folks, first of all thanks in advance for reviews/comments!

This work is based on agd5f/amd-staging-drm-next branch - there is this
patch from Mario that refactored the amdgpu_vcn.c, and since it dropped
the legacy names from the switch cases, I've decided to also include them
here as comments.

I'm not sure if that's a good idea, feels good for somebody not so
used to the code read the codenames instead of purely numbers, but
if you wanna move away from the legacy names for good, lemme know
and I can rework without these comments.

Cheers,


Guilherme


 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 84 +
 1 file changed, 16 insertions(+), 68 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index 1b1a3c9e1863..1f880e162d9d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -111,78 +111,26 @@ int amdgpu_vcn_sw_init(struct amdgpu_device *adev)
atomic_set(>vcn.inst[i].dpg_enc_submission_cnt, 0);
 
switch (adev->ip_versions[UVD_HWIP][0]) {
-   case IP_VERSION(1, 0, 0):
-   case IP_VERSION(1, 0, 1):
-   case IP_VERSION(2, 5, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(2, 2, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(2, 6, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(2, 0, 0):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(2, 0, 2):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 0, 0):
-   case IP_VERSION(3, 0, 64):
-   case IP_VERSION(3, 0, 192):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 0, 2):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 0, 16):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 0, 33):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
-   case IP_VERSION(3, 1, 1):
-   if ((adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) &&
-   (adev->pg_flags & AMD_PG_SUPPORT_VCN_DPG))
-   adev->vcn.indirect_sram = true;
-   break;
+   case IP_VERSION(1, 0, 0):   /* Raven (1/2) / Picasso */
+   case IP_VERSION(1, 0, 1):   /* Raven (1/2) / Picasso */
+   case IP_VERSION(2, 0, 0):   /* Navi10 */
+   case IP_VERSION(2, 0, 2):   /* Navi12 / Navi14 */
+   case IP_VERSION(2, 2, 0):   /* Renoir / Green Sardine */
+   case IP_VERSION(2, 5, 0):   /* Arcturus */
+   case IP_VERSION(2, 6, 0):   /* Aldebaran */
+   case IP_VERSION(3, 0, 0):   /* Sienna Cichlid / Navy Flounder */
+   case IP_VERSION(3, 0, 2):   /* Vangogh */
+   case IP_VERSION(3, 0, 

[PATCH 1/3] drm/amdgpu/vcn: Adjust firmware names indentation

2023-01-16 Thread Guilherme G. Piccoli
This is an incredibly trivial fix, just for the sake of
"aesthetical" organization of the defines. Some were space based,
most were tab based and there was a lack of "alignment", now it's
all the same and aligned.

Cc: James Zhu 
Cc: Lazar Lijo 
Cc: Leo Liu 
Cc: Mario Limonciello 
Cc: Sonny Jiang 
Signed-off-by: Guilherme G. Piccoli 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c | 38 -
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
index f8397d993f23..1b1a3c9e1863 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vcn.c
@@ -36,26 +36,26 @@
 #include "soc15d.h"
 
 /* Firmware Names */
-#define FIRMWARE_RAVEN "amdgpu/raven_vcn.bin"
-#define FIRMWARE_PICASSO   "amdgpu/picasso_vcn.bin"
-#define FIRMWARE_RAVEN2"amdgpu/raven2_vcn.bin"
-#define FIRMWARE_ARCTURUS  "amdgpu/arcturus_vcn.bin"
-#define FIRMWARE_RENOIR"amdgpu/renoir_vcn.bin"
-#define FIRMWARE_GREEN_SARDINE "amdgpu/green_sardine_vcn.bin"
-#define FIRMWARE_NAVI10"amdgpu/navi10_vcn.bin"
-#define FIRMWARE_NAVI14"amdgpu/navi14_vcn.bin"
-#define FIRMWARE_NAVI12"amdgpu/navi12_vcn.bin"
-#define FIRMWARE_SIENNA_CICHLID"amdgpu/sienna_cichlid_vcn.bin"
-#define FIRMWARE_NAVY_FLOUNDER "amdgpu/navy_flounder_vcn.bin"
-#define FIRMWARE_VANGOGH   "amdgpu/vangogh_vcn.bin"
+#define FIRMWARE_RAVEN "amdgpu/raven_vcn.bin"
+#define FIRMWARE_PICASSO   "amdgpu/picasso_vcn.bin"
+#define FIRMWARE_RAVEN2"amdgpu/raven2_vcn.bin"
+#define FIRMWARE_ARCTURUS  "amdgpu/arcturus_vcn.bin"
+#define FIRMWARE_RENOIR"amdgpu/renoir_vcn.bin"
+#define FIRMWARE_GREEN_SARDINE "amdgpu/green_sardine_vcn.bin"
+#define FIRMWARE_NAVI10"amdgpu/navi10_vcn.bin"
+#define FIRMWARE_NAVI14"amdgpu/navi14_vcn.bin"
+#define FIRMWARE_NAVI12"amdgpu/navi12_vcn.bin"
+#define FIRMWARE_SIENNA_CICHLID"amdgpu/sienna_cichlid_vcn.bin"
+#define FIRMWARE_NAVY_FLOUNDER "amdgpu/navy_flounder_vcn.bin"
+#define FIRMWARE_VANGOGH   "amdgpu/vangogh_vcn.bin"
 #define FIRMWARE_DIMGREY_CAVEFISH  "amdgpu/dimgrey_cavefish_vcn.bin"
-#define FIRMWARE_ALDEBARAN "amdgpu/aldebaran_vcn.bin"
-#define FIRMWARE_BEIGE_GOBY"amdgpu/beige_goby_vcn.bin"
-#define FIRMWARE_YELLOW_CARP   "amdgpu/yellow_carp_vcn.bin"
-#define FIRMWARE_VCN_3_1_2 "amdgpu/vcn_3_1_2.bin"
-#define FIRMWARE_VCN4_0_0  "amdgpu/vcn_4_0_0.bin"
-#define FIRMWARE_VCN4_0_2  "amdgpu/vcn_4_0_2.bin"
-#define FIRMWARE_VCN4_0_4  "amdgpu/vcn_4_0_4.bin"
+#define FIRMWARE_ALDEBARAN "amdgpu/aldebaran_vcn.bin"
+#define FIRMWARE_BEIGE_GOBY"amdgpu/beige_goby_vcn.bin"
+#define FIRMWARE_YELLOW_CARP   "amdgpu/yellow_carp_vcn.bin"
+#define FIRMWARE_VCN_3_1_2 "amdgpu/vcn_3_1_2.bin"
+#define FIRMWARE_VCN4_0_0  "amdgpu/vcn_4_0_0.bin"
+#define FIRMWARE_VCN4_0_2  "amdgpu/vcn_4_0_2.bin"
+#define FIRMWARE_VCN4_0_4  "amdgpu/vcn_4_0_4.bin"
 
 MODULE_FIRMWARE(FIRMWARE_RAVEN);
 MODULE_FIRMWARE(FIRMWARE_PICASSO);
-- 
2.39.0



[PATCH] drm/amd/display: Fix warning caused by a bad fake property

2023-01-14 Thread Guilherme G. Piccoli
Commit 5468c36d6285 ("drm/amd/display: Filter Invalid 420 Modes for HDMI TMDS")
added a validation on driver probe to prevent invalid TMDS modes. For that,
it requires to fake some plane properties before submitting to the validation
routines.

Happens that one of such fake properties (swizzle) causes a boot-time
warning due to an assert on swizzle routines. Since it's a fake property,
change it hereby to a valid value to prevent such warning. Also, while at
it, fix the unnecessary double assignment of this same property in code.

Fixes: 5468c36d6285 ("drm/amd/display: Filter Invalid 420 Modes for HDMI TMDS")
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2264
Reported-by: Melissa Wen 
Cc: Fangzhi Zuo 
Cc: Harry Wentland 
Cc: Leo Li 
Cc: Mark Broadworth 
Cc: Rodrigo Siqueira 
Cc: Sung Joon Kim 
Signed-off-by: Guilherme G. Piccoli 
---


Hi folks, notice my choice here was to set swizzle to DC_SW_LINEAR;
I've considered creating a new enum member (like "DC_SW_FAKE", or
something like that) but seems it's easier to just use the LINEAR
one. In my (very cheap!) understanding, this shouldn't affect the
purpose of the TMDS validation thing.

Lemme know if there's something to improve here, and thanks in
advance for reviews/comments.
Cheers,


Guilherme


 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 1b7f20a9d4ae..35ab39fd9345 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6261,9 +6261,8 @@ static enum dc_status 
dm_validate_stream_and_context(struct dc *dc,
dc_plane_state->plane_size.surface_size.width  = stream->src.width;
dc_plane_state->plane_size.chroma_size.height  = stream->src.height;
dc_plane_state->plane_size.chroma_size.width   = stream->src.width;
-   dc_plane_state->tiling_info.gfx9.swizzle =  DC_SW_UNKNOWN;
dc_plane_state->format = SURFACE_PIXEL_FORMAT_GRPH_ARGB;
-   dc_plane_state->tiling_info.gfx9.swizzle = DC_SW_UNKNOWN;
+   dc_plane_state->tiling_info.gfx9.swizzle = DC_SW_LINEAR;
dc_plane_state->rotation = ROTATION_ANGLE_0;
dc_plane_state->is_tiling_rotated = false;
dc_plane_state->tiling_info.gfx8.array_mode = DC_ARRAY_LINEAR_GENERAL;
-- 
2.39.0



Re: [PATCH v1] drm/scheduler: Fix lockup in drm_sched_entity_kill()

2022-12-29 Thread Guilherme G. Piccoli
On 29/12/2022 07:15, Dmitry Osipenko wrote:
> [...]
> I'll push the patch to misc-fixes as soon as it will be rebased on
> 6.2-rc. Thanks!
> 
> Best regards,
> Dmitry
> 

Thank you Dmitry, much appreciated!
Cheers,


Guilherme


Re: [PATCH v1] drm/scheduler: Fix lockup in drm_sched_entity_kill()

2022-12-27 Thread Guilherme G. Piccoli
Hi Dmitry / Christian, thanks for the fix!

(And thanks Melissa for pointing that, saving me lots of time in
research heh)

Is this fix planned to be released on 6.2-rc cycle? I've just tested it
on Steam Deck, and it resolved a lockup observed (since v6.2-rc1) -
exactly the same thing mentioned in the commit message.

FWIW:
Tested-By: Guilherme G. Piccoli  # Steam Deck

Cheers,


Guilherme


Re: [PATCH] drm/todo: Update panic handling todo

2022-02-24 Thread Guilherme G. Piccoli
On 24/02/2022 10:24, Daniel Vetter wrote:
> Some things changed, and add two useful links.
> 
> v2: Also include a link to the QR encoding work. Plus review from
> Javier.
> 
> Reviewed-by: Javier Martinez Canillas 
> Cc: Javier Martinez Canillas 
> Cc: Pekka Paalanen 
> Cc: gpicc...@igalia.com
> Signed-off-by: Daniel Vetter 
> ---

Hi Daniel, thanks for the improvement - much appreciated!

Below a comment inline; other than that, feel free to add my:
Reviewed-by: Guilherme G. Piccoli 

Cheers,


Guilherme

> [...]
>  
> -* There's also proposal for a simplied DRM console instead of the full-blown
> -  fbcon and DRM fbdev emulation. Any kind of panic handling tricks should
> -  obviously work for both console, in case we ever get kmslog merged.
> +* Encoding the actual oops and preceeding dmesg in a QR might help with the

Email client complains about "preceeding" word - misspelled maybe?

> +  dread "important stuff scrolled away" problem. See `[RFC][PATCH] Oops 
> messages
> +  transfer using QR codes
> +  
> <https://lore.kernel.org/lkml/1446217392-11981-1-git-send-email-alexandru.murt...@intel.com/>`_
> +  for some example code that could be reused.
>  
>  Contact: Daniel Vetter
>  


Re: Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-10 Thread Guilherme G. Piccoli
On 10/12/2021 12:13, Christian König wrote:
> [...]
> How about issuing a PCIe reset and re-initializing the ASIC with just 
> the VBIOS?
> 
> That should be pretty straightforward I think.
> 
> Christian.


Thanks Christian, that'd be perfect! Is it feasible? Per Alex comment,
we'd need to run atombios commands to reprogram the timings, display
info, etc...like a small driver would do, a full init.

Also, what kind of PCIe reset is recommended for this adapter? Like a
hot reset, powering-off/re-power, FLR or that MODE2 reset present in
amdgpu code? Remembering this is an APU device.

Thanks a lot!



Re: Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-10 Thread Guilherme G. Piccoli
On 10/12/2021 11:16, Alex Deucher wrote:> [...]
> Why not just reload the driver after kexec?
> 
> Alex

Because the original issue is the kdump case, and we want a very very
tiny kernel - also, the crash originally could have been caused by
amdgpu itself, so if it's a GPU issue, we don't want to mess with that
in kdump. And I confess I tried modprobe amdgpu after a kdump, no
success - kdump won't call shutdown handlers, so GPU will be in a
"rogue" state...

My question was about regular kexec because it's much simpler usually,
we can do whatever we want there. My line of thought was: if I make it
work in regular kexec with a simple framebuffer, I might be able to get
it working on kdump heheh




Re: Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-10 Thread Guilherme G. Piccoli
Thanks a lot Alex / Gerd and Thomas, very informative stuff! I'm glad
there are projects to collect/save the data and reuse after a kdump,
this is very useful.

I'll continue my study on the atombios thing of AMD and QXL, maybe at
least we can make it work in qemu, that'd be great (like a small
initdriver to reprogram de paravirtual device on kexec boot).

Cheers,


Guilherme


Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-09 Thread Guilherme G. Piccoli
Hi all, I have a question about the possibility of reusing a framebuffer
after a regular (or panic) kexec - my case is with amdgpu (APU, aka, not
a separate GPU hardware), but I guess the question is kinda generic
hence I've looped most of the lists / people I think does make sense
(apologies for duplicates).


The context is: we have a hardware that has an amdgpu-controlled device
(Vangogh model) and as soon as the machine boots, efifb is providing
graphics - I understand the UEFI/GRUB outputs rely in EFI framebuffer as
well. As soon amdgpu module is available, kernel loads it and it takes
over the GPU, providing graphics. The kexec_file_load syscall allows to
pass a valid screen_info structure, so by kexec'ing a new kernel, we
have again efifb taking over on boot time, but this time I see nothing
in the screen. I've manually blacklisted amdgpu in this new kexec'ed
kernel, I'd like to rely in the simple framebuffer - the goal is to have
a tiny kernel kexec'ed. I'm using kernel version 5.16.0-rc4.

I've done some other experiments, for exemple: I've forced screen_info
model to match VLFB, so vesafb took over after the kexec, with the same
result. Also noticed that BusMaster bit was off after kexec, in the AMD
APU PCIe device, so I've set it on efifb before probe, and finally
tested the same things in qemu, with qxl, all with the same result
(blank screen).
The most interesting result I got (both with amdgpu and qemu/qxl) is
that if I blacklist these drivers and let the machine continue using
efifb since the beginning, after kexec the efifb is still able to
produce graphics.

Which then led me to think that likely there's something fundamentally
"blocking" the reuse of the simple framebuffer after kexec, like maybe
DRM stack is destroying the old framebuffer somehow? What kind of
preparation is required at firmware level to make the simple EFI VGA
framebuffer work, and could we perform this in a kexec (or "save it"
before the amdgpu/qxl drivers take over and reuse later)?

Any advice is greatly appreciated!
Thanks in advance,


Guilherme


Re: Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-09 Thread Guilherme G. Piccoli
On 09/12/2021 14:31, Alex Deucher wrote:
> [...] 
> Once the driver takes over, none of the pre-driver state is retained.
> You'll need to load the driver in the new kernel to initialize the
> displays.  Note the efifb doesn't actually have the ability to program
> any hardware, it just takes over the memory region that was used for
> the pre-OS framebuffer and whatever display timing was set up by the
> GOP driver prior to the OS loading.  Once that OS driver has loaded
> the area is gone and the display configuration may have changed.
> 

Hi Christian and Alex, thanks for the clarifications!

Is there any way to save/retain this state before amdgpu takes over?
Would simpledrm be able to program the device again, to a working state?

Finally, do you have any example of such a GOP driver (open source) so I
can take a look? I tried to find something like that in Tianocore
project, but didn't find anything that seemed useful for my issue.

Thanks again!


Re: Reuse framebuffer after a kexec (amdgpu / efifb)

2021-12-09 Thread Guilherme G. Piccoli
Thanks again Alex! Some comments inlined below:

On 09/12/2021 15:06, Alex Deucher wrote:
> Not really in a generic way.  It's asic and platform specific.  In
> addition most modern displays require link training to bring up the
> display, so you can't just save and restore registers.

Oh sure, I understand that. My question is more like: is there a way,
inside amdgpu driver, to save this state before taking
over/overwriting/reprogramming the device? So we could (again, from
inside the amdgpu driver) dump this pre-saved state in the shutdown
handler, for example, having the device in a "pre-OS" state when the new
kexec'ed kernel starts.

> 
> The drivers are asic and platform specific.  E.g., the driver for
> vangogh is different from renoir is different from skylake, etc.  The
> display programming interfaces are asic specific.

Cool, that makes sense! But if you (or anybody here) know some of these
GOP drivers, e.g. for the qemu/qxl device, I'm just curious to
see/understand how complex is the FW driver to just put the
device/screen in a usable state.

Cheers,


Guilherme