RE: [PATCH] drm/amd/mst: clean DP main link status only when unplug mst 1st link

2020-08-03 Thread Lin, Wayne
[AMD Public Use]

Sorry for spamming. 
It's a duplicate one and should be ignored.

> -Original Message-
> From: Wayne Lin 
> Sent: Tuesday, August 4, 2020 11:37 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Pillai, Aurabindo ; Wu, Hersen
> ; Kazlauskas, Nicholas
> ; Siqueira, Rodrigo
> ; Zuo, Jerry ; Lin, Wayne
> 
> Subject: [PATCH] drm/amd/mst: clean DP main link status only when unplug
> mst 1st link
> 
> [Why]
> Under DP daisy chain scenario as below:
> 
>   Src - Monitor_1 - Monitor_2
> 
> If unplug 2nd Monitor_2 and plug in again, observe that Monitor_1 doesn't
> light up.
> 
> When unplug 2nd monitor, we clear the
> dc_link->cur_link_settings.lane_count in dm_dp_destroy_mst_connector().
> However this link status is a shared data structure by all connected mst
> monitors. Although the 2nd monitor is gone, this link status should still be
> retained for other connected mst monitors. Otherwise, when we plug the 2nd
> monitor in again, we find out that main link is not trained and do link 
> training
> again. Payload ID Table for Monitor_1 is ruined and we don't reallocate it.
> 
> [How]
> In dm_dp_destroy_mst_connector(), only clean the cur_link_settings when we
> no longer do mst mode.
> 
> Signed-off-by: Wayne Lin 
> ---
>  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c | 5
> -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git
> a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
> index 2c10352fa514..526f29598403 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
> @@ -415,7 +415,10 @@ static void dm_dp_destroy_mst_connector(struct
> drm_dp_mst_topology_mgr *mgr,
>  aconnector->dc_sink);
>   dc_sink_release(aconnector->dc_sink);
>   aconnector->dc_sink = NULL;
> - aconnector->dc_link->cur_link_settings.lane_count = 0;
> + mutex_lock(>lock);
> + if (!mgr->mst_state)
> + aconnector->dc_link->cur_link_settings.lane_count = 0;
> + mutex_unlock(>lock);
>   }
> 
>   drm_connector_unregister(connector);
> --
> 2.17.1
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/dc: Read from DP_SINK_COUNT_ESI for DPDC r1.2 or higher

2020-08-03 Thread Wayne Lin
[Why]
According to DP spec, DPRX with DPCD r1.2 or higher shall have the
same Link/Sink Device Status field registers at DPCD Addresses 00200h
through 00205h to the corresponding DPRX Event Status Indicator
registers at DPCD Addresses 02002h through 0200Fh. We now only read from
02002h when DPCD revision number is r1.4 or higher while handling short
HPD. Need to correct that.

[How]
Set to read from 02002h when DPCD is r1.2 or higher

Signed-off-by: Wayne Lin 
---
 drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c 
b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
index 2bfa4e35c2cf..9fb1543b4c73 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c
@@ -1834,9 +1834,9 @@ static enum dc_status read_hpd_rx_irq_data(
 * fail, so we now explicitly read 6 bytes which is
 * the req from the above mentioned test cases.
 *
-* For DP 1.4 we need to read those from 2002h range.
+* For DPCD r1.2 or higher, we need to read those from 2002h range.
 */
-   if (link->dpcd_caps.dpcd_rev.raw < DPCD_REV_14)
+   if (link->dpcd_caps.dpcd_rev.raw < DPCD_REV_12)
retval = core_link_read_dpcd(
link,
DP_SINK_COUNT,
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/mst: clean DP main link status only when unplug mst 1st link

2020-08-03 Thread Wayne Lin
[Why]
Under DP daisy chain scenario as below:

Src - Monitor_1 - Monitor_2

If unplug 2nd Monitor_2 and plug in again, observe that Monitor_1
doesn't light up.

When unplug 2nd monitor, we clear the
dc_link->cur_link_settings.lane_count in dm_dp_destroy_mst_connector().
However this link status is a shared data structure by all connected mst
monitors. Although the 2nd monitor is gone, this link status should
still be retained for other connected mst monitors. Otherwise, when we
plug the 2nd monitor in again, we find out that main link is not trained
and do link training again. Payload ID Table for Monitor_1 is ruined and
we don't reallocate it.

[How]
In dm_dp_destroy_mst_connector(), only clean the cur_link_settings when
we no longer do mst mode.

Signed-off-by: Wayne Lin 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
index 2c10352fa514..526f29598403 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
@@ -415,7 +415,10 @@ static void dm_dp_destroy_mst_connector(struct 
drm_dp_mst_topology_mgr *mgr,
   aconnector->dc_sink);
dc_sink_release(aconnector->dc_sink);
aconnector->dc_sink = NULL;
-   aconnector->dc_link->cur_link_settings.lane_count = 0;
+   mutex_lock(>lock);
+   if (!mgr->mst_state)
+   aconnector->dc_link->cur_link_settings.lane_count = 0;
+   mutex_unlock(>lock);
}
 
drm_connector_unregister(connector);
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/mst: clean DP main link status only when unplug mst 1st link

2020-08-03 Thread Wayne Lin
[Why]
Under DP daisy chain scenario as below:

Src - Monitor_1 - Monitor_2

If unplug 2nd Monitor_2 and plug in again, observe that Monitor_1
doesn't light up.

When unplug 2nd monitor, we clear the
dc_link->cur_link_settings.lane_count in dm_dp_destroy_mst_connector().
However this link status is a shared data structure by all connected mst
monitors. Although the 2nd monitor is gone, this link status should
still be retained for other connected mst monitors. Otherwise, when we
plug the 2nd monitor in again, we find out that main link is not trained
and do link training again. Payload ID Table for Monitor_1 is ruined and
we don't reallocate it.

[How]
In dm_dp_destroy_mst_connector(), only clean the cur_link_settings when
we no longer do mst mode.

Signed-off-by: Wayne Lin 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
index 2c10352fa514..526f29598403 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
@@ -415,7 +415,10 @@ static void dm_dp_destroy_mst_connector(struct 
drm_dp_mst_topology_mgr *mgr,
   aconnector->dc_sink);
dc_sink_release(aconnector->dc_sink);
aconnector->dc_sink = NULL;
-   aconnector->dc_link->cur_link_settings.lane_count = 0;
+   mutex_lock(>lock);
+   if (!mgr->mst_state)
+   aconnector->dc_link->cur_link_settings.lane_count = 0;
+   mutex_unlock(>lock);
}
 
drm_connector_unregister(connector);
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Dave Hansen
On 8/3/20 8:12 AM, Andy Lutomirski wrote:
> I could easily be convinced that the PASID fixup is so trivial and so
> obviously free of misfiring in a way that causes an infinite loop that
> this code is fine.  But I think we first need to answer the bigger
> question of why we're doing a lazy fixup in the first place.

There was an (internal to Intel) implementation of this about a year ago
that used smp_call_function_many() to force the MSR state into all
threads of a process.  I took one look at it, decided there was a 0%
chance of it actually functioning and recommended we find another way.
While I'm sure it could be done more efficiently, the implementation I
looked at took ~200 lines of code and comments.  It started to look too
much like another instance of mm->cpumask for my comfort.

The only other option I could think of would be an ABI where threads
were required to call into the kernel at least once after creation
before calling ENQCMD.  All ENQCMDs would be required to be "wrapped" by
code doing this syscall.  Something like this:

if (!thread_local(did_pasid_init))
sys_pasid_init(); // new syscall or prctl

thread_local(did_pasid_init) = 1;

// ENQCMD here
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Raj, Ashok
On Mon, Aug 03, 2020 at 08:12:18AM -0700, Andy Lutomirski wrote:
> On Mon, Aug 3, 2020 at 8:03 AM Dave Hansen  wrote:
> >
> > On 7/31/20 4:34 PM, Andy Lutomirski wrote:
> > >> Thomas suggested to provide a reason for the #GP caused by executing 
> > >> ENQCMD
> > >> without a valid PASID value programmed. #GP error codes are 16 bits and 
> > >> all
> > >> 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The 
> > >> other
> > >> choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
> > >> when loading from the source operand, so its not fully comprehending all
> > >> the reasons. Rather than special case the ENQCMD, in future Intel may
> > >> choose a different fault mechanism for such cases if recovery is needed 
> > >> on
> > >> #GP.
> > > Decoding the user instruction is ugly and sets a bad architecture
> > > precedent, but we already do it in #GP for UMIP.  So I'm unconvinced.
> >
> > I'll try to do one more bit of convincing. :)
> >
> > In the end, we need a way to figure out if the #GP was from a known "OK"
> > source that we can fix up.  You're right that we could fire up the
> > instruction decoder to help answer that question.  But, it (also)
> > doesn't easily yield a perfect answer as to the source of the #GP, it
> > always involves a user copy, and it's a larger code impact than what
> > we've got.
> >
> > I think I went and looked at fixup_umip_exception(), and compared it to
> > the alternative which is essentially just these three lines of code:
> >
> > > + /*
> > > +  * If the current task already has a valid PASID in the MSR,
> > > +  * the #GP must be for some other reason.
> > > +  */
> > > + if (current->has_valid_pasid)
> > > + return false;
> > ...> +  /* Now the current task has a valid PASID in the MSR. */
> > > + current->has_valid_pasid = 1;
> >
> > and *I* was convinced that instruction decoding wasn't worth it.
> >
> > There's a lot of stuff that fixup_umip_exception() does which we don't
> > have to duplicate, but it's going to be really hard to get it anywhere
> > near as compact as what we've got.
> >
> 
> I could easily be convinced that the PASID fixup is so trivial and so
> obviously free of misfiring in a way that causes an infinite loop that
> this code is fine.  But I think we first need to answer the bigger
> question of why we're doing a lazy fixup in the first place.

We choose lazy fixup for 2 reasons. 

- If some threads were already created before the MSR is programmed, then
  we need to fixup those in a race free way. scheduling some task-work etc.
  We did do that early on, but decided it was ugly. 
- Not all threads need to submit ENQCMD, force feeding the MSR probably
  isn't even required for all. Yes the overhead isn't probably big, but
  might not even be required for all threads.

We needed to fixup MSR in two different way. To keep the code simple, the
choice was to only fixup on #GP, that eliminated the extra code we need to
support case1.

Cheers,
Ashok
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Dave Hansen
On 8/3/20 10:16 AM, Andy Lutomirski wrote:
> - TILE: genuinely per-thread, but it's expensive so it's
> lazy-loadable.  But the lazy-load mechanism reuses #NM, and it's not
> fully disambiguated from the other use of #NM.  So it sort of works,
> but it's gross.

For those playing along at home, there's a new whitepaper out from Intel
about some new CPU features which are going to be fun:

> https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf

Which part were you worried about?  I thought it was fully disambuguated
from this:

> When XFD causes an instruction to generate #NM, the processor loads
> the IA32_XFD_ERR MSR to identify the disabled state component(s).
> Specifically, the MSR is loaded with the logical AND of the IA32_XFD
> MSR and the bitmap corresponding to the state components required by
> the faulting instruction.
> 
> Device-not-available exceptions that are not due to XFD — those 
> resulting from setting CR0.TS to 1 — do not modify the IA32_XFD_ERR
> MSR.

So if you always make sure to *clear* IA32_XFD_ERR after handing and XFD
exception, any #NM's with a clear IA32_XFD_ERR are from "legacy"
CR0.TS=1.  Any bits set in IA32_XFD_ERR mean a new-style XFD exception.

Am I missing something?
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Fenghua Yu
Hi, Andy,

On Fri, Jul 31, 2020 at 06:28:37PM -0700, Andy Lutomirski wrote:
> On Mon, Jul 13, 2020 at 4:48 PM Fenghua Yu  wrote:
> >
> > A #GP fault is generated when ENQCMD instruction is executed without
> > a valid PASID value programmed in the current thread's PASID MSR. The
> > #GP fault handler will initialize the MSR if a PASID has been allocated
> > for this process.
> 
> Let's take a step back here.  Why are we trying to avoid IPIs?  If you
> call munmap(), you IPI other CPUs running tasks in the current mm.  If
> you do perf_event_open() and thus acquire RDPMC permission, you IPI
> other CPUs running tasks in the current mm.  If you call modify_ldt(),
> you IPI other CPUs running tasks in the current mm.  These events can
> all happen more than once per process.
> 
> Now we have ENQCMD.  An mm can be assigned a PASID *once* in the model
> that these patches support.  Why not just send an IPI using
> essentially identical code to the LDT sync or the CR4.PCE sync?

ldt (or the other two cases) is different from ENQCMD: the PASID MSR
is per-task and is supported by xsaves.

The per-task PASID MSR needs to updated to ALL tasks. That means IPI,
which only updates running tasks' MSRs, is not enough. All tasks' MSRs
need to be updated when a PASID is allocated.

This difference increases the complexity of sending IPI to running tasks
and updating sleeping tasks's MSRs with locking etc.

Of course, it's doable not to update the MSRs in all task when a new PASID
is allocated to the mm. But that means we need to discard xsaves support
for the MSR and create our own switch function to load the MSR. That
increases complexity.

We tried similar IPI way to update the PASID in about 200 lines of code.
As Dave Hansen pointed, it's too complex. The current lazy updating the MSR
only takes essential 3 lines of code in #GP.

Does it make sense to still use the current fix up method to update the MSR?

Thanks.

-Fenghua
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Dave Hansen
On 7/31/20 4:34 PM, Andy Lutomirski wrote:
>> Thomas suggested to provide a reason for the #GP caused by executing ENQCMD
>> without a valid PASID value programmed. #GP error codes are 16 bits and all
>> 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The other
>> choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
>> when loading from the source operand, so its not fully comprehending all
>> the reasons. Rather than special case the ENQCMD, in future Intel may
>> choose a different fault mechanism for such cases if recovery is needed on
>> #GP.
> Decoding the user instruction is ugly and sets a bad architecture
> precedent, but we already do it in #GP for UMIP.  So I'm unconvinced.

I'll try to do one more bit of convincing. :)

In the end, we need a way to figure out if the #GP was from a known "OK"
source that we can fix up.  You're right that we could fire up the
instruction decoder to help answer that question.  But, it (also)
doesn't easily yield a perfect answer as to the source of the #GP, it
always involves a user copy, and it's a larger code impact than what
we've got.

I think I went and looked at fixup_umip_exception(), and compared it to
the alternative which is essentially just these three lines of code:

> + /*
> +  * If the current task already has a valid PASID in the MSR,
> +  * the #GP must be for some other reason.
> +  */
> + if (current->has_valid_pasid)
> + return false;
...> +  /* Now the current task has a valid PASID in the MSR. */
> + current->has_valid_pasid = 1;

and *I* was convinced that instruction decoding wasn't worth it.

There's a lot of stuff that fixup_umip_exception() does which we don't
have to duplicate, but it's going to be really hard to get it anywhere
near as compact as what we've got.

I guess Fenghua could go give instruction decoding a shot and see how it
looks, though.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Andy Lutomirski
On Mon, Aug 3, 2020 at 9:37 AM Dave Hansen  wrote:
>
> On 8/3/20 8:12 AM, Andy Lutomirski wrote:
> > I could easily be convinced that the PASID fixup is so trivial and so
> > obviously free of misfiring in a way that causes an infinite loop that
> > this code is fine.  But I think we first need to answer the bigger
> > question of why we're doing a lazy fixup in the first place.
>
> There was an (internal to Intel) implementation of this about a year ago
> that used smp_call_function_many() to force the MSR state into all
> threads of a process.  I took one look at it, decided there was a 0%
> chance of it actually functioning and recommended we find another way.
> While I'm sure it could be done more efficiently, the implementation I
> looked at took ~200 lines of code and comments.  It started to look too
> much like another instance of mm->cpumask for my comfort.

If I were implementing this, I would try making switch_mm_irqs_off()
do, roughly:

void load_mm_pasid(...) {
  if (cpu_feature_enabled(X86_FEATURE_ENQCMD))
tsk->xstate[offset] = READ_ONCE(next->context.pasid);
}

This costs one cache miss, although the cache line in question is
about to be read anyway.  It might be faster to, instead, do:

void load_mm_pasid(...) {
  u32 pasid = READ_ONCE(next->context.pasid);

  if (tsk->xstate[offset] != pasid)
tsk->state[offset] = pasid;
}

so we don't dirty the cache line in the common case.  The actual
generated code ought to be pretty good -- surely the offset of PASID
in XSTATE is an entry in an array somewhere that can be found with a
single read, right?

The READ_ONCE is because this could race against a write to
context.pasid, so this code needs to be at the end of the function
where it's protected by mm_cpumask.  With all this done, the pasid
update is just on_each_cpu_mask(mm_cpumask(mm), load_mm_pasid, mm,
true).

This looks like maybe 20 lines of code.  As an added bonus, it lets us
free PASIDs early if we ever decide we want to.




May I take this opportunity to ask Intel to please put some real
thought into future pieces of CPU state?  Here's a summary of some
things we have:

- Normal extended state (FPU, XMM, etc): genuinely per thread and only
ever used explicitly.  Actually makes sense with XSAVE(C/S).

- PKRU: affects CPL0-originated memory accesses, so it needs to be
eagerly loaded in the kernel.  Does not make sense with XRSTOR(C/S),
but it's in there anyway.

- CR3: per-mm state.  Makes some sense in CR3, but it's tangled up
with CR4 in nasty ways.

- LDTR: per-mm on Linux and mostly obsolete everyone.  In it's own
register, so it's not a big deal.

- PASID: per-mm state (surely Intel always intended it to be per-mm,
since it's for shared _virtual memory_!).  But for some reason it's in
an MSR (which is slow), and it's cleverly, but not that cleverly,
accessible with XSAVES/XRSTORS.  Doesn't actually make sense.  Also,
PASID is lazy-loadable, but the mechanism for telling the kernel that
a lazy load is needed got flubbed.

- TILE: genuinely per-thread, but it's expensive so it's
lazy-loadable.  But the lazy-load mechanism reuses #NM, and it's not
fully disambiguated from the other use of #NM.  So it sort of works,
but it's gross.

- "KERNEL_GS_BASE", i.e. the shadow GS base.  This is logically
per-user-thread state, but it's only accessible in MSRs.  For some
reason this is *not* in XSAVES/XRSTORS state, nor is there any
efficient way to access it at all.

- Segment registers: can't be properly saved except by hypervisors,
and can almost, but not quite, be properly loaded (assuming the state
was sane to begin with) by normal kernels.  Just don't try to load 1,
2, or 3 into any of them.

Sometimes I think that this is all intended to be as confusing as
possible and that it's all a ploy to keep context switches slow and
complicated.  Maybe Intel doesn't actually want to compete with other
architectures that can context switch quickly?


It would be really nice if we had a clean way to load per-mm state
(see my private emails about this), a clean way to load CPL3 register
state, and a clean way to load per-user-thread *kernel* register state
(e.g. PKRU and probably PKRS).  And there should be an exception that
says "user code accessed a lazy-loaded resource that isn't loaded, and
this is the resource it tried to access".
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Bug][Regression][Bisected] pp_table writes began to fail for Navi10 on amd-staging-drm-next

2020-08-03 Thread Matt Coffin
Hi Evan,

Thanks for the work. It would have taken me probably forever to find that.

I replied to your series with a Tested-by, and I tested locally. With
edad8312cbbf9a33c86873fc4093664f150dd5c1 reverted to fix the sysfs
interfaces, and your series, I once again have a 100% working OD system!

Thanks for the snappy work,
Matt

On 8/2/20 10:50 PM, Quan, Evan wrote:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi Matt,
>
> I just sent out two patches to fix the issue reported here. It worked
on my local env.
> Please confirm on your platform.
> https://lists.freedesktop.org/archives/amd-gfx/2020-August/052245.html
> https://lists.freedesktop.org/archives/amd-gfx/2020-August/052246.html
>
> BR,
> Evan
> -Original Message-
> From: Quan, Evan
> Sent: Friday, July 31, 2020 9:20 AM
> To: Matt Coffin ; amd-gfx@lists.freedesktop.org
> Cc: Alex Deucher 
> Subject: RE: [Bug][Regression][Bisected] pp_table writes began to fail
for Navi10 on amd-staging-drm-next
>
> Thanks for reporting this. I will check it.
>
> BR
> Evan
> -Original Message-
> From: Matt Coffin 
> Sent: Thursday, July 30, 2020 10:25 PM
> To: Quan, Evan ; amd-gfx@lists.freedesktop.org
> Cc: Alex Deucher 
> Subject: [Bug][Regression][Bisected] pp_table writes began to fail for
Navi10 on amd-staging-drm-next
>
> Hey Evan,
> 
> I've been having an issue with uploading `pp_table`s on recent 
> `amd-staging-drm-next` kernels. I bisected the issue, and it came back to a 
> commit of yours - ec8ee23f610578c71885a36ddfcf58d35cccab67.
> 
> I didn't have your gitlab handle to CC you on the issue, so I thought I'd at 
> least alert you to it.
> 
> Here's a link to the issue on GitLab:
> https://gitlab.freedesktop.org/drm/amd/-/issues/1243
> 
> I'd appreciate any help or insight you could offer here as I work on a fix.
> 
> First bad commit header:
> 
> commit ec8ee23f610578c71885a36ddfcf58d35cccab67 (refs/bisect/bad)
> Author: Evan Quan 
> Date:   Wed Jun 10 16:52:32 2020 +0800
> 
> drm/amd/powerplay: update Navi10 default dpm table setup
> 
> Cache all clocks levels for every dpm table. They are needed
> by other APIs.
> 
> Change-Id: I8114cf31e6ec8c9af4578d51749eb213befdcc71
> Signed-off-by: Evan Quan 
> Reviewed-by: Alex Deucher 
> 
> Thanks everyone, and cheers,
> Matt Coffin
>



signature.asc
Description: OpenPGP digital signature
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm/amd/powerplay: put VCN/JPEG into PG ungate state before dpm table setup

2020-08-03 Thread Matt Coffin
Thanks Evan! I can confirm that this resolved the following GitLab
issue. Thanks for CC'ing me!

https://gitlab.freedesktop.org/drm/amd/-/issues/1243

Series is Tested-by: Matt Coffin 

On 8/2/20 10:46 PM, Evan Quan wrote:
> As VCN related dpm table setup needs VCN be in PG ungate state. Same logics
> applies to JPEG.
> 
> Change-Id: I94335efc4e0424cfe0991e984c938998fd8f1287
> Signed-off-by: Evan Quan 
> ---
>  drivers/gpu/drm/amd/powerplay/amdgpu_smu.c | 38 +-
>  1 file changed, 30 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c 
> b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> index 85b04c48bd09..1349d05c5f3d 100644
> --- a/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> +++ b/drivers/gpu/drm/amd/powerplay/amdgpu_smu.c
> @@ -134,7 +134,8 @@ int smu_get_dpm_freq_range(struct smu_context *smu,
>  }
>  
>  static int smu_dpm_set_vcn_enable(struct smu_context *smu,
> -   bool enable)
> +   bool enable,
> +   int *previous_pg_state)
>  {
>   struct smu_power_context *smu_power = >smu_power;
>   struct smu_power_gate *power_gate = _power->power_gate;
> @@ -148,6 +149,9 @@ static int smu_dpm_set_vcn_enable(struct smu_context *smu,
>   if (atomic_read(_gate->vcn_gated) ^ enable)
>   goto out;
>  
> + if (previous_pg_state)
> + *previous_pg_state = atomic_read(_gate->vcn_gated);
> +
>   ret = smu->ppt_funcs->dpm_set_vcn_enable(smu, enable);
>   if (!ret)
>   atomic_set(_gate->vcn_gated, !enable);
> @@ -159,7 +163,8 @@ static int smu_dpm_set_vcn_enable(struct smu_context *smu,
>  }
>  
>  static int smu_dpm_set_jpeg_enable(struct smu_context *smu,
> -bool enable)
> +bool enable,
> +int *previous_pg_state)
>  {
>   struct smu_power_context *smu_power = >smu_power;
>   struct smu_power_gate *power_gate = _power->power_gate;
> @@ -173,6 +178,9 @@ static int smu_dpm_set_jpeg_enable(struct smu_context 
> *smu,
>   if (atomic_read(_gate->jpeg_gated) ^ enable)
>   goto out;
>  
> + if (previous_pg_state)
> + *previous_pg_state = atomic_read(_gate->jpeg_gated);
> +
>   ret = smu->ppt_funcs->dpm_set_jpeg_enable(smu, enable);
>   if (!ret)
>   atomic_set(_gate->jpeg_gated, !enable);
> @@ -212,7 +220,7 @@ int smu_dpm_set_power_gate(struct smu_context *smu, 
> uint32_t block_type,
>*/
>   case AMD_IP_BLOCK_TYPE_UVD:
>   case AMD_IP_BLOCK_TYPE_VCN:
> - ret = smu_dpm_set_vcn_enable(smu, !gate);
> + ret = smu_dpm_set_vcn_enable(smu, !gate, NULL);
>   if (ret)
>   dev_err(smu->adev->dev, "Failed to power %s VCN!\n",
>   gate ? "gate" : "ungate");
> @@ -230,7 +238,7 @@ int smu_dpm_set_power_gate(struct smu_context *smu, 
> uint32_t block_type,
>   gate ? "gate" : "ungate");
>   break;
>   case AMD_IP_BLOCK_TYPE_JPEG:
> - ret = smu_dpm_set_jpeg_enable(smu, !gate);
> + ret = smu_dpm_set_jpeg_enable(smu, !gate, NULL);
>   if (ret)
>   dev_err(smu->adev->dev, "Failed to power %s JPEG!\n",
>   gate ? "gate" : "ungate");
> @@ -407,6 +415,7 @@ static int smu_late_init(void *handle)
>  {
>   struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>   struct smu_context *smu = >smu;
> + int vcn_gate, jpeg_gate;
>   int ret = 0;
>  
>   if (!smu->pm_enabled)
> @@ -418,6 +427,14 @@ static int smu_late_init(void *handle)
>   return ret;
>   }
>  
> + /*
> +  * 1. Power up VCN/JPEG as the succeeding smu_set_default_dpm_table()
> +  *needs VCN/JPEG up.
> +  * 2. Save original gate states and then we can restore back afterwards.
> +  */
> + smu_dpm_set_vcn_enable(smu, true, _gate);
> + smu_dpm_set_jpeg_enable(smu, true, _gate);
> +
>   /*
>* Set initialized values (get from vbios) to dpm tables context such as
>* gfxclk, memclk, dcefclk, and etc. And enable the DPM feature for each
> @@ -429,6 +446,11 @@ static int smu_late_init(void *handle)
>   return ret;
>   }
>  
> + /* Restore back to original VCN/JPEG power gate states */
> + smu_dpm_set_vcn_enable(smu, !vcn_gate, NULL);
> + smu_dpm_set_jpeg_enable(smu, !vcn_gate, NULL);
> +
> +
>   ret = smu_populate_umd_state_clk(smu);
>   if (ret) {
>   dev_err(adev->dev, "Failed to populate UMD state clocks!\n");
> @@ -991,8 +1013,8 @@ static int smu_hw_init(void *handle)
>  
>   if (smu->is_apu) {
>   smu_powergate_sdma(>smu, false);
> - smu_dpm_set_vcn_enable(smu, true);
> - smu_dpm_set_jpeg_enable(smu, true);
> + 

Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Andy Lutomirski
On Mon, Aug 3, 2020 at 8:03 AM Dave Hansen  wrote:
>
> On 7/31/20 4:34 PM, Andy Lutomirski wrote:
> >> Thomas suggested to provide a reason for the #GP caused by executing ENQCMD
> >> without a valid PASID value programmed. #GP error codes are 16 bits and all
> >> 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The other
> >> choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
> >> when loading from the source operand, so its not fully comprehending all
> >> the reasons. Rather than special case the ENQCMD, in future Intel may
> >> choose a different fault mechanism for such cases if recovery is needed on
> >> #GP.
> > Decoding the user instruction is ugly and sets a bad architecture
> > precedent, but we already do it in #GP for UMIP.  So I'm unconvinced.
>
> I'll try to do one more bit of convincing. :)
>
> In the end, we need a way to figure out if the #GP was from a known "OK"
> source that we can fix up.  You're right that we could fire up the
> instruction decoder to help answer that question.  But, it (also)
> doesn't easily yield a perfect answer as to the source of the #GP, it
> always involves a user copy, and it's a larger code impact than what
> we've got.
>
> I think I went and looked at fixup_umip_exception(), and compared it to
> the alternative which is essentially just these three lines of code:
>
> > + /*
> > +  * If the current task already has a valid PASID in the MSR,
> > +  * the #GP must be for some other reason.
> > +  */
> > + if (current->has_valid_pasid)
> > + return false;
> ...> +  /* Now the current task has a valid PASID in the MSR. */
> > + current->has_valid_pasid = 1;
>
> and *I* was convinced that instruction decoding wasn't worth it.
>
> There's a lot of stuff that fixup_umip_exception() does which we don't
> have to duplicate, but it's going to be really hard to get it anywhere
> near as compact as what we've got.
>

I could easily be convinced that the PASID fixup is so trivial and so
obviously free of misfiring in a way that causes an infinite loop that
this code is fine.  But I think we first need to answer the bigger
question of why we're doing a lazy fixup in the first place.

--Andy
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/display: Indent an if statement

2020-08-03 Thread Dan Carpenter
The if statement wasn't indented so it's confusing.

Signed-off-by: Dan Carpenter 
---
 drivers/gpu/drm/amd/display/dc/core/dc_resource.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c 
b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
index ca26714c800e..c6b737dd8425 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_resource.c
@@ -71,7 +71,7 @@ enum dce_version resource_parse_asic_id(struct hw_asic_id 
asic_id)
if (ASIC_REV_IS_TAHITI_P(asic_id.hw_internal_rev) ||
ASIC_REV_IS_PITCAIRN_PM(asic_id.hw_internal_rev) ||
ASIC_REV_IS_CAPEVERDE_M(asic_id.hw_internal_rev))
-   dc_version = DCE_VERSION_6_0;
+   dc_version = DCE_VERSION_6_0;
else if (ASIC_REV_IS_OLAND_M(asic_id.hw_internal_rev))
dc_version = DCE_VERSION_6_4;
else
-- 
2.27.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

2020-08-03 Thread Fenghua Yu
Hi, Andy,

On Fri, Jul 31, 2020 at 04:34:11PM -0700, Andy Lutomirski wrote:
> On Mon, Jul 13, 2020 at 4:48 PM Fenghua Yu  wrote:
> >
> > A #GP fault is generated when ENQCMD instruction is executed without
> > a valid PASID value programmed in the current thread's PASID MSR. The
> > #GP fault handler will initialize the MSR if a PASID has been allocated
> > for this process.
> >
> > Decoding the user instruction is ugly and sets a bad architecture
> > precedent. It may not function if the faulting instruction is modified
> > after #GP.
> >
> > Thomas suggested to provide a reason for the #GP caused by executing ENQCMD
> > without a valid PASID value programmed. #GP error codes are 16 bits and all
> > 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The other
> > choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
> > when loading from the source operand, so its not fully comprehending all
> > the reasons. Rather than special case the ENQCMD, in future Intel may
> > choose a different fault mechanism for such cases if recovery is needed on
> > #GP.
> 
> Decoding the user instruction is ugly and sets a bad architecture
> precedent, but we already do it in #GP for UMIP.  So I'm unconvinced.

Maybe just remove the "Decoding the user instruction ... bad architecture
precedent" sentence? The sentence is vague.

As described in the following "It may not function ..." sentence, the real
issue of parsing the instruction is the instruction may be modified by
another processor before it's parsed in the #GP handler.

If just keep the "It may not function ..." sentence, is that good enough to
explain why we don't parse the faulting instruction?

> 
> Memo to Intel, though: you REALLY need to start thinking about what
> the heck an OS is supposed to do with all these new faults you're
> coming up with.  The new #NM for TILE is utterly nonsensical.  Sure,
> it works for an OS that does not use CR0.TS and as long as no one
> tries to extend the same mechanism for some new optional piece of
> state, but as soon as Intel tries to use the same mechanism for
> anything else, it falls apart.
> 
> Please do better.

Internally we did discuss the error code in #GP for PASID with HW architects.
But due to some uarch reason, it's not simple to report the error code for
PASID:( Please see previous discussion on the error code for PASID:
https://lore.kernel.org/lkml/20200427224646.GA103955@otc-nc-03/

It's painful for our SW guys to check exception reasons if hardware
doesn't explicitly tell us.

Hopefully the heuristics (fixup the PASID MSR if the process already has
a valid PASID but the MSR doesn't have one yet) in this patch is acceptable.
 
> 
> > +
> > +/*
> > + * Write the current task's PASID MSR/state. This is called only when PASID
> > + * is enabled.
> > + */
> > +static void fpu__pasid_write(u32 pasid)
> > +{
> > +   u64 msr_val = pasid | MSR_IA32_PASID_VALID;
> > +
> > +   fpregs_lock();
> > +
> > +   /*
> > +* If the MSR is active and owned by the current task's FPU, it can
> > +* be directly written.
> > +*
> > +* Otherwise, write the fpstate.
> > +*/
> > +   if (!test_thread_flag(TIF_NEED_FPU_LOAD)) {
> > +   wrmsrl(MSR_IA32_PASID, msr_val);
> > +   } else {
> > +   struct ia32_pasid_state *ppasid_state;
> > +
> > +   ppasid_state = 
> > get_xsave_addr(>thread.fpu.state.xsave,
> > + XFEATURE_PASID);
> > +   /*
> > +* ppasid_state shouldn't be NULL because XFEATURE_PASID
> > +* is enabled.
> > +*/
> > +   WARN_ON_ONCE(!ppasid_state);
> > +   ppasid_state->pasid = msr_val;
> 
> WARN instead of BUG is nice, but you'll immediate oops if this fails.
> How about:
> 
> if (!WARN_ON_ONCE(!ppasid_state))
>   ppasid_state->pasid = msr_val;

OK. I will fix this issue.

Thank you very much for your review!

-Fenghua
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/display: Fix wrong return value in dm_update_plane_state()

2020-08-03 Thread Tianjia Zhang
On an error exit path, a negative error code should be returned
instead of a positive return value.

Fixes: 9e869063b0021 ("drm/amd/display: Move iteration out of dm_update_planes")
Cc: Leo Li 
Signed-off-by: Tianjia Zhang 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 710edc70e37e..5810416e2ddc 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -8182,8 +8182,7 @@ static int dm_update_plane_state(struct dc *dc,
dm_old_plane_state->dc_state,
dm_state->context)) {
 
-   ret = EINVAL;
-   return ret;
+   return -EINVAL;
}
 
 
-- 
2.26.2

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [Linux-kernel-mentees] [PATCH] drm/amdgpu: Prevent kernel-infoleak in amdgpu_info_ioctl()

2020-08-03 Thread Arnd Bergmann
On Thu, Jul 30, 2020 at 11:09 PM Luben Tuikov  wrote:
> On 2020-07-29 9:49 a.m., Alex Deucher wrote:
> > On Wed, Jul 29, 2020 at 4:11 AM Christian König
> >  wrote:
> >>
> >> Am 28.07.20 um 21:29 schrieb Peilin Ye:
> >>> Compiler leaves a 4-byte hole near the end of `dev_info`, causing
> >>> amdgpu_info_ioctl() to copy uninitialized kernel stack memory to userspace
> >>> when `size` is greater than 356.
> >>>
> >>> In 2015 we tried to fix this issue by doing `= {};` on `dev_info`, which
> >>> unfortunately does not initialize that 4-byte hole. Fix it by using
> >>> memset() instead.
> >>>
> >>> Cc: sta...@vger.kernel.org
> >>> Fixes: c193fa91b918 ("drm/amdgpu: information leak in 
> >>> amdgpu_info_ioctl()")
> >>> Fixes: d38ceaf99ed0 ("drm/amdgpu: add core driver (v4)")
> >>> Suggested-by: Dan Carpenter 
> >>> Signed-off-by: Peilin Ye 
> >>
> >> Reviewed-by: Christian König 
> >>
> >> I can't count how many of those we have fixed over the years.
> >>
> >> At some point we should probably document that using "= {}" or "= { 0 }"
> >> in the kernel is a really bad idea and should be avoided.
> >
> > Moreover, it seems like different compilers seem to behave relatively
> > differently with these and we often get reports of warnings with these
> > on clang.  When in doubt, memset.
>
> There are quite a few of those under drivers/gpu/drm, for "amd/", "scheduler/"
> drm*.c files,
>
> $find . \( -regex "./drm.*\.c" -or -regex "./amd/.*\.c" -or -regex 
> "./scheduler/.*\.c" \) -exec egrep -n -- " *= *{ *(|NULL|0) *}" \{\} \+ | wc 
> -l
> 374
> $_
>
> Out of which only 16 are of the non-ISO C variety, "= {}",
>
> $find . \( -regex "./drm.*\.c" -or -regex "./amd/.*\.c" -or -regex 
> "./scheduler/.*\.c" \) -exec egrep -n -- " *= *{ *}" \{\} \+ | wc -l
> 16
> $_

That is an unrelated issue, those were introduced to deal with older compilers
that do not accept '{0}' as an initializer for an aggregate whose
first member is
another aggregate. Generally speaking, '= { }' is better to use in the
kernel than
'= { 0 }' because all supported compilers interpret that the same way for all
structures.

 Arnd
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Fix regression in adjusting power table/profile

2020-08-03 Thread Paweł Gronowski
Hello again,

I just finished a bisect of amd-staging-drm-next and it looks like
the hang is first introduced in edad8312cbbf9a33c86873fc4093664f150dd5c1
("drm/amdgpu: fix system hang issue during GPU reset").

It is a bit tricky, because it is commited on top of my first faulty patch
7173949df45482 ("drm/amdgpu: Fix NULL dereference in dpm sysfs handlers") so
it needs to be reverted fix the premature -INVAL.

Test case:
  sudo sh -c 'echo "s 0 305 750" > 
/sys/class/drm/card0/device/pp_od_clk_voltage'
Results:
  edad8312cbbf9a3 + revert 7173949df45482 = hang
  edad8312cbbf9a3~1 + revert 7173949df45482 = no hang

Could you confirm that you get the same results?

Thanks,
Paweł Gronowski


On Fri, Jul 31, 2020 at 03:34:40PM +0200, Paweł Gronowski wrote:
> Hey Matt,
> 
> I have just tested the amd-staging-drm-next branch 
> (dd654c76d6e854afad716ded899e4404734aaa10) with my patches reverted
> and I can reproduce your issue with:
> 
>   sudo sh -c 'echo "s 0 305 750" > 
> /sys/class/drm/card0/device/pp_od_clk_voltage'
> 
> Which makes the sh hang with 100% usage.
> 
> The issue does not happen on the mainline 
> (d8b9faec54ae4bc2fff68bcd0befa93ace8256ce)
> both without and with my patches reapplied.
> So the problem must be related to some commit that is present in the
> amd-staging-drm-next but not in the mainline.
> 
> 
> Paweł Gronowski
> 
> On Thu, Jul 30, 2020 at 06:34:14PM -0600, Matt Coffin wrote:
> > Hey Pawel,
> > 
> > I did confirm that this patch *introduced* the issue both with the
> > bisect, and by testing reverting it.
> > 
> > Now, there's a lot of fragile pieces in the dpm handling, so it could be
> > this patch's interaction with something else that's causing it and it
> > may well not be the fault of this code, but this is the patch that
> > introduced the issue.
> > 
> > I'll have some more time tomorrow to try to get down to root cause here,
> > so maybe I'll have more to offer then.
> > 
> > Thanks for taking a look,
> > Matt
> > 
> > On 7/30/20 6:31 PM, Paweł Gronowski wrote:
> > > Hello Matt,
> > > 
> > > Thank you for your testing. It seems that my gpu (RX 570) does not 
> > > support the
> > > vc setting so I can not exactly reproduce the issue. However I did trace 
> > > the
> > > code path the test case takes and it seems to correctly pass through the 
> > > while
> > > loop that parses the input and fails only in 
> > > amdgpu_dpm_odn_edit_dpm_table.
> > > The 'parameter' array is populated the same way as the original code did. 
> > > Since
> > > the amdgpu_dpm_odn_edit_dpm_table is reached, I think that your problem is
> > > unfortunately caused by something else.
> > > 
> > > 
> > > Paweł Gronowski
> > > 
> > > On Thu, Jul 30, 2020 at 08:49:41AM -0600, Matt Coffin wrote:
> > >> Hello all, I just did some testing with this applied, and while it no
> > >> longer returns -EINVAL, running `sudo sh -c 'echo "vc 2 2150 1195" >
> > >> /sys/class/drm/card1/device/pp_od_clk_voltage'` results in `sh` spiking
> > >> to, and staying at 100% CPU usage, with no indicating information in
> > >> `dmesg` from the kernel.
> > >>
> > >> It appeared to work at least ONCE, but potentially not after.
> > >>
> > >> This is not unique to Navi, and caused the problem on a POLARIS10 card
> > >> as well.
> > >>
> > >> Sorry for the bad news, and thanks for any insight you may have,
> > >> Matt Coffin
> > >>
> > >> On 7/29/20 8:53 PM, Alex Deucher wrote:
> > >>> On Wed, Jul 29, 2020 at 10:20 PM Paweł Gronowski  
> > >>> wrote:
> > 
> >  Regression was introduced in commit 38e0c89a19fd
> >  ("drm/amdgpu: Fix NULL dereference in dpm sysfs handlers") which
> >  made the set_pp_od_clk_voltage and set_pp_power_profile_mode return
> >  -EINVAL for previously valid input. This was caused by an empty
> >  string (starting at the \0 character) being passed to the kstrtol.
> > 
> >  Signed-off-by: Paweł Gronowski 
> > >>>
> > >>> Applied.  Thanks!
> > >>>
> > >>> Alex
> > >>>
> >  ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 9 +++--
> >   1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> >  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c 
> >  b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
> >  index ebb8a28ff002..cbf623ff03bd 100644
> >  --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
> >  +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
> >  @@ -778,12 +778,14 @@ static ssize_t 
> >  amdgpu_set_pp_od_clk_voltage(struct device *dev,
> >  tmp_str++;
> >  while (isspace(*++tmp_str));
> > 
> >  -   while ((sub_str = strsep(_str, delimiter)) != NULL) {
> >  +   while ((sub_str = strsep(_str, delimiter)) && *sub_str) {
> >  ret = kstrtol(sub_str, 0, [parameter_size]);
> >  if (ret)
> >  return -EINVAL;
> >  parameter_size++;
> > 
> >  +   if (!tmp_str)
> >  +

Re: [PATCH] drm/amdgpu: Fix regression in adjusting power table/profile

2020-08-03 Thread Paweł Gronowski
Hey Matt,

I have just tested the amd-staging-drm-next branch 
(dd654c76d6e854afad716ded899e4404734aaa10) with my patches reverted
and I can reproduce your issue with:

  sudo sh -c 'echo "s 0 305 750" > 
/sys/class/drm/card0/device/pp_od_clk_voltage'

Which makes the sh hang with 100% usage.

The issue does not happen on the mainline 
(d8b9faec54ae4bc2fff68bcd0befa93ace8256ce)
both without and with my patches reapplied.
So the problem must be related to some commit that is present in the
amd-staging-drm-next but not in the mainline.


Paweł Gronowski

On Thu, Jul 30, 2020 at 06:34:14PM -0600, Matt Coffin wrote:
> Hey Pawel,
> 
> I did confirm that this patch *introduced* the issue both with the
> bisect, and by testing reverting it.
> 
> Now, there's a lot of fragile pieces in the dpm handling, so it could be
> this patch's interaction with something else that's causing it and it
> may well not be the fault of this code, but this is the patch that
> introduced the issue.
> 
> I'll have some more time tomorrow to try to get down to root cause here,
> so maybe I'll have more to offer then.
> 
> Thanks for taking a look,
> Matt
> 
> On 7/30/20 6:31 PM, Paweł Gronowski wrote:
> > Hello Matt,
> > 
> > Thank you for your testing. It seems that my gpu (RX 570) does not support 
> > the
> > vc setting so I can not exactly reproduce the issue. However I did trace the
> > code path the test case takes and it seems to correctly pass through the 
> > while
> > loop that parses the input and fails only in amdgpu_dpm_odn_edit_dpm_table.
> > The 'parameter' array is populated the same way as the original code did. 
> > Since
> > the amdgpu_dpm_odn_edit_dpm_table is reached, I think that your problem is
> > unfortunately caused by something else.
> > 
> > 
> > Paweł Gronowski
> > 
> > On Thu, Jul 30, 2020 at 08:49:41AM -0600, Matt Coffin wrote:
> >> Hello all, I just did some testing with this applied, and while it no
> >> longer returns -EINVAL, running `sudo sh -c 'echo "vc 2 2150 1195" >
> >> /sys/class/drm/card1/device/pp_od_clk_voltage'` results in `sh` spiking
> >> to, and staying at 100% CPU usage, with no indicating information in
> >> `dmesg` from the kernel.
> >>
> >> It appeared to work at least ONCE, but potentially not after.
> >>
> >> This is not unique to Navi, and caused the problem on a POLARIS10 card
> >> as well.
> >>
> >> Sorry for the bad news, and thanks for any insight you may have,
> >> Matt Coffin
> >>
> >> On 7/29/20 8:53 PM, Alex Deucher wrote:
> >>> On Wed, Jul 29, 2020 at 10:20 PM Paweł Gronowski  wrote:
> 
>  Regression was introduced in commit 38e0c89a19fd
>  ("drm/amdgpu: Fix NULL dereference in dpm sysfs handlers") which
>  made the set_pp_od_clk_voltage and set_pp_power_profile_mode return
>  -EINVAL for previously valid input. This was caused by an empty
>  string (starting at the \0 character) being passed to the kstrtol.
> 
>  Signed-off-by: Paweł Gronowski 
> >>>
> >>> Applied.  Thanks!
> >>>
> >>> Alex
> >>>
>  ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 9 +++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
> 
>  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c 
>  b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
>  index ebb8a28ff002..cbf623ff03bd 100644
>  --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
>  +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
>  @@ -778,12 +778,14 @@ static ssize_t amdgpu_set_pp_od_clk_voltage(struct 
>  device *dev,
>  tmp_str++;
>  while (isspace(*++tmp_str));
> 
>  -   while ((sub_str = strsep(_str, delimiter)) != NULL) {
>  +   while ((sub_str = strsep(_str, delimiter)) && *sub_str) {
>  ret = kstrtol(sub_str, 0, [parameter_size]);
>  if (ret)
>  return -EINVAL;
>  parameter_size++;
> 
>  +   if (!tmp_str)
>  +   break;
>  while (isspace(*tmp_str))
>  tmp_str++;
>  }
>  @@ -1635,11 +1637,14 @@ static ssize_t 
>  amdgpu_set_pp_power_profile_mode(struct device *dev,
>  i++;
>  memcpy(buf_cpy, buf, count-i);
>  tmp_str = buf_cpy;
>  -   while ((sub_str = strsep(_str, delimiter)) != NULL) {
>  +   while ((sub_str = strsep(_str, delimiter)) && 
>  *sub_str) {
>  ret = kstrtol(sub_str, 0, 
>  [parameter_size]);
>  if (ret)
>  return -EINVAL;
>  parameter_size++;
>  +
>  +   if (!tmp_str)
>  +   break;
>  while (isspace(*tmp_str))
>  tmp_str++;
>  }
>  --
> 

RE: [PATCH] drm/amdgpu: added RAS EEPROM device support check

2020-08-03 Thread Chen, Guchun
[AMD Public Use]

Please put the check __is_ras_eeprom_supported behind the line 
'*exceed_err_limit = false;' in function amdgpu_ras_eeprom_init and 
amdgpu_ras_eeprom_check_err_threshold respectively. That promises, even when 
eeprom is not available on several ASICs, exceed_err_limit must set to be false.

With above fixed, the patch is:
Reviewed-by: Guchun Chen guchun.c...@amd.com

Regards,
Guchun

From: Clements, John 
Sent: Monday, August 3, 2020 2:48 PM
To: amd-gfx list ; Chen, Guchun 
; Zhang, Hawking 
Subject: [PATCH] drm/amdgpu: added RAS EEPROM device support check


[AMD Public Use]

Submitting patch to with added device support check before trying to access RAS 
EEPROM.

Thank you,
John Clements
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: added RAS EEPROM device support check

2020-08-03 Thread Clements, John
[AMD Public Use]

Submitting patch to with added device support check before trying to access RAS 
EEPROM.

Thank you,
John Clements


0001-drm-amdgpu-added-RAS-EEPROM-device-support-check.patch
Description: 0001-drm-amdgpu-added-RAS-EEPROM-device-support-check.patch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: enable RAS support for sienna

2020-08-03 Thread Clements, John
[AMD Public Use]

Thank you Guchun!

From: Chen, Guchun 
Sent: Monday, August 3, 2020 2:15 PM
To: Clements, John ; amd-gfx list 
; Zhang, Hawking 
Subject: RE: [PATCH] drm/amdgpu: enable RAS support for sienna


[AMD Public Use]

In patch subject, chip full name SIENNA_CICHLID is preferred.

With above addressed, the patch is:
Reviewed-by: Guchun Chen mailto:guchun.c...@amd.com>>

Regards,
Guchun

From: Clements, John mailto:john.cleme...@amd.com>>
Sent: Monday, August 3, 2020 2:01 PM
To: amd-gfx list 
mailto:amd-gfx@lists.freedesktop.org>>; Zhang, 
Hawking mailto:hawking.zh...@amd.com>>; Chen, Guchun 
mailto:guchun.c...@amd.com>>
Subject: [PATCH] drm/amdgpu: enable RAS support for sienna


[AMD Public Use]

Submitting patch to turn on GECC error injection /query support for sienna 
cichilid.

Thank you,
John Clements
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: enable RAS support for sienna

2020-08-03 Thread Chen, Guchun
[AMD Public Use]

In patch subject, chip full name SIENNA_CICHLID is preferred.

With above addressed, the patch is:
Reviewed-by: Guchun Chen 

Regards,
Guchun

From: Clements, John 
Sent: Monday, August 3, 2020 2:01 PM
To: amd-gfx list ; Zhang, Hawking 
; Chen, Guchun 
Subject: [PATCH] drm/amdgpu: enable RAS support for sienna


[AMD Public Use]

Submitting patch to turn on GECC error injection /query support for sienna 
cichilid.

Thank you,
John Clements
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: enable RAS support for sienna

2020-08-03 Thread Clements, John
[AMD Public Use]

Submitting patch to turn on GECC error injection /query support for sienna 
cichilid.

Thank you,
John Clements


0001-drm-amdgpu-enable-RAS-support-for-sienna.patch
Description: 0001-drm-amdgpu-enable-RAS-support-for-sienna.patch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx