Re: [PATCH] drm/i915/guc: Correct capture of EIR register on hang

2024-02-27 Thread Teres Alexis, Alan Previn
On Fri, 2024-02-23 at 12:32 -0800, john.c.harri...@intel.com wrote:
> From: John Harrison 
alan:snip

> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> @@ -51,6 +51,7 @@
> { RING_ESR(0),  0,  0, "ESR" }, \
> { RING_DMA_FADD(0), 0,  0, "RING_DMA_FADD_LDW" },
> \
> { RING_DMA_FADD_UDW(0), 0,  0, "RING_DMA_FADD_UDW" },
> \
> +   { RING_EIR(0),  0,  0, "EIR" }, \
> { RING_IPEIR(0),    0,  0, "IPEIR" }, \
> { RING_IPEHR(0),    0,  0, "IPEHR" }, \
> { RING_INSTPS(0),   0,  0, "INSTPS" }, \
> @@ -80,9 +81,6 @@
> { GEN8_RING_PDP_LDW(0, 3),  0,  0, "PDP3_LDW" }, \
> { GEN8_RING_PDP_UDW(0, 3),  0,  0, "PDP3_UDW" }
>  
> -#define COMMON_BASE_HAS_EU \
> -   { EIR,  0,  0, "EIR" }
> -
alan:snip

alan: Thanks for catching this one.
Reviewed-by: Alan Previn 


Re: ✗ Fi.CI.IGT: failure for Resolve suspend-resume racing with GuC destroy-context-worker (rev13)

2024-01-04 Thread Teres Alexis, Alan Previn
On Thu, 2024-01-04 at 10:57 +, Patchwork wrote:
> Patch Details
> Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev13)
> URL:https://patchwork.freedesktop.org/series/121916/
> State:  failure
> Details:
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v13/index.html
> CI Bug Log - changes from CI_DRM_14076_full -> Patchwork_121916v13_full
> Summary
> 
> FAILURE
alan:snip


> Here are the unknown changes that may have been introduced in 
> Patchwork_121916v13_full:
> 
> IGT changes
> Possible regressions
> 
>   *   igt@gem_eio@wait-wedge-immediate:
>  *   shard-mtlp: 
> PASS
>  -> 
> ABORT
> 
alan: from the code and dmesg, this is unrelated to guc context destruction 
flows.
Its reading an MCR register that times out. Additionally, i believe this error 
is occuring during post-reset-init flows.
So its definitely not doing any context destruction at this point (as reset 
would have happenned sooner).
> Known issues
> 



Re: [PATCH v9 0/2] Resolve suspend-resume racing with GuC destroy-context-worker

2024-01-02 Thread Teres Alexis, Alan Previn
On Wed, 2023-12-27 at 20:55 -0800, Teres Alexis, Alan Previn wrote:
> This series is the result of debugging issues root caused to
> races between the GuC's destroyed_worker_func being triggered
> vs repeating suspend-resume cycles with concurrent delayed
> fence signals for engine-freeing.
> 
alan:snip.

alan: I did not receive the CI-premerge email where the following was reported:
  IGT changes
  Possible regressions
  igt@i915_selftest@live@gt_pm:
  shard-rkl: PASS -> DMESG-FAIL
After going thru the error in dmesg and codes, i am confident this failure not
related to the series. This selftest calls rdmsrl functions (that doen't do
any requests / guc submissions) but gets a reply power of zero (the bug 
reported).
So this is unrelated.


Hi @"Vivi, Rodrigo" , just an FYI note that after the 
last
requested rebase, BAT passed twice in a row now so i am confident failures on 
rev7
and prior was unrelated and that this series is ready for merging. Thanks again
for all your help and patiences - this was a long one  :)


Re: [PATCH v8 2/2] drm/i915/guc: Close deregister-context race against CT-loss

2023-12-27 Thread Teres Alexis, Alan Previn
On Tue, 2023-12-26 at 10:11 -0500, Vivi, Rodrigo wrote:
> On Wed, Dec 20, 2023 at 11:08:59PM +0000, Teres Alexis, Alan Previn wrote:
> > On Wed, 2023-12-13 at 16:23 -0500, Vivi, Rodrigo wrote:
alan:snip

> > 
> > 
> > alan: Thanks Rodrigo for the RB last week, just quick update:
> > 
> > I've cant reproduce the BAT failures that seem to be intermittent
> > on platform and test - however, a noticable number of failures
> > do keep occuring on i915_selftest @live @requests where the
> > last test leaked a wakeref and the failing test hangs waiting
> > for gt to idle before starting its test.
> > 
> > i have to debug this further although from code inspection
> > is unrelated to the patches in this series.
> > Hopefully its a different issue.
> 
> Yeap, likely not related. Anyway, I'm sorry for not merging
> this sooner. Could you please send a rebased version? This
> on is not applying cleanly anymore.

Hi Rodrigo, i will rebase it as soon as i do a bit more testing..
I realize i was using a slighlty older guc and with newer guc am
seeing all kinds of failures but trending as not an issue with the series.



Re: [PATCH v8 2/2] drm/i915/guc: Close deregister-context race against CT-loss

2023-12-20 Thread Teres Alexis, Alan Previn
On Wed, 2023-12-13 at 16:23 -0500, Vivi, Rodrigo wrote:
> On Tue, Dec 12, 2023 at 08:57:16AM -0800, Alan Previn wrote:
> > If we are at the end of suspend or very early in resume
> > its possible an async fence signal (via rcu_call) is triggered
> > to free_engines which could lead us to the execution of
> > the context destruction worker (after a prior worker flush).
alan:snip
> 
> > Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_-
> > contexts if guc_lrc_desc_unpin fails due to CT send falure.
> > When unrolling, keep the context in the GuC's destroy-list so
> > it can get picked up on the next destroy worker invocation
> > (if suspend aborted) or get fully purged as part of a GuC
> > sanitization (end of suspend) or a reset flow.
> > 
> > Signed-off-by: Alan Previn 
> > Signed-off-by: Anshuman Gupta 
> > Tested-by: Mousumi Jana 
> > Acked-by: Daniele Ceraolo Spurio 
> 
> Thanks for all the explanations, patience and great work!
> 
> Reviewed-by: Rodrigo Vivi 

alan: Thanks Rodrigo for the RB last week, just quick update:

I've cant reproduce the BAT failures that seem to be intermittent
on platform and test - however, a noticable number of failures
do keep occuring on i915_selftest @live @requests where the
last test leaked a wakeref and the failing test hangs waiting
for gt to idle before starting its test.

i have to debug this further although from code inspection
is unrelated to the patches in this series.
Hopefully its a different issue.


Re: [Intel-gfx] ✗ Fi.CI.BAT: failure for Resolve suspend-resume racing with GuC destroy-context-worker (rev8)

2023-11-30 Thread Teres Alexis, Alan Previn
On Fri, 2023-12-01 at 02:20 +, Patchwork wrote:
> Patch Details
> Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev8)
> URL:https://patchwork.freedesktop.org/series/121916/
> State:  failure
> Details:
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v8/index.html

alan:snip
> Here are the unknown changes that may have been introduced in 
> Patchwork_121916v8:
> 
> IGT changes
> Possible regressions
> 
>   *   igt@i915_selftest@live@requests:
>  *   bat-mtlp-6: 
> PASS
>  -> 
> DMESG-FAIL
alan: inspecting the selftest code that is failing WRT changes in this targets, 
i really dont see any relation.
the series is changing how we handle context deregistratoin either thru the 
regular context-close or through the
flushing during suspend... however the selftest reporting a reset would be an 
issue where the context is not
closed (because it would be hanging) - also its a kernel context - which should 
not get closed. 
I will run another retest after i verify trybot sees no issue.


Re: [Intel-gfx] [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss

2023-11-30 Thread Teres Alexis, Alan Previn
> As far as i can tell, its only if we started resetting / wedging right after 
> this
> queued worker got started.

alan: hope Daniele can proof read my tracing and confirm if got it right.


Re: [Intel-gfx] [PATCH v7 2/2] drm/i915/guc: Close deregister-context race against CT-loss

2023-11-30 Thread Teres Alexis, Alan Previn
On Thu, 2023-11-30 at 16:18 -0500, Vivi, Rodrigo wrote:
> On Wed, Nov 29, 2023 at 04:20:13PM -0800, Alan Previn wrote:
alan:snip
> > +
> > if (unlikely(disabled)) {
> > release_guc_id(guc, ce);
> > __guc_context_destroy(ce);
> > -   return;
> > +   return 0;
> 
> is success the right return case here?
alan: yes: we may discover "disabled == true" if submission_disabled
found that gt-is-wedged. I dont believe such a case will happen as
part of flushing destroyed_worker_func during suspend but may occur
as part of regular runtime context closing that just happens to
get queued in the middle of a gt-reset/wedging process. In such a case,
the reset-prepare code will sanitize everytihng including cleaning up
the pending-destructoin-contexts-link-list. So its either we pick it
from here and dump it ... or reset picks it first and dumps it there
(where both dumpings only occur if guc got disabled first).


Supplemental: How regular context cleanup leads to the same path -->

i915_sw_fence_notify -> engines_notify -> free_engines_rcu ->
intel_context_put -> kref_put(>ref..) -> ce->ops->destroy ->

(where ce->ops = engine->cops and engine->cops = guc_context_ops)
*and, guc_context_ops->destroy == guc_context_destroy so ->

ce->ops->destroy -> guc_context_destroy ->
queue_work(..>submission_state.destroyed_worker);
-> [eventually] -> the same guc_lrc_unpin above

However with additional "if (!intel_guc_is_ready(guc))" in destroyed_worker_func
as part of this patch, hitting this "disabled==true" case will be even less 
likely.
As far as i can tell, its only if we started resetting / wedging right after 
this
queued worker got started. (i ran out of time to check if reset can ever occur
from within the same system_unbound_wq but then i recall we also have CI using
debugfs to force wedging for select (/all?) igt tests) so i suspect it can still
happen in parallel. NOTE: the original checking of the "is disabled" is not new
code - its the original code.

...alan

P.S.- oh man, that took a lot of code tracing as i can't remember these paths 
by heart.



Re: [Intel-gfx] [PATCH v2 1/1] drm/i915/pxp: Add missing tag for Wa_14019159160

2023-11-29 Thread Teres Alexis, Alan Previn
On Wed, 2023-11-29 at 13:13 -0800, Teres Alexis, Alan Previn wrote:
> On Mon, 2023-11-27 at 15:24 -0500, Vivi, Rodrigo wrote:
> > On Wed, Nov 22, 2023 at 12:30:03PM -0800, Alan Previn wrote:
> alan:snip
> alan: thanks for reviewing and apologize for replying to this late.
> 
> > >   /*
> > > -  * On MTL and newer platforms, protected contexts require setting
> > > -  * the LRC run-alone bit or else the encryption will not happen.
> > > +  * Wa_14019159160 - Case 2: mtl
> > > +  * On some platforms, protected contexts require setting
> > > +  * the LRC run-alone bit or else the encryption/decryption will not 
> > > happen.
> > > +  * NOTE: Case 2 only applies to PXP use-case of said workaround.
> > >*/
> > 
> > hmm, interesting enough, on the wa description I read that it is incomplete 
> > for MTL/ARL
> > and something about a fuse bit. We should probably chase for some 
> > clarification in the
> > HSD?!
> alan: yes, i went through the HSD description with the architect and we both 
> had agreed that
> that from the KMD's perspective. At that time, the checking of the fuse from 
> KMD's
> could be optimized out for Case-2-PXP: if the fuses was set a certain way, 
> KMD can skip setting
> the bit in lrc because hw will enforce runalone in pxp mode irrespective of 
> what KMD does but
> if fuse was was not set that way KMD needed to set the runalone in lrc. So 
> for case2, its simpler
> to just turn it on always when the context is protected. However, that said, 
> i realized the
> wording of the HSDs have changed since the original patch was implemented so 
> i will reclarify
> offline and reply back. NOTE: i believe John Harrison is working on case-3 
> through a
> separate series.
alan: based on conversations with the architect, a different WA number, 
14019725520,
has the details of case-2 that explains the relationship the new fuse.
However, as before it can be optimized as explained above where PXP context
need to always set this bit to ensure encryption/decryption engages.




Re: [Intel-gfx] [PATCH v2 1/1] drm/i915/pxp: Add missing tag for Wa_14019159160

2023-11-29 Thread Teres Alexis, Alan Previn
On Mon, 2023-11-27 at 15:24 -0500, Vivi, Rodrigo wrote:
> On Wed, Nov 22, 2023 at 12:30:03PM -0800, Alan Previn wrote:
alan:snip
alan: thanks for reviewing and apologize for replying to this late.

> > /*
> > -* On MTL and newer platforms, protected contexts require setting
> > -* the LRC run-alone bit or else the encryption will not happen.
> > +* Wa_14019159160 - Case 2: mtl
> > +* On some platforms, protected contexts require setting
> > +* the LRC run-alone bit or else the encryption/decryption will not 
> > happen.
> > +* NOTE: Case 2 only applies to PXP use-case of said workaround.
> >  */
> 
> hmm, interesting enough, on the wa description I read that it is incomplete 
> for MTL/ARL
> and something about a fuse bit. We should probably chase for some 
> clarification in the
> HSD?!
alan: yes, i went through the HSD description with the architect and we both 
had agreed that
that from the KMD's perspective. At that time, the checking of the fuse from 
KMD's
could be optimized out for Case-2-PXP: if the fuses was set a certain way, KMD 
can skip setting
the bit in lrc because hw will enforce runalone in pxp mode irrespective of 
what KMD does but
if fuse was was not set that way KMD needed to set the runalone in lrc. So for 
case2, its simpler
to just turn it on always when the context is protected. However, that said, i 
realized the
wording of the HSDs have changed since the original patch was implemented so i 
will reclarify
offline and reply back. NOTE: i believe John Harrison is working on case-3 
through a
separate series.

> 
> > -   if (GRAPHICS_VER_FULL(ce->engine->i915) >= IP_VER(12, 70) &&
> > -   (ce->engine->class == COMPUTE_CLASS || ce->engine->class == 
> > RENDER_CLASS)) {
> > +   if (IS_METEORLAKE(ce->engine->i915) && (ce->engine->class == 
> > COMPUTE_CLASS ||
> > +   ce->engine->class == 
> > RENDER_CLASS)) {
> 
> This check now excludes the ARL with the same IP, no?!
alan: yeah - re-reved this part just now.
> 
> > rcu_read_lock();
> > gem_ctx = rcu_dereference(ce->gem_context);
> > if (gem_ctx)
> > 
> > base-commit: 5429d55de723544dfc0630cf39d96392052b27a1
> > -- 
> > 2.39.0
> > 



Re: [Intel-gfx] [PATCH v5] drm/i915/pxp: Add drm_dbgs for critical PXP events.

2023-11-29 Thread Teres Alexis, Alan Previn
On Fri, 2023-11-24 at 08:30 +, Tvrtko Ursulin wrote:
> On 22/11/2023 19:15, Alan Previn wrote:
alan:snip
alan: thanks for reviewing.
> > if (iir & GEN12_DISPLAY_STATE_RESET_COMPLETE_INTERRUPT)
> > -   pxp->session_events |= PXP_TERMINATION_COMPLETE;
> > +   pxp->session_events |= PXP_TERMINATION_COMPLETE | 
> > PXP_EVENT_TYPE_IRQ;
> 
> This looked to be doing more than adding debug logs, but then no one is 
> actually consuming this new flag?!
alan: see below hunk, inside pxp_session_work the drm_dbg that was prints
the "events" that came from above. we need this debug message to uniquely
recognize that contexts pxp keys would get invalidated in response to
a KCR IRQ (as opposed to other known flows that are being reported by other
means like banning hanging contexts, or as part of suspend etc).
NOTE: pxp->session_events does get set in at lease one other path
outside the IRQ (which triggers teardown but not coming from KCR-hw)
This flag is solely for the debug message. Hope this works.

> 
> Regards,
> 
> Tvrtko
> 
> >   
> > if (pxp->session_events)
> > queue_work(system_unbound_wq, >session_work);
> > diff --git a/drivers/gpu/drm/i915/pxp/intel_pxp_session.c 
> > b/drivers/gpu/drm/i915/pxp/intel_pxp_session.c
> > index 0a3e66b0265e..091c86e03d1a 100644
> > --- a/drivers/gpu/drm/i915/pxp/intel_pxp_session.c
> > +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_session.c
> > @@ -137,8 +137,10 @@ void intel_pxp_terminate(struct intel_pxp *pxp, bool 
> > post_invalidation_needs_res
> >   static void pxp_terminate_complete(struct intel_pxp *pxp)
> >   {
> > /* Re-create the arb session after teardown handle complete */
> > -   if (fetch_and_zero(>hw_state_invalidated))
> > +   if (fetch_and_zero(>hw_state_invalidated)) {
> > +   drm_dbg(>ctrl_gt->i915->drm, "PXP: creating arb_session 
> > after invalidation");
> > pxp_create_arb_session(pxp);
> > +   }
> >   
> > complete_all(>termination);
> >   }
> > @@ -157,6 +159,8 @@ static void pxp_session_work(struct work_struct *work)
> > if (!events)
> > return;
> >   
> > +   drm_dbg(>i915->drm, "PXP: processing event-flags 0x%08x", events);
> > +
> > if (events & PXP_INVAL_REQUIRED)
> > intel_pxp_invalidate(pxp);
> >   
> > diff --git a/drivers/gpu/drm/i915/pxp/intel_pxp_types.h 
> > b/drivers/gpu/drm/i915/pxp/intel_pxp_types.h
> > index 7e11fa8034b2..07864b584cf4 100644
> > --- a/drivers/gpu/drm/i915/pxp/intel_pxp_types.h
> > +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_types.h
> > @@ -124,6 +124,7 @@ struct intel_pxp {
> >   #define PXP_TERMINATION_REQUEST  BIT(0)
> >   #define PXP_TERMINATION_COMPLETE BIT(1)
> >   #define PXP_INVAL_REQUIRED   BIT(2)
> > +#define PXP_EVENT_TYPE_IRQ   BIT(3)
> >   };
> >   
> >   #endif /* __INTEL_PXP_TYPES_H__ */
> > 
> > base-commit: 5429d55de723544dfc0630cf39d96392052b27a1



Re: [Intel-gfx] [PATCH v6 2/2] drm/i915/guc: Close deregister-context race against CT-loss

2023-11-29 Thread Teres Alexis, Alan Previn
On Mon, 2023-11-27 at 16:51 -0500, Vivi, Rodrigo wrote:

alan: Firstly, thanks for taking the time to review this, knowing you have a 
lot on your plate right now.
> 
alan:snip
> > @@ -3301,19 +3315,38 @@ static inline void guc_lrc_desc_unpin(struct 
> > intel_context *ce)
> > /* Seal race with Reset */
> > spin_lock_irqsave(>guc_state.lock, flags);
> > disabled = submission_disabled(guc);
> > -   if (likely(!disabled)) {
> > -   __intel_gt_pm_get(gt);
> > -   set_context_destroyed(ce);
> > -   clr_context_registered(ce);
> > -   }
> > -   spin_unlock_irqrestore(>guc_state.lock, flags);
> 
> you are now holding this spin lock for too long...
alan: my intention was to ensure that an async H2G request to change this
gucid-context state wouldnt come in while we are in the middle of attempting
to send the context-destruction H2G and later realize it needed to be
unrolled (corner case race we are solving here as discovered on
customer platform). But after discussing with Daniele and John, they
agreed that we should move the unlock back to before the deregister and then
take the lock again inside of the unrolling code (if deregister_context 
fails).
alan:snip
> 
> > -   deregister_context(ce, ce->guc_id.id);
> > +   /* GuC is active, lets destroy this context,
> 
> for multi-line comments you need to start with '/*' only
> and start the real comment below, like:
alan:my bad. will fix.
> *
>  * GuC is active, ...
> > +* but at this point we can still be racing with
> > +* suspend, so we undo everything if the H2G fails
> > +*/
> > +
> > +   /* Change context state to destroyed and get gt-pm */
> > +   __intel_gt_pm_get(gt);
> > +   set_context_destroyed(ce);
> > +   clr_context_registered(ce);
> > +
> > +   ret = deregister_context(ce, ce->guc_id.id);
> > +   if (ret) {
> > +   /* Undo the state change and put gt-pm if that failed */
> > +   set_context_registered(ce);
> > +   clr_context_destroyed(ce);
> > +   /*
> > +* Dont use might_sleep / ASYNC verion of put because
> > +* CT loss in deregister_context could mean an ongoing
> > +* reset or suspend flow. Immediately put before the unlock
> > +*/
> > +   __intel_wakeref_put(>wakeref, 0);
> 
> interesting enough you use the '__' version to bypass the might_sleep(),
> but also sending 0 as argument what might lead you in the mutex_lock inside
> the spin lock area, what is not allowed.
alan: so one thing to note, the motivation for an alternative unlock was only
driven by the fact we were holding that spinlock. However, as per the review
comment and response on the spin lock above, if we move the pm-put
outside the spin lock, we can call the intel_wakeref_put_async (which will not
actually trigger a delayed task becase this function (guc_lrc_desc_unpin) starts
with GEM_BUG_ON(!intel_gt_pm_is_awake(gt)); which means it would only decrement
the ref count.
alan:snip
> > +   spin_lock_irqsave(>submission_state.lock, flags);
> > +   list_add_tail(>destroyed_link,
> > + 
> > >submission_state.destroyed_contexts);
> > +   spin_unlock_irqrestore(>submission_state.lock, 
> > flags);
> > +   /* Bail now since the list might never be emptied if 
> > h2gs fail */
> 
> For this GuC interaction part I'd like to get an ack from John Harrison 
> please.
alan:okay - will do. I might alternatively tap on Daniele since he knows the 
guc just as well.
> 
alan:snip

> > @@ -3392,6 +3440,17 @@ static void destroyed_worker_func(struct work_struct 
> > *w)
> > struct intel_gt *gt = guc_to_gt(guc);
> > int tmp;
> >  
> > +   /*
> > +* In rare cases we can get here via async context-free fence-signals 
> > that
> > +* come very late in suspend flow or very early in resume flows. In 
> > these
> > +* cases, GuC won't be ready but just skipping it here is fine as these
> > +* pending-destroy-contexts get destroyed totally at GuC reset time at 
> > the
> > +* end of suspend.. OR.. this worker can be picked up later on the next
> > +* context destruction trigger after resume-completes
> > +*/
> > +   if (!intel_guc_is_ready(guc))
> > +   return;
> 
> is this racy?
alan: this is to reduce raciness. This worker function eventually calls
deregister_destroyed_context which calls the guc_lrc_desc_unpin that does
all the work we discussed above (taking locks, taking refs, sending h2g,
potentially unrolling). So this check an optimization to skip that without
going through the locking. So yes its a bit racy but its trying to reduce
raciness. NOTE1: Without this line of code, in theory everything would still
work fine, with the driver potentially experiencing more unroll cases
in those thousands of cycles. NOTE2: This check using intel_guc_is_ready
is done in many places in the driver knowing that it doesnt take any locks.
Of 

Re: [Intel-gfx] [PATCH v3 1/1] drm/i915/pxp: Add missing tag for Wa_14019159160

2023-11-29 Thread Teres Alexis, Alan Previn
On Tue, 2023-11-28 at 10:03 -0800, Roper, Matthew D wrote:
> On Mon, Nov 27, 2023 at 12:11:50PM -0800, Alan Previn wrote:
> > Add missing tag for "Wa_14019159160 - Case 2" (for existing
> > PXP code that ensures run alone mode bit is set to allow
> > PxP-decryption.
alan:snip.
alan: thanks for reviewing.
> > -   if (GRAPHICS_VER_FULL(ce->engine->i915) >= IP_VER(12, 70) &&
> > +   if (GRAPHICS_VER_FULL(ce->engine->i915) == IP_VER(12, 70) &&
> 
> The workaround database lists this as being needed on both 12.70 and
> 12.71.  Should this be a
> 
> IS_GFX_GT_IP_RANGE(gt, IP_VER(12, 70), IP_VER(12, 71))
alan: sure - will do this.
> 
> check instead?
> 
> The workaround is also listed in the database as applying to DG2; is
> this "case 2" subset of the workaround not relevant to that platform?
alan: i dont believe so - not the pxp case 2.



Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/pxp: Add drm_dbgs for critical PXP events. (rev5)

2023-11-22 Thread Teres Alexis, Alan Previn
On Wed, 2023-09-27 at 11:08 +, Patchwork wrote:
> Patch Details
> Series: drm/i915/pxp: Add drm_dbgs for critical PXP events. (rev5)
> URL:https://patchwork.freedesktop.org/series/123803/
alan:snip
> 
> Here are the unknown changes that may have been introduced in 
> Patchwork_123803v5_full:
> 
> IGT changes
> Possible regressions
> 
>   *   igt@perf_pmu@frequency@gt0:
>  *   shard-dg2: 
> PASS
>  -> 
> SKIP
alan: failure is verified as not related. but will post a rebase since its been 
a while.


Re: [Intel-gfx] [PATCH v1 1/1] drm/i915/pxp: Bail early in pxp tee backend on first teardown error

2023-11-16 Thread Teres Alexis, Alan Previn
On Thu, 2023-11-16 at 15:20 -0800, Teres Alexis, Alan Previn wrote:
> For Gen12 when using mei-pxp tee backend tranport, if we are coming
> up from a cold boot or from a resume (not runtime resume), we can
> optionally quicken the very first session cleanup that would occur
> as part of starting up a default PXP session. Typically a cleanup
> from a cold-start is expected to be quick so we can use a shorter
> timeout and skip retries (when getting non-success on calling
> backend transport for intel_pxp_tee_end_arb_fw_session).
alan:snip

> @@ -387,10 +389,14 @@ void intel_pxp_tee_end_arb_fw_session(struct intel_pxp 
> *pxp, u32 session_id)
>   ret = intel_pxp_tee_io_message(pxp,
>  _in, sizeof(msg_in),
>  _out, sizeof(msg_out),
> -NULL);
> +NULL, pxp->hw_state_coldstart ?
> +PXP_TRANSPORT_TIMEOUT_FAST_MS : 
> PXP_TRANSPORT_TIMEOUT_MS);
>  
> - /* Cleanup coherency between GT and Firmware is critical, so try again 
> if it fails */
> - if ((ret || msg_out.header.status != 0x0) && ++trials < 3)
> + /*
> +  * Cleanup coherency between GT and Firmware is critical, so try again 
> if it
> +  * fails, unless we are performing a cold-start reset
> +  */
> + if ((ret || msg_out.header.status != 0x0) && !pxp->hw_state_coldstart 
> &&  ++trials < 3)
alan: Take note I am working offline with sister teams to perform some end to
end conformance testing with more comprehensive OS stacks before we can verify
that this optimization doesnt break any existing use-cases.


Re: [Intel-gfx] [char-misc-next 3/4] mei: pxp: re-enable client on errors

2023-11-15 Thread Teres Alexis, Alan Previn
On Wed, 2023-11-15 at 13:31 +, Tvrtko Ursulin wrote:
> On 14/11/2023 15:31, Teres Alexis, Alan Previn wrote:
> > On Tue, 2023-11-14 at 16:00 +0200, Ville Syrjälä wrote:
> > > On Wed, Oct 11, 2023 at 02:01:56PM +0300, Tomas Winkler wrote:
> > > 
> 
> Regardless of the mei_pxp_send_message being temporarily broken, doesn't 
> Ville's logs suggest the PXP detection is altogether messed up? AFAIR 
> the plan was exactly to avoid stalls during Mesa init and new uapi was 
> added to achieve that. But it doesn't seem to be working?!
> 
> commit 3b918f4f0c8b5344af4058f1a12e2023363d0097
> Author: Alan Previn 
> Date:   Wed Aug 2 11:25:50 2023 -0700
> 
>  drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS
> 
>  After recent discussions with Mesa folks, it was requested
>  that we optimize i915's GET_PARAM for the PXP_STATUS without
>  changing the UAPI spec.
> 
>  Add these additional optimizations:
> - If any PXP initializatoin flow failed, then ensure that
>   we catch it so that we can change the returned PXP_STATUS
>   from "2" (i.e. 'PXP is supported but not yet ready')
>   to "-ENODEV". This typically should not happen and if it
>   does, we have a platform configuration issue.
> - If a PXP arbitration session creation event failed
>   due to incorrect firmware version or blocking SOC fusing
>   or blocking BIOS configuration (platform reasons that won't
>   change if we retry), then reflect that blockage by also
>   returning -ENODEV in the GET_PARAM:PXP_STATUS.
> - GET_PARAM:PXP_STATUS should not wait at all if PXP is
>   supported but non-i915 dependencies (component-driver /
>   firmware) we are still pending to complete the init flows.
>   In this case, just return "2" immediately (i.e. 'PXP is
>   supported but not yet ready').
> 
> AFAIU is things failed there shouldn't be long waits, repeated/constant 
> ones even less so.
> 
alan: I agree - but I don't think MESA is detecting for PXP for above case...
which is designed to be quick if using the GET_PARAM IOCTL - i believe MESA
may actually be opting in to enforce PXP. When this happens we are required
to be stringent when managing the protected-hw-sessions and for certain PXP
operations we retry when a failure occurs - one in particular is the PXP hw
session teardown or reset. We are expected to retry up to 3 times to ensure
that the session is properly torn down (requirements from architecture)
unless the error returned from FW indicated that we dont have proper fusing or
security assets in the platform (i.e. lower level aspects of the platform won't
permit PXP support). All of this is already coded.

So we have two problems here:

1. The code for enforcing PXP when PXP is explicitly requested by UMD is
expecting the mei-component driver to follow the mei-component-interface spec
but a change was introduced by Alex that caused a bug in this lowest layer spec
(see: 
https://elixir.bootlin.com/linux/latest/source/drivers/misc/mei/pxp/mei_pxp.c#L25)
"Return: 0 on Success, <0 on Failure" for send and "Return: bytes sent on
Success, <0 on Failure" for receive - the change they made was returning
a positive number on send's success and i915 was checking the according to spec:

 ret = pxp_component->ops->send(pxp_component->tee_dev, msg_in, 
msg_in_size,
PXP_TRANSPORT_TIMEOUT_MS);
->   if (ret) {
 drm_err(>drm, "Failed to send PXP TEE message\n");
 goto unlock;
 }

 ret = pxp_component->ops->recv(pxp_component->tee_dev, msg_out, 
msg_out_max_size,
PXP_TRANSPORT_TIMEOUT_MS);
 if (ret < 0) {
 drm_err(>drm, "Failed to receive PXP TEE message\n");
 goto unlock;
 }

So i really cannot guarantee anything if the mei code itself is broken and not
complying to the spec we agreed on - i can change the above check for send
from "if (ret)" to "if (ret < 0)" but that would only be "working aroung"
the mei bug. As a side note, when we bail on a successful "send" thinking its
a failure, and try to "send" again, it triggers an link reset in the mei-hw
link because the receive was never picked up. IF i understand this correctly,
this is hw-fw design requirement, i.e. that the send-recv be an atomic FSM
sequence. That said, if the the send-failure was a real failure, the timeout
would have not been hit and the retry would have been much faster. As a side 
note,
Alex from mei team did reply to me offline with indication that the fix was
supposed to have been sent

Re: [Intel-gfx] [PATCH v1 1/1] drm/i915/gt: Dont wait forever when idling in suspend

2023-11-14 Thread Teres Alexis, Alan Previn
On Tue, 2023-11-14 at 08:22 -0800, Teres Alexis, Alan Previn wrote:
> When suspending, add a timeout when calling
> intel_gt_pm_wait_for_idle else if we have a leaked
> wakeref (which would be indicative of a bug elsewhere
> in the driver), driver will at exit the suspend-resume
> cycle, after the kernel detects the held reference and
> prints a message to abort suspending instead of hanging
> in the kernel forever which then requires serial connection
> or ramoops dump to debug further.
NOTE: this patch originates from Patch#3 of this other series
https://patchwork.freedesktop.org/series/121916/ (rev 5 and prior)
and was decided to be moved out as its own patch since this
patch is trying to improve general debuggability as opposed
to resolving that bug being resolved in above series.
alan:snip

> +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms)
>  {
> - int err;
> + int err = 0;
>  
>   might_sleep();
>  
> - err = wait_var_event_killable(>wakeref,
> -   !intel_wakeref_is_active(wf));
> + if (!timeout_ms)
> + err = wait_var_event_killable(>wakeref,
> +   !intel_wakeref_is_active(wf));
> + else if (wait_var_event_timeout(>wakeref,
> + !intel_wakeref_is_active(wf),
> + msecs_to_jiffies(timeout_ms)) < 1)
> + err = -ETIMEDOUT;
> +
alan: paraphrasing feedback from Tvrtko on the originating series this patch:
it would be good idea to add error-injection into this timeout to ensure
we dont have any other subsytem that could inadvertently leak an rpm
wakeref (and catch such bugs in future pre-merge).




Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-11-14 Thread Teres Alexis, Alan Previn
On Tue, 2023-11-14 at 17:52 +, Tvrtko Ursulin wrote:
> On 14/11/2023 17:37, Teres Alexis, Alan Previn wrote:
> > On Tue, 2023-11-14 at 17:27 +, Tvrtko Ursulin wrote:
> > > On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote:
> > > > On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
> > > > > On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
> > > > > > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
> > > > > > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
> 
alan:snip

> > alan: So i did trace back the gt->wakeref before i posted these patches and
> > see that within these runtime get/put calls, i believe the first 'get' leads
> > to __intel_wakeref_get_first which calls intel_runtime_pm_get via rpm_get
> > helper and eventually executes a pm_runtime_get_sync(rpm->kdev); (hanging 
> > off
> > i915_device). (naturally there is a corresponding mirros for the 
> > '_put_last').
> > So this means the first-get and last-put lets the kernel know. Thats why 
> > when
> > i tested this patch, it did actually cause the suspend to abort from kernel 
> > side
> > and the kernel would print a message indicating i915 was the one that didnt
> > release all refs.
> 
> Ah that would be much better then.
> 
> Do you know if everything gets resumed/restored correctly in that case 
> or we would need some additional work to maybe early exit from callers 
> of wait_for_suspend()?
alan: So assuming we are still discussing about a "potentially new
future leaked-wakeref bug" (i.e. putting aside the fact that
Patch #1 + #2 resolves this specific series' bug), based on the
previous testing we did, after this timeout-bail trigger,
the suspend flow bails and gt/guc operation does actually continue
as normal. However, its been a long time since we tested this so
i am not sure of how accidental-new-future bugs might play to this
assumption especially if some other subsystem that leaked the rpm
wakref but that subsystem did NOT get reset like how GuC is reset
at the end of suspend.

> 
> What I would also ask is to see if something like injecting a probing 
> failure is feasible, so we can have this new timeout exit path 
> constantly/regularly tested in CI.
alan: Thats a good idea. In line with this, i would like to point out that
rev6 of this series has been posted but i removed this Patch #3. However i did
post this Patch #3 as a standalone patch here: 
https://patchwork.freedesktop.org/series/126414/
as i anticipate this patch will truly help with future issue debuggability.

That said, i shall post a review on that patch with your suggestion to add
an injected probe error for the suspend-resume flow and follow up on that one
separately.

> 
> Regards,
> 
> Tvrtko
> 
> > alan: Anyways, i have pulled this patch out of rev6 of this series and 
> > created a
> > separate standalone patch for this patch #3 that we review independently.
> > 



Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-11-14 Thread Teres Alexis, Alan Previn
On Tue, 2023-11-14 at 12:36 -0500, Vivi, Rodrigo wrote:
> On Tue, Nov 14, 2023 at 05:27:18PM +, Tvrtko Ursulin wrote:
> > 
> > On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote:
> > > On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
> > > > On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
> > > > > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
> > > > > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
> 
> That's a good point.
> Well, current code is already bad and buggy on suspend-resume. We could get
> suspend stuck forever without any clue of what happened.
> At least the proposed patch add some gt_warn. But, yes, the right thing
> to do should be entirely abort the suspend in case of timeout, besides the
> gt_warn.
alan: yes - thats was the whole idea for Patch #3. Only after putting such
code did we have much better debuggability on real world production platforms
+ config that may not have serial-port or ramoops-dump by default.

Btw, as per previous comments by Tvrtko - which i agreed with, I have
moved this one single patch into a separate patch on its own.
See here: https://patchwork.freedesktop.org/series/126414/
(I also maintained the RB from you Rodrigo because the patch did not changed).
And yes, the gt_warn is still in place :)


Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-11-14 Thread Teres Alexis, Alan Previn
On Tue, 2023-11-14 at 17:27 +, Tvrtko Ursulin wrote:
> On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote:
> > On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
> > > On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
> > > > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
> > > > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
alan: snip
> 
> > alan: I won't say its NOT fixing anything - i am saying it's not fixing
> > this specific bug where we have the outstanding G2H from a context 
> > destruction
> > operation that got dropped. I am okay with dropping this patch in the next 
> > rev
> > but shall post a separate stand alone version of Patch3 - because I believe
> > all other i915 subsystems that take runtime-pm's DO NOT wait forever when 
> > entering
> > suspend - but GT is doing this. This means if there ever was a bug 
> > introduced,
> > it would require serial port or ramoops collection to debug. So i think 
> > such a
> > patch, despite not fixing this specific bug will be very helpful for 
> > debuggability
> > of future issues. After all, its better to fail our suspend when we have a 
> > bug
> > rather than to hang the kernel forever.
> 
> Issue I have is that I am not seeing how it fails the suspend when 
> nothing is passed out from *void* wait_suspend(gt). To me it looks the 
> suspend just carries on. And if so, it sounds dangerous to allow it to 
> do that with any future/unknown bugs in the suspend sequence. Existing 
> timeout wedges the GPU (and all that entails). New code just says "meh 
> I'll just carry on regardless".

alan: So i did trace back the gt->wakeref before i posted these patches and
see that within these runtime get/put calls, i believe the first 'get' leads
to __intel_wakeref_get_first which calls intel_runtime_pm_get via rpm_get
helper and eventually executes a pm_runtime_get_sync(rpm->kdev); (hanging off
i915_device). (naturally there is a corresponding mirros for the '_put_last').
So this means the first-get and last-put lets the kernel know. Thats why when
i tested this patch, it did actually cause the suspend to abort from kernel side
and the kernel would print a message indicating i915 was the one that didnt
release all refs.

alan: Anyways, i have pulled this patch out of rev6 of this series and created a
separate standalone patch for this patch #3 that we review independently.



Re: [Intel-gfx] [char-misc-next 3/4] mei: pxp: re-enable client on errors

2023-11-14 Thread Teres Alexis, Alan Previn
On Tue, 2023-11-14 at 16:00 +0200, Ville Syrjälä wrote:
> On Wed, Oct 11, 2023 at 02:01:56PM +0300, Tomas Winkler wrote:
> > From: Alexander Usyskin 
> > 
> > Disable and enable mei-pxp client on errors to clean the internal state.
> 
> This broke i915 on my Alderlake-P laptop.
> 


Hi Alex, i just relooked at the series that got merged, and i noticed
that in patch #3 of the series, you had changed mei_pxp_send_message
to return bytes sent instead of zero on success. IIRC, we had
agreed to not effect the behavior of this component interface (other
than adding the timeout) - this was the intention of Patch #4 that i
was pushing for in order to spec the interface (which continues
to say zero on success). We should fix this to stay with the original
behavior - where mei-pxp should NOT send partial packets and
will only return zero in success case where success is sending of
the complete packets - so we don't need to get back the "bytes sent"
from mei_pxp_send_message. So i think this might be causing the problem.


Side note  to Ville:, are you enabling PXP kernel config by default in
all MESA contexts? I recall that MESA folks were running some CI testing
with enable pxp contexts, but didn't realize this is being enabled by
default in all contexts. Please be aware that enabling pxp-contexts
would temporarily disabled runtime-pm during that contexts lifetime.
Also pxp contexts will be forced to be irrecoverable if it ever hangs.
The former is a hardware architecture requirement but doesn't do anything
if you're enabling display (which I beleive also blocks in ADL). The
latter was a requirement to comply with Vulkan.

...alan




Re: [Intel-gfx] [PATCH v3] drm/i915: Skip pxp init if gt is wedged

2023-11-13 Thread Teres Alexis, Alan Previn
On Mon, 2023-11-13 at 14:49 -0800, Zhanjun Dong wrote:
> The gt wedged could be triggered by missing guc firmware file, HW not
> working, etc. Once triggered, it means all gt usage is dead, therefore we
> can't enable pxp under this fatal error condition.
> 
> 
alan:skip
alan: this looks good (as per our offline review/discussion),
we dont mess with the current driver startup sequence (i.e. pxp
failures can never pull down the driver probing and will not
generate warnings or errors). Also, if something does break for PXP,
we only do a drm_debug if the failure returned is not -ENODEV
(since -ENODEV can happen on the majority of cases with
legacy products or with non-PXP kernel configs):

Reviewed-by: Alan Previn 


Re: [Intel-gfx] [PATCH] drm/i915: Initialize residency registers earlier

2023-11-13 Thread Teres Alexis, Alan Previn
On Mon, 2023-10-30 at 16:45 -0700, Belgaumkar, Vinay wrote:
alan:skip
> +++ b/drivers/gpu/drm/i915/gt/intel_rc6.c
> @@ -608,11 +608,13 @@ void intel_rc6_init(struct intel_rc6 *rc6)
>   /* Disable runtime-pm until we can save the GPU state with rc6 pctx */
>   rpm_get(rc6);
>  
> - if (!rc6_supported(rc6))
> - return;
> -
>   rc6_res_reg_init(rc6);
>  
> + if (!rc6_supported(rc6)) {
> + rpm_put(rc6);
> + return;
> + }
> +
>   if (IS_CHERRYVIEW(i915))
>   err = chv_rc6_init(rc6);
>   else if (IS_VALLEYVIEW(i915))

alan: as far as the bug this patch is addressing  (i.e. ensuring that
intel_rc6_print_residency has valid rc6.res_reg values for correct dprc
debugfs output when rc6 is disabled) and release the rpm, this looks good
to me.

However, when looking at the other code flows around the intel_rc6_init/fini
and intel_rc6_enable/disable, i must point out that the calls to rpm_get
and rpm_put from these functions don't seem to be designed with proper
mirror-ing. For example during driver startup, intel_rc6_init (which is called
by intel_gt_pm_init) calls rpm_get at the start but doesn't call rpm_put
before it returns. But back up the callstack in intel_gt_init,
after intel_gt_pm_init, a couple of subsystems get intialized before 
intel_gt_resume
is called - which in turn calls intel_rc6_enable which does the rpm_put at its 
end.
However before that get and put, i see several error paths that trigger cleanups
(leading eventually to driver load failure), but i think some cases are 
completely
missing the put_rpm that intel_rc6_init took. Additionally, the function names 
of
rc6_init and __get_rc6 inside i915_pmu.c seems to be confusing although static.
I wish those were named pmu_rc6_init and __pmu_rc6_init and etc.

Anyways, as per offline conversation, we are not trying to solve every
bug and design gap in this patch but just one specific bug fix. So as
per the agreed condition that we create a separate internal issue
to address this "lack of a clean mirrored-function design of rpm_get/put
across the rc6 startup sequences", here is my rb:

Reviewed-by: Alan Previn 



Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-11-13 Thread Teres Alexis, Alan Previn
On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
> On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
> > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
> > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
alan:snip
> > > It is not possible to wait for lost G2H in something like
> > > intel_uc_suspend() and simply declare "bad things happened" if it times
> > > out there, and forcibly clean it all up? (Which would include releasing
> > > all the abandoned pm refs, so this patch wouldn't be needed.)
> > > 
> > alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref
> > check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref.
> > 
> > As we already know, what we do know from a uc-perspective:
> > -  ensure the outstanding guc related workers is flushed which we didnt 
> > before
> > (addressed by patch #1).
> > - any further late H2G-SchedDisable is not leaking wakerefs when calling H2G
> > and not realizing it failed (addressed by patch #2).
> > - (we already), "forcibly clean it all up" at the end of the 
> > intel_uc_suspend
> > when we do the guc reset and cleanup all guc-ids. (pre-existing upstream 
> > code)
> > - we previously didnt have a coherrent guarantee that "this is the end" 
> > i.e. no
> > more new request after intel_uc_suspend. I mean by code logic, we thought 
> > we did
> > (thats why intel_uc_suspend ends wth a guc reset), but we now know 
> > otherwise.
> > So we that fix by adding the additional rcu_barrier (also part of patch #2).
> 
> It is not clear to me from the above if that includes cleaning up the 
> outstanding CT replies or no. But anyway..
alan: Apologies, i should have made it clearer by citing the actual function
name calling out the steps of interest: So if you look at the function
"intel_gt_suspend_late", it calls "intel_uc_suspend" which in turn calls 
"intel_guc_suspend" which does a soft reset onto guc firmware - so after that
we can assume all outstanding G2Hs are done. Going back up the stack,
intel_gt_suspend_late finally calls gt_sanitize which calls 
"intel_uc_reset_prepare"
which calls "intel_guc_submission_reset_prepare" which calls
"scrub_guc_desc_for_outstanding_g2h" which does what we are looking for for all
types of outstanding G2H. As per prior review comments, besides closing the race
condition, we were missing an rcu_barrier (which caused new requests to come in 
way
late). So we have resolved both in Patch #2.

> > That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have
> > a future racy gt-wakeref late-leak somewhere - no matter which subsystem
> > took it (guc is not the only subsystem taking gt wakerefs), we at
> > least don't hang forever in this code. Ofc, based on that, even without
> > patch-3 i am confident the issue is resolved anyway.
> > So we could just drop patch-3 is you prefer?
> 
> .. given this it does sound to me that if you are confident patch 3 
> isn't fixing anything today that it should be dropped.
alan: I won't say its NOT fixing anything - i am saying it's not fixing
this specific bug where we have the outstanding G2H from a context destruction
operation that got dropped. I am okay with dropping this patch in the next rev
but shall post a separate stand alone version of Patch3 - because I believe
all other i915 subsystems that take runtime-pm's DO NOT wait forever when 
entering
suspend - but GT is doing this. This means if there ever was a bug introduced,
it would require serial port or ramoops collection to debug. So i think such a
patch, despite not fixing this specific bug will be very helpful for 
debuggability
of future issues. After all, its better to fail our suspend when we have a bug
rather than to hang the kernel forever.





Re: [Intel-gfx] [PATCH] drm/i915: Skip pxp init if gt is wedged

2023-10-31 Thread Teres Alexis, Alan Previn
On Fri, 2023-10-27 at 10:13 +0300, Jani Nikula wrote:
> On Thu, 26 Oct 2023, Zhanjun Dong  wrote:
> 
alan:snip
> I'll note that nobody checks intel_pxp_init() return status, so this
> silently skips PXP.
> 
> BR,
> Jani.

alan:snip
> > +   if (intel_gt_is_wedged(gt))
> > +   return -ENODEV;
> > +

alan: wondering if we can add a drm_dbg in the caller of intel_pxp_init and
use a unique return value for the case of gt_is_wedged (for example: -ENXIO.).
As we know gt being wedged at startup basically means all gt usage is dead
and therefore we cant enable PXP (along with everything else that needs 
submission/
guc/ etc). With a drm-debug in the caller that prints that return value, it
helps us to differentiate between gt-is-wedged vs platform config doesnt support
PXP. However, this would mean new drm-debug 'noise' for platforms that i915 just
doesn't support PXP on at all which would be okay if dont use drm_warn or 
drm_err
and use 'softer' message like "PXP skipped with %d".

Please treat above comment as a "nit" - i.e. existing patch is good enough for 
me...
(after addressing Jani's request for more commit message info). ...alan




Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-10-04 Thread Teres Alexis, Alan Previn
On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
> On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
> > Thanks for taking the time to review this Tvrtko, replies inline below.
alan:snip

> > > 
> > > Main concern is that we need to be sure there are no possible
> > > ill-effects, like letting the GPU/GuC scribble on some memory we
> > > unmapped (or will unmap), having let the suspend continue after timing
> > > out, and not perhaps doing the forced wedge like wait_for_suspend() does
> > > on the existing timeout path.
> > alan: this will not happen because the held wakeref is never force-released
> > after the timeout - so what happens is the kernel would bail the suspend.
> 
> How does it know to fail the suspend when there is no error code 
> returned with this timeout? Maybe a stupid question.. my knowledge of 
> suspend-resume paths was not great even before I forgot it all.
alan:Tvrtko, you and I both sir. (apologies for the tardy response yet again 
busy week).
So i did trace back the gt->wakeref before i posted these patches and 
discovered that
runtime get/put call, i believe that the first 'get' leads to 
__intel_wakeref_get_first
which calls intel_runtime_pm_get via rpm_get helper and eventually executes a
pm_runtime_get_sync(rpm->kdev); (hanging off i915). (ofc, there is a 
corresponding
for '_put_last') - so non-first, non-last increases the counter for the gt...
but this last miss will mean kernel knows i915 hasnt 'put' everything.

alan:snip
> > 
> > Recap: so in both cases (original vs this patch), if we had a buggy 
> > gt-wakeref leak,
> > we dont get invalid guc-accesses, but without this patch, we wait forever,
> > and with this patch, we get some messages and eventually bail the suspend.
> 
> It is not possible to wait for lost G2H in something like 
> intel_uc_suspend() and simply declare "bad things happened" if it times 
> out there, and forcibly clean it all up? (Which would include releasing 
> all the abandoned pm refs, so this patch wouldn't be needed.)
> 
alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref
check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref. 

As we already know, what we do know from a uc-perspective:
-  ensure the outstanding guc related workers is flushed which we didnt before
(addressed by patch #1).
- any further late H2G-SchedDisable is not leaking wakerefs when calling H2G
and not realizing it failed (addressed by patch #2).
- (we already), "forcibly clean it all up" at the end of the intel_uc_suspend
when we do the guc reset and cleanup all guc-ids. (pre-existing upstream code)
- we previously didnt have a coherrent guarantee that "this is the end" i.e. no
more new request after intel_uc_suspend. I mean by code logic, we thought we did
(thats why intel_uc_suspend ends wth a guc reset), but we now know otherwise.
So we that fix by adding the additional rcu_barrier (also part of patch #2).

That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have
a future racy gt-wakeref late-leak somewhere - no matter which subsystem
took it (guc is not the only subsystem taking gt wakerefs), we at
least don't hang forever in this code. Ofc, based on that, even without
patch-3 i am confident the issue is resolved anyway.
So we could just drop patch-3 is you prefer?

...alan


Re: [Intel-gfx] [PATCH v4 2/3] drm/i915/guc: Close deregister-context race against CT-loss

2023-10-04 Thread Teres Alexis, Alan Previn
On Wed, 2023-10-04 at 06:34 +, Gupta, Anshuman wrote:
> 
> > -Original Message-
> > From: Teres Alexis, Alan Previn  > @@ -289,6 +289,13 @@ int intel_gt_resume(struct intel_gt *gt)
> > 
> >  static void wait_for_suspend(struct intel_gt *gt)  {
> > +   /*
> > +* On rare occasions, we've observed the fence completion trigger
> > +* free_engines asynchronously via rcu_call. Ensure those are done.
> > +* This path is only called on suspend, so it's an acceptable cost.
> > +*/
> > +   rcu_barrier();
> Let's add the barrier after the end of prepare suspend and at start of late 
> suspend.
> To make sure we don't have any async destroy from any user request or any 
> internal  kmd request during i915 suspend?
> Br,
> Anshuman Gupta.
alan: some thoughts: actuallly wait_fos_suspend is being called at from both 
intel_gt_suspend_prepare
and intel_gt_suspend_late. so putting the barrier in above location would get 
hit for both steps.
However, because wait_for_suspend may optionally wedge the system if 
outstanding requests were stuck
for more than I915_GT_SUSPEND_IDLE_TIMEOUT, wouldnt it be better to add the 
barrier before that
check (i.e. in above location) as opposed to after the return from 
wait_for_suspend at the end
of suspend_prepare - which would defeat the purpose of even checking that 
intel_gt_wait_for_idle
from within wait_for_suspend? (which is existing code). Perhaps we need to get 
one last test on this?



Re: [Intel-gfx] [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-09-27 Thread Teres Alexis, Alan Previn
Thanks for taking the time to review this Tvrtko, replies inline below.

On Wed, 2023-09-27 at 10:02 +0100, Tvrtko Ursulin wrote:
> On 26/09/2023 20:05, Alan Previn wrote:
> > When suspending, add a timeout when calling
> > intel_gt_pm_wait_for_idle else if we have a lost
> > G2H event that holds a wakeref (which would be
> > indicative of a bug elsewhere in the driver),
> > driver will at least complete the suspend-resume
> > cycle, (albeit not hitting all the targets for
> > low power hw counters), instead of hanging in the kernel.
alan:snip

> >   {
> > +   int timeout_ms = CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT ? : 1;
> 
> CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT is in ns so assigning it to _ms is 
> a bit to arbitrary.
> 
> Why not the existing I915_GT_SUSPEND_IDLE_TIMEOUT for instance?
alan: good catch, my bad. However I915_GT_SUSPEND_IDLE_TIMEOUT is only half a 
sec
which i feel is too quick. I guess i could create a new #define or a multiple
of I915_GT_SUSPEND_IDLE_TIMEOUT?

> > /*
> >  * On rare occasions, we've observed the fence completion trigger
> >  * free_engines asynchronously via rcu_call. Ensure those are done.
> > @@ -308,7 +309,10 @@ static void wait_for_suspend(struct intel_gt *gt)
> > intel_gt_retire_requests(gt);
> > }
> >   
> > -   intel_gt_pm_wait_for_idle(gt);
> > +   /* we are suspending, so we shouldn't be waiting forever */
> > +   if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIMEDOUT)
> > +   gt_warn(gt, "bailing from %s after %d milisec timeout\n",
> > +   __func__, timeout_ms);
> 
> Does the timeout in intel_gt_pm_wait_timeout_for_idle always comes in 
> pair with the timeout first in intel_gt_wait_for_idle?
alan: Not necessarily, ... IIRC in nearly all cases, the first wait call
complete in time (i.e. no more ongoing work) and the second wait
does wait only if the last bit of work 'just-finished', then that second
wait may take a touch bit longer because of possible async gt-pm-put calls.
(which appear in several places in the driver as part of regular runtime
operation including request completion). NOTE: this not high end workloads
so the
> 
> Also, is the timeout here hit from the intel_gt_suspend_prepare, 
> intel_gt_suspend_late, or can be both?
> 
> Main concern is that we need to be sure there are no possible 
> ill-effects, like letting the GPU/GuC scribble on some memory we 
> unmapped (or will unmap), having let the suspend continue after timing 
> out, and not perhaps doing the forced wedge like wait_for_suspend() does 
> on the existing timeout path.
alan: this will not happen because the held wakeref is never force-released
after the timeout - so what happens is the kernel would bail the suspend.
> 
> Would it be possible to handle the lost G2H events directly in the 
> respective component instead of here? Like apply the timeout during the 
> step which explicitly idles the CT for suspend (presumably that 
> exists?), and so cleanup from there once declared a lost event.
alan: So yes, we don't solve the problem with this patch - Patch#2
is addressing the root cause - and verification is still ongoing - because its
hard to thoroughly test / reproduce. (i.e. Patch 2 could get re-rev'd).

Intent of this patch, is to be informed that gt-pm wait failed in prep for
suspending and kernel will eventually bail the suspend, that's all.
Because without this timed-out version of gt-pm-wait, if we did have a leaky
gt-wakeref, kernel will hang here forever and we will need to get serial
console or ramoops to eventually discover the kernel's cpu hung error with
call-stack pointing to this location. So the goal here is to help speed up
future debug process (if let's say some future code change with another
async gt-pm-put came along and caused new problems after Patch #2 resolved
this time).

Recap: so in both cases (original vs this patch), if we had a buggy gt-wakeref 
leak,
we dont get invalid guc-accesses, but without this patch, we wait forever,
and with this patch, we get some messages and eventually bail the suspend.

alan:snip



Re: [Intel-gfx] [PATCH v3 2/3] drm/i915/guc: Close deregister-context race against CT-loss

2023-09-26 Thread Teres Alexis, Alan Previn
On Thu, 2023-09-14 at 11:34 -0400, Vivi, Rodrigo wrote:
> On Sat, Sep 09, 2023 at 08:58:45PM -0700, Alan Previn wrote:
alan:snip
> 
> > +   /* Change context state to destroyed and get gt-pm */
> > +   __intel_gt_pm_get(gt);
> > +   set_context_destroyed(ce);
> > +   clr_context_registered(ce);
> > +
> > +   ret = deregister_context(ce, ce->guc_id.id);
> > +   if (ret) {
> > +   /* Undo the state change and put gt-pm if that failed */
> > +   set_context_registered(ce);
> > +   clr_context_destroyed(ce);
> > +   intel_gt_pm_put(gt);
> 
> This is a might_sleep inside a spin_lock! Are you 100% confident no WARN
> was seeing during the tests indicated in the commit msg?
> 
> > +   }
> > +   spin_unlock_irqrestore(>guc_state.lock, flags);
> > +
> > +   return 0;
> 
> If you are always returning 0, there's no pointing in s/void/int...
alan: Hi Rodrigo - i actually forget that i need the error value returned
so that _guc_context_destroy can put the context back into the
guc->submission_state.destroyed_contexts linked list. So i clearly missed 
returning
the error in the case deregister_context fails.
> 
> >  }
> >  
> >  static void __guc_context_destroy(struct intel_context *ce)
> > @@ -3268,7 +3296,22 @@ static void deregister_destroyed_contexts(struct 
> > intel_guc *guc)
> > if (!ce)
> > break;
> >  
> > -   guc_lrc_desc_unpin(ce);
> > +   if (guc_lrc_desc_unpin(ce)) {
> > +   /*
> > +* This means GuC's CT link severed mid-way which could 
> > happen
> > +* in suspend-resume corner cases. In this case, put the
> > +* context back into the destroyed_contexts list which 
> > will
> > +* get picked up on the next context deregistration 
> > event or
> > +* purged in a GuC sanitization event 
> > (reset/unload/wedged/...).
> > +*/
> > +   spin_lock_irqsave(>submission_state.lock, flags);
> > +   list_add_tail(>destroyed_link,
> > + 
> > >submission_state.destroyed_contexts);
> > +   spin_unlock_irqrestore(>submission_state.lock, 
> > flags);
alan:snip


Re: [Intel-gfx] [PATCH v3 2/3] drm/i915/guc: Close deregister-context race against CT-loss

2023-09-26 Thread Teres Alexis, Alan Previn
> 


> > alan:snip
> > > > @@ -3279,6 +3322,17 @@ static void destroyed_worker_func(struct
> > work_struct *w)
> > > > struct intel_gt *gt = guc_to_gt(guc);
> > > > int tmp;
> > > > 
> > > > +   /*
> > > > +* In rare cases we can get here via async context-free 
> > > > fence-signals
> > that
> > > > +* come very late in suspend flow or very early in resume 
> > > > flows. In
> > these
> > > > +* cases, GuC won't be ready but just skipping it here is fine 
> > > > as these
> > > > +* pending-destroy-contexts get destroyed totally at GuC reset 
> > > > time at
> > the
> > > > +* end of suspend.. OR.. this worker can be picked up later on 
> > > > the next
> > > > +* context destruction trigger after resume-completes
> > > 
> > > who is triggering the work queue again?
> > 
> > alan: short answer: we dont know - and still hunting this (getting closer 
> > now..
> > using task tgid str-name lookups).
> > in the few times I've seen it, the callstack I've seen looked like this:
> > 
> > [33763.582036] Call Trace:
> > [33763.582038]  
> > [33763.582040]  dump_stack_lvl+0x69/0x97 [33763.582054]
> > guc_context_destroy+0x1b5/0x1ec [33763.582067]
> > free_engines+0x52/0x70 [33763.582072]  rcu_do_batch+0x161/0x438
> > [33763.582084]  rcu_nocb_cb_kthread+0xda/0x2d0 [33763.582093]
> > kthread+0x13a/0x152 [33763.582102]  ?
> > rcu_nocb_gp_kthread+0x6a7/0x6a7 [33763.582107]  ? css_get+0x38/0x38
> > [33763.582118]  ret_from_fork+0x1f/0x30 [33763.582128]  

> Alan above trace is not due to missing GT wakeref, it is due to a 
> intel_context_put(),
> Which  called asynchronously by rcu_call(__free_engines), we need insert 
> rcu_barrier() to flush all
> rcu callback in late suspend.
> 
> Thanks,
> Anshuman.
> > 
Thanks Anshuman for following up with the ongoing debug. I shall re-rev 
accordingly.
...alan


Re: [Intel-gfx] [PATCH v3 2/3] drm/i915/guc: Close deregister-context race against CT-loss

2023-09-22 Thread Teres Alexis, Alan Previn
(cc Anshuman who is working directly with the taskforce debugging this)

Thanks again for taking the time to review this patch.
Apologies for the tardiness, rest assured debug is still ongoing.

As mentioned in prior comments, the signatures and frequency are
now different compared to without the patches of this series.
We are still hunting for data as we are suspecting a different wakeref
still being held with the same trigger event despite.

That said, we will continue to rebase / update this series but hold off on 
actual
merge until we can be sure we have all the issues resolved.

On Thu, 2023-09-14 at 11:34 -0400, Vivi, Rodrigo wrote:
> On Sat, Sep 09, 2023 at 08:58:45PM -0700, Alan Previn wrote:
> > If we are at the end of suspend or very early in resume
> > its possible an async fence signal could lead us to the
alan:snip


> > @@ -3188,19 +3202,33 @@ static inline void guc_lrc_desc_unpin(struct 
> > intel_context *ce)
> > /* Seal race with Reset */
> > spin_lock_irqsave(>guc_state.lock, flags);
> > disabled = submission_disabled(guc);
> > 
alan:snip
> > +   /* Change context state to destroyed and get gt-pm */
> > +   __intel_gt_pm_get(gt);
> > +   set_context_destroyed(ce);
> > +   clr_context_registered(ce);
> > +
> > +   ret = deregister_context(ce, ce->guc_id.id);
> > +   if (ret) {
> > +   /* Undo the state change and put gt-pm if that failed */
> > +   set_context_registered(ce);
> > +   clr_context_destroyed(ce);
> > +   intel_gt_pm_put(gt);
> 
> This is a might_sleep inside a spin_lock! Are you 100% confident no WARN
> was seeing during the tests indicated in the commit msg?
alan: Good catch - i dont think we saw a WARN - I'll go back and check
with the task force - i shall rework this function to get that outside the lock.

> 
> > +   }
> > +   spin_unlock_irqrestore(>guc_state.lock, flags);
> > +
> > +   return 0;
> 
> If you are always returning 0, there's no pointing in s/void/int...
Alan: agreed - will change to void.
> > 
> > 

alan:snip
> > @@ -3279,6 +3322,17 @@ static void destroyed_worker_func(struct work_struct 
> > *w)
> > struct intel_gt *gt = guc_to_gt(guc);
> > int tmp;
> >  
> > +   /*
> > +* In rare cases we can get here via async context-free fence-signals 
> > that
> > +* come very late in suspend flow or very early in resume flows. In 
> > these
> > +* cases, GuC won't be ready but just skipping it here is fine as these
> > +* pending-destroy-contexts get destroyed totally at GuC reset time at 
> > the
> > +* end of suspend.. OR.. this worker can be picked up later on the next
> > +* context destruction trigger after resume-completes
> 
> who is triggering the work queue again?

alan: short answer: we dont know - and still hunting this
(getting closer now.. using task tgid str-name lookups).
in the few times I've seen it, the callstack I've seen looked like this:

[33763.582036] Call Trace:
[33763.582038]  
[33763.582040]  dump_stack_lvl+0x69/0x97
[33763.582054]  guc_context_destroy+0x1b5/0x1ec
[33763.582067]  free_engines+0x52/0x70
[33763.582072]  rcu_do_batch+0x161/0x438
[33763.582084]  rcu_nocb_cb_kthread+0xda/0x2d0
[33763.582093]  kthread+0x13a/0x152
[33763.582102]  ? rcu_nocb_gp_kthread+0x6a7/0x6a7
[33763.582107]  ? css_get+0x38/0x38
[33763.582118]  ret_from_fork+0x1f/0x30
[33763.582128]  

I did add additional debug-msg for tracking and I recall seeing this sequence 
via independant callstacks in the big picture:
i915_sw_fence_complete > __i915_sw_fence_complete -> 
__i915_sw_fence_notify(fence, FENCE_FREE) -> <..delayed?..>
[ drm fence sync func ] <...> engines_notify > call_rcu(>rcu, 
free_engines_rcu) <..delayed?..>
free_engines -> intel_context_put -> ... [kref-dec] --> 
guc_context_destroy

Unfortunately, we still don't know why this initial "i915_sw_fence_complete" is 
coming during suspend-late.
NOTE1: in the cover letter or prior comment, I hope i mentioned the 
reproduction steps where
it only occurs when having a workload that does network download that begins 
downloading just
before suspend is started but completes before suspend late. We are getting 
close to finding
this - taking time because of the reproduction steps.

Anshuman can chime in if he is seeing new signatures with different callstack /
events after this patch is applied.


> I mean, I'm wondering if it is necessary to re-schedule it from inside...
> but also wonder if this is safe anyway...

alan: so my thought process, also after consulting with John and Daniele, was:
since we have a link list to collect the list of contexts that need to be 
dereigstered,
and since we have up to 64k (32k excluding multi-lrc) guc-ids, there really is
risk is just keeping it inside the link list until we reach one of the 
following:

1. full suspend happens and we actually reset the guc - in which case we will 
remove
   all contexts in this link list and completely destroy them and release all 
references
 

Re: [Intel-gfx] [PATCH v3 1/3] drm/i915/guc: Flush context destruction worker at suspend

2023-09-22 Thread Teres Alexis, Alan Previn
On Thu, 2023-09-14 at 11:35 -0400, Vivi, Rodrigo wrote:
> On Sat, Sep 09, 2023 at 08:58:44PM -0700, Alan Previn wrote:
> > When suspending, flush the context-guc-id
> > deregistration worker at the final stages of
> > intel_gt_suspend_late when we finally call gt_sanitize
> > that eventually leads down to __uc_sanitize so that
> > the deregistration worker doesn't fire off later as
> > we reset the GuC microcontroller.
> > 
> > Signed-off-by: Alan Previn 
> > Tested-by: Mousumi Jana 
> 
> Reviewed-by: Rodrigo Vivi 
alan:snip
alan: thanks for the RB Rodrigo.


Re: [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/i915/pxp: Add drm_dbgs for critical PXP events.

2023-09-19 Thread Teres Alexis, Alan Previn
On Sat, 2023-09-16 at 03:09 +, Patchwork wrote:
> 
alan:snip
> 2eab9e4e637a drm/i915/pxp: Add drm_dbgs for critical PXP events.
> -:7: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line 
> (possible unwrapped commit description?)
> #7: 
> sequence of important events. Add drm_dbg into the most important PXP events.
> 
> -:15: ERROR:BAD_SIGN_OFF: Unrecognized email address: 'Balasubrawmanian, 
> Vivaik '
> #15: 
> Reviewed-by: Balasubrawmanian, Vivaik 
> 
> -:96: WARNING:LONG_LINE: line length of 105 exceeds 100 columns
> #96: FILE: drivers/gpu/drm/i915/pxp/intel_pxp_irq.c:43:
> + pxp->session_events |= PXP_TERMINATION_REQUEST | 
> PXP_INVAL_REQUIRED | PXP_EVENT_TYPE_IRQ;

Will fix these before we merge.
...alan


Re: [Intel-gfx] ✗ Fi.CI.BAT: failure for drm/i915/pxp/mtl: Update gsc-heci cmd submission to align with fw/hw spec

2023-09-18 Thread Teres Alexis, Alan Previn
On Sun, 2023-09-17 at 22:04 +, Patchwork wrote:
> Patch Details
> Series: drm/i915/pxp/mtl: Update gsc-heci cmd submission to align with fw/hw 
> spec
> URL:https://patchwork.freedesktop.org/series/123830/
> 

alan:snip
Below issue it unrelated since this series only changes code paths in MTL-PXP 
only and in pxp-specific lrc code. (double checked). will retrigger test since 
it was a BAT failure.

> Possible new issues
> 
> Here are the unknown changes that may have been introduced in 
> Patchwork_123830v1:
> 
> IGT changes
> Possible regressions
> 
>   *   igt@gem_exec_suspend@basic-s3@lmem0:
>  *   bat-dg2-8: NOTRUN -> 
> INCOMPLETE
> 



Re: [Intel-gfx] [PATCH v6 1/3] drm/i915/pxp/mtl: Update pxp-firmware response timeout

2023-09-17 Thread Teres Alexis, Alan Previn
On Sat, 2023-09-16 at 10:25 +0800, lkp wrote:
> Hi Alan,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on cf1e91e884bb1113c653e654e9de1754fc1d4488]
> 
> aAll errors (new ones prefixed by >>):
> 
> 
alan:snip
alan: missed building with PXP config after that last change below - will 
re-rev this.

>drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c: In function 
> 'gsccs_send_message':
> > > drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c:115:52: error: 
> > > 'GSC_REPLY_LATENCY_MS' undeclared (first use in this function); did you 
> > > mean 'GSC_HECI_REPLY_LATENCY_MS'?
>  115 |
> GSC_REPLY_LATENCY_MS);
>  |
> ^~~~
>  |
> GSC_HECI_REPLY_LATENCY_MS
>drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c:115:52: note: each undeclared 
> identifier is reported only once for each function it appears in
> 




Re: [Intel-gfx] [PATCH v3] drm/i915/pxp: Add drm_dbgs for critical PXP events.

2023-09-15 Thread Teres Alexis, Alan Previn
On Fri, 2023-09-15 at 13:15 -0700, Teres Alexis, Alan Previn wrote:
> Debugging PXP issues can't even begin without understanding precedding
> sequence of important events. Add drm_dbg into the most important PXP events.
> 
>  v3 : - move gt_dbg to after mutex block in function
> i915_gsc_proxy_component_bind. (Vivaik)
>  v2 : - remove __func__ since drm_dbg covers that (Jani).
>   - add timeout dbg of the restart from front-end (Alan).
> 
alan:snip

> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c
> @@ -322,6 +322,7 @@ static int i915_gsc_proxy_component_bind(struct device 
> *i915_kdev,
>   gsc->proxy.component = data;
>   gsc->proxy.component->mei_dev = mei_kdev;
>   mutex_unlock(>proxy.mutex);
> + gt_dbg(gt, "GSC proxy mei component bound\n");

forgot to include RB from Vivaik, per condition of fixing above hunk, from rev2 
dri-devel 
https://lists.freedesktop.org/archives/dri-devel/2023-September/422858.html :

Reviewed-by: Balasubrawmanian, Vivaik  


Re: [Intel-gfx] [PATCH v5 3/3] drm/i915/lrc: User PXP contexts requires runalone bit in lrc

2023-09-15 Thread Teres Alexis, Alan Previn
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote:
> On Meteorlake onwards, HW specs require that all user contexts that
> run on render or compute engines and require PXP must enforce
> run-alone bit in lrc. Add this enforcement for protected contexts.
alan:snip
> 
> Signed-off-by: Alan Previn 
> @@ -860,6 +881,8 @@ static void init_common_regs(u32 * const regs,
>   if (GRAPHICS_VER(engine->i915) < 11)
>   ctl |= _MASKED_BIT_DISABLE(CTX_CTRL_ENGINE_CTX_SAVE_INHIBIT |
>  CTX_CTRL_RS_CTX_ENABLE);
> + if (ctx_needs_runalone(ce))
> + ctl |= _MASKED_BIT_ENABLE(BIT(7));
>   regs[CTX_CONTEXT_CONTROL] = ctl;
>  
>   regs[CTX_TIMESTAMP] = ce->stats.runtime.last;
alan: to align intel-gfx ml, Vivaik reviewed this on dri-devel at 
https://lists.freedesktop.org/archives/dri-devel/2023-September/422875.html - 
snippet:
thus, will rerev this (with the others fixes in this series).

>> Can we please get the bit defined in intel_engine_regs.h with a define 
>> instead of a number identification?
>> 
>> Review completed conditional to the above fix.
>> 
>> Reviewed-by: Balasubrawmanian, Vivaik 
>>  
>> <mailto:vivaik.balasubrawmanian at intel.com>




Re: [Intel-gfx] [PATCH v5 2/3] drm/i915/pxp/mtl: Update pxp-firmware packet size

2023-09-15 Thread Teres Alexis, Alan Previn
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote:
> Update the GSC-fw input/output HECI packet size to match
> updated internal fw specs.
> 
> Signed-off-by: Alan Previn 
> ---
>  drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h 
> b/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h
> index 0165d38fbead..e017a7d952e9 100644
> --- a/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h
> +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h
> @@ -14,8 +14,8 @@
>  #define PXP43_CMDID_NEW_HUC_AUTH 0x003F /* MTL+ */
>  #define PXP43_CMDID_INIT_SESSION 0x0036
>  
> -/* PXP-Packet sizes for MTL's GSCCS-HECI instruction */
> -#define PXP43_MAX_HECI_INOUT_SIZE (SZ_32K)
> +/* PXP-Packet sizes for MTL's GSCCS-HECI instruction is spec'd at 65K before 
> page alignment*/
> +#define PXP43_MAX_HECI_INOUT_SIZE (PAGE_ALIGNED(SZ_64K + SZ_1K))
>  
>  /* PXP-Packet size for MTL's NEW_HUC_AUTH instruction */
>  #define PXP43_HUC_AUTH_INOUT_SIZE (SZ_4K)
Vivaik replied with RB on dri-devel: 
https://lists.freedesktop.org/archives/dri-devel/2023-September/422862.htmll
we connected offline and agreed that his RB can remain standing on condition i 
fix the PAGE_ALIGNED -> PAGE_ALIGN fix.
Thanks Vivaik for reviewing.



Re: [Intel-gfx] [PATCH v5 1/3] drm/i915/pxp/mtl: Update pxp-firmware response timeout

2023-09-15 Thread Teres Alexis, Alan Previn
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote:
> Update the max GSC-fw response time to match updated internal
> fw specs. Because this response time is an SLA on the firmware,
> not inclusive of i915->GuC->HW handoff latency, when submitting
> requests to the GSC fw via intel_gsc_uc_heci_cmd_submit helpers,
alan:snip
Vivaik replied with RB on dri-devel: 
https://lists.freedesktop.org/archives/dri-devel/2023-September/422861.html
we connected offline and agreed that his RB can remain standing on condition i 
fix the PAGE_ALIGNED -> PAGE_ALIGN fix.
Thanks Vivaik for reviewing.


Re: [Intel-gfx] [PATCH v5 1/3] drm/i915/pxp/mtl: Update pxp-firmware response timeout

2023-09-15 Thread Teres Alexis, Alan Previn
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote:
> Update the max GSC-fw response time to match updated internal
> fw specs. Because this response time is an SLA on the firmware,
> not inclusive of i915->GuC->HW handoff latency, when submitting
> requests to the GSC fw via intel_gsc_uc_heci_cmd_submit helpers,
> start the count after the request hits the GSC command streamer.
> Also, move GSC_REPLY_LATENCY_MS definition from pxp header to
> intel_gsc_uc_heci_cmd_submit.h since its for any GSC HECI packet.
> 
> Signed-off-by: Alan Previn 
> ---
>  .../i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c | 20 +--
>  .../i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h |  6 ++
>  drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.h| 11 ++
>  3 files changed, 31 insertions(+), 6 deletions(-)
alan: snip


> index 09d3fbdad05a..5ae5c5d9608b 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h
> @@ -12,6 +12,12 @@ struct i915_vma;
>  struct intel_context;
>  struct intel_gsc_uc;
>  
> +#define GSC_HECI_REPLY_LATENCY_MS 350
> +/*
> + * Max FW response time is 350ms, but this should be counted from the time 
> the
> + * command has hit the GSC-CS hardware, not the preceding handoff to GuC CTB.
> + */
alan: continue to face timeout issues - so increasing this to ~500 to absorb 
other hw/sw system latencies.
this also matches what the gsc-proxy code was doing - so i could use the same 
macro for that other code path.



Re: [Intel-gfx] [PATCH v5 2/3] drm/i915/pxp/mtl: Update pxp-firmware packet size

2023-09-15 Thread Teres Alexis, Alan Previn
On Sat, 2023-09-09 at 15:38 -0700, Teres Alexis, Alan Previn wrote:
> Update the GSC-fw input/output HECI packet size to match
> updated internal fw specs.
> 
> Signed-off-by: Alan Previn 
> 
alan:snip

> -/* PXP-Packet sizes for MTL's GSCCS-HECI instruction */
> -#define PXP43_MAX_HECI_INOUT_SIZE (SZ_32K)
> +/* PXP-Packet sizes for MTL's GSCCS-HECI instruction is spec'd at 65K before 
> page alignment*/
> +#define PXP43_MAX_HECI_INOUT_SIZE (PAGE_ALIGNED(SZ_64K + SZ_1K))
alan: silly ctrl-c/v bug on my part - should be PAGE_ALIGN, not ALIGNED
>  
>  /* PXP-Packet size for MTL's NEW_HUC_AUTH instruction */
>  #define PXP43_HUC_AUTH_INOUT_SIZE (SZ_4K)



Re: [Intel-gfx] [PATCH v1 1/1] drm/i915/pxp: Add drm_dbgs for critical PXP events.

2023-09-13 Thread Teres Alexis, Alan Previn
On Mon, 2023-09-11 at 12:26 +0300, Jani Nikula wrote:
> On Wed, 06 Sep 2023, Alan Previn  wrote:
> > Debugging PXP issues can't even begin without understanding precedding
> > sequence of events. Add drm_dbg into the most important PXP events.
> > 
> > Signed-off-by: Alan Previn 
alan:snip

> 
> > +
> > +   drm_dbg(>ctrl_gt->i915->drm, "PXP: %s invoked", __func__);
> 
> drm_dbg already covers __func__ (via __builtin_return_address(0) in
> __drm_dev_dbg), it's redundant.
> 
> Ditto for all added debugs below.

My bad - yup - will fix them.
Thanks for taking time to review this patch.
...alan
> 



Re: [Intel-gfx] [PATCH] i915/pmu: Move execlist stats initialization to execlist specific setup

2023-09-13 Thread Teres Alexis, Alan Previn
I went up the call stack to ensure the differences between the
old and new location isnt skipping over other functions that may reference
something engine related (that may also end up triggering stats variabls).

Without digging further, i see the old postion here:
i915_driver_probe -> i915_driver_mmio_probe -> intel_gt_init_mmio -> 
intel_engines_init_mmio -> intel_engine_setup
new postion here:
i915_driver_probe -> i915_gem_init -> (for_each_gt) intel_gt_init -> 
intel_engines_init -> setup (intel_execlists_submission_setup)

And between i915_driver_mmio_probe and i915_gem_init are only mem/ggtt, display 
and irq related init functions.
That said, LGTM (although you do need to address the BAT failure before 
merging):

Reviewed-by: Alan Previn 

On Tue, 2023-09-12 at 14:22 -0700, Umesh Nerlige Ramappa wrote:
> engine->stats is a union of execlist and guc stat objects. When execlist
> specific fields are initialized, the initial state of guc stats is
> affected. This results in bad busyness values when using GuC mode. Move
> the execlist initialization from common code to execlist specific code.
> 
> Fixes: 77cdd054dd2c ("drm/i915/pmu: Connect engine busyness stats from GuC to 
> pmu")
> Signed-off-by: Umesh Nerlige Ramappa 
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c| 1 -
>  drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 2 ++
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index dfb69fc977a0..84a75c95f3f7 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -558,7 +558,6 @@ static int intel_engine_setup(struct intel_gt *gt, enum 
> intel_engine_id id,
>   DRIVER_CAPS(i915)->has_logical_contexts = true;
>  
>   ewma__engine_latency_init(>latency);
> - seqcount_init(>stats.execlists.lock);
>  
>   ATOMIC_INIT_NOTIFIER_HEAD(>context_status_notifier);
>  
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c 
> b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 4d05321dc5b5..e8f42ec6b1b4 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -3548,6 +3548,8 @@ int intel_execlists_submission_setup(struct 
> intel_engine_cs *engine)
>   logical_ring_default_vfuncs(engine);
>   logical_ring_default_irqs(engine);
>  
> + seqcount_init(>stats.execlists.lock);
> +
>   if (engine->flags & I915_ENGINE_HAS_RCS_REG_STATE)
>   rcs_submission_override(engine);
>  



Re: [Intel-gfx] [PATCH v4 2/3] drm/i915/pxp/mtl: Update pxp-firmware packet size

2023-09-06 Thread Teres Alexis, Alan Previn
On Wed, 2023-09-06 at 17:15 -0700, Teres Alexis, Alan Previn wrote:
> Update the GSC-fw input/output HECI packet size to match
> updated internal fw specs.
alan:snip

> +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_cmd_interface_43.h
> @@ -14,8 +14,8 @@
> 


> +/* PXP-Packet sizes for MTL's GSCCS-HECI instruction is 65K*/
> +#define PXP43_MAX_HECI_INOUT_SIZE (SZ_64K + SZ_1K)
alan: my mistake - didnt fix this before posting - should have been 
PAGE_ALIGN(65k)




Re: [Intel-gfx] [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss

2023-08-28 Thread Teres Alexis, Alan Previn
Additional update from the most recent testing.

When relying solely on guc_lrc_desc_unpin getting a failure from 
deregister_context
as a means for identifying that we are in the 
"deregister-context-vs-suspend-late" race,
it is too late a location to handle this safely. This is because one of the
first things destroyed_worker_func does it to take a gt pm wakeref - which 
triggers
the gt_unpark function that does a whole lot bunch of other flows including 
triggering more
workers and taking additional refs. That said, its best to not even call
deregister_destroyed_contexts from the worker when !intel_guc_is_ready 
(ct-is-disabled).

...alan


On Fri, 2023-08-25 at 11:54 -0700, Teres Alexis, Alan Previn wrote:
> just a follow up note-to-self:
> 
> On Tue, 2023-08-15 at 12:08 -0700, Teres Alexis, Alan Previn wrote:
> > On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote:
> > > On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> > > > 
> [snip]
> 
> in guc_submission_send_busy_loop, we are incrementing the following
> that needs to be decremented if the function fails.
> 
> atomic_inc(>outstanding_submission_g2h);
> 
> also, it seems that even with thie unroll design - we are still
> leaking a wakeref elsewhere. this is despite a cleaner redesign of
> flows in function "guc_lrc_desc_unpin"
> (discussed earlier that wasnt very readible).
> 
> will re-rev today but will probably need more follow ups
> tracking that one more leaking gt-wakeref (one in thousands-cycles)
> but at least now we are not hanging mid-suspend.. we bail from suspend
> with useful kernel messages.
> 
> 
> 
> 



Re: [Intel-gfx] [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss

2023-08-25 Thread Teres Alexis, Alan Previn
just a follow up note-to-self:

On Tue, 2023-08-15 at 12:08 -0700, Teres Alexis, Alan Previn wrote:
> On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote:
> > On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> > > 
[snip]

in guc_submission_send_busy_loop, we are incrementing the following
that needs to be decremented if the function fails.

atomic_inc(>outstanding_submission_g2h);

also, it seems that even with thie unroll design - we are still
leaking a wakeref elsewhere. this is despite a cleaner redesign of
flows in function "guc_lrc_desc_unpin"
(discussed earlier that wasnt very readible).

will re-rev today but will probably need more follow ups
tracking that one more leaking gt-wakeref (one in thousands-cycles)
but at least now we are not hanging mid-suspend.. we bail from suspend
with useful kernel messages.






Re: [Intel-gfx] [PATCH v2 1/3] drm/i915/guc: Flush context destruction worker at suspend

2023-08-25 Thread Teres Alexis, Alan Previn
Thanks again Rodrigo for reviewing and apologies for my tardy replies.
We are stil testing on shipping platforms and these latest patches seemed
to have reduced the frequency and solved the "system hangs" while suspending
but its still causing issues so we continue to debug. (issue is that
its running thousands of cycles overnight on multiple systems and takes
time). 

[snip]

> > @@ -693,6 +693,8 @@ void intel_uc_suspend(struct intel_uc *uc)
> > return;
> > }
> >  
> > +   intel_guc_submission_flush_work(guc);
> > +
> 
> what happens if a new job comes exactly here?
> This still sounds a bit racy, although this already looks
> much cleaner than the previous version.
alan: intel_uc_suspend is called from suspend-late or init-failure.
and the very next function -> intel_guc_suspend will soft reset the guc
and eventually nuke it. this is the most appropriate "latests as
possible" place to put this.
> 
> > with_intel_runtime_pm(_to_gt(uc)->i915->runtime_pm, wakeref) {
> > err = intel_guc_suspend(guc);
> > if (err)






Re: [Intel-gfx] [PATCH v3 1/3] drm/i915/pxp/mtl: Update pxp-firmware response timeout

2023-08-15 Thread Teres Alexis, Alan Previn
On Tue, 2023-08-15 at 13:29 -0700, Teres Alexis, Alan Previn wrote:
> Update the max GSC-fw response time to match updated internal
> fw specs. Because this response time is an SLA on the firmware,
> not inclusive of i915->GuC->HW handoff latency, when submitting
> requests to the GSC fw via intel_gsc_uc_heci_cmd_submit_nonpriv,
> start the count after the request hits the GSC command streamer.
> 
> Signed-off-by: Alan Previn 
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c | 3 +++
>  drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.h| 6 +++---
>  2 files changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> index 89ed5ee9cded..ae45855594ac 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> @@ -186,6 +186,9 @@ intel_gsc_uc_heci_cmd_submit_nonpriv(struct intel_gsc_uc 
> *gsc,
>   i915_request_add(rq);
>  
>   if (!err) {
> + if (wait_for(i915_request_started(rq), 200))
> + drm_dbg(_uc_to_gt(gsc)->i915->drm,
> + "Delay in gsc-heci-non-priv submission to 
> gsccs-hw");
alan: offline discussion with Daniele, Daniele provided the following review 
comments:
We should add this wait-check for both priv and non-priv but we should increase 
the
timeout to be more than the guaranteed fw response time of 1 other message 
(since we
have a total of 2 contexts that could be sending messages concurrently at any 
time
including this one)... so maybe timeout of the new GSC_REPLY_LATENCY_MS + 150.
More importantly, he highlighted the fact that we should wait for the 
request-started
AND ensure there as no error in request status.

[snip]


Re: [Intel-gfx] [PATCH v2 2/3] drm/i915/guc: Close deregister-context race against CT-loss

2023-08-15 Thread Teres Alexis, Alan Previn
On Tue, 2023-08-15 at 09:56 -0400, Vivi, Rodrigo wrote:
> On Mon, Aug 14, 2023 at 06:12:09PM -0700, Alan Previn wrote:
> > If we are at the end of suspend or very early in resume
> > its possible an async fence signal could lead us to the
> > execution of the context destruction worker (after the
> > prior worker flush).

[snip]
> 
> > @@ -3199,10 +3206,20 @@ static inline void guc_lrc_desc_unpin(struct 
> > intel_context *ce)
> > if (unlikely(disabled)) {
> > release_guc_id(guc, ce);
> > __guc_context_destroy(ce);
> > -   return;
> > +   return 0;
> > +   }
> > +
> > +   if (deregister_context(ce, ce->guc_id.id)) {
> > +   /* Seal race with concurrent suspend by unrolling */
> > +   spin_lock_irqsave(>guc_state.lock, flags);
> > +   set_context_registered(ce);
> > +   clr_context_destroyed(ce);
> > +   intel_gt_pm_put(gt);
> 
> how sure we are this is not calling unbalanced puts?
alan: in this function (guc_lrc_desc_unpin), the summarized flow is:

check-status-stuff
if guc-is-not-disabled, take-pm, change ctx-state-bits
[implied else] if guc-is-disabled, scub-ctx and return

thus derigster_context is only called if we didnt exit, i.e. when 
guc-is-not-disabled, i.e. after the pm was taken.
> could we wrap this in some existent function to make this clear?
alan: yeah - not so readible as it now - let me tweak this function and make it 
cleaner

> 
> > +   spin_unlock_irqrestore(>guc_state.lock, flags);
> > +   return -ENODEV;
> > }
> >  
> > -   deregister_context(ce, ce->guc_id.id);
> > +   return 0;
> >  }



Re: [Intel-gfx] [PATCH v2 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-08-15 Thread Teres Alexis, Alan Previn
Thanks Rodrigo - agreed on everything below - will re-rev.

On Tue, 2023-08-15 at 09:51 -0400, Vivi, Rodrigo wrote:
> On Mon, Aug 14, 2023 at 06:12:10PM -0700, Alan Previn wrote:
> > When suspending, add a timeout when calling
> > intel_gt_pm_wait_for_idle else if we have a lost
> > G2H event that holds a wakeref (which would be
> > 
[snip]

> > @@ -301,7 +303,10 @@ static void wait_for_suspend(struct intel_gt *gt)
> > intel_gt_retire_requests(gt);
> > }
> >  
> > -   intel_gt_pm_wait_for_idle(gt);
> > +   /* we are suspending, so we shouldn't be waiting forever */
> > +   if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms) == -ETIME)
> 
> you forgot to change the error code here..^
> 
> but maybe we don't even need this here and a simple
> if (intel_gt_pm_wait_timeout_for_idle(gt, timeout_ms)) should be enough
> since the error from the killable one is unlikely and the only place
> I error I could check on that path would be a catastrophic -ERESTARTSYS.
> 
> but up to you.
alan: my bad - I'll fix it - but i agree with not needing to check the failure 
type.
and I'll keep the error the same ("bailing from...")
[snip]

> > +static inline int intel_gt_pm_wait_timeout_for_idle(struct intel_gt *gt, 
> > int timeout_ms)
> > +{
> > +   return intel_wakeref_wait_for_idle(>wakeref, timeout_ms);
> >  }
> 
> I was going to ask why you created a single use function here, but then I
> noticed the above one. So it makes sense.
> Then I was going to ask why in here you didn't use the same change of
> timeout = 0 in the existent function like you used below, but then I
> noticed that the above function is called in multiple places and the
> patch with this change is much cleaner and the function is static inline
> so your approach was good here.
alan: yes that was my exact reason - thus no impact across other callers.
[snip]


> Please add a documentation for this function making sure you have the 
> following
> mentions:
alan: good catch -will do.

> 
> /**
> [snip]
> * @timeout_ms: Timeout in ums, 0 means never timeout.
> *
> * Returns 0 on success, -ETIMEDOUT upon a timeout, or the unlikely
> * error propagation from wait_var_event_killable if timeout_ms is 0.
> */
> 
> with the return error fixed above and the documentation in place:
> 
> Reviewed-by: Rodrigo Vivi 
> 
> > -int intel_wakeref_wait_for_idle(struct intel_wakeref *wf)
> > +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms)
> >  {
> > -   int err;
> > 
[snip]


Re: [Intel-gfx] [PATCH v2 0/3] Resolve suspend-resume racing with GuC destroy-context-worker

2023-08-14 Thread Teres Alexis, Alan Previn
On Mon, 2023-08-14 at 18:12 -0700, Teres Alexis, Alan Previn wrote:
> This series is the result of debugging issues root caused to
> races between the GuC's destroyed_worker_func being triggered
> vs repeating suspend-resume cycles with concurrent delayed
> fence signals for engine-freeing.
alan: forgot credit:
Tested-by: Mousumi Jana 

alan:snip.
> 
> 
> Alan Previn (3):
>   drm/i915/guc: Flush context destruction worker at suspend
>   drm/i915/guc: Close deregister-context race against CT-loss
>   drm/i915/gt: Timeout when waiting for idle in suspending
> 
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c |  2 +-
>  drivers/gpu/drm/i915/gt/intel_gt_pm.c |  7 ++-
>  drivers/gpu/drm/i915/gt/intel_gt_pm.h |  7 ++-
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 45 +--
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.h |  2 +
>  drivers/gpu/drm/i915/gt/uc/intel_uc.c |  2 +
>  drivers/gpu/drm/i915/intel_wakeref.c  | 14 --
>  drivers/gpu/drm/i915/intel_wakeref.h  |  5 ++-
>  8 files changed, 71 insertions(+), 13 deletions(-)
> 
> 
> base-commit: 85f20fb339f05ec4221bb295c13e46061c5c566f



Re: [Intel-gfx] [PATCH v1 1/3] drm/i915/guc: Flush context destruction worker at suspend

2023-08-14 Thread Teres Alexis, Alan Previn
> 

> > Rodrigo: And why here and not some upper layer? like in prepare
> alan: wait_for_suspend does both the checking for idle as well as the 
> potential
> wedging if guc or hw has hung at this late state. if i call from the upper
> layer before this wait_for_suspend, it might be too early because the
> wait-for-idle could experience workloads completing and new 
> contexts-derigtrations
> being queued. If i call it from upper layer after wait_for_suspend, then it 
> would
> be unnecessary if wait_for_suspend decided to nuke everything. Hmmm.. but i 
> guess
> the latter could work too - since the nuke case would have emptied out the 
> link-list
> of that worker and so it would either run and do nothing or would not even be 
> queued.
> Would you rather i go that way? (i'll recheck with my team mates for 
> i-dotting/t-crossing
> discussion.

actually, after going up a layer, i realize the right place might be to insert
late stage worker-flushing into intel_uc_suspend (called from 
intel_gt_suspend_late)
which is also where the gsc worker is flushed. This will also mean we don't 
need to
create intel_uc_suspend_prepare for new plumbing.


Re: [Intel-gfx] [PATCH v1] drm/i915/pxp/mtl: Update gsc-heci cmd size and timeout

2023-08-10 Thread Teres Alexis, Alan Previn
On Fri, 2023-07-07 at 11:34 -0700, Teres Alexis, Alan Previn wrote:
> Update the max GSC-HECI packet size and the max firmware
> response timeout to match internal fw specs.
> 
> Signed-off-by: Alan Previn 

I'm going to re-rev this and change the subject slightly to "Update gsc-heci 
cmd submission
to align with spec". The two changes in this patch will be included but the 
cmd-packet size
is now to be increased to 64K. Also, the counting of the timeout needs to start 
from when
the request has his the cmd streamer hardware (not from when the PXP subsystem 
has thrown it
over the wall to the GuC). Although this latter change would imply a longer 
timeout period,
the change to observe this longer timeout should be applicable to code that is 
actually
triggering a PXP session creation/teardown.

In addition to that, we also need to update the LRC common defaults to inclure 
forcing
the runalone bit for PXP contexts (as that is the updated hardware spec'd
expectation).

...alan


 



Re: [Intel-gfx] [PATCH v1 2/3] drm/i915/guc: Close deregister-context race against CT-loss

2023-08-09 Thread Teres Alexis, Alan Previn
On Wed, 2023-08-02 at 16:35 -0700, Teres Alexis, Alan Previn wrote:
> If we are at the end of suspend or very early in resume
> its possible an async fence signal could lead us to the
> execution of the context destruction worker (after the
> prior worker flush).
> 
alan:snip
>  
>  static void __guc_context_destroy(struct intel_context *ce)
> @@ -3270,7 +3287,20 @@ static void deregister_destroyed_contexts(struct 
> intel_guc *guc)
>   if (!ce)
>   break;
>  
> - guc_lrc_desc_unpin(ce);
> + if (guc_lrc_desc_unpin(ce)) {
> + /*
> +  * This means GuC's CT link severed mid-way which only 
> happens
> +  * in suspend-resume corner cases. In this case, put the
> +  * context back into the destroyed_contexts list which 
> will
> +  * get picked up on the next context deregistration 
> event or
> +  * purged in a GuC sanitization event 
> (reset/unload/wedged/...).
> +  */
> + spin_lock_irqsave(>submission_state.lock, flags);
> + list_add_tail(>destroyed_link,
> +   
> >submission_state.destroyed_contexts);
alan: i completely missed the fact this new code is sitting within a 
while (!list_empty(>submission_state.submission_state.destroyed_contexts) 
block
so putting it back will cause it to while loop forever.

will fix and rerev.

> + spin_unlock_irqrestore(>submission_state.lock, 
> flags);
> + }
> +
>   }
>  }
>  



Re: [Intel-gfx] [PATCH v1 1/3] drm/i915/guc: Flush context destruction worker at suspend

2023-08-09 Thread Teres Alexis, Alan Previn
Thanks Rodrigo for reviewing this.

On Mon, 2023-08-07 at 13:52 -0400, Vivi, Rodrigo wrote:
> On Wed, Aug 02, 2023 at 04:34:59PM -0700, Alan Previn wrote:
> > Suspend is not like reset, it can unroll, so we have to properly
> > flush pending context-guc-id deregistrations to complete before
> > we return from suspend calls.
> 
> But if is 'unrolls' the execution should just continue, no?!
> In other words, why is this flush needed? What happens if we
> don't flush, but resume doesn't proceed? in in which case
> of resume you are thinking that it returns and not having flushed?

alan: I apologize, i realize I should have explained it better.
The flush is needed for when we DON'T unroll. I wanted to point
out that at in intel_gt_suspend_foo we dont actually know
if the suspend is going to unroll or not so we should always flush
properly in the case. I will re-rev with better comment on this patch.

alan:snip
> 
> >  
> >  static void wait_for_suspend(struct intel_gt *gt)
> >  {
> > -   if (!intel_gt_pm_is_awake(gt))
> > +   if (!intel_gt_pm_is_awake(gt)) {
> > +   intel_uc_suspend_prepare(>uc);
> 
> why only on idle?

alan: actually no - i am flushing for both idle and non-idle cases (see farther
below) but for the non-idle case, i am skipping the flush if we decide to wedge
(since the wedge will clear up everything -> all guc-contexts that are inflight
are nuked and freed).
> 
> Well, I know, if we are in idle it is because all the requests had
> already ended and gt will be wedged, but why do we need to do anything
> if we are in idle?

Because the issue that is seen as a very late context-deregister that
is triggered by a, orthogonal thread via the following callstack:
drm-fence signal triggers a FENCE_FREE via__i915_sw_fence_notify that
connects to engines_notify -> free_engines_rcu -> (thread) ->
intel_context_put -> kref_put(>ref..) that queues the
context-destruction worker. That said, eventhough the gt is idle because
the last workload has been retired a context-destruction worker
may have has just gotten queued. 

> 
> And why here and not some upper layer? like in prepare
alan: wait_for_suspend does both the checking for idle as well as the potential
wedging if guc or hw has hung at this late state. if i call from the upper
layer before this wait_for_suspend, it might be too early because the
wait-for-idle could experience workloads completing and new 
contexts-derigtrations
being queued. If i call it from upper layer after wait_for_suspend, then it 
would
be unnecessary if wait_for_suspend decided to nuke everything. Hmmm.. but i 
guess
the latter could work too - since the nuke case would have emptied out the 
link-list
of that worker and so it would either run and do nothing or would not even be 
queued.
Would you rather i go that way? (i'll recheck with my team mates for 
i-dotting/t-crossing
discussion.
> 
> > return;
> > +   }
> >  
> > if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME) 
> > {
> > /*
> > @@ -299,6 +301,8 @@ static void wait_for_suspend(struct intel_gt *gt)
> >  */
> > intel_gt_set_wedged(gt);
> > intel_gt_retire_requests(gt);
> > +   } else {
> > +   intel_uc_suspend_prepare(>uc);
> > }
> >  
> > intel_gt_pm_wait_for_idle(gt);
alan:snip


Re: [Intel-gfx] [PATCH v1 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

2023-08-09 Thread Teres Alexis, Alan Previn
Thanks Rodrigo for reviewing this.

On Mon, 2023-08-07 at 13:56 -0400, Vivi, Rodrigo wrote:
> On Wed, Aug 02, 2023 at 04:35:01PM -0700, Alan Previn wrote:
> > When suspending, add a timeout when calling
> > intel_gt_pm_wait_for_idle else if we have a lost
> > G2H event that holds a wakeref (which would be
> > indicating of a bug elsewhere in the driver), we
> > get to complete the suspend-resume cycle, albeit
> > without all the lower power hw counters hitting
> > its targets, instead of hanging in the kernel.
> > 
alan:snip

> > -int intel_wakeref_wait_for_idle(struct intel_wakeref *wf)
> > +int intel_wakeref_wait_for_idle(struct intel_wakeref *wf, int timeout_ms)
> >  {
> > -   int err;
> > +   int err = 0;
> >  
> > might_sleep();
> >  
> > -   err = wait_var_event_killable(>wakeref,
> > - !intel_wakeref_is_active(wf));
> > +   if (!timeout_ms)
> > +   err = wait_var_event_killable(>wakeref,
> > + !intel_wakeref_is_active(wf));
> > +   else if (wait_var_event_timeout(>wakeref,
> > +   !intel_wakeref_is_active(wf),
> > +   msecs_to_jiffies(timeout_ms)) < 1)
> > +   err = -ETIME;
> 
> it looks to me that -ETIMEDOUT would be a better error.
alan: okay - will do, thanks.



Re: [Intel-gfx] [PATCH v4 1/1] drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS

2023-08-03 Thread Teres Alexis, Alan Previn
On Wed, 2023-08-02 at 11:25 -0700, Teres Alexis, Alan Previn wrote:
> After recent discussions with Mesa folks, it was requested
> that we optimize i915's GET_PARAM for the PXP_STATUS without
> changing the UAPI spec.
> 
> Add these additional optimizations:
>- If any PXP initializatoin flow failed, then ensure that
>  we catch it so that we can change the returned PXP_STATUS
>  from "2" (i.e. 'PXP is supported but not yet ready')
>  to "-ENODEV". This typically should not happen and if it
>  does, we have a platform configuration issue.
>- If a PXP arbitration session creation event failed
>  due to incorrect firmware version or blocking SOC fusing
>  or blocking BIOS configuration (platform reasons that won't
>  change if we retry), then reflect that blockage by also
>  returning -ENODEV in the GET_PARAM:PXP_STATUS.
>- GET_PARAM:PXP_STATUS should not wait at all if PXP is
>  supported but non-i915 dependencies (component-driver /
>  firmware) we are still pending to complete the init flows.
>  In this case, just return "2" immediately (i.e. 'PXP is
>  supported but not yet ready').
> 
> Difference from prio revs:
>   v3: - Rebase with latest tip that has pulled in setting the
> gsc fw load to fail if proxy init fails.
>   v2: - Use a #define for the default readiness timeout (Vivaik).
>   - Improve comments around the failing of proxy-init.
>   v1: - Change the commit msg style to be imperative. (Jani)
>   - Rename timeout to timeout_ms. (Jani)
>   - Fix is_fw_err_platform_config to use higher order
> param (pxp) first. (Jani)
> 
> Signed-off-by: Alan Previn 

alan: Daniele pointed out that i removed Vivaik's RB from rev-3.
The difference from this rev vs rev is a hunk of code got removed
and went into a different patch (that reused the same code, is
higher priority - CI related, and had to go first). So this rev
is identical to rev3 except a hunk that has been removed.
I've checked with Vivaik and he is good with keeping his R-b
since the rest of the patch is the same. Thus repasting his
R-b from rev3 (Thanks Daniele and Vivaik):

Reviewed-by: Balasubrawmanian, Vivaik 


Re: [Intel-gfx] [PATCH v1 0/3] Resolve suspend-resume racing with GuC destroy-context-worker

2023-08-02 Thread Teres Alexis, Alan Previn
On Wed, 2023-08-02 at 16:34 -0700, Teres Alexis, Alan Previn wrote:
> This series is the result of debugging issues root caused to
> races between the GuC's destroyed_worker_func being triggered vs
> repeating suspend-resume cycles with concurrent delayed
> fence signals for engine-freeing.
> 
> The reproduction steps require that an app is created right before
> the start of the suspend cycle where it creates a new gem
> context and submits a tiny workload that would complete in the
> middle of the suspend cycle. However this app uses dma-buffer
> sharing or dma-fence with non-GPU objects or signals that
> eventually triggers a FENCE_FREE via__i915_sw_fence_notify that
> connects to engines_notify -> free_engines_rcu ->
> intel_context_put -> kref_put(>ref..) that queues the
> worker after the GuCs CTB has been disabled (i.e. after
> i915-gem's suspend-late flows).
> 
As an FYI - in offline conversations with John and Daniele, we have agreed
that at least the first two of the patches in this are necessary improvements
but the last patch may remain open as further offline debug is continuing
to pin down the src of the above fence-signal-flow. For now we are hoping
to proceed with reviewing the first two patches and only look into the 3rd
patch if there are system level fence signalling that truly can trigger
this anomaly or if its just a straddling request somewhere within i915
that has appeared or hung at the wrong time which needs to be fixed.

alan:snip


Re: [Intel-gfx] [PATCH] drm/i915/pxp/mtl: intel_pxp_init_hw needs runtime-pm inside pm-complete

2023-08-02 Thread Teres Alexis, Alan Previn
> > 
> > 
alan:snip

Thanks Vinay and Daniele - i'll respin with below fix.


> > @@ -48,7 +50,8 @@ void intel_pxp_resume_complete()
> > if (!HAS_ENGINE(pxp->ctrl_gt, GSC0) && !pxp->pxp_component)
> > return;
> >   
> > -   intel_pxp_init_hw(pxp);
> > +   with_intel_runtime_pm(>ctrl_gt->i915->runtime_pm, wakeref)
> 
> This is called from within the rpm resume path, so you can't do an rpm 
> get or it will deadlock. Maybe have:
> 
> __pxp_resume_complete(struct intel_pxp *pxp, bool needs_rpm);
> 
> intel_pxp_resume_complete(..)
> {
>      return __pxp_resume_complete(pxp, true);
> }
> 
> intel_pxp_runtime_resume(..)
> {
>      return __pxp_resume_complete(pxp, false);
> }
> 
> 
> or something like that.
> Daniele



Re: [Intel-gfx] ✗ Fi.CI.BAT: failure for drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests (rev7)

2023-07-26 Thread Teres Alexis, Alan Previn
> IGT changes
> Possible regressions
> 
>   *   igt@vgem_basic@dmabuf-fence-before:
>  *   fi-kbl-soraka: 
> PASS
>  -> 
> INCOMPLETE
> 
This failure is unrelated because of two reasons - #1, the patch will only be 
in effect on MTL and #2, the patch is only part of self-test startup but this 
is not a self-test
failure.


Re: [Intel-gfx] [PATCH v6] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests

2023-07-20 Thread Teres Alexis, Alan Previn
On Thu, 2023-07-20 at 14:52 -0700, Ceraolo Spurio, Daniele wrote:
> 
> On 7/20/2023 2:40 PM, Alan Previn wrote:
> > On MTL, if the GSC Proxy init flows haven't completed, submissions to the
> > GSC engine will fail. Those init flows are dependent on the mei's
> > gsc_proxy component that is loaded in parallel with i915 and a
> > worker that could potentially start after i915 driver init is done.
> > 
> > That said, all subsytems that access the GSC engine today does check
> > for such init flow completion before using the GSC engine. However,
> > selftests currently don't wait on anything before starting.
> > 
> > To fix this, add a waiter function at the start of __run_selftests
> > that waits for gsc-proxy init flows to complete. Selftests shouldn't
> > care if the proxy-init failed as that should be flagged elsewhere.
> > 
> > Difference from prior versions:
> > v6: - Add a helper that returns something more than a boolean
> >   so we selftest can stop waiting if proxy-init hadn't
> >   completed but failed (Daniele).

alan:snip
> 

> > +int intel_gsc_uc_fw_proxy_get_status(struct intel_gsc_uc *gsc)
> > +{
> > +   if (!(IS_ENABLED(CONFIG_INTEL_MEI_GSC_PROXY)))
> > +   return -ENODEV;
> > +   if (!intel_uc_fw_is_loadable(>fw))
> > +   return -ENODEV;
> > +   if (__intel_uc_fw_status(>fw) == INTEL_UC_FIRMWARE_LOAD_FAIL)
> 
> You're missing the change to move the FW status to LOAD_FAIL if the 
> proxy fails to initialize. Or are you expecting 
> https://patchwork.freedesktop.org/series/118723/, which included that 
> change, to be merged first?
> 
> Daniele

alan: as per our offline sync, I'll respin this one and move it away from the
other patch (since this is more critical) and we can respin the other after
this is done so we get a smooth merge. Also, as i move that "change fw status
to fail" from that PXP patch to this patch, I'll fix that issue where i missed
the 2nd failure point in the proxy init flow.

Thanks for your help. :)



Re: [Intel-gfx] [PATCH v4] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests

2023-07-12 Thread Teres Alexis, Alan Previn
On Wed, 2023-07-12 at 10:19 +0100, Tvrtko Ursulin wrote:
> On 11/07/2023 23:02, Alan Previn wrote:
> > On MTL, if the GSC Proxy init flows haven't completed, submissions to the
> > GSC engine will fail. Those init flows are dependent on the mei's
> > gsc_proxy component that is loaded in parallel with i915 and a
> > worker that could potentially start after i915 driver init is done.
> > 
> > That said, all subsytems that access the GSC engine today does check
> > for such init flow completion before using the GSC engine. However,
> > selftests currently don't wait on anything before starting.
> > 
> > 
> > 
alan:snip

> > +   /*
> > +* The gsc proxy component depends on the kernel component driver load 
> > ordering
> > +* and in corner cases (the first time after an IFWI flash), 
> > init-completion
> > +* firmware flows take longer.
> > +*/
> > +   unsigned long timeout_ms = 8000;
> > +
> > +   if (need_to_wait &&
> > +   (wait_for(intel_gsc_uc_fw_proxy_init_done(>media_gt->uc.gsc, 
> > true),
> > +   timeout_ms)))
> > +   pr_info(DRIVER_NAME "Timed out waiting for 
> > gsc_proxy_completion!\n");
> 
> Would it make sense to error out here? Or at least upgrade to pr_warn or 
> something?
alan: agree on pr_warn (especially since need_for_wait being true and we tried 
for 8 secs - this should never happen).

> 
> I didn't quite understand the points Daniele raised about engine loops 
> and resets - in my mind GSC engine is this special thing exercised for 
> highly specialized operations and not touched in random for_each_engine 
> loop tests, but I also did not really look so might be totally wrong.

alan: after consulting with Daniele further, the concern in the case of
having gsc-proxy-init mid-execution while other selttests
are running (when thinking of tests that have nothing to do with GSC
but has indirect effect like memory-pressure, engine-resets,
etc) is that the proxy-init will bail-out print an error but
that would result in CI reporting a false-negative. We can't skip
that error since CONFIG_INTEL_MEI_GSC_PROXY tells us that the kernel
owner wants GSC-PROXY working.

> 
> In any case, v4 reads clear - no confusing comments and not 
> over-engineered so is acceptable to me.
> 
alan: thanks Tvrtko.


> P.S. Maybe the check *could* be moved to i915_live_selftests, where hw 
> dependencies conceptually fit better, and maybe i915_perf_selftests 
> would need it too then (?), but it is up to you.
alan: i can do this quickly as a rev5 (after a bit of manual check if perf 
needs it)

> 
> Maybe even in the array selftests/i915_live_selftests.h if we could add 
> a facility to make unskippable tests and add this one after the sanity 
> check. Which would then achieve the same generalized thing you had in 
> the previous version without needing to add a new array/loop.
alan: i rather not attempt this as part of the current patch but will
consider it as a separate patch if you are okay with that?



Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/pxp/mtl: Update gsc-heci cmd size and timeout

2023-07-12 Thread Teres Alexis, Alan Previn
On Fri, 2023-07-07 at 23:43 +, Patchwork wrote:
> Patch Details
> Series: drm/i915/pxp/mtl: Update gsc-heci cmd size and timeout
> URL:https://patchwork.freedesktop.org/series/120360/
> State:  failure
> Details:
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_120360v1/index.html
> CI Bug Log - changes from CI_DRM_13355_full -> Patchwork_120360v1_full
> Summary
> 
> FAILURE
> 
> Serious unknown changes coming with Patchwork_120360v1_full absolutely need 
> to be
> verified manually.

alan:snip


> Possible new issues
> 
> Here are the unknown changes that may have been introduced in 
> Patchwork_120360v1_full:
> 
> IGT changes
> Possible regressions
> 
>   *   igt@kms_vblank@pipe-b-accuracy-idle:
> 
>  *   shard-glk: 
> PASS
>  -> 
> FAIL
>   *   igt@prime_vgem@fence-wait@ccs0:
> 
>  *   shard-mtlp: 
> PASS
>  -> 
> DMESG-WARN
>   *   igt@prime_vgem@fence-wait@vecs0:
> 
>  *   shard-mtlp: 
> PASS
>  -> 
> ABORT
> 
Above 3 reported regressions are unrelated to PXP at all - the first is a 
display issue and the remaining two are about gt park-unpark incorrectly 
attempting to free a vma (looks
to me like a genuine bug not related to PXP but related to some last bits of GT 
code that dont seem to be aware of dual-GT concurrent operation)

alan:snip


Re: [Intel-gfx] [PATCH v3] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests

2023-07-11 Thread Teres Alexis, Alan Previn
On Tue, 2023-07-11 at 11:49 -0700, Ceraolo Spurio, Daniele wrote:
> 
> > > > @@ -134,6 +193,8 @@ static int __run_selftests(const char *name,
> > > >{
> > > > int err = 0;
> > > >
> > > > +   __wait_on_all_system_dependencies(data);
> > > Why does this need to be top level selftests and not just a wait for
> > > intel_gsc_uc_fw_proxy_init_done in the tests where it is relevant, via
> > > some helper or something?
> > Alan: it was an offline decision because we didn't want to repeat
> > the same check for all permutations of selftests' subtests (i.e. considering
> > module params can dictate to skip some subtests but execute others).
> > 
> > Anyways, let me get back to you on how how many selftests' subtests 
> > actually excercise the
> > need for proxy-init to complete - if its just 1-to-2 subtest I'll move the 
> > remove the code
> > from here and move them into the individual subtests.
> 
> I don't think it is going to be easy to figure out which selftest are 
> impacted. All selftests looping on all engines of course, but also tests 
> triggering GT resets and/or messing with the system in other ways. Any 
> new tests added will also need to be evaluated.
> 
> IMO there is minimal impact of having this check on every test. When 
> running selftests we load i915 after the rest of the system has already 
> fully booted, so there are no delays in getting the mei component up and 
> therefore proxy init is sometimes completed even before the selftest 
> code starts; when we do have to wait, it's usually for a very short 
> time, because the expected total execution time for the GSC worker when 
> not having to wait for the mei component to load is ~750ms (~200ms for 
> GSC load + 20ms for HuC auth + ~500ms for proxy init). Having a few 
> seconds added to the total selftests runtime is IMO a better option that 
> having to maintain a list of impacted tests.
> 
> Daniele
> 

Thanks Daniele - I completely forgot about reset or other system disruptive 
tests.
For now I'll re-rev to address Tvrtko's other comments but will keep the waiter
as 'once-top-down' for now and wait for Tvrtko's thoughts on that next rev.
...alan


Re: [Intel-gfx] [PATCH v3] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests

2023-07-11 Thread Teres Alexis, Alan Previn
Thanks fore reviewing Tvrtko, below are my responses.
I'll rerev without generalized func ptr and only for the subtests that need it.
...alan

On Thu, 2023-06-29 at 22:44 +0100, Tvrtko Ursulin wrote:
> On 29/06/2023 21:42, Alan Previn wrote:
> > On MTL, if the GSC Proxy init flows haven't completed, submissions to the
> > GSC engine will fail. Those init flows are dependent on the mei's
> > gsc_proxy component that is loaded in parallel with i915 and a
> > worker that could potentially start after i915 driver init is done.
> > 
> > That said, all subsytems that access the GSC engine today does check
> > for such init flow completion before using the GSC engine. However,
> > selftests currently don't wait on anything before starting.

alan:snip
> > +static int
> > +__wait_gsc_proxy_completed(struct drm_i915_private *i915,
> > +  unsigned long timeout_ms)
> > +{
> > +   bool need_to_wait = (IS_ENABLED(CONFIG_INTEL_MEI_GSC_PROXY) &&
> > +i915->media_gt &&
> > +HAS_ENGINE(i915->media_gt, GSC0) &&
> > +
> > intel_uc_fw_is_loadable(>media_gt->uc.gsc.fw));
> > +
> > +   /*
> > +* For gsc proxy component loading + init, we need a much longer timeout
> > +* than what CI selftest infrastrucutre currently uses. This longer wait
> > +* period depends on the kernel config and component driver load 
> > ordering
> > +*/
> 
> How is a CI timeout value relevant?
> 
> Plus from the commit message it sounds like the point of wait is so 
> submission to gsc does not fail if loading is still in progress, not 
> that the CI times out. So what is the main problem?

Alan: The comment was meant to explain why we override the CI selftest timeout 
(an input param
to the generalized func ptr loop) to something much larger specially for 
gsc-proxy-waiting.
However, since your other review comment below is to remove the generalization, 
this comment
therefore will not make sense so I'll remove it accordingly. The point was that 
CI might
have a system level selftest timeout of something much smaller like 500 
milisecs (used to
have some control over the execution time), but for the gsc-proxy waiting, its 
not in i915's
control but depends on the kernel component driver loading flow (and in rare 
occasions, after
a fresh IFWI was flashed which causes a 1-time longer period for fw-proxy flows 
to complete).
In any case, I'll remove the comment as per your direction.

> 
> > +   if (timeout_ms < 8000)
> > +   timeout_ms = 8000;
> > +
> 

alan:snip
> > +static int __wait_on_all_system_dependencies(struct drm_i915_private *i915)
> > +{
> > +   struct __startup_waiter *waiter = all_startup_waiters;
> > +   int count = ARRAY_SIZE(all_startup_waiters);
> > +   int ret;
> > +
> > +   if (!waiter || !count || !i915)
> > +   return 0;
> 
> Ugh.
> 
> If it ever becomes an empty array just zap this whole code and not have 
> these checks.
Alan: Okay sure - will remove these check except the i915 - see below.
> 
> Also, no i915 is a possibility?
Alan: i915_mock_selftests passes in NULL for i915. This checking of the i915
aligns with the existing __run_selftests code - but in that function the
param is called "data" eventhough in all callers of __run_selftests, that
"data" is actually i915 when its not null.
> 
> But actually.. please don't add the function table generalization unless 
> it is already known something else is coming to be plugged into it.
Alan: Okay- I'll remove it.

alan:snip
> 

> > @@ -134,6 +193,8 @@ static int __run_selftests(const char *name,
> >   {
> > int err = 0;
> >   
> > +   __wait_on_all_system_dependencies(data);
> 
> Why does this need to be top level selftests and not just a wait for 
> intel_gsc_uc_fw_proxy_init_done in the tests where it is relevant, via 
> some helper or something?
Alan: it was an offline decision because we didn't want to repeat
the same check for all permutations of selftests' subtests (i.e. considering
module params can dictate to skip some subtests but execute others).

Anyways, let me get back to you on how how many selftests' subtests actually 
excercise the
need for proxy-init to complete - if its just 1-to-2 subtest I'll move the 
remove the code
from here and move them into the individual subtests.

alan:snip


Re: [Intel-gfx] [PATCH] drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS

2023-06-29 Thread Teres Alexis, Alan Previn
On Tue, 2023-06-20 at 09:30 -0500, Balasubrawmanian, Vivaik wrote:
> On 6/1/2023 12:45 PM, Alan Previn wrote:
> > After recent discussions with Mesa folks, it was requested
> > that we optimize i915's GET_PARAM for the PXP_STATUS without
> > changing the UAPI spec.
> > 
> > This patch adds this additional optimizations:
> > - If any PXP initializatoin flow failed, then ensure that
> >   we catch it so that we can change the returned PXP_STATUS
> >   from "2" (i.e. 'PXP is supported but not yet ready')
> >   to "-ENODEV". This typically should not happen and if it
> >   does, we have a platform configuration.
> > - If a PXP arbitration session creation event failed
> >   due to incorrect firmware version or blocking SOC fusing
> >   or blocking BIOS configuration (platform reasons that won't
> >   change if we retry), then reflect that blockage by also
> >   returning -ENODEV in the GET_PARAM-PXP_STATUS.
> > - GET_PARAM:PXP_STATUS should not wait at all if PXP is
> >   supported but non-i915 dependencies (component-driver /
> >   firmware) we are still pending to complete the init flows.
> >   In this case, just return "2" immediately (i.e. 'PXP is
> >   supported but not yet ready').
> > 
> > Signed-off-by: Alan Previn 
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c  | 11 +-
> >   drivers/gpu/drm/i915/i915_getparam.c   |  2 +-
> >   drivers/gpu/drm/i915/pxp/intel_pxp.c   | 25 ++
> >   drivers/gpu/drm/i915/pxp/intel_pxp.h   |  2 +-
> >   drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c |  7 +++---
> >   drivers/gpu/drm/i915/pxp/intel_pxp_tee.c   |  7 +++---
> >   drivers/gpu/drm/i915/pxp/intel_pxp_types.h |  9 
> >   7 files changed, 50 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c 
> > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> > index fb0984f875f9..4dd744c96a37 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> > @@ -42,8 +42,17 @@ static void gsc_work(struct work_struct *work)
> > }
> >   
> > ret = intel_gsc_proxy_request_handler(gsc);
> > -   if (ret)
> > +   if (ret) {
> > +   if (actions & GSC_ACTION_FW_LOAD) {
> > +   /*
> > +* a proxy request failure that came together 
> > with the
> > +* firmware load action means the last part of 
> > init has
> > +* failed so GSC fw won't be usable after this
> > +*/
> > +   intel_uc_fw_change_status(>fw, 
> > INTEL_UC_FIRMWARE_LOAD_FAIL);
> > +   }
> > goto out_put;
> > +   }
> >   
> > /* mark the GSC FW init as done the first time we run this */
> > if (actions & GSC_ACTION_FW_LOAD) {
> 
> On the huc authentication comment block above this snippet, the last 
> statement: "Note that we can only do the GSC auth if the GuC auth was" 
> is confusing as the code below is only dealing with HuC Authentication.
alan: i believe what he meant was "can only do the GSC-based auth if
the GuC-based auth"... but I can't change that code as part
of this patch - I believe the rules for kernel patch is to ensure each
single patch is modular (not mixing unrelated changes) and focuses just
on what its described to do. IIRC, we would need to create a separate
patch review for that change.

> 
> This function seems to have a section to deal with FW load action and 
> another to deal with SW Proxy requests, but we seem to be mixing both 
> actions in the SW proxy section. instead, can we move this call to 
> intel_gsc_proxy_request_handler to the FW load section itself instead of 
> handling it as an additional check in the SW_proxy section? In the same 
> vein, we should also move the intel_uc_fw_change_status() call into the 
> above FW Load action section. i think that way the code reads better.
alan: GSC_ACTION_FW_LOAD is used for loading the GSC firmware which is a
one-time thing per i915 load. However, GSC_ACTION_SW_PROXY events can happen
any time the GSC fw needs to communicate with CSE firmware (or vice versa)
due to platform events that may have not been triggered by i915 long after
init. However, the rule is after GSC FW is loaded, i915 is required
to do a 1-time proxy-init step to prime both GSC and CSE fws that proxy
comms is avail. without this step, we can't use the gsc-fw for other ops.

So to recap the rules:
1. we launch the worker to do the one-time the GSC firmware load.
2. after the GSC firmware load is successful, we have to do a one-time SW-proxy 
init.
-> this is why we add the GSC_ACTION_SW_PROXY flag successful load 
completion.
3. If we are doing proxy-handling for the very first time, we ensure
   -> FW status is only set to RUNNING if proxy int was 

Re: [Intel-gfx] [PATCH v2 3/5] drm/i915/mtl/gsc: query the GSC FW for its compatibility version

2023-06-08 Thread Teres Alexis, Alan Previn
On Mon, 2023-06-05 at 19:24 -0700, Ceraolo Spurio, Daniele wrote:
> The compatibility version is queried via an MKHI command. Right now, the
> only existing interface is 1.0
> This is basically the interface version for the GSC FW, so the plan is
> to use it as the main tracked version, including for the binary naming
> in the fetch code.
> 
> v2: use define for the object size (Alan)
> 
> Cc: Alan Previn 
> Cc: John Harrison 
> Reviewed-by: Alan Previn  #v1


Diff'd v1 vs v2 of this patch - differences are good. So my R'B still holds:
Reviewed-by: Alan Previn 

> ---



Re: [Intel-gfx] [PATCH v1] drm/i915/gsc: take a wakeref for the proxy-init-completion check

2023-06-08 Thread Teres Alexis, Alan Previn
On Thu, 2023-06-08 at 11:19 -0700, Ceraolo Spurio, Daniele wrote:
> On 6/8/2023 11:04 AM, Alan Previn wrote:
> > Ensure intel_gsc_uc_fw_init_done and intel_gsc_uc_fw_proxy_init
> > takes a wakeref before reading GSC Shim registers.
alan:snip
> 
> >   bool intel_gsc_uc_fw_proxy_init_done(struct intel_gsc_uc *gsc)
> >   {
> > struct intel_uncore *uncore = gsc_uc_to_gt(gsc)->uncore;
> > -   u32 fw_status = intel_uncore_read(uncore, GSC_FW_STATUS_REG);
> > +   intel_wakeref_t wakeref;
> > +   u32 fw_status;
> > +
> > +   with_intel_runtime_pm(uncore->rpm, wakeref)
> > +   fw_status = intel_uncore_read(uncore, GSC_FW_STATUS_REG);
> 
> I think this could be moved to an helper (gsc_uc_get_fw_status?), so we 
> don't have to re-do the wakeref in all the callers.
alan: thanks - will fix.



Re: [Intel-gfx] [PATCH v2 2/5] drm/i915/mtl/gsc: extract release and security versions from the gsc binary

2023-06-08 Thread Teres Alexis, Alan Previn
Everything looks good to me, so
Reviewed-by: Alan Previn 

On Mon, 2023-06-05 at 19:23 -0700, Ceraolo Spurio, Daniele wrote:
> The release and security versions of the GSC binary are not used at
> runtime to decide interface compatibility (there is a separate version
> for that), but they're still useful for debug, so it is still worth
> extracting them and printing them out in dmesg.
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_binary_headers.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_binary_headers.h
> index 714f0c256118..ad80afcafd23 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_binary_headers.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_binary_headers.h
alan:snip


> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
alan:snip


> @@ -42,6 +43,157 @@ bool intel_gsc_uc_fw_init_done(struct intel_gsc_uc *gsc)
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h
alan:snip


> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h
alan:snip



>  /*
> @@ -289,6 +290,8 @@ static inline u32 intel_uc_fw_get_upload_size(struct 
> intel_uc_fw *uc_fw)
alan:snip


Re: [Intel-gfx] [PATCH] drm/i915/gsc: Fix error code in intel_gsc_uc_heci_cmd_submit_nonpriv()

2023-06-08 Thread Teres Alexis, Alan Previn
On Thu, 2023-06-08 at 11:37 +0300, Dan Carpenter wrote:
> On Wed, Jun 07, 2023 at 06:44:54PM +0000, Teres Alexis, Alan Previn wrote:
> > On Tue, 2023-06-06 at 21:32 +0300, Dan Carpenter wrote:
> > > On Tue, Jun 06, 2023 at 06:07:19PM +0000, Teres Alexis, Alan Previn wrote:
> > > > That was my bad, i could have sword i'd fixed that before the final 
> > > > rev. Thanks for fixing this.
> > > > nit: below function applies to MTL only which at the moment is still 
> > > > force-probed, so not sure if the fixes tag is significant.
> > > > 
> > > 
> > > The Fixes tag is correct.  It's definitely a bug fix.
> > > 
> > > (I invented the Fixes tag so technically that makes me the worlds #1
> > > expert on Fixes tags).
> > > 
> > So sorry my bad.
> 
> LOL.
> 
> Presumably the appology was intended sarcastically?  Hopefully?  Oh my
> goodness.  ROFL...
> 

lol - i'll take the 5th here - i should have phrased it differrently -  my bad.
 


Re: [Intel-gfx] [v2] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests

2023-06-08 Thread Teres Alexis, Alan Previn
On Thu, 2023-06-08 at 18:14 +, Dong, Zhanjun wrote:
> See my comments below.
> 
> > -Original Message-
> > From: Alan Previn 
alan:snip

> > +static int
> > +__wait_gsc_proxy_completed(struct drm_i915_private *i915,
> > +  unsigned long timeout_ms)
> > +{
> > +   bool need_to_wait = (IS_ENABLED(CONFIG_INTEL_MEI_GSC_PROXY)
> > &&
> > +i915->media_gt &&
> > +HAS_ENGINE(i915->media_gt, GSC0) &&
> > +intel_uc_fw_is_loadable(>media_gt-
> > > uc.gsc.fw));
> > +
> > +   /*
> > +* For gsc proxy component loading + init, we need a much longer
> > timeout
> > +* than what CI selftest infrastrucutre currently uses. This longer wait
> > +* period depends on the kernel config and component driver load
> > ordering
> > +*/
> > +   if (timeout_ms < 8000)
> > +   timeout_ms = 8000;
> 
> 
> Lgtm, just an concern about the fixed number here, shall we set the minimal 
> here, or let i915_selftest.timeout_ms take control? Thus no longer need 
> coding change here in the future.
> 
> Reviewed-by: Zhanjun Dong 

Thanks Zhanjun, unfortunately, based on internal testing, 
i915_selftest.timeout_ms default is too
low that it does occasionally timeout for CI. From experience, with a lean 
ubuntu config, it typically
takes about ~1 seconds for the mei-gsc-sw-proxy component driver to load after 
i915 loads.
Since CI regular unloads and reloads i915, the timeout observed ends up being 
reported as issue.

8 seconds was based on internal testing of the worst case scenario - which 
hardly ever happens.
We've only seen the 8 second happen when the kernel config has configs enabled 
for very many SOC IP
drivers and component driver (seen one at least one customer config) or if the 
MTL board IFWI was only
just reflashed (this would be a one-off 8 seconds, we suspect due to the 
firmware doing additional steps)




Re: [Intel-gfx] [PATCH v3] drm/i915/mtl/gsc: Add a gsc_info debugfs

2023-06-07 Thread Teres Alexis, Alan Previn
On Mon, 2023-06-05 at 21:32 -0700, Ceraolo Spurio, Daniele wrote:
> Add a new debugfs to dump information about the GSC. This includes:
> 
> - the FW path and SW tracking status;
> - the release, security and compatibility versions;
> - the HECI1 status registers.
> 
> Note that those are the same registers that the mei driver dumps in
> their own status sysfs on DG2 (where mei owns the GSC).
> 
alan:snip


all looks good. (ofc we do have to follow up on those 2 actions from last rev's
conversation (that we agreed should be separate patch). For now, this patch is

Reviewed-by: Alan Previn 



Re: [Intel-gfx] [PATCH v2 1/5] drm/i915/gsc: fixes and updates for GSC memory allocation

2023-06-07 Thread Teres Alexis, Alan Previn
On Mon, 2023-06-05 at 19:23 -0700, Ceraolo Spurio, Daniele wrote:
> A few fixes/updates are required around the GSC memory allocation and it
> is easier to do them all at the same time. The changes are as follows:
> 
> 1 - Switch the memory allocation to stolen memory. We need to avoid
> accesses from GSC FW to normal memory after the suspend function has
> completed and to do so we can either switch to using stolen or make sure
> the GSC is gone to sleep before the end of the suspend function. Given
> that the GSC waits for a bit before going idle even if there are no
> pending operations, it is easier and quicker to just use stolen memory.
> 
> 2 - Reduce the GSC allocation size to 4MBs, which is the POR requirement.
> The 8MBs were needed only for early FW and I had misunderstood that as
> being the expected POR size when I sent the original patch.
> 
> 3 - Perma-map the GSC allocation. This isn't required immediately, but it
> will be needed later to be able to quickly extract the GSC logs, which are
> inside the allocation. Since the mapping code needs to be rewritten due to
> switching to stolen, it makes sense to do the switch immediately to avoid
> having to change it again later.
> 
> Note that the explicit setting of CACHE_NONE for Wa_22016122933 has been
> dropped because that's the default setting for stolen memory on !LLC
> platforms.
> 
> v2: only memset the memory we're not overwriting (Alan)
> 
LGTM so ..
Reviewed-by: Alan Previn 


Re: [Intel-gfx] [PATCH] drm/i915/gsc: Fix error code in intel_gsc_uc_heci_cmd_submit_nonpriv()

2023-06-07 Thread Teres Alexis, Alan Previn
On Tue, 2023-06-06 at 21:32 +0300, Dan Carpenter wrote:
> On Tue, Jun 06, 2023 at 06:07:19PM +0000, Teres Alexis, Alan Previn wrote:
> > That was my bad, i could have sword i'd fixed that before the final rev. 
> > Thanks for fixing this.
> > nit: below function applies to MTL only which at the moment is still 
> > force-probed, so not sure if the fixes tag is significant.
> > 
> 
> The Fixes tag is correct.  It's definitely a bug fix.
> 
> (I invented the Fixes tag so technically that makes me the worlds #1
> expert on Fixes tags).
> 
So sorry my bad.


Re: [Intel-gfx] [PATCH] drm/i915/gsc: Fix error code in intel_gsc_uc_heci_cmd_submit_nonpriv()

2023-06-06 Thread Teres Alexis, Alan Previn
That was my bad, i could have sword i'd fixed that before the final rev. Thanks 
for fixing this.
nit: below function applies to MTL only which at the moment is still 
force-probed, so not sure if the fixes tag is significant.

Reviewed-by: Alan Previn 

On Tue, 2023-06-06 at 11:22 +0300, Dan Carpenter wrote:
> This should return negative -EAGAIN instead of positive EAGAIN.
> 
> Fixes: e5e1e6d28ebc ("drm/i915/pxp: Add MTL helpers to submit Heci-Cmd-Packet 
> to GSC")
> Signed-off-by: Dan Carpenter 
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> index 579c0f5a1438..42218ae6ef13 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> @@ -202,7 +202,7 @@ intel_gsc_uc_heci_cmd_submit_nonpriv(struct intel_gsc_uc 
> *gsc,
>   if (++trials < 10)
>   goto retry;
>   else
> - err = EAGAIN;
> + err = -EAGAIN;



Re: [Intel-gfx] [PATCH 2/6] drm/i915/uc/gsc: fixes and updates for GSC memory allocation

2023-06-05 Thread Teres Alexis, Alan Previn
On Tue, 2023-05-23 at 08:21 -0700, Ceraolo Spurio, Daniele wrote:
> 
> > 
> > > +static int gsc_allocate_and_map_vma(struct intel_gsc_uc *gsc, u32 size)
> > alan:snip
> > > + obj = i915_gem_object_create_stolen(gt->i915, s0ize);
> > > + if (IS_ERR(obj))
> > > + return PTR_ERR(obj);
> > > +
> > > + vma = i915_gem_object_ggtt_pin(obj, NULL, 0, 0, 0);
> > alan: should we be passing in the PIN_MAPPABLE flag into the last param?
> 
> No, PIN_MAPPABLE is only for legacy platform that used the aperture BAR 
> for stolen mem access via GGTT. MTL doesn't have it and stolen is 
> directly accessible via the LMEM BAR (which is actually the same BAR 2, 
> but now behaves differently).

alan: thanks - sounds good  - i forgot about those differentiations

> 
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> > > @@ -18,6 +18,7 @@ struct intel_gsc_uc {
> > >   
> > >   /* GSC-specific additions */
> > >   struct i915_vma *local; /* private memory for GSC usage */
> > > + void __iomem *local_vaddr; /* pointer to access the private memory */
> > alan:nit: relooking at the these variable names that originate from
> > last year's patch you worked on introducing gsc_uc, i am wondering now
> > if we should rename "local" to "privmem" and local_vaddr becomes 
> > privmem_vaddr.
> > (no significant reason other than improving readibility of the code)
> 
> IIRC I used local because one of the GSC docs referred to it that way. I 
> don't mind the renaming, but I don't think it should be done as part of 
> this patch.
alan: sure - sounds good.




Re: [Intel-gfx] [PATCH 3/6] drm/i915/uc/gsc: extract release and security versions from the gsc binary

2023-06-05 Thread Teres Alexis, Alan Previn
On Fri, 2023-05-26 at 18:27 -0700, Ceraolo Spurio, Daniele wrote:
> 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h 
> > > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h
> > > index d55a66202576..8bce2b8aed84 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h
alan:snip

> > alan: structure layout seems unnecessarily repetitive... why not ->
> > struct partition_info {
> > u32 offset;
> > u32 size;
> > };
> > struct intel_gsc_layout_pointers {
> > u8 rom_bypass_vector[16];
> > ...
> > struct partition_info datap;
> > struct partition_info bootregion[5];
> > struct partition_info trace;
> > }__packed;
> > 
> 
> I've just realized that I didn't reply to this comment. The specs have 
> the structure defined that way, so I tried to keep a 1:1 match like we 
> usually do. I think switching to a partition_info structure is ok, but 
> I'll avoid the array because the GSC partition are 1-based, which could 
> cause confusion (i.e. partition boot1 would be bootregion[0]).
alan: sure - that's fine - let's stick to align with the spec definitions



Re: [Intel-gfx] [PATCH 6/6] drm/i915/uc/gsc: Add a gsc_info debugfs

2023-06-05 Thread Teres Alexis, Alan Previn
On Wed, 2023-05-31 at 17:25 -0700, Ceraolo Spurio, Daniele wrote:
> 
> On 5/26/2023 3:57 PM, Teres Alexis, Alan Previn wrote:
> > On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:
> > > Add a new debugfs to dump information about the GSC. This includes:
> > alan:snip
> > Actually everything looks good except for a couple of questions + asks - 
> > hope we can close on this patch in next rev.
> > 
> > > - the FW path and SW tracking status;
> > > - the release, security and compatibility versions;
> > > - the HECI1 status registers.
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c 
> > > b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
> > > index 0b6dcd982b14..3014e982aab2 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
> > > @@ -12,36 +12,31 @@
> > >   #include "intel_gsc_fw.h"
> > >   #include "intel_gsc_meu_headers.h"
> > >   #include "intel_gsc_uc_heci_cmd_submit.h"
> > > -
> > > -#define GSC_FW_STATUS_REG_MMIO(0x116C40)
> > > -#define GSC_FW_CURRENT_STATE REG_GENMASK(3, 0)
> > > -#define   GSC_FW_CURRENT_STATE_RESET 0
> > > -#define   GSC_FW_PROXY_STATE_NORMAL  5
> > > -#define GSC_FW_INIT_COMPLETE_BIT REG_BIT(9)
> > > +#include "i915_reg.h"
> > >   
> > alan:snip
> >   
> > alan: btw, just to be consistent with other top-level "intel_foo_is..." 
> > checking functions,
> > why don't we take the runtime wakeref inside the following functions and 
> > make it easier for any callers?
> > (just like what we do for "intel_huc_is_authenticated"):
> >  static bool gsc_is_in_reset(struct intel_uncore *uncore)
> >  bool intel_gsc_uc_fw_proxy_init_done(struct intel_gsc_uc *gsc)
> >  bool intel_gsc_uc_fw_init_done(struct intel_gsc_uc *gsc)
> 
> The idea was that we shouldn't check the FW status if we're not planning 
> to do something with it, in which case we should already have a wakeref. 
> HuC is a special case because userspace can query that when the HW is 
> idle. This said, I have nothing against adding an extra wakeref , but I 
> don't think it should be in this patch.
alan: i believe intel_pxp_gsccs_is_ready_for_sessions is being used in a
similar way where one of the uses it to check huc-status and gsc-proxy
status without actually doing any operation. If u still wanna keep it
differently then I'll have to update the PXP code. Or perhaps you could
help me fix that on the PXP side?

alna:snip
> > 

> > > +void intel_gsc_uc_load_status(struct intel_gsc_uc *gsc, struct 
> > > drm_printer *p)
> > > +{
> > > + struct intel_gt *gt = gsc_uc_to_gt(gsc);
> > > + struct intel_uncore *uncore = gt->uncore;
> > > + intel_wakeref_t wakeref;
> > > +
> > > + if (!intel_gsc_uc_is_supported(gsc)) {
> > alan: this was already checked in caller so we'll never get here. i think 
> > we should remove the check in the caller, let below msg appear.
> 
> I did the same as what we do for GuC and HuC. I'd prefer to be 
> consistent in behavior with those.
alan: okay - sounds good
> 
> > > + drm_printf(p, "GSC not supported\n");
> > > + return;
> > > + }
> > alan:snip

> > > + drm_printf(p, "HECI1 FWSTST%u = 0x%08x\n", i, status);
> > alan:nit: do you we could add those additional shim regs? (seemed useful in 
> > recent offline debugs).
> 
> Agreed that it would be useful; I'll try to get a complete list from 
> arch and/or the GSC FW team. Are you ok if we go ahead with this in the 
> meantime?
alan: yes sure.
> 
> > 
> 



Re: [Intel-gfx] [PATCH] drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS

2023-06-02 Thread Teres Alexis, Alan Previn
Thanks Jani - will rev this up and fix these.

On Fri, 2023-06-02 at 16:03 +0300, Jani Nikula wrote:
> On Thu, 01 Jun 2023, Alan Previn  wrote:
> > After recent discussions with Mesa folks, it was requested
> > that we optimize i915's GET_PARAM for the PXP_STATUS without
> > changing the UAPI spec.
> > 
> > This patch adds this additional optimizations:
> 
> Nitpick, please avoid "This patch". It's redundant, and after the patch
> gets applied it becomes a commit, not a patch.
> 
> Instead, use the imperative mood, e.g. "Add these additional
> optimizations".
> 
> See 
> https://docs.kernel.org/process/submitting-patches.html#describe-your-changes
alan:snip

> 
> > -int intel_pxp_get_readiness_status(struct intel_pxp *pxp)
> > +int intel_pxp_get_readiness_status(struct intel_pxp *pxp, int timeout)
> 
> It would help the reader if you named the parameter timeout_ms. Assuming
> the unit is ms.
alan:snip

> > -is_fw_err_platform_config(u32 type)
> > +is_fw_err_platform_config(u32 type, struct intel_pxp *pxp)
> 
> It's customary to have the parameters ordered from higher context to
> lower.
> 

alan:snip



Re: [Intel-gfx] [PATCH v5 5/7] drm/i915/mtl/huc: auth HuC via GSC

2023-06-01 Thread Teres Alexis, Alan Previn
On Wed, 2023-05-31 at 16:54 -0700, Ceraolo Spurio, Daniele wrote:
> The full authentication via the GSC requires an heci packet submission
> to the GSC FW via the GSC CS. The GSC has new PXP command for this
> (literally called NEW_HUC_AUTH).
> The intel_huc_auth function is also updated to handle both authentication
> types.
> 
> 
alan:snip

> @@ -399,6 +416,9 @@ void intel_huc_fini(struct intel_huc *huc)
>*/
>   delayed_huc_load_fini(huc);
>  
> + if (huc->heci_pkt)
> + i915_vma_unpin_and_release(>heci_pkt, 0);
alan: nit: i just realized that for consistency (init mirror-ing fini), we 
should be doing this heci releasing after the intel_uc_fw_fini below.
But since the below object isnt referencing the heci packet, this doens't 
matter, so consider a nit.

alan:snip
> @@ -470,31 +491,41 @@ int intel_huc_auth(struct intel_huc *huc)
>   if (!intel_uc_fw_is_loaded(>fw))
>   return -ENOEXEC;
>  
> - /* GSC will do the auth */
> + /* GSC will do the auth with the load */
>   if (intel_huc_is_loaded_by_gsc(huc))
>   return -ENODEV;
alan: nit: sorry for another late comment but merely a nit - wondering if we 
should add a warn in here (to catch
if we could end up here) but its okay - this in theory shouldnt happen anyway.

alan:snip


> +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c
alan:snip



Only a couple of late-comer-nits that you can ignore, else LGTM:
Reviewed-by: Alan Previn 


Re: [Intel-gfx] [PATCH] drm/i915/pxp: use correct format string for size_t

2023-06-01 Thread Teres Alexis, Alan Previn
On Thu, 2023-06-01 at 23:36 +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> While 'unsigned long' needs the %ld format string, size_t needs the %z
> modifier:

alan:snip


> +++ b/drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c
> @@ -143,7 +143,7 @@ gsccs_send_message(struct intel_pxp *pxp,
>  
>   reply_size = header->message_size - sizeof(*header);
>   if (reply_size > msg_out_size_max) {
> - drm_warn(>drm, "caller with insufficient PXP reply size 
> %u (%ld)\n",
> + drm_warn(>drm, "caller with insufficient PXP reply size 
> %u (%zd)\n",
>reply_size, msg_out_size_max);
>   reply_size = msg_out_size_max;
>   }
Thanks Arnd for catching this, although i believe Nathan sumbmitted a patch for 
the same fix yesterday and received an RB from Andi.


Re: [Intel-gfx] [PATCH v3 5/7] drm/i915/mtl/huc: auth HuC via GSC

2023-05-30 Thread Teres Alexis, Alan Previn
On Fri, 2023-05-26 at 17:52 -0700, Ceraolo Spurio, Daniele wrote:
> The full authentication via the GSC requires an heci packet submission
> to the GSC FW via the GSC CS. The GSC has new PXP command for this
> (literally called NEW_HUC_AUTH).
> The intel_huc_auth fuction is also updated to handle both authentication
> types.
> 


alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> index b26f493f86fa..c659cc01f32f 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> @@ -29,13 +29,32 @@ static void gsc_work(struct work_struct *work)
>  
>   if (actions & GSC_ACTION_FW_LOAD) {
>   ret = intel_gsc_uc_fw_upload(gsc);
> - if (ret == -EEXIST) /* skip proxy if not a new load */
> - actions &= ~GSC_ACTION_FW_LOAD;
> - else if (ret)
> + if (!ret)
> + /* setup proxy on a new load */
> + actions |= GSC_ACTION_SW_PROXY;
> + else if (ret != -EEXIST)
>   goto out_put;
alan: I realize that this worker doesnt print error values that
come back from intel_gsc_uc_fw_upload - wonder if we should print
it before goto out_put.

> +
> + /*
> +  * The HuC auth can be done both before or after the proxy init;
> +  * if done after, a proxy request will be issued and must be
> +  * serviced before the authentication can complete.
> +  * Since this worker also handles proxy requests, we can't
> +  * perform an action that requires the proxy from within it and
> +  * then stall waiting for it, because we'd be blocking the
> +  * service path. Therefore, it is easier for us to load HuC
> +  * first and do proxy later. The GSC will ack the HuC auth and
> +  * then send the HuC proxy request as part of the proxy init
> +  * flow.
> +  * Note that we can only do the GSC auth if the GuC auth was
> +  * successful.
> +  */
> + if (intel_uc_uses_huc(>uc) &&
> + intel_huc_is_authenticated(>uc.huc, 
> INTEL_HUC_AUTH_BY_GUC))
> + intel_huc_auth(>uc.huc, INTEL_HUC_AUTH_BY_GSC);
>   }
>  
> - if (actions & (GSC_ACTION_FW_LOAD | GSC_ACTION_SW_PROXY)) {
> + if (actions & GSC_ACTION_SW_PROXY) {
>   if (!intel_gsc_uc_fw_init_done(gsc)) {
>   gt_err(gt, "Proxy request received with GSC not 
> loaded!\n");
>   goto out_put;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> index 579c0f5a1438..0ad090304ca0 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.c
> 
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
> index ab5246ae3979..5a4058d39550 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_huc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
> 

alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c
> index d2b4176c81d6..8e538d639b05 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c
> 
alan:snip


> +int intel_huc_fw_auth_via_gsccs(struct intel_huc *huc)
> +{
> + struct drm_i915_gem_object *obj;
> + void *pkt_vaddr;
> + u64 pkt_offset;
> + int retry = 5;
> + int err = 0;
> +
> + if (!huc->heci_pkt)
> + return -ENODEV;
> +
> + obj = huc->heci_pkt->obj;
> + pkt_offset = i915_ggtt_offset(huc->heci_pkt);
> +
> + pkt_vaddr = i915_gem_object_pin_map_unlocked(obj,
> +  
> i915_coherent_map_type(i915, obj, true));
> + if (IS_ERR(pkt_vaddr))
> + return PTR_ERR(pkt_vaddr);
> +
> + msg_in = pkt_vaddr;
> + msg_out = pkt_vaddr + SZ_4K;
this 4K*2 (4k for input and 4K for output) seems to be hardcoded in two 
different locations.
One is here in intel_huc_fw_auth_via_gsccs and the other location in 
intel_huc_init which was
even in a different file. Perhaps its better to use a #define after to the 
definition of
PXP43_CMDID_NEW_HUC_AUTH... maybe something like a "#define 
PXP43_NEW_HUC_AUTH_INOUT_PKT_SIZE (SZ_4K)"
> +
> + intel_gsc_uc_heci_cmd_emit_mtl_header(_in->header,
> +   HECI_MEADDRESS_PXP,
> +   sizeof(*msg_in), 0);
> +
> + msg_in->huc_in.header.api_version = PXP_APIVER(4, 3);
> + msg_in->huc_in.header.command_id = PXP43_CMDID_NEW_HUC_AUTH;
> + msg_in->huc_in.header.status = 0;
> + msg_in->huc_in.header.buffer_len = sizeof(msg_in->huc_in) -
> +   

Re: [Intel-gfx] [PATCH 6/6] drm/i915/uc/gsc: Add a gsc_info debugfs

2023-05-26 Thread Teres Alexis, Alan Previn
On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:
> Add a new debugfs to dump information about the GSC. This includes:
alan:snip
Actually everything looks good except for a couple of questions + asks - hope 
we can close on this patch in next rev.

> 
> - the FW path and SW tracking status;
> - the release, security and compatibility versions;
> - the HECI1 status registers.
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
> index 0b6dcd982b14..3014e982aab2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c
> @@ -12,36 +12,31 @@
>  #include "intel_gsc_fw.h"
>  #include "intel_gsc_meu_headers.h"
>  #include "intel_gsc_uc_heci_cmd_submit.h"
> -
> -#define GSC_FW_STATUS_REG_MMIO(0x116C40)
> -#define GSC_FW_CURRENT_STATE REG_GENMASK(3, 0)
> -#define   GSC_FW_CURRENT_STATE_RESET 0
> -#define   GSC_FW_PROXY_STATE_NORMAL  5
> -#define GSC_FW_INIT_COMPLETE_BIT REG_BIT(9)
> +#include "i915_reg.h"
>  
alan:snip
 
alan: btw, just to be consistent with other top-level "intel_foo_is..." 
checking functions,
why don't we take the runtime wakeref inside the following functions and make 
it easier for any callers?
(just like what we do for "intel_huc_is_authenticated"):
static bool gsc_is_in_reset(struct intel_uncore *uncore)
bool intel_gsc_uc_fw_proxy_init_done(struct intel_gsc_uc *gsc)
bool intel_gsc_uc_fw_init_done(struct intel_gsc_uc *gsc)


> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> index 2ae693b01b49..5475e95d61c6 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> @@ -9,8 +9,9 @@
>  #include "gt/intel_gt_print.h"
>  #include "intel_gsc_uc.h"
alan: nit: not this patch's fault, alphabetically, intel_gsc_uc.h is after 
intel_gsc_proxy.h
>  #include "intel_gsc_fw.h"
> -#include "i915_drv.h"
>  #include "intel_gsc_proxy.h"
> +#include "i915_drv.h"
> +#include "i915_reg.h"
>  
>  static void gsc_work(struct work_struct *work)
>  {
> @@ -301,3 +302,46 @@ void intel_gsc_uc_load_start(struct intel_gsc_uc *gsc)
>   queue_work(gsc->wq, >work);
>  }
> +
alan: btw, why are we putting intel_gsc_uc_load_status in intel_gsc_uc.c if the 
only caller is gsc_uc's debugfs?
why not just make it a static in there? unless u plan to call it from 
"err_print_uc" - then can we add that in next rev?

> +void intel_gsc_uc_load_status(struct intel_gsc_uc *gsc, struct drm_printer 
> *p)
> +{
> + struct intel_gt *gt = gsc_uc_to_gt(gsc);
> + struct intel_uncore *uncore = gt->uncore;
> + intel_wakeref_t wakeref;
> +
> + if (!intel_gsc_uc_is_supported(gsc)) {
alan: this was already checked in caller so we'll never get here. i think we 
should remove the check in the caller, let below msg appear.
> + drm_printf(p, "GSC not supported\n");
> + return;
> + }
alan:snip



> + drm_printf(p, "HECI1 FWSTST%u = 0x%08x\n", i, status);
alan:nit: do you we could add those additional shim regs? (seemed useful in 
recent offline debugs).
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.c
> new file mode 100644
> index ..da9f96b72291
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.c
> @@ -0,0 +1,38 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2020 Intel Corporation
alan:2023?

alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.h
> new file mode 100644
> index ..c405e5574253
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_debugfs.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2020 Intel Corporation
alan:2023?
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
alan:snip

 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_debugfs.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_uc_debugfs.c
> index 2f93cc4e408a..6d541c866edb 100644
alan:snip





Re: [Intel-gfx] [PATCH 5/6] drm/i915/uc/gsc: define gsc fw

2023-05-25 Thread Teres Alexis, Alan Previn
Considering the only request i have below is touching up of existing comments 
(as
far as this patch is concerned), and since the rest of the code looks good, 
here is
my R-b - but i hope you can anwser my newbie question at the bottom:

Reviewed-by: Alan Previn 

On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:
> Add FW definition and the matching override modparam.
> 
> The GSC FW has both a release version, based on platform and a rolling
> counter, and a compatibility version, which is the one tracking
> interface changes. Since what we care about is the interface, we use
> the compatibility version in the buinary names.
alan :s/buinary/binary

> 
> Same as with the GuC, a major version bump indicate a
> backward-incompatible change, while a minor version bump indicates a
> backward-compatible one, so we use only the former in the file name.
> 
> Signed-off-by: Daniele Ceraolo Spurio 
> Cc: Alan Previn 
> Cc: John Harrison 
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c | 32 ++--
>  1 file changed, 24 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
> index 36ee96c02d74..531cd172151d 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
> @@ -124,6 +124,18 @@ void intel_uc_fw_change_status(struct intel_uc_fw *uc_fw,
>   fw_def(BROXTON,  0, huc_mmp(bxt,  2, 0, 0)) \
>   fw_def(SKYLAKE,  0, huc_mmp(skl,  2, 0, 0))
>  
> +/*
> + * The GSC FW has both a release version, based on platform and a rolling
> + * counter, and a compatibility version, which is the one tracking
> + * interface changes. Since what we care about is the interface, we use
> + * the compatibility version in the buinary names.
alan:s/buinary/binary
also, since we will (i hope) be adding several comments (alongside the new
version objects under intel_gsc_uc structure) in the patch #3 about what
their differences are and which one we care about and when they get populated,
perhaps we can minimize the information here and redirect to that other
comment... OR ... we can minimize the comments in patch #3 and redirect here
(will be good to have a single location with detailed explaination in the
comments and a redirect-ptr from the other location since a reader would
most likely stumble onto those questions from either of these locations).

> + * Same as with the GuC, a major version bump indicate a
> + * backward-incompatible change, while a minor version bump indicates a
> + * backward-compatible one, so we use only the former in the file name.
> + */
> +#define INTEL_GSC_FIRMWARE_DEFS(fw_def, gsc_def) \
> + fw_def(METEORLAKE,   0, gsc_def(mtl, 1, 0))
> +
>  /*
>  
> 
alan:snip

> @@ -257,14 +281,6 @@ __uc_fw_auto_select(struct drm_i915_private *i915, 
> struct intel_uc_fw *uc_fw)
>   int i;
>   bool found;
>  
> - /*
> -  * GSC FW support is still not fully in place, so we're not defining
> -  * the FW blob yet because we don't want the driver to attempt to load
> -  * it until we're ready for it.
> -  */
> - if (uc_fw->type == INTEL_UC_FW_TYPE_GSC)
> - return;
> -
alan: more of a newbie question from myself: considering this is a new firmware
we are adding, is there some kind of requirement to provide a link to the patch
targetting the linux firmware repo that is a dependency of this series?
or perhaps we should mention in the series that merge will only happen after
that patch gets merged (with a final rev that includes the patch on
the fw-repo side?). Just trying to understand the process.


>   /*
>* The only difference between the ADL GuC FWs is the HWConfig support.
>* ADL-N does not support HWConfig, so we should use the same binary as



Re: [Intel-gfx] [PATCH 4/6] drm/i915/uc/gsc: query the GSC FW for its compatibility version

2023-05-25 Thread Teres Alexis, Alan Previn
On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:
> The compatibility version is queried via an MKHI command. Right now, the
> only existing interface is 1.0
> This is basically the interface version for the GSC FW, so the plan is
> to use it as the main tracked version, including for the binary naming
> in the fetch code.
> 
> Signed-off-by: Daniele Ceraolo Spurio 
> 

alan: just a couple of minor things nits below.
One ask though, in line with the clarification we had over the offline 
coversation,
I am wondering if we can document the fact that the file_selected.ver remains 
as major-minor::zero-zero
for the case of gsc until after the firmware is loaded and we query via this 
function (which happens
later at gt-late-init). However, that comment might not belong here - perhaps 
it belongs in the prior
patch together with the other comment i requested for (asking for additional 
explainations about the
different types of versions for gsc).

That said, for this patch, LGTM:
Reviewed-by: Alan Previn 

alan:snip
> +static int gsc_fw_query_compatibility_version(struct intel_gsc_uc *gsc)
> +{
> + struct intel_gt *gt = gsc_uc_to_gt(gsc);
> + struct mtl_gsc_ver_msg_in *msg_in;
> + struct mtl_gsc_ver_msg_out *msg_out;
> + struct i915_vma *vma;
> + u64 offset;
> + void *vaddr;
> + int err;
> +
> + err = intel_guc_allocate_and_map_vma(>uc.guc, PAGE_SIZE * 2,
> +  , );
alan: nit: im assuming this code will be used for future discrete cards,.. if 
so,
perhaps we should also be using "SZ_4K * 2" above since different host-cpu-arch
could have different PAGE sizes - this way we'll be consistent with exact size 
allocations.
also its more consistent in this function - maybe a #define 
GSC_UC_GET_ABI_VER_PKT_SIZE SZ_4K
at top of function is nice. either way, i consider this a nit.

> + if (err) {
> + gt_err(gt, "failed to allocate vma for GSC version query\n");
> + return err;
> + }

alan:snip

> +
>  int intel_gsc_uc_fw_upload(struct intel_gsc_uc *gsc)
>  {
>   struct intel_gt *gt = gsc_uc_to_gt(gsc);
> @@ -327,11 +406,21 @@ int intel_gsc_uc_fw_upload(struct intel_gsc_uc *gsc)
>   if (err)
>   goto fail;
>  
> + err = gsc_fw_query_compatibility_version(gsc);
> + if (err)
> + goto fail;
> +
> + /* we only support compatibility version 1.0 at the moment */
> + err = intel_uc_check_file_version(gsc_fw, NULL);
> + if (err)
> + goto fail;
> +
>   /* FW is not fully operational until we enable SW proxy */
>   intel_uc_fw_change_status(gsc_fw, INTEL_UC_FIRMWARE_TRANSFERRED);
>  
> - gt_info(gt, "Loaded GSC firmware %s (r%u.%u.%u.%u, svn%u)\n",
> + gt_info(gt, "Loaded GSC firmware %s (cv%u.%u, r%u.%u.%u.%u, svn %u)\n",
alan:nit "abi" instead of "cv"?
>   gsc_fw->file_selected.path,
> + gsc_fw->file_selected.ver.major, 
> gsc_fw->file_selected.ver.minor,
>   gsc->release.major, gsc->release.minor,
>   gsc->release.patch, gsc->release.build,
>   gsc->security_version);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h
> index 8f199d5f963e..fb1453ed4ecf 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc_heci_cmd_submit.h
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
> index cd8fc194f7fa..36ee96c02d74 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c
alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h
> index 279244744d43..4406e7b48b27 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc_fw.h
alan:snip


Re: [Intel-gfx] [PATCH 3/6] drm/i915/uc/gsc: extract release and security versions from the gsc binary

2023-05-25 Thread Teres Alexis, Alan Previn
On Thu, 2023-05-25 at 09:56 -0700, Ceraolo Spurio, Daniele wrote:
> On 5/24/2023 10:14 PM, Teres Alexis, Alan Previn wrote:
> > On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:
alan:snip
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> > > @@ -17,6 +17,9 @@ struct intel_gsc_uc {
> > >   struct intel_uc_fw fw;
> > > 
> > >   /* GSC-specific additions */
> > > + struct intel_uc_fw_ver release;
> > > + u32 security_version;
> > alan: for consistency and less redundancy, can't we add "security_version"
> > into 'struct intel_uc_fw_ver' (which is zero for firmware that doesn't
> > have it). That way, intel_gsc_uc can re-use intel_uc_fw.file_selected
> > just like huc?
> 
> I'm not sure what you mean by re-using intel_uc_fw.file_selected. Is 
> that for the call from intel_uc_fw_version_from_meu_manifest? I'm 
> purposely not doing that. Note that the GSC has 3 versions:
> 
> Release version (incremented with each build and encoded in the header)
> Security version (also encoded in the header)
> Compatibility version (queried via message to the GSC)
> 
> The one we care about for communicating with the GSC is the last one, so 
> that's the one I stored in intel_uc_fw.file_selected (in the next 
> patch). The other 2  versions are not strictly required to use the GSC 
> and we only fetch them for debug purposes, so if something goes wrong we 
> know exactly what we've loaded.
> 
> Daniele
alan: okay thanks - seeing that now in the next patch... (and i also forgot that
the GSC release version doesnt reflect interface versioning in anyway like GuC 
does).
In that case, above additional versions are fine. Would definitely love to see
additional comments under "GSC-specific-additions" that explain those 3 
versioning
items and what we care about as how you have explained here.


Re: [Intel-gfx] [PATCH 3/6] drm/i915/uc/gsc: extract release and security versions from the gsc binary

2023-05-24 Thread Teres Alexis, Alan Previn
On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:


alan: snip


> +int intel_gsc_fw_get_binary_info(struct intel_uc_fw *gsc_fw, const void 
> *data, size_t size)
> +{
alan:snip
> + /*
> +  * The GSC binary starts with the pointer layout, which contains the
> +  * locations of the various partitions of the binary. The one we're
> +  * interested in to get the version is the boot1 partition, where we can
> +  * find a BPDT header followed by entries, one of which points to the
> +  * RBE sub-section of the partition. From here, we can parse the CPD
alan: nit: could we add the meaning of 'RBE', probably not here but in the 
header file where GSC_RBE is defined?
> +  * header and the following entries to find the manifest location
> +  * (entry identified by the "RBEP.man" name), from which we can finally
> +  * extract the version.
> +  *
> +  * --
> +  * [  intel_gsc_layout_pointers ]
> +  * [  ...   ]
> +  * [  boot1_offset  >---]--o
> +  * [  ...   ]  |
> +  * --  |
> +  * |
> +  * --  |
> +  * [  intel_gsc_bpdt_header ]<-o
> +  * --
alan: special thanks for the diagram - love these! :)
alan: snip

> + min_size = layout->boot1_offset + layout->boot1_offset > size;
alan: latter is a binary so + 1? or is this a typo and supposed to be:
min_size = layout->boot1_offset;
actually since we are accessing a bpdt_header hanging off that offset, it 
should rather be:
min_size = layout->boot1_offset + sizeof(*bpdt_header);
> + if (size < min_size) {
> + gt_err(gt, "GSC FW too small for boot section! %zu < %zu\n",
> +size, min_size);
> + return -ENODATA;
> + }
> +
> + bpdt_header = data + layout->boot1_offset;
> + if (bpdt_header->signature != INTEL_GSC_BPDT_HEADER_SIGNATURE) {
> + gt_err(gt, "invalid signature for meu BPDT header: 0x%08x!\n",
> +bpdt_header->signature);
> + return -EINVAL;
> + }
> +
alan: IIUC, to be strict about the size-crawl-checking, we should check minsize
again - this time with the additional "bpdt_header->descriptor_count * 
sizeof(*bpdt_entry)".
(hope i got that right?) - adding that check before parsing through the 
(bpdt_entry++)'s ->type
> + bpdt_entry = (void *)bpdt_header + sizeof(*bpdt_header);
> + for (i = 0; i < bpdt_header->descriptor_count; i++, bpdt_entry++) {
> + if ((bpdt_entry->type & INTEL_GSC_BPDT_ENTRY_TYPE_MASK) !=
> + INTEL_GSC_BPDT_ENTRY_TYPE_GSC_RBE)
> + continue;
> +
> + cpd_header = (void *)bpdt_header + 
> bpdt_entry->sub_partition_offset;
> + break;
> + }
> +
> + if (!cpd_header) {
> + gt_err(gt, "couldn't find CPD header in GSC binary!\n");
> + return -ENODATA;
> + }
alan: same as above, so for size-crawl-checking, we should check minsize again 
with the addition of cpd_header, no?
> +
> + if (cpd_header->header_marker != INTEL_GSC_CPD_HEADER_MARKER) {
> + gt_err(gt, "invalid marker for meu CPD header in GSC bin: 
> 0x%08x!\n",
> +cpd_header->header_marker);
> + return -EINVAL;
> + }
alan: and again here, the size crawl checking with cpd_header->num_of_entries * 
*cpd_entry
> + cpd_entry = (void *)cpd_header + cpd_header->header_length;
> + for (i = 0; i < cpd_header->num_of_entries; i++, cpd_entry++) {
> + if (strcmp(cpd_entry->name, "RBEP.man") == 0) {
> + manifest = (void *)cpd_header + 
> cpd_entry_offset(cpd_entry);
> + intel_uc_fw_version_from_meu_manifest(>release,
> +   manifest);
> + gsc->security_version = manifest->security_version;
> + break;
> + }
> + }
> +
> + return 0;

alan:snip

>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h
> index fff8928218df..8d7b9e4f1ffc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.h
alan:snip


> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h
> index d55a66202576..8bce2b8aed84 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_meu_headers.h
alan:snip



> +struct intel_gsc_layout_pointers {
> + u8 

Re: [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests

2023-05-24 Thread Teres Alexis, Alan Previn
The error below seems unrelated to the change in this patch. In fact test below 
fails on APL which wont excersize the patch code change.
However, from internal testing we did see cases where CI's selftest timeout is 
lower than the GSC Proxy requires to complete
(i.e. the selftest would bail with timeout because it was waiting for the GSC 
proxy to complete but the completion came in after the timeout).

So might need to re-rev this anyway with a longer timeout. (which typically 
takes longer for first time boot, not at driver-reload or suspend-resume)

...alan

On Sat, 2023-05-13 at 04:55 +, Patchwork wrote:
> Patch Details
> Series: drm/i915/selftest/gsc: Ensure GSC Proxy init completes before 
> selftests
> URL:https://patchwork.freedesktop.org/series/117713/
> State:  failure
> Details:
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117713v1/index.html
> CI Bug Log - changes from CI_DRM_13143_full -> Patchwork_117713v1_full
> Summary
> 
> FAILURE
> 
> Serious unknown changes coming with Patchwork_117713v1_full absolutely need 
> to be
> verified manually.
> 
> If you think the reported changes have nothing to do with the changes
> introduced in Patchwork_117713v1_full, please notify your bug team to allow 
> them
> to document this new failure mode, which will reduce false positives in CI.
> 
> Participating hosts (7 -> 7)
> 
> No changes in participating hosts
> 
> Possible new issues
> 
> Here are the unknown changes that may have been introduced in 
> Patchwork_117713v1_full:
> 
> IGT changes
> Possible regressions
> 
>   *   igt@kms_plane@plane-panning-bottom-right-suspend@pipe-b-planes:
>  *   shard-apl: 
> PASS
>  -> 
> ABORT
> 
alan:snip


Re: [Intel-gfx] [PATCH 2/6] drm/i915/uc/gsc: fixes and updates for GSC memory allocation

2023-05-22 Thread Teres Alexis, Alan Previn
On Fri, 2023-05-05 at 09:04 -0700, Ceraolo Spurio, Daniele wrote:
> A few fixes/updates are required around the GSC memory allocation and it
> is easier to do them all at the same time. The changes are as follows:


alan:snip

> @@ -109,38 +110,21 @@ static int gsc_fw_load_prepare(struct intel_gsc_uc *gsc)
>  {
>   struct intel_gt *gt = gsc_uc_to_gt(gsc);
>   struct drm_i915_private *i915 = gt->i915;
> - struct drm_i915_gem_object *obj;
> - void *src, *dst;
> + void *src;


alan:snip

>  
> - memset(dst, 0, obj->base.size);
> - memcpy(dst, src, gsc->fw.size);
> + memset_io(gsc->local_vaddr, 0, gsc->local->size);
> + memcpy_toio(gsc->local_vaddr, src, gsc->fw.size);
alan: i wonder if it there is benefit to do the memcpy_toio first
and then do the memset_io but only for the balance of area from
offset 'gsc->fw.size' for (gsc->local->size - gsc->fw.size) bytes.

alan:snip

> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> @@ -130,26 +130,85 @@ void intel_gsc_uc_init_early(struct intel_gsc_uc *gsc)
>   }
>  }
>  
> +static int gsc_allocate_and_map_vma(struct intel_gsc_uc *gsc, u32 size)
alan:snip

> + obj = i915_gem_object_create_stolen(gt->i915, size);
> + if (IS_ERR(obj))
> + return PTR_ERR(obj);
> +
> + vma = i915_gem_object_ggtt_pin(obj, NULL, 0, 0, 0);
alan: should we be passing in the PIN_MAPPABLE flag into the last param?


alan:snip

> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> index a2a0813b8a76..c01286dddbdb 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.h
> @@ -18,6 +18,7 @@ struct intel_gsc_uc {
>  
>   /* GSC-specific additions */
>   struct i915_vma *local; /* private memory for GSC usage */
> + void __iomem *local_vaddr; /* pointer to access the private memory */
alan:nit: relooking at the these variable names that originate from
last year's patch you worked on introducing gsc_uc, i am wondering now
if we should rename "local" to "privmem" and local_vaddr becomes privmem_vaddr.
(no significant reason other than improving readibility of the code)




Re: [Intel-gfx] [PATCH v2 5/8] drm/i915/huc: differentiate the 2 steps of the MTL HuC auth flow

2023-05-12 Thread Teres Alexis, Alan Previn
On Fri, 2023-04-28 at 11:58 -0700, Ceraolo Spurio, Daniele wrote:
> Before we add the second step of the MTL HuC auth (via GSC), we need to
> have the ability to differentiate between them. To do so, the huc
> authentication check is duplicated for GuC and GSC auth, with meu
> binaries being considered fully authenticated only after the GSC auth
> step.
> 
> To report the difference between the 2 auth steps, a new case is added
> to the HuC getparam. This way, the clear media driver can start
> submitting before full auth, as partial auth is enough for those
> workloads.
> 
> v2: fix authentication status check for DG2
> 
> Signed-off-by: Daniele Ceraolo Spurio 
> Cc: Alan Previn 
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_huc.c| 94 +--
>  drivers/gpu/drm/i915/gt/uc/intel_huc.h| 16 +++-
>  drivers/gpu/drm/i915/gt/uc/intel_huc_fw.c |  4 +-
>  drivers/gpu/drm/i915/i915_reg.h   |  3 +
>  include/uapi/drm/i915_drm.h   |  3 +-
>  5 files changed, 91 insertions(+), 29 deletions(-)
> 
I believe you need a rebase with the PXP single session merged (the
readiness code in gsccs backend). Other than that, all looks good:

Reviewed-by: Alan Previn 


Re: [Intel-gfx] [PATCH v2] drm/i915/huc: Parse the GSC-enabled HuC binary

2023-05-12 Thread Teres Alexis, Alan Previn
On Tue, 2023-05-02 at 08:27 -0700, Ceraolo Spurio, Daniele wrote:
> The new binaries that support the 2-step authentication have contain the
> legacy-style binary, which we can use for loading the HuC via DMA. To
> find out where this is located in the image, we need to parse the meu
> manifest of the GSC binary. The manifest consist of a partition header
> followed by entries, one of which contains the offset we're looking for.
> Note that the DG2 GSC binary contains entries with the same names, but
> it doesn't contain a full legacy binary, so we need to skip assigning
> the dma offset in that case (which we can do by checking the ccs).
> Also, since we're now parsing the entries, we can extract the HuC
> version that way instead of using hardcoded offsets.
> 
> Note that the meu structure will be re-used for parsing the GSC binary,
> so they've been added in their own header.
> 
> v2: fix structure names to match meu defines (s/CPT/CPD/), update commit
> message, check ccs validity, drop old version location defines.
> 
> Signed-off-by: Daniele Ceraolo Spurio 
> Cc: Alan Previn 
> ---
>  .../drm/i915/gt/uc/intel_gsc_meu_headers.h|  74 ++

Compared line by line as per internal reviews and the spec.
All looks good to me - nice to see that additional ccs validity.
LGTM, Reviewed-by: Alan Previn 


Re: [Intel-gfx] [PATCH] drm/i915/guc: Fix confused register capture list creation

2023-05-11 Thread Teres Alexis, Alan Previn
On Thu, 2023-05-11 at 18:35 -0700, john.c.harri...@intel.com wrote:
> From: John Harrison 
> 
> The GuC has a completely separate engine class enum when referring to
> register capture lists, which combines render and compute. The driver
> was using the 'normal' GuC specific engine class enum instead. That
> meant that it thought it was defining a capture list for compute
> engines, the list was actually being applied to the GSC engine. And if
> a platform didn't have a render engine, then it would get no compute
> register captures at all.
> 
> Fix that.
> 
> Signed-off-by: John Harrison 
alan:snip.

LGTM, simple and straight-forward patch - although i can only imagine the
pain of debugging this one. So for the benefit of others on the mailing
list, because the COMPUTE and RENDER enum of the i915 (not-GuC-ABI) was
different, but the GuC-ABI one was using the its own Render for both,
(coincidentially '0' == render for both GUC-ABI and i915), it means that
ADS popultion of capture-register list would only cause problems for
devices that had no render or has a GSC (all have VD/VE/Blt). So MTL
is not yet fully POR and none of the existing upstream supported devices
had no render engine so therefore the "Fixes" tag isn't needed IIUC.

That said, pending CI results,
Reviewed-by: Alan Previn 


Re: [Intel-gfx] [PATCH v9 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP

2023-05-10 Thread Teres Alexis, Alan Previn
> 
alan:snip

> 
> Assuming that:
> 
>   2 = PXP feature is supported but should be ready soon (pending
>   initialization of non-i915 system dependencies).
> 
> really means, "it'll be ready soon or there is a bug somewhere",
> 
> Acked-by: Jordan Justen 
> 
> If Mesa finds that it can't really rely on that, we may have to decide
> on a different approach in how to use that return value.
> 
> Is it possible to hold on for another ~12 hours to see if Lionel wants
> to Ack as well?
> 
> -Jordan

alan: agreed and thanks. And sure, lets give ourselves till tomorrow afternoon 
PST.



Re: [Intel-gfx] [PATCH v9 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP

2023-05-10 Thread Teres Alexis, Alan Previn

alan:snip

> This is why I asked if it was it was "basically certain that in a
> production environment, then it will eventually return 1 meaning it's
> ready". Alan's response was a little ambiguous on this point.
alan: if we get a '2' and never transition to '1' - thats a kernel bug or
firmware / system issue.

> Arguably a transition from 2 to -ENODEV could be considered a kernel
> bug, but it doesn't sound like Alan would agree. :) Maybe Alan would
> agree to saying it's either a kernel, or system integration bug.
alan: agreed - that would be a kernel bug or a system integration bug.

I also agreed on the init-flow vs app-usage thoughts Jordan had.
That said MESA has many ways it can use this UAPI and we can discuss
that on the MESA patch.


hmmm.. so... ack anyone? [insert big hopeful smiley here]
apologies if I am asking too often.


Re: [Intel-gfx] [PATCH v9 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP

2023-05-10 Thread Teres Alexis, Alan Previn
> > > > > Because of the additional firmware, component-driver and
> > > > > initialization depedencies required on MTL platform before a
> > > > > PXP context can be created, UMD calling for PXP creation as a
> > > > > way to get-caps can take a long time. An actual real world
> > > > > customer stack has seen this happen in the 4-to-8 second range
> > > > > after the kernel starts (which sees MESA's init appear in the
> > > > > middle of this range as the compositor comes up). To avoid
> > > > > unncessary delays experienced by the UMD for get-caps purposes,
> > > > > add a GET_PARAM for I915_PARAM_PXP_SUPPORT.
> > > > > 
> > > > alan:snip.
> > > > Progress update on the UMD side - I'm working on patch for PR here: 
> > > > https://gitlab.freedesktop.org/alan_previn_intel/mesa-alan-previn-features/-/commit/fb9d4fbfbef7dfd3f41df335cd31549fd39ddb57
> > > > but taking time to verify certain code paths
> > > 
> > > Just to confirm, if I915_PARAM_PXP_STATUS returns 2 ("will be ready
> > > soon"), then it is basically certain that in a production environment,
> > > then it will eventually return 1 meaning it's ready, right?
> > alan: yes - but not 100%. non-platform-state-machine could still
> > cause unexpected failures for example, [1] if hw was fused in a way
> > that doesnt permit PXP or [2] enabling certain BIOS debug knobs
> > doesnt allow PXP. However at the moment, there is no way for us
> > to know for sure without actually creating a protected context.
> > Of course having hw-fusing + bios-config that supports PXP have
> > always 100% succeeded for me.
> 
> In my opinion it is problematic mark that we support protected context but 
> then it fails to create protected context.
> 
> It should check the I915_PARAM_PXP_STATUS in 
> i915_gem_supports_protected_context() return earlier if know that protected 
> context is not supported.
> Then create a protected context so we know that all other calls will succeed.

Hi Jose, I think your comment WRT how MESA change coule be implemented. Right 
now this UAPI
will provide all possible information: '-ENODEV'==no-support... 
'1'==supported-and-read...
'2'==supported-but-not-ready.

Unfortunately the '2'->'1' transition is not something the kernel driver has 
control over.
As per the earlier review comments in prior revs and weeks with others (Lionel, 
Jordan, Trvtko,
Daniele, etc), depending on certain scenarios (kernel configs with many 
components + interdependencies,
.. OR... a fresh platform-IFWI-update ... OR... a distro that is designed to 
boot under 1 sec),
the "2"->"1" can take up to 8-seconds-from-kernel-start. In almost all our 
internal CI testing
we are only seeing it take 2-seconds-from-kernel-start.

That said, perhaps we can discuss the MESA behavior on the MESA patch:
do we want to block on init time get-caps (i915_gem_supports_protected_context) 
or during
runtime context-creation (if the user explicitly requests it).. or never block 
from MESA
and report if the PXP context failed because it wasnt ready (so user knows it 
could retry).
<-- this last one was what Jordan wanted and what i posted here on MESA patch: 
https://gitlab.freedesktop.org/alan_previn_intel/mesa-alan-previn-features/-/commit/8728ab5a40c1a83f65afe0072f45a906ff26f706

However, as it stands today, this UAPI kernel patch has incorporated all
the requests from prior review comments and provides as much information as 
required.
Do you see other shortcoming this kernel side UAPI behavior? 
if not, an ack would be greatly appreciated :) since earlier ack from Lionel
was on the prior behavior before Jordan's request for this more
comprehensive response-set (-ENODEV vs '1' vs '2').


Re: [Intel-gfx] [PATCH v9 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP

2023-05-09 Thread Teres Alexis, Alan Previn
On Tue, 2023-05-09 at 13:27 +, Teres Alexis, Alan Previn wrote:
> > 
> alan:snip
> 
> > > [Jordan:] 
> > > Another option besides from the timeout loop in
> > > iris_create_hw_context() might be to check I915_PARAM_PXP_STATUS after
> > > the context create fails to tweak the debug message.
> > alan: Yeah, that is an option - I'm thinking we can add a DBG that reads
> > either "PXP context failure expected due not ready" vs
> > "Unexpected PXP context failure" 
> 
> Hi Jordan, should i proceed this way? (I can repost the UMD patch accordingly)

MESA-patch proposal: 
https://gitlab.freedesktop.org/alan_previn_intel/mesa-alan-previn-features/-/commit/8728ab5a40c1a83f65afe0072f45a906ff26f706


Re: [Intel-gfx] [PATCH v9 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP

2023-05-09 Thread Teres Alexis, Alan Previn
> 
alan:snip

> > [Jordan:] 
> > Another option besides from the timeout loop in
> > iris_create_hw_context() might be to check I915_PARAM_PXP_STATUS after
> > the context create fails to tweak the debug message.
> alan: Yeah, that is an option - I'm thinking we can add a DBG that reads
> either "PXP context failure expected due not ready" vs
> "Unexpected PXP context failure" 

Hi Jordan, should i proceed this way? (I can repost the UMD patch accordingly)


Re: [Intel-gfx] [PATCH v9 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP

2023-05-08 Thread Teres Alexis, Alan Previn
On Fri, 2023-05-05 at 00:39 -0700, Justen, Jordan L wrote:
> On 2023-05-04 22:30:07, Teres Alexis, Alan Previn wrote:
> > On Thu, 2023-04-27 at 16:48 -0700, Teres Alexis, Alan Previn wrote:
> > > Because of the additional firmware, component-driver and
> > > initialization depedencies required on MTL platform before a
> > > PXP context can be created, UMD calling for PXP creation as a
> > > way to get-caps can take a long time. An actual real world
> > > customer stack has seen this happen in the 4-to-8 second range
> > > after the kernel starts (which sees MESA's init appear in the
> > > middle of this range as the compositor comes up). To avoid
> > > unncessary delays experienced by the UMD for get-caps purposes,
> > > add a GET_PARAM for I915_PARAM_PXP_SUPPORT.
> > > 
> > alan:snip.
> > Progress update on the UMD side - I'm working on patch for PR here: 
> > https://gitlab.freedesktop.org/alan_previn_intel/mesa-alan-previn-features/-/commit/fb9d4fbfbef7dfd3f41df335cd31549fd39ddb57
> > but taking time to verify certain code paths
> 
> Just to confirm, if I915_PARAM_PXP_STATUS returns 2 ("will be ready
> soon"), then it is basically certain that in a production environment,
> then it will eventually return 1 meaning it's ready, right?
alan: yes - but not 100%. non-platform-state-machine could still
cause unexpected failures for example, [1] if hw was fused in a way
that doesnt permit PXP or [2] enabling certain BIOS debug knobs
doesnt allow PXP. However at the moment, there is no way for us
to know for sure without actually creating a protected context.
Of course having hw-fusing + bios-config that supports PXP have
always 100% succeeded for me.

> 
> If this is correct, then I think that the change in
> i915_gem_supports_protected_context() is good, and probably we can
> skip the change in iris_create_hw_context().
> 
> Basically, we're timing out for multiple seconds either way. And, I'm
> hoping that the kernel will eventually get the PXP init done and
> create the context.
> 
> I think there's 2 cases of interest here.
> 
> First, and most likely, the application running doesn't care about
> protected content. In this case we can quickly advertise the support,
> but there will be no consequence because the application won't use the
> feature.
alan: yes - that was one of the goals of this UAPI.
> 
> Second, the application does care about protected content. In this
> case we can quickly advertise the support, but if the feature is used
> too quickly, then the context create might take a long time.
alan: no, that isnt the case now, we started at 8-secs, then 2-secs,
and finally settled on the same timeout as ADL since this new UAPI
will be something that can be polled on to be sure we have readiness
to make the create-context-call. That's why i wanted to add the
polling wait in the actual create call - but not the get caps call
which can be quick (as you said above).

> 
> If I915_PARAM_PXP_STATUS returning 2 has a reasonable chance in a
> production environment of eventually finding out that pxp can't work,
> then perhaps more disussion is needed. I guess the worst case scenario
> is no worse than today though. (Except it is still somewhat better,
> because the common case would not involve protected content being
> used by the application.)
alan: hmmm... i guess it depends on the meaning of resonable: if 50%
of the CI platforms being run have incorrect bios config / hw fusing
does it mean this UAPI is only 50% useful? My opinion is the alternative:
in the case of platform that has correctly configured BIOS + fusing,
is the chance of 2 eventually leading to a failure high? The answer is no.
Hypothetically i have actually never seen this happen (note: this UAPI
is new but i know from past debug of customer issues). There are some very
corner-cases but those get into the weeds of pxp runtime state machine..
I am sure we don't wanna discuss that rabbit hole right now.
> 
> Another option besides from the timeout loop in
> iris_create_hw_context() might be to check I915_PARAM_PXP_STATUS after
> the context create fails to tweak the debug message.
alan: Yeah, that is an option - I'm thinking we can add a DBG that reads
either"PXP context failure expected due not ready" vs
"Unexpected PXP context failure" 

> 
> -Jordan



Re: [Intel-gfx] [PATCH v9 6/8] drm/i915/uapi/pxp: Add a GET_PARAM for PXP

2023-05-04 Thread Teres Alexis, Alan Previn
On Thu, 2023-04-27 at 16:48 -0700, Teres Alexis, Alan Previn wrote:
> Because of the additional firmware, component-driver and
> initialization depedencies required on MTL platform before a
> PXP context can be created, UMD calling for PXP creation as a
> way to get-caps can take a long time. An actual real world
> customer stack has seen this happen in the 4-to-8 second range
> after the kernel starts (which sees MESA's init appear in the
> middle of this range as the compositor comes up). To avoid
> unncessary delays experienced by the UMD for get-caps purposes,
> add a GET_PARAM for I915_PARAM_PXP_SUPPORT.
> 
alan:snip.
Progress update on the UMD side - I'm working on patch for PR here: 
https://gitlab.freedesktop.org/alan_previn_intel/mesa-alan-previn-features/-/commit/fb9d4fbfbef7dfd3f41df335cd31549fd39ddb57
but taking time to verify certain code paths


Re: [Intel-gfx] [PATCH v3 4/4] drm/i915/gsc: add support for GSC proxy interrupt

2023-05-04 Thread Teres Alexis, Alan Previn
On Tue, 2023-05-02 at 09:38 -0700, Ceraolo Spurio, Daniele wrote:
> The GSC notifies us of a proxy request via the HECI2 interrupt. The
> interrupt must be enabled both in the HECI layer and in our usual gt irq
> programming; for the latter, the interrupt is enabled via the same enable
> register as the GSC CS, but it does have its own mask register. When the
> interrupt is received, we also need to de-assert it in both layers.
> 
> The handling of the proxy request is deferred to the same worker that we
> use for GSC load. New flags have been added to distinguish between the
> init case and the proxy interrupt.
> 
> v2: Make sure not to set the reset bit when enabling/disabling the GSC
> interrupts, fix defines (Alan)
> 
> v3: rebase on proxy status register check
> 
alan:snip
Reviewed-by: Alan Previn 


  1   2   3   4   5   >