Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.

2022-02-14 Thread Teres Alexis, Alan Previn
Looks like trying to move the vma up into guc-upper is impacting many other 
functions
in intel_guc_log and intel_guc_log_debugfs. I'll have to take it back (the 
level of
redesign i shall attempt with this series).

I'll just move the log_stats back into intel_guc_log and have intel_guc_capture
have its own stats structure - but keep the VMA allocation of this shared 
buffer in
intel_guc_log (like it was in the prior revs) and have intel_guc_capture "reach 
across"
into intel_guc_log to get the vma ptr (like it was in the prior revs).

...alan


On Mon, 2022-02-14 at 11:22 -0800, Alan Previn Teres Alexis wrote:
> Matt, just a final confirmation on below
> 
> > > > > > On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote:
> > > > > > > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote:
> > > If the object lives at the GuC level, operate on it at the GuC level.
> > > 
> > > e.g.
> > > intel_guc_log_init_early calls mutex init on guc->log_state - that is
> > > wrong and breaks the layering. intel_guc_log_init_early should only
> > > operate on guc_log or below objects, not above it.
> > > 
> > > The key here is consisteny, if the GuC level owns the object it should
> > > be initialized there + passed into the lower levels if possible. The
> > > lower levels should avoid reaching back to GuC level for objects
> > > whenever possible.
> > > 
> > > You could have 2 intel_guc_log_stats objects below the guc_log object
> > > and 1 intel_guc_log_stats object for capture at the GuC level. That's
> > > likely the right approach here.
> > 
> > Thanks Matt - I'm in agreement... I was concerned about too much of
> > change - but you're right, I should be focusing on the design consistency.
> > Above sounds like the correct design (these stats and locks should belong
> > to their sole user).
> > 
> > ...alan
> > 
> 
> So this means:
> 1. guc[upper] allocates the shared-logging-buffer
>- but would ask the lower level components for the sizes before
>  buffering-up or capping-down to match interface spec.
> 2. guc-log and guc-error-capture requests guc for a vmap at their init.
> 3. guc-log and guc-error-capture owns independent log-stats and
>(and separate locks if needed).
> 4. when lower level components are done, they relinquish access to
>their region by requesting guc[upper] to unmap and free
> 
> A super clean separation like above could mean ripping apart enums
> and other #defines to split them across guc_log and guc_error_capture
> headers (such as region sizes).
> 
> I believe that separation complicates the understanding of the fw interface
> for logging as we break that picture into independant files / components.
> For now I want to keep guc[upper] aware of the individual sub-region
> allocation requirements (no ripping apart of enums but moving them around)
> but only keep the requesting of vmap and independant log-region-stats
> within the lower level?
> 
> Are you okay with this?
> 
> side note: error-capture no longer need locks after the recent G2H triggered
> linked-list extraction redesign.
> 
> 



Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.

2022-02-14 Thread Teres Alexis, Alan Previn
Matt, just a final confirmation on below

> > > > > 
> > > > > On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote:
> > > > > > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote:
> > > > > > > 
> > If the object lives at the GuC level, operate on it at the GuC level.
> > 
> > e.g.
> > intel_guc_log_init_early calls mutex init on guc->log_state - that is
> > wrong and breaks the layering. intel_guc_log_init_early should only
> > operate on guc_log or below objects, not above it.
> > 
> > The key here is consisteny, if the GuC level owns the object it should
> > be initialized there + passed into the lower levels if possible. The
> > lower levels should avoid reaching back to GuC level for objects
> > whenever possible.
> > 
> > You could have 2 intel_guc_log_stats objects below the guc_log object
> > and 1 intel_guc_log_stats object for capture at the GuC level. That's
> > likely the right approach here.
> 
> Thanks Matt - I'm in agreement... I was concerned about too much of
> change - but you're right, I should be focusing on the design consistency.
> Above sounds like the correct design (these stats and locks should belong
> to their sole user).
> 
> ...alan
> 

So this means:
1. guc[upper] allocates the shared-logging-buffer
   - but would ask the lower level components for the sizes before
 buffering-up or capping-down to match interface spec.
2. guc-log and guc-error-capture requests guc for a vmap at their init.
3. guc-log and guc-error-capture owns independent log-stats and
   (and separate locks if needed).
4. when lower level components are done, they relinquish access to
   their region by requesting guc[upper] to unmap and free

A super clean separation like above could mean ripping apart enums
and other #defines to split them across guc_log and guc_error_capture
headers (such as region sizes).

I believe that separation complicates the understanding of the fw interface
for logging as we break that picture into independant files / components.
For now I want to keep guc[upper] aware of the individual sub-region
allocation requirements (no ripping apart of enums but moving them around)
but only keep the requesting of vmap and independant log-region-stats
within the lower level?

Are you okay with this?

side note: error-capture no longer need locks after the recent G2H triggered
linked-list extraction redesign.




Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.

2022-02-08 Thread Teres Alexis, Alan Previn

On Tue, 2022-02-08 at 19:34 -0800, Matthew Brost wrote:
> On Tue, Feb 08, 2022 at 02:55:07PM -0800, Teres Alexis, Alan Previn wrote:
> > On Tue, 2022-02-08 at 14:18 -0800, Matthew Brost wrote:
> > > On Tue, Feb 08, 2022 at 11:38:13AM -0800, Teres Alexis, Alan Previn wrote:
> > > > Hi Matt, thank you for taking the time to review the codes.
> > > > Answering your question inline below.
> > > > 
> > > > 
> > > > On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote:
> > > > > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote:
> > > > > > GuC log buffer regions for debug-log-events, crash-dumps and
> > > > > > error-state-capture are all a single bo allocation that includes
> > > > > > the guc_log_buffer_state structures.
> > > > > > 
> > > > > > Since the error-capture region is accessed with high priority at 
> > > > > > non-
> > > > > > deterministic times (as part of gpu coredump) while the 
> > > > > > debug-log-event
> > > > > > region is populated and accessed with different priorities, timings 
> > > > > > and
> > > > > > consumers, let's split out separate locks for buffer-state accesses
> > > > > > of each region.
> > > > > > 
> > > > > > Also, ensure a global mapping is made up front for the entire bo
> > > > > > throughout GuC operation so that dynamic mapping and unmapping isn't
> > > > > > required for error capture log access if relay-logging isn't 
> > > > > > running.
> > > > > > 
> > > > > > Additionally, while here, make some readibility improvements:
> > > > > > 1. change previous function names with "capture_logs" to
> > > > > >"copy_debug_logs" to help make the distinction clearer.
> > > > > > 2. Update the guc log region mapping comments to order them
> > > > > >according to the enum definition as per the GuC interface.
> > > > > > 
> > > > > > Signed-off-by: Alan Previn 
> > > > > > ---
> > > > > >  drivers/gpu/drm/i915/gt/uc/intel_guc.h|   2 +
> > > > > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.c|  47 ++
> > > > > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.h|   1 +
> > > > > >  drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 
> > > > > > +++---
> > > > > >  drivers/gpu/drm/i915/gt/uc/intel_guc_log.h|  16 ++-
> > > > > >  5 files changed, 141 insertions(+), 60 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h 
> > > > > > b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > > > index 4e819853ec2e..be1ad7fa2bf8 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > > > @@ -34,6 +34,8 @@ struct intel_guc {
> > > > > > struct intel_uc_fw fw;
> > > > > > /** @log: sub-structure containing GuC log related data and 
> > > > > > objects */
> > > > > > struct intel_guc_log log;
> > > > > > +   /** @log_state: states and locks for each subregion of GuC's 
> > > > > > log buffer */
> > > > > > +   struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER];
> > > > > 
> > > > > Why move this? This still probably should be sub-structure of
> > > > > intel_guc_log. Most of the access is from functions that pass in
> > > > > intel_guc_log, then retrieve the GuC object, only to access this new
> > > > > intel_guc_log_stats object. That layering seems wrong, if the argument
> > > > > to a function is intel_guc_log it should really try to access members
> > > > > within that object or below. Obv some exceptions exist but here it 
> > > > > seems
> > > > > clear to me this is in the wrong place.
> > > > > 
> > > > So the reasoning i had was because because intel_guc_log module only 
> > > > managed
> > > > guc-relay-logging and guc-log-dumping for log-events but allocates the 
> > > > buffer
> > > > for 3 separate subregion/usages (i.e. log-events, crash-dump and 
> > > > error-capture).
> > > > That said, I did not want intel_guc_capture and relay-logging (or 
> > > > log-dumping)
> > > > to have to contend with the same lock because these two subregions are 
> > > > completely
> > > > independant of each other in terms of when they are being accessed and 
> > > > why.
> > > > 
> > > 
> > > All that is fine, I just think the 'struct intel_guc_log_stats' should
> > > be sub-substure of struct intel_guc_log.
> > > 
> > > > However, after the redesign of rev5 (this rev), I now believe the 
> > > > intel_guc_capture
> > > > module does not require a lock because its only ever accessing its log
> > > > subregion in response to the ctb handler functions that run out of the 
> > > > same queue.
> > > > As we know intel_guc_capture reacts to G2H-error-capture-notification 
> > > > and G2H-context-reset
> > > > (that trickles into i915_gpu_coredump). At the point of 
> > > > i915_error_state dump,
> > > > intel_guc_capture module does not access the buffer - it merely dumps 
> > > > the already-parsed
> > > > and engine-dump-node (that was detached from error-capture's internal 
> > > > linked-list
> > > > and attached to the 

Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.

2022-02-08 Thread Matthew Brost
On Tue, Feb 08, 2022 at 07:34:37PM -0800, Matthew Brost wrote:
> On Tue, Feb 08, 2022 at 02:55:07PM -0800, Teres Alexis, Alan Previn wrote:
> > 
> > On Tue, 2022-02-08 at 14:18 -0800, Matthew Brost wrote:
> > > On Tue, Feb 08, 2022 at 11:38:13AM -0800, Teres Alexis, Alan Previn wrote:
> > > > Hi Matt, thank you for taking the time to review the codes.
> > > > Answering your question inline below.
> > > >
> > > >
> > > > On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote:
> > > > > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote:
> > > > > > GuC log buffer regions for debug-log-events, crash-dumps and
> > > > > > error-state-capture are all a single bo allocation that includes
> > > > > > the guc_log_buffer_state structures.
> > > > > >
> > > > > > Since the error-capture region is accessed with high priority at 
> > > > > > non-
> > > > > > deterministic times (as part of gpu coredump) while the 
> > > > > > debug-log-event
> > > > > > region is populated and accessed with different priorities, timings 
> > > > > > and
> > > > > > consumers, let's split out separate locks for buffer-state accesses
> > > > > > of each region.
> > > > > >
> > > > > > Also, ensure a global mapping is made up front for the entire bo
> > > > > > throughout GuC operation so that dynamic mapping and unmapping isn't
> > > > > > required for error capture log access if relay-logging isn't 
> > > > > > running.
> > > > > >
> > > > > > Additionally, while here, make some readibility improvements:
> > > > > > 1. change previous function names with "capture_logs" to
> > > > > >"copy_debug_logs" to help make the distinction clearer.
> > > > > > 2. Update the guc log region mapping comments to order them
> > > > > >according to the enum definition as per the GuC interface.
> > > > > >
> > > > > > Signed-off-by: Alan Previn 
> > > > > > ---
> > > > > >  drivers/gpu/drm/i915/gt/uc/intel_guc.h|   2 +
> > > > > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.c|  47 ++
> > > > > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.h|   1 +
> > > > > >  drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 
> > > > > > +++---
> > > > > >  drivers/gpu/drm/i915/gt/uc/intel_guc_log.h|  16 ++-
> > > > > >  5 files changed, 141 insertions(+), 60 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h 
> > > > > > b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > > > index 4e819853ec2e..be1ad7fa2bf8 100644
> > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > > > @@ -34,6 +34,8 @@ struct intel_guc {
> > > > > > struct intel_uc_fw fw;
> > > > > > /** @log: sub-structure containing GuC log related data and 
> > > > > > objects */
> > > > > > struct intel_guc_log log;
> > > > > > +   /** @log_state: states and locks for each subregion of GuC's 
> > > > > > log buffer */
> > > > > > +   struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER];
> > > > >
> > > > > Why move this? This still probably should be sub-structure of
> > > > > intel_guc_log. Most of the access is from functions that pass in
> > > > > intel_guc_log, then retrieve the GuC object, only to access this new
> > > > > intel_guc_log_stats object. That layering seems wrong, if the argument
> > > > > to a function is intel_guc_log it should really try to access members
> > > > > within that object or below. Obv some exceptions exist but here it 
> > > > > seems
> > > > > clear to me this is in the wrong place.
> > > > >
> > > > So the reasoning i had was because because intel_guc_log module only 
> > > > managed
> > > > guc-relay-logging and guc-log-dumping for log-events but allocates the 
> > > > buffer
> > > > for 3 separate subregion/usages (i.e. log-events, crash-dump and 
> > > > error-capture).
> > > > That said, I did not want intel_guc_capture and relay-logging (or 
> > > > log-dumping)
> > > > to have to contend with the same lock because these two subregions are 
> > > > completely
> > > > independant of each other in terms of when they are being accessed and 
> > > > why.
> > > >
> > >
> > > All that is fine, I just think the 'struct intel_guc_log_stats' should
> > > be sub-substure of struct intel_guc_log.
> > >
> > > > However, after the redesign of rev5 (this rev), I now believe the 
> > > > intel_guc_capture
> > > > module does not require a lock because its only ever accessing its log
> > > > subregion in response to the ctb handler functions that run out of the 
> > > > same queue.
> > > > As we know intel_guc_capture reacts to G2H-error-capture-notification 
> > > > and G2H-context-reset
> > > > (that trickles into i915_gpu_coredump). At the point of 
> > > > i915_error_state dump,
> > > > intel_guc_capture module does not access the buffer - it merely dumps 
> > > > the already-parsed
> > > > and engine-dump-node (that was detached from error-capture's internal 
> > > > linked-list
> > > > and attached to the 

Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.

2022-02-08 Thread Matthew Brost
On Tue, Feb 08, 2022 at 02:55:07PM -0800, Teres Alexis, Alan Previn wrote:
> 
> On Tue, 2022-02-08 at 14:18 -0800, Matthew Brost wrote:
> > On Tue, Feb 08, 2022 at 11:38:13AM -0800, Teres Alexis, Alan Previn wrote:
> > > Hi Matt, thank you for taking the time to review the codes.
> > > Answering your question inline below.
> > >
> > >
> > > On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote:
> > > > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote:
> > > > > GuC log buffer regions for debug-log-events, crash-dumps and
> > > > > error-state-capture are all a single bo allocation that includes
> > > > > the guc_log_buffer_state structures.
> > > > >
> > > > > Since the error-capture region is accessed with high priority at non-
> > > > > deterministic times (as part of gpu coredump) while the 
> > > > > debug-log-event
> > > > > region is populated and accessed with different priorities, timings 
> > > > > and
> > > > > consumers, let's split out separate locks for buffer-state accesses
> > > > > of each region.
> > > > >
> > > > > Also, ensure a global mapping is made up front for the entire bo
> > > > > throughout GuC operation so that dynamic mapping and unmapping isn't
> > > > > required for error capture log access if relay-logging isn't running.
> > > > >
> > > > > Additionally, while here, make some readibility improvements:
> > > > > 1. change previous function names with "capture_logs" to
> > > > >"copy_debug_logs" to help make the distinction clearer.
> > > > > 2. Update the guc log region mapping comments to order them
> > > > >according to the enum definition as per the GuC interface.
> > > > >
> > > > > Signed-off-by: Alan Previn 
> > > > > ---
> > > > >  drivers/gpu/drm/i915/gt/uc/intel_guc.h|   2 +
> > > > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.c|  47 ++
> > > > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.h|   1 +
> > > > >  drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 
> > > > > +++---
> > > > >  drivers/gpu/drm/i915/gt/uc/intel_guc_log.h|  16 ++-
> > > > >  5 files changed, 141 insertions(+), 60 deletions(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h 
> > > > > b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > > index 4e819853ec2e..be1ad7fa2bf8 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > > @@ -34,6 +34,8 @@ struct intel_guc {
> > > > > struct intel_uc_fw fw;
> > > > > /** @log: sub-structure containing GuC log related data and 
> > > > > objects */
> > > > > struct intel_guc_log log;
> > > > > +   /** @log_state: states and locks for each subregion of GuC's log 
> > > > > buffer */
> > > > > +   struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER];
> > > >
> > > > Why move this? This still probably should be sub-structure of
> > > > intel_guc_log. Most of the access is from functions that pass in
> > > > intel_guc_log, then retrieve the GuC object, only to access this new
> > > > intel_guc_log_stats object. That layering seems wrong, if the argument
> > > > to a function is intel_guc_log it should really try to access members
> > > > within that object or below. Obv some exceptions exist but here it seems
> > > > clear to me this is in the wrong place.
> > > >
> > > So the reasoning i had was because because intel_guc_log module only 
> > > managed
> > > guc-relay-logging and guc-log-dumping for log-events but allocates the 
> > > buffer
> > > for 3 separate subregion/usages (i.e. log-events, crash-dump and 
> > > error-capture).
> > > That said, I did not want intel_guc_capture and relay-logging (or 
> > > log-dumping)
> > > to have to contend with the same lock because these two subregions are 
> > > completely
> > > independant of each other in terms of when they are being accessed and 
> > > why.
> > >
> >
> > All that is fine, I just think the 'struct intel_guc_log_stats' should
> > be sub-substure of struct intel_guc_log.
> >
> > > However, after the redesign of rev5 (this rev), I now believe the 
> > > intel_guc_capture
> > > module does not require a lock because its only ever accessing its log
> > > subregion in response to the ctb handler functions that run out of the 
> > > same queue.
> > > As we know intel_guc_capture reacts to G2H-error-capture-notification and 
> > > G2H-context-reset
> > > (that trickles into i915_gpu_coredump). At the point of i915_error_state 
> > > dump,
> > > intel_guc_capture module does not access the buffer - it merely dumps the 
> > > already-parsed
> > > and engine-dump-node (that was detached from error-capture's internal 
> > > linked-list
> > > and attached to the gpu_coredump structure in the second G2H above).
> > >
> > > And of course, intel_guc_log only ever accesses the log-events subregion
> > > and never the error-capture regions.
> > >
> > > That said, i could revert the lock structure to the original and not have
> > > intel_guc_capture using 

Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.

2022-02-08 Thread Teres Alexis, Alan Previn

On Tue, 2022-02-08 at 14:18 -0800, Matthew Brost wrote:
> On Tue, Feb 08, 2022 at 11:38:13AM -0800, Teres Alexis, Alan Previn wrote:
> > Hi Matt, thank you for taking the time to review the codes.
> > Answering your question inline below.
> > 
> > 
> > On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote:
> > > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote:
> > > > GuC log buffer regions for debug-log-events, crash-dumps and
> > > > error-state-capture are all a single bo allocation that includes
> > > > the guc_log_buffer_state structures.
> > > > 
> > > > Since the error-capture region is accessed with high priority at non-
> > > > deterministic times (as part of gpu coredump) while the debug-log-event
> > > > region is populated and accessed with different priorities, timings and
> > > > consumers, let's split out separate locks for buffer-state accesses
> > > > of each region.
> > > > 
> > > > Also, ensure a global mapping is made up front for the entire bo
> > > > throughout GuC operation so that dynamic mapping and unmapping isn't
> > > > required for error capture log access if relay-logging isn't running.
> > > > 
> > > > Additionally, while here, make some readibility improvements:
> > > > 1. change previous function names with "capture_logs" to
> > > >"copy_debug_logs" to help make the distinction clearer.
> > > > 2. Update the guc log region mapping comments to order them
> > > >according to the enum definition as per the GuC interface.
> > > > 
> > > > Signed-off-by: Alan Previn 
> > > > ---
> > > >  drivers/gpu/drm/i915/gt/uc/intel_guc.h|   2 +
> > > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.c|  47 ++
> > > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.h|   1 +
> > > >  drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 +++---
> > > >  drivers/gpu/drm/i915/gt/uc/intel_guc_log.h|  16 ++-
> > > >  5 files changed, 141 insertions(+), 60 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h 
> > > > b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > index 4e819853ec2e..be1ad7fa2bf8 100644
> > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > > @@ -34,6 +34,8 @@ struct intel_guc {
> > > > struct intel_uc_fw fw;
> > > > /** @log: sub-structure containing GuC log related data and objects 
> > > > */
> > > > struct intel_guc_log log;
> > > > +   /** @log_state: states and locks for each subregion of GuC's log 
> > > > buffer */
> > > > +   struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER];
> > > 
> > > Why move this? This still probably should be sub-structure of
> > > intel_guc_log. Most of the access is from functions that pass in
> > > intel_guc_log, then retrieve the GuC object, only to access this new
> > > intel_guc_log_stats object. That layering seems wrong, if the argument
> > > to a function is intel_guc_log it should really try to access members
> > > within that object or below. Obv some exceptions exist but here it seems
> > > clear to me this is in the wrong place.
> > > 
> > So the reasoning i had was because because intel_guc_log module only managed
> > guc-relay-logging and guc-log-dumping for log-events but allocates the 
> > buffer
> > for 3 separate subregion/usages (i.e. log-events, crash-dump and 
> > error-capture).
> > That said, I did not want intel_guc_capture and relay-logging (or 
> > log-dumping)
> > to have to contend with the same lock because these two subregions are 
> > completely
> > independant of each other in terms of when they are being accessed and why.
> > 
> 
> All that is fine, I just think the 'struct intel_guc_log_stats' should
> be sub-substure of struct intel_guc_log.
> 
> > However, after the redesign of rev5 (this rev), I now believe the 
> > intel_guc_capture
> > module does not require a lock because its only ever accessing its log
> > subregion in response to the ctb handler functions that run out of the same 
> > queue.
> > As we know intel_guc_capture reacts to G2H-error-capture-notification and 
> > G2H-context-reset
> > (that trickles into i915_gpu_coredump). At the point of i915_error_state 
> > dump,
> > intel_guc_capture module does not access the buffer - it merely dumps the 
> > already-parsed
> > and engine-dump-node (that was detached from error-capture's internal 
> > linked-list
> > and attached to the gpu_coredump structure in the second G2H above).
> > 
> > And of course, intel_guc_log only ever accesses the log-events subregion
> > and never the error-capture regions.
> > 
> > That said, i could revert the lock structure to the original and not have
> > intel_guc_capture using locks. What do you think?
> > 
> 
> Again my comment has nothing to do with locking, it is where the
> structure lives.
> 
> 
IMHO, based on my understanding of the codes and the GuC interface,
i see intel_guc_log and intel_guc_capture as 2 subsystems at equal level
under the intel_guc framework. 

Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.

2022-02-08 Thread Matthew Brost
On Tue, Feb 08, 2022 at 11:38:13AM -0800, Teres Alexis, Alan Previn wrote:
> Hi Matt, thank you for taking the time to review the codes.
> Answering your question inline below.
> 
> 
> On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote:
> > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote:
> > > GuC log buffer regions for debug-log-events, crash-dumps and
> > > error-state-capture are all a single bo allocation that includes
> > > the guc_log_buffer_state structures.
> > >
> > > Since the error-capture region is accessed with high priority at non-
> > > deterministic times (as part of gpu coredump) while the debug-log-event
> > > region is populated and accessed with different priorities, timings and
> > > consumers, let's split out separate locks for buffer-state accesses
> > > of each region.
> > >
> > > Also, ensure a global mapping is made up front for the entire bo
> > > throughout GuC operation so that dynamic mapping and unmapping isn't
> > > required for error capture log access if relay-logging isn't running.
> > >
> > > Additionally, while here, make some readibility improvements:
> > > 1. change previous function names with "capture_logs" to
> > >"copy_debug_logs" to help make the distinction clearer.
> > > 2. Update the guc log region mapping comments to order them
> > >according to the enum definition as per the GuC interface.
> > >
> > > Signed-off-by: Alan Previn 
> > > ---
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc.h|   2 +
> > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.c|  47 ++
> > >  .../gpu/drm/i915/gt/uc/intel_guc_capture.h|   1 +
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 +++---
> > >  drivers/gpu/drm/i915/gt/uc/intel_guc_log.h|  16 ++-
> > >  5 files changed, 141 insertions(+), 60 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h 
> > > b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > index 4e819853ec2e..be1ad7fa2bf8 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > > @@ -34,6 +34,8 @@ struct intel_guc {
> > > struct intel_uc_fw fw;
> > > /** @log: sub-structure containing GuC log related data and objects */
> > > struct intel_guc_log log;
> > > +   /** @log_state: states and locks for each subregion of GuC's log 
> > > buffer */
> > > +   struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER];
> >
> > Why move this? This still probably should be sub-structure of
> > intel_guc_log. Most of the access is from functions that pass in
> > intel_guc_log, then retrieve the GuC object, only to access this new
> > intel_guc_log_stats object. That layering seems wrong, if the argument
> > to a function is intel_guc_log it should really try to access members
> > within that object or below. Obv some exceptions exist but here it seems
> > clear to me this is in the wrong place.
> >
> So the reasoning i had was because because intel_guc_log module only managed
> guc-relay-logging and guc-log-dumping for log-events but allocates the buffer
> for 3 separate subregion/usages (i.e. log-events, crash-dump and 
> error-capture).
> That said, I did not want intel_guc_capture and relay-logging (or log-dumping)
> to have to contend with the same lock because these two subregions are 
> completely
> independant of each other in terms of when they are being accessed and why.
> 

All that is fine, I just think the 'struct intel_guc_log_stats' should
be sub-substure of struct intel_guc_log.

> However, after the redesign of rev5 (this rev), I now believe the 
> intel_guc_capture
> module does not require a lock because its only ever accessing its log
> subregion in response to the ctb handler functions that run out of the same 
> queue.
> As we know intel_guc_capture reacts to G2H-error-capture-notification and 
> G2H-context-reset
> (that trickles into i915_gpu_coredump). At the point of i915_error_state dump,
> intel_guc_capture module does not access the buffer - it merely dumps the 
> already-parsed
> and engine-dump-node (that was detached from error-capture's internal 
> linked-list
> and attached to the gpu_coredump structure in the second G2H above).
> 
> And of course, intel_guc_log only ever accesses the log-events subregion
> and never the error-capture regions.
> 
> That said, i could revert the lock structure to the original and not have
> intel_guc_capture using locks. What do you think?
> 

Again my comment has nothing to do with locking, it is where the
structure lives.

Matt

> ...alan
> 
> > Another nit, I'd personally break this out into multiple patches.
> >
> > e.g. 1 to rename relay log functions, 1 introducing intel_guc_log_stats
> > + lock, and 1 adding intel_guc_capture_output_min_size_est. Maybe I'm
> > missing another patch or two in there.
> >
> > Not a blocker but it does make reviews easier.
> >
> Will do.
> 
> > Matt
> >
> > > /** @ct: the command transport communication channel */
> > > 

Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.

2022-02-08 Thread Teres Alexis, Alan Previn
Hi Matt, thank you for taking the time to review the codes.
Answering your question inline below.


On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote:
> On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote:
> > GuC log buffer regions for debug-log-events, crash-dumps and
> > error-state-capture are all a single bo allocation that includes
> > the guc_log_buffer_state structures.
> > 
> > Since the error-capture region is accessed with high priority at non-
> > deterministic times (as part of gpu coredump) while the debug-log-event
> > region is populated and accessed with different priorities, timings and
> > consumers, let's split out separate locks for buffer-state accesses
> > of each region.
> > 
> > Also, ensure a global mapping is made up front for the entire bo
> > throughout GuC operation so that dynamic mapping and unmapping isn't
> > required for error capture log access if relay-logging isn't running.
> > 
> > Additionally, while here, make some readibility improvements:
> > 1. change previous function names with "capture_logs" to
> >"copy_debug_logs" to help make the distinction clearer.
> > 2. Update the guc log region mapping comments to order them
> >according to the enum definition as per the GuC interface.
> > 
> > Signed-off-by: Alan Previn 
> > ---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc.h|   2 +
> >  .../gpu/drm/i915/gt/uc/intel_guc_capture.c|  47 ++
> >  .../gpu/drm/i915/gt/uc/intel_guc_capture.h|   1 +
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 +++---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_log.h|  16 ++-
> >  5 files changed, 141 insertions(+), 60 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h 
> > b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > index 4e819853ec2e..be1ad7fa2bf8 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> > @@ -34,6 +34,8 @@ struct intel_guc {
> > struct intel_uc_fw fw;
> > /** @log: sub-structure containing GuC log related data and objects */
> > struct intel_guc_log log;
> > +   /** @log_state: states and locks for each subregion of GuC's log buffer 
> > */
> > +   struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER];
> 
> Why move this? This still probably should be sub-structure of
> intel_guc_log. Most of the access is from functions that pass in
> intel_guc_log, then retrieve the GuC object, only to access this new
> intel_guc_log_stats object. That layering seems wrong, if the argument
> to a function is intel_guc_log it should really try to access members
> within that object or below. Obv some exceptions exist but here it seems
> clear to me this is in the wrong place.
> 
So the reasoning i had was because because intel_guc_log module only managed
guc-relay-logging and guc-log-dumping for log-events but allocates the buffer
for 3 separate subregion/usages (i.e. log-events, crash-dump and error-capture).
That said, I did not want intel_guc_capture and relay-logging (or log-dumping)
to have to contend with the same lock because these two subregions are 
completely
independant of each other in terms of when they are being accessed and why.

However, after the redesign of rev5 (this rev), I now believe the 
intel_guc_capture
module does not require a lock because its only ever accessing its log
subregion in response to the ctb handler functions that run out of the same 
queue.
As we know intel_guc_capture reacts to G2H-error-capture-notification and 
G2H-context-reset
(that trickles into i915_gpu_coredump). At the point of i915_error_state dump,
intel_guc_capture module does not access the buffer - it merely dumps the 
already-parsed
and engine-dump-node (that was detached from error-capture's internal 
linked-list
and attached to the gpu_coredump structure in the second G2H above).

And of course, intel_guc_log only ever accesses the log-events subregion
and never the error-capture regions.

That said, i could revert the lock structure to the original and not have
intel_guc_capture using locks. What do you think?

...alan

> Another nit, I'd personally break this out into multiple patches.
> 
> e.g. 1 to rename relay log functions, 1 introducing intel_guc_log_stats
> + lock, and 1 adding intel_guc_capture_output_min_size_est. Maybe I'm
> missing another patch or two in there.
> 
> Not a blocker but it does make reviews easier.
> 
Will do.

> Matt
> 
> > /** @ct: the command transport communication channel */
> > struct intel_guc_ct ct;
> > /** @slpc: sub-structure containing SLPC related data and objects */
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c 
> > b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> > index 70d2ee841289..e7f99d051636 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> > @@ -651,6 +651,53 @@ int intel_guc_capture_prep_lists(struct intel_guc 
> > *guc, struct 

Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.

2022-02-04 Thread Matthew Brost
On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote:
> GuC log buffer regions for debug-log-events, crash-dumps and
> error-state-capture are all a single bo allocation that includes
> the guc_log_buffer_state structures.
> 
> Since the error-capture region is accessed with high priority at non-
> deterministic times (as part of gpu coredump) while the debug-log-event
> region is populated and accessed with different priorities, timings and
> consumers, let's split out separate locks for buffer-state accesses
> of each region.
> 
> Also, ensure a global mapping is made up front for the entire bo
> throughout GuC operation so that dynamic mapping and unmapping isn't
> required for error capture log access if relay-logging isn't running.
> 
> Additionally, while here, make some readibility improvements:
> 1. change previous function names with "capture_logs" to
>"copy_debug_logs" to help make the distinction clearer.
> 2. Update the guc log region mapping comments to order them
>according to the enum definition as per the GuC interface.
> 
> Signed-off-by: Alan Previn 
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc.h|   2 +
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.c|  47 ++
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.h|   1 +
>  drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 +++---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_log.h|  16 ++-
>  5 files changed, 141 insertions(+), 60 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 4e819853ec2e..be1ad7fa2bf8 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -34,6 +34,8 @@ struct intel_guc {
>   struct intel_uc_fw fw;
>   /** @log: sub-structure containing GuC log related data and objects */
>   struct intel_guc_log log;
> + /** @log_state: states and locks for each subregion of GuC's log buffer 
> */
> + struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER];

Why move this? This still probably should be sub-structure of
intel_guc_log. Most of the access is from functions that pass in
intel_guc_log, then retrieve the GuC object, only to access this new
intel_guc_log_stats object. That layering seems wrong, if the argument
to a function is intel_guc_log it should really try to access members
within that object or below. Obv some exceptions exist but here it seems
clear to me this is in the wrong place.

Another nit, I'd personally break this out into multiple patches.

e.g. 1 to rename relay log functions, 1 introducing intel_guc_log_stats
+ lock, and 1 adding intel_guc_capture_output_min_size_est. Maybe I'm
missing another patch or two in there.

Not a blocker but it does make reviews easier.

Matt

>   /** @ct: the command transport communication channel */
>   struct intel_guc_ct ct;
>   /** @slpc: sub-structure containing SLPC related data and objects */
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> index 70d2ee841289..e7f99d051636 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> @@ -651,6 +651,53 @@ int intel_guc_capture_prep_lists(struct intel_guc *guc, 
> struct guc_ads *blob, u3
>   return PAGE_ALIGN(alloc_size);
>  }
>  
> +#define GUC_CAPTURE_OVERBUFFER_MULTIPLIER 3
> +int intel_guc_capture_output_min_size_est(struct intel_guc *guc)
> +{
> + struct intel_gt *gt = guc_to_gt(guc);
> + struct intel_engine_cs *engine;
> + enum intel_engine_id id;
> + int worst_min_size = 0, num_regs = 0;
> + u16 tmp = 0;
> +
> + /*
> +  * If every single engine-instance suffered a failure in quick 
> succession but
> +  * were all unrelated, then a burst of multiple error-capture events 
> would dump
> +  * registers for every one engine instance, one at a time. In this 
> case, GuC
> +  * would even dump the global-registers repeatedly.
> +  *
> +  * For each engine instance, there would be 1 x 
> guc_state_capture_group_t output
> +  * followed by 3 x guc_state_capture_t lists. The latter is how the 
> register
> +  * dumps are split across different register types (where the '3' are 
> global vs class
> +  * vs instance). Finally, let's multiply the whole thing by 3x (just so 
> we are
> +  * not limited to just 1 round of data in a worst case full register 
> dump log)
> +  *
> +  * NOTE: intel_guc_log that allocates the log buffer would round this 
> size up to
> +  * a power of two.
> +  */
> +
> + for_each_engine(engine, gt, id) {
> + worst_min_size += sizeof(struct 
> guc_state_capture_group_header_t) +
> +   (3 * sizeof(struct 
> guc_state_capture_header_t));
> +
> + if (!guc_capture_list_count(guc, 0, 
> GUC_CAPTURE_LIST_TYPE_GLOBAL, 0, ))
> + num_regs 

Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.

2022-01-26 Thread Teres Alexis, Alan Previn
As per the rev 5 CI results between this patch and patch7, i have introduced a 
lockdep splat bug, i shall fix that in
the next rev.

...alan

On Wed, 2022-01-26 at 02:48 -0800, Alan Previn wrote:
> GuC log buffer regions for debug-log-events, crash-dumps and
> error-state-capture are all a single bo allocation that includes
> the guc_log_buffer_state structures.
> 
> Since the error-capture region is accessed with high priority at non-
> deterministic times (as part of gpu coredump) while the debug-log-event
> region is populated and accessed with different priorities, timings and
> consumers, let's split out separate locks for buffer-state accesses
> of each region.
> 
> Also, ensure a global mapping is made up front for the entire bo
> throughout GuC operation so that dynamic mapping and unmapping isn't
> required for error capture log access if relay-logging isn't running.
> 
> Additionally, while here, make some readibility improvements:
> 1. change previous function names with "capture_logs" to
>"copy_debug_logs" to help make the distinction clearer.
> 2. Update the guc log region mapping comments to order them
>according to the enum definition as per the GuC interface.
> 


[Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.

2022-01-26 Thread Alan Previn
GuC log buffer regions for debug-log-events, crash-dumps and
error-state-capture are all a single bo allocation that includes
the guc_log_buffer_state structures.

Since the error-capture region is accessed with high priority at non-
deterministic times (as part of gpu coredump) while the debug-log-event
region is populated and accessed with different priorities, timings and
consumers, let's split out separate locks for buffer-state accesses
of each region.

Also, ensure a global mapping is made up front for the entire bo
throughout GuC operation so that dynamic mapping and unmapping isn't
required for error capture log access if relay-logging isn't running.

Additionally, while here, make some readibility improvements:
1. change previous function names with "capture_logs" to
   "copy_debug_logs" to help make the distinction clearer.
2. Update the guc log region mapping comments to order them
   according to the enum definition as per the GuC interface.

Signed-off-by: Alan Previn 
---
 drivers/gpu/drm/i915/gt/uc/intel_guc.h|   2 +
 .../gpu/drm/i915/gt/uc/intel_guc_capture.c|  47 ++
 .../gpu/drm/i915/gt/uc/intel_guc_capture.h|   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 +++---
 drivers/gpu/drm/i915/gt/uc/intel_guc_log.h|  16 ++-
 5 files changed, 141 insertions(+), 60 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h 
b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 4e819853ec2e..be1ad7fa2bf8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -34,6 +34,8 @@ struct intel_guc {
struct intel_uc_fw fw;
/** @log: sub-structure containing GuC log related data and objects */
struct intel_guc_log log;
+   /** @log_state: states and locks for each subregion of GuC's log buffer 
*/
+   struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER];
/** @ct: the command transport communication channel */
struct intel_guc_ct ct;
/** @slpc: sub-structure containing SLPC related data and objects */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
index 70d2ee841289..e7f99d051636 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
@@ -651,6 +651,53 @@ int intel_guc_capture_prep_lists(struct intel_guc *guc, 
struct guc_ads *blob, u3
return PAGE_ALIGN(alloc_size);
 }
 
+#define GUC_CAPTURE_OVERBUFFER_MULTIPLIER 3
+int intel_guc_capture_output_min_size_est(struct intel_guc *guc)
+{
+   struct intel_gt *gt = guc_to_gt(guc);
+   struct intel_engine_cs *engine;
+   enum intel_engine_id id;
+   int worst_min_size = 0, num_regs = 0;
+   u16 tmp = 0;
+
+   /*
+* If every single engine-instance suffered a failure in quick 
succession but
+* were all unrelated, then a burst of multiple error-capture events 
would dump
+* registers for every one engine instance, one at a time. In this 
case, GuC
+* would even dump the global-registers repeatedly.
+*
+* For each engine instance, there would be 1 x 
guc_state_capture_group_t output
+* followed by 3 x guc_state_capture_t lists. The latter is how the 
register
+* dumps are split across different register types (where the '3' are 
global vs class
+* vs instance). Finally, let's multiply the whole thing by 3x (just so 
we are
+* not limited to just 1 round of data in a worst case full register 
dump log)
+*
+* NOTE: intel_guc_log that allocates the log buffer would round this 
size up to
+* a power of two.
+*/
+
+   for_each_engine(engine, gt, id) {
+   worst_min_size += sizeof(struct 
guc_state_capture_group_header_t) +
+ (3 * sizeof(struct 
guc_state_capture_header_t));
+
+   if (!guc_capture_list_count(guc, 0, 
GUC_CAPTURE_LIST_TYPE_GLOBAL, 0, ))
+   num_regs += tmp;
+
+   if (!guc_capture_list_count(guc, 0, 
GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS,
+   engine->class, )) {
+   num_regs += tmp;
+   }
+   if (!guc_capture_list_count(guc, 0, 
GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE,
+   engine->class, )) {
+   num_regs += tmp;
+   }
+   }
+
+   worst_min_size += (num_regs * sizeof(struct guc_mmio_reg));
+
+   return (worst_min_size * GUC_CAPTURE_OVERBUFFER_MULTIPLIER);
+}
+
 void intel_guc_capture_destroy(struct intel_guc *guc)
 {
guc_capture_clear_ext_regs(guc->capture.priv->reglists);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
index 6b5594ca529d..4d3e5221128c 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h
+++