Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
Looks like trying to move the vma up into guc-upper is impacting many other functions in intel_guc_log and intel_guc_log_debugfs. I'll have to take it back (the level of redesign i shall attempt with this series). I'll just move the log_stats back into intel_guc_log and have intel_guc_capture have its own stats structure - but keep the VMA allocation of this shared buffer in intel_guc_log (like it was in the prior revs) and have intel_guc_capture "reach across" into intel_guc_log to get the vma ptr (like it was in the prior revs). ...alan On Mon, 2022-02-14 at 11:22 -0800, Alan Previn Teres Alexis wrote: > Matt, just a final confirmation on below > > > > > > > On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote: > > > > > > > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote: > > > If the object lives at the GuC level, operate on it at the GuC level. > > > > > > e.g. > > > intel_guc_log_init_early calls mutex init on guc->log_state - that is > > > wrong and breaks the layering. intel_guc_log_init_early should only > > > operate on guc_log or below objects, not above it. > > > > > > The key here is consisteny, if the GuC level owns the object it should > > > be initialized there + passed into the lower levels if possible. The > > > lower levels should avoid reaching back to GuC level for objects > > > whenever possible. > > > > > > You could have 2 intel_guc_log_stats objects below the guc_log object > > > and 1 intel_guc_log_stats object for capture at the GuC level. That's > > > likely the right approach here. > > > > Thanks Matt - I'm in agreement... I was concerned about too much of > > change - but you're right, I should be focusing on the design consistency. > > Above sounds like the correct design (these stats and locks should belong > > to their sole user). > > > > ...alan > > > > So this means: > 1. guc[upper] allocates the shared-logging-buffer >- but would ask the lower level components for the sizes before > buffering-up or capping-down to match interface spec. > 2. guc-log and guc-error-capture requests guc for a vmap at their init. > 3. guc-log and guc-error-capture owns independent log-stats and >(and separate locks if needed). > 4. when lower level components are done, they relinquish access to >their region by requesting guc[upper] to unmap and free > > A super clean separation like above could mean ripping apart enums > and other #defines to split them across guc_log and guc_error_capture > headers (such as region sizes). > > I believe that separation complicates the understanding of the fw interface > for logging as we break that picture into independant files / components. > For now I want to keep guc[upper] aware of the individual sub-region > allocation requirements (no ripping apart of enums but moving them around) > but only keep the requesting of vmap and independant log-region-stats > within the lower level? > > Are you okay with this? > > side note: error-capture no longer need locks after the recent G2H triggered > linked-list extraction redesign. > >
Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
Matt, just a final confirmation on below > > > > > > > > > > On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote: > > > > > > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote: > > > > > > > > > If the object lives at the GuC level, operate on it at the GuC level. > > > > e.g. > > intel_guc_log_init_early calls mutex init on guc->log_state - that is > > wrong and breaks the layering. intel_guc_log_init_early should only > > operate on guc_log or below objects, not above it. > > > > The key here is consisteny, if the GuC level owns the object it should > > be initialized there + passed into the lower levels if possible. The > > lower levels should avoid reaching back to GuC level for objects > > whenever possible. > > > > You could have 2 intel_guc_log_stats objects below the guc_log object > > and 1 intel_guc_log_stats object for capture at the GuC level. That's > > likely the right approach here. > > Thanks Matt - I'm in agreement... I was concerned about too much of > change - but you're right, I should be focusing on the design consistency. > Above sounds like the correct design (these stats and locks should belong > to their sole user). > > ...alan > So this means: 1. guc[upper] allocates the shared-logging-buffer - but would ask the lower level components for the sizes before buffering-up or capping-down to match interface spec. 2. guc-log and guc-error-capture requests guc for a vmap at their init. 3. guc-log and guc-error-capture owns independent log-stats and (and separate locks if needed). 4. when lower level components are done, they relinquish access to their region by requesting guc[upper] to unmap and free A super clean separation like above could mean ripping apart enums and other #defines to split them across guc_log and guc_error_capture headers (such as region sizes). I believe that separation complicates the understanding of the fw interface for logging as we break that picture into independant files / components. For now I want to keep guc[upper] aware of the individual sub-region allocation requirements (no ripping apart of enums but moving them around) but only keep the requesting of vmap and independant log-region-stats within the lower level? Are you okay with this? side note: error-capture no longer need locks after the recent G2H triggered linked-list extraction redesign.
Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
On Tue, 2022-02-08 at 19:34 -0800, Matthew Brost wrote: > On Tue, Feb 08, 2022 at 02:55:07PM -0800, Teres Alexis, Alan Previn wrote: > > On Tue, 2022-02-08 at 14:18 -0800, Matthew Brost wrote: > > > On Tue, Feb 08, 2022 at 11:38:13AM -0800, Teres Alexis, Alan Previn wrote: > > > > Hi Matt, thank you for taking the time to review the codes. > > > > Answering your question inline below. > > > > > > > > > > > > On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote: > > > > > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote: > > > > > > GuC log buffer regions for debug-log-events, crash-dumps and > > > > > > error-state-capture are all a single bo allocation that includes > > > > > > the guc_log_buffer_state structures. > > > > > > > > > > > > Since the error-capture region is accessed with high priority at > > > > > > non- > > > > > > deterministic times (as part of gpu coredump) while the > > > > > > debug-log-event > > > > > > region is populated and accessed with different priorities, timings > > > > > > and > > > > > > consumers, let's split out separate locks for buffer-state accesses > > > > > > of each region. > > > > > > > > > > > > Also, ensure a global mapping is made up front for the entire bo > > > > > > throughout GuC operation so that dynamic mapping and unmapping isn't > > > > > > required for error capture log access if relay-logging isn't > > > > > > running. > > > > > > > > > > > > Additionally, while here, make some readibility improvements: > > > > > > 1. change previous function names with "capture_logs" to > > > > > >"copy_debug_logs" to help make the distinction clearer. > > > > > > 2. Update the guc log region mapping comments to order them > > > > > >according to the enum definition as per the GuC interface. > > > > > > > > > > > > Signed-off-by: Alan Previn > > > > > > --- > > > > > > drivers/gpu/drm/i915/gt/uc/intel_guc.h| 2 + > > > > > > .../gpu/drm/i915/gt/uc/intel_guc_capture.c| 47 ++ > > > > > > .../gpu/drm/i915/gt/uc/intel_guc_capture.h| 1 + > > > > > > drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 > > > > > > +++--- > > > > > > drivers/gpu/drm/i915/gt/uc/intel_guc_log.h| 16 ++- > > > > > > 5 files changed, 141 insertions(+), 60 deletions(-) > > > > > > > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > > > b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > > > index 4e819853ec2e..be1ad7fa2bf8 100644 > > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > > > @@ -34,6 +34,8 @@ struct intel_guc { > > > > > > struct intel_uc_fw fw; > > > > > > /** @log: sub-structure containing GuC log related data and > > > > > > objects */ > > > > > > struct intel_guc_log log; > > > > > > + /** @log_state: states and locks for each subregion of GuC's > > > > > > log buffer */ > > > > > > + struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER]; > > > > > > > > > > Why move this? This still probably should be sub-structure of > > > > > intel_guc_log. Most of the access is from functions that pass in > > > > > intel_guc_log, then retrieve the GuC object, only to access this new > > > > > intel_guc_log_stats object. That layering seems wrong, if the argument > > > > > to a function is intel_guc_log it should really try to access members > > > > > within that object or below. Obv some exceptions exist but here it > > > > > seems > > > > > clear to me this is in the wrong place. > > > > > > > > > So the reasoning i had was because because intel_guc_log module only > > > > managed > > > > guc-relay-logging and guc-log-dumping for log-events but allocates the > > > > buffer > > > > for 3 separate subregion/usages (i.e. log-events, crash-dump and > > > > error-capture). > > > > That said, I did not want intel_guc_capture and relay-logging (or > > > > log-dumping) > > > > to have to contend with the same lock because these two subregions are > > > > completely > > > > independant of each other in terms of when they are being accessed and > > > > why. > > > > > > > > > > All that is fine, I just think the 'struct intel_guc_log_stats' should > > > be sub-substure of struct intel_guc_log. > > > > > > > However, after the redesign of rev5 (this rev), I now believe the > > > > intel_guc_capture > > > > module does not require a lock because its only ever accessing its log > > > > subregion in response to the ctb handler functions that run out of the > > > > same queue. > > > > As we know intel_guc_capture reacts to G2H-error-capture-notification > > > > and G2H-context-reset > > > > (that trickles into i915_gpu_coredump). At the point of > > > > i915_error_state dump, > > > > intel_guc_capture module does not access the buffer - it merely dumps > > > > the already-parsed > > > > and engine-dump-node (that was detached from error-capture's internal > > > > linked-list > > > > and attached to the
Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
On Tue, Feb 08, 2022 at 07:34:37PM -0800, Matthew Brost wrote: > On Tue, Feb 08, 2022 at 02:55:07PM -0800, Teres Alexis, Alan Previn wrote: > > > > On Tue, 2022-02-08 at 14:18 -0800, Matthew Brost wrote: > > > On Tue, Feb 08, 2022 at 11:38:13AM -0800, Teres Alexis, Alan Previn wrote: > > > > Hi Matt, thank you for taking the time to review the codes. > > > > Answering your question inline below. > > > > > > > > > > > > On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote: > > > > > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote: > > > > > > GuC log buffer regions for debug-log-events, crash-dumps and > > > > > > error-state-capture are all a single bo allocation that includes > > > > > > the guc_log_buffer_state structures. > > > > > > > > > > > > Since the error-capture region is accessed with high priority at > > > > > > non- > > > > > > deterministic times (as part of gpu coredump) while the > > > > > > debug-log-event > > > > > > region is populated and accessed with different priorities, timings > > > > > > and > > > > > > consumers, let's split out separate locks for buffer-state accesses > > > > > > of each region. > > > > > > > > > > > > Also, ensure a global mapping is made up front for the entire bo > > > > > > throughout GuC operation so that dynamic mapping and unmapping isn't > > > > > > required for error capture log access if relay-logging isn't > > > > > > running. > > > > > > > > > > > > Additionally, while here, make some readibility improvements: > > > > > > 1. change previous function names with "capture_logs" to > > > > > >"copy_debug_logs" to help make the distinction clearer. > > > > > > 2. Update the guc log region mapping comments to order them > > > > > >according to the enum definition as per the GuC interface. > > > > > > > > > > > > Signed-off-by: Alan Previn > > > > > > --- > > > > > > drivers/gpu/drm/i915/gt/uc/intel_guc.h| 2 + > > > > > > .../gpu/drm/i915/gt/uc/intel_guc_capture.c| 47 ++ > > > > > > .../gpu/drm/i915/gt/uc/intel_guc_capture.h| 1 + > > > > > > drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 > > > > > > +++--- > > > > > > drivers/gpu/drm/i915/gt/uc/intel_guc_log.h| 16 ++- > > > > > > 5 files changed, 141 insertions(+), 60 deletions(-) > > > > > > > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > > > b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > > > index 4e819853ec2e..be1ad7fa2bf8 100644 > > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > > > @@ -34,6 +34,8 @@ struct intel_guc { > > > > > > struct intel_uc_fw fw; > > > > > > /** @log: sub-structure containing GuC log related data and > > > > > > objects */ > > > > > > struct intel_guc_log log; > > > > > > + /** @log_state: states and locks for each subregion of GuC's > > > > > > log buffer */ > > > > > > + struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER]; > > > > > > > > > > Why move this? This still probably should be sub-structure of > > > > > intel_guc_log. Most of the access is from functions that pass in > > > > > intel_guc_log, then retrieve the GuC object, only to access this new > > > > > intel_guc_log_stats object. That layering seems wrong, if the argument > > > > > to a function is intel_guc_log it should really try to access members > > > > > within that object or below. Obv some exceptions exist but here it > > > > > seems > > > > > clear to me this is in the wrong place. > > > > > > > > > So the reasoning i had was because because intel_guc_log module only > > > > managed > > > > guc-relay-logging and guc-log-dumping for log-events but allocates the > > > > buffer > > > > for 3 separate subregion/usages (i.e. log-events, crash-dump and > > > > error-capture). > > > > That said, I did not want intel_guc_capture and relay-logging (or > > > > log-dumping) > > > > to have to contend with the same lock because these two subregions are > > > > completely > > > > independant of each other in terms of when they are being accessed and > > > > why. > > > > > > > > > > All that is fine, I just think the 'struct intel_guc_log_stats' should > > > be sub-substure of struct intel_guc_log. > > > > > > > However, after the redesign of rev5 (this rev), I now believe the > > > > intel_guc_capture > > > > module does not require a lock because its only ever accessing its log > > > > subregion in response to the ctb handler functions that run out of the > > > > same queue. > > > > As we know intel_guc_capture reacts to G2H-error-capture-notification > > > > and G2H-context-reset > > > > (that trickles into i915_gpu_coredump). At the point of > > > > i915_error_state dump, > > > > intel_guc_capture module does not access the buffer - it merely dumps > > > > the already-parsed > > > > and engine-dump-node (that was detached from error-capture's internal > > > > linked-list > > > > and attached to the
Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
On Tue, Feb 08, 2022 at 02:55:07PM -0800, Teres Alexis, Alan Previn wrote: > > On Tue, 2022-02-08 at 14:18 -0800, Matthew Brost wrote: > > On Tue, Feb 08, 2022 at 11:38:13AM -0800, Teres Alexis, Alan Previn wrote: > > > Hi Matt, thank you for taking the time to review the codes. > > > Answering your question inline below. > > > > > > > > > On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote: > > > > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote: > > > > > GuC log buffer regions for debug-log-events, crash-dumps and > > > > > error-state-capture are all a single bo allocation that includes > > > > > the guc_log_buffer_state structures. > > > > > > > > > > Since the error-capture region is accessed with high priority at non- > > > > > deterministic times (as part of gpu coredump) while the > > > > > debug-log-event > > > > > region is populated and accessed with different priorities, timings > > > > > and > > > > > consumers, let's split out separate locks for buffer-state accesses > > > > > of each region. > > > > > > > > > > Also, ensure a global mapping is made up front for the entire bo > > > > > throughout GuC operation so that dynamic mapping and unmapping isn't > > > > > required for error capture log access if relay-logging isn't running. > > > > > > > > > > Additionally, while here, make some readibility improvements: > > > > > 1. change previous function names with "capture_logs" to > > > > >"copy_debug_logs" to help make the distinction clearer. > > > > > 2. Update the guc log region mapping comments to order them > > > > >according to the enum definition as per the GuC interface. > > > > > > > > > > Signed-off-by: Alan Previn > > > > > --- > > > > > drivers/gpu/drm/i915/gt/uc/intel_guc.h| 2 + > > > > > .../gpu/drm/i915/gt/uc/intel_guc_capture.c| 47 ++ > > > > > .../gpu/drm/i915/gt/uc/intel_guc_capture.h| 1 + > > > > > drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 > > > > > +++--- > > > > > drivers/gpu/drm/i915/gt/uc/intel_guc_log.h| 16 ++- > > > > > 5 files changed, 141 insertions(+), 60 deletions(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > > b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > > index 4e819853ec2e..be1ad7fa2bf8 100644 > > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > > @@ -34,6 +34,8 @@ struct intel_guc { > > > > > struct intel_uc_fw fw; > > > > > /** @log: sub-structure containing GuC log related data and > > > > > objects */ > > > > > struct intel_guc_log log; > > > > > + /** @log_state: states and locks for each subregion of GuC's log > > > > > buffer */ > > > > > + struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER]; > > > > > > > > Why move this? This still probably should be sub-structure of > > > > intel_guc_log. Most of the access is from functions that pass in > > > > intel_guc_log, then retrieve the GuC object, only to access this new > > > > intel_guc_log_stats object. That layering seems wrong, if the argument > > > > to a function is intel_guc_log it should really try to access members > > > > within that object or below. Obv some exceptions exist but here it seems > > > > clear to me this is in the wrong place. > > > > > > > So the reasoning i had was because because intel_guc_log module only > > > managed > > > guc-relay-logging and guc-log-dumping for log-events but allocates the > > > buffer > > > for 3 separate subregion/usages (i.e. log-events, crash-dump and > > > error-capture). > > > That said, I did not want intel_guc_capture and relay-logging (or > > > log-dumping) > > > to have to contend with the same lock because these two subregions are > > > completely > > > independant of each other in terms of when they are being accessed and > > > why. > > > > > > > All that is fine, I just think the 'struct intel_guc_log_stats' should > > be sub-substure of struct intel_guc_log. > > > > > However, after the redesign of rev5 (this rev), I now believe the > > > intel_guc_capture > > > module does not require a lock because its only ever accessing its log > > > subregion in response to the ctb handler functions that run out of the > > > same queue. > > > As we know intel_guc_capture reacts to G2H-error-capture-notification and > > > G2H-context-reset > > > (that trickles into i915_gpu_coredump). At the point of i915_error_state > > > dump, > > > intel_guc_capture module does not access the buffer - it merely dumps the > > > already-parsed > > > and engine-dump-node (that was detached from error-capture's internal > > > linked-list > > > and attached to the gpu_coredump structure in the second G2H above). > > > > > > And of course, intel_guc_log only ever accesses the log-events subregion > > > and never the error-capture regions. > > > > > > That said, i could revert the lock structure to the original and not have > > > intel_guc_capture using
Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
On Tue, 2022-02-08 at 14:18 -0800, Matthew Brost wrote: > On Tue, Feb 08, 2022 at 11:38:13AM -0800, Teres Alexis, Alan Previn wrote: > > Hi Matt, thank you for taking the time to review the codes. > > Answering your question inline below. > > > > > > On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote: > > > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote: > > > > GuC log buffer regions for debug-log-events, crash-dumps and > > > > error-state-capture are all a single bo allocation that includes > > > > the guc_log_buffer_state structures. > > > > > > > > Since the error-capture region is accessed with high priority at non- > > > > deterministic times (as part of gpu coredump) while the debug-log-event > > > > region is populated and accessed with different priorities, timings and > > > > consumers, let's split out separate locks for buffer-state accesses > > > > of each region. > > > > > > > > Also, ensure a global mapping is made up front for the entire bo > > > > throughout GuC operation so that dynamic mapping and unmapping isn't > > > > required for error capture log access if relay-logging isn't running. > > > > > > > > Additionally, while here, make some readibility improvements: > > > > 1. change previous function names with "capture_logs" to > > > >"copy_debug_logs" to help make the distinction clearer. > > > > 2. Update the guc log region mapping comments to order them > > > >according to the enum definition as per the GuC interface. > > > > > > > > Signed-off-by: Alan Previn > > > > --- > > > > drivers/gpu/drm/i915/gt/uc/intel_guc.h| 2 + > > > > .../gpu/drm/i915/gt/uc/intel_guc_capture.c| 47 ++ > > > > .../gpu/drm/i915/gt/uc/intel_guc_capture.h| 1 + > > > > drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 +++--- > > > > drivers/gpu/drm/i915/gt/uc/intel_guc_log.h| 16 ++- > > > > 5 files changed, 141 insertions(+), 60 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > index 4e819853ec2e..be1ad7fa2bf8 100644 > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > > @@ -34,6 +34,8 @@ struct intel_guc { > > > > struct intel_uc_fw fw; > > > > /** @log: sub-structure containing GuC log related data and objects > > > > */ > > > > struct intel_guc_log log; > > > > + /** @log_state: states and locks for each subregion of GuC's log > > > > buffer */ > > > > + struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER]; > > > > > > Why move this? This still probably should be sub-structure of > > > intel_guc_log. Most of the access is from functions that pass in > > > intel_guc_log, then retrieve the GuC object, only to access this new > > > intel_guc_log_stats object. That layering seems wrong, if the argument > > > to a function is intel_guc_log it should really try to access members > > > within that object or below. Obv some exceptions exist but here it seems > > > clear to me this is in the wrong place. > > > > > So the reasoning i had was because because intel_guc_log module only managed > > guc-relay-logging and guc-log-dumping for log-events but allocates the > > buffer > > for 3 separate subregion/usages (i.e. log-events, crash-dump and > > error-capture). > > That said, I did not want intel_guc_capture and relay-logging (or > > log-dumping) > > to have to contend with the same lock because these two subregions are > > completely > > independant of each other in terms of when they are being accessed and why. > > > > All that is fine, I just think the 'struct intel_guc_log_stats' should > be sub-substure of struct intel_guc_log. > > > However, after the redesign of rev5 (this rev), I now believe the > > intel_guc_capture > > module does not require a lock because its only ever accessing its log > > subregion in response to the ctb handler functions that run out of the same > > queue. > > As we know intel_guc_capture reacts to G2H-error-capture-notification and > > G2H-context-reset > > (that trickles into i915_gpu_coredump). At the point of i915_error_state > > dump, > > intel_guc_capture module does not access the buffer - it merely dumps the > > already-parsed > > and engine-dump-node (that was detached from error-capture's internal > > linked-list > > and attached to the gpu_coredump structure in the second G2H above). > > > > And of course, intel_guc_log only ever accesses the log-events subregion > > and never the error-capture regions. > > > > That said, i could revert the lock structure to the original and not have > > intel_guc_capture using locks. What do you think? > > > > Again my comment has nothing to do with locking, it is where the > structure lives. > > IMHO, based on my understanding of the codes and the GuC interface, i see intel_guc_log and intel_guc_capture as 2 subsystems at equal level under the intel_guc framework.
Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
On Tue, Feb 08, 2022 at 11:38:13AM -0800, Teres Alexis, Alan Previn wrote: > Hi Matt, thank you for taking the time to review the codes. > Answering your question inline below. > > > On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote: > > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote: > > > GuC log buffer regions for debug-log-events, crash-dumps and > > > error-state-capture are all a single bo allocation that includes > > > the guc_log_buffer_state structures. > > > > > > Since the error-capture region is accessed with high priority at non- > > > deterministic times (as part of gpu coredump) while the debug-log-event > > > region is populated and accessed with different priorities, timings and > > > consumers, let's split out separate locks for buffer-state accesses > > > of each region. > > > > > > Also, ensure a global mapping is made up front for the entire bo > > > throughout GuC operation so that dynamic mapping and unmapping isn't > > > required for error capture log access if relay-logging isn't running. > > > > > > Additionally, while here, make some readibility improvements: > > > 1. change previous function names with "capture_logs" to > > >"copy_debug_logs" to help make the distinction clearer. > > > 2. Update the guc log region mapping comments to order them > > >according to the enum definition as per the GuC interface. > > > > > > Signed-off-by: Alan Previn > > > --- > > > drivers/gpu/drm/i915/gt/uc/intel_guc.h| 2 + > > > .../gpu/drm/i915/gt/uc/intel_guc_capture.c| 47 ++ > > > .../gpu/drm/i915/gt/uc/intel_guc_capture.h| 1 + > > > drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 +++--- > > > drivers/gpu/drm/i915/gt/uc/intel_guc_log.h| 16 ++- > > > 5 files changed, 141 insertions(+), 60 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > index 4e819853ec2e..be1ad7fa2bf8 100644 > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > > @@ -34,6 +34,8 @@ struct intel_guc { > > > struct intel_uc_fw fw; > > > /** @log: sub-structure containing GuC log related data and objects */ > > > struct intel_guc_log log; > > > + /** @log_state: states and locks for each subregion of GuC's log > > > buffer */ > > > + struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER]; > > > > Why move this? This still probably should be sub-structure of > > intel_guc_log. Most of the access is from functions that pass in > > intel_guc_log, then retrieve the GuC object, only to access this new > > intel_guc_log_stats object. That layering seems wrong, if the argument > > to a function is intel_guc_log it should really try to access members > > within that object or below. Obv some exceptions exist but here it seems > > clear to me this is in the wrong place. > > > So the reasoning i had was because because intel_guc_log module only managed > guc-relay-logging and guc-log-dumping for log-events but allocates the buffer > for 3 separate subregion/usages (i.e. log-events, crash-dump and > error-capture). > That said, I did not want intel_guc_capture and relay-logging (or log-dumping) > to have to contend with the same lock because these two subregions are > completely > independant of each other in terms of when they are being accessed and why. > All that is fine, I just think the 'struct intel_guc_log_stats' should be sub-substure of struct intel_guc_log. > However, after the redesign of rev5 (this rev), I now believe the > intel_guc_capture > module does not require a lock because its only ever accessing its log > subregion in response to the ctb handler functions that run out of the same > queue. > As we know intel_guc_capture reacts to G2H-error-capture-notification and > G2H-context-reset > (that trickles into i915_gpu_coredump). At the point of i915_error_state dump, > intel_guc_capture module does not access the buffer - it merely dumps the > already-parsed > and engine-dump-node (that was detached from error-capture's internal > linked-list > and attached to the gpu_coredump structure in the second G2H above). > > And of course, intel_guc_log only ever accesses the log-events subregion > and never the error-capture regions. > > That said, i could revert the lock structure to the original and not have > intel_guc_capture using locks. What do you think? > Again my comment has nothing to do with locking, it is where the structure lives. Matt > ...alan > > > Another nit, I'd personally break this out into multiple patches. > > > > e.g. 1 to rename relay log functions, 1 introducing intel_guc_log_stats > > + lock, and 1 adding intel_guc_capture_output_min_size_est. Maybe I'm > > missing another patch or two in there. > > > > Not a blocker but it does make reviews easier. > > > Will do. > > > Matt > > > > > /** @ct: the command transport communication channel */ > > >
Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
Hi Matt, thank you for taking the time to review the codes. Answering your question inline below. On Fri, 2022-02-04 at 10:19 -0800, Matthew Brost wrote: > On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote: > > GuC log buffer regions for debug-log-events, crash-dumps and > > error-state-capture are all a single bo allocation that includes > > the guc_log_buffer_state structures. > > > > Since the error-capture region is accessed with high priority at non- > > deterministic times (as part of gpu coredump) while the debug-log-event > > region is populated and accessed with different priorities, timings and > > consumers, let's split out separate locks for buffer-state accesses > > of each region. > > > > Also, ensure a global mapping is made up front for the entire bo > > throughout GuC operation so that dynamic mapping and unmapping isn't > > required for error capture log access if relay-logging isn't running. > > > > Additionally, while here, make some readibility improvements: > > 1. change previous function names with "capture_logs" to > >"copy_debug_logs" to help make the distinction clearer. > > 2. Update the guc log region mapping comments to order them > >according to the enum definition as per the GuC interface. > > > > Signed-off-by: Alan Previn > > --- > > drivers/gpu/drm/i915/gt/uc/intel_guc.h| 2 + > > .../gpu/drm/i915/gt/uc/intel_guc_capture.c| 47 ++ > > .../gpu/drm/i915/gt/uc/intel_guc_capture.h| 1 + > > drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 +++--- > > drivers/gpu/drm/i915/gt/uc/intel_guc_log.h| 16 ++- > > 5 files changed, 141 insertions(+), 60 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > index 4e819853ec2e..be1ad7fa2bf8 100644 > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > > @@ -34,6 +34,8 @@ struct intel_guc { > > struct intel_uc_fw fw; > > /** @log: sub-structure containing GuC log related data and objects */ > > struct intel_guc_log log; > > + /** @log_state: states and locks for each subregion of GuC's log buffer > > */ > > + struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER]; > > Why move this? This still probably should be sub-structure of > intel_guc_log. Most of the access is from functions that pass in > intel_guc_log, then retrieve the GuC object, only to access this new > intel_guc_log_stats object. That layering seems wrong, if the argument > to a function is intel_guc_log it should really try to access members > within that object or below. Obv some exceptions exist but here it seems > clear to me this is in the wrong place. > So the reasoning i had was because because intel_guc_log module only managed guc-relay-logging and guc-log-dumping for log-events but allocates the buffer for 3 separate subregion/usages (i.e. log-events, crash-dump and error-capture). That said, I did not want intel_guc_capture and relay-logging (or log-dumping) to have to contend with the same lock because these two subregions are completely independant of each other in terms of when they are being accessed and why. However, after the redesign of rev5 (this rev), I now believe the intel_guc_capture module does not require a lock because its only ever accessing its log subregion in response to the ctb handler functions that run out of the same queue. As we know intel_guc_capture reacts to G2H-error-capture-notification and G2H-context-reset (that trickles into i915_gpu_coredump). At the point of i915_error_state dump, intel_guc_capture module does not access the buffer - it merely dumps the already-parsed and engine-dump-node (that was detached from error-capture's internal linked-list and attached to the gpu_coredump structure in the second G2H above). And of course, intel_guc_log only ever accesses the log-events subregion and never the error-capture regions. That said, i could revert the lock structure to the original and not have intel_guc_capture using locks. What do you think? ...alan > Another nit, I'd personally break this out into multiple patches. > > e.g. 1 to rename relay log functions, 1 introducing intel_guc_log_stats > + lock, and 1 adding intel_guc_capture_output_min_size_est. Maybe I'm > missing another patch or two in there. > > Not a blocker but it does make reviews easier. > Will do. > Matt > > > /** @ct: the command transport communication channel */ > > struct intel_guc_ct ct; > > /** @slpc: sub-structure containing SLPC related data and objects */ > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c > > b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c > > index 70d2ee841289..e7f99d051636 100644 > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c > > @@ -651,6 +651,53 @@ int intel_guc_capture_prep_lists(struct intel_guc > > *guc, struct
Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
On Wed, Jan 26, 2022 at 02:48:18AM -0800, Alan Previn wrote: > GuC log buffer regions for debug-log-events, crash-dumps and > error-state-capture are all a single bo allocation that includes > the guc_log_buffer_state structures. > > Since the error-capture region is accessed with high priority at non- > deterministic times (as part of gpu coredump) while the debug-log-event > region is populated and accessed with different priorities, timings and > consumers, let's split out separate locks for buffer-state accesses > of each region. > > Also, ensure a global mapping is made up front for the entire bo > throughout GuC operation so that dynamic mapping and unmapping isn't > required for error capture log access if relay-logging isn't running. > > Additionally, while here, make some readibility improvements: > 1. change previous function names with "capture_logs" to >"copy_debug_logs" to help make the distinction clearer. > 2. Update the guc log region mapping comments to order them >according to the enum definition as per the GuC interface. > > Signed-off-by: Alan Previn > --- > drivers/gpu/drm/i915/gt/uc/intel_guc.h| 2 + > .../gpu/drm/i915/gt/uc/intel_guc_capture.c| 47 ++ > .../gpu/drm/i915/gt/uc/intel_guc_capture.h| 1 + > drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 +++--- > drivers/gpu/drm/i915/gt/uc/intel_guc_log.h| 16 ++- > 5 files changed, 141 insertions(+), 60 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > index 4e819853ec2e..be1ad7fa2bf8 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h > @@ -34,6 +34,8 @@ struct intel_guc { > struct intel_uc_fw fw; > /** @log: sub-structure containing GuC log related data and objects */ > struct intel_guc_log log; > + /** @log_state: states and locks for each subregion of GuC's log buffer > */ > + struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER]; Why move this? This still probably should be sub-structure of intel_guc_log. Most of the access is from functions that pass in intel_guc_log, then retrieve the GuC object, only to access this new intel_guc_log_stats object. That layering seems wrong, if the argument to a function is intel_guc_log it should really try to access members within that object or below. Obv some exceptions exist but here it seems clear to me this is in the wrong place. Another nit, I'd personally break this out into multiple patches. e.g. 1 to rename relay log functions, 1 introducing intel_guc_log_stats + lock, and 1 adding intel_guc_capture_output_min_size_est. Maybe I'm missing another patch or two in there. Not a blocker but it does make reviews easier. Matt > /** @ct: the command transport communication channel */ > struct intel_guc_ct ct; > /** @slpc: sub-structure containing SLPC related data and objects */ > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c > b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c > index 70d2ee841289..e7f99d051636 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c > @@ -651,6 +651,53 @@ int intel_guc_capture_prep_lists(struct intel_guc *guc, > struct guc_ads *blob, u3 > return PAGE_ALIGN(alloc_size); > } > > +#define GUC_CAPTURE_OVERBUFFER_MULTIPLIER 3 > +int intel_guc_capture_output_min_size_est(struct intel_guc *guc) > +{ > + struct intel_gt *gt = guc_to_gt(guc); > + struct intel_engine_cs *engine; > + enum intel_engine_id id; > + int worst_min_size = 0, num_regs = 0; > + u16 tmp = 0; > + > + /* > + * If every single engine-instance suffered a failure in quick > succession but > + * were all unrelated, then a burst of multiple error-capture events > would dump > + * registers for every one engine instance, one at a time. In this > case, GuC > + * would even dump the global-registers repeatedly. > + * > + * For each engine instance, there would be 1 x > guc_state_capture_group_t output > + * followed by 3 x guc_state_capture_t lists. The latter is how the > register > + * dumps are split across different register types (where the '3' are > global vs class > + * vs instance). Finally, let's multiply the whole thing by 3x (just so > we are > + * not limited to just 1 round of data in a worst case full register > dump log) > + * > + * NOTE: intel_guc_log that allocates the log buffer would round this > size up to > + * a power of two. > + */ > + > + for_each_engine(engine, gt, id) { > + worst_min_size += sizeof(struct > guc_state_capture_group_header_t) + > + (3 * sizeof(struct > guc_state_capture_header_t)); > + > + if (!guc_capture_list_count(guc, 0, > GUC_CAPTURE_LIST_TYPE_GLOBAL, 0, )) > + num_regs
Re: [Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
As per the rev 5 CI results between this patch and patch7, i have introduced a lockdep splat bug, i shall fix that in the next rev. ...alan On Wed, 2022-01-26 at 02:48 -0800, Alan Previn wrote: > GuC log buffer regions for debug-log-events, crash-dumps and > error-state-capture are all a single bo allocation that includes > the guc_log_buffer_state structures. > > Since the error-capture region is accessed with high priority at non- > deterministic times (as part of gpu coredump) while the debug-log-event > region is populated and accessed with different priorities, timings and > consumers, let's split out separate locks for buffer-state accesses > of each region. > > Also, ensure a global mapping is made up front for the entire bo > throughout GuC operation so that dynamic mapping and unmapping isn't > required for error capture log access if relay-logging isn't running. > > Additionally, while here, make some readibility improvements: > 1. change previous function names with "capture_logs" to >"copy_debug_logs" to help make the distinction clearer. > 2. Update the guc log region mapping comments to order them >according to the enum definition as per the GuC interface. >
[Intel-gfx] [PATCH v5 06/10] drm/i915/guc: Update GuC's log-buffer-state access for error capture.
GuC log buffer regions for debug-log-events, crash-dumps and error-state-capture are all a single bo allocation that includes the guc_log_buffer_state structures. Since the error-capture region is accessed with high priority at non- deterministic times (as part of gpu coredump) while the debug-log-event region is populated and accessed with different priorities, timings and consumers, let's split out separate locks for buffer-state accesses of each region. Also, ensure a global mapping is made up front for the entire bo throughout GuC operation so that dynamic mapping and unmapping isn't required for error capture log access if relay-logging isn't running. Additionally, while here, make some readibility improvements: 1. change previous function names with "capture_logs" to "copy_debug_logs" to help make the distinction clearer. 2. Update the guc log region mapping comments to order them according to the enum definition as per the GuC interface. Signed-off-by: Alan Previn --- drivers/gpu/drm/i915/gt/uc/intel_guc.h| 2 + .../gpu/drm/i915/gt/uc/intel_guc_capture.c| 47 ++ .../gpu/drm/i915/gt/uc/intel_guc_capture.h| 1 + drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 135 +++--- drivers/gpu/drm/i915/gt/uc/intel_guc_log.h| 16 ++- 5 files changed, 141 insertions(+), 60 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h index 4e819853ec2e..be1ad7fa2bf8 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h @@ -34,6 +34,8 @@ struct intel_guc { struct intel_uc_fw fw; /** @log: sub-structure containing GuC log related data and objects */ struct intel_guc_log log; + /** @log_state: states and locks for each subregion of GuC's log buffer */ + struct intel_guc_log_stats log_state[GUC_MAX_LOG_BUFFER]; /** @ct: the command transport communication channel */ struct intel_guc_ct ct; /** @slpc: sub-structure containing SLPC related data and objects */ diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c index 70d2ee841289..e7f99d051636 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c @@ -651,6 +651,53 @@ int intel_guc_capture_prep_lists(struct intel_guc *guc, struct guc_ads *blob, u3 return PAGE_ALIGN(alloc_size); } +#define GUC_CAPTURE_OVERBUFFER_MULTIPLIER 3 +int intel_guc_capture_output_min_size_est(struct intel_guc *guc) +{ + struct intel_gt *gt = guc_to_gt(guc); + struct intel_engine_cs *engine; + enum intel_engine_id id; + int worst_min_size = 0, num_regs = 0; + u16 tmp = 0; + + /* +* If every single engine-instance suffered a failure in quick succession but +* were all unrelated, then a burst of multiple error-capture events would dump +* registers for every one engine instance, one at a time. In this case, GuC +* would even dump the global-registers repeatedly. +* +* For each engine instance, there would be 1 x guc_state_capture_group_t output +* followed by 3 x guc_state_capture_t lists. The latter is how the register +* dumps are split across different register types (where the '3' are global vs class +* vs instance). Finally, let's multiply the whole thing by 3x (just so we are +* not limited to just 1 round of data in a worst case full register dump log) +* +* NOTE: intel_guc_log that allocates the log buffer would round this size up to +* a power of two. +*/ + + for_each_engine(engine, gt, id) { + worst_min_size += sizeof(struct guc_state_capture_group_header_t) + + (3 * sizeof(struct guc_state_capture_header_t)); + + if (!guc_capture_list_count(guc, 0, GUC_CAPTURE_LIST_TYPE_GLOBAL, 0, )) + num_regs += tmp; + + if (!guc_capture_list_count(guc, 0, GUC_CAPTURE_LIST_TYPE_ENGINE_CLASS, + engine->class, )) { + num_regs += tmp; + } + if (!guc_capture_list_count(guc, 0, GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE, + engine->class, )) { + num_regs += tmp; + } + } + + worst_min_size += (num_regs * sizeof(struct guc_mmio_reg)); + + return (worst_min_size * GUC_CAPTURE_OVERBUFFER_MULTIPLIER); +} + void intel_guc_capture_destroy(struct intel_guc *guc) { guc_capture_clear_ext_regs(guc->capture.priv->reglists); diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h index 6b5594ca529d..4d3e5221128c 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h +++