On 23/03/2022 13:19, Tvrtko Ursulin wrote:

Hi,

On 21/03/2022 16:45, Alan Previn wrote:
This series:
   1. Enables support of GuC to report error-state-capture
      using a list of MMIO registers the driver registers
      and GuC will dump, log and notify right before a GuC
      triggered engine-reset event.
   2. Updates the ADS blob creation to register said lists
      of global, engine class and engine instance registers
      with GuC.
   3. Defines tables of register lists that are global or
      engine class or engine instance in scope.
   4. Updates usage and buffer-state data for the regions
      of the shared GuC log-buffer to accomdate both
      the existing relay logging of general debug logs
      along with the new error state capture usage.
   5. Using a pool of preallocated memory, provide ability
      to extract and format the GuC reported register-capture
      data into chunks consistent with existing i915 error-
      state collection flows and structures.
   6. Connects the i915_gpu_coredump reporting function
      to the GuC error capture module to print all GuC
      error state capture dumps that is reported.

Story is finished with this series and we have everything on feature parity? Intel_error_decode handles it fine?

Would you have an example error capture at hand, I am curious how it looks?

Ping? I need to write a meaningful pull request text in ~2 weeks time when submitting the feature to drm-next so looking for a high level statement as described - does it complete the work and is error capture now at feature parity and all usual userspace tools to access it handle it?

Regards,

Tvrtko




Regards,

Tvrtko


This is the 13th rev of this series where the first 3 revs
are RFC

Prior receipts of rvb's:
   - Patch #2, #3, #4, #5, #10, #11, #12, #13 have received
     R-v-b's from Umesh Nerlige Ramappa <umesh.nerlige.rama...@intel.com>
   - Patch #1, #6, #7, #8, #9 has received an R-v-b from Matthew Brost
     <matthew.br...@intel.com>. NOTE: some of these came in on the
     trybot series. https://patchwork.freedesktop.org/series/100831/

Changes from prior revs:
   v13:- Fixing register list definition styling as per Jani's request.
   v12:- Re-sending it because previous revs only got to intel-gfx,
         and only cover letter was in dri-devel. Also rebased again.
   v11:- Rebase again on latest drm-tip to fix merge error.
   v10:- Rebase on latest drm-tip again. Fix a number of checkpatch
         warnings and an error Reported-by: kernel test robot <l...@intel.com>.
   v9: - Rebase on latest drm-tip to solve CI merge-build error.
   v8: - Fix a bug found by CI in rev7: Create a cached ADS
         capture list for null-header like the other lists.
       - Fixed a bug on the ggtt offset calculation in the
         ADS population loop. Thanks to Matt Brost.
       - Change the storage uses for initial allocation and
         caching of the ADS register lists so we only store
         a regular pointer instead of file handle.
       - Multiple improvements on code styling, variable names,
         comments and code reduction from Umesh suggestions
         across multiple patches.

   v7: - Rebased on lastest drm_tip that has the ADS now using
         shmem based ads_blob_write utilities. Stress test
         was performed with this patch included to fix a
         legacy bug:
         https://patchwork.freedesktop.org/series/100768/

   v6: - In patch #1, ADS reg-list population, we now alloc
         regular memory to create the lists and cache them for
         simpler and faster use by GuC ADS module at init,
         suspend-resume and reset cycles. This was in response
         to review comments from Lucas De Marchi that also
         wanted to ensure the GuC ADS module owns the final
         copying into the ADS phyical memory.
       - Thanks to Jani Nikula for pointing out that patch #2
         and #3 should ensure static tables as constant and
         dynamic lists should be allocated and cached but
         attached to the GT level for the case of multiple
         cards with different fusings for steered registers.
         These are addressed now along with multiple code
         style fixups (thanks to review comment from Umesh)
         and splitting the steered register list generation
         as a seperate patch.
       - The extraction functionality, Patch #10 and #11 (was
         patch #7), has fixed all of Umesh's review comments
         related to the code styling. Additionally, it was
         discovered during stress tests that the extraction
         function could be called by the ct processing thread
         at the same time as the start of a GT reset event.
         Thus, a redesign was done whereby the linked list of
         processed capture-output-nodes are allocated up
         front and reused throughout the driver's life to
         ensure no memory locks are taken during extraction.
       - For patch #6 (now 7, 8 and 9), updates to
         intel_guc_log was split into smaller chunks and the
         log_state structure was returned back to inside of
         the intel_guc_log struct as opposed to the
         intel_guc struct in prior rev. This is in response
         to review comments by Matt Brost.
       - #Patch 13 (previously #10) is mostly identical but
         addresses all of the code styling comments reviews
         from Umesh.
   v5: - Added Gen9->Gen11 register list for CI coverage that
         included Gen9 with GuC submission.
       - Redesigned the extraction of the GuC error-capture
         dumps by grouping them into complete per-engine-reset
         nodes. Complete here means each node includes the
         global, engine-class and engine-instance register
         lists in a single structure.
       - Extraction is decoupled from the print-out. We now
         do the extraction immediately when receiving the
         G2H for error-capture notification. A link list of
         nodes is maintained with a FIFO based threshold
         while awaiting retrieval from i915_gpu_coredump's
         capture_engine function.
       - Added new plumbing through the i915_gpu_coredump
         allocation and capture functions to include a flag
         that is used indicate that GuC had triggered the
         reset. This new plumbing guarantees an exact match
         from i915_gpu_coredump's per-engine vma recording
         and node-retrieval from the guc-error-capture.
       - Broke the coredump gt_global capture and recording
         functions into smaller subsets so we can reuse as
         much of the existing legacy register reading + printing
         functions and only rely on GuC error-capture for
         the smaller subset of registers that are tied to
         engine workload execution.
       - Updated the register list to follow the legacy execlist
         format of printout.
   v4:
       - Rebased on latest drm-tip that has been merged with the
         support of GuC firmware version 69.0.3 that is required
         for GuC error-state-catpure to work.
       - Added register list for DG2 which is the same as XE_LP
         except an additional steering register set.
       - Fixed a bug in the end of capture parsing loop in
         intel_guc_capture_out_print_next_group that was not
         properly comparing the engine-instance and engine-
         class being parsed against the one that triggered
         the i915_gpu_coredump.
   v3:
       - Fixed all review comments from rev2 except the following:
           - Michal Wajdeczko proposed adding a seperate function
             to lookup register string nameslookup (based on offset)
             but decided against it because of offset conflicts
             and the current table layout is easier to maintain.
           - Last set of checkpatch errors pertaining to "COMPLEX
             MACROS" should be fixed on next rev.
       - Abstracted internal-to-guc-capture information into a new
         __guc_state_capture_priv structure that allows the exclusion
         of intel_guc.h and intel_guc_fwif.h from intel_guc_capture.h.
         Now, only the first 2 patches have a wider build time
         impact because of the changes to intel_guc_fwif.h but
         subsequent changes to guc-capture internal structures
         or firmware interfaces used solely by guc-capture module
         shoudn't impact the rest of the driver build.
       - Added missing Gen12LP registers and added slice+subslice
         indices when reporting extended steered registers.
       - Add additional checks to ensure that the GuC reported
         error capture information matches the i915_gpu_coredump
         that is being printed before we print out the corresponding
         VMA dumps such as the batch buffer.
    v2:
       - Ignore - failed CI retest.

Alan Previn (13):
   drm/i915/guc: Update GuC ADS size for error capture lists
   drm/i915/guc: Add XE_LP static registers for GuC error capture.
   drm/i915/guc: Add XE_LP steered register lists support
   drm/i915/guc: Add DG2 registers for GuC error state capture.
   drm/i915/guc: Add Gen9 registers for GuC error state capture.
   drm/i915/guc: Add GuC's error state capture output structures.
   drm/i915/guc: Update GuC-log relay function names
   drm/i915/guc: Add capture region into intel_guc_log
   drm/i915/guc: Check sizing of guc_capture output
   drm/i915/guc: Extract GuC error capture lists on G2H notification.
   drm/i915/guc: Pre-allocate output nodes for extraction
   drm/i915/guc: Plumb GuC-capture into gpu_coredump
   drm/i915/guc: Print the GuC error capture output register list.

  drivers/gpu/drm/i915/Makefile                 |    1 +
  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |    4 +-
  .../drm/i915/gt/intel_execlists_submission.c  |    4 +-
  drivers/gpu/drm/i915/gt/intel_reset.c         |    2 +-
  .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |    7 +
  drivers/gpu/drm/i915/gt/uc/guc_capture_fwif.h |  218 +++
  drivers/gpu/drm/i915/gt/uc/intel_guc.c        |   13 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc.h        |   12 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c    |  127 +-
  .../gpu/drm/i915/gt/uc/intel_guc_capture.c    | 1655 +++++++++++++++++
  .../gpu/drm/i915/gt/uc/intel_guc_capture.h    |   33 +
  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |   14 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_log.c    |  127 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_log.h    |    7 +-
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   18 +-
  drivers/gpu/drm/i915/i915_debugfs.c           |    3 +-
  drivers/gpu/drm/i915/i915_gpu_error.c         |  282 ++-
  drivers/gpu/drm/i915/i915_gpu_error.h         |   35 +-
  18 files changed, 2379 insertions(+), 183 deletions(-)
  create mode 100644 drivers/gpu/drm/i915/gt/uc/guc_capture_fwif.h
  create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
  create mode 100644 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.h

Reply via email to