Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2022-01-13 Thread Teres Alexis, Alan Previn
> This made me thing guc engine reset notification is a "handshake" 
> operation and not a pure notification? Does it imply GuC will wait for
> i915 to reply what to do next meaning it won't continue to execute ContextA 
> before i915 replies to engine reset notification?

> If so that would resolve my concern.

Yes: The GuC to host action is used to report a hung context to the VF host if 
engine reset was triggered and a hung context was detected during engine reset. 
This context is automatically put in a non-runnable state.
Apologies for the delay - some task IRQs.

...alan

-Original Message-
From: Tvrtko Ursulin  
Sent: Tuesday, January 11, 2022 2:09 AM
To: Teres Alexis, Alan Previn ; Brost, 
Matthew 
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture 
output register list.


On 10/01/2022 18:19, Teres Alexis, Alan Previn wrote:
> 
> On Mon, 2022-01-10 at 08:07 +, Tvrtko Ursulin wrote:
>> On 07/01/2022 17:03, Teres Alexis, Alan Previn wrote:
>>> On Fri, 2022-01-07 at 09:03 +, Tvrtko Ursulin wrote:
>>>> On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:
>>>>> On Thu, 2022-01-06 at 09:38 +, Tvrtko Ursulin wrote:
>>>>>> On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
>>>>>>> On Tue, 2022-01-04 at 13:56 +, Tvrtko Ursulin wrote:
>>>>>>>>> The flow of events are as below:
>>>>>>>>>
>>>>>>>>> 1. guc sends notification that an error capture was done and ready to 
>>>>>>>>> take.
>>>>>>>>>   - at this point we copy the guc error captured dump into an 
>>>>>>>>> interim store
>>>>>>>>> (larger buffer that can hold multiple captures).
>>>>>>>>> 2. guc sends notification that a context was reset (after the prior)
>>>>>>>>>   - this triggers a call to i915_gpu_coredump with the 
>>>>>>>>> corresponding engine-mask
>>>>>>>>>from the context that was reset
>>>>>>>>>   - i915_gpu_coredump proceeds to gather entire gpu state 
>>>>>>>>> including driver state,
>>>>>>>>>global gpu state, engine state, context vmas and also 
>>>>>>>>> engine registers. For the
>>>>>>>>>engine registers now call into the guc_capture code 
>>>>>>>>> which merely needs to verify
>>>>>>>>> that GuC had already done a step 1 and we have data ready to 
>>>>>>>>> be parsed.
>>>>>>>>
>>>>>>>> What about the time between the actual reset and receiving the 
>>>>>>>> context reset notification? Latter will contain 
>>>>>>>> intel_context->guc_id - can that be re-assigned or "retired" in 
>>>>>>>> between the two and so cause problems for matching the correct (or 
>>>>>>>> any) vmas?
>>>>>>>>
>>>>>>> Not it cannot because its only after the context reset 
>>>>>>> notification that i915 starts taking action against that cotnext - and 
>>>>>>> even that happens after the i915_gpu_codedump (engine-mask-of-context) 
>>>>>>> happens.
>>>>>>> That's what i've observed in the code flow.
>>>>>>
>>>>>> The fact it is "only after" is exactly why I asked.
>>>>>>
>>>>>> Reset notification is in a CT queue with other stuff, right? So 
>>>>>> can be some unrelated time after the actual reset. Could have 
>>>>>> context be retired in the meantime and guc_id released is the question.
>>>>>>
>>>>>> Because i915 has no idea there was a reset until this delayed 
>>>>>> message comes over, but it could see user interrupt signaling end 
>>>>>> of batch, after the reset has happened, unbeknown to i915, right?
>>>>>>
>>>>>> Perhaps the answer is guc_id cannot be released via the request 
>>>>>> retire flows. Or GuC signaling release of guc_id is a thing, 
>>>>>> which is then ordered via the same CT buffer.
>>>>>>
>>>>>> I don't know, just asking.
>>>>>>
>>>>> As long as the con

Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2022-01-11 Thread Tvrtko Ursulin



On 10/01/2022 18:19, Teres Alexis, Alan Previn wrote:


On Mon, 2022-01-10 at 08:07 +, Tvrtko Ursulin wrote:

On 07/01/2022 17:03, Teres Alexis, Alan Previn wrote:

On Fri, 2022-01-07 at 09:03 +, Tvrtko Ursulin wrote:

On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:

On Thu, 2022-01-06 at 09:38 +, Tvrtko Ursulin wrote:

On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:

On Tue, 2022-01-04 at 13:56 +, Tvrtko Ursulin wrote:

The flow of events are as below:

1. guc sends notification that an error capture was done and ready to take.
- at this point we copy the guc error captured dump into an interim 
store
  (larger buffer that can hold multiple captures).
2. guc sends notification that a context was reset (after the prior)
- this triggers a call to i915_gpu_coredump with the corresponding 
engine-mask
   from the context that was reset
- i915_gpu_coredump proceeds to gather entire gpu state including 
driver state,
   global gpu state, engine state, context vmas and also engine 
registers. For the
   engine registers now call into the guc_capture code which merely 
needs to verify
  that GuC had already done a step 1 and we have data ready to be 
parsed.


What about the time between the actual reset and receiving the context
reset notification? Latter will contain intel_context->guc_id - can that
be re-assigned or "retired" in between the two and so cause problems for
matching the correct (or any) vmas?


Not it cannot because its only after the context reset notification that i915 
starts
taking action against that cotnext - and even that happens after the 
i915_gpu_codedump (engine-mask-of-context) happens.
That's what i've observed in the code flow.


The fact it is "only after" is exactly why I asked.

Reset notification is in a CT queue with other stuff, right? So can be
some unrelated time after the actual reset. Could have context be
retired in the meantime and guc_id released is the question.

Because i915 has no idea there was a reset until this delayed message
comes over, but it could see user interrupt signaling end of batch,
after the reset has happened, unbeknown to i915, right?

Perhaps the answer is guc_id cannot be released via the request retire
flows. Or GuC signaling release of guc_id is a thing, which is then
ordered via the same CT buffer.

I don't know, just asking.


As long as the context is pinned, the guc-id wont be re-assigned. After a bit 
of offline brain-dump
from John Harrison, there are many factors that can keep the context pinned 
(recounts) including
new or oustanding requests. So a guc-id can't get re-assigned between a 
capture-notify and a
context-reset even if that outstanding request is the only refcount left since 
it would still
be considered outstanding by the driver. I also think we may also be talking 
past each other
in the sense that the guc-id is something the driver assigns to a context being 
pinned and only
the driver can un-assign it (both assigning and unasigning is via H2G 
interactions).
I get the sense you are assuming the GuC can un-assign the guc-id's on its own 
- which isn't
the case. Apologies if i mis-assumed.


I did not think GuC can re-assign ce->guc_id. I asked about request/context 
complete/retire happening before reset/capture notification is received.

That would be the time window between the last intel_context_put, so last 
i915_request_put from retire, at which point AFAICT GuC code releases the 
guc_id. Execution timeline like:


-- rq1 --|-- rq2 --|

  ^ engine reset^ rq2, rq1 retire, guc id released

^ GuC 
reset notify received - guc_id not known any more?

You are saying something is guaranteed to be holding onto the guc_id at the point of receiving the notification? "There are many factors that can keep the context pinned" - what is it in this case? Or the case cannot happen?


Regards,

Tvrtko


above chart is incorrect: GuC reset notification is sent from GuC to host 
before it sends the engine reset notification


Meaning? And how does it relate to actual reset vs retire vs reset
notification (sent or received)?

Plus, I thought so far we were talking about reset notification and
capture notification, so what you say here now extra confuses me without
providing an answer to my question.

Regards,

Tvrtko

So i think the confusion at this point of the conversation is because in the 
prior discussion we have been talking about
the focus was on printout of the error capture status (which happens when user 
triggers the debugfs to dump). In your
previous reply, you had provided a timeline that references the engine-reset, 
request/retire and reset-notification
events which are separate from the print-out event.


So recap of timeline of events that highlights when things occur including the 
printout:
(apologies for 

Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2022-01-10 Thread Teres Alexis, Alan Previn

On Mon, 2022-01-10 at 08:07 +, Tvrtko Ursulin wrote:
> On 07/01/2022 17:03, Teres Alexis, Alan Previn wrote:
> > On Fri, 2022-01-07 at 09:03 +, Tvrtko Ursulin wrote:
> > > On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:
> > > > On Thu, 2022-01-06 at 09:38 +, Tvrtko Ursulin wrote:
> > > > > On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
> > > > > > On Tue, 2022-01-04 at 13:56 +, Tvrtko Ursulin wrote:
> > > > > > > > The flow of events are as below:
> > > > > > > > 
> > > > > > > > 1. guc sends notification that an error capture was done and 
> > > > > > > > ready to take.
> > > > > > > > - at this point we copy the guc error captured dump 
> > > > > > > > into an interim store
> > > > > > > >   (larger buffer that can hold multiple captures).
> > > > > > > > 2. guc sends notification that a context was reset (after the 
> > > > > > > > prior)
> > > > > > > > - this triggers a call to i915_gpu_coredump with the 
> > > > > > > > corresponding engine-mask
> > > > > > > >   from the context that was reset
> > > > > > > > - i915_gpu_coredump proceeds to gather entire gpu state 
> > > > > > > > including driver state,
> > > > > > > >   global gpu state, engine state, context vmas and 
> > > > > > > > also engine registers. For the
> > > > > > > >   engine registers now call into the guc_capture 
> > > > > > > > code which merely needs to verify
> > > > > > > >   that GuC had already done a step 1 and we have data 
> > > > > > > > ready to be parsed.
> > > > > > > 
> > > > > > > What about the time between the actual reset and receiving the 
> > > > > > > context
> > > > > > > reset notification? Latter will contain intel_context->guc_id - 
> > > > > > > can that
> > > > > > > be re-assigned or "retired" in between the two and so cause 
> > > > > > > problems for
> > > > > > > matching the correct (or any) vmas?
> > > > > > > 
> > > > > > Not it cannot because its only after the context reset notification 
> > > > > > that i915 starts
> > > > > > taking action against that cotnext - and even that happens after 
> > > > > > the i915_gpu_codedump (engine-mask-of-context) happens.
> > > > > > That's what i've observed in the code flow.
> > > > > 
> > > > > The fact it is "only after" is exactly why I asked.
> > > > > 
> > > > > Reset notification is in a CT queue with other stuff, right? So can be
> > > > > some unrelated time after the actual reset. Could have context be
> > > > > retired in the meantime and guc_id released is the question.
> > > > > 
> > > > > Because i915 has no idea there was a reset until this delayed message
> > > > > comes over, but it could see user interrupt signaling end of batch,
> > > > > after the reset has happened, unbeknown to i915, right?
> > > > > 
> > > > > Perhaps the answer is guc_id cannot be released via the request retire
> > > > > flows. Or GuC signaling release of guc_id is a thing, which is then
> > > > > ordered via the same CT buffer.
> > > > > 
> > > > > I don't know, just asking.
> > > > > 
> > > > As long as the context is pinned, the guc-id wont be re-assigned. After 
> > > > a bit of offline brain-dump
> > > > from John Harrison, there are many factors that can keep the context 
> > > > pinned (recounts) including
> > > > new or oustanding requests. So a guc-id can't get re-assigned between a 
> > > > capture-notify and a
> > > > context-reset even if that outstanding request is the only refcount 
> > > > left since it would still
> > > > be considered outstanding by the driver. I also think we may also be 
> > > > talking past each other
> > > > in the sense that the guc-id is something the driver assigns to a 
> > > > context being pinned and only
> > > > the driver can un-assign it (both assigning and unasigning is via H2G 
> > > > interactions).
> > > > I get the sense you are assuming the GuC can un-assign the guc-id's on 
> > > > its own - which isn't
> > > > the case. Apologies if i mis-assumed.
> > > 
> > > I did not think GuC can re-assign ce->guc_id. I asked about 
> > > request/context complete/retire happening before reset/capture 
> > > notification is received.
> > > 
> > > That would be the time window between the last intel_context_put, so last 
> > > i915_request_put from retire, at which point AFAICT GuC code releases the 
> > > guc_id. Execution timeline like:
> > > 
> > > > -- rq1 --|-- rq2 --|
> > >  ^ engine reset   ^ rq2, rq1 retire, guc id released
> > > 
> > >   
> > > ^ GuC reset notify received - guc_id not known any more?
> > >
> > > You are saying something is guaranteed to be holding onto the guc_id at 
> > > the point of receiving the notification? "There are many factors that can 
> > > keep the context pinned" - what is it in this case? Or the case cannot 
> > > happen?
> > > 
> > > Regards,
> > > 
> > > Tvrtko
> > 
> > 

Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2022-01-10 Thread Tvrtko Ursulin



On 07/01/2022 17:03, Teres Alexis, Alan Previn wrote:


On Fri, 2022-01-07 at 09:03 +, Tvrtko Ursulin wrote:

On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:

On Thu, 2022-01-06 at 09:38 +, Tvrtko Ursulin wrote:

On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:

On Tue, 2022-01-04 at 13:56 +, Tvrtko Ursulin wrote:

The flow of events are as below:

1. guc sends notification that an error capture was done and ready to take.
- at this point we copy the guc error captured dump into an interim 
store
  (larger buffer that can hold multiple captures).
2. guc sends notification that a context was reset (after the prior)
- this triggers a call to i915_gpu_coredump with the corresponding 
engine-mask
  from the context that was reset
- i915_gpu_coredump proceeds to gather entire gpu state including 
driver state,
  global gpu state, engine state, context vmas and also engine 
registers. For the
  engine registers now call into the guc_capture code which merely 
needs to verify
  that GuC had already done a step 1 and we have data ready to be 
parsed.


What about the time between the actual reset and receiving the context
reset notification? Latter will contain intel_context->guc_id - can that
be re-assigned or "retired" in between the two and so cause problems for
matching the correct (or any) vmas?


Not it cannot because its only after the context reset notification that i915 
starts
taking action against that cotnext - and even that happens after the 
i915_gpu_codedump (engine-mask-of-context) happens.
That's what i've observed in the code flow.


The fact it is "only after" is exactly why I asked.

Reset notification is in a CT queue with other stuff, right? So can be
some unrelated time after the actual reset. Could have context be
retired in the meantime and guc_id released is the question.

Because i915 has no idea there was a reset until this delayed message
comes over, but it could see user interrupt signaling end of batch,
after the reset has happened, unbeknown to i915, right?

Perhaps the answer is guc_id cannot be released via the request retire
flows. Or GuC signaling release of guc_id is a thing, which is then
ordered via the same CT buffer.

I don't know, just asking.


As long as the context is pinned, the guc-id wont be re-assigned. After a bit 
of offline brain-dump
from John Harrison, there are many factors that can keep the context pinned 
(recounts) including
new or oustanding requests. So a guc-id can't get re-assigned between a 
capture-notify and a
context-reset even if that outstanding request is the only refcount left since 
it would still
be considered outstanding by the driver. I also think we may also be talking 
past each other
in the sense that the guc-id is something the driver assigns to a context being 
pinned and only
the driver can un-assign it (both assigning and unasigning is via H2G 
interactions).
I get the sense you are assuming the GuC can un-assign the guc-id's on its own 
- which isn't
the case. Apologies if i mis-assumed.


I did not think GuC can re-assign ce->guc_id. I asked about request/context 
complete/retire happening before reset/capture notification is received.

That would be the time window between the last intel_context_put, so last 
i915_request_put from retire, at which point AFAICT GuC code releases the 
guc_id. Execution timeline like:


-- rq1 --|-- rq2 --|

 ^ engine reset ^ rq2, rq1 retire, guc id released

^ GuC 
reset notify received - guc_id not known any more?
   
You are saying something is guaranteed to be holding onto the guc_id at the point of receiving the notification? "There are many factors that can keep the context pinned" - what is it in this case? Or the case cannot happen?


Regards,

Tvrtko


above chart is incorrect: GuC reset notification is sent from GuC to host 
before it sends the engine reset notification


Meaning? And how does it relate to actual reset vs retire vs reset 
notification (sent or received)?


Plus, I thought so far we were talking about reset notification and 
capture notification, so what you say here now extra confuses me without 
providing an answer to my question.


Regards,

Tvrtko


Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2022-01-07 Thread Teres Alexis, Alan Previn

On Fri, 2022-01-07 at 09:03 +, Tvrtko Ursulin wrote:
> On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:
> > On Thu, 2022-01-06 at 09:38 +, Tvrtko Ursulin wrote:
> > > On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
> > > > On Tue, 2022-01-04 at 13:56 +, Tvrtko Ursulin wrote:
> > > > > > The flow of events are as below:
> > > > > > 
> > > > > > 1. guc sends notification that an error capture was done and ready 
> > > > > > to take.
> > > > > > - at this point we copy the guc error captured dump into an 
> > > > > > interim store
> > > > > >   (larger buffer that can hold multiple captures).
> > > > > > 2. guc sends notification that a context was reset (after the prior)
> > > > > > - this triggers a call to i915_gpu_coredump with the 
> > > > > > corresponding engine-mask
> > > > > >  from the context that was reset
> > > > > > - i915_gpu_coredump proceeds to gather entire gpu state 
> > > > > > including driver state,
> > > > > >  global gpu state, engine state, context vmas and also 
> > > > > > engine registers. For the
> > > > > >  engine registers now call into the guc_capture code 
> > > > > > which merely needs to verify
> > > > > >   that GuC had already done a step 1 and we have data ready to 
> > > > > > be parsed.
> > > > > 
> > > > > What about the time between the actual reset and receiving the context
> > > > > reset notification? Latter will contain intel_context->guc_id - can 
> > > > > that
> > > > > be re-assigned or "retired" in between the two and so cause problems 
> > > > > for
> > > > > matching the correct (or any) vmas?
> > > > > 
> > > > Not it cannot because its only after the context reset notification 
> > > > that i915 starts
> > > > taking action against that cotnext - and even that happens after the 
> > > > i915_gpu_codedump (engine-mask-of-context) happens.
> > > > That's what i've observed in the code flow.
> > > 
> > > The fact it is "only after" is exactly why I asked.
> > > 
> > > Reset notification is in a CT queue with other stuff, right? So can be
> > > some unrelated time after the actual reset. Could have context be
> > > retired in the meantime and guc_id released is the question.
> > > 
> > > Because i915 has no idea there was a reset until this delayed message
> > > comes over, but it could see user interrupt signaling end of batch,
> > > after the reset has happened, unbeknown to i915, right?
> > > 
> > > Perhaps the answer is guc_id cannot be released via the request retire
> > > flows. Or GuC signaling release of guc_id is a thing, which is then
> > > ordered via the same CT buffer.
> > > 
> > > I don't know, just asking.
> > > 
> > As long as the context is pinned, the guc-id wont be re-assigned. After a 
> > bit of offline brain-dump
> > from John Harrison, there are many factors that can keep the context pinned 
> > (recounts) including
> > new or oustanding requests. So a guc-id can't get re-assigned between a 
> > capture-notify and a
> > context-reset even if that outstanding request is the only refcount left 
> > since it would still
> > be considered outstanding by the driver. I also think we may also be 
> > talking past each other
> > in the sense that the guc-id is something the driver assigns to a context 
> > being pinned and only
> > the driver can un-assign it (both assigning and unasigning is via H2G 
> > interactions).
> > I get the sense you are assuming the GuC can un-assign the guc-id's on its 
> > own - which isn't
> > the case. Apologies if i mis-assumed.
> 
> I did not think GuC can re-assign ce->guc_id. I asked about request/context 
> complete/retire happening before reset/capture notification is received.
> 
> That would be the time window between the last intel_context_put, so last 
> i915_request_put from retire, at which point AFAICT GuC code releases the 
> guc_id. Execution timeline like:
> 
> > -- rq1 --|-- rq2 --|
> ^ engine reset^ rq2, rq1 retire, guc id released
> 
>   ^ GuC 
> reset notify received - guc_id not known any more?
>   
> You are saying something is guaranteed to be holding onto the guc_id at the 
> point of receiving the notification? "There are many factors that can keep 
> the context pinned" - what is it in this case? Or the case cannot happen?
> 
> Regards,
> 
> Tvrtko

above chart is incorrect: GuC reset notification is sent from GuC to host 
before it sends the engine reset notification 
...alan







Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2022-01-07 Thread Tvrtko Ursulin



On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:


On Thu, 2022-01-06 at 09:38 +, Tvrtko Ursulin wrote:

On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:

On Tue, 2022-01-04 at 13:56 +, Tvrtko Ursulin wrote:

The flow of events are as below:

1. guc sends notification that an error capture was done and ready to take.
- at this point we copy the guc error captured dump into an interim 
store
  (larger buffer that can hold multiple captures).
2. guc sends notification that a context was reset (after the prior)
- this triggers a call to i915_gpu_coredump with the corresponding 
engine-mask
 from the context that was reset
- i915_gpu_coredump proceeds to gather entire gpu state including 
driver state,
 global gpu state, engine state, context vmas and also engine 
registers. For the
 engine registers now call into the guc_capture code which merely 
needs to verify
  that GuC had already done a step 1 and we have data ready to be 
parsed.


What about the time between the actual reset and receiving the context
reset notification? Latter will contain intel_context->guc_id - can that
be re-assigned or "retired" in between the two and so cause problems for
matching the correct (or any) vmas?


Not it cannot because its only after the context reset notification that i915 
starts
taking action against that cotnext - and even that happens after the 
i915_gpu_codedump (engine-mask-of-context) happens.
That's what i've observed in the code flow.


The fact it is "only after" is exactly why I asked.

Reset notification is in a CT queue with other stuff, right? So can be
some unrelated time after the actual reset. Could have context be
retired in the meantime and guc_id released is the question.

Because i915 has no idea there was a reset until this delayed message
comes over, but it could see user interrupt signaling end of batch,
after the reset has happened, unbeknown to i915, right?

Perhaps the answer is guc_id cannot be released via the request retire
flows. Or GuC signaling release of guc_id is a thing, which is then
ordered via the same CT buffer.

I don't know, just asking.


As long as the context is pinned, the guc-id wont be re-assigned. After a bit 
of offline brain-dump
from John Harrison, there are many factors that can keep the context pinned 
(recounts) including
new or oustanding requests. So a guc-id can't get re-assigned between a 
capture-notify and a
context-reset even if that outstanding request is the only refcount left since 
it would still
be considered outstanding by the driver. I also think we may also be talking 
past each other
in the sense that the guc-id is something the driver assigns to a context being 
pinned and only
the driver can un-assign it (both assigning and unasigning is via H2G 
interactions).
I get the sense you are assuming the GuC can un-assign the guc-id's on its own 
- which isn't
the case. Apologies if i mis-assumed.


I did not think GuC can re-assign ce->guc_id. I asked about request/context 
complete/retire happening before reset/capture notification is received.

That would be the time window between the last intel_context_put, so last 
i915_request_put from retire, at which point AFAICT GuC code releases the 
guc_id. Execution timeline like:

|-- rq1 --|-- rq2 --|
   ^ engine reset   ^ rq2, rq1 retire, guc id released

^ GuC 
reset notify received - guc_id not known any more?
 
You are saying something is guaranteed to be holding onto the guc_id at the point of receiving the notification? "There are many factors that can keep the context pinned" - what is it in this case? Or the case cannot happen?


Regards,

Tvrtko


Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2022-01-06 Thread Teres Alexis, Alan Previn

On Thu, 2022-01-06 at 09:38 +, Tvrtko Ursulin wrote:
> On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
> > On Tue, 2022-01-04 at 13:56 +, Tvrtko Ursulin wrote:
> > > > The flow of events are as below:
> > > > 
> > > > 1. guc sends notification that an error capture was done and ready to 
> > > > take.
> > > > - at this point we copy the guc error captured dump into an 
> > > > interim store
> > > >   (larger buffer that can hold multiple captures).
> > > > 2. guc sends notification that a context was reset (after the prior)
> > > > - this triggers a call to i915_gpu_coredump with the 
> > > > corresponding engine-mask
> > > > from the context that was reset
> > > > - i915_gpu_coredump proceeds to gather entire gpu state 
> > > > including driver state,
> > > > global gpu state, engine state, context vmas and also 
> > > > engine registers. For the
> > > > engine registers now call into the guc_capture code which 
> > > > merely needs to verify
> > > >   that GuC had already done a step 1 and we have data ready to 
> > > > be parsed.
> > > 
> > > What about the time between the actual reset and receiving the context
> > > reset notification? Latter will contain intel_context->guc_id - can that
> > > be re-assigned or "retired" in between the two and so cause problems for
> > > matching the correct (or any) vmas?
> > > 
> > Not it cannot because its only after the context reset notification that 
> > i915 starts
> > taking action against that cotnext - and even that happens after the 
> > i915_gpu_codedump (engine-mask-of-context) happens.
> > That's what i've observed in the code flow.
> 
> The fact it is "only after" is exactly why I asked.
> 
> Reset notification is in a CT queue with other stuff, right? So can be 
> some unrelated time after the actual reset. Could have context be 
> retired in the meantime and guc_id released is the question.
> 
> Because i915 has no idea there was a reset until this delayed message 
> comes over, but it could see user interrupt signaling end of batch, 
> after the reset has happened, unbeknown to i915, right?
> 
> Perhaps the answer is guc_id cannot be released via the request retire 
> flows. Or GuC signaling release of guc_id is a thing, which is then 
> ordered via the same CT buffer.
> 
> I don't know, just asking.
> 
As long as the context is pinned, the guc-id wont be re-assigned. After a bit 
of offline brain-dump
from John Harrison, there are many factors that can keep the context pinned 
(recounts) including
new or oustanding requests. So a guc-id can't get re-assigned between a 
capture-notify and a
context-reset even if that outstanding request is the only refcount left since 
it would still
be considered outstanding by the driver. I also think we may also be talking 
past each other
in the sense that the guc-id is something the driver assigns to a context being 
pinned and only
the driver can un-assign it (both assigning and unasigning is via H2G 
interactions).
I get the sense you are assuming the GuC can un-assign the guc-id's on its own 
- which isn't
the case. Apologies if i mis-assumed.

> Regards,
> 
> Tvrtko



Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2022-01-06 Thread Tvrtko Ursulin



On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:


On Tue, 2022-01-04 at 13:56 +, Tvrtko Ursulin wrote:



The flow of events are as below:

1. guc sends notification that an error capture was done and ready to take.
- at this point we copy the guc error captured dump into an interim 
store
  (larger buffer that can hold multiple captures).
2. guc sends notification that a context was reset (after the prior)
- this triggers a call to i915_gpu_coredump with the corresponding 
engine-mask
from the context that was reset
- i915_gpu_coredump proceeds to gather entire gpu state including 
driver state,
global gpu state, engine state, context vmas and also engine 
registers. For the
engine registers now call into the guc_capture code which merely 
needs to verify
  that GuC had already done a step 1 and we have data ready to be 
parsed.


What about the time between the actual reset and receiving the context
reset notification? Latter will contain intel_context->guc_id - can that
be re-assigned or "retired" in between the two and so cause problems for
matching the correct (or any) vmas?


Not it cannot because its only after the context reset notification that i915 
starts
taking action against that cotnext - and even that happens after the 
i915_gpu_codedump (engine-mask-of-context) happens.
That's what i've observed in the code flow.


The fact it is "only after" is exactly why I asked.

Reset notification is in a CT queue with other stuff, right? So can be 
some unrelated time after the actual reset. Could have context be 
retired in the meantime and guc_id released is the question.


Because i915 has no idea there was a reset until this delayed message 
comes over, but it could see user interrupt signaling end of batch, 
after the reset has happened, unbeknown to i915, right?


Perhaps the answer is guc_id cannot be released via the request retire 
flows. Or GuC signaling release of guc_id is a thing, which is then 
ordered via the same CT buffer.


I don't know, just asking.

Regards,

Tvrtko


Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2022-01-05 Thread Teres Alexis, Alan Previn

On Tue, 2022-01-04 at 13:56 +, Tvrtko Ursulin wrote:
> 
> > The flow of events are as below:
> > 
> > 1. guc sends notification that an error capture was done and ready to take.
> > - at this point we copy the guc error captured dump into an interim 
> > store
> >   (larger buffer that can hold multiple captures).
> > 2. guc sends notification that a context was reset (after the prior)
> > - this triggers a call to i915_gpu_coredump with the corresponding 
> > engine-mask
> >from the context that was reset
> > - i915_gpu_coredump proceeds to gather entire gpu state including 
> > driver state,
> >global gpu state, engine state, context vmas and also engine 
> > registers. For the
> >engine registers now call into the guc_capture code which merely 
> > needs to verify
> >   that GuC had already done a step 1 and we have data ready to be 
> > parsed.
> 
> What about the time between the actual reset and receiving the context 
> reset notification? Latter will contain intel_context->guc_id - can that 
> be re-assigned or "retired" in between the two and so cause problems for 
> matching the correct (or any) vmas?
> 
Not it cannot because its only after the context reset notification that i915 
starts
taking action against that cotnext - and even that happens after the 
i915_gpu_codedump (engine-mask-of-context) happens.
That's what i've observed in the code flow.

> Regards,
> 
> Tvrtko


Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2022-01-04 Thread Tvrtko Ursulin



On 24/12/2021 13:34, Teres Alexis, Alan Previn wrote:


On Fri, 2021-12-24 at 12:09 +, Tvrtko Ursulin wrote:

Hi,

Somehow I stumbled on this while browsing through the mailing list.

On 23/12/2021 18:54, Teres Alexis, Alan Previn wrote:

Revisiting below hunk of patch-7 comment, as per offline discussion with Matt,
there is little benefit to even making that guc-id lookup because:

1. the delay between the context reset notification (when the vmas are copied
 and when we verify we had received a guc err capture dump) may be 
subjectively
 large enough and not tethered that the guc-id may have already been 
re-assigned.

2. I was really looking for some kind of unique context handle to print out 
that could
 be correlated (by user inspecting the dump) back to a unique app or 
process or
 context-id but cant find such a param in struct intel_context.

As part of further reviewing the end to end flows and possible error scenarios, 
there
also may potentially be a mismatch between "which context was reset by guc at 
time-n"
vs "which context's vma buffers is being printed out at time-n+x" if
we are experiencing back-to-back resets and the user dumped the debugfs x-time 
later.


What does this all actually mean, because it sounds rather alarming,
that it just won't be possible to know which context, belonging to which
process, was reset? And because of guc_id potentially re-assigned even
the captured VMAs may not be the correct ones?




The flow of events are as below:

1. guc sends notification that an error capture was done and ready to take.
- at this point we copy the guc error captured dump into an interim 
store
  (larger buffer that can hold multiple captures).
2. guc sends notification that a context was reset (after the prior)
- this triggers a call to i915_gpu_coredump with the corresponding 
engine-mask
   from the context that was reset
- i915_gpu_coredump proceeds to gather entire gpu state including 
driver state,
   global gpu state, engine state, context vmas and also engine 
registers. For the
   engine registers now call into the guc_capture code which merely 
needs to verify
  that GuC had already done a step 1 and we have data ready to be 
parsed.


What about the time between the actual reset and receiving the context 
reset notification? Latter will contain intel_context->guc_id - can that 
be re-assigned or "retired" in between the two and so cause problems for 
matching the correct (or any) vmas?


Regards,

Tvrtko



(time elapses)

3. end user triggers the sysfs to dump the error state and all prior 
information is
printed out in proper format.


Between 2 and 3:
- Looking at existing framework (established by execlist-capture codes),
 I believe we only hold on to the first error state capture and drop any
 subsequent context reset captures occurring before #3 (i.e. before the 
end user
 triggers the debugfs)
- However, in that same space, guc can send us more and more error-capture 
logs
  long as we have space for it in the buffer.

So the issue was that in my original patch, for every next capture-snaphot we 
find in
guc-error-capture output buffer, i would find the matching engine and print out 
all
the VMA data (that was successfully captured in #2). However, i should only do 
that
for the first dump only since that would correlate exactly with the existing 
execlist
code behavior. So this fix is actually pretty straight forward to get the right 
matching
VMA.

WRT to my statement about "getting the context-to->process" lookup, i was 
initially hoping that
I could "on my own" (within the guc-err-capture module) get that information, 
but it would be
a stretch (in terms of inter-component information access). More importantly, 
its totally
unnecessary since existing execlist code already did that in Step 2. That code 
remains intact
with guc-error-capture.

One open i plan to test before final rev is with shared engines like CCS and 
RCS where i want to
trigger cascading hangs + resets in quick succession just to see how the 
overall flow behavior works.

I will attach an output guc error capture based gpu error dump as per the 
review comment from Matthew
on last rev.

..alan






Regards,

Tvrtko


(Recap: First, guc notifies capture event, second, guc notifies context reset 
during
which we trigger i915_gpu_coredump. In this second step, the vma's are dumped 
and we
verify that the guc capture happened but don't parse the guc-err-capture-logs 
yet.
Third step is when user triggers the debugfs to dump which is when we parse the 
error
capture logs.)

As a fix, what we can do in the guc_error_capture report out is to ensure that
we dont re-print the previously dumped vmas if we end up finding multiple
guc-error-capture dumps since the i915_gpu_coredump would have only captured 
the vma's
for the very first context that was reset. And with 

Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2021-12-24 Thread Teres Alexis, Alan Previn

On Fri, 2021-12-24 at 12:09 +, Tvrtko Ursulin wrote:
> Hi,
> 
> Somehow I stumbled on this while browsing through the mailing list.
> 
> On 23/12/2021 18:54, Teres Alexis, Alan Previn wrote:
> > Revisiting below hunk of patch-7 comment, as per offline discussion with 
> > Matt,
> > there is little benefit to even making that guc-id lookup because:
> > 
> > 1. the delay between the context reset notification (when the vmas are 
> > copied
> > and when we verify we had received a guc err capture dump) may be 
> > subjectively
> > large enough and not tethered that the guc-id may have already been 
> > re-assigned.
> > 
> > 2. I was really looking for some kind of unique context handle to print out 
> > that could
> > be correlated (by user inspecting the dump) back to a unique app or 
> > process or
> > context-id but cant find such a param in struct intel_context.
> > 
> > As part of further reviewing the end to end flows and possible error 
> > scenarios, there
> > also may potentially be a mismatch between "which context was reset by guc 
> > at time-n"
> > vs "which context's vma buffers is being printed out at time-n+x" if
> > we are experiencing back-to-back resets and the user dumped the debugfs 
> > x-time later.
> 
> What does this all actually mean, because it sounds rather alarming, 
> that it just won't be possible to know which context, belonging to which 
> process, was reset? And because of guc_id potentially re-assigned even 
> the captured VMAs may not be the correct ones?
> 
> 

The flow of events are as below:

1. guc sends notification that an error capture was done and ready to take.
- at this point we copy the guc error captured dump into an interim 
store
  (larger buffer that can hold multiple captures).
2. guc sends notification that a context was reset (after the prior)
- this triggers a call to i915_gpu_coredump with the corresponding 
engine-mask
  from the context that was reset
- i915_gpu_coredump proceeds to gather entire gpu state including 
driver state,
  global gpu state, engine state, context vmas and also engine 
registers. For the
  engine registers now call into the guc_capture code which merely 
needs to verify
  that GuC had already done a step 1 and we have data ready to be 
parsed.

(time elapses)

3. end user triggers the sysfs to dump the error state and all prior 
information is 
   printed out in proper format.


Between 2 and 3:
   - Looking at existing framework (established by execlist-capture codes),
I believe we only hold on to the first error state capture and drop any
subsequent context reset captures occurring before #3 (i.e. before the 
end user
triggers the debugfs)
   - However, in that same space, guc can send us more and more error-capture 
logs
 long as we have space for it in the buffer.

So the issue was that in my original patch, for every next capture-snaphot we 
find in
guc-error-capture output buffer, i would find the matching engine and print out 
all
the VMA data (that was successfully captured in #2). However, i should only do 
that
for the first dump only since that would correlate exactly with the existing 
execlist
code behavior. So this fix is actually pretty straight forward to get the right 
matching
VMA.

WRT to my statement about "getting the context-to->process" lookup, i was 
initially hoping that
I could "on my own" (within the guc-err-capture module) get that information, 
but it would be
a stretch (in terms of inter-component information access). More importantly, 
its totally
unnecessary since existing execlist code already did that in Step 2. That code 
remains intact
with guc-error-capture.

One open i plan to test before final rev is with shared engines like CCS and 
RCS where i want to
trigger cascading hangs + resets in quick succession just to see how the 
overall flow behavior works.

I will attach an output guc error capture based gpu error dump as per the 
review comment from Matthew
on last rev.

..alan
> 


> Regards,
> 
> Tvrtko
> 
> > (Recap: First, guc notifies capture event, second, guc notifies context 
> > reset during
> > which we trigger i915_gpu_coredump. In this second step, the vma's are 
> > dumped and we
> > verify that the guc capture happened but don't parse the 
> > guc-err-capture-logs yet.
> > Third step is when user triggers the debugfs to dump which is when we parse 
> > the error
> > capture logs.)
> > 
> > As a fix, what we can do in the guc_error_capture report out is to ensure 
> > that
> > we dont re-print the previously dumped vmas if we end up finding multiple
> > guc-error-capture dumps since the i915_gpu_coredump would have only 
> > captured the vma's
> > for the very first context that was reset. And with guc-submission, that 
> > would always
> > correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the 
> > guc-error-capture
> > logs are large enough 

Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2021-12-24 Thread Tvrtko Ursulin



Hi,

Somehow I stumbled on this while browsing through the mailing list.

On 23/12/2021 18:54, Teres Alexis, Alan Previn wrote:

Revisiting below hunk of patch-7 comment, as per offline discussion with Matt,
there is little benefit to even making that guc-id lookup because:

1. the delay between the context reset notification (when the vmas are copied
and when we verify we had received a guc err capture dump) may be 
subjectively
large enough and not tethered that the guc-id may have already been 
re-assigned.

2. I was really looking for some kind of unique context handle to print out 
that could
be correlated (by user inspecting the dump) back to a unique app or process 
or
context-id but cant find such a param in struct intel_context.

As part of further reviewing the end to end flows and possible error scenarios, 
there
also may potentially be a mismatch between "which context was reset by guc at 
time-n"
vs "which context's vma buffers is being printed out at time-n+x" if
we are experiencing back-to-back resets and the user dumped the debugfs x-time 
later.


What does this all actually mean, because it sounds rather alarming, 
that it just won't be possible to know which context, belonging to which 
process, was reset? And because of guc_id potentially re-assigned even 
the captured VMAs may not be the correct ones?


Regards,

Tvrtko



(Recap: First, guc notifies capture event, second, guc notifies context reset 
during
which we trigger i915_gpu_coredump. In this second step, the vma's are dumped 
and we
verify that the guc capture happened but don't parse the guc-err-capture-logs 
yet.
Third step is when user triggers the debugfs to dump which is when we parse the 
error
capture logs.)

As a fix, what we can do in the guc_error_capture report out is to ensure that
we dont re-print the previously dumped vmas if we end up finding multiple
guc-error-capture dumps since the i915_gpu_coredump would have only captured 
the vma's
for the very first context that was reset. And with guc-submission, that would 
always
correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the 
guc-error-capture
logs are large enough to hold data for multiple dumps).

The changes (removal of below-hunk and adding of only-print-the-first-vma") is 
trivial
but i felt it warranted a good explanation. Apologies for the inbox noise.

...alan

On Tue, 2021-12-07 at 22:32 -0800, Alan Previn Teres Alexis wrote:

Thanks again for the detailed review here.
Will fix all the rest on next rev.
One special response for this one:


On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:

On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:

+   if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
+   GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
+   eng_inst = 
FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
+   eng = guc_lookup_engine(guc, engineclass, 
eng_inst);
+   if (eng) {
+   GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, 
eng);
+   } else {
+   PRINT(>drm, ebuf, "i915-Eng-Lookup 
Fail!\n");
+   }
+   ce = guc_context_lookup(guc, data.guc_ctx_id);


You are going to need to reference count the 'ce' here. See
intel_guc_context_reset_process_msg for an example.



Oh crap - i missed this one - which you had explicitly mentioned offline when i 
was doing the
development. Sorry about that i just totally missed it from my todo-notes.

...alan




Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2021-12-23 Thread Teres Alexis, Alan Previn
Revisiting below hunk of patch-7 comment, as per offline discussion with Matt,
there is little benefit to even making that guc-id lookup because:

1. the delay between the context reset notification (when the vmas are copied
   and when we verify we had received a guc err capture dump) may be 
subjectively
   large enough and not tethered that the guc-id may have already been 
re-assigned.

2. I was really looking for some kind of unique context handle to print out 
that could
   be correlated (by user inspecting the dump) back to a unique app or process 
or
   context-id but cant find such a param in struct intel_context.

As part of further reviewing the end to end flows and possible error scenarios, 
there
also may potentially be a mismatch between "which context was reset by guc at 
time-n"
vs "which context's vma buffers is being printed out at time-n+x" if
we are experiencing back-to-back resets and the user dumped the debugfs x-time 
later.

(Recap: First, guc notifies capture event, second, guc notifies context reset 
during
which we trigger i915_gpu_coredump. In this second step, the vma's are dumped 
and we
verify that the guc capture happened but don't parse the guc-err-capture-logs 
yet.
Third step is when user triggers the debugfs to dump which is when we parse the 
error
capture logs.)

As a fix, what we can do in the guc_error_capture report out is to ensure that
we dont re-print the previously dumped vmas if we end up finding multiple
guc-error-capture dumps since the i915_gpu_coredump would have only captured 
the vma's
for the very first context that was reset. And with guc-submission, that would 
always
correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the 
guc-error-capture
logs are large enough to hold data for multiple dumps).

The changes (removal of below-hunk and adding of only-print-the-first-vma") is 
trivial
but i felt it warranted a good explanation. Apologies for the inbox noise.

...alan

On Tue, 2021-12-07 at 22:32 -0800, Alan Previn Teres Alexis wrote:
> Thanks again for the detailed review here.
> Will fix all the rest on next rev.
> One special response for this one:
> 
> 
> On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:
> > On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
> > > + if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
> > > + GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
> > > + eng_inst = 
> > > FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
> > > + eng = guc_lookup_engine(guc, engineclass, 
> > > eng_inst);
> > > + if (eng) {
> > > + GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, 
> > > eng);
> > > + } else {
> > > + PRINT(>drm, ebuf, "
> > > i915-Eng-Lookup Fail!\n");
> > > + }
> > > + ce = guc_context_lookup(guc, data.guc_ctx_id);
> > 
> > You are going to need to reference count the 'ce' here. See
> > intel_guc_context_reset_process_msg for an example. 
> > 
> 
> Oh crap - i missed this one - which you had explicitly mentioned offline when 
> i was doing the
> development. Sorry about that i just totally missed it from my todo-notes.
> 
> ...alan



Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2021-12-07 Thread Teres Alexis, Alan Previn
Thanks again for the detailed review here.
Will fix all the rest on next rev.
One special response for this one:


On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:
> On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
> > +   if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
> > +   GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
> > +   eng_inst = 
> > FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
> > +   eng = guc_lookup_engine(guc, engineclass, 
> > eng_inst);
> > +   if (eng) {
> > +   GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, 
> > eng);
> > +   } else {
> > +   PRINT(>drm, ebuf, "
> > i915-Eng-Lookup Fail!\n");
> > +   }
> > +   ce = guc_context_lookup(guc, data.guc_ctx_id);
> 
> You are going to need to reference count the 'ce' here. See
> intel_guc_context_reset_process_msg for an example. 
> 

Oh crap - i missed this one - which you had explicitly mentioned offline when i 
was doing the
development. Sorry about that i just totally missed it from my todo-notes.

...alan


Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2021-12-07 Thread Matthew Brost
On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
> Print the GuC captured error state register list (offsets
> and values) when gpu_coredump_state printout is invoked.
> 
> Also, since the GuC can report multiple engine class registers in a
> single notification event, parse the captured data (appearing as a
> stream of structures) to identify multiple captures of different
> 'engine-capture-group-outputs'.
> 
> Finally, for each 'engine-capture-group-output', identify the last
> running context and print already-identified vma's so that user's
> output report follows the same layout as execlist submission. I.e.
> engine1-registers, engine1-context-vmas, engine2-registers,
> engine2-context-vmas, etc.
> 

Can you include a sample error capture in the next rev cover letter
assuming it is posted after 69.0.0 is merged.

A couple of comments below.

> Signed-off-by: Alan Previn 
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c |   4 +-
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.c| 389 ++
>  .../gpu/drm/i915/gt/uc/intel_guc_capture.h|   6 +
>  drivers/gpu/drm/i915/i915_gpu_error.c |  53 ++-
>  drivers/gpu/drm/i915/i915_gpu_error.h |   5 +
>  5 files changed, 439 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 332756036007..5806e2c05212 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -1595,9 +1595,7 @@ static void intel_engine_print_registers(struct 
> intel_engine_cs *engine,
>   drm_printf(m, "\tIPEHR: 0x%08x\n", ENGINE_READ(engine, IPEHR));
>   }
>  
> - if (intel_engine_uses_guc(engine)) {
> - /* nothing to print yet */
> - } else if (HAS_EXECLISTS(dev_priv)) {
> + if (HAS_EXECLISTS(dev_priv) && !intel_engine_uses_guc(engine)) {
>   struct i915_request * const *port, *rq;
>   const u32 *hws =
>   >status_page.addr[I915_HWS_CSB_BUF0_INDEX];
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> index 459fe81c77ae..998ce1b474ed 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> @@ -415,8 +415,389 @@ int intel_guc_capture_output_min_size_est(struct 
> intel_guc *guc)
>   *   L--> intel_guc_capture_store_snapshot
>   *L--> queue(__guc_capture_store_snapshot_work)
>   * Copies from B (head->tail) into C
> + *
> + * GUC --> notify context reset:
> + * -
> + * --> G2H CONTEXT RESET
> + *   L--> guc_handle_context_reset --> 
> i915_capture_error_state
> + *--> i915_gpu_coredump --> intel_guc_capture_store_ptr
> + *L--> keep a ptr to capture_store in
> + * i915_gpu_coredump struct.
> + *
> + * User Sysfs / Debugfs
> + * 
> + *  --> i915_gpu_coredump_copy_to_buffer->
> + *   L--> err_print_to_sgl --> err_print_gt
> + *L--> error_print_guc_captures
> + * L--> loop: 
> intel_guc_capture_out_print_next_group
> + *
>   */
>  
> +#if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
> +
> +static char *
> +guc_capture_register_string(const struct intel_guc *guc, u32 owner, u32 type,
> + u32 class, u32 id, u32 offset)
> +{
> + struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
> + struct __guc_mmio_reg_descr_group *match;
> + int num_regs, j = 0;
> +
> + if (!reglists)
> + return NULL;
> +
> + match = guc_capture_get_one_list(reglists, owner, type, id);
> + if (match) {
> + num_regs = match->num_regs;
> + while (num_regs--) {

for (num_regs = match->num_regs, j = 0; num_regs; ++j, --num_regs)

> + if (offset == match->list[j].reg.reg)
> + return match->list[j].regname;
> + ++j;
> + }
> + }
> +
> + return NULL;
> +}
> +
> +static inline int
> +guc_capture_store_remove_dw(struct guc_capture_out_store *store, u32 
> *bytesleft,
> + u32 *dw)
> +{
> + int tries = 2;
> + int avail = 0;
> + u32 *src_data;
> +
> + if (!*bytesleft)
> + return 0;
> +
> + while (tries--) {
> + avail = CIRC_CNT_TO_END(store->head, store->tail, store->size);
> + if (avail >= sizeof(u32)) {
> + src_data = (u32 *)(store->addr + store->tail);
> + *dw = *src_data;
> + store->tail = (store->tail + 4) & (store->size - 1);
> + *bytesleft -= 4;
> + return 4;
> + }
> + if 

Re: [Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2021-11-22 Thread Teres Alexis, Alan Previn
I realize I missed checkpatch on patch-7 before send-mail. Will fix that on 
next rev.
Patch #2 also has checkpatch failures which I was aware of - I'm still wresting 
with how to instance those register tables in a clean readable way using.

...alan

-Original Message-
From: Teres Alexis, Alan Previn  
Sent: Monday, November 22, 2021 3:04 PM
To: intel-gfx@lists.freedesktop.org
Cc: Teres Alexis, Alan Previn 
Subject: [RFC 7/7] drm/i915/guc: Print the GuC error capture output register 
list.

Print the GuC captured error state register list (offsets and values) when 
gpu_coredump_state printout is invoked.

Also, since the GuC can report multiple engine class registers in a single 
notification event, parse the captured data (appearing as a stream of 
structures) to identify multiple captures of different 
'engine-capture-group-outputs'.

Finally, for each 'engine-capture-group-output', identify the last running 
context and print already-identified vma's so that user's output report follows 
the same layout as execlist submission. I.e.
engine1-registers, engine1-context-vmas, engine2-registers, 
engine2-context-vmas, etc.

Signed-off-by: Alan Previn 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |   4 +-
 .../gpu/drm/i915/gt/uc/intel_guc_capture.c| 389 ++
 .../gpu/drm/i915/gt/uc/intel_guc_capture.h|   6 +
 drivers/gpu/drm/i915/i915_gpu_error.c |  53 ++-
 drivers/gpu/drm/i915/i915_gpu_error.h |   5 +
 5 files changed, 439 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 332756036007..5806e2c05212 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1595,9 +1595,7 @@ static void intel_engine_print_registers(struct 
intel_engine_cs *engine,
drm_printf(m, "\tIPEHR: 0x%08x\n", ENGINE_READ(engine, IPEHR));
}
 
-   if (intel_engine_uses_guc(engine)) {
-   /* nothing to print yet */
-   } else if (HAS_EXECLISTS(dev_priv)) {
+   if (HAS_EXECLISTS(dev_priv) && !intel_engine_uses_guc(engine)) {
struct i915_request * const *port, *rq;
const u32 *hws =
>status_page.addr[I915_HWS_CSB_BUF0_INDEX];
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
index 459fe81c77ae..998ce1b474ed 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
@@ -415,8 +415,389 @@ int intel_guc_capture_output_min_size_est(struct 
intel_guc *guc)
  *   L--> intel_guc_capture_store_snapshot
  *L--> queue(__guc_capture_store_snapshot_work)
  * Copies from B (head->tail) into C
+ *
+ * GUC --> notify context reset:
+ * -
+ * --> G2H CONTEXT RESET
+ *   L--> guc_handle_context_reset --> i915_capture_error_state
+ *--> i915_gpu_coredump --> intel_guc_capture_store_ptr
+ *L--> keep a ptr to capture_store in
+ * i915_gpu_coredump struct.
+ *
+ * User Sysfs / Debugfs
+ * 
+ *  --> i915_gpu_coredump_copy_to_buffer->
+ *   L--> err_print_to_sgl --> err_print_gt
+ *L--> error_print_guc_captures
+ * L--> loop: 
intel_guc_capture_out_print_next_group
+ *
  */
 
+#if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
+
+static char *
+guc_capture_register_string(const struct intel_guc *guc, u32 owner, u32 type,
+   u32 class, u32 id, u32 offset)
+{
+   struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
+   struct __guc_mmio_reg_descr_group *match;
+   int num_regs, j = 0;
+
+   if (!reglists)
+   return NULL;
+
+   match = guc_capture_get_one_list(reglists, owner, type, id);
+   if (match) {
+   num_regs = match->num_regs;
+   while (num_regs--) {
+   if (offset == match->list[j].reg.reg)
+   return match->list[j].regname;
+   ++j;
+   }
+   }
+
+   return NULL;
+}
+
+static inline int
+guc_capture_store_remove_dw(struct guc_capture_out_store *store, u32 
*bytesleft,
+   u32 *dw)
+{
+   int tries = 2;
+   int avail = 0;
+   u32 *src_data;
+
+   if (!*bytesleft)
+   return 0;
+
+   while (tries--) {
+   avail = CIRC_CNT_TO_END(store->head, store->tail, store->size);
+   if (avail >= sizeof(u32)) {
+   src_data = (u32 *)(store->addr + store->tail);
+   *dw = *src_data;
+   store->tail = (store->tail + 4) & (store->size - 1);
+   *bytesleft -= 

[Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

2021-11-22 Thread Alan Previn
Print the GuC captured error state register list (offsets
and values) when gpu_coredump_state printout is invoked.

Also, since the GuC can report multiple engine class registers in a
single notification event, parse the captured data (appearing as a
stream of structures) to identify multiple captures of different
'engine-capture-group-outputs'.

Finally, for each 'engine-capture-group-output', identify the last
running context and print already-identified vma's so that user's
output report follows the same layout as execlist submission. I.e.
engine1-registers, engine1-context-vmas, engine2-registers,
engine2-context-vmas, etc.

Signed-off-by: Alan Previn 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |   4 +-
 .../gpu/drm/i915/gt/uc/intel_guc_capture.c| 389 ++
 .../gpu/drm/i915/gt/uc/intel_guc_capture.h|   6 +
 drivers/gpu/drm/i915/i915_gpu_error.c |  53 ++-
 drivers/gpu/drm/i915/i915_gpu_error.h |   5 +
 5 files changed, 439 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 332756036007..5806e2c05212 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1595,9 +1595,7 @@ static void intel_engine_print_registers(struct 
intel_engine_cs *engine,
drm_printf(m, "\tIPEHR: 0x%08x\n", ENGINE_READ(engine, IPEHR));
}
 
-   if (intel_engine_uses_guc(engine)) {
-   /* nothing to print yet */
-   } else if (HAS_EXECLISTS(dev_priv)) {
+   if (HAS_EXECLISTS(dev_priv) && !intel_engine_uses_guc(engine)) {
struct i915_request * const *port, *rq;
const u32 *hws =
>status_page.addr[I915_HWS_CSB_BUF0_INDEX];
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
index 459fe81c77ae..998ce1b474ed 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
@@ -415,8 +415,389 @@ int intel_guc_capture_output_min_size_est(struct 
intel_guc *guc)
  *   L--> intel_guc_capture_store_snapshot
  *L--> queue(__guc_capture_store_snapshot_work)
  * Copies from B (head->tail) into C
+ *
+ * GUC --> notify context reset:
+ * -
+ * --> G2H CONTEXT RESET
+ *   L--> guc_handle_context_reset --> i915_capture_error_state
+ *--> i915_gpu_coredump --> intel_guc_capture_store_ptr
+ *L--> keep a ptr to capture_store in
+ * i915_gpu_coredump struct.
+ *
+ * User Sysfs / Debugfs
+ * 
+ *  --> i915_gpu_coredump_copy_to_buffer->
+ *   L--> err_print_to_sgl --> err_print_gt
+ *L--> error_print_guc_captures
+ * L--> loop: 
intel_guc_capture_out_print_next_group
+ *
  */
 
+#if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
+
+static char *
+guc_capture_register_string(const struct intel_guc *guc, u32 owner, u32 type,
+   u32 class, u32 id, u32 offset)
+{
+   struct __guc_mmio_reg_descr_group *reglists = guc->capture.reglists;
+   struct __guc_mmio_reg_descr_group *match;
+   int num_regs, j = 0;
+
+   if (!reglists)
+   return NULL;
+
+   match = guc_capture_get_one_list(reglists, owner, type, id);
+   if (match) {
+   num_regs = match->num_regs;
+   while (num_regs--) {
+   if (offset == match->list[j].reg.reg)
+   return match->list[j].regname;
+   ++j;
+   }
+   }
+
+   return NULL;
+}
+
+static inline int
+guc_capture_store_remove_dw(struct guc_capture_out_store *store, u32 
*bytesleft,
+   u32 *dw)
+{
+   int tries = 2;
+   int avail = 0;
+   u32 *src_data;
+
+   if (!*bytesleft)
+   return 0;
+
+   while (tries--) {
+   avail = CIRC_CNT_TO_END(store->head, store->tail, store->size);
+   if (avail >= sizeof(u32)) {
+   src_data = (u32 *)(store->addr + store->tail);
+   *dw = *src_data;
+   store->tail = (store->tail + 4) & (store->size - 1);
+   *bytesleft -= 4;
+   return 4;
+   }
+   if (store->tail == (store->size - 1) && store->head > 0)
+   store->tail = 0;
+   }
+
+   return 0;
+}
+
+static int
+capture_store_get_group_hdr(const struct intel_guc *guc,
+   struct guc_capture_out_store *store, u32 *bytesleft,
+   struct intel_guc_capture_out_group_header *group)
+{
+   int read = 0;
+   int fullsize = sizeof(struct