Re: [PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

2022-02-10 Thread Alex Deucher
On Thu, Feb 10, 2022 at 9:11 AM Sharma, Shashank
 wrote:
>
>
>
> On 2/10/2022 3:05 PM, Christian König wrote:
> > Am 10.02.22 um 14:18 schrieb Sharma, Shashank:
> >>
> >>
> >> On 2/10/2022 8:38 AM, Christian König wrote:
> >>> Am 10.02.22 um 08:34 schrieb Somalapuram, Amaranath:
> 
>  On 2/10/2022 12:39 PM, Christian König wrote:
> > Am 10.02.22 um 06:29 schrieb Somalapuram, Amaranath:
> >>
> >> On 2/9/2022 1:17 PM, Christian König wrote:
> >>> Am 08.02.22 um 16:28 schrieb Alex Deucher:
>  On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
>   wrote:
> > Dump the list of register values to trace event on GPU reset.
> >
> > Signed-off-by: Somalapuram Amaranath
> > 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21
> > -
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19
> > +++
> >   2 files changed, 39 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 1e651b959141..057922fb7e37 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct
> > amdgpu_device *adev,
> >  return r;
> >   }
> >
> > +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
> > +{
> > +   int i;
> > +   uint32_t reg_value[128];
> > +
> > +   for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
> > +   if (adev->asic_type >= CHIP_NAVI10)
>  This check should be against CHIP_VEGA10.  Also, this only
>  allows for
>  GC registers.  If we wanted to dump other registers, we'd need a
>  different macro.  Might be better to just use RREG32 here for
>  everything and then encode the full offset using
>  SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to
>  think
>  about how to handle gfxoff in this case.  gfxoff needs to be
>  disabled
>  or we'll hang the chip if we try and read GC or SDMA registers via
>  MMIO which will adversely affect the hang signature.
> >>>
> >>> Well this should execute right before a GPU reset, so I think it
> >>> shouldn't matter if we hang the chip or not as long as the read
> >>> comes back correctly (I remember a very long UVD debug session
> >>> because of this).
> >>>
> >>> But in general I agree, we should just use RREG32() here and
> >>> always encode the full register offset.
> >>>
> >>> Regards,
> >>> Christian.
> >>>
> >> Can I use something like this:
> >>
> >> +   reg_value[i] =
> >> RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]
> >> + [adev->reset_dump_reg_list[i][1]]
> >> + [adev->reset_dump_reg_list[i][2]])
> >> + + adev->reset_dump_reg_list[i][3]);
> >>
> >> ip --> adev->reset_dump_reg_list[i][0]
> >>
> >> inst --> adev->reset_dump_reg_list[i][1]
> >>
> >> BASE_IDX--> adev->reset_dump_reg_list[i][2]
> >>
> >> reg --> adev->reset_dump_reg_list[i][3]
> >
> > No, that won't work.
> >
> > What you need to do is to use the full 32bit address of the
> > register. Userspace can worry about figuring out which ip,
> > instance, base and reg to resolve into that address.
> >
> > Regards,
> > Christian.
> >
>  Thanks Christian.
> 
>  should I consider using gfxoff like below code or not required:
>  amdgpu_gfx_off_ctrl(adev, false);
>  amdgpu_gfx_off_ctrl(adev, true);
> >>>
> >>> That's a really good question I can't fully answer.
> >>>
> >>> I think we don't want that because the GPU is stuck when the dump is
> >>> made, but better let Alex comment as well.
> >>>
> >>> Regards,
> >>> Christian.
> >>
> >>
> >> I had a quick look at the function amdgpu_gfx_off_ctrl, and it locks
> >> this mutex internally:
> >> mutex_lock(>gfx.gfx_off_mutex);
> >>
> >> and the reference state is tracked in:
> >> adev->gfx.gfx_off_state
> >>
> >> We can do something like this maybe:
> >> - If (adev->gfx_off_state == 0) {
> >>   trylock(gfx_off_mutex)
> >>   read_regs_now;
> >>   unlock_mutex();
> >> }
> >>
> >> How does it sounds ?
> >
> > As far as I know that won't work. GFX_off is only disabled intentionally
> > in very few places.
> >
> > So we will probably never get a register trace with that.
> >
>
> Ok, I don't know much about this feature, but due to the name I was
> udner the impression that gfx_off will be mostly disabled. But if it is
> hardly ever disabled, we need more infomrmation about it first, like
> when is this 

Re: [PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

2022-02-10 Thread Sharma, Shashank




On 2/10/2022 3:05 PM, Christian König wrote:

Am 10.02.22 um 14:18 schrieb Sharma, Shashank:



On 2/10/2022 8:38 AM, Christian König wrote:

Am 10.02.22 um 08:34 schrieb Somalapuram, Amaranath:


On 2/10/2022 12:39 PM, Christian König wrote:

Am 10.02.22 um 06:29 schrieb Somalapuram, Amaranath:


On 2/9/2022 1:17 PM, Christian König wrote:

Am 08.02.22 um 16:28 schrieb Alex Deucher:

On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
 wrote:

Dump the list of register values to trace event on GPU reset.

Signed-off-by: Somalapuram Amaranath 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 
-
  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 
+++

  2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 1e651b959141..057922fb7e37 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,

 return r;
  }

+static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
+{
+   int i;
+   uint32_t reg_value[128];
+
+   for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
+   if (adev->asic_type >= CHIP_NAVI10)
This check should be against CHIP_VEGA10.  Also, this only 
allows for

GC registers.  If we wanted to dump other registers, we'd need a
different macro.  Might be better to just use RREG32 here for
everything and then encode the full offset using
SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to 
think
about how to handle gfxoff in this case.  gfxoff needs to be 
disabled

or we'll hang the chip if we try and read GC or SDMA registers via
MMIO which will adversely affect the hang signature.


Well this should execute right before a GPU reset, so I think it 
shouldn't matter if we hang the chip or not as long as the read 
comes back correctly (I remember a very long UVD debug session 
because of this).


But in general I agree, we should just use RREG32() here and 
always encode the full register offset.


Regards,
Christian.


Can I use something like this:

+   reg_value[i] = 
RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]

+ [adev->reset_dump_reg_list[i][1]]
+ [adev->reset_dump_reg_list[i][2]])
+                                 + adev->reset_dump_reg_list[i][3]);

ip --> adev->reset_dump_reg_list[i][0]

inst --> adev->reset_dump_reg_list[i][1]

BASE_IDX--> adev->reset_dump_reg_list[i][2]

reg --> adev->reset_dump_reg_list[i][3]


No, that won't work.

What you need to do is to use the full 32bit address of the 
register. Userspace can worry about figuring out which ip, 
instance, base and reg to resolve into that address.


Regards,
Christian.


Thanks Christian.

should I consider using gfxoff like below code or not required:
amdgpu_gfx_off_ctrl(adev, false);
amdgpu_gfx_off_ctrl(adev, true);


That's a really good question I can't fully answer.

I think we don't want that because the GPU is stuck when the dump is 
made, but better let Alex comment as well.


Regards,
Christian.



I had a quick look at the function amdgpu_gfx_off_ctrl, and it locks 
this mutex internally:

mutex_lock(>gfx.gfx_off_mutex);

and the reference state is tracked in:
adev->gfx.gfx_off_state

We can do something like this maybe:
- If (adev->gfx_off_state == 0) {
  trylock(gfx_off_mutex)
  read_regs_now;
  unlock_mutex();
}

How does it sounds ?


As far as I know that won't work. GFX_off is only disabled intentionally 
in very few places.


So we will probably never get a register trace with that.



Ok, I don't know much about this feature, but due to the name I was 
udner the impression that gfx_off will be mostly disabled. But if it is 
hardly ever disabled, we need more infomrmation about it first, like 
when is this disabled and why ?


Alex ?

- Shashank


Regards,
Christian.



- Shashank





Regards,

S.Amarnath


which requires 4 values in user space for each register.

using any existing macro like RREG32_SOC15** will not be able to 
pass proper argument from user space (like ip##_HWIP or 
reg##_BASE_IDX)






Alex

+ reg_value[i] = RREG32_SOC15_IP(GC, 
adev->reset_dump_reg_list[i]);

+   else
+   reg_value[i] = 
RREG32(adev->reset_dump_reg_list[i]);

+   }
+
+ trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, 
reg_value, i);

+
+   return 0;
+}
+
  int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  struct amdgpu_reset_context 
*reset_context)

  {
@@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct 
list_head *device_list_handle,

tmp_adev->gmc.xgmi.pending_reset = false;
 if 
(!queue_work(system_unbound_wq, _adev->xgmi_reset_work))

 r = -EALREADY;
-   } else
+ 

Re: [PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

2022-02-10 Thread Christian König

Am 10.02.22 um 14:18 schrieb Sharma, Shashank:



On 2/10/2022 8:38 AM, Christian König wrote:

Am 10.02.22 um 08:34 schrieb Somalapuram, Amaranath:


On 2/10/2022 12:39 PM, Christian König wrote:

Am 10.02.22 um 06:29 schrieb Somalapuram, Amaranath:


On 2/9/2022 1:17 PM, Christian König wrote:

Am 08.02.22 um 16:28 schrieb Alex Deucher:

On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
 wrote:

Dump the list of register values to trace event on GPU reset.

Signed-off-by: Somalapuram Amaranath 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 
-
  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 
+++

  2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 1e651b959141..057922fb7e37 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,

 return r;
  }

+static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
+{
+   int i;
+   uint32_t reg_value[128];
+
+   for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
+   if (adev->asic_type >= CHIP_NAVI10)
This check should be against CHIP_VEGA10.  Also, this only 
allows for

GC registers.  If we wanted to dump other registers, we'd need a
different macro.  Might be better to just use RREG32 here for
everything and then encode the full offset using
SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to 
think
about how to handle gfxoff in this case.  gfxoff needs to be 
disabled

or we'll hang the chip if we try and read GC or SDMA registers via
MMIO which will adversely affect the hang signature.


Well this should execute right before a GPU reset, so I think it 
shouldn't matter if we hang the chip or not as long as the read 
comes back correctly (I remember a very long UVD debug session 
because of this).


But in general I agree, we should just use RREG32() here and 
always encode the full register offset.


Regards,
Christian.


Can I use something like this:

+   reg_value[i] = 
RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]

+ [adev->reset_dump_reg_list[i][1]]
+ [adev->reset_dump_reg_list[i][2]])
+                                 + adev->reset_dump_reg_list[i][3]);

ip --> adev->reset_dump_reg_list[i][0]

inst --> adev->reset_dump_reg_list[i][1]

BASE_IDX--> adev->reset_dump_reg_list[i][2]

reg --> adev->reset_dump_reg_list[i][3]


No, that won't work.

What you need to do is to use the full 32bit address of the 
register. Userspace can worry about figuring out which ip, 
instance, base and reg to resolve into that address.


Regards,
Christian.


Thanks Christian.

should I consider using gfxoff like below code or not required:
amdgpu_gfx_off_ctrl(adev, false);
amdgpu_gfx_off_ctrl(adev, true);


That's a really good question I can't fully answer.

I think we don't want that because the GPU is stuck when the dump is 
made, but better let Alex comment as well.


Regards,
Christian.



I had a quick look at the function amdgpu_gfx_off_ctrl, and it locks 
this mutex internally:

mutex_lock(>gfx.gfx_off_mutex);

and the reference state is tracked in:
adev->gfx.gfx_off_state

We can do something like this maybe:
- If (adev->gfx_off_state == 0) {
  trylock(gfx_off_mutex)
  read_regs_now;
  unlock_mutex();
}

How does it sounds ?


As far as I know that won't work. GFX_off is only disabled intentionally 
in very few places.


So we will probably never get a register trace with that.

Regards,
Christian.



- Shashank





Regards,

S.Amarnath


which requires 4 values in user space for each register.

using any existing macro like RREG32_SOC15** will not be able to 
pass proper argument from user space (like ip##_HWIP or 
reg##_BASE_IDX)






Alex

+ reg_value[i] = RREG32_SOC15_IP(GC, 
adev->reset_dump_reg_list[i]);

+   else
+   reg_value[i] = 
RREG32(adev->reset_dump_reg_list[i]);

+   }
+
+ trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, 
reg_value, i);

+
+   return 0;
+}
+
  int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  struct amdgpu_reset_context 
*reset_context)

  {
@@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct 
list_head *device_list_handle,

tmp_adev->gmc.xgmi.pending_reset = false;
 if 
(!queue_work(system_unbound_wq, _adev->xgmi_reset_work))

 r = -EALREADY;
-   } else
+   } else {
+ amdgpu_reset_reg_dumps(tmp_adev);
 r = amdgpu_asic_reset(tmp_adev);
+   }

 if (r) {
dev_err(tmp_adev->dev, "ASIC reset failed with error, %d for 
drm dev, %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 

Re: [PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

2022-02-10 Thread Sharma, Shashank




On 2/10/2022 8:38 AM, Christian König wrote:

Am 10.02.22 um 08:34 schrieb Somalapuram, Amaranath:


On 2/10/2022 12:39 PM, Christian König wrote:

Am 10.02.22 um 06:29 schrieb Somalapuram, Amaranath:


On 2/9/2022 1:17 PM, Christian König wrote:

Am 08.02.22 um 16:28 schrieb Alex Deucher:

On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
 wrote:

Dump the list of register values to trace event on GPU reset.

Signed-off-by: Somalapuram Amaranath 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 
-
  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 
+++

  2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 1e651b959141..057922fb7e37 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,

 return r;
  }

+static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
+{
+   int i;
+   uint32_t reg_value[128];
+
+   for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
+   if (adev->asic_type >= CHIP_NAVI10)

This check should be against CHIP_VEGA10.  Also, this only allows for
GC registers.  If we wanted to dump other registers, we'd need a
different macro.  Might be better to just use RREG32 here for
everything and then encode the full offset using
SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to think
about how to handle gfxoff in this case.  gfxoff needs to be disabled
or we'll hang the chip if we try and read GC or SDMA registers via
MMIO which will adversely affect the hang signature.


Well this should execute right before a GPU reset, so I think it 
shouldn't matter if we hang the chip or not as long as the read 
comes back correctly (I remember a very long UVD debug session 
because of this).


But in general I agree, we should just use RREG32() here and always 
encode the full register offset.


Regards,
Christian.


Can I use something like this:

+   reg_value[i] = 
RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]

+ [adev->reset_dump_reg_list[i][1]]
+ [adev->reset_dump_reg_list[i][2]])
+                                 + adev->reset_dump_reg_list[i][3]);

ip --> adev->reset_dump_reg_list[i][0]

inst --> adev->reset_dump_reg_list[i][1]

BASE_IDX--> adev->reset_dump_reg_list[i][2]

reg --> adev->reset_dump_reg_list[i][3]


No, that won't work.

What you need to do is to use the full 32bit address of the register. 
Userspace can worry about figuring out which ip, instance, base and 
reg to resolve into that address.


Regards,
Christian.


Thanks Christian.

should I consider using gfxoff like below code or not required:
amdgpu_gfx_off_ctrl(adev, false);
amdgpu_gfx_off_ctrl(adev, true);


That's a really good question I can't fully answer.

I think we don't want that because the GPU is stuck when the dump is 
made, but better let Alex comment as well.


Regards,
Christian.



I had a quick look at the function amdgpu_gfx_off_ctrl, and it locks 
this mutex internally:

mutex_lock(>gfx.gfx_off_mutex);

and the reference state is tracked in:
adev->gfx.gfx_off_state

We can do something like this maybe:
- If (adev->gfx_off_state == 0) {
  trylock(gfx_off_mutex)
  read_regs_now;
  unlock_mutex();
}

How does it sounds ?

- Shashank





Regards,

S.Amarnath


which requires 4 values in user space for each register.

using any existing macro like RREG32_SOC15** will not be able to 
pass proper argument from user space (like ip##_HWIP or reg##_BASE_IDX)






Alex


+ reg_value[i] = RREG32_SOC15_IP(GC, adev->reset_dump_reg_list[i]);
+   else
+   reg_value[i] = 
RREG32(adev->reset_dump_reg_list[i]);

+   }
+
+ trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, 
reg_value, i);

+
+   return 0;
+}
+
  int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  struct amdgpu_reset_context 
*reset_context)

  {
@@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,

tmp_adev->gmc.xgmi.pending_reset = false;
 if 
(!queue_work(system_unbound_wq, _adev->xgmi_reset_work))

 r = -EALREADY;
-   } else
+   } else {
+ amdgpu_reset_reg_dumps(tmp_adev);
 r = amdgpu_asic_reset(tmp_adev);
+   }

 if (r) {
dev_err(tmp_adev->dev, "ASIC reset failed with error, %d for drm 
dev, %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h

index d855cb53c7e0..3fe33de3564a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
   

Re: [PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

2022-02-10 Thread Christian König

Am 10.02.22 um 12:59 schrieb Sharma, Shashank:



On 2/10/2022 6:29 AM, Somalapuram, Amaranath wrote:


On 2/9/2022 1:17 PM, Christian König wrote:

Am 08.02.22 um 16:28 schrieb Alex Deucher:

On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
 wrote:

Dump the list of register values to trace event on GPU reset.

Signed-off-by: Somalapuram Amaranath 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 
-

  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 +++
  2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 1e651b959141..057922fb7e37 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,

 return r;
  }

+static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
+{
+   int i;
+   uint32_t reg_value[128];
+
+   for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
+   if (adev->asic_type >= CHIP_NAVI10)

This check should be against CHIP_VEGA10.  Also, this only allows for
GC registers.  If we wanted to dump other registers, we'd need a
different macro.  Might be better to just use RREG32 here for
everything and then encode the full offset using
SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to think
about how to handle gfxoff in this case.  gfxoff needs to be disabled
or we'll hang the chip if we try and read GC or SDMA registers via
MMIO which will adversely affect the hang signature.


Well this should execute right before a GPU reset, so I think it 
shouldn't matter if we hang the chip or not as long as the read 
comes back correctly (I remember a very long UVD debug session 
because of this).


But in general I agree, we should just use RREG32() here and always 
encode the full register offset.


Regards,
Christian.


Can I use something like this:

+   reg_value[i] = 
RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]

+ [adev->reset_dump_reg_list[i][1]]
+ [adev->reset_dump_reg_list[i][2]])
+                                 + adev->reset_dump_reg_list[i][3]);

ip --> adev->reset_dump_reg_list[i][0]

inst --> adev->reset_dump_reg_list[i][1]

BASE_IDX--> adev->reset_dump_reg_list[i][2]

reg --> adev->reset_dump_reg_list[i][3]

which requires 4 values in user space for each register.

using any existing macro like RREG32_SOC15** will not be able to pass 
proper argument from user space (like ip##_HWIP or reg##_BASE_IDX)





Why cant we use just a simple array
adev->reset_dump_reg_list[10] for both ip and reg offsets ?


That won't work. The IPs are separated into several base registers, see 
how the SOC15 functions work.


But that's also not necessary. Userspace should have the same 
information as the kernel about which IP is mapped where.


So all we need here is the already resolved 32bit value of which 
register to read and are basically done.


Regards,
Christian.



Userspace can provide the IP engine enum in first entry of the array,
reset_dump_reg_list[0], and register offsets in other entries starting 
from 1. We can convert that into desirable engine substring using an 
array of char *, something like:


const char *ip_engine_name_substing[] = {
/* Same order as enum amd_hw_ip_block_type */
"GC", "HDP", ..
}

engine enum;
u32 ip = adev->reset_dump_reg_list[0];
const char *ip_name = ip_engine_name_subs[ip];

for (i = 0; i < 9; i++) {
reg_val = RREG_SOC15_IP(ip_name, reset_dump_reg_list[i+1]);
}

- Shashank





Alex

+   reg_value[i] = RREG32_SOC15_IP(GC, 
adev->reset_dump_reg_list[i]);

+   else
+   reg_value[i] = 
RREG32(adev->reset_dump_reg_list[i]);

+   }
+
+ trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, 
reg_value, i);

+
+   return 0;
+}
+
  int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  struct amdgpu_reset_context *reset_context)
  {
@@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,

tmp_adev->gmc.xgmi.pending_reset = false;
 if 
(!queue_work(system_unbound_wq, _adev->xgmi_reset_work))

 r = -EALREADY;
-   } else
+   } else {
+ amdgpu_reset_reg_dumps(tmp_adev);
 r = amdgpu_asic_reset(tmp_adev);
+   }

 if (r) {
 dev_err(tmp_adev->dev, "ASIC 
reset failed with error, %d for drm dev, %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h

index d855cb53c7e0..3fe33de3564a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -537,6 +537,25 @@ 

Re: [PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

2022-02-10 Thread Sharma, Shashank




On 2/10/2022 6:29 AM, Somalapuram, Amaranath wrote:


On 2/9/2022 1:17 PM, Christian König wrote:

Am 08.02.22 um 16:28 schrieb Alex Deucher:

On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
 wrote:

Dump the list of register values to trace event on GPU reset.

Signed-off-by: Somalapuram Amaranath 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 +++
  2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 1e651b959141..057922fb7e37 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,

 return r;
  }

+static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
+{
+   int i;
+   uint32_t reg_value[128];
+
+   for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
+   if (adev->asic_type >= CHIP_NAVI10)

This check should be against CHIP_VEGA10.  Also, this only allows for
GC registers.  If we wanted to dump other registers, we'd need a
different macro.  Might be better to just use RREG32 here for
everything and then encode the full offset using
SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to think
about how to handle gfxoff in this case.  gfxoff needs to be disabled
or we'll hang the chip if we try and read GC or SDMA registers via
MMIO which will adversely affect the hang signature.


Well this should execute right before a GPU reset, so I think it 
shouldn't matter if we hang the chip or not as long as the read comes 
back correctly (I remember a very long UVD debug session because of 
this).


But in general I agree, we should just use RREG32() here and always 
encode the full register offset.


Regards,
Christian.


Can I use something like this:

+   reg_value[i] = 
RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]

+ [adev->reset_dump_reg_list[i][1]]
+ [adev->reset_dump_reg_list[i][2]])
+                                 + adev->reset_dump_reg_list[i][3]);

ip --> adev->reset_dump_reg_list[i][0]

inst --> adev->reset_dump_reg_list[i][1]

BASE_IDX--> adev->reset_dump_reg_list[i][2]

reg --> adev->reset_dump_reg_list[i][3]

which requires 4 values in user space for each register.

using any existing macro like RREG32_SOC15** will not be able to pass 
proper argument from user space (like ip##_HWIP or reg##_BASE_IDX)





Why cant we use just a simple array
adev->reset_dump_reg_list[10] for both ip and reg offsets ?

Userspace can provide the IP engine enum in first entry of the array,
reset_dump_reg_list[0], and register offsets in other entries starting 
from 1. We can convert that into desirable engine substring using an 
array of char *, something like:


const char *ip_engine_name_substing[] = {
/* Same order as enum amd_hw_ip_block_type */
"GC", "HDP", ..
}

engine enum;
u32 ip = adev->reset_dump_reg_list[0];
const char *ip_name = ip_engine_name_subs[ip];

for (i = 0; i < 9; i++) {
reg_val = RREG_SOC15_IP(ip_name, reset_dump_reg_list[i+1]);
}

- Shashank





Alex

+   reg_value[i] = RREG32_SOC15_IP(GC, 
adev->reset_dump_reg_list[i]);

+   else
+   reg_value[i] = 
RREG32(adev->reset_dump_reg_list[i]);

+   }
+
+ trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, reg_value, 
i);

+
+   return 0;
+}
+
  int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  struct amdgpu_reset_context *reset_context)
  {
@@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,

tmp_adev->gmc.xgmi.pending_reset = false;
 if (!queue_work(system_unbound_wq, 
_adev->xgmi_reset_work))

 r = -EALREADY;
-   } else
+   } else {
+ amdgpu_reset_reg_dumps(tmp_adev);
 r = amdgpu_asic_reset(tmp_adev);
+   }

 if (r) {
 dev_err(tmp_adev->dev, "ASIC reset 
failed with error, %d for drm dev, %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h

index d855cb53c7e0..3fe33de3564a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
   __entry->seqno)
  );

+TRACE_EVENT(amdgpu_reset_reg_dumps,
+   TP_PROTO(long *address, uint32_t *value, int length),
+   TP_ARGS(address, value, length),
+   TP_STRUCT__entry(
+    __array(long, address, 128)
+    __array(uint32_t, value, 128)
+    

Re: [PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

2022-02-09 Thread Christian König

Am 10.02.22 um 08:34 schrieb Somalapuram, Amaranath:


On 2/10/2022 12:39 PM, Christian König wrote:

Am 10.02.22 um 06:29 schrieb Somalapuram, Amaranath:


On 2/9/2022 1:17 PM, Christian König wrote:

Am 08.02.22 um 16:28 schrieb Alex Deucher:

On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
 wrote:

Dump the list of register values to trace event on GPU reset.

Signed-off-by: Somalapuram Amaranath 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 
-
  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 
+++

  2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 1e651b959141..057922fb7e37 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,

 return r;
  }

+static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
+{
+   int i;
+   uint32_t reg_value[128];
+
+   for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
+   if (adev->asic_type >= CHIP_NAVI10)

This check should be against CHIP_VEGA10.  Also, this only allows for
GC registers.  If we wanted to dump other registers, we'd need a
different macro.  Might be better to just use RREG32 here for
everything and then encode the full offset using
SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to think
about how to handle gfxoff in this case.  gfxoff needs to be disabled
or we'll hang the chip if we try and read GC or SDMA registers via
MMIO which will adversely affect the hang signature.


Well this should execute right before a GPU reset, so I think it 
shouldn't matter if we hang the chip or not as long as the read 
comes back correctly (I remember a very long UVD debug session 
because of this).


But in general I agree, we should just use RREG32() here and always 
encode the full register offset.


Regards,
Christian.


Can I use something like this:

+   reg_value[i] = 
RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]

+ [adev->reset_dump_reg_list[i][1]]
+ [adev->reset_dump_reg_list[i][2]])
+                                 + adev->reset_dump_reg_list[i][3]);

ip --> adev->reset_dump_reg_list[i][0]

inst --> adev->reset_dump_reg_list[i][1]

BASE_IDX--> adev->reset_dump_reg_list[i][2]

reg --> adev->reset_dump_reg_list[i][3]


No, that won't work.

What you need to do is to use the full 32bit address of the register. 
Userspace can worry about figuring out which ip, instance, base and 
reg to resolve into that address.


Regards,
Christian.


Thanks Christian.

should I consider using gfxoff like below code or not required:
amdgpu_gfx_off_ctrl(adev, false);
amdgpu_gfx_off_ctrl(adev, true);


That's a really good question I can't fully answer.

I think we don't want that because the GPU is stuck when the dump is 
made, but better let Alex comment as well.


Regards,
Christian.




Regards,

S.Amarnath


which requires 4 values in user space for each register.

using any existing macro like RREG32_SOC15** will not be able to 
pass proper argument from user space (like ip##_HWIP or reg##_BASE_IDX)






Alex


+ reg_value[i] = RREG32_SOC15_IP(GC, adev->reset_dump_reg_list[i]);
+   else
+   reg_value[i] = 
RREG32(adev->reset_dump_reg_list[i]);

+   }
+
+ trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, 
reg_value, i);

+
+   return 0;
+}
+
  int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  struct amdgpu_reset_context 
*reset_context)

  {
@@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,

tmp_adev->gmc.xgmi.pending_reset = false;
 if 
(!queue_work(system_unbound_wq, _adev->xgmi_reset_work))

 r = -EALREADY;
-   } else
+   } else {
+ amdgpu_reset_reg_dumps(tmp_adev);
 r = amdgpu_asic_reset(tmp_adev);
+   }

 if (r) {
dev_err(tmp_adev->dev, "ASIC reset failed with error, %d for drm 
dev, %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h

index d855cb53c7e0..3fe33de3564a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
   __entry->seqno)
  );

+TRACE_EVENT(amdgpu_reset_reg_dumps,
+   TP_PROTO(long *address, uint32_t *value, int length),
+   TP_ARGS(address, value, length),
+   TP_STRUCT__entry(
+    __array(long, address, 128)
+    __array(uint32_t, value, 128)
+    __field(int, len)
+    ),
+   

Re: [PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

2022-02-09 Thread Somalapuram, Amaranath



On 2/10/2022 12:39 PM, Christian König wrote:

Am 10.02.22 um 06:29 schrieb Somalapuram, Amaranath:


On 2/9/2022 1:17 PM, Christian König wrote:

Am 08.02.22 um 16:28 schrieb Alex Deucher:

On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
 wrote:

Dump the list of register values to trace event on GPU reset.

Signed-off-by: Somalapuram Amaranath 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 
-

  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 +++
  2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 1e651b959141..057922fb7e37 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,

 return r;
  }

+static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
+{
+   int i;
+   uint32_t reg_value[128];
+
+   for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
+   if (adev->asic_type >= CHIP_NAVI10)

This check should be against CHIP_VEGA10.  Also, this only allows for
GC registers.  If we wanted to dump other registers, we'd need a
different macro.  Might be better to just use RREG32 here for
everything and then encode the full offset using
SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to think
about how to handle gfxoff in this case.  gfxoff needs to be disabled
or we'll hang the chip if we try and read GC or SDMA registers via
MMIO which will adversely affect the hang signature.


Well this should execute right before a GPU reset, so I think it 
shouldn't matter if we hang the chip or not as long as the read 
comes back correctly (I remember a very long UVD debug session 
because of this).


But in general I agree, we should just use RREG32() here and always 
encode the full register offset.


Regards,
Christian.


Can I use something like this:

+   reg_value[i] = 
RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]

+ [adev->reset_dump_reg_list[i][1]]
+ [adev->reset_dump_reg_list[i][2]])
+                                 + adev->reset_dump_reg_list[i][3]);

ip --> adev->reset_dump_reg_list[i][0]

inst --> adev->reset_dump_reg_list[i][1]

BASE_IDX--> adev->reset_dump_reg_list[i][2]

reg --> adev->reset_dump_reg_list[i][3]


No, that won't work.

What you need to do is to use the full 32bit address of the register. 
Userspace can worry about figuring out which ip, instance, base and 
reg to resolve into that address.


Regards,
Christian.


Thanks Christian.

should I consider using gfxoff like below code or not required:
amdgpu_gfx_off_ctrl(adev, false);
amdgpu_gfx_off_ctrl(adev, true);


Regards,

S.Amarnath


which requires 4 values in user space for each register.

using any existing macro like RREG32_SOC15** will not be able to pass 
proper argument from user space (like ip##_HWIP or reg##_BASE_IDX)






Alex

+   reg_value[i] = RREG32_SOC15_IP(GC, 
adev->reset_dump_reg_list[i]);

+   else
+   reg_value[i] = 
RREG32(adev->reset_dump_reg_list[i]);

+   }
+
+ trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, 
reg_value, i);

+
+   return 0;
+}
+
  int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  struct amdgpu_reset_context *reset_context)
  {
@@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,

tmp_adev->gmc.xgmi.pending_reset = false;
 if 
(!queue_work(system_unbound_wq, _adev->xgmi_reset_work))

 r = -EALREADY;
-   } else
+   } else {
+ amdgpu_reset_reg_dumps(tmp_adev);
 r = amdgpu_asic_reset(tmp_adev);
+   }

 if (r) {
 dev_err(tmp_adev->dev, "ASIC 
reset failed with error, %d for drm dev, %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h

index d855cb53c7e0..3fe33de3564a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
   __entry->seqno)
  );

+TRACE_EVENT(amdgpu_reset_reg_dumps,
+   TP_PROTO(long *address, uint32_t *value, int length),
+   TP_ARGS(address, value, length),
+   TP_STRUCT__entry(
+    __array(long, address, 128)
+    __array(uint32_t, value, 128)
+    __field(int, len)
+    ),
+   TP_fast_assign(
+  memcpy(__entry->address, address, 128);
+  memcpy(__entry->value, value, 128);
+  

Re: [PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

2022-02-09 Thread Christian König

Am 10.02.22 um 06:29 schrieb Somalapuram, Amaranath:


On 2/9/2022 1:17 PM, Christian König wrote:

Am 08.02.22 um 16:28 schrieb Alex Deucher:

On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
 wrote:

Dump the list of register values to trace event on GPU reset.

Signed-off-by: Somalapuram Amaranath 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 
-

  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 +++
  2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 1e651b959141..057922fb7e37 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,

 return r;
  }

+static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
+{
+   int i;
+   uint32_t reg_value[128];
+
+   for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
+   if (adev->asic_type >= CHIP_NAVI10)

This check should be against CHIP_VEGA10.  Also, this only allows for
GC registers.  If we wanted to dump other registers, we'd need a
different macro.  Might be better to just use RREG32 here for
everything and then encode the full offset using
SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to think
about how to handle gfxoff in this case.  gfxoff needs to be disabled
or we'll hang the chip if we try and read GC or SDMA registers via
MMIO which will adversely affect the hang signature.


Well this should execute right before a GPU reset, so I think it 
shouldn't matter if we hang the chip or not as long as the read comes 
back correctly (I remember a very long UVD debug session because of 
this).


But in general I agree, we should just use RREG32() here and always 
encode the full register offset.


Regards,
Christian.


Can I use something like this:

+   reg_value[i] = 
RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]

+ [adev->reset_dump_reg_list[i][1]]
+ [adev->reset_dump_reg_list[i][2]])
+                                 + adev->reset_dump_reg_list[i][3]);

ip --> adev->reset_dump_reg_list[i][0]

inst --> adev->reset_dump_reg_list[i][1]

BASE_IDX--> adev->reset_dump_reg_list[i][2]

reg --> adev->reset_dump_reg_list[i][3]


No, that won't work.

What you need to do is to use the full 32bit address of the register. 
Userspace can worry about figuring out which ip, instance, base and reg 
to resolve into that address.


Regards,
Christian.



which requires 4 values in user space for each register.

using any existing macro like RREG32_SOC15** will not be able to pass 
proper argument from user space (like ip##_HWIP or reg##_BASE_IDX)






Alex

+   reg_value[i] = RREG32_SOC15_IP(GC, 
adev->reset_dump_reg_list[i]);

+   else
+   reg_value[i] = 
RREG32(adev->reset_dump_reg_list[i]);

+   }
+
+ trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, 
reg_value, i);

+
+   return 0;
+}
+
  int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  struct amdgpu_reset_context *reset_context)
  {
@@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,

tmp_adev->gmc.xgmi.pending_reset = false;
 if (!queue_work(system_unbound_wq, 
_adev->xgmi_reset_work))

 r = -EALREADY;
-   } else
+   } else {
+ amdgpu_reset_reg_dumps(tmp_adev);
 r = amdgpu_asic_reset(tmp_adev);
+   }

 if (r) {
 dev_err(tmp_adev->dev, "ASIC reset 
failed with error, %d for drm dev, %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h

index d855cb53c7e0..3fe33de3564a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
   __entry->seqno)
  );

+TRACE_EVENT(amdgpu_reset_reg_dumps,
+   TP_PROTO(long *address, uint32_t *value, int length),
+   TP_ARGS(address, value, length),
+   TP_STRUCT__entry(
+    __array(long, address, 128)
+    __array(uint32_t, value, 128)
+    __field(int, len)
+    ),
+   TP_fast_assign(
+  memcpy(__entry->address, address, 128);
+  memcpy(__entry->value,  value, 128);
+  __entry->len = length;
+  ),
+   TP_printk("amdgpu register dump offset: %s value: %s ",
+ __print_array(__entry->address, __entry->len, 
8),

+ 

Re: [PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

2022-02-09 Thread Somalapuram, Amaranath



On 2/9/2022 1:17 PM, Christian König wrote:

Am 08.02.22 um 16:28 schrieb Alex Deucher:

On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
 wrote:

Dump the list of register values to trace event on GPU reset.

Signed-off-by: Somalapuram Amaranath 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 +++
  2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 1e651b959141..057922fb7e37 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,

 return r;
  }

+static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
+{
+   int i;
+   uint32_t reg_value[128];
+
+   for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
+   if (adev->asic_type >= CHIP_NAVI10)

This check should be against CHIP_VEGA10.  Also, this only allows for
GC registers.  If we wanted to dump other registers, we'd need a
different macro.  Might be better to just use RREG32 here for
everything and then encode the full offset using
SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to think
about how to handle gfxoff in this case.  gfxoff needs to be disabled
or we'll hang the chip if we try and read GC or SDMA registers via
MMIO which will adversely affect the hang signature.


Well this should execute right before a GPU reset, so I think it 
shouldn't matter if we hang the chip or not as long as the read comes 
back correctly (I remember a very long UVD debug session because of 
this).


But in general I agree, we should just use RREG32() here and always 
encode the full register offset.


Regards,
Christian.


Can I use something like this:

+   reg_value[i] = 
RREG32((adev->reg_offset[adev->reset_dump_reg_list[i][0]]

+ [adev->reset_dump_reg_list[i][1]]
+ [adev->reset_dump_reg_list[i][2]])
+                                 + adev->reset_dump_reg_list[i][3]);

ip --> adev->reset_dump_reg_list[i][0]

inst --> adev->reset_dump_reg_list[i][1]

BASE_IDX--> adev->reset_dump_reg_list[i][2]

reg --> adev->reset_dump_reg_list[i][3]

which requires 4 values in user space for each register.

using any existing macro like RREG32_SOC15** will not be able to pass 
proper argument from user space (like ip##_HWIP or reg##_BASE_IDX)






Alex

+   reg_value[i] = RREG32_SOC15_IP(GC, 
adev->reset_dump_reg_list[i]);

+   else
+   reg_value[i] = 
RREG32(adev->reset_dump_reg_list[i]);

+   }
+
+ trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, reg_value, 
i);

+
+   return 0;
+}
+
  int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  struct amdgpu_reset_context *reset_context)
  {
@@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,

tmp_adev->gmc.xgmi.pending_reset = false;
 if (!queue_work(system_unbound_wq, 
_adev->xgmi_reset_work))

 r = -EALREADY;
-   } else
+   } else {
+ amdgpu_reset_reg_dumps(tmp_adev);
 r = amdgpu_asic_reset(tmp_adev);
+   }

 if (r) {
 dev_err(tmp_adev->dev, "ASIC reset 
failed with error, %d for drm dev, %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h

index d855cb53c7e0..3fe33de3564a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
   __entry->seqno)
  );

+TRACE_EVENT(amdgpu_reset_reg_dumps,
+   TP_PROTO(long *address, uint32_t *value, int length),
+   TP_ARGS(address, value, length),
+   TP_STRUCT__entry(
+    __array(long, address, 128)
+    __array(uint32_t, value, 128)
+    __field(int, len)
+    ),
+   TP_fast_assign(
+  memcpy(__entry->address, address, 128);
+  memcpy(__entry->value,  value, 128);
+  __entry->len = length;
+  ),
+   TP_printk("amdgpu register dump offset: %s value: %s ",
+ __print_array(__entry->address, __entry->len, 8),
+ __print_array(__entry->value, __entry->len, 8)
+    )
+);
+
  #undef AMDGPU_JOB_GET_TIMELINE_NAME
  #endif

--
2.25.1





Re: [PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

2022-02-08 Thread Christian König

Am 08.02.22 um 16:28 schrieb Alex Deucher:

On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
 wrote:

Dump the list of register values to trace event on GPU reset.

Signed-off-by: Somalapuram Amaranath 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 +++
  2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1e651b959141..057922fb7e37 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
*adev,
 return r;
  }

+static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
+{
+   int i;
+   uint32_t reg_value[128];
+
+   for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
+   if (adev->asic_type >= CHIP_NAVI10)

This check should be against CHIP_VEGA10.  Also, this only allows for
GC registers.  If we wanted to dump other registers, we'd need a
different macro.  Might be better to just use RREG32 here for
everything and then encode the full offset using
SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to think
about how to handle gfxoff in this case.  gfxoff needs to be disabled
or we'll hang the chip if we try and read GC or SDMA registers via
MMIO which will adversely affect the hang signature.


Well this should execute right before a GPU reset, so I think it 
shouldn't matter if we hang the chip or not as long as the read comes 
back correctly (I remember a very long UVD debug session because of this).


But in general I agree, we should just use RREG32() here and always 
encode the full register offset.


Regards,
Christian.




Alex


+   reg_value[i] = RREG32_SOC15_IP(GC, 
adev->reset_dump_reg_list[i]);
+   else
+   reg_value[i] = RREG32(adev->reset_dump_reg_list[i]);
+   }
+
+   trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, reg_value, i);
+
+   return 0;
+}
+
  int amdgpu_do_asic_reset(struct list_head *device_list_handle,
  struct amdgpu_reset_context *reset_context)
  {
@@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
 tmp_adev->gmc.xgmi.pending_reset = false;
 if (!queue_work(system_unbound_wq, 
_adev->xgmi_reset_work))
 r = -EALREADY;
-   } else
+   } else {
+   amdgpu_reset_reg_dumps(tmp_adev);
 r = amdgpu_asic_reset(tmp_adev);
+   }

 if (r) {
 dev_err(tmp_adev->dev, "ASIC reset failed with 
error, %d for drm dev, %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
index d855cb53c7e0..3fe33de3564a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
   __entry->seqno)
  );

+TRACE_EVENT(amdgpu_reset_reg_dumps,
+   TP_PROTO(long *address, uint32_t *value, int length),
+   TP_ARGS(address, value, length),
+   TP_STRUCT__entry(
+__array(long, address, 128)
+__array(uint32_t, value, 128)
+__field(int, len)
+),
+   TP_fast_assign(
+  memcpy(__entry->address, address, 128);
+  memcpy(__entry->value,  value, 128);
+  __entry->len = length;
+  ),
+   TP_printk("amdgpu register dump offset: %s value: %s ",
+ __print_array(__entry->address, __entry->len, 8),
+ __print_array(__entry->value, __entry->len, 8)
+)
+);
+
  #undef AMDGPU_JOB_GET_TIMELINE_NAME
  #endif

--
2.25.1





Re: [PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

2022-02-08 Thread Alex Deucher
On Tue, Feb 8, 2022 at 3:17 AM Somalapuram Amaranath
 wrote:
>
> Dump the list of register values to trace event on GPU reset.
>
> Signed-off-by: Somalapuram Amaranath 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 +++
>  2 files changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 1e651b959141..057922fb7e37 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
> *adev,
> return r;
>  }
>
> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
> +{
> +   int i;
> +   uint32_t reg_value[128];
> +
> +   for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
> +   if (adev->asic_type >= CHIP_NAVI10)

This check should be against CHIP_VEGA10.  Also, this only allows for
GC registers.  If we wanted to dump other registers, we'd need a
different macro.  Might be better to just use RREG32 here for
everything and then encode the full offset using
SOC15_REG_ENTRY_OFFSET() or a similar macro.  Also, we need to think
about how to handle gfxoff in this case.  gfxoff needs to be disabled
or we'll hang the chip if we try and read GC or SDMA registers via
MMIO which will adversely affect the hang signature.

Alex

> +   reg_value[i] = RREG32_SOC15_IP(GC, 
> adev->reset_dump_reg_list[i]);
> +   else
> +   reg_value[i] = RREG32(adev->reset_dump_reg_list[i]);
> +   }
> +
> +   trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, reg_value, i);
> +
> +   return 0;
> +}
> +
>  int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>  struct amdgpu_reset_context *reset_context)
>  {
> @@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head 
> *device_list_handle,
> tmp_adev->gmc.xgmi.pending_reset = false;
> if (!queue_work(system_unbound_wq, 
> _adev->xgmi_reset_work))
> r = -EALREADY;
> -   } else
> +   } else {
> +   amdgpu_reset_reg_dumps(tmp_adev);
> r = amdgpu_asic_reset(tmp_adev);
> +   }
>
> if (r) {
> dev_err(tmp_adev->dev, "ASIC reset failed 
> with error, %d for drm dev, %s",
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> index d855cb53c7e0..3fe33de3564a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> @@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
>   __entry->seqno)
>  );
>
> +TRACE_EVENT(amdgpu_reset_reg_dumps,
> +   TP_PROTO(long *address, uint32_t *value, int length),
> +   TP_ARGS(address, value, length),
> +   TP_STRUCT__entry(
> +__array(long, address, 128)
> +__array(uint32_t, value, 128)
> +__field(int, len)
> +),
> +   TP_fast_assign(
> +  memcpy(__entry->address, address, 128);
> +  memcpy(__entry->value,  value, 128);
> +  __entry->len = length;
> +  ),
> +   TP_printk("amdgpu register dump offset: %s value: %s ",
> + __print_array(__entry->address, __entry->len, 8),
> + __print_array(__entry->value, __entry->len, 8)
> +)
> +);
> +
>  #undef AMDGPU_JOB_GET_TIMELINE_NAME
>  #endif
>
> --
> 2.25.1
>


[PATCH 2/2] drm/amdgpu: add reset register trace function on GPU reset

2022-02-08 Thread Somalapuram Amaranath
Dump the list of register values to trace event on GPU reset.

Signed-off-by: Somalapuram Amaranath 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h  | 19 +++
 2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1e651b959141..057922fb7e37 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4534,6 +4534,23 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
*adev,
return r;
 }
 
+static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
+{
+   int i;
+   uint32_t reg_value[128];
+
+   for (i = 0; adev->reset_dump_reg_list[i] != 0; i++) {
+   if (adev->asic_type >= CHIP_NAVI10)
+   reg_value[i] = RREG32_SOC15_IP(GC, 
adev->reset_dump_reg_list[i]);
+   else
+   reg_value[i] = RREG32(adev->reset_dump_reg_list[i]);
+   }
+
+   trace_amdgpu_reset_reg_dumps(adev->reset_dump_reg_list, reg_value, i);
+
+   return 0;
+}
+
 int amdgpu_do_asic_reset(struct list_head *device_list_handle,
 struct amdgpu_reset_context *reset_context)
 {
@@ -4567,8 +4584,10 @@ int amdgpu_do_asic_reset(struct list_head 
*device_list_handle,
tmp_adev->gmc.xgmi.pending_reset = false;
if (!queue_work(system_unbound_wq, 
_adev->xgmi_reset_work))
r = -EALREADY;
-   } else
+   } else {
+   amdgpu_reset_reg_dumps(tmp_adev);
r = amdgpu_asic_reset(tmp_adev);
+   }
 
if (r) {
dev_err(tmp_adev->dev, "ASIC reset failed with 
error, %d for drm dev, %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
index d855cb53c7e0..3fe33de3564a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
@@ -537,6 +537,25 @@ TRACE_EVENT(amdgpu_ib_pipe_sync,
  __entry->seqno)
 );
 
+TRACE_EVENT(amdgpu_reset_reg_dumps,
+   TP_PROTO(long *address, uint32_t *value, int length),
+   TP_ARGS(address, value, length),
+   TP_STRUCT__entry(
+__array(long, address, 128)
+__array(uint32_t, value, 128)
+__field(int, len)
+),
+   TP_fast_assign(
+  memcpy(__entry->address, address, 128);
+  memcpy(__entry->value,  value, 128);
+  __entry->len = length;
+  ),
+   TP_printk("amdgpu register dump offset: %s value: %s ",
+ __print_array(__entry->address, __entry->len, 8),
+ __print_array(__entry->value, __entry->len, 8)
+)
+);
+
 #undef AMDGPU_JOB_GET_TIMELINE_NAME
 #endif
 
-- 
2.25.1