RE: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

2017-12-17 Thread Liu, Monk
Lockup_timeout = 0 doesn't indicate GPU reset isn't ready, kernel/amdgpu never 
tell you that, instead if means there is no Timeout of jobs 
So no warning, no gpu recover triggered by time out event, but that doesn't 
mean gpu recover cannot be triggered, e.g. for SRIOV we can
Trigger gpu recover by hypervisor.

Your patch shouldn't and cannot break exist logics, that's very simple rule ...
If you insist your change, at least make sure it doesn't change any logic of 
SRIOV and that's not hard for you, just add "if (!amdgpu_sriov_vf(adev))" 
checking 
Prior to your path, although I didn't encourage such ugly actions...


-Original Message-
From: Marek Olšák [mailto:mar...@gmail.com] 
Sent: Tuesday, December 12, 2017 11:02 PM
To: Liu, Monk 
Cc: amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

On Tue, Dec 12, 2017 at 4:18 AM, Liu, Monk  wrote:
> NAK, you change break SRIOV logic:
>
> Without lockup_timeout set, this gpu_recover() won't get called at all 
> , unless your IB triggered invalid instruct and that IRQ invoked 
> Amdgpu_gpu_recover(), by this cause you should disable the logic that 
> in that IRQ instead of change gpu_recover() itself because For SRIOV 
> we need gpu_recover() even lockup_timeout is zero

The default value of 0 indicates that GPU reset isn't ready to be enabled by 
default. That's what it means. Once the GPU reset works, the default should be 
non-zero (e.g. 1) and
amdgpu.lockup_timeout=0 should be used to disable all GPU resets in order to be 
able do scandumps and debug GPU hangs.

Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

2017-12-12 Thread Marek Olšák
On Tue, Dec 12, 2017 at 5:36 PM, Christian König
 wrote:
> Am 12.12.2017 um 15:57 schrieb Marek Olšák:
>>
>> On Tue, Dec 12, 2017 at 10:01 AM, Christian König
>>  wrote:
>>>
>>> Am 11.12.2017 um 22:29 schrieb Marek Olšák:

 From: Marek Olšák 

 Signed-off-by: Marek Olšák 
 ---

 Is this really correct? I have no easy way to test it.
>>>
>>>
>>> It's a step in the right direction, but I would rather vote for something
>>> else:
>>>
>>> Instead of disabling the timeout by default we only disable the GPU
>>> reset/recovery.
>>>
>>> The idea is to add a new parameter amdgpu_gpu_recovery which makes
>>> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at
>>> all
>>> (on bare metal systems).
>>>
>>> Then we finally set the amdgpu_lockup_timeout to a non zero value by
>>> default.
>>>
>>> Andrey could you take care of this when you have time?
>>
>> I don't understand this.
>>
>> Why can't we keep the previous behavior where amdgpu.lockup_timeout=0
>> disabled GPU reset? Why do we have to add another option for the same
>> thing?
>
>
> lockup_timeout=0 never disabled the GPU reset, it just disabled the timeout.

It disabled the automatic reset before we had those interrupt callbacks.

>
> You could still manually trigger a reset and also invalid commands, invalid
> register writes and requests from the SRIOV hypervisor could trigger this.

That's OK. Manual resets should always be allowed.

>
> And as Monk explained GPU resets are mandatory for SRIOV, you can't disable
> them at all in this case.

What is preventing Monk from setting amdgpu.lockup_timeout > 0, which
should be the default state anyway?

Let's just say lockup_timeout=0 has undefined behavior with SRIOV.

>
> Additional to that we probably want the error message that something timed
> out, but not touching the hardware in any way.

Yes that is a fair point.

Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

2017-12-12 Thread Christian König

Am 12.12.2017 um 15:57 schrieb Marek Olšák:

On Tue, Dec 12, 2017 at 10:01 AM, Christian König
 wrote:

Am 11.12.2017 um 22:29 schrieb Marek Olšák:

From: Marek Olšák 

Signed-off-by: Marek Olšák 
---

Is this really correct? I have no easy way to test it.


It's a step in the right direction, but I would rather vote for something
else:

Instead of disabling the timeout by default we only disable the GPU
reset/recovery.

The idea is to add a new parameter amdgpu_gpu_recovery which makes
amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at all
(on bare metal systems).

Then we finally set the amdgpu_lockup_timeout to a non zero value by
default.

Andrey could you take care of this when you have time?

I don't understand this.

Why can't we keep the previous behavior where amdgpu.lockup_timeout=0
disabled GPU reset? Why do we have to add another option for the same
thing?


lockup_timeout=0 never disabled the GPU reset, it just disabled the timeout.

You could still manually trigger a reset and also invalid commands, 
invalid register writes and requests from the SRIOV hypervisor could 
trigger this.


And as Monk explained GPU resets are mandatory for SRIOV, you can't 
disable them at all in this case.


Additional to that we probably want the error message that something 
timed out, but not touching the hardware in any way.


Regards,
Christian.



Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

2017-12-12 Thread Marek Olšák
On Tue, Dec 12, 2017 at 4:18 AM, Liu, Monk  wrote:
> NAK, you change break SRIOV logic:
>
> Without lockup_timeout set, this gpu_recover() won't get called at all , 
> unless your IB triggered invalid instruct and that IRQ invoked
> Amdgpu_gpu_recover(), by this cause you should disable the logic that in that 
> IRQ instead of change gpu_recover() itself because
> For SRIOV we need gpu_recover() even lockup_timeout is zero

The default value of 0 indicates that GPU reset isn't ready to be
enabled by default. That's what it means. Once the GPU reset works,
the default should be non-zero (e.g. 1) and
amdgpu.lockup_timeout=0 should be used to disable all GPU resets in
order to be able do scandumps and debug GPU hangs.

Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

2017-12-12 Thread Marek Olšák
On Tue, Dec 12, 2017 at 10:01 AM, Christian König
 wrote:
> Am 11.12.2017 um 22:29 schrieb Marek Olšák:
>>
>> From: Marek Olšák 
>>
>> Signed-off-by: Marek Olšák 
>> ---
>>
>> Is this really correct? I have no easy way to test it.
>
>
> It's a step in the right direction, but I would rather vote for something
> else:
>
> Instead of disabling the timeout by default we only disable the GPU
> reset/recovery.
>
> The idea is to add a new parameter amdgpu_gpu_recovery which makes
> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at all
> (on bare metal systems).
>
> Then we finally set the amdgpu_lockup_timeout to a non zero value by
> default.
>
> Andrey could you take care of this when you have time?

I don't understand this.

Why can't we keep the previous behavior where amdgpu.lockup_timeout=0
disabled GPU reset? Why do we have to add another option for the same
thing?

Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

2017-12-12 Thread Andrey Grodzovsky



On 12/12/2017 04:01 AM, Christian König wrote:

Am 11.12.2017 um 22:29 schrieb Marek Olšák:

From: Marek Olšák 

Signed-off-by: Marek Olšák 
---

Is this really correct? I have no easy way to test it.


It's a step in the right direction, but I would rather vote for 
something else:


Instead of disabling the timeout by default we only disable the GPU 
reset/recovery.


The idea is to add a new parameter amdgpu_gpu_recovery which makes 
amdgpu_gpu_recover only prints out an error and doesn't touch the GPU 
at all (on bare metal systems).


Then we finally set the amdgpu_lockup_timeout to a non zero value by 
default.


Andrey could you take care of this when you have time?

Thanks,
Christian.


Sure.

Thanks,
Andrey





  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 8d03baa..56c41cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct 
amdgpu_device *adev, uint64_t *reset_flags,

   *
   * Attempt to reset the GPU if it has hung (all asics).
   * Returns 0 for success or an error on failure.
   */
  int amdgpu_gpu_recover(struct amdgpu_device *adev, struct 
amdgpu_job *job)

  {
  struct drm_atomic_state *state = NULL;
  uint64_t reset_flags = 0;
  int i, r, resched;
  +    /* amdgpu.lockup_timeout=0 disables GPU reset. */
+    if (amdgpu_lockup_timeout == 0)
+    return 0;
+
  if (!amdgpu_check_soft_reset(adev)) {
  DRM_INFO("No hardware hang detected. Did some blocks 
stall?\n");

  return 0;
  }
    dev_info(adev->dev, "GPU reset begin!\n");
    mutex_lock(&adev->lock_reset);
  atomic_inc(&adev->gpu_reset_counter);
  adev->in_gpu_reset = 1;




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

2017-12-12 Thread Christian König

Am 11.12.2017 um 22:29 schrieb Marek Olšák:

From: Marek Olšák 

Signed-off-by: Marek Olšák 
---

Is this really correct? I have no easy way to test it.


It's a step in the right direction, but I would rather vote for 
something else:


Instead of disabling the timeout by default we only disable the GPU 
reset/recovery.


The idea is to add a new parameter amdgpu_gpu_recovery which makes 
amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at 
all (on bare metal systems).


Then we finally set the amdgpu_lockup_timeout to a non zero value by 
default.


Andrey could you take care of this when you have time?

Thanks,
Christian.



  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8d03baa..56c41cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct amdgpu_device 
*adev, uint64_t *reset_flags,
   *
   * Attempt to reset the GPU if it has hung (all asics).
   * Returns 0 for success or an error on failure.
   */
  int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
  {
struct drm_atomic_state *state = NULL;
uint64_t reset_flags = 0;
int i, r, resched;
  
+	/* amdgpu.lockup_timeout=0 disables GPU reset. */

+   if (amdgpu_lockup_timeout == 0)
+   return 0;
+
if (!amdgpu_check_soft_reset(adev)) {
DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
return 0;
}
  
  	dev_info(adev->dev, "GPU reset begin!\n");
  
  	mutex_lock(&adev->lock_reset);

atomic_inc(&adev->gpu_reset_counter);
adev->in_gpu_reset = 1;


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

2017-12-11 Thread Liu, Monk
NAK, you change break SRIOV logic:

Without lockup_timeout set, this gpu_recover() won't get called at all , unless 
your IB triggered invalid instruct and that IRQ invoked 
Amdgpu_gpu_recover(), by this cause you should disable the logic that in that 
IRQ instead of change gpu_recover() itself because 
For SRIOV we need gpu_recover() even lockup_timeout is zero 


-Original Message-
From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of Marek 
Ol?ák
Sent: 2017年12月12日 5:30
To: amd-gfx@lists.freedesktop.org
Subject: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

From: Marek Olšák 

Signed-off-by: Marek Olšák 
---

Is this really correct? I have no easy way to test it.

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8d03baa..56c41cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct amdgpu_device 
*adev, uint64_t *reset_flags,
  *
  * Attempt to reset the GPU if it has hung (all asics).
  * Returns 0 for success or an error on failure.
  */
 int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)  {
struct drm_atomic_state *state = NULL;
uint64_t reset_flags = 0;
int i, r, resched;
 
+   /* amdgpu.lockup_timeout=0 disables GPU reset. */
+   if (amdgpu_lockup_timeout == 0)
+   return 0;
+
if (!amdgpu_check_soft_reset(adev)) {
DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
return 0;
}
 
dev_info(adev->dev, "GPU reset begin!\n");
 
mutex_lock(&adev->lock_reset);
atomic_inc(&adev->gpu_reset_counter);
adev->in_gpu_reset = 1;
--
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

2017-12-11 Thread Marek Olšák
From: Marek Olšák 

Signed-off-by: Marek Olšák 
---

Is this really correct? I have no easy way to test it.

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8d03baa..56c41cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct amdgpu_device 
*adev, uint64_t *reset_flags,
  *
  * Attempt to reset the GPU if it has hung (all asics).
  * Returns 0 for success or an error on failure.
  */
 int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
 {
struct drm_atomic_state *state = NULL;
uint64_t reset_flags = 0;
int i, r, resched;
 
+   /* amdgpu.lockup_timeout=0 disables GPU reset. */
+   if (amdgpu_lockup_timeout == 0)
+   return 0;
+
if (!amdgpu_check_soft_reset(adev)) {
DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
return 0;
}
 
dev_info(adev->dev, "GPU reset begin!\n");
 
mutex_lock(&adev->lock_reset);
atomic_inc(&adev->gpu_reset_counter);
adev->in_gpu_reset = 1;
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx