RE: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
Lockup_timeout = 0 doesn't indicate GPU reset isn't ready, kernel/amdgpu never tell you that, instead if means there is no Timeout of jobs So no warning, no gpu recover triggered by time out event, but that doesn't mean gpu recover cannot be triggered, e.g. for SRIOV we can Trigger gpu recover by hypervisor. Your patch shouldn't and cannot break exist logics, that's very simple rule ... If you insist your change, at least make sure it doesn't change any logic of SRIOV and that's not hard for you, just add "if (!amdgpu_sriov_vf(adev))" checking Prior to your path, although I didn't encourage such ugly actions... -Original Message- From: Marek Olšák [mailto:mar...@gmail.com] Sent: Tuesday, December 12, 2017 11:02 PM To: Liu, Monk Cc: amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0 On Tue, Dec 12, 2017 at 4:18 AM, Liu, Monk wrote: > NAK, you change break SRIOV logic: > > Without lockup_timeout set, this gpu_recover() won't get called at all > , unless your IB triggered invalid instruct and that IRQ invoked > Amdgpu_gpu_recover(), by this cause you should disable the logic that > in that IRQ instead of change gpu_recover() itself because For SRIOV > we need gpu_recover() even lockup_timeout is zero The default value of 0 indicates that GPU reset isn't ready to be enabled by default. That's what it means. Once the GPU reset works, the default should be non-zero (e.g. 1) and amdgpu.lockup_timeout=0 should be used to disable all GPU resets in order to be able do scandumps and debug GPU hangs. Marek ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
On Tue, Dec 12, 2017 at 5:36 PM, Christian König wrote: > Am 12.12.2017 um 15:57 schrieb Marek Olšák: >> >> On Tue, Dec 12, 2017 at 10:01 AM, Christian König >> wrote: >>> >>> Am 11.12.2017 um 22:29 schrieb Marek Olšák: From: Marek Olšák Signed-off-by: Marek Olšák --- Is this really correct? I have no easy way to test it. >>> >>> >>> It's a step in the right direction, but I would rather vote for something >>> else: >>> >>> Instead of disabling the timeout by default we only disable the GPU >>> reset/recovery. >>> >>> The idea is to add a new parameter amdgpu_gpu_recovery which makes >>> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at >>> all >>> (on bare metal systems). >>> >>> Then we finally set the amdgpu_lockup_timeout to a non zero value by >>> default. >>> >>> Andrey could you take care of this when you have time? >> >> I don't understand this. >> >> Why can't we keep the previous behavior where amdgpu.lockup_timeout=0 >> disabled GPU reset? Why do we have to add another option for the same >> thing? > > > lockup_timeout=0 never disabled the GPU reset, it just disabled the timeout. It disabled the automatic reset before we had those interrupt callbacks. > > You could still manually trigger a reset and also invalid commands, invalid > register writes and requests from the SRIOV hypervisor could trigger this. That's OK. Manual resets should always be allowed. > > And as Monk explained GPU resets are mandatory for SRIOV, you can't disable > them at all in this case. What is preventing Monk from setting amdgpu.lockup_timeout > 0, which should be the default state anyway? Let's just say lockup_timeout=0 has undefined behavior with SRIOV. > > Additional to that we probably want the error message that something timed > out, but not touching the hardware in any way. Yes that is a fair point. Marek ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
Am 12.12.2017 um 15:57 schrieb Marek Olšák: On Tue, Dec 12, 2017 at 10:01 AM, Christian König wrote: Am 11.12.2017 um 22:29 schrieb Marek Olšák: From: Marek Olšák Signed-off-by: Marek Olšák --- Is this really correct? I have no easy way to test it. It's a step in the right direction, but I would rather vote for something else: Instead of disabling the timeout by default we only disable the GPU reset/recovery. The idea is to add a new parameter amdgpu_gpu_recovery which makes amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at all (on bare metal systems). Then we finally set the amdgpu_lockup_timeout to a non zero value by default. Andrey could you take care of this when you have time? I don't understand this. Why can't we keep the previous behavior where amdgpu.lockup_timeout=0 disabled GPU reset? Why do we have to add another option for the same thing? lockup_timeout=0 never disabled the GPU reset, it just disabled the timeout. You could still manually trigger a reset and also invalid commands, invalid register writes and requests from the SRIOV hypervisor could trigger this. And as Monk explained GPU resets are mandatory for SRIOV, you can't disable them at all in this case. Additional to that we probably want the error message that something timed out, but not touching the hardware in any way. Regards, Christian. Marek ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
On Tue, Dec 12, 2017 at 4:18 AM, Liu, Monk wrote: > NAK, you change break SRIOV logic: > > Without lockup_timeout set, this gpu_recover() won't get called at all , > unless your IB triggered invalid instruct and that IRQ invoked > Amdgpu_gpu_recover(), by this cause you should disable the logic that in that > IRQ instead of change gpu_recover() itself because > For SRIOV we need gpu_recover() even lockup_timeout is zero The default value of 0 indicates that GPU reset isn't ready to be enabled by default. That's what it means. Once the GPU reset works, the default should be non-zero (e.g. 1) and amdgpu.lockup_timeout=0 should be used to disable all GPU resets in order to be able do scandumps and debug GPU hangs. Marek ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
On Tue, Dec 12, 2017 at 10:01 AM, Christian König wrote: > Am 11.12.2017 um 22:29 schrieb Marek Olšák: >> >> From: Marek Olšák >> >> Signed-off-by: Marek Olšák >> --- >> >> Is this really correct? I have no easy way to test it. > > > It's a step in the right direction, but I would rather vote for something > else: > > Instead of disabling the timeout by default we only disable the GPU > reset/recovery. > > The idea is to add a new parameter amdgpu_gpu_recovery which makes > amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at all > (on bare metal systems). > > Then we finally set the amdgpu_lockup_timeout to a non zero value by > default. > > Andrey could you take care of this when you have time? I don't understand this. Why can't we keep the previous behavior where amdgpu.lockup_timeout=0 disabled GPU reset? Why do we have to add another option for the same thing? Marek ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
On 12/12/2017 04:01 AM, Christian König wrote: Am 11.12.2017 um 22:29 schrieb Marek Olšák: From: Marek Olšák Signed-off-by: Marek Olšák --- Is this really correct? I have no easy way to test it. It's a step in the right direction, but I would rather vote for something else: Instead of disabling the timeout by default we only disable the GPU reset/recovery. The idea is to add a new parameter amdgpu_gpu_recovery which makes amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at all (on bare metal systems). Then we finally set the amdgpu_lockup_timeout to a non zero value by default. Andrey could you take care of this when you have time? Thanks, Christian. Sure. Thanks, Andrey drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 8d03baa..56c41cf 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct amdgpu_device *adev, uint64_t *reset_flags, * * Attempt to reset the GPU if it has hung (all asics). * Returns 0 for success or an error on failure. */ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job) { struct drm_atomic_state *state = NULL; uint64_t reset_flags = 0; int i, r, resched; + /* amdgpu.lockup_timeout=0 disables GPU reset. */ + if (amdgpu_lockup_timeout == 0) + return 0; + if (!amdgpu_check_soft_reset(adev)) { DRM_INFO("No hardware hang detected. Did some blocks stall?\n"); return 0; } dev_info(adev->dev, "GPU reset begin!\n"); mutex_lock(&adev->lock_reset); atomic_inc(&adev->gpu_reset_counter); adev->in_gpu_reset = 1; ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
Am 11.12.2017 um 22:29 schrieb Marek Olšák: From: Marek Olšák Signed-off-by: Marek Olšák --- Is this really correct? I have no easy way to test it. It's a step in the right direction, but I would rather vote for something else: Instead of disabling the timeout by default we only disable the GPU reset/recovery. The idea is to add a new parameter amdgpu_gpu_recovery which makes amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at all (on bare metal systems). Then we finally set the amdgpu_lockup_timeout to a non zero value by default. Andrey could you take care of this when you have time? Thanks, Christian. drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 8d03baa..56c41cf 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct amdgpu_device *adev, uint64_t *reset_flags, * * Attempt to reset the GPU if it has hung (all asics). * Returns 0 for success or an error on failure. */ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job) { struct drm_atomic_state *state = NULL; uint64_t reset_flags = 0; int i, r, resched; + /* amdgpu.lockup_timeout=0 disables GPU reset. */ + if (amdgpu_lockup_timeout == 0) + return 0; + if (!amdgpu_check_soft_reset(adev)) { DRM_INFO("No hardware hang detected. Did some blocks stall?\n"); return 0; } dev_info(adev->dev, "GPU reset begin!\n"); mutex_lock(&adev->lock_reset); atomic_inc(&adev->gpu_reset_counter); adev->in_gpu_reset = 1; ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
NAK, you change break SRIOV logic: Without lockup_timeout set, this gpu_recover() won't get called at all , unless your IB triggered invalid instruct and that IRQ invoked Amdgpu_gpu_recover(), by this cause you should disable the logic that in that IRQ instead of change gpu_recover() itself because For SRIOV we need gpu_recover() even lockup_timeout is zero -Original Message- From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of Marek Ol?ák Sent: 2017年12月12日 5:30 To: amd-gfx@lists.freedesktop.org Subject: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0 From: Marek Olšák Signed-off-by: Marek Olšák --- Is this really correct? I have no easy way to test it. drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 8d03baa..56c41cf 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct amdgpu_device *adev, uint64_t *reset_flags, * * Attempt to reset the GPU if it has hung (all asics). * Returns 0 for success or an error on failure. */ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job) { struct drm_atomic_state *state = NULL; uint64_t reset_flags = 0; int i, r, resched; + /* amdgpu.lockup_timeout=0 disables GPU reset. */ + if (amdgpu_lockup_timeout == 0) + return 0; + if (!amdgpu_check_soft_reset(adev)) { DRM_INFO("No hardware hang detected. Did some blocks stall?\n"); return 0; } dev_info(adev->dev, "GPU reset begin!\n"); mutex_lock(&adev->lock_reset); atomic_inc(&adev->gpu_reset_counter); adev->in_gpu_reset = 1; -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0
From: Marek Olšák Signed-off-by: Marek Olšák --- Is this really correct? I have no easy way to test it. drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 8d03baa..56c41cf 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3018,20 +3018,24 @@ static int amdgpu_reset_sriov(struct amdgpu_device *adev, uint64_t *reset_flags, * * Attempt to reset the GPU if it has hung (all asics). * Returns 0 for success or an error on failure. */ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job) { struct drm_atomic_state *state = NULL; uint64_t reset_flags = 0; int i, r, resched; + /* amdgpu.lockup_timeout=0 disables GPU reset. */ + if (amdgpu_lockup_timeout == 0) + return 0; + if (!amdgpu_check_soft_reset(adev)) { DRM_INFO("No hardware hang detected. Did some blocks stall?\n"); return 0; } dev_info(adev->dev, "GPU reset begin!\n"); mutex_lock(&adev->lock_reset); atomic_inc(&adev->gpu_reset_counter); adev->in_gpu_reset = 1; -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx