Re: [PATCH 2/2] drm/i915: Don't treat FLR resets as errors

2024-05-22 Thread Nirmoy Das

Hi Andi,

On 5/21/2024 12:56 PM, Andi Shyti wrote:

Hi Nirmoy,

On Fri, May 17, 2024 at 10:13:37PM +0200, Nirmoy Das wrote:

Hi Andi,

On 5/17/2024 9:34 PM, Andi Shyti wrote:

 Hi Nirmoy,

 On Fri, May 17, 2024 at 04:00:02PM +0200, Nirmoy Das wrote:

 On 5/17/2024 1:25 PM, Andi Shyti wrote:

 If we timeout while waiting for an FLR reset, there is nothing we
 can do and i915 doesn't have any control on it. In any case the
 system is still perfectly usable

 If a FLR reset fails then we will have a dead GPU, I don't think the 
GPU is
 usable without a cold reboot.

 fact is that the GPU keeps going and even though the timeout has
 expired, the system moves to the next phase.

The current test might look like it is has passed, but if you look into the
subsequent tests you can see a dead GPU:

<7>[  369.168121] pci :00:02.0: [drm:intel_uncore_fini_mmio [i915]] 
Triggering Driver-FLR
<3>[  372.170189] pci :00:02.0: [drm] *ERROR* Driver-FLR-teardown wait 
completion failed! -110
<7>[  372.437630] [IGT] i915_selftest: finished subtest requests, SUCCESS
<7>[  372.438356] [IGT] i915_selftest: starting dynamic subtest migrate
<5>[  373.110580] Setting dangerous option live_selftests - tainting kernel
<3>[  373.183499] i915 :00:02.0: Unable to change power state from D0 to 
D0, device inaccessible
<3>[  373.246921] i915 :00:02.0: [drm] *ERROR* Unrecognized display IP 
version 1023.255; disabling display.
<7>[  373.247130] i915 :00:02.0: [drm:intel_step_init [i915]] Using future 
steppings
<7>[  373.247716] i915 :00:02.0: [drm:intel_step_init [i915]] Using future 
steppings
<7>[  373.248263] i915 :00:02.0: [drm:intel_step_init [i915]] Using future 
display steppings
<7>[  373.251843] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] 
WOPCM: 2048K
<7>[  373.252505] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT0: 
enable_guc=3 (guc:yes submission:yes huc:no slpc:yes)
<7>[  373.253140] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT0: 
Setting up Primary GT
<7>[  373.253556] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT1: 
Setting up Standalone Media GT
<7>[  373.253941] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] 
WOPCM: 2048K
<7>[  373.254365] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT1: 
enable_guc=3 (guc:yes submission:yes huc:yes slpc:yes)
<3>[  375.256235] i915 :00:02.0: [drm] *ERROR* Device is non-operational; 
MMIO access returns 0x!
<3>[  375.259089] i915 :00:02.0: Device initialization failed (-5)
<3>[  375.260521] i915 :00:02.0: probe with driver i915 failed with error -5
<7>[  375.392209] [IGT] i915_selftest: finished subtest migrate, FAIL

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14724/bat-arls-3/dmesg0.txt

Are we sure this is dependent on the FLR reset?


Yes, while on FLR read into memory will return either 0/F.



  There are cases
when the FLR reset doesn't make any difference and in any case
this error is completely ignored by the driver.


This happens at very late with no recovery possible and hope is module  
reload works.





Perhaps we can change it to a warning?


I think it should be error. CI will still complain even on warning.





 This is a serious issue and should be report as an error.  I think we 
need
 to create a HW ticket to understand

 why is FLR reset fails.

 Maybe it takes longer and longer to reset. We've been sending
 several patches in the latest years to fix the timings.

HW spec says 3 sec but we can try increasing it bit higher to try it out.

We could go, then, with just patch 1 and see if it improves.


Does it help ? If helps then we can go ahead with increased timeout.



  Also
because, the FLR reset might also depend on the firmware.


Possible. In that case we should wait for firmware fix ?


Regards,

Nirmoy



Thanks, Nirmoy,
Andi


Re: [PATCH 2/2] drm/i915: Don't treat FLR resets as errors

2024-05-21 Thread Andi Shyti
Hi Nirmoy,

On Fri, May 17, 2024 at 10:13:37PM +0200, Nirmoy Das wrote:
> Hi Andi,
> 
> On 5/17/2024 9:34 PM, Andi Shyti wrote:
> 
> Hi Nirmoy,
> 
> On Fri, May 17, 2024 at 04:00:02PM +0200, Nirmoy Das wrote:
> 
> On 5/17/2024 1:25 PM, Andi Shyti wrote:
> 
> If we timeout while waiting for an FLR reset, there is nothing we
> can do and i915 doesn't have any control on it. In any case the
> system is still perfectly usable
> 
> If a FLR reset fails then we will have a dead GPU, I don't think the 
> GPU is
> usable without a cold reboot.
> 
> fact is that the GPU keeps going and even though the timeout has
> expired, the system moves to the next phase.
> 
> The current test might look like it is has passed, but if you look into the
> subsequent tests you can see a dead GPU:
> 
> <7>[  369.168121] pci :00:02.0: [drm:intel_uncore_fini_mmio [i915]] 
> Triggering Driver-FLR
> <3>[  372.170189] pci :00:02.0: [drm] *ERROR* Driver-FLR-teardown wait 
> completion failed! -110
> <7>[  372.437630] [IGT] i915_selftest: finished subtest requests, SUCCESS
> <7>[  372.438356] [IGT] i915_selftest: starting dynamic subtest migrate
> <5>[  373.110580] Setting dangerous option live_selftests - tainting kernel
> <3>[  373.183499] i915 :00:02.0: Unable to change power state from D0 to 
> D0, device inaccessible
> <3>[  373.246921] i915 :00:02.0: [drm] *ERROR* Unrecognized display IP 
> version 1023.255; disabling display.
> <7>[  373.247130] i915 :00:02.0: [drm:intel_step_init [i915]] Using 
> future steppings
> <7>[  373.247716] i915 :00:02.0: [drm:intel_step_init [i915]] Using 
> future steppings
> <7>[  373.248263] i915 :00:02.0: [drm:intel_step_init [i915]] Using 
> future display steppings
> <7>[  373.251843] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] 
> WOPCM: 2048K
> <7>[  373.252505] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT0: 
> enable_guc=3 (guc:yes submission:yes huc:no slpc:yes)
> <7>[  373.253140] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT0: 
> Setting up Primary GT
> <7>[  373.253556] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT1: 
> Setting up Standalone Media GT
> <7>[  373.253941] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] 
> WOPCM: 2048K
> <7>[  373.254365] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT1: 
> enable_guc=3 (guc:yes submission:yes huc:yes slpc:yes)
> <3>[  375.256235] i915 :00:02.0: [drm] *ERROR* Device is non-operational; 
> MMIO access returns 0x!
> <3>[  375.259089] i915 :00:02.0: Device initialization failed (-5)
> <3>[  375.260521] i915 :00:02.0: probe with driver i915 failed with error 
> -5
> <7>[  375.392209] [IGT] i915_selftest: finished subtest migrate, FAIL
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14724/bat-arls-3/dmesg0.txt

Are we sure this is dependent on the FLR reset? There are cases
when the FLR reset doesn't make any difference and in any case
this error is completely ignored by the driver.

Perhaps we can change it to a warning?

> This is a serious issue and should be report as an error.  I think we 
> need
> to create a HW ticket to understand
> 
> why is FLR reset fails.
> 
> Maybe it takes longer and longer to reset. We've been sending
> several patches in the latest years to fix the timings.
> 
> HW spec says 3 sec but we can try increasing it bit higher to try it out.

We could go, then, with just patch 1 and see if it improves. Also
because, the FLR reset might also depend on the firmware.

Thanks, Nirmoy,
Andi


Re: [PATCH 2/2] drm/i915: Don't treat FLR resets as errors

2024-05-17 Thread Nirmoy Das

Hi Andi,

On 5/17/2024 9:34 PM, Andi Shyti wrote:

Hi Nirmoy,

On Fri, May 17, 2024 at 04:00:02PM +0200, Nirmoy Das wrote:

On 5/17/2024 1:25 PM, Andi Shyti wrote:

If we timeout while waiting for an FLR reset, there is nothing we
can do and i915 doesn't have any control on it. In any case the
system is still perfectly usable

If a FLR reset fails then we will have a dead GPU, I don't think the GPU is
usable without a cold reboot.

fact is that the GPU keeps going and even though the timeout has
expired, the system moves to the next phase.
The current test might look like it is has passed, but if you look into 
the subsequent tests you can see a dead GPU:


<7>[  369.168121] pci :00:02.0: [drm:intel_uncore_fini_mmio [i915]] 
Triggering Driver-FLR
*<3>[ 372.170189] pci :00:02.0: [drm] *ERROR* Driver-FLR-teardown 
wait completion failed! -110*

*<7>[ 372.437630] [IGT] i915_selftest: finished subtest requests, SUCCESS*
<7>[  372.438356] [IGT] i915_selftest: starting dynamic subtest migrate
<5>[  373.110580] Setting dangerous option live_selftests - tainting kernel
<3>[  373.183499] i915 :00:02.0: Unable to change power state from D0 to 
D0, device inaccessible
<3>[  373.246921] i915 :00:02.0: [drm] *ERROR* Unrecognized display IP 
version 1023.255; disabling display.
<7>[  373.247130] i915 :00:02.0: [drm:intel_step_init [i915]] Using future 
steppings
<7>[  373.247716] i915 :00:02.0: [drm:intel_step_init [i915]] Using future 
steppings
<7>[  373.248263] i915 :00:02.0: [drm:intel_step_init [i915]] Using future 
display steppings
<7>[  373.251843] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] 
WOPCM: 2048K
<7>[  373.252505] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT0: 
enable_guc=3 (guc:yes submission:yes huc:no slpc:yes)
<7>[  373.253140] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT0: 
Setting up Primary GT
<7>[  373.253556] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT1: 
Setting up Standalone Media GT
<7>[  373.253941] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] 
WOPCM: 2048K
<7>[  373.254365] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT1: 
enable_guc=3 (guc:yes submission:yes huc:yes slpc:yes)
*<3>[ 375.256235] i915 :00:02.0: [drm] *ERROR* Device is 
non-operational; MMIO access returns 0x!*

<3>[  375.259089] i915 :00:02.0: Device initialization failed (-5)
<3>[  375.260521] i915 :00:02.0: probe with driver i915 failed with error -5
<7>[  375.392209] [IGT] i915_selftest: finished subtest migrate, FAIL

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14724/bat-arls-3/dmesg0.txt




This is a serious issue and should be report as an error.  I think we need
to create a HW ticket to understand

why is FLR reset fails.

Maybe it takes longer and longer to reset. We've been sending
several patches in the latest years to fix the timings.


HW spec says 3 sec but we can try increasing it bit higher to try it out.


Regards,

Nirmoy



Andi

Re: [PATCH 2/2] drm/i915: Don't treat FLR resets as errors

2024-05-17 Thread Andi Shyti
Hi Nirmoy,

On Fri, May 17, 2024 at 04:00:02PM +0200, Nirmoy Das wrote:
> On 5/17/2024 1:25 PM, Andi Shyti wrote:
> > If we timeout while waiting for an FLR reset, there is nothing we
> > can do and i915 doesn't have any control on it. In any case the
> > system is still perfectly usable
> 
> If a FLR reset fails then we will have a dead GPU, I don't think the GPU is
> usable without a cold reboot.

fact is that the GPU keeps going and even though the timeout has
expired, the system moves to the next phase.

> This is a serious issue and should be report as an error.  I think we need
> to create a HW ticket to understand
> 
> why is FLR reset fails.

Maybe it takes longer and longer to reset. We've been sending
several patches in the latest years to fix the timings.

Andi


Re: [PATCH 2/2] drm/i915: Don't treat FLR resets as errors

2024-05-17 Thread Nirmoy Das

Hi Andi,

On 5/17/2024 1:25 PM, Andi Shyti wrote:

If we timeout while waiting for an FLR reset, there is nothing we
can do and i915 doesn't have any control on it. In any case the
system is still perfectly usable


If a FLR reset fails then we will have a dead GPU, I don't think the GPU 
is usable without a cold reboot.


This is a serious issue and should be report as an error.  I think we 
need to create a HW ticket to understand


why is FLR reset fails.


Regards,

Nirmoy




  and the function returns void.

We don't need to be alarmed, therefore, print the timeout
expiration as a debug message instead of an error.

Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/10955
Signed-off-by: Andi Shyti 
---
  drivers/gpu/drm/i915/intel_uncore.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c 
b/drivers/gpu/drm/i915/intel_uncore.c
index 2eba289d88ad..a3fa2ed91aae 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -2637,7 +2637,7 @@ static void driver_initiated_flr(struct intel_uncore 
*uncore)
 */
ret = intel_wait_for_register_fw(uncore, GU_CNTL, DRIVERFLR, 0, 
flr_timeout_ms);
if (ret) {
-   drm_err(>drm,
+   drm_dbg(>drm,
"Failed to wait for Driver-FLR bit to clear! %d\n",
ret);
return;
@@ -2652,7 +2652,7 @@ static void driver_initiated_flr(struct intel_uncore 
*uncore)
 DRIVERFLR, 0,
 flr_timeout_ms);
if (ret) {
-   drm_err(>drm, "Driver-FLR-teardown wait completion failed! 
%d\n", ret);
+   drm_dbg(>drm, "Driver-FLR-teardown wait completion failed! 
%d\n", ret);
return;
}
  
@@ -2661,7 +2661,7 @@ static void driver_initiated_flr(struct intel_uncore *uncore)

 DRIVERFLR_STATUS, DRIVERFLR_STATUS,
 flr_timeout_ms);
if (ret) {
-   drm_err(>drm, "Driver-FLR-reinit wait completion failed! 
%d\n", ret);
+   drm_dbg(>drm, "Driver-FLR-reinit wait completion failed! 
%d\n", ret);
return;
}
  


[PATCH 2/2] drm/i915: Don't treat FLR resets as errors

2024-05-17 Thread Andi Shyti
If we timeout while waiting for an FLR reset, there is nothing we
can do and i915 doesn't have any control on it. In any case the
system is still perfectly usable and the function returns void.

We don't need to be alarmed, therefore, print the timeout
expiration as a debug message instead of an error.

Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/10955
Signed-off-by: Andi Shyti 
---
 drivers/gpu/drm/i915/intel_uncore.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c 
b/drivers/gpu/drm/i915/intel_uncore.c
index 2eba289d88ad..a3fa2ed91aae 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -2637,7 +2637,7 @@ static void driver_initiated_flr(struct intel_uncore 
*uncore)
 */
ret = intel_wait_for_register_fw(uncore, GU_CNTL, DRIVERFLR, 0, 
flr_timeout_ms);
if (ret) {
-   drm_err(>drm,
+   drm_dbg(>drm,
"Failed to wait for Driver-FLR bit to clear! %d\n",
ret);
return;
@@ -2652,7 +2652,7 @@ static void driver_initiated_flr(struct intel_uncore 
*uncore)
 DRIVERFLR, 0,
 flr_timeout_ms);
if (ret) {
-   drm_err(>drm, "Driver-FLR-teardown wait completion 
failed! %d\n", ret);
+   drm_dbg(>drm, "Driver-FLR-teardown wait completion 
failed! %d\n", ret);
return;
}
 
@@ -2661,7 +2661,7 @@ static void driver_initiated_flr(struct intel_uncore 
*uncore)
 DRIVERFLR_STATUS, DRIVERFLR_STATUS,
 flr_timeout_ms);
if (ret) {
-   drm_err(>drm, "Driver-FLR-reinit wait completion failed! 
%d\n", ret);
+   drm_dbg(>drm, "Driver-FLR-reinit wait completion failed! 
%d\n", ret);
return;
}
 
-- 
2.43.0