[PATCH 1/3] drm/i915/guc: Temporarily bump the GuC load timeout

2021-12-21 Thread John . C . Harrison
From: John Harrison 

There is a known (but exceedingly unlikely) race condition where the
asynchronous frequency management code could reduce the GT clock while
a GuC reload is in progress (during a full GT reset). A fix is in
progress but there are complex locking issues to be resolved. In the
meantime bump the timeout to 200ms. Even at slowest clock, this
should be sufficient. And in the working case, a larger timeout makes
no difference.

Signed-off-by: John Harrison 
Reviewed-by: Matthew Brost 
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
index 31420ce1ce6b..d09c205b2beb 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
@@ -105,12 +105,21 @@ static int guc_wait_ucode(struct intel_uncore *uncore)
/*
 * Wait for the GuC to start up.
 * NB: Docs recommend not using the interrupt for completion.
-* Measurements indicate this should take no more than 20ms, so a
+* Measurements indicate this should take no more than 20ms
+* (assuming the GT clock is at maximum frequency). So, a
 * timeout here indicates that the GuC has failed and is unusable.
 * (Higher levels of the driver may decide to reset the GuC and
 * attempt the ucode load again if this happens.)
+*
+* FIXME: There is a known (but exceedingly unlikely) race condition
+* where the asynchronous frequency management code could reduce
+* the GT clock while a GuC reload is in progress (during a full
+* GT reset). A fix is in progress but there are complex locking
+* issues to be resolved. In the meantime bump the timeout to
+* 200ms. Even at slowest clock, this should be sufficient. And
+* in the working case, a larger timeout makes no difference.
 */
-   ret = wait_for(guc_ready(uncore, &status), 100);
+   ret = wait_for(guc_ready(uncore, &status), 200);
if (ret) {
struct drm_device *drm = &uncore->i915->drm;
 
-- 
2.25.1



Re: [Intel-gfx] [PATCH 1/3] drm/i915/guc: Temporarily bump the GuC load timeout

2021-12-20 Thread Matthew Brost
On Mon, Dec 20, 2021 at 04:52:19PM -0800, john.c.harri...@intel.com wrote:
> From: John Harrison 
> 
> There is a known (but exceedingly unlikely) race condition where the
> asynchronous frequency management code could reduce the GT clock while
> a GuC reload is in progress (during a full GT reset). A fix is in
> progress but there are complex locking issues to be resolved. In the
> meantime bump the timeout to 500ms. Even at slowest clock, this
> should be sufficient. And in the working case, a larger timeout makes
> no difference.
> 
> Signed-off-by: John Harrison 

Any idea of the ETA for the proper fix? Also if the proper fix makes the
locking more complicated I'm probably of the opinion we just live with a
longer timer as full GTs shouldn't really ever happen in practice and if
they take a longer time, so be it.

Anyways for this patch:
Reviewed-by: Matthew Brost 

> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
> index 31420ce1ce6b..c03bde5ec61f 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
> @@ -105,12 +105,21 @@ static int guc_wait_ucode(struct intel_uncore *uncore)
>   /*
>* Wait for the GuC to start up.
>* NB: Docs recommend not using the interrupt for completion.
> -  * Measurements indicate this should take no more than 20ms, so a
> +  * Measurements indicate this should take no more than 20ms
> +  * (assuming the GT clock is at maximum frequency). So, a
>* timeout here indicates that the GuC has failed and is unusable.
>* (Higher levels of the driver may decide to reset the GuC and
>* attempt the ucode load again if this happens.)
> +  *
> +  * FIXME: There is a known (but exceedingly unlikely) race condition
> +  * where the asynchronous frequency management code could reduce
> +  * the GT clock while a GuC reload is in progress (during a full
> +  * GT reset). A fix is in progress but there are complex locking
> +  * issues to be resolved. In the meantime bump the timeout to
> +  * 500ms. Even at slowest clock, this should be sufficient. And
> +  * in the working case, a larger timeout makes no difference.
>*/
> - ret = wait_for(guc_ready(uncore, &status), 100);
> + ret = wait_for(guc_ready(uncore, &status), 500);
>   if (ret) {
>   struct drm_device *drm = &uncore->i915->drm;
>  
> -- 
> 2.25.1
> 


[PATCH 1/3] drm/i915/guc: Temporarily bump the GuC load timeout

2021-12-20 Thread John . C . Harrison
From: John Harrison 

There is a known (but exceedingly unlikely) race condition where the
asynchronous frequency management code could reduce the GT clock while
a GuC reload is in progress (during a full GT reset). A fix is in
progress but there are complex locking issues to be resolved. In the
meantime bump the timeout to 500ms. Even at slowest clock, this
should be sufficient. And in the working case, a larger timeout makes
no difference.

Signed-off-by: John Harrison 
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
index 31420ce1ce6b..c03bde5ec61f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fw.c
@@ -105,12 +105,21 @@ static int guc_wait_ucode(struct intel_uncore *uncore)
/*
 * Wait for the GuC to start up.
 * NB: Docs recommend not using the interrupt for completion.
-* Measurements indicate this should take no more than 20ms, so a
+* Measurements indicate this should take no more than 20ms
+* (assuming the GT clock is at maximum frequency). So, a
 * timeout here indicates that the GuC has failed and is unusable.
 * (Higher levels of the driver may decide to reset the GuC and
 * attempt the ucode load again if this happens.)
+*
+* FIXME: There is a known (but exceedingly unlikely) race condition
+* where the asynchronous frequency management code could reduce
+* the GT clock while a GuC reload is in progress (during a full
+* GT reset). A fix is in progress but there are complex locking
+* issues to be resolved. In the meantime bump the timeout to
+* 500ms. Even at slowest clock, this should be sufficient. And
+* in the working case, a larger timeout makes no difference.
 */
-   ret = wait_for(guc_ready(uncore, &status), 100);
+   ret = wait_for(guc_ready(uncore, &status), 500);
if (ret) {
struct drm_device *drm = &uncore->i915->drm;
 
-- 
2.25.1