On Mon, Jan 30, 2023 at 09:48:31AM +0100, Andi Shyti wrote:
> Hi GG,
> 
> thanks for the deep analysis!
> 
> > Hi Andi,
> > In the call flow invoked by intel_pcode_init(), I've added brief comments
> > where further clarification is needed in this scenario, and a description of
> > the suspicious scenario at the bottom.
> > 
> > -------------------------------------------------------------------------
> > intel_pcode_init()
> >  |
> >  +-> skl_pcode_request(uncore, DG1_PCODE_STATUS,
> >                        DG1_UNCORE_GET_INIT_STATUS,
> >                        DG1_UNCORE_INIT_STATUS_COMPLETE,
> >                        DG1_UNCORE_INIT_STATUS_COMPLETE, 180000);
> >        |
> >        +-> skl_pcode_try_request()
> >              |
> >              +->  *status = __snb_pcode_rw(uncore, mbox, &request, NULL,
> >                                            500, 0, true);
> > 
> > -------------------------------------------------------------------------
> > static int __snb_pcode_rw(struct intel_uncore *uncore, u32 mbox,
> >                       u32 *val, u32 *val1,
> >                       int fast_timeout_us, int slow_timeout_ms,
> >                       bool is_read)
> > {
> > ...
> >         /* Before writing a value to the GEN6_PCODE_DATA register,
> >            check if the bit in the GEN6_PCODE_MAILBOX register indicates
> >            BUSY. */
> >     if (intel_uncore_read_fw(uncore, GEN6_PCODE_MAILBOX) & GEN6_PCODE_READY)
> >             return -EAGAIN;
> 
> what if we fail here because the punit is still initializing and
> will be ready, say, in 10 seconds?
> 
> GG, without going any further, we fail here! The -EAGAIN we
> receive from the test comes from this point. We don't fail with
> -ETIMEDOUT, but with -EAGAIN and the reason is because the punit
> is not ready to perform the very fist communication and we fail
> the probing.
> 
> It doesn't mean, though, that there is anything wrong, we just
> need to wait a bit before "taking drastic decisions"!
> 
> This patch is suggesting to wait up to 10s for the punit to be
> ready and eventually try to probe again... and, indeed, it works!

As GG, what I still don't understand is how this extra 10 seconds
wait helps... have you tried to simple add the 10 to the 180 and
make the code 190 sec instead?

> 
> Andi
> 
> > 
> >         /* write value to GEN6_PCODE_DATA register */
> >     intel_uncore_write_fw(uncore, GEN6_PCODE_DATA, *val);
> > 
> >     intel_uncore_write_fw(uncore, GEN6_PCODE_DATA1, val1 ? *val1 : 0);
> > 
> >         /* In this scenario, the value
> >            "DG1_PCODE_STATUS | GEN6_PCODE_READY"
> >            is written to the GEN6_PCODE_MAILBOX register,
> >            so that the Busy status of the GEN6_PCODE_MAILBOX register
> >            can be checked later.
> >            (When the value of the GEN6_PCODE_READY bit of the
> >             GEN6_PCODE_MAILBOX register changes to 0, the operation can
> >             be considered completed.) */
> >     intel_uncore_write_fw(uncore,
> >                           GEN6_PCODE_MAILBOX, GEN6_PCODE_READY | mbox);
> > 
> >         /* In this scenario, verify that the BUSY status bit in the
> >            GEN6_PCODE_MAILBOX register turns off for up to 500us. */
> >     if (__intel_wait_for_register_fw(uncore,
> >                                      GEN6_PCODE_MAILBOX,
> >                                      GEN6_PCODE_READY, 0,
> >                                      fast_timeout_us,
> >                                      slow_timeout_ms,
> >                                      &mbox))
> >             return -ETIMEDOUT;
> >         /* If there is a failure here, it may be considered that the
> >            "DG1_PCODE_STATUS | GEN6_PCODE_READY" operation was not
> >            completed within 500us */
> > ...
> > }
> > 
> > int skl_pcode_request(struct intel_uncore *uncore, u32 mbox, u32 request,
> >                   u32 reply_mask, u32 reply, int timeout_base_ms)
> > {
> >     u32 status;
> >     int ret;
> > 
> >     mutex_lock(&uncore->i915->sb_lock);
> > 
> > #define COND \
> >     skl_pcode_try_request(uncore, mbox, request, reply_mask, reply, &status)
> > 
> >         /* the first trial for skl_pcode_try_request() can return
> >            -EAGAIN or -ETIMEDOUT. And the code did not check the error
> >            code here, so we don't know how far the __snb_pcode_rw()
> >            function went. It is not known whether the pcode_mailbox
> >            status was busy before writing the value to the
> >            GEN6_PCODE_DATA register or after.*/
> >     if (COND) {
> >             ret = 0;
> >             goto out;
> >     }
> > 
> >         /* In this scenario, skl_pcode_try_request() is invoked every
> >            10us for 180 seconds. When skl_pcode_try_request() returns
> >            -EAGAIN and -ETIMEDOUT by _wait_for(),
> >            -ETIMEDOUT is returned to a variable ret. */
> > 
> >     ret = _wait_for(COND, timeout_base_ms * 1000, 10, 10);
> > 
> >     if (!ret)
> >             goto out;
> > 
> >     /*
> >      * The above can time out if the number of requests was low (2 in the
> >      * worst case) _and_ PCODE was busy for some reason even after a
> >      * (queued) request and @timeout_base_ms delay. As a workaround retry
> >      * the poll with preemption disabled to maximize the number of
> >      * requests. Increase the timeout from @timeout_base_ms to 50ms to
> >      * account for interrupts that could reduce the number of these
> >      * requests, and for any quirks of the PCODE firmware that delays
> >      * the request completion.
> >      */
> >     drm_dbg_kms(&uncore->i915->drm,
> >                 "PCODE timeout, retrying with preemption disabled\n");
> >     drm_WARN_ON_ONCE(&uncore->i915->drm, timeout_base_ms > 3);
> >     preempt_disable();
> >     ret = wait_for_atomic(COND, 50);
> > 
> >     preempt_enable();
> > 
> > out:
> >     mutex_unlock(&uncore->i915->sb_lock);
> >     return status ? status : ret;
> > #undef COND
> > }
> > 
> > ---------------------------------------------------------
> > 
> > If you try skl_pcode_try_request() for 180 seconds in skl_pcode_request(),
> > and the first "intel_uncore_read_fw(uncore, GEN6_PCODE_MAILBOX) &
> > GEN6_PCODE_READY)" call in __snb_pcode_rw() that skl_pcode_try_request()
> > invokes always fails. if then it does not make sense to me why this patch
> > fixes it by just waiting 10 seconds.This is because if it was called with
> > the flow described above, 180 seconds is longer than 10 seconds, so the
> > scenario you mentioned is also covered in the existing code.
> > 
> > To describe in more detail the second scenario I previously commented on:
> > skl_pcode_request() tries skl_pcode_try_request() for 180 seconds
> >  1) In skl_pcode_try_request(), the GEN6_PCODE_MAILBOX register bit is not
> > BUSY, so the value is written to the GEN6_PCODE_DATA1 register.
> >  2) skl_pcode_try_request() returns -ETIMEDOUT value because the operation
> > of 1) does not complete within 500us.
> >  3) Scenario in which the GEN6_PCODE_MAILBOX register bit is checked as BUSY
> > and returns -EAGAIN in the last call of skl_pcode_try_request() invoked by
> > skl_pcode_request()
> > 
> > If the reason why this problem occurred is because of this scenario,
> > shouldn't there be an attempt to increase fast_timeout_us used as an
> > argument of __snb_pcode_rw() to 500us or more when skl_pcode_try_request()
> > returns -ETIMEDOUT?
> > 
> > Br,
> > G.G.

Reply via email to