On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote:
> On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote:
> > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > > recover it results in NULL pointer dereference issue in
> > > > qcom_glink_smem_unregister().
> > > > 
> > > > Fix it by having a NULL check in glink_subdev_stop().
> > > > 
> > > >         Process-A                                       Process-B
> > > > 
> > > >   fatal error interrupt happens
> > > > 
> > > >   rproc_crash_handler_work()
> > > >     mutex_lock_interruptible(&rproc->lock);
> > > >     ...
> > > > 
> > > >        rproc->state = RPROC_CRASHED;
> > > >     ...
> > > >     mutex_unlock(&rproc->lock);
> > > > 
> > > >     rproc_trigger_recovery()
> > > >      mutex_lock_interruptible(&rproc->lock);
> > > > 
> > > >       adsp_stop()
> > > >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > > >       remoteproc remoteproc3: can't stop rproc: -22
> > > 
> > > I presume that at this point this remoteproc is in some undefined state
> > > and the only way to recover is for the user to reboot the machine?
> > 
> > Here, 50+ (5s) retry of scm shutdown is failing during decryption of
> > remote processor memory region, and i don't think, it is anyway to do
> > with remote processor state here, as a best effort more number of
> > retries can be tried instead of 50 or wait for some other recovery
> > command like recovery_store() to let it do the retry again from
> > beginning.
> > 
> 
> But are you saying that retrying a bit later would allow us to get out
> of this problem? If we just didn't hit the NULL pointer(s)?

I am not sure whether adding more number of retries will solve the issue
and initially thinking from perspective that, it is better to retry than
to leave the remoteproc in some unknown state however, I do get that
letting it retry could give unnecessary patching every code e.g., ssr
notifier callbacks, which is not expecting to be called twice as a
side-effect of remoteproc unknown state.
> 
> How long are we talking about here? Is the timeout too short?

5sec is very significant amount of time in wait for remote processor to
get recovered, we should not change this.
> 
> > > 
> > > 
> > > The check for glink->edge avoids one pitfall following this, but I'd
> > > prefer to see a solution that avoids issues in this scenario in the
> > > remoteproc core - rather than working around side effects of this in
> > > different places.
> > 
> > Handling in a remoteproc core means we may need another state something
> > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
> > failure and checking the same during another try return immediately with
> > some log message.
> > 
> 
> Yes, if we are failing to shut down the remoteproc and there's no way
> for us to reliably recover from that (e.g. we are not able to reclaim
> the memory), it seems reasonable that we have to mark it using a new
> state.
> 
> If that is the case, I'd call it RPROC_DEFUNCT (or something like that
> instead), because while in some "unknown" state, from a remoteproc
> framework's point of view, it's in a well known (broken) state.

Ack.

-Mukesh
> 
> Regards,
> Bjorn

Reply via email to