On Tue, Oct 01, 2024 at 02:15:52PM -0700, Bjorn Andersson wrote:
> On Tue, Oct 01, 2024 at 12:06:17PM +0530, Mukesh Ojha wrote:
> > On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote:
> > > On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote:
> > > > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> > > > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > > > > > Multiple call to glink_subdev_stop() for the same remoteproc can 
> > > > > > happen
> > > > > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > > > > RPROC_CRASHED state later a call to recovery_store from user space 
> > > > > > in
> > > > > > Process B triggers rproc_trigger_recovery() of the same remoteproc 
> > > > > > to
> > > > > > recover it results in NULL pointer dereference issue in
> > > > > > qcom_glink_smem_unregister().
> > > > > > 
> > > > > > Fix it by having a NULL check in glink_subdev_stop().
> > > > > > 
> > > > > >     Process-A                                       Process-B
> > > > > > 
> > > > > >   fatal error interrupt happens
> > > > > > 
> > > > > >   rproc_crash_handler_work()
> > > > > >     mutex_lock_interruptible(&rproc->lock);
> > > > > >     ...
> > > > > > 
> > > > > >        rproc->state = RPROC_CRASHED;
> > > > > >     ...
> > > > > >     mutex_unlock(&rproc->lock);
> > > > > > 
> > > > > >     rproc_trigger_recovery()
> > > > > >      mutex_lock_interruptible(&rproc->lock);
> > > > > > 
> > > > > >       adsp_stop()
> > > > > >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > > > > >       remoteproc remoteproc3: can't stop rproc: -22
> > > > > 
> > > > > I presume that at this point this remoteproc is in some undefined 
> > > > > state
> > > > > and the only way to recover is for the user to reboot the machine?
> > > > 
> > > > Here, 50+ (5s) retry of scm shutdown is failing during decryption of
> > > > remote processor memory region, and i don't think, it is anyway to do
> > > > with remote processor state here, as a best effort more number of
> > > > retries can be tried instead of 50 or wait for some other recovery
> > > > command like recovery_store() to let it do the retry again from
> > > > beginning.
> > > > 
> > > 
> > > But are you saying that retrying a bit later would allow us to get out
> > > of this problem? If we just didn't hit the NULL pointer(s)?
> > 
> > I am not sure whether adding more number of retries will solve the issue
> > and initially thinking from perspective that, it is better to retry than
> > to leave the remoteproc in some unknown state however, I do get that
> > letting it retry could give unnecessary patching every code e.g., ssr
> > notifier callbacks, which is not expecting to be called twice as a
> > side-effect of remoteproc unknown state.
> 
> That's not what I'm asking you. When the remote processor fails to stop,
> can you recover the system by just trying a bit later, or is the
> remoteproc dead until reboot?

I cannot say this with certainty. For the current issue, the remoteproc
fails to stop as it is running out of heap memory.

-Mukesh
> 
> > > 
> > > How long are we talking about here? Is the timeout too short?
> > 
> > 5sec is very significant amount of time in wait for remote processor to
> > get recovered, we should not change this.
> 
> Okay, I'm just trying to understand the problem you're trying to solve.
> 
> Regards,
> Bjorn
> 
> > > 
> > > > > 
> > > > > 
> > > > > The check for glink->edge avoids one pitfall following this, but I'd
> > > > > prefer to see a solution that avoids issues in this scenario in the
> > > > > remoteproc core - rather than working around side effects of this in
> > > > > different places.
> > > > 
> > > > Handling in a remoteproc core means we may need another state something
> > > > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
> > > > failure and checking the same during another try return immediately with
> > > > some log message.
> > > > 
> > > 
> > > Yes, if we are failing to shut down the remoteproc and there's no way
> > > for us to reliably recover from that (e.g. we are not able to reclaim
> > > the memory), it seems reasonable that we have to mark it using a new
> > > state.
> > > 
> > > If that is the case, I'd call it RPROC_DEFUNCT (or something like that
> > > instead), because while in some "unknown" state, from a remoteproc
> > > framework's point of view, it's in a well known (broken) state.
> > 
> > Ack.
> > 
> > -Mukesh
> > > 
> > > Regards,
> > > Bjorn

Reply via email to