On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote: > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote: > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote: > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen > > > if rproc_stop() fails from Process-A that leaves the rproc state to > > > RPROC_CRASHED state later a call to recovery_store from user space in > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to > > > recover it results in NULL pointer dereference issue in > > > qcom_glink_smem_unregister(). > > > > > > Fix it by having a NULL check in glink_subdev_stop(). > > > > > > Process-A Process-B > > > > > > fatal error interrupt happens > > > > > > rproc_crash_handler_work() > > > mutex_lock_interruptible(&rproc->lock); > > > ... > > > > > > rproc->state = RPROC_CRASHED; > > > ... > > > mutex_unlock(&rproc->lock); > > > > > > rproc_trigger_recovery() > > > mutex_lock_interruptible(&rproc->lock); > > > > > > adsp_stop() > > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > > > remoteproc remoteproc3: can't stop rproc: -22 > > > > I presume that at this point this remoteproc is in some undefined state > > and the only way to recover is for the user to reboot the machine? > > Here, 50+ (5s) retry of scm shutdown is failing during decryption of > remote processor memory region, and i don't think, it is anyway to do > with remote processor state here, as a best effort more number of > retries can be tried instead of 50 or wait for some other recovery > command like recovery_store() to let it do the retry again from > beginning. >
But are you saying that retrying a bit later would allow us to get out of this problem? If we just didn't hit the NULL pointer(s)? How long are we talking about here? Is the timeout too short? > > > > > > The check for glink->edge avoids one pitfall following this, but I'd > > prefer to see a solution that avoids issues in this scenario in the > > remoteproc core - rather than working around side effects of this in > > different places. > > Handling in a remoteproc core means we may need another state something > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery > failure and checking the same during another try return immediately with > some log message. > Yes, if we are failing to shut down the remoteproc and there's no way for us to reliably recover from that (e.g. we are not able to reclaim the memory), it seems reasonable that we have to mark it using a new state. If that is the case, I'd call it RPROC_DEFUNCT (or something like that instead), because while in some "unknown" state, from a remoteproc framework's point of view, it's in a well known (broken) state. Regards, Bjorn