On Fri, Oct 25, 2024 at 09:08:03AM -0600, Mathieu Poirier wrote:
> On Fri, Oct 25, 2024 at 01:40:45PM +0530, Mukesh Ojha wrote:
> > On Mon, Oct 21, 2024 at 09:12:47AM -0600, Mathieu Poirier wrote:
> > > Hi Mukesh,
> > > 
> > > On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote:
> > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > > recover it results in NULL pointer dereference issue in
> > > > qcom_glink_smem_unregister().
> > > > 
> > > > There is other side to this issue if we want to fix this via adding a
> > > > NULL check on glink->edge which does not guarantees that the remoteproc
> > > > will recover in second call from Process B as it has failed in the first
> > > > Process A during SMC shutdown call and may again fail at the same call
> > > > and rproc can not recover for such case.
> > > > 
> > > > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of
> > > > remoteproc and the only way to recover from it via system restart.
> > > > 
> > > >         Process-A                                       Process-B
> > > > 
> > > >   fatal error interrupt happens
> > > > 
> > > >   rproc_crash_handler_work()
> > > >     mutex_lock_interruptible(&rproc->lock);
> > > >     ...
> > > > 
> > > >        rproc->state = RPROC_CRASHED;
> > > >     ...
> > > >     mutex_unlock(&rproc->lock);
> > > > 
> > > >     rproc_trigger_recovery()
> > > >      mutex_lock_interruptible(&rproc->lock);
> > > > 
> > > >       adsp_stop()
> > > >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > > >       remoteproc remoteproc3: can't stop rproc: -22
> > > >      mutex_unlock(&rproc->lock);
> > > 
> > > Ok, that can happen.
> > > 
> > > > 
> > > >                                                 echo enabled > 
> > > > /sys/class/remoteproc/remoteprocX/recovery
> > > >                                                 recovery_store()
> > > >                                                  
> > > > rproc_trigger_recovery()
> > > >                                                   
> > > > mutex_lock_interruptible(&rproc->lock);
> > > >                                                    rproc_stop()
> > > >                                                     glink_subdev_stop()
> > > >                                                       
> > > > qcom_glink_smem_unregister() ==|
> > > >                                                                         
> > > >              |
> > > >                                                                         
> > > >              V
> > > 
> > > I am missing some information here but I will _assume_ this is caused by
> > > glink->edge being set to NULL [1] when glink_subdev_stop() is first 
> > > called by
> > > process A.  Instead of adding a new state to the core I think a better 
> > > idea
> > > would be to add a check for a NULL value on @smem in
> > > qcom_glink_smem_unregister().  This is a problem that should be fixed in 
> > > the
> > > driver rather than the core.
> > > 
> > > [1]. 
> > > https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213
> > 
> > 
> > I did the same here [1] but after discussion with Bjorn, realized that
> > remoteproc might not even recover and may fail in the second attempt as
> > well and only way is reboot of the machine.
> 
> Whether in RPROC_CRASHED or RPROC_DEFUNCT state, the end result is the same -
> manual intervention is needed.  I don't see why another state needs to be 
> added.

Is it really true ? As when recovery is disabled and any rproc crash
will result in RPROC_CRASHED state, while recovery enablement can
recover the rproc back to ONLINE while if rproc recovery is not
successful it can be put into RPROC_DEFUNCT state.

-Mukesh

> 
> > 
> > [1]
> > https://lore.kernel.org/lkml/[email protected]/
> > 
> > > 
> > > >                                                       Unable to handle 
> > > > kernel NULL pointer dereference
> > > >                                                                 at 
> > > > virtual address 0000000000000358
> > > > 
> > > > Signed-off-by: Mukesh Ojha <[email protected]>
> > > > ---
> > > > Changes in v3:
> > > >  - Fix kernel test reported error.
> > > > 
> > > > Changes in v2:
> > > >  - Removed NULL pointer check instead added a new state to signify
> > > >    non-recoverable state of remoteproc.
> > > > 
> > > >  drivers/remoteproc/remoteproc_core.c  | 3 ++-
> > > >  drivers/remoteproc/remoteproc_sysfs.c | 1 +
> > > >  include/linux/remoteproc.h            | 5 ++++-
> > > >  3 files changed, 7 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/remoteproc/remoteproc_core.c 
> > > > b/drivers/remoteproc/remoteproc_core.c
> > > > index f276956f2c5c..c4e14503b971 100644
> > > > --- a/drivers/remoteproc/remoteproc_core.c
> > > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool 
> > > > crashed)
> > > >         /* power off the remote processor */
> > > >         ret = rproc->ops->stop(rproc);
> > > >         if (ret) {
> > > > +               rproc->state = RPROC_DEFUNCT;
> > > >                 dev_err(dev, "can't stop rproc: %d\n", ret);
> > > >                 return ret;
> > > >         }
> > > > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
> > > >                 return ret;
> > > >  
> > > >         /* State could have changed before we got the mutex */
> > > > -       if (rproc->state != RPROC_CRASHED)
> > > > +       if (rproc->state == RPROC_DEFUNCT || rproc->state != 
> > > > RPROC_CRASHED)
> > > >                 goto unlock_mutex;
> > > 
> > > The problem is that rproc_trigger_recovery() an only be called once for a
> > > remoteproc, something that modifies the state machine and may introduce 
> > > backward
> > > compatibility issues for other remote processor implementations.
> > > 
> > 
> > I missed one more point to add here which i tried to highlight in second
> > version[2] that setting of RPROC_DEFUNCT should happen for this case
> > from vendor remoteproc driver and not at the core and that should take
> > care of the backward compatibility.
> > 
> > [2]
> > https://lore.kernel.org/lkml/[email protected]/
> > 
> > -Mukesh
> > 
> > > Thanks,
> > > Mathieu
> > > 
> > > >  
> > > >         dev_err(dev, "recovering %s\n", rproc->name);
> > > > diff --git a/drivers/remoteproc/remoteproc_sysfs.c 
> > > > b/drivers/remoteproc/remoteproc_sysfs.c
> > > > index 138e752c5e4e..5f722b4576b2 100644
> > > > --- a/drivers/remoteproc/remoteproc_sysfs.c
> > > > +++ b/drivers/remoteproc/remoteproc_sysfs.c
> > > > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
> > > >         [RPROC_DELETED]         = "deleted",
> > > >         [RPROC_ATTACHED]        = "attached",
> > > >         [RPROC_DETACHED]        = "detached",
> > > > +       [RPROC_DEFUNCT]         = "defunct",
> > > >         [RPROC_LAST]            = "invalid",
> > > >  };
> > > >  
> > > > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
> > > > index b4795698d8c2..3e4ba06c6a9a 100644
> > > > --- a/include/linux/remoteproc.h
> > > > +++ b/include/linux/remoteproc.h
> > > > @@ -417,6 +417,8 @@ struct rproc_ops {
> > > >   *                     has attached to it
> > > >   * @RPROC_DETACHED:    device has been booted by another entity and 
> > > > waiting
> > > >   *                     for the core to attach to it
> > > > + * @RPROC_DEFUNCT:     device neither crashed nor responding to any of 
> > > > the
> > > > + *                     requests and can only recover on system restart.
> > > >   * @RPROC_LAST:                just keep this one at the end
> > > >   *
> > > >   * Please note that the values of these states are used as indices
> > > > @@ -433,7 +435,8 @@ enum rproc_state {
> > > >         RPROC_DELETED   = 4,
> > > >         RPROC_ATTACHED  = 5,
> > > >         RPROC_DETACHED  = 6,
> > > > -       RPROC_LAST      = 7,
> > > > +       RPROC_DEFUNCT   = 7,
> > > > +       RPROC_LAST      = 8,
> > > >  };
> > > >  
> > > >  /**
> > > > -- 
> > > > 2.34.1
> > > > 

Reply via email to