Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

2024-10-07 Thread Mukesh Ojha
On Tue, Oct 01, 2024 at 02:15:52PM -0700, Bjorn Andersson wrote:
> On Tue, Oct 01, 2024 at 12:06:17PM +0530, Mukesh Ojha wrote:
> > On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote:
> > > On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote:
> > > > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> > > > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > > > > > Multiple call to glink_subdev_stop() for the same remoteproc can 
> > > > > > happen
> > > > > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > > > > RPROC_CRASHED state later a call to recovery_store from user space 
> > > > > > in
> > > > > > Process B triggers rproc_trigger_recovery() of the same remoteproc 
> > > > > > to
> > > > > > recover it results in NULL pointer dereference issue in
> > > > > > qcom_glink_smem_unregister().
> > > > > > 
> > > > > > Fix it by having a NULL check in glink_subdev_stop().
> > > > > > 
> > > > > > Process-A   Process-B
> > > > > > 
> > > > > >   fatal error interrupt happens
> > > > > > 
> > > > > >   rproc_crash_handler_work()
> > > > > > mutex_lock_interruptible(&rproc->lock);
> > > > > > ...
> > > > > > 
> > > > > >rproc->state = RPROC_CRASHED;
> > > > > > ...
> > > > > > mutex_unlock(&rproc->lock);
> > > > > > 
> > > > > > rproc_trigger_recovery()
> > > > > >  mutex_lock_interruptible(&rproc->lock);
> > > > > > 
> > > > > >   adsp_stop()
> > > > > >   qcom_q6v5_pas 20c0.remoteproc: failed to shutdown: -22
> > > > > >   remoteproc remoteproc3: can't stop rproc: -22
> > > > > 
> > > > > I presume that at this point this remoteproc is in some undefined 
> > > > > state
> > > > > and the only way to recover is for the user to reboot the machine?
> > > > 
> > > > Here, 50+ (5s) retry of scm shutdown is failing during decryption of
> > > > remote processor memory region, and i don't think, it is anyway to do
> > > > with remote processor state here, as a best effort more number of
> > > > retries can be tried instead of 50 or wait for some other recovery
> > > > command like recovery_store() to let it do the retry again from
> > > > beginning.
> > > > 
> > > 
> > > But are you saying that retrying a bit later would allow us to get out
> > > of this problem? If we just didn't hit the NULL pointer(s)?
> > 
> > I am not sure whether adding more number of retries will solve the issue
> > and initially thinking from perspective that, it is better to retry than
> > to leave the remoteproc in some unknown state however, I do get that
> > letting it retry could give unnecessary patching every code e.g., ssr
> > notifier callbacks, which is not expecting to be called twice as a
> > side-effect of remoteproc unknown state.
> 
> That's not what I'm asking you. When the remote processor fails to stop,
> can you recover the system by just trying a bit later, or is the
> remoteproc dead until reboot?

I cannot say this with certainty. For the current issue, the remoteproc
fails to stop as it is running out of heap memory.

-Mukesh
> 
> > > 
> > > How long are we talking about here? Is the timeout too short?
> > 
> > 5sec is very significant amount of time in wait for remote processor to
> > get recovered, we should not change this.
> 
> Okay, I'm just trying to understand the problem you're trying to solve.
> 
> Regards,
> Bjorn
> 
> > > 
> > > > > 
> > > > > 
> > > > > The check for glink->edge avoids one pitfall following this, but I'd
> > > > > prefer to see a solution that avoids issues in this scenario in the
> > > > > remoteproc core - rather than working around side effects of this in
> > > > > different places.
> > > > 
> > > > Handling in a remoteproc core means we may need another state something
> > > > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
> > > > failure and checking the same during another try return immediately with
> > > > some log message.
> > > > 
> > > 
> > > Yes, if we are failing to shut down the remoteproc and there's no way
> > > for us to reliably recover from that (e.g. we are not able to reclaim
> > > the memory), it seems reasonable that we have to mark it using a new
> > > state.
> > > 
> > > If that is the case, I'd call it RPROC_DEFUNCT (or something like that
> > > instead), because while in some "unknown" state, from a remoteproc
> > > framework's point of view, it's in a well known (broken) state.
> > 
> > Ack.
> > 
> > -Mukesh
> > > 
> > > Regards,
> > > Bjorn



Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

2024-10-01 Thread Bjorn Andersson
On Tue, Oct 01, 2024 at 12:06:17PM +0530, Mukesh Ojha wrote:
> On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote:
> > On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote:
> > > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> > > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > > > > Multiple call to glink_subdev_stop() for the same remoteproc can 
> > > > > happen
> > > > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > > > recover it results in NULL pointer dereference issue in
> > > > > qcom_glink_smem_unregister().
> > > > > 
> > > > > Fix it by having a NULL check in glink_subdev_stop().
> > > > > 
> > > > >   Process-A   Process-B
> > > > > 
> > > > >   fatal error interrupt happens
> > > > > 
> > > > >   rproc_crash_handler_work()
> > > > > mutex_lock_interruptible(&rproc->lock);
> > > > > ...
> > > > > 
> > > > >rproc->state = RPROC_CRASHED;
> > > > > ...
> > > > > mutex_unlock(&rproc->lock);
> > > > > 
> > > > > rproc_trigger_recovery()
> > > > >  mutex_lock_interruptible(&rproc->lock);
> > > > > 
> > > > >   adsp_stop()
> > > > >   qcom_q6v5_pas 20c0.remoteproc: failed to shutdown: -22
> > > > >   remoteproc remoteproc3: can't stop rproc: -22
> > > > 
> > > > I presume that at this point this remoteproc is in some undefined state
> > > > and the only way to recover is for the user to reboot the machine?
> > > 
> > > Here, 50+ (5s) retry of scm shutdown is failing during decryption of
> > > remote processor memory region, and i don't think, it is anyway to do
> > > with remote processor state here, as a best effort more number of
> > > retries can be tried instead of 50 or wait for some other recovery
> > > command like recovery_store() to let it do the retry again from
> > > beginning.
> > > 
> > 
> > But are you saying that retrying a bit later would allow us to get out
> > of this problem? If we just didn't hit the NULL pointer(s)?
> 
> I am not sure whether adding more number of retries will solve the issue
> and initially thinking from perspective that, it is better to retry than
> to leave the remoteproc in some unknown state however, I do get that
> letting it retry could give unnecessary patching every code e.g., ssr
> notifier callbacks, which is not expecting to be called twice as a
> side-effect of remoteproc unknown state.

That's not what I'm asking you. When the remote processor fails to stop,
can you recover the system by just trying a bit later, or is the
remoteproc dead until reboot?

> > 
> > How long are we talking about here? Is the timeout too short?
> 
> 5sec is very significant amount of time in wait for remote processor to
> get recovered, we should not change this.

Okay, I'm just trying to understand the problem you're trying to solve.

Regards,
Bjorn

> > 
> > > > 
> > > > 
> > > > The check for glink->edge avoids one pitfall following this, but I'd
> > > > prefer to see a solution that avoids issues in this scenario in the
> > > > remoteproc core - rather than working around side effects of this in
> > > > different places.
> > > 
> > > Handling in a remoteproc core means we may need another state something
> > > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
> > > failure and checking the same during another try return immediately with
> > > some log message.
> > > 
> > 
> > Yes, if we are failing to shut down the remoteproc and there's no way
> > for us to reliably recover from that (e.g. we are not able to reclaim
> > the memory), it seems reasonable that we have to mark it using a new
> > state.
> > 
> > If that is the case, I'd call it RPROC_DEFUNCT (or something like that
> > instead), because while in some "unknown" state, from a remoteproc
> > framework's point of view, it's in a well known (broken) state.
> 
> Ack.
> 
> -Mukesh
> > 
> > Regards,
> > Bjorn



Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

2024-09-30 Thread Mukesh Ojha
On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote:
> On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote:
> > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > > recover it results in NULL pointer dereference issue in
> > > > qcom_glink_smem_unregister().
> > > > 
> > > > Fix it by having a NULL check in glink_subdev_stop().
> > > > 
> > > > Process-A   Process-B
> > > > 
> > > >   fatal error interrupt happens
> > > > 
> > > >   rproc_crash_handler_work()
> > > > mutex_lock_interruptible(&rproc->lock);
> > > > ...
> > > > 
> > > >rproc->state = RPROC_CRASHED;
> > > > ...
> > > > mutex_unlock(&rproc->lock);
> > > > 
> > > > rproc_trigger_recovery()
> > > >  mutex_lock_interruptible(&rproc->lock);
> > > > 
> > > >   adsp_stop()
> > > >   qcom_q6v5_pas 20c0.remoteproc: failed to shutdown: -22
> > > >   remoteproc remoteproc3: can't stop rproc: -22
> > > 
> > > I presume that at this point this remoteproc is in some undefined state
> > > and the only way to recover is for the user to reboot the machine?
> > 
> > Here, 50+ (5s) retry of scm shutdown is failing during decryption of
> > remote processor memory region, and i don't think, it is anyway to do
> > with remote processor state here, as a best effort more number of
> > retries can be tried instead of 50 or wait for some other recovery
> > command like recovery_store() to let it do the retry again from
> > beginning.
> > 
> 
> But are you saying that retrying a bit later would allow us to get out
> of this problem? If we just didn't hit the NULL pointer(s)?

I am not sure whether adding more number of retries will solve the issue
and initially thinking from perspective that, it is better to retry than
to leave the remoteproc in some unknown state however, I do get that
letting it retry could give unnecessary patching every code e.g., ssr
notifier callbacks, which is not expecting to be called twice as a
side-effect of remoteproc unknown state.
> 
> How long are we talking about here? Is the timeout too short?

5sec is very significant amount of time in wait for remote processor to
get recovered, we should not change this.
> 
> > > 
> > > 
> > > The check for glink->edge avoids one pitfall following this, but I'd
> > > prefer to see a solution that avoids issues in this scenario in the
> > > remoteproc core - rather than working around side effects of this in
> > > different places.
> > 
> > Handling in a remoteproc core means we may need another state something
> > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
> > failure and checking the same during another try return immediately with
> > some log message.
> > 
> 
> Yes, if we are failing to shut down the remoteproc and there's no way
> for us to reliably recover from that (e.g. we are not able to reclaim
> the memory), it seems reasonable that we have to mark it using a new
> state.
> 
> If that is the case, I'd call it RPROC_DEFUNCT (or something like that
> instead), because while in some "unknown" state, from a remoteproc
> framework's point of view, it's in a well known (broken) state.

Ack.

-Mukesh
> 
> Regards,
> Bjorn



Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

2024-09-27 Thread Bjorn Andersson
On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote:
> On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > recover it results in NULL pointer dereference issue in
> > > qcom_glink_smem_unregister().
> > > 
> > > Fix it by having a NULL check in glink_subdev_stop().
> > > 
> > >   Process-A   Process-B
> > > 
> > >   fatal error interrupt happens
> > > 
> > >   rproc_crash_handler_work()
> > > mutex_lock_interruptible(&rproc->lock);
> > > ...
> > > 
> > >rproc->state = RPROC_CRASHED;
> > > ...
> > > mutex_unlock(&rproc->lock);
> > > 
> > > rproc_trigger_recovery()
> > >  mutex_lock_interruptible(&rproc->lock);
> > > 
> > >   adsp_stop()
> > >   qcom_q6v5_pas 20c0.remoteproc: failed to shutdown: -22
> > >   remoteproc remoteproc3: can't stop rproc: -22
> > 
> > I presume that at this point this remoteproc is in some undefined state
> > and the only way to recover is for the user to reboot the machine?
> 
> Here, 50+ (5s) retry of scm shutdown is failing during decryption of
> remote processor memory region, and i don't think, it is anyway to do
> with remote processor state here, as a best effort more number of
> retries can be tried instead of 50 or wait for some other recovery
> command like recovery_store() to let it do the retry again from
> beginning.
> 

But are you saying that retrying a bit later would allow us to get out
of this problem? If we just didn't hit the NULL pointer(s)?

How long are we talking about here? Is the timeout too short?

> > 
> > 
> > The check for glink->edge avoids one pitfall following this, but I'd
> > prefer to see a solution that avoids issues in this scenario in the
> > remoteproc core - rather than working around side effects of this in
> > different places.
> 
> Handling in a remoteproc core means we may need another state something
> like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
> failure and checking the same during another try return immediately with
> some log message.
> 

Yes, if we are failing to shut down the remoteproc and there's no way
for us to reliably recover from that (e.g. we are not able to reclaim
the memory), it seems reasonable that we have to mark it using a new
state.

If that is the case, I'd call it RPROC_DEFUNCT (or something like that
instead), because while in some "unknown" state, from a remoteproc
framework's point of view, it's in a well known (broken) state.

Regards,
Bjorn



Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

2024-09-27 Thread Mukesh Ojha
On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote:
> On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > if rproc_stop() fails from Process-A that leaves the rproc state to
> > RPROC_CRASHED state later a call to recovery_store from user space in
> > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > recover it results in NULL pointer dereference issue in
> > qcom_glink_smem_unregister().
> > 
> > Fix it by having a NULL check in glink_subdev_stop().
> > 
> > Process-A   Process-B
> > 
> >   fatal error interrupt happens
> > 
> >   rproc_crash_handler_work()
> > mutex_lock_interruptible(&rproc->lock);
> > ...
> > 
> >rproc->state = RPROC_CRASHED;
> > ...
> > mutex_unlock(&rproc->lock);
> > 
> > rproc_trigger_recovery()
> >  mutex_lock_interruptible(&rproc->lock);
> > 
> >   adsp_stop()
> >   qcom_q6v5_pas 20c0.remoteproc: failed to shutdown: -22
> >   remoteproc remoteproc3: can't stop rproc: -22
> 
> I presume that at this point this remoteproc is in some undefined state
> and the only way to recover is for the user to reboot the machine?

Here, 50+ (5s) retry of scm shutdown is failing during decryption of
remote processor memory region, and i don't think, it is anyway to do
with remote processor state here, as a best effort more number of
retries can be tried instead of 50 or wait for some other recovery
command like recovery_store() to let it do the retry again from
beginning.

> 
> 
> The check for glink->edge avoids one pitfall following this, but I'd
> prefer to see a solution that avoids issues in this scenario in the
> remoteproc core - rather than working around side effects of this in
> different places.

Handling in a remoteproc core means we may need another state something
like "RPROC_UNKNOWN" which can be kept after one attempt of recovery
failure and checking the same during another try return immediately with
some log message.

-Mukesh



Re: [PATCH] remoteproc: qcom: Fix NULL pointer in glink_subdev_stop()

2024-09-25 Thread Bjorn Andersson
On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote:
> Multiple call to glink_subdev_stop() for the same remoteproc can happen
> if rproc_stop() fails from Process-A that leaves the rproc state to
> RPROC_CRASHED state later a call to recovery_store from user space in
> Process B triggers rproc_trigger_recovery() of the same remoteproc to
> recover it results in NULL pointer dereference issue in
> qcom_glink_smem_unregister().
> 
> Fix it by having a NULL check in glink_subdev_stop().
> 
>   Process-A   Process-B
> 
>   fatal error interrupt happens
> 
>   rproc_crash_handler_work()
> mutex_lock_interruptible(&rproc->lock);
> ...
> 
>rproc->state = RPROC_CRASHED;
> ...
> mutex_unlock(&rproc->lock);
> 
> rproc_trigger_recovery()
>  mutex_lock_interruptible(&rproc->lock);
> 
>   adsp_stop()
>   qcom_q6v5_pas 20c0.remoteproc: failed to shutdown: -22
>   remoteproc remoteproc3: can't stop rproc: -22

I presume that at this point this remoteproc is in some undefined state
and the only way to recover is for the user to reboot the machine?


The check for glink->edge avoids one pitfall following this, but I'd
prefer to see a solution that avoids issues in this scenario in the
remoteproc core - rather than working around side effects of this in
different places.

Regards,
Bjorn

>  mutex_unlock(&rproc->lock);
> 
>   echo enabled > 
> /sys/class/remoteproc/remoteprocX/recovery
>   recovery_store()
>rproc_trigger_recovery()
> 
> mutex_lock_interruptible(&rproc->lock);
>  rproc_stop()
>   glink_subdev_stop()
> 
> qcom_glink_smem_unregister() ==|
>   
>|
>   
>V
> Unable to handle kernel 
> NULL pointer dereference
> at virtual 
> address 0358
> 
> Signed-off-by: Mukesh Ojha 
> ---
> - We can do this NULL check in qcom_glink_smem_unregister() as it is
>   exported function however, there is only one user of this. So, doing
>   it with current approach should also be fine.
> 
>  drivers/remoteproc/qcom_common.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/remoteproc/qcom_common.c 
> b/drivers/remoteproc/qcom_common.c
> index 8c8688f99f0a..52d6c9b99fdb 100644
> --- a/drivers/remoteproc/qcom_common.c
> +++ b/drivers/remoteproc/qcom_common.c
> @@ -209,6 +209,9 @@ static void glink_subdev_stop(struct rproc_subdev 
> *subdev, bool crashed)
>  {
>   struct qcom_rproc_glink *glink = to_glink_subdev(subdev);
>  
> + if (!glink->edge)
> + return;
> +
>   qcom_glink_smem_unregister(glink->edge);
>   glink->edge = NULL;
>  }
> -- 
> 2.34.1
> 
>