On 28/11/2025 10:32, Mukesh Ojha wrote:
There is a scenario, when fatal interrupt triggers rproc crash handling
while a user-space recovery is initiated in parallel. The overlapping
recovery/stop sequences race on rproc state and subdevice teardown,
resulting in a NULL pointer dereference in the GLINK SMEM unregister
path.

        Process-A                                       Process-B

   fatal error interrupt happens

   rproc_crash_handler_work()
     mutex_lock_interruptible(&rproc->lock);
     ...

        rproc->state = RPROC_CRASHED;
     ...
     mutex_unlock(&rproc->lock);

     rproc_trigger_recovery()
      mutex_lock_interruptible(&rproc->lock);

       qcom_pas_stop()
       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
       remoteproc remoteproc3: can't stop rproc: -22
      mutex_unlock(&rproc->lock);

                                                echo enabled > 
/sys/class/remoteproc/remoteprocX/recovery
                                                recovery_store()
                                                 rproc_trigger_recovery()
                                                  
mutex_lock_interruptible(&rproc->lock);
                                                   rproc_stop()
                                                    glink_subdev_stop()
                                                      
qcom_glink_smem_unregister() ==|
                                                                                
      |
                                                                                
      V
                                                      Unable to handle kernel 
NULL pointer dereference
                                                                 at virtual 
address 0000000000000358

It is tempting to introduce a remoteproc state that could be set from
the ->ops->stop() callback, which would have avoided the second attempt
and prevented the crash. However, making remoteproc recovery dependent
on manual intervention or a system reboot is not ideal. We should always
try to recover the remote processor if possible. A failure in the
->ops->stop() callback might be temporary or caused by a timeout, and a
recovery attempt could still succeed, as seen in similar scenarios.
Therefore, instead of adding a restrictive state, let’s add a NULL check
at the appropriate places to avoid a kernel crash and allow the system
to move forward gracefully.

Signed-off-by: Mukesh Ojha <[email protected]>
---
Changes in v4: 
https://lore.kernel.org/all/[email protected]/
  - Brought the same change from v2.
  - Added smd->edge NULL check.
  - Rephrased the commit text.

Changes in v3:
  - Fix kernel test reported error.

Changes in v2: 
https://lore.kernel.org/lkml/[email protected]/
  - Removed NULL pointer check instead added a new state to signify
    non-recoverable state of remoteproc.

  drivers/remoteproc/qcom_common.c | 6 ++++++
  1 file changed, 6 insertions(+)

diff --git a/drivers/remoteproc/qcom_common.c b/drivers/remoteproc/qcom_common.c
index 8c8688f99f0a..6480293d2f61 100644
--- a/drivers/remoteproc/qcom_common.c
+++ b/drivers/remoteproc/qcom_common.c
@@ -209,6 +209,9 @@ static void glink_subdev_stop(struct rproc_subdev *subdev, 
bool crashed)
  {
        struct qcom_rproc_glink *glink = to_glink_subdev(subdev);

+       if (!glink->edge)
+               return;
+
        qcom_glink_smem_unregister(glink->edge);
        glink->edge = NULL;
  }
@@ -320,6 +323,9 @@ static void smd_subdev_stop(struct rproc_subdev *subdev, 
bool crashed)
  {
        struct qcom_rproc_subdev *smd = to_smd_subdev(subdev);

+       if (!smd->edge)
+               return;
+
        qcom_smd_unregister_edge(smd->edge);
        smd->edge = NULL;
  }
--
2.50.1



Since this fixes a real bug, you need a Fixes tag.

Once added.

Reviewed-by: Bryan O'Donoghue <[email protected]>

Reply via email to