On 28/11/2025 10:32, Mukesh Ojha wrote:
There is a scenario, when fatal interrupt triggers rproc crash handling while a user-space recovery is initiated in parallel. The overlapping recovery/stop sequences race on rproc state and subdevice teardown, resulting in a NULL pointer dereference in the GLINK SMEM unregister path.Process-A Process-B fatal error interrupt happens rproc_crash_handler_work() mutex_lock_interruptible(&rproc->lock); ... rproc->state = RPROC_CRASHED; ... mutex_unlock(&rproc->lock); rproc_trigger_recovery() mutex_lock_interruptible(&rproc->lock); qcom_pas_stop() qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 remoteproc remoteproc3: can't stop rproc: -22 mutex_unlock(&rproc->lock); echo enabled > /sys/class/remoteproc/remoteprocX/recovery recovery_store() rproc_trigger_recovery() mutex_lock_interruptible(&rproc->lock); rproc_stop() glink_subdev_stop() qcom_glink_smem_unregister() ==| | V Unable to handle kernel NULL pointer dereference at virtual address 0000000000000358 It is tempting to introduce a remoteproc state that could be set from the ->ops->stop() callback, which would have avoided the second attempt and prevented the crash. However, making remoteproc recovery dependent on manual intervention or a system reboot is not ideal. We should always try to recover the remote processor if possible. A failure in the ->ops->stop() callback might be temporary or caused by a timeout, and a recovery attempt could still succeed, as seen in similar scenarios. Therefore, instead of adding a restrictive state, let’s add a NULL check at the appropriate places to avoid a kernel crash and allow the system to move forward gracefully. Signed-off-by: Mukesh Ojha <[email protected]> --- Changes in v4: https://lore.kernel.org/all/[email protected]/ - Brought the same change from v2. - Added smd->edge NULL check. - Rephrased the commit text. Changes in v3: - Fix kernel test reported error. Changes in v2: https://lore.kernel.org/lkml/[email protected]/ - Removed NULL pointer check instead added a new state to signify non-recoverable state of remoteproc. drivers/remoteproc/qcom_common.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/remoteproc/qcom_common.c b/drivers/remoteproc/qcom_common.c index 8c8688f99f0a..6480293d2f61 100644 --- a/drivers/remoteproc/qcom_common.c +++ b/drivers/remoteproc/qcom_common.c @@ -209,6 +209,9 @@ static void glink_subdev_stop(struct rproc_subdev *subdev, bool crashed) { struct qcom_rproc_glink *glink = to_glink_subdev(subdev); + if (!glink->edge) + return; + qcom_glink_smem_unregister(glink->edge); glink->edge = NULL; } @@ -320,6 +323,9 @@ static void smd_subdev_stop(struct rproc_subdev *subdev, bool crashed) { struct qcom_rproc_subdev *smd = to_smd_subdev(subdev); + if (!smd->edge) + return; + qcom_smd_unregister_edge(smd->edge); smd->edge = NULL; } -- 2.50.1
Since this fixes a real bug, you need a Fixes tag. Once added. Reviewed-by: Bryan O'Donoghue <[email protected]>

