On Tue, Mar 10, 2026 at 06:50:30AM -0700, Bartosz Golaszewski wrote:
> On Tue, 10 Mar 2026 11:03:21 +0100, Jingyi Wang
> <[email protected]> said:
> > rproc_add() called by rproc probe function failure will tear down all
> > the resources including do device_del() and remove subdev etc. If
> > rproc_report_crash() is called in this path, the rproc_crash_handler_work
> > could be excuted asynchronously, rproc_boot_recovery()->rproc_stop() will
> > be called with recovery enabled, which may cause NULL pointer dereference
> > if the resource has already been cleaned up.
> >
> > [    5.251483] Unable to handle kernel NULL pointer dereference at virtual 
> > address 0000000000000300
> > [    5.260499] Mem abort info:
> > [    5.263384]   ESR = 0x0000000096000006
> > [    5.267248]   EC = 0x25: DABT (current EL), IL = 32 bits
> > [    5.272711]   SET = 0, FnV = 0
> > [    5.275865]   EA = 0, S1PTW = 0
> > [    5.279106]   FSC = 0x06: level 2 translation fault
> > [    5.284125] Data abort info:
> > [    5.287101]   ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
> > [    5.292742]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > [    5.297939]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > [    5.303400] user pgtable: 4k pages, 48-bit VAs, pgdp=000000089e086000
> > [    5.310022] [0000000000000300] pgd=080000089e087403, 
> > p4d=080000089e087403, pud=080000089e088403, pmd=0000000000000000
> > [    5.320917] Internal error: Oops: 0000000096000006 [#1]  SMP
> > [    5.392494] Hardware name: Qualcomm Technologies, Inc. Kaanapali QRD (DT)
> > [    5.399466] Workqueue: rproc_recovery_wq rproc_crash_handler_work
> > [    5.405729] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS 
> > BTYPE=--)
> > [    5.412879] pc : qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem]
> > [    5.419674] lr : glink_subdev_stop+0x1c/0x30 [qcom_common]
> > [    5.425308] sp : ffff800080ffbc90
> > [    5.428724] x29: ffff800080ffbc90 x28: ffff00081be833f0 x27: 
> > ffff000800059c00
> > [    5.436053] x26: 0000000000000000 x25: ffff000800a56f80 x24: 
> > 61c8864680b583eb
> > [    5.443384] x23: ffff00081be83038 x22: 0000000000000001 x21: 
> > ffff00081be83000
> > [    5.450714] x20: ffff00081be833c0 x19: 0000000000000000 x18: 
> > 0000000000000010
> > [    5.458043] x17: 0000000000000000 x16: 0000000000000000 x15: 
> > ffff0008042684f8
> > [    5.465374] x14: 00000000000002dd x13: ffff0008042684f8 x12: 
> > ffffd37f69f967a0
> > [    5.472705] x11: ffffd37f6a006800 x10: ffffd37f69fee7c0 x9 : 
> > ffffd37f69fee818
> > [    5.480036] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 
> > 0000000000000001
> > [    5.487366] x5 : ffff000d6536d408 x4 : 0000000000000001 x3 : 
> > 0000000000000000
> > [    5.494697] x2 : ffffd37f5703c18c x1 : 0000000000000001 x0 : 
> > 0000000000000000
> > [    5.502028] Call trace:
> > [    5.504549]  qcom_glink_smem_unregister+0x14/0x48 [qcom_glink_smem] (P)
> > [    5.511344]  glink_subdev_stop+0x1c/0x30 [qcom_common]
> > [    5.516622]  rproc_stop+0x58/0x17c
> > [    5.520127]  rproc_trigger_recovery+0xb0/0x150
> > [    5.524693]  rproc_crash_handler_work+0xa4/0xc4
> > [    5.529346]  process_scheduled_works+0x18c/0x2d8
> > [    5.534092]  worker_thread+0x144/0x280
> > [    5.537952]  kthread+0x124/0x138
> > [    5.541280]  ret_from_fork+0x10/0x20
> > [    5.544965] Code: a9be7bfd 910003fd a90153f3 aa0003f3 (b9430000)
> > [    5.551224] ---[ end trace 0000000000000000 ]---
> >
> > So set recovery_disabled during rproc_add().
> >
> > Signed-off-by: Jingyi Wang <[email protected]>
> > ---
> >  drivers/remoteproc/remoteproc_core.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/remoteproc/remoteproc_core.c 
> > b/drivers/remoteproc/remoteproc_core.c
> > index b087ed21858a..f66dde712cec 100644
> > --- a/drivers/remoteproc/remoteproc_core.c
> > +++ b/drivers/remoteproc/remoteproc_core.c
> > @@ -2286,7 +2286,10 @@ int rproc_add(struct rproc *rproc)
> >  {
> >     struct device *dev = &rproc->dev;
> >     int ret;
> > +   bool rproc_recovery_save;
> >
> > +   rproc_recovery_save  = rproc->recovery_disabled;
> > +   rproc->recovery_disabled = true;
> >     ret = rproc_validate(rproc);
> >     if (ret < 0)
> >             return ret;
> > @@ -2319,6 +2322,7 @@ int rproc_add(struct rproc *rproc)
> >     list_add_rcu(&rproc->node, &rproc_list);
> >     mutex_unlock(&rproc_list_mutex);
> >
> > +   rproc->recovery_disabled = rproc_recovery_save;
> >     return 0;
> >
> >  rproc_remove_dev:
> >
> > --
> > 2.25.1
> >
> >
> 
> Ideally things like this would be passed to the rproc core in some kind of a
> config structure and only set when registration succeeds. This looks to me
> like papering over the real issue and I think it's still racy as there's no
> true synchronization.
> 
> Wouldn't it be better to take rproc->lock for the entire duration of
> rproc_add()? It's already initialized in rproc_alloc().

It would still be racy as rproc_trigger_recovery() is called outside of
the lock. Instead the error cleanup path (and BTW, rproc_del() path too)
must explicitly call cancel_work_sync() on the crash_handler work (and
any other work items that can be scheduled).

> 
> Bart

-- 
With best wishes
Dmitry

Reply via email to