> On Tue, Mar 17, 2026 at 11:43:49PM +0000, Long Li wrote:
> > >
> > > On Fri, Mar 13, 2026 at 01:59:28PM -0300, Jason Gunthorpe wrote:
> > > > On Sat, Mar 07, 2026 at 07:38:14PM +0200, Leon Romanovsky wrote:
> > > > > On Fri, Mar 06, 2026 at 05:47:14PM -0800, Long Li wrote:
> > > > > > When the MANA hardware undergoes a service reset, the ETH
> > > > > > auxiliary device
> > > > > > (mana.eth) used by DPDK persists across the reset cycle — it
> > > > > > is not removed and re-added like RC/UD/GSI QPs. This means
> > > > > > userspace RDMA consumers such as DPDK have no way of knowing
> > > > > > that firmware handles for their PD, CQ, WQ, QP and MR resources have
> become stale.
> > > > >
> > > > > NAK to any of this.
> > > > >
> > > > > In case of hardware reset, mana_ib AUX device needs to be
> > > > > destroyed and recreated later.
> > > >
> > > > Yeah, that is our general model for any serious RAS event where
> > > > the driver's view of resources becomes out of sync with the HW.
> > > >
> > > > You have tear down the ib_device by removing the aux and then
> > > > bring back a new one.
> > > >
> > > > There is an IB_EVENT_DEVICE_FATAL, but the purpose of that event
> > > > is to tell userspace to close and re-open their uverbs FD.
> > > >
> > > > We don't have a model where a uverbs FD in userspace can continue
> > > > to work after the device has a catasrophic RAS event.
> > > >
> > > > There may be room to have a model where the ib device doesn't
> > > > fully unplug/replug so it retains its name and things, but that is
> > > > core code not driver stuff.
> > >
> > > Good luck with that model. It is going to break RDMA-CM hotplug support.
> > >
> >
> >    I think we can preserve RDMA-CM behavior without requiring ib_device
> >    unregister/re-register.
> >
> >    On device reset, the driver can dispatch IB_EVENT_DEVICE_FATAL (or a
> >    new reset event) through ib_dispatch_event(). RDMA-CM already handles
> >    device events — we would add a handler that iterates all rdma_cm_ids
> >    on the device and sends RDMA_CM_EVENT_DEVICE_REMOVAL to each,
> same
> >    as cma_process_remove() does today. The difference: cma_device stays
> >    alive, so applications can reconnect on the same device after recovery
> >    instead of waiting for a new one to appear.
> >
> >    The motivation for keeping ib_device alive is that some RDMA consumers
> >    — DPDK and NCCL — don't use RDMA-CM at all. They use raw verbs and
> >    manage QP state themselves.
> 
> RDMA-CM provides an "external QP" model where the QP is managed by the
> rdma-cm user.
> 
> As Jason noted, you should propose the core changes together with the
> corresponding librdmacm updates. The final result must ensure that legacy
> applications continue to function correctly with the new kernel.
> 
> Thanks

Will send RFC patches.

Thank you,
Long

Reply via email to