> 
> On Fri, Mar 13, 2026 at 01:59:28PM -0300, Jason Gunthorpe wrote:
> > On Sat, Mar 07, 2026 at 07:38:14PM +0200, Leon Romanovsky wrote:
> > > On Fri, Mar 06, 2026 at 05:47:14PM -0800, Long Li wrote:
> > > > When the MANA hardware undergoes a service reset, the ETH
> > > > auxiliary device
> > > > (mana.eth) used by DPDK persists across the reset cycle — it is
> > > > not removed and re-added like RC/UD/GSI QPs. This means userspace
> > > > RDMA consumers such as DPDK have no way of knowing that firmware
> > > > handles for their PD, CQ, WQ, QP and MR resources have become stale.
> > >
> > > NAK to any of this.
> > >
> > > In case of hardware reset, mana_ib AUX device needs to be destroyed
> > > and recreated later.
> >
> > Yeah, that is our general model for any serious RAS event where the
> > driver's view of resources becomes out of sync with the HW.
> >
> > You have tear down the ib_device by removing the aux and then bring
> > back a new one.
> >
> > There is an IB_EVENT_DEVICE_FATAL, but the purpose of that event is to
> > tell userspace to close and re-open their uverbs FD.
> >
> > We don't have a model where a uverbs FD in userspace can continue to
> > work after the device has a catasrophic RAS event.
> >
> > There may be room to have a model where the ib device doesn't fully
> > unplug/replug so it retains its name and things, but that is core code
> > not driver stuff.
> 
> Good luck with that model. It is going to break RDMA-CM hotplug support.
> 

   I think we can preserve RDMA-CM behavior without requiring ib_device
   unregister/re-register.

   On device reset, the driver can dispatch IB_EVENT_DEVICE_FATAL (or a
   new reset event) through ib_dispatch_event(). RDMA-CM already handles
   device events — we would add a handler that iterates all rdma_cm_ids
   on the device and sends RDMA_CM_EVENT_DEVICE_REMOVAL to each, same
   as cma_process_remove() does today. The difference: cma_device stays
   alive, so applications can reconnect on the same device after recovery
   instead of waiting for a new one to appear.

   The motivation for keeping ib_device alive is that some RDMA consumers
   — DPDK and NCCL — don't use RDMA-CM at all. They use raw verbs and
   manage QP state themselves. For these users, a persistent ib_device
   with IB_EVENT_PORT_ERR / IB_EVENT_PORT_ACTIVE notifications enables
   reliable in-place recovery without reopening the device.

   This matters especially for PCI DPC recovery, which is becoming
   critical for large-scale GPU/storage deployments. See this talk for
   context on the value of surviving DPC events:
   https://www.youtube.com/watch?v=TpNNeMGEsdU&t=1619s

   Today a DPC event on one NIC kills all RDMA connections and can
   crash entire training jobs. If the ib_device persists and the driver
   recreates firmware resources after recovery, raw verbs users can
   resume without full teardown, and RDMA-CM users get the same
   disconnect/reconnect behavior they have today.

Thanks,
Long

Reply via email to