On Tue, Mar 17, 2026 at 11:43:49PM +0000, Long Li wrote: > > > > On Fri, Mar 13, 2026 at 01:59:28PM -0300, Jason Gunthorpe wrote: > > > On Sat, Mar 07, 2026 at 07:38:14PM +0200, Leon Romanovsky wrote: > > > > On Fri, Mar 06, 2026 at 05:47:14PM -0800, Long Li wrote: > > > > > When the MANA hardware undergoes a service reset, the ETH > > > > > auxiliary device > > > > > (mana.eth) used by DPDK persists across the reset cycle — it is > > > > > not removed and re-added like RC/UD/GSI QPs. This means userspace > > > > > RDMA consumers such as DPDK have no way of knowing that firmware > > > > > handles for their PD, CQ, WQ, QP and MR resources have become stale. > > > > > > > > NAK to any of this. > > > > > > > > In case of hardware reset, mana_ib AUX device needs to be destroyed > > > > and recreated later. > > > > > > Yeah, that is our general model for any serious RAS event where the > > > driver's view of resources becomes out of sync with the HW. > > > > > > You have tear down the ib_device by removing the aux and then bring > > > back a new one. > > > > > > There is an IB_EVENT_DEVICE_FATAL, but the purpose of that event is to > > > tell userspace to close and re-open their uverbs FD. > > > > > > We don't have a model where a uverbs FD in userspace can continue to > > > work after the device has a catasrophic RAS event. > > > > > > There may be room to have a model where the ib device doesn't fully > > > unplug/replug so it retains its name and things, but that is core code > > > not driver stuff. > > > > Good luck with that model. It is going to break RDMA-CM hotplug support. > > > > I think we can preserve RDMA-CM behavior without requiring ib_device > unregister/re-register. > > On device reset, the driver can dispatch IB_EVENT_DEVICE_FATAL (or a > new reset event) through ib_dispatch_event(). RDMA-CM already handles > device events — we would add a handler that iterates all rdma_cm_ids > on the device and sends RDMA_CM_EVENT_DEVICE_REMOVAL to each, same > as cma_process_remove() does today. The difference: cma_device stays > alive, so applications can reconnect on the same device after recovery > instead of waiting for a new one to appear. > > The motivation for keeping ib_device alive is that some RDMA consumers > — DPDK and NCCL — don't use RDMA-CM at all. They use raw verbs and > manage QP state themselves.
RDMA-CM provides an "external QP" model where the QP is managed by the rdma-cm user. As Jason noted, you should propose the core changes together with the corresponding librdmacm updates. The final result must ensure that legacy applications continue to function correctly with the new kernel. Thanks

