On Mon, 8 Jun 2026 05:08:24 -0700 Wei Hu <[email protected]> wrote:
> From: Wei Hu <[email protected]> > > Add support for handling hardware reset events in the MANA driver. > When the MANA kernel driver receives a hardware service event, it > initiates a device reset and notifies userspace via > IBV_EVENT_DEVICE_FATAL. The DPDK driver handles this by performing > an automatic teardown and recovery sequence. > > The reset flow has two phases. In the enter phase, running on the > EAL interrupt thread, the driver transitions the device state, > waits for data path threads to drain using per-queue atomic flags, > stops queues, tears down IB resources, and frees per-queue MR > caches. A control thread is then spawned to handle the exit phase: > it waits for the hardware to recover, unregisters the interrupt > handler, re-probes the PCI device, reinitializes MR caches, and > restarts queues. > > Each queue has an atomic burst_state variable where bit 0 is the > in-burst flag and bits 1+ encode device state. The data path uses > a single compare-and-swap (0 to 1) to enter a burst, which fails > immediately if the reset path has set any state bits. The reset > path sets state bits via atomic fetch-or and polls bit 0 to wait > for in-flight bursts to drain. This single-variable design avoids > the need for sequential consistency ordering. > > A per-device mutex serializes the reset path with ethdev > operations. The mutex uses PTHREAD_PROCESS_SHARED for multi-process > support and is held across blocking IB verbs calls. A trylock > helper encapsulates the lock acquisition and device state check > for all ethdev operation wrappers. Operations that cannot wait > (configure, queue setup) return -EBUSY during reset, while > dev_stop and dev_close join the reset thread before acquiring > the lock to ensure proper sequencing. A CAS-based helper prevents > double-join of the reset thread. > > Multi-process support is included: secondary processes unmap and > remap doorbell pages via IPC during the reset enter and exit > phases. Data path functions in both primary and secondary > processes check the device state atomically and return early when > the device is not active. > > The driver emits RTE_ETH_EVENT_ERR_RECOVERING before entering the > reset path so that upper layers (e.g. netvsc) can switch their > data path before queues are stopped. The event is emitted outside > the reset lock to avoid deadlock if the callback calls dev_stop or > dev_close. On completion, the driver emits RECOVERY_SUCCESS or > RECOVERY_FAILED. If the enter phase fails internally, > RECOVERY_FAILED is sent immediately so the application receives a > terminal event. A PCI device removal event callback distinguishes > hot-remove from service reset. > > Documentation for the device reset feature is added in the MANA > NIC guide and the 26.07 release notes. > > Signed-off-by: Wei Hu <[email protected]> > --- This is is a rather complex set of state transitions so admit to relying on AI as backup for tracking this. It still sees some errors here. Worth asking the question, "what does mlx5 do?" and "should DPDK EAL be doing this at the PCI layer instead?" --- Much better - this addresses the RCU, the macros, the thread-safety-analysis suppressions, and the callback-under-lock deadlock. The single-variable burst_state CAS is a clean way to do the drain and the acquire/release reasoning checks out. One structural thing remains. The enter phase still runs the heavy teardown on the EAL interrupt thread under reset_ops_lock: dev_stop, then mana_mp_req_on_rxtx(RESET_ENTER) which is a blocking rte_mp_request_sync with a multi-second timeout, then dev_close with its ibv calls. You already moved the exit phase to a control thread because intr_callback_unregister cannot run on the interrupt thread; the same argument applies to blocking IPC and verbs teardown. A slow or absent secondary will stall the interrupt thread for the MP timeout, and this is the blocking-under-a-sleeping-mutex pattern. Please have the interrupt handler just set state and drain, then hand the rest of the teardown to the control thread. That also removes the last lock hand-off between functions/threads, so each function can own its lock. Smaller points: - RECOVERY_SUCCESS/FAILED are emitted from the reset thread. If the callback calls dev_stop/dev_close, mana_join_reset_thread() joins the current thread (EDEADLK, leaked handle). INTR_RMV is fine since it runs on the dev-event thread. - The burst_state comment says bits 1+ encode device state, but only RESET_ENTER<<1 is ever stored - it is effectively a single "blocked" flag.

