Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider: - [High] Missing re-allocation of `apc->rxqs` during interface bring-up causes deterministic NULL pointer dereference - [High] Uninitialized queue handles (0 and INVALID_MANA_HANDLE) are sent to the firmware during error teardown, corrupting hardware state - [High] GDMA admin commands are issued immediately after a PCI Function Level Reset (FLR), causing severe cascading timeouts and hardware state violations --
commit 2129283d37aabfe724d7509e34346eab93d302a2 Author: Dipayaan Roy <[email protected]> net: mana: Add NULL guards in teardown path to prevent panic on attach failure When queue allocation fails partway through, the error cleanup frees and NULLs apc->tx_qp and apc->rxqs. The patch adds NULL guards for these pointers in teardown paths such as mana_destroy_vport(), mana_fence_rqs(), and mana_dealloc_queues() to prevent NULL pointer dereferences. > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c > b/drivers/net/ethernet/microsoft/mana/mana_en.c > index 9afc786b297a8..9e7e4bf526bf6 100644 > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c [ ... ] > @@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct > mana_port_context *apc) > struct mana_rxq *rxq; > u32 rxq_idx; > > - for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) { > - rxq = apc->rxqs[rxq_idx]; > - if (!rxq) > - continue; > + if (apc->rxqs) { > > - mana_destroy_rxq(apc, rxq, true); > - apc->rxqs[rxq_idx] = NULL; > + for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) { > + rxq = apc->rxqs[rxq_idx]; > + if (!rxq) > + continue; > + > + mana_destroy_rxq(apc, rxq, true); > + apc->rxqs[rxq_idx] = NULL; > + } > } If the teardown path frees apc->rxqs and sets it to NULL on attach failure, how does the interface recover if the user subsequently brings it back up? When mana_open() is invoked, it directly calls mana_alloc_queues(). Inside mana_alloc_queues() -> mana_add_rx_queues(), the code does: apc->rxqs[i] = rxq; If apc->rxqs was left as NULL during the teardown, will this cause a NULL pointer dereference on the next interface bring-up? [ ... ] > @@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device > *ndev) > * number of queues. > */ > > - for (i = 0; i < apc->num_queues; i++) { > - txq = &apc->tx_qp[i].txq; > - tsleep = 1000; > - while (atomic_read(&txq->pending_sends) > 0 && > - time_before(jiffies, timeout)) { > - usleep_range(tsleep, tsleep + 1000); > - tsleep <<= 1; > - } > - if (atomic_read(&txq->pending_sends)) { > - err = pcie_flr(to_pci_dev(gd->gdma_context->dev)); > - if (err) { > - netdev_err(ndev, "flr failed %d with %d pkts > pending in txq %u\n", > - err, > atomic_read(&txq->pending_sends), > - txq->gdma_txq_id); > + if (apc->tx_qp) { In the error path where queue creation fails partway through, will mana_destroy_txq() be called? If mana_destroy_txq() unconditionally iterates over all apc->num_queues: for (i = 0; i < apc->num_queues; i++) { ... mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object); ... Will it pass 0 or INVALID_MANA_HANDLE to mana_destroy_wq_obj() for queues that were never fully initialized? Does mana_destroy_wq_obj() send a MANA_DESTROY_WQ_OBJ command to the hardware with these uninitialized handles, and could handle 0 corrupt the firmware state? > + for (i = 0; i < apc->num_queues; i++) { > + txq = &apc->tx_qp[i].txq; > + tsleep = 1000; > + while (atomic_read(&txq->pending_sends) > 0 && > + time_before(jiffies, timeout)) { > + usleep_range(tsleep, tsleep + 1000); > + tsleep <<= 1; > + } > + if (atomic_read(&txq->pending_sends)) { > + err = > + pcie_flr(to_pci_dev(gd->gdma_context->dev)); If pcie_flr() is triggered here, the PCI Function Level Reset wipes the internal GDMA state. Further down in this function, mana_disable_vport_rx(), mana_fence_rqs(), and mana_destroy_vport() are called. Will these functions attempt to send commands to the now-destroyed GDMA admin queue via mana_send_request() and wait for completions? Could this lead to cascading timeouts since the hardware was just reset and cannot process admin commands? > + if (err) { > + netdev_err(ndev, "flr failed %d with %d > pkts pending in txq %u\n", > + err, > + atomic_read(&txq->pending_sends), > + txq->gdma_txq_id); > + } > + break; > } > - break; > } -- Sashiko AI review ยท https://sashiko.dev/#/patchset/[email protected]?part=1
