On 04/02/2013 02:00 PM, Roland Dreier wrote: >> diff --git a/drivers/infiniband/hw/mlx4/qp.c >> b/drivers/infiniband/hw/mlx4/qp.c >> index 35cced2..0fa4f72 100644 >> --- a/drivers/infiniband/hw/mlx4/qp.c >> +++ b/drivers/infiniband/hw/mlx4/qp.c >> @@ -2216,6 +2216,9 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct >> ib_send_wr *wr, >> __be32 blh; >> int i; >> >> + if (pci_channel_offline(to_mdev(ibqp->device)->dev->pdev)) >> + return -EIO; >> + >> spin_lock_irqsave(&qp->sq.lock, flags); >> >> ind = qp->sq_next_wqe; > > To pile on to what Or and Jack asked, why here? Why not in post_recv? > Why not in mlx4_en? What about userspace consumers? What if the > error condition triggers just after the pci_channel_offline() check? > What if a command is queued but a PCI error occurs before the > completion can be returned? > > Is there some practical scenario where this change makes a difference? > > I would assume that in case of a PCI error, the driver would notice a > catastrophic error and send that asynchronous event to consumers, who > would know that commands might have been lost. >
The problem that I'm trying to solve is that some IB core modules are hanging waiting on completion queues on their remove path during error recovery. I've added the pci offline check in post_send, which seemed to have to solved the problem, but while running other tests I was able to hit the bug again. Adding the check in post_recv also only hid the problem for a few testcases. Adding any check in mlx4_en doesn't make sense in this case, because the problem is only with IB adapters. The ethernet/RoCE adapters are recovering fine, the check has been added already on the relevant places in mlx4_core. What async event should be sent to consumers before calling the remove functions? IB_EVENT_DEVICE_FATAL, which is currently sent by mlx4_core in case of catastrophic error (but not in PCI error recovery), doesn't seem to be handled by most of the event handlers registered. Sending IB_EVENT_PORT_ERR seems to solve the problem for most modules, but rdma_cm, which doesn't have an event handler, is still hanging. Should we implement an event handler for rdma_cm? Thanks! -- Kleber Sacilotto de Souza IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html