On 04/02/2013 02:00 PM, Roland Dreier wrote:
>> diff --git a/drivers/infiniband/hw/mlx4/qp.c 
>> b/drivers/infiniband/hw/mlx4/qp.c
>> index 35cced2..0fa4f72 100644
>> --- a/drivers/infiniband/hw/mlx4/qp.c
>> +++ b/drivers/infiniband/hw/mlx4/qp.c
>> @@ -2216,6 +2216,9 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct 
>> ib_send_wr *wr,
>>         __be32 blh;
>>         int i;
>>
>> +       if (pci_channel_offline(to_mdev(ibqp->device)->dev->pdev))
>> +               return -EIO;
>> +
>>         spin_lock_irqsave(&qp->sq.lock, flags);
>>
>>         ind = qp->sq_next_wqe;
> 
> To pile on to what Or and Jack asked, why here?  Why not in post_recv?
>  Why not in mlx4_en?  What about userspace consumers?  What if the
> error condition triggers just after the pci_channel_offline() check?
> What if a command is queued but a PCI error occurs before the
> completion can be returned?
> 
> Is there some practical scenario where this change makes a difference?
> 
> I would assume that in case of a PCI error, the driver would notice a
> catastrophic error and send that asynchronous event to consumers, who
> would know that commands might have been lost.
> 

The problem that I'm trying to solve is that some IB core modules are
hanging waiting on completion queues on their remove path during error
recovery. I've added the pci offline check in post_send, which seemed to
have to solved the problem, but while running other tests I was able to
hit the bug again. Adding the check in post_recv also only hid the
problem for a few testcases.

Adding any check in mlx4_en doesn't make sense in this case, because the
problem is only with IB adapters. The ethernet/RoCE adapters are
recovering fine, the check has been added already on the relevant places
in mlx4_core.

What async event should be sent to consumers before calling the remove
functions? IB_EVENT_DEVICE_FATAL, which is currently sent by mlx4_core
in case of catastrophic error (but not in PCI error recovery), doesn't
seem to be handled by most of the event handlers registered. Sending
IB_EVENT_PORT_ERR seems to solve the problem for most modules, but
rdma_cm, which doesn't have an event handler, is still hanging. Should
we implement an event handler for rdma_cm?


Thanks!

-- 
Kleber Sacilotto de Souza
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to