Jon Mason wrote:
I am seeing some unusual behavior during the shutdown phase of ompi at the end 
of my testcase.  While running a IMB pingpong test over the rdmacm on openib, I 
get cq flush errors on my iWARP adapters.

This error is happening because the remote node is still polling the endpoint 
while the other one shutdown.  This occurs because iWARP puts the qps in error 
state when the channel is disconnected (IB does not do this).  Since the cq is 
still being polled when the event is received on the remote node, ompi thinks 
it hit an error and kills the run.  Since this is expected behavior on iWARP, 
this is not really an error case.


The key here, I think is that when an iWARP QP moves out of RTS, all the RECVs and any pending SQ WRs get flushed. Further, disconnecting the iwarp connection forces the QP out of RTS. This is probably different than they way IB works. IE "disconnecting" in IB is an out-of-band exchange done by the IBCM. For iWARP, "disconnecting" is an in-band operation (a TCP close or abort) so the QP cannot remain in RTS during this process.

There is a larger question regarding why the remote node is still polling the hca and not shutting down, but my immediate question is if it is an acceptable fix to simply disregard this "error" if it is an iWARP adapter.
Opinions?

If the openib btl (or the layers above) assume the "disconnect" will notify the remote rank that the connection should be finalized, then we must deal with FLUSHED WRs for the iwarp case. If some sort of "finalizing" is done by OMPI and then the connections disconnected, then that "finalizing" should include not polling the CQ anymore. But that's not what we observe.

Thanks,
Jon

The patch would look something like this:

===================================================================
--- ompi/mca/btl/openib/btl_openib_component.c  (revision 18362)
+++ ompi/mca/btl/openib/btl_openib_component.c  (working copy)
@@ -2062,6 +2062,11 @@
     if(endpoint && endpoint->endpoint_proc && 
endpoint->endpoint_proc->proc_ompi)
         remote_proc = endpoint->endpoint_proc->proc_ompi;

+    if (wc->status == IBV_WC_WR_FLUSH_ERR &&
+        IBV_TRANSPORT_IWARP == hca->ib_dev->transport_type) {
+        return;
+    }
+
     if(wc->status != IBV_WC_WR_FLUSH_ERR || !flush_err_printed[cq]++) {
         BTL_PEER_ERROR(remote_proc, ("error polling %s with status %s "
                     "status number %d for wr_id %llu opcode %d qp_idx %d",
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to