I was wondering what the logic is behind allowing an MPI job to continue
in the presence of a fatal qp error?
Note the "will try to continue" sentence:
--------------------------------------------------------------------------
The OpenFabrics stack has reported a network error event. Open MPI
will try to continue, but your job may end up failing.
Local host: escher
MPI process PID: 19136
Error number: 1 (IBV_EVENT_QP_FATAL)
This error may indicate connectivity problems within the fabric;
please contact your system administrator.
--------------------------------------------------------------------------
Due to other bugs I'm chasing, I get these sorts of errors, and I notice
the job just hangs even though the connections have been disconnected,
the qps flushed, and all pending WRs completed with status == FLUSH.
Thoughts?
Thanks,
Steve.