I was wondering what the logic is behind allowing an MPI job to continue in the presence of a fatal qp error?

Note the "will try to continue" sentence:


--------------------------------------------------------------------------
The OpenFabrics stack has reported a network error event.  Open MPI
will try to continue, but your job may end up failing.

  Local host:        escher
  MPI process PID:   19136
  Error number:      1 (IBV_EVENT_QP_FATAL)

This error may indicate connectivity problems within the fabric;
please contact your system administrator.
--------------------------------------------------------------------------


Due to other bugs I'm chasing, I get these sorts of errors, and I notice the job just hangs even though the connections have been disconnected, the qps flushed, and all pending WRs completed with status == FLUSH.

Thoughts?

Thanks,

Steve.

Reply via email to