On Sep 1, 2010, at 4:47 PM, Steve Wise wrote: > I was wondering what the logic is behind allowing an MPI job to continue in > the presence of a fatal qp error?
It's a feature...? > Note the "will try to continue" sentence: > > -------------------------------------------------------------------------- > The OpenFabrics stack has reported a network error event. Open MPI > will try to continue, but your job may end up failing. > > Local host: escher > MPI process PID: 19136 > Error number: 1 (IBV_EVENT_QP_FATAL) > > This error may indicate connectivity problems within the fabric; > please contact your system administrator. > -------------------------------------------------------------------------- > > Due to other bugs I'm chasing, I get these sorts of errors, and I notice the > job just hangs even though the connections have been disconnected, the qps > flushed, and all pending WRs completed with status == FLUSH. Would it be better to make it a fatal error? (I'm thinking probably "yes") Feel free to submit a patch... -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/