On Sep 1, 2010, at 4:47 PM, Steve Wise wrote:

> I was wondering what the logic is behind allowing an MPI job to continue in 
> the presence of a fatal qp error?

It's a feature...?

> Note the "will try to continue" sentence:
> 
> --------------------------------------------------------------------------
> The OpenFabrics stack has reported a network error event.  Open MPI
> will try to continue, but your job may end up failing.
> 
>  Local host:        escher
>  MPI process PID:   19136
>  Error number:      1 (IBV_EVENT_QP_FATAL)
> 
> This error may indicate connectivity problems within the fabric;
> please contact your system administrator.
> --------------------------------------------------------------------------
> 
> Due to other bugs I'm chasing, I get these sorts of errors, and I notice the 
> job just hangs even though the connections have been disconnected, the qps 
> flushed, and all pending WRs completed with status == FLUSH.

Would it be better to make it a fatal error?  (I'm thinking probably "yes")

Feel free to submit a patch...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to