On Apr 6, 2010, at 1:03 PM, Sam Preston wrote:

> I have a problem with the cluster I'm currently using where nodes
> 'hang' silently from time to time during an MPI call.  This causes the
> blocked MPI processes to block indefinitely -- the only way to detect
> an error is to notice that no more output is being written to the log
> files.  We're trying to resolve the underlying cause of the nodes
> hanging, but in the mean time, is there a way to set a timeout or
> something similar to detect this situation?  Sorry if this has been
> addressed before, I searched the FAQ and archives and didn't come up
> with anything.

Unfortunately, no.  MPI doesn't actively check to see if an application has 
deadlocked (although there are tools for doing this kind of thing -- google 
around for them).  Or if something has gone wrong, Open MPI may not be 
detecting it properly.  Hopefully, it's not an Open MPI bug!

I wish I had more helpful information for you -- let us know what you find 
about the underlying cause.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to