On Dec 16, 2010, at 3:29 AM, Gilbert Grosdidier wrote:

>> Does this problem *always* happen, or does it only happen once in a great 
>> while?
>> 
> gg= No, this problem happens rather often, almost every other time.
> Seems to happen more often as the number of cores increases.

Well that's a bummer -- it seems to indicate that this may be a problem in OMPI.

Are you running multiple OMPI jobs concurrently?  More specifically, are you 
starting multiple jobs on the same machine more-or-less at the same time?  I'm 
wondering if our TCP startup mechanism is somehow accidentally getting the TCP 
ports from a different job.  I can't imagine how that would be happening, but...

> gg= Is there a way with the current code, to direct OpenMPI to use a 
> restricted range of TCP ports,
> that I can choose at launch time ?

Yes.  In OMPI v1.4, there's 2 MCA params:

oob_tcp_port_min_v4 (default: 0)
oob_tcp_port_range_v4 (default: 65536)

Try setting these values to mutually exclusive ranges for each of your jobs and 
see if that fixes the problem.  Keep in mind that user-level ports start at 
1024, so your lowest range might as well start at 1024.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to