On Apr 27, 2011, at 2:46 PM, Ralph Castain wrote:

> Actually, I understood you correctly. I'm just saying that I find no evidence 
> in the code that we try three times before giving up. What I see is a single 
> attempt to bind the port - if it fails, then we abort. There is no parameter 
> to control that behavior.
> 
> So if the OS hasn't released the port by the time a new job starts on that 
> node, then it will indeed abort if the job was unfortunately given the same 
> port reservation.

FWIW, the OS may be trying multiple times under the covers, but from as far as 
OMPI is concerned, we're just trying once.

OMPI asks for whatever port the OS has open (i.e., we pass in 0 when asking for 
a specific port number, and the OS fills it in for us).  If it gives us back a 
port that isn't actually available, that would be really surprising.

If you have a bajiollion short jobs running, I wonder if there's some kind of 
race condition occurring that some MPI processes are getting messages from the 
wrong mpirun.  And then things go downhill from there.  

I can't immediately imagine how that would happen, but maybe there's some kind 
of weird race condition in there somewhere...?  We pass specific IP addresses 
and ports around on the command line, though, so I don't quite see how that 
would happen...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to