On Dec 15, 2010, at 1:11 PM, Gilbert Grosdidier wrote: > Ralph, > > I am not using the TCP BTL, only OPENIB one. Does this change the number of > sockets in use per node, please ?
I believe the openib btl opens sockets for connection purposes, so the count is likely the same. An IB person can confirm that... > > But I suspect the ORTE daemons are communicating only through TCP anyway, > right ? Yes > > Also, is there anybody in the OpenMPI team using an SGI Altix cluster with a > high number of nodes > (1k nodes, ie 8k cores) that I could ask him (her) about the right setup ? Not sure, but 1k nodes isn't particularly large - we launch much larger clusters without problem. My guess is that you have some zombie procs running that are messing up the communication. You said you had undergone a lot of crashed jobs that might have left zombies in their wake. Can you clean those up? > > Thanks, Best, G. > > > > Le 15/12/2010 21:03, Ralph Castain a écrit : >> On Dec 15, 2010, at 12:30 PM, Gilbert Grosdidier wrote: >> >>> Bonsoir Ralph, >>> >>> Le 15/12/2010 18:45, Ralph Castain a écrit : >>>> It looks like all the messages are flowing within a single job (all three >>>> processes mentioned in the error have the same identifier). Only >>>> possibility I can think of is that somehow you are reusing ports - is it >>>> possible your system doesn't have enough ports to support all the procs? >>>> >>> Seems there is on every worker node a range of almost 30k ports available: >>>> ssh r33i0n0 cat /proc/sys/net/ipv4/ip_local_port_range >>> 32768 61000 >>> >>> This is AFAIK the only way I can get info about this. >>> Are these 30k ports this enough ? >> Depends on how many nodes there are in your system. >> >>> Question is : is OpenMPI opening ports from every node towards every other >>> node ? >>> In such a case I could figure out why it is going to to lacking ports when >>> I increase the number of nodes. >> Yes - in two ways: >> >> 1. each ORTE daemon opens a port to every other daemon in the system. Thus, >> you need at least M ports if your job is running across M nodes >> >> 2. each MPI process will open a direct port to any other MPI process that it >> communicates with. So if you have N processes on a node, and they only >> communicate to the 8 nearest neighbor nodes (each of which have N >> processes), and you are using the TCP btl, then you will consume an >> additional 8*N*N sockets on each node. >> >>> But: is there a possibility (mca param ?) to prevent OpenMPI to open so >>> many ports ? >>> Indeed, apart from rank 0 node, every MPI process will need to communicate >>> with ONLY >>> the 8 (nearest) neighbour nodes. So, there should be a switch somewhere >>> telling OpenMPI >>> to open a port ONLY when needed, but I did not find it among ompi_info >>> stuff ;-) >> It always only opens a port when something tries to communicate - we never >> open ports in advance. >> >>> Which one is it ? >>> >>> Thanks, Best, G.
