[OMPI users] large memory usage and hangs when preconnecting beyond 1000 cpus

2014-10-17 Thread Marshall Ward
I currently have a numerical model that, for reasons unknown, requires preconnection to avoid hanging on an initial MPI_Allreduce call. But when we try to scale out beyond around 1000 cores, we are unable to get past MPI_Init's preconnection phase. To test this, I have a basic C program containing

Re: [OMPI users] large memory usage and hangs when preconnecting beyond 1000 cpus

2014-10-18 Thread Ralph Castain
> On Oct 17, 2014, at 3:37 AM, Marshall Ward wrote: > > I currently have a numerical model that, for reasons unknown, requires > preconnection to avoid hanging on an initial MPI_Allreduce call. That is indeed odd - it might take a while for all the connections to form, but it shouldn’t hang >

Re: [OMPI users] large memory usage and hangs when preconnecting beyond 1000 cpus

2014-10-20 Thread Marshall Ward
Thanks, it's at least good to know that the behaviour isn't normal! Could it be some sort of memory leak in the call? The code in ompi/runtime/ompi_mpi_preconnect.c looks reasonably safe, though maybe doing thousands of of isend/irecv pairs is causing problems with the buffer used in ptp mes

Re: [OMPI users] large memory usage and hangs when preconnecting beyond 1000 cpus

2014-10-21 Thread Nathan Hjelm
At those sizes it is possible you are running into resource exhastion issues. Some of the resource exhaustion code paths still lead to hangs. If the code does not need to be fully connected I would suggest not using mpi_preconnect_mpi but instead track down why the initial MPI_Allreduce hangs. I w

Re: [OMPI users] large memory usage and hangs when preconnecting beyond 1000 cpus

2014-10-30 Thread Marshall Ward
Hi, I'm just following up on this to say that the problem was not related to preconnection, but just very large memory usage for high CPU jobs. Preconnecting was just acting to send off a large number of isend/irecv messages and trigger the memory consumption. I tried experimenting a bit with XRC