Thanks, it's at least good to know that the behaviour isn't normal! Could it be some sort of memory leak in the call? The code in
ompi/runtime/ompi_mpi_preconnect.c looks reasonably safe, though maybe doing thousands of of isend/irecv pairs is causing problems with the buffer used in ptp messages? I'm trying to see if valgrind can see anything, but nothing from ompi_init_preconnect_mpi is coming up (although there are some other warnings). On Sun, Oct 19, 2014 at 2:37 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> On Oct 17, 2014, at 3:37 AM, Marshall Ward <marshall.w...@gmail.com> wrote: >> >> I currently have a numerical model that, for reasons unknown, requires >> preconnection to avoid hanging on an initial MPI_Allreduce call. > > That is indeed odd - it might take a while for all the connections to form, > but it shouldn’t hang > >> But >> when we try to scale out beyond around 1000 cores, we are unable to >> get past MPI_Init's preconnection phase. >> >> To test this, I have a basic C program containing only MPI_Init() and >> MPI_Finalize() named `mpi_init`, which I compile and run using `mpirun >> -mca mpi_preconnect_mpi 1 mpi_init`. > > I doubt preconnect has been tested in a rather long time as I’m unaware of > anyone still using it (we originally provided it for some legacy code that > otherwise took a long time to initialize). However, I could give it a try and > see what happens. FWIW: because it was so targeted and hasn’t been used in a > long time, the preconnect algo is really not very efficient. Still, shouldn’t > have anything to do with memory footprint. > >> >> This preconnection seems to consume a large amount of memory, and is >> exceeding the available memory on our nodes (~2GiB/core) as the number >> gets into the thousands (~4000 or so). If we try to preconnect to >> around ~6000, we start to see hangs and crashes. >> >> A failed 5600 core preconnection gave this warning (~10k times) while >> hanging for 30 minutes: >> >> [warn] opal_libevent2021_event_base_loop: reentrant invocation. >> Only one event_base_loop can run on each event_base at once. >> >> A failed 6000-core preconnection job crashed almost immediately with >> the following error. >> >> [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in >> file ras_tm_module.c at line 159 >> [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in >> file ras_tm_module.c at line 85 >> [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in >> file base/ras_base_allocate.c at line 187 > > This doesn’t have anything to do with preconnect - it indicates that mpirun > was unable to open the Torque allocation file. However, it shouldn’t have > “crashed”, but instead simply exited with an error message. > >> >> Should we expect to use very large amounts of memory for >> preconnections of thousands of CPUs? And can these >> >> I am using Open MPI 1.8.2 on Linux 2.6.32 (centOS) and FDR infiniband >> network. This is probably not enough information, but I'll try to >> provide more if necessary. My knowledge of implementation is >> unfortunately very limited. >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/10/25527.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/10/25536.php