I currently have a numerical model that, for reasons unknown, requires preconnection to avoid hanging on an initial MPI_Allreduce call. But when we try to scale out beyond around 1000 cores, we are unable to get past MPI_Init's preconnection phase.
To test this, I have a basic C program containing only MPI_Init() and MPI_Finalize() named `mpi_init`, which I compile and run using `mpirun -mca mpi_preconnect_mpi 1 mpi_init`. This preconnection seems to consume a large amount of memory, and is exceeding the available memory on our nodes (~2GiB/core) as the number gets into the thousands (~4000 or so). If we try to preconnect to around ~6000, we start to see hangs and crashes. A failed 5600 core preconnection gave this warning (~10k times) while hanging for 30 minutes: [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once. A failed 6000-core preconnection job crashed almost immediately with the following error. [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 159 [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 85 [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 187 Should we expect to use very large amounts of memory for preconnections of thousands of CPUs? And can these I am using Open MPI 1.8.2 on Linux 2.6.32 (centOS) and FDR infiniband network. This is probably not enough information, but I'll try to provide more if necessary. My knowledge of implementation is unfortunately very limited.