I currently have a numerical model that, for reasons unknown, requires
preconnection to avoid hanging on an initial MPI_Allreduce call. But
when we try to scale out beyond around 1000 cores, we are unable to
get past MPI_Init's preconnection phase.

To test this, I have a basic C program containing only MPI_Init() and
MPI_Finalize() named `mpi_init`, which I compile and run using `mpirun
-mca mpi_preconnect_mpi 1 mpi_init`.

This preconnection seems to consume a large amount of memory, and is
exceeding the available memory on our nodes (~2GiB/core) as the number
gets into the thousands (~4000 or so). If we try to preconnect to
around ~6000, we start to see hangs and crashes.

A failed 5600 core preconnection gave this warning (~10k times) while
hanging for 30 minutes:

    [warn] opal_libevent2021_event_base_loop: reentrant invocation.
Only one event_base_loop can run on each event_base at once.

A failed 6000-core preconnection job crashed almost immediately with
the following error.

    [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 159
    [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 85
    [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in
file base/ras_base_allocate.c at line 187

Should we expect to use very large amounts of memory for
preconnections of thousands of CPUs? And can these

I am using Open MPI 1.8.2 on Linux 2.6.32 (centOS) and FDR infiniband
network. This is probably not enough information, but I'll try to
provide more if necessary. My knowledge of implementation is
unfortunately very limited.

Reply via email to