I suspect the problem is related to reserved ports described here:
http://slurm.schedmd.com/mpi_guide.html#open_mpi
Quoting Yann Sagon <[email protected]>:
(sorry for the previous post, bad manipulation)
Hello,
I'm facing the following problem: one of our user wrote a simple c wrapper
that launches a multithreaded program. It was working before an update of
the cluster (os, and ofed).
the wrapper is invoked like that:
$srun -n64 -c4 wrapper
The result is something like that:
[...]
srun: error: node04: task 12: Killed
srun: error: node04: tasks 13-15 unable to claim reserved port, retrying.
srun: Terminating job step 47498.0
slurmd[node04]: *** STEP 47498.0 KILLED AT 2013-09-13T17:13:33 WITH SIGNAL
9 ***
[...]
If we call the wrapper like that:
$srun -n64 wrapper
it is working but we have only one core per thread.
We were using slurm 2.5.4, now I tried with 2.6.2
Tested with openmpi 1.6.4 and 1.6.5
here is the code of the wrapper:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int rank, size;
char buf[512];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
sprintf(buf, "the_multithreaded_binary %d %d", rank, size);
system(buf);
MPI_Finalize();
return 0;
}