I suspect the problem is related to reserved ports described here:
http://slurm.schedmd.com/mpi_guide.html#open_mpi

Quoting Yann Sagon <[email protected]>:

(sorry for the previous post, bad manipulation)

Hello,

I'm facing the following problem: one of our user wrote a simple c wrapper
that launches a multithreaded program. It was working before an update of
the cluster (os, and ofed).

the wrapper is invoked like that:

$srun -n64 -c4 wrapper

The result is something like that:

[...]
srun: error: node04: task 12: Killed
srun: error: node04: tasks 13-15 unable to claim reserved port, retrying.
srun: Terminating job step 47498.0
slurmd[node04]: *** STEP 47498.0 KILLED AT 2013-09-13T17:13:33 WITH SIGNAL
9 ***
[...]

If we call the wrapper like that:

$srun -n64 wrapper

it is working but we have only one core per thread.

We were using slurm 2.5.4, now I tried with 2.6.2
Tested with openmpi 1.6.4 and 1.6.5


here is the code of the wrapper:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
    int rank, size;
    char buf[512];

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    sprintf(buf, "the_multithreaded_binary %d %d", rank, size);
    system(buf);
    MPI_Finalize();

    return 0;
}



Reply via email to