I have that set in slurm.conf:
MpiDefault=openmpi
MpiParams=ports=12000-12999
The cluster is 56 nodes of 16 cores. Do I need to increase something?
If I issue this on my nodes, nothings appears:
cexec "netstat -laputen | grep ':12[0-9]\{3\}'"
2013/9/13 Moe Jette <[email protected]>
>
> I suspect the problem is related to reserved ports described here:
> http://slurm.schedmd.com/mpi_**guide.html#open_mpi<http://slurm.schedmd.com/mpi_guide.html#open_mpi>
>
>
> Quoting Yann Sagon <[email protected]>:
>
> (sorry for the previous post, bad manipulation)
>>
>> Hello,
>>
>> I'm facing the following problem: one of our user wrote a simple c wrapper
>> that launches a multithreaded program. It was working before an update of
>> the cluster (os, and ofed).
>>
>> the wrapper is invoked like that:
>>
>> $srun -n64 -c4 wrapper
>>
>> The result is something like that:
>>
>> [...]
>> srun: error: node04: task 12: Killed
>> srun: error: node04: tasks 13-15 unable to claim reserved port, retrying.
>> srun: Terminating job step 47498.0
>> slurmd[node04]: *** STEP 47498.0 KILLED AT 2013-09-13T17:13:33 WITH SIGNAL
>> 9 ***
>> [...]
>>
>> If we call the wrapper like that:
>>
>> $srun -n64 wrapper
>>
>> it is working but we have only one core per thread.
>>
>> We were using slurm 2.5.4, now I tried with 2.6.2
>> Tested with openmpi 1.6.4 and 1.6.5
>>
>>
>> here is the code of the wrapper:
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <mpi.h>
>>
>> int main(int argc, char *argv[])
>> {
>> int rank, size;
>> char buf[512];
>>
>> MPI_Init(&argc, &argv);
>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>> sprintf(buf, "the_multithreaded_binary %d %d", rank, size);
>> system(buf);
>> MPI_Finalize();
>>
>> return 0;
>> }
>>
>>
>
>