I have that set in slurm.conf:

MpiDefault=openmpi
MpiParams=ports=12000-12999

The cluster is 56 nodes of 16 cores. Do I need to increase something?

If I issue this on my nodes, nothings appears:

cexec "netstat -laputen | grep ':12[0-9]\{3\}'"




2013/9/13 Moe Jette <[email protected]>

>
> I suspect the problem is related to reserved ports described here:
> http://slurm.schedmd.com/mpi_**guide.html#open_mpi<http://slurm.schedmd.com/mpi_guide.html#open_mpi>
>
>
> Quoting Yann Sagon <[email protected]>:
>
>  (sorry for the previous post, bad manipulation)
>>
>> Hello,
>>
>> I'm facing the following problem: one of our user wrote a simple c wrapper
>> that launches a multithreaded program. It was working before an update of
>> the cluster (os, and ofed).
>>
>> the wrapper is invoked like that:
>>
>> $srun -n64 -c4 wrapper
>>
>> The result is something like that:
>>
>> [...]
>> srun: error: node04: task 12: Killed
>> srun: error: node04: tasks 13-15 unable to claim reserved port, retrying.
>> srun: Terminating job step 47498.0
>> slurmd[node04]: *** STEP 47498.0 KILLED AT 2013-09-13T17:13:33 WITH SIGNAL
>> 9 ***
>> [...]
>>
>> If we call the wrapper like that:
>>
>> $srun -n64 wrapper
>>
>> it is working but we have only one core per thread.
>>
>> We were using slurm 2.5.4, now I tried with 2.6.2
>> Tested with openmpi 1.6.4 and 1.6.5
>>
>>
>> here is the code of the wrapper:
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <mpi.h>
>>
>> int main(int argc, char *argv[])
>> {
>>     int rank, size;
>>     char buf[512];
>>
>>     MPI_Init(&argc, &argv);
>>     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>     MPI_Comm_size(MPI_COMM_WORLD, &size);
>>     sprintf(buf, "the_multithreaded_binary %d %d", rank, size);
>>     system(buf);
>>     MPI_Finalize();
>>
>>     return 0;
>> }
>>
>>
>
>

Reply via email to