Thanks a lot Ralph it was exactly that!

2013/9/13 Ralph Castain <r...@open-mpi.org>

> Configure OMPI --with-pmi as the port reservation method won't work in
> this scenario.
>
> On Sep 13, 2013, at 8:29 AM, Yann Sagon <ysa...@gmail.com> wrote:
>
>  I have that set in slurm.conf:
>
> MpiDefault=openmpi
> MpiParams=ports=12000-12999
>
> The cluster is 56 nodes of 16 cores. Do I need to increase something?
>
> If I issue this on my nodes, nothings appears:
>
> cexec "netstat -laputen | grep ':12[0-9]\{3\}'"
>
>
>
>
> 2013/9/13 Moe Jette <je...@schedmd.com>
>
>>
>> I suspect the problem is related to reserved ports described here:
>> http://slurm.schedmd.com/mpi_**guide.html#open_mpi<http://slurm.schedmd.com/mpi_guide.html#open_mpi>
>>
>>
>> Quoting Yann Sagon <ysa...@gmail.com>:
>>
>>  (sorry for the previous post, bad manipulation)
>>>
>>> Hello,
>>>
>>> I'm facing the following problem: one of our user wrote a simple c
>>> wrapper
>>> that launches a multithreaded program. It was working before an update of
>>> the cluster (os, and ofed).
>>>
>>> the wrapper is invoked like that:
>>>
>>> $srun -n64 -c4 wrapper
>>>
>>> The result is something like that:
>>>
>>> [...]
>>> srun: error: node04: task 12: Killed
>>> srun: error: node04: tasks 13-15 unable to claim reserved port, retrying.
>>> srun: Terminating job step 47498.0
>>> slurmd[node04]: *** STEP 47498.0 KILLED AT 2013-09-13T17:13:33 WITH
>>> SIGNAL
>>> 9 ***
>>> [...]
>>>
>>> If we call the wrapper like that:
>>>
>>> $srun -n64 wrapper
>>>
>>> it is working but we have only one core per thread.
>>>
>>> We were using slurm 2.5.4, now I tried with 2.6.2
>>> Tested with openmpi 1.6.4 and 1.6.5
>>>
>>>
>>> here is the code of the wrapper:
>>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <mpi.h>
>>>
>>> int main(int argc, char *argv[])
>>> {
>>>     int rank, size;
>>>     char buf[512];
>>>
>>>     MPI_Init(&argc, &argv);
>>>     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>     MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>     sprintf(buf, "the_multithreaded_binary %d %d", rank, size);
>>>     system(buf);
>>>     MPI_Finalize();
>>>
>>>     return 0;
>>> }
>>>
>>>
>>
>>
>
>

Reply via email to