Yes, 2.0.1 has a spawn issue. We believe that 2.0.2 is okay if you want to give 
it a try 

Sent from my iPad

> On Feb 15, 2017, at 1:14 PM, Jason Maldonis <maldo...@wisc.edu> wrote:
> 
> Just to throw this out there -- to me, that doesn't seem to be just a problem 
> with SLURM. I'm guessing the exact same error would be thrown interactively 
> (unless I didn't read the above messages carefully enough).  I had a lot of 
> problems running spawned jobs on 2.0.x a few months ago, so I switched back 
> to 1.10.2 and everything worked. Just in case that helps someone.
> 
> Jason
> 
>> On Wed, Feb 15, 2017 at 1:09 PM, Anastasia Kruchinina 
>> <nastja.kruchin...@gmail.com> wrote:
>> Hi!
>> 
>> I am doing like this:
>> 
>> sbatch  -N 2 -n 5 ./job.sh
>> 
>> where job.sh is:
>> 
>> #!/bin/bash -l
>> module load openmpi/2.0.1-icc
>> mpirun -np 1 ./manager 4
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 15 February 2017 at 17:58, r...@open-mpi.org <r...@open-mpi.org> wrote:
>>> The cmd line looks fine - when you do your “sbatch” request, what is in the 
>>> shell script you give it? Or are you saying you just “sbatch” the mpirun 
>>> cmd directly?
>>> 
>>> 
>>>> On Feb 15, 2017, at 8:07 AM, Anastasia Kruchinina 
>>>> <nastja.kruchin...@gmail.com> wrote:
>>>> 
>>>> Hi, 
>>>> 
>>>> I am running like this: 
>>>> mpirun -np 1 ./manager
>>>> 
>>>> Should I do it differently?
>>>> 
>>>> I also thought that all sbatch does is create an allocation and then run 
>>>> my script in it. But it seems it is not since I am getting these results...
>>>> 
>>>> I would like to upgrade to OpenMPI, but no clusters near me have it yet :( 
>>>> So I even cannot check if it works with OpenMPI 2.0.2. 
>>>> 
>>>>> On 15 February 2017 at 16:04, Howard Pritchard <hpprit...@gmail.com> 
>>>>> wrote:
>>>>> Hi Anastasia,
>>>>> 
>>>>> Definitely check the mpirun when in batch environment but you may also 
>>>>> want to upgrade to Open MPI 2.0.2.
>>>>> 
>>>>> Howard
>>>>> 
>>>>> r...@open-mpi.org <r...@open-mpi.org> schrieb am Mi. 15. Feb. 2017 um 
>>>>> 07:49:
>>>>>> Nothing immediate comes to mind - all sbatch does is create an 
>>>>>> allocation and then run your script in it. Perhaps your script is using 
>>>>>> a different “mpirun” command than when you type it interactively?
>>>>>> 
>>>>>>> On Feb 14, 2017, at 5:11 AM, Anastasia Kruchinina 
>>>>>>> <nastja.kruchin...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> I am trying to use MPI_Comm_spawn function in my code. I am having 
>>>>>>> trouble with openmpi 2.0.x + sbatch (batch system Slurm). 
>>>>>>> My test program is located here: 
>>>>>>> http://user.it.uu.se/~anakr367/files/MPI_test/ 
>>>>>>> 
>>>>>>> When I am running my code I am getting an error: 
>>>>>>> 
>>>>>>> OPAL ERROR: Timeout in file 
>>>>>>> ../../../../openmpi-2.0.1/opal/mca/pmix/base/pmix_base_fns.c at line 
>>>>>>> 193 
>>>>>>> *** An error occurred in MPI_Init_thread 
>>>>>>> *** on a NULL communicator 
>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
>>>>>>> abort, 
>>>>>>> ***    and potentially your MPI job) 
>>>>>>> --------------------------------------------------------------------------
>>>>>>>  
>>>>>>> It looks like MPI_INIT failed for some reason; your parallel process is 
>>>>>>> likely to abort.  There are many reasons that a parallel process can 
>>>>>>> fail during MPI_INIT; some of which are due to configuration or 
>>>>>>> environment 
>>>>>>> problems.  This failure appears to be an internal failure; here's some 
>>>>>>> additional information (which may only be relevant to an Open MPI 
>>>>>>> developer): 
>>>>>>> 
>>>>>>>    ompi_dpm_dyn_init() failed 
>>>>>>>    --> Returned "Timeout" (-15) instead of "Success" (0) 
>>>>>>> --------------------------------------------------------------------------
>>>>>>>  
>>>>>>> 
>>>>>>> The interesting thing is that there is no error when I am firstly 
>>>>>>> allocating nodes with salloc and then run my program. So, I noticed 
>>>>>>> that the program works fine using openmpi 1.x+sbach/salloc or openmpi 
>>>>>>> 2.0.x+salloc but not openmpi 2.0.x+sbatch. 
>>>>>>> 
>>>>>>> The error was reproduced on three different computer clusters. 
>>>>>>> 
>>>>>>> Best regards, 
>>>>>>> Anastasia 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users@lists.open-mpi.org
>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to