Received from Lev Givon on Mon, Feb 23, 2015 at 11:32:53PM EST:
> 
> (Not sure if this is more of a SLURM question or an OpenMPI question.)
> 
> I've been successfully using SLURM 2.6.5 to submit MPI jobs on a fixed number 
> of
> Ubuntu 14.04.2 nodes with mpi4py 1.3.1 manually built against OpenMPI 1.8.4
> (which in turn was manually built against the pmi libraries included with 
> SLURM
> 2.6.5 on Ubuntu). Recently, however, I encountered problems attempting to 
> submit
> jobs that make use of dynamic process creation via MPI_Comm_spawn; for 
> example,
> if I submit a job that spawns several processes using
> 
> srun -n 1 python prog.py
> 
> I observe the following error:
> 
> [huxley:05020] [[5080,1],0] ORTE_ERROR_LOG: Not available in file dpm_orte.c 
> at
> line 1100
> [huxley:5020] *** An error occurred in MPI_Comm_spawn
> [huxley:5020] *** reported by process [332922881,0]
> [huxley:5020] *** on communicator MPI_COMM_SELF
> [huxley:5020] *** MPI_ERR_UNKNOWN: unknown error
> [huxley:5020] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
> now abort,
> [huxley:5020] ***    and potentially your MPI job)
> In: PMI_Abort(14, N/A)
> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> 
> The same error occurs if I try to increase the number of tasks to that of the
> number of spawned processes. Running the job directly with mpiexec, i.e.,
> 
> mpiexec -np 1 python prog.py
> 
> does work properly, however.
> 
> Is there something I am overlooking when submitting spawning jobs to SLURM? 
> Or are there
> currently limitations in SLURM's support for launching MPI programs that use 
> dynamic
> process allocation?

For future reference, the OpenMPI folks indicated that dynamic spawning isn't
currently supported when an OpenMPI job is launched directly via srun:

http://www.open-mpi.org/community/lists/users/2015/02/26404.php
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Reply via email to