(Not sure if this is more of a SLURM question or an OpenMPI question.)

I've been successfully using SLURM 2.6.5 to submit MPI jobs on a fixed number of
Ubuntu 14.04.2 nodes with mpi4py 1.3.1 manually built against OpenMPI 1.8.4
(which in turn was manually built against the pmi libraries included with SLURM
2.6.5 on Ubuntu). Recently, however, I encountered problems attempting to submit
jobs that make use of dynamic process creation via MPI_Comm_spawn; for example,
if I submit a job that spawns several processes using

srun -n 1 python prog.py

I observe the following error:

[huxley:05020] [[5080,1],0] ORTE_ERROR_LOG: Not available in file dpm_orte.c at
line 1100
[huxley:5020] *** An error occurred in MPI_Comm_spawn
[huxley:5020] *** reported by process [332922881,0]
[huxley:5020] *** on communicator MPI_COMM_SELF
[huxley:5020] *** MPI_ERR_UNKNOWN: unknown error
[huxley:5020] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[huxley:5020] ***    and potentially your MPI job)
In: PMI_Abort(14, N/A)
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.

The same error occurs if I try to increase the number of tasks to that of the
number of spawned processes. Running the job directly with mpiexec, i.e.,

mpiexec -np 1 python prog.py

does work properly, however.

Is there something I am overlooking when submitting spawning jobs to SLURM? Or 
are there
currently limitations in SLURM's support for launching MPI programs that use 
dynamic
process allocation?
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/

Reply via email to