Received from Lev Givon on Mon, Feb 23, 2015 at 11:32:53PM EST: > > (Not sure if this is more of a SLURM question or an OpenMPI question.) > > I've been successfully using SLURM 2.6.5 to submit MPI jobs on a fixed number > of > Ubuntu 14.04.2 nodes with mpi4py 1.3.1 manually built against OpenMPI 1.8.4 > (which in turn was manually built against the pmi libraries included with > SLURM > 2.6.5 on Ubuntu). Recently, however, I encountered problems attempting to > submit > jobs that make use of dynamic process creation via MPI_Comm_spawn; for > example, > if I submit a job that spawns several processes using > > srun -n 1 python prog.py > > I observe the following error: > > [huxley:05020] [[5080,1],0] ORTE_ERROR_LOG: Not available in file dpm_orte.c > at > line 1100 > [huxley:5020] *** An error occurred in MPI_Comm_spawn > [huxley:5020] *** reported by process [332922881,0] > [huxley:5020] *** on communicator MPI_COMM_SELF > [huxley:5020] *** MPI_ERR_UNKNOWN: unknown error > [huxley:5020] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will > now abort, > [huxley:5020] *** and potentially your MPI job) > In: PMI_Abort(14, N/A) > srun: Job step aborted: Waiting up to 2 seconds for job step to finish. > > The same error occurs if I try to increase the number of tasks to that of the > number of spawned processes. Running the job directly with mpiexec, i.e., > > mpiexec -np 1 python prog.py > > does work properly, however. > > Is there something I am overlooking when submitting spawning jobs to SLURM? > Or are there > currently limitations in SLURM's support for launching MPI programs that use > dynamic > process allocation?
For future reference, the OpenMPI folks indicated that dynamic spawning isn't currently supported when an OpenMPI job is launched directly via srun: http://www.open-mpi.org/community/lists/users/2015/02/26404.php -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
