(Not sure if this is more of a SLURM question or an OpenMPI question.) I've been successfully using SLURM 2.6.5 to submit MPI jobs on a fixed number of Ubuntu 14.04.2 nodes with mpi4py 1.3.1 manually built against OpenMPI 1.8.4 (which in turn was manually built against the pmi libraries included with SLURM 2.6.5 on Ubuntu). Recently, however, I encountered problems attempting to submit jobs that make use of dynamic process creation via MPI_Comm_spawn; for example, if I submit a job that spawns several processes using
srun -n 1 python prog.py I observe the following error: [huxley:05020] [[5080,1],0] ORTE_ERROR_LOG: Not available in file dpm_orte.c at line 1100 [huxley:5020] *** An error occurred in MPI_Comm_spawn [huxley:5020] *** reported by process [332922881,0] [huxley:5020] *** on communicator MPI_COMM_SELF [huxley:5020] *** MPI_ERR_UNKNOWN: unknown error [huxley:5020] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [huxley:5020] *** and potentially your MPI job) In: PMI_Abort(14, N/A) srun: Job step aborted: Waiting up to 2 seconds for job step to finish. The same error occurs if I try to increase the number of tasks to that of the number of spawned processes. Running the job directly with mpiexec, i.e., mpiexec -np 1 python prog.py does work properly, however. Is there something I am overlooking when submitting spawning jobs to SLURM? Or are there currently limitations in SLURM's support for launching MPI programs that use dynamic process allocation? -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
