Each MPI implementation is a bit differrent. Check your MpiDefault configuration parameter; see:
http://slurm.schedmd.com/slurm.conf.html
Quoting Jonathan Perkins <[email protected]>:
Hi there. Can you share the output of mpiname -a? In order to use srun with mvapich2 you will need to configure mvapich2 with the following options: ./configure --with-pm=no --with-pmi=slurm On Mon, Nov 25, 2013 at 8:46 AM, Arjun J Rao <[email protected]>wrote:I have a cluster with two nodes qdr3 and qdr4. I run slurmctld on qdr3 and slurmd on qdr3 and qdr4 both. I have attached the slurm.conf file. I am using MVAPICH2 2.0a (the lates is 2.0b) I then wrote a simple MPI hello world program that mentions the process rank and the processor name from whichever node it is run. I compiled the code using mpicc -L/usr/local/lib/slurm -lpmi Hello.c where /usr/local/lib/slurm is the place where slurm libraries reside. Compilation and the subsequent commands were all entered in qdr3's terminal, where slurmctld runs too. $: salloc -N2 bash salloc : Granted job allocation 24 $: sbcast a.out /tmp/random.a.out $: srun /tmp/random.a.out In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error) In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error) slurmd[qdr4]: *** STEP 24.0 KILLED AT 2013-11-25T18:52:52 with SIGNAL 9 *** srun: Job step aborted: Waiting upto 2 seconds for job step to finish srun: error: qdr3: task 0: Exited with exit code 1 srun: error: qdr4: task 1: Exited with exit code 1 I checked the /tmp folder on qdr4 and qdr3 and they did contain random.a.out as a file. I can log in to each machine from the other without having to use a password. Even if I try srun -n4 /tmp/random.a.out srun -n2 /tmp/random.a.out srun -n14 /tmp/random.a.out don't work and give off similar errors. What could be going wrong here ?-- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo
