Is this OpenMPI? We experienced similar behaviour with OpenMPI. This was fixed after recompiling OpenMPI with PMI, i.e.
./configure [...] --with-pmi=/path/to/slurm [...] 2017-01-23 14:22 GMT+01:00 liu junjun <ljjl...@gmail.com>: > Hi Paddy, > > Thanks a lot for you kind helps! > > Replacing mpirun by srun seems still not working. Here's how I did: >>cat a.sh > #!/bin/bash > srun ./mpitest >>sbatch -n 2 ./a.sh > Submitted batch job 3611 >>cat slurm-3611.out > Hello world: rank 0 of 1 running on cdd001 > Hello world: rank 0 of 1 running on cdd001 > > So, srun just executed the program twice, instead of running it as parallel. > > If I replace the srun back to mpirun, the good output is produced: >>cat a.sh > #!/bin/bash > mpirun ./mpitest >> sbatch a.sh > Submitted batch job 3612 >>cat slurm-3612.out > Hello world: rank 0 of 2 running on cdd001 > Hello world: rank 1 of 2 running on cdd001 > > I also tried using srun inside bash script for serial program: >>cat a.sh > #!/bin/bash > srun echo Hello >>sbatch -n 2 ./a.sh > Submitted batch job 3614 >>cat slurm-3614.out > Hello > Hello > > Any idea? > > Thanks in advance! > > Junjun > > > > On Mon, Jan 23, 2017 at 6:16 PM, Paddy Doyle <pa...@tchpc.tcd.ie> wrote: >> >> >> Hi Junjun, >> >> On Mon, Jan 23, 2017 at 12:04:17AM -0800, liu junjun wrote: >> >> > Hi all, >> > >> > I have small MPI test program just printing the rannk id of a parallel >> > job. >> > The output is like this: >> > >mpirun -n 2 ./mpitest >> > Hello world: rank 0 of 2 running on cddlogin >> > Hello world: rank 1 of 2 running on cddlogin >> > >> > I ran this test program with salloc. It produces similar output: >> > >salloc -n 2 >> > salloc: Granted job allocation 3605 >> > >mpirun -n 2 ./mpitest >> > Hello world: rank 0 of 2 running on cdd001 >> > Hello world: rank 1 of 2 running on cdd001 >> > >> > I put this one line command into a bash script for running with sbatch. >> > It >> > also get the same result as expected. However, it is totally different >> > if >> > it run with srun: >> > >srun -n 2 mpirun -n 2 ./mpitest >> > Hello world: rank 0 of 2 running on cdd001 >> > Hello world: rank 1 of 2 running on cdd001 >> > Hello world: rank 0 of 2 running on cdd001 >> > Hello world: rank 1 of 2 running on cdd001 >> >> That looks like expected behaviour from calling both srun and mpirun; have >> never >> tried it, but it looks like what might happen if you call them both. >> >> But it's not recommended to run your code like that. >> >> I think basically don't call both srun and mpirun! In your sbatch either >> put: >> >> #SBATCH -n 2 >> .... >> mpirun ./mpitest >> >> >> ..or: >> >> >> #SBATCH -n 2 >> .... >> srun ./mpitest >> >> >> You don't need both. And it's simpler not to repeat the '-n 2' again in >> the >> mpirun/srun line, as it will lead to copy/paste errors when you change it >> in the >> '#SBATCH' line but not below. >> >> > The test program was invoked twice ($SLURM_NTASKS) with each time asked >> > 2 >> > ($SLURM_NTASKS) CPU for mpi program!! >> >> Yes. >> >> > The problem of srun is actually not about mpi: >> > >srun -n 2 echo "Hello" >> > Hello >> > Hello >> > >> > How can I resolve the problem of srun, and let it behaves like sbatch or >> > salloc, where the program executed only one time? >> > >> > The version of slurm is 16.05.3, and >> >> Thanks, >> Paddy >> >> -- >> Paddy Doyle >> Trinity Centre for High Performance Computing, >> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. >> Phone: +353-1-896-3725 >> http://www.tchpc.tcd.ie/ > >