I think this is what Paddy was getting at above: the `-n` argument to
`srun` will run `-n` copies of the program indicated on the command line.
This is how `srun` was designed and isn't a bug or problem.

I guess the question is what you're trying to accomplish.  From
appearances, you want to run an MPI program on the command line
interactively.  For this I'd say `salloc` as you've done it is the way to
go.  As rhc indicates above, if you have MPI built with PMI you don't need
to tell `mpirun` what to do, it will detect the number and location of
assigned nodes:

$ salloc -n 4 -N 4
salloc: Pending job allocation 46557547
salloc: job 46557547 queued and waiting for resources
salloc: job 46557547 has been allocated resources
salloc: Granted job allocation 46557547
$ srun hostname
node114
node311
node310
node312
$ mpirun bin/helloMPI
Hello world from processor node114, rank 0 out of 4 processors
Hello world from processor node312, rank 3 out of 4 processors
Hello world from processor node310, rank 1 out of 4 processors
Hello world from processor node311, rank 2 out of 4 processors

​HTH​


On Mon, Jan 23, 2017 at 1:36 PM, r...@open-mpi.org <r...@open-mpi.org> wrote:

>
> Note that 16.05 contains support for PMIx, so if you are using OMPI 2.0 or
> above, you should ensure that the slurm PMIx support is configured “on” and
> use that for srun (I believe you have to tell srun the pmi version to use,
> so perhaps “srun -mpi=pmix”?)
>
>
> > On Jan 23, 2017, at 7:10 AM, TO_Webmaster <luftha...@gmail.com> wrote:
> >
> >
> > Is this OpenMPI? We experienced similar behaviour with OpenMPI. This
> > was fixed after recompiling OpenMPI with PMI, i.e.
> >
> > ./configure [...] --with-pmi=/path/to/slurm [...]
> >
> > 2017-01-23 14:22 GMT+01:00 liu junjun <ljjl...@gmail.com>:
> >> Hi Paddy,
> >>
> >> Thanks a lot for you kind helps!
> >>
> >> Replacing mpirun by srun seems still not working. Here's how I did:
> >>> cat a.sh
> >> #!/bin/bash
> >> srun ./mpitest
> >>> sbatch -n 2 ./a.sh
> >> Submitted batch job 3611
> >>> cat slurm-3611.out
> >> Hello world: rank 0 of 1 running on cdd001
> >> Hello world: rank 0 of 1 running on cdd001
> >>
> >> So, srun just executed the program twice, instead of running it as
> parallel.
> >>
> >> If I replace the srun back to mpirun, the good output is produced:
> >>> cat a.sh
> >> #!/bin/bash
> >> mpirun ./mpitest
> >>> sbatch a.sh
> >> Submitted batch job 3612
> >>> cat slurm-3612.out
> >> Hello world: rank 0 of 2 running on cdd001
> >> Hello world: rank 1 of 2 running on cdd001
> >>
> >> I also tried using srun inside bash script for serial program:
> >>> cat a.sh
> >> #!/bin/bash
> >> srun echo Hello
> >>> sbatch -n 2 ./a.sh
> >> Submitted batch job 3614
> >>> cat slurm-3614.out
> >> Hello
> >> Hello
> >>
> >> Any idea?
> >>
> >> Thanks in advance!
> >>
> >> Junjun
> >>
> >>
> >>
> >> On Mon, Jan 23, 2017 at 6:16 PM, Paddy Doyle <pa...@tchpc.tcd.ie>
> wrote:
> >>>
> >>>
> >>> Hi Junjun,
> >>>
> >>> On Mon, Jan 23, 2017 at 12:04:17AM -0800, liu junjun wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I have small MPI test program just printing the rannk id of a parallel
> >>>> job.
> >>>> The output is like this:
> >>>>> mpirun -n 2 ./mpitest
> >>>> Hello world: rank 0 of 2 running on cddlogin
> >>>> Hello world: rank 1 of 2 running on cddlogin
> >>>>
> >>>> I ran this test program with salloc. It produces similar output:
> >>>>> salloc -n 2
> >>>> salloc: Granted job allocation 3605
> >>>>> mpirun -n 2 ./mpitest
> >>>> Hello world: rank 0 of 2 running on cdd001
> >>>> Hello world: rank 1 of 2 running on cdd001
> >>>>
> >>>> I put this one line command into a bash script for running with
> sbatch.
> >>>> It
> >>>> also get the same result as expected. However, it is totally different
> >>>> if
> >>>> it run with srun:
> >>>>> srun -n 2 mpirun -n 2 ./mpitest
> >>>> Hello world: rank 0 of 2 running on cdd001
> >>>> Hello world: rank 1 of 2 running on cdd001
> >>>> Hello world: rank 0 of 2 running on cdd001
> >>>> Hello world: rank 1 of 2 running on cdd001
> >>>
> >>> That looks like expected behaviour from calling both srun and mpirun;
> have
> >>> never
> >>> tried it, but it looks like what might happen if you call them both.
> >>>
> >>> But it's not recommended to run your code like that.
> >>>
> >>> I think basically don't call both srun and mpirun! In your sbatch
> either
> >>> put:
> >>>
> >>>  #SBATCH -n 2
> >>>  ....
> >>>  mpirun ./mpitest
> >>>
> >>>
> >>> ..or:
> >>>
> >>>
> >>>  #SBATCH -n 2
> >>>  ....
> >>>  srun ./mpitest
> >>>
> >>>
> >>> You don't need both. And it's simpler not to repeat the '-n 2' again in
> >>> the
> >>> mpirun/srun line, as it will lead to copy/paste errors when you change
> it
> >>> in the
> >>> '#SBATCH' line but not below.
> >>>
> >>>> The test program was invoked twice ($SLURM_NTASKS) with each time
> asked
> >>>> 2
> >>>> ($SLURM_NTASKS) CPU for mpi program!!
> >>>
> >>> Yes.
> >>>
> >>>> The problem of srun is actually not about mpi:
> >>>>> srun -n 2 echo "Hello"
> >>>> Hello
> >>>> Hello
> >>>>
> >>>> How can I resolve the problem of srun, and let it behaves like sbatch
> or
> >>>> salloc, where the program executed only one time?
> >>>>
> >>>> The version of slurm is 16.05.3, and
> >>>
> >>> Thanks,
> >>> Paddy
> >>>
> >>> --
> >>> Paddy Doyle
> >>> Trinity Centre for High Performance Computing,
> >>> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> >>> Phone: +353-1-896-3725
> >>> http://www.tchpc.tcd.ie/
> >>
> >>
>

Reply via email to