Is this OpenMPI? We experienced similar behaviour with OpenMPI. This
was fixed after recompiling OpenMPI with PMI, i.e.

./configure [...] --with-pmi=/path/to/slurm [...]

2017-01-23 14:22 GMT+01:00 liu junjun <ljjl...@gmail.com>:
> Hi Paddy,
>
> Thanks a lot for you kind helps!
>
> Replacing mpirun by srun seems still not working. Here's how I did:
>>cat a.sh
> #!/bin/bash
> srun ./mpitest
>>sbatch -n 2 ./a.sh
> Submitted batch job 3611
>>cat slurm-3611.out
> Hello world: rank 0 of 1 running on cdd001
> Hello world: rank 0 of 1 running on cdd001
>
> So, srun just executed the program twice, instead of running it as parallel.
>
> If I replace the srun back to mpirun, the good output is produced:
>>cat a.sh
> #!/bin/bash
> mpirun ./mpitest
>> sbatch a.sh
> Submitted batch job 3612
>>cat slurm-3612.out
> Hello world: rank 0 of 2 running on cdd001
> Hello world: rank 1 of 2 running on cdd001
>
> I also tried using srun inside bash script for serial program:
>>cat a.sh
> #!/bin/bash
> srun echo Hello
>>sbatch -n 2 ./a.sh
> Submitted batch job 3614
>>cat slurm-3614.out
> Hello
> Hello
>
> Any idea?
>
> Thanks in advance!
>
> Junjun
>
>
>
> On Mon, Jan 23, 2017 at 6:16 PM, Paddy Doyle <pa...@tchpc.tcd.ie> wrote:
>>
>>
>> Hi Junjun,
>>
>> On Mon, Jan 23, 2017 at 12:04:17AM -0800, liu junjun wrote:
>>
>> > Hi all,
>> >
>> > I have small MPI test program just printing the rannk id of a parallel
>> > job.
>> > The output is like this:
>> > >mpirun -n 2 ./mpitest
>> > Hello world: rank 0 of 2 running on cddlogin
>> > Hello world: rank 1 of 2 running on cddlogin
>> >
>> > I ran this test program with salloc. It produces similar output:
>> > >salloc -n 2
>> > salloc: Granted job allocation 3605
>> > >mpirun -n 2 ./mpitest
>> > Hello world: rank 0 of 2 running on cdd001
>> > Hello world: rank 1 of 2 running on cdd001
>> >
>> > I put this one line command into a bash script for running with sbatch.
>> > It
>> > also get the same result as expected. However, it is totally different
>> > if
>> > it run with srun:
>> > >srun -n 2 mpirun -n 2 ./mpitest
>> > Hello world: rank 0 of 2 running on cdd001
>> > Hello world: rank 1 of 2 running on cdd001
>> > Hello world: rank 0 of 2 running on cdd001
>> > Hello world: rank 1 of 2 running on cdd001
>>
>> That looks like expected behaviour from calling both srun and mpirun; have
>> never
>> tried it, but it looks like what might happen if you call them both.
>>
>> But it's not recommended to run your code like that.
>>
>> I think basically don't call both srun and mpirun! In your sbatch either
>> put:
>>
>>   #SBATCH -n 2
>>   ....
>>   mpirun ./mpitest
>>
>>
>> ..or:
>>
>>
>>   #SBATCH -n 2
>>   ....
>>   srun ./mpitest
>>
>>
>> You don't need both. And it's simpler not to repeat the '-n 2' again in
>> the
>> mpirun/srun line, as it will lead to copy/paste errors when you change it
>> in the
>> '#SBATCH' line but not below.
>>
>> > The test program was invoked twice ($SLURM_NTASKS) with each time asked
>> > 2
>> > ($SLURM_NTASKS) CPU for mpi program!!
>>
>> Yes.
>>
>> > The problem of srun is actually not about mpi:
>> > >srun -n 2 echo "Hello"
>> > Hello
>> > Hello
>> >
>> > How can I resolve the problem of srun, and let it behaves like sbatch or
>> > salloc, where the program executed only one time?
>> >
>> > The version of slurm is 16.05.3, and
>>
>> Thanks,
>> Paddy
>>
>> --
>> Paddy Doyle
>> Trinity Centre for High Performance Computing,
>> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
>> Phone: +353-1-896-3725
>> http://www.tchpc.tcd.ie/
>
>

Reply via email to