subject:"\[OMPI users\] mpirun 2.1.1 refuses to start a Torque 6.1.1.1 job if I change the scheduler to Maui 3.3.1"

Re: [OMPI users] mpirun 2.1.1 refuses to start a Torque 6.1.1.1 job if I change the scheduler to Maui 3.3.1

2017-08-10 Thread A M

All solved and now works well! The culprit was the lost line in the
"maui.cfg" file:

JOBNODEMATCHPOLICY EXACTNODE

The default value for this variable is EXACTPROC and, in its presence, Maui
completely ignores the  "-l nodes=N:ppn=M" PBS instruction and allocates
the first M available cores inside the first free node..

Andy.

2017-08-09 23:55 GMT+02:00 A M :

>
> Thanks!
>
> In fact there should be a problem with Maui's node allocation setting. I
> have checked the $PBS_NODEFILE contents (this is also may be seen with
> "qstat -n1"): while the default Torque scheduler correctly allocates one
> slot on node1 and another slot on node2, in case of Maui I always see that
> Maui allocates two slots on one of the nodes. Will now try to check better
> the maui.cfg file. Apparently my allocation policy is not correct. Will now
> dig it further..
>
> Andy.
>
>
>
>
>
> 2017-08-09 21:49 GMT+02:00 r...@open-mpi.org :
>
>> sounds to me like your maui scheduler didn’t provide any allocated slots
>> on the nodes - did you check $PBS_NODEFILE?
>>
>> > On Aug 9, 2017, at 12:41 PM, A M  wrote:
>> >
>> >
>> > Hello,
>> >
>> > I have just ran into a strange issue with "mpirun". Here is what
>> happened:
>> >
>> > I successfully installed Torque 6.1.1.1 with the plain pbs_sched on a
>> minimal set of 2 IB nodes. Then I added openmpi 2.1.1 compiled with verbs
>> and tm, and have verified that mpirun works as it should with a small
>> "pingpong" program.
>> >
>> > Here is my Torque minimal jobscript which I used to check the IB
>> message passing:
>> >
>> > #!/bin/sh
>> > #PBS -o Out
>> > #PBS -e Err
>> > #PBS -l nodes=2:ppn=1
>> > cd $PBS_O_WORKDIR
>> > mpirun -np 2 -pernode ./pingpong 400
>> >
>> > The job correctly used IB as the default message passing iface and
>> resulted in 3.6 Gb/sec "pingpong" bandwidth which is correct in my case,
>> since the two batch nodes have the QDR HCAs.
>> >
>> > I have then stopped "pbs_sched" and started the Maui 3.3.1 scheduler
>> instead. Serial jobs work without any problem, but the same jobscript is
>> now failing with the following  message:
>> >
>> > 
>> > Your job has requested more processes than the ppr for this topology
>> can support:
>> > App: /lustre/work/user/testus/pingpong
>> > Number of procs:  2
>> > PPR: 1:node
>> > Please revise the conflict and try again.
>> > 
>> >
>> > I then have tried to play with  - -nooversubscribe and "--pernode 2"
>> options, but the error persisted. It looks like the freshmost "mpirun" is
>> getting some information from the latest available Maui scheduler. It is
>> enough to go back to "pbs_sched", and everything works like a charm. I used
>> the preexisting "maui.cfg" file which still works well on the oldish Centos
>> 6 with an old 1.8.5 version of openmpi.
>> >
>> > Thanks ahead for any hint/comment on how to address this. Are there any
>> other mpirun options to try? Should I try to downgrade openmpi to the
>> latest 1.X series?
>> >
>> > Andy.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > mpirun -np 2 -pernode --mca btl ^tcp ./pingpong 400
>> >
>> >
>> > 2.
>> >
>> >
>> > ___
>> > users mailing list
>> > users@lists.open-mpi.org
>> > https://lists.open-mpi.org/mailman/listinfo/users
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpirun 2.1.1 refuses to start a Torque 6.1.1.1 job if I change the scheduler to Maui 3.3.1 [SOLVED]

2017-08-10 Thread A M

All solved and now works well! The culprit was the lost line in the
"maui.cfg" file:

JOBNODEMATCHPOLICY EXACTNODE

The default value for this variable is EXACTPROC and, in its presence, Maui
completely ignores the  "-l nodes=N:ppn=M" PBS instruction and allocates
the first M available cores inside the first free node..

Andy.


2017-08-09 23:55 GMT+02:00 A M :

>
> Thanks!
>
> In fact there should be a problem with Maui's node allocation setting. I
> have checked the $PBS_NODEFILE contents (this is also may be seen with
> "qstat -n1"): while the default Torque scheduler correctly allocates one
> slot on node1 and another slot on node2, in case of Maui I always see that
> Maui allocates two slots on one of the nodes. Will now try to check better
> the maui.cfg file. Apparently my allocation policy is not correct. Will now
> dig it further..
>
> Andy.
>
>
>
>
>
> 2017-08-09 21:49 GMT+02:00 r...@open-mpi.org :
>
>> sounds to me like your maui scheduler didn’t provide any allocated slots
>> on the nodes - did you check $PBS_NODEFILE?
>>
>> > On Aug 9, 2017, at 12:41 PM, A M  wrote:
>> >
>> >
>> > Hello,
>> >
>> > I have just ran into a strange issue with "mpirun". Here is what
>> happened:
>> >
>> > I successfully installed Torque 6.1.1.1 with the plain pbs_sched on a
>> minimal set of 2 IB nodes. Then I added openmpi 2.1.1 compiled with verbs
>> and tm, and have verified that mpirun works as it should with a small
>> "pingpong" program.
>> >
>> > Here is my Torque minimal jobscript which I used to check the IB
>> message passing:
>> >
>> > #!/bin/sh
>> > #PBS -o Out
>> > #PBS -e Err
>> > #PBS -l nodes=2:ppn=1
>> > cd $PBS_O_WORKDIR
>> > mpirun -np 2 -pernode ./pingpong 400
>> >
>> > The job correctly used IB as the default message passing iface and
>> resulted in 3.6 Gb/sec "pingpong" bandwidth which is correct in my case,
>> since the two batch nodes have the QDR HCAs.
>> >
>> > I have then stopped "pbs_sched" and started the Maui 3.3.1 scheduler
>> instead. Serial jobs work without any problem, but the same jobscript is
>> now failing with the following  message:
>> >
>> > 
>> > Your job has requested more processes than the ppr for this topology
>> can support:
>> > App: /lustre/work/user/testus/pingpong
>> > Number of procs:  2
>> > PPR: 1:node
>> > Please revise the conflict and try again.
>> > 
>> >
>> > I then have tried to play with  - -nooversubscribe and "--pernode 2"
>> options, but the error persisted. It looks like the freshmost "mpirun" is
>> getting some information from the latest available Maui scheduler. It is
>> enough to go back to "pbs_sched", and everything works like a charm. I used
>> the preexisting "maui.cfg" file which still works well on the oldish Centos
>> 6 with an old 1.8.5 version of openmpi.
>> >
>> > Thanks ahead for any hint/comment on how to address this. Are there any
>> other mpirun options to try? Should I try to downgrade openmpi to the
>> latest 1.X series?
>> >
>> > Andy.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > mpirun -np 2 -pernode --mca btl ^tcp ./pingpong 400
>> >
>> >
>> > 2.
>> >
>> >
>> > ___
>> > users mailing list
>> > users@lists.open-mpi.org
>> > https://lists.open-mpi.org/mailman/listinfo/users
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpirun 2.1.1 refuses to start a Torque 6.1.1.1 job if I change the scheduler to Maui 3.3.1

2017-08-09 Thread A M

Thanks!

In fact there should be a problem with Maui's node allocation setting. I
have checked the $PBS_NODEFILE contents (this is also may be seen with
"qstat -n1"): while the default Torque scheduler correctly allocates one
slot on node1 and another slot on node2, in case of Maui I always see that
Maui allocates two slots on one of the nodes. Will now try to check better
the maui.cfg file. Apparently my allocation policy is not correct. Will now
dig it further..

Andy.





2017-08-09 21:49 GMT+02:00 r...@open-mpi.org :

> sounds to me like your maui scheduler didn’t provide any allocated slots
> on the nodes - did you check $PBS_NODEFILE?
>
> > On Aug 9, 2017, at 12:41 PM, A M  wrote:
> >
> >
> > Hello,
> >
> > I have just ran into a strange issue with "mpirun". Here is what
> happened:
> >
> > I successfully installed Torque 6.1.1.1 with the plain pbs_sched on a
> minimal set of 2 IB nodes. Then I added openmpi 2.1.1 compiled with verbs
> and tm, and have verified that mpirun works as it should with a small
> "pingpong" program.
> >
> > Here is my Torque minimal jobscript which I used to check the IB message
> passing:
> >
> > #!/bin/sh
> > #PBS -o Out
> > #PBS -e Err
> > #PBS -l nodes=2:ppn=1
> > cd $PBS_O_WORKDIR
> > mpirun -np 2 -pernode ./pingpong 400
> >
> > The job correctly used IB as the default message passing iface and
> resulted in 3.6 Gb/sec "pingpong" bandwidth which is correct in my case,
> since the two batch nodes have the QDR HCAs.
> >
> > I have then stopped "pbs_sched" and started the Maui 3.3.1 scheduler
> instead. Serial jobs work without any problem, but the same jobscript is
> now failing with the following  message:
> >
> > 
> > Your job has requested more processes than the ppr for this topology can
> support:
> > App: /lustre/work/user/testus/pingpong
> > Number of procs:  2
> > PPR: 1:node
> > Please revise the conflict and try again.
> > 
> >
> > I then have tried to play with  - -nooversubscribe and "--pernode 2"
> options, but the error persisted. It looks like the freshmost "mpirun" is
> getting some information from the latest available Maui scheduler. It is
> enough to go back to "pbs_sched", and everything works like a charm. I used
> the preexisting "maui.cfg" file which still works well on the oldish Centos
> 6 with an old 1.8.5 version of openmpi.
> >
> > Thanks ahead for any hint/comment on how to address this. Are there any
> other mpirun options to try? Should I try to downgrade openmpi to the
> latest 1.X series?
> >
> > Andy.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > mpirun -np 2 -pernode --mca btl ^tcp ./pingpong 400
> >
> >
> > 2.
> >
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpirun 2.1.1 refuses to start a Torque 6.1.1.1 job if I change the scheduler to Maui 3.3.1

2017-08-09 Thread r...@open-mpi.org

sounds to me like your maui scheduler didn’t provide any allocated slots on the 
nodes - did you check $PBS_NODEFILE?

> On Aug 9, 2017, at 12:41 PM, A M  wrote:
> 
> 
> Hello,
> 
> I have just ran into a strange issue with "mpirun". Here is what happened:
> 
> I successfully installed Torque 6.1.1.1 with the plain pbs_sched on a minimal 
> set of 2 IB nodes. Then I added openmpi 2.1.1 compiled with verbs and tm, and 
> have verified that mpirun works as it should with a small "pingpong" program. 
> 
> Here is my Torque minimal jobscript which I used to check the IB message 
> passing:
> 
> #!/bin/sh
> #PBS -o Out
> #PBS -e Err
> #PBS -l nodes=2:ppn=1
> cd $PBS_O_WORKDIR
> mpirun -np 2 -pernode ./pingpong 400
> 
> The job correctly used IB as the default message passing iface and resulted 
> in 3.6 Gb/sec "pingpong" bandwidth which is correct in my case, since the two 
> batch nodes have the QDR HCAs.
> 
> I have then stopped "pbs_sched" and started the Maui 3.3.1 scheduler instead. 
> Serial jobs work without any problem, but the same jobscript is now failing 
> with the following  message:
> 
> 
> Your job has requested more processes than the ppr for this topology can 
> support:
> App: /lustre/work/user/testus/pingpong
> Number of procs:  2
> PPR: 1:node
> Please revise the conflict and try again.
> 
> 
> I then have tried to play with  - -nooversubscribe and "--pernode 2" options, 
> but the error persisted. It looks like the freshmost "mpirun" is getting some 
> information from the latest available Maui scheduler. It is enough to go back 
> to "pbs_sched", and everything works like a charm. I used the preexisting 
> "maui.cfg" file which still works well on the oldish Centos 6 with an old 
> 1.8.5 version of openmpi.  
> 
> Thanks ahead for any hint/comment on how to address this. Are there any other 
> mpirun options to try? Should I try to downgrade openmpi to the latest 1.X 
> series?
> 
> Andy.
>  
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> mpirun -np 2 -pernode --mca btl ^tcp ./pingpong 400
> 
> 
> 2.   
> 
>  
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] mpirun 2.1.1 refuses to start a Torque 6.1.1.1 job if I change the scheduler to Maui 3.3.1

2017-08-09 Thread A M

Hello,

I have just ran into a strange issue with "mpirun". Here is what happened:

I successfully installed Torque 6.1.1.1 with the plain pbs_sched on a
minimal set of 2 IB nodes. Then I added openmpi 2.1.1 compiled with verbs
and tm, and have verified that mpirun works as it should with a small
"pingpong" program.

Here is my Torque minimal jobscript which I used to check the IB message
passing:

#!/bin/sh
#PBS -o Out
#PBS -e Err
#PBS -l nodes=2:ppn=1
cd $PBS_O_WORKDIR
mpirun -np 2 -pernode ./pingpong 400

The job correctly used IB as the default message passing iface and resulted
in 3.6 Gb/sec "pingpong" bandwidth which is correct in my case, since the
two batch nodes have the QDR HCAs.

I have then stopped "pbs_sched" and started the Maui 3.3.1 scheduler
instead. Serial jobs work without any problem, but the same jobscript is
now failing with the following  message:


Your job has requested more processes than the ppr for this topology can
support:
App: /lustre/work/user/testus/pingpong
Number of procs:  2
PPR: 1:node
Please revise the conflict and try again.


I then have tried to play with  - -nooversubscribe and "--pernode 2"
options, but the error persisted. It looks like the freshmost "mpirun" is
getting some information from the latest available Maui scheduler. It is
enough to go back to "pbs_sched", and everything works like a charm. I used
the preexisting "maui.cfg" file which still works well on the oldish Centos
6 with an old 1.8.5 version of openmpi.

Thanks ahead for any hint/comment on how to address this. Are there any
other mpirun options to try? Should I try to downgrade openmpi to the
latest 1.X series?

Andy.












mpirun -np 2 -pernode --mca btl ^tcp ./pingpong 400


2.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpirun 2.1.1 refuses to start a Torque 6.1.1.1 job if I change the scheduler to Maui 3.3.1

Re: [OMPI users] mpirun 2.1.1 refuses to start a Torque 6.1.1.1 job if I change the scheduler to Maui 3.3.1 [SOLVED]

Re: [OMPI users] mpirun 2.1.1 refuses to start a Torque 6.1.1.1 job if I change the scheduler to Maui 3.3.1

Re: [OMPI users] mpirun 2.1.1 refuses to start a Torque 6.1.1.1 job if I change the scheduler to Maui 3.3.1

[OMPI users] mpirun 2.1.1 refuses to start a Torque 6.1.1.1 job if I change the scheduler to Maui 3.3.1

5 matches

Site Navigation

Mail list logo

Footer information