Hello Ralph, When I do the -np 7 it still fails with "There are not enough slots available in the system to satisfy the 7 slots that were requested by the application", but when I do -np 2 it will actually run from a machine that was failing but will only run on one other machine and in this case it ran from a machine with 2 processors to a machine with only 1 processor. If I try to make -np higher then 2 it will also fail.
-Adam LeBlanc On Thu, Nov 1, 2018 at 3:53 PM Ralph H Castain <r...@open-mpi.org> wrote: > Hmmm - try adding a value for nprocs instead of leaving it blank. Say ā-np > 7ā > > Sent from my iPhone > > On Nov 1, 2018, at 11:56 AM, Adam LeBlanc <alebl...@iol.unh.edu> wrote: > > Hello Ralph, > > Here is the output for a failing machine: > > [130_02:44:13_aleblanc@farbauti]{~}$ > mpirun --mca > btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 > --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues > P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca > ras_base_verbose 5 IMB-MPI1 > > ====================== ALLOCATED NODES ====================== > farbauti: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP > hyperion-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > ================================================================= > -------------------------------------------------------------------------- > There are not enough slots available in the system to satisfy the 7 slots > that were requested by the application: > 10 > > Either request fewer slots for your application, or make more slots > available > for use. > -------------------------------------------------------------------------- > > > Here is an output of a passing machine: > > [1_02:54:26_aleblanc@hyperion]{~}$ > mpirun --mca > btl_openib_warn_no_device_params_found 0 --mca orte_base_help_aggregate 0 > --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_receive_queues > P,65536,120,64,32 -hostfile /home/soesterreich/ce-mpi-hosts --mca > ras_base_verbose 5 IMB-MPI1 > > ====================== ALLOCATED NODES ====================== > hyperion: flags=0x11 slots=1 max_slots=0 slots_inuse=0 state=UP > farbauti-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > io-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > jarnsaxa-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > rhea-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > tarqeq-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > tarvos-ce: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > ================================================================= > > > Yes the hostfile is available on all nodes through an NFS mount for all of > our home directories. > > On Thu, Nov 1, 2018 at 2:44 PM Adam LeBlanc <alebl...@iol.unh.edu> wrote: > >> >> >> ---------- Forwarded message --------- >> From: Ralph H Castain <r...@open-mpi.org> >> Date: Thu, Nov 1, 2018 at 2:34 PM >> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count >> To: Open MPI Users <users@lists.open-mpi.org> >> >> >> Iām a little under the weather and so will only be able to help a bit at >> a time. However, a couple of things to check: >> >> * add -mca ras_base_verbose 5 to the cmd line to see what mpirun thought >> the allocation was >> >> * is the hostfile available on every node? >> >> Ralph >> >> On Nov 1, 2018, at 10:55 AM, Adam LeBlanc <alebl...@iol.unh.edu> wrote: >> >> Hello Ralph, >> >> Attached below is the verbose output for a failing machine and a passing >> machine. >> >> Thanks, >> Adam LeBlanc >> >> On Thu, Nov 1, 2018 at 1:41 PM Adam LeBlanc <alebl...@iol.unh.edu> wrote: >> >>> >>> >>> ---------- Forwarded message --------- >>> From: Ralph H Castain <r...@open-mpi.org> >>> Date: Thu, Nov 1, 2018 at 1:07 PM >>> Subject: Re: [OMPI users] Bug with Open-MPI Processor Count >>> To: Open MPI Users <users@lists.open-mpi.org> >>> >>> >>> Set rmaps_base_verbose=10 for debugging output >>> >>> Sent from my iPhone >>> >>> On Nov 1, 2018, at 9:31 AM, Adam LeBlanc <alebl...@iol.unh.edu> wrote: >>> >>> The version by the way for Open-MPI is 3.1.2. >>> >>> -Adam LeBlanc >>> >>> On Thu, Nov 1, 2018 at 12:05 PM Adam LeBlanc <alebl...@iol.unh.edu> >>> wrote: >>> >>>> Hello, I am an employee of the UNH InterOperability Lab, and we are in >>>> the process of testing OFED-4.17-RC1 for the OpenFabrics Alliance. We have >>>> purchased some new hardware that has one processor, and noticed an issue >>>> when running mpi jobs on nodes that do not have similar processor counts. >>>> If we launch the MPI job from a node that has 2 processors, it will fail >>>> and stating there are not enough resources and will not start the run, like >>>> so: >>>> -------------------------------------------------------------------------- >>>> There are not enough slots available in the system to satisfy the 14 slots >>>> that were requested by the application: IMB-MPI1 Either request fewer >>>> slots for your application, or make more slots available for use. >>>> -------------------------------------------------------------------------- >>>> If we launch the MPI job from the node with one processor, without changing >>>> the mpirun command at all, it runs as expected. Here is the command being >>>> run: mpirun --mca btl_openib_warn_no_device_params_found 0 --mca >>>> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca >>>> btl_openib_receive_queues P,65536,120,64,32 -hostfile >>>> /home/soesterreich/ce-mpi-hosts IMB-MPI1 Here is the hostfile being used: >>>> farbauti-ce.ofa.iol.unh.edu slots=1 hyperion-ce.ofa.iol.unh.edu >>>> slots=1 io-ce.ofa.iol.unh.edu slots=1 jarnsaxa-ce.ofa.iol.unh.edu >>>> slots=1 rhea-ce.ofa.iol.unh.edu slots=1 tarqeq-ce.ofa.iol.unh.edu >>>> slots=1 tarvos-ce.ofa.iol.unh.edu slots=1 This seems like a bug and we >>>> would like some help to explain and fix what is happening. The IBTA >>>> plugfest saw similar behaviours, so this should be reproduceable. Thanks, >>>> Adam LeBlanc >>>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >>> >> <passing_verbose_output.txt><failing_verbose_output.txt> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users