That is correct. If you don’t specify a slot count, we auto-discover the number of cores on each node and set #slots to that number. If an RM is involved, then we use what they give us
Sent from my iPad > On Sep 26, 2017, at 8:11 PM, Anthony Thyssen <a.thys...@griffith.edu.au> > wrote: > > > I have been having problems with OpenMPI on a new cluster of machines, using > stock RHEL7 packages. > > ASIDE: This will be used with Torque-PBS (from EPEL archives), though OpenMPI > (currently) does not have the "tm" resource manager configured to use PBS, as > you > will be able to see in the debug output below. > > # mpirun -V > mpirun (Open MPI) 1.10.6 > > # sudo yum list installed openmpi > ... > Installed Packages > openmpi.x86_64 1.10.6-2.el7 @rhel-7-server-rpms > ... > > More than likely I am doing something fundamentally stupid, but I have no > idea what. > > The problem is that OpenMPI is not obeying the given hostfile, and running one > process on each host given in the list. The manual and all my (meagre) > experience > is that that is what it is meant to do. > > Instead it runs the maximum number of processes that is allowed to run for > the CPU > of that machine. That is a nice feature, but NOT what is wanted. > > There is no "/etc/openmpi-x86_64/openmpi-default-hostfile" configuration > present. > > For example given the hostfile > > # cat hostfile.txt > node21.emperor > node22.emperor > node22.emperor > node23.emperor > > Running OpenMPI on the head node "shrek", I get the following, > (ras debugging enabled to see the result) > > # mpirun --hostfile hostfile.txt --mca ras_base_verbose 5 mpi_hello > [shrek.emperor:93385] mca:base:select:( ras) Querying component [gridengine] > [shrek.emperor:93385] mca:base:select:( ras) Skipping component > [gridengine]. Query failed to return a module > [shrek.emperor:93385] mca:base:select:( ras) Querying component [loadleveler] > [shrek.emperor:93385] mca:base:select:( ras) Skipping component > [loadleveler]. Query failed to return a module > [shrek.emperor:93385] mca:base:select:( ras) Querying component [simulator] > [shrek.emperor:93385] mca:base:select:( ras) Skipping component [simulator]. > Query failed to return a module > [shrek.emperor:93385] mca:base:select:( ras) Querying component [slurm] > [shrek.emperor:93385] mca:base:select:( ras) Skipping component [slurm]. > Query failed to return a module > [shrek.emperor:93385] mca:base:select:( ras) No component selected! > > ====================== ALLOCATED NODES ====================== > node21.emperor: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > node22.emperor: slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN > node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > ================================================================= > Hello World! from process 0 out of 6 on node21.emperor > Hello World! from process 2 out of 6 on node22.emperor > Hello World! from process 1 out of 6 on node21.emperor > Hello World! from process 3 out of 6 on node22.emperor > Hello World! from process 4 out of 6 on node23.emperor > Hello World! from process 5 out of 6 on node23.emperor > > These machines are all dual core CPU's. If a quad core is added to the list > I get 4 processes on that node. And so on, BUT NOT always. > > Note that the "ALLOCATED NODES" list is NOT obeyed. > > If on the other hand I add "slot=#" to the provided hostfile it works as > expected! > (the debug output was not included as it is essentially the same as above) > > # awk '{n[$0]++} END {for(i in n)print i,"slots="n[i]}' hostfile.txt > > hostfile_slots.txt > # cat hostfile_slots.txt > node23.emperor slots=1 > node22.emperor slots=2 > node21.emperor slots=1 > > # mpirun --hostfile hostfile_slots.txt mpi_hello > Hello World! from process 0 out of 4 on node23.emperor > Hello World! from process 1 out of 4 on node22.emperor > Hello World! from process 3 out of 4 on node21.emperor > Hello World! from process 2 out of 4 on node22.emperor > > Or if I convert the hostfile into a comma separated host list it also works. > > # tr '\n' , <hostfile.txt; echo > node21.emperor,node22.emperor,node22.emperor,node23.emperor, > # mpirun --host $(tr '\n' , <hostfile.txt) mpi_hello > Hello World! from process 0 out of 4 on node21.emperor > Hello World! from process 1 out of 4 on node22.emperor > Hello World! from process 3 out of 4 on node23.emperor > Hello World! from process 2 out of 4 on node22.emperor > > > Any help as to why --hostfile does not work as expected and debugged says it > should be working would be appreciated. > > As you can see I have been studing this problem a long time. Google has not > been very helpful. All I seem to get are man pages, and general help guides. > > > Anthony Thyssen ( System Programmer ) <a.thys...@griffith.edu.au> > -------------------------------------------------------------------------- > All the books of Power had their own particular nature. > The "Octavo" was harsh and imperious. > The "Bumper Fun Grimore" went in for deadly practical jokes. > The "Joy of Tantric Sex" had to be kept under iced water. > -- Terry Pratchett, "Moving Pictures" > -------------------------------------------------------------------------- > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users