I have been having problems with OpenMPI on a new cluster of machines, using stock RHEL7 packages.
ASIDE: This will be used with Torque-PBS (from EPEL archives), though OpenMPI (currently) does not have the "tm" resource manager configured to use PBS, as you will be able to see in the debug output below. *# mpirun -V* mpirun (Open MPI) 1.10.6 *# sudo yum list installed openmpi* ... Installed Packages openmpi.x86_64 1.10.6-2.el7 @rhel-7-server-rpms ... More than likely I am doing something fundamentally stupid, but I have no idea what. The problem is that OpenMPI is not obeying the given hostfile, and running one process on each host given in the list. The manual and all my (meagre) experience is that that is what it is meant to do. Instead it runs the maximum number of processes that is allowed to run for the CPU of that machine. That is a nice feature, but NOT what is wanted. There is no "/etc/openmpi-x86_64/openmpi-default-hostfile" configuration present. For example given the hostfile *# cat hostfile.txt* node21.emperor node22.emperor node22.emperor node23.emperor Running OpenMPI on the head node "shrek", I get the following, (ras debugging enabled to see the result) *# mpirun --hostfile hostfile.txt --mca ras_base_verbose 5 mpi_hello* [shrek.emperor:93385] mca:base:select:( ras) Querying component [gridengine] [shrek.emperor:93385] mca:base:select:( ras) Skipping component [gridengine]. Query failed to return a module [shrek.emperor:93385] mca:base:select:( ras) Querying component [loadleveler] [shrek.emperor:93385] mca:base:select:( ras) Skipping component [loadleveler]. Query failed to return a module [shrek.emperor:93385] mca:base:select:( ras) Querying component [simulator] [shrek.emperor:93385] mca:base:select:( ras) Skipping component [simulator]. Query failed to return a module [shrek.emperor:93385] mca:base:select:( ras) Querying component [slurm] [shrek.emperor:93385] mca:base:select:( ras) Skipping component [slurm]. Query failed to return a module [shrek.emperor:93385] mca:base:select:( ras) No component selected! ====================== ALLOCATED NODES ====================== node21.emperor: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN node22.emperor: slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN ================================================================= Hello World! from process 0 out of 6 on node21.emperor Hello World! from process 2 out of 6 on node22.emperor Hello World! from process 1 out of 6 on node21.emperor Hello World! from process 3 out of 6 on node22.emperor Hello World! from process 4 out of 6 on node23.emperor Hello World! from process 5 out of 6 on node23.emperor These machines are all dual core CPU's. If a quad core is added to the list I get 4 processes on that node. And so on, BUT NOT always. *Note that the "ALLOCATED NODES" list is NOT obeyed.* If on the other hand I add "slot=#" to the provided hostfile it works as expected! (the debug output was not included as it is essentially the same as above) *# awk '{n[$0]++} END {for(i in n)print i,"slots="n[i]}' hostfile.txt > hostfile_slots.txt* *# cat hostfile_slots.txt* node23.emperor slots=1 node22.emperor slots=2 node21.emperor slots=1 *# mpirun --hostfile hostfile_slots.txt mpi_hello* Hello World! from process 0 out of 4 on node23.emperor Hello World! from process 1 out of 4 on node22.emperor Hello World! from process 3 out of 4 on node21.emperor Hello World! from process 2 out of 4 on node22.emperor Or if I convert the hostfile into a comma separated host list it also works. *# tr '\n' , <hostfile.txt; echo* node21.emperor,node22.emperor,node22.emperor,node23.emperor, *# mpirun --host $(tr '\n' , <hostfile.txt) mpi_hello* Hello World! from process 0 out of 4 on node21.emperor Hello World! from process 1 out of 4 on node22.emperor Hello World! from process 3 out of 4 on node23.emperor Hello World! from process 2 out of 4 on node22.emperor Any help as to why --hostfile does not work as expected and debugged says it should be working would be appreciated. As you can see I have been studing this problem a long time. Google has not been very helpful. All I seem to get are man pages, and general help guides. Anthony Thyssen ( System Programmer ) <a.thys...@griffith.edu.au> -------------------------------------------------------------------------- All the books of Power had their own particular nature. The "Octavo" was harsh and imperious. The "Bumper Fun Grimore" went in for deadly practical jokes. The "Joy of Tantric Sex" had to be kept under iced water. -- Terry Pratchett, "Moving Pictures" --------------------------------------------------------------------------
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users