I have been having problems with OpenMPI on a new cluster of machines, using
stock RHEL7 packages.

ASIDE: This will be used with Torque-PBS (from EPEL archives), though
(currently) does not have the "tm" resource manager configured to use PBS,
as you
will be able to see in the debug output below.

*# mpirun -V*
mpirun (Open MPI) 1.10.6

*# sudo yum list installed openmpi*
Installed Packages
openmpi.x86_64    1.10.6-2.el7    @rhel-7-server-rpms

More than likely I am doing something fundamentally stupid, but I have no
idea what.

The problem is that OpenMPI is not obeying the given hostfile, and running
process on each host given in the list. The manual and all my (meagre)
is that that is what it is meant to do.

Instead it runs the maximum number of processes that is allowed to run for
the CPU
of that machine.  That is a nice feature, but NOT what is wanted.

There is no "/etc/openmpi-x86_64/openmpi-default-hostfile" configuration

For example given the hostfile

*# cat hostfile.txt*

Running OpenMPI on the head node "shrek", I get the following,
(ras debugging enabled to see the result)

*# mpirun --hostfile hostfile.txt --mca ras_base_verbose 5 mpi_hello*
[shrek.emperor:93385] mca:base:select:(  ras) Querying component
[shrek.emperor:93385] mca:base:select:(  ras) Skipping component
[gridengine]. Query failed to return a module
[shrek.emperor:93385] mca:base:select:(  ras) Querying component
[shrek.emperor:93385] mca:base:select:(  ras) Skipping component
[loadleveler]. Query failed to return a module
[shrek.emperor:93385] mca:base:select:(  ras) Querying component [simulator]
[shrek.emperor:93385] mca:base:select:(  ras) Skipping component
[simulator]. Query failed to return a module
[shrek.emperor:93385] mca:base:select:(  ras) Querying component [slurm]
[shrek.emperor:93385] mca:base:select:(  ras) Skipping component [slurm].
Query failed to return a module
[shrek.emperor:93385] mca:base:select:(  ras) No component selected!

======================   ALLOCATED NODES   ======================
        node21.emperor: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
        node22.emperor: slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
        node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
Hello World! from process 0 out of 6 on node21.emperor
Hello World! from process 2 out of 6 on node22.emperor
Hello World! from process 1 out of 6 on node21.emperor
Hello World! from process 3 out of 6 on node22.emperor
Hello World! from process 4 out of 6 on node23.emperor
Hello World! from process 5 out of 6 on node23.emperor

These machines are all dual core CPU's.  If a quad core is added to the list
I get 4 processes on that node. And so on, BUT NOT always.

*Note that the "ALLOCATED NODES" list is NOT obeyed.*

If on the other hand I add "slot=#" to the provided hostfile  it works as
(the debug output was not included as it is essentially the same as above)

*# awk '{n[$0]++} END {for(i in n)print i,"slots="n[i]}' hostfile.txt
> hostfile_slots.txt*
*# cat hostfile_slots.txt*
node23.emperor slots=1
node22.emperor slots=2
node21.emperor slots=1

*# mpirun --hostfile hostfile_slots.txt mpi_hello*
Hello World! from process 0 out of 4 on node23.emperor
Hello World! from process 1 out of 4 on node22.emperor
Hello World! from process 3 out of 4 on node21.emperor
Hello World! from process 2 out of 4 on node22.emperor

Or if I convert the hostfile into a comma separated host list it also works.

*# tr '\n' , <hostfile.txt; echo*
*# mpirun --host $(tr '\n' , <hostfile.txt) mpi_hello*
Hello World! from process 0 out of 4 on node21.emperor
Hello World! from process 1 out of 4 on node22.emperor
Hello World! from process 3 out of 4 on node23.emperor
Hello World! from process 2 out of 4 on node22.emperor

Any help as to why --hostfile does not work as expected and debugged says it
should be working would be appreciated.

As you can see I have been studing this problem a long time.  Google has not
been very helpful.  All I seem to get are man pages, and general help

