I have been having problems with OpenMPI on a new cluster of machines, using
stock RHEL7 packages.

ASIDE: This will be used with Torque-PBS (from EPEL archives), though
OpenMPI
(currently) does not have the "tm" resource manager configured to use PBS,
as you
will be able to see in the debug output below.

*# mpirun -V*
mpirun (Open MPI) 1.10.6

*# sudo yum list installed openmpi*
...
Installed Packages
openmpi.x86_64    1.10.6-2.el7    @rhel-7-server-rpms
...

More than likely I am doing something fundamentally stupid, but I have no
idea what.

The problem is that OpenMPI is not obeying the given hostfile, and running
one
process on each host given in the list. The manual and all my (meagre)
experience
is that that is what it is meant to do.

Instead it runs the maximum number of processes that is allowed to run for
the CPU
of that machine.  That is a nice feature, but NOT what is wanted.

There is no "/etc/openmpi-x86_64/openmpi-default-hostfile" configuration
present.

For example given the hostfile

*# cat hostfile.txt*
node21.emperor
node22.emperor
node22.emperor
node23.emperor

Running OpenMPI on the head node "shrek", I get the following,
(ras debugging enabled to see the result)

*# mpirun --hostfile hostfile.txt --mca ras_base_verbose 5 mpi_hello*
[shrek.emperor:93385] mca:base:select:(  ras) Querying component
[gridengine]
[shrek.emperor:93385] mca:base:select:(  ras) Skipping component
[gridengine]. Query failed to return a module
[shrek.emperor:93385] mca:base:select:(  ras) Querying component
[loadleveler]
[shrek.emperor:93385] mca:base:select:(  ras) Skipping component
[loadleveler]. Query failed to return a module
[shrek.emperor:93385] mca:base:select:(  ras) Querying component [simulator]
[shrek.emperor:93385] mca:base:select:(  ras) Skipping component
[simulator]. Query failed to return a module
[shrek.emperor:93385] mca:base:select:(  ras) Querying component [slurm]
[shrek.emperor:93385] mca:base:select:(  ras) Skipping component [slurm].
Query failed to return a module
[shrek.emperor:93385] mca:base:select:(  ras) No component selected!

======================   ALLOCATED NODES   ======================
        node21.emperor: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
        node22.emperor: slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
        node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
Hello World! from process 0 out of 6 on node21.emperor
Hello World! from process 2 out of 6 on node22.emperor
Hello World! from process 1 out of 6 on node21.emperor
Hello World! from process 3 out of 6 on node22.emperor
Hello World! from process 4 out of 6 on node23.emperor
Hello World! from process 5 out of 6 on node23.emperor

These machines are all dual core CPU's.  If a quad core is added to the list
I get 4 processes on that node. And so on, BUT NOT always.

*Note that the "ALLOCATED NODES" list is NOT obeyed.*

If on the other hand I add "slot=#" to the provided hostfile  it works as
expected!
(the debug output was not included as it is essentially the same as above)


*# awk '{n[$0]++} END {for(i in n)print i,"slots="n[i]}' hostfile.txt
> hostfile_slots.txt*
*# cat hostfile_slots.txt*
node23.emperor slots=1
node22.emperor slots=2
node21.emperor slots=1

*# mpirun --hostfile hostfile_slots.txt mpi_hello*
Hello World! from process 0 out of 4 on node23.emperor
Hello World! from process 1 out of 4 on node22.emperor
Hello World! from process 3 out of 4 on node21.emperor
Hello World! from process 2 out of 4 on node22.emperor

Or if I convert the hostfile into a comma separated host list it also works.

*# tr '\n' , <hostfile.txt; echo*
node21.emperor,node22.emperor,node22.emperor,node23.emperor,
*# mpirun --host $(tr '\n' , <hostfile.txt) mpi_hello*
Hello World! from process 0 out of 4 on node21.emperor
Hello World! from process 1 out of 4 on node22.emperor
Hello World! from process 3 out of 4 on node23.emperor
Hello World! from process 2 out of 4 on node22.emperor


Any help as to why --hostfile does not work as expected and debugged says it
should be working would be appreciated.

As you can see I have been studing this problem a long time.  Google has not
been very helpful.  All I seem to get are man pages, and general help
guides.


  Anthony Thyssen ( System Programmer )    <a.thys...@griffith.edu.au>
 --------------------------------------------------------------------------
  All the books of Power had their own particular nature.
  The "Octavo" was harsh and imperious.
  The "Bumper Fun Grimore" went in for deadly practical jokes.
  The "Joy of Tantric Sex" had to be kept under iced water.
                                    -- Terry Pratchett, "Moving Pictures"
 --------------------------------------------------------------------------
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to