That is correct. If you don’t specify a slot count, we auto-discover the number 
of cores on each node and set #slots to that number. If an RM is involved, then 
we use what they give us

Sent from my iPad

> On Sep 26, 2017, at 8:11 PM, Anthony Thyssen <a.thys...@griffith.edu.au> 
> wrote:
> 
> 
> I have been having problems with OpenMPI on a new cluster of machines, using
> stock RHEL7 packages.
> 
> ASIDE: This will be used with Torque-PBS (from EPEL archives), though OpenMPI
> (currently) does not have the "tm" resource manager configured to use PBS, as 
> you
> will be able to see in the debug output below.
> 
> # mpirun -V
> mpirun (Open MPI) 1.10.6
> 
> # sudo yum list installed openmpi
> ...
> Installed Packages
> openmpi.x86_64    1.10.6-2.el7    @rhel-7-server-rpms
> ...
> 
> More than likely I am doing something fundamentally stupid, but I have no 
> idea what.
> 
> The problem is that OpenMPI is not obeying the given hostfile, and running one
> process on each host given in the list. The manual and all my (meagre) 
> experience
> is that that is what it is meant to do.
> 
> Instead it runs the maximum number of processes that is allowed to run for 
> the CPU
> of that machine.  That is a nice feature, but NOT what is wanted.
> 
> There is no "/etc/openmpi-x86_64/openmpi-default-hostfile" configuration 
> present.
> 
> For example given the hostfile
> 
> # cat hostfile.txt
> node21.emperor
> node22.emperor
> node22.emperor
> node23.emperor
> 
> Running OpenMPI on the head node "shrek", I get the following,
> (ras debugging enabled to see the result)
> 
> # mpirun --hostfile hostfile.txt --mca ras_base_verbose 5 mpi_hello
> [shrek.emperor:93385] mca:base:select:(  ras) Querying component [gridengine]
> [shrek.emperor:93385] mca:base:select:(  ras) Skipping component 
> [gridengine]. Query failed to return a module
> [shrek.emperor:93385] mca:base:select:(  ras) Querying component [loadleveler]
> [shrek.emperor:93385] mca:base:select:(  ras) Skipping component 
> [loadleveler]. Query failed to return a module
> [shrek.emperor:93385] mca:base:select:(  ras) Querying component [simulator]
> [shrek.emperor:93385] mca:base:select:(  ras) Skipping component [simulator]. 
> Query failed to return a module
> [shrek.emperor:93385] mca:base:select:(  ras) Querying component [slurm]
> [shrek.emperor:93385] mca:base:select:(  ras) Skipping component [slurm]. 
> Query failed to return a module
> [shrek.emperor:93385] mca:base:select:(  ras) No component selected!
> 
> ======================   ALLOCATED NODES   ======================
>         node21.emperor: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>         node22.emperor: slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
>         node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> =================================================================
> Hello World! from process 0 out of 6 on node21.emperor
> Hello World! from process 2 out of 6 on node22.emperor
> Hello World! from process 1 out of 6 on node21.emperor
> Hello World! from process 3 out of 6 on node22.emperor
> Hello World! from process 4 out of 6 on node23.emperor
> Hello World! from process 5 out of 6 on node23.emperor
> 
> These machines are all dual core CPU's.  If a quad core is added to the list
> I get 4 processes on that node. And so on, BUT NOT always.
> 
> Note that the "ALLOCATED NODES" list is NOT obeyed.
> 
> If on the other hand I add "slot=#" to the provided hostfile  it works as 
> expected!
> (the debug output was not included as it is essentially the same as above)
> 
> # awk '{n[$0]++} END {for(i in n)print i,"slots="n[i]}' hostfile.txt > 
> hostfile_slots.txt
> # cat hostfile_slots.txt
> node23.emperor slots=1
> node22.emperor slots=2
> node21.emperor slots=1
> 
> # mpirun --hostfile hostfile_slots.txt mpi_hello
> Hello World! from process 0 out of 4 on node23.emperor
> Hello World! from process 1 out of 4 on node22.emperor
> Hello World! from process 3 out of 4 on node21.emperor
> Hello World! from process 2 out of 4 on node22.emperor
> 
> Or if I convert the hostfile into a comma separated host list it also works.
> 
> # tr '\n' , <hostfile.txt; echo
> node21.emperor,node22.emperor,node22.emperor,node23.emperor,
> # mpirun --host $(tr '\n' , <hostfile.txt) mpi_hello
> Hello World! from process 0 out of 4 on node21.emperor
> Hello World! from process 1 out of 4 on node22.emperor
> Hello World! from process 3 out of 4 on node23.emperor
> Hello World! from process 2 out of 4 on node22.emperor
> 
> 
> Any help as to why --hostfile does not work as expected and debugged says it
> should be working would be appreciated.
> 
> As you can see I have been studing this problem a long time.  Google has not
> been very helpful.  All I seem to get are man pages, and general help guides.
> 
> 
>   Anthony Thyssen ( System Programmer )    <a.thys...@griffith.edu.au>
>  --------------------------------------------------------------------------
>   All the books of Power had their own particular nature.
>   The "Octavo" was harsh and imperious.
>   The "Bumper Fun Grimore" went in for deadly practical jokes.
>   The "Joy of Tantric Sex" had to be kept under iced water.
>                                     -- Terry Pratchett, "Moving Pictures"
>  --------------------------------------------------------------------------
> 
> 
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to