Just to be clear, the hostfile contains the correct info:

dancer00 slots=8
dancer01 slots=8

The output regarding the 2 nodes (dancer00 and dancer01) is clearly wrong.

  George.



On Tue, Apr 25, 2017 at 4:32 PM, George Bosilca <bosi...@icl.utk.edu> wrote:

> I confirm a similar issue on a more managed environment. I have an
> hostfile that worked for the last few years, and that span across a small
> cluster (30 nodes of 8 cores each).
>
> Trying to spawn any number of processes across P nodes fails if the number
> of processes is larger than P (despite the fact that there are largely
> enough resources, and that this information is provided via the hostfile).
>
> George.
>
>
> $ mpirun -mca ras_base_verbose 10 --display-allocation -np 4 --host
> dancer00,dancer01 --map-by
>
> [dancer.icl.utk.edu:13457] mca: base: components_register: registering
> framework ras components
> [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
> component simulator
> [dancer.icl.utk.edu:13457] mca: base: components_register: component
> simulator register function successful
> [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
> component slurm
> [dancer.icl.utk.edu:13457] mca: base: components_register: component
> slurm register function successful
> [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
> component loadleveler
> [dancer.icl.utk.edu:13457] mca: base: components_register: component
> loadleveler register function successful
> [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded
> component tm
> [dancer.icl.utk.edu:13457] mca: base: components_register: component tm
> register function successful
> [dancer.icl.utk.edu:13457] mca: base: components_open: opening ras
> components
> [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
> component simulator
> [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
> component slurm
> [dancer.icl.utk.edu:13457] mca: base: components_open: component slurm
> open function successful
> [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
> component loadleveler
> [dancer.icl.utk.edu:13457] mca: base: components_open: component
> loadleveler open function successful
> [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded
> component tm
> [dancer.icl.utk.edu:13457] mca: base: components_open: component tm open
> function successful
> [dancer.icl.utk.edu:13457] mca:base:select: Auto-selecting ras components
> [dancer.icl.utk.edu:13457] mca:base:select:(  ras) Querying component
> [simulator]
> [dancer.icl.utk.edu:13457] mca:base:select:(  ras) Querying component
> [slurm]
> [dancer.icl.utk.edu:13457] mca:base:select:(  ras) Querying component
> [loadleveler]
> [dancer.icl.utk.edu:13457] mca:base:select:(  ras) Querying component [tm]
> [dancer.icl.utk.edu:13457] mca:base:select:(  ras) No component selected!
>
> ======================   ALLOCATED NODES   ======================
> dancer00: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer01: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> =================================================================
>
> ======================   ALLOCATED NODES   ======================
> dancer00: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer01: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
> =================================================================
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 4 slots
> that were requested by the application:
>   startup
>
> Either request fewer slots for your application, or make more slots
> available
> for use.
> --------------------------------------------------------------------------
>
>
>
>
> On Tue, Apr 25, 2017 at 4:00 PM, r...@open-mpi.org <r...@open-mpi.org>
> wrote:
>
>> Okay - so effectively you have no hostfile, and no allocation. So this is
>> running just on the one node where mpirun exists?
>>
>> Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and
>> let’s see what it found
>>
>> > On Apr 25, 2017, at 12:56 PM, Eric Chamberland <
>> eric.chamberl...@giref.ulaval.ca> wrote:
>> >
>> > Hi,
>> >
>> > the host file has been constructed automatically by the
>> configuration+installation process and seems to contain only comments and a
>> blank line:
>> >
>> > (15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/ope
>> nmpi-default-hostfile
>> > #
>> > # Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>> > #                         University Research and Technology
>> > #                         Corporation.  All rights reserved.
>> > # Copyright (c) 2004-2005 The University of Tennessee and The University
>> > #                         of Tennessee Research Foundation.  All rights
>> > #                         reserved.
>> > # Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
>> > #                         University of Stuttgart.  All rights reserved.
>> > # Copyright (c) 2004-2005 The Regents of the University of California.
>> > #                         All rights reserved.
>> > # $COPYRIGHT$
>> > #
>> > # Additional copyrights may follow
>> > #
>> > # $HEADER$
>> > #
>> > # This is the default hostfile for Open MPI.  Notice that it does not
>> > # contain any hosts (not even localhost).  This file should only
>> > # contain hosts if a system administrator wants users to always have
>> > # the same set of default hosts, and is not using a batch scheduler
>> > # (such as SLURM, PBS, etc.).
>> > #
>> > # Note that this file is *not* used when running in "managed"
>> > # environments (e.g., running in a job under a job scheduler, such as
>> > # SLURM or PBS / Torque).
>> > #
>> > # If you are primarily interested in running Open MPI on one node, you
>> > # should *not* simply list "localhost" in here (contrary to prior MPI
>> > # implementations, such as LAM/MPI).  A localhost-only node list is
>> > # created by the RAS component named "localhost" if no other RAS
>> > # components were able to find any hosts to run on (this behavior can
>> > # be disabled by excluding the localhost RAS component by specifying
>> > # the value "^localhost" [without the quotes] to the "ras" MCA
>> > # parameter).
>> >
>> > (15:53:52) [zorg]:~>
>> >
>> > Thanks!
>> >
>> > Eric
>> >
>> >
>> > On 25/04/17 03:52 PM, r...@open-mpi.org wrote:
>> >> What is in your hostfile?
>> >>
>> >>
>> >>> On Apr 25, 2017, at 11:39 AM, Eric Chamberland <
>> eric.chamberl...@giref.ulaval.ca> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> just testing the 3.x branch... I launch:
>> >>>
>> >>> mpirun -n 8 echo "hello"
>> >>>
>> >>> and I get:
>> >>>
>> >>> ------------------------------------------------------------
>> --------------
>> >>> There are not enough slots available in the system to satisfy the 8
>> slots
>> >>> that were requested by the application:
>> >>> echo
>> >>>
>> >>> Either request fewer slots for your application, or make more slots
>> available
>> >>> for use.
>> >>> ------------------------------------------------------------
>> --------------
>> >>>
>> >>> I have to oversubscribe, so what do I have to do to bypass this
>> "limitation"?
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Eric
>> >>>
>> >>> configure log:
>> >>>
>> >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h
>> 46m08s_config.log
>> >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h
>> 46m08s_ompi_info_all.txt
>> >>>
>> >>>
>> >>> here is the complete message:
>> >>>
>> >>> [zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
>> path NULL
>> >>> [zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash
>> 810220270
>> >>> [zorg:30036] plm:base:set_hnp_name: final jobfam 49136
>> >>> [zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>> >>> [zorg:30036] [[49136,0],0] plm:base:receive start comm
>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_job
>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm
>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm creating map
>> >>> [zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation
>> >>> [zorg:30036] [[49136,0],0] using default hostfile
>> /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile
>> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation
>> >>> [zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by
>> cores
>> >>> [zorg:30036] [[49136,0],0] complete_setup on job [49136,1]
>> >>> [zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1]
>> >>> ------------------------------------------------------------
>> --------------
>> >>> There are not enough slots available in the system to satisfy the 8
>> slots
>> >>> that were requested by the application:
>> >>> echo
>> >>>
>> >>> Either request fewer slots for your application, or make more slots
>> available
>> >>> for use.
>> >>> ------------------------------------------------------------
>> --------------
>> >>> [zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit
>> commands
>> >>> [zorg:30036] [[49136,0],0] plm:base:receive stop comm
>> >>>
>> >>> _______________________________________________
>> >>> users mailing list
>> >>> users@lists.open-mpi.org
>> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> users@lists.open-mpi.org
>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> >>
>> > _______________________________________________
>> > users mailing list
>> > users@lists.open-mpi.org
>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to