Just to be clear, the hostfile contains the correct info: dancer00 slots=8 dancer01 slots=8
The output regarding the 2 nodes (dancer00 and dancer01) is clearly wrong. George. On Tue, Apr 25, 2017 at 4:32 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > I confirm a similar issue on a more managed environment. I have an > hostfile that worked for the last few years, and that span across a small > cluster (30 nodes of 8 cores each). > > Trying to spawn any number of processes across P nodes fails if the number > of processes is larger than P (despite the fact that there are largely > enough resources, and that this information is provided via the hostfile). > > George. > > > $ mpirun -mca ras_base_verbose 10 --display-allocation -np 4 --host > dancer00,dancer01 --map-by > > [dancer.icl.utk.edu:13457] mca: base: components_register: registering > framework ras components > [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded > component simulator > [dancer.icl.utk.edu:13457] mca: base: components_register: component > simulator register function successful > [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded > component slurm > [dancer.icl.utk.edu:13457] mca: base: components_register: component > slurm register function successful > [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded > component loadleveler > [dancer.icl.utk.edu:13457] mca: base: components_register: component > loadleveler register function successful > [dancer.icl.utk.edu:13457] mca: base: components_register: found loaded > component tm > [dancer.icl.utk.edu:13457] mca: base: components_register: component tm > register function successful > [dancer.icl.utk.edu:13457] mca: base: components_open: opening ras > components > [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded > component simulator > [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded > component slurm > [dancer.icl.utk.edu:13457] mca: base: components_open: component slurm > open function successful > [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded > component loadleveler > [dancer.icl.utk.edu:13457] mca: base: components_open: component > loadleveler open function successful > [dancer.icl.utk.edu:13457] mca: base: components_open: found loaded > component tm > [dancer.icl.utk.edu:13457] mca: base: components_open: component tm open > function successful > [dancer.icl.utk.edu:13457] mca:base:select: Auto-selecting ras components > [dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component > [simulator] > [dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component > [slurm] > [dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component > [loadleveler] > [dancer.icl.utk.edu:13457] mca:base:select:( ras) Querying component [tm] > [dancer.icl.utk.edu:13457] mca:base:select:( ras) No component selected! > > ====================== ALLOCATED NODES ====================== > dancer00: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer01: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > ================================================================= > > ====================== ALLOCATED NODES ====================== > dancer00: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer01: flags=0x13 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer02: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer03: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer04: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer05: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer06: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer07: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer08: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer09: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer10: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer11: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer12: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer13: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer14: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer15: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer16: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer17: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer18: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer19: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer20: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer21: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer22: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer23: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer24: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer25: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer26: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer27: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer28: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer29: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer30: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > dancer31: flags=0x10 slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN > ================================================================= > -------------------------------------------------------------------------- > There are not enough slots available in the system to satisfy the 4 slots > that were requested by the application: > startup > > Either request fewer slots for your application, or make more slots > available > for use. > -------------------------------------------------------------------------- > > > > > On Tue, Apr 25, 2017 at 4:00 PM, r...@open-mpi.org <r...@open-mpi.org> > wrote: > >> Okay - so effectively you have no hostfile, and no allocation. So this is >> running just on the one node where mpirun exists? >> >> Add “-mca ras_base_verbose 10 --display-allocation” to your cmd line and >> let’s see what it found >> >> > On Apr 25, 2017, at 12:56 PM, Eric Chamberland < >> eric.chamberl...@giref.ulaval.ca> wrote: >> > >> > Hi, >> > >> > the host file has been constructed automatically by the >> configuration+installation process and seems to contain only comments and a >> blank line: >> > >> > (15:53:50) [zorg]:~> cat /opt/openmpi-3.x_debug/etc/ope >> nmpi-default-hostfile >> > # >> > # Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana >> > # University Research and Technology >> > # Corporation. All rights reserved. >> > # Copyright (c) 2004-2005 The University of Tennessee and The University >> > # of Tennessee Research Foundation. All rights >> > # reserved. >> > # Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, >> > # University of Stuttgart. All rights reserved. >> > # Copyright (c) 2004-2005 The Regents of the University of California. >> > # All rights reserved. >> > # $COPYRIGHT$ >> > # >> > # Additional copyrights may follow >> > # >> > # $HEADER$ >> > # >> > # This is the default hostfile for Open MPI. Notice that it does not >> > # contain any hosts (not even localhost). This file should only >> > # contain hosts if a system administrator wants users to always have >> > # the same set of default hosts, and is not using a batch scheduler >> > # (such as SLURM, PBS, etc.). >> > # >> > # Note that this file is *not* used when running in "managed" >> > # environments (e.g., running in a job under a job scheduler, such as >> > # SLURM or PBS / Torque). >> > # >> > # If you are primarily interested in running Open MPI on one node, you >> > # should *not* simply list "localhost" in here (contrary to prior MPI >> > # implementations, such as LAM/MPI). A localhost-only node list is >> > # created by the RAS component named "localhost" if no other RAS >> > # components were able to find any hosts to run on (this behavior can >> > # be disabled by excluding the localhost RAS component by specifying >> > # the value "^localhost" [without the quotes] to the "ras" MCA >> > # parameter). >> > >> > (15:53:52) [zorg]:~> >> > >> > Thanks! >> > >> > Eric >> > >> > >> > On 25/04/17 03:52 PM, r...@open-mpi.org wrote: >> >> What is in your hostfile? >> >> >> >> >> >>> On Apr 25, 2017, at 11:39 AM, Eric Chamberland < >> eric.chamberl...@giref.ulaval.ca> wrote: >> >>> >> >>> Hi, >> >>> >> >>> just testing the 3.x branch... I launch: >> >>> >> >>> mpirun -n 8 echo "hello" >> >>> >> >>> and I get: >> >>> >> >>> ------------------------------------------------------------ >> -------------- >> >>> There are not enough slots available in the system to satisfy the 8 >> slots >> >>> that were requested by the application: >> >>> echo >> >>> >> >>> Either request fewer slots for your application, or make more slots >> available >> >>> for use. >> >>> ------------------------------------------------------------ >> -------------- >> >>> >> >>> I have to oversubscribe, so what do I have to do to bypass this >> "limitation"? >> >>> >> >>> Thanks, >> >>> >> >>> Eric >> >>> >> >>> configure log: >> >>> >> >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h >> 46m08s_config.log >> >>> http://www.giref.ulaval.ca/~cmpgiref/ompi_3.x/2017.04.25.10h >> 46m08s_ompi_info_all.txt >> >>> >> >>> >> >>> here is the complete message: >> >>> >> >>> [zorg:30036] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh >> path NULL >> >>> [zorg:30036] plm:base:set_hnp_name: initial bias 30036 nodename hash >> 810220270 >> >>> [zorg:30036] plm:base:set_hnp_name: final jobfam 49136 >> >>> [zorg:30036] [[49136,0],0] plm:rsh_setup on agent ssh : rsh path NULL >> >>> [zorg:30036] [[49136,0],0] plm:base:receive start comm >> >>> [zorg:30036] [[49136,0],0] plm:base:setup_job >> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm >> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm creating map >> >>> [zorg:30036] [[49136,0],0] setup:vm: working unmanaged allocation >> >>> [zorg:30036] [[49136,0],0] using default hostfile >> /opt/openmpi-3.x_debug/etc/openmpi-default-hostfile >> >>> [zorg:30036] [[49136,0],0] plm:base:setup_vm only HNP in allocation >> >>> [zorg:30036] [[49136,0],0] plm:base:setting slots for node zorg by >> cores >> >>> [zorg:30036] [[49136,0],0] complete_setup on job [49136,1] >> >>> [zorg:30036] [[49136,0],0] plm:base:launch_apps for job [49136,1] >> >>> ------------------------------------------------------------ >> -------------- >> >>> There are not enough slots available in the system to satisfy the 8 >> slots >> >>> that were requested by the application: >> >>> echo >> >>> >> >>> Either request fewer slots for your application, or make more slots >> available >> >>> for use. >> >>> ------------------------------------------------------------ >> -------------- >> >>> [zorg:30036] [[49136,0],0] plm:base:orted_cmd sending orted_exit >> commands >> >>> [zorg:30036] [[49136,0],0] plm:base:receive stop comm >> >>> >> >>> _______________________________________________ >> >>> users mailing list >> >>> users@lists.open-mpi.org >> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> >> >> _______________________________________________ >> >> users mailing list >> >> users@lists.open-mpi.org >> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> >> > _______________________________________________ >> > users mailing list >> > users@lists.open-mpi.org >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users