can you try not specifiyng "max-slots" in the hostfile. if you are the only user of the nodes, there will be no oversibscibing of the processors. This one definetly looks like a bug, but as Ralph said there is a current disscusion and working on this component. Lenny.
On Mon, Aug 17, 2009 at 2:37 PM, Ralph Castain <r...@open-mpi.org> wrote: > Is there an explanation for this? >> > > I believe the word is "bug". :-) > > The rank_file mapper has been substantially revised lately - we are > discussing now how much of that revision to bring to 1.3.4 versus the next > major release. > > Ralph > > On Aug 17, 2009, at 4:45 AM, jody wrote: > > Hi Lenny >> >> I think it has something to do with your environment, /etc/hosts, IT >>> setup, >>> hostname function return value e.t.c >>> I am not sure if it has something to do with Open MPI at all. >>> >> >> OK. I just thought this was Open MPI related because i was able to use the >> aliases of the hosts (i.e. plankton instead of plankton.uzh.ch) in >> the host file... >> >> However, I encountered a new problem: >> if the rankfile lists all the entries which occur in the host file >> there is an error message. >> In the following example, the hostfile is >> [jody@plankton neander]$ cat th_02 >> nano_00.uzh.ch slots=2 max-slots=2 >> nano_02.uzh.ch slots=2 max-slots=2 >> >> and the rankfile is: >> [jody@plankton neander]$ cat rf_02 >> rank 0=nano_00.uzh.ch slot=0 >> rank 2=nano_00.uzh.ch slot=1 >> rank 1=nano_02.uzh.ch slot=0 >> rank 3=nano_02.uzh.ch slot=1 >> >> Here is the error: >> [jody@plankton neander]$ mpirun -np 4 -hostfile th_02 -rf rf_02 >> ./HelloMPI >> -------------------------------------------------------------------------- >> There are not enough slots available in the system to satisfy the 4 slots >> that were requested by the application: >> ./HelloMPI >> >> Either request fewer slots for your application, or make more slots >> available >> for use. >> >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to >> launch so we are aborting. >> >> There may be more information reported by the environment (see above). >> >> This may be because the daemon was unable to find all the needed shared >> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the >> location of the shared libraries on the remote nodes and this will >> automatically be forwarded to the remote nodes. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -------------------------------------------------------------------------- >> mpirun: clean termination accomplished >> >> If i use a hostfile with one more entry >> [jody@aim-plankton neander]$ cat th_021 >> aim-nano_00.uzh.ch slots=2 max-slots=2 >> aim-nano_02.uzh.ch slots=2 max-slots=2 >> aim-nano_01.uzh.ch slots=1 max-slots=1 >> >> Then this works fine: >> [jody@aim-plankton neander]$ mpirun -np 4 -hostfile th_021 -rf rf_02 >> ./HelloMPI >> >> Is there an explanation for this? >> >> Thank You >> Jody >> >> Lenny. >>> On Mon, Aug 17, 2009 at 12:59 PM, jody <jody....@gmail.com> wrote: >>> >>>> >>>> Hi Lenny >>>> >>>> Thanks - using the full names makes it work! >>>> Is there a reason why the rankfile option treats >>>> host names differently than the hostfile option? >>>> >>>> Thanks >>>> Jody >>>> >>>> >>>> >>>> On Mon, Aug 17, 2009 at 11:20 AM, Lenny >>>> Verkhovsky<lenny.verkhov...@gmail.com> wrote: >>>> >>>>> Hi >>>>> This message means >>>>> that you are trying to use host "plankton", that was not allocated via >>>>> hostfile or hostlist. >>>>> But according to the files and command line, everything seems fine. >>>>> Can you try using "plankton.uzh.ch" hostname instead of "plankton". >>>>> thanks >>>>> Lenny. >>>>> On Mon, Aug 17, 2009 at 10:36 AM, jody <jody....@gmail.com> wrote: >>>>> >>>>>> >>>>>> Hi >>>>>> >>>>>> When i use a rankfile, i get an error message which i don't >>>>>> understand: >>>>>> >>>>>> [jody@plankton tests]$ mpirun -np 3 -rf rankfile -hostfile testhosts >>>>>> ./HelloMPI >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> Rankfile claimed host plankton that was not allocated or >>>>>> oversubscribed it's slots: >>>>>> >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>> in >>>>>> file rmaps_rank_file.c at line 108 >>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>> in >>>>>> file base/rmaps_base_map_job.c at line 87 >>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>> in >>>>>> file base/plm_base_launch_support.c at line 77 >>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>> in >>>>>> file plm_rsh_module.c at line 990 >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting >>>>>> to >>>>>> launch so we are aborting. >>>>>> >>>>>> There may be more information reported by the environment (see above). >>>>>> >>>>>> This may be because the daemon was unable to find all the needed >>>>>> shared >>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>>>>> the >>>>>> location of the shared libraries on the remote nodes and this will >>>>>> automatically be forwarded to the remote nodes. >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> mpirun noticed that the job aborted, but has no info as to the process >>>>>> that caused that situation. >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> mpirun: clean termination accomplished >>>>>> >>>>>> >>>>>> >>>>>> With out the '-rf rankfile' option everything works as expected. >>>>>> >>>>>> My hostfile : >>>>>> [jody@plankton tests]$ cat testhosts >>>>>> # The following node is a quad-processor machine, and we absolutely >>>>>> # want to disallow over-subscribing it: >>>>>> plankton slots=3 max-slots=3 >>>>>> # The following nodes are dual-processor machines: >>>>>> nano_00 slots=2 max-slots=2 >>>>>> nano_01 slots=2 max-slots=2 >>>>>> nano_02 slots=2 max-slots=2 >>>>>> nano_03 slots=2 max-slots=2 >>>>>> nano_04 slots=2 max-slots=2 >>>>>> nano_05 slots=2 max-slots=2 >>>>>> nano_06 slots=2 max-slots=2 >>>>>> >>>>>> my rank file: >>>>>> [jody@plankton neander]$ cat rankfile >>>>>> rank 0=nano_00 slot=1 >>>>>> rank 1=plankton slot=0 >>>>>> rank 2=nano_01 slot=1 >>>>>> >>>>>> my Open MPI version: 1.3.2 >>>>>> >>>>>> i get the same error if i use a rankfile which has a single line >>>>>> rank 0=plankton slot=0 >>>>>> (plankton is my local machine) and call mpirun with np 1 >>>>>> >>>>>> What does the "Rankfile claimed..." message mean? >>>>>> Did i make an error in my rankfile? >>>>>> If yes, what would be the correct way to write it? >>>>>> >>>>>> Thank You >>>>>> Jody >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >