can you try not specifiyng "max-slots" in the hostfile.
if you are the only user of the nodes, there will be no oversibscibing of
the processors.
This one definetly looks like a bug,
but as Ralph said there is a current disscusion and working on this
component.
Lenny.

On Mon, Aug 17, 2009 at 2:37 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Is there an explanation for this?
>>
>
> I believe the word is "bug". :-)
>
> The rank_file mapper has been substantially revised lately - we are
> discussing now how much of that revision to bring to 1.3.4 versus the next
> major release.
>
> Ralph
>
> On Aug 17, 2009, at 4:45 AM, jody wrote:
>
>  Hi Lenny
>>
>>  I think it has something to do with your environment,  /etc/hosts, IT
>>> setup,
>>> hostname function return value e.t.c
>>> I am not sure if it has something to do with Open MPI at all.
>>>
>>
>> OK. I just thought this was Open MPI related because i was able to use the
>> aliases of the hosts (i.e. plankton instead of plankton.uzh.ch) in
>> the host file...
>>
>> However, I encountered a new problem:
>> if the rankfile lists all the entries which occur in the host file
>> there is an error message.
>> In the following example, the hostfile is
>> [jody@plankton neander]$ cat th_02
>> nano_00.uzh.ch  slots=2 max-slots=2
>> nano_02.uzh.ch  slots=2 max-slots=2
>>
>> and the rankfile is:
>> [jody@plankton neander]$ cat rf_02
>> rank  0=nano_00.uzh.ch  slot=0
>> rank  2=nano_00.uzh.ch  slot=1
>> rank  1=nano_02.uzh.ch  slot=0
>> rank  3=nano_02.uzh.ch  slot=1
>>
>> Here is the error:
>> [jody@plankton neander]$ mpirun -np 4 -hostfile th_02  -rf rf_02
>> ./HelloMPI
>> --------------------------------------------------------------------------
>> There are not enough slots available in the system to satisfy the 4 slots
>> that were requested by the application:
>>   ./HelloMPI
>>
>> Either request fewer slots for your application, or make more slots
>> available
>> for use.
>>
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> mpirun: clean termination accomplished
>>
>> If i use a hostfile with one more entry
>> [jody@aim-plankton neander]$ cat th_021
>> aim-nano_00.uzh.ch  slots=2 max-slots=2
>> aim-nano_02.uzh.ch  slots=2 max-slots=2
>> aim-nano_01.uzh.ch  slots=1 max-slots=1
>>
>> Then this works fine:
>> [jody@aim-plankton neander]$ mpirun -np 4 -hostfile th_021  -rf rf_02
>> ./HelloMPI
>>
>> Is there an explanation for this?
>>
>> Thank You
>>  Jody
>>
>>  Lenny.
>>> On Mon, Aug 17, 2009 at 12:59 PM, jody <jody....@gmail.com> wrote:
>>>
>>>>
>>>> Hi Lenny
>>>>
>>>> Thanks - using the full names makes it work!
>>>> Is there a reason why the rankfile option treats
>>>> host names differently than the hostfile option?
>>>>
>>>> Thanks
>>>>  Jody
>>>>
>>>>
>>>>
>>>> On Mon, Aug 17, 2009 at 11:20 AM, Lenny
>>>> Verkhovsky<lenny.verkhov...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>> This message means
>>>>> that you are trying to use host "plankton", that was not allocated via
>>>>> hostfile or hostlist.
>>>>> But according to the files and command line, everything seems fine.
>>>>> Can you try using "plankton.uzh.ch" hostname instead of "plankton".
>>>>> thanks
>>>>> Lenny.
>>>>> On Mon, Aug 17, 2009 at 10:36 AM, jody <jody....@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> When i use a rankfile, i get an error message which i don't
>>>>>> understand:
>>>>>>
>>>>>> [jody@plankton tests]$ mpirun -np 3 -rf rankfile -hostfile testhosts
>>>>>> ./HelloMPI
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> Rankfile claimed host plankton that was not allocated or
>>>>>> oversubscribed it's slots:
>>>>>>
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>> in
>>>>>> file rmaps_rank_file.c at line 108
>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>> in
>>>>>> file base/rmaps_base_map_job.c at line 87
>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>> in
>>>>>> file base/plm_base_launch_support.c at line 77
>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>> in
>>>>>> file plm_rsh_module.c at line 990
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting
>>>>>> to
>>>>>> launch so we are aborting.
>>>>>>
>>>>>> There may be more information reported by the environment (see above).
>>>>>>
>>>>>> This may be because the daemon was unable to find all the needed
>>>>>> shared
>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>>>>> the
>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>> automatically be forwarded to the remote nodes.
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>>> that caused that situation.
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun: clean termination accomplished
>>>>>>
>>>>>>
>>>>>>
>>>>>> With out the '-rf rankfile' option everything works as expected.
>>>>>>
>>>>>> My hostfile :
>>>>>> [jody@plankton tests]$ cat testhosts
>>>>>> # The following node is a quad-processor machine, and we absolutely
>>>>>> # want to disallow over-subscribing it:
>>>>>> plankton slots=3  max-slots=3
>>>>>> # The following nodes are dual-processor machines:
>>>>>> nano_00  slots=2 max-slots=2
>>>>>> nano_01  slots=2 max-slots=2
>>>>>> nano_02  slots=2 max-slots=2
>>>>>> nano_03  slots=2 max-slots=2
>>>>>> nano_04  slots=2 max-slots=2
>>>>>> nano_05  slots=2 max-slots=2
>>>>>> nano_06  slots=2 max-slots=2
>>>>>>
>>>>>> my rank file:
>>>>>> [jody@plankton neander]$ cat rankfile
>>>>>> rank  0=nano_00  slot=1
>>>>>> rank  1=plankton slot=0
>>>>>> rank  2=nano_01  slot=1
>>>>>>
>>>>>> my Open MPI version: 1.3.2
>>>>>>
>>>>>> i get the same error if i use a rankfile which has a single line
>>>>>>  rank  0=plankton  slot=0
>>>>>> (plankton is my local machine) and call mpirun with np 1
>>>>>>
>>>>>> What does the "Rankfile claimed..." message mean?
>>>>>> Did i make an error in my rankfile?
>>>>>> If yes, what would be the correct way to write it?
>>>>>>
>>>>>> Thank You
>>>>>>  Jody
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to