can you try not specifiyng "max-slots" in the hostfile.
if you are the only user of the nodes, there will be no oversibscibing of
the processors.
This one definetly looks like a bug,
but as Ralph said there is a current disscusion and working on this

On Mon, Aug 17, 2009 at 2:37 PM, Ralph Castain <> wrote:

> Is there an explanation for this?
> I believe the word is "bug". :-)
> The rank_file mapper has been substantially revised lately - we are
> discussing now how much of that revision to bring to 1.3.4 versus the next
> major release.
> Ralph
> On Aug 17, 2009, at 4:45 AM, jody wrote:
>  Hi Lenny
>>  I think it has something to do with your environment,  /etc/hosts, IT
>>> setup,
>>> hostname function return value e.t.c
>>> I am not sure if it has something to do with Open MPI at all.
>> OK. I just thought this was Open MPI related because i was able to use the
>> aliases of the hosts (i.e. plankton instead of in
>> the host file...
>> However, I encountered a new problem:
>> if the rankfile lists all the entries which occur in the host file
>> there is an error message.
>> In the following example, the hostfile is
>> [jody@plankton neander]$ cat th_02
>>  slots=2 max-slots=2
>>  slots=2 max-slots=2
>> and the rankfile is:
>> [jody@plankton neander]$ cat rf_02
>> rank  slot=0
>> rank  slot=1
>> rank  slot=0
>> rank  slot=1
>> Here is the error:
>> [jody@plankton neander]$ mpirun -np 4 -hostfile th_02  -rf rf_02
>> ./HelloMPI
>> --------------------------------------------------------------------------
>> There are not enough slots available in the system to satisfy the 4 slots
>> that were requested by the application:
>>   ./HelloMPI
>> Either request fewer slots for your application, or make more slots
>> available
>> for use.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>> There may be more information reported by the environment (see above).
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> mpirun: clean termination accomplished
>> If i use a hostfile with one more entry
>> [jody@aim-plankton neander]$ cat th_021
>>  slots=2 max-slots=2
>>  slots=2 max-slots=2
>>  slots=1 max-slots=1
>> Then this works fine:
>> [jody@aim-plankton neander]$ mpirun -np 4 -hostfile th_021  -rf rf_02
>> ./HelloMPI
>> Is there an explanation for this?
>> Thank You
>>  Jody
>>  Lenny.
>>> On Mon, Aug 17, 2009 at 12:59 PM, jody <> wrote:
>>>> Hi Lenny
>>>> Thanks - using the full names makes it work!
>>>> Is there a reason why the rankfile option treats
>>>> host names differently than the hostfile option?
>>>> Thanks
>>>>  Jody
>>>> On Mon, Aug 17, 2009 at 11:20 AM, Lenny
>>>> Verkhovsky<> wrote:
>>>>> Hi
>>>>> This message means
>>>>> that you are trying to use host "plankton", that was not allocated via
>>>>> hostfile or hostlist.
>>>>> But according to the files and command line, everything seems fine.
>>>>> Can you try using "" hostname instead of "plankton".
>>>>> thanks
>>>>> Lenny.
>>>>> On Mon, Aug 17, 2009 at 10:36 AM, jody <> wrote:
>>>>>> Hi
>>>>>> When i use a rankfile, i get an error message which i don't
>>>>>> understand:
>>>>>> [jody@plankton tests]$ mpirun -np 3 -rf rankfile -hostfile testhosts
>>>>>> ./HelloMPI
>>>>>> --------------------------------------------------------------------------
>>>>>> Rankfile claimed host plankton that was not allocated or
>>>>>> oversubscribed it's slots:
>>>>>> --------------------------------------------------------------------------
>>>>>> [] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>> in
>>>>>> file rmaps_rank_file.c at line 108
>>>>>> [] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>> in
>>>>>> file base/rmaps_base_map_job.c at line 87
>>>>>> [] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>> in
>>>>>> file base/plm_base_launch_support.c at line 77
>>>>>> [] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter
>>>>>> in
>>>>>> file plm_rsh_module.c at line 990
>>>>>> --------------------------------------------------------------------------
>>>>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting
>>>>>> to
>>>>>> launch so we are aborting.
>>>>>> There may be more information reported by the environment (see above).
>>>>>> This may be because the daemon was unable to find all the needed
>>>>>> shared
>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>>>>> the
>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>> automatically be forwarded to the remote nodes.
>>>>>> --------------------------------------------------------------------------
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>>> that caused that situation.
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun: clean termination accomplished
>>>>>> With out the '-rf rankfile' option everything works as expected.
>>>>>> My hostfile :
>>>>>> [jody@plankton tests]$ cat testhosts
>>>>>> # The following node is a quad-processor machine, and we absolutely
>>>>>> # want to disallow over-subscribing it:
>>>>>> plankton slots=3  max-slots=3
>>>>>> # The following nodes are dual-processor machines:
>>>>>> nano_00  slots=2 max-slots=2
>>>>>> nano_01  slots=2 max-slots=2
>>>>>> nano_02  slots=2 max-slots=2
>>>>>> nano_03  slots=2 max-slots=2
>>>>>> nano_04  slots=2 max-slots=2
>>>>>> nano_05  slots=2 max-slots=2
>>>>>> nano_06  slots=2 max-slots=2
>>>>>> my rank file:
>>>>>> [jody@plankton neander]$ cat rankfile
>>>>>> rank  0=nano_00  slot=1
>>>>>> rank  1=plankton slot=0
>>>>>> rank  2=nano_01  slot=1
>>>>>> my Open MPI version: 1.3.2
>>>>>> i get the same error if i use a rankfile which has a single line
>>>>>>  rank  0=plankton  slot=0
>>>>>> (plankton is my local machine) and call mpirun with np 1
>>>>>> What does the "Rankfile claimed..." message mean?
>>>>>> Did i make an error in my rankfile?
>>>>>> If yes, what would be the correct way to write it?
>>>>>> Thank You
>>>>>>  Jody
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>> _______________________________________________
>>>>> users mailing list
>>>> _______________________________________________
>>>> users mailing list
>>> _______________________________________________
>>> users mailing list
>> _______________________________________________
>> users mailing list
> _______________________________________________
> users mailing list

Reply via email to