On Jan 19, 2014, at 1:36 AM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Thank you for your fix. I will try it tomorrow.
> 
> Before that, although I could not understand everything,
> let me ask some questions about the new hostfile.c.
> 
> 1. The line 244-248 is included in else-clause, which might cause
> memory leak(it seems to me). Should it be out of the clause?
> 
> 244            if (NULL != node_alias) {
> 245                /* add to list of aliases for this node - only add if
> unique */
> 246                opal_argv_append_unique_nosize(&node->alias, node_alias,
> false);
> 247                free(node_alias);
> 248            }

Yes, although it shouldn't ever actually be true unless the node was previously 
seen anyway

> 
> 2. For the similar reason, should the line 306-314 be out of else-clause?

Those lines actually shouldn't exist as we don't define an alias in that code 
block, so node_alias is always NULL

> 
> 3. I think that node->slots_given of hosts detected through rank-file
> should
> always be true to avoid override by orte_set_dafault_slots. Should the line
> 305
> be out of else-clause as well?
> 
> 305            node->slots_given = true;

Yes - thanks, it was meant to be outside the clause

> 
> Regards,
> Tetsuya Mishima
> 
>> I believe I now have this working correctly on the trunk and setup for
> 1.7.4. If you get a chance, please give it a try and confirm it solves the
> problem.
>> 
>> Thanks
>> Ralph
>> 
>> On Jan 17, 2014, at 2:16 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>>> Sorry for delay - I understood and was just occupied with something
> else for a while. Thanks for the follow-up. I'm looking at the issue and
> trying to decipher the right solution.
>>> 
>>> 
>>> On Jan 17, 2014, at 2:00 PM, tmish...@jcity.maeda.co.jp wrote:
>>> 
>>>> 
>>>> 
>>>> Hi Ralph,
>>>> 
>>>> I'm sorry that my explanation was not enough ...
>>>> This is the summary of my situation:
>>>> 
>>>> 1. I create a hostfile as shown below manually.
>>>> 
>>>> 2. I use mpirun to start the job without Torque, which means I'm
> running in
>>>> an un-managed environment.
>>>> 
>>>> 3. Firstly, ORTE detects 8 slots on each host(maybe in
>>>> "orte_ras_base_allocate").
>>>>  node05: slots=8 max_slots=0 slots_inuse=0
>>>>  node06: slots=8 max_slots=0 slots_inuse=0
>>>> 
>>>> 4. Then, the code I identified is resetting the slot counts.
>>>>  node05: slots=1 max_slots=0 slots_inuse=0
>>>>  node06: slots=1 max_slots=0 slots_inuse=0
>>>> 
>>>> 5. Therefore, ORTE believes that there is only one slot on each host.
>>>> 
>>>> Regards,
>>>> Tetsuya Mishima
>>>> 
>>>>> No, I didn't use Torque this time.
>>>>> 
>>>>> This issue is caused only when it is not in the managed
>>>>> environment - namely, orte_managed_allocation is false
>>>>> (and orte_set_slots is NULL).
>>>>> 
>>>>> Under the torque management, it works fine.
>>>>> 
>>>>> I hope you can understand the situation.
>>>>> 
>>>>> Tetsuya Mishima
>>>>> 
>>>>>> I'm sorry, but I'm really confused, so let me try to understand the
>>>>> situation.
>>>>>> 
>>>>>> You use Torque to get an allocation, so you are running in a managed
>>>>> environment.
>>>>>> 
>>>>>> You then use mpirun to start the job, but pass it a hostfile as
> shown
>>>>> below.
>>>>>> 
>>>>>> Somehow, ORTE believes that there is only one slot on each host, and
>>>> you
>>>>> believe the code you've identified is resetting the slot counts.
>>>>>> 
>>>>>> Is that a correct summary of the situation?
>>>>>> 
>>>>>> Thanks
>>>>>> Ralph
>>>>>> 
>>>>>> On Jan 16, 2014, at 4:00 PM, tmish...@jcity.maeda.co.jp wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Hi Ralph,
>>>>>>> 
>>>>>>> I encountered the hostfile issue again where slots are counted by
>>>>>>> listing the node multiple times. This should be fixed by r29765
>>>>>>> - Fix hostfile parsing for the case where RMs count slots ....
>>>>>>> 
>>>>>>> The difference is using RM or not. At that time, I executed mpirun
>>>>> through
>>>>>>> Torque manager. This time I executed it directly from command line
> as
>>>>>>> shown at the bottom, where node05,06 has 8 cores.
>>>>>>> 
>>>>>>> Then, I checked source files arroud it and found that the line
>>>> 151-160
>>>>> in
>>>>>>> plm_base_launch_support.c caused this issue. As node->slots is
>>>> already
>>>>>>> counted in hostfile.c @ r29765 even when node->slots_given is
> false,
>>>>>>> I think this part of plm_base_launch_support.c would be
> unnecesarry.
>>>>>>> 
>>>>>>> orte/mca/plm/base/plm_base_launch_support.c @ 30189:
>>>>>>> 151             } else {
>>>>>>> 152                 /* set any non-specified slot counts to 1 */
>>>>>>> 153                 for (i=0; i < orte_node_pool->size; i++) {
>>>>>>> 154                     if (NULL == (node =
>>>>>>> (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) {
>>>>>>> 155                         continue;
>>>>>>> 156                     }
>>>>>>> 157                     if (!node->slots_given) {
>>>>>>> 158                         node->slots = 1;
>>>>>>> 159                     }
>>>>>>> 160                 }
>>>>>>> 161             }
>>>>>>> 
>>>>>>> Removing this part, it works very well, where the function of
>>>>>>> orte_set_default_slots is still alive. I think this would be better
>>>> for
>>>>>>> the compatible extention of openmpi-1.7.3.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Tetsuya Mishima
>>>>>>> 
>>>>>>> [mishima@manage work]$ cat pbs_hosts
>>>>>>> node05
>>>>>>> node05
>>>>>>> node05
>>>>>>> node05
>>>>>>> node05
>>>>>>> node05
>>>>>>> node05
>>>>>>> node05
>>>>>>> node06
>>>>>>> node06
>>>>>>> node06
>>>>>>> node06
>>>>>>> node06
>>>>>>> node06
>>>>>>> node06
>>>>>>> node06
>>>>>>> [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts
>>>> -cpus-per-proc
>>>>> 4
>>>>>>> -report-bindings myprog
>>>>>>> [node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]],
>>>>> socket
>>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>>>>>> [node05.cluster:22287] MCW rank 3 is not bound (or bound to all
>>>>> available
>>>>>>> processors)
>>>>>>> [node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>>>>> socket
>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>>>>>> [node05.cluster:22287] MCW rank 1 is not bound (or bound to all
>>>>> available
>>>>>>> processors)
>>>>>>> Hello world from process 0 of 4
>>>>>>> Hello world from process 1 of 4
>>>>>>> Hello world from process 3 of 4
>>>>>>> Hello world from process 2 of 4
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to