On Jan 19, 2014, at 1:36 AM, tmish...@jcity.maeda.co.jp wrote: > > > Thank you for your fix. I will try it tomorrow. > > Before that, although I could not understand everything, > let me ask some questions about the new hostfile.c. > > 1. The line 244-248 is included in else-clause, which might cause > memory leak(it seems to me). Should it be out of the clause? > > 244 if (NULL != node_alias) { > 245 /* add to list of aliases for this node - only add if > unique */ > 246 opal_argv_append_unique_nosize(&node->alias, node_alias, > false); > 247 free(node_alias); > 248 }
Yes, although it shouldn't ever actually be true unless the node was previously seen anyway > > 2. For the similar reason, should the line 306-314 be out of else-clause? Those lines actually shouldn't exist as we don't define an alias in that code block, so node_alias is always NULL > > 3. I think that node->slots_given of hosts detected through rank-file > should > always be true to avoid override by orte_set_dafault_slots. Should the line > 305 > be out of else-clause as well? > > 305 node->slots_given = true; Yes - thanks, it was meant to be outside the clause > > Regards, > Tetsuya Mishima > >> I believe I now have this working correctly on the trunk and setup for > 1.7.4. If you get a chance, please give it a try and confirm it solves the > problem. >> >> Thanks >> Ralph >> >> On Jan 17, 2014, at 2:16 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Sorry for delay - I understood and was just occupied with something > else for a while. Thanks for the follow-up. I'm looking at the issue and > trying to decipher the right solution. >>> >>> >>> On Jan 17, 2014, at 2:00 PM, tmish...@jcity.maeda.co.jp wrote: >>> >>>> >>>> >>>> Hi Ralph, >>>> >>>> I'm sorry that my explanation was not enough ... >>>> This is the summary of my situation: >>>> >>>> 1. I create a hostfile as shown below manually. >>>> >>>> 2. I use mpirun to start the job without Torque, which means I'm > running in >>>> an un-managed environment. >>>> >>>> 3. Firstly, ORTE detects 8 slots on each host(maybe in >>>> "orte_ras_base_allocate"). >>>> node05: slots=8 max_slots=0 slots_inuse=0 >>>> node06: slots=8 max_slots=0 slots_inuse=0 >>>> >>>> 4. Then, the code I identified is resetting the slot counts. >>>> node05: slots=1 max_slots=0 slots_inuse=0 >>>> node06: slots=1 max_slots=0 slots_inuse=0 >>>> >>>> 5. Therefore, ORTE believes that there is only one slot on each host. >>>> >>>> Regards, >>>> Tetsuya Mishima >>>> >>>>> No, I didn't use Torque this time. >>>>> >>>>> This issue is caused only when it is not in the managed >>>>> environment - namely, orte_managed_allocation is false >>>>> (and orte_set_slots is NULL). >>>>> >>>>> Under the torque management, it works fine. >>>>> >>>>> I hope you can understand the situation. >>>>> >>>>> Tetsuya Mishima >>>>> >>>>>> I'm sorry, but I'm really confused, so let me try to understand the >>>>> situation. >>>>>> >>>>>> You use Torque to get an allocation, so you are running in a managed >>>>> environment. >>>>>> >>>>>> You then use mpirun to start the job, but pass it a hostfile as > shown >>>>> below. >>>>>> >>>>>> Somehow, ORTE believes that there is only one slot on each host, and >>>> you >>>>> believe the code you've identified is resetting the slot counts. >>>>>> >>>>>> Is that a correct summary of the situation? >>>>>> >>>>>> Thanks >>>>>> Ralph >>>>>> >>>>>> On Jan 16, 2014, at 4:00 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>> >>>>>>> >>>>>>> Hi Ralph, >>>>>>> >>>>>>> I encountered the hostfile issue again where slots are counted by >>>>>>> listing the node multiple times. This should be fixed by r29765 >>>>>>> - Fix hostfile parsing for the case where RMs count slots .... >>>>>>> >>>>>>> The difference is using RM or not. At that time, I executed mpirun >>>>> through >>>>>>> Torque manager. This time I executed it directly from command line > as >>>>>>> shown at the bottom, where node05,06 has 8 cores. >>>>>>> >>>>>>> Then, I checked source files arroud it and found that the line >>>> 151-160 >>>>> in >>>>>>> plm_base_launch_support.c caused this issue. As node->slots is >>>> already >>>>>>> counted in hostfile.c @ r29765 even when node->slots_given is > false, >>>>>>> I think this part of plm_base_launch_support.c would be > unnecesarry. >>>>>>> >>>>>>> orte/mca/plm/base/plm_base_launch_support.c @ 30189: >>>>>>> 151 } else { >>>>>>> 152 /* set any non-specified slot counts to 1 */ >>>>>>> 153 for (i=0; i < orte_node_pool->size; i++) { >>>>>>> 154 if (NULL == (node = >>>>>>> (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) { >>>>>>> 155 continue; >>>>>>> 156 } >>>>>>> 157 if (!node->slots_given) { >>>>>>> 158 node->slots = 1; >>>>>>> 159 } >>>>>>> 160 } >>>>>>> 161 } >>>>>>> >>>>>>> Removing this part, it works very well, where the function of >>>>>>> orte_set_default_slots is still alive. I think this would be better >>>> for >>>>>>> the compatible extention of openmpi-1.7.3. >>>>>>> >>>>>>> Regards, >>>>>>> Tetsuya Mishima >>>>>>> >>>>>>> [mishima@manage work]$ cat pbs_hosts >>>>>>> node05 >>>>>>> node05 >>>>>>> node05 >>>>>>> node05 >>>>>>> node05 >>>>>>> node05 >>>>>>> node05 >>>>>>> node05 >>>>>>> node06 >>>>>>> node06 >>>>>>> node06 >>>>>>> node06 >>>>>>> node06 >>>>>>> node06 >>>>>>> node06 >>>>>>> node06 >>>>>>> [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts >>>> -cpus-per-proc >>>>> 4 >>>>>>> -report-bindings myprog >>>>>>> [node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]], >>>>> socket >>>>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>>>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>>>>>> [node05.cluster:22287] MCW rank 3 is not bound (or bound to all >>>>> available >>>>>>> processors) >>>>>>> [node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>>>> socket >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>>>>>> [node05.cluster:22287] MCW rank 1 is not bound (or bound to all >>>>> available >>>>>>> processors) >>>>>>> Hello world from process 0 of 4 >>>>>>> Hello world from process 1 of 4 >>>>>>> Hello world from process 3 of 4 >>>>>>> Hello world from process 2 of 4 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users