I believe I now have this working correctly on the trunk and setup for 1.7.4. If you get a chance, please give it a try and confirm it solves the problem.
Thanks Ralph On Jan 17, 2014, at 2:16 PM, Ralph Castain <r...@open-mpi.org> wrote: > Sorry for delay - I understood and was just occupied with something else for > a while. Thanks for the follow-up. I'm looking at the issue and trying to > decipher the right solution. > > > On Jan 17, 2014, at 2:00 PM, tmish...@jcity.maeda.co.jp wrote: > >> >> >> Hi Ralph, >> >> I'm sorry that my explanation was not enough ... >> This is the summary of my situation: >> >> 1. I create a hostfile as shown below manually. >> >> 2. I use mpirun to start the job without Torque, which means I'm running in >> an un-managed environment. >> >> 3. Firstly, ORTE detects 8 slots on each host(maybe in >> "orte_ras_base_allocate"). >> node05: slots=8 max_slots=0 slots_inuse=0 >> node06: slots=8 max_slots=0 slots_inuse=0 >> >> 4. Then, the code I identified is resetting the slot counts. >> node05: slots=1 max_slots=0 slots_inuse=0 >> node06: slots=1 max_slots=0 slots_inuse=0 >> >> 5. Therefore, ORTE believes that there is only one slot on each host. >> >> Regards, >> Tetsuya Mishima >> >>> No, I didn't use Torque this time. >>> >>> This issue is caused only when it is not in the managed >>> environment - namely, orte_managed_allocation is false >>> (and orte_set_slots is NULL). >>> >>> Under the torque management, it works fine. >>> >>> I hope you can understand the situation. >>> >>> Tetsuya Mishima >>> >>>> I'm sorry, but I'm really confused, so let me try to understand the >>> situation. >>>> >>>> You use Torque to get an allocation, so you are running in a managed >>> environment. >>>> >>>> You then use mpirun to start the job, but pass it a hostfile as shown >>> below. >>>> >>>> Somehow, ORTE believes that there is only one slot on each host, and >> you >>> believe the code you've identified is resetting the slot counts. >>>> >>>> Is that a correct summary of the situation? >>>> >>>> Thanks >>>> Ralph >>>> >>>> On Jan 16, 2014, at 4:00 PM, tmish...@jcity.maeda.co.jp wrote: >>>> >>>>> >>>>> Hi Ralph, >>>>> >>>>> I encountered the hostfile issue again where slots are counted by >>>>> listing the node multiple times. This should be fixed by r29765 >>>>> - Fix hostfile parsing for the case where RMs count slots .... >>>>> >>>>> The difference is using RM or not. At that time, I executed mpirun >>> through >>>>> Torque manager. This time I executed it directly from command line as >>>>> shown at the bottom, where node05,06 has 8 cores. >>>>> >>>>> Then, I checked source files arroud it and found that the line >> 151-160 >>> in >>>>> plm_base_launch_support.c caused this issue. As node->slots is >> already >>>>> counted in hostfile.c @ r29765 even when node->slots_given is false, >>>>> I think this part of plm_base_launch_support.c would be unnecesarry. >>>>> >>>>> orte/mca/plm/base/plm_base_launch_support.c @ 30189: >>>>> 151 } else { >>>>> 152 /* set any non-specified slot counts to 1 */ >>>>> 153 for (i=0; i < orte_node_pool->size; i++) { >>>>> 154 if (NULL == (node = >>>>> (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, i))) { >>>>> 155 continue; >>>>> 156 } >>>>> 157 if (!node->slots_given) { >>>>> 158 node->slots = 1; >>>>> 159 } >>>>> 160 } >>>>> 161 } >>>>> >>>>> Removing this part, it works very well, where the function of >>>>> orte_set_default_slots is still alive. I think this would be better >> for >>>>> the compatible extention of openmpi-1.7.3. >>>>> >>>>> Regards, >>>>> Tetsuya Mishima >>>>> >>>>> [mishima@manage work]$ cat pbs_hosts >>>>> node05 >>>>> node05 >>>>> node05 >>>>> node05 >>>>> node05 >>>>> node05 >>>>> node05 >>>>> node05 >>>>> node06 >>>>> node06 >>>>> node06 >>>>> node06 >>>>> node06 >>>>> node06 >>>>> node06 >>>>> node06 >>>>> [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts >> -cpus-per-proc >>> 4 >>>>> -report-bindings myprog >>>>> [node05.cluster:22287] MCW rank 2 bound to socket 1[core 4[hwt 0]], >>> socket >>>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>>>> [node05.cluster:22287] MCW rank 3 is not bound (or bound to all >>> available >>>>> processors) >>>>> [node05.cluster:22287] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>> socket >>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>>>> [node05.cluster:22287] MCW rank 1 is not bound (or bound to all >>> available >>>>> processors) >>>>> Hello world from process 0 of 4 >>>>> Hello world from process 1 of 4 >>>>> Hello world from process 3 of 4 >>>>> Hello world from process 2 of 4 >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >