No, that is actually correct. We map a socket until full, then move to the next. What you want is --map-by socket:span
On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, > > I had a time to try your patch yesterday using openmpi-1.7.4a1r29646. > > It stopped the error but unfortunately "mapping by socket" itself didn't > work > well as shown bellow: > > [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32 > qsub: waiting for job 8260.manage.cluster to start > qsub: job 8260.manage.cluster ready > > [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > [mishima@node04 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4 > -map-by socket myprog > [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > ocket 1[core 11[hwt 0]]: > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket > 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > socket 1[core 15[hwt 0]]: > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket > 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > socket 2[core 19[hwt 0]]: > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket > 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > socket 2[core 23[hwt 0]]: > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket > 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > socket 3[core 27[hwt 0]]: > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket > 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > socket 3[core 31[hwt 0]]: > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > cket 0[core 3[hwt 0]]: > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket > 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > cket 0[core 7[hwt 0]]: > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > Hello world from process 2 of 8 > Hello world from process 1 of 8 > Hello world from process 3 of 8 > Hello world from process 0 of 8 > Hello world from process 6 of 8 > Hello world from process 5 of 8 > Hello world from process 4 of 8 > Hello world from process 7 of 8 > > I think this should be like this: > > rank 00 > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > rank 01 > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > rank 02 > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > ... > > Regards, > Tetsuya Mishima > >> I fixed this under the trunk (was an issue regardless of RM) and have > scheduled it for 1.7.4. >> >> Thanks! >> Ralph >> >> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Hi Ralph, >>> >>> Thank you very much for your quick response. >>> >>> I'm afraid to say that I found one more issuse... >>> >>> It's not so serious. Please check it when you have a lot of time. >>> >>> The problem is cpus-per-proc with -map-by option under Torque manager. >>> It doesn't work as shown below. I guess you can get the same >>> behaviour under Slurm manager. >>> >>> Of course, if I remove -map-by option, it works quite well. >>> >>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 >>> qsub: waiting for job 8116.manage.cluster to start >>> qsub: job 8116.manage.cluster ready >>> >>> [mishima@node03 ~]$ cd ~/Ducom/testbed2 >>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc > 4 >>> -map-by socket mPre >>> > -------------------------------------------------------------------------- >>> A request was made to bind to that would result in binding more >>> processes than cpus on a resource: >>> >>> Bind to: CORE >>> Node: node03 >>> #processes: 2 >>> #cpus: 1 >>> >>> You can override this protection by adding the "overload-allowed" >>> option to your binding directive. >>> > -------------------------------------------------------------------------- >>> >>> >>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc > 4 >>> mPre >>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt 0]], > socket >>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>> ocket 1[core 11[hwt 0]]: >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt 0]], > socket >>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>> socket 1[core 15[hwt 0]]: >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt 0]], > socket >>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>> socket 2[core 19[hwt 0]]: >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt 0]], > socket >>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>> socket 2[core 23[hwt 0]]: >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt 0]], > socket >>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>> socket 3[core 27[hwt 0]]: >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt 0]], > socket >>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>> socket 3[core 31[hwt 0]]: >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt 0]], > socket >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>> cket 0[core 3[hwt 0]]: >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt 0]], > socket >>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>> cket 0[core 7[hwt 0]]: >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>> >>> Regards, >>> Tetsuya Mishima >>> >>>> Fixed and scheduled to move to 1.7.4. Thanks again! >>>> >>>> >>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>> Thanks! That's precisely where I was going to look when I had time :-) >>>> >>>> I'll update tomorrow. >>>> Ralph >>>> >>>> >>>> >>>> >>>> On Sun, Nov 17, 2013 at 7:01 PM, <tmish...@jcity.maeda.co.jp>wrote: >>>> >>>> >>>> Hi Ralph, >>>> >>>> This is the continuous story of "Segmentation fault in oob_tcp.c of >>>> openmpi-1.7.4a1r29646". >>>> >>>> I found the cause. >>>> >>>> Firstly, I noticed that your hostfile can work and mine can not. >>>> >>>> Your host file: >>>> cat hosts >>>> bend001 slots=12 >>>> >>>> My host file: >>>> cat hosts >>>> node08 >>>> node08 >>>> ...(total 8 lines) >>>> >>>> I modified my script file to add "slots=1" to each line of my hostfile >>>> just before launching mpirun. Then it worked. >>>> >>>> My host file(modified): >>>> cat hosts >>>> node08 slots=1 >>>> node08 slots=1 >>>> ...(total 8 lines) >>>> >>>> Secondary, I confirmed that there's a slight difference between >>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646. >>>> >>>> $ diff >>>> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c >>>> 394,401c394,399 >>>> < if (got_count) { >>>> < node->slots_given = true; >>>> < } else if (got_max) { >>>> < node->slots = node->slots_max; >>>> < node->slots_given = true; >>>> < } else { >>>> < /* should be set by obj_new, but just to be clear */ >>>> < node->slots_given = false; >>>> --- >>>>> if (!got_count) { >>>>> if (got_max) { >>>>> node->slots = node->slots_max; >>>>> } else { >>>>> ++node->slots; >>>>> } >>>> .... >>>> >>>> Finally, I added the line 402 below just as a tentative trial. >>>> Then, it worked. >>>> >>>> cat -n orte/util/hostfile/hostfile.c: >>>> ... >>>> 394 if (got_count) { >>>> 395 node->slots_given = true; >>>> 396 } else if (got_max) { >>>> 397 node->slots = node->slots_max; >>>> 398 node->slots_given = true; >>>> 399 } else { >>>> 400 /* should be set by obj_new, but just to be clear */ >>>> 401 node->slots_given = false; >>>> 402 ++node->slots; /* added by tmishima */ >>>> 403 } >>>> ... >>>> >>>> Please fix the problem properly, because it's just based on my >>>> random guess. It's related to the treatment of hostfile where slots >>>> information is not given. >>>> >>>> Regards, >>>> Tetsuya Mishima >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > >>> >>>> users mailing list >>>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users