Hi Ralph, sorry for confusing you.
At that time, I cut and paste the part of "cat $PBS_NODEFILE". I guess I didn't paste the last line by my mistake. I retried the test and below one is exactly what I got when I did the test. [mishima@manage ~]$ qsub -I -l nodes=node11:ppn=8+node12:ppn=8 qsub: waiting for job 8338.manage.cluster to start qsub: job 8338.manage.cluster ready [mishima@node11 ~]$ cat $PBS_NODEFILE node11 node11 node11 node11 node11 node11 node11 node11 node12 node12 node12 node12 node12 node12 node12 node12 [mishima@node11 ~]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings myprog -------------------------------------------------------------------------- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node: node12 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -------------------------------------------------------------------------- Regards, Tetsuya Mishima > I removed the debug in #2 - thanks for reporting it > > For #1, it actually looks to me like this is correct. If you look at your allocation, there are only 7 slots being allocated on node12, yet you have asked for 8 cpus to be assigned (2 procs with 2 > cpus/proc). So the warning is in fact correct > > > On Dec 18, 2013, at 4:04 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So I'd like > > to report > > 3 issues mainly regarding -cpus-per-proc. > > > > 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2 sockets X > > 4 cores/socket), > > it starts to produce the error again as shown below. At least, > > openmpi-1.7.4a1r29646 did > > work well. > > > > [mishima@manage ~]$ qsub -I -l nodes=2:ppn=8 > > qsub: waiting for job 8336.manage.cluster to start > > qsub: job 8336.manage.cluster ready > > > > [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > > [mishima@node11 demos]$ cat $PBS_NODEFILE > > node11 > > node11 > > node11 > > node11 > > node11 > > node11 > > node11 > > node11 > > node12 > > node12 > > node12 > > node12 > > node12 > > node12 > > node12 > > [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > > myprog > > -------------------------------------------------------------------------- > > A request was made to bind to that would result in binding more > > processes than cpus on a resource: > > > > Bind to: CORE > > Node: node12 > > #processes: 2 > > #cpus: 1 > > > > You can override this protection by adding the "overload-allowed" > > option to your binding directive. > > -------------------------------------------------------------------------- > > > > Of course it works well using only one node. > > > > [mishima@node11 demos]$ mpirun -np 2 -cpus-per-proc 4 -report-bindings > > myprog > > [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > > [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket > > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > > Hello world from process 1 of 2 > > Hello world from process 0 of 2 > > > > > > 2) Adding "-bind-to numa", it works but the message "bind:upward target > > NUMANode type NUMANode" appears. > > As far as I remember, I didn't see such a kind of message before. > > > > mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings > > -bind-to numa myprog > > [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type > > NUMANode > > [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type > > NUMANode > > [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type > > NUMANode > > [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type > > NUMANode > > [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > > [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket > > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > > [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket > > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > > [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket > > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > > Hello world from process 1 of 4 > > Hello world from process 0 of 4 > > Hello world from process 3 of 4 > > Hello world from process 2 of 4 > > > > > > 3) I use PGI compiler. It can not accept compiler switch > > "-Wno-variadic-macros", which is > > included in configure script. > > > > btl_usnic_CFLAGS="-Wno-variadic-macros" > > > > I removed this switch, then I could continue to build 1.7.4rc1. > > > > Regards, > > Tetsuya Mishima > > > > > >> Hmmm...okay, I understand the scenario. Must be something in the algo > > when it only has one node, so it shouldn't be too hard to track down. > >> > >> I'm off on travel for a few days, but will return to this when I get > > back. > >> > >> Sorry for delay - will try to look at this while I'm gone, but can't > > promise anything :-( > >> > >> > >> On Dec 10, 2013, at 6:58 PM, tmish...@jcity.maeda.co.jp wrote: > >> > >>> > >>> > >>> Hi Ralph, sorry for confusing. > >>> > >>> We usually logon to "manage", which is our control node. > >>> From manage, we submit job or enter a remote node such as > >>> node03 by torque interactive mode(qsub -I). > >>> > >>> At that time, instead of torque, I just did rsh to node03 from manage > >>> and ran myprog on the node. I hope you could understand what I did. > >>> > >>> Now, I retried with "-host node03", which still causes the problem: > >>> (I comfirmed local run on manage caused the same problem too) > >>> > >>> [mishima@manage ~]$ rsh node03 > >>> Last login: Wed Dec 11 11:38:57 from manage > >>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > >>> [mishima@node03 demos]$ > >>> [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings > >>> -cpus-per-proc 4 -map-by socket myprog > >>> > > -------------------------------------------------------------------------- > >>> A request was made to bind to that would result in binding more > >>> processes than cpus on a resource: > >>> > >>> Bind to: CORE > >>> Node: node03 > >>> #processes: 2 > >>> #cpus: 1 > >>> > >>> You can override this protection by adding the "overload-allowed" > >>> option to your binding directive. > >>> > > -------------------------------------------------------------------------- > >>> > >>> It' strange, but I have to report that "-map-by socket:span" worked > > well. > >>> > >>> [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings > >>> -cpus-per-proc 4 -map-by socket:span myprog > >>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt 0]], > > socket > >>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > >>> ocket 1[core 11[hwt 0]]: > >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > >>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt 0]], > > socket > >>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > >>> socket 1[core 15[hwt 0]]: > >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > >>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt 0]], > > socket > >>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > >>> socket 2[core 19[hwt 0]]: > >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > >>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt 0]], > > socket > >>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > >>> socket 2[core 23[hwt 0]]: > >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > >>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt 0]], > > socket > >>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > >>> socket 3[core 27[hwt 0]]: > >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > >>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt 0]], > > socket > >>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > >>> socket 3[core 31[hwt 0]]: > >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > >>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt 0]], > > socket > >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>> cket 0[core 3[hwt 0]]: > >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > >>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt 0]], > > socket > >>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > >>> cket 0[core 7[hwt 0]]: > >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > >>> Hello world from process 2 of 8 > >>> Hello world from process 6 of 8 > >>> Hello world from process 3 of 8 > >>> Hello world from process 7 of 8 > >>> Hello world from process 1 of 8 > >>> Hello world from process 5 of 8 > >>> Hello world from process 0 of 8 > >>> Hello world from process 4 of 8 > >>> > >>> Regards, > >>> Tetsuya Mishima > >>> > >>> > >>>> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote: > >>>> > >>>>> > >>>>> > >>>>> Hi Ralph, > >>>>> > >>>>> I tried again with -cpus-per-proc 2 as shown below. > >>>>> Here, I found that "-map-by socket:span" worked well. > >>>>> > >>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc > > 2 > >>>>> -map-by socket:span myprog > >>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt 0]], > >>> socket > >>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. > >>>>> /././.][./././././././.][./././././././.] > >>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt 0]], > >>> socket > >>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B > >>>>> /./././.][./././././././.][./././././././.] > >>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt 0]], > >>> socket > >>>>> 2[core 17[hwt 0]]: [./././././././.][./././. > >>>>> /./././.][B/B/./././././.][./././././././.] > >>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt 0]], > >>> socket > >>>>> 2[core 19[hwt 0]]: [./././././././.][./././. > >>>>> /./././.][././B/B/./././.][./././././././.] > >>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt 0]], > >>> socket > >>>>> 3[core 25[hwt 0]]: [./././././././.][./././. > >>>>> /./././.][./././././././.][B/B/./././././.] > >>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt 0]], > >>> socket > >>>>> 3[core 27[hwt 0]]: [./././././././.][./././. > >>>>> /./././.][./././././././.][././B/B/./././.] > >>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt 0]], > >>> socket > >>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. > >>>>> /././.][./././././././.][./././././././.] > >>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt 0]], > >>> socket > >>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. > >>>>> /././.][./././././././.][./././././././.] > >>>>> Hello world from process 1 of 8 > >>>>> Hello world from process 0 of 8 > >>>>> Hello world from process 4 of 8 > >>>>> Hello world from process 2 of 8 > >>>>> Hello world from process 7 of 8 > >>>>> Hello world from process 6 of 8 > >>>>> Hello world from process 5 of 8 > >>>>> Hello world from process 3 of 8 > >>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc > > 2 > >>>>> -map-by socket myprog > >>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt 0]], > >>> socket > >>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././. > >>>>> /././.][./././././././.][./././././././.] > >>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt 0]], > >>> socket > >>>>> 0[core 7[hwt 0]]: [././././././B/B][././././. > >>>>> /././.][./././././././.][./././././././.] > >>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt 0]], > >>> socket > >>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. > >>>>> /././.][./././././././.][./././././././.] > >>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt 0]], > >>> socket > >>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B > >>>>> /./././.][./././././././.][./././././././.] > >>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt 0]], > >>> socket > >>>>> 1[core 13[hwt 0]]: [./././././././.][./././. > >>>>> /B/B/./.][./././././././.][./././././././.] > >>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt 0]], > >>> socket > >>>>> 1[core 15[hwt 0]]: [./././././././.][./././. > >>>>> /././B/B][./././././././.][./././././././.] > >>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt 0]], > >>> socket > >>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. > >>>>> /././.][./././././././.][./././././././.] > >>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt 0]], > >>> socket > >>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. > >>>>> /././.][./././././././.][./././././././.] > >>>>> Hello world from process 5 of 8 > >>>>> Hello world from process 1 of 8 > >>>>> Hello world from process 6 of 8 > >>>>> Hello world from process 4 of 8 > >>>>> Hello world from process 2 of 8 > >>>>> Hello world from process 0 of 8 > >>>>> Hello world from process 7 of 8 > >>>>> Hello world from process 3 of 8 > >>>>> > >>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets. > >>>>> In this case, I guess "-map-by socket:span" and "-map-by socket" has > >>> same > >>>>> meaning. > >>>>> Therefore, there's no problem about that. Sorry for distubing. > >>>> > >>>> No problem - glad you could clear that up :-) > >>>> > >>>>> > >>>>> By the way, through this test, I found another problem. > >>>>> Without torque manager and just using rsh, it causes the same error > >>> like > >>>>> below: > >>>>> > >>>>> [mishima@manage openmpi-1.7]$ rsh node03 > >>>>> Last login: Wed Dec 11 09:42:02 from manage > >>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > >>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc > > 4 > >>>>> -map-by socket myprog > >>>> > >>>> I don't understand the difference here - you are simply starting it > > from > >>> a different node? It looks like everything is expected to run local to > >>> mpirun, yes? So there is no rsh actually involved here. > >>>> Are you still running in an allocation? > >>>> > >>>> If you run this with "-host node03" on the cmd line, do you see the > > same > >>> problem? > >>>> > >>>> > >>>>> > >>> > > -------------------------------------------------------------------------- > >>>>> A request was made to bind to that would result in binding more > >>>>> processes than cpus on a resource: > >>>>> > >>>>> Bind to: CORE > >>>>> Node: node03 > >>>>> #processes: 2 > >>>>> #cpus: 1 > >>>>> > >>>>> You can override this protection by adding the "overload-allowed" > >>>>> option to your binding directive. > >>>>> > >>> > > -------------------------------------------------------------------------- > >>>>> [mishima@node03 demos]$ > >>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc > > 4 > >>>>> myprog > >>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt 0]], > >>> socket > >>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > >>>>> ocket 1[core 11[hwt 0]]: > >>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > >>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt 0]], > >>> socket > >>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > >>>>> socket 1[core 15[hwt 0]]: > >>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > >>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt 0]], > >>> socket > >>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > >>>>> socket 2[core 19[hwt 0]]: > >>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > >>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt 0]], > >>> socket > >>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > >>>>> socket 2[core 23[hwt 0]]: > >>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > >>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt 0]], > >>> socket > >>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > >>>>> socket 3[core 27[hwt 0]]:>>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > >>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt 0]], > >>> socket > >>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > >>>>> socket 3[core 31[hwt 0]]: > >>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > >>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt 0]], > >>> socket > >>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>>>> cket 0[core 3[hwt 0]]: > >>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > >>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt 0]], > >>> socket > >>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > >>>>> cket 0[core 7[hwt 0]]: > >>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > >>>>> Hello world from process 4 of 8 > >>>>> Hello world from process 2 of 8 > >>>>> Hello world from process 6 of 8 > >>>>> Hello world from process 5 of 8 > >>>>> Hello world from process 3 of 8 > >>>>> Hello world from process 7 of 8 > >>>>> Hello world from process 0 of 8 > >>>>> Hello world from process 1 of 8 > >>>>> > >>>>> Regards, > >>>>> Tetsuya Mishima > >>>>> > >>>>>> Hmmm...that's strange. I only have 2 sockets on my system, but let > > me > >>>>> poke around a bit and see what might be happening. > >>>>>> > >>>>>> On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Hi Ralph, > >>>>>>> > >>>>>>> Thanks. I didn't know the meaning of "socket:span". > >>>>>>> > >>>>>>> But it still causes the problem, which seems socket:span doesn't > >>> work. > >>>>>>> > >>>>>>> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32 > >>>>>>> qsub: waiting for job 8265.manage.cluster to start > >>>>>>> qsub: job 8265.manage.cluster ready > >>>>>>> > >>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > > -cpus-per-proc > >>> 4 > >>>>>>> -map-by socket:span myprog > >>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8[hwt 0]], > >>>>> socket > >>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > >>>>>>> ocket 1[core 11[hwt 0]]: > >>>>>>> > > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > >>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12[hwt > > 0]], > >>>>> socket > >>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > >>>>>>> socket 1[core 15[hwt 0]]: > >>>>>>> > > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > >>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16[hwt > > 0]], > >>>>> socket > >>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > >>>>>>> socket 2[core 19[hwt 0]]: > >>>>>>> > > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > >>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20[hwt > > 0]], > >>>>> socket > >>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > >>>>>>> socket 2[core 23[hwt 0]]: > >>>>>>> > > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > >>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24[hwt > > 0]], > >>>>> socket > >>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > >>>>>>> socket 3[core 27[hwt 0]]: > >>>>>>> > > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > >>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28[hwt > > 0]], > >>>>> socket > >>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > >>>>>>> socket 3[core 31[hwt 0]]: > >>>>>>> > > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > >>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0[hwt 0]], > >>>>> socket > >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>>>>>> cket 0[core 3[hwt 0]]: > >>>>>>> > > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > >>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4[hwt 0]], > >>>>> socket > >>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > >>>>>>> cket 0[core 7[hwt 0]]: > >>>>>>> > > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > >>>>>>> Hello world from process 0 of 8 > >>>>>>> Hello world from process 3 of 8 > >>>>>>> Hello world from process 1 of 8 > >>>>>>> Hello world from process 4 of 8 > >>>>>>> Hello world from process 6 of 8 > >>>>>>> Hello world from process 5 of 8 > >>>>>>> Hello world from process 2 of 8 > >>>>>>> Hello world from process 7 of 8 > >>>>>>> > >>>>>>> Regards, > >>>>>>> Tetsuya Mishima > >>>>>>> > >>>>>>>> No, that is actually correct. We map a socket until full, then > > move > >>> to > >>>>>>> the next. What you want is --map-by socket:span > >>>>>>>> > >>>>>>>> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Hi Ralph, > >>>>>>>>> > >>>>>>>>> I had a time to try your patch yesterday using > >>> openmpi-1.7.4a1r29646. > >>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by > > socket" itself > >>>>>>> didn't > >>>>>>>>> work > >>>>>>>>> well as shown bellow: > >>>>>>>>> > >>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32 > >>>>>>>>> qsub: waiting for job 8260.manage.cluster to start > >>>>>>>>> qsub: job 8260.manage.cluster ready > >>>>>>>>> > >>>>>>>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > >>>>>>>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings > >>> -cpus-per-proc > >>>>> 4 > >>>>>>>>> -map-by socket myprog > >>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt > > 0]], > >>>>>>> socket > >>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > >>>>>>>>> ocket 1[core 11[hwt 0]]: > >>>>>>>>> > >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > >>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt > >>> 0]], > >>>>>>> socket > >>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > >>>>>>>>> socket 1[core 15[hwt 0]]: > >>>>>>>>> > >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > >>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt > >>> 0]], > >>>>>>> socket > >>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > >>>>>>>>> socket 2[core 19[hwt 0]]: > >>>>>>>>> > >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > >>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt > >>> 0]], > >>>>>>> socket > >>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > >>>>>>>>> socket 2[core 23[hwt 0]]: > >>>>>>>>> > >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > >>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt > >>> 0]], > >>>>>>> socket > >>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > >>>>>>>>> socket 3[core 27[hwt 0]]: > >>>>>>>>> > >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > >>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt > >>> 0]], > >>>>>>> socket > >>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > >>>>>>>>> socket 3[core 31[hwt 0]]: > >>>>>>>>> > >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > >>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt > > 0]], > >>>>>>> socket > >>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>>>>>>>> cket 0[core 3[hwt 0]]: > >>>>>>>>> > >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > >>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4[hwt > > 0]], > >>>>>>> socket > >>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > >>>>>>>>> cket 0[core 7[hwt 0]]: > >>>>>>>>> > >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > >>>>>>>>> Hello world from process 2 of 8 > >>>>>>>>> Hello world from process 1 of 8 > >>>>>>>>> Hello world from process 3 of 8 > >>>>>>>>> Hello world from process 0 of 8 > >>>>>>>>> Hello world from process 6 of 8 > >>>>>>>>> Hello world from process 5 of 8 > >>>>>>>>> Hello world from process 4 of 8 > >>>>>>>>> Hello world from process 7 of 8 > >>>>>>>>> > >>>>>>>>> I think this should be like this: > >>>>>>>>> > >>>>>>>>> rank 00 > >>>>>>>>> > >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > >>>>>>>>> rank 01 > >>>>>>>>> > >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > >>>>>>>>> rank 02 > >>>>>>>>> > >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > >>>>>>>>> ... > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> Tetsuya Mishima > >>>>>>>>> > >>>>>>>>>> I fixed this under the trunk (was an issue regardless of RM) and > >>>>> have > >>>>>>>>> scheduled it for 1.7.4. > >>>>>>>>>> > >>>>>>>>>> Thanks! > >>>>>>>>>> Ralph > >>>>>>>>>> > >>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote: > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Hi Ralph, > >>>>>>>>>>> > >>>>>>>>>>> Thank you very much for your quick response. > >>>>>>>>>>> > >>>>>>>>>>> I'm afraid to say that I found one more issuse... > >>>>>>>>>>> > >>>>>>>>>>> It's not so serious. Please check it when you have a lot of > > time. > >>>>>>>>>>> > >>>>>>>>>>> The problem is cpus-per-proc with -map-by option under Torque > >>>>>>> manager. > >>>>>>>>>>> It doesn't work as shown below. I guess you can get the same > >>>>>>>>>>> behaviour under Slurm manager. > >>>>>>>>>>> > >>>>>>>>>>> Of course, if I remove -map-by option, it works quite well. > >>>>>>>>>>> > >>>>>>>>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 > >>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start > >>>>>>>>>>> qsub: job 8116.manage.cluster ready > >>>>>>>>>>> > >>>>>>>>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2 > >>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings > >>>>>>> -cpus-per-proc > >>>>>>>>> 4 > >>>>>>>>>>> -map-by socket mPre > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > >>> > > -------------------------------------------------------------------------- > >>>>>>>>>>> A request was made to bind to that would result in binding more > >>>>>>>>>>> processes than cpus on a resource: > >>>>>>>>>>> > >>>>>>>>>>> Bind to: CORE > >>>>>>>>>>> Node: node03>>>>>>> #processes: 2 > >>>>>>>>>>> #cpus: 1 > >>>>>>>>>>> > >>>>>>>>>>> You can override this protection by adding the > > "overload-allowed" > >>>>>>>>>>> option to your binding directive. > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > >>> > > -------------------------------------------------------------------------- > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings > >>>>>>> -cpus-per-proc > >>>>>>>>> 4 > >>>>>>>>>>> mPre > >>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s > >>>>>>>>>>> ocket 1[core 11[hwt 0]]: > >>>>>>>>>>> > >>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], > >>>>>>>>>>> socket 1[core 15[hwt 0]]: > >>>>>>>>>>> > >>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], > >>>>>>>>>>> socket 2[core 19[hwt 0]]: > >>>>>>>>>>> > >>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] > >>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], > >>>>>>>>>>> socket 2[core 23[hwt 0]]: > >>>>>>>>>>> > >>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] > >>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], > >>>>>>>>>>> socket 3[core 27[hwt 0]]: > >>>>>>>>>>> > >>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] > >>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28 [hwt > >>>>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], > >>>>>>>>>>> socket 3[core 31[hwt 0]]: > >>>>>>>>>>> > >>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] > >>>>>>>>>>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > >>>>>>>>>>> cket 0[core 3[hwt 0]]: > >>>>>>>>>>> > >>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > >>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4 [hwt > >>> 0]], > >>>>>>>>> socket > >>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > >>>>>>>>>>> cket 0[core 7[hwt 0]]: > >>>>>>>>>>> > >>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > >>>>>>>>>>> > >>>>>>>>>>> Regards, > >>>>>>>>>>> Tetsuya Mishima > >>>>>>>>>>> > >>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again! > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain <r...@open-mpi.org> > >>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks! That's precisely where I was going to look when I had > >>>>>>> time :-) > >>>>>>>>>>>> > >>>>>>>>>>>> I'll update tomorrow. > >>>>>>>>>>>> Ralph > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM, > >>>>> <tmish...@jcity.maeda.co.jp>wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Hi Ralph, > >>>>>>>>>>>> > >>>>>>>>>>>> This is the continuous story of "Segmentation fault in > > oob_tcp.c > >>>>> of > >>>>>>>>>>>> openmpi-1.7.4a1r29646". > >>>>>>>>>>>> > >>>>>>>>>>>> I found the cause. > >>>>>>>>>>>> > >>>>>>>>>>>> Firstly, I noticed that your hostfile can work and mine can > > not. > >>>>>>>>>>>> > >>>>>>>>>>>> Your host file: > >>>>>>>>>>>> cat hosts > >>>>>>>>>>>> bend001 slots=12 > >>>>>>>>>>>> > >>>>>>>>>>>> My host file: > >>>>>>>>>>>> cat hosts > >>>>>>>>>>>> node08 > >>>>>>>>>>>> node08 > >>>>>>>>>>>> ...(total 8 lines) > >>>>>>>>>>>> > >>>>>>>>>>>> I modified my script file to add "slots=1" to each line of my > >>>>>>> hostfile > >>>>>>>>>>>> just before launching mpirun. Then it worked. > >>>>>>>>>>>> > >>>>>>>>>>>> My host file(modified): > >>>>>>>>>>>> cat hosts > >>>>>>>>>>>> node08 slots=1 > >>>>>>>>>>>> node08 slots=1 > >>>>>>>>>>>> ...(total 8 lines) > >>>>>>>>>>>> > >>>>>>>>>>>> Secondary, I confirmed that there's a slight difference > > between > >>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of > >>> 1.7.4a1r29646. > >>>>>>>>>>>> > >>>>>>>>>>>> $ diff > >>>>>>>>>>>> > >>>>>>> > >>> hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c > >>>>>>>>>>>> 394,401c394,399 > >>>>>>>>>>>> < if (got_count) { > >>>>>>>>>>>> < node->slots_given = true; > >>>>>>>>>>>> < } else if (got_max) { > >>>>>>>>>>>> < node->slots = node->slots_max; > >>>>>>>>>>>> < node->slots_given = true; > >>>>>>>>>>>> < } else { > >>>>>>>>>>>> < /* should be set by obj_new, but just to be clear */ > >>>>>>>>>>>> < node->slots_given = false; > >>>>>>>>>>>> --- > >>>>>>>>>>>>> if (!got_count) { > >>>>>>>>>>>>> if (got_max) { > >>>>>>>>>>>>> node->slots = node->slots_max; > >>>>>>>>>>>>> } else { > >>>>>>>>>>>>> ++node->slots;>>>>>>>>>>>>> } > >>>>>>>>>>>> .... > >>>>>>>>>>>> > >>>>>>>>>>>> Finally, I added the line 402 below just as a tentative trial. > >>>>>>>>>>>> Then, it worked. > >>>>>>>>>>>> > >>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c: > >>>>>>>>>>>> ... > >>>>>>>>>>>> 394 if (got_count) { > >>>>>>>>>>>> 395 node->slots_given = true; > >>>>>>>>>>>> 396 } else if (got_max) { > >>>>>>>>>>>> 397 node->slots = node->slots_max; > >>>>>>>>>>>> 398 node->slots_given = true; > >>>>>>>>>>>> 399 } else { > >>>>>>>>>>>> 400 /* should be set by obj_new, but just to be clear > >>> */ > >>>>>>>>>>>> 401 node->slots_given = false; > >>>>>>>>>>>> 402 ++node->slots; /* added by tmishima */ > >>>>>>>>>>>> 403 } > >>>>>>>>>>>> ... > >>>>>>>>>>>> > >>>>>>>>>>>> Please fix the problem properly, because it's just based on my > >>>>>>>>>>>> random guess. It's related to the treatment of hostfile where > >>>>> slots > >>>>>>>>>>>> information is not given. > >>>>>>>>>>>> > >>>>>>>>>>>> Regards, > >>>>>>>>>>>> Tetsuya Mishima > >>>>>>>>>>>> > >>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>> users mailing list > >>>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > >>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > > > >>> > >>>>> > >>>>>>> > >>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> users mailing list > >>>>>>>>>>>> > >>>>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> users mailing list > >>>>>>>>>>> us...@open-mpi.org > >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> users mailing list > >>>>>>>>>> us...@open-mpi.org > >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> us...@open-mpi.org > >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> us...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users