Very strange - I can't seem to replicate it. Is there any chance that you have < 8 actual cores on node12?
On Dec 18, 2013, at 4:53 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, sorry for confusing you. > > At that time, I cut and paste the part of "cat $PBS_NODEFILE". > I guess I didn't paste the last line by my mistake. > > I retried the test and below one is exactly what I got when I did the test. > > [mishima@manage ~]$ qsub -I -l nodes=node11:ppn=8+node12:ppn=8 > qsub: waiting for job 8338.manage.cluster to start > qsub: job 8338.manage.cluster ready > > [mishima@node11 ~]$ cat $PBS_NODEFILE > node11 > node11 > node11 > node11 > node11 > node11 > node11 > node11 > node12 > node12 > node12 > node12 > node12 > node12 > node12 > node12 > [mishima@node11 ~]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings myprog > -------------------------------------------------------------------------- > A request was made to bind to that would result in binding more > processes than cpus on a resource: > > Bind to: CORE > Node: node12 > #processes: 2 > #cpus: 1 > > You can override this protection by adding the "overload-allowed" > option to your binding directive. > -------------------------------------------------------------------------- > > Regards, > > Tetsuya Mishima > >> I removed the debug in #2 - thanks for reporting it >> >> For #1, it actually looks to me like this is correct. If you look at your > allocation, there are only 7 slots being allocated on node12, yet you have > asked for 8 cpus to be assigned (2 procs with 2 >> cpus/proc). So the warning is in fact correct >> >> >> On Dec 18, 2013, at 4:04 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Hi Ralph, I found that openmpi-1.7.4rc1 was already uploaded. So I'd > like >>> to report >>> 3 issues mainly regarding -cpus-per-proc. >>> >>> 1) When I use 2 nodes(node11,node12), which has 8 cores each(= 2 > sockets X >>> 4 cores/socket), >>> it starts to produce the error again as shown below. At least, >>> openmpi-1.7.4a1r29646 did >>> work well. >>> >>> [mishima@manage ~]$ qsub -I -l nodes=2:ppn=8 >>> qsub: waiting for job 8336.manage.cluster to start >>> qsub: job 8336.manage.cluster ready >>> >>> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>> [mishima@node11 demos]$ cat $PBS_NODEFILE >>> node11 >>> node11 >>> node11 >>> node11 >>> node11 >>> node11 >>> node11 >>> node11 >>> node12 >>> node12 >>> node12 >>> node12 >>> node12 >>> node12 >>> node12 >>> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings >>> myprog >>> > -------------------------------------------------------------------------- >>> A request was made to bind to that would result in binding more >>> processes than cpus on a resource: >>> >>> Bind to: CORE >>> Node: node12 >>> #processes: 2 >>> #cpus: 1 >>> >>> You can override this protection by adding the "overload-allowed" >>> option to your binding directive. >>> > -------------------------------------------------------------------------- >>> >>> Of course it works well using only one node. >>> >>> [mishima@node11 demos]$ mpirun -np 2 -cpus-per-proc 4 -report-bindings >>> myprog >>> [node11.cluster:26238] MCW rank 0 bound to socket 0[core 0[hwt 0]], > socket >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>> [node11.cluster:26238] MCW rank 1 bound to socket 1[core 4[hwt 0]], > socket >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>> Hello world from process 1 of 2 >>> Hello world from process 0 of 2 >>> >>> >>> 2) Adding "-bind-to numa", it works but the message "bind:upward target >>> NUMANode type NUMANode" appears. >>> As far as I remember, I didn't see such a kind of message before. >>> >>> mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings >>> -bind-to numa myprog >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type >>> NUMANode >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type >>> NUMANode >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type >>> NUMANode >>> [node11.cluster:26260] [[8844,0],0] bind:upward target NUMANode type >>> NUMANode >>> [node11.cluster:26260] MCW rank 0 bound to socket 0[core 0[hwt 0]], > socket >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>> [node11.cluster:26260] MCW rank 1 bound to socket 1[core 4[hwt 0]], > socket >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>> [node12.cluster:23607] MCW rank 3 bound to socket 1[core 4[hwt 0]], > socket >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>> [node12.cluster:23607] MCW rank 2 bound to socket 0[core 0[hwt 0]], > socket >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>> Hello world from process 1 of 4 >>> Hello world from process 0 of 4 >>> Hello world from process 3 of 4 >>> Hello world from process 2 of 4 >>> >>> >>> 3) I use PGI compiler. It can not accept compiler switch >>> "-Wno-variadic-macros", which is >>> included in configure script. >>> >>> btl_usnic_CFLAGS="-Wno-variadic-macros" >>> >>> I removed this switch, then I could continue to build 1.7.4rc1. >>> >>> Regards, >>> Tetsuya Mishima >>> >>> >>>> Hmmm...okay, I understand the scenario. Must be something in the algo >>> when it only has one node, so it shouldn't be too hard to track down. >>>> >>>> I'm off on travel for a few days, but will return to this when I get >>> back. >>>> >>>> Sorry for delay - will try to look at this while I'm gone, but can't >>> promise anything :-( >>>> >>>> >>>> On Dec 10, 2013, at 6:58 PM, tmish...@jcity.maeda.co.jp wrote: >>>> >>>>> >>>>> >>>>> Hi Ralph, sorry for confusing. >>>>> >>>>> We usually logon to "manage", which is our control node. >>>>> From manage, we submit job or enter a remote node such as >>>>> node03 by torque interactive mode(qsub -I). >>>>> >>>>> At that time, instead of torque, I just did rsh to node03 from manage >>>>> and ran myprog on the node. I hope you could understand what I did. >>>>> >>>>> Now, I retried with "-host node03", which still causes the problem: >>>>> (I comfirmed local run on manage caused the same problem too) >>>>> >>>>> [mishima@manage ~]$ rsh node03 >>>>> Last login: Wed Dec 11 11:38:57 from manage >>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>> [mishima@node03 demos]$ >>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings >>>>> -cpus-per-proc 4 -map-by socket myprog >>>>> >>> > -------------------------------------------------------------------------- >>>>> A request was made to bind to that would result in binding more >>>>> processes than cpus on a resource: >>>>> >>>>> Bind to: CORE >>>>> Node: node03 >>>>> #processes: 2 >>>>> #cpus: 1 >>>>> >>>>> You can override this protection by adding the "overload-allowed" >>>>> option to your binding directive. >>>>> >>> > -------------------------------------------------------------------------- >>>>> >>>>> It' strange, but I have to report that "-map-by socket:span" worked >>> well. >>>>> >>>>> [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings >>>>> -cpus-per-proc 4 -map-by socket:span myprog >>>>> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt 0]], >>> socket >>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>> ocket 1[core 11[hwt 0]]: >>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt 0]], >>> socket >>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>> socket 1[core 15[hwt 0]]: >>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt 0]], >>> socket >>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>> socket 2[core 19[hwt 0]]: >>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt 0]], >>> socket >>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>> socket 2[core 23[hwt 0]]: >>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt 0]], >>> socket >>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>> socket 3[core 27[hwt 0]]: >>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt 0]], >>> socket >>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>> socket 3[core 31[hwt 0]]: >>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>> socket >>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>> cket 0[core 3[hwt 0]]: >>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt 0]], >>> socket >>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>> cket 0[core 7[hwt 0]]: >>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>> Hello world from process 2 of 8 >>>>> Hello world from process 6 of 8 >>>>> Hello world from process 3 of 8 >>>>> Hello world from process 7 of 8 >>>>> Hello world from process 1 of 8 >>>>> Hello world from process 5 of 8 >>>>> Hello world from process 0 of 8 >>>>> Hello world from process 4 of 8 >>>>> >>>>> Regards, >>>>> Tetsuya Mishima >>>>> >>>>> >>>>>> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Ralph, >>>>>>> >>>>>>> I tried again with -cpus-per-proc 2 as shown below. >>>>>>> Here, I found that "-map-by socket:span" worked well. >>>>>>> >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > -cpus-per-proc >>> 2 >>>>>>> -map-by socket:span myprog >>>>>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt 0]], >>>>> socket >>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. >>>>>>> /././.][./././././././.][./././././././.] >>>>>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt > 0]], >>>>> socket >>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B >>>>>>> /./././.][./././././././.][./././././././.] >>>>>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt > 0]], >>>>> socket >>>>>>> 2[core 17[hwt 0]]: [./././././././.][./././. >>>>>>> /./././.][B/B/./././././.][./././././././.] >>>>>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt > 0]], >>>>> socket >>>>>>> 2[core 19[hwt 0]]: [./././././././.][./././. >>>>>>> /./././.][././B/B/./././.][./././././././.] >>>>>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt > 0]], >>>>> socket >>>>>>> 3[core 25[hwt 0]]: [./././././././.][./././. >>>>>>> /./././.][./././././././.][B/B/./././././.] >>>>>>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt > 0]], >>>>> socket >>>>>>> 3[core 27[hwt 0]]: [./././././././.][./././. >>>>>>> /./././.][./././././././.][././B/B/./././.] >>>>>>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>>>> socket >>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. >>>>>>> /././.][./././././././.][./././././././.] >>>>>>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt 0]], >>>>> socket >>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. >>>>>>> /././.][./././././././.][./././././././.] >>>>>>> Hello world from process 1 of 8 >>>>>>> Hello world from process 0 of 8 >>>>>>> Hello world from process 4 of 8 >>>>>>> Hello world from process 2 of 8 >>>>>>> Hello world from process 7 of 8 >>>>>>> Hello world from process 6 of 8 >>>>>>> Hello world from process 5 of 8 >>>>>>> Hello world from process 3 of 8 >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > -cpus-per-proc >>> 2 >>>>>>> -map-by socket myprog >>>>>>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt 0]], >>>>> socket >>>>>>> 0[core 5[hwt 0]]: [././././B/B/./.][././././. >>>>>>> /././.][./././././././.][./././././././.] >>>>>>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt 0]], >>>>> socket >>>>>>> 0[core 7[hwt 0]]: [././././././B/B][././././. >>>>>>> /././.][./././././././.][./././././././.] >>>>>>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt 0]], >>>>> socket >>>>>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././. >>>>>>> /././.][./././././././.][./././././././.] >>>>>>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt > 0]], >>>>> socket >>>>>>> 1[core 11[hwt 0]]: [./././././././.][././B/B >>>>>>> /./././.][./././././././.][./././././././.] >>>>>>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt > 0]], >>>>> socket >>>>>>> 1[core 13[hwt 0]]: [./././././././.][./././. >>>>>>> /B/B/./.][./././././././.][./././././././.] >>>>>>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt > 0]], >>>>> socket >>>>>>> 1[core 15[hwt 0]]: [./././././././.][./././. >>>>>>> /././B/B][./././././././.][./././././././.] >>>>>>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>>>> socket >>>>>>> 0[core 1[hwt 0]]: [B/B/./././././.][././././. >>>>>>> /././.][./././././././.][./././././././.] >>>>>>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt 0]], >>>>> socket >>>>>>> 0[core 3[hwt 0]]: [././B/B/./././.][././././. >>>>>>> /././.][./././././././.][./././././././.] >>>>>>> Hello world from process 5 of 8 >>>>>>> Hello world from process 1 of 8 >>>>>>> Hello world from process 6 of 8 >>>>>>> Hello world from process 4 of 8 >>>>>>> Hello world from process 2 of 8 >>>>>>> Hello world from process 0 of 8 >>>>>>> Hello world from process 7 of 8 >>>>>>> Hello world from process 3 of 8 >>>>>>> >>>>>>> "-np 8" and "-cpus-per-proc 4" just filled all sockets. >>>>>>> In this case, I guess "-map-by socket:span" and "-map-by socket" > has >>>>> same >>>>>>> meaning. >>>>>>> Therefore, there's no problem about that. Sorry for distubing. >>>>>> >>>>>> No problem - glad you could clear that up :-) >>>>>> >>>>>>> >>>>>>> By the way, through this test, I found another problem. >>>>>>> Without torque manager and just using rsh, it causes the same error >>>>> like >>>>>>> below: >>>>>>> >>>>>>> [mishima@manage openmpi-1.7]$ rsh node03 >>>>>>> Last login: Wed Dec 11 09:42:02 from manage >>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > -cpus-per-proc >>> 4 >>>>>>> -map-by socket myprog >>>>>> >>>>>> I don't understand the difference here - you are simply starting it >>> from >>>>> a different node? It looks like everything is expected to run local > to >>>>> mpirun, yes? So there is no rsh actually involved here. >>>>>> Are you still running in an allocation? >>>>>> >>>>>> If you run this with "-host node03" on the cmd line, do you see the >>> same >>>>> problem? >>>>>> >>>>>> >>>>>>> >>>>> >>> > -------------------------------------------------------------------------- >>>>>>> A request was made to bind to that would result in binding more >>>>>>> processes than cpus on a resource: >>>>>>> >>>>>>> Bind to: CORE >>>>>>> Node: node03 >>>>>>> #processes: 2 >>>>>>> #cpus: 1 >>>>>>> >>>>>>> You can override this protection by adding the "overload-allowed" >>>>>>> option to your binding directive. >>>>>>> >>>>> >>> > -------------------------------------------------------------------------- >>>>>>> [mishima@node03 demos]$ >>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > -cpus-per-proc >>> 4 >>>>>>> myprog >>>>>>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt 0]], >>>>> socket >>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt > 0]], >>>>> socket >>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>> socket 1[core 15[hwt 0]]: >>>>>>> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt > 0]], >>>>> socket >>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>> socket 2[core 19[hwt 0]]: >>>>>>> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt > 0]], >>>>> socket >>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>> socket 2[core 23[hwt 0]]: >>>>>>> > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt > 0]], >>>>> socket >>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>> socket 3[core 27[hwt 0]]:>>>>> > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt > 0]], >>>>> socket >>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>> socket 3[core 31[hwt 0]]: >>>>>>> > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt 0]], >>>>> socket >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>> cket 0[core 3[hwt 0]]: >>>>>>> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt 0]], >>>>> socket >>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>> cket 0[core 7[hwt 0]]: >>>>>>> > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>> Hello world from process 4 of 8 >>>>>>> Hello world from process 2 of 8 >>>>>>> Hello world from process 6 of 8 >>>>>>> Hello world from process 5 of 8 >>>>>>> Hello world from process 3 of 8 >>>>>>> Hello world from process 7 of 8 >>>>>>> Hello world from process 0 of 8 >>>>>>> Hello world from process 1 of 8 >>>>>>> >>>>>>> Regards, >>>>>>> Tetsuya Mishima >>>>>>> >>>>>>>> Hmmm...that's strange. I only have 2 sockets on my system, but let >>> me >>>>>>> poke around a bit and see what might be happening. >>>>>>>> >>>>>>>> On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Ralph, >>>>>>>>> >>>>>>>>> Thanks. I didn't know the meaning of "socket:span". >>>>>>>>> >>>>>>>>> But it still causes the problem, which seems socket:span doesn't >>>>> work. >>>>>>>>> >>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32 >>>>>>>>> qsub: waiting for job 8265.manage.cluster to start >>>>>>>>> qsub: job 8265.manage.cluster ready >>>>>>>>> >>>>>>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings >>> -cpus-per-proc >>>>> 4 >>>>>>>>> -map-by socket:span myprog >>>>>>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8[hwt > 0]], >>>>>>> socket >>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>> >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>> >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28[hwt >>> 0]], >>>>>>> socket >>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>> >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0[hwt > 0]], >>>>>>> socket >>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>> >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4[hwt > 0]], >>>>>>> socket >>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>> >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>> Hello world from process 0 of 8 >>>>>>>>> Hello world from process 3 of 8 >>>>>>>>> Hello world from process 1 of 8 >>>>>>>>> Hello world from process 4 of 8 >>>>>>>>> Hello world from process 6 of 8 >>>>>>>>> Hello world from process 5 of 8 >>>>>>>>> Hello world from process 2 of 8 >>>>>>>>> Hello world from process 7 of 8 >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Tetsuya Mishima >>>>>>>>> >>>>>>>>>> No, that is actually correct. We map a socket until full, then >>> move >>>>> to >>>>>>>>> the next. What you want is --map-by socket:span >>>>>>>>>> >>>>>>>>>> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Ralph, >>>>>>>>>>> >>>>>>>>>>> I had a time to try your patch yesterday using >>>>> openmpi-1.7.4a1r29646. >>>>>>>>>>>>>>>>>> It stopped the error but unfortunately "mapping by >>> socket" itself >>>>>>>>> didn't >>>>>>>>>>> work >>>>>>>>>>> well as shown bellow: >>>>>>>>>>> >>>>>>>>>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32 >>>>>>>>>>> qsub: waiting for job 8260.manage.cluster to start >>>>>>>>>>> qsub: job 8260.manage.cluster ready >>>>>>>>>>> >>>>>>>>>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/ >>>>>>>>>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings >>>>> -cpus-per-proc >>>>>>> 4 >>>>>>>>>>> -map-by socket myprog >>>>>>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt >>>>> 0]], >>>>>>>>> socket >>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>>>> >>>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>>>> >>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4[hwt >>> 0]], >>>>>>>>> socket >>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>>>> >>>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>>>> Hello world from process 2 of 8 >>>>>>>>>>> Hello world from process 1 of 8 >>>>>>>>>>> Hello world from process 3 of 8 >>>>>>>>>>> Hello world from process 0 of 8 >>>>>>>>>>> Hello world from process 6 of 8 >>>>>>>>>>> Hello world from process 5 of 8 >>>>>>>>>>> Hello world from process 4 of 8 >>>>>>>>>>> Hello world from process 7 of 8 >>>>>>>>>>> >>>>>>>>>>> I think this should be like this: >>>>>>>>>>> >>>>>>>>>>> rank 00 >>>>>>>>>>> >>>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>>> rank 01 >>>>>>>>>>> >>>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>>> rank 02 >>>>>>>>>>> >>>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>>> ... >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>> >>>>>>>>>>>> I fixed this under the trunk (was an issue regardless of RM) > and >>>>>>> have >>>>>>>>>>> scheduled it for 1.7.4. >>>>>>>>>>>> >>>>>>>>>>>> Thanks! >>>>>>>>>>>> Ralph >>>>>>>>>>>> >>>>>>>>>>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you very much for your quick response. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm afraid to say that I found one more issuse... >>>>>>>>>>>>> >>>>>>>>>>>>> It's not so serious. Please check it when you have a lot of >>> time. >>>>>>>>>>>>> >>>>>>>>>>>>> The problem is cpus-per-proc with -map-by option under Torque >>>>>>>>> manager. >>>>>>>>>>>>> It doesn't work as shown below. I guess you can get the same >>>>>>>>>>>>> behaviour under Slurm manager. >>>>>>>>>>>>> >>>>>>>>>>>>> Of course, if I remove -map-by option, it works quite well. >>>>>>>>>>>>> >>>>>>>>>>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 >>>>>>>>>>>>> qsub: waiting for job 8116.manage.cluster to start >>>>>>>>>>>>> qsub: job 8116.manage.cluster ready >>>>>>>>>>>>> >>>>>>>>>>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2 >>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings >>>>>>>>> -cpus-per-proc >>>>>>>>>>> 4 >>>>>>>>>>>>> -map-by socket mPre >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> > -------------------------------------------------------------------------- >>>>>>>>>>>>> A request was made to bind to that would result in binding > more >>>>>>>>>>>>> processes than cpus on a resource: >>>>>>>>>>>>> >>>>>>>>>>>>> Bind to: CORE >>>>>>>>>>>>> Node: node03>>>>>>> #processes: 2 >>>>>>>>>>>>> #cpus: 1 >>>>>>>>>>>>> >>>>>>>>>>>>> You can override this protection by adding the >>> "overload-allowed" >>>>>>>>>>>>> option to your binding directive. >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> > -------------------------------------------------------------------------- >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings >>>>>>>>> -cpus-per-proc >>>>>>>>>>> 4 >>>>>>>>>>>>> mPre >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8 > [hwt >>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s >>>>>>>>>>>>> ocket 1[core 11[hwt 0]]: >>>>>>>>>>>>> >>>>>>> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12 > [hwt >>>>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], >>>>>>>>>>>>> socket 1[core 15[hwt 0]]: >>>>>>>>>>>>> >>>>>>> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.] >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16 > [hwt >>>>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], >>>>>>>>>>>>> socket 2[core 19[hwt 0]]: >>>>>>>>>>>>> >>>>>>> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20 > [hwt >>>>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], >>>>>>>>>>>>> socket 2[core 23[hwt 0]]: >>>>>>>>>>>>> >>>>>>> > [./././././././.][./././././././.][././././B/B/B/B][./././././././.] >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24 > [hwt >>>>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], >>>>>>>>>>>>> socket 3[core 27[hwt 0]]: >>>>>>>>>>>>> >>>>>>> > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28 > [hwt >>>>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], >>>>>>>>>>>>> socket 3[core 31[hwt 0]]: >>>>>>>>>>>>> >>>>>>> > [./././././././.][./././././././.][./././././././.][././././B/B/B/B] >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0 > [hwt >>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>>>>>>>>>>>> cket 0[core 3[hwt 0]]: >>>>>>>>>>>>> >>>>>>> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4 > [hwt >>>>> 0]], >>>>>>>>>>> socket >>>>>>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so >>>>>>>>>>>>> cket 0[core 7[hwt 0]]: >>>>>>>>>>>>> >>>>>>> > [././././B/B/B/B][./././././././.][./././././././.][./././././././.] >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>> >>>>>>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again! >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain > <r...@open-mpi.org> >>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks! That's precisely where I was going to look when I > had >>>>>>>>> time :-) >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'll update tomorrow. >>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM, >>>>>>> <tmish...@jcity.maeda.co.jp>wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is the continuous story of "Segmentation fault in >>> oob_tcp.c >>>>>>> of >>>>>>>>>>>>>> openmpi-1.7.4a1r29646". >>>>>>>>>>>>>> >>>>>>>>>>>>>> I found the cause. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Firstly, I noticed that your hostfile can work and mine can >>> not. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Your host file: >>>>>>>>>>>>>> cat hosts >>>>>>>>>>>>>> bend001 slots=12 >>>>>>>>>>>>>> >>>>>>>>>>>>>> My host file: >>>>>>>>>>>>>> cat hosts >>>>>>>>>>>>>> node08 >>>>>>>>>>>>>> node08 >>>>>>>>>>>>>> ...(total 8 lines) >>>>>>>>>>>>>> >>>>>>>>>>>>>> I modified my script file to add "slots=1" to each line of > my >>>>>>>>> hostfile >>>>>>>>>>>>>> just before launching mpirun. Then it worked. >>>>>>>>>>>>>> >>>>>>>>>>>>>> My host file(modified): >>>>>>>>>>>>>> cat hosts >>>>>>>>>>>>>> node08 slots=1 >>>>>>>>>>>>>> node08 slots=1 >>>>>>>>>>>>>> ...(total 8 lines) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Secondary, I confirmed that there's a slight difference >>> between >>>>>>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of >>>>> 1.7.4a1r29646. >>>>>>>>>>>>>> >>>>>>>>>>>>>> $ diff >>>>>>>>>>>>>> >>>>>>>>> >>>>> > hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c >>>>>>>>>>>>>> 394,401c394,399 >>>>>>>>>>>>>> < if (got_count) { >>>>>>>>>>>>>> < node->slots_given = true; >>>>>>>>>>>>>> < } else if (got_max) { >>>>>>>>>>>>>> < node->slots = node->slots_max; >>>>>>>>>>>>>> < node->slots_given = true; >>>>>>>>>>>>>> < } else { >>>>>>>>>>>>>> < /* should be set by obj_new, but just to be clear > */ >>>>>>>>>>>>>> < node->slots_given = false; >>>>>>>>>>>>>> --- >>>>>>>>>>>>>>> if (!got_count) { >>>>>>>>>>>>>>> if (got_max) { >>>>>>>>>>>>>>> node->slots = node->slots_max; >>>>>>>>>>>>>>> } else { >>>>>>>>>>>>>>> ++node->slots;>>>>>>>>>>>>> } >>>>>>>>>>>>>> .... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Finally, I added the line 402 below just as a tentative > trial. >>>>>>>>>>>>>> Then, it worked. >>>>>>>>>>>>>> >>>>>>>>>>>>>> cat -n orte/util/hostfile/hostfile.c: >>>>>>>>>>>>>> ... >>>>>>>>>>>>>> 394 if (got_count) { >>>>>>>>>>>>>> 395 node->slots_given = true; >>>>>>>>>>>>>> 396 } else if (got_max) { >>>>>>>>>>>>>> 397 node->slots = node->slots_max; >>>>>>>>>>>>>> 398 node->slots_given = true; >>>>>>>>>>>>>> 399 } else { >>>>>>>>>>>>>> 400 /* should be set by obj_new, but just to be > clear >>>>> */ >>>>>>>>>>>>>> 401 node->slots_given > = false; >>>>>>>>>>>>>> 402 ++node->slots; /* added by tmishima */ >>>>>>>>>>>>>> 403 } >>>>>>>>>>>>>> ... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Please fix the problem properly, because it's just based on > my >>>>>>>>>>>>>> random guess. It's related to the treatment of hostfile > where >>>>>>> slots >>>>>>>>>>>>>> information is not given. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> Tetsuya Mishima >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > >>> >>>>> >>>>>>> >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> >>>>>>> > users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users