Thank you, Ralph.

I just hope that it helps you to improve the quality of openmpi-1.7 series.

Tetsuya Mishima


> Hmmm...okay, I understand the scenario. Must be something in the algo
when it only has one node, so it shouldn't be too hard to track down.
>
> I'm off on travel for a few days, but will return to this when I get
back.
>
> Sorry for delay - will try to look at this while I'm gone, but can't
promise anything :-(
>
>
> On Dec 10, 2013, at 6:58 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Ralph, sorry for confusing.
> >
> > We usually logon to "manage", which is our control node.
> > From manage, we submit job or enter a remote node such as
> > node03 by torque interactive mode(qsub -I).
> >
> > At that time, instead of torque, I just did rsh to node03 from manage
> > and ran myprog on the node. I hope you could understand what I did.
> >
> > Now, I retried with "-host node03", which still causes the problem:
> > (I comfirmed local run on manage caused the same problem too)
> >
> > [mishima@manage ~]$ rsh node03
> > Last login: Wed Dec 11 11:38:57 from manage
> > [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> > [mishima@node03 demos]$
> > [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings
> > -cpus-per-proc 4 -map-by socket myprog
> >
--------------------------------------------------------------------------
> > A request was made to bind to that would result in binding more
> > processes than cpus on a resource:
> >
> >   Bind to:         CORE
> >   Node:            node03
> >   #processes:  2
> >   #cpus:          1
> >
> > You can override this protection by adding the "overload-allowed"
> > option to your binding directive.
> >
--------------------------------------------------------------------------
> >
> > It' strange, but I have to report that "-map-by socket:span" worked
well.
> >
> > [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings
> > -cpus-per-proc 4 -map-by socket:span myprog
> > [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt 0]],
socket
> > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > ocket 1[core 11[hwt 0]]:
> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt 0]],
socket
> > 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > socket 1[core 15[hwt 0]]:
> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt 0]],
socket
> > 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > socket 2[core 19[hwt 0]]:
> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt 0]],
socket
> > 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> > socket 2[core 23[hwt 0]]:
> > [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> > [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt 0]],
socket
> > 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> > socket 3[core 27[hwt 0]]:
> > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> > [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt 0]],
socket
> > 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> > socket 3[core 31[hwt 0]]:
> > [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> > [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket
> > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > cket 0[core 3[hwt 0]]:
> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt 0]],
socket
> > 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> > cket 0[core 7[hwt 0]]:
> > [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> > Hello world from process 2 of 8
> > Hello world from process 6 of 8
> > Hello world from process 3 of 8
> > Hello world from process 7 of 8
> > Hello world from process 1 of 8
> > Hello world from process 5 of 8
> > Hello world from process 0 of 8
> > Hello world from process 4 of 8
> >
> > Regards,
> > Tetsuya Mishima
> >
> >
> >> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote:
> >>
> >>>
> >>>
> >>> Hi Ralph,
> >>>
> >>> I tried again with -cpus-per-proc 2 as shown below.
> >>> Here, I found that "-map-by socket:span" worked well.
> >>>
> >>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc
2
> >>> -map-by socket:span myprog
> >>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt 0]],
> > socket
> >>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> >>> /././.][./././././././.][./././././././.]
> >>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt 0]],
> > socket
> >>> 1[core 11[hwt 0]]: [./././././././.][././B/B
> >>> /./././.][./././././././.][./././././././.]
> >>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt 0]],
> > socket
> >>> 2[core 17[hwt 0]]: [./././././././.][./././.
> >>> /./././.][B/B/./././././.][./././././././.]
> >>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt 0]],
> > socket
> >>> 2[core 19[hwt 0]]: [./././././././.][./././.
> >>> /./././.][././B/B/./././.][./././././././.]
> >>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt 0]],
> > socket
> >>> 3[core 25[hwt 0]]: [./././././././.][./././.
> >>> /./././.][./././././././.][B/B/./././././.]
> >>> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt 0]],
> > socket
> >>> 3[core 27[hwt 0]]: [./././././././.][./././.
> >>> /./././.][./././././././.][././B/B/./././.]
> >>> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > socket
> >>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> >>> /././.][./././././././.][./././././././.]
> >>> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt 0]],
> > socket
> >>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> >>> /././.][./././././././.][./././././././.]
> >>> Hello world from process 1 of 8
> >>> Hello world from process 0 of 8
> >>> Hello world from process 4 of 8
> >>> Hello world from process 2 of 8
> >>> Hello world from process 7 of 8
> >>> Hello world from process 6 of 8
> >>> Hello world from process 5 of 8
> >>> Hello world from process 3 of 8
> >>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc
2
> >>> -map-by socket myprog
> >>> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt 0]],
> > socket
> >>> 0[core 5[hwt 0]]: [././././B/B/./.][././././.
> >>> /././.][./././././././.][./././././././.]
> >>> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt 0]],
> > socket
> >>> 0[core 7[hwt 0]]: [././././././B/B][././././.
> >>> /././.][./././././././.][./././././././.]
> >>> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt 0]],
> > socket
> >>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> >>> /././.][./././././././.][./././././././.]
> >>> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt 0]],
> > socket
> >>> 1[core 11[hwt 0]]: [./././././././.][././B/B
> >>> /./././.][./././././././.][./././././././.]
> >>> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt 0]],
> > socket
> >>> 1[core 13[hwt 0]]: [./././././././.][./././.
> >>> /B/B/./.][./././././././.][./././././././.]
> >>> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt 0]],
> > socket
> >>> 1[core 15[hwt 0]]: [./././././././.][./././.
> >>> /././B/B][./././././././.][./././././././.]
> >>> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > socket
> >>> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> >>> /././.][./././././././.][./././././././.]
> >>> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt 0]],
> > socket
> >>> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> >>> /././.][./././././././.][./././././././.]
> >>> Hello world from process 5 of 8
> >>> Hello world from process 1 of 8
> >>> Hello world from process 6 of 8
> >>> Hello world from process 4 of 8
> >>> Hello world from process 2 of 8
> >>> Hello world from process 0 of 8
> >>> Hello world from process 7 of 8
> >>> Hello world from process 3 of 8
> >>>
> >>> "-np 8" and "-cpus-per-proc 4" just filled all sockets.
> >>> In this case, I guess "-map-by socket:span" and "-map-by socket" has
> > same
> >>> meaning.
> >>> Therefore, there's no problem about that. Sorry for distubing.
> >>
> >> No problem - glad you could clear that up :-)
> >>
> >>>
> >>> By the way, through this test, I found another problem.
> >>> Without torque manager and just using rsh, it causes the same error
> > like
> >>> below:
> >>>
> >>> [mishima@manage openmpi-1.7]$ rsh node03
> >>> Last login: Wed Dec 11 09:42:02 from manage
> >>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc
4
> >>> -map-by socket myprog
> >>
> >> I don't understand the difference here - you are simply starting it
from
> > a different node? It looks like everything is expected to run local to
> > mpirun, yes? So there is no rsh actually involved here.
> >> Are you still running in an allocation?
> >>
> >> If you run this with "-host node03" on the cmd line, do you see the
same
> > problem?
> >>
> >>
> >>>
> >
--------------------------------------------------------------------------
> >>> A request was made to bind to that would result in binding more
> >>> processes than cpus on a resource:
> >>>
> >>>  Bind to:         CORE
> >>>  Node:            node03
> >>>  #processes:  2
> >>>  #cpus:          1
> >>>
> >>> You can override this protection by adding the "overload-allowed"
> >>> option to your binding directive.
> >>>
> >
--------------------------------------------------------------------------
> >>> [mishima@node03 demos]$
> >>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc
4
> >>> myprog
> >>> [node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt 0]],
> > socket
> >>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>> ocket 1[core 11[hwt 0]]:
> >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>> [node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt 0]],
> > socket
> >>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>> socket 1[core 15[hwt 0]]:
> >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>> [node03.cluster:11036] MCW rank 4 bound to socket 2[core 16[hwt 0]],
> > socket
> >>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>> socket 2[core 19[hwt 0]]:
> >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>> [node03.cluster:11036] MCW rank 5 bound to socket 2[core 20[hwt 0]],
> > socket
> >>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>> socket 2[core 23[hwt 0]]:
> >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>> [node03.cluster:11036] MCW rank 6 bound to socket 3[core 24[hwt 0]],
> > socket
> >>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>> socket 3[core 27[hwt 0]]:
> >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>> [node03.cluster:11036] MCW rank 7 bound to socket 3[core 28[hwt 0]],
> > socket
> >>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>> socket 3[core 31[hwt 0]]:
> >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> >>> [node03.cluster:11036] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> > socket
> >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>> cket 0[core 3[hwt 0]]:
> >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>> [node03.cluster:11036] MCW rank 1 bound to socket 0[core 4[hwt 0]],
> > socket
> >>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>> cket 0[core 7[hwt 0]]:
> >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>> Hello world from process 4 of 8
> >>> Hello world from process 2 of 8
> >>> Hello world from process 6 of 8
> >>> Hello world from process 5 of 8
> >>> Hello world from process 3 of 8
> >>> Hello world from process 7 of 8
> >>> Hello world from process 0 of 8
> >>> Hello world from process 1 of 8
> >>>
> >>> Regards,
> >>> Tetsuya Mishima
> >>>
> >>>> Hmmm...that's strange. I only have 2 sockets on my system, but let
me
> >>> poke around a bit and see what might be happening.
> >>>>
> >>>> On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>> Hi Ralph,
> >>>>>
> >>>>> Thanks. I didn't know the meaning of "socket:span".
> >>>>>
> >>>>> But it still causes the problem, which seems socket:span doesn't
> > work.
> >>>>>
> >>>>> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32
> >>>>> qsub: waiting for job 8265.manage.cluster to start
> >>>>> qsub: job 8265.manage.cluster ready
> >>>>>
> >>>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings
-cpus-per-proc
> > 4
> >>>>> -map-by socket:span myprog
> >>>>> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8[hwt 0]],
> >>> socket
> >>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>>>> ocket 1[core 11[hwt 0]]:
> >>>>>
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12[hwt
0]],
> >>> socket
> >>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>>>> socket 1[core 15[hwt 0]]:
> >>>>>
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>>>> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16[hwt
0]],
> >>> socket
> >>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>>>> socket 2[core 19[hwt 0]]:
> >>>>>
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20[hwt
0]],
> >>> socket
> >>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>>>> socket 2[core 23[hwt 0]]:
> >>>>>
[./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>>>> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24[hwt
0]],
> >>> socket
> >>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>>>> socket 3[core 27[hwt 0]]:
> >>>>>
[./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>>>> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28[hwt
0]],
> >>> socket
> >>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>>>> socket 3[core 31[hwt 0]]:
> >>>>>
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> >>>>> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> >>> socket
> >>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>> cket 0[core 3[hwt 0]]:
> >>>>>
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4[hwt 0]],
> >>> socket
> >>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>>>> cket 0[core 7[hwt 0]]:
> >>>>>
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>>>> Hello world from process 0 of 8
> >>>>> Hello world from process 3 of 8
> >>>>> Hello world from process 1 of 8
> >>>>> Hello world from process 4 of 8
> >>>>> Hello world from process 6 of 8
> >>>>> Hello world from process 5 of 8
> >>>>> Hello world from process 2 of 8
> >>>>> Hello world from process 7 of 8
> >>>>>
> >>>>> Regards,
> >>>>> Tetsuya Mishima
> >>>>>
> >>>>>> No, that is actually correct. We map a socket until full, then
move
> > to
> >>>>> the next. What you want is --map-by socket:span
> >>>>>>
> >>>>>> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi Ralph,
> >>>>>>>
> >>>>>>> I had a time to try your patch yesterday using
> > openmpi-1.7.4a1r29646.
> >>>>>>> >>>>>>> It stopped the error but unfortunately "mapping by
socket" itself
> >>>>> didn't
> >>>>>>> work
> >>>>>>> well as shown bellow:
> >>>>>>>
> >>>>>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32
> >>>>>>> qsub: waiting for job 8260.manage.cluster to start
> >>>>>>> qsub: job 8260.manage.cluster ready
> >>>>>>>
> >>>>>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> >>>>>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings
> > -cpus-per-proc
> >>> 4
> >>>>>>> -map-by socket myprog
> >>>>>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt
0]],
> >>>>> socket
> >>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>>>>>> ocket 1[core 11[hwt 0]]:
> >>>>>>>
> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>>>>>> socket 1[core 15[hwt 0]]:
> >>>>>>>
> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>>>>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>>>>>> socket 2[core 19[hwt 0]]:
> >>>>>>>
> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>>>>>> socket 2[core 23[hwt 0]]:
> >>>>>>>
> > [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>>>>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>>>>>> socket 3[core 27[hwt 0]]:
> >>>>>>>
> > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>>>>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt
> > 0]],
> >>>>> socket
> >>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>>>>>> socket 3[core 31[hwt 0]]:
> >>>>>>>
> > [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> >>>>>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt
0]],
> >>>>> socket
> >>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>> cket 0[core 3[hwt 0]]:
> >>>>>>>
> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4[hwt
0]],
> >>>>> socket
> >>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>>>>>> cket 0[core 7[hwt 0]]:
> >>>>>>>
> > [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>>>>>> Hello world from process 2 of 8
> >>>>>>> Hello world from process 1 of 8
> >>>>>>> Hello world from process 3 of 8
> >>>>>>> Hello world from process 0 of 8
> >>>>>>> Hello world from process 6 of 8
> >>>>>>> Hello world from process 5 of 8
> >>>>>>> Hello world from process 4 of 8
> >>>>>>> Hello world from process 7 of 8
> >>>>>>>
> >>>>>>> I think this should be like this:
> >>>>>>>
> >>>>>>> rank 00
> >>>>>>>
> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>> rank 01
> >>>>>>>
> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>> rank 02
> >>>>>>>
> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>> ...
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Tetsuya Mishima
> >>>>>>>
> >>>>>>>> I fixed this under the trunk (was an issue regardless of RM) and
> >>> have
> >>>>>>> scheduled it for 1.7.4.
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>> Ralph
> >>>>>>>>
> >>>>>>>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hi Ralph,
> >>>>>>>>>
> >>>>>>>>> Thank you very much for your quick response.
> >>>>>>>>>
> >>>>>>>>> I'm afraid to say that I found one more issuse...
> >>>>>>>>>
> >>>>>>>>> It's not so serious. Please check it when you have a lot of
time.
> >>>>>>>>>
> >>>>>>>>> The problem is cpus-per-proc with -map-by option under Torque
> >>>>> manager.
> >>>>>>>>> It doesn't work as shown below. I guess you can get the same
> >>>>>>>>> behaviour under Slurm manager.
> >>>>>>>>>
> >>>>>>>>> Of course, if I remove -map-by option, it works quite well.
> >>>>>>>>>
> >>>>>>>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32
> >>>>>>>>> qsub: waiting for job 8116.manage.cluster to start
> >>>>>>>>> qsub: job 8116.manage.cluster ready
> >>>>>>>>>
> >>>>>>>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2
> >>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings
> >>>>> -cpus-per-proc
> >>>>>>> 4
> >>>>>>>>> -map-by socket mPre
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>>>>>>> A request was made to bind to that would result in binding more
> >>>>>>>>> processes than cpus on a resource:
> >>>>>>>>>
> >>>>>>>>> Bind to:         CORE
> >>>>>>>>> Node:            node03>>>>>>> #processes:  2
> >>>>>>>>> #cpus:          1
> >>>>>>>>>
> >>>>>>>>> You can override this protection by adding the
"overload-allowed"
> >>>>>>>>> option to your binding directive.
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >
--------------------------------------------------------------------------
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings
> >>>>> -cpus-per-proc
> >>>>>>> 4
> >>>>>>>>> mPre
> >>>>>>>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt
> > 0]],
> >>>>>>> socket
> >>>>>>>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> >>>>>>>>> ocket 1[core 11[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> >>>>>>>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> >>>>>>>>> socket 1[core 15[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> >>>>>>>>> [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> >>>>>>>>> socket 2[core 19[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> >>>>>>>>> [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> >>>>>>>>> socket 2[core 23[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> >>>>>>>>> [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> >>>>>>>>> socket 3[core 27[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> >>>>>>>>> [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt
> >>> 0]],
> >>>>>>> socket
> >>>>>>>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> >>>>>>>>> socket 3[core 31[hwt 0]]:
> >>>>>>>>>
> >>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> >>>>>>>>> [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt
> > 0]],
> >>>>>>> socket
> >>>>>>>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> >>>>>>>>> cket 0[core 3[hwt 0]]:
> >>>>>>>>>
> >>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> >>>>>>>>> [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt
> > 0]],
> >>>>>>> socket
> >>>>>>>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> >>>>>>>>> cket 0[core 7[hwt 0]]:
> >>>>>>>>>
> >>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Tetsuya Mishima
> >>>>>>>>>
> >>>>>>>>>> Fixed and scheduled to move to 1.7.4. Thanks again!
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Nov 17, 2013, at 6:11 PM, Ralph Castain <r...@open-mpi.org>
> >>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Thanks! That's precisely where I was going to look when I had
> >>>>> time :-)
> >>>>>>>>>>
> >>>>>>>>>> I'll update tomorrow.
> >>>>>>>>>> Ralph
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Sun, Nov 17, 2013 at 7:01 PM,
> >>> <tmish...@jcity.maeda.co.jp>wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Ralph,
> >>>>>>>>>>
> >>>>>>>>>> This is the continuous story of "Segmentation fault in
oob_tcp.c
> >>> of
> >>>>>>>>>> openmpi-1.7.4a1r29646".
> >>>>>>>>>>
> >>>>>>>>>> I found the cause.
> >>>>>>>>>>
> >>>>>>>>>> Firstly, I noticed that your hostfile can work and mine can
not.
> >>>>>>>>>>
> >>>>>>>>>> Your host file:
> >>>>>>>>>> cat hosts
> >>>>>>>>>> bend001 slots=12
> >>>>>>>>>>
> >>>>>>>>>> My host file:
> >>>>>>>>>> cat hosts
> >>>>>>>>>> node08
> >>>>>>>>>> node08
> >>>>>>>>>> ...(total 8 lines)
> >>>>>>>>>>
> >>>>>>>>>> I modified my script file to add "slots=1" to each line of my
> >>>>> hostfile
> >>>>>>>>>> just before launching mpirun. Then it worked.
> >>>>>>>>>>
> >>>>>>>>>> My host file(modified):
> >>>>>>>>>> cat hosts
> >>>>>>>>>> node08 slots=1
> >>>>>>>>>> node08 slots=1
> >>>>>>>>>> ...(total 8 lines)
> >>>>>>>>>>
> >>>>>>>>>> Secondary, I confirmed that there's a slight difference
between
> >>>>>>>>>> orte/util/hostfile/hostfile.c of 1.7.3 and that of
> > 1.7.4a1r29646.
> >>>>>>>>>>
> >>>>>>>>>> $ diff
> >>>>>>>>>>
> >>>>>
> > hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c
> >>>>>>>>>> 394,401c394,399
> >>>>>>>>>> <     if (got_count) {
> >>>>>>>>>> <         node->slots_given = true;
> >>>>>>>>>> <     } else if (got_max) {
> >>>>>>>>>> <         node->slots = node->slots_max;
> >>>>>>>>>> <         node->slots_given = true;
> >>>>>>>>>> <     } else {
> >>>>>>>>>> <         /* should be set by obj_new, but just to be clear */
> >>>>>>>>>> <         node->slots_given = false;
> >>>>>>>>>> ---
> >>>>>>>>>>> if (!got_count) {
> >>>>>>>>>>>     if (got_max) {
> >>>>>>>>>>>         node->slots = node->slots_max;
> >>>>>>>>>>>     } else {
> >>>>>>>>>>>         ++node->slots;
> >>>>>>>>>>>     }
> >>>>>>>>>> ....
> >>>>>>>>>>
> >>>>>>>>>> Finally, I added the line 402 below just as a tentative trial.
> >>>>>>>>>> Then, it worked.
> >>>>>>>>>>
> >>>>>>>>>> cat -n orte/util/hostfile/hostfile.c:
> >>>>>>>>>> ...
> >>>>>>>>>> 394      if (got_count) {
> >>>>>>>>>> 395          node->slots_given = true;
> >>>>>>>>>> 396      } else if (got_max) {
> >>>>>>>>>> 397          node->slots = node->slots_max;
> >>>>>>>>>> 398          node->slots_given = true;
> >>>>>>>>>> 399      } else {
> >>>>>>>>>> 400          /* should be set by obj_new, but just to be clear
> > */
> >>>>>>>>>> 401          node->slots_given = false;
> >>>>>>>>>> 402          ++node->slots; /* added by tmishima */
> >>>>>>>>>> 403      }
> >>>>>>>>>> ...
> >>>>>>>>>>
> >>>>>>>>>> Please fix the problem properly, because it's just based on my
> >>>>>>>>>> random guess. It's related to the treatment of hostfile where
> >>> slots
> >>>>>>>>>> information is not given.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Tetsuya Mishima
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> users mailing list
> >>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >
http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________

> >
> >>>
> >>>>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>>> users mailing list
> >>>>>>>>>>
> >>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> us...@open-mpi.org
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> us...@open-mpi.org
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> us...@open-mpi.org
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> us...@open-mpi.org
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to