Hi Ralph,
Thank you very much for your quick response. I'm afraid to say that I found one more issuse... It's not so serious. Please check it when you have a lot of time. The problem is cpus-per-proc with -map-by option under Torque manager. It doesn't work as shown below. I guess you can get the same behaviour under Slurm manager. Of course, if I remove -map-by option, it works quite well. [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32 qsub: waiting for job 8116.manage.cluster to start qsub: job 8116.manage.cluster ready [mishima@node03 ~]$ cd ~/Ducom/testbed2 [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc 4 -map-by socket mPre -------------------------------------------------------------------------- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node: node03 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -------------------------------------------------------------------------- [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc 4 mPre [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s ocket 1[core 11[hwt 0]]: [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.] [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], socket 1[core 15[hwt 0]]: [./././././././.][././././B/B/B/B][./././././././.][./././././././.] [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket 2[core 17[hwt 0]], socket 2[core 18[hwt 0]], socket 2[core 19[hwt 0]]: [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.] [node03.cluster:18128] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket 2[core 21[hwt 0]], socket 2[core 22[hwt 0]], socket 2[core 23[hwt 0]]: [./././././././.][./././././././.][././././B/B/B/B][./././././././.] [node03.cluster:18128] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket 3[core 25[hwt 0]], socket 3[core 26[hwt 0]], socket 3[core 27[hwt 0]]: [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.] [node03.cluster:18128] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket 3[core 29[hwt 0]], socket 3[core 30[hwt 0]], socket 3[core 31[hwt 0]]: [./././././././.][./././././././.][./././././././.][././././B/B/B/B] [node03.cluster:18128] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so cket 0[core 3[hwt 0]]: [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] [node03.cluster:18128] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so cket 0[core 7[hwt 0]]: [././././B/B/B/B][./././././././.][./././././././.][./././././././.] Regards, Tetsuya Mishima > Fixed and scheduled to move to 1.7.4. Thanks again! > > > On Nov 17, 2013, at 6:11 PM, Ralph Castain <r...@open-mpi.org> wrote: > > Thanks! That's precisely where I was going to look when I had time :-) > > I'll update tomorrow. > Ralph > > > > > On Sun, Nov 17, 2013 at 7:01 PM, <tmish...@jcity.maeda.co.jp>wrote: > > > Hi Ralph, > > This is the continuous story of "Segmentation fault in oob_tcp.c of > openmpi-1.7.4a1r29646". > > I found the cause. > > Firstly, I noticed that your hostfile can work and mine can not. > > Your host file: > cat hosts > bend001 slots=12 > > My host file: > cat hosts > node08 > node08 > ...(total 8 lines) > > I modified my script file to add "slots=1" to each line of my hostfile > just before launching mpirun. Then it worked. > > My host file(modified): > cat hosts > node08 slots=1 > node08 slots=1 > ...(total 8 lines) > > Secondary, I confirmed that there's a slight difference between > orte/util/hostfile/hostfile.c of 1.7.3 and that of 1.7.4a1r29646. > > $ diff > hostfile.c.org ../../../../openmpi-1.7.3/orte/util/hostfile/hostfile.c > 394,401c394,399 > < if (got_count) { > < node->slots_given = true; > < } else if (got_max) { > < node->slots = node->slots_max; > < node->slots_given = true; > < } else { > < /* should be set by obj_new, but just to be clear */ > < node->slots_given = false; > --- > > if (!got_count) { > > if (got_max) { > > node->slots = node->slots_max; > > } else { > > ++node->slots; > > } > .... > > Finally, I added the line 402 below just as a tentative trial. > Then, it worked. > > cat -n orte/util/hostfile/hostfile.c: > ... > 394 if (got_count) { > 395 node->slots_given = true; > 396 } else if (got_max) { > 397 node->slots = node->slots_max; > 398 node->slots_given = true; > 399 } else { > 400 /* should be set by obj_new, but just to be clear */ > 401 node->slots_given = false; > 402 ++node->slots; /* added by tmishima */ > 403 } > ... > > Please fix the problem properly, because it's just based on my > random guess. It's related to the treatment of hostfile where slots > information is not given. > > Regards, > Tetsuya Mishima > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > users mailing list > users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users