Hi > > thank you very much for your answer. I have compiled your program > > and get different behaviours for openmpi-1.6.4rc3 and openmpi-1.9. > > > > I get the following output for openmpi-1.9 (different outputs !!!). > > > > sunpc1 rankfiles 104 mpirun --report-bindings --rankfile myrankfile > > ./a.out > > [sunpc1:26554] MCW rank 0 bound to socket 0[core 0[hwt 0]], > > socket 0[core 1[hwt 0]]: [B/B][./.] > > unbound > > > > sunpc1 rankfiles 105 mpirun --report-bindings --rankfile myrankfile_0 > > ./a.out > > [sunpc1:26557] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][./.] > > bind to 0 > > I think what's happening is that although you specified "0:0" or "0:1" > in the rankfile, the string "0,0" or "0,1" is getting passed > in (at least in the runs I looked at). That colon became a comma. > So, it's just by accident that myrankfile_0 is working out all > right.
It is working for 0:0 and 1:1 and it isn't working for 0:1 and 1:0. The machine is a Sun Ultra 40 by the way. sunpc1 rankfiles 104 ompi_info | grep "MPI:" Open MPI: 1.9a1r28035 sunpc1 rankfiles 105 cat myrankfile_* rank 0=sunpc1 slot=0:0 rank 0=sunpc1 slot=0:1 rank 0=sunpc1 slot=1:0 rank 0=sunpc1 slot=1:1 sunpc1 rankfiles 106 cc check.c sunpc1 rankfiles 107 mpirun --report-bindings \ --rankfile myrankfile_0 ./a.out bind to 0 [sunpc1:26988] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/.][./.] sunpc1 rankfiles 108 mpirun --report-bindings \ --rankfile myrankfile_1 ./a.out [sunpc1:26991] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B][./.] unbound sunpc1 rankfiles 109 mpirun --report-bindings \ --rankfile myrankfile_2 ./a.out [sunpc1:26994] MCW rank 0 bound to socket 1[core 2[hwt 0]], socket 1[core 3[hwt 0]]: [./.][B/B] unbound sunpc1 rankfiles 110 mpirun --report-bindings \ --rankfile myrankfile_3 ./a.out [sunpc1:26997] MCW rank 0 bound to socket 1[core 3[hwt 0]]: [./.][./B] bind to 3 sunpc1 rankfiles 111 > Could someone who knows the code better than I do help me narrow this > down? E.g., where is the rankfile parsed? For what it's > worth, by the time mpirun reaches > orte_odls_base_default_get_add_procs_data(), orte_job_data already > contains the corrupted > cpu_bitmap string. Thank you very much for any help in advance. Kind regards Siegmar