Hi, > Okay, I have a fix for not specifying the number of procs when > using a rankfile. > > As for the binding pattern, the problem is a syntax error in > your rankfile. You need a semi-colon instead of a comma to > separate the sockets for rank0: > > > rank 0=bend001 slot=0:0-1,1:0-1 => rank 0=bend001 slot=0:0-1;1:0-1 > > This is required because you use commas to list specific cores > - e.g., slot=0:0,1,4,6 ...
OK, you have changed syntax. Open MPI 1.6.x needs "," and Open MPI 1.9.x needs ";". Unfortunately my rankfiles still don't work as expected (even if I add "-np <number>", so that everything is specified now). These are some of my rankfiles, which I use to show you different errors. :::::::::::::: rf_linpc_semicolon :::::::::::::: # Open MPI 1.7.x and newer needs ";" to separate sockets. # mpiexec -report-bindings -rf rf_linpc_semicolon -np 1 hostname rank 0=linpc1 slot=0:0-1;1:0-1 :::::::::::::: rf_linpc_linpc_semicolon :::::::::::::: # Open MPI 1.7.x and newer needs ";" to separate sockets. # mpiexec -report-bindings -rf rf_linpc_linpc_semicolon -np 4 hostname rank 0=linpc0 slot=0:0-1;1:0-1 rank 1=linpc1 slot=0:0-1 rank 2=linpc1 slot=1:0 rank 3=linpc1 slot=1:1 :::::::::::::: rf_tyr_semicolon :::::::::::::: # Open MPI 1.7.x and newer needs ";" to separate sockets. # mpiexec -report-bindings -rf rf_tyr_semicolon -np 1 hostname rank 0=tyr slot=0:0;1:0 tyr rankfiles 198 These are my results. "linpc?" use Open-SuSE Linux, "sunpc?" use Solaris 10 x86_64, and "tyr" uses Solaris 10 sparc. "linpc?" and "sunpc?" use identical hardware. tyr rankfiles 107 ompi_info | grep "Open MPI:" Open MPI: 1.9a1r29097 1) It seems that I can use the rankfile only on a node, which is specified in the rankfile. linpc1 rankfiles 98 mpiexec -report-bindings \ -rf rf_linpc_semicolon -np 1 hostname [linpc1:12504] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 1[core 2[hwt 0]], socket 1[core 3[hwt 0]]: [B/B][B/B] linpc1 linpc1 rankfiles 98 exit tyr rankfiles 125 ssh sunpc1 ... sunpc1 rankfiles 102 mpiexec -report-bindings \ -rf rf_linpc_semicolon -np 1 hostname -------------------------------------------------------------------------- All nodes which are allocated for this job are already filled. -------------------------------------------------------------------------- sunpc1 rankfiles 103 exit linpc0 rankfiles 93 mpiexec -report-bindings \ -rf rf_linpc_semicolon -np 1 hostname -------------------------------------------------------------------------- All nodes which are allocated for this job are already filled. -------------------------------------------------------------------------- linpc0 rankfiles 94 exit I can use the rankfile on any machine with Open MPI 1.6.x. tyr rankfiles 105 ompi_info | grep "Open MPI:" Open MPI: 1.6.5a1r28554 tyr rankfiles 106 mpiexec -report-bindings \ -rf rf_linpc_semicolon -np 1 hostname [tyr.informatik.hs-fulda.de:29380] Got an error! [linpc1:12637] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) linpc1 Semicolon isn't allowed. tyr rankfiles 107 mpiexec -report-bindings \ -rf rf_linpc_comma -np 1 hostname [linpc1:12704] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0,1,1:0,1) linpc1 tyr rankfiles 108 2) I cannot use two Linux machines with Open MPI 1.9.x. linpc1 rankfiles 105 mpiexec -report-bindings \ -rf rf_linpc_linpc_semicolon -np 4 hostname -------------------------------------------------------------------------- The rankfile that was used claimed that a host was either not allocated or oversubscribed its slots. Please review your rank-slot assignments and your host allocation to ensure a proper match. Also, some systems may require using full hostnames, such as "host1.example.com" (instead of just plain "host1"). Host: linpc0 -------------------------------------------------------------------------- linpc1 rankfiles 106 Perhaps this problem is a follow-up of the above problem. No problem with Open MPI 1.6.x. linpc1 rankfiles 106 mpiexec -report-bindings \ -rf rf_linpc_linpc_comma -np 4 hostname [linpc1:12975] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) [linpc1:12975] MCW rank 2 bound to socket 1[core 0]: [. .][B .] (slot list 1:0) [linpc1:12975] MCW rank 3 bound to socket 1[core 1]: [. .][. B] (slot list 1:1) linpc1 linpc1 [linpc0:13855] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) linpc0 linpc1 linpc1 rankfiles 107 3) I have a problem on "tyr" (Solaris 10 sparc). tyr rankfiles 106 mpiexec -report-bindings \ -rf rf_tyr_semicolon -np 1 hostname [tyr.informatik.hs-fulda.de:29849] [[53951,0],0] ORTE_ERROR_LOG: Not found in file ../../../../../openmpi-1.9a1r29097/orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 276 [tyr.informatik.hs-fulda.de:29849] [[53951,0],0] ORTE_ERROR_LOG: Not found in file ../../../../openmpi-1.9a1r29097/orte/mca/rmaps/base/rmaps_base_map_job.c at line 173 tyr rankfiles 107 I get the following output, if I try the rankfile from a different machine (also Solaris 10 sparc). rs0 rankfiles 104 mpiexec -report-bindings -rf rf_tyr_semicolon -np 1 hostname -------------------------------------------------------------------------- All nodes which are allocated for this job are already filled. -------------------------------------------------------------------------- rs0 rankfiles 105 This time I have also a small problem with Open MPI 1.6.x. tyr rankfiles 134 ompi_info | grep "Open MPI:" Open MPI: 1.6.5a1r28554 tyr rankfiles 135 mpiexec -report-bindings \ -rf rf_tyr_comma -np 1 hostname -------------------------------------------------------------------------- All nodes which are allocated for this job are already filled. -------------------------------------------------------------------------- tyr rankfiles 136 ssh rs0 ... rs0 rankfiles 104 ompi_info | grep "Open MPI:" Open MPI: 1.6.5a1r28554 rs0 rankfiles 105 mpiexec -report-bindings \ -rf rf_tyr_comma -np 1 hostname [tyr.informatik.hs-fulda.de:29770] MCW rank 0 bound to socket 0[core 0] socket 1[core 0]: [B][B] (slot list 0:0,1:0) tyr.informatik.hs-fulda.de rs0 rankfiles 106 Why doesn't it work, if I'm logged in into the machine in the rankfile, while it works, if I'm using the rankfile on a different machine? Thank you very much for any help in advance. Kind regards Siegmar > On Sep 2, 2013, at 7:52 AM, Ralph Castain <r...@open-mpi.org> wrote: > > > It seems to run for me on CentOS, though I note that rank 0 isn't > bound to both sockets 0 and 1 as specified and I had to tell it how > many procs to run: > > > > [rhc@bend001 svn-trunk]$ mpirun --report-bindings > > -rf rf -n 4 hostname > > [bend001:13297] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], > > socket 0[core 1[hwt 0-1]]: [BB/BB/../../../..][../../../../../..] > > bend001 > > [bend002:25899] MCW rank 3 bound to socket 1[core 7[hwt 0-1]]: > > [../../../../../..][../BB/../../../..] > > bend002 > > [bend002:25899] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], > > socket 0[core 1[hwt 0-1]]: [BB/BB/../../../..][../../../../../..] > > bend002 > > [bend002:25899] MCW rank 2 bound to socket 1[core 6[hwt 0-1]]: > > [../../../../../..][BB/../../../../..] > > bend002 > > > > [rhc@bend001 svn-trunk]$ cat rf > > rank 0=bend001 slot=0:0-1,1:0-1 > > rank 1=bend002 slot=0:0-1 > > rank 2=bend002 slot=1:0 > > rank 3=bend002 slot=1:1 > > > > I'll work on those issues, but don't know why you are getting > > this "not allocated" error. > > > > > > On Sep 2, 2013, at 7:10 AM, Siegmar Gross > > <siegmar.gr...@informatik.hs-fulda.de> wrote: > > > >> Hi, > >> > >> I installed openmpi-1.9a1r29097 on "openSuSE Linux 12.1", "Solaris 10 > >> x86_64", and "Solaris 10 sparc" with "Sun C 5.12" in 64-bit mode. > >> Unfortunately I still have a problem with rankfiles. I reported the > >> problems already in May. I show the problems with Linux, although I > >> have the same problems on all Solaris machines as well. > >> > >> linpc1 rankfiles 99 cat rf_linpc1 > >> # mpiexec -report-bindings -rf rf_linpc1 hostname > >> rank 0=linpc1 slot=0:0-1,1:0-1 > >> > >> linpc1 rankfiles 100 mpiexec -report-bindings -rf rf_linpc1 hostname > >> [linpc1:23413] MCW rank 0 bound to socket 0[core 0[hwt 0]], > >> socket 0[core 1[hwt 0]]: [B/B][./.] > >> linpc1 > >> > >> > >> linpc1 rankfiles 101 cat rf_ex_linpc > >> # mpiexec -report-bindings -rf rf_ex_linpc hostname > >> rank 0=linpc0 slot=0:0-1,1:0-1 > >> rank 1=linpc1 slot=0:0-1 > >> rank 2=linpc1 slot=1:0 > >> rank 3=linpc1 slot=1:1 > >> > >> linpc1 rankfiles 102 mpiexec -report-bindings -rf rf_ex_linpc hostname > >> -------------------------------------------------------------------------- > >> The rankfile that was used claimed that a host was either not > >> allocated or oversubscribed its slots. Please review your rank-slot > >> assignments and your host allocation to ensure a proper match. Also, > >> some systems may require using full hostnames, such as > >> "host1.example.com" (instead of just plain "host1"). > >> > >> Host: linpc0 > >> -------------------------------------------------------------------------- > >> linpc1 rankfiles 103 > >> > >> > >> > >> I don't have these problems with openmpi-1.6.5a1r28554. > >> > >> linpc1 rankfiles 95 ompi_info | grep "Open MPI:" > >> Open MPI: 1.6.5a1r28554 > >> > >> linpc1 rankfiles 95 mpiexec -report-bindings -rf rf_linpc1 hostname > >> [linpc1:23583] MCW rank 0 bound to socket 0[core 0-1] > >> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) > >> linpc1 > >> > >> > >> linpc1 rankfiles 96 mpiexec -report-bindings -rf rf_ex_linpc hostname > >> [linpc1:23585] MCW rank 1 bound to socket 0[core 0-1]: > >> [B B][. .] (slot list 0:0-1) > >> [linpc1:23585] MCW rank 2 bound to socket 1[core 0]: > >> [. .][B .] (slot list 1:0) > >> [linpc1:23585] MCW rank 3 bound to socket 1[core 1]: > >> [. .][. B] (slot list 1:1) > >> linpc1 > >> linpc1 > >> linpc1 > >> [linpc0:10422] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: > >> [B B][B B] (slot list 0:0-1,1:0-1) > >> linpc0 > >> > >> > >> I would be grateful, if somebody can fix the problem. Thank you > >> very much for any help in advance. > >> > >> > >> Kind regards > >> > >> Siegmar > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > >