Hmmm…well, it seems to be working fine in 1.8.4rc1 (I only have 12 cores on my humble machine). However, I can’t test any interactions with LSF, though that shouldn’t be an issue:
$ mpirun -host bend001 -rf ./rankfile --report-bindings --display-devel-map hostname Data for JOB [60677,1] offset 0 Mapper requested: NULL Last mapper: rank_file Mapping policy: BYUSER Ranking policy: SLOT Binding policy: CPUSET Cpu set: NULL PPR: NULL Cpus-per-rank: 1 Num new daemons: 0 New daemon starting vpid INVALID Num nodes: 1 Data for node: bend001 Launch id: -1 State: 2 Daemon: [[60677,0],0] Daemon launched: True Num slots: 12 Slots in use: 12 Oversubscribed: FALSE Num slots allocated: 12 Max slots: 0 Username on node: NULL Num procs: 12 Next node_rank: 12 Data for proc: [[60677,1],0] Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: (null) Binding: 0,12 Data for proc: [[60677,1],1] Pid: 0 Local rank: 1 Node rank: 1 App rank: 1 State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: (null) Binding: 8,20 Data for proc: [[60677,1],2] Pid: 0 Local rank: 2 Node rank: 2 App rank: 2 State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: (null) Binding: 5,17 Data for proc: [[60677,1],3] Pid: 0 Local rank: 3 Node rank: 3 App rank: 3 State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: (null) Binding: 9,21 Data for proc: [[60677,1],4] Pid: 0 Local rank: 4 Node rank: 4 App rank: 4 State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: (null) Binding: 11,23 Data for proc: [[60677,1],5] Pid: 0 Local rank: 5 Node rank: 5 App rank: 5 State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: (null) Binding: 7,19 Data for proc: [[60677,1],6] Pid: 0 Local rank: 6 Node rank: 6 App rank: 6 State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: (null) Binding: 3,15 Data for proc: [[60677,1],7] Pid: 0 Local rank: 7 Node rank: 7 App rank: 7 State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: (null) Binding: 6,18 Data for proc: [[60677,1],8] Pid: 0 Local rank: 8 Node rank: 8 App rank: 8 State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: (null) Binding: 2,14 Data for proc: [[60677,1],9] Pid: 0 Local rank: 9 Node rank: 9 App rank: 9 State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: (null) Binding: 4,16 Data for proc: [[60677,1],10] Pid: 0 Local rank: 10 Node rank: 10 App rank: 10 State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: (null) Binding: 10,22 Data for proc: [[60677,1],11] Pid: 0 Local rank: 11 Node rank: 11 App rank: 11 State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: (null) Binding: 1,13 [bend001:24667] MCW rank 1 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/..][../../../../../..] [bend001:24667] MCW rank 2 bound to socket 1[core 8[hwt 0-1]]: [../../../../../..][../../BB/../../..] [bend001:24667] MCW rank 3 bound to socket 1[core 10[hwt 0-1]]: [../../../../../..][../../../../BB/..] [bend001:24667] MCW rank 4 bound to socket 1[core 11[hwt 0-1]]: [../../../../../..][../../../../../BB] [bend001:24667] MCW rank 5 bound to socket 1[core 9[hwt 0-1]]: [../../../../../..][../../../BB/../..] [bend001:24667] MCW rank 6 bound to socket 1[core 7[hwt 0-1]]: [../../../../../..][../BB/../../../..] [bend001:24667] MCW rank 7 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../..][../../../../../..] [bend001:24667] MCW rank 8 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../..][../../../../../..] [bend001:24667] MCW rank 9 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../..][../../../../../..] [bend001:24667] MCW rank 10 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB][../../../../../..] [bend001:24667] MCW rank 11 bound to socket 1[core 6[hwt 0-1]]: [../../../../../..][BB/../../../../..] [bend001:24667] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..][../../../../../..] Can you try with the latest nightly 1.8 tarball? http://www.open-mpi.org/nightly/v1.8/ <http://www.open-mpi.org/nightly/v1.8/> Note that it is also possible that hwloc isn’t correctly identifying the cores here. Can you tell us something about the hardware? Do you have hardware threads enabled? I ask because the binding being reported by us is the cpu numbers as identified by hwloc - which may not be the same you are expecting from some hardware vendor’s map. We are using logical processor assignments, not physical. You can use the —report-bindings option to show the resulting map, as above. > On Nov 5, 2014, at 7:21 AM, twu...@goodyear.com wrote: > > I am using openmpi v 1.8.3 and LSF 9.1.3. > > LSF creates a rankfile that looks like: > > RANK_FILE: > ====================================================================== > rank 0=mach1 slot=0 > rank 1=mach1 slot=4 > rank 2=mach1 slot=8 > rank 3=mach1 slot=12 > rank 4=mach1 slot=16 > rank 5=mach1 slot=20 > rank 6=mach1 slot=24 > rank 7=mach1 slot=28 > rank 8=mach1 slot=32 > rank 9=mach1 slot=36 > rank 10=mach1 slot=40 > rank 11=mach1 slot=44 > rank 12=mach1 slot=1 > rank 13=mach1 slot=5 > rank 14=mach1 slot=9 > rank 15=mach1 slot=13 > > which really are the cores I want to use, in order. > > I logon to this machine and type (all on one line): > > /apps/share/openmpi/1.8.3.I1217913/bin/mpirun \ > --mca orte_base_help_aggregate 0 \ > -v -display-devel-allocation \ > -display-devel-map \ > --rankfile RANK_FILE \ > --mca btl openib,tcp,sm,self \ > --x LD_LIBRARY_PATH \ > --np 16 \ > my_executable \ > -i model.i \ > -l model.o > > And I get the following on the screen: > > ====================== ALLOCATED NODES ====================== > mach1: slots=16 max_slots=0 slots_inuse=0 state=UP > ================================================================= > Data for JOB [52387,1] offset 0 > > Mapper requested: NULL Last mapper: rank_file Mapping policy: BYUSER > Ranking policy: SLOT > Binding policy: CPUSET Cpu set: NULL PPR: NULL Cpus-per-rank: 1 > Num new daemons: 0 New daemon starting vpid INVALID > Num nodes: 1 > > Data for node: mach1 Launch id: -1 State: 2 > Daemon: [[52387,0],0] Daemon launched: True > Num slots: 16 Slots in use: 16 Oversubscribed: FALSE > Num slots allocated: 16 Max slots: 0 > Username on node: NULL > Num procs: 16 Next node_rank: 16 > Data for proc: [[52387,1],0] > Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 0 > Data for proc: [[52387,1],1] > Pid: 0 Local rank: 1 Node rank: 1 App rank: 1 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 16 > Data for proc: [[52387,1],2] > Pid: 0 Local rank: 2 Node rank: 2 App rank: 2 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 32 > Data for proc: [[52387,1],3] > Pid: 0 Local rank: 3 Node rank: 3 App rank: 3 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 1 > Data for proc: [[52387,1],4] > Pid: 0 Local rank: 4 Node rank: 4 App rank: 4 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 17 > Data for proc: [[52387,1],5] > Pid: 0 Local rank: 5 Node rank: 5 App rank: 5 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 33 > Data for proc: [[52387,1],6] > Pid: 0 Local rank: 6 Node rank: 6 App rank: 6 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 2 > Data for proc: [[52387,1],7] > Pid: 0 Local rank: 7 Node rank: 7 App rank: 7 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 18 > Data for proc: [[52387,1],8] > Pid: 0 Local rank: 8 Node rank: 8 App rank: 8 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 34 > Data for proc: [[52387,1],9] > Pid: 0 Local rank: 9 Node rank: 9 App rank: 9 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 3 > Data for proc: [[52387,1],10] > Pid: 0 Local rank: 10 Node rank: 10 App rank: 10 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 19 > Data for proc: [[52387,1],11] > Pid: 0 Local rank: 11 Node rank: 11 App rank: 11 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 35 > Data for proc: [[52387,1],12] > Pid: 0 Local rank: 12 Node rank: 12 App rank: 12 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 4 > Data for proc: [[52387,1],13] > Pid: 0 Local rank: 13 Node rank: 13 App rank: 13 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 20 > Data for proc: [[52387,1],14] > Pid: 0 Local rank: 14 Node rank: 14 App rank: 14 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 36 > Data for proc: [[52387,1],15] > Pid: 0 Local rank: 15 Node rank: 15 App rank: 15 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > UNKNOWN Bind location: (null) Binding: 5 > > And a numa-map of the node shows: > > PID COMMAND CPUMASK TOTAL [ N0 N1 N2 N3 N4 > N5 N6 N7 ] > 31044 my_executable 0 443.3M [ 443.3M 0 0 0 0 > 0 0 0 ] > 31045 my_executable 16 459.7M [ 459.7M 0 0 0 0 > 0 0 0 ] > 31046 my_executable 32 435.0M [ 0 435.0M 0 0 0 > 0 0 0 ] > 31047 my_executable 1 468.8M [ 0 0 468.8M 0 0 > 0 0 0 ] > 31048 my_executable 17 493.2M [ 0 0 493.2M 0 0 > 0 0 0 ] > 31049 my_executable 33 498.0M [ 0 0 0 498.0M 0 > 0 0 0 ] > 31050 my_executable 2 501.2M [ 0 0 0 0 501.2M > 0 0 0 ] > 31051 my_executable 18 502.4M [ 0 0 0 0 502.4M > 0 0 0 ] > 31052 my_executable 34 500.5M [ 0 0 0 0 0 > 500.5M 0 0 ] > 31053 my_executable 3 515.6M [ 0 0 0 0 0 > 0 515.6M 0 ] > 31054 my_executable 19 508.1M [ 0 0 0 0 0 > 0 508.1M 0 ] > 31055 my_executable 35 503.9M [ 0 0 0 0 0 > 0 0 503.9M ] > 31056 my_executable 4 502.1M [ 502.1M 0 0 0 0 > 0 0 0 ] > 31057 my_executable 20 515.2M [ 515.2M 0 0 0 0 > 0 0 0 ] > 31058 my_executable 36 508.1M [ 0 508.1M 0 0 0 > 0 0 0 ] > 31059 my_executable 5 446.7M [ 0 0 446.7M 0 0 > 0 0 0 ] > -- > > Why didn't mpirun honor the ranfile and put the processes on the correct > cores in > the proper order? It looks to me like mpirun doesn't like the rankfile...?? > > Thanks for any help. > > Tom > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16199.php