I am using openmpi v 1.8.3 and LSF 9.1.3.

LSF creates a rankfile that looks like:

RANK_FILE:
======================================================================
rank 0=mach1 slot=0
rank 1=mach1 slot=4
rank 2=mach1 slot=8
rank 3=mach1 slot=12
rank 4=mach1 slot=16
rank 5=mach1 slot=20
rank 6=mach1 slot=24
rank 7=mach1 slot=28
rank 8=mach1 slot=32
rank 9=mach1 slot=36
rank 10=mach1 slot=40
rank 11=mach1 slot=44
rank 12=mach1 slot=1
rank 13=mach1 slot=5
rank 14=mach1 slot=9
rank 15=mach1 slot=13

which really are the cores I want to use, in order. 

I logon to this machine and type (all on one line):

/apps/share/openmpi/1.8.3.I1217913/bin/mpirun \
  --mca orte_base_help_aggregate 0 \
  -v -display-devel-allocation \
  -display-devel-map \
  --rankfile RANK_FILE \
  --mca btl openib,tcp,sm,self \
  --x LD_LIBRARY_PATH \
  --np 16 \
  my_executable \
  -i model.i \
  -l model.o

And I get the following on the screen:

======================   ALLOCATED NODES   ======================
        mach1: slots=16 max_slots=0 slots_inuse=0 state=UP
=================================================================
 Data for JOB [52387,1] offset 0

 Mapper requested: NULL  Last mapper: rank_file  Mapping policy: BYUSER  
Ranking policy: SLOT
 Binding policy: CPUSET  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
        Num new daemons: 0      New daemon starting vpid INVALID
        Num nodes: 1

 Data for node: mach1           Launch id: -1   State: 2
        Daemon: [[52387,0],0]   Daemon launched: True
        Num slots: 16   Slots in use: 16        Oversubscribed: FALSE
        Num slots allocated: 16 Max slots: 0
        Username on node: NULL
        Num procs: 16   Next node_rank: 16
        Data for proc: [[52387,1],0]
                Pid: 0  Local rank: 0   Node rank: 0    App rank: 0
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 0
        Data for proc: [[52387,1],1]
                Pid: 0  Local rank: 1   Node rank: 1    App rank: 1
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 16
        Data for proc: [[52387,1],2]
                Pid: 0  Local rank: 2   Node rank: 2    App rank: 2
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 32
        Data for proc: [[52387,1],3]
                Pid: 0  Local rank: 3   Node rank: 3    App rank: 3
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 1
        Data for proc: [[52387,1],4]
                Pid: 0  Local rank: 4   Node rank: 4    App rank: 4
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 17
        Data for proc: [[52387,1],5]
                Pid: 0  Local rank: 5   Node rank: 5    App rank: 5
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 33
        Data for proc: [[52387,1],6]
                Pid: 0  Local rank: 6   Node rank: 6    App rank: 6
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 2
        Data for proc: [[52387,1],7]
                Pid: 0  Local rank: 7   Node rank: 7    App rank: 7
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 18
        Data for proc: [[52387,1],8]
                Pid: 0  Local rank: 8   Node rank: 8    App rank: 8
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 34
        Data for proc: [[52387,1],9]
                Pid: 0  Local rank: 9   Node rank: 9    App rank: 9
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 3
        Data for proc: [[52387,1],10]
                Pid: 0  Local rank: 10  Node rank: 10   App rank: 10
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 19
        Data for proc: [[52387,1],11]
                Pid: 0  Local rank: 11  Node rank: 11   App rank: 11
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 35
        Data for proc: [[52387,1],12]
                Pid: 0  Local rank: 12  Node rank: 12   App rank: 12
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 4
        Data for proc: [[52387,1],13]
                Pid: 0  Local rank: 13  Node rank: 13   App rank: 13
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 20
        Data for proc: [[52387,1],14]
                Pid: 0  Local rank: 14  Node rank: 14   App rank: 14
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 36
        Data for proc: [[52387,1],15]
                Pid: 0  Local rank: 15  Node rank: 15   App rank: 15
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 5

And a numa-map of the node shows:

  PID COMMAND         CPUMASK     TOTAL [     N0     N1     N2     N3     N4    
 N5     N6     N7 ]
31044 my_executable         0    443.3M [ 443.3M     0      0      0      0     
 0      0      0  ]
31045 my_executable        16    459.7M [ 459.7M     0      0      0      0     
 0      0      0  ]
31046 my_executable        32    435.0M [     0  435.0M     0      0      0     
 0      0      0  ]
31047 my_executable         1    468.8M [     0      0  468.8M     0      0     
 0      0      0  ]
31048 my_executable        17    493.2M [     0      0  493.2M     0      0     
 0      0      0  ]
31049 my_executable        33    498.0M [     0      0      0  498.0M     0     
 0      0      0  ]
31050 my_executable         2    501.2M [     0      0      0      0  501.2M    
 0      0      0  ]
31051 my_executable        18    502.4M [     0      0      0      0  502.4M    
 0      0      0  ]
31052 my_executable        34    500.5M [     0      0      0      0      0  
500.5M     0      0  ]
31053 my_executable         3    515.6M [     0      0      0      0      0     
 0  515.6M     0  ]
31054 my_executable        19    508.1M [     0      0      0      0      0     
 0  508.1M     0  ]
31055 my_executable        35    503.9M [     0      0      0      0      0     
 0      0  503.9M ]
31056 my_executable         4    502.1M [ 502.1M     0      0      0      0     
 0      0      0  ]
31057 my_executable        20    515.2M [ 515.2M     0      0      0      0     
 0      0      0  ]
31058 my_executable        36    508.1M [     0  508.1M     0      0      0     
 0      0      0  ]
31059 my_executable         5    446.7M [     0      0  446.7M     0      0     
 0      0      0  ]
-- 

Why didn't mpirun honor the ranfile and put the processes on the correct cores 
in
the proper order?  It looks to me like mpirun doesn't like the rankfile...??

Thanks for any help.

Tom

Reply via email to