Oh ye gods of rankfiles:

I have a node that has two sockets, each with four cores. If I use a rankfile, I can bind to a specific core, a specific range of cores, or a specific core or range of cores of a specific socket. I'm having trouble binding to all cores of a specific socket. It's looking for core 4 on socket 0. I understand why it can't find it, but I don't understand why it's looking for it. Bug? My error/misunderstanding? Here's what the flight recorder black box says:


% cat rankfile
rank 0=saem9 slot=0:*
% mpirun -np 1 --host saem9 --rankfile rankfile --mca paffinity_base_verbose 5 ./a.out
[saem9:20649] mca:base:select:(paffinity) Querying component [linux]
[saem9:20649] mca:base:select:(paffinity) Query of component [linux] set priority to 10
[saem9:20649] mca:base:select:(paffinity) Selected component [linux]
[saem9:20650] mca:base:select:(paffinity) Querying component [linux]
[saem9:20650] mca:base:select:(paffinity) Query of component [linux] set priority to 10
[saem9:20650] mca:base:select:(paffinity) Selected component [linux]
[saem9:20650] paffinity slot assignment: slot_list == 0:*
[saem9:20650] Rank 0: PAFFINITY cannot get physical core id for logical core 4 in physical socket 0 (0)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 opal_paffinity_base_slot_list_set() returned an error
 --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[saem9:20650] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 20650 on
node saem9 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Reply via email to