[OMPI devel] rankfile syntax

Eugene Loh Thu, 23 Jul 2009 13:18:54 -0400

Oh ye gods of rankfiles:

I have a node that has two sockets, each with four cores. If I use arankfile, I can bind to a specific core, a specific range of cores, or aspecific core or range of cores of a specific socket. I'm havingtrouble binding to all cores of a specific socket. It's looking forcore 4 on socket 0. I understand why it can't find it, but I don'tunderstand why it's looking for it. Bug? My error/misunderstanding?Here's what the flight recorder black box says:



% cat rankfile
rank 0=saem9 slot=0:*

% mpirun -np 1 --host saem9 --rankfile rankfile --mcapaffinity_base_verbose 5 ./a.out

[saem9:20649] mca:base:select:(paffinity) Querying component [linux]

[saem9:20649] mca:base:select:(paffinity) Query of component [linux] setpriority to 10

[saem9:20649] mca:base:select:(paffinity) Selected component [linux]
[saem9:20650] mca:base:select:(paffinity) Querying component [linux]

[saem9:20650] mca:base:select:(paffinity) Query of component [linux] setpriority to 10

[saem9:20650] mca:base:select:(paffinity) Selected component [linux]
[saem9:20650] paffinity slot assignment: slot_list == 0:*

[saem9:20650] Rank 0: PAFFINITY cannot get physical core id for logicalcore 4 in physical socket 0 (0)

--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 opal_paffinity_base_slot_list_set() returned an error
 --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

[saem9:20650] Abort before MPI_INIT completed successfully; not able toguarantee that all other processes were killed!

--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 20650 on
node saem9 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

[OMPI devel] rankfile syntax

Reply via email to