Hi,
I am evaluating OpenMPI 5.0.0 and I am experiencing a race condition when 
spawning a different number of processes in different nodes. 

With:


$cat hostfile
node00
node01
node02
node03


If I run this code:


#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char* argv[]){
        MPI_Init(&argc, &argv);
        MPI_Comm intercomm;
        int final_nranks = 4, len;
    char name[MPI_MAX_PROCESSOR_NAME];
    MPI_Get_processor_name(name, &len);
        MPI_Comm_get_parent(&intercomm);
        if(intercomm == MPI_COMM_NULL){
                MPI_Info info;
                MPI_Info_create(&info);
                MPI_Info_set(info, "hostfile", "hostfile");
                MPI_Info_set(info, "map_by", "ppr:1:node");
                MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, final_nranks, info, 0, 
MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE);
                printf("PARENT %s\n", name);
        } else {
                printf("CHILD %s\n", name);
        }
        MPI_Finalize();
        return 0;
}


With the command:


$ mpirun  -np 2 --hostfile hostfile --map-by node ./a.out


Sometimes I get this (that it is what I wanted, but without PMIX errors):


[node00:281361] PMIX ERROR: ERROR in file prted/pmix/pmix_server_dyn.c at line 
1034
[node00:281361] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at 
line 1839
PARENT node00
CHILD node00
PARENT node01
CHILD node01
CHILD node02
CHILD node03

However, in other executions I get the following output:

[node00:281468] PMIX ERROR: ERROR in file prted/pmix/pmix_server_dyn.c at line 
1034
[node00:281468] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at 
line 1839
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-node00-281468@0,0] on node node00
  Remote daemon: [prterun-node00-281468@0,2] on node node01

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[node00:00000] *** An error occurred in Socket closed
[node00:00000] *** reported by process [3933011970,0]
[node00:00000] *** on a NULL communicator
[node00:00000] *** Unknown error
[node00:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[node00:00000] ***    and MPI will try to terminate your MPI job as well)


I also submitted the issue in Github: 
https://github.com/open-mpi/ompi/issues/11421

Any help is appreciatted, even if it is in the shape of hints to hack some 
parts of the code that may be causing this issue.

Thank you.

Reply via email to