Hi,
I am evaluating OpenMPI 5.0.0 and I am experiencing a race condition when
spawning a different number of processes in different nodes.
With:
$cat hostfile
node00
node01
node02
node03
If I run this code:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char* argv[]){
MPI_Init(&argc, &argv);
MPI_Comm intercomm;
int final_nranks = 4, len;
char name[MPI_MAX_PROCESSOR_NAME];
MPI_Get_processor_name(name, &len);
MPI_Comm_get_parent(&intercomm);
if(intercomm == MPI_COMM_NULL){
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "hostfile", "hostfile");
MPI_Info_set(info, "map_by", "ppr:1:node");
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, final_nranks, info, 0,
MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE);
printf("PARENT %s\n", name);
} else {
printf("CHILD %s\n", name);
}
MPI_Finalize();
return 0;
}
With the command:
$ mpirun -np 2 --hostfile hostfile --map-by node ./a.out
Sometimes I get this (that it is what I wanted, but without PMIX errors):
[node00:281361] PMIX ERROR: ERROR in file prted/pmix/pmix_server_dyn.c at line
1034
[node00:281361] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at
line 1839
PARENT node00
CHILD node00
PARENT node01
CHILD node01
CHILD node02
CHILD node03
However, in other executions I get the following output:
[node00:281468] PMIX ERROR: ERROR in file prted/pmix/pmix_server_dyn.c at line
1034
[node00:281468] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at
line 1839
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.
HNP daemon : [prterun-node00-281468@0,0] on node node00
Remote daemon: [prterun-node00-281468@0,2] on node node01
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[node00:00000] *** An error occurred in Socket closed
[node00:00000] *** reported by process [3933011970,0]
[node00:00000] *** on a NULL communicator
[node00:00000] *** Unknown error
[node00:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
[node00:00000] *** and MPI will try to terminate your MPI job as well)
I also submitted the issue in Github:
https://github.com/open-mpi/ompi/issues/11421
Any help is appreciatted, even if it is in the shape of hints to hack some
parts of the code that may be causing this issue.
Thank you.