I have a new question about Checkpoint/Restart.
16th question is as follows:
(16) If a program uses MPI_Init_thread function,
checkpoint cannot be taken by the opal_cr_thread_fn thread.
Framework : ompi/mpi
Component : c
The source file : ompi/mpi/c/init_thread.c
The function name : MPI_Init_thread
Here's the code that causes the problem:
#define LOOP 60
MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0) {
printf(" rank=%d 60 seconds sleeping start \n",rank); fflush(stdout);
}
for (i=0;i<LOOP;i++) { /* Take checkpoint while the process is in this loop.
*/
sleep(1);
if (rank == 0) {
printf(" rank=%d loop=%d \n",rank,i); fflush(stdout);
}
}
if (rank == 0) {
printf(" rank=%d 60 seconds sleeping finished \n",rank); fflush(stdout);
}
MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0) {
printf(" rank=%d executes Finalize \n",rank); fflush(stdout);
}
MPI_Finalize();
* This problem can be confirmed even by execution by one process.
mpiexec -n 1 .... ./a.out
* Take checkpoint while the process is in the loop to which it takes 60 seconds.
* Example of restart result of a program using MPI_Init.
-bash-3.2$ ompi-restart ompi_global_snapshot_20762.ckpt
rank=0 loop=42
rank=0 loop=43
rank=0 loop=44
rank=0 loop=45
rank=0 loop=46
rank=0 loop=47
rank=0 loop=48
rank=0 loop=49
rank=0 loop=50
rank=0 loop=51
rank=0 loop=52
rank=0 loop=53
rank=0 loop=54
rank=0 loop=55
rank=0 loop=56
rank=0 loop=57
rank=0 loop=58
rank=0 loop=59
rank=0 60 seconds sleeping finished
rank=0 executes Finalize
rank=0 program end
Because checkpoint was taken by opal_cr_thread_fn function immediately
when the checkpoint operation was executed,
the program restarts from the loop.
* Example of restart result of a program using MPI_Init_thread.
-bash-3.2$ ompi-restart ompi_global_snapshot_20660.ckpt
rank=0 executes Finalize
rank=0 program end
It is in the MPI_Barrier function after the loop
that checkpoint was actually taken.
Therefore, the program restarts from MPI_Barrier function.
* I think that it is the problem that MPI_Init_thread does not execute
OPAL_CR_INIT_LIBRARY.
So, opal_cr_thread_is_active still remains in false condition.
Therefore, the following while loop does not terminate.
/*
* Wait to become active
*/
while( !opal_cr_thread_is_active && !opal_cr_thread_is_done) {
sched_yield();
}
* MPI_Init_thread uses OPAL_CR_ENTER_LIBRARY and OPAL_CR_EXIT_LIBRARY.
I think it is not correct.
Because MPI_Init_thread is an initialization function of MPI,
I think that it should be the same specification as MPI_Init.
-bash-3.2$ cat t_mpi_question-16.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include "mpi.h"
#define LOOP 60
int main(int ac,char **av)
{
int i;
int rank,size;
int required,provided,provided_for_query;
required = MPI_THREAD_SINGLE;
provided = -1;
provided_for_query = -1;
#if defined(USE_INITTHREAD)
MPI_Init_thread(&ac,&av,required,&provided);
MPI_Query_thread(&provided_for_query);
#else
MPI_Init(&ac,&av);
#endif
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
if (rank == 0) {
printf(" rank=%d sz=%d required=%d provided=%d provided_for_query=%d \n"
,rank,size,required,provided,provided_for_query); fflush(stdout);
}
MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0) {
printf(" rank=%d 60 seconds sleeping start \n",rank); fflush(stdout);
}
for (i=0;i<LOOP;i++) {
sleep(1);
if (rank == 0) {
printf(" rank=%d loop=%d \n",rank,i); fflush(stdout);
}
}
if (rank == 0) {
printf(" rank=%d 60 seconds sleeping finished \n",rank); fflush(stdout);
}
MPI_Barrier(MPI_COMM_WORLD);
if (rank == 0) {
printf(" rank=%d executes Finalize \n",rank); fflush(stdout);
}
MPI_Finalize();
if (rank == 0) {
printf(" rank=%d program end \n",rank); fflush(stdout);
}
return(0);
}