Dear All,
            I am running a simple mpi application which looks as follows:

######################################

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>

int main(int argc, char **argv)
{
int rank,size;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello\n"); 
sleep(15);
printf("Hello again\n" );
sleep(15);
printf("Final Hello\n"); 
sleep(15);
printf("bye \n");
MPI_Finalize();
return 0;
}
#################################

When I run my application as follows, it checkpoint correctly but when i try to 
restart it if gives the following errors:

######################################

ompi-restart ompi_global_snapshot_380.ckpt
Hello again
[sun06:00381] *** Process received signal ***
[sun06:00381] Signal: Bus error (7)
[sun06:00381] Signal code:  (2)
[sun06:00381] Failing at address: 0xae7cb054
[sun06:00381] [ 0] [0xb7f8640c]
[sun06:00381] [ 1] 
/home/raj/openmpisof/lib/libopen-pal.so.0(opal_progress+0x123) [0xb7b95456]
[sun06:00381] [ 2] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7bcb093]
[sun06:00381] [ 3] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7bcae97]
[sun06:00381] [ 4] 
/home/raj/openmpisof/lib/libopen-pal.so.0(opal_crs_blcr_checkpoint+0x187) 
[0xb7bca69b]
[sun06:00381] [ 5] 
/home/raj/openmpisof/lib/libopen-pal.so.0(opal_cr_inc_core+0xc3) [0xb7b970bd]
[sun06:00381] [ 6] /home/raj/openmpisof/lib/libopen-rte.so.0 [0xb7cab06f]
[sun06:00381] [ 7] 
/home/raj/openmpisof/lib/libopen-pal.so.0(opal_cr_test_if_checkpoint_ready+0x129)
 [0xb7b96fca]
[sun06:00381] [ 8] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7b97698]
[sun06:00381] [ 9] /lib/libpthread.so.0 [0xb7ac4f3b]
[sun06:00381] [10] /lib/libc.so.6(clone+0x5e) [0xb7a4bbee]
[sun06:00381] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 399 on node sun06 exited on signal 
7 (Bus error).
--------------------------------------------------------------------------
#####################################################################

I am running it as follows:

################################################################
mpirun -am ft-enable-cr -np 2 -mca btl ^openib -mca 
snapc_base_global_snapshot_dir /tmp mpisleepbas.
################################################################

Once a checkpoint it taken, I have to copy it to the home directory and try to 
restart it.

please not that if i used - np 1, it works fine when i restart it. The problem 
is mainly when the application has more than one process running.


Any help will be very appreciated


Raj






Reply via email to