Dear All, I am running a simple mpi application which looks as follows:
###################################### #include <mpi.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <signal.h> int main(int argc, char **argv) { int rank,size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello\n"); sleep(15); printf("Hello again\n" ); sleep(15); printf("Final Hello\n"); sleep(15); printf("bye \n"); MPI_Finalize(); return 0; } ################################# When I run my application as follows, it checkpoint correctly but when i try to restart it if gives the following errors: ###################################### ompi-restart ompi_global_snapshot_380.ckpt Hello again [sun06:00381] *** Process received signal *** [sun06:00381] Signal: Bus error (7) [sun06:00381] Signal code: (2) [sun06:00381] Failing at address: 0xae7cb054 [sun06:00381] [ 0] [0xb7f8640c] [sun06:00381] [ 1] /home/raj/openmpisof/lib/libopen-pal.so.0(opal_progress+0x123) [0xb7b95456] [sun06:00381] [ 2] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7bcb093] [sun06:00381] [ 3] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7bcae97] [sun06:00381] [ 4] /home/raj/openmpisof/lib/libopen-pal.so.0(opal_crs_blcr_checkpoint+0x187) [0xb7bca69b] [sun06:00381] [ 5] /home/raj/openmpisof/lib/libopen-pal.so.0(opal_cr_inc_core+0xc3) [0xb7b970bd] [sun06:00381] [ 6] /home/raj/openmpisof/lib/libopen-rte.so.0 [0xb7cab06f] [sun06:00381] [ 7] /home/raj/openmpisof/lib/libopen-pal.so.0(opal_cr_test_if_checkpoint_ready+0x129) [0xb7b96fca] [sun06:00381] [ 8] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7b97698] [sun06:00381] [ 9] /lib/libpthread.so.0 [0xb7ac4f3b] [sun06:00381] [10] /lib/libc.so.6(clone+0x5e) [0xb7a4bbee] [sun06:00381] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 399 on node sun06 exited on signal 7 (Bus error). -------------------------------------------------------------------------- ##################################################################### I am running it as follows: ################################################################ mpirun -am ft-enable-cr -np 2 -mca btl ^openib -mca snapc_base_global_snapshot_dir /tmp mpisleepbas. ################################################################ Once a checkpoint it taken, I have to copy it to the home directory and try to restart it. please not that if i used - np 1, it works fine when i restart it. The problem is mainly when the application has more than one process running. Any help will be very appreciated Raj