I downloaded the nightly build of the trunk (r23756) and found that the 
checkpoint functionality is broken. My MPI program is a simple helloworld 
program incrementing and printing the number every few seconds once.

Following are the steps:
1. mpirun with NP set to 32
2. call ompi-checkpoint with "-term" option and it terminate the program after 
successful checkpoint image creation
3. call ompi-restart using the checkpoint image and it terminates with 
segmentation fault

I tried these steps with 1.5rc6 and 1.4.2 and I am able to restart the process 
using the checkpoint image. Am I missing any steps here? Reducing the number of 
processes didn't change the behavior.

Following is the output from my checkpoint attempt:

=== Output START ==================================
mpirun -am ft-enable-cr --mca opal_cr_enable_timer 1 --mca 
sstore_stage_global_is_shared 1 --mca sstore_base_global_snapshot_dir 
/scratch/hpl005/UIT_test/amudar/FWI --mca mpi_paffinity_alone 1  -np 32 
-hostfile hostfile-32 ../hellompi
Hello, world, I am 0 of 32
  1 Hello, world, I am 4 of 32
  1 Hello, world, I am 5 of 32
  1 Hello, world, I am 1 of 32
  1 Hello, world, I am 9 of 32
  1 Hello, world, I am 8 of 32
  1 Hello, world, I am 2 of 32
  1 Hello, world, I am 7 of 32
  1 Hello, world, I am 16 of 32
  1 Hello, world, I am 10 of 32
  1 Hello, world, I am 14 of 32
  1 Hello, world, I am 3 of 32
  1 Hello, world, I am 11 of 32
  1 Hello, world, I am 13 of 32
  1 Hello, world, I am 15 of 32
  1 Hello, world, I am 20 of 32
  1 Hello, world, I am 18 of 32
  1 Hello, world, I am 17 of 32
  1 Hello, world, I am 23 of 32
  1 Hello, world, I am 24 of 32
  1 Hello, world, I am 22 of 32
  1 Hello, world, I am 19 of 32
  1 Hello, world, I am 21 of 32
  1 Hello, world, I am 28 of 32
  1 Hello, world, I am 6 of 32
  1 Hello, world, I am 26 of 32
  1 Hello, world, I am 27 of 32
  1 Hello, world, I am 25 of 32
  1 Hello, world, I am 30 of 32
  1 Hello, world, I am 31 of 32
  1 Hello, world, I am 29 of 32
  1 Hello, world, I am 12 of 32
  1   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
  2   2   2   2   2   2   2   2   2   2   2   2   2   3   3   3   3   3   3   3 
  3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
  3   3   3   3   3   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4 
  4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   5   5   5 
  5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5 
  5   5   5   5   5   5   5   5   5   6   6   6   6   6   6   6   6   6   6   6 
  6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6   6 
  6   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7   7 
  7   7   7   7   7   7   7   7   7   7   7   7   7   8   8   8   8   8   8   8 
  8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8 
  8   8   8   8   8   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9 
  9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9 
[hplcnlj158:13937] OPAL CR Timing: ******************** Summary Begin
[hplcnlj158:13937] opal_cr: timing: Start Entry Point    =       0.01 s       
1.22 s      0.57
[hplcnlj158:13937] opal_cr: timing: CRCP Protocol        =       0.43 s       
1.22 s     35.45
[hplcnlj158:13937] opal_cr: timing: P2P Suspend          =       0.00 s       
1.22 s      0.34
[hplcnlj158:13937] opal_cr: timing: Checkpoint           =       0.64 s       
1.22 s     52.87
[hplcnlj158:13937] opal_cr: timing: P2P Reactivation     = -1284678958.98 s     
     1.22 s     -105438618322.51
[hplcnlj158:13937] opal_cr: timing: CRCP Cleanup         =       0.00 s       
1.22 s      0.00
[hplcnlj158:13937] opal_cr: timing: Finish Entry Point   = 1284678959.11 s      
     1.22 s     105438618333.28
[hplcnlj158:13937] OPAL CR Timing: ******************** Summary End
hplcnlj158> ompi-restart -am ft-enable-cr --mca opal_cr_enable_timer 1 
-hostfile hostfile-32 --mca sstore_stage_global_is_shared 1 --mca 
sstore_base_global_snapshot_dir /scratch/hpl005/UIT_test/amudar/FWI 
ompi_global_snapshot_13933.ckpt
  9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9   9 
  9   9   9   9   9   9   9   9   9   9   9   9 [hplcnlj158:13937] *** Process 
received signal ***
[hplcnlj158:13937] Signal: Segmentation fault (11)
[hplcnlj158:13937] Signal code: Address not mapped (1)
[hplcnlj158:13937] Failing at address: 0x2aaa00000001
[hplcnlj158:13937] [ 0] /lib64/libpthread.so.0 [0x2b4019a064c0]
[hplcnlj158:13937] [ 1] 
/users/amudar/openmpi-1.7/lib/libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
 [0x2aaaad96628a]
[hplcnlj158:13937] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/mca_btl_sm.so 
[0x2aaaaf0a55e8]
[hplcnlj158:13937] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0 
[0x2b4018c3c11b]
[hplcnlj158:13937] [ 4] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(mca_base_components_open+0x3ef) 
[0x2b4018c3b70b]
[hplcnlj158:13937] [ 5] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(mca_btl_base_open+0xfd) 
[0x2b4018b620fe]
[hplcnlj158:13937] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/mca_bml_r2.so 
[0x2aaaadd9e4fb]
[hplcnlj158:13937] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/mca_pml_ob1.so 
[0x2aaaae5fa429]
[hplcnlj158:13937] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/mca_pml_crcpw.so 
[0x2aaaadfadce6]
[hplcnlj158:13937] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0 
[0x2b4018b01a0d]
[hplcnlj158:13937] [10] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(ompi_cr_coord+0xc0) [0x2b4018b017ba]
[hplcnlj158:13937] [11] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(opal_cr_inc_core_recover+0xed) 
[0x2b4018c0efab]
[hplcnlj158:13937] [12] /users/amudar/openmpi-1.7/lib/openmpi/mca_snapc_full.so 
[0x2aaaabd280fc]
[hplcnlj158:13937] [13] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(opal_cr_test_if_checkpoint_ready+0x11b)
 [0x2b4018c0ecd3]
[hplcnlj158:13937] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0 
[0x2b4018c0f6e7]
[hplcnlj158:13937] [15] /lib64/libpthread.so.0 [0x2b40199fe367]
[hplcnlj158:13937] [16] /lib64/libc.so.6(clone+0x6d) [0x2b4019ce5f7d]
[hplcnlj158:13937] *** End of error message ***
[hplcnlj161:00637] *** Process received signal ***
[hplcnlj161:00637] Signal: Segmentation fault (11)
[hplcnlj161:00637] Signal code: Address not mapped (1)
[hplcnlj161:00637] Failing at address: 0x2aaa00000001
[hplcnlj161:00649] *** Process received signal ***
[hplcnlj161:00649] Signal: Segmentation fault (11)
[hplcnlj161:00649] Signal code: Address not mapped (1)
[hplcnlj161:00649] Failing at address: 0x2aaa00000001
/users/amudar/Fix_for_pidinuse/cr_restart: line 5: 14012 Segmentation fault     
 /usr/blcr/bin/cr_restart --no-restore-pid "$@"
[hplcnlj161:00643] *** Process received signal ***
[hplcnlj161:00643] Signal: Segmentation fault (11)
[hplcnlj161:00643] Signal code: Address not mapped (1)
[hplcnlj161:00643] Failing at address: 0x2aaa00000001
[hplcnlj161:00640] *** Process received signal ***
[hplcnlj161:00640] Signal: Segmentation fault (11)
[hplcnlj161:00640] Signal code: Address not mapped (1)
[hplcnlj161:00640] Failing at address: 0x2aaa00000001
[hplcnlj161:00636] *** Process received signal ***
[hplcnlj161:00652] *** Process received signal ***
[hplcnlj161:00652] Signal: Segmentation fault (11)
[hplcnlj161:00652] Signal code: Address not mapped (1)
[hplcnlj161:00652] Failing at address: 0x2aaa00000001
[hplcnlj161:00636] Signal: Segmentation fault (11)
[hplcnlj161:00636] Signal code: Address not mapped (1)
[hplcnlj161:00636] Failing at address: 0x2aaa00000001
[hplcnlj161:00637] [ 0] /lib64/libpthread.so.0 [0x2b86c74694c0]
[hplcnlj161:00637] [ 1] 
/users/amudar/openmpi-1.7/lib/libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
 [0x2aaaad96628a]
[hplcnlj161:00637] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/mca_btl_sm.so 
[0x2aaaaf0a55e8]
[hplcnlj161:00637] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0 
[0x2b86c669f11b]
[hplcnlj161:00637] [ 4] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(mca_base_components_open+0x3ef) 
[0x2b86c669e70b]
[hplcnlj161:00637] [ 5] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(mca_btl_base_open+0xfd) 
[0x2b86c65c50fe]
[hplcnlj161:00637] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/mca_bml_r2.so 
[0x2aaaadd9e4fb]
[hplcnlj161:00637] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/mca_pml_ob1.so 
[0x2aaaae5fa429]
[hplcnlj161:00637] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/mca_pml_crcpw.so 
[0x2aaaadfadce6]
[hplcnlj161:00637] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0 
[0x2b86c6564a0d]
[hplcnlj161:00637] [10] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(ompi_cr_coord+0xc0) [0x2b86c65647ba]
[hplcnlj161:00637] [11] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(opal_cr_inc_core_recover+0xed) 
[0x2b86c6671fab]
[hplcnlj161:00637] [12] /users/amudar/openmpi-1.7/lib/openmpi/mca_snapc_full.so 
[0x2aaaabd280fc]
[hplcnlj161:00637] [13] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(opal_cr_test_if_checkpoint_ready+0x11b)
 [0x2b86c6671cd3]
[hplcnlj161:00637] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0 
[0x2b86c66726e7]
[hplcnlj161:00637] [15] /lib64/libpthread.so.0 [0x2b86c7461367]
[hplcnlj161:00637] [16] /lib64/libc.so.6(clone+0x6d) [0x2b86c7748f7d]
[hplcnlj161:00637] *** End of error message ***
[hplcnlj161:00649] [ 0] /lib64/libpthread.so.0 [0x2b7bfa6204c0]
[hplcnlj161:00649] [ 1] 
/users/amudar/openmpi-1.7/lib/libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
 [0x2aaaad96628a]
[hplcnlj161:00649] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/mca_btl_sm.so 
[0x2aaaaf0a55e8]
[hplcnlj161:00649] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0 
[0x2b7bf985611b]
[hplcnlj161:00649] [ 4] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(mca_base_components_open+0x3ef) 
[0x2b7bf985570b]
[hplcnlj161:00649] [ 5] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(mca_btl_base_open+0xfd) 
[0x2b7bf977c0fe]
[hplcnlj161:00649] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/mca_bml_r2.so 
[0x2aaaadd9e4fb]
[hplcnlj161:00649] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/mca_pml_ob1.so 
[0x2aaaae5fa429]
[hplcnlj161:00649] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/mca_pml_crcpw.so 
[0x2aaaadfadce6]
[hplcnlj161:00649] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0 
[0x2b7bf971ba0d]
[hplcnlj161:00649] [10] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(ompi_cr_coord+0xc0) [0x2b7bf971b7ba]
[hplcnlj161:00649] [11] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(opal_cr_inc_core_recover+0xed) 
[0x2b7bf9828fab]
[hplcnlj161:00649] [12] /users/amudar/openmpi-1.7/lib/openmpi/mca_snapc_full.so 
[0x2aaaabd280fc]
[hplcnlj161:00649] [13] 
/users/amudar/openmpi-1.7/lib/libmpi.so.0(opal_cr_test_if_checkpoint_ready+0x11b)
 [0x2b7bf9828cd3]
[hplcnlj161:00649] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0 
[0x2b7bf98296e7]
[hplcnlj161:00649] [15] /lib64/libpthread.so.0 [0x2b7bfa618367]
[hplcnlj161:00649] [16] /lib64/libc.so.6(clone+0x6d) [0x2b7bfa8fff7d]
[hplcnlj161:00649] *** End of error message ***
======= Output END =======================
Ananda B Mudar, PMP
Senior Technical Architect
Wipro Technologies
Ph: 972 765 8093

Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com

Reply via email to