Hi


I believe the new common shared memory component was committed to the
trunk sometime towards the later part of August. I had not tried this
trunk version until last week and I have seen some discrepancy with this
component specifically related to checkpoint functionality. I am not
able to checkpoint any program with the latest trunk version. Am I
missing something here? Should I be using any other options to enable
checkpoint functionality for shared memory component?



However if I disable shared memory component and use only self, tcp, and
openib (--mca btl self,tcp,openib), I can checkpoint successfully!!



Following are the options I have used with mpirun:



mpirun -am ft-enable-cr --mca opal_cr_enable_timer 1 --mca
sstore_stage_global_is_shared 1 --mca sstore_base_global_snapshot_dir
/scratch/hpl005/UIT_test/amudar/FWI --mca mpi_paffinity_alone 1  -np 32
-hostfile hostfile-32 ../hellompi



Please note that hellompi is a very simple program without any
collective calls. When I issue checkpoint, this program fails with the
following messages:



hplcnlj158:13937] Signal: Segmentation fault (11)
[hplcnlj158:13937] Signal code: Address not mapped (1)
[hplcnlj158:13937] Failing at address: 0x2aaa00000001
[hplcnlj158:13937] [ 0] /lib64/libpthread.so.0 [0x2b4019a064c0]
[hplcnlj158:13937] [ 1]
/users/amudar/openmpi-1.7/lib/libmca_common_sm.so.0(mca_common_sm_param_
register+0x262) [0x2aaaad96628a]
[hplcnlj158:13937] [ 2]
/users/amudar/openmpi-1.7/lib/openmpi/mca_btl_sm.so [0x2aaaaf0a55e8]
[hplcnlj158:13937] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b4018c3c11b]
[hplcnlj158:13937] [ 4]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(mca_base_components_open+0x3ef
) [0x2b4018c3b70b]
[hplcnlj158:13937] [ 5]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(mca_btl_base_open+0xfd)
[0x2b4018b620fe]
[hplcnlj158:13937] [ 6]
/users/amudar/openmpi-1.7/lib/openmpi/mca_bml_r2.so [0x2aaaadd9e4fb]
[hplcnlj158:13937] [ 7]
/users/amudar/openmpi-1.7/lib/openmpi/mca_pml_ob1.so [0x2aaaae5fa429]
[hplcnlj158:13937] [ 8]
/users/amudar/openmpi-1.7/lib/openmpi/mca_pml_crcpw.so [0x2aaaadfadce6]
[hplcnlj158:13937] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b4018b01a0d]
[hplcnlj158:13937] [10]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(ompi_cr_coord+0xc0)
[0x2b4018b017ba]
[hplcnlj158:13937] [11]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(opal_cr_inc_core_recover+0xed)
[0x2b4018c0efab]
[hplcnlj158:13937] [12]
/users/amudar/openmpi-1.7/lib/openmpi/mca_snapc_full.so [0x2aaaabd280fc]
[hplcnlj158:13937] [13]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(opal_cr_test_if_checkpoint_rea
dy+0x11b) [0x2b4018c0ecd3]
[hplcnlj158:13937] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b4018c0f6e7]
[hplcnlj158:13937] [15] /lib64/libpthread.so.0 [0x2b40199fe367]
[hplcnlj158:13937] [16] /lib64/libc.so.6(clone+0x6d) [0x2b4019ce5f7d]
[hplcnlj158:13937] *** End of error message ***
[hplcnlj161:00637] *** Process received signal ***
[hplcnlj161:00637] Signal: Segmentation fault (11)
[hplcnlj161:00637] Signal code: Address not mapped (1)
[hplcnlj161:00637] Failing at address: 0x2aaa00000001
[hplcnlj161:00649] *** Process received signal ***
[hplcnlj161:00649] Signal: Segmentation fault (11)
[hplcnlj161:00649] Signal code: Address not mapped (1)
[hplcnlj161:00649] Failing at address: 0x2aaa00000001
/users/amudar/Fix_for_pidinuse/cr_restart: line 5: 14012 Segmentation
fault      /usr/blcr/bin/cr_restart --no-restore-pid "$@"
[hplcnlj161:00643] *** Process received signal ***
[hplcnlj161:00643] Signal: Segmentation fault (11)
[hplcnlj161:00643] Signal code: Address not mapped (1)
[hplcnlj161:00643] Failing at address: 0x2aaa00000001
[hplcnlj161:00640] *** Process received signal ***
[hplcnlj161:00640] Signal: Segmentation fault (11)
[hplcnlj161:00640] Signal code: Address not mapped (1)
[hplcnlj161:00640] Failing at address: 0x2aaa00000001
[hplcnlj161:00636] *** Process received signal ***
[hplcnlj161:00652] *** Process received signal ***
[hplcnlj161:00652] Signal: Segmentation fault (11)
[hplcnlj161:00652] Signal code: Address not mapped (1)
[hplcnlj161:00652] Failing at address: 0x2aaa00000001
[hplcnlj161:00636] Signal: Segmentation fault (11)
[hplcnlj161:00636] Signal code: Address not mapped (1)
[hplcnlj161:00636] Failing at address: 0x2aaa00000001
[hplcnlj161:00637] [ 0] /lib64/libpthread.so.0 [0x2b86c74694c0]
[hplcnlj161:00637] [ 1]
/users/amudar/openmpi-1.7/lib/libmca_common_sm.so.0(mca_common_sm_param_
register+0x262) [0x2aaaad96628a]
[hplcnlj161:00637] [ 2]
/users/amudar/openmpi-1.7/lib/openmpi/mca_btl_sm.so [0x2aaaaf0a55e8]
[hplcnlj161:00637] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b86c669f11b]
[hplcnlj161:00637] [ 4]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(mca_base_components_open+0x3ef
) [0x2b86c669e70b]
[hplcnlj161:00637] [ 5]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(mca_btl_base_open+0xfd)
[0x2b86c65c50fe]
[hplcnlj161:00637] [ 6]
/users/amudar/openmpi-1.7/lib/openmpi/mca_bml_r2.so [0x2aaaadd9e4fb]
[hplcnlj161:00637] [ 7]
/users/amudar/openmpi-1.7/lib/openmpi/mca_pml_ob1.so [0x2aaaae5fa429]
[hplcnlj161:00637] [ 8]
/users/amudar/openmpi-1.7/lib/openmpi/mca_pml_crcpw.so [0x2aaaadfadce6]
[hplcnlj161:00637] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b86c6564a0d]
[hplcnlj161:00637] [10]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(ompi_cr_coord+0xc0)
[0x2b86c65647ba]
[hplcnlj161:00637] [11]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(opal_cr_inc_core_recover+0xed)
[0x2b86c6671fab]
[hplcnlj161:00637] [12]
/users/amudar/openmpi-1.7/lib/openmpi/mca_snapc_full.so [0x2aaaabd280fc]
[hplcnlj161:00637] [13]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(opal_cr_test_if_checkpoint_rea
dy+0x11b) [0x2b86c6671cd3]
[hplcnlj161:00637] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b86c66726e7]
[hplcnlj161:00637] [15] /lib64/libpthread.so.0 [0x2b86c7461367]
[hplcnlj161:00637] [16] /lib64/libc.so.6(clone+0x6d) [0x2b86c7748f7d]
[hplcnlj161:00637] *** End of error message ***
[hplcnlj161:00649] [ 0] /lib64/libpthread.so.0 [0x2b7bfa6204c0]
[hplcnlj161:00649] [ 1]
/users/amudar/openmpi-1.7/lib/libmca_common_sm.so.0(mca_common_sm_param_
register+0x262) [0x2aaaad96628a]
[hplcnlj161:00649] [ 2]
/users/amudar/openmpi-1.7/lib/openmpi/mca_btl_sm.so [0x2aaaaf0a55e8]
[hplcnlj161:00649] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b7bf985611b]
[hplcnlj161:00649] [ 4]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(mca_base_components_open+0x3ef
) [0x2b7bf985570b]
[hplcnlj161:00649] [ 5]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(mca_btl_base_open+0xfd)
[0x2b7bf977c0fe]
[hplcnlj161:00649] [ 6]
/users/amudar/openmpi-1.7/lib/openmpi/mca_bml_r2.so [0x2aaaadd9e4fb]
[hplcnlj161:00649] [ 7]
/users/amudar/openmpi-1.7/lib/openmpi/mca_pml_ob1.so [0x2aaaae5fa429]
[hplcnlj161:00649] [ 8]
/users/amudar/openmpi-1.7/lib/openmpi/mca_pml_crcpw.so [0x2aaaadfadce6]
[hplcnlj161:00649] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b7bf971ba0d]
[hplcnlj161:00649] [10]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(ompi_cr_coord+0xc0)
[0x2b7bf971b7ba]
[hplcnlj161:00649] [11]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(opal_cr_inc_core_recover+0xed)
[0x2b7bf9828fab]
[hplcnlj161:00649] [12]
/users/amudar/openmpi-1.7/lib/openmpi/mca_snapc_full.so [0x2aaaabd280fc]
[hplcnlj161:00649] [13]
/users/amudar/openmpi-1.7/lib/libmpi.so.0(opal_cr_test_if_checkpoint_rea
dy+0x11b) [0x2b7bf9828cd3]
[hplcnlj161:00649] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b7bf98296e7]
[hplcnlj161:00649] [15] /lib64/libpthread.so.0 [0x2b7bfa618367]
[hplcnlj161:00649] [16] /lib64/libc.so.6(clone+0x6d) [0x2b7bfa8fff7d]
[hplcnlj161:00649] *** End of error message ***





Thanks

Ananda



Ananda B Mudar, PMP

Senior Technical Architect

Wipro Technologies

Ph: 972 765 8093

ananda.mu...@wipro.com




Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email. 

www.wipro.com

Reply via email to