Hi,

Just to be clear - do you see similar checkpoint performance differences in 1.5rc6 and 1.4.2 with and without shared memory enabled?

Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Sep 21, 2010, at 9:35 AM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com > wrote:

Hello Samuel
This problem seems to be resolved after I moved to r23781. However, I see another discrepancy in checkpoint image creation time when I disable shared memory (--mca btl self,tcp,openib) vs using it. I mean the time to create checkpoint image for this simple program is about 0.4 seconds if I disable shared memory while it is close to 6.5 seconds when I use shared memory component. I have not seen this behavior earlier. Do I have to tune any other parameter to reduce the time?
Thanks
Ananda
Hi Ananda,

This issue should be resolved in r23781. Please let me know if it is
not.

Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Sep 20, 2010, at 11:26 AM, <ananda.mudar_at_[hidden]> <ananda.mudar_at_[hidden]
 > wrote:
> I have used following options to build:
> ./configure CC=/usr/bin/gcc CXX=/usr/bin/c++ F77=/usr/bin/gfortran
> FC=/usr/bin/gfortran --prefix /users/amudar/openmpi-1.7 --with-tm=/
> usr/local/pbs --with-openib --with-threads=posix --enable-mpi- thread- > multiple --enable-ft-thread --enable-debug --with-ft=cr --with- blcr=/
> usr/blcr --with-blcr-libdir=/usr/blcr/lib
>
> Alsop please note that this is with r23756 build.
>
> Let me know if you need any other information.
>
> Thanks
> Ananda
> Let me take a look at it. How did you configure your build?
> Thanks,
>
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> On Sep 20, 2010, at 10:14 AM, <ananda.mudar_at_[hidden]>
> <ananda.mudar_at_[hidden]
>  > wrote:
> > Hi
> >
> > I believe the new common shared memory component was committed to
> > the trunk sometime towards the later part of August. I had not tried > > this trunk version until last week and I have seen some discrepancy
> > with this component specifically related to checkpoint
> > functionality. I am not able to checkpoint any program with the
> > latest trunk version. Am I missing something here? Should I be using
> > any other options to enable checkpoint functionality for shared
> > memory component?
> >
> > However if I disable shared memory component and use only self, tcp,
> > and openib (--mca btl self,tcp,openib), I can checkpoint
> > successfully!!
> >
> > Following are the options I have used with mpirun:
> >
> > mpirun -am ft-enable-cr --mca opal_cr_enable_timer 1 --mca
> > sstore_stage_global_is_shared 1 --mca
> > sstore_base_global_snapshot_dir /scratch/hpl005/UIT_test/amudar/ FWI
> > --mca mpi_paffinity_alone 1  -np 32 -hostfile hostfile-32 ../
> hellompi
> >
> > Please note that hellompi is a very simple program without any
> > collective calls. When I issue checkpoint, this program fails with
> > the following messages:
> >
> > hplcnlj158:13937] Signal: Segmentation fault (11)
> > [hplcnlj158:13937] Signal code: Address not mapped (1)
> > [hplcnlj158:13937] Failing at address: 0x2aaa00000001
> > [hplcnlj158:13937] [ 0] /lib64/libpthread.so.0 [0x2b4019a064c0]
> > [hplcnlj158:13937] [ 1] /users/amudar/openmpi-1.7/lib/
> > libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
> > [0x2aaaad96628a]
> > [hplcnlj158:13937] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_btl_sm.so [0x2aaaaf0a55e8]
> > [hplcnlj158:13937] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b4018c3c11b]
> > [hplcnlj158:13937] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_base_components_open+0x3ef) [0x2b4018c3b70b]
> > [hplcnlj158:13937] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_btl_base_open+0xfd) [0x2b4018b620fe]
> > [hplcnlj158:13937] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_bml_r2.so [0x2aaaadd9e4fb]
> > [hplcnlj158:13937] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_ob1.so [0x2aaaae5fa429]
> > [hplcnlj158:13937] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_crcpw.so [0x2aaaadfadce6]
> > [hplcnlj158:13937] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b4018b01a0d]
> > [hplcnlj158:13937] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(ompi_cr_coord+0xc0) [0x2b4018b017ba]
> > [hplcnlj158:13937] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_inc_core_recover+0xed) [0x2b4018c0efab]
> > [hplcnlj158:13937] [12] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_snapc_full.so [0x2aaaabd280fc]
> > [hplcnlj158:13937] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b4018c0ecd3]
> > [hplcnlj158:13937] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b4018c0f6e7]
> > [hplcnlj158:13937] [15] /lib64/libpthread.so.0 [0x2b40199fe367]
> > [hplcnlj158:13937] [16] /lib64/libc.so.6(clone+0x6d)
> [0x2b4019ce5f7d]
> > [hplcnlj158:13937] *** End of error message ***
> > [hplcnlj161:00637] *** Process received signal ***
> > [hplcnlj161:00637] Signal: Segmentation fault (11)
> > [hplcnlj161:00637] Signal code: Address not mapped (1)
> > [hplcnlj161:00637] Failing at address: 0x2aaa00000001
> > [hplcnlj161:00649] *** Process received signal ***
> > [hplcnlj161:00649] Signal: Segmentation fault (11)
> > [hplcnlj161:00649] Signal code: Address not mapped (1)
> > [hplcnlj161:00649] Failing at address: 0x2aaa00000001
> > /users/amudar/Fix_for_pidinuse/cr_restart: line 5: 14012
> > Segmentation fault      /usr/blcr/bin/cr_restart --no-restore-pid
> "$@"
> > [hplcnlj161:00643] *** Process received signal ***
> > [hplcnlj161:00643] Signal: Segmentation fault (11)
> > [hplcnlj161:00643] Signal code: Address not mapped (1)
> > [hplcnlj161:00643] Failing at address: 0x2aaa00000001
> > [hplcnlj161:00640] *** Process received signal ***
> > [hplcnlj161:00640] Signal: Segmentation fault (11)
> > [hplcnlj161:00640] Signal code: Address not mapped (1)
> > [hplcnlj161:00640] Failing at address: 0x2aaa00000001
> > [hplcnlj161:00636] *** Process received signal ***
> > [hplcnlj161:00652] *** Process received signal ***
> > [hplcnlj161:00652] Signal: Segmentation fault (11)
> > [hplcnlj161:00652] Signal code: Address not mapped (1)
> > [hplcnlj161:00652] Failing at address: 0x2aaa00000001
> > [hplcnlj161:00636] Signal: Segmentation fault (11)
> > [hplcnlj161:00636] Signal code: Address not mapped (1)
> > [hplcnlj161:00636] Failing at address: 0x2aaa00000001
> > [hplcnlj161:00637] [ 0] /lib64/libpthread.so.0 [0x2b86c74694c0]
> > [hplcnlj161:00637] [ 1] /users/amudar/openmpi-1.7/lib/
> > libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
> > [0x2aaaad96628a]
> > [hplcnlj161:00637] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_btl_sm.so [0x2aaaaf0a55e8]
> > [hplcnlj161:00637] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b86c669f11b]
> > [hplcnlj161:00637] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_base_components_open+0x3ef) [0x2b86c669e70b]
> > [hplcnlj161:00637] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_btl_base_open+0xfd) [0x2b86c65c50fe]
> > [hplcnlj161:00637] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_bml_r2.so [0x2aaaadd9e4fb]
> > [hplcnlj161:00637] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_ob1.so [0x2aaaae5fa429]
> > [hplcnlj161:00637] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_crcpw.so [0x2aaaadfadce6]
> > [hplcnlj161:00637] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b86c6564a0d]
> > [hplcnlj161:00637] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(ompi_cr_coord+0xc0) [0x2b86c65647ba]
> > [hplcnlj161:00637] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_inc_core_recover+0xed) [0x2b86c6671fab]
> > [hplcnlj161:00637] [12] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_snapc_full.so [0x2aaaabd280fc]
> > [hplcnlj161:00637] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b86c6671cd3]
> > [hplcnlj161:00637] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b86c66726e7]
> > [hplcnlj161:00637] [15] /lib64/libpthread.so.0 [0x2b86c7461367]
> > [hplcnlj161:00637] [16] /lib64/libc.so.6(clone+0x6d)
> [0x2b86c7748f7d]
> > [hplcnlj161:00637] *** End of error message ***
> > [hplcnlj161:00649] [ 0] /lib64/libpthread.so.0 [0x2b7bfa6204c0]
> > [hplcnlj161:00649] [ 1] /users/amudar/openmpi-1.7/lib/
> > libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
> > [0x2aaaad96628a]
> > [hplcnlj161:00649] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_btl_sm.so [0x2aaaaf0a55e8]
> > [hplcnlj161:00649] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b7bf985611b]
> > [hplcnlj161:00649] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_base_components_open+0x3ef) [0x2b7bf985570b]
> > [hplcnlj161:00649] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_btl_base_open+0xfd) [0x2b7bf977c0fe]
> > [hplcnlj161:00649] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_bml_r2.so [0x2aaaadd9e4fb]
> > [hplcnlj161:00649] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_ob1.so [0x2aaaae5fa429]
> > [hplcnlj161:00649] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_crcpw.so [0x2aaaadfadce6]
> > [hplcnlj161:00649] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b7bf971ba0d]
> > [hplcnlj161:00649] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(ompi_cr_coord+0xc0) [0x2b7bf971b7ba]
> > [hplcnlj161:00649] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_inc_core_recover+0xed) [0x2b7bf9828fab]
> > [hplcnlj161:00649] [12] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_snapc_full.so [0x2aaaabd280fc]
> > [hplcnlj161:00649] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b7bf9828cd3]
> > [hplcnlj161:00649] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b7bf98296e7]
> > [hplcnlj161:00649] [15] /lib64/libpthread.so.0 [0x2b7bfa618367]
> > [hplcnlj161:00649] [16] /lib64/libc.so.6(clone+0x6d)
> [0x2b7bfa8fff7d]
> > [hplcnlj161:00649] *** End of error message ***
> >
> >
> > Thanks
> > Ananda
> >
> > Ananda B Mudar, PMP
> > Senior Technical Architect
> > Wipro Technologies
> > Ph: 972 765 8093              972 765 8093
> > ananda.mudar_at_[hidden]
> >
> > Please do not print this email unless it is absolutely necessary.
> >
> > The information contained in this electronic message and any
> > attachments to this message are intended for the exclusive use of
> > the addressee(s) and may contain proprietary, confidential or
> > privileged information. If you are not the intended recipient, you
> > should not disseminate, distribute or copy this e-mail. Please
> > notify the sender immediately and destroy all copies of this message
> > and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The
> > recipient should check this email and any attachments for the
> > presence of viruses. The company accepts no liability for any damage
> > caused by any virus transmitted by this email.
> >
> > www.wipro.com
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any
> attachments to this message are intended for the exclusive use of
> the addressee(s) and may contain proprietary, confidential or
> privileged information. If you are not the intended recipient, you
> should not disseminate, distribute or copy this e-mail. Please
> notify the sender immediately and destroy all copies of this message
> and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The
> recipient should check this email and any attachments for the
> presence of viruses. The company accepts no liability for any damage
> caused by any virus transmitted by this email.
>
> www.wipro.com
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Ananda B Mudar, PMP
Senior Technical Architect
Wipro Technologies
Ph: 972 765 8093
ananda.mu...@wipro.com

Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

www.wipro.com

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to