[OMPI devel] SVN / Mercurial

2010-09-21 Thread Jeff Squyres
It came up on the call today, so I thought I'd throw out some references here 
on the list...

A while ago, we investigated moving the OMPI development away from SVN and to a 
100% Mercurial solution.  We ended up not doing this for a few reasons:

1. We actually kinda like the combo SVN+Mercurial solution that we use now.  A 
bunch of us OMPI developers use variants on "combo" schemes, the most common of 
which we documented on the wiki:

https://svn.open-mpi.org/trac/ompi/wiki/UsingMercurial

2. Perhaps we're reflecting our CVS/SVN backgrounds, but we didn't like the 
fact that if you do a bunch of "private" commits internally, when you push your 
final set of changes back up to the public mainline, all those private commits 
are listed.  Instead, we tend to prefer to make bunches of short-lived branches 
off the OMPI public mainline, do a bunch of private development, and then bring 
that functionality back to the OMPI mainline in a (series of) public commits.  
There's no need to show all the internal "just committing so that I can 
synchronize to another cluster" kinds of commits.

That being said, there are actually (at least) two ways to avoid exposing these 
kinds of internal commits: mercurial queues and rebasing.  But we didn't think 
the OMPI developer community would be wild about these options.  We could be 
wrong, of course, but that was our assessment at the time.

-

So there ya go.  :-)  We still love Mercurial and use it every day, but we've 
been fairly happy with our SVN+Mercurial solutions.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Question regarding recently common shared-memory component

2010-09-21 Thread ananda.mudar
Hello Samuel

Like I said in my earlier response, I have never tried this option. So I
ran these tests on 1.4.2 now and apparently the behavior is same ie; the
checkpoint creation time increases when I enable shared memory
componentL

Is there any parameter that can be tuned to improve the performance?

Regards

Ananda

---

Hi,

Just to be clear - do you see similar checkpoint performance
differences in 1.5rc6 and 1.4.2 with and without shared memory enabled?

Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Sep 21, 2010, at 9:35 AM, 
 wrote:
> Hello Samuel
> This problem seems to be resolved after I moved to r23781. However,
> I see another discrepancy in checkpoint image creation time when I
> disable shared memory (--mca btl self,tcp,openib) vs using it. I
> mean the time to create checkpoint image for this simple program is
> about 0.4 seconds if I disable shared memory while it is close to
> 6.5 seconds when I use shared memory component. I have not seen this
> behavior earlier. Do I have to tune any other parameter to reduce
> the time?
> Thanks
> Ananda
> Hi Ananda,
>
> This issue should be resolved in r23781. Please let me know if it is
> not.
>
> Thanks!
>
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> On Sep 20, 2010, at 11:26 AM, 
>   > wrote:
> > I have used following options to build:
> > ./configure CC=/usr/bin/gcc CXX=/usr/bin/c++ F77=/usr/bin/gfortran
> > FC=/usr/bin/gfortran --prefix /users/amudar/openmpi-1.7 --with-tm=/
> > usr/local/pbs --with-openib --with-threads=posix --enable-mpi-
> thread-
> > multiple --enable-ft-thread --enable-debug --with-ft=cr --with-
> blcr=/
> > usr/blcr --with-blcr-libdir=/usr/blcr/lib
> >
> > Alsop please note that this is with r23756 build.
> >
> > Let me know if you need any other information.
> >
> > Thanks
> > Ananda
> > Let me take a look at it. How did you configure your build?
> > Thanks,
> >
> > --
> > Samuel K. Gutierrez
> > Los Alamos National Laboratory
> > On Sep 20, 2010, at 10:14 AM, 
> >  >  > wrote:
> > > Hi
> > >
> > > I believe the new common shared memory component was committed to
> > > the trunk sometime towards the later part of August. I had not
> tried
> > > this trunk version until last week and I have seen some
> discrepancy
> > > with this component specifically related to checkpoint
> > > functionality. I am not able to checkpoint any program with the
> > > latest trunk version. Am I missing something here? Should I be
> using
> > > any other options to enable checkpoint functionality for shared
> > > memory component?
> > >
> > > However if I disable shared memory component and use only self,
> tcp,
> > > and openib (--mca btl self,tcp,openib), I can checkpoint
> > > successfully!!
> > >
> > > Following are the options I have used with mpirun:
> > >
> > > mpirun -am ft-enable-cr --mca opal_cr_enable_timer 1 --mca
> > > sstore_stage_global_is_shared 1 --mca
> > > sstore_base_global_snapshot_dir /scratch/hpl005/UIT_test/amudar/
> FWI
> > > --mca mpi_paffinity_alone 1  -np 32 -hostfile hostfile-32 ../
> > hellompi
> > >
> > > Please note that hellompi is a very simple program without any
> > > collective calls. When I issue checkpoint, this program fails with
> > > the following messages:
> > >
> > > hplcnlj158:13937] Signal: Segmentation fault (11)
> > > [hplcnlj158:13937] Signal code: Address not mapped (1)
> > > [hplcnlj158:13937] Failing at address: 0x2aaa0001
> > > [hplcnlj158:13937] [ 0] /lib64/libpthread.so.0 [0x2b4019a064c0]
> > > [hplcnlj158:13937] [ 1] /users/amudar/openmpi-1.7/lib/
> > > libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
> > > [0x2d96628a]
> > > [hplcnlj158:13937] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_btl_sm.so [0x2f0a55e8]
> > > [hplcnlj158:13937] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > > [0x2b4018c3c11b]
> > > [hplcnlj158:13937] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(mca_base_components_open+0x3ef) [0x2b4018c3b70b]
> > > [hplcnlj158:13937] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(mca_btl_base_open+0xfd) [0x2b4018b620fe]
> > > [hplcnlj158:13937] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_bml_r2.so [0x2dd9e4fb]
> > > [hplcnlj158:13937] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_pml_ob1.so [0x2e5fa429]
> > > [hplcnlj158:13937] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_pml_crcpw.so [0x2dfadce6]
> > > [hplcnlj158:13937] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > > [0x2b4018b01a0d]
> > > [hplcnlj158:13937] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(ompi_cr_coord+0xc0) [0x2b4018b017ba]
> > > [hplcnlj158:13937] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(opal_cr_inc_core_recover+0xed) [0x2b4018c0efab]
> > > [hplcnlj158:13937] [12] /users/amudar/openmpi-1.7/lib/openmpi/
> > > mca_snapc_full.so [0x2bd280fc]
> > > [hplcnlj158:13937] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > > 0(opal_cr_test_if_checkpoin

Re: [OMPI devel] Question regarding recently common shared-memory component

2010-09-21 Thread Samuel K. Gutierrez

Hi,

Just to be clear - do you see similar checkpoint performance  
differences in 1.5rc6 and 1.4.2 with and without shared memory enabled?


Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Sep 21, 2010, at 9:35 AM,  > wrote:



Hello Samuel
This problem seems to be resolved after I moved to r23781. However,  
I see another discrepancy in checkpoint image creation time when I  
disable shared memory (--mca btl self,tcp,openib) vs using it. I  
mean the time to create checkpoint image for this simple program is  
about 0.4 seconds if I disable shared memory while it is close to  
6.5 seconds when I use shared memory component. I have not seen this  
behavior earlier. Do I have to tune any other parameter to reduce  
the time?

Thanks
Ananda
Hi Ananda,

This issue should be resolved in r23781. Please let me know if it is
not.

Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Sep 20, 2010, at 11:26 AM,   

 > wrote:
> I have used following options to build:
> ./configure CC=/usr/bin/gcc CXX=/usr/bin/c++ F77=/usr/bin/gfortran
> FC=/usr/bin/gfortran --prefix /users/amudar/openmpi-1.7 --with-tm=/
> usr/local/pbs --with-openib --with-threads=posix --enable-mpi- 
thread-
> multiple --enable-ft-thread --enable-debug --with-ft=cr --with- 
blcr=/

> usr/blcr --with-blcr-libdir=/usr/blcr/lib
>
> Alsop please note that this is with r23756 build.
>
> Let me know if you need any other information.
>
> Thanks
> Ananda
> Let me take a look at it. How did you configure your build?
> Thanks,
>
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> On Sep 20, 2010, at 10:14 AM, 
>   > wrote:
> > Hi
> >
> > I believe the new common shared memory component was committed to
> > the trunk sometime towards the later part of August. I had not  
tried
> > this trunk version until last week and I have seen some  
discrepancy

> > with this component specifically related to checkpoint
> > functionality. I am not able to checkpoint any program with the
> > latest trunk version. Am I missing something here? Should I be  
using

> > any other options to enable checkpoint functionality for shared
> > memory component?
> >
> > However if I disable shared memory component and use only self,  
tcp,

> > and openib (--mca btl self,tcp,openib), I can checkpoint
> > successfully!!
> >
> > Following are the options I have used with mpirun:
> >
> > mpirun -am ft-enable-cr --mca opal_cr_enable_timer 1 --mca
> > sstore_stage_global_is_shared 1 --mca
> > sstore_base_global_snapshot_dir /scratch/hpl005/UIT_test/amudar/ 
FWI

> > --mca mpi_paffinity_alone 1  -np 32 -hostfile hostfile-32 ../
> hellompi
> >
> > Please note that hellompi is a very simple program without any
> > collective calls. When I issue checkpoint, this program fails with
> > the following messages:
> >
> > hplcnlj158:13937] Signal: Segmentation fault (11)
> > [hplcnlj158:13937] Signal code: Address not mapped (1)
> > [hplcnlj158:13937] Failing at address: 0x2aaa0001
> > [hplcnlj158:13937] [ 0] /lib64/libpthread.so.0 [0x2b4019a064c0]
> > [hplcnlj158:13937] [ 1] /users/amudar/openmpi-1.7/lib/
> > libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
> > [0x2d96628a]
> > [hplcnlj158:13937] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_btl_sm.so [0x2f0a55e8]
> > [hplcnlj158:13937] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b4018c3c11b]
> > [hplcnlj158:13937] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_base_components_open+0x3ef) [0x2b4018c3b70b]
> > [hplcnlj158:13937] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_btl_base_open+0xfd) [0x2b4018b620fe]
> > [hplcnlj158:13937] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_bml_r2.so [0x2dd9e4fb]
> > [hplcnlj158:13937] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_ob1.so [0x2e5fa429]
> > [hplcnlj158:13937] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_crcpw.so [0x2dfadce6]
> > [hplcnlj158:13937] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b4018b01a0d]
> > [hplcnlj158:13937] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(ompi_cr_coord+0xc0) [0x2b4018b017ba]
> > [hplcnlj158:13937] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_inc_core_recover+0xed) [0x2b4018c0efab]
> > [hplcnlj158:13937] [12] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_snapc_full.so [0x2bd280fc]
> > [hplcnlj158:13937] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b4018c0ecd3]
> > [hplcnlj158:13937] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b4018c0f6e7]
> > [hplcnlj158:13937] [15] /lib64/libpthread.so.0 [0x2b40199fe367]
> > [hplcnlj158:13937] [16] /lib64/libc.so.6(clone+0x6d)
> [0x2b4019ce5f7d]
> > [hplcnlj158:13937] *** End of error message ***
> > [hplcnlj161:00637] *** Process received signal ***
> > [hplcnlj161:00637] Signal: Segmentation fault (11)
> > [hplcnlj161:00637] Signal code: Address not mapped (1)
> > [hplcnlj161:00637] Failing at 

Re: [OMPI devel] Question regarding recently common shared-memory component

2010-09-21 Thread ananda.mudar
Hello Samuel

This problem seems to be resolved after I moved to r23781. However, I
see another discrepancy in checkpoint image creation time when I disable
shared memory (--mca btl self,tcp,openib) vs using it. I mean the time
to create checkpoint image for this simple program is about 0.4 seconds
if I disable shared memory while it is close to 6.5 seconds when I use
shared memory component. I have not seen this behavior earlier. Do I
have to tune any other parameter to reduce the time?

Thanks

Ananda



Hi Ananda,

This issue should be resolved in r23781. Please let me know if it is
not.

Thanks!

--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Sep 20, 2010, at 11:26 AM, 
 wrote:
> I have used following options to build:
> ./configure CC=/usr/bin/gcc CXX=/usr/bin/c++ F77=/usr/bin/gfortran 
> FC=/usr/bin/gfortran --prefix /users/amudar/openmpi-1.7 --with-tm=/
> usr/local/pbs --with-openib --with-threads=posix --enable-mpi-thread-
> multiple --enable-ft-thread --enable-debug --with-ft=cr --with-blcr=/
> usr/blcr --with-blcr-libdir=/usr/blcr/lib
>
> Alsop please note that this is with r23756 build.
>
> Let me know if you need any other information.
>
> Thanks
> Ananda
> Let me take a look at it. How did you configure your build?
> Thanks,
>
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> On Sep 20, 2010, at 10:14 AM, 
>   > wrote:
> > Hi
> >
> > I believe the new common shared memory component was committed to
> > the trunk sometime towards the later part of August. I had not tried
> > this trunk version until last week and I have seen some discrepancy
> > with this component specifically related to checkpoint
> > functionality. I am not able to checkpoint any program with the
> > latest trunk version. Am I missing something here? Should I be using
> > any other options to enable checkpoint functionality for shared
> > memory component?
> >
> > However if I disable shared memory component and use only self, tcp,
> > and openib (--mca btl self,tcp,openib), I can checkpoint
> > successfully!!
> >
> > Following are the options I have used with mpirun:
> >
> > mpirun -am ft-enable-cr --mca opal_cr_enable_timer 1 --mca
> > sstore_stage_global_is_shared 1 --mca
> > sstore_base_global_snapshot_dir /scratch/hpl005/UIT_test/amudar/FWI
> > --mca mpi_paffinity_alone 1  -np 32 -hostfile hostfile-32 ../
> hellompi
> >
> > Please note that hellompi is a very simple program without any
> > collective calls. When I issue checkpoint, this program fails with
> > the following messages:
> >
> > hplcnlj158:13937] Signal: Segmentation fault (11)
> > [hplcnlj158:13937] Signal code: Address not mapped (1)
> > [hplcnlj158:13937] Failing at address: 0x2aaa0001
> > [hplcnlj158:13937] [ 0] /lib64/libpthread.so.0 [0x2b4019a064c0]
> > [hplcnlj158:13937] [ 1] /users/amudar/openmpi-1.7/lib/
> > libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
> > [0x2d96628a]
> > [hplcnlj158:13937] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_btl_sm.so [0x2f0a55e8]
> > [hplcnlj158:13937] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b4018c3c11b]
> > [hplcnlj158:13937] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_base_components_open+0x3ef) [0x2b4018c3b70b]
> > [hplcnlj158:13937] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(mca_btl_base_open+0xfd) [0x2b4018b620fe]
> > [hplcnlj158:13937] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_bml_r2.so [0x2dd9e4fb]
> > [hplcnlj158:13937] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_ob1.so [0x2e5fa429]
> > [hplcnlj158:13937] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_pml_crcpw.so [0x2dfadce6]
> > [hplcnlj158:13937] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b4018b01a0d]
> > [hplcnlj158:13937] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(ompi_cr_coord+0xc0) [0x2b4018b017ba]
> > [hplcnlj158:13937] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_inc_core_recover+0xed) [0x2b4018c0efab]
> > [hplcnlj158:13937] [12] /users/amudar/openmpi-1.7/lib/openmpi/
> > mca_snapc_full.so [0x2bd280fc]
> > [hplcnlj158:13937] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
> > 0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b4018c0ecd3]
> > [hplcnlj158:13937] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
> > [0x2b4018c0f6e7]
> > [hplcnlj158:13937] [15] /lib64/libpthread.so.0 [0x2b40199fe367]
> > [hplcnlj158:13937] [16] /lib64/libc.so.6(clone+0x6d)
> [0x2b4019ce5f7d]
> > [hplcnlj158:13937] *** End of error message ***
> > [hplcnlj161:00637] *** Process received signal ***
> > [hplcnlj161:00637] Signal: Segmentation fault (11)
> > [hplcnlj161:00637] Signal code: Address not mapped (1)
> > [hplcnlj161:00637] Failing at address: 0x2aaa0001
> > [hplcnlj161:00649] *** Process received signal ***
> > [hplcnlj161:00649] Signal: Segmentation fault (11)
> > [hplcnlj161:00649] Signal code: Address not mapped (1)
> > [hplcnlj161:00649] Failing at address: 0x2aaa000