Re: [OMPI devel] [mpich-discuss] BUG in ADIOI_NFS_WriteStrided

2014-12-19 Thread Rob Latham



On 12/19/2014 02:33 PM, Eric Chamberland wrote:

Hi Howard,

the bug is present also with MPICH-3.1.3...

So, for disc...@mpich.org list readers, here is the valgrind output for
a bug revealed with valgrind (sorry, I didn't compiled MPICH in debug
mode),  reported to OpenMPI earlier today:
http://www.open-mpi.org/community/lists/devel/2014/12/16691.php

sorry for the "duplicated" report again.

I encountered a new bug while testing our collective MPI I/O
functionnalities over NFS.  This is not a big issue for us, but I think
someone should have a look at it.


Please don't use NFS for MPI-IO.  ROMIO makes a best effort but there's 
no way to guarantee you won't corrupt a block of data (NFS clients are 
allowed to cache... arbitrarily, it seems).  There are so many good 
parallel file systems with saner consistency semantics .


This looks like maybe a calloc would clean it right up.

==rob



While running at 3 processes, we have this error on rank #0 and rank #2,
knowing that rank #1 have nothing to write (0 length size) on this
particular PMPI_File_write_all_begin call:

==3434== Syscall param write(buf) points to uninitialised byte(s)
==3434==at 0x108D0380: __write_nocancel (in /lib64/libpthread-2.17.so)
==3434==by 0x11DB9D46: ADIOI_NFS_WriteStrided (in
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DD264F: ADIOI_GEN_WriteStridedColl (in
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB5F8F: MPIOI_File_write_all_begin (in
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB60F3: PMPI_File_write_all_begin (in
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x4CCAE6: SYEnveloppeMessage
PAIO::ecritureIndexeParBlocMPI, FunctorCopieInfosSurDansVectPAType*, std::allocator*> > const>, FunctorAccesseurPorteurLocal > >(PAGroupeProcessus&, ADIOI_FileD*, long long,
PtrPorteurConst, PtrPorteurConst,
FunctorCopieInfosSurDansVectPAType*, std::allocator*> > const>&, FunctorAccesseurPorteurLocal >&, long, DistributionComposantes&, long, unsigned long, unsigned
long, std::string const&) (in
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x4DDBFE:
GISLectureEcriture::visiteMaillage(Maillage const&) (in
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x4BCB22:
GISLectureEcriture::ecritGISMPI(std::string,
GroupeInfoSur const&, std::string const&) (in
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x48E213: main (in
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==  Address 0x1a12cd10 is 224 bytes inside a block of size 524,448
alloc'd
==3434==at 0x4C2C27B: malloc (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==3434==by 0x11DADA96: MPL_trmalloc (in
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DD9285: ADIOI_Malloc_fn (in
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB9AC8: ADIOI_NFS_WriteStrided (in
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DD264F: ADIOI_GEN_WriteStridedColl (in
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB5F8F: MPIOI_File_write_all_begin (in
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB60F3: PMPI_File_write_all_begin (in
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x4CCAE6: SYEnveloppeMessage
PAIO::ecritureIndexeParBlocMPI, FunctorCopieInfosSurDansVectPAType*, std::allocator*> > const>, FunctorAccesseurPorteurLocal > >(PAGroupeProcessus&, ADIOI_FileD*, long long,
PtrPorteurConst, PtrPorteurConst,
FunctorCopieInfosSurDansVectPAType*, std::allocator*> > const>&, FunctorAccesseurPorteurLocal >&, long, DistributionComposantes&, long, unsigned long, unsigned
long, std::string const&) (in
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x4DDBFE:
GISLectureEcriture::visiteMaillage(Maillage const&) (in
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x4BCB22:
GISLectureEcriture::ecritGISMPI(std::string,
GroupeInfoSur const&, std::string const&) (in
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x48E213: main (in
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==  Uninitialised value was created by a client request
==3434==at 0x11DADEE5: MPL_trmalloc (in
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DD9285: ADIOI_Malloc_fn (in
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB9AC8: 

Re: [OMPI devel] [1.8.4rc5] preliminary results

2014-12-19 Thread Paul Hargrove
A quick glance at the results for the two "configurations of interest"
appear to show the problem is resolved.

Tonight I will take a complete look through my results and report ONLY if I
find new regressions.
Unless you hear from me, assume "openmpi-v1.8.3-322-g562a764" gets my
"thumbs up" with respect to "Fortran Sadness".

-Paul

On Fri, Dec 19, 2014 at 12:51 PM, Paul Hargrove  wrote:

> Jeff,
>
> Less typing to launch 50+ testers than pick out just those two.
> Starting them now...
>
> -Paul
>
> On Fri, Dec 19, 2014 at 12:22 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>>
>> Paul --
>>
>> The fix for this is now on the v1.8 branch.  I'm spinning up a new
>> nightly tarball for you.
>>
>> http://www.open-mpi.org/nightly/v1.8/
>>
>> Could you test it in the 2 cases where you had fortran failures?
>>
>>
>>
>> On Dec 18, 2014, at 8:50 PM, Paul Hargrove  wrote:
>>
>> > Update:
>> >
>> > I now have 59 of 61 results, with only the QEMU-emulated MIPS platforms
>> outstanding.
>> > Those will not likely finish until near to (or after) midnight tonight.
>> > Unless something turns up on the MIPS systems my "smoke test" of rc5 is
>> complete.
>> >
>> > The only issues I found are the vader and fortran ones mentioned
>> previously.
>> >
>> > Nathan now has an account on the same SGI UV as I have been using.
>> > Jeff now has my configure and ompi_info output for my fortran failures.
>> >
>> > NOTE (primarily directed at Jeff):
>> > I define "issue" to *exclude* known problem with certain compilers that
>> are also present in earlier releases.  In particular, I pass explicit
>> --with-mpi-fortran=XXX and/or --disable-oshmem-fortran options to configure
>> when using certain PGI and XLC versions because (as of 1.8.3 when I last
>> adjusted those settings) configure was not able to automatically disqualify
>> their "deficient" fortran support.  If there is a desire/need to follow up
>> on this, let me know.  However, all those "deficient" fortan compilers have
>> been reported by me on this list at least once in testing prior releases
>> (just never in one place).
>> >
>> > -Paul
>> >
>> > On Thu, Dec 18, 2014 at 8:55 AM, Paul Hargrove 
>> wrote:
>> > With results from about 50 out of 61 platforms:
>> >
>> > + KNOWN: SGI UV is still "broken-by-default" (fails compiling vader
>> unless configured with --without-xpmem)
>> > + NEW: I see Fortran bindings failing to compile w/ gfortran
>> > + NEW: I see Fortran bindings fail to link with Open64
>> >
>> > I also have unexplained errors on my Solaris-10/SPARC system.
>> > It looks like there may have been a loss of network connectivity during
>> the tests.
>> > I need to check these deeper, but I expect them to pass when I get a
>> chance to re-run them.
>> >
>> > -Paul
>> >
>> > --
>> > Paul H. Hargrove  phhargr...@lbl.gov
>> > Computer Languages & Systems Software (CLaSS) Group
>> > Computer Science Department   Tel: +1-510-495-2352
>> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> >
>> >
>> > --
>> > Paul H. Hargrove  phhargr...@lbl.gov
>> > Computer Languages & Systems Software (CLaSS) Group
>> > Computer Science Department   Tel: +1-510-495-2352
>> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> > ___
>> > devel mailing list
>> > de...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16683.php
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16697.php
>>
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] [1.8.4rc5] preliminary results

2014-12-19 Thread Paul Hargrove
Jeff,

Less typing to launch 50+ testers than pick out just those two.
Starting them now...

-Paul

On Fri, Dec 19, 2014 at 12:22 PM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:
>
> Paul --
>
> The fix for this is now on the v1.8 branch.  I'm spinning up a new nightly
> tarball for you.
>
> http://www.open-mpi.org/nightly/v1.8/
>
> Could you test it in the 2 cases where you had fortran failures?
>
>
>
> On Dec 18, 2014, at 8:50 PM, Paul Hargrove  wrote:
>
> > Update:
> >
> > I now have 59 of 61 results, with only the QEMU-emulated MIPS platforms
> outstanding.
> > Those will not likely finish until near to (or after) midnight tonight.
> > Unless something turns up on the MIPS systems my "smoke test" of rc5 is
> complete.
> >
> > The only issues I found are the vader and fortran ones mentioned
> previously.
> >
> > Nathan now has an account on the same SGI UV as I have been using.
> > Jeff now has my configure and ompi_info output for my fortran failures.
> >
> > NOTE (primarily directed at Jeff):
> > I define "issue" to *exclude* known problem with certain compilers that
> are also present in earlier releases.  In particular, I pass explicit
> --with-mpi-fortran=XXX and/or --disable-oshmem-fortran options to configure
> when using certain PGI and XLC versions because (as of 1.8.3 when I last
> adjusted those settings) configure was not able to automatically disqualify
> their "deficient" fortran support.  If there is a desire/need to follow up
> on this, let me know.  However, all those "deficient" fortan compilers have
> been reported by me on this list at least once in testing prior releases
> (just never in one place).
> >
> > -Paul
> >
> > On Thu, Dec 18, 2014 at 8:55 AM, Paul Hargrove 
> wrote:
> > With results from about 50 out of 61 platforms:
> >
> > + KNOWN: SGI UV is still "broken-by-default" (fails compiling vader
> unless configured with --without-xpmem)
> > + NEW: I see Fortran bindings failing to compile w/ gfortran
> > + NEW: I see Fortran bindings fail to link with Open64
> >
> > I also have unexplained errors on my Solaris-10/SPARC system.
> > It looks like there may have been a loss of network connectivity during
> the tests.
> > I need to check these deeper, but I expect them to pass when I get a
> chance to re-run them.
> >
> > -Paul
> >
> > --
> > Paul H. Hargrove  phhargr...@lbl.gov
> > Computer Languages & Systems Software (CLaSS) Group
> > Computer Science Department   Tel: +1-510-495-2352
> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> >
> >
> > --
> > Paul H. Hargrove  phhargr...@lbl.gov
> > Computer Languages & Systems Software (CLaSS) Group
> > Computer Science Department   Tel: +1-510-495-2352
> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16683.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16697.php
>


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] BUG in ADIOI_NFS_WriteStrided

2014-12-19 Thread Eric Chamberland

Hi Howard,

the bug is present also with MPICH-3.1.3...

So, for disc...@mpich.org list readers, here is the valgrind output for 
a bug revealed with valgrind (sorry, I didn't compiled MPICH in debug 
mode),  reported to OpenMPI earlier today: 
http://www.open-mpi.org/community/lists/devel/2014/12/16691.php


sorry for the "duplicated" report again.

I encountered a new bug while testing our collective MPI I/O 
functionnalities over NFS.  This is not a big issue for us, but I think 
someone should have a look at it.


While running at 3 processes, we have this error on rank #0 and rank #2, 
knowing that rank #1 have nothing to write (0 length size) on this 
particular PMPI_File_write_all_begin call:


==3434== Syscall param write(buf) points to uninitialised byte(s)
==3434==at 0x108D0380: __write_nocancel (in /lib64/libpthread-2.17.so)
==3434==by 0x11DB9D46: ADIOI_NFS_WriteStrided (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DD264F: ADIOI_GEN_WriteStridedColl (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB5F8F: MPIOI_File_write_all_begin (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB60F3: PMPI_File_write_all_begin (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x4CCAE6: SYEnveloppeMessage 
PAIO::ecritureIndexeParBlocMPI, FunctorCopieInfosSurDansVectPAType*, std::allocator*> > const>, FunctorAccesseurPorteurLocal > >(PAGroupeProcessus&, ADIOI_FileD*, long long, 
PtrPorteurConst, PtrPorteurConst, 
FunctorCopieInfosSurDansVectPAType*, std::allocator*> > const>&, FunctorAccesseurPorteurLocal >&, long, DistributionComposantes&, long, unsigned long, unsigned 
long, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x4DDBFE: 
GISLectureEcriture::visiteMaillage(Maillage const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x4BCB22: 
GISLectureEcriture::ecritGISMPI(std::string, 
GroupeInfoSur const&, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x48E213: main (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==  Address 0x1a12cd10 is 224 bytes inside a block of size 524,448 
alloc'd
==3434==at 0x4C2C27B: malloc (in 
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==3434==by 0x11DADA96: MPL_trmalloc (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DD9285: ADIOI_Malloc_fn (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB9AC8: ADIOI_NFS_WriteStrided (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DD264F: ADIOI_GEN_WriteStridedColl (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB5F8F: MPIOI_File_write_all_begin (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB60F3: PMPI_File_write_all_begin (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x4CCAE6: SYEnveloppeMessage 
PAIO::ecritureIndexeParBlocMPI, FunctorCopieInfosSurDansVectPAType*, std::allocator*> > const>, FunctorAccesseurPorteurLocal > >(PAGroupeProcessus&, ADIOI_FileD*, long long, 
PtrPorteurConst, PtrPorteurConst, 
FunctorCopieInfosSurDansVectPAType*, std::allocator*> > const>&, FunctorAccesseurPorteurLocal >&, long, DistributionComposantes&, long, unsigned long, unsigned 
long, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x4DDBFE: 
GISLectureEcriture::visiteMaillage(Maillage const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x4BCB22: 
GISLectureEcriture::ecritGISMPI(std::string, 
GroupeInfoSur const&, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==3434==by 0x48E213: main (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)

==3434==  Uninitialised value was created by a client request
==3434==at 0x11DADEE5: MPL_trmalloc (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DD9285: ADIOI_Malloc_fn (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB9AC8: ADIOI_NFS_WriteStrided (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DD264F: ADIOI_GEN_WriteStridedColl (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB5F8F: MPIOI_File_write_all_begin (in 
/opt/mpich-3.1.3/lib64/libmpi.so.12.0.4)
==3434==by 0x11DB60F3: PMPI_File_write_all_begin (in 

Re: [OMPI devel] [1.8.4rc5] preliminary results

2014-12-19 Thread Jeff Squyres (jsquyres)
Paul --

The fix for this is now on the v1.8 branch.  I'm spinning up a new nightly 
tarball for you.

http://www.open-mpi.org/nightly/v1.8/

Could you test it in the 2 cases where you had fortran failures?



On Dec 18, 2014, at 8:50 PM, Paul Hargrove  wrote:

> Update:
> 
> I now have 59 of 61 results, with only the QEMU-emulated MIPS platforms 
> outstanding.
> Those will not likely finish until near to (or after) midnight tonight.
> Unless something turns up on the MIPS systems my "smoke test" of rc5 is 
> complete. 
> 
> The only issues I found are the vader and fortran ones mentioned previously.
> 
> Nathan now has an account on the same SGI UV as I have been using.
> Jeff now has my configure and ompi_info output for my fortran failures.
> 
> NOTE (primarily directed at Jeff):
> I define "issue" to *exclude* known problem with certain compilers that are 
> also present in earlier releases.  In particular, I pass explicit 
> --with-mpi-fortran=XXX and/or --disable-oshmem-fortran options to configure 
> when using certain PGI and XLC versions because (as of 1.8.3 when I last 
> adjusted those settings) configure was not able to automatically disqualify 
> their "deficient" fortran support.  If there is a desire/need to follow up on 
> this, let me know.  However, all those "deficient" fortan compilers have been 
> reported by me on this list at least once in testing prior releases (just 
> never in one place).
> 
> -Paul
> 
> On Thu, Dec 18, 2014 at 8:55 AM, Paul Hargrove  wrote:
> With results from about 50 out of 61 platforms:
> 
> + KNOWN: SGI UV is still "broken-by-default" (fails compiling vader unless 
> configured with --without-xpmem)
> + NEW: I see Fortran bindings failing to compile w/ gfortran
> + NEW: I see Fortran bindings fail to link with Open64
> 
> I also have unexplained errors on my Solaris-10/SPARC system.
> It looks like there may have been a loss of network connectivity during the 
> tests.
> I need to check these deeper, but I expect them to pass when I get a chance 
> to re-run them.
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16683.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] BUG in ADIOI_NFS_WriteStrided

2014-12-19 Thread Howard Pritchard
HI Eric,

Does your app also work with MPICH?  The romio in Open MPI is getting a bit
old, so it would be useful to know if you see the same valgrind error using
a recent MPICH.

Howard


2014-12-19 9:50 GMT-07:00 Eric Chamberland :
>
> Hi,
>
> I encountered a new bug while testing our collective MPI I/O
> functionnalities over NFS.  This is not a big issue for us, but I think
> someone should have a look at it.
>
> While running at 3 processes, we have this error on rank #0 and rank #2,
> knowing that rank #1 have nothing to write (0 length size) on this
> particular PMPI_File_write_all_begin call:
>
>
> ==19211== Syscall param write(buf) points to uninitialised byte(s)
> ==19211==at 0x10CB739D: ??? (in /lib64/libpthread-2.17.so)
> ==19211==by 0x27438431: ADIOI_NFS_WriteStrided (ad_nfs_write.c:645)
> ==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl
> (ad_write_coll.c:159)
> ==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114)
> ==19211==by 0x27431DBF: mca_io_romio_dist_MPI_File_write_all_begin
> (write_allb.c:44)
> ==19211==by 0x2742A367: mca_io_romio_file_write_all_begin
> (io_romio_file_write.c:264)
> ==19211==by 0x12126520: PMPI_File_write_all_begin
> (pfile_write_all_begin.c:74)
> ==19211==by 0x4D7CFB: SYEnveloppeMessage PAIO::
> ecritureIndexeParBlocMPI,
> FunctorCopieInfosSurDansVectPAType std::vector*, std::allocator Arete>*> > const>, FunctorAccesseurPorteurLocal Arete> > >(PAGroupeProcessus&, ompi_file_t*, long long,
> PtrPorteurConst, PtrPorteurConst,
> FunctorCopieInfosSurDansVectPAType std::vector*, std::allocator Arete>*> > const>&, FunctorAccesseurPorteurLocal Arete> >&, long, DistributionComposantes&, long, unsigned long, unsigned
> long, std::string const&) (in /home/mefpp_ericc/GIREF/bin/
> Test.LectureEcritureGISMPI.opt)
> ==19211==by 0x4E9A67: GISLectureEcriture::visiteMaillage(Maillage
> const&) (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
> ==19211==by 0x4C79A2: GISLectureEcriture::ecritGISMPI(std::string,
> GroupeInfoSur const&, std::string const&) (in
> /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
> ==19211==by 0x4961AD: main (in /home/mefpp_ericc/GIREF/bin/
> Test.LectureEcritureGISMPI.opt)
> ==19211==  Address 0x295af060 is 144 bytes inside a block of size 524,288
> alloc'd
> ==19211==at 0x4C2C27B: malloc (in /usr/lib64/valgrind/vgpreload_
> memcheck-amd64-linux.so)
> ==19211==by 0x2745E78E: ADIOI_Malloc_fn (malloc.c:50)
> ==19211==by 0x2743757C: ADIOI_NFS_WriteStrided (ad_nfs_write.c:497)
> ==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl
> (ad_write_coll.c:159)
> ==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114)
> ==19211==by 0x27431DBF: mca_io_romio_dist_MPI_File_write_all_begin
> (write_allb.c:44)
> ==19211==by 0x2742A367: mca_io_romio_file_write_all_begin
> (io_romio_file_write.c:264)
> ==19211==by 0x12126520: PMPI_File_write_all_begin
> (pfile_write_all_begin.c:74)
> ==19211==by 0x4D7CFB: SYEnveloppeMessage PAIO::
> ecritureIndexeParBlocMPI,
> FunctorCopieInfosSurDansVectPAType std::vector*, std::allocator Arete>*> > const>, FunctorAccesseurPorteurLocal Arete> > >(PAGroupeProcessus&, ompi_file_t*, long long,
> PtrPorteurConst, PtrPorteurConst,
> FunctorCopieInfosSurDansVectPAType std::vector*, std::allocator Arete>*> > const>&, FunctorAccesseurPorteurLocal Arete> >&, long, DistributionComposantes&, long, unsigned long, unsigned
> long, std::string const&) (in /home/mefpp_ericc/GIREF/bin/
> Test.LectureEcritureGISMPI.opt)
> ==19211==by 0x4E9A67: GISLectureEcriture::visiteMaillage(Maillage
> const&) (in /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
> ==19211==by 0x4C79A2: GISLectureEcriture::ecritGISMPI(std::string,
> GroupeInfoSur const&, std::string const&) (in
> /home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
> ==19211==by 0x4961AD: main (in /home/mefpp_ericc/GIREF/bin/
> Test.LectureEcritureGISMPI.opt)
> ==19211==  Uninitialised value was created by a heap allocation
> ==19211==at 0x4C2C27B: malloc (in /usr/lib64/valgrind/vgpreload_
> memcheck-amd64-linux.so)
> ==19211==by 0x2745E78E: ADIOI_Malloc_fn (malloc.c:50)
> ==19211==by 0x2743757C: ADIOI_NFS_WriteStrided (ad_nfs_write.c:497)
> ==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl
> (ad_write_coll.c:159)
> ==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114)
> ==19211==by 0x27431DBF: 

Re: [OMPI devel] FT code (again)

2014-12-19 Thread Joshua Ladd
George is correct; opal_pmix.fence replaces the grpcomm barrier.

Josh

On Fri, Dec 19, 2014 at 10:47 AM, George Bosilca 
wrote:
>
> A opal_pmix.fence seems like a perfect replacement.
>
>   George.
>
>
> On Fri, Dec 19, 2014 at 10:26 AM, Adrian Reber  wrote:
>
>> Again I am trying to get the FT code working. This time I am unsure how
>> to resolve the code changes from this commit:
>>
>> commit aec5cd08bd8c33677276612b899b48618d271efa
>> Author: Ralph Castain 
>> Date:   Thu Aug 21 18:56:47 2014 +
>>
>> Per the PMIx RFC:
>>
>>
>> This includes changes like this:
>>
>>
>> @@ -172,17 +164,7 @@ static int rte_init(void)
>>   * in the job won't be executing this step, so we would hang
>>   */
>>  if (ORTE_PROC_IS_NON_MPI && !orte_do_not_barrier) {
>> -orte_grpcomm_collective_t coll;
>> -OBJ_CONSTRUCT(, orte_grpcomm_collective_t);
>> -coll.id = orte_process_info.peer_modex;
>> -coll.active = true;
>> -if (ORTE_SUCCESS != (ret = orte_grpcomm.modex())) {
>> -ORTE_ERROR_LOG(ret);
>> -error = "orte modex";
>> -goto error;
>> -}
>> -ORTE_WAIT_FOR_COMPLETION(coll.active);
>> -OBJ_DESTRUCT();
>> +opal_pmix.fence(NULL, 0);
>>  }
>>
>>
>> In the FT code in orte/mca/ess/env/ess_env_module.c there is similar code:
>>
>> OBJ_CONSTRUCT(, orte_grpcomm_collective_t);
>> coll.id = orte_process_info.snapc_init_barrier;
>>
>> ...
>>
>> if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier())) {
>>
>> ...
>>
>> coll.active = true;
>> ORTE_WAIT_FOR_COMPLETION(coll.active);
>>
>>
>> How can this be expressed with the new code?
>>
>>
>> Adrian
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16688.php
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16689.php
>


Re: [OMPI devel] [1.8.4rc5] preliminary results

2014-12-19 Thread Ralph Castain
Wow! Thanks - I was turning blue…

Actually, truly do appreciate all the tests you run for us!

> On Dec 19, 2014, at 9:03 AM, Paul Hargrove  wrote:
> 
> 
> On Thu, Dec 18, 2014 at 5:50 PM, Paul Hargrove  > wrote:
> Unless something turns up on the MIPS systems my "smoke test" of rc5 is 
> complete. 
> 
> In case anybody was holding their breath:
> The MIPS testers completed just fine.
> 
> -Paul
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov 
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16693.php



Re: [OMPI devel] [1.8.4rc5] preliminary results

2014-12-19 Thread Paul Hargrove
On Thu, Dec 18, 2014 at 5:50 PM, Paul Hargrove  wrote:
>
> Unless something turns up on the MIPS systems my "smoke test" of rc5 is
> complete.


In case anybody was holding their breath:
The MIPS testers completed just fine.

-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] [1.8.4rc5] preliminary results

2014-12-19 Thread Paul Hargrove
Gilles,

Whether we think this is a Open64 issue or not, this complier worked with
1.8.3 and 1.8.4rc4.  I don't know the nature of the Fortran changes between
rc4 and rc5, but perhaps they can be made conditional to allow Open64 to
work with 1.8.4?

I will sent configure output off-list momentarily.

-Paul

On Fri, Dec 19, 2014 at 3:03 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:
>
>  Paul,
>
> i faced the very same issue with open64-5.0
>
> here is attached a simple reproducer.
>
> main2 can be built, but main cannot be built.
> the only difference is than unlike main.F90, main2.F90 contains a line :
> use, intrinsic :: iso_c_binding
> /* and they both link with libfoo.so, and foo.F90 *does* contain the same
> line */
>
>
> at this stage, all i can conclude is this is an open64 compiler issue.
>
>
> i am unable to reproduce the issue with gcc, could you please detail :
> - your configure command line
> - the version of the gnu compilers you are using
>
>
> i hit a glitch with solarisstudio 12.4 compilers on linux if i configure
> with FC=f77 :
> f77 does not recognize the 'present' keyword and fails, which raises the
> question :
> why is there some f90 code in the mpif-h directory ?
>
> Cheers,
>
> Gilles
>
> make[2]: Entering directory
> `/csc/home1/gouaillardet/build/openmpi-1.8.4rc5-os124/ompi/mpi/fortran/mpif-h'
>   PPFC libmpi_mpifh_sizeof_la-sizeof-mpif08-pre-1.8.4_f.lo
> mpi_f08_sizeof:
>
> MODULE mpi_f08_sizeof
>^
> "../../../../../../src/openmpi-1.8.4rc5/ompi/mpi/fortran/mpif-h/sizeof-mpif08-pre-1.8.4_f.F90",
> Line = 31, Column = 8: ERROR: The compiler has detected errors in module
> "MPI_F08_SIZEOF".  No module information file will be created for this
> module.
>
>   if (present(ierror)) ierror = 0
>   ^
> "../../../../../../src/openmpi-1.8.4rc5/ompi/mpi/fortran/mpif-h/sizeof-mpif08-pre-1.8.4_f.F90",
> Line = 45, Column = 11: ERROR: IMPLICIT NONE is specified in the local
> scope, therefore an explicit type must be specified for function "PRESENT".
>
>
>
> On 2014/12/19 3:40, Paul Hargrove wrote:
>
> Jeff,
>
> See below for some failure details.
> The look like different symptoms of the same issue.
>
> -Paul
>
> Open64 link failure:
>
> $ mpifort -g hello_mpifh.f -o hello_mpifh
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-open64/INST/lib/libmpi_mpifh.so:
> undefined reference to `_Iso_c_binding'
> collect2: ld returned 1 exit status
> make[2]: *** [hello_mpifh] Error 1
>
> Gcc build failure:
>
> libtool: compile:  gfortran -DHAVE_CONFIG_H -I.
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/openmpi-1.8.4rc5/ompi/mpi/fortran/mpif-h
> -I../../../../opal/include -I../../../../orte/include
> -I../../../../ompi/include -I../../../../oshmem/include
> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/private/autogen
> -I../../../../opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen
> -DOMPI_PROFILE_LAYER=0 -DOMPI_COMPILING_FORTRAN_WRAPPERS=1
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/openmpi-1.8.4rc5
> -I../../../..
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/openmpi-1.8.4rc5/opal/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/openmpi-1.8.4rc5/orte/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/openmpi-1.8.4rc5/ompi/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/openmpi-1.8.4rc5/oshmem/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/openmpi-1.8.4rc5/opal/mca/hwloc/hwloc191/hwloc/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/BLD/opal/mca/hwloc/hwloc191/hwloc/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/openmpi-1.8.4rc5/opal/mca/event/libevent2021/libevent
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/openmpi-1.8.4rc5/opal/mca/event/libevent2021/libevent/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/BLD/opal/mca/event/libevent2021/libevent/include
> -I../../../../ompi/mpi/fortran/use-mpi-tkr -g -c
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/openmpi-1.8.4rc5/ompi/mpi/fortran/mpif-h/sizeof-mpif08-pre-1.8.4_f.F90
>  -fPIC -o .libs/libmpi_mpifh_sizeof_la-sizeof-mpif08-pre-1.8.4_f.o
>  In file
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/openmpi-1.8.4rc5/ompi/mpi/fortran/mpif-h/sizeof-mpif08-pre-1.8.4_f.F90:32
>
>use, intrinsic :: ISO_C_BINDING
>   1
> Error: Unclassifiable statement at (1)
>  In file
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.8.4rc5-linux-x86_64-gcc-atomics/openmpi-1.8.4rc5/omp
> i/mpi/fortran/mpif-h/sizeof-mpif08-pre-1.8.4_f.F90:38
>
>   use, 

[OMPI devel] BUG in ADIOI_NFS_WriteStrided

2014-12-19 Thread Eric Chamberland

Hi,

I encountered a new bug while testing our collective MPI I/O 
functionnalities over NFS.  This is not a big issue for us, but I think 
someone should have a look at it.


While running at 3 processes, we have this error on rank #0 and rank #2, 
knowing that rank #1 have nothing to write (0 length size) on this 
particular PMPI_File_write_all_begin call:



==19211== Syscall param write(buf) points to uninitialised byte(s)
==19211==at 0x10CB739D: ??? (in /lib64/libpthread-2.17.so)
==19211==by 0x27438431: ADIOI_NFS_WriteStrided (ad_nfs_write.c:645)
==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:159)
==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114)
==19211==by 0x27431DBF: mca_io_romio_dist_MPI_File_write_all_begin 
(write_allb.c:44)
==19211==by 0x2742A367: mca_io_romio_file_write_all_begin 
(io_romio_file_write.c:264)
==19211==by 0x12126520: PMPI_File_write_all_begin 
(pfile_write_all_begin.c:74)
==19211==by 0x4D7CFB: SYEnveloppeMessage 
PAIO::ecritureIndexeParBlocMPI, FunctorCopieInfosSurDansVectPAType*, std::allocator*> > const>, FunctorAccesseurPorteurLocal > >(PAGroupeProcessus&, ompi_file_t*, long long, 
PtrPorteurConst, PtrPorteurConst, 
FunctorCopieInfosSurDansVectPAType*, std::allocator*> > const>&, FunctorAccesseurPorteurLocal >&, long, DistributionComposantes&, long, unsigned long, unsigned 
long, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==by 0x4E9A67: 
GISLectureEcriture::visiteMaillage(Maillage const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==by 0x4C79A2: 
GISLectureEcriture::ecritGISMPI(std::string, 
GroupeInfoSur const&, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==by 0x4961AD: main (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==  Address 0x295af060 is 144 bytes inside a block of size 
524,288 alloc'd
==19211==at 0x4C2C27B: malloc (in 
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)

==19211==by 0x2745E78E: ADIOI_Malloc_fn (malloc.c:50)
==19211==by 0x2743757C: ADIOI_NFS_WriteStrided (ad_nfs_write.c:497)
==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:159)
==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114)
==19211==by 0x27431DBF: mca_io_romio_dist_MPI_File_write_all_begin 
(write_allb.c:44)
==19211==by 0x2742A367: mca_io_romio_file_write_all_begin 
(io_romio_file_write.c:264)
==19211==by 0x12126520: PMPI_File_write_all_begin 
(pfile_write_all_begin.c:74)
==19211==by 0x4D7CFB: SYEnveloppeMessage 
PAIO::ecritureIndexeParBlocMPI, FunctorCopieInfosSurDansVectPAType*, std::allocator*> > const>, FunctorAccesseurPorteurLocal > >(PAGroupeProcessus&, ompi_file_t*, long long, 
PtrPorteurConst, PtrPorteurConst, 
FunctorCopieInfosSurDansVectPAType*, std::allocator*> > const>&, FunctorAccesseurPorteurLocal >&, long, DistributionComposantes&, long, unsigned long, unsigned 
long, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==by 0x4E9A67: 
GISLectureEcriture::visiteMaillage(Maillage const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==by 0x4C79A2: 
GISLectureEcriture::ecritGISMPI(std::string, 
GroupeInfoSur const&, std::string const&) (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)
==19211==by 0x4961AD: main (in 
/home/mefpp_ericc/GIREF/bin/Test.LectureEcritureGISMPI.opt)

==19211==  Uninitialised value was created by a heap allocation
==19211==at 0x4C2C27B: malloc (in 
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)

==19211==by 0x2745E78E: ADIOI_Malloc_fn (malloc.c:50)
==19211==by 0x2743757C: ADIOI_NFS_WriteStrided (ad_nfs_write.c:497)
==19211==by 0x27451963: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:159)
==19211==by 0x274321BD: MPIOI_File_write_all_begin (write_allb.c:114)
==19211==by 0x27431DBF: mca_io_romio_dist_MPI_File_write_all_begin 
(write_allb.c:44)
==19211==by 0x2742A367: mca_io_romio_file_write_all_begin 
(io_romio_file_write.c:264)
==19211==by 0x12126520: PMPI_File_write_all_begin 
(pfile_write_all_begin.c:74)
==19211==by 0x4D7CFB: SYEnveloppeMessage 
PAIO::ecritureIndexeParBlocMPI, FunctorCopieInfosSurDansVectPAType

[OMPI devel] Git security vulnerability, please upgrade Windows & OS X Git clients

2014-12-19 Thread Dave Goodell (dgoodell)
Quoting from 
https://github.com/blog/1938-vulnerability-announced-update-your-git-clients

"""
A critical Git security vulnerability has been announced today, affecting all 
versions of the official Git client and all related software that interacts 
with Git repositories, including GitHub for Windows and GitHub for Mac. Because 
this is a client-side only vulnerability, github.com and GitHub Enterprise are 
not directly affected.

The vulnerability concerns Git and Git-compatible clients that access Git 
repositories in a case-insensitive or case-normalizing filesystem. An attacker 
can craft a malicious Git tree that will cause Git to overwrite its own 
.git/config file when cloning or checking out a repository, leading to 
arbitrary command execution in the client machine. Git clients running on OS X 
(HFS+) or any version of Microsoft Windows (NTFS, FAT) are exploitable through 
this vulnerability. Linux clients are not affected if they run in a 
case-sensitive filesystem.

We strongly encourage all users of GitHub and GitHub Enterprise to update their 
Git clients as soon as possible, and to be particularly careful when cloning or 
accessing Git repositories hosted on unsafe or untrusted hosts.
"""

The official Git release post: 
http://article.gmane.org/gmane.linux.kernel/1853266

-Dave



Re: [OMPI devel] FT code (again)

2014-12-19 Thread George Bosilca
A opal_pmix.fence seems like a perfect replacement.

  George.


On Fri, Dec 19, 2014 at 10:26 AM, Adrian Reber  wrote:

> Again I am trying to get the FT code working. This time I am unsure how
> to resolve the code changes from this commit:
>
> commit aec5cd08bd8c33677276612b899b48618d271efa
> Author: Ralph Castain 
> Date:   Thu Aug 21 18:56:47 2014 +
>
> Per the PMIx RFC:
>
>
> This includes changes like this:
>
>
> @@ -172,17 +164,7 @@ static int rte_init(void)
>   * in the job won't be executing this step, so we would hang
>   */
>  if (ORTE_PROC_IS_NON_MPI && !orte_do_not_barrier) {
> -orte_grpcomm_collective_t coll;
> -OBJ_CONSTRUCT(, orte_grpcomm_collective_t);
> -coll.id = orte_process_info.peer_modex;
> -coll.active = true;
> -if (ORTE_SUCCESS != (ret = orte_grpcomm.modex())) {
> -ORTE_ERROR_LOG(ret);
> -error = "orte modex";
> -goto error;
> -}
> -ORTE_WAIT_FOR_COMPLETION(coll.active);
> -OBJ_DESTRUCT();
> +opal_pmix.fence(NULL, 0);
>  }
>
>
> In the FT code in orte/mca/ess/env/ess_env_module.c there is similar code:
>
> OBJ_CONSTRUCT(, orte_grpcomm_collective_t);
> coll.id = orte_process_info.snapc_init_barrier;
>
> ...
>
> if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier())) {
>
> ...
>
> coll.active = true;
> ORTE_WAIT_FOR_COMPLETION(coll.active);
>
>
> How can this be expressed with the new code?
>
>
> Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16688.php
>


[OMPI devel] FT code (again)

2014-12-19 Thread Adrian Reber
Again I am trying to get the FT code working. This time I am unsure how
to resolve the code changes from this commit:

commit aec5cd08bd8c33677276612b899b48618d271efa
Author: Ralph Castain 
List-Post: devel@lists.open-mpi.org
Date:   Thu Aug 21 18:56:47 2014 +

Per the PMIx RFC:


This includes changes like this:


@@ -172,17 +164,7 @@ static int rte_init(void)
  * in the job won't be executing this step, so we would hang
  */
 if (ORTE_PROC_IS_NON_MPI && !orte_do_not_barrier) {
-orte_grpcomm_collective_t coll;
-OBJ_CONSTRUCT(, orte_grpcomm_collective_t);
-coll.id = orte_process_info.peer_modex;
-coll.active = true;
-if (ORTE_SUCCESS != (ret = orte_grpcomm.modex())) {
-ORTE_ERROR_LOG(ret);
-error = "orte modex";
-goto error;
-}
-ORTE_WAIT_FOR_COMPLETION(coll.active);
-OBJ_DESTRUCT();
+opal_pmix.fence(NULL, 0);
 }


In the FT code in orte/mca/ess/env/ess_env_module.c there is similar code:

OBJ_CONSTRUCT(, orte_grpcomm_collective_t);
coll.id = orte_process_info.snapc_init_barrier;

...

if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier())) {

...

coll.active = true;
ORTE_WAIT_FOR_COMPLETION(coll.active);


How can this be expressed with the new code?


Adrian


Re: [OMPI devel] Still getting DDT test assert fails

2014-12-19 Thread Jeff Squyres (jsquyres)
I posted the full output from running the test on the still-open issue about 
this:

https://github.com/open-mpi/ompi/issues/294#issuecomment-67638568


On Dec 19, 2014, at 6:46 AM, Jeff Squyres (jsquyres)  wrote:

> George --
> 
> You uncommented the "#if 0 ..." section in the opal datatype test yesterday 
> (https://github.com/open-mpi/ompi/commit/1895f29537820ee06492ae3b2e66c1cf5ef78c70),
>  but we're still getting assert fails on opal_datatype_test.  It caused the 
> nightly tarball to fail last night, and I'm able to reproduce this on a Linux 
> x86_64 machine (but not on my Mac laptop):
> 
> -
> #0  0x003491632925 in raise () from /lib64/libc.so.6
> #1  0x003491634105 in abort () from /lib64/libc.so.6
> #2  0x00349162ba4e in __assert_fail_base () from /lib64/libc.so.6
> #3  0x00349162bb10 in __assert_fail () from /lib64/libc.so.6
> #4  0x00403bb5 in local_copy_with_convertor (pdt=0x60e550, 
> count=4500, 
>chunk=956) at opal_datatype_test.c:438
> #5  0x00405a86 in main (argc=1, argv=0x7fffd2a8)
>at opal_datatype_test.c:667
> -
> 
> Specifically, it fails on this line:
> 
> -
>if(outputFlags & QUIT_ON_FIRST_ERROR) { assert(0); exit(-1); }
> -
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16686.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] Still getting DDT test assert fails

2014-12-19 Thread Jeff Squyres (jsquyres)
George --

You uncommented the "#if 0 ..." section in the opal datatype test yesterday 
(https://github.com/open-mpi/ompi/commit/1895f29537820ee06492ae3b2e66c1cf5ef78c70),
 but we're still getting assert fails on opal_datatype_test.  It caused the 
nightly tarball to fail last night, and I'm able to reproduce this on a Linux 
x86_64 machine (but not on my Mac laptop):

-
#0  0x003491632925 in raise () from /lib64/libc.so.6
#1  0x003491634105 in abort () from /lib64/libc.so.6
#2  0x00349162ba4e in __assert_fail_base () from /lib64/libc.so.6
#3  0x00349162bb10 in __assert_fail () from /lib64/libc.so.6
#4  0x00403bb5 in local_copy_with_convertor (pdt=0x60e550, count=4500, 
chunk=956) at opal_datatype_test.c:438
#5  0x00405a86 in main (argc=1, argv=0x7fffd2a8)
at opal_datatype_test.c:667
-

Specifically, it fails on this line:

-
if(outputFlags & QUIT_ON_FIRST_ERROR) { assert(0); exit(-1); }
-

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] libfabric, config.h and hwloc

2014-12-19 Thread Gilles Gouaillardet
Jeff,

the include path is $top_srcdir/opal/mca/event/libevent2021/libevent
and the libevent config.h is in
$top_builddir/opal/mca/event/libevent2021/libevent

so if you do not use VPATH, $top_srcdir = $top_builddir and make success,
but since i use VPATH, $top_srcdir != $top_builddir and there is no
config.h in my include path,
and hence make fails.

Cheers,

Gilles



 On 2014/12/19 4:12, Jeff Squyres (jsquyres) wrote:
> On Dec 18, 2014, at 3:13 AM, Gilles Gouaillardet 
>  wrote:
>
>> currently, ompi master cannot be built if configured with
>> --without-hwloc *and without* --without-libfabric.
>>
>> the root cause is HAVE_CONFIG_H is defined but no config.h file is found.
>>
>> i digged a bit and found that config.h is taken from a hwloc directory
>> (if the --without-hwloc option is not used),
>> so even if this "works" that is unlikely the expected behavior.
> Mmm.  I see what you're saying -- the libfabric code expects there to be a 
> config.h file; it'll basically take any config.h that's in the include path.
>
> I actually find several config.h's in the tree:
>
> -
> $ find . -name config.h
> ./opal/libltdl/config.h
> ./opal/mca/hwloc/hwloc191/hwloc/include/hwloc/autogen/config.h
> ./opal/mca/hwloc/hwloc191/hwloc/include/private/autogen/config.h
> ./opal/mca/event/libevent2021/libevent/config.h
> ./ompi/contrib/vt/vt/extlib/otf/config.h
> ./ompi/contrib/vt/vt/config.h
> -
>
> However, even if I exclude libltdl, vt, and hwloc (so that there's only a 
> single config.h left in the tree -- for libevent), everything still works:
>
> 
> $ ./configure --prefix=$bogus --disable-dlopen --disable-vt --without-hwloc
> ...etc...
> $ make
> [...succeeds...]
> $ find . -name config.h
> ./opal/mca/event/libevent2021/libevent/config.h
> -
>
> So I agree with you that it only works by chance (sorta -- libevent's 
> config.h will still have all the Right Stuff in it), I can't find a case that 
> fails.
>
> Can you detail what your specific case is that is failing?
>
> (SIDENOTE: I might not be able to find the failure because of what I mention 
> below...)
>
>> the attached patch fixes some missing #ifdef
> Good catch.  I fixed those a different way (just deleted the #includes -- 
> they weren't necessary); I committed the fix both in OMPI and upstream in 
> libfabric.
>
>> my last (cosmetic) comment is about
>> $srcdir/opal/mca/common/libfabric/Makefile.in (and several other
>> Makefile.in) :
>> [snipped]
> Good catch.  Fixed in 
> https://github.com/open-mpi/ompi/commit/be6d46490f7b80d4f5ea90c859ccbebe96bdaaba.
>   And then later fixed *that* a followup commit.  :-(
>