Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end

2016-07-14 Thread Eric Chamberland

Hi Gilles,

On 13/07/16 08:01 PM, Gilles Gouaillardet wrote:

Eric,


OpenMPI 2.0.0 has been released, so the fix should land into the v2.x
branch shortly.


ok, thanks again.



If i understand correctly, you script download/compile OpenMPI and then
download/compile PETSc.

More precisely, for OpenMPI I am cloning 
https://github.com/open-mpi/ompi.git and for Petsc, I just compile the 
latest proved stable with our code which is now 3.7.2.



In this is correct, and for the time being, feel free to patch Open MPI
v2.x before compiling it, the fix can be

downloaded ad
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1263.patch



Ok but I think it is already included into the master of the clone I 
get... :)


Cheers,

Eric




Cheers,


Gilles


On 7/14/2016 3:37 AM, Eric Chamberland wrote:

Hi Howard,

ok, I will wait for 2.0.1rcX... ;)

I've put in place a script to download/compile OpenMPI+PETSc(3.7.2)
and our code from the git repos.

Now I am in a somewhat uncomfortable situation where neither the
ompi-release.git or ompi.git repos are working for me.

The first gives me the errors with MPI_File_write_all_end I reported,
but the former gives me errors like these:

[lorien:106919] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in
file ess_singleton_module.c at line 167
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:106919] Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not
able to guarantee that all other processes were killed!

So, for my continuous integration of OpenMPI I am in a no man's
land... :(

Thanks anyway for the follow-up!

Eric

On 13/07/16 07:49 AM, Howard Pritchard wrote:

Hi Eric,

Thanks very much for finding this problem.   We decided in order to have
a reasonably timely
release, that we'd triage issues and turn around a new RC if something
drastic
appeared.  We want to fix this issue (and it will be fixed), but we've
decided to
defer the fix for this issue to a 2.0.1 bug fix release.

Howard



2016-07-12 13:51 GMT-06:00 Eric Chamberland
mailto:eric.chamberl...@giref.ulaval.ca>>:

Hi Edgard,

I just saw that your patch got into ompi/master... any chances it
goes into ompi-release/v2.x before rc5?

thanks,

Eric


On 08/07/16 03:14 PM, Edgar Gabriel wrote:

I think I found the problem, I filed a pr towards master, and if
that
passes I will file a pr for the 2.x branch.

Thanks!
Edgar


On 7/8/2016 1:14 PM, Eric Chamberland wrote:


On 08/07/16 01:44 PM, Edgar Gabriel wrote:

ok, but just to be able to construct a test case,
basically what you are
doing is

MPI_File_write_all_begin (fh, NULL, 0, some datatype);

MPI_File_write_all_end (fh, NULL, &status),

is this correct?

Yes, but with 2 processes:

rank 0 writes something, but not rank 1...

other info: rank 0 didn't wait for rank1 after
MPI_File_write_all_end so
it continued to the next MPI_File_write_all_begin with a
different
datatype but on the same file...

thanks!

Eric
___
devel mailing list
de...@open-mpi.org 
Subscription:
https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19173.php


___
devel mailing list
de...@open-mpi.org 
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19192.php



___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19201.php



___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/07/19206.php


Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end

2016-07-14 Thread Eric Chamberland

Thanks Ralph,

It is now *much* better: all sequential executions are working... ;)
but I still have issues with a lot of parallel tests... (but not all)

The SHA tested last night was c3c262b.

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.14.01h20m32s_config.log

Here is what is the backtrace for most of these issues:

*** Error in 
`/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt': 
free(): invalid pointer: 0x7f9ab09c6020 ***

=== Backtrace: =
/lib64/libc.so.6(+0x7277f)[0x7f9ab019b77f]
/lib64/libc.so.6(+0x78026)[0x7f9ab01a1026]
/lib64/libc.so.6(+0x78d53)[0x7f9ab01a1d53]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x172a1)[0x7f9aa3df32a1]
/opt/openmpi-2.x_opt/lib/libmpi.so.0(MPI_Request_free+0x4c)[0x7f9ab0761dac]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adaf9)[0x7f9ab7fa2af9]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f9ab7f9dc35]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4574e7)[0x7f9ab7f4c4e7]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecDestroy+0x648)[0x7f9ab7ef28ca]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_Z15GIREFVecDestroyRP6_p_Vec+0xe)[0x7f9abc9746de]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN12VecteurPETScD1Ev+0x31)[0x7f9abca8bfa1]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN10SolveurGCPD2Ev+0x20c)[0x7f9abc9a013c]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN10SolveurGCPD0Ev+0x9)[0x7f9abc9a01f9]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Formulation.so(_ZN10ProblemeGDD2Ev+0x42)[0x7f9abeeb94e2]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt[0x4159b9]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9ab014ab25]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt[0x4084dc]

The very same code ans tests are all working well with 
openmpi-1.{8.4,10.2} and the same version of PETSc...


And the segfault with MPI_File_write_all_end seems gone... Thanks to 
Edgar! :)


Btw, I am wondering when I should report a bug or not, since I am 
"blindly" cloning around 01h20 am each day, independently of the 
"status" of the master...  I don't want to bother anyone on this list 
with annoying bug reports...  So tell me what you would like please...


Thanks,

Eric


On 13/07/16 08:36 PM, Ralph Castain wrote:

Fixed on master


On Jul 13, 2016, at 12:47 PM, Jeff Squyres (jsquyres)  
wrote:

I literally just noticed that this morning (that singleton was broken on 
master), but hadn't gotten to bisecting / reporting it yet...

I also haven't tested 2.0.0.  I really hope singletons aren't broken then...

/me goes to test 2.0.0...

Whew -- 2.0.0 singletons are fine.  :-)



On Jul 13, 2016, at 3:01 PM, Ralph Castain  wrote:

Hmmm…I see where the singleton on master might be broken - will check later 
today


On Jul 13, 2016, at 11:37 AM, Eric Chamberland 
 wrote:

Hi Howard,

ok, I will wait for 2.0.1rcX... ;)

I've put in place a script to download/compile OpenMPI+PETSc(3.7.2) and our 
code from the git repos.

Now I am in a somewhat uncomfortable situation where neither the 
ompi-release.git or ompi.git repos are working for me.

The first gives me the errors with MPI_File_write_all_end I reported, but the 
former gives me errors like these:

[lorien:106919] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
ess_singleton_module.c at line 167
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:106919] Local abort before MPI_INIT completed completed successfully, 
but am not able to aggregate error messages, and not able to guarantee that all 
other processes were killed!

So, for my continuous integration of OpenMPI I am in a no man's land... :(

Thanks anyway for the follow-up!

Eric

On 13/07/16 07:49 AM, Howard Pritchard wrote:

Hi Eric,

Thanks very much for finding this problem.   We decided in order to have
a reasonably timely
release, that we'd triage issues and turn around a new RC if something
drastic
appeared.  We want to fix this issue (and it will be fixed), but we've
decided to
defer the fix for this issue to a 2.0.1 bug fix release.

Howard



2016-07-12 13:51 GMT-06:00 Eric Chamberland
mailto:eric.chamberl...@giref.ulaval.ca>>:

 Hi Edgard,

 I just saw that your patch got into ompi/master... any chances it
 goes into ompi-release/v2.x before rc5?

 thanks,

 Eric


 On 08/07/16 03:14 PM, Edgar Gabriel wrote:

 I think I found the problem, I filed a pr towards master, and if
 that
 passes I will file a pr for the 2.x branch.

 Thanks!
 Edgar


 On 7/8/2016 1:14 PM, Eric Chamberland wrote:


 On 08/07/1

Re: [OMPI devel] 2.0.0rc4 Crash in MPI_File_write_all_end

2016-07-14 Thread Jeff Squyres (jsquyres)
This looks like a new error -- something is potentially going wrong in 
MPI_Request_free (or perhaps the underlying progress invocation invoked by 
MPI_Request_free).

I think cloning at that time and running tests is absolutely fine.

We tend to track our bugs in Github issues, so if you'd like to file future 
issues there, that would likely save a step.

I filed an issue for this one:

https://github.com/open-mpi/ompi/issues/1875


> On Jul 14, 2016, at 9:47 AM, Eric Chamberland 
>  wrote:
> 
> Thanks Ralph,
> 
> It is now *much* better: all sequential executions are working... ;)
> but I still have issues with a lot of parallel tests... (but not all)
> 
> The SHA tested last night was c3c262b.
> 
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.07.14.01h20m32s_config.log
> 
> Here is what is the backtrace for most of these issues:
> 
> *** Error in 
> `/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt':
>  free(): invalid pointer: 0x7f9ab09c6020 ***
> === Backtrace: =
> /lib64/libc.so.6(+0x7277f)[0x7f9ab019b77f]
> /lib64/libc.so.6(+0x78026)[0x7f9ab01a1026]
> /lib64/libc.so.6(+0x78d53)[0x7f9ab01a1d53]
> /opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x172a1)[0x7f9aa3df32a1]
> /opt/openmpi-2.x_opt/lib/libmpi.so.0(MPI_Request_free+0x4c)[0x7f9ab0761dac]
> /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adaf9)[0x7f9ab7fa2af9]
> /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f9ab7f9dc35]
> /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4574e7)[0x7f9ab7f4c4e7]
> /opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecDestroy+0x648)[0x7f9ab7ef28ca]
> /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_Z15GIREFVecDestroyRP6_p_Vec+0xe)[0x7f9abc9746de]
> /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN12VecteurPETScD1Ev+0x31)[0x7f9abca8bfa1]
> /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN10SolveurGCPD2Ev+0x20c)[0x7f9abc9a013c]
> /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN10SolveurGCPD0Ev+0x9)[0x7f9abc9a01f9]
> /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Formulation.so(_ZN10ProblemeGDD2Ev+0x42)[0x7f9abeeb94e2]
> /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt[0x4159b9]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9ab014ab25]
> /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt[0x4084dc]
> 
> The very same code ans tests are all working well with openmpi-1.{8.4,10.2} 
> and the same version of PETSc...
> 
> And the segfault with MPI_File_write_all_end seems gone... Thanks to Edgar! :)
> 
> Btw, I am wondering when I should report a bug or not, since I am "blindly" 
> cloning around 01h20 am each day, independently of the "status" of the 
> master...  I don't want to bother anyone on this list with annoying bug 
> reports...  So tell me what you would like please...
> 
> Thanks,
> 
> Eric
> 
> 
> On 13/07/16 08:36 PM, Ralph Castain wrote:
>> Fixed on master
>> 
>>> On Jul 13, 2016, at 12:47 PM, Jeff Squyres (jsquyres)  
>>> wrote:
>>> 
>>> I literally just noticed that this morning (that singleton was broken on 
>>> master), but hadn't gotten to bisecting / reporting it yet...
>>> 
>>> I also haven't tested 2.0.0.  I really hope singletons aren't broken then...
>>> 
>>> /me goes to test 2.0.0...
>>> 
>>> Whew -- 2.0.0 singletons are fine.  :-)
>>> 
>>> 
 On Jul 13, 2016, at 3:01 PM, Ralph Castain  wrote:
 
 Hmmm…I see where the singleton on master might be broken - will check 
 later today
 
> On Jul 13, 2016, at 11:37 AM, Eric Chamberland 
>  wrote:
> 
> Hi Howard,
> 
> ok, I will wait for 2.0.1rcX... ;)
> 
> I've put in place a script to download/compile OpenMPI+PETSc(3.7.2) and 
> our code from the git repos.
> 
> Now I am in a somewhat uncomfortable situation where neither the 
> ompi-release.git or ompi.git repos are working for me.
> 
> The first gives me the errors with MPI_File_write_all_end I reported, but 
> the former gives me errors like these:
> 
> [lorien:106919] [[INVALID],INVALID] ORTE_ERROR_LOG: Bad parameter in file 
> ess_singleton_module.c at line 167
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [lorien:106919] Local abort before MPI_INIT completed completed 
> successfully, but am not able to aggregate error messages, and not able 
> to guarantee that all other processes were killed!
> 
> So, for my continuous integration of OpenMPI I am in a no man's land... :(
> 
> Thanks anyway for the follow-up!
> 
> Eric
> 
> On 13/

[OMPI devel] MPI_Init() affecting rand()

2016-07-14 Thread Cabral, Matias A
Hi All,

Doing quick test with rand()/srand() I found that MPI_Init() seems to be 
calling a function in their family  that is affecting the values in the user 
application.  Please see below my simple test and the results. Yes, moving the 
second call to srand() after MPI_init() solves the problem. However, I'm 
confused since this was supposedly addressed in version 1.7.5. From release 
notes:


1.7.5 20 Mar 2014:



- OMPI now uses its own internal random number generator and will not perturb 
srand() and friends.


I tested on OMPI 1.10.2 and 1.10.3. The result is deterministic.



Any ideas?



Thanks,
Regards,

#include 
#include 
#include 
int main(int argc, char *argv[])
{
int rand1;
int rand2;
  int name_len;
srand(10);
rand1 = rand();
srand(10);
MPI_Init(&argc, &argv);
rand2 = rand();
if (rand1 != rand2) {
printf("%d != %d\n", rand1, rand2);
fflush(stdout);
}
else {
printf("%d == %d\n", rand1, rand2);
fflush(stdout);
}
MPI_Finalize();
return 0;
}


host1:/tmp> mpirun -np 1 -host host1 -mca pml ob1 -mca btl tcp,self ./rand1

964940668 != 865007240


_MAC



[OMPI devel] Git / Github branching and repo plans

2016-07-14 Thread Jeff Squyres (jsquyres)
We have talked about this on the weekly calls, but for those of you who have 
not been able to attend, here's a summary of our expected plans with git 
branches, github repos, etc.:

1. v2.0.0 has been released.  Yay!

2. A picture is worth 1,000 words: below is the git branching plan for the v2.x 
series going forward (more discussion below):

[cid:A46BCBDC-157A-49DB-B3F0-D772072AED93@cdnivt.cisco.com]

3. Today will be the end of the mandatory 2-day hold on merging anything into 
the v2.x branch.  We have this 2-day hold for the "oh crud!" factor after a 
large release -- i.e., in case anything major is discovered right after the 
release, we can do a small commit to fix the problem, and then re-release 
(without anything else new).  This means that tomorrow, Howard and I will start 
merging some of the existing v2.0.1 PRs.  We'll likely merge in several a day 
and let MTT and friends chug through them.  It'll probably only take a few days 
to chug through the v2.0.1 PRs -- most of them are small / low risk.

4. As a reminder, v2.0.1 is ONLY for bug fixes.  No new features will be 
accepted.  Backwards compatibility with v2.0.0 MUST be preserved.

5. Once we have finished merging most (all?) v2.0.1 PRs, we'll create a v2.0.x 
branch from the v2.x branch.  The v2.0.1 release (and subsequent v2.0.x 
releases) will come from that branch.

6. The v2.x branch will continue on and eventually become v2.1.0.  v2.1.0 will 
be backwards compatible with v2.0.0.

7. We may need to have pull requests to multiple release branches (e.g., to fix 
a bug in both v2.0.1 and v2.1.0).

8. As a reminder, we have two Github repos: ompi and ompi-release.  ompi 
contains the development master, and ompi-release contains all the release 
branches.  We did this split because Github didn't used to support per-branch 
ACLs.  Now they do, and now that we have released v2.0.0 we can officially 
start thinking about folding the ompi-release repo back into the ompi repo.  
...however, as you all know, we're in the middle of migrating all of Open MPI's 
hosting infrastructure away from Indiana U., and that's taking quite a bit of 
time and effort.  As such, the "fold ompi-release back into ompi" plan may get 
delayed a bit.  Sorry folks -- there's only so many sysadmin-related cycles to 
go around, and the migration efforts have concrete deadlines.  :-\  Bear with 
us until we can get this done.

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/