Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-10-06 Thread Gilles Gouaillardet
Eric, 2.0.2 is scheduled to happen. 2.1.0 will bring some new features whereas v2.0.2 is a bug fix release. my guess is v2.0.2 will come first, but this is just a guess (even if v2.1.0 comes first, v2.0.2 will be released anyway) Cheers, Gilles On 10/7/2016 2:34 AM, Eric Chamberland wro

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-10-06 Thread Eric Chamberland
Hi Gilles, just to mention that since the PR 2091 as been merged into 2.0.x, I haven't got any failure! Since 2.0.0 and 2.0.1 aren't usable for us, the next version should be a good one... So will there be a 2.0.2 release or will it go to 2.1.0 directly? Thanks, Eric On 16/09/16 10:01 AM

Re: [OMPI devel] OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-30 Thread Jeremy McCaslin
Hi, Can I please be removed from this list? Thanks, Jeremy On Thu, Sep 15, 2016 at 8:44 AM, r...@open-mpi.org wrote: > I don’t think a collision was the issue here. We were taking the > mpirun-generated jobid and passing it thru the hash, thus creating an > incorrect and invalid value. What I’

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-16 Thread Gilles Gouaillardet
Eric, I expect the PR will fix this bug. The crash occur after the unexpected process identifier error, and this error should not happen in the first place. So at this stage, I would not worry too much of that crash (to me, it is an undefined behavior anyway) Cheers, Gilles On Friday, September

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-16 Thread Eric Chamberland
Hi, I know the pull request has not (yet) been merged, but here is a somewhat "different" output from a single sequential test (automatically) laucnhed without mpirun last night: [lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [lorien:172229] plm:base:set_hnp_

Re: [OMPI devel] OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread r...@open-mpi.org
I don’t think a collision was the issue here. We were taking the mpirun-generated jobid and passing it thru the hash, thus creating an incorrect and invalid value. What I’m more surprised by is that it doesn’t -always- fail. Only thing I can figure is that, unlike with PMIx, the usock oob compon

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread r...@open-mpi.org
It’s okay - it was just confusing This actually wound up having nothing to do with how the jobid is generated. The root cause of the problem was that we took an mpirun-generated jobid, and then mistakenly passed it back thru a hash function instead of just using it. So we hashed a perfectly goo

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread Joshua Ladd
Ralph, We love PMIx :). In this context, when I say PMIx, I am referring to the PMIx framework in OMPI/OPAL, not the standalone PMIx library. Sorry that wasn't clear. Josh On Thu, Sep 15, 2016 at 10:07 AM, r...@open-mpi.org wrote: > I don’t understand this fascination with PMIx. PMIx didn’t ca

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread r...@open-mpi.org
I don’t understand this fascination with PMIx. PMIx didn’t calculate this jobid - OMPI did. Yes, it is in the opal/pmix layer, but it had -nothing- to do with PMIx. So why do you want to continue to blame PMIx for this problem?? > On Sep 15, 2016, at 4:29 AM, Joshua Ladd wrote: > > Great cat

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread Eric Chamberland
Hi Gilles, On 15/09/16 03:38 AM, Gilles Gouaillardet wrote: Eric, a bug has been identified, and a patch is available at https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 ./a.out), so if appl

Re: [OMPI devel] OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread Gilles Gouaillardet
I just realized i screwed up my test, and i was missing some relevant info... So on one hand, i fixed a bug in singleton, But on the other hand, i cannot tell whether a collision was involved in this issue Cheers, Gilles Joshua Ladd wrote: >Great catch, Gilles! Not much of a surprise though. 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread Joshua Ladd
Great catch, Gilles! Not much of a surprise though. Indeed, this issue has EVERYTHING to do with how PMIx is calculating the jobid, which, in this case, results in hash collisions. ;-P Josh On Thursday, September 15, 2016, Gilles Gouaillardet wrote: > Eric, > > > a bug has been identified, and

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread Gilles Gouaillardet
Eric, a bug has been identified, and a patch is available at https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 ./a.out), so if applying a patch does not fit your test workflow, it might be

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread Gilles Gouaillardet
Ralph, i fixed master at https://github.com/open-mpi/ompi/commit/11ebf3ab23bdaeb0ec96818c119364c6d837cd3b and PR for v2.x at https://github.com/open-mpi/ompi-release/pull/1376 Cheers, Gilles On 9/15/2016 12:26 PM, r...@open-mpi.org wrote: Ah...I take that back. We changed this and now

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
Just in the FWIW category: the HNP used to send the singleton’s name down the pipe at startup, which eliminated the code line you identified. Now, we are pushing the name into the environment as a PMIx envar, and having the PMIx component pick it up. Roundabout way of getting it, and that’s what

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
Ah...I take that back. We changed this and now we _do_ indeed go down that code path. Not good. So yes, we need that putenv so it gets the jobid from the HNP that was launched, like it used to do. You want to throw that in? Thanks Ralph > On Sep 14, 2016, at 8:18 PM, r...@open-mpi.org wrote: >

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
Nah, something isn’t right here. The singleton doesn’t go thru that code line, or it isn’t supposed to do so. I think the problem lies in the way the singleton in 2.x is starting up. Let me take a look at how singletons are working over there. > On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Gilles Gouaillardet
Ralph, i think i just found the root cause :-) from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c /* store our jobid and rank */ if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) { /* if we were launched by the OMPI RTE, then * the jobid is in a special fo

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland
Ok, one test segfaulted *but* I can't tell if it is the *same* bug because there has been a segfault: stderr: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
Many things are possible, given infinite time :-) The issue with this notion lies in direct launch scenarios - i.e., when procs are launched directly by the RM and not via mpirun. In this case, there is nobody who can give us the session directory (well, until PMIx becomes universal), and so th

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Gilles Gouaillardet
Ralph, is there any reason to use a session directory based on the jobid (or job family) ? I mean, could we use mkstemp to generate a unique directory, and then propagate the path via orted comm or the environment ? Cheers, Gilles On Wednesday, September 14, 2016, r...@open-mpi.org wrote: > T

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland
On 14/09/16 10:27 AM, Gilles Gouaillardet wrote: Eric, do you mean you have a unique $TMP per a.out ? No or a unique $TMP per "batch" of run ? Yes. I was happy because each nighlty batch has it's own TMP, so I can check afterward for problems related to a specific night without interfe

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Gilles Gouaillardet
Eric, do you mean you have a unique $TMP per a.out ? or a unique $TMP per "batch" of run ? in the first case, my understanding is that conflicts cannot happen ... once you hit the bug, can you please please post the output of the failed a.out, and run egrep 'jobfam|stop' on all your logs, so we

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
This has nothing to do with PMIx, Josh - the error is coming out of the usock OOB component. > On Sep 14, 2016, at 7:17 AM, Joshua Ladd wrote: > > Eric, > > We are looking into the PMIx code path that sets up the jobid. The session > directories are created based on the jobid. It might be th

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Joshua Ladd
Eric, We are looking into the PMIx code path that sets up the jobid. The session directories are created based on the jobid. It might be the case that the jobids (generated with rand) happen to be the same for different jobs resulting in multiple jobs sharing the same session directory, but we nee

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Gilles Gouaillardet
Thanks Eric, the goal of the patch is simply not to output info that is not needed (by both orted and a.out) /* since you ./a.out, an orted is forked under the hood */ so the patch is really optional, though convenient. Cheers, Gilles On Wednesday, September 14, 2016, Eric Chamberland < eric.c

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland
Lucky! Since each runs have a specific TMP, I still have it on disc. for the faulty run, the TMP variable was: TMP=/tmp/tmp.wOv5dkNaSI and into $TMP I have: openmpi-sessions-40031@lorien_0 and into this subdirectory I have a bunch of empty dirs: cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-ses

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland
On 14/09/16 01:36 AM, Gilles Gouaillardet wrote: Eric, can you please provide more information on how your tests are launched ? Yes! do you mpirun -np 1 ./a.out or do you simply ./a.out For all sequential tests, we do ./a.out. do you use a batch manager ? if yes, which one ? N

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Joshua Ladd
Hi, Eric I **think** this might be related to the following: https://github.com/pmix/master/pull/145 I'm wondering if you can look into the /tmp directory and see if you have a bunch of stale usock files. Best, Josh On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet wrote: > Eric, > > >

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-13 Thread Gilles Gouaillardet
Eric, can you please provide more information on how your tests are launched ? do you mpirun -np 1 ./a.out or do you simply ./a.out do you use a batch manager ? if yes, which one ? do you run one test per job ? or multiple tests per job ? how are these tests launched ? do the test that

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-13 Thread Eric Chamberland
On 13/09/16 12:11 PM, Pritchard Jr., Howard wrote: Hello Eric, Is the failure seen with the same two tests? Or is it random which tests fail? If its not random, would you be able to post No, the tests that failed were different ones... the tests to the list? Also, if possible, it would

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-13 Thread Pritchard Jr., Howard
Hello Eric, Is the failure seen with the same two tests? Or is it random which tests fail? If its not random, would you be able to post the tests to the list? Also, if possible, it would be great if you could test against a master snapshot: https://www.open-mpi.org/nightly/master/ Thanks,

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-13 Thread Eric Chamberland
Other relevant info: I never saw this problem with OpenMPI 1.6.5,1.8.4 and 1.10.[3,4] which runs the same test suite... thanks, Eric On 13/09/16 11:35 AM, Eric Chamberland wrote: Hi, It is the third time this happened into the last 10 days. While running nighlty tests (~2200), we have one

[OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-13 Thread Eric Chamberland
Hi, It is the third time this happened into the last 10 days. While running nighlty tests (~2200), we have one or two tests that fails at the very beginning with this strange error: [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received unexpected process identifier [[9325,0],