Eric,
2.0.2 is scheduled to happen.
2.1.0 will bring some new features whereas v2.0.2 is a bug fix release.
my guess is v2.0.2 will come first, but this is just a guess
(even if v2.1.0 comes first, v2.0.2 will be released anyway)
Cheers,
Gilles
On 10/7/2016 2:34 AM, Eric Chamberland wro
Hi Gilles,
just to mention that since the PR 2091 as been merged into 2.0.x, I
haven't got any failure!
Since 2.0.0 and 2.0.1 aren't usable for us, the next version should be a
good one... So will there be a 2.0.2 release or will it go to 2.1.0
directly?
Thanks,
Eric
On 16/09/16 10:01 AM
Hi,
Can I please be removed from this list?
Thanks,
Jeremy
On Thu, Sep 15, 2016 at 8:44 AM, r...@open-mpi.org wrote:
> I don’t think a collision was the issue here. We were taking the
> mpirun-generated jobid and passing it thru the hash, thus creating an
> incorrect and invalid value. What I’
Eric,
I expect the PR will fix this bug.
The crash occur after the unexpected process identifier error, and this
error should not happen in the first place. So at this stage, I would not
worry too much of that crash (to me, it is an undefined behavior anyway)
Cheers,
Gilles
On Friday, September
Hi,
I know the pull request has not (yet) been merged, but here is a
somewhat "different" output from a single sequential test
(automatically) laucnhed without mpirun last night:
[lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
[lorien:172229] plm:base:set_hnp_
I don’t think a collision was the issue here. We were taking the
mpirun-generated jobid and passing it thru the hash, thus creating an incorrect
and invalid value. What I’m more surprised by is that it doesn’t -always- fail.
Only thing I can figure is that, unlike with PMIx, the usock oob compon
It’s okay - it was just confusing
This actually wound up having nothing to do with how the jobid is generated.
The root cause of the problem was that we took an mpirun-generated jobid, and
then mistakenly passed it back thru a hash function instead of just using it.
So we hashed a perfectly goo
Ralph,
We love PMIx :). In this context, when I say PMIx, I am referring to the
PMIx framework in OMPI/OPAL, not the standalone PMIx library. Sorry that
wasn't clear.
Josh
On Thu, Sep 15, 2016 at 10:07 AM, r...@open-mpi.org wrote:
> I don’t understand this fascination with PMIx. PMIx didn’t ca
I don’t understand this fascination with PMIx. PMIx didn’t calculate this jobid
- OMPI did. Yes, it is in the opal/pmix layer, but it had -nothing- to do with
PMIx.
So why do you want to continue to blame PMIx for this problem??
> On Sep 15, 2016, at 4:29 AM, Joshua Ladd wrote:
>
> Great cat
Hi Gilles,
On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:
Eric,
a bug has been identified, and a patch is available at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch
the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
./a.out), so if appl
I just realized i screwed up my test, and i was missing some relevant info...
So on one hand, i fixed a bug in singleton,
But on the other hand, i cannot tell whether a collision was involved in this
issue
Cheers,
Gilles
Joshua Ladd wrote:
>Great catch, Gilles! Not much of a surprise though.
Great catch, Gilles! Not much of a surprise though.
Indeed, this issue has EVERYTHING to do with how PMIx is calculating the
jobid, which, in this case, results in hash collisions. ;-P
Josh
On Thursday, September 15, 2016, Gilles Gouaillardet
wrote:
> Eric,
>
>
> a bug has been identified, and
Eric,
a bug has been identified, and a patch is available at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch
the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
./a.out), so if applying a patch does not fit your test workflow,
it might be
Ralph,
i fixed master at
https://github.com/open-mpi/ompi/commit/11ebf3ab23bdaeb0ec96818c119364c6d837cd3b
and PR for v2.x at https://github.com/open-mpi/ompi-release/pull/1376
Cheers,
Gilles
On 9/15/2016 12:26 PM, r...@open-mpi.org wrote:
Ah...I take that back. We changed this and now
Just in the FWIW category: the HNP used to send the singleton’s name down the
pipe at startup, which eliminated the code line you identified. Now, we are
pushing the name into the environment as a PMIx envar, and having the PMIx
component pick it up. Roundabout way of getting it, and that’s what
Ah...I take that back. We changed this and now we _do_ indeed go down that code
path. Not good.
So yes, we need that putenv so it gets the jobid from the HNP that was
launched, like it used to do. You want to throw that in?
Thanks
Ralph
> On Sep 14, 2016, at 8:18 PM, r...@open-mpi.org wrote:
>
Nah, something isn’t right here. The singleton doesn’t go thru that code line,
or it isn’t supposed to do so. I think the problem lies in the way the
singleton in 2.x is starting up. Let me take a look at how singletons are
working over there.
> On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet
Ralph,
i think i just found the root cause :-)
from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c
/* store our jobid and rank */
if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) {
/* if we were launched by the OMPI RTE, then
* the jobid is in a special fo
Ok,
one test segfaulted *but* I can't tell if it is the *same* bug because
there has been a segfault:
stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
Many things are possible, given infinite time :-)
The issue with this notion lies in direct launch scenarios - i.e., when procs
are launched directly by the RM and not via mpirun. In this case, there is
nobody who can give us the session directory (well, until PMIx becomes
universal), and so th
Ralph,
is there any reason to use a session directory based on the jobid (or job
family) ?
I mean, could we use mkstemp to generate a unique directory, and then
propagate the path via orted comm or the environment ?
Cheers,
Gilles
On Wednesday, September 14, 2016, r...@open-mpi.org wrote:
> T
On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:
Eric,
do you mean you have a unique $TMP per a.out ?
No
or a unique $TMP per "batch" of run ?
Yes.
I was happy because each nighlty batch has it's own TMP, so I can check
afterward for problems related to a specific night without interfe
Eric,
do you mean you have a unique $TMP per a.out ?
or a unique $TMP per "batch" of run ?
in the first case, my understanding is that conflicts cannot happen ...
once you hit the bug, can you please please post the output of the failed
a.out,
and run
egrep 'jobfam|stop'
on all your logs, so we
This has nothing to do with PMIx, Josh - the error is coming out of the usock
OOB component.
> On Sep 14, 2016, at 7:17 AM, Joshua Ladd wrote:
>
> Eric,
>
> We are looking into the PMIx code path that sets up the jobid. The session
> directories are created based on the jobid. It might be th
Eric,
We are looking into the PMIx code path that sets up the jobid. The session
directories are created based on the jobid. It might be the case that the
jobids (generated with rand) happen to be the same for different jobs
resulting in multiple jobs sharing the same session directory, but we nee
Thanks Eric,
the goal of the patch is simply not to output info that is not needed (by
both orted and a.out)
/* since you ./a.out, an orted is forked under the hood */
so the patch is really optional, though convenient.
Cheers,
Gilles
On Wednesday, September 14, 2016, Eric Chamberland <
eric.c
Lucky!
Since each runs have a specific TMP, I still have it on disc.
for the faulty run, the TMP variable was:
TMP=/tmp/tmp.wOv5dkNaSI
and into $TMP I have:
openmpi-sessions-40031@lorien_0
and into this subdirectory I have a bunch of empty dirs:
cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-ses
On 14/09/16 01:36 AM, Gilles Gouaillardet wrote:
Eric,
can you please provide more information on how your tests are launched ?
Yes!
do you
mpirun -np 1 ./a.out
or do you simply
./a.out
For all sequential tests, we do ./a.out.
do you use a batch manager ? if yes, which one ?
N
Hi, Eric
I **think** this might be related to the following:
https://github.com/pmix/master/pull/145
I'm wondering if you can look into the /tmp directory and see if you have a
bunch of stale usock files.
Best,
Josh
On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
wrote:
> Eric,
>
>
>
Eric,
can you please provide more information on how your tests are launched ?
do you
mpirun -np 1 ./a.out
or do you simply
./a.out
do you use a batch manager ? if yes, which one ?
do you run one test per job ? or multiple tests per job ?
how are these tests launched ?
do the test that
On 13/09/16 12:11 PM, Pritchard Jr., Howard wrote:
Hello Eric,
Is the failure seen with the same two tests? Or is it random
which tests fail? If its not random, would you be able to post
No, the tests that failed were different ones...
the tests to the list?
Also, if possible, it would
Hello Eric,
Is the failure seen with the same two tests? Or is it random
which tests fail? If its not random, would you be able to post
the tests to the list?
Also, if possible, it would be great if you could test against a master
snapshot:
https://www.open-mpi.org/nightly/master/
Thanks,
Other relevant info: I never saw this problem with OpenMPI 1.6.5,1.8.4
and 1.10.[3,4] which runs the same test suite...
thanks,
Eric
On 13/09/16 11:35 AM, Eric Chamberland wrote:
Hi,
It is the third time this happened into the last 10 days.
While running nighlty tests (~2200), we have one
Hi,
It is the third time this happened into the last 10 days.
While running nighlty tests (~2200), we have one or two tests that fails
at the very beginning with this strange error:
[lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received
unexpected process identifier [[9325,0],
34 matches
Mail list logo