Thanks Ralph,
i investigated this a bit deeper, and found the $enable_dlopen variable
is not correctly used in pmix3x.
/* my understanding of pmix3x is that --disable-dlopen implies
--disable-pdl-dlopen,
but that did not happen */
i opened https://github.com/open-mpi/ompi/pull/2079 so le
Hi, Eric
I **think** this might be related to the following:
https://github.com/pmix/master/pull/145
I'm wondering if you can look into the /tmp directory and see if you have a
bunch of stale usock files.
Best,
Josh
On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
wrote:
> Eric,
>
>
>
On 14/09/16 01:36 AM, Gilles Gouaillardet wrote:
Eric,
can you please provide more information on how your tests are launched ?
Yes!
do you
mpirun -np 1 ./a.out
or do you simply
./a.out
For all sequential tests, we do ./a.out.
do you use a batch manager ? if yes, which one ?
N
Lucky!
Since each runs have a specific TMP, I still have it on disc.
for the faulty run, the TMP variable was:
TMP=/tmp/tmp.wOv5dkNaSI
and into $TMP I have:
openmpi-sessions-40031@lorien_0
and into this subdirectory I have a bunch of empty dirs:
cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-ses
Thanks Eric,
the goal of the patch is simply not to output info that is not needed (by
both orted and a.out)
/* since you ./a.out, an orted is forked under the hood */
so the patch is really optional, though convenient.
Cheers,
Gilles
On Wednesday, September 14, 2016, Eric Chamberland <
eric.c
Eric,
We are looking into the PMIx code path that sets up the jobid. The session
directories are created based on the jobid. It might be the case that the
jobids (generated with rand) happen to be the same for different jobs
resulting in multiple jobs sharing the same session directory, but we nee
This has nothing to do with PMIx, Josh - the error is coming out of the usock
OOB component.
> On Sep 14, 2016, at 7:17 AM, Joshua Ladd wrote:
>
> Eric,
>
> We are looking into the PMIx code path that sets up the jobid. The session
> directories are created based on the jobid. It might be th
Eric,
do you mean you have a unique $TMP per a.out ?
or a unique $TMP per "batch" of run ?
in the first case, my understanding is that conflicts cannot happen ...
once you hit the bug, can you please please post the output of the failed
a.out,
and run
egrep 'jobfam|stop'
on all your logs, so we
On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:
Eric,
do you mean you have a unique $TMP per a.out ?
No
or a unique $TMP per "batch" of run ?
Yes.
I was happy because each nighlty batch has it's own TMP, so I can check
afterward for problems related to a specific night without interfe
Ralph,
is there any reason to use a session directory based on the jobid (or job
family) ?
I mean, could we use mkstemp to generate a unique directory, and then
propagate the path via orted comm or the environment ?
Cheers,
Gilles
On Wednesday, September 14, 2016, r...@open-mpi.org wrote:
> T
Many things are possible, given infinite time :-)
The issue with this notion lies in direct launch scenarios - i.e., when procs
are launched directly by the RM and not via mpirun. In this case, there is
nobody who can give us the session directory (well, until PMIx becomes
universal), and so th
Ok,
one test segfaulted *but* I can't tell if it is the *same* bug because
there has been a segfault:
stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
- Code reviews got better / more organized
- Some project management tools now available
- We can enforce the use of 2-factor authentication
https://github.com/blog/2256-a-whole-new-github-universe-announcing-new-tools-forums-and-features
Sweet!
--
Jeff Squyres
jsquy...@cisco.com
For corporate
> On Sep 14, 2016, at 11:37 AM, Jeff Squyres (jsquyres)
> wrote:
>
> - Code reviews got better / more organized
> - Some project management tools now available
> - We can enforce the use of 2-factor authentication
Please don’t do that...
>
> https://github.com/blog/2256-a-whole-new-github-un
On Sep 14, 2016, at 2:40 PM, r...@open-mpi.org wrote:
>
>> - Code reviews got better / more organized
>> - Some project management tools now available
>> - We can enforce the use of 2-factor authentication
>
> Please don’t do that...
Certainly wouldn't do the last one without talking it through
I’d want to _fully_ understand the implications before forcing something on
everyone that might prove burdensome, especially when it “solves” a currently
non-existent problem
> On Sep 14, 2016, at 11:43 AM, Jeff Squyres (jsquyres)
> wrote:
>
> On Sep 14, 2016, at 2:40 PM, r...@open-mpi.org w
Sure. There's no rush at all; in fact, this is probably a decent topic for our
next face-to-face.
> On Sep 14, 2016, at 2:46 PM, r...@open-mpi.org wrote:
>
> I’d want to _fully_ understand the implications before forcing something on
> everyone that might prove burdensome, especially when it
Ralph,
I know with older versions of git you may have problems since you can’t use
https. I think with newer versions it will prompt not just for passed but
also
2-factor.
That’s one problem I hit anyway when first enabling 2-factor.
Howard
--
Howard Pritchard
HPC-DES
Los Alamos National Lab
The problem I hit, and the reason I’m pushing back, was that it required me to
have a smart phone handy. Not everyone has a smart phone, nor do they always
have it sitting next to them. In the case I hit, I was sitting somewhere that
(a) had poor cell reception, and (b) didn’t have my cell phone
Ralph,
On 9/15/2016 12:11 AM, r...@open-mpi.org wrote:
Many things are possible, given infinite time :-)
i could not agree more :-D
The issue with this notion lies in direct launch scenarios - i.e.,
when procs are launched directly by the RM and not via mpirun. In this
case, there is nobody
If we are going to make a change, then let’s do it only once. Since we
introduced PMIx and the concept of the string namespace, the plan has been to
switch away from a numerical jobid and to the namespace. This eliminates the
issue of the hash altogether. If we are going to make a disruptive cha
Ralph,
i think i just found the root cause :-)
from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c
/* store our jobid and rank */
if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) {
/* if we were launched by the OMPI RTE, then
* the jobid is in a special fo
Nah, something isn’t right here. The singleton doesn’t go thru that code line,
or it isn’t supposed to do so. I think the problem lies in the way the
singleton in 2.x is starting up. Let me take a look at how singletons are
working over there.
> On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet
Ah...I take that back. We changed this and now we _do_ indeed go down that code
path. Not good.
So yes, we need that putenv so it gets the jobid from the HNP that was
launched, like it used to do. You want to throw that in?
Thanks
Ralph
> On Sep 14, 2016, at 8:18 PM, r...@open-mpi.org wrote:
>
Just in the FWIW category: the HNP used to send the singleton’s name down the
pipe at startup, which eliminated the code line you identified. Now, we are
pushing the name into the environment as a PMIx envar, and having the PMIx
component pick it up. Roundabout way of getting it, and that’s what
25 matches
Mail list logo