Re: [OMPI devel] === CREATE FAILURE (dev-1979-g13425e7) ===

2015-06-25 Thread Ralph Castain
I have a fix - about to commit On Thu, Jun 25, 2015 at 8:46 PM, Jeff Squyres (jsquyres) wrote: > Howard -- > > The LANL distcheck jenkins hasn't been running all day. > > > > On Jun 25, 2015, at 8:33 PM, Howard Pritchard > wrote: > > > > Hi folks, > > > > I'm confused about this build failure.

Re: [OMPI devel] === CREATE FAILURE (dev-1979-g13425e7) ===

2015-06-25 Thread Jeff Squyres (jsquyres)
Howard -- The LANL distcheck jenkins hasn't been running all day. > On Jun 25, 2015, at 8:33 PM, Howard Pritchard wrote: > > Hi folks, > > I'm confused about this build failure. It should have been caught by the > make distcheck IU jenkins > project I would think. Should the IU jenkins pro

Re: [OMPI devel] === CREATE FAILURE (dev-1979-g13425e7) ===

2015-06-25 Thread Howard Pritchard
Hi folks, I'm confused about this build failure. It should have been caught by the make distcheck IU jenkins project I would think. Should the IU jenkins project do something else beside make -j X distcheck to catch this problem? Or, did this problem happen because someone bypassed the PR pro

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Paul Hargrove
On Thu, Jun 25, 2015 at 5:05 PM, Paul Hargrove wrote: > > On Thu, Jun 25, 2015 at 4:59 PM, Gilles Gouaillardet > wrote: > >> In this case, mca_coll_hcoll module is linked with the proprietary >> libhcoll.so. >> the ml symbols are defined in both mca_coll_ml.so and libhcoll.so >> i am not sure (i

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Paul Hargrove
On Thu, Jun 25, 2015 at 4:59 PM, Gilles Gouaillardet wrote: > In this case, mca_coll_hcoll module is linked with the proprietary > libhcoll.so. > the ml symbols are defined in both mca_coll_ml.so and libhcoll.so > i am not sure (i blame my poor understanding of linkers) this is an error > if > Op

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Gilles Gouaillardet
Paul, generally speaking, that is a good point. an other option could be to write a script that detects symbols defined more than once. In this case, mca_coll_hcoll module is linked with the proprietary libhcoll.so. the ml symbols are defined in both mca_coll_ml.so and libhcoll.so i am not s

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Paul Hargrove
I can see cloning of existing component's source as a starting point for a new one as a common occurrence (at least relative to creating new components from zero). So, this is probably not the last time this will ever occur. Would a build with --disable-dlopen have detected this problem (by failin

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Jeff Squyres (jsquyres)
Devendar literally just reproduced here at the developer meeting, too. Sweet -- ok, so we understand what is going on. Devendar/Mellanox is going to talk about this internally and get back to us. > On Jun 25, 2015, at 2:59 PM, Gilles Gouaillardet > wrote: > > Jeff, > > this is exactly what

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Gilles Gouaillardet
Jeff, this is exactly what happens. I will send a stack trace later Cheers, Gilles On Thursday, June 25, 2015, Jeff Squyres (jsquyres) wrote: > Gilles -- > > Can you send a stack trace from one of these crashes? > > I am *guessing* that the following is happening: > > 1. coll selection begin

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Ralph Castain
That appears to be correct. On Thu, Jun 25, 2015 at 9:51 AM, Shamis, Pavel wrote: > As I read this thread - this issue is not related to the ML bootstrap > itself, > but the naming conflict between public functions in HCOLL and ML. > > Did I get it right ? > > If this the case, we can work with

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Shamis, Pavel
As I read this thread - this issue is not related to the ML bootstrap itself, but the naming conflict between public functions in HCOLL and ML. Did I get it right ? If this the case, we can work with Mellanox folks to resolve this conflict. Best, Pavel (Pasha) Shamis --- Computer Science Rese

[OMPI devel] Pruning from the 2.x branch

2015-06-25 Thread Jeff Squyres (jsquyres)
We have removed the following stale / inactive frameworks/components from the 2.x tree: - ompi coll hierarch (it was effectively already removed, anyway) - ompi coll ml - ompi sbgp - ompi bcol - orte reachable *** DEVELOPERS: Are there other components / frameworks that should be removed from t

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Jeff Squyres (jsquyres)
Gilles -- Can you send a stack trace from one of these crashes? I am *guessing* that the following is happening: 1. coll selection begins 2. coll ml is queried, and disqualifies itself (but is not dlclosed yet) 3. coll hcol is queried, which ends up calling down into libhcol. libhcol calls a c

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Ralph Castain
We at least need to release an immediate 1.8.7 to rectify the situation, either by "rm -rf" of ml or ompi_ignore it. I'll ompi_ignore it in the 1.10 branch for now as that hasn't been released yet - if we can get a fix in the next week or two, we can "unignore" it for the release. I'm still angling

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Jeff Squyres (jsquyres)
Should we .ompi_ignore ml? > On Jun 25, 2015, at 4:41 AM, Joshua Ladd wrote: > > Thanks, Gilles. > > We are addressing this. > > Josh > > Sent from my iPhone > > On Jun 25, 2015, at 11:03 AM, Gilles Gouaillardet wrote: > >> Folks, >> >> this is a followup on an issue reported by Daniel o

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Joshua Ladd
Thanks, Gilles. We are addressing this. Josh Sent from my iPhone > On Jun 25, 2015, at 11:03 AM, Gilles Gouaillardet wrote: > > Folks, > > this is a followup on an issue reported by Daniel on the users mailing list : > OpenMPI is built with hcoll from Mellanox. > the coll ml module has defau

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Gilles Gouaillardet
Folks, this is a followup on an issue reported by Daniel on the users mailing list : OpenMPI is built with hcoll from Mellanox. the coll ml module has default priority zero. on my cluster, it works just fine on Daniel's cluster, it crashes. i was able to reproduce the crash by tweaking mca_ba