Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-29 Thread Devendar Bureddy
: Open MPI Developers Subject: Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled Crud - thanks Paul! Mellanox is working on a fix (renaming the symbols in their proprietary library so they don't conflict). If they can release that soon, I'm hopin

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-26 Thread Jeff Squyres (jsquyres)
On Jun 25, 2015, at 10:48 PM, Gilles Gouaillardet wrote: > > as far as i understand, the behavior depends on how plugins are enumerated > and this is system dependent > (by default, Daniel got a crash, but i got none ...) > should we sort the plugins by name/library name so we do not fall into t

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-26 Thread Paul Hargrove
On Thu, Jun 25, 2015 at 10:48 PM, Gilles Gouaillardet wrote: > Paul, > > i assume you ran the test with Open MPI configured with --disable-dlopen, > right ? > > --disable-dlopen is like forcing coll_ml to be loaded first, hence the > crash, even with --mca coll ^ml > > without --disable-dlopen, a

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-26 Thread Gilles Gouaillardet
Paul, i assume you ran the test with Open MPI configured with --disable-dlopen, right ? --disable-dlopen is like forcing coll_ml to be loaded first, hence the crash, even with --mca coll ^ml without --disable-dlopen, and with default coll_ml_priority=0, the crash only occurs if coll_ml is

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-26 Thread Ralph Castain
Crud - thanks Paul! Mellanox is working on a fix (renaming the symbols in their proprietary library so they don't conflict). If they can release that soon, I'm hoping to avoid having to release a quick 1.8.7 to fix the problem from inside OMPI (i.e., removing one of the conflicting plugins). On

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Paul Hargrove
On Thu, Jun 25, 2015 at 5:05 PM, Paul Hargrove wrote: > > On Thu, Jun 25, 2015 at 4:59 PM, Gilles Gouaillardet > wrote: > >> In this case, mca_coll_hcoll module is linked with the proprietary >> libhcoll.so. >> the ml symbols are defined in both mca_coll_ml.so and libhcoll.so >> i am not sure (i

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Paul Hargrove
On Thu, Jun 25, 2015 at 4:59 PM, Gilles Gouaillardet wrote: > In this case, mca_coll_hcoll module is linked with the proprietary > libhcoll.so. > the ml symbols are defined in both mca_coll_ml.so and libhcoll.so > i am not sure (i blame my poor understanding of linkers) this is an error > if > Op

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Gilles Gouaillardet
Paul, generally speaking, that is a good point. an other option could be to write a script that detects symbols defined more than once. In this case, mca_coll_hcoll module is linked with the proprietary libhcoll.so. the ml symbols are defined in both mca_coll_ml.so and libhcoll.so i am not s

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Paul Hargrove
I can see cloning of existing component's source as a starting point for a new one as a common occurrence (at least relative to creating new components from zero). So, this is probably not the last time this will ever occur. Would a build with --disable-dlopen have detected this problem (by failin

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Jeff Squyres (jsquyres)
Devendar literally just reproduced here at the developer meeting, too. Sweet -- ok, so we understand what is going on. Devendar/Mellanox is going to talk about this internally and get back to us. > On Jun 25, 2015, at 2:59 PM, Gilles Gouaillardet > wrote: > > Jeff, > > this is exactly what

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Gilles Gouaillardet
Jeff, this is exactly what happens. I will send a stack trace later Cheers, Gilles On Thursday, June 25, 2015, Jeff Squyres (jsquyres) wrote: > Gilles -- > > Can you send a stack trace from one of these crashes? > > I am *guessing* that the following is happening: > > 1. coll selection begin

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Ralph Castain
That appears to be correct. On Thu, Jun 25, 2015 at 9:51 AM, Shamis, Pavel wrote: > As I read this thread - this issue is not related to the ML bootstrap > itself, > but the naming conflict between public functions in HCOLL and ML. > > Did I get it right ? > > If this the case, we can work with

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Shamis, Pavel
As I read this thread - this issue is not related to the ML bootstrap itself, but the naming conflict between public functions in HCOLL and ML. Did I get it right ? If this the case, we can work with Mellanox folks to resolve this conflict. Best, Pavel (Pasha) Shamis --- Computer Science Rese

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Jeff Squyres (jsquyres)
Gilles -- Can you send a stack trace from one of these crashes? I am *guessing* that the following is happening: 1. coll selection begins 2. coll ml is queried, and disqualifies itself (but is not dlclosed yet) 3. coll hcol is queried, which ends up calling down into libhcol. libhcol calls a c

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Ralph Castain
We at least need to release an immediate 1.8.7 to rectify the situation, either by "rm -rf" of ml or ompi_ignore it. I'll ompi_ignore it in the 1.10 branch for now as that hasn't been released yet - if we can get a fix in the next week or two, we can "unignore" it for the release. I'm still angling

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Jeff Squyres (jsquyres)
Should we .ompi_ignore ml? > On Jun 25, 2015, at 4:41 AM, Joshua Ladd wrote: > > Thanks, Gilles. > > We are addressing this. > > Josh > > Sent from my iPhone > > On Jun 25, 2015, at 11:03 AM, Gilles Gouaillardet wrote: > >> Folks, >> >> this is a followup on an issue reported by Daniel o

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Joshua Ladd
Thanks, Gilles. We are addressing this. Josh Sent from my iPhone > On Jun 25, 2015, at 11:03 AM, Gilles Gouaillardet wrote: > > Folks, > > this is a followup on an issue reported by Daniel on the users mailing list : > OpenMPI is built with hcoll from Mellanox. > the coll ml module has defau

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-25 Thread Gilles Gouaillardet
Folks, this is a followup on an issue reported by Daniel on the users mailing list : OpenMPI is built with hcoll from Mellanox. the coll ml module has default priority zero. on my cluster, it works just fine on Daniel's cluster, it crashes. i was able to reproduce the crash by tweaking mca_ba

Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when coll ml not disabled

2015-06-24 Thread Gilles Gouaillardet
Daniel, thanks for the logs. an other workaround is to mpirun --mca coll ^hcoll ... i was able to reproduce the issue, and it surprisingly occurs only if the coll_ml module is loaded *before* the hcoll module. /* this is not the case on my system, so i had to hack my mca_base_component_path i