: Open MPI Developers
Subject: Re: [OMPI devel] [OMPI users] simple mpi hello world segfaults when
coll ml not disabled
Crud - thanks Paul! Mellanox is working on a fix (renaming the symbols in their
proprietary library so they don't conflict). If they can release that soon, I'm
hopin
On Jun 25, 2015, at 10:48 PM, Gilles Gouaillardet wrote:
>
> as far as i understand, the behavior depends on how plugins are enumerated
> and this is system dependent
> (by default, Daniel got a crash, but i got none ...)
> should we sort the plugins by name/library name so we do not fall into t
On Thu, Jun 25, 2015 at 10:48 PM, Gilles Gouaillardet
wrote:
> Paul,
>
> i assume you ran the test with Open MPI configured with --disable-dlopen,
> right ?
>
> --disable-dlopen is like forcing coll_ml to be loaded first, hence the
> crash, even with --mca coll ^ml
>
> without --disable-dlopen, a
Paul,
i assume you ran the test with Open MPI configured with
--disable-dlopen, right ?
--disable-dlopen is like forcing coll_ml to be loaded first, hence the
crash, even with --mca coll ^ml
without --disable-dlopen, and with default coll_ml_priority=0, the crash
only occurs if coll_ml is
Crud - thanks Paul! Mellanox is working on a fix (renaming the symbols in
their proprietary library so they don't conflict). If they can release that
soon, I'm hoping to avoid having to release a quick 1.8.7 to fix the
problem from inside OMPI (i.e., removing one of the conflicting plugins).
On
On Thu, Jun 25, 2015 at 5:05 PM, Paul Hargrove wrote:
>
> On Thu, Jun 25, 2015 at 4:59 PM, Gilles Gouaillardet
> wrote:
>
>> In this case, mca_coll_hcoll module is linked with the proprietary
>> libhcoll.so.
>> the ml symbols are defined in both mca_coll_ml.so and libhcoll.so
>> i am not sure (i
On Thu, Jun 25, 2015 at 4:59 PM, Gilles Gouaillardet
wrote:
> In this case, mca_coll_hcoll module is linked with the proprietary
> libhcoll.so.
> the ml symbols are defined in both mca_coll_ml.so and libhcoll.so
> i am not sure (i blame my poor understanding of linkers) this is an error
> if
> Op
Paul,
generally speaking, that is a good point.
an other option could be to write a script that detects symbols defined
more than once.
In this case, mca_coll_hcoll module is linked with the proprietary
libhcoll.so.
the ml symbols are defined in both mca_coll_ml.so and libhcoll.so
i am not s
I can see cloning of existing component's source as a starting point for a
new one as a common occurrence (at least relative to creating new
components from zero).
So, this is probably not the last time this will ever occur.
Would a build with --disable-dlopen have detected this problem (by failin
Devendar literally just reproduced here at the developer meeting, too.
Sweet -- ok, so we understand what is going on.
Devendar/Mellanox is going to talk about this internally and get back to us.
> On Jun 25, 2015, at 2:59 PM, Gilles Gouaillardet
> wrote:
>
> Jeff,
>
> this is exactly what
Jeff,
this is exactly what happens.
I will send a stack trace later
Cheers,
Gilles
On Thursday, June 25, 2015, Jeff Squyres (jsquyres)
wrote:
> Gilles --
>
> Can you send a stack trace from one of these crashes?
>
> I am *guessing* that the following is happening:
>
> 1. coll selection begin
That appears to be correct.
On Thu, Jun 25, 2015 at 9:51 AM, Shamis, Pavel wrote:
> As I read this thread - this issue is not related to the ML bootstrap
> itself,
> but the naming conflict between public functions in HCOLL and ML.
>
> Did I get it right ?
>
> If this the case, we can work with
As I read this thread - this issue is not related to the ML bootstrap itself,
but the naming conflict between public functions in HCOLL and ML.
Did I get it right ?
If this the case, we can work with Mellanox folks to resolve this conflict.
Best,
Pavel (Pasha) Shamis
---
Computer Science Rese
Gilles --
Can you send a stack trace from one of these crashes?
I am *guessing* that the following is happening:
1. coll selection begins
2. coll ml is queried, and disqualifies itself (but is not dlclosed yet)
3. coll hcol is queried, which ends up calling down into libhcol. libhcol
calls a c
We at least need to release an immediate 1.8.7 to rectify the situation,
either by "rm -rf" of ml or ompi_ignore it. I'll ompi_ignore it in the 1.10
branch for now as that hasn't been released yet - if we can get a fix in
the next week or two, we can "unignore" it for the release. I'm still
angling
Should we .ompi_ignore ml?
> On Jun 25, 2015, at 4:41 AM, Joshua Ladd wrote:
>
> Thanks, Gilles.
>
> We are addressing this.
>
> Josh
>
> Sent from my iPhone
>
> On Jun 25, 2015, at 11:03 AM, Gilles Gouaillardet wrote:
>
>> Folks,
>>
>> this is a followup on an issue reported by Daniel o
Thanks, Gilles.
We are addressing this.
Josh
Sent from my iPhone
> On Jun 25, 2015, at 11:03 AM, Gilles Gouaillardet wrote:
>
> Folks,
>
> this is a followup on an issue reported by Daniel on the users mailing list :
> OpenMPI is built with hcoll from Mellanox.
> the coll ml module has defau
Folks,
this is a followup on an issue reported by Daniel on the users mailing
list :
OpenMPI is built with hcoll from Mellanox.
the coll ml module has default priority zero.
on my cluster, it works just fine
on Daniel's cluster, it crashes.
i was able to reproduce the crash by tweaking mca_ba
Daniel,
thanks for the logs.
an other workaround is to
mpirun --mca coll ^hcoll ...
i was able to reproduce the issue, and it surprisingly occurs only if
the coll_ml module is loaded *before* the hcoll module.
/* this is not the case on my system, so i had to hack my
mca_base_component_path i
19 matches
Mail list logo