I have a fix - about to commit
On Thu, Jun 25, 2015 at 8:46 PM, Jeff Squyres (jsquyres) wrote:
> Howard --
>
> The LANL distcheck jenkins hasn't been running all day.
>
>
> > On Jun 25, 2015, at 8:33 PM, Howard Pritchard
> wrote:
> >
> > Hi folks,
> >
> > I'm confused about this build failure.
Howard --
The LANL distcheck jenkins hasn't been running all day.
> On Jun 25, 2015, at 8:33 PM, Howard Pritchard wrote:
>
> Hi folks,
>
> I'm confused about this build failure. It should have been caught by the
> make distcheck IU jenkins
> project I would think. Should the IU jenkins pro
Hi folks,
I'm confused about this build failure. It should have been caught by the
make distcheck IU jenkins
project I would think. Should the IU jenkins project do something else
beside
make -j X distcheck
to catch this problem?
Or, did this problem happen because someone bypassed the PR pro
On Thu, Jun 25, 2015 at 5:05 PM, Paul Hargrove wrote:
>
> On Thu, Jun 25, 2015 at 4:59 PM, Gilles Gouaillardet
> wrote:
>
>> In this case, mca_coll_hcoll module is linked with the proprietary
>> libhcoll.so.
>> the ml symbols are defined in both mca_coll_ml.so and libhcoll.so
>> i am not sure (i
On Thu, Jun 25, 2015 at 4:59 PM, Gilles Gouaillardet
wrote:
> In this case, mca_coll_hcoll module is linked with the proprietary
> libhcoll.so.
> the ml symbols are defined in both mca_coll_ml.so and libhcoll.so
> i am not sure (i blame my poor understanding of linkers) this is an error
> if
> Op
Paul,
generally speaking, that is a good point.
an other option could be to write a script that detects symbols defined
more than once.
In this case, mca_coll_hcoll module is linked with the proprietary
libhcoll.so.
the ml symbols are defined in both mca_coll_ml.so and libhcoll.so
i am not s
I can see cloning of existing component's source as a starting point for a
new one as a common occurrence (at least relative to creating new
components from zero).
So, this is probably not the last time this will ever occur.
Would a build with --disable-dlopen have detected this problem (by failin
Devendar literally just reproduced here at the developer meeting, too.
Sweet -- ok, so we understand what is going on.
Devendar/Mellanox is going to talk about this internally and get back to us.
> On Jun 25, 2015, at 2:59 PM, Gilles Gouaillardet
> wrote:
>
> Jeff,
>
> this is exactly what
Jeff,
this is exactly what happens.
I will send a stack trace later
Cheers,
Gilles
On Thursday, June 25, 2015, Jeff Squyres (jsquyres)
wrote:
> Gilles --
>
> Can you send a stack trace from one of these crashes?
>
> I am *guessing* that the following is happening:
>
> 1. coll selection begin
That appears to be correct.
On Thu, Jun 25, 2015 at 9:51 AM, Shamis, Pavel wrote:
> As I read this thread - this issue is not related to the ML bootstrap
> itself,
> but the naming conflict between public functions in HCOLL and ML.
>
> Did I get it right ?
>
> If this the case, we can work with
As I read this thread - this issue is not related to the ML bootstrap itself,
but the naming conflict between public functions in HCOLL and ML.
Did I get it right ?
If this the case, we can work with Mellanox folks to resolve this conflict.
Best,
Pavel (Pasha) Shamis
---
Computer Science Rese
We have removed the following stale / inactive frameworks/components from the
2.x tree:
- ompi coll hierarch (it was effectively already removed, anyway)
- ompi coll ml
- ompi sbgp
- ompi bcol
- orte reachable
*** DEVELOPERS: Are there other components / frameworks that should be removed
from t
Gilles --
Can you send a stack trace from one of these crashes?
I am *guessing* that the following is happening:
1. coll selection begins
2. coll ml is queried, and disqualifies itself (but is not dlclosed yet)
3. coll hcol is queried, which ends up calling down into libhcol. libhcol
calls a c
We at least need to release an immediate 1.8.7 to rectify the situation,
either by "rm -rf" of ml or ompi_ignore it. I'll ompi_ignore it in the 1.10
branch for now as that hasn't been released yet - if we can get a fix in
the next week or two, we can "unignore" it for the release. I'm still
angling
Should we .ompi_ignore ml?
> On Jun 25, 2015, at 4:41 AM, Joshua Ladd wrote:
>
> Thanks, Gilles.
>
> We are addressing this.
>
> Josh
>
> Sent from my iPhone
>
> On Jun 25, 2015, at 11:03 AM, Gilles Gouaillardet wrote:
>
>> Folks,
>>
>> this is a followup on an issue reported by Daniel o
Thanks, Gilles.
We are addressing this.
Josh
Sent from my iPhone
> On Jun 25, 2015, at 11:03 AM, Gilles Gouaillardet wrote:
>
> Folks,
>
> this is a followup on an issue reported by Daniel on the users mailing list :
> OpenMPI is built with hcoll from Mellanox.
> the coll ml module has defau
Folks,
this is a followup on an issue reported by Daniel on the users mailing
list :
OpenMPI is built with hcoll from Mellanox.
the coll ml module has default priority zero.
on my cluster, it works just fine
on Daniel's cluster, it crashes.
i was able to reproduce the crash by tweaking mca_ba
17 matches
Mail list logo