Ralph,
Indeed, configuration with --enable-mca-no-build=coll-ml resolved my
problem.
So, this *is* the same problem at was already known.
Sorry for the false alarm.
-Paul
On Sun, Aug 23, 2015 at 9:43 AM, Ralph Castain wrote:
> I think that’s true - this looks like the hcoll symbol issue. I’d s
I think that’s true - this looks like the hcoll symbol issue. I’d suggest
configuring with —enable-mca-no-build=coll-ml to resolve the problem in static
builds, or follow Gilles suggestion about .ompi_ignore
> On Aug 22, 2015, at 10:14 PM, Gilles Gouaillardet
> mailto:gilles.gouaillar...@gma
Paul,
if ompi is built statically or with --disable-dlopen, I do not think --mca
coll ^ml can prevent the crash (assuming this is the same issue we
discussed before).
note if you build dynamically and without --disable-dlopen, it might or
might not crash, depending on how modules are enumerated, a
Gilles,
This is on Mellanox's own system where /opt/mellanox/hcoll was updates Aug
2.
This problem also did not occur unless I build libmpi statically.
A run of "mpirun -mca coll ^ml -np 2 examples/ring_c" still crashes.
So, I really don't know if this is the same issue, but suspect that it is
not
Paul,
isn t this an issue that was already discussed ?
mellanox proprietary hcoll library includes its own coll ml module that
conflicts with the ompi one.
mellanox folks fixed this internally but I am not sure this has been
released.
you can run
nm libhcoll.so
if there are some symbols starting w
Having seen problems with mtl:ofi with "--enable-static --disable-shared",
I tried mtl:psm and mtl:mxm with those options as well.
The good news is that mtl:psm was fine, but the bad news is when testing
mtl:mxm I encountered a new problem involving coll:hcol.
Ralph probably wants to strangle me r