Folks,
this is a followup on an issue reported by Daniel on the users mailing
list :
OpenMPI is built with hcoll from Mellanox.
the coll ml module has default priority zero.
on my cluster, it works just fine
on Daniel's cluster, it crashes.
i was able to reproduce the crash by tweaking mca_base_component_path
and ensure
the coll ml module is loaded first.
basically, i found two issues :
1) libhcoll.so (vendor lib provided by Mellanox, i tested
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.2-x86_64) seems to include its
own coll ml, since there are some *public* symbols that are common to
this module (ml_open, ml_coll_hier_barrier_setup, ...)
2) coll ml priority is zero, and even if the library is dlclose'd, it
seems this is uneffective
(nothing changed in /proc/xxx/maps before and after dlclose)
there are two workarounds :
mpirun --mca coll ^ml
or
mpirun --mca coll ^hcoll ... (probably not what is needed though ...)
is it expected the library is not unloaded after dlclose ?
Mellanox folks,
can you please double check how libhcoll is built ?
i guess it would work if the ml_ symbols were private to the library.
if not, the only workaround is to mpirun --mca coll ^ml
otherwise, it might crash (if coll_ml is loaded before coll_hcoll, which
is really system dependent)
Cheers,
Gilles
On 6/25/2015 10:46 AM, Gilles Gouaillardet wrote:
Daniel,
thanks for the logs.
an other workaround is to
mpirun --mca coll ^hcoll ...
i was able to reproduce the issue, and it surprisingly occurs only if
the coll_ml module is loaded *before* the hcoll module.
/* this is not the case on my system, so i had to hack my
mca_base_component_path in order to reproduce the issue */
as far as i understand, libhcoll is a proprietary software, so i
cannot dig into it.
that being said, i noticed libhcoll defines some symbols (such as
ml_coll_hier_barrier_setup) that are also defined by the coll_ml
module, so it is likely hcoll coll_ml and openmpi coll_ml are not
binary compatible hence the error.
i will dig a bit more and see if this is even supposed to happen
(since coll_ml_priority is zero, why is the module still loaded ?)
as far as i am concerned, you *have to* mpirun --mca coll ^ml or
update your user/system wide config file to blacklist the coll_ml
module to ensure this is working.
Mike and Mellanox folks, could you please comment on that ?
Cheers,
Gilles
On 6/24/2015 5:23 PM, Daniel Letai wrote:
Gilles,
Attached the two output logs.
Thanks,
Daniel
On 06/22/2015 08:08 AM, Gilles Gouaillardet wrote:
Daniel,
i double checked this and i cannot make any sense with these logs.
if coll_ml_priority is zero, then i do not any way how
ml_coll_hier_barrier_setup can be invoked.
could you please run again with --mca coll_base_verbose 100
with and without --mca coll ^ml
Cheers,
Gilles
On 6/22/2015 12:08 AM, Gilles Gouaillardet wrote:
Daniel,
ok, thanks
it seems that even if priority is zero, some code gets executed
I will confirm this tomorrow and send you a patch to work around
the issue if that if my guess is proven right
Cheers,
Gilles
On Sunday, June 21, 2015, Daniel Letai <d...@letai.org.il
<mailto:d...@letai.org.il>> wrote:
MCA coll: parameter "coll_ml_priority" (current value: "0",
data source: default, level: 9 dev/all, type: int)
Not sure how to read this, but for any n>1 mpirun only works
with --mca coll ^ml
Thanks for helping
On 06/18/2015 04:36 PM, Gilles Gouaillardet wrote:
This is really odd...
you can run
ompi_info --all
and search coll_ml_priority
it will display the current value and the origin
(e.g. default, system wide config, user config, cli,
environment variable)
Cheers,
Gilles
On Thursday, June 18, 2015, Daniel Letai <d...@letai.org.il
<javascript:_e(%7B%7D,'cvml','d...@letai.org.il');>> wrote:
No, that's the issue.
I had to disable it to get things working.
That's why I included my config settings - I couldn't
figure out which option enabled it, so I could remove it
from the configuration...
On 06/18/2015 02:43 PM, Gilles Gouaillardet wrote:
Daniel,
ML module is not ready for production and is disabled by
default.
Did you explicitly enable this module ?
If yes, I encourage you to disable it
Cheers,
Gilles
On Thursday, June 18, 2015, Daniel Letai
<d...@letai.org.il> wrote:
given a simple hello.c:
#include <stdio.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
int size, rank, len;
char name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(name, &len);
printf("%s: Process %d out of %d\n", name,
rank, size);
MPI_Finalize();ffff
}
for n=1
mpirun -n 1 ./hello
it works correctly.
for n>1 it segfaults with signal 11
used gdb to trace the problem to ml coll:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6750845 in ml_coll_hier_barrier_setup()
from <path to openmpi
1.8.5>/lib/openmpi/mca_coll_ml.so
running with
mpirun -n 2 --mca coll ^ml ./hello
works correctly
using mellanox ofed 2.3-2.0.5-rhel6.4-x86_64, if it's
at all relevant.
openmpi 1.8.5 was built with following options:
rpmbuild --rebuild --define 'configure_options
--with-verbs=/usr --with-verbs-libdir=/usr/lib64
CC=gcc CXX=g++ FC=gfortran CFLAGS="-g -O3"
--enable-mpirun-prefix-by-default
--enable-orterun-prefix-by-default --disable-debug
--with-knem=/opt/knem-1.1.1.90mlnx
--with-platform=optimized --without-mpi-param-check
--with-contrib-vt-flags=--disable-iotrace
--enable-builtin-atomics --enable-cxx-exceptions
--enable-sparse-groups --enable-mpi-thread-multiple
--enable-memchecker --enable-btl-openib-failover
--with-hwloc=internal --with-verbs --with-x
--with-slurm --with-pmi=/opt/slurm
--with-fca=/opt/mellanox/fca
--with-mxm=/opt/mellanox/mxm
--with-hcoll=/opt/mellanox/hcoll' openmpi-1.8.5-1.src.rpm
gcc version 5.1.1
Thanks in advance
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/06/27155.php
_______________________________________________
users mailing list
us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/06/27157.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/06/27169.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/06/27170.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this
post:http://www.open-mpi.org/community/lists/users/2015/06/27183.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/06/17528.php