Couple of things stand out. You should remove the following configure options:

--enable-mpi-thread-multiple
--with-threads
--enable-heterogeneous

Thread multiple is not ready yet in OMPI (and openib doesn't support threaded 
operations anyway), and the support for hetero systems really isn't working. 
Not saying that's the sole source of the problem, but it may well be 
contributing if you are trying to run a multi-threaded app and it exposes 
alternative code paths that may not be fully debugged.


On Jun 11, 2013, at 7:40 AM, Jesús Escudero Sahuquillo <jescud...@dsi.uclm.es> 
wrote:

> In fact, I also have tried to configure the OpenMPI with this:
> 
> ./configure --with-sge --with-openib --enable-mpi-thread-multiple 
> --with-threads --with-hwloc --enable-heterogeneous --disable-vt 
> --enable-openib-dynamic-sl --prefix=/home/jescudero/opt/openmpi
> 
> And the problem is still present
> 
> El 11/06/13 15:32, Mike Dubman escribió:
>> --mca btl_openib_ib_path_record_service_level 1 flag controls openib btl, 
>> you need to remove  --mca mtl mxm  from command line.
>> 
>> Have you compiled OpenMPI with rhel6.4 inbox ofed driver? AFAIK, the MOFED 
>> 2.x does not have XRC and you mentioned "--enable-openib-connectx-xrc" flag 
>> in configure.
>> 
>> 
>> On Tue, Jun 11, 2013 at 3:02 PM, Jesús Escudero Sahuquillo 
>> <jescud...@dsi.uclm.es> wrote:
>> I have a 16-node Mellanox cluster built with Mellanox ConnectX3 cards. 
>> Recently I have updated the MLNX_OFED to the 2.0.5 version. The reason of 
>> this e-mail to the OpenMPI users list is that I am not able to run MPI 
>> applications using the service levels (SLs) feature of the OpenMPI driver.
>> 
>> Currently, the nodes have the Red-Hat 6.4 with the kernel 
>> 2.6.32-358.el6.x86_64. I have compiled OpenMPI 1.6.4 with:
>> 
>>  ./configure --with-sge --with-openib=/usr --enable-openib-connectx-xrc 
>> --enable-mpi-thread-multiple --with-threads --with-hwloc 
>> --enable-heterogeneous --with-fca=/opt/mellanox/fca 
>> --with-mxm-libdir=/opt/mellanox/mxm/lib --with-mxm=/opt/mellanox/mxm 
>> --prefix=/home/jescudero/opt/openmpi
>> 
>> I have modified the OpenSM code (which is based on 3.3.15) in order to 
>> include a special routing algorithm based on "ftree". Apparently all is 
>> correct with the OpenSM since it returns the SLs when I execute the command 
>> "saquery --src-to-dst slid:dlid". Anyway, I have also tried to run the 
>> OpenSM with the DFSSSP algorithm.
>> 
>> However, when I try to run MPI applications (i.e. HPCC, OSU or even 
>> alltoall.c -included in the OpenMPI sources-) I experience some errors if 
>> the "btl_openib_path_record_info" is set to "1", otherwise (i.e. if the 
>> btl_openib_path_record_info is not enabled) the application execution ends 
>> correctly. I run the MPI application with the next command:
>> 
>> mpirun -display-allocation -display-map -np 8 -machinefile maquinas.aux 
>> --mca btl openib,self,sm --mca mtl mxm --mca 
>> btl_openib_ib_path_record_service_level 1 --mca btl_openib_cpc_include oob 
>> hpcc
>> 
>> I obtain the next trace:
>> 
>> [nodo20.XXXXX][[31227,1],6][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
>>  error posting receive on QP [0x16db] errno says: Success [0]
>> [nodo15.XXXXX][[31227,1],4][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
>>  error posting receive on QP [0x1749] errno says: Success [0]
>> [nodo17.XXXXX][[31227,1],5][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
>>  error posting receive on QP [0x1783] errno says: Success [0]
>> [nodo21.XXXXX][[31227,1],7][connect/btl_openib_connect_sl.c:239:get_pathrecord_info]
>>  error posting receive on QP [0x1838] errno says: Success [0]
>> [nodo21.XXXXX][[31227,1],7][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
>>  endpoint connect error: -1
>> [nodo17.XXXXX][[31227,1],5][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
>>  endpoint connect error: -1
>> [nodo15.XXXXX][[31227,1],4][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
>>  endpoint connect error: -1
>> [nodo20.XXXXX][[31227,1],6][connect/btl_openib_connect_oob.c:885:rml_recv_cb]
>>  endpoint connect error: -1
>> 
>> Does anyone know what I am doing wrong?
>> 
>> All the best,
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to