The btl_openib_receive_queues parameters that Howard provided
fixed our problem with getting 2.0.1 working with RoCE so thanks for
all the help. However, we are seeing segfaults with this when
configured with --enable-btl-openib-failover. I've included the
configuration below that the package manager uses under Gentoo.
I also tested this after removing all of the redundant enable/disables,
and it's definitely the --enable-btl-openib-failover that causes 2.0.1
on RoCE to segfault. I can enable debugging and recompile if more
information is needed.
Could someone also explain why these parameters need to
be set explicitly for RoCE rather than being embedded in the code?
Dave
This is the configure line that our package manage generates:
./configure --prefix=/usr --build=x86_64-pc-linux-gnu
--host=x86_64-pc-linux-gnu --mandir=/usr/share/man
--infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc
--localstatedir=/var/lib --disable-dependency-tracking
--disable-silent-rules --docdir=/usr/share/doc/openmpi-2.0.1
--htmldir=/usr/share/doc/openmpi-2.0.1/html --libdir=/usr/lib64
--sysconfdir=/etc/openmpi --enable-pretty-print-stacktrace
--enable-orterun-prefix-by-default --with-hwloc=/usr
--with-libltdl=/usr --enable-mpi-fortran=all --enable-mpi-cxx
--without-cma --with-cuda=/opt/cuda --disable-io-romio
--disable-heterogeneous --enable-ipv6 --disable-java
--disable-mpi-java --disable-mpi-thread-multiple --without-verbs
--without-knem --without-psm --disable-openib-control-hdr-padding
--disable-openib-connectx-xrc --disable-openib-rdmacm
--disable-openib-udcm --disable-openib-dynamic-sl
--disable-btl-openib-failover --without-tm --without-slurm --with-sge
--enable-openib-connectx-xrc --enable-openib-rdmacm
--enable-openib-udcm --enable-openib-dynamic-sl
--enable-btl-openib-failover --with-verbs
On Thu, Jan 5, 2017 at 10:53 AM, Howard Pritchard <[email protected]>
wrote:
> Hi Dave,
>
> Sorry for the delayed response.
>
> Anyway, you have to use rdmacm for connection management when using ROCE.
> However, with 2.0.1 and later, you have to specify per peer QP info
> manually
> on the mpirun command line.
>
> Could you try rerunning with
>
> mpirun --mca btl_openib_receive_queues P,128,64,32,32,32:S,2048,1024,128,32:S,
> 12288,1024,128,32:S,65536,1024,128,32 (all the reset of the command line
> args)
>
> and see if it then works?
>
> Howard
>
>
> 2017-01-04 16:37 GMT-07:00 Dave Turner <[email protected]>:
>
>> ------------------------------------------------------------
>> --------------
>> No OpenFabrics connection schemes reported that they were able to be
>> used on a specific port. As such, the openib BTL (OpenFabrics
>> support) will be disabled for this port.
>>
>> Local host: elf22
>> Local device: mlx4_2
>> Local port: 1
>> CPCs attempted: rdmacm, udcm
>> ------------------------------------------------------------
>> --------------
>>
>> I posted this to the user list but got no answer so I'm reposting to
>> the devel list.
>>
>> We recently upgraded to OpenMPI 2.0.1. Everything works fine
>> on our QDR connections but we get the error above for our
>> 40 GbE connections running RoCE. I traced through the code and
>> it looks like udcm cannot be used with RoCE. I've also read that
>> there are currently some problems with rdmacm under 2.0.1, which
>> would mean 2.0.1 does not currently work on RoCE. We've tested
>> 10.4 using rdmacm and that works fine so I don't think we have anything
>> configured wrong on the RoCE side.
>> Could someone please verify whether this information is correct that
>> RoCE requires rdmacm only and not udcm, and that rdmacm is currently
>> not working. If so, is it being worked on?
>>
>> Dave
>>
>>
>> --
>> Work: [email protected] (785) 532-7791
>> 2219 Engineering Hall, Manhattan KS 66506
>> Home: [email protected]
>> cell: (785) 770-5929
>>
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
>
--
Work: [email protected] (785) 532-7791
2219 Engineering Hall, Manhattan KS 66506
Home: [email protected]
cell: (785) 770-5929
_______________________________________________
devel mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel