Did we just recently discuss the openib BTL failover capability and decide that it had bit-rotted?
If so, we need to amend our documentation and disable the code. > On Jan 11, 2017, at 3:11 PM, Dave Turner <drdavetur...@gmail.com> wrote: > > > The btl_openib_receive_queues parameters that Howard provided > fixed our problem with getting 2.0.1 working with RoCE so thanks for > all the help. However, we are seeing segfaults with this when > configured with --enable-btl-openib-failover. I've included the > configuration below that the package manager uses under Gentoo. > I also tested this after removing all of the redundant enable/disables, > and it's definitely the --enable-btl-openib-failover that causes 2.0.1 > on RoCE to segfault. I can enable debugging and recompile if more > information is needed. > > Could someone also explain why these parameters need to > be set explicitly for RoCE rather than being embedded in the code? > > Dave > > This is the configure line that our package manage generates: > ./configure --prefix=/usr --build=x86_64-pc-linux-gnu > --host=x86_64-pc-linux-gnu --mandir=/usr/share/man > --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc > --localstatedir=/var/lib --disable-dependency-tracking > --disable-silent-rules --docdir=/usr/share/doc/openmpi-2.0.1 > --htmldir=/usr/share/doc/openmpi-2.0.1/html --libdir=/usr/lib64 > --sysconfdir=/etc/openmpi --enable-pretty-print-stacktrace > --enable-orterun-prefix-by-default --with-hwloc=/usr > --with-libltdl=/usr --enable-mpi-fortran=all --enable-mpi-cxx > --without-cma --with-cuda=/opt/cuda --disable-io-romio > --disable-heterogeneous --enable-ipv6 --disable-java > --disable-mpi-java --disable-mpi-thread-multiple --without-verbs > --without-knem --without-psm --disable-openib-control-hdr-padding > --disable-openib-connectx-xrc --disable-openib-rdmacm > --disable-openib-udcm --disable-openib-dynamic-sl > --disable-btl-openib-failover --without-tm --without-slurm --with-sge > --enable-openib-connectx-xrc --enable-openib-rdmacm > --enable-openib-udcm --enable-openib-dynamic-sl > --enable-btl-openib-failover --with-verbs > > On Thu, Jan 5, 2017 at 10:53 AM, Howard Pritchard <hpprit...@gmail.com> wrote: > Hi Dave, > > Sorry for the delayed response. > > Anyway, you have to use rdmacm for connection management when using ROCE. > However, with 2.0.1 and later, you have to specify per peer QP info manually > on the mpirun command line. > > Could you try rerunning with > > mpirun --mca btl_openib_receive_queues > P,128,64,32,32,32:S,2048,1024,128,32:S, 12288,1024,128,32:S,65536,1024,128,32 > (all the reset of the command line args) > > and see if it then works? > > Howard > > > 2017-01-04 16:37 GMT-07:00 Dave Turner <drdavetur...@gmail.com>: > -------------------------------------------------------------------------- > No OpenFabrics connection schemes reported that they were able to be > used on a specific port. As such, the openib BTL (OpenFabrics > support) will be disabled for this port. > > Local host: elf22 > Local device: mlx4_2 > Local port: 1 > CPCs attempted: rdmacm, udcm > -------------------------------------------------------------------------- > > I posted this to the user list but got no answer so I'm reposting to > the devel list. > > We recently upgraded to OpenMPI 2.0.1. Everything works fine > on our QDR connections but we get the error above for our > 40 GbE connections running RoCE. I traced through the code and > it looks like udcm cannot be used with RoCE. I've also read that > there are currently some problems with rdmacm under 2.0.1, which > would mean 2.0.1 does not currently work on RoCE. We've tested > 10.4 using rdmacm and that works fine so I don't think we have anything > configured wrong on the RoCE side. > Could someone please verify whether this information is correct that > RoCE requires rdmacm only and not udcm, and that rdmacm is currently > not working. If so, is it being worked on? > > Dave > > > -- > Work: davetur...@ksu.edu (785) 532-7791 > 2219 Engineering Hall, Manhattan KS 66506 > Home: drdavetur...@gmail.com > cell: (785) 770-5929 > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > > > -- > Work: davetur...@ksu.edu (785) 532-7791 > 2219 Engineering Hall, Manhattan KS 66506 > Home: drdavetur...@gmail.com > cell: (785) 770-5929 > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel -- Jeff Squyres jsquy...@cisco.com _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel