Did we just recently discuss the openib BTL failover capability and decide that 
it had bit-rotted?

If so, we need to amend our documentation and disable the code.


> On Jan 11, 2017, at 3:11 PM, Dave Turner <drdavetur...@gmail.com> wrote:
> 
> 
>      The btl_openib_receive_queues parameters that Howard provided
> fixed our problem with getting 2.0.1 working with RoCE so thanks for
> all the help.  However, we are seeing segfaults with this when
> configured with --enable-btl-openib-failover.  I've included the 
> configuration below that the package manager uses under Gentoo.
> I also tested this after removing all of the redundant enable/disables,
> and it's definitely the --enable-btl-openib-failover that causes 2.0.1
> on RoCE to segfault.  I can enable debugging and recompile if more
> information is needed.
> 
>      Could someone also explain why these parameters need to
> be set explicitly for RoCE rather than being embedded in the code?
> 
>                    Dave
> 
> This is the configure line that our package manage generates:
> ./configure --prefix=/usr --build=x86_64-pc-linux-gnu
> --host=x86_64-pc-linux-gnu --mandir=/usr/share/man
> --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc
> --localstatedir=/var/lib --disable-dependency-tracking
> --disable-silent-rules --docdir=/usr/share/doc/openmpi-2.0.1
> --htmldir=/usr/share/doc/openmpi-2.0.1/html --libdir=/usr/lib64
> --sysconfdir=/etc/openmpi --enable-pretty-print-stacktrace
> --enable-orterun-prefix-by-default --with-hwloc=/usr
> --with-libltdl=/usr --enable-mpi-fortran=all --enable-mpi-cxx
> --without-cma --with-cuda=/opt/cuda --disable-io-romio
> --disable-heterogeneous --enable-ipv6 --disable-java
> --disable-mpi-java --disable-mpi-thread-multiple --without-verbs
> --without-knem --without-psm --disable-openib-control-hdr-padding
> --disable-openib-connectx-xrc --disable-openib-rdmacm
> --disable-openib-udcm --disable-openib-dynamic-sl
> --disable-btl-openib-failover --without-tm --without-slurm --with-sge
> --enable-openib-connectx-xrc --enable-openib-rdmacm
> --enable-openib-udcm --enable-openib-dynamic-sl
> --enable-btl-openib-failover --with-verbs
> 
> On Thu, Jan 5, 2017 at 10:53 AM, Howard Pritchard <hpprit...@gmail.com> wrote:
> Hi Dave,
> 
> Sorry for the delayed response.  
> 
> Anyway, you have to use rdmacm for connection management when using ROCE.
> However, with 2.0.1 and later, you have to specify per peer QP info manually
> on the mpirun command line.  
> 
> Could you try rerunning with
> 
> mpirun --mca btl_openib_receive_queues 
> P,128,64,32,32,32:S,2048,1024,128,32:S, 12288,1024,128,32:S,65536,1024,128,32 
> (all the reset of the command line args)
> 
> and see if it then works?
> 
> Howard
> 
> 
> 2017-01-04 16:37 GMT-07:00 Dave Turner <drdavetur...@gmail.com>:
> --------------------------------------------------------------------------
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>   Local host:           elf22
>   Local device:         mlx4_2
>   Local port:           1
>   CPCs attempted:       rdmacm, udcm
> --------------------------------------------------------------------------
> 
>     I posted this to the user list but got no answer so I'm reposting to
> the devel list.
> 
>     We recently upgraded to OpenMPI 2.0.1.  Everything works fine
> on our QDR connections but we get the error above for our
> 40 GbE connections running RoCE.  I traced through the code and
> it looks like udcm cannot be used with RoCE.  I've also read that 
> there are currently some problems with rdmacm under 2.0.1, which
> would mean 2.0.1 does not currently work on RoCE.  We've tested
> 10.4 using rdmacm and that works fine so I don't think we have anything
> configured wrong on the RoCE side.  
>      Could someone please verify whether this information is correct that
> RoCE requires rdmacm only and not udcm, and that rdmacm is currently
> not working.  If so, is it being worked on?
> 
>                      Dave
> 
> 
> -- 
> Work:     davetur...@ksu.edu     (785) 532-7791
>              2219 Engineering Hall, Manhattan KS  66506
> Home:    drdavetur...@gmail.com
>               cell: (785) 770-5929
> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> 
> 
> 
> -- 
> Work:     davetur...@ksu.edu     (785) 532-7791
>              2219 Engineering Hall, Manhattan KS  66506
> Home:    drdavetur...@gmail.com
>               cell: (785) 770-5929
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to