As of 2.0.0 we now support experimental verbs. It looks like one of the calls is failing:
#if HAVE_DECL_IBV_EXP_QUERY_DEVICE device->ib_exp_dev_attr.comp_mask = IBV_EXP_DEVICE_ATTR_RESERVED - 1; if(ibv_exp_query_device(device->ib_dev_context, &device->ib_exp_dev_attr)){ BTL_ERROR(("error obtaining device attributes for %s errno says %s", ibv_get_device_name(device->ib_dev), strerror(errno))); goto error; } #endif Do you know what OFED or MOFED version you are running? -Nathan > On Jul 13, 2016, at 7:15 AM, Matt Thompson <fort...@gmail.com> wrote: > > All, > > I've been struggling here at NASA Goddard trying to get PGI 16.5 + Open MPI > 1.10.3 working on the Discover cluster. What was happening was I'd run our > climate model at, say, 4x24 and it would work sometimes. Most of the time. > Every once in a while, it'd throw a segfault. If we changed the layout or > number of processors, more (and sometimes different) segfaults are trigger. > > As we could build with PGI 15.7 + Open MPI 1.10.3 (where Open MPI is built > exactly the same) and run perfectly, I was focusing on the Open MPI build. I > tried compiling it at -O3, -O, -O0, all sorts of things and was about to > throw in the towel as all failed. > > But, I saw Open MPI 2.0.0 was out and figured, may as well try the latest > before reporting to the mailing list. I built it and, huzzah!, it works! I'm > happy! Except that every time I execute 'mpirun' I get odd errors: > > (1034) $ mpirun -np 4 ./helloWorld.mpi2.exe > -------------------------------------------------------------------------- > WARNING: There was an error initializing an OpenFabrics device. > > Local host: borgr074 > Local device: mlx5_0 > -------------------------------------------------------------------------- > [borgr074][[35244,1],1][btl_openib_component.c:1618:init_one_device] error > obtaining device attributes for mlx5_0 errno says Cannot allocate memory > [borgr074][[35244,1],3][btl_openib_component.c:1618:init_one_device] error > obtaining device attributes for mlx5_0 errno says Cannot allocate memory > [borgr074][[35244,1],0][btl_openib_component.c:1618:init_one_device] error > obtaining device attributes for mlx5_0 errno says Cannot allocate memory > [borgr074][[35244,1],2][btl_openib_component.c:1618:init_one_device] error > obtaining device attributes for mlx5_0 errno says Cannot allocate memory > MPI Version: 3.1 > MPI Library Version: Open MPI v2.0.0, package: Open MPI mathomp4@borg01z239 > Distribution, ident: 2.0.0, repo rev: v2.x-dev-1570-g0a4a5d7, Jul 12, 2016 > Process 0 of 4 is on borgr074 > Process 3 of 4 is on borgr074 > Process 1 of 4 is on borgr074 > Process 2 of 4 is on borgr074 > [borgr074:29032] 3 more processes have sent help message > help-mpi-btl-openib.txt / error in device init > [borgr074:29032] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > > If I run with --mca btl_base_verbose 1 and use more than one node, I see that > the openib/verbs (still not sure what to call this) btl isn't being used, but > rather tcp: > > [borgr075:14374] mca: bml: Using tcp btl for send to [[35628,1],15] on node > borgr074 > [borgr075:14374] mca: bml: Using tcp btl for send to [[35628,1],15] on node > borgr074 > > which makes sense since it can't find an Infiniband device. > > My first thought is that the build/configure procedure of the past doesn't > quite jibe with what Open MPI 2.0.0 is expecting? I build Open MPI as: > > export CC=pgcc > export CXX=pgc++ > export FC=pgfortran > > export CFLAGS="-fpic -m64" > export CXXFLAGS="-fpic -m64" > export FCFLAGS="-m64 -fpic" > export PREFIX=/discover/swdev/mathomp4/MPI/openmpi/2.0.0/pgi-16.5-k40 > > export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/slurm/lib64 > export LDFLAGS="-L/usr/slurm/lib64" > export CPPFLAGS="-I/usr/slurm/include" > > export LIBS="-lpciaccess" > > build() { > echo `pwd` > ./configure --with-slurm --disable-wrapper-rpath --enable-shared > --prefix=${PREFIX} > make -j8 > make install > } > > echo "calling build" > build > echo "exiting" > > This is a build script built over time; it might have things unnecessary for > an Open MPI 2.0 build, but perhaps now it needs more info? I can say that in > the past (say with 1.10.3) it definitely found the openib/verbs btl and used > it! > > Per the website, I'm attaching links to my config.log and "ompi_info --all" > information: > > https://dl.dropboxusercontent.com/u/61696/Open%20MPI/config.log.gz > https://dl.dropboxusercontent.com/u/61696/Open%20MPI/build.pgi16.5.log.gz > https://dl.dropboxusercontent.com/u/61696/Open%20MPI/ompi_info.txt.gz > > I tried to run "ompi_info -v ompi full --parsable" as asked but that doesn't > seem possible anymore: > > (1053) $ ompi_info -v ompi full --parsable > ompi_info: Error: unknown option "-v" > Type 'ompi_info --help' for usage. > > I am asking our machine gurus about the Infiniband network per: > https://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot > -- > Matt Thompson > Man Among Men > Fulcrum of History > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/07/29656.php