As of 2.0.0 we now support experimental verbs. It looks like one of the calls 
is failing:

#if HAVE_DECL_IBV_EXP_QUERY_DEVICE
    device->ib_exp_dev_attr.comp_mask = IBV_EXP_DEVICE_ATTR_RESERVED - 1;
    if(ibv_exp_query_device(device->ib_dev_context, &device->ib_exp_dev_attr)){
        BTL_ERROR(("error obtaining device attributes for %s errno says %s",
                    ibv_get_device_name(device->ib_dev), strerror(errno)));
        goto error;
    }
#endif

Do you know what OFED or MOFED version you are running?

-Nathan

> On Jul 13, 2016, at 7:15 AM, Matt Thompson <fort...@gmail.com> wrote:
> 
> All,
> 
> I've been struggling here at NASA Goddard trying to get PGI 16.5 + Open MPI 
> 1.10.3 working on the Discover cluster. What was happening was I'd run our 
> climate model at, say, 4x24 and it would work sometimes. Most of the time. 
> Every once in a while, it'd throw a segfault. If we changed the layout or 
> number of processors, more (and sometimes different) segfaults are trigger.
> 
> As we could build with PGI 15.7 + Open MPI 1.10.3 (where Open MPI is built 
> exactly the same) and run perfectly, I was focusing on the Open MPI build. I 
> tried compiling it at -O3, -O, -O0, all sorts of things and was about to 
> throw in the towel as all failed.
> 
> But, I saw Open MPI 2.0.0 was out and figured, may as well try the latest 
> before reporting to the mailing list. I built it and, huzzah!, it works! I'm 
> happy! Except that every time I execute 'mpirun' I get odd errors:
> 
> (1034) $ mpirun -np 4 ./helloWorld.mpi2.exe 
> --------------------------------------------------------------------------
> WARNING: There was an error initializing an OpenFabrics device.
> 
>   Local host:   borgr074
>   Local device: mlx5_0
> --------------------------------------------------------------------------
> [borgr074][[35244,1],1][btl_openib_component.c:1618:init_one_device] error 
> obtaining device attributes for mlx5_0 errno says Cannot allocate memory
> [borgr074][[35244,1],3][btl_openib_component.c:1618:init_one_device] error 
> obtaining device attributes for mlx5_0 errno says Cannot allocate memory
> [borgr074][[35244,1],0][btl_openib_component.c:1618:init_one_device] error 
> obtaining device attributes for mlx5_0 errno says Cannot allocate memory
> [borgr074][[35244,1],2][btl_openib_component.c:1618:init_one_device] error 
> obtaining device attributes for mlx5_0 errno says Cannot allocate memory
> MPI Version: 3.1
> MPI Library Version: Open MPI v2.0.0, package: Open MPI mathomp4@borg01z239 
> Distribution, ident: 2.0.0, repo rev: v2.x-dev-1570-g0a4a5d7, Jul 12, 2016
> Process    0 of    4 is on borgr074
> Process    3 of    4 is on borgr074
> Process    1 of    4 is on borgr074
> Process    2 of    4 is on borgr074
> [borgr074:29032] 3 more processes have sent help message 
> help-mpi-btl-openib.txt / error in device init
> [borgr074:29032] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages
> 
> If I run with --mca btl_base_verbose 1 and use more than one node, I see that 
> the openib/verbs (still not sure what to call this) btl isn't being used, but 
> rather tcp:
> 
> [borgr075:14374] mca: bml: Using tcp btl for send to [[35628,1],15] on node 
> borgr074
> [borgr075:14374] mca: bml: Using tcp btl for send to [[35628,1],15] on node 
> borgr074
> 
> which makes sense since it can't find an Infiniband device.
> 
> My first thought is that the build/configure procedure of the past doesn't 
> quite jibe with what Open MPI 2.0.0 is expecting? I build Open MPI as:
> 
> export CC=pgcc
> export CXX=pgc++
> export FC=pgfortran
> 
> export CFLAGS="-fpic -m64"
> export CXXFLAGS="-fpic -m64"
> export FCFLAGS="-m64 -fpic"
> export PREFIX=/discover/swdev/mathomp4/MPI/openmpi/2.0.0/pgi-16.5-k40
> 
> export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/slurm/lib64
> export LDFLAGS="-L/usr/slurm/lib64"
> export CPPFLAGS="-I/usr/slurm/include"
> 
> export LIBS="-lpciaccess"
> 
> build() {
>   echo `pwd`
>   ./configure --with-slurm --disable-wrapper-rpath --enable-shared 
> --prefix=${PREFIX}
>   make -j8
>   make install
> }
> 
> echo "calling build"
> build
> echo "exiting"
> 
> This is a build script built over time; it might have things unnecessary for 
> an Open MPI 2.0 build, but perhaps now it needs more info? I can say that in 
> the past (say with 1.10.3) it definitely found the openib/verbs btl and used 
> it!
> 
> Per the website, I'm attaching links to my config.log and "ompi_info --all" 
> information:
> 
> https://dl.dropboxusercontent.com/u/61696/Open%20MPI/config.log.gz
> https://dl.dropboxusercontent.com/u/61696/Open%20MPI/build.pgi16.5.log.gz
> https://dl.dropboxusercontent.com/u/61696/Open%20MPI/ompi_info.txt.gz
> 
> I tried to run "ompi_info -v ompi full --parsable" as asked but that doesn't 
> seem possible anymore:
> 
> (1053) $ ompi_info -v ompi full --parsable
> ompi_info: Error: unknown option "-v"
> Type 'ompi_info --help' for usage.
> 
> I am asking our machine gurus about the Infiniband network per: 
> https://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot
> -- 
> Matt Thompson
> Man Among Men
> Fulcrum of History
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/07/29656.php

Reply via email to