Perhaps we can back one step: Use your mpicc to build a "hello world" mpi test, then run it on a compute node (with GPU) to see if it works. If no, then your MPI environment has problems; If yes, then use it to build petsc (turn on petsc's gpu support, --with-cuda --with-cudac=nvcc), and then your code.
--Junchao Zhang On Fri, Oct 7, 2022 at 10:45 PM Rob Kudyba <rk3...@columbia.edu> wrote: > The error changes now and at an earlier place, 66% vs 70%: > make LDFLAGS="-Wl,--copy-dt-needed-entries" > Consolidate compiler generated dependencies of target fmt > [ 12%] Built target fmt > Consolidate compiler generated dependencies of target richdem > [ 37%] Built target richdem > Consolidate compiler generated dependencies of target wtm > [ 62%] Built target wtm > Consolidate compiler generated dependencies of target wtm.x > [ 66%] Linking CXX executable wtm.x > /usr/bin/ld: libwtm.a(transient_groundwater.cpp.o): undefined reference to > symbol 'MPI_Abort' > /path/to/openmpi-4.1.1_ucx_cuda_11.0.3_support/lib/libmpi.so.40: error > adding symbols: DSO missing from command line > collect2: error: ld returned 1 exit status > make[2]: *** [CMakeFiles/wtm.x.dir/build.make:103: wtm.x] Error 1 > make[1]: *** [CMakeFiles/Makefile2:225: CMakeFiles/wtm.x.dir/all] Error 2 > make: *** [Makefile:136: all] Error 2 > > So perhaps PET_Sc is now being found. Any other suggestions? > > On Fri, Oct 7, 2022 at 11:18 PM Rob Kudyba <rk3...@columbia.edu> wrote: > >> >> Thanks for the quick reply. I added these options to make and make check >>>> still produce the warnings so I used the command like this: >>>> make PETSC_DIR=/path/to/petsc PETSC_ARCH=arch-linux-c-debug >>>> MPIEXEC="mpiexec -mca orte_base_help_aggregate 0 --mca >>>> opal_warn_on_missing_libcuda 0 -mca pml ucx --mca btl '^openib'" check >>>> Running check examples to verify correct installation >>>> Using PETSC_DIR=/path/to/petsc and PETSC_ARCH=arch-linux-c-debug >>>> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI >>>> process >>>> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI >>>> processes >>>> Completed test examples >>>> >>>> Could be useful for the FAQ. >>>> >>> You mentioned you had "OpenMPI 4.1.1 with CUDA aware", so I think a >>> workable mpicc should automatically find cuda libraries. Maybe you >>> unloaded cuda libraries? >>> >> Oh let me clarify, OpenMPI is CUDA aware however this code and the node >> where PET_Sc is compiling does not have a GPU, hence not needed and using >> the MPIEXEC option worked during the 'check' to suppress the warning. >> >> I'm not trying to use PetSC to compile and linking appears to go awry: >>>> [ 58%] Building CXX object >>>> CMakeFiles/wtm.dir/src/update_effective_storativity.cpp.o >>>> [ 62%] Linking CXX static library libwtm.a >>>> [ 62%] Built target wtm >>>> [ 66%] Building CXX object CMakeFiles/wtm.x.dir/src/WTM.cpp.o >>>> [ 70%] Linking CXX executable wtm.x >>>> /usr/bin/ld: cannot find -lpetsc >>>> collect2: error: ld returned 1 exit status >>>> make[2]: *** [CMakeFiles/wtm.x.dir/build.make:103: wtm.x] Error 1 >>>> make[1]: *** [CMakeFiles/Makefile2:269: CMakeFiles/wtm.x.dir/all] Error >>>> 2 >>>> make: *** [Makefile:136: all] Error 2 >>>> >>> It seems cmake could not find petsc. Look >>> at $PETSC_DIR/share/petsc/CMakeLists.txt and try to modify your >>> CMakeLists.txt. >>> >> >> There is an explicit reference to the path in CMakeLists.txt: >> # NOTE: You may need to update this path to identify PETSc's location >> set(ENV{PKG_CONFIG_PATH} >> "$ENV{PKG_CONFIG_PATH}:/path/to/petsc/arch-linux-cxx-debug/lib/pkgconfig/") >> pkg_check_modules(PETSC PETSc>=3.17.1 IMPORTED_TARGET REQUIRED) >> message(STATUS "Found PETSc ${PETSC_VERSION}") >> add_subdirectory(common/richdem EXCLUDE_FROM_ALL) >> add_subdirectory(common/fmt EXCLUDE_FROM_ALL) >> >> And that exists: >> ls /path/to/petsc/arch-linux-cxx-debug/lib/pkgconfig/ >> petsc.pc PETSc.pc >> >> Is there an environment variable I'm missing? I've seen the suggestion >>> <https://www.mail-archive.com/search?l=petsc-users@mcs.anl.gov&q=subject:%22%5C%5Bpetsc%5C-users%5C%5D+CMake+error+in+PETSc%22&o=newest&f=1> >>> to add it to LD_LIBRARY_PATH which I did with export >>> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PETSC_DIR/$PETSC_ARCH/lib and that >>> points to: >>> >>>> ls -l /path/to/petsc/arch-linux-c-debug/lib >>>> total 83732 >>>> lrwxrwxrwx 1 rk3199 user 18 Oct 7 13:56 libpetsc.so -> >>>> libpetsc.so.3.18.0 >>>> lrwxrwxrwx 1 rk3199 user 18 Oct 7 13:56 libpetsc.so.3.18 -> >>>> libpetsc.so.3.18.0 >>>> -rwxr-xr-x 1 rk3199 user 85719200 Oct 7 13:56 libpetsc.so.3.18.0 >>>> drwxr-xr-x 3 rk3199 user 4096 Oct 6 10:22 petsc >>>> drwxr-xr-x 2 rk3199 user 4096 Oct 6 10:23 pkgconfig >>>> >>>> Anything else to check? >>>> >>> If modifying CMakeLists.txt does not work, you can try export >>> LIBRARY_PATH=$LIBRARY_PATH:$PETSC_DIR/$PETSC_ARCH/lib >>> LD_LIBRARY_PATHis is for run time, but the error happened at link time, >>> >> >> Yes that's what I already had. Any other debug that I can provide? >> >> >> >>> On Fri, Oct 7, 2022 at 1:53 PM Satish Balay <ba...@mcs.anl.gov> wrote: >>>> >>>>> you can try >>>>> >>>>> make PETSC_DIR=/path/to/petsc PETSC_ARCH=arch-linux-c-debug >>>>> MPIEXEC="mpiexec -mca orte_base_help_aggregate 0 --mca >>>>> opal_warn_on_missing_libcuda 0 -mca pml ucx --mca btl '^openib'" >>>>> >>>>> Wrt configure - it can be set with --with-mpiexec option - its saved >>>>> in PETSC_ARCH/lib/petsc/conf/petscvariables >>>>> >>>>> Satish >>>>> >>>>> On Fri, 7 Oct 2022, Rob Kudyba wrote: >>>>> >>>>> > We are on RHEL 8, using modules that we can load/unload various >>>>> version of >>>>> > packages/libraries, and I have OpenMPI 4.1.1 with CUDA aware loaded >>>>> along >>>>> > with GDAL 3.3.0, GCC 10.2.0, and cmake 3.22.1 >>>>> > >>>>> > make PETSC_DIR=/path/to/petsc PETSC_ARCH=arch-linux-c-debug check >>>>> > fails with the below errors, >>>>> > Running check examples to verify correct installation >>>>> > >>>>> > Using PETSC_DIR=/path/to/petsc and PETSC_ARCH=arch-linux-c-debug >>>>> > Possible error running C/C++ src/snes/tutorials/ex19 with 1 MPI >>>>> process >>>>> > See https://petsc.org/release/faq/ >>>>> > >>>>> -------------------------------------------------------------------------- >>>>> > The library attempted to open the following supporting CUDA >>>>> libraries, >>>>> > but each of them failed. CUDA-aware support is disabled. >>>>> > libcuda.so.1: cannot open shared object file: No such file or >>>>> directory >>>>> > libcuda.dylib: cannot open shared object file: No such file or >>>>> directory >>>>> > /usr/lib64/libcuda.so.1: cannot open shared object file: No such >>>>> file or >>>>> > directory >>>>> > /usr/lib64/libcuda.dylib: cannot open shared object file: No such >>>>> file or >>>>> > directory >>>>> > If you are not interested in CUDA-aware support, then run with >>>>> > --mca opal_warn_on_missing_libcuda 0 to suppress this message. If >>>>> you are >>>>> > interested >>>>> > in CUDA-aware support, then try setting LD_LIBRARY_PATH to the >>>>> location >>>>> > of libcuda.so.1 to get passed this issue. >>>>> > >>>>> -------------------------------------------------------------------------- >>>>> > >>>>> -------------------------------------------------------------------------- >>>>> > WARNING: There was an error initializing an OpenFabrics device. >>>>> > >>>>> > Local host: g117 >>>>> > Local device: mlx5_0 >>>>> > >>>>> -------------------------------------------------------------------------- >>>>> > lid velocity = 0.0016, prandtl # = 1., grashof # = 1. >>>>> > Number of SNES iterations = 2 >>>>> > Possible error running C/C++ src/snes/tutorials/ex19 with 2 MPI >>>>> processes >>>>> > See https://petsc.org/release/faq/ >>>>> > >>>>> > The library attempted to open the following supporting CUDA >>>>> libraries, >>>>> > but each of them failed. CUDA-aware support is disabled. >>>>> > libcuda.so.1: cannot open shared object file: No such file or >>>>> directory >>>>> > libcuda.dylib: cannot open shared object file: No such file or >>>>> directory >>>>> > /usr/lib64/libcuda.so.1: cannot open shared object file: No such >>>>> file or >>>>> > directory >>>>> > /usr/lib64/libcuda.dylib: cannot open shared object file: No such >>>>> file or >>>>> > directory >>>>> > If you are not interested in CUDA-aware support, then run with >>>>> > --mca opal_warn_on_missing_libcuda 0 to suppress this message. If >>>>> you are >>>>> > interested in CUDA-aware support, then try setting LD_LIBRARY_PATH >>>>> to the >>>>> > locationof libcuda.so.1 to get passed this issue. >>>>> > >>>>> > WARNING: There was an error initializing an OpenFabrics device. >>>>> > >>>>> > Local host: xxx >>>>> > Local device: mlx5_0 >>>>> > >>>>> > lid velocity = 0.0016, prandtl # = 1., grashof # = 1. >>>>> > Number of SNES iterations = 2 >>>>> > [g117:4162783] 1 more process has sent help message >>>>> > help-mpi-common-cuda.txt / dlopen failed >>>>> > [g117:4162783] Set MCA parameter "orte_base_help_aggregate" to 0 to >>>>> see all >>>>> > help / error messages >>>>> > [g117:4162783] 1 more process has sent help message >>>>> help-mpi-btl-openib.txt >>>>> > / error in device init >>>>> > Completed test examples >>>>> > Error while running make check >>>>> > gmake[1]: *** [makefile:149: check] Error 1 >>>>> > make: *** [GNUmakefile:17: check] Error 2 >>>>> > >>>>> > Where is $MPI_RUN set? I'd like to be able to pass options such as >>>>> --mca >>>>> > orte_base_help_aggregate 0 --mca opal_warn_on_missing_libcuda 0 -mca >>>>> pml >>>>> > ucx --mca btl '^openib' which will help me troubleshoot and hide >>>>> unneeded >>>>> > warnings. >>>>> > >>>>> > Thanks, >>>>> > Rob >>>>> > >>>>> >>>>>