Perhaps we can back one step:
Use your mpicc to build a "hello world" mpi test, then run it on a compute
node (with GPU) to see if it works.
If no, then your MPI environment has problems;
If yes, then use it to build petsc (turn on petsc's gpu support,
--with-cuda  --with-cudac=nvcc), and then your code.

--Junchao Zhang


On Fri, Oct 7, 2022 at 10:45 PM Rob Kudyba <rk3...@columbia.edu> wrote:

> The error changes now and at an earlier place, 66% vs 70%:
> make LDFLAGS="-Wl,--copy-dt-needed-entries"
> Consolidate compiler generated dependencies of target fmt
> [ 12%] Built target fmt
> Consolidate compiler generated dependencies of target richdem
> [ 37%] Built target richdem
> Consolidate compiler generated dependencies of target wtm
> [ 62%] Built target wtm
> Consolidate compiler generated dependencies of target wtm.x
> [ 66%] Linking CXX executable wtm.x
> /usr/bin/ld: libwtm.a(transient_groundwater.cpp.o): undefined reference to
> symbol 'MPI_Abort'
> /path/to/openmpi-4.1.1_ucx_cuda_11.0.3_support/lib/libmpi.so.40: error
> adding symbols: DSO missing from command line
> collect2: error: ld returned 1 exit status
> make[2]: *** [CMakeFiles/wtm.x.dir/build.make:103: wtm.x] Error 1
> make[1]: *** [CMakeFiles/Makefile2:225: CMakeFiles/wtm.x.dir/all] Error 2
> make: *** [Makefile:136: all] Error 2
>
> So perhaps PET_Sc is now being found. Any other suggestions?
>
> On Fri, Oct 7, 2022 at 11:18 PM Rob Kudyba <rk3...@columbia.edu> wrote:
>
>>
>> Thanks for the quick reply. I added these options to make and make check
>>>> still produce the warnings so I used the command like this:
>>>> make PETSC_DIR=/path/to/petsc PETSC_ARCH=arch-linux-c-debug
>>>>  MPIEXEC="mpiexec -mca orte_base_help_aggregate 0 --mca
>>>> opal_warn_on_missing_libcuda 0 -mca pml ucx --mca btl '^openib'" check
>>>> Running check examples to verify correct installation
>>>> Using PETSC_DIR=/path/to/petsc and PETSC_ARCH=arch-linux-c-debug
>>>> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI
>>>> process
>>>> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI
>>>> processes
>>>> Completed test examples
>>>>
>>>> Could be useful for the FAQ.
>>>>
>>> You mentioned you had "OpenMPI 4.1.1 with CUDA aware",  so I think a
>>> workable mpicc should automatically find cuda libraries.  Maybe you
>>> unloaded cuda libraries?
>>>
>> Oh let me clarify, OpenMPI is CUDA aware however this code and the node
>> where PET_Sc is compiling does not have a GPU, hence not needed and using
>> the MPIEXEC option worked during the 'check' to suppress the warning.
>>
>> I'm not trying to use PetSC to compile and linking appears to go awry:
>>>> [ 58%] Building CXX object
>>>> CMakeFiles/wtm.dir/src/update_effective_storativity.cpp.o
>>>> [ 62%] Linking CXX static library libwtm.a
>>>> [ 62%] Built target wtm
>>>> [ 66%] Building CXX object CMakeFiles/wtm.x.dir/src/WTM.cpp.o
>>>> [ 70%] Linking CXX executable wtm.x
>>>> /usr/bin/ld: cannot find -lpetsc
>>>> collect2: error: ld returned 1 exit status
>>>> make[2]: *** [CMakeFiles/wtm.x.dir/build.make:103: wtm.x] Error 1
>>>> make[1]: *** [CMakeFiles/Makefile2:269: CMakeFiles/wtm.x.dir/all] Error
>>>> 2
>>>> make: *** [Makefile:136: all] Error 2
>>>>
>>> It seems cmake could not find petsc.   Look
>>> at $PETSC_DIR/share/petsc/CMakeLists.txt and try to modify your
>>> CMakeLists.txt.
>>>
>>
>> There is an explicit reference to the path in CMakeLists.txt:
>> # NOTE: You may need to update this path to identify PETSc's location
>> set(ENV{PKG_CONFIG_PATH}
>> "$ENV{PKG_CONFIG_PATH}:/path/to/petsc/arch-linux-cxx-debug/lib/pkgconfig/")
>> pkg_check_modules(PETSC PETSc>=3.17.1 IMPORTED_TARGET REQUIRED)
>> message(STATUS "Found PETSc ${PETSC_VERSION}")
>> add_subdirectory(common/richdem EXCLUDE_FROM_ALL)
>> add_subdirectory(common/fmt EXCLUDE_FROM_ALL)
>>
>> And that exists:
>> ls /path/to/petsc/arch-linux-cxx-debug/lib/pkgconfig/
>> petsc.pc  PETSc.pc
>>
>>  Is there an environment variable I'm missing? I've seen the suggestion
>>> <https://www.mail-archive.com/search?l=petsc-users@mcs.anl.gov&q=subject:%22%5C%5Bpetsc%5C-users%5C%5D+CMake+error+in+PETSc%22&o=newest&f=1>
>>> to add it to LD_LIBRARY_PATH which I did with export
>>> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PETSC_DIR/$PETSC_ARCH/lib and that
>>> points to:
>>>
>>>> ls -l /path/to/petsc/arch-linux-c-debug/lib
>>>> total 83732
>>>> lrwxrwxrwx 1 rk3199 user       18 Oct  7 13:56 libpetsc.so ->
>>>> libpetsc.so.3.18.0
>>>> lrwxrwxrwx 1 rk3199 user       18 Oct  7 13:56 libpetsc.so.3.18 ->
>>>> libpetsc.so.3.18.0
>>>> -rwxr-xr-x 1 rk3199 user 85719200 Oct  7 13:56 libpetsc.so.3.18.0
>>>> drwxr-xr-x 3 rk3199 user     4096 Oct  6 10:22 petsc
>>>> drwxr-xr-x 2 rk3199 user     4096 Oct  6 10:23 pkgconfig
>>>>
>>>> Anything else to check?
>>>>
>>> If modifying  CMakeLists.txt does not work, you can try export
>>> LIBRARY_PATH=$LIBRARY_PATH:$PETSC_DIR/$PETSC_ARCH/lib
>>> LD_LIBRARY_PATHis is for run time, but the error happened at link time,
>>>
>>
>> Yes that's what I already had. Any other debug that I can provide?
>>
>>
>>
>>> On Fri, Oct 7, 2022 at 1:53 PM Satish Balay <ba...@mcs.anl.gov> wrote:
>>>>
>>>>> you can try
>>>>>
>>>>> make PETSC_DIR=/path/to/petsc PETSC_ARCH=arch-linux-c-debug
>>>>> MPIEXEC="mpiexec -mca orte_base_help_aggregate 0 --mca
>>>>> opal_warn_on_missing_libcuda 0 -mca pml ucx --mca btl '^openib'"
>>>>>
>>>>> Wrt configure - it can be set with --with-mpiexec option - its saved
>>>>> in PETSC_ARCH/lib/petsc/conf/petscvariables
>>>>>
>>>>> Satish
>>>>>
>>>>> On Fri, 7 Oct 2022, Rob Kudyba wrote:
>>>>>
>>>>> > We are on RHEL 8, using modules that we can load/unload various
>>>>> version of
>>>>> > packages/libraries, and I have OpenMPI 4.1.1 with CUDA aware loaded
>>>>> along
>>>>> > with GDAL 3.3.0, GCC 10.2.0, and cmake 3.22.1
>>>>> >
>>>>> > make PETSC_DIR=/path/to/petsc PETSC_ARCH=arch-linux-c-debug check
>>>>> > fails with the below errors,
>>>>> > Running check examples to verify correct installation
>>>>> >
>>>>> > Using PETSC_DIR=/path/to/petsc and PETSC_ARCH=arch-linux-c-debug
>>>>> > Possible error running C/C++ src/snes/tutorials/ex19 with 1 MPI
>>>>> process
>>>>> > See https://petsc.org/release/faq/
>>>>> >
>>>>> --------------------------------------------------------------------------
>>>>> > The library attempted to open the following supporting CUDA
>>>>> libraries,
>>>>> > but each of them failed.  CUDA-aware support is disabled.
>>>>> > libcuda.so.1: cannot open shared object file: No such file or
>>>>> directory
>>>>> > libcuda.dylib: cannot open shared object file: No such file or
>>>>> directory
>>>>> > /usr/lib64/libcuda.so.1: cannot open shared object file: No such
>>>>> file or
>>>>> > directory
>>>>> > /usr/lib64/libcuda.dylib: cannot open shared object file: No such
>>>>> file or
>>>>> > directory
>>>>> > If you are not interested in CUDA-aware support, then run with
>>>>> > --mca opal_warn_on_missing_libcuda 0 to suppress this message.  If
>>>>> you are
>>>>> > interested
>>>>> > in CUDA-aware support, then try setting LD_LIBRARY_PATH to the
>>>>> location
>>>>> > of libcuda.so.1 to get passed this issue.
>>>>> >
>>>>> --------------------------------------------------------------------------
>>>>> >
>>>>> --------------------------------------------------------------------------
>>>>> > WARNING: There was an error initializing an OpenFabrics device.
>>>>> >
>>>>> >   Local host:   g117
>>>>> >   Local device: mlx5_0
>>>>> >
>>>>> --------------------------------------------------------------------------
>>>>> > lid velocity = 0.0016, prandtl # = 1., grashof # = 1.
>>>>> > Number of SNES iterations = 2
>>>>> > Possible error running C/C++ src/snes/tutorials/ex19 with 2 MPI
>>>>> processes
>>>>> > See https://petsc.org/release/faq/
>>>>> >
>>>>> > The library attempted to open the following supporting CUDA
>>>>> libraries,
>>>>> > but each of them failed.  CUDA-aware support is disabled.
>>>>> > libcuda.so.1: cannot open shared object file: No such file or
>>>>> directory
>>>>> > libcuda.dylib: cannot open shared object file: No such file or
>>>>> directory
>>>>> > /usr/lib64/libcuda.so.1: cannot open shared object file: No such
>>>>> file or
>>>>> > directory
>>>>> > /usr/lib64/libcuda.dylib: cannot open shared object file: No such
>>>>> file or
>>>>> > directory
>>>>> > If you are not interested in CUDA-aware support, then run with
>>>>> > --mca opal_warn_on_missing_libcuda 0 to suppress this message.  If
>>>>> you are
>>>>> > interested in CUDA-aware support, then try setting LD_LIBRARY_PATH
>>>>> to the
>>>>> > locationof libcuda.so.1 to get passed this issue.
>>>>> >
>>>>> > WARNING: There was an error initializing an OpenFabrics device.
>>>>> >
>>>>> >   Local host:   xxx
>>>>> >   Local device: mlx5_0
>>>>> >
>>>>> > lid velocity = 0.0016, prandtl # = 1., grashof # = 1.
>>>>> > Number of SNES iterations = 2
>>>>> > [g117:4162783] 1 more process has sent help message
>>>>> > help-mpi-common-cuda.txt / dlopen failed
>>>>> > [g117:4162783] Set MCA parameter "orte_base_help_aggregate" to 0 to
>>>>> see all
>>>>> > help / error messages
>>>>> > [g117:4162783] 1 more process has sent help message
>>>>> help-mpi-btl-openib.txt
>>>>> > / error in device init
>>>>> > Completed test examples
>>>>> > Error while running make check
>>>>> > gmake[1]: *** [makefile:149: check] Error 1
>>>>> > make: *** [GNUmakefile:17: check] Error 2
>>>>> >
>>>>> > Where is $MPI_RUN set? I'd like to be able to pass options such as
>>>>> --mca
>>>>> > orte_base_help_aggregate 0 --mca opal_warn_on_missing_libcuda 0 -mca
>>>>> pml
>>>>> > ucx --mca btl '^openib' which will help me troubleshoot and hide
>>>>> unneeded
>>>>> > warnings.
>>>>> >
>>>>> > Thanks,
>>>>> > Rob
>>>>> >
>>>>>
>>>>>

Reply via email to