I still get the same error when deactivating GPU-aware MPI. I also tried unloading spectrum MPI and using openMPI instead (recompiling everything) and I get a segfault in PETSc in that case (still using GPU-aware MPI I think, at least not explicitly turning it off):
0 TS dt 1e-12 time 0. [ERROR] [0]PETSC ERROR: [ERROR] ------------------------------------------------------------------------ [ERROR] [0]PETSC ERROR: [ERROR] Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [ERROR] [0]PETSC ERROR: [ERROR] Try option -start_in_debugger or -on_error_attach_debugger [ERROR] [0]PETSC ERROR: [ERROR] or see https://urldefense.us/v3/__https://petsc.org/release/faq/*valgrind__;Iw!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhIx_dFdIU$ <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_-23valgrind&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=Loebf9sk4dgXGOOKPK3IHxp-C5SjGtr7Svr49LwaM4E&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI0jsXekj$ > and https://urldefense.us/v3/__https://petsc.org/release/faq/__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI2KyVq6U$ <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__petsc.org_release_faq_&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=7e9oLVYLacda_1-8rSkzDEHL4Zy1BFnO4pnrfMNlgO4&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI74qqyaL$ > [ERROR] [0]PETSC ERROR: [ERROR] or try https://urldefense.us/v3/__https://docs.nvidia.com/cuda/cuda-memcheck/index.html__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI7f1dyQ_$ <https://urldefense.us/v3/__https://urldefense.us/v2/url?u=https-3A__docs.nvidia.com_cuda_cuda-2Dmemcheck_index.html&d=DwQGaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SNsmM8pc4pmx4j-bqFq40w&m=1GLMwF9jewRd8MBil83VSwu-tVEn7Tkm_YfSAcgEMsZ9hDb2HvlnscmeqXsnzv5S&s=2gHentsiEM2njpPim4k40mYA96k7v_ivjI3erSECebM&e=__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI3YGCBJ5$ > on NVIDIA CUDA systems to find memory corruption errors [ERROR] [0]PETSC ERROR: [ERROR] configure using --with-debugging=yes, recompile, link, and run [ERROR] [0]PETSC ERROR: [ERROR] to get more information on the crash. [ERROR] [0]PETSC ERROR: [ERROR] Run with -malloc_debug to check if memory corruption is causing the crash. -------------------------------------------------------------------------- Best, Sophie ________________________________ From: Blondel, Sophie via Xolotl-psi-development <xolotl-psi-developm...@lists.sourceforge.net> Sent: Thursday, February 29, 2024 10:17 To: xolotl-psi-developm...@lists.sourceforge.net <xolotl-psi-developm...@lists.sourceforge.net>; petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov> Subject: [Xolotl-psi-development] PAMI error on Summit Hi, I am using PETSc build with the Kokkos CUDA backend on Summit but when I run my code with multiple MPI tasks I get the following error: 0 TS dt 1e-12 time 0. errno 14 pid 864558 xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. errno 14 pid 864557 xolotl: /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Dev ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed. [e28n07:864557] *** Process received signal *** [e28n07:864557] Signal: Aborted (6) [e28n07:864557] Signal code: (-6) [e28n07:864557] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so(gsignal+0xd8)[0x200005d796f8] [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so(abort+0x164)[0x200005d53ff4] [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so(+0x3d280)[0x200005d6d280] [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal *** [e28n07:864558] Signal: Aborted (6) [e28n07:864558] Signal code: (-6) [e28n07:864558] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8] [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so(gsignal+0xd8)[0x200005d796f8] [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so(abort+0x164)[0x200005d53ff4] [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so(+0x3d280)[0x200005d6d280] [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so(__assert_fail+0x64)[0x200005d6d324] [e28n07:864557] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] [e28n07:864557] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] [e28n07:864557] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] [e28n07:864557] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Rget+0x18)[0x20007f2d94a8] [e28n07:864557] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] [e28n07:864557] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so(__assert_fail+0x64)[0x200005d6d324] [e28n07:864558] [ 5] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8] [e28n07:864558] [ 6] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10] [e28n07:864558] [ 7] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c] [e28n07:864558] [ 8] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Rget+0x18)[0x20007f2d94a8] [e28n07:864558] [ 9] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(process_rndv_msg+0x46c)[0x2000a80159ac] [e28n07:864558] [10] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c] [e28n07:864558] [11] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] [e28n07:864557] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Context_advancev+0x6b0)[0x20007f2da540] [e28n07:864557] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] [e28n07:864557] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ progress+0x6c)[0x20003d60640c] [e28n07:864557] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r equest_default_wait_all+0x144)[0x2000034c4b04] [e28n07:864557] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W aitall+0x10c)[0x20000352790c] [e28n07:864557] [17] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac] [e28n07:864558] [12] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3 (PAMI_Context_advancev+0x6b0)[0x20007f2da540] [e28n07:864558] [13] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4] [e28n07:864558] [14] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_ progress+0x6c)[0x20003d60640c] [e28n07:864558] [15] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r equest_default_wait_all+0x144)[0x2000034c4b04] [e28n07:864558] [16] /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W aitall+0x10c)[0x20000352790c] [e28n07:864558] [17] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] [e28n07:864557] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0] [e28n07:864558] [18] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] [e28n07:864557] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68] [e28n07:864558] [19] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] [e28n07:864557] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214] [e28n07:864558] [20] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] [e28n07:864557] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0] [e28n07:864558] [21] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] [e28n07:864558] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038] [e28n07:864557] [22] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] [e28n07:864558] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0] [e28n07:864557] [23] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] [e28n07:864557] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 _+0xc4)[0x200005f710d4] [e28n07:864557] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 00005f7130c] [e28n07:864557] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0] [e28n07:864558] [24] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5 _+0xc4)[0x200005f710d4] [e28n07:864558] [25] /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2 00005f7130c] [e28n07:864558] [26] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] [e28n07:864557] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc] [e28n07:864558] [27] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] [e28n07:864557] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8] [e28n07:864558] [28] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] [e28n07:864557] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0] [e28n07:864558] [29] /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] [e28n07:864557] *** End of error message *** /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088] [e28n07:864558] *** End of error message *** It seems to be pointing to https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!bhpq7UF4Rq9PhMMRRb_zeSflUb9Cs5My48ggt02OxSWxoM4eIU_MDt3H6e2YnrxJizIsA21q76YdORVhI30Ylvr6$ so I wanted to check if you had seen this type of error before and if it could be related to how the code is compiled or run. Let me know if I can provide any additional information. Best, Sophie