Hi Sophie,
  PetscSFBcastEnd() was calling MPI_Waitall() to finish the communication
in DMGlobalToLocal.
  I guess you used gpu-aware MPI.  The error you saw might be due to it.
You can try without it with a petsc option  -use_gpu_aware_mpi 0
  But we generally recommend gpu-aware mpi.  You can try on other GPU
machines to see if it is just an IBM Spectrum MPI problem.

   Thanks.
--Junchao Zhang


On Thu, Feb 29, 2024 at 9:17 AM Blondel, Sophie via petsc-users <
petsc-users@mcs.anl.gov> wrote:

> Hi, I am using PETSc build with the Kokkos CUDA backend on Summit but when
> I run my code with multiple MPI tasks I get the following error: 0 TS dt
> 1e-12 time 0. errno 14 pid 864558 xolotl:
> /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/.
> . /pami/components/devices/shmem/shaddr/CMAShaddr. h: 164:
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Hi,
>
> I am using PETSc build with the Kokkos CUDA backend on Summit but when I
> run my code with multiple MPI tasks I get the following error:
> 0 TS dt 1e-12 time 0.
> errno 14 pid 864558
> xolotl:
> /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164:
> size_t PAMI::Dev
> ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t,
> PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed.
> errno 14 pid 864557
> xolotl:
> /__SMPI_build_dir__________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164:
> size_t PAMI::Dev
> ice::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t,
> PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed.
> [e28n07:864557] *** Process received signal ***
> [e28n07:864557] Signal: Aborted (6)
> [e28n07:864557] Signal code:  (-6)
> [e28n07:864557] [ 0]
> linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
> [e28n07:864557] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so
> (gsignal+0xd8)[0x200005d796f8]
> [e28n07:864557] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so
> (abort+0x164)[0x200005d53ff4]
> [e28n07:864557] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so
> (+0x3d280)[0x200005d6d280]
> [e28n07:864557] [ 4] [e28n07:864558] *** Process received signal ***
> [e28n07:864558] Signal: Aborted (6)
> [e28n07:864558] Signal code:  (-6)
> [e28n07:864558] [ 0]
> linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
> [e28n07:864558] [ 1] /lib64/glibc-hwcaps/power9/libc-2.28.so
> (gsignal+0xd8)[0x200005d796f8]
> [e28n07:864558] [ 2] /lib64/glibc-hwcaps/power9/libc-2.28.so
> (abort+0x164)[0x200005d53ff4]
> [e28n07:864558] [ 3] /lib64/glibc-hwcaps/power9/libc-2.28.so
> (+0x3d280)[0x200005d6d280]
> [e28n07:864558] [ 4] /lib64/glibc-hwcaps/power9/libc-2.28.so
> (__assert_fail+0x64)[0x200005d6d324]
> [e28n07:864557] [ 5]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3
>
> (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt
>
> omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8]
> [e28n07:864557] [ 6]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3
>
> (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10]
> [e28n07:864557] [ 7]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3
> (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c]
> [e28n07:864557] [ 8]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3
> (PAMI_Rget+0x18)[0x20007f2d94a8]
> [e28n07:864557] [ 9]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p
> ami.so(process_rndv_msg+0x46c)[0x2000a80159ac]
> [e28n07:864557] [10]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p
> ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c]
> [e28n07:864557] [11] /lib64/glibc-hwcaps/power9/libc-2.28.so
> (__assert_fail+0x64)[0x200005d6d324]
> [e28n07:864558] [ 5]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3
>
> (_ZN4PAMI8Protocol3Get7GetRdmaINS_6Device5Shmem8DmaModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAt
>
> omicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEELb0EEESL_E6simpleEP18pami_rget_simple_t+0x1d8)[0x20007f3971d8]
> [e28n07:864558] [ 6]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3
>
> (_ZN4PAMI8Protocol3Get13CompositeRGetINS1_4RGetES3_E6simpleEP18pami_rget_simple_t+0x40)[0x20007f2ecc10]
> [e28n07:864558] [ 7]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3
> (_ZN4PAMI7Context9rget_implEP18pami_rget_simple_t+0x28c)[0x20007f31a78c]
> [e28n07:864558] [ 8]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3
> (PAMI_Rget+0x18)[0x20007f2d94a8]
> [e28n07:864558] [ 9]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p
> ami.so(process_rndv_msg+0x46c)[0x2000a80159ac]
> [e28n07:864558] [10]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p
> ami.so(pml_pami_recv_rndv_cb+0x2bc)[0x2000a801670c]
> [e28n07:864558] [11]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3
>
> (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic
>
> 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac]
> [e28n07:864557] [12]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3
> (PAMI_Context_advancev+0x6b0)[0x20007f2da540]
> [e28n07:864557] [13]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p
> ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4]
> [e28n07:864557] [14]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_
> progress+0x6c)[0x20003d60640c]
> [e28n07:864557] [15]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r
> equest_default_wait_all+0x144)[0x2000034c4b04]
> [e28n07:864557] [16]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W
> aitall+0x10c)[0x20000352790c]
> [e28n07:864557] [17]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3
>
> (_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic
>
> 12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE5EE15dispatch_packedEPvSP_mSP_SP_+0x4c)[0x20007f2e30ac]
> [e28n07:864558] [12]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/pami_port/libpami.so.3
> (PAMI_Context_advancev+0x6b0)[0x20007f2da540]
> [e28n07:864558] [13]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/spectrum_mpi/mca_pml_p
> ami.so(mca_pml_pami_progress+0x34)[0x2000a80073e4]
> [e28n07:864558] [14]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libopen-pal.so.3(opal_
> progress+0x6c)[0x20003d60640c]
> [e28n07:864558] [15]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(ompi_r
> equest_default_wait_all+0x144)[0x2000034c4b04]
> [e28n07:864558] [16]
> /sw/summit/spack-envs/summit-plus/opt/gcc-12.1.0/spectrum-mpi-10.4.0.6-20230210-db5xakaaqowbhp3nqwebpxrdbwtm4knu/container/../lib/libmpi_ibm.so.3(PMPI_W
> aitall+0x10c)[0x20000352790c]
> [e28n07:864558] [17]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0]
> [e28n07:864557] [18]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3ca7b0)[0x2000004ea7b0]
> [e28n07:864558] [18]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68]
> [e28n07:864557] [19]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3c5e68)[0x2000004e5e68]
> [e28n07:864558] [19]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214]
> [e28n07:864557] [20]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(PetscSFBcastEnd+0x74)[0x2000004c9214]
> [e28n07:864558] [20]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0]
> [e28n07:864557] [21]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x3b4cb0)[0x2000004d4cb0]
> [e28n07:864558] [21]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038]
> [e28n07:864558] [22]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(VecScatterEnd+0x178)[0x2000004dd038]
> [e28n07:864557] [22]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0]
> [e28n07:864558] [23]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x1112be0)[0x200001232be0]
> [e28n07:864557] [23]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0]
> [e28n07:864557] [24]
> /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5
> _+0xc4)[0x200005f710d4]
> [e28n07:864557] [25]
> /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2
> 00005f7130c]
> [e28n07:864557] [26]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(DMGlobalToLocalEnd+0x470)[0x200000e9b0f0]
> [e28n07:864558] [24]
> /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11PetscSolver11rhsFunctionEP5_p_TSdP6_p_VecS5
> _+0xc4)[0x200005f710d4]
> [e28n07:864558] [25]
> /gpfs/alpine2/mat267/proj-shared/code/xolotl-stable-cuda/xolotl/solver/libxolotlSolver.so(_ZN6xolotl6solver11RHSFunctionEP5_p_TSdP6_p_VecS4_Pv+0x2c)[0x2
> 00005f7130c]
> [e28n07:864558] [26]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc]
> [e28n07:864557] [27]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeRHSFunction+0x1bc)[0x2000017621dc]
> [e28n07:864558] [27]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8]
> [e28n07:864557] [28]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSComputeIFunction+0x418)[0x200001763ad8]
> [e28n07:864558] [28]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0]
> [e28n07:864557] [29]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(+0x16f2ef0)[0x200001812ef0]
> [e28n07:864558] [29]
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088]
> [e28n07:864557] *** End of error message ***
>
> /gpfs/alpine2/mat267/proj-shared/dependencies/petsc-kokkos-cuda/lib/libpetsc.so.3.020(TSStep+0x228)[0x200001768088]
> [e28n07:864558] *** End of error message ***
>
> It seems to be pointing to
> https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!a4sJxnbeytQjqMoPqCRBXF8xnvgYCP0bj9cBxLdgrAm7azPozXDPDVSMpoRW3E2vSww5N_TOjhbAinQwRU_Y4lPdX_6u$
>  
> <https://urldefense.us/v3/__https://petsc.org/release/manualpages/PetscSF/PetscSFBcastEnd/__;!!G_uCfscf7eWS!Y8h5WLnoArdfhK2_UDmISOEiqxAN9gBUzvWniKOoMwtA9cGjg8w9sYX6V8aIgfzL8Uhea5ppiRbuTGr1jZ_R2DOV$>
> so I wanted to check if you had seen this type of error before and if it
> could be related to how the code is compiled or run. Let me know if I can
> provide any additional information.
>
> Best,
>
> Sophie
>

Reply via email to